JP3472038B2

JP3472038B2 - Logical identification of document elements

Info

Publication number: JP3472038B2
Application number: JP13851796A
Authority: JP
Inventors: 正治尾▲崎▼
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-06-07
Filing date: 1996-05-31
Publication date: 2003-12-02
Anticipated expiration: 2016-05-31
Also published as: JPH096911A

Description

【発明の詳細な説明】【０００１】【発明の属する技術分野】本発明は、一般に、文書の主
要ホワイト領域を構造モデルと比較して文書内のカラム
構造を分析することによって入力文書の文書エレメント
（構成要素）を分類することに関し、特に、特定の原文
書、文書タイプ、または印刷物に対する許容可能なカラ
ムレイアウトを表す構造モデルを使用してカラム中の文
書エレメントを論理的に識別する方法および装置に関す
る。【０００２】【従来の技術および発明が解決しようとする課題】文書
画像内のエレメント（構成要素）を論理的に識別する従
来の方法は、例えば、ナジ（Ｎａｊｉ）等による"A pro
to type Document Image Analysis Systemfor technica
l Journal", pp10-21, Computer, July 1992に概説され
ている。これまで、文書をカラム構造に基づいて解析す
る多くの方法が開発されてきたが、いずれもカラム構造
に関する一般的な推定に基づくものであった。このよう
な推定ベースの方法では、実際のカラム構造が、推定し
た一般的カラム構造と異なる部分や例外を含む場合に問
題があった。通常、異なる原文書は異なる例外を有する
ので、このような例外を軽視するわけにはいかない。【０００３】また、文書構造をコンテクストに依らない
文法規則で表現して許容可能な原文書のページレイアウ
トを識別することによっても文書構造が記述可能である
（上記Ｎａｊｉ参照）。しかし、この方法は、複雑なペ
ージレベルから画素レベルまで、構造をそれぞれ異なる
細度で表現する綿密な文法規則を必要とする。原文書の
文書構造のすべてのレベルを表わすのは複雑で困難な作
業であり、より単純な表現が望まれる。【０００４】特願平７−２４３２１３号は、背景領域を
解析することによって文書を分割する方法および装置を
開示している。この方法は、ページレイアウトを考慮す
ることなく、主要ホワイト領域に基づいて、文書から非
矩形の文書エレメントを抽出する。【０００５】また、特願平７−２４３２１２号は、ホワ
イト空間パターンのマッチングによって、ページ画像を
論理的にタグ付けする方法を開示する。この方法は、比
較的固定的なレイアウトの文書ページの識別に適してい
る。しかし、多様なエレメントを様々な順序で配置した
柔軟なレイアウトにこの方法を適用するには、可能なす
べてのレイアウトを列挙して、各レイアウトをそれぞれ
対応の文書構造で表現しなければならないという問題が
ある。これを実行できないことはないが、ユーザーにと
って手間のかかるプロセスであり、ここでもより単純な
表現が望まれる。【０００６】【課題を解決するための手段】本発明は、異なるタイプ
の文書の各々に対して単一のカラム構造モデルを使用し
て、入力文書の各カラム内の文書エレメントに「見出
し」、「本文」、「図」などの論理的タグ付けを行う、
正確かつ効率的な文書エレメントの識別方法を提供す
る。本発明では、文書エレメント間および文書エレメン
ト内の主要ホワイト領域のシーケンスを原文書の構造モ
デルとマッチングさせることによって、入力文書のカラ
ム構造が解析される。【０００７】本発明の１態様は、文書画像内の文書エレ
メントを論理的に識別する方法であって、少なくとも１
つの文書エレメントを含む少なくともひとつのカラムを
有する原文書に対応する少なくとも１つの構造モデルを
生成するステップを備え、各構造モデルは、少なくとも
１つのカラム表現を有し、少なくとも１つのエレメント
表現を有し、原文書の少なくとも１つの文書エレメント
間の関係を定義し、文書画像において第１の主要背景領
域を識別するステップを備え、第１の主要背景領域は文
書画像を少なくとも１つのカラムに分割し、少なくとも
ひとつのカラムの各々において第２の主要背景領域を識
別するステップを備え、識別された第２の主要背景領域
から文書画像の各カラムに対する主要背景領域パターン
を形成するステップを備え、文書画像の各カラムに対す
る主要背景領域パターンにマッチする少なくとも１つの
カラムストリングを生成するステップを備え、主要背景
パターンにマッチする少なくとも１つのカラムストリン
グの中からベストカラムストリングを決定するステップ
を備え、ベストカラムストリングは少なくとも１つの構
造モデルのカラム表現に最もマッチし、ベストカラムス
トリングに基づいて、文書画像内の少なくとも１つの文
書エレメントの各々を論理的に識別するステップを備え
る。【０００８】本発明はまた、文書カラム内で文書エレメ
ントを隔てる主要ホワイト領域のサイズと方向に関する
情報を用いてユーザーが構造モデルを展開することを可
能にし、文書画像カラム中の文書エレメント間の関係を
表現できるシステムを提供する。【０００９】主要ホワイト領域は、「ホワイト空間」の
矩形領域であり、その最小サイズはあらかじめ定められ
ている。ホワイト空間とは、文書画像の背景、非画像領
域、または非テキスト領域を意味する。通常の文書は、
ホワイトの背景領域にブラックおよび／またはカラーの
画像を形成したものなので、これらの背景領域を「ホワ
イト空間」と呼ぶこととするが、本発明の解釈において
は、カラーの背景領域に文書を形成した場合、あるいは
ブラックまたはカラーの背景に白抜きでテキストを形成
した場合であっても、非画像領域であるこれらのカラー
（またはブラック）背景領域を「ホワイト空間」と定義
する。【００１０】本発明を用いると、文書エレメントを含む
文書画像部分を解析して、コヒーレントグループ、すな
わち文書エレメントから連結要素を検出する必要がな
い。文書上の画像が走査されて、画像の電子的またはデ
ィジタル表現が形成される。文書エレメントを含む領域
が、所定サイズの主要ホワイト領域によって他の文書エ
レメント領域から離間されている場合は、これらの文書
エレメント領域は、互いに独立したものとみなされる。
文書エレメントは、文書画像中で、見出し、テキスト、
グラフィック（図形）などの情報を含み主要ホワイト領
域によって互いに分離される矩形領域である。【００１１】構造モデルとは、原文書に起こり得る許容
可能な特定のカラムレイアウトを表わしたものである。
構造モデルは、２種類の規則的表現、すなわちカラム表
現とエレメント表現で表わされる。構造モデルのエレメ
ント表現は、実際のタグ付けプロセスに先立って原文書
のユーザーあるいは作成者がオフラインで提供するか、
または、原文書からの実際のトレーニングサンプルから
決定される。構造モデルのカラム表現は、実際のタグ付
けプロセスに先立って原文書のユーザーあるいは作成者
がオフラインで生成する。原文書に対する許容可能なカ
ラムレイアウトは、カラム表現として構造モデル中に示
される。エレメント表現は、構造モデル内で識別されカ
ラム表現で使用される各エレメントタイプに対する、文
書エレメント内の主要ホワイト領域のあらゆる可能なシ
ーケンスを表わすものである。【００１２】本発明の文書エレメント識別システムは、
入力文書画像の主要ホワイト領域シーケンスまたはパタ
ーンの一部と構造モデルのエレメント表現とのすべての
可能なマッチングを抽出する。次いで、入力文書の入力
ページの各カラムごとに、入力文書ページ画像の主要ホ
ワイト領域を示すカラムストリングを少なくともひとつ
生成する。さらに、入力文書画像の主要ホワイト領域を
マッチングするマッチングエレメント表現のすべての組
み合わせを構造モデルのカラム表現と比較する。複数の
カラムストリングまたはマッチングエレメント表現の組
み合わせがカラム表現に対して連続的にテストされる場
合は、選択装置でベストマッチ（すなわち、最も近いマ
ッチング）を選択する。【００１３】入力文書画像との対比に基づき、最も近い
マッチングカラム表現の文書エレメントに関連する論理
タグを用いて、入力文書画像の対応する文書エレメント
にタグ付けを行い論理的に文書エレメントを識別する。【００１４】代替的に、入力文書画像の対応する文書エ
レメントは分割されてプリントまたは記憶のために出力
される。【００１５】本発明のその他の目的、効果は、図面を参
照して以下の好ましい実施形態の詳細な説明から一層明
確となる。【００１６】【発明の実施の形態】図１は、本発明の文書エレメント
識別システム１００の好ましい実施の形態を示す。文書
エレメント識別システム１００は、文書ホワイト領域抽
出システム１１０、主要ホワイト領域選択手段１２０、
メモリ１３０、プロセッサ１４０、構造モデル定義手段
１５０、カラムストリング選択手段１６０、カラム表現
比較手段１７０、論理タグ割り当て手段１８０、および
文書エレメント抽出手段１９０を含み、これらすべてが
バス手段１０５を介して互いに接続される。図１に示す
ように、文書エレメント識別システム１００を汎用コン
ピュータ３００で実行するのが好ましいが、もちろん、
専用コンピュータ、マイクロプロセッサベースまたはマ
イクロコントローラベースのシステム、ＡＳＩＣなどの
集積回路、ディスクリート素子回路などのハードワイヤ
ード電子回路、フィールドプログラマブルゲートアレイ
などのプログラマブルロジックデバイス（ＰＬＤ）など
でも実行可能である。【００１７】図２は、図１の文書エレメント識別システ
ム１００の、文書ホワイト領域抽出システム１１０の好
ましい実施形態を示す。図示のように、文書ホワイト領
域抽出システム１１０は、連続成分識別手段２６０、境
界ボックス生成手段２５０、および主要ホワイト領域抽
出手段２４０を含み、これらはすべてをバス手段１０５
に接続される。まず、スキャナ２１０またはメモリ１３
０から、文書画像データを連結要素識別手段２６０に入
力する。メモリ１３０は、汎用コンピュータ３００の内
部メモリでもよいし、ディスクドライブ、ＣＤ−ＲＯ
Ｍ、ＥＰＲＯＭなどの形態で、周知のように汎用コンピ
ュータ３００の外部に配置されてもよい。スキャナ２１
０からの文書画像データを、連結要素識別手段２６０に
入力する前に、まずメモリ１３０に記憶してもよい。文
書画像データは、バイナリ（２値）画像、または複数ビ
ットのディジタル信号の形態で連結要素識別手段２６０
に入力される。１ビット以上の各セットは、文書画像の
特定の画素がオンであるかオフであるかを示す。【００１８】連結要素識別手段２６０は、文書画像デー
タを受信したならば、その文書画像内のすべての連結要
素を検出する。図４は、文書画像４００の例である。連
結要素４１０は、「オフ」画素（ホワイト画素）に囲ま
れた一連の隣接する「オン」画素（ブラック画素）で構
成される。文書画像４００内の連結要素４１０を検出す
るシステムについては、当技術分野において周知であ
る。【００１９】文書画像４００の連結要素４１０を識別し
たならば、境界ボックス生成手段２５０で、各連結要素
４１０ごとの境界ボックス４２０を生成する。当業界で
周知のように、境界ボックス４２０は、連結要素４１０
を完全に囲い込む最小の矩形のボックスである。連結要
素から４１０から境界ボックス４２０を生成するシステ
ムも当分野において周知である。【００２０】境界ボックス情報を含んだ文書画像データ
は、主要ホワイト領域抽出手段２４０に送られる。主要
ホワイト領域抽出手段２４０は、図５および６に示すよ
うに、文書画像４００の垂直方向および水平方向の主要
ホワイト領域を抽出する。【００２１】文書ホワイト領域抽出システム１１０の好
ましい実施形態では、主要ホワイト領域抽出手段２４０
は、図３に示す２つのセクション、すなわち垂直抽出部
２４１と水平抽出部２４２に分かれている。垂直抽出部
２４１、水平抽出部２４２はそれぞれ、一次（プリミテ
ィブ）ホワイト領域抽出手段２４３、比較手段２４４、
消去手段２４５、グループ化手段２４６を備え、これら
をバス手段１０５に接続する。垂直抽出部２４１と水平
抽出部２４２は同一の構成要素を含み、同様の方法で動
作する。【００２２】図５に示すように、水平抽出部２４２はま
ず、一次（プリミティブ）ホワイト領域４３０−１〜４
３０−１０を抽出し、水平方向に主要ホワイト領域４６
０を形成する。同様に、図６に示すように垂直抽出部４
２１は、一次ホワイト領域４３０−１１〜４３０−１９
を抽出し、垂直方向に主要ホワイト領域４６０を形成す
る。【００２３】水平方向の主要ホワイト領域４６０の形成
は、水平一次ホワイト領域４３０−１〜４３０−１０の
隣接する領域同士を、特定の規則に従ってひとつのグル
ープに併合し、水平方向にグループ化したひとつ以上の
一次ホワイト領域とすることによって実効される。同様
に、垂直方向の主要ホワイト領域４６０の形成も垂直一
次ホワイト領域４３０−１１〜４３０−１９の隣接する
領域同士を、特定の規則にしたがってひとつのグループ
に併合することによって実効され、これによって垂直方
向にグループ化されたひとつ以上の一次ホワイト領域が
できる。垂直および水平方向の一次ホワイト領域をグル
ープ化し併合した後、水平一次ホワイト領域４３０およ
び水平方向にグループ化した一次ホワイト領域の中で、
しきい値幅４４０より広い幅を有し、しきい値高さ４５
０を越える高さの領域を識別する。同様に、垂直一次ホ
ワイト領域４３０と垂直方向にグループ化した一次ホワ
イト領域の中で、しきい値高さ４５０’を越える高さ
で、しきい値幅４４０’より広い幅の領域を識別する。
これらの識別された領域が、主要ホワイト領域となる。【００２４】図７に示すように、多様な文書エレメント
４７０を、垂直方向の主要ホワイト領域４６０によっ
て、カラム４０５にグループ化する。図７の例では、文
書画像４００は、２つのカラム４０５−１、４０５−２
に分けられる。【００２５】しかし、分割された各領域の論理的な識別
はまだ成されていない。すなわち、ここから各カラムの
文書エレメントをひとつ以上の構造モデルを使用して識
別する必要がある。【００２６】文書ホワイト領域抽出システム１１０によ
って、上述のような主要ホワイト領域の識別を行ったな
らば、主要ホワイト領域選択手段１２０で、主要ホワイ
ト領域の多様な領域を選択する。構造モデル定義手段１
５０を設けて、ユーザーが構造モデルカラム表現とエレ
メント表現を入力できるようにしてもよい。原文書の作
成者によって構造モデルがあらかじめ与えられている場
合は、構造モデル定義手段１５０は必要ない。カラムス
トリング選択手段１６０は、文書画像カラム内の選択さ
れた主要ホワイト領域を使用し、入力ページのカラム内
の主要ホワイト領域シーケンスに対応するカラムストリ
ングを識別する。カラムストリング選択手段１６０で
は、エレメント候補の抽出処理も行い、構造モデルエレ
メント表現を文書画像のカラム内の主要ホワイト領域シ
ーケンスとマッチングさせる。カラム表現比較手段１７
０は、文書画像カラム内の選択された主要ホワイト領域
から生成したキャラクタストリングまたはカラムストリ
ングを対応する構造モデルのカラム表現と比較する。カ
ラム表現比較手段１７０によって行われるカラム表現比
較プロセスは、構造照合プロセスと、ベストマッチのカ
ラム表現を識別するカラム表現選択プロセスとを含む。
次いで、論理タグ割り当て手段１８０は、識別されたマ
ッチング構造モデルカラム表現に基づいて論理タグを主
要ホワイト領域間の領域に割り当て、文書エレメント抽
出手段１９０が論理的にタグ付けされた文書エレメント
４７０を抽出する。代替的に、論理タグ割り当て手段１
８０の代わりに、または論理タグ割当手段１８０に付加
して文書エレメント抽出手段を用いることもできる。論
理タグ割り当て手段１８０、文書エレメント抽出手段１
９０、および構造モデル定義手段１５０の詳細について
は、上述の特願平７−２４３２１２および特願平７−２
４３２１３に記載されている。【００２７】文書ホワイト領域抽出システム１１０は、
上述の文書ホワイト領域抽出機能を果たすシステムのひ
とつの例であり、これに限定されるものではない。【００２８】図７に示すように、文書画像４００は任意
の数の文書エレメント４７０を含む。入力文書画像の文
書エレメント間または文書エレメント内に存在する主要
ホワイト領域間の空間的（幾何学的）関係と、各原文書
に対して選択された構造モデル内で定義される空間的
（幾何学的）関係と比較することによって、文書画像４
００の文書要素４７０を論理的に識別する。文書画像の
カラム４０５の主要ホワイト領域間の幾何学的関係が構
造モデルカラム表現で表わされる関係とマッチすれば、
文書画像４００の文書エレメント４７０の文書画像カラ
ムレイアウトが特定される。次いで、文書画像４００の
文書エレメント４７０を主要ホワイト領域によって抽出
する。抽出した文書エレメント４７０に、原文書の構造
モデルカラム表現内の対応のエレメントタイプの論理タ
グを割り当てる。論理タグは、異なるタイプの文書エレ
メント４７０のネームであり、例えば「見出し」、「文
章」、「図」などである。論理タグを文書エレメント４
７０に割り当てることによって、文書エレメント４７０
を論理的に識別する。【００２９】しかし、文書画像４００の文書エレメント
４７０を論理的に識別する前に、各原文書ごとに少なく
ともひとつの構造モデルを、文書エレメント識別システ
ム１００に供給しなければならない。各構造モデルは、
原文書に対応する文書エレメントタイプと、対応する原
文書の文書エレメント間の空間的（幾何学的）関係とを
含む。【００３０】構造モデルは、基準に合ったカラムのレイ
アウトを示す構造を記述したものであり、２種類の規則
的表現、すなわちカラム表現とエレメント表現で表わさ
れる。カラム表現は、任意の原文書において、その原文
書の各カラムに含まれるあらゆる可能な基準に合ったエ
レメントタイプシーケンスを表わす。エレメント表現
は、原文書の各エレメントタイプに対するあらゆる可能
な基準に合った主要ホワイト領域シーケンスを表わす。
エレメント表現の定義は、カラム表現に含まれ得る。各
原文書ごとに、それぞれ異なる既知のカラムレイアウト
の規則があると考えられるので、特定の原文書ごとに構
造モデルが与えられる。本発明の好ましい実施形態で
は、構造モデルのカラム表現はワンカラム（one-colum
n）ページレイアウトの表現であるが、本発明のシステ
ムと方法を、２カラム、３カラム、さらに一般化してｎ
カラムページレイアウトにも適用できる。本発明の実施
形態では、文書エレメント識別システム１００に原文書
が供給されることを前提としており、ここからカラム形
式の文書画像を生成し、その文書エレメントに論理的に
タグ付けする。【００３１】文書エレメントはカラムの上から下へ順に
配置される。カラム内の各文書エレメントの位置は、各
原文書の構造モデルのカラムレイアウト規則にしたがっ
て決定される。各原文書カラムの基準に合ったカラムレ
イアウトは、少なくともひとつのカラム表現で表わされ
る。各カラムはトップとボトムを有し、各原文書は、
「見出し」、「テキスト」、といった許容文書エレメン
トタイプの集合を含む。好ましい実施の形態では、原文
書のユーザーが、その原文書の構造モデルの許容範囲内
のカラムレイアウトを定義する規則集合を生成している
ものとするが、原文書の作成者または供給源から各原文
書のカラム表現を得るようにしてもよい。【００３２】以下で、エレメントタイプおよびカラムの
規則的表現の例を示す。実際のカラムまたは構造モデル
のカラム表現に与えられる規則的表現は、次式（１）で
表わされる。Ｃｏｌｕｍｎ＝＾Ｆ＊（Ｈ？Ｔ｜
Ｆ）（ＨＴ）＊Ｆ＊＄（１）（１）式からわかるように、適正な文書エレメ
ントタイプの集合は、「見出し（Ｈ）」、「図
（Ｆ）」、「テキスト（Ｔ）」を含む。さらに、原文書
の規則として、以下の規則を適用する。１．見出しは、
カラムのボトムには配置されない。２．各カラムに、少
なくともひとつのテキストブロックを配置する。３．テ
キストブロックは、必ず見出しブロックの下に配置す
る。４．ひとつ以上の図を含むシーケンスは、必ずカラ
ムのトップまたはボトムに接する。符号Ｈ、Ｔ、Ｆは
文書エレメントのそれぞれのエレメントタイプである
「見出し」、「テキスト」、「図」を意味する。符号
「＾」は、カラムのトップすなわちカラム表現の開始を
表わす。符号「＄」は、カラムのボトムすなわちカラム
表現の終端を示す。符号「？」は、ひとつ前の表現がオ
プショナルであることを示す。符号「＊」は、先行する
表現が不定回数繰り返され得るか、現われないことを示
す。すなわち、「＊」は、先行する表現がまったく現わ
れないか、一度以上現われ得ることを示す。「｜」は、
論理ＯＲを意味する。括弧（）は、演算子の「自然」優
先順位を無視して、「＊」を最上位に、２番目にエレメ
ントタイプの連結を、そして最後に「｜」をもってく
る。図８は、原文書４０１における、基準に合った３つ
のカラムレイアウト４０５−３、４０５−４、４０５−
５を示す。各許容可能カラム４０５−３、４０５−４、
４０５−５にそれぞれ対応するエレメントシーケンス
を、次式（２）〜（４）で示す。【００３３】Ｃｏｌｕｍｎ＝ＦＨＴ（２）Ｃｏｌｕｍｎ＝ＨＴＦＦ（３）Ｃｏｌｕｍｎ＝ＦＨＦ（４）同様のプロセスを用いて、各原文書の構造モデルで使用
する、識別したエレメントタイプのエレメント表現を定
義する。原文書の各エレメントタイプにおけるエレメン
ト表現は、原文書のユーザーが生成するのが好ましい。
各エレメントタイプは、「見出し」など、同一の論理タ
グを持った文書エレメントのカテゴリーのいずれかに対
応する。すべてのエレメントタイプは、次式（５）に示
す規則的表現によって表わされる。【００３４】Ｅｌｅｍｅｎｔ＝ＡＷ＊Ｂ（５）ここで、「Ｅｌｅｍｅｎｔ」は、「テキスト」、「見出
し」などのエレメントタイプを表わし、符号Ａ、Ｗ、
＊、Ｂは規則的表現を示す。「Ａ」は、その文書エレメ
ントの上方（Ａｂｏｖｅ）に位置する主要ホワイト領域
のサイズにしたがって割り当てられるキャラクタコード
をマッチングさせる規則的表現である。「Ｂ」は、その
文書エレメントの下方（Ｂｅｌｏｗ）に位置する主要ホ
ワイト領域のサイズにしたがって割り当てられるキャラ
クタコードをマッチングさせる規則的表現である。
「＊」は繰り返し演算子である。「Ｗ」は、文書エレメ
ント内部（Ｗｉｔｈｉｎ）に位置する主要ホワイト領域
のサイズにしたがって割り当てられるキャラクタコード
をマッチングさせる規則的表現である。繰り返し可能を
示す「Ｗ＊」は、繰り返し回数の範囲がわかれば、その
範囲を代入してもよい。例えば、Ｗ｛０，３｝と表わさ
れると、Ｗが０回〜３回繰り返されることを示す。【００３５】エレメントタイプのひとつである「テキス
ト」がカラムの中央に位置する場合は、このエレメント
表現は次式（６）で示す規則的表現で表わされる。【００３６】Ｔｅｘｔ＝［ｆ〜ｇ］［ａ〜ｂ］＊［ｆ〜ｇ］（６）ここで、ａ、ｂ、ｆ、ｇはそれぞれ３ポイント、５ポイ
ント、１６ポイント、２０ポイントのライン間隔を表わ
す。これらの値はサンプルデータから得られるものなの
で、エレメント表現は、各原文書のサンプルデータの解
析から自動的に生成される。【００３７】各テキストは、テキスト内の隣接のライン
と３〜５ポイントの間隔を保ち、そのテキストの上ある
いは下に配置される別の隣接文書エレメントとは、１６
〜２０ポイントのライン間隔で隔てられる。【００３８】図９に移って、文書エレメント４０６−１
は、カラム内にテキストエレメントを含む。文書エレメ
ント４０６−２では、テキストはカラムのトップから始
まらなければならないので、このエレメント表現は符号
「＾」で始まる。文書エレメント４６０−３では、テキ
ストはカラムのボトムまで降りる必要があるので、エレ
メント表現は符号「＄」で終了する。文書エレメント４
６０−４では、テキストはカラムのトップからボトムま
で続くので、エレメント表現は符号「＾」で始まり、
「＄」で終了する。【００３９】図９の「テキスト」文書エレメントの４つ
の配置例４０６−１、４０６−２、４０６−３、４０６
−４は、それぞれ式（６）〜（９）に示す規則的表現に
対応する。【００４０】Ｔｅｘｔ＝＾［ａ〜ｂ］＊［ｆ〜ｇ］（７）Ｔｅｘｔ＝［ｆ〜ｇ］［ａ〜ｂ］＊＄（８）Ｔｅｘｔ＝＾［ａ〜ｂ］＊＄（９）これらの４つの規則的エレメント表現をひとつなぎに連
結するか、あるいは論理ＯＲ演算して、このエレメント
タイプ（テキスト）のエレメント表現を求める。換言す
ると、式（６）〜（９）で示す４つの規則的表現をひと
つのエレメント表現へと結合し、次式（１０）に示すよ
うに、４つの配置すべてをカバーする。Ｔｅｘｔ＝［ｆ〜ｇ］［ａ〜ｂ］＊［ｆ〜ｇ］｜＾［ａ〜ｂ］＊［ｆ〜ｇ］｜［ｆ〜ｇ］［ａ〜ｂ］＊＄｜＾［ａ〜ｂ］＊＄（１０）原文書で使用されている、識別したエレメントタイプの
各々に対して、エレメント表現が必要である。各エレメ
ント表現を、カラム表現比較手段４７０によって入力文
書画像の主要ホワイト領域の空間ストリングと比較す
る。この例では、テキストエレメントは、テキスト内の
隣接ラインとのライン間隔は３〜５ポイントという単位
フォーマットを有し、隣接の（上下に位置する）別の文
書エレメントとは１６〜２０ポイントで隔てられる。テ
キストエレメントはカラムのトップ、ボトム、カラム全
体、またはカラム内部に位置することができる。文書カ
ラム内での文書エレメントの別の事象例として、テキス
トを例にとると、式（６）を繰り返すことによって表わ
し得る。【００４１】エレメント表現は、トレーニングデータま
たはサンプルデータから生成される。エレメント表現
は、原文書で認められたエレメントタイプの実際の画像
から得られた文書エレメントを定義する主要ホワイト領
域のサイズ範囲から決定できる。しかし、カラム表現
は、原文書のカラム構造を注意深くチェックできる原文
書の作成者またはユーザが生成するのが好ましい。【００４２】好ましい実施の形態では、タグ付け（解
析）前の文書５００の文書エレメントのタグ付けプロセ
スを初める前に、構造モデルをあらかじめ定義し、メモ
リ１３０の構造モデル格納部にあらかじめ記憶してお
く。したがって、各エレメントタイプのカラム表現とエ
レメント表現が、構造モデルと共にメモリ１３０の構造
モデル格納部に記憶されることになる。【００４３】メモリ１３０に記憶したひとつ以上の構造
モデルを使用して、解析前の文書画像５００の文書エレ
メント４７０を、図１０のように識別する。解析前の文
書５００は、まだ識別されていない文書エレメント４７
０を含み、これらを論理的に識別する必要がある。解析
前の文書５００は、スキャナ２１０または遠隔インター
フェイス２３０によって入力する。文書ホワイト領域抽
出システム１１０で、解析前の文書画像５００から主要
ホワイト領域を抽出する。抽出した主要ホワイト領域
を、主要ホワイト領域選択手段１２０に送る。主要ホワ
イト領域選択手段１２０は、適正な主要ホワイト領域を
選択し、これに基づいて、記憶した構造モデルのカラム
表現と、解析前文書画像５００の識別されていない文書
エレメントとを比較する用する。より正確には、選択し
た主要ホワイト領域のシーケンスと、構造モデルとのマ
ッチングを行う。【００４４】この比較方法として、文書ホワイト領域抽
出システム１１０と主要ホワイト領域選択手段１２０を
連続して繰り返し使用し、文書画像からの主要ホワイト
領域の抽出、選択を繰り返し、文書画像をまずカラムに
分割する。主要ホワイト領域のしきい値を調節した後、
入力文書画像で識別した各カラムを、文書ホワイト領域
抽出システム１１０と主要ホワイト領域選択手段１２０
に再度入力する。カラム内で抽出され選択された主要ホ
ワイト領域は、そのカラムの空間ストリングの検出に使
用される。この空間ストリングを、カラムストリング選
択手段１６０に入力する。別の手法として、文書画像を
まず、非矩形の文書エレメントを含むマルチカラムの文
書画像からカラムを検出する装置に入力し、この装置で
識別したカラムを、本発明の文書エレメント識別システ
ム１００のカラムストリング選択手段１６０に入力して
もよい。【００４５】例えば、図１０に示す解析前文書画像５０
０では、主要ホワイト領域選択手段１２０は、まず文書
画像５００の上から下まで垂直方向に延びる主要ホワイ
ト領域４６０を識別する。次いで、識別した垂直方向の
主要ホワイト領域４６０から、隣合う垂直主要ホワイト
領域４６０のペアをそれぞれ選び出すことによって、解
析前文書画像５００の各カラム５０１を識別する。この
ように識別した画像カラム５０１を、文書ホワイト領域
抽出システム１１０に入力した後、これを主要ホワイト
領域選択手段１２０に送る。主要ホワイト領域選択手段
１２０では、図１１に示すように、解析前文書画像５０
０のカラム５０１を特定する一対の垂直ホワイト領域４
６０の間に存在する、水平方向への主要ホワイト領域４
６０を選択する。主要ホワイト領域選択手段１２０はさ
らに、カラム５０１で選択した各水平主要ホワイト領域
４６０に、その垂直方向への高さに基づいて、それぞれ
キャラクタコードを割り当てる。カラム５０１の各水平
主要ホワイト領域に割り当てられたキャラクタコードを
つなげて、ひとつの空間ストリング５０２を生成する。
空間スリング５０２は、解析前画像５００のカラム５０
１に対応し、これを表わすものである。空間ストリング
５０２はまた、主要ホワイト領域４６０のシーケンス
（パターン）を定義し、このシーケンスを、構造モデル
のカラム表現およびエレメント表現と比較する。図１１
の例では、カラム５０１の空間ストリング５０２は「ａ
ｇｃｄｃｃｃｇｂｆｃｄｄｄｄ」である。【００４６】空間ストリング５０２は、水平方向の主要
ホワイト領域４６０だけを使用しており、水平主要ホワ
イト領域４６０の上下の位置関係を幾何学的的に表わ
す。構造モデルは、「物理的」なカラム構造を表現する
ために生成したものである。連続するカラムが連なって
いる場合は、キャラクタストリングを決定する前にまず
カラムを分割する。または、マルチカラム画像を、まず
一次元キャラクタで表現してから、そのキャラクタスト
リングを構造モデルと比較する。特願平７−２４３２１
３号は、２カラム文書画像などのより複雑な文書画像に
おいて、水平方向の左右関係に基づいてどのようにキャ
ラクタストリングを決定するかを開示している。【００４７】解析前文書画像５０１の空間ストリング５
０１を決定したならば、この空間ストリング５０２で表
わされる解析前文書５００の識別されていない文書エレ
メント４７０を、構造モデルと比較、すなわちマッチン
グを行う。各エレメントタイプに関する構造モデルのエ
レメント表現を、空間ストリング５０２と比較し、この
空間ストリングの任意の一部分とマッチするあらゆるエ
レメントタイプの位置を特定する。特定した各部分の例
を、図１１の双方向矢印５０３で示す。これらが集まっ
て、エレメントタイプリストを形成する。次いで、文書
エレメント識別システム１００は、カラム５０１におい
て、空間ストリング５０２の全体とマッチする、あらゆ
る可能なカラムストリング５０４を検出する。カラムス
トリング５０４は、構造モデルエレメントタイプの任意
の組み合わせであり、結果として原文書のカラムを形成
することになる。検出した各カラムストリング５０４
を、あらかじめ記憶したひとつ以上の構造モデルのカラ
ム表現と比較する。カラムストリング５０４が、ひとつ
以上のエレメントタイプの基準に合わない組み合わせで
ある場合は、そのカラムストリングを消去する。カラム
ストリング５０４のマッチングが複数検出された場合
は、スコアリングまたは選択プロセスで、統計的なデー
タに基づいて最良のマッチングを決定する。【００４８】上述の処理を行うために、空間ストリング
５０２を、主要ホワイト領域選択手段１２０から、カラ
ムストリング選択手段１６０へと出力する。空間ストリ
ング５０２は、解析前文書画像５００のカラム５０１内
の文書エレメント４７０と、この文書エレメント４７０
間または内部の主要ホワイト領域４６０との相対位置を
定義するものである。カラムストリング選択手段１６０
は、空間ストリング５０２とマッチする構造モデルエレ
メントタイプのシーケンスを選択する。より正確には、
カラムストリング選択手段１６０は、構造モデル内で定
義されるひとつ以上のエレメントタイプのエレメント表
現とマッチする、空間ストリング５０２のサブストリン
グを抽出する。【００４９】図１１の空間ストリングは、「ａｇｃｄｃ
ｃｃｇｂｊｃｄｄｄｄ」である。原文書中のエレメント
タイプ「見出し」における構造モデルエレメント表現
は、次式（１１）に示す通りである。【００５０】Ｈｅａｄｉｎｇ＝（＾［ａ〜ｂ］＊［ｆ〜ｇ］｜［ｆ〜ｇ］［ａ〜ｂ］＊［ｆ〜ｇ］）（１１）上述したように、構造モデル規則によって「見出し」エ
レメントがカラムのボトムに配置されることはないの
で、「見出し」のエレメント表現がキャラクタ「＄」で
終わることはない。また、エレメントタイプ「本文」の
構造モデルエレメント表現は、次式（１２）で示され
る。【００５１】Ｂｏｄｙｔｅｘｔ＝（＾［ｃ〜ｄ］＊＄｜＾［ｃ〜ｄ］＊［ｆ〜ｈ］｜［ｆ〜ｈ］［ｃ〜ｄ］＊［ｆ〜ｈ］｜［ｆ〜ｈ］［ｃ〜ｄ］＊＄）（１２）図１１に示す「見出し」ストリング「ａｇ」、「ｇｂ
ｆ」は、式（１１）で定義される「見出し」のエレメン
ト表現とマッチする。これらのマッチする「見出し」ス
トリングを、対応の開始および終了位置とともに、空間
ストリング５０２に記憶する。一方、図１１の「本文」
ストリングである「ｇｃｄｃｃｃｇ」と「ｆｃｄｄｄ
ｄ」も、式（１２）で定義する「本文」エレメント表現
とマッチするので、これらの「本文」ストリングも、対
応の開始位置および終了位置とともに空間ストリング５
０２に記憶する。【００５２】カラムストリング選択手段１６０は、解析
前文書画像５００の空間ストリング５０２全体とマッチ
する、構造モデルエレメントタイプのカラムストリング
５０４の候補リストを作り上げる。候補カラムストリン
グ５０４は、エレメントタイプとマッチする空間ストリ
ングから抽出したサブストリングの位置に基づいて生成
する。特に、マッチするカラムストリング５０４のすべ
ての可能な組み合わせが検出されるまで、抽出したサブ
ストリングの代わりに、エレメントタイプを表わすマッ
チングキャラクタコードを使用する。ひとつの空間スト
リング５０２から複数のカラムストリング５０４が生成
される可能性もあり、それらをカラムストリング候補リ
ストとして集める。カラムストリング選択手段１６０
は、カラムストリング候補リストにある各候補カラムス
トリングを、構造モデルのカラム表現と比較する。候補
カラムストリング５０４がカラム表現とマッチしない場
合は、そのカラムストリング５０４を候補リストから除
外する。こうして残ったカラムストリング５０４が、入
力文書画像カラムの基準に合ったカラムストリングであ
ると解釈される。【００５３】図１１の例では、空間ストリング５０２は
単一のカラムストリング５０４、すなわち「ＨＴＨＴ」
というカラムストリングに変換される。【００５４】カラムストリング選択手段１６０が、カラ
ムストリング候補リストに複数の候補カラムストリング
５０４が残っていると判断した場合は、候補リストをカ
ラム表現比較手段１７０に出力する。カラム表現比較手
段１７０は、残りの候補の中から、構造モデルのカラム
表現と最もマッチするベストマッチのカラムストリング
を選択する。文書エレメント識別システム１００は、主
要ホワイト領域のサイズに関する統計データを有してお
り、主要ホワイト領域サイズは、エレメントの境界にお
ける幾何学的的な位置関係（たとえば、文書エレメント
タイプの内部に位置する、または２つの文書エレメント
間に位置するなど）によって、カテゴリーに分類され
る。カラム表現比較手段１７０は、この統計的データに
基づいて、カラムストリングの最良のマッチングを決定
する。【００５５】構造モデルで「見出し」、「本文」、
「図」の３つのエレメントタイプが定義されるとする
と、図１２に示すように、全部で９つの幾何学的関係が
特定される。すなわち、図１２の白抜きの双方向矢印で
示す各エレメントタイプ（本文、見出し、図）内部での
関係と、実線の矢印で示す２つのエレメントタイプ間の
関係である。各幾何学的関係において、平均値（μ）と
標準偏差（σ）を計算して、文書エレメント識別システ
ム１００に記憶する。カラム表現比較手段１７０は、候
補カラムストリング５０４の各主要ホワイト領域が、ど
の幾何学的関係のカテゴリーに入るかを決定する。そし
て、主要ホワイト領域のサイズを統計サンプルと比較す
ることによって、主要ホワイト領域がその幾何学的関係
のカテゴリーと対応する確率を検出する。【００５６】ｌ個の可能なマッチングが残っているとす
ると、各マッチングｍｊ（ｊ＝１，．．．，ｌ）は、ｎ
個のホワイト空間を含む。正規分布だと仮定すると、ホ
ワイト空間ｗｉ（ｉ＝１，．．．，ｎ）が残りの候補
カラムストリングに対応する確率ｐは、統計理論によっ
て計算される。この計算は、次式（１３）に示すよう
に、正規分布の確率関数である。【００５７】【数１】【００５８】ここで、ｘ１はｗ１のサイズ、μｃおよび
σｃは、それぞれＷｉが属するホワイト空間カテゴリー
の平均値と標準偏差である。【００５９】各可能なマッチングのスコアを、すべての
ホワイト空間の確率値の平均値として計算する。可能な
マッチングｍｊのスコアＳは、次式（１４）で表わされ
る。【００６０】【数２】【００６１】最もよくマッチする候補カラムストリング
５０４は、最高値のスコアＳを有するので、最高のスコ
アの候補カラムストリング５０４が、最終的にマッチす
るカラムストリング５０４として選択される。その後文
書エレメント識別システム１００は、カラム画像を文書
エレメントに分割し、それぞれに対応の論理タグを付け
る。タグ付けを行ったならば、解析前文書画像５００の
文書エレメント４７０を、マッチするカラム表現の対応
の論理タグとともにメモリに記憶するか、以降の処理の
ためにプロセッサ１４０に出力する。以降の処理例とし
て、論理タグに基づいた、特定の文書エレメント４７０
の光学的文字認識処理がある。【００６２】次に、記憶した構造モデル（識別した各エ
レメントタイプのカラム表現およびエレメント表現を含
む）を用いて文書エレメント４７０を識別する方法の良
好な実施の形態を、フローチャートに基づいて説明す
る。【００６３】図１３は、カラム形式の文書画像の文書エ
レメントに論理的にタグ付けを行う処理を示すフローチ
ャートである。ステップＳ１００でオペレーションを開
始し、ステップＳ２００で、原文書のカラム表現と必要
とされるエレメント表現とを含む構造モデルを、文書識
別システム１００のメモリ１３０に記憶する。【００６４】ステップＳ３００で、文書画像の主要ホワ
イト領域を抽出する。ここで抽出した文書画像の主要ホ
ワイト領域に基づいて、まず文書画像のカラムを識別
し、次いで、各カラム内の主要ホワイト領域シーケンス
を検出する。文書画像主要ホワイト領域を抽出したなら
ば、ステップＳ４００で、文書画像カラムの空間ストリ
ングを検出する。抽出したカラムごとに、Ｓ４００以降
のプロセスを繰り返す。【００６５】ステップＳ５００において、主要ホワイト
領域選択手段１２０で生成した空間ストリングとマッチ
する、可能なカラムストリングの集合を検出する。ステ
ップＳ６００で、検出したカラムストリングの候補リス
トの中から、構造モデルで定義されるカラム表現に最も
よくマッチするカラムストリングを選択する。【００６６】ステップＳ７００で、ベストマッチのカラ
ムストリングの主要ホワイト領域に基づいて、そのカラ
ムの文書画像エレメントを分割する。ステップＳ８００
で、分割した文書エレメントを、選択したベストマッチ
のカラムストリングに対応する論理タグでラベル付けす
る。ステップＳ９００で、文書エレメントを抽出して、
例えばプリンタ３００に出力する。ステップＳ１０００
でオペレーションを終了する。【００６７】図１４は、図１３のオペレーションフロー
のうち、文書画像主要ホワイト領域４６０の抽出に関連
するステップＳ３００を、詳細に説明するものである。
ステップＳ３００の開始後、ステップＳ３１０で、文書
画像を入力する。入力文書画像４００は、複数の文書エ
レメント４７０を含む。ステップＳ３２０で、文書画像
４００内の連結要素４１０を識別する。ステップＳ３３
０で、識別した連結要素４１０の各々に関して、境界ボ
ックス４２０を生成する。ステップＳ３４０で、主要ホ
ワイト領域４６０を抽出する。抽出した主要ホワイト領
域に基づいて、ステップＳ３０５で各カラムを検出す
る。ステップＳ３０５ですべてのカラムを検出したなら
ば、ステップＳ３４５で、各カラムごとに、カラム領域
の左から右へ延びるすべての水平ホワイト領域を抽出す
る。抽出した水平ホワイト領域は空間ストリングの生成
に使用される。ステップＳ３５０で、オペレーションは
ステップＳ４００に復帰する。【００６８】図１５は、図１４の主要ホワイト領域抽出
ステップＳ３４０を詳細に示すフローチャートである。
ステップＳ３４０の開始後、ステップＳ３４２で、一次
ホワイト領域４３０を抽出する。図５で示したように、
一次ホワイト領域４３０は、境界ボックス４２０間の矩
形のホワイト空間領域である。次に、ステップＳ３４３
で、水平方向の各一次ホワイト領域４３０の幅と高さ
を、しきい値幅４４０としきい値高さ４５０と比較し、
また、垂直方向の各一次ホワイト領域４３０の幅と高さ
を、しきい値幅４４０’およびしきい値高さ４５０’と
比較する。水平ホワイト領域のしきい値幅４４０は、文
書画像４００の水平方向の長さ（すなわち幅）の１／３
に設定するのが好ましい。水平ホワイト領域のしきい値
高さ４５０は、文書画像中のテキストのライン間隔より
小さく設定するのが好ましい。一方、垂直ホワイト領域
のしきい値高さ４５０’は、文書画像４００の垂直方向
の長さ（すなわち高さ）の１／３に設定するのが好まし
く、しきい値幅４４０’は文書画像中のテキストのライ
ン間隔より大きな値に設定するのが好ましい。比較の結
果、しきい値高さ（４５０、４５０’）およびしきい値
幅（４４０、４４０’）を越えるサイズの一次水平ホワ
イト領域および一次垂直ホワイト領域が、主要ホワイト
領域として識別される。【００６９】ステップＳ３４４で、水平しきい値幅４４
０より狭い幅の水平一次ホワイト領域４３０と、垂直し
きい値高さ４５０’より低い高さの垂直一次ホワイト領
域４３０を消去する。ステップＳ３４５で、残った一次
ホワイト領域４３０をグループに併合して、主要ホワイ
ト領域４６０を識別する。ステップＳ３４６で、これら
の主要ホワイト領域のうち、水平方向と垂直方向の少な
くとも一方が、対応の垂直しきい値と水平しきい値より
小さいものを消去する。または、水平方向と垂直方向の
双方ともに、対応の垂直しきい値と水平しきい値より小
さいものだけを消去してもよい。ステップＳ３４５にお
ける、残りの一次ホワイト領域４３０のグループ化およ
び主要ホワイト領域４６０の検出方法として多様な方法
が考えられるが、例えば前述の特願平７−２４３２１３
号に開示される方法で実行する。ステップＳ３４７で、
オペレーションをステップＳ３５０に復帰する。【００７０】図１６は、ステップＳ４００での文書画像
カラムの空間ストリング決定プロセスの好ましい実施形
態のフローチャートである。ステップＳ４００でオペレ
ーションが開始したならば、ステップＳ４１０で、抽出
した主要ホワイト領域の幾何学的関係を検出する。ステ
ップＳ４２０において、抽出した主要ホワイト領域のう
ち、文書エレメント４７０を分割する主要ホワイト領域
を選択する。ステップＳ４３０で、選択したすべての主
要ホワイト領域に対して、そのサイズに応じてキャラク
タコードを割り当てる。ステップＳ４４０で、全体の文
書画像カラム５０１に含まれる主要ホワイト領域４６０
の空間ストリング５０２を決定する。ステップＳ４５０
で、オペレーションをステップＳ５００に復帰する。【００７１】図１１の例では、垂直方向に揃えた文書エ
レメント４７０の分割に、水平主要ホワイト領域４６０
だけを用いている。このプロセスを一般化して、マルチ
カラムの文書に適用することも可能である。複数カラム
の文書での空間ストリングの検出方法は、前述の特願平
７−２４３２１３号に詳細に述べられている。【００７２】図１７は、ステップＳ５００のカラムスト
リング選択プロセスの好ましい実施形態を示すフローチ
ャートである。ステップＳ５００で開始した後、ステッ
プＳ５０５で構造モデルエレメントタイプを選択する。
ステップＳ５１０で、前工程で選択したエレメントタイ
プのすべてのエレメント表現を空間ストリング５０２を
比較する。ステップＳ５１５で、空間ストリング５０２
の任意の一部分にマッチするエレメント表現があるかど
うかを判断する。エレメントタイプのカテゴリーまたは
エレメント表現が空間ストリングの一部とマッチすれ
ば、ステップＳ５２０に進む。ステップＳ５２０で、空
間ストリングの一部とマッチする各エレメント表現を、
例えばメモリ１３０に記憶する。マッチングしたエレメ
ント表現と共に、少なくとも空間ストリングの対応する
開始部分と終了部分もメモリ１３０に記憶する。ステッ
プＳ５２０で、各カテゴリーのエレメントタイプと、関
連のデータを記憶した後、ステップＳ５２５へ進む。ス
テップＳ５１５で、空間ストリング５０２の一部とマッ
チするエレメントタイプのカテゴリーがない場合は、オ
ペレーションは直接ステップＳ５２５にジャンプする。【００７３】ステップＳ５２０で、構造モデルを調べ
て、空間ストリング５０２と比較すべき構造モデルエレ
メントタイプがまだ残っているかどうかを判定する。比
較すべきエレメントタイプがまだある場合は、ステップ
Ｓ５２５に進み、次のエレメントタイプを選択し、オペ
レーションはステップＳ５１０に戻る。構造モデルで定
義されるすべてのエレメントタイプを空間ストリング５
０２と比較したと判断されたならば、オペレーションは
ステップＳ５２５からＳ５３０へと進む。【００７４】ステップＳ５３０で、構造モデル内の識別
された各エレメントタイプのエレメント表現のうち、空
間ストリング５０２とマッチしたすべてのエレメント表
現をメモリ１３０から呼び込む。ステップＳ５３５で、
空間ストリングの一部とマッチする多種のエレメント表
現を組み合わせて、全体の空間ストリングとマッチする
カラムストリング５０４を生成する。ステップＳ５４０
で、カラムストリング５０４を、構造モデルで定義され
るカラム表現に対してチェックし、カラムストリング５
０４が、構造モデルに対応して、原文書の基準にあった
カラムレイアウトであるかどうかを判断する。【００７５】ステップＳ５４０で、識別したカラムスト
リング５０４が基準を満たすカラムレイアウトであると
判定されたなら、ステップＳ５４５へ進む。ステップＳ
５４５で、このカラムストリングをカラムストリング候
補リストに追加してステップＳ５５５に進む。一方、ス
テップＳ５４０で、前工程で識別したカラムストリング
が基準に合わないカラムレイアウトであると判定された
場合は、ステップＳ５５０でこのカラムストリングを消
去し、ステップＳ５５５に進む。ステップＳ５５５で、
カラムストリング選択手段１６０は、マッチしたエレメ
ントタイプのカテゴリーから、さらに追加のカラムスト
リングが集められるかどうかを判断する。さらに追加の
カラムストリングがある場合は、ステップＳ５３５に戻
り、次のカラムストリングを識別する。ステップＳ５５
５で、すべての可能なカラムストリングを識別したなら
ば、ステップＳ５６０へ進み、オペレーションをステッ
プＳ６００に復帰する。【００７６】図１８は、ステップＳ６００の最良のマッ
チングカラムストリング選択プロセスの良好な実施形態
を示すフローチャートである。ステップＳ６００でオペ
レーションを開始した後、ステップＳ６０５で候補カラ
ムストリングリストをチェックして、ひとつ以上のキャ
ラクタストリング５０４がカラムストリング候補リスト
にあるかどうかを判断する。候補カラムストリングがひ
とつしかない場合は、ステップＳ６１０に進み、そこで
このカラムストリングを、カラム表現の最良のマッチと
して選択する。その後、ステップＳ６１５を経てステッ
プＳ７００に復帰する。【００７７】ステップＳ６０５で、カラムストリング候
補リストがひとつ以上のカラムストリングを含むと判定
された場合は、ステップＳ６２０に進む。ステップ６２
０で、カラム表現の幾何学的関係のカテゴリーの統計デ
ータを検索する。ステップＳ６２５で、第１のカラムス
トリング候補を現在のカラムストリングとして識別す
る。ステップＳ６３０で、現在のカラムストリングの第
１の主要ホワイト領域のスコアを、式（１３）に従って
決定する。式（１３）は、カラムストリングの主要ホワ
イト領域の各々のスコアリングのための式である。ステ
ップＳ６３５で、現在のカラムストリングをチェックし
て、スコアリングすべき主要ホワイト領域がまだ残って
いるかどうかを判定する。スコアリングすべき主要ホワ
イト領域が残っている場合は、ステップＳ６３０へ戻
る。現在のカラムストリングですべての主要ホワイト領
域をスコアリングしたならば、ステップＳ６３５からス
テップＳ６４０に進む。ステップＳ６４０で、式（１
４）に基づいて現在のカラムストリングの総スコアを求
め、例えばメモリ１３０に記憶する、ステップＳ６４５
で、カラムストリング候補リストをチェックして、スコ
アリングするカラムストリングがまだあるかどうかを調
べる。スコアリングすべきカラムストリングが残ってい
る場合は、ステップＳ６５０に進む。ステップＳ６５０
で、次の候補カラムストリングを現在のカラムストリン
グとして選択して、ステップＳ６３０に戻る。ステップ
Ｓ６４５で、すべての候補カラムストリングがスコアリ
ングされたと判断されたら、ステップＳ６５５に進む。
ステップＳ６５５で、最高得点のカラムストリングを、
カラム表現に最もマッチするカラムストリングとして選
択する。ステップＳ６６０で、ステップＳ７００に復帰
する。【００７８】図１９は、カラム中の文書エレメント４７
０に、未知の入力文書にマッチングする主要ホワイト領
域パターンに基づいて、論理的にタグ付けする方法を示
すフローチャートである。まず、ステップＳ１１００を
開始した後、ステップＳ１２００で、複数種類の周知の
原文書の構造モデルを、文書エレメント識別システムに
記憶する。構造モデルは、少なくともカバーページ、カ
ラム表現、および構造モデル内で定義されるエレメント
タイプ間または内部の幾何学的関係のカテゴリーを分割
する主要ホワイト領域の見込みサイズに関する統計的デ
ータを含む。図１２は、３つのエレメントタイプ「見出
し」、「本文」、「図」を有する構造モデルにおいて、
統計データを必要とする幾何学的関係のカテゴリーを示
す。【００７９】ステップＳ１３００で、カバーページと、
論理的にタグ付けを行うべき文書エレメントを含む未知
の入力文書とを、文書エレメント識別システムに入力す
る。入力文書画像は、入力文書をカラムごとに分割する
主要ホワイト領域を有する。この主要ホワイト領域を抽
出して、文書画像のカラムを以降の解析のために記憶す
る。【００８０】ステップＳ１４００で、分割した未知の入
力文書のカバーページを検索し、あらかじめ記憶してお
いた複数種類の原文書カバーページと比較して、マッチ
ングするカバーページを識別する。マッチングしたカバ
ーページで未知の入力文書タイプを識別し、識別された
入力文書タイプの対応の構造モデルを記憶装置から検索
する。複数の文書のカバーページを識別するプロセス
は、特願平７−２４３２１２号に述べられている。また
は、文書エレメント識別システムが、ユーザーに未知の
入力文書タイプをマニュアル（手動）で識別するように
指示を出すようにしてもよい。この場合、ユーザーによ
って未知の入力文タイプが特定された後に、対応の構造
モデルを検索する。【００８１】ステップＳ１５００で、文書エレメント識
別システムは、対応の構造モデルを使用して、入力文書
画像の分割カラムの文書エレメントを分割し、論理的に
タグ付けを行う。【００８２】ステップＳ１６００で、入力文書画像内の
論理的にタグ付けした文書エレメントを抽出し、例えば
プリンタ２００や記憶装置に出力する。【００８３】ステップＳ１７００で、文書エレメント識
別システムは、論理的にタグ付けした入力文書画像の文
書エレメント間の幾何学的関係の統計データを検索し、
識別した対応の構造モデルにおいて、文書エレメント間
または内部の幾何学的関係の統計データを更新する。更
新した幾何学的関係統計データを含む対応の構造モデル
を、再度記憶する。ステップＳ１８００で、プロセスを
終了する。【００８４】以上述べた良好な実施形態は、あくまでも
例示に過ぎず、当業者であれば、本発明の範囲内で多様
なデータ処理構成があり、上述の実施形態の例が本発明
を限定するものではないことは明白である。【００８５】【発明の効果】本発明によれば、文書構造を表わす簡単
な構造モデル表現を用い、ホワイト領域の検出に基づい
て、文書内の各エレメントを識別することができる。こ
の方法を用いると、文書画像の細部をそれぞれ解析して
その連結状態を検出する必要がない。実際の文書の構造
モデルに基づくので、どんな文書の識別にも適用するこ
とができる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention
Compare the white area required with the structural model to the columns in the document.
Document elements of the input document by analyzing the structure
Regarding the classification of (components), in particular the specific text
Acceptable color for a book, document type, or print
Statements in columns using a structural model that represents the system layout
Method and apparatus for logically identifying document elements
You. [0002] Documents to be Solved by the Related Art
A function that logically identifies an element (component) in an image
The next method is, for example, "A pro
to type Document Image Analysis System for technica
l Journal ", pp10-21, Computer, July 1992.
ing. Until now, documents have been analyzed based on the column structure.
Many methods have been developed, but all have column structures
Was based on general estimates of like this
In an accurate estimation-based method, the actual column structure
Question when there are parts or exceptions that differ from the general column structure
There was a title. Usually different source documents have different exceptions
So we can't afford to ignore such exceptions. In addition, the document structure does not depend on the context.
Page layout of the original document that can be expressed by grammar rules
Document structure can also be described by identifying documents
(See Naji above). However, this method is complicated
Different structure from image level to pixel level
Requires strict grammar rules expressed in fineness. Of the original document
Representing all levels of the document structure is a complex and difficult task.
It is a business, and simpler expressions are desired. Japanese Patent Application No. 7-243213 discloses a background area.
Method and apparatus for splitting a document by parsing
Has been disclosed. This method takes into account page layout
Without document, based on key white areas
Extract the rectangular document element. Also, Japanese Patent Application No. 7-243212 discloses a
Page images by matching site space patterns
A method for logically tagging is disclosed. This method uses the ratio
Suitable for identifying document pages with relatively fixed layout
You. However, various elements were arranged in various orders
To apply this method to a flexible layout,
List all layouts
The problem of having to express in the corresponding document structure
is there. There's nothing you can't do, but
Is a tedious process, again a simpler
Expression is desired. SUMMARY OF THE INVENTION The present invention is directed to different types
Use a single column structure model for each of the documents
, Add a “heading” to the document element in each column of the input document.
Logical tags such as "
Provide an accurate and efficient way to identify document elements
You. In the present invention, between document elements and between document elements
The sequence of the main white areas in the document
By matching with Dell, the color of the input document
The system structure is analyzed. [0007] One aspect of the present invention is a document element in a document image.
A method for logically identifying a statement, wherein at least one
At least one column containing one document element
At least one structural model corresponding to the original document
Generating, wherein each structural model comprises at least
At least one element with one column representation
At least one document element of the source document that has a representation
The first key background area in the document image
Identifying a region, wherein the first primary background region comprises a sentence.
The document image into at least one column,
Identify a second key background area in each of the columns
A second primary background area identified, comprising the step of identifying
To main background area pattern for each column of document image
For each column of the document image
At least one matching main background area pattern
Main step with generating column strings
At least one column string that matches the pattern
To determine the best column string from the list
And the best column string has at least one
Best match the column expression of the
At least one sentence in the document image based on the tring
Logically identifying each of the write elements
You. The present invention also provides a document element in a document column.
The size and orientation of the main white area separating the
Allows users to develop structural models using information
The relationship between document elements in the document image column.
Provide a system that can be expressed. The main white area is the “white space”
A rectangular area whose minimum size is predetermined
ing. White space is the background of the document image,
Area or non-text area. Normal documents are
Black and / or colored on white background area
Because these images are formed, these background areas are
It is called "it space", but in the interpretation of the present invention
Is when the document is formed in a colored background area, or
Form text with white outline on black or color background
These colors are non-image areas even if
(Or black) background area defined as "white space"
I do. With the present invention, including document elements
Analyze the document image part and coherent group,
That is, there is no need to detect connected components from document elements.
No. The image on the document is scanned and the electronic or
A digital representation is formed. Area containing the document element
May have other document
These documents, if separated from the element area
The element regions are considered to be independent of each other.
Document elements include headings, text,
Major white areas including information such as graphics
These are rectangular areas separated from each other by areas. [0011] The structural model is a possible tolerance in the original document.
It represents a specific possible column layout.
Structural models have two types of regular expressions: column tables
It is represented by the present and element expressions. Element of structural model
Portrayal of the original document prior to the actual tagging process.
Users or creators provide offline,
Or from a real training sample from the original document
It is determined. Column representation of the structural model is actually tagged
User or creator of the original document prior to the process
Generated offline. Acceptable power for original document
The ram layout is shown in the structural model as a column representation.
Is done. Element representations are identified and
Statement for each element type used in the ram representation
All possible white space in the main white area
Sequence. The document element identification system according to the present invention comprises:
Key white area sequence or pattern of the input document image
All of the parts of the
Extract possible matches. Next, input the input document
For each column of the page, the main image of the input document page image
At least one column string that indicates a white area
Generate. In addition, the main white area of the input document image
All pairs of matching element expressions to match
Compare the combination with the column representation of the structural model. plural
A set of column strings or matching element expressions
If the match is continuously tested against the column representation
The best match (ie, the closest match)
Touching). Based on a comparison with the input document image, the closest
Logic related to the document element of the matching column expression
Using tags, the corresponding document elements of the input document image
And logically identify document elements. Alternatively, the corresponding document file of the input document image
Elements are split and output for printing or storage
Is done. For other objects and effects of the present invention, see the drawings.
Further clarification from the following detailed description of the preferred embodiments
Be certain. FIG. 1 shows a document element of the present invention.
1 shows a preferred embodiment of the identification system 100. documents
The element identification system 100 extracts the document white area.
Output system 110, main white area selecting means 120,
Memory 130, processor 140, structural model definition means
150, column string selecting means 160, column expression
Comparing means 170, logical tag assigning means 180, and
Including document element extraction means 190, all of which
They are connected to each other via a bus means 105. Shown in FIG.
The document element identification system 100 is
It is preferred to run on pewter 300, but of course
Dedicated computer, microprocessor-based or
Microcontroller-based systems, ASICs, etc.
Hard wires for integrated circuits, discrete element circuits, etc.
Electronic circuit, field programmable gate array
Such as programmable logic devices (PLDs)
But it can be done. FIG. 2 shows the document element identification system of FIG.
Of the document white area extraction system 110 of the
3 shows a preferred embodiment. As shown, document white area
The region extraction system 110 includes a continuous component
Bounding box generation means 250, and main white area extraction
Outgoing means 240, all of which are bus means 105
Connected to. First, the scanner 210 or the memory 13
0, the document image data is input to the connected element identifying means 260.
Power. The memory 130 is included in the general-purpose computer 300.
Internal memory, disk drive, CD-RO
In the form of M, EPROM, etc.
It may be arranged outside the computer 300. Scanner 21
0 to the connected element identifying means 260
Before inputting, it may be stored in the memory 130 first. Sentence
The writing image data can be a binary (binary) image or
Connected element identification means 260 in the form of digital signals
Is input to Each set of 1 bit or more is
Indicates whether a particular pixel is on or off. The connection element identification means 260 stores the document image data
When the data is received, all the link elements in the document image are
Detect element. FIG. 4 is an example of a document image 400. Communicating
Connected element 410 is surrounded by “off” pixels (white pixels)
A series of adjacent “on” pixels (black pixels)
Is done. Detecting connected component 410 in document image 400
Systems are well known in the art.
You. Identify the connected component 410 of the document image 400
Then, each bounding element is
Generate a bounding box 420 for each 410. In the industry
As is well known, bounding box 420 includes connecting element 410
Is the smallest rectangular box that completely encloses. Connection required
A system for generating a bounding box 420 from a prime 410
Systems are also well known in the art. Document image data including bounding box information
Is sent to the main white area extracting means 240. Major
The white area extracting means 240 is shown in FIGS.
As shown in FIG.
Extract the white area. The document white region extraction system 110
In a preferred embodiment, the primary white region extraction means 240
Are the two sections shown in FIG. 3, namely the vertical extractor
241 and a horizontal extraction unit 242. Vertical extractor
241 and the horizontal extraction unit 242 each have a primary (prime
Iv) white region extracting means 243, comparing means 244,
An erasing unit 245 and a grouping unit 246 are provided.
Are connected to the bus means 105. Vertical extractor 241 and horizontal
The extraction unit 242 includes the same components and operates in a similar manner.
Make. As shown in FIG. 5, the horizontal extractor 242 is
, Primary (primitive) white areas 430-1 to 430-4
30-10, and extract the main white area 46 in the horizontal direction.
0 is formed. Similarly, as shown in FIG.
21 is a primary white area 430-11 to 430-19
To form a main white area 460 in the vertical direction
You. Formation of Horizontal Main White Area 460
Of the horizontal primary white areas 430-1 to 430-10
Adjacent areas can be grouped together according to specific rules.
One or more groups merged horizontally and grouped horizontally
This is effected by making it a primary white area. As well
In addition, the formation of the main white region 460 in the vertical direction is
Next to the next white area 430-11 to 430-19
Areas are grouped together according to specific rules
Is effected by merging into
One or more primary white areas grouped in
it can. Glue vertical and horizontal primary white areas
And merged, the horizontal primary white area 430 and
In the primary white area, grouped horizontally and horizontally,
It has a width wider than the threshold width 440 and a threshold height 45
Identify areas with heights greater than zero. Similarly, the vertical primary
Primary white grouped vertically with white area 430
Height above the threshold height of 450 '
Identifies an area wider than the threshold width 440 '.
These identified areas are the primary white areas. As shown in FIG. 7, various document elements
470 is defined by a vertical major white area 460.
And group them into a column 405. In the example of FIG.
The writing image 400 has two columns 405-1 and 405-2.
Divided into However, the logical identification of each divided area
Has not been done yet. In other words, from here,
Document elements are identified using one or more structural models.
I need to separate it. The document white area extraction system 110
Therefore, the main white area was identified as described above.
If the main white area selection means 120 determines
Select various regions of the default region. Structural model definition means 1
50, the user can specify the structural model column expression and element.
It may be possible to input a statement expression. Original document creation
When a structural model is given in advance by an adult
In this case, the structural model definition means 150 is not required. Columns
The tring selection means 160 selects a selected item in the document image column.
In the column of the input page using the main white area
Column streams corresponding to the primary white region sequence
Identify Column string selecting means 160
Also performs element candidate extraction processing, and
Document representation in the main white area
Match with the sequence. Column expression comparison means 17
0 is the selected major white area in the document image column
Character string or column stream generated from
Is compared with the column representation of the corresponding structural model. Mosquito
Column expression ratio performed by ram expression comparison means 170
The comparison process consists of the structure matching process and the best match
A column expression selection process for identifying a column expression.
Next, the logical tag assigning means 180
Logical tags based on the column structure
Assign to areas between white areas that need to be
Document element whose output means 190 is logically tagged
Extract 470. Alternatively, logical tag assigning means 1
In place of 80 or added to logical tag assigning means 180
Then, a document element extracting means can be used. Argument
Logical tag assigning means 180, document element extracting means 1
90 and details of the structural model definition means 150
Are described in Japanese Patent Application Nos. 7-243212 and 7-2.
43213. The document white area extraction system 110
One of the systems that performs the document white area extraction function described above
However, the present invention is not limited to this. As shown in FIG. 7, the document image 400 is optional.
Of document elements 470. Sentence of input document image
Key existing between or within document elements
Spatial (geometric) relationships between white areas and each original document
Spatial defined in the structural model selected for
By comparing with the (geometric) relationship, the document image 4
00 document element 470 is logically identified. Document image
The geometric relationship between the main white areas of column 405 is
If it matches the relationship represented by the modeling model column expression,
Document image color of document element 470 of document image 400
The system layout is specified. Next, the document image 400
Extract document element 470 by major white area
I do. The extracted document element 470 contains the structure of the original document.
Logical element of the corresponding element type in the model column representation
Assign Logical tags represent different types of document elements.
470, for example, "heading", "sentence"
Chapters and figures. Logical element for document element 4
70, the document element 470
Are logically identified. However, the document element of the document image 400
Before logically identifying 470, at least
And one structural model, the document element identification system
Must be supplied to the system 100. Each structural model is
The document element type corresponding to the source document and the corresponding source
The spatial (geometric) relationships between the document elements of the document
Including. The structural model is based on the column layout that meets the criteria.
Out structure is described, two types of rules
Expression, ie, column expression and element expression
It is. The column expression is the source text in any source document.
That meet all possible criteria contained in each column of the
Represents a element type sequence. Element representation
Is available for each element type in the source document
The main white area sequence that meets various criteria.
The definition of the element representation can be included in the column representation. each
Different known column layouts for each source document
Rules for each particular source document.
A model is provided. In a preferred embodiment of the present invention
Means that the column expression of the structural model is one-column
n) Although it is the expression of page layout, the system of the present invention
Method and method, two columns, three columns, and more generalized n
Also applicable to column page layout. Implementation of the present invention
In the embodiment, the original document is stored in the document element identification system 100.
Is assumed to be supplied.
Generate a formula document image and logically add it to the document element
Tag it. The document elements are arranged in order from the top of the column to the bottom.
Be placed. The position of each document element in the column is
According to the column layout rules of the structural model of the original document
Is determined. Columns that meet the criteria for each source document column
Layouts are represented by at least one column expression.
You. Each column has a top and a bottom, and each source document is
Allowable document elements such as "Heading" and "Text"
Type set. In a preferred embodiment,
The user of the document is within the tolerance of the structural model of the original document
Rule set that defines the column layout of
Each original text from the creator or source of the original document.
The column expression of the book may be obtained. In the following, the element type and column
Here is an example of a regular expression. Actual column or structural model
The regular expression given to the column expression of
Is represented. Column = ＾ F * (H? T |
F) (HT) * F * ＄ (1) As can be seen from equation (1), the proper document element
The set of font types is “Heading (H)”, “Figure
(F) "and" text (T) ". In addition, the original document
The following rules apply as the rules for. 1. Headings are
It is not located at the bottom of the column. 2. For each column,
Place at least one text block. 3. Te
Make sure that the text block is located below the heading block.
You. 4. Sequences containing one or more figures must be
Touch the top or bottom of the The symbols H, T and F are
The respective element type of the document element
It means "heading", "text", and "figure". Sign
“＾” indicates the top of the column, that is, the start of the column expression.
Express. The symbol “＄” is the bottom of the column, that is, the column
Indicates the end of the expression. The sign "?" Indicates that the previous expression is
Indicates optional. The sign "*" precedes
Indicates that the expression can be repeated an indefinite number of times or does not appear
You. That is, "*" indicates that the preceding expression does not appear at all.
Indicates that it may not appear or may appear more than once. "|"
Means logical OR. Parentheses () indicate the operator's "natural"
Disregarding the order of priority, “*” is at the top and element at the second
Type connection and end with "|"
You. FIG. 8 shows three original documents 401 that meet the criteria.
Column layouts 405-3, 405-4, 405-
5 is shown. Each acceptable column 405-3, 405-4,
Element sequence corresponding to each of 405-5
Are represented by the following equations (2) to (4). Column = FHT (2) Column = HTFF (3) Column = FHF (4) Using a similar process in the structural model of each original document
The element representation of the identified element type
Justify. Element for each element type in the original document
The default representation is preferably generated by the user of the original document.
Each element type has the same logical tag,
To one of the categories of document elements that have
Respond. All element types are shown in the following equation (5).
Are represented by regular expressions. Element = AW * B (5) where “Element” is “text”, “heading”
Represents an element type such as
*, B indicates a regular expression. "A" is the document element
Main white area located above (Above)
Character code assigned according to the size of
Is a regular expression that matches. "B" is
The main e located below the document element (Below)
Characters assigned according to the size of the white area
This is a regular expression that matches the dictator codes.
“*” Is a repetition operator. "W" is the document element
Main white area located inside the client (Within)
Character code assigned according to the size of
Is a regular expression that matches. Repeatable
"W *" indicates that if the range of the number of repetitions is known,
A range may be substituted. For example, W {0,3}
Indicates that W is repeated 0 to 3 times. One of the element types, "text"
Element is located in the center of the column.
The expression is a regular expression represented by the following expression (6). Text = [fg] [ab] * [fg] (6) where a, b, f, and g are 3 points and 5 poi, respectively.
Points, 16 points and 20 points
You. These values come from the sample data
The element expression is the solution of the sample data of each original document.
Generated automatically from analysis. Each text is an adjacent line in the text
Between 3 and 5 points above the text
Or another adjacent document element located below
Separated by ~ 20 point line spacing. Turning to FIG. 9, document element 406-1
Contains text elements in columns. Document element
In text 406-2, the text starts at the top of the column.
This element representation is
Begins with "＾". In document element 460-3, text
The strike must descend to the bottom of the column,
The statement expression ends with the sign "@". Document element 4
In 60-4, the text is from the top to the bottom of the column.
, So the element expression starts with the sign “＾”,
End with "＄". Four of the "text" document elements of FIG.
406-1, 406-2, 406-3, 406
-4 is a regular expression shown in equations (6) to (9), respectively.
Corresponding. Text = {[ab] * [fg] (7) Text = [fg] [ab] *} (8) Text = {[ab] *} (9) Linking these four regular element expressions together
Connect or perform a logical OR operation on this element
Find the type (text) element representation. Paraphrase
Then, the four regular expressions shown in Expressions (6) to (9) are
Combined into two element expressions, as shown in the following equation (10)
Thus, all four configurations are covered. Text = [fg] [ab] * [fg] | {[ab] * [fg] | [fg] [ab] *} | 〜 [ab ] * ＄ (10) of the identified element type used in the original document
For each, an element representation is required. Each element
Input expression by the column expression comparing means 470.
Compare with the spatial string of the main white area of the document image.
You. In this example, the text element is
The unit between adjacent lines is 3 to 5 points
Another sentence that has a format and is adjacent (top and bottom)
It is separated from the writing element by 16 to 20 points. Te
The quist element is the top, bottom, and entire column
It can be located in the body, or inside the column. Document
Another example of a document element event in a ram
For example, by repeating equation (6),
I can do it. The element expression is used to store training data.
Or generated from sample data. Element representation
Is the actual image of the element type recognized in the source document
Key white space defining document elements derived from
It can be determined from the size range of the area. But the column expression
Is a source document that allows you to carefully check the column structure of the source document.
It is preferably generated by the creator or user of the document. In the preferred embodiment, tagging (solution
Analysis) Tagging process of the document element of the document 500 before
Before you begin, define a structural model
Previously stored in the structural model storage of the
Good. Therefore, the column representation and the
The element representation, together with the structure model,
It will be stored in the model storage. One or more structures stored in memory 130
Using the model, the document element of the document image 500 before analysis is
The statement 470 is identified as in FIG. Statement before parsing
The document 500 is a document element 47 that has not yet been identified.
0, which must be logically identified. analysis
The previous document 500 is stored in the scanner 210 or remote interface.
Input by face 230. Document white area extraction
In the output system 110, the main
Extract the white area. Main white area extracted
To the main white area selecting means 120. Main white
The site selection means 120 selects an appropriate primary white region.
Select and, based on this, the columns of the stored structural model
Representation and unidentified document of pre-analysis document image 500
For comparing with an element. More precisely, choose
Between the sequence of major white areas and the structural model
Perform the switching. As a comparison method, a document white area extraction
Output system 110 and main white area selecting means 120
Main white from document image, used continuously and repeatedly
Repeat the extraction and selection of the area, the document image is first column
To divide. After adjusting the threshold for the main white area,
Each column identified in the input document image is placed in the document white area.
Extraction system 110 and main white area selection means 120
Enter again. The main eaves extracted and selected in the column
The white area is used to find the spatial string for that column.
Used. This spatial string is used for column string selection.
Input to the selection means 160. Another approach is to use a document image
First, a multi-column sentence containing non-rectangular document elements
Input to the device that detects the column from the document image,
The identified column is used in the document element identification system of the present invention.
Input to the column string selection means 160 of the
Is also good. For example, the pre-analysis document image 50 shown in FIG.
0, the main white area selecting means 120
Main white that extends vertically from top to bottom of image 500
The default region 460 is identified. Then the identified vertical
From the primary white area 460, the adjacent vertical primary white
By picking out each pair of regions 460, the solution
Each column 501 of the pre-analysis document image 500 is identified. this
The image column 501 identified as above is placed in the document white area.
After entering the extraction system 110, it is
This is sent to the area selection means 120. Main white area selection means
At 120, as shown in FIG.
A pair of vertical white areas 4 specifying a column 501 of 0
Main white area 4 in the horizontal direction, which exists between 60
Select 60. The main white area selecting means 120 is
In addition, each horizontal major white area selected in column 501
At 460, each based on its vertical height
Assign a character code. Each horizontal of column 501
The character code assigned to the main white area
By concatenation, one spatial string 502 is generated.
The spatial sling 502 corresponds to the column 50 of the image 500 before analysis.
This corresponds to 1 and represents this. Spatial string
502 is also a sequence of primary white areas 460
(Pattern) and define this sequence as a structural model
With the column and element expressions of. FIG.
In the example, the spatial string 502 of the column 501 is "a
gcdcccgbfcdddd ”. The spatial string 502 has a horizontal principal
Only the white area 460 is used,
Geometrical representation of the vertical position of the light
You. Structural models represent a "physical" column structure
Generated for the purpose. Continuous columns
If so, first determine the character string before
Split columns. Or, first, multi-column images
After expressing it as a one-dimensional character,
Compare the ring with the structural model. Japanese Patent Application No. 7-24321
No. 3 is for more complex document images such as two-column document images
And how the camera is based on the horizontal
It discloses whether to determine the lacquer string. Spatial string 5 of document image 501 before analysis
01 is determined by this spatial string 502.
Document elements that are not identified in the pre-analysis document 500
Comparison with the structural model,
Doing Check the structural model for each element type.
The element representation is compared to the spatial string 502 and this
Any match that matches any part of the spatial string
Identify the position of the element type. Examples of each part identified
Is indicated by a bidirectional arrow 503 in FIG. These gather
To form an element type list. Then the document
The element identification system 100 detects the
Match the whole space string 502
The possible column strings 504 are detected. Columns
Tring 504 can be any of the structural model element types
, Resulting in a column of the original document
Will do. Each detected column string 504
Of one or more structural models stored in advance.
Compare with the system expression. One column string 504
Combinations that do not meet the criteria for the above element types
If so, erase that column string. column
When multiple matches of the string 504 are detected
Is the statistical data used in the scoring or selection process.
The best match based on the data. To perform the above processing, a spatial string
502 from the main white area selecting means 120
The output is sent to the string selector 160. Spatial story
502 is in the column 501 of the document image 500 before analysis.
Document element 470 and the document element 470
The relative position with respect to or between the major white areas 460
Is to define. Column string selecting means 160
Is the structural model element that matches the spatial string 502
Select the sequence of the statement type. More precisely,
The column string selection means 160 is defined in the structural model.
Element table for one or more defined element types
Substring of spatial string 502 that matches the current
Extract the bug. The spatial string in FIG. 11 is “agcdc
ccgbjcdddd ". Elements in the original document
Structural model element representation in type "heading"
Is as shown in the following equation (11). Heading = (＾ [ab] * [fg] | [fg] [ab] * [fg]) (11) As described above, the “heading” is determined by the structural model rule. D
Elements are never placed at the bottom of the column
And the element expression of “heading” is the character “＄”
It never ends. In addition, the element type "body"
The structural model element expression is expressed by the following equation (12).
You. Body text = ({[cd] *} | ＾ [cd] * [fh] | [fh] [cd] * [fh] | [fh ] [C-d] * ＄) (12) The “heading” strings “ag” and “gb”
f ”is an element of“ heading ”defined by equation (11).
Matches the expression. These matching "headings"
The tring, along with the corresponding start and end positions,
Stored in string 502. On the other hand, "text" in FIG.
The strings “gcdccccg” and “fcdddd”
"d" is also a "body" element expression defined by equation (12).
Match, so these "body" strings
Spatial string 5 with corresponding start and end positions
02 is stored. The column string selecting means 160
Matches the entire spatial string 502 of the previous document image 500
Column string of structural model element type
Build 504 candidate lists. Candidate column string
Group 504 is a spatial stream matching the element type.
Generated based on the position of the substring extracted from the
I do. In particular, all of the matching column strings 504
Until all possible combinations are detected,
Instead of a string, a map representing the element type
Use the ching character code. One spatial strike
Generate multiple column strings 504 from ring 502
Column string candidate list.
Collect as strike. Column string selecting means 160
Indicates each candidate column in the column string candidate list.
Compare the tring with the column representation of the structural model. Candidate
If the column string 504 does not match the column expression
The column string 504 from the candidate list.
Remove. The remaining column string 504 is
Column string that meets the criteria for
Is interpreted as In the example of FIG. 11, the spatial string 502 is
A single column string 504, "HTHT"
Is converted to a column string. When the column string selecting means 160
Multiple candidate column strings in the system string candidate list
If it is determined that 504 remains, the candidate list is
Output to the ram expression comparing means 170. Column expression comparator
Step 170 is a column of the structural model from among the remaining candidates.
Best match column string that best matches the expression
Select The document element identification system 100
Having statistical data on the size of the white area required
The primary white area size is
(Eg, document elements)
Two document elements located inside the type or
Are located in between)
You. The column expression comparing means 170 converts the statistical data
Determine best match for column string based on
I do. In the structural model, “heading”, “body”,
Suppose that three element types of "Figure" are defined
And, as shown in FIG. 12, a total of nine geometric relationships
Specified. That is, the white two-way arrow in FIG.
Within each element type shown (text, headings, figures)
Relationship between the two element types indicated by the solid arrows
Relationship. For each geometric relationship, the mean (μ) and
Calculate the standard deviation (σ) and use the document element identification system.
Stored in the system 100. The column expression comparing means 170
Each major white area of the complementary column string 504 is
The category of the geometric relationship. Soshi
To compare the size of the key white area with the statistical sample.
The main white region is
Detect categories and probabilities corresponding to them. Assume that l possible matches remain.
Then, each matching mj (j = 1,..., L) becomes n
Contains white space. Assuming a normal distribution,
The white space wi (i = 1,..., N) is a remaining candidate
The probability p corresponding to a column string is determined by statistical theory.
Is calculated. This calculation is performed as shown in the following equation (13).
Is a probability function of a normal distribution. [Equation 1] Here, x1 is the size of w1, μc and
σc is the white space category to which each Wi belongs
Mean and standard deviation. The score of each possible match is
It is calculated as the average of the probability values in the white space. Possible
The score S of the matching mj is expressed by the following equation (14).
You. [Mathematical formula-see original document] The best matching candidate column string
504 has the highest score S, and thus has the highest score.
A candidate column string 504 finally matches
Selected as the column string 504. Then sentence
The book element identification system 100 converts a column image into a document.
Divide into elements and attach corresponding logical tags to each
You. If tagging has been performed, the
Document element 470 is mapped to a matching column expression
Stored in the memory together with the logical tag of
Output to the processor 140. The following processing example
A specific document element 470 based on the logical tag.
Optical character recognition processing. Next, the stored structural model (each identified element)
Includes column and element expressions for element types
Of the document element 470 using the
Preferred embodiments will be described based on a flowchart.
You. FIG. 13 shows a document image of a column-format document image.
Float showing logical tagging of elements
It is a chart. Operation is started in step S100.
First, in step S200, the column expression of the original document and the necessary
The structural model including the element representation
It is stored in the memory 130 of the other system 100. In step S300, the main white of the document image
Extract the site area. The main image of the document image extracted here
Identify document image columns based on white space
And then the key white region sequence in each column
Is detected. If the main white area of the document image is extracted
For example, in step S400, the spatial stream of the document image column
Detecting the ringing. For each extracted column, after S400
Repeat the process. In step S500, the main white
Match with the spatial string generated by the region selection means 120
To find a set of possible column strings. Stay
In step S600, the candidate list of the detected column string is displayed.
Of the column expressions defined in the structural model
Select a column string that matches well. In step S700, the best match color
Based on the primary white area of the
The document image element of the system. Step S800
To select the split document element and select the best match
Label with a logical tag corresponding to the column string of
You. In step S900, a document element is extracted,
For example, it outputs to the printer 300. Step S1000
To end the operation. FIG. 14 shows the operation flow of FIG.
Related to extraction of document image main white area 460
Step S300 will be described in detail.
After the start of step S300, in step S310, the document
Enter an image. The input document image 400 includes a plurality of document
Element 470. In step S320, the document image
Identify connected components 410 within 400. Step S33
0, for each of the identified connected components 410,
Box 420 is generated. In step S340, the main e
The white area 460 is extracted. Main white areas extracted
Each column is detected in step S305 based on the area.
You. If all columns are detected in step S305
For example, in step S345, a column area
Extract all horizontal white regions extending from left to right of
You. The extracted horizontal white area is used to generate a spatial string
Used for In step S350, the operation is
It returns to step S400. FIG. 15 shows the extraction of the main white area in FIG.
It is a flowchart which shows step S340 in detail.
After the start of step S340, in step S342, the primary
The white area 430 is extracted. As shown in FIG.
The primary white area 430 is a rectangle between the bounding boxes 420
The white space region of the shape. Next, step S343
The width and height of each primary white area 430 in the horizontal direction
Is compared with the threshold width 440 and the threshold height 450,
Also, the width and height of each primary white area 430 in the vertical direction
With the threshold width 440 'and the threshold height 450'
Compare. The threshold width 440 of the horizontal white area is a sentence
1/3 of the horizontal length (ie, width) of the written image 400
It is preferable to set Horizontal white area threshold
Height 450 is greater than the line spacing of the text in the document image
It is preferable to set small. Meanwhile, the vertical white area
The threshold height 450 'of the document image 400 is
Preferably set to 1/3 of the length (ie height) of
In addition, the threshold width 440 'is set to
It is preferable to set the value to a value larger than the interval. Comparison results
Result, threshold height (450, 450 ') and threshold
Primary horizontal white that exceeds the width (440, 440 ')
The primary white area and the primary vertical white area
Identified as a region. In step S344, the horizontal threshold width 44
The horizontal primary white area 430 having a width smaller than 0 and the vertical
Vertical primary white area lower than threshold height 450 '
The area 430 is deleted. In step S345, the remaining primary
Merge white area 430 into group
The default region 460 is identified. In step S346, these
Of the major white areas of the horizontal and vertical
At least one is greater than the corresponding vertical and horizontal thresholds
Erase small ones. Or horizontal and vertical
Both are less than the corresponding vertical and horizontal thresholds
You may delete only the thing. In step S345
Grouping of the remaining primary white areas 430
Various methods for detecting the main white area 460
For example, the aforementioned Japanese Patent Application No. 7-243213
In the manner disclosed in US Pat. In step S347,
The operation returns to step S350. FIG. 16 shows the document image in step S400.
Preferred embodiment of the column spatial string determination process
It is a flowchart of a state. Operation in step S400
If the solution starts, in step S410, extraction
Detect the geometric relationship of the selected main white area. Stay
In step S420, the extracted main white area
That is, the main white area into which the document element 470 is divided
Select In step S430, all selected masters
For white areas that require character,
Assign a code. In step S440, the entire sentence
Main white area 460 included in the letter image column 501
Is determined. Step S450
Then, the operation returns to step S500. In the example shown in FIG. 11, the document
The horizontal major white area 460 is divided into the divisions of the element 470.
Only use. Generalize this process to
It can also be applied to column documents. Multiple columns
The method of detecting spatial strings in documents described in
No. 7,243,213. FIG. 17 shows the column list in step S500.
Float showing a preferred embodiment of the ring selection process
It is a chart. After starting in step S500,
In step S505, a structure model element type is selected.
In step S510, the element type selected in the previous process
All element representations of the
Compare. In step S515, the spatial string 502
If there is an element expression that matches any part of
To determine Element type category or
Element representation matches part of a spatial string
If so, the process proceeds to step S520. Empty in step S520
Each element expression that matches a part of the inter-string
For example, it is stored in the memory 130. Matched element
With at least the corresponding
The start part and the end part are also stored in the memory 130. Step
In step S520, the element type of each category and the related
After storing the consecutive data, the process proceeds to step S525. S
In step S515, a part of the spatial string 502 is mapped.
If there is no category for the element type
The operation jumps directly to step S525. At step S520, the structural model is checked.
The structural model element to be compared with the spatial string 502
It is determined whether or not there is any remaining comment type. ratio
If there are more element types to compare
Proceed to step S525, select the next element type, and
The operation returns to step S510. Defined by structural model
All element types defined as spatial string 5
If the operation is determined to be compared to 02,
The process proceeds from step S525 to S530. In step S530, identification in the structural model
Empty in the element representation of each element type
All element tables that matched the inter-string 502
The current is called from the memory 130. In step S535,
Various element tables that match part of the spatial string
Combine the current to match the entire spatial string
A column string 504 is generated. Step S540
The column string 504 is defined in the structural model
Check for column expression, column string 5
04 was the standard of the original document corresponding to the structural model
Determine if the column layout. The column list identified in step S540
If the ring 504 has a column layout that meets the criteria,
If it is determined, the process proceeds to step S545. Step S
At 545, this column string is
Then, the process proceeds to step S555. On the other hand,
The column string identified in the previous step in step S540
Is determined to be a column layout that does not meet the criteria
In this case, this column string is deleted in step S550.
And the process proceeds to step S555. In step S555,
The column string selecting means 160 outputs the matched element.
Additional column list from the
Determine if the rings can be collected. Further additional
If there is a column string, return to step S535.
Identifies the next column string. Step S55
If you identified all possible column strings in 5,
If not, the process proceeds to step S560, and the operation is
The process returns to step S600. FIG. 18 shows the best map of step S600.
Preferred Embodiment of the Tching Column String Selection Process
It is a flowchart which shows. Operation in step S600
After starting the configuration, in step S605
Check the string list to see if one or more
Lactor string 504 is a column string candidate list
To determine if it is. If the candidate column string
If there is only one, go to step S610, where
This column string is used as the best match for the column expression.
And select. Then, after step S615,
The process returns to step S700. In step S605, the column string
Determines that complement list contains one or more column strings
If so, the process proceeds to step S620. Step 62
0 means that the statistical data of the category
Search for data. In step S625, the first column
Identify the string candidate as the current column string
You. In step S630, the current column string
The score of the main white region of 1 is calculated according to equation (13).
decide. Equation (13) is the main white of the column string.
Expressions for scoring each of the site areas. Stay
In step S635, the current column string is checked.
There are still key white areas to score
Is determined. Major whites to score
If there is a site left, the process returns to step S630.
You. All major white regions in the current column string
After scoring the area, the process proceeds from step S635.
Proceed to step S640. In step S640, the expression (1)
Calculate the total score of the current column string based on 4)
For example, it is stored in the memory 130, for example, in step S645.
Check the column string candidate list and
Check if there are more column strings to attach
Bell. Column strings to be scored remain
If not, the process proceeds to step S650. Step S650
To replace the next candidate column string with the current column string.
And returns to step S630. Steps
In S645, all candidate column strings are scored.
If it is determined that the access has been performed, the process proceeds to step S655.
In step S655, the column string with the highest score is
Select as the column string that best matches the column expression
Select. In step S660, the process returns to step S700
I do. FIG. 19 shows a document element 47 in a column.
0, the primary white area matching unknown input documents
Demonstrates how to tag logically based on
FIG. First, step S1100
After starting, in step S1200, a plurality of types of well-known
The structural model of the original document is transferred to the document element identification system.
Remember. The structural model should have at least a cover page,
Ram representations and elements defined in the structural model
Split categories of or internal geometric relationships between types
Statistical data on the expected size of major white areas
Data. FIG. 12 shows three element types "
, "Text", and "figure"
Indicate categories of geometric relationships that require statistical data
You. At step S1300, the cover page and
Unknown containing document elements that should be logically tagged
Input document to the document element identification system.
You. Input document image divides input document into columns
It has a major white area. Extract this main white area
And store the columns of the document image for further analysis.
You. In step S1400, the unknown unknown input
Search for the cover page of the input document and store it in advance.
Compared to multiple types of original document cover pages,
Identify the cover page to load. Matched hippo
Page identifies the unknown input document type and identifies
Search the storage device for the corresponding structural model of the input document type
I do. The process of identifying cover pages for multiple documents
Is described in Japanese Patent Application No. 7-243212. Also
Means that the document element identification system
Manually identify input document types
An instruction may be given. In this case, the user
After the unknown input sentence type is identified, the corresponding structure
Search for a model. At step S1500, the document element identification
Another system uses the corresponding structural model to
Split document elements in the image split column and logically
Perform tagging. At step S1600, the input document image
Extract logically tagged document elements, for example
The data is output to the printer 200 or a storage device. At step S1700, the document element identification
Another system provides a logically tagged input document image sentence.
Search statistical data of geometric relationships between calligraphic elements,
Between the document elements in the identified corresponding structural model
Or update statistics of internal geometric relationships. Change
Corresponding structural model including new geometric relation statistical data
Is stored again. In step S1800, the process
finish. The preferred embodiment described above is to the last
It is merely exemplary, and those skilled in the art will appreciate that various
There are various data processing configurations.
Obviously, this is not a limitation. According to the present invention, a simple structure representing a document structure can be obtained.
Based on white area detection using simple structural model representation
Thus, each element in the document can be identified. This
Is used to analyze each detail in the document image
There is no need to detect the connection state. Actual document structure
Based on the model, apply to any document identification.
Can be.

【図面の簡単な説明】【図１】本発明の文書エレメント識別システムの好まし
い実施形態を示すブロック図である。【図２】図１の文書エレメント識別システムにおける、
文書ホワイト領域抽出システムの好ましい実施形態を示
すブロック図である。【図３】図２の文書ホワイト領域抽出システムにおけ
る、主要ホワイト領域抽出手段の好ましい実施形態を示
すブロック図である。【図４】サンプル文書画像の例を示す図である。【図５】本発明にしたがって文書画像から抽出した、水
平方向の一次ホワイト領域を示す図である。【図６】本発明にしたがって文書画像から抽出した、垂
直方向の一次ホワイト領域を示す図である。【図７】本発明にしたがって抽出した主要ホワイト領域
を有する文書画像の図である。【図８】所定のカラム表現のカラムレイアウトの例を示
すサンプル文書の図である。【図９】文書エレメントの配置を示すサンプル図であ
り、３つの本文エレメントを含む例を示す図である。【図１０】解析前の、入力文書画像カラムの例を示す図
である。【図１１】入力文書画像の空間ストリングを、構造モデ
ルのエレメント表現と比較して、マッチングするカラム
ストリングを識別する手法を示す図である。【図１２】「見出し」、「図」、「本文」の３種類の文
書エレメントを有する構造モデルにおいて、主要ホワイ
ト領域の幾何学的的関係を示す図である。【図１３】カラム形式の文書画像内の文書エレメント
に、主要ホワイト領域パターンのマッチングによって論
理的にタグ付けを行う良好なプロセスを示すフローチャ
ートである。【図１４】主要ホワイト領域と文書エレメントを抽出す
る好ましいプロセスを示すフローチャートである。【図１５】図１４のフローチャートのうち、主要ホワイ
ト領域を抽出する詳細なプロセスを示すフローチャート
である。【図１６】入力文書画像のカラムの空間ストリングを決
定するプロセスを示すフローチャートである。【図１７】空間ストリングとマッチする可能なカラムス
トリングの候補リストを検出するプロセスを示すフロー
チャートである。【図１８】カラムストリング候補リストの中から、カラ
ム表現と最もマッチするカラムストリングを選択するプ
ロセスを示すフローチャートである。【図１９】本発明の第２の実施形態に基づき、原文書の
タイプを識別した後に、対応の文書内の文書エレメント
に論理的にタグ付けをするプロセスを示すフローチャー
トである。【符号の説明】１００文書エレメント識別システム１１０文書ホワイト領域抽出システム１２０主要ホワイト領域選択手段１３０メモリ１４０プロセッサ１５０構造モデル定義手段１６０カラムストリング比較手段１７０カラム表現比較手段１８０論理タグ割り当て手段１９０文書エレメント抽出手段４００文書画像４０５カラム４０６文書エレメント４６０主要ホワイト空間４７０文書エレメントBRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a preferred embodiment of the document element identification system of the present invention. FIG. 2 shows the document element identification system of FIG.
FIG. 2 is a block diagram illustrating a preferred embodiment of a document white region extraction system. FIG. 3 is a block diagram showing a preferred embodiment of a main white area extracting means in the document white area extracting system of FIG. 2; FIG. 4 is a diagram illustrating an example of a sample document image. FIG. 5 is a diagram illustrating a horizontal primary white area extracted from a document image according to the present invention. FIG. 6 shows a vertical primary white area extracted from a document image according to the present invention. FIG. 7 is a diagram of a document image having a main white region extracted according to the present invention. FIG. 8 is a diagram of a sample document showing an example of a column layout of a predetermined column expression. FIG. 9 is a sample diagram showing an arrangement of document elements, and is a diagram showing an example including three text elements. FIG. 10 is a diagram showing an example of an input document image column before analysis. FIG. 11 is a diagram illustrating a method of comparing a spatial string of an input document image with an element expression of a structural model to identify a matching column string. FIG. 12 is a diagram showing a geometric relationship of a main white area in a structural model having three types of document elements of “heading”, “figure”, and “body”. FIG. 13 is a flowchart illustrating a good process for logically tagging document elements in a columnar document image by matching a primary white area pattern. FIG. 14 is a flowchart illustrating a preferred process for extracting primary white regions and document elements. FIG. 15 is a flowchart showing a detailed process of extracting a main white area from the flowchart of FIG. 14; FIG. 16 is a flowchart illustrating a process for determining a spatial string of columns of an input document image. FIG. 17 is a flowchart illustrating a process for detecting a candidate list of possible column strings that match a spatial string. FIG. 18 is a flowchart showing a process of selecting a column string that best matches a column expression from a column string candidate list. FIG. 19 is a flowchart illustrating a process of logically tagging document elements in a corresponding document after identifying the type of the original document according to the second embodiment of the present invention. DESCRIPTION OF SYMBOLS 100 Document element identification system 110 Document white area extraction system 120 Main white area selection means 130 Memory 140 Processor 150 Structural model definition means 160 Column string comparison means 170 Column expression comparison means 180 Logical tag assignment means 190 Document element extraction Means 400 Document image 405 Column 406 Document element 460 Primary white space 470 Document element

フロントページの続き (56)参考文献特開平５−159101（ＪＰ，Ａ) 特開平６−60219（ＪＰ，Ａ) 特開平８−185474（ＪＰ，Ａ) 特開平８−185476（ＪＰ，Ａ) 文書画像構造解析のための知識ベースの一構成法，情報処理学会論文誌，日本，1993年１月，Ｖｏｌ．34 Ｎｏ. １，ｐｐ．75−87 矩形レイアウトモデルに基づく文書画像の領域識別，電子情報通信学会技術研究報告，日本，1993年11月12日，Ｖｏｌ．93 Ｎｏ．319，ｐｐ．45−52 モデルに基づいた文書画像のレイアウト理解，電子情報通信学会論文誌，日本，1992年10月，Ｖｏｌ．Ｊ75−Ｄ−ＩＩＮｏ．10，ｐｐ．1673−1681 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06K 9/00 - 9/82 G06F 17/21 Continuation of the front page (56) References JP-A-5-159101 (JP, A) JP-A-6-60219 (JP, A) JP-A-8-185474 (JP, A) JP-A-8-185476 (JP) , A) A knowledge base construction method for document image structure analysis, Transactions of Information Processing Society of Japan, Japan, January 1993, Vol. 34 No. 1, pp. 75-87 Document Image Region Identification Based on Rectangular Layout Model, IEICE Technical Report, Japan, November 12, 1993, Vol. 93 No. 319, pp. Understanding of Document Image Layout Based on 45-52 Model, IEICE Transactions, Japan, October 1992, Vol. J75-DII No. 10, pp. 1673-1681 (58) Field surveyed (Int.Cl. ⁷ , DB name) G06K 9/00-9/82 G06F 17/21

Claims

(57) A method for logically identifying a document element in a document image, wherein at least one document element has at least one column including at least one document element. Generating one structural model, wherein each structural model has at least one column representation and at least one
Defining a relationship between at least one document element of the original document and identifying a first main background area in the document image, wherein the first main background area identifies the document image by at least one Dividing into two columns and identifying a second main background area in each of the at least one column, forming a main background area pattern for each column of the document image from the identified second main background area Generating at least one column string that matches a main background pattern for each column of the document image; and determining a best column string from the at least one column string that matches the main background pattern. With at least one best column string Logically identifying document elements in the document image, comprising logically identifying each of the at least one document element in the document image based on a best column string that best matches a column representation of the structured model. Method.