JPH096907A

JPH096907A - Logical discrimination method of document element

Info

Publication number: JPH096907A
Application number: JP8140296A
Authority: JP
Inventors: Masaharu Ozaki; 正治尾▲崎▼; Jain Muesita; ジャインムージタ
Original assignee: Xerox Corp
Current assignee: Xerox Corp
Priority date: 1995-06-07
Filing date: 1996-06-03
Publication date: 1997-01-10

Abstract

PROBLEM TO BE SOLVED: To logically identify document elements in a composite column document picture. SOLUTION: Main white area allowable structure (pattern) dividing the columns of a document picture is previously defined by using a normal expression based upon a cross of main white (background) areas. After extracting horizontal and vertical main white areas from the document picture, the sequence of the extracted main white areas is compared with a finite state machine (automaton) by approximate normal matching to determine a set (i.e., a pattern) of main white areas most close to a required pattern. Consequently a column layout is determined and document elements can logically be identified.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、一般的に、複合カ
ラム文書画像から主要ホワイト領域を抽出し、文書エレ
メントのカラムを識別する方法および装置に関する。FIELD OF THE INVENTION The present invention relates generally to a method and apparatus for extracting a major white region from a composite column document image to identify columns of document elements.

【０００２】[0002]

【従来の技術および発明が解決しようとする課題】テキ
スト領域を分割する方法は、例えばベアード（Baird）
等による"ImageSegmentation by Shape Directed Cover
s," 10th International Conference onPattern Recogn
ition, pp. 820-825, 16-21 June 1990 で提案されてい
る。ベアードを始めとする従来の多くの文書分割方法
は、文書中のすべてのエレメント（構成要素）が矩形で
あることを前提としている。しかし、現実に多様な種類
の文書レイアウトがあることを考えると、この前提は必
ずしも正しいとは言えない。このような従来の方法で
は、非矩形のエレメント領域を分割、識別する場合に問
題が生じる。また、上記ベアードでは、文書画像のホワ
イト領域を解析しているが、文書画像のホワイト領域を
解析した結果から、主要なホワイト領域を抽出、解析す
ることによって、文書エレメントのカラムを分割する方
法は見られなかった。ここで「ホワイト」領域とは、連
結要素を全く含まない領域であり、文書の背景領域であ
る。2. Description of the Related Art A method for dividing a text area is disclosed in, for example, Baird.
"Image Segmentation by Shape Directed Cover
s, "10th International Conference onPattern Recogn
ition, pp. 820-825, 16-21 June 1990. Many conventional document segmentation methods such as Baird assume that all elements (components) in a document are rectangular. However, considering that there are various kinds of document layouts in reality, this assumption is not always correct. Such a conventional method has a problem in dividing and identifying a non-rectangular element area. Further, in the above Baird, the white area of the document image is analyzed. However, the method of dividing the column of the document element by extracting and analyzing the main white area from the result of analyzing the white area of the document image is I couldn't see it. Here, the “white” area is an area that does not include any connected elements and is the background area of the document.

【０００３】特願平７−２４３２１２号は、入力文書画
像の主要ホワイト領域を抽出し、文書中の主要ホワイト
領域の閉ループを見つけることによって、非矩形の文書
エレメントを抽出する分割方法を開示している。しか
し、図４０に示すように、特願平７−２４３２１２号の
文書分割システムでは、構造的に重要ではないホワイト
領域を偶然に検出してしまう可能性があり、その場合、
本来ひとつの文書エレメントを、２つ以上の個別の文書
エレメントとして分割して識別するという問題が生じ
る。図４０の例では、文書分割システム領域Ａを主要ホ
ワイト領域として誤認したことによって、単一の文書エ
レメントＣを、２つのエレメントＣ−１、Ｃ−２として
分割、抽出している。文書画像内での不適切な主要ホワ
イト領域の検出は、文書分割システムが、必ずしも基準
に合った許容カラムレイアウトだけを扱うわけではない
ことに起因する。このような誤認による問題を避けるに
は、文書分割システムに、文書画像カラム内の文書エレ
メントを分割する主要ホワイト領域を定義する、ひとつ
以上の許容カラムレイアウトのモデルを与える必要があ
る。Japanese Patent Application No. 7-243212 discloses a division method for extracting a non-rectangular document element by extracting a main white area of an input document image and finding a closed loop of the main white area in the document. There is. However, as shown in FIG. 40, the document segmentation system of Japanese Patent Application No. 7-243212 may accidentally detect a white area that is structurally insignificant. In that case,
The problem arises that one document element is originally divided and identified as two or more individual document elements. In the example of FIG. 40, a single document element C is divided and extracted as two elements C-1 and C-2 by misidentifying the document division system area A as a main white area. Inadequate detection of major white areas in a document image is due to the fact that the document segmentation system does not always handle only acceptable column layouts that meet the criteria. To avoid such misidentification problems, the document segmentation system should be provided with a model of one or more allowed column layouts that defines the major white areas that segment the document elements within a document image column.

【０００４】構造モデルは、原文書中に存在し得る特定
のカラムレイアウトを表現したものである。構造モデル
は、実際の文書分割プロセスに先立って、ユーザーがオ
フラインで生成するか、または原文書の作成者が提供す
る。A structural model is a representation of a particular column layout that may exist in the original document. The structural model is either generated offline by the user or provided by the author of the original document, prior to the actual document segmentation process.

【０００５】また、上記特願平７−２４３２１２号に開
示される文書分割システムでは、主要ホワイト領域抽出
においてある程度のロスが生じる。つまり、文書分割シ
ステムで固定しきい値を使用して主要ホワイト領域の選
別を行うので、文書中に実際に存在するすべての主要ホ
ワイト領域を抽出することができない。具体的には、図
４１に示すように、サイズは小さいが極めて重要な主要
ホワイト領域が見落とされる可能性がある。図４１の例
では、ページ中央の小さな主要ホワイト領域Ｅは、その
高さが垂直方向の主要ホワイト領域のしきい値高さより
小さいので抽出されない。この結果、文書エレメントの
抽出において、２つの個別の文書エレメントとして識別
されるべき文書エレメント領域ＦとＧが、ひとつの文書
エレメントとして抽出されてしまう。したがって、入力
文書の主要ホワイト領域を構造モデルと適正にマッチさ
せてデータロスを回避できる分割システムが望まれる。In the document segmentation system disclosed in Japanese Patent Application No. 7-243212, some loss occurs in extracting the main white area. In other words, the document segmentation system uses a fixed threshold to sort out the main white regions, so that all the main white regions actually existing in the document cannot be extracted. Specifically, as shown in FIG. 41, a main white region that is small in size but very important may be overlooked. In the example of FIG. 41, the small main white area E in the center of the page is not extracted because its height is smaller than the threshold height of the main white area in the vertical direction. As a result, in the document element extraction, the document element regions F and G that should be identified as two separate document elements are extracted as one document element. Therefore, there is a demand for a segmentation system that can properly match the main white region of the input document with the structural model to avoid data loss.

【０００６】Ｅ・マイアーズ（Myers ）とＷ・ミラー
（Miller）による共著"ApproximateMatching of Regula
r Expressions," Bulletin of Mathematical Biology,
vol.51, No. 1, pp. 5-37, 1989は、近似正規表現のマ
ッチング方法を開示する。シーケンスをＡ、正規表現を
Ｒとすると、Ｒとマッチングするシーケンスのうち、す
べてのシーケンスの中で最高得点でＲと整合するシーケ
ンスを見つける。この解決法は、最高得点のシーケンス
整合、またはＡをＲの要素に変換する際の削除、挿入、
置換といった編集オペレーションの最小コスト値に基づ
くものである。マイアーズはまた、少なくとも時間Ｏ
（ＭＮ）で最良のマッチングシーケンスを求めるアルゴ
リズムをいくつか示している。ここで、Ｍ、Ｎは、それ
ぞれＡおよびＲの長さである。Co-authored by E. Myers and W. Miller, "Approximate Matching of Regula
r Expressions, "Bulletin of Mathematical Biology,
vol.51, No. 1, pp. 5-37, 1989 discloses a matching method for approximate regular expressions. If the sequence is A and the regular expression is R, then the sequence matching R with the highest score among all sequences matching R is found. This solution is the highest-scoring sequence match, or delete, insert, when converting A to R elements,
It is based on the minimum cost value of an edit operation such as replacement. Myers are also at least O
(MN) shows some algorithms for finding the best matching sequence. Here, M and N are the lengths of A and R, respectively.

【０００７】[0007]

【問題を解決するための手段】本発明は、文書画像内の
ホワイト領域、すなわちテキストを全く含まない背景領
域のみを解析することによって文書エレメントを識別す
るシステムを提供する。SUMMARY OF THE INVENTION The present invention provides a system for identifying document elements by analyzing only white areas in a document image, ie, background areas that contain no text.

【０００８】本発明はまた、入力文書画像から抽出した
主要ホワイト領域を、カラムモデル文書の許容パターン
の記述と比較して、正確かつ効果的に文書エレメントを
分割する装置を提供する。許容カラム画像は、いくつか
の矩形文書エレメントの結合を含むことが可能であり、
従って非矩形となる可能性がある。本発明の文書エレメ
ント分割システムは、主要ホワイト領域パターンに基づ
いて入力文書画像から文書エレメントをカラムに分割す
る。主要ホワイト領域パターンは、主要ホワイト領域の
タイプ、シーケンス、および交差に基づく。The present invention also provides an apparatus for accurately and effectively segmenting a document element by comparing a main white region extracted from an input document image with a description of an allowable pattern of a column model document. The allowed column image can include a combination of several rectangular document elements,
Therefore, it may be non-rectangular. The document element division system of the present invention divides a document element into columns from an input document image based on a main white area pattern. The dominant white region pattern is based on the dominant white region type, sequence, and intersection.

【０００９】本発明では、主要ホワイト領域の交差に基
づいた正規表現を使用して、文書画像のカラムを分割す
る主要ホワイト領域の許容構造（パターン）をあらかじ
め定義する。文書画像から水平および垂直方向の主要ホ
ワイト領域を抽出した後で、近似正規表現マッチングに
よって、抽出した主要ホワイト領域のシーケンスを有限
状態マシン（オートマトン）と比較し、カラムを隔てる
主要ホワイト領域の集合（すなわちパターン）を決定す
る。このように、本発明の文書エレメント分割システム
は、近似マッチング方法を使用し、主要ホワイト領域の
所望のパターンに最も近いマッチングを出力するので、
データロスや誤認を回避することができる。In the present invention, a regular expression based on the intersection of the main white areas is used to predefine an allowable structure (pattern) of the main white areas that divides a column of a document image. After extracting the main white regions in the horizontal and vertical directions from the document image, the sequence of the main white regions extracted is compared with a finite state machine (automaton) by approximate regular expression matching, and the set of main white regions separating columns ( That is, the pattern) is determined. Thus, the document element segmentation system of the present invention uses the approximate matching method and outputs the closest match to the desired pattern in the main white region,
Data loss and misidentification can be avoided.

【００１０】いったん文書画像をカラムに分割したなら
ば、入力文書画像の文書エレメントをさらに処理するこ
とができる。例えば、入力文書画像の比較に基づいて、
マッチするカラム構造の文書エレメントに関連する論理
タグを、対応の文書画像文書エレメントに割り当てて、
文書エレメントの論理的な識別を行ったり、分割した文
書エレメントを光学文字認識ユニットで抽出、処理し、
プリンタに出力するなどである。Once the document image has been divided into columns, the document elements of the input document image can be further processed. For example, based on a comparison of input document images,
Assign logical tags related to the matching column-structured document element to the corresponding document image document element,
Performs logical identification of document elements, extracts and processes divided document elements with an optical character recognition unit,
For example, output to a printer.

【００１１】本発明を用いると、文書エレメントを含む
文書画像部分を解析して、どの連結要素がコヒーレント
グループ、すなわち文書エレメントを形成するかを検出
する必要がない。本発明では、まず文書上の画像を走査
し、入力文書画像の電子的、またはディジタル表現を生
成する。主要ホワイト領域は、「ホワイト空間」の矩形
領域であり、所定の最小サイズを有する。本明細書中で
は、ホワイト空間とは背景領域を意味し、文書画像中、
画像あるいはテキストの存在しない領域である。通常の
文書は、ホワイトの背景領域にブラックまたはカラーの
画像を形成したものなので、これらの背景領域を「ホワ
イト空間」と呼ぶこととするが、本発明の解釈において
は、カラーの背景領域に文書を形成した場合、あるいは
ブラックまたはカラーの背景に白抜きでテキストを形成
した場合であっても、これらの画像を含まないカラー
（またはブラック）背景領域を「ホワイト空間」と定義
する。With the present invention, it is not necessary to analyze the portion of the document image containing the document elements to detect which connected elements form a coherent group, or document element. In the present invention, the image on the document is first scanned to produce an electronic or digital representation of the input document image. The main white area is a rectangular area of “white space” and has a predetermined minimum size. In this specification, the white space means a background area, and in the document image,
An area where no image or text exists. Since a normal document is one in which a black or color image is formed on a white background area, these background areas will be referred to as “white space”, but in the interpretation of the present invention, the document is formed on the color background area. Is formed, or even if text is formed on a black or color background with an outline, a color (or black) background area that does not include these images is defined as “white space”.

【００１２】文書エレメントは、文書画像中で、見出
し、テキスト、グラフィック（図形）などの情報を含む
矩形領域であり、主要ホワイト領域によって互いに分離
される。文書エレメントを含み、所定サイズの主要ホワ
イト領域によって、文書エレメントを含む他の領域と区
分される領域は、個別の文書エレメントとされる。A document element is a rectangular area containing information such as a headline, text, graphic (graphic) in a document image, and is separated from each other by a main white area. An area containing a document element and separated from another area containing the document element by a main white area of a predetermined size is an individual document element.

【００１３】本発明の１態様は、複合カラムの文書画像
の文書エレメントを論理的に識別する方法であって、文
書画像中の主要背景領域を識別するステップを備え、文
書画像中の主要背景領域の配置に対応する順序付けられ
たデータストリングを生成するステップを備え、順序付
けられたデータストリングを有限状態マシンと比較し
て、順序付けられたデータストリングに対して少なくと
も１つの候補パスの中から有限状態マシンに最も整合す
る最適のパスを決定するステップを備え、決定された最
適のパスに基づいてカラムレイアウトを識別するステッ
プを備える。One aspect of the invention is a method for logically identifying document elements of a composite column document image, the method comprising the step of identifying a major background area in the document image. Generating an ordered data string corresponding to the arrangement of the finite state machine, comparing the ordered data string with a finite state machine, and selecting a finite state machine from the at least one candidate path for the ordered data string. Determining the optimal path that best matches the column layout, and identifying the column layout based on the determined optimal path.

【００１４】本発明のその他の目的、効果は、図面を参
照して以下の良好な実施形態の詳細な説明から、一層明
確になる。Other objects and effects of the present invention will become more apparent from the following detailed description of the preferred embodiments with reference to the drawings.

【００１５】[0015]

【発明の実施の形態】図１は、本発明の文書エレメント
分割システム１００の好ましい実施の形態を示す。文書
エレメント分割システム１００は、文書ホワイト領域抽
出システム１１０、主要ホワイト領域選択手段１２０、
メモリ１３０、プロセッサ１４０、ストリング変換手段
１５０、比較手段１６０、カラムレイアウト識別手段１
７０、論理タグ割り当て手段１８０、文書エレメント抽
出手段１９０を含み、これらすべてをバス手段１０５を
介して互いに接続する。また、ひとつ以上のプリンタ２
００、スキャナ２１０、ユーザーインターフェイス２２
０、遠隔インターフェイス２３０、不揮発性メモリ２７
０を、バス手段１０５を介して接続する。遠隔インター
フェイスは、ＬＡＮ、広域ネットワーク（ＷＡＮ）、あ
るいは別のコンピュータに接続できる。図１に示すよう
に、文書エレメント識別システム１００を汎用コンピュ
ータ３００で実行するのが好ましいが、専用コンピュー
タ、マイクロプロセッサベースまたはマイクロコントロ
ーラベースのシステム、ＡＳＩＣなどの集積回路、ディ
スクリート素子回路などのハードワイヤード電子回路、
フィールドプログラマブルゲートアレイなどのプログラ
マブルロジックデバイス（ＰＬＤ）などでも実行可能で
ある。1 illustrates a preferred embodiment of a document element segmentation system 100 of the present invention. The document element division system 100 includes a document white area extraction system 110, a main white area selection unit 120,
Memory 130, processor 140, string conversion means 150, comparison means 160, column layout identification means 1
70, a logical tag assigning unit 180, and a document element extracting unit 190, all of which are connected to each other via the bus unit 105. Also, one or more printers 2
00, scanner 210, user interface 22
0, remote interface 230, non-volatile memory 27
0 is connected via the bus means 105. The remote interface can connect to a LAN, wide area network (WAN), or another computer. As shown in FIG. 1, the document element identification system 100 is preferably implemented on a general purpose computer 300, but may be a dedicated computer, a microprocessor-based or microcontroller-based system, an integrated circuit such as an ASIC, a hard-wired such as a discrete device circuit. Electronic circuit,
It can also be implemented by a programmable logic device (PLD) such as a field programmable gate array.

【００１６】図２は、図１の文書エレメント分割システ
ム１００のうち、文書ホワイト領域抽出システム１１０
の好適な実施形態を示す。図示のように、文書ホワイト
領域抽出システム１１０は、連結要素識別手段２６０、
境界ボックス生成手段２５０、主要ホワイト領域抽出手
段２４０を含み、これらすべてをバス手段１０５によっ
て接続する。まず、スキャナ２１０、不揮発性メモリ２
７０、遠隔インターフェイス２３０、メモリ１３０など
から、文書画像データを連結要素識別手段２６０に入力
する。メモリ１３０は、汎用コンピュータ３００の内部
メモリでもよいし、フロッピーディスクとディスクドラ
イブ、ハードディスクドライブ、ＣＤ−ＲＯＭ、ＥＰＲ
ＯＭなど、汎用コンピュータ３００の外部に取り付ける
周知の外部メモリであってもよい。スキャナ２１０で読
み取った文書画像データを、連結要素識別手段２６０に
入力する前に、いったんメモリ１３０に記憶してもよ
い。文書画像データは、バイナリ（２値）画像、または
複数ビットのディジタル信号の形で、連結要素識別手段
２６０に入力される。各ビットは、文書画像の特定の画
素が、オン（ＯＮ）であるかオフ（ＯＦＦ）であるかを
示す。FIG. 2 shows a document white area extraction system 110 of the document element division system 100 of FIG.
2 illustrates a preferred embodiment of As shown, the document white area extraction system 110 includes a connected component identifying unit 260,
It includes a bounding box generating means 250 and a main white area extracting means 240, all of which are connected by the bus means 105. First, the scanner 210 and the non-volatile memory 2
The document image data is input to the connecting element identifying unit 260 from the remote interface 70, the remote interface 230, the memory 130, and the like. The memory 130 may be an internal memory of the general-purpose computer 300, a floppy disk and a disk drive, a hard disk drive, a CD-ROM, an EPR.
It may be a well-known external memory such as an OM attached to the outside of the general-purpose computer 300. The document image data read by the scanner 210 may be temporarily stored in the memory 130 before being input to the connected component identifying unit 260. The document image data is input to the connecting element identifying means 260 in the form of a binary (binary) image or a digital signal of a plurality of bits. Each bit indicates whether a particular pixel of the document image is on (ON) or off (OFF).

【００１７】連結要素識別手段２６０は、文書画像デー
タを受信したならば、その文書画像内のすべての連結要
素を検出する。図４は、文書画像４００の例である。連
結要素４１０は、「オフ」画素（ホワイト画素）に囲ま
れた一連の隣接する「オン」画素（ブラック画素）で構
成される。文書画像４００内の連結要素４１０を検出す
るシステムについては、当技術分野において周知なの
で、ここでは説明を控える。Upon receiving the document image data, the connected element identifying means 260 detects all connected elements in the document image. FIG. 4 is an example of the document image 400. The connecting element 410 is composed of a series of adjacent "on" pixels (black pixels) surrounded by "off" pixels (white pixels). Systems for detecting connected elements 410 in document image 400 are well known in the art and will not be described here.

【００１８】文書画像４００の連結要素４１０を検出し
たならば、境界ボックス生成手段２５０で、各連結要素
４１０ごとの境界ボックス４２０を生成する。当業界で
周知のように、境界ボックス４２０は、連結要素４１０
を完全に囲い込む最小の矩形の枠である。連結要素から
４１０から境界ボックス４２０を生成するシステムも当
分野においては周知である。When the connected element 410 of the document image 400 is detected, the bounding box creating means 250 creates a bounding box 420 for each connected element 410. As is well known in the art, bounding box 420 includes connecting element 410.
Is the smallest rectangular frame that completely encloses. Systems for generating bounding box 420 from connected elements 410 are also well known in the art.

【００１９】境界ボックス情報を含んだ文書画像データ
は主要ホワイト領域抽出手段２４０に送られる。主要ホ
ワイト領域抽出手段２４０は、図５および６に示すよう
に、文書画像４００の垂直方向および水平方向への主要
ホワイト領域を抽出する。The document image data including the bounding box information is sent to the main white area extracting means 240. As shown in FIGS. 5 and 6, the main white area extracting means 240 extracts main white areas in the vertical direction and the horizontal direction of the document image 400.

【００２０】文書ホワイト領域抽出システム１１０の好
ましい実施形態では、主要ホワイト領域抽出手段２４０
は、図３に示す２つのセクション、すなわち垂直抽出部
２４１と水平抽出部２４２に分かれている。垂直抽出部
２４１、水平抽出部２４２はそれぞれ、一次（プリミテ
ィブ）ホワイト領域抽出手段２４３、比較手段２４４、
消去手段２４５、グループ化手段２４６を備え、これら
をバス手段１０５に接続する。垂直抽出部２４１と水平
抽出部２４２は同一の構成要素を含み、同様の方法で動
作する。In the preferred embodiment of the document white area extraction system 110, the primary white area extraction means 240.
Is divided into two sections shown in FIG. 3, namely a vertical extraction section 241 and a horizontal extraction section 242. The vertical extraction unit 241 and the horizontal extraction unit 242 respectively include a primary (primitive) white area extraction unit 243, a comparison unit 244, and
An erasing means 245 and a grouping means 246 are provided, and these are connected to the bus means 105. The vertical extraction unit 241 and the horizontal extraction unit 242 include the same components and operate in the same manner.

【００２１】図５に示すように、水平抽出部２４２はま
ず、水平一次（プリミティブ）ホワイト領域に対してし
きい値を越えて延びる一次（プリミティブ）ホワイト領
域４３０−１〜４３０−１０を抽出し、これらを集めて
水平方向に延びる主要ホワイト領域４６０を形成する。
同様に、図６に示すように垂直抽出部４２１は、垂直１
次ホワイト領域に対してしきい値を越えて延びる一次ホ
ワイト領域４３０−１１〜４３０−１９を抽出し、これ
らを集めて垂直方向に延びる主要ホワイト領域４６０を
形成する。As shown in FIG. 5, the horizontal extraction unit 242 first extracts the primary (primitive) white areas 430-1 to 430-10 extending beyond the threshold with respect to the horizontal primary (primitive) white area. , Which together form a major white region 460 that extends horizontally.
Similarly, as shown in FIG.
The primary white regions 430-11 to 430-19 that extend above the threshold value for the next white region are extracted, and these are collected to form a main white region 460 that extends in the vertical direction.

【００２２】水平ホワイト領域からの水平主要ホワイト
領域４６０の形成は、水平一次ホワイト領域４３０−１
〜４３０−１０の隣接する領域同士を、特定の規則に従
ってひとつのグループに併合し、水平方向にグループ化
したひとつ以上の一次ホワイト領域とすることによって
実行される。同様に、垂直ホワイト領域からの垂直主要
ホワイト領域４６０の形成も、垂直一次ホワイト領域４
３０−１１〜４３０−１９の隣接する領域同士を、特定
の規則にしたがってひとつのグループに併合することに
よって実行され、これによって垂直方向にグループ化さ
れたひとつ以上の一次ホワイト領域ができる。このよう
に垂直および水平方向の一次ホワイト領域をグループ化
し、併合したならば、水平一次ホワイト領域４３０と、
これを水平方向にグループ化した一次ホワイト領域のう
ち、しきい値幅４４０より広い幅と、しきい値高さ４５
０を越える高さとを有する領域を識別する。同様に、垂
直一次ホワイト領域４３０および、これを垂直方向にグ
ループ化した一次ホワイト領域の中で、しきい値高さ４
５０’を越える高さと、しきい値幅４４０’より広い幅
とを有する領域を識別する。これらの領域が、主要ホワ
イト領域として識別されることになる。The formation of the horizontal major white area 460 from the horizontal white area is determined by the horizontal primary white area 430-1.
.About.430-10 adjacent regions are merged into a group according to specific rules to form one or more primary white regions grouped in the horizontal direction. Similarly, the formation of the vertical main white region 460 from the vertical white region also includes the vertical primary white region 4
This is done by merging adjacent regions of 30-11 to 430-19 into a group according to a specific rule, which results in one or more vertically grouped primary white regions. When the vertical and horizontal primary white areas are grouped and merged as described above, the horizontal primary white areas 430 and
Of the primary white areas that are grouped in the horizontal direction, a width wider than the threshold width 440 and a threshold height 45
Identify areas with height greater than zero. Similarly, in the vertical primary white region 430 and the primary white regions grouped in the vertical direction, the threshold height 4
Regions having a height greater than 50 'and a width greater than the threshold width 440' are identified. These areas will be identified as the major white areas.

【００２３】図７に示すように、識別された主要ホワイ
ト領域４６０は、文書エレメント４７０を取り囲み、文
書エレメントをたがいに分割する。そこで、主要ホワイ
ト領域４６０のサイズと方向を使用して、文書の論理的
な構造を定義できる。As shown in FIG. 7, the identified major white area 460 surrounds the document element 470 and divides the document element into pieces. Thus, the size and orientation of the main white area 460 can be used to define the logical structure of the document.

【００２４】上述の文書ホワイト領域抽出システム１１
０は、ホワイト領域の抽出を実行できる多様な実施形態
のひとつの例であり、これに限定されるものではない。Document white area extraction system 11 described above
0 is an example of various embodiments that can perform extraction of a white area, and is not limited to this.

【００２５】文書ホワイト領域抽出システム１１０で主
要ホワイト領域４６０を識別したならば、水平方向の主
要ホワイト領域４６０と垂直方向の主要ホワイト領域４
６０が交わる交差部分４８０の位置を特定する。そし
て、この交差部分を形成する主要ホワイト領域４６０
を、ストリング変換手段１５０によって、一次元データ
ストリングに変換する。さらに、主要ホワイト領域交差
部４８０をタイプごとに分類し、変換で求めた一次元デ
ータストリングとともに、メモリ１３０または不揮発性
メモリ２７０に記憶する。比較手段１６０は、記憶され
た情報を参照して比較を行う。２つの交差部４８０の間
に延びる主要ホワイト領域４６０のタイプは、その主要
ホワイト領域の終点の交差型と、主要ホワイト領域４６
０内での交差のタイプとに基づいて、分類または識別さ
れる。すなわち、ストリング変換手段１５０において、
特定の主要ホワイト領域４６０の位置と、この主要ホワ
イト領域４６０と別の主要ホワイト領域４６０との交差
部４８０とによって、各主要ホワイト領域４６０を分
類、識別し、文書画像中に現われる主要ホワイト領域の
規則に従ったシーケンスに対応する一次元データストリ
ングを生成する。好ましい実施の形態では、主要ホワイ
ト領域４６０を上から下、および左から右という規則に
したがって連結し、一次元データストリングを生成す
る。Once the main white area 460 has been identified by the document white area extraction system 110, the horizontal main white area 460 and the vertical main white area 4 are identified.
The position of the intersection 480 where 60 intersects is specified. The main white area 460 that forms this intersection is then
Is converted into a one-dimensional data string by the string conversion means 150. Further, the main white area intersection 480 is classified by type and stored in the memory 130 or the non-volatile memory 270 together with the one-dimensional data string obtained by the conversion. The comparison means 160 refers to the stored information and makes a comparison. The type of main white region 460 that extends between two intersections 480 is the intersection type of the end of the main white region and the main white region 46.
Classified or identified based on the type of intersection within 0. That is, in the string conversion means 150,
The location of a particular major white area 460 and the intersection 480 of this major white area 460 with another major white area 460 classifies and identifies each major white area 460 to identify the major white areas appearing in the document image. Generate a one-dimensional data string corresponding to a sequence according to the rules. In the preferred embodiment, the major white regions 460 are concatenated according to the rules of top to bottom and left to right to produce a one-dimensional data string.

【００２６】比較手段１６０は、ストリング変換手段１
５０によって生成された一次元データストリングとして
の主要ホワイト領域４６０のシーケンスを、構造モデル
で定義される許容カラムレイアウトと比較する。比較手
段１６０の一部は、許容される（適正な）文書カラム分
割領域を表現した有限状態マシンであるのが好ましい。
比較手段１６０は、近似マッチング方法を用いて、デー
タストリングと有限状態マシンとのマッチングを行う。
カラムレイアウト識別手段１７０は、比較手段１６０に
よる比較結果に基づいて、文書画像のカラムレイアウト
を識別する。The comparison means 160 is the string conversion means 1
Compare the sequence of major white regions 460 as a one-dimensional data string generated by 50 with the allowed column layout defined in the structural model. It is preferable that a part of the comparing unit 160 is a finite state machine expressing an allowable (appropriate) document column division area.
The comparison means 160 matches the data string with the finite state machine using the approximate matching method.
The column layout identifying means 170 identifies the column layout of the document image based on the comparison result by the comparing means 160.

【００２７】論理タグ割り当て手段１８０は、識別した
マッチング文書画像のカラムレイアウトに基づいて、主
要ホワイト領域４６０間にある（すなわち主要ホワイト
領域４６０によって隔てられる）文書エレメント４７０
の領域に、論理タグを付ける。文書エレメント抽出手段
１９０は、論理的にタグ付けした文書エレメント４７０
を抽出する。別の構成例として、文書エレメント抽出手
段１９０を論理タグ割り当て手段１８０の代わりに使用
するか、これと組み合わせて使用してもよい。論理タグ
割り当て手段１８０および文書エレメント抽出手段１９
０の詳細については、特願平７−２４３２１２号および
７−２４３２１３号に述べられている。The logical tag assigning means 180 is based on the column layout of the identified matching document image, the document elements 470 that are between the major white areas 460 (ie, separated by the major white areas 460).
Add a logical tag to the area. The document element extraction unit 190 uses the logically tagged document element 470.
To extract. As another configuration example, the document element extracting unit 190 may be used instead of the logical tag assigning unit 180 or may be used in combination with it. Logical tag assigning means 180 and document element extracting means 19
Details of No. 0 are described in Japanese Patent Application Nos. 7-243212 and 7-243213.

【００２８】図７に示すように、文書画像４００は、任
意の数の主要ホワイト領域４６０を含む。これら主要ホ
ワイト領域は、任意の数の交差部４８０を有し、文書画
像を任意の数の文書エレメントに分割する。入力文書画
像の文書エレメントと文書エレメントの間に存在する主
要ホワイト領域間の空間的（幾何学的）関係と、原文書
の許容カラムレイアウトで定義される空間的（幾何学
的）関係とを比較することによって、文書画像４００の
文書エレメント４７０を論理的に識別する。文書画像の
主要ホワイト領域４６０間の幾何学的関係が、許容カラ
ムレイアウトの幾何学的関係とマッチすれば、文書画像
４００の文書エレメント４７０の文書画像カラムレイア
ウトが特定されることになる。図７の例では、文書エレ
メントのカラム４０５−１と、４０５−２とが特定され
る。しかし、文書エレメントのカラムレイアウトを識別
するためには、あらかじめ文書エレメントタイプ間の空
間的幾何学的関係を定義するモデルを、文書エレメント
分割システム１００に供給しておかなければならない。As shown in FIG. 7, document image 400 includes any number of major white areas 460. These major white areas have any number of intersections 480 and divide the document image into any number of document elements. Compares the spatial (geometric) relationship between the document elements of the input document image and the main white areas existing between the document elements and the spatial (geometric) relationship defined by the allowed column layout of the original document. By doing so, the document element 470 of the document image 400 is logically identified. If the geometric relationship between the main white areas 460 of the document image matches the geometric relationship of the allowed column layouts, the document image column layout of the document element 470 of the document image 400 will be identified. In the example of FIG. 7, columns 405-1 and 405-2 of the document element are specified. However, in order to identify the column layout of the document elements, a model that defines the spatial geometric relationship between the document element types must be supplied to the document element division system 100 in advance.

【００２９】構造モデルは、原文書内の許容カラムレイ
アウトを表わす原文書の構造を描写したものである。図
７の例では、入力文書画像は２カラム方式のレイアウト
であるが、本発明は２カラムレイアウトに限定されず、
本発明のシステムと方法は、３カラム、４カラム、さら
に一般化してｎカラムレイアウトにも適用できる。The structural model is a depiction of the structure of the original document that represents the allowed column layouts within the original document. In the example of FIG. 7, the input document image has a two-column layout, but the present invention is not limited to a two-column layout.
The system and method of the present invention can be applied to 3-column, 4-column, and more generalized n-column layouts.

【００３０】２カラムレイアウトでは、文書画像に図、
表、説明文などが挿入されている場合、３通りの状態を
考慮する必要がある。第１の状態は、図、表、または補
足文がカラム内におさまっている状態である。この場合
は、主要ホワイト領域のパターンに影響を与えない。In the two-column layout, the document image is
When a table, a description, etc. are inserted, it is necessary to consider three states. The first state is a state in which a figure, table, or supplementary sentence is contained in a column. In this case, the pattern of the main white area is not affected.

【００３１】第２の状態は、ひとつのカラム幅より広
く、全体の文書画像幅より狭い幅の図、表、または説明
文（以後、一括して挿入部とする）が挿入されている状
態である。この場合は、挿入部の片側または両側のテキ
スト領域が、挿入部のサイズに応じて削られることにな
る。挿入部の存在によって、異なる主要ホワイト領域パ
ターンが形成され、主要ホワイト領域のパターンが変わ
る。The second state is a state in which a figure, table, or explanatory text (hereinafter collectively referred to as an insertion portion) having a width wider than one column width and narrower than the entire document image width is inserted. is there. In this case, the text area on one side or both sides of the insertion portion is cut according to the size of the insertion portion. Due to the presence of the insertion portion, a different main white area pattern is formed and the main white area pattern is changed.

【００３２】第３の状態は、図、表、補足文が、文書画
像全体の幅と等しい幅を有する場合、すなわち、２カラ
ム文書の両方のカラムにわたる部分が挿入されている場
合である。この場合も、挿入部によって異なるタイプの
主要ホワイト領域が形成されるため、主要ホワイト領域
パターンが変わってくる。The third state is that the figure, table, and supplementary sentence have a width equal to the width of the entire document image, that is, a portion extending over both columns of a two-column document is inserted. Also in this case, the main white area pattern changes because different types of main white areas are formed depending on the insertion portion.

【００３３】文書エレメント分割システム１００は、ペ
ージに挿入された文書エレメントが重なり合わないこと
を前提とする。したがって、単一のエレメントが挿入さ
れている場合、上記３通りの状態を組み合わせることに
よって、挿入パターンのすべての状態をカバーできる。
この前提に基づいて、ひとつ以上のカラムにわたって、
複数の文書エレメントが挿入されているレイアウトも可
能である。The document element dividing system 100 is based on the premise that the document elements inserted in a page do not overlap each other. Therefore, when a single element is inserted, all the states of the insertion pattern can be covered by combining the above three states.
Based on this assumption, across one or more columns,
A layout in which multiple document elements are inserted is also possible.

【００３４】カラムを分割する空間は、互いに交差する
水平方向と垂直方向の主要ホワイト領域４６０の交互の
シーケンスで表わされる。また、好適な実施形態では、
垂直方向のパスが全くない場合でも、シーケンスをマッ
チさせることができる。さらに、文書ホワイト領域抽出
システム１１０で主要ホワイト領域４６０を抽出する前
に、４つのマージン、すなわち上部余白、下部余白、左
余白、右余白を除去する。この場合、主要ホワイト領域
を抽出した後に、主要ホワイト領域の閉ループを形成す
るために４つのマージンをもとに戻し、文書エレメント
から成るカラムを識別する。The space dividing the column is represented by an alternating sequence of horizontal and vertical major white areas 460 that intersect each other. Also, in a preferred embodiment,
Sequences can be matched even if there are no vertical paths. Further, before extracting the main white area 460 by the document white area extracting system 110, four margins, that is, an upper margin, a lower margin, a left margin, and a right margin are removed. In this case, after extracting the main white region, the four margins are replaced to form a closed loop of the main white region and the column of document elements is identified.

【００３５】カラムに挿入部が挿入されている場合は必
ず、挿入部を周囲のテキストから区別するための、少な
くともひとつの主要ホワイト領域４６０も挿入されてい
る。挿入部が、カラムのいずれか一方の垂直端に揃えて
挿入されている場合は、反対側の縦の辺に沿って、残り
の部分と区別するための主要ホワイト領域４６０が設け
られている。挿入部がカラム中央に位置する場合は、上
下のテキストと区分するために、挿入部の上下に水平方
向の主要ホワイト領域が挿入されている。挿入部がカラ
ムの上端に揃えて挿入されている場合は、挿入部の下方
に隣接して水平方向の主要ホワイト領域が設けられ、挿
入部がカラムの下端に揃っている場合は、挿入部の上方
に沿って水平方向主要ホワイト領域が形成されている。Whenever an insert is inserted in a column, at least one major white area 460 is also inserted to distinguish the insert from the surrounding text. When the insertion section is aligned with one of the vertical ends of the column, a main white area 460 is provided along the opposite vertical side to distinguish it from the rest. When the insertion portion is located at the center of the column, horizontal main white regions are inserted above and below the insertion portion to separate it from the text above and below. If the insert is aligned with the top of the column, a horizontal major white area is provided adjacent to the bottom of the insert, and if the insert is aligned with the bottom of the column, insert A horizontal main white region is formed along the upper part.

【００３６】好適な実施の形態では、主要ホワイト領域
交差部４８０を形成する主要ホワイト領域の位置によっ
て、図８および９に示す６タイプの主要ホワイト領域に
分類できる。In the preferred embodiment, the location of the major white regions forming major white region intersection 480 allows for classification into the six types of major white regions shown in FIGS.

【００３７】ＨＬ：左余白ホワイト領域とのみ交差す
る水平主要ホワイト領域ＨＲ：右余白ホワイト領域とのみ交差する水平主要ホ
ワイト領域ＨＦ：左余白とも右余白とも交差しないが、少なくと
もひとつの垂直主要ホワイト領域と交差する、水平主要
ホワイト領域ＶＣ：文書画像の縦の中心線の少なくとも一部を含む
垂直主要ホワイト領域ＶＬ：文書画像の縦中心線の左側に位置する垂直主要
ホワイト領域ＶＲ：文書画像の縦中心線の右側に位置する垂直主要
ホワイト領域これら６通りの形態に加えて、ＨＬＲ：カラムの両側の左余白、右余白の双方と交差
する水平主要ホワイト領域 ε：空シンボルとを用いる。HL: Horizontal main white area that intersects only the left margin white area HR: Horizontal main white area that intersects only the right margin white area HF: At least one vertical major white area that does not intersect either the left margin or the right margin Horizontal main white area VC: Vertical main white area including at least a part of the vertical centerline of the document image VL: Vertical main white area located to the left of the vertical centerline of the document image VR: Vertical image of the document image Vertical main white area located on the right side of the center line In addition to these 6 forms, HLR: horizontal main white area intersecting both left and right margins on both sides of the column, ε: empty symbol is used.

【００３８】ストリング変換手段１５０は、上記のタイ
プに基づいて主要ホワイト領域４６０を分類する。垂直
方向の主要ホワイト領域４６０は、それが位置する水平
方向の位置によって分類される。すなわち、水平方向の
中心（縦中心線）を含む位置にある垂直主要ホワイト領
域はＶＣ、水平方向の中心より左側にある垂直主要ホワ
イト領域４６０はＶＬ、水平方向の中心より右側にある
垂直主要ホワイト領域４６０はＶＲとして分類される。
水平主要ホワイト領域４６０もまた、その位置によって
区分され、水平主要ホワイト領域４６０が文書の縦中心
線をまったく横切らない場合は、その水平主要ホワイト
領域を、ストリング変換手段１５０によって消去する。The string conversion means 150 classifies the main white area 460 based on the above type. The vertical major white area 460 is classified by the horizontal location in which it is located. That is, VC is a vertical main white area at a position including the horizontal center (vertical centerline), VL is a vertical main white area 460 on the left side of the horizontal center, and VL is a vertical main white area on the right side of the horizontal center. Region 460 is classified as VR.
The horizontal main white area 460 is also segmented by its position, and if the horizontal main white area 460 does not cross the vertical centerline of the document at all, the horizontal main white area is erased by the string conversion means 150.

【００３９】ストリング変換手段１５０はまた、垂直お
よび水平方向の主要ホワイト領域４６０にしきい値を適
用する。図１１に示すように、ＨＬ主要ホワイト領域４
６０が、ＶＣ主要ホワイト領域４６０と交差するが、Ｈ
Ｌ主要ホワイト領域４６０の水平方向の幅が、ＶＣ主要
ホワイト領域４６０と左余白との間の幅をそれほど越え
ない（しきい値を越えない）場合は、このＨＬ主要ホワ
イト領域４６０をカラム内で文書エレメントを隔てるセ
パレータとして扱う。したがってこのＨＬ主要ホワイト
領域４６０をシーケンスの中に組み込む。同様に、カラ
ム内でエレメントを隔てるしきい値内のＨＲ主要ホワイ
ト領域もシーケンスの中に組み込む。The string conversion means 150 also applies a threshold to the major vertical and horizontal white areas 460. As shown in FIG. 11, the HL main white area 4
60 intersects the VC major white region 460, but H
If the horizontal width of the L main white area 460 does not significantly exceed the width between the VC main white area 460 and the left margin (does not exceed the threshold value), this HL main white area 460 is set in the column. Treat as a separator that separates document elements. Therefore, this HL major white area 460 is incorporated into the sequence. Similarly, the HR major white region within the threshold that separates the elements in the column is also incorporated into the sequence.

【００４０】文書ホワイト領域抽出システム１１０で、
入力文書画像の主要ホワイト領域４６０を識別したなら
ば、水平主要ホワイト領域と垂直の主要ホワイト領域と
の交差部４８０の位置を特定し、この交差部に関連する
主要ホワイト領域４６０を、ストリング変換手段１５０
によって一次元データストリングに変換する。２つの交
差部４８０の間に延びる主要ホワイト領域４６０を、そ
の交差タイプと、終点位置に応じて分類し、文書画像中
に主要ホワイト領域が現われる順序付けられたシーケン
スに対応する一次元データストリングを生成する。本実
施形態では、主要ホワイト領域４６０を上から下、およ
び左から右に連結して、一次元データストリングを生成
する。In the document white area extraction system 110,
Once the main white area 460 of the input document image has been identified, the location of the intersection 480 between the horizontal main white area and the vertical main white area is located and the main white area 460 associated with this intersection is identified by the string conversion means. 150
Is converted into a one-dimensional data string by. The major white regions 460 extending between the two intersections 480 are classified according to their intersection type and end point position to produce a one-dimensional data string corresponding to the ordered sequence in which the major white regions appear in the document image. To do. In this embodiment, the main white regions 460 are concatenated from top to bottom and left to right to generate a one-dimensional data string.

【００４１】また、図１０に示すように、文書画像中
に、それぞれの最上部位置が非常に近接して位置する
（しきい値以内）水平方向および垂直方向の主要ホワイ
ト領域がある場合、ストリング変換手段１５０は、これ
らの主要ホワイト領域４６０の最上部位置を同一位置と
して取り扱う。図１０の例では、ＨＦ主要ホワイト領域
と、ＶＣ主要ホワイト領域の最上部位置が同位置とみな
され、左から右の規則にしたがって、まずＨＦ主要ホワ
イト領域が解析されて一次元データストリングに組み込
まれ、次にＶＣ主要ホワイト領域が解析される。このよ
うにして、順次主要ホワイト領域４６０を一次元主要ホ
ワイト領域シーケンスに変換する。図１０の文書画像か
ら生成した一次元ストリング（ＯＤＳ：one-dimensiona
l string）は、次式（１）で表わされる。Also, as shown in FIG. 10, if there are horizontal and vertical major white areas in which the topmost positions of each are very close to each other (within a threshold value) in the document image, the string The conversion means 150 treats the uppermost positions of these main white areas 460 as the same position. In the example of FIG. 10, the uppermost positions of the HF main white region and the VC main white region are regarded as the same position, and the HF main white region is first analyzed and incorporated into the one-dimensional data string according to the rule from left to right. Then, the VC major white region is analyzed. In this way, the main white region 460 is sequentially converted into a one-dimensional main white region sequence. One-dimensional string (ODS: one-dimensiona) generated from the document image of FIG.
l string) is represented by the following equation (1).

【００４２】ＯＤＳ＝ＶＬ−ＶＲ−ＨＦ−ＶＣ−ＨＬＲ（１）文書エレメント分割システム１００は、図１２、１３に
示すような主要ホワイト領域４６０の反復シーケンスも
処理することができる。図１２および１３の文書画像中
の、主要ホワイト領域４６０の順序付けられたシーケン
スに対応する一次元データストリングを、次式（２）、
（３）で示す。ODS = VL-VR-HF-VC-HLR (1) The document element segmentation system 100 can also process an iterative sequence of major white regions 460 as shown in FIGS. The one-dimensional data string corresponding to the ordered sequence of major white regions 460 in the document images of FIGS. 12 and 13 is given by the following equation (2):
It shows with (3).

【００４３】ＯＤＳ＝ＶＣ−ＨＬ−ＶＲ−ＨＬ−ＶＣ−ＨＬ−ＶＲ−ＨＬ−ＶＣ（２）ＯＤＳ＝ＶＣ−ＨＲ−ＶＬ−ＨＲ−ＶＣ−ＨＦ−ＶＬ−ＶＲ−ＨＦ−ＶＣ（３）図１４に、２カラム形式の文書画像分割の比較に使用す
る、実行可能な有限状態マシンの例を示す。この有限状
態マシンはその入力ストリングが比較モデルのストリン
グ集合に属するかどうかを判定する。図１４に示すよう
に、カラム分割シーケンス（すなわち文書画像中の主要
ホワイト領域４６０を特定するデータストリング）は、
入力ストリングに対応する。比較ストリング集合は、構
造モデルで定義される２カラム形式の許容レイアウトに
対応し、本実施形態では、図１４の有限状態マシンが構
造モデルとなる。ODS = VC-HL-VR-HL-VC-HL-VR-HL-VC (2) ODS = VC-HR-VL-HR-VC-HF-VL-VR-HF-VC (3) FIG. FIG. 14 shows an example of a finite state machine that can be used to compare two-column format document image divisions. The finite state machine determines if the input string belongs to the string set of the comparison model. As shown in FIG. 14, the column division sequence (that is, the data string that identifies the main white area 460 in the document image) is
Corresponds to the input string. The comparison string set corresponds to a two-column format allowable layout defined by the structural model, and in the present embodiment, the finite state machine of FIG. 14 becomes the structural model.

【００４４】図１４の有限状態マシンの各状態から状態
への遷移は、比較手段１６０で決定される主要ホワイト
領域と構造モデルとのマッチングに対応する。図１３に
示した文書画像のシーケンスを表わす第（３）式の主要
ホワイト領域一次元データストリングと、図１４の有限
状態マシンの遷移とはマッチする。このマッチングは、
「スタート状態、状態１、状態３、状態１０、状態１
１、状態１２、状態３、状態１７、状態１８、状態１
９、状態２０、状態３、エンド状態」というシーケンス
を生成する。したがって、式（３）の一次元データスト
リングは、２カラム文書画像の主要ホワイト領域の有効
なカラム分割シーケンスである。The transition from state to state in the finite state machine of FIG. 14 corresponds to the matching of the main white area determined by the comparison means 160 with the structural model. The main white area one-dimensional data string of the equation (3) representing the sequence of the document image shown in FIG. 13 and the transition of the finite state machine of FIG. 14 match. This matching is
"Start state, state 1, state 3, state 10, state 1
1, state 12, state 3, state 17, state 18, state 1
9, state 20, state 3, end state ”is generated. Therefore, the one-dimensional data string in equation (3) is a valid column segmentation sequence for the main white region of a two-column document image.

【００４５】図１５〜２６は、２カラムレイアウトの実
施形態において、図１４の有限状態マシンで定義される
許容（適正）カラム分割シーケンスの文書画像を示す。
図１５〜２６は、単一の挿入部に基づく単事象の許容カ
ラム分割シーケンスだけを示すが、図１２、１３のよう
な複数の事象も、一連の追加主要ホワイト領域パターン
として処理することができる。FIGS. 15-26 show the document image of a permissible (proper) column split sequence defined by the finite state machine of FIG. 14 in a two column layout embodiment.
Although FIGS. 15-26 show only a single-event permissible column splitting sequence based on a single insert, multiple events such as FIGS. 12 and 13 can also be treated as a series of additional major white region patterns. .

【００４６】比較手段１６０は、有限状態マシンと入力
ストリングとの間の最良の整合を選択するためのコスト
マトリックスを生成する。コストマトリックスの各行
は、入力ストリングの各キャラクタである。入力ストリ
ングは、本実施形態では上から下、左から右への主要ホ
ワイト領域のシーケンスであるので、例えば、２つの垂
直ホワイト領域４６０の最上部位置が同一である場合、
左側に位置する垂直主要領域が先に選択される。一方、
コストマトリックスの各列は、図１４のような有限状態
マシンのスタート状態から最終（エンド）状態までの各
状態に対応する。各状態に付された数字は、識別名とし
て機能する。編集コストは、「挿入」、「削除」、「置
換」といった編集オペレーションの一次コストの総和で
ある。本実施形態では、挿入、削除、置換オペレーショ
ンのコストを、「−１」に設定する。一次元データスト
リング（またはカラム分割シーケンス）の結果、総編集
コストが「０」になったならば、そのカラム分割シーケ
ンスは有効であり、マッチしたと考えられる。また、パ
スコスト「０」は、そのパスが、スタート状態からエン
ド状態まで、有限状態マシンを完全に通ることを意味す
る。有限状態マシンを完全に通らないパスのコストは、
一次編集コストが負の値で定義されるので、カラム分割
シーケンスのトータルコストもマイナスの値になる。本
実施形態では、最適のパスは最大コストを有し、すなわ
ちもっとも「０」に近い値と解釈される。The comparison means 160 produces a cost matrix for choosing the best match between the finite state machine and the input string. Each row of the cost matrix is each character in the input string. The input string is a sequence of major white regions from top to bottom, left to right in this embodiment, so if, for example, the top positions of two vertical white regions 460 are the same:
The vertical main region located on the left is selected first. on the other hand,
Each column of the cost matrix corresponds to each state from the start state to the final (end) state of the finite state machine as shown in FIG. The number attached to each state functions as an identification name. The edit cost is the sum of the primary costs of edit operations such as "insert", "delete", and "replace". In this embodiment, the costs of the insert, delete, and replace operations are set to "-1". If the total edit cost is "0" as a result of the one-dimensional data string (or column division sequence), the column division sequence is valid and is considered to have matched. A path cost of "0" means that the path passes through the finite state machine completely from the start state to the end state. The cost of a path that does not go completely through a finite state machine is
Since the primary editing cost is defined as a negative value, the total cost of the column division sequence also becomes a negative value. In the present embodiment, the optimum path has the maximum cost, that is, it is interpreted as a value that is closest to "0".

【００４７】本実施形態では、これらの一次コストは文
書エレメント分割システム１００であらかじめ定められ
ており、メモリ１３０の一次コスト格納領域に記憶され
る。さらに、各一次コストの値は必ずしもすべて等しく
なくてもよい。例えば、置換オペレーションのコスト
を、挿入または削除オペレーションのコスト値の２倍の
−２に設定してもよい。全体のコストマトリックスを決
定した後に、コストマトリックスを最終行から最初の行
へたどることによって、最大コストを有する最適のパス
を獲得できる。In the present embodiment, these primary costs are predetermined by the document element division system 100 and are stored in the primary cost storage area of the memory 130. Moreover, the values of each primary cost need not all be equal. For example, the cost of the replace operation may be set to -2, which is twice the cost value of the insert or delete operation. After determining the overall cost matrix, the optimal path with the highest cost can be obtained by traversing the cost matrix from the last row to the first row.

【００４８】図４０は、解析前の文書画像４００に、Ｖ
Ｌ主要ホワイト領域が余分に挿入されて検出された、挿
入オペレーションの例である。図４０において、ストリ
ング変換手段１５０で生成されるカラム分割シーケンス
は、「ＶＲ−ＨＬ−ＶＬ−ＶＣ」となる。このカラム分
割シーケンスを、図１４の有限状態マシンと比較する
と、生成される最高コストのパスは、「スタート状態、
状態１、状態４、状態５、状態３、エンド状態」であ
る。したがって、ＶＬ主要ホワイト領域は、シーケンス
中で不適法な主要ホワイト領域ということになり、有限
状態マシンは、ＶＬ主要ホワイト領域が、ストリングへ
の不適当な挿入であることを示す。In FIG. 40, V is added to the document image 400 before analysis.
It is an example of an insertion operation in which an L major white region has been extraly inserted and detected. In FIG. 40, the column division sequence generated by the string conversion means 150 is “VR-HL-VL-VC”. Comparing this column splitting sequence with the finite state machine of FIG. 14, the highest cost path generated is
State 1, state 4, state 5, state 3, end state ". Therefore, the VL major white region is an illegal major white region in the sequence, and the finite state machine indicates that the VL major white region is an improper insertion into the string.

【００４９】図４１は、解析前の文書画像４００から垂
直主要ホワイト領域Ｅが欠落して検出された欠落オペレ
ーションの例である。ここから、ストリング変換手段１
５０はカラム分割シーケンス「ＶＲ−ＨＬ−ＨＲ−Ｖ
Ｌ」を生成する。このカラム分割シーケンスを、図１４
の有限状態マシンと比較すると、生成される最高コスト
のパスは、「スタート状態、状態１、状態４、状態５、
欠落、状態１３、状態１４、状態３、エンド状態」であ
る。すなわち、有限状態マシンを通る最高コストのパス
は、状態５と状態１３との間にギャップを示し、図４１
における主要ホワイト領域の欠落を示す。FIG. 41 is an example of a missing operation detected by missing the vertical main white area E from the document image 400 before analysis. From here, the string conversion means 1
50 is a column division sequence "VR-HL-HR-V.
L "is generated. This column division sequence is shown in FIG.
Compared to the finite state machine of, the highest cost path generated is "start state, state 1, state 4, state 5,
"Missing, state 13, state 14, state 3, end state". That is, the highest cost path through the finite state machine shows a gap between states 5 and 13,
Shows the lack of a major white area in.

【００５０】上述のようにして比較手段１６０で文書画
像中の主要ホワイト領域４６０の最良のシーケンスを決
定したら、カラムレイアウト識別手段１７０が、検出さ
れた最良のシーケンスに基づいて、文書画像の文書エレ
メントカラムレイアウトを決定する。When the comparing means 160 determines the best sequence of the main white areas 460 in the document image as described above, the column layout identifying means 170 determines the document element of the document image based on the detected best sequence. Determine the column layout.

【００５１】まず、カラムレイアウト識別手段１７０
は、最良のシーケンス中に余分な挿入、欠落、あるいは
置換があるかどうかを判定する。最良のシーケンスのコ
スト値が最大値「０」であれば、その最良シーケンスの
主要ホワイト領域を、そのままカラム分割領域として使
用する。しかし、図４０のように、最良シーケンスに余
分な挿入がある場合、挿入ホワイト領域の周囲のマッチ
する主要ホワイト領域を識別することによって、挿入ホ
ワイト領域を削除しなければならない。そして、残りの
主要ホワイト領域を、文書画像カラムを分割する空間領
域として使用する。図４０、４１に示すように、最良の
シーケンスに主要ホワイト領域の挿入または欠落がある
場合、マッチする主要ホワイト領域を検出して、余分な
ホワイト領域、あるいは欠落したホワイト領域を検出す
る。置換は、欠落と同様に扱う。次いで、文書ホワイト
領域抽出システム１１０は、しきい値を変えて（最初の
設定より小さくするか、または大きく設定して）文書画
像を再度処理し、最良シーケンスから欠落した主要ホワ
イト領域を抽出し、余分な主要ホワイト領域を除去す
る。First, the column layout identifying means 170.
Determines if there are extra insertions, omissions, or substitutions in the best sequence. If the cost value of the best sequence is the maximum value “0”, the main white area of the best sequence is used as it is as the column division area. However, if there are extra insertions in the best sequence, as in FIG. 40, the insertion white areas must be deleted by identifying the matching major white areas around the insertion white area. Then, the remaining main white area is used as a spatial area for dividing the document image column. As shown in FIGS. 40 and 41, when the main sequence has a main white area inserted or missing, a matching main white area is detected to detect an extra white area or a missing white area. Replace is treated the same as missing. The document white region extraction system 110 then processes the document image again with varying thresholds (less than or greater than the initial setting) to extract missing major white regions from the best sequence, Remove extra major white areas.

【００５２】文書ホワイト領域抽出システム１００が上
述の操作を達成できなかった場合は、カラムレイアウト
識別手段１７０で、有限状態マシンを通る可能なパスの
コスト値に基づいて、２番目に最適なパスを選択する。
本実施形態では、コスト値が等しい最適パスが複数ある
場合にのみ、これを適用する。そして、最初の最適パス
に代わって選択した２番目の最適パスを使用して、上述
の操作を実行する。If the document white area extraction system 100 fails to achieve the above operation, the column layout identification means 170 determines the second best path based on the cost value of the possible paths through the finite state machine. select.
In the present embodiment, this is applied only when there are a plurality of optimum paths having the same cost value. Then, the above-described operation is executed using the second optimum path selected instead of the first optimum path.

【００５３】文書ホワイト領域抽出システム１１０で、
２番目の最適パスを用いて、余分な領域の消去および欠
落した領域の抽出プロセスを首尾良く達成できたなら
ば、２番目の最適パスでの主要ホワイト領域シーケンス
に、４つのマージンを戻して、文書画像の文書エレメン
トをカラム画像に分割する。本実施形態では、文書画像
は２つの文書カラムに分割され、これらのカラムは非矩
形であっても構わない。In the document white area extraction system 110,
If we were able to successfully accomplish the process of erasing the extra regions and extracting the missing regions using the second optimal pass, we would put four margins back into the main white region sequence on the second optimal pass, The document element of the document image is divided into column images. In this embodiment, the document image is divided into two document columns, and these columns may be non-rectangular.

【００５４】入力カラム分割シーケンスおよび有限状態
マシンから生成されるコストマトリックスは、前述のマ
イアーズの文献に述べられている近似正規表現に基づ
く。本実施形態では、マイアーズの第２０ページに開示
される方法を使用してコストマトリックスを生成し、最
適のパスを決定する。The cost matrix generated from the input column partition sequence and the finite state machine is based on the approximate regular expression described in the above-mentioned Myers reference. In this embodiment, a cost matrix is generated using the method disclosed on page 20 of Myers to determine the optimal path.

【００５５】図２７〜２９、および図３０〜３３は、ま
た別の有限機械に基づいてコストマトリックスを生成す
る例を示す図である。図２７の有限状態マシンは、図１
７、１８、１９に示すカラムレイアウトの文書画像を扱
うことができる。この例では、挿入、欠落、置換の編集
コスト値を、あらかじめ「−１」に設定しておく。図２
７の有限状態マシンは、スタート状態とエンド状態をつ
なぐ３つの明らかなパスを有する。すなわち、（１）状
態１→状態２→状態４→状態６、（２）状態１→状態
６、（３）状態１→状態３→状態５→状態６、というパ
スである。これら３つのパスのいずれかとマッチする任
意の主要ホワイト領域シーケンスは、トータルコストが
「０」である。この有限状態マシンは、コストマトリッ
クスとは独立のものである。27 to 29 and 30 to 33 are diagrams showing an example of generating a cost matrix based on another finite machine. The finite state machine of FIG.
Document images having the column layouts shown in 7, 18, and 19 can be handled. In this example, the edit cost values for insertion, deletion, and replacement are set to "-1" in advance. FIG.
The finite state machine of 7 has three obvious paths connecting the start and end states. That is, the paths are (1) state 1 → state 2 → state 4 → state 6, (2) state 1 → state 6, and (3) state 1 → state 3 → state 5 → state 6. Any major white region sequence that matches any of these three passes has a total cost of "0". This finite state machine is independent of the cost matrix.

【００５６】図２８は、図２７の有限状態マシンのう
ち、図３０のレイアウトに対応する下方パス「状態０→
状態１→状態３→状態５→状態６」のコストマトリック
スである。図２７の有限状態マシンによると、図３０の
カラムレイアウトの最初の遷移はεである。このコスト
マトリックスの行「０」は、状態０から、状態１〜６の
各状態へ移るときのコストを示す。これによると、状態
０から状態１に移るコストは０である。行「１」は、入
力シーケンス「スタート、ＶＲ」において、状態１か
ら、状態０、２〜６の各状態への遷移コストを示す。行
「２」は、入力シーケンス「ＶＲ、ＨＬ」において、状
態３から、状態０〜２、４〜６の各状態への遷移コスト
である。行「３」は、入力シーケンス「ＶＲ、ＨＬ、Ｖ
Ｃ」において、状態５から、状態０〜４および６に移る
コストを示す。図２８のコストマトリックスが示すよう
に、状態０→状態１、状態１→状態３、状態３→状態
５、状態５→状態６という遷移コストは０であり、これ
は、入力シーケンス「ＶＲ、ＨＬ、ＶＣ」に対応する。
したがって、「状態０→状態１→状態３→状態５→状態
６」というパスは、図２７の有限状態マシンを通る有効
なパスであり、「ＶＲ、ＨＬ、ＶＣ」は、図３０の文書
レイアウトで示されるように、主要ホワイト領域４６０
の有効カラム分割シーケンスである。FIG. 28 shows the lower path “state 0 →→” corresponding to the layout of FIG. 30 in the finite state machine of FIG.
It is a cost matrix of “state 1 → state 3 → state 5 → state 6”. According to the finite state machine of FIG. 27, the first transition in the column layout of FIG. 30 is ε. The row “0” of this cost matrix shows the cost when moving from the state 0 to each of the states 1 to 6. According to this, the cost of moving from state 0 to state 1 is zero. Row "1" shows the transition cost from state 1 to states 0, 2 to 6 in the input sequence "start, VR". Row "2" is the transition cost from state 3 to states 0-2, 4-6 in the input sequence "VR, HL". Line "3" contains the input sequence "VR, HL, V
"C" shows the cost of moving from state 5 to states 0-4 and 6. As shown in the cost matrix of FIG. 28, the transition cost of state 0 → state 1, state 1 → state 3, state 3 → state 5, state 5 → state 6 is 0, which means that the input sequence “VR, HL , VC ”.
Therefore, the path “state 0 → state 1 → state 3 → state 5 → state 6” is a valid path through the finite state machine of FIG. 27, and “VR, HL, VC” is the document layout of FIG. As shown by, the main white area 460
Is an effective column division sequence of.

【００５７】図３１は、カラムシーケンス「ＶＲ、Ｖ
Ｃ」を有する文書画像である。図３１の文書画像は、図
２９のコストマトリックスの２通りの異なるパス（パス
ＡとパスＢ）で解析することができる。シーケンス「Ｖ
Ｒ、ＶＣ」を有する図３１の文書画像を、パスＡ、Ｂを
使用して解析した図を、それぞれ図３２および３３に示
す。FIG. 31 shows the column sequence "VR, V
It is a document image having "C". The document image of FIG. 31 can be analyzed by two different paths (path A and path B) of the cost matrix of FIG. Sequence "V
32 and 33 are diagrams obtained by analyzing the document image of FIG. 31 having “R, VC” using paths A and B, respectively.

【００５８】図３２は、図２７の有限状態マシンを通る
パス「状態０→状態１→状態３→状態５→状態６」（図
２９のパスＡ）を使用した解析であり、ここでは「Ｈ
Ｌ」がパスから欠落している。入力シーケンス「ＶＲ、
ＶＣ」において、パスＡで図２７の有限状態マシンを通
ると、トータルコストゼロ（０）で状態０から状態１
へ、さらに状態３へ移行する。しかし、図２９のコスト
マトリックスの行「１」に示すように、入力シーケンス
「ＶＲ、ＶＣ」において、図２７の有限状態マシンの状
態３からコストゼロでの遷移は皆無であり、状態３から
状態５への最良の遷移は、コスト「−１」での遷移であ
る。これは、図２７に示すように、入力シーケンス「Ｖ
Ｒ、ＶＣ」から状態５（ＨＬ）が不適切に欠落している
ことを意味し、最良シーケンスは「ＶＲ、ＨＬ、ＶＣ」
であることを示す。欠落したＨＬを補った最良のシーケ
ンス「ＨＬ、ＶＣ」は、図２９のコストマトリックスの
行１と行２に示すように、ともに同値のコスト「−１」
を有するので、状態５から状態６への遷移コストは、結
果的にゼロとなる。したがって、パスＡのトータルコス
トは、「−１」である。このように、マッチする近接の
主要ホワイト領域ＶＲとＶＣを比較することによって、
カラムレイアウト識別手段１７０は、「ＨＬ」を欠落し
た遷移として識別する。一方、図３３は、図２７の有限
状態マシンを通るパス「状態０→状態１→状態６のパ
ス」（図２９のパスＢ）を使用した、ＶＲ−ＶＣカラム
分離シーケンスの解析である。図２９のコストマトリッ
クスの行１に示すように、状態１から状態３へ移動する
代わりに、有限状態マシンはコスト「−１」で状態１に
留まる。これは、入力シーケンス「ＶＲ、ＶＣ」には、
状態３（ＶＲ）が不適切に挿入されていることを意味
し、最良のシーケンスは「ＶＣ」であることを示す。ま
た、図２９のコストマトリックスの行１と行２から明ら
かなように、行１の状態１から、行２の状態６への遷移
は、同値のコスト間の遷移なので、この遷移コストはゼ
ロである。したがって、パスＢにおいて、「ＶＲ］が余
分な挿入部であることがわかる。パス「ＶＲ−ＶＣ」を
図２９のコストマトリックスと比較すると、行０の状態
１から、行１の状態１への遷移コストは−１であり、行
１の状態１から、行２の状態６への「ＶＣ」遷移コスト
は０である。結果として、パスＢのトータルコストは−
１になる。FIG. 32 is an analysis using the path “state 0 → state 1 → state 3 → state 5 → state 6” (path A in FIG. 29) passing through the finite state machine of FIG.
L ”is missing from the path. Input sequence "VR,
In “VC”, when the path A passes through the finite state machine of FIG. 27, the total cost is zero (0) and the state 0 to the state 1
And further to state 3. However, as shown in row “1” of the cost matrix of FIG. 29, in the input sequence “VR, VC”, there is no transition from the state 3 of the finite state machine of FIG. The best transition to 5 is the transition at cost "-1". This is as shown in FIG.
Improperly missing state 5 (HL) from R, VC ", the best sequence is" VR, HL, VC "
Is shown. The best sequence “HL, VC” that compensates for the missing HL is the same cost “−1” as shown in rows 1 and 2 of the cost matrix of FIG.
, The transition cost from state 5 to state 6 will eventually be zero. Therefore, the total cost of the path A is "-1". Thus, by comparing the matching major white regions VR and VC of the proximity,
The column layout identifying means 170 identifies “HL” as a missing transition. On the other hand, FIG. 33 is an analysis of the VR-VC column separation sequence using the path “path of state 0 → state 1 → state 6” (path B of FIG. 29) passing through the finite state machine of FIG. As shown in row 1 of the cost matrix of FIG. 29, instead of moving from state 1 to state 3, the finite state machine stays in state 1 at cost "-1". This is because the input sequence "VR, VC"
It means that state 3 (VR) is improperly inserted, indicating that the best sequence is "VC". Also, as is clear from row 1 and row 2 of the cost matrix of FIG. 29, the transition from state 1 of row 1 to state 6 of row 2 is a transition between costs of the same value, so this transition cost is zero. is there. Therefore, it can be seen that "VR" is an extra insertion part in the path B. Comparing the path "VR-VC" with the cost matrix of Fig. 29, the state 1 in row 0 to the state 1 in row 1 is compared. The transition cost is -1, and the "VC" transition cost from state 1 in row 1 to state 6 in row 2 is 0. As a result, the total cost of pass B is −
Becomes 1.

【００５９】要するに、図２７の有限状態マシンのう
ち、図３０、３１にそれぞれ示す２つのカラムレイアウ
トに対応する２つのコストマトリックスを示したのが図
２８、２９である。図２８と３０を比べると、シーケン
ス「ＶＲ、ＨＬ、ＶＣ」は有限状態マシンと首尾良くマ
ッチングし、最適パスのコストは０になる。一方、図２
９と３１を比較すると、図２９のマトリックスでそれぞ
れ値が「−１」の２つの可能な最適パスが存在する。こ
のうち、図３２では、図３１のシーケンスを得る最適の
パスからＨＬが欠落し、図３３では、図３１のシーケン
スを得る最適のパスにＶＲが余分に挿入されている。In summary, FIGS. 28 and 29 show two cost matrices corresponding to the two column layouts shown in FIGS. 30 and 31, respectively, of the finite state machine shown in FIG. Comparing Figures 28 and 30, the sequence "VR, HL, VC" successfully matches the finite state machine and the cost of the optimal path is zero. On the other hand, FIG.
Comparing 9 and 31, there are two possible optimal paths, each with a value of "-1" in the matrix of FIG. Of these, in FIG. 32, HL is missing from the optimal path that obtains the sequence of FIG. 31, and in FIG. 33, an extra VR is inserted in the optimal path that obtains the sequence of FIG.

【００６０】こうしてマッチングに基づいて解析前の文
書画像４０１を分割した文書カラムの文書エレメント４
７０は、マッチングしたカラム表現とともにメモリ１３
０に記憶されるか、またはプロセッサ１４０に出力され
て以降の処理を受ける。以降の処理の例として、文書画
像カラムの構造を解析することによって文書エレメント
を分類し、それらの文書エレメントに論理的にタグ付け
を行うなどがある。In this way, the document element 4 of the document column obtained by dividing the document image 401 before analysis based on the matching.
70 is a memory 13 together with the matched column representation
0, or output to the processor 140 for further processing. Examples of subsequent processing include classifying document elements by analyzing the structure of a document image column and logically tagging those document elements.

【００６１】次に、図３４のフローチャートを参照し
て、主要ホワイト領域パターンのマッチングによる複合
カラムの分割を用いた文書エレメント４７０の分割方法
を説明する。ステップＳ２００でオペレーションを開始
した後、ステップＳ３００で文書画像の主要ホワイト領
域を抽出する。主要ホワイト領域を抽出したならば、ス
テップＳ４００で、主要ホワイト領域の交差を解析する
ことによって、入力文書画像の主要ホワイト領域を表わ
す一次元データストリングを決定する。Next, referring to the flowchart in FIG. 34, a method of dividing the document element 470 using division of a composite column by matching the main white area pattern will be described. After starting the operation in step S200, the main white area of the document image is extracted in step S300. Once the major white regions have been extracted, the one-dimensional data string representing the major white regions of the input document image is determined in step S400 by analyzing the intersections of the major white regions.

【００６２】ステップＳ５００で、主要ホワイト領域の
データストリング表現を、入力ストリングとして有限状
態マシンに入力する。有限状態マシンの出力は、入力文
書画像で抽出した主要ホワイト領域の最良の解釈を示す
最適のパスである。有限状態マシンは、任意の識別され
た２つの主要ホワイト領域交差部４８０の間に延びる適
正な許容主要ホワイト領域４６０を表現したものであ
る。最適のパスは、コストマトリックスで決定されるよ
うに、有限状態マシンを通るパスから主要ホワイト領域
を挿入、置換、削除する所定の編集コストに基づいた、
有限状態マシンの状態を通る最小コストのパスである。In step S500, the data string representation of the major white area is input to the finite state machine as an input string. The output of the finite state machine is the optimal path that shows the best interpretation of the main white region extracted in the input document image. The finite state machine is a representation of the proper allowed major white region 460 extending between any two identified major white region intersections 480. The optimal path is based on a given editing cost of inserting, replacing, and deleting the major white regions from the path through the finite state machine, as determined by the cost matrix.
It is the lowest cost path through the states of a finite state machine.

【００６３】ステップＳ６００では、識別した最適のパ
スを使用して、入力文書画像のカラムレイアウトを決定
する。ステップＳ７００で、カラムレイアウトから識別
した文書エレメントを分割、論理的にタグ付け、あるい
は出力することによって、入力文書画像の文書エレメン
トを処理する。ステップＳ８００で、プロセスを終了す
る。In step S600, the identified optimal path is used to determine the column layout of the input document image. In step S700, the document element of the input document image is processed by dividing, logically tagging, or outputting the document element identified from the column layout. In step S800, the process ends.

【００６４】図３５は、文書画像主要ホワイト領域４６
０を抽出するステップＳ３００の詳細な工程を示すフロ
ーチャートである。ステップＳ３００でオペレーション
を開始した後、ステップＳ３１０で文書画像４００を入
力する。ステップＳ３２０で、文書画像４００の連結要
素４１０を識別する。ステップＳ３３０において、ステ
ップＳ３２０で識別した各連結要素４１０ごとに、境界
ボックス４２０を生成する。ステップＳ３４０で、主要
ホワイト領域４６０を抽出し、ステップＳ３５０で、ス
テップＳ４００に復帰する。FIG. 35 shows a document image main white area 46.
It is a flow chart which shows the detailed process of Step S300 which extracts 0. After starting the operation in step S300, the document image 400 is input in step S310. In step S320, the connected component 410 of the document image 400 is identified. In step S330, a bounding box 420 is generated for each connected element 410 identified in step S320. In step S340, the main white area 460 is extracted, and in step S350, the process returns to step S400.

【００６５】図３６は、主要ホワイト領域を抽出するス
テップＳ３４０の詳細なプロセスを示す。ステップＳ３
４０でオペレーション開始後、ステップＳ３４２で一次
（プリミティブ）ホワイト領域４３０を抽出する。図５
に示すように、一次ホワイト領域４３０は、境界ボック
ス４２０間の矩形のホワイト空間領域である。ステップ
Ｓ３４３で、各水平一次ホワイト領域４３０の幅と高さ
を、しきい値幅４４０、およびしきい値高さ４５０とそ
れぞれ比較し、また、垂直一次ホワイト領域４３０の幅
と高さを、しきい値幅４４０’、およびしきい値高さ４
５０’とそれぞれ比較する。FIG. 36 shows the detailed process of step S340 of extracting the main white region. Step S3
After the operation is started at 40, the primary (primitive) white area 430 is extracted at step S342. FIG.
As shown in, the primary white area 430 is a rectangular white space area between the bounding boxes 420. In step S343, the width and height of each horizontal primary white area 430 are compared with a threshold width 440 and a threshold height 450, respectively, and the width and height of the vertical primary white area 430 are determined with a threshold. Value width 440 'and threshold height 4
Compare with 50 'respectively.

【００６６】水平領域のしきい値幅は、文書画像４００
の水平方向の長さ（すなわち文書画像の幅）の１／３に
設定し、水平領域のしきい値高さ４５０は、文書画像の
テキスト（文章）のライン間隔を越える値に設定するの
が好ましい。一方、垂直領域のしきい値高さ４５０’
は、文書画像４００垂直方向の長さ（すなわち文書画像
の高さ）の１／３に設定し、垂直領域のしきい値幅４４
０’を、文書画像のテキストのライン間隔を越える値に
設定するのが好ましい。しきい値高さ（４５０、４５
０’）と、しきい値幅（４４０、４４０’）を越えるサ
イズの水平および垂直一次ホワイト領域を、主要ホワイ
ト領域として識別する。The threshold width of the horizontal area is the document image 400.
Is set to 1/3 of the horizontal length of the document image (that is, the width of the document image), and the threshold height 450 of the horizontal region is set to a value exceeding the line spacing of the text (text) of the document image. preferable. On the other hand, the threshold height of the vertical region 450 '
Is set to 1/3 of the vertical length of the document image 400 (that is, the height of the document image), and the threshold width 44 of the vertical region is set to
It is preferable to set 0'to a value exceeding the line spacing of the text of the document image. Threshold height (450, 45
0 ') and horizontal and vertical primary white areas of a size exceeding the threshold width (440, 440') are identified as major white areas.

【００６７】ステップＳ３４４で、水平領域のしきい値
幅４４０より小さい幅の水平一次ホワイト領域４３０
と、垂直領域のしきい値高さ４５０’より低い垂直一次
ホワイト領域を消去する。ステップＳ３４５で、残りの
一次ホワイト領域４３０を、主要ホワイト領域４６０に
グループ化する。残りの一次ホワイト領域を主要ホワイ
ト領域へグループ化する方法は多数考えられるが、例え
ば、特願平７−２４３２１２号に開示の方法で、主要ホ
ワイト領域へのグループ化を実行する。In step S344, the horizontal primary white area 430 having a width smaller than the threshold width 440 of the horizontal area is set.
And erase the vertical primary white areas below the vertical area threshold height 450 '. In step S345, the remaining primary white areas 430 are grouped into a main white area 460. There are many possible methods for grouping the remaining primary white areas into main white areas. For example, the method disclosed in Japanese Patent Application No. 7-243212 executes grouping into the main white areas.

【００６８】ステップＳ３４６で、主要ホワイト領域の
垂直または水平方向のサイズの少なくとも一方が、対応
の垂直および水平しきい値より小さい場合に、その主要
ホワイト領域を消去する。または、垂直および水平方向
のサイズの双方が、対応のしきい値より小さい主要ホワ
イト領域だけを消去する方法を取ってもよい。ステップ
Ｓ３４７で、オペレーションはステップＳ３５０に復帰
する。In step S346, the major white area is erased if at least one of the vertical and horizontal sizes of the major white area is smaller than the corresponding vertical and horizontal thresholds. Alternatively, a method may be adopted in which only the main white areas whose both vertical and horizontal sizes are smaller than the corresponding thresholds are erased. In step S347, operation returns to step S350.

【００６９】図３７は、文書画像から抽出した主要ホワ
イト領域の位置と交差部とに基づいて、主要ホワイト領
域の一次元データストリング（カラム分割シーケンス）
を決定するステップＳ４００の詳細な工程である。ステ
ップＳ４００でオペレーションを開始した後、ステップ
Ｓ４１０で、文書画像４００の主要ホワイト領域交差部
４８０を識別する。ステップＳ４２０で、交差部の位置
に基づいて主要ホワイト領域のタイプ（ＨＬ、ＶＣな
ど）を検出する。ステップＳ４３０で、検出した主要ホ
ワイト領域の位置とタイプとに基づいて、主要ホワイト
領域の規則にしたがったシーケンスを検出する。好適な
実施形態では、規則にしたがったシーケンスは、文書画
像４００の上から下へ、次いで文書画像４００の左から
右へ移行するシーケンスである。文書画像において、垂
直方向の同じ位置（高さ）に複数の主要ホワイト領域が
存在する場合、シーケンスの決定要因は、文書画像の左
端から右端へと移る。ただし、文書画像の同じ垂直位置
に水平主要ホワイト領域と垂直主要ホワイト領域がある
場合は、水平主要ホワイト領域を優先し、その後で、垂
直主要ホワイト領域を左から右の順にシーケンスに組み
込む。ステップＳ４４０で、文書画像４００の主要ホワ
イト領域の順序付けられたシーケンスを表わす一次元デ
ータストリングを検出する。本実施形態では、一次元デ
ータストリングは、主要ホワイト領域のタイプを規則に
したがって順次並べた一連のストリングである。ステッ
プＳ４５０で、ステップＳ５００に復帰する。FIG. 37 is a one-dimensional data string (column division sequence) of the main white area based on the position and intersection of the main white area extracted from the document image.
It is a detailed process of step S400 for determining. After starting the operation in step S400, the main white area intersection 480 of the document image 400 is identified in step S410. In step S420, the type (HL, VC, etc.) of the main white area is detected based on the position of the intersection. In step S430, a sequence according to the rule of the main white area is detected based on the detected position and type of the main white area. In the preferred embodiment, the sequence according to the rules is a sequence of transitions from top to bottom of document image 400 and then from left to right of document image 400. When there are a plurality of main white regions at the same vertical position (height) in the document image, the determinant of the sequence shifts from the left edge to the right edge of the document image. However, if there is a horizontal main white area and a vertical main white area at the same vertical position in the document image, the horizontal main white area is given priority, and then the vertical main white area is incorporated in the sequence from left to right. In step S440, a one-dimensional data string representing an ordered sequence of major white areas of document image 400 is detected. In the present embodiment, the one-dimensional data string is a series of strings in which the types of main white areas are sequentially arranged according to a rule. In step S450, the process returns to step S500.

【００７０】図３８は、ステップＳ５００の好適な実施
形態の詳細なプロセスを示す。ステップＳ５００は、一
次元データストリングを有限状態マシン（オートマト
ン）と比較して、文書画像をカラムに分割する可能なパ
スの各々のコストマトリックスを決定し、可能なパスの
中から最適のパスを特定する工程である。ステップＳ５
００でオペレーションを開始し、ステップＳ５１０で、
有限状態マシンを決定（あるいは生成）し、許容カラム
レイアウトを定義する。有限状態マシンは、原文書の構
造モデルの一部として供給するのが好ましい。ステップ
Ｓ５２０で、文書画像を分割する各パスごとに、有限状
態マシンと、一次元データストリングとを比較し、適切
な相対表現マッチングを使用してコストマトリックスを
決定する。ステップＳ５３０で、コストマトリックスを
通る候補のパスを評価し、最適のパスを選択する。最適
のパスは、許容カラムレイアウトと、可能なパスに対す
る所定の編集コスト（挿入、削除、置換）とに基づいて
選択し、文書画像を通してトータルコストゼロ（０）と
なる適正パスを獲得する。ステップＳ５４０で、ステプ
Ｓ６００に戻る。FIG. 38 shows the detailed process of the preferred embodiment of step S500. Step S500 compares a one-dimensional data string with a finite state machine (automaton) to determine a cost matrix for each of the possible paths that divide the document image into columns, and to identify the optimal path from the possible paths. It is a process to do. Step S5
Operation is started at 00, and at step S510,
Determine (or create) a finite state machine and define the allowed column layout. The finite state machine is preferably supplied as part of the structural model of the original document. In step S520, the finite state machine and the one-dimensional data string are compared for each pass that divides the document image and the cost matrix is determined using appropriate relative expression matching. In step S530, candidate paths through the cost matrix are evaluated and the optimal path is selected. The optimum path is selected based on the allowable column layout and a predetermined editing cost (insertion, deletion, replacement) for a possible path, and an appropriate path having a total cost of zero (0) is obtained through the document image. In step S540, the process returns to step S600.

【００７１】図３９は、最適のパスから文書画像のカラ
ムレイアウトを識別するステップＳ６００の好適な実施
形態を示す。ステップＳ６００でスタートし、ステップ
Ｓ６０５で、最大値のコストを有するすべての最適パス
を検索する。ステップＳ６１０で最適パスをチェック
し、評価していないパスがまだあるかどうかを調べる。
すべてのパスが評価されたと判定されたら、ステップＳ
６５５に進み、ステップＳ７００へと戻る。FIG. 39 shows a preferred embodiment of step S600, which identifies the column layout of the document image from the optimal path. Starting in step S600, in step S605 all optimal paths with the maximum cost are searched. In step S610, the optimum path is checked to see if there are any paths that have not been evaluated.
If it is determined that all paths have been evaluated, step S
Proceed to 655, and return to step S700.

【００７２】ステップＳ６１０で、まだ評価していない
最適パスが残っている場合は、ステップＳ６１５に進
み、評価していないパスを現在の最適パスとして選択
し、ステップＳ６２０に進む。ステップＳ６２０で、現
在の最適パスをチェックして、図４０のような余分な挿
入があるかどうかを調べる。余分な挿入が検出された場
合は、ステップＳ６２５でこれを消去し、ステップＳ６
２５から直接ステップＳ６７０にジャンプする。ステッ
プＳ６２０で、現在の最適パスに余分な挿入がない場合
は、ステップＳ６３０に進む。In step S610, if there is an optimum path that has not been evaluated yet, the process proceeds to step S615, the path that has not been evaluated is selected as the current optimum path, and the process proceeds to step S620. In step S620, the current optimum path is checked to see if there is an extra insertion as shown in FIG. If an extra insertion is detected, it is erased in step S625, and step S6
From 25, jump directly to step S670. In step S620, if there is no extra insertion in the current optimum path, the process proceeds to step S630.

【００７３】ステップＳ６３０で、現在の最適パスをチ
ェックして、図４１のような欠落がないかどうかを調べ
る。ステップＳ６３０で、現在の最適パスに欠落が検出
されない場合は、ステップＳ６６０に進む。現在の最適
パスに欠落がある場合は、ステップＳ６３５に進み、欠
落部分周辺の近接のマッチする主要ホワイト領域を識別
する。ステップＳ６４０で、ステップでＳ６３５で識別
した近接のマッチする主要ホワイト領域に基づいて、欠
落した主要ホワイト領域のタイプを識別する。ステップ
Ｓ６４５で、文書ホワイト領域抽出システム１１０は、
低減したしきい値で文書画像を再度処理し、欠落した主
要ホワイト領域の位置を特定する。In step S630, the current optimum path is checked to see if there is any omission as shown in FIG. If no omission is detected in the current optimum path in step S630, the process proceeds to step S660. If there is a missing part in the current optimum path, the process advances to step S635 to identify a matching main white region around the missing part. In step S640, the type of missing major white region is identified based on the adjacent matching major white regions identified in step S635. In step S645, the document white area extraction system 110
The document image is reprocessed with the reduced threshold to locate missing major white areas.

【００７４】ステップＳ６５０で、新たに抽出した主要
ホワイト領域をチェックし、欠落した主要ホワイト領域
が適正に位置するかどうかを決定する。ステップＳ６５
０で、欠落した主要ホワイト領域が適正に位置しないと
判定された場合は、オペレーションはステップＳ６１０
に戻る。ステップＳ６５０で、欠落した主要ホワイト領
域が適正に位置すると判定された場合は、ステップＳ６
７０に進む。In step S650, the newly extracted main white area is checked to determine if the missing main white area is properly located. Step S65
If 0, it is determined that the missing major white area is not located properly, operation proceeds to step S610.
Return to If it is determined in step S650 that the missing main white region is properly located, step S6
Go to 70.

【００７５】ステップＳ６６０では、現在の最適パスを
チェックして、置換があるかどうかを調べる。置換がな
い場合は、ステップＳ６７０に進み、一方、置換がある
場合は、ステップＳ６６５で置換を削除して、ステップ
Ｓ６３５に進む。In step S660, the current optimal path is checked to see if there is a replacement. If there is no replacement, the process proceeds to step S670, while if there is a replacement, the replacement is deleted in step S665, and the process proceeds to step S635.

【００７６】ステップＳ６７０で、上下左右の４つのマ
ージン（余白）を、現在の最適パスでマッチした主要ホ
ワイト領域に付け加える。ステップＳ６７５で、最適パ
スで主要ホワイト領域で分割される文書エレメントをカ
ラムに抽出する。その後、ステップＳ６８０で、ステッ
プＳ７００に戻る。In step S670, four margins (margins) on the upper, lower, left and right sides are added to the main white area matched in the current optimum path. In step S675, the document elements divided in the main white area by the optimum path are extracted into columns. Then, in step S680, the process returns to step S700.

【００７７】以上、本発明を特定の実施形態に基づいて
述べてきたが、当業者にとって多様な置換、変形が可能
であることはいうまでもない。上述した本発明の好適な
実施形態は一例に過ぎず、これに制限されずに、本発明
の原理と範囲内で多様な変形が可能である。Although the present invention has been described based on the specific embodiments, it goes without saying that various substitutions and modifications can be made by those skilled in the art. The preferred embodiments of the present invention described above are merely examples, and the present invention is not limited thereto and various modifications can be made within the principle and scope of the present invention.

【００７８】[0078]

【発明の効果】本発明によれば、有限状態マシンとの近
似マッチング方法を使用して、主要ホワイト領域の所望
のパターンに最も近いシーケンスを出力するので、カラ
ムの分割、識別において、データロスや誤認を回避でき
る。したがって、非矩形の文書エレメントカラムを含む
複合カラム文書画像であっても、無関係なホワイト空間
領域を誤って抽出したり、重要なホワイト空間領域を見
落としたりすることなく、正確かつ効果的に文書カラム
の分割、識別が達成できる。According to the present invention, an approximate matching method with a finite state machine is used to output a sequence that is closest to a desired pattern in a main white area. False positives can be avoided. Therefore, even in a compound column document image that includes a non-rectangular document element column, the document column can be accurately and effectively extracted without erroneously extracting extraneous white space areas or overlooking important white space areas. Can be divided and identified.

[Brief description of drawings]

【図１】本発明の文書エレメント分割システムの好まし
い実施形態を示すブロック図である。FIG. 1 is a block diagram showing a preferred embodiment of a document element division system of the present invention.

【図２】図１の文書エレメント分割システムにおける、
文書ホワイト領域抽出システムの好ましい実施形態を示
すブロック図である。2 is a diagram showing a document element dividing system of FIG.
FIG. 3 is a block diagram illustrating a preferred embodiment of a document white area extraction system.

【図３】図２の文書ホワイト領域抽出システムにおけ
る、主要ホワイト領域抽出手段の好ましい実施形態を示
すブロック図である。FIG. 3 is a block diagram showing a preferred embodiment of main white area extracting means in the document white area extracting system of FIG.

【図４】サンプル文書画像の例を示す図である。FIG. 4 is a diagram showing an example of a sample document image.

【図５】抽出した水平方向の一次ホワイト領域を有する
文書画像の図である。FIG. 5 is a diagram of a document image having an extracted horizontal primary white area.

【図６】図５の文書画像において、抽出した垂直方向の
一次ホワイト領域を有する文書画像を示す図である。FIG. 6 is a diagram showing a document image having an extracted primary white area in the vertical direction in the document image of FIG. 5;

【図７】抽出した主要ホワイト領域を有する別の文書画
像の図である。FIG. 7 is a diagram of another document image having an extracted main white area.

【図８】垂直方向に延びる主要ホワイト領域の交差のタ
イプを示すサンプル文書画像の図である。FIG. 8 is a sample document image showing types of intersections of major white regions extending vertically.

【図９】水平方向に延びる主要ホワイト領域交差のタイ
プを示すサンプル文書画像の図である。FIG. 9 is an illustration of a sample document image showing types of major white region intersections that extend horizontally.

【図１０】サンプル文書画像、および主要ホワイト領域
のしきい値処理基準を示す図である。FIG. 10 is a diagram showing a sample document image and thresholding criteria for a main white area.

【図１１】サンプル文書画像、および主要ホワイト領域
のしきい値処理基準を示す図である。FIG. 11 is a diagram showing a sample document image and thresholding criteria for a main white area.

【図１２】文書カラム内の文書エレメントの反復配置を
示すサンプル文書画像である。FIG. 12 is a sample document image showing a repetitive arrangement of document elements within a document column.

【図１３】文書カラム内の文書エレメントの反復配置を
示すサンプル文書画像である。FIG. 13 is a sample document image showing a repetitive arrangement of document elements within a document column.

【図１４】本発明の好適な実施形態の有限状態マシンの
各状態を示す図である。FIG. 14 is a diagram showing each state of the finite state machine of the preferred embodiment of the present invention.

【図１５】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。15 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図１６】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。16 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図１７】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。FIG. 17 is a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図１８】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。FIG. 18 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図１９】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。19 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２０】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。20 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２１】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。FIG. 21 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２２】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。22 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２３】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。FIG. 23 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２４】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。FIG. 24 is a diagram of a sample document showing one of the major white area models defined by the state-to-state transitions of the finite state machine of FIG.

【図２５】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。25 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２６】図１４の有限状態マシンの状態から状態への
遷移で定義される、主要ホワイト領域モデルのひとつを
示すサンプル文書の図である。FIG. 26 is a diagram of a sample document showing one of the major white region models defined by the state-to-state transitions of the finite state machine of FIG.

【図２７】別のサンプル有限状態マシンを示す図であ
る。FIG. 27 illustrates another sample finite state machine.

【図２８】図２７のサンプル有限状態マシンに対応する
コストマトリックスの図である28 is a diagram of a cost matrix corresponding to the sample finite state machine of FIG. 27.

【図２９】図２７のサンプル有限状態マシンに対応する
コストマトリックスの図である。FIG. 29 is a diagram of a cost matrix corresponding to the sample finite state machine of FIG. 27.

【図３０】図２８のコストマトリックスに対応する主要
ホワイト領域レイアウトのサンプル文書画像のの図であ
る。FIG. 30 is a diagram of a sample document image of a major white area layout corresponding to the cost matrix of FIG. 28.

【図３１】図２９のコストマトリックスに対応する主要
ホワイト領域レイアウトのサンプル文書の図である。FIG. 31 is a sample document of a major white area layout corresponding to the cost matrix of FIG. 29.

【図３２】図３１のサンプル文書の解析の結果、主要ホ
ワイト領域の欠落があることを示す図である。32 is a diagram showing that there is a lack of a main white area as a result of analysis of the sample document of FIG. 31.

【図３３】図３１のサンプル文書の解析の結果、主要ホ
ワイト領域の余分な挿入があることを示す図である。FIG. 33 is a diagram showing that there is an extra insertion of a main white area as a result of analysis of the sample document of FIG. 31.

【図３４】主要ホワイト領域パターンのマッピングによ
って、複合カラム文書エレメントを分割する方法の好適
な実施形態を示すフローチャートである。FIG. 34 is a flow chart illustrating a preferred embodiment of a method for splitting a composite column document element by mapping a main white area pattern.

【図３５】図３４のフローチャートのうち、入力画像か
ら主要ホワイト領域を識別するステップの好適な実施形
態を詳細に示すフローチャートである。35 is a flow chart detailing a preferred embodiment of the step of identifying a major white region from an input image of the flow chart of FIG. 34.

【図３６】図３５に示した主要ホワイト領域抽出ステッ
プの、好適な実施形態を詳細に示すフローチャートであ
る。FIG. 36 is a flowchart detailing a preferred embodiment of the main white region extraction step shown in FIG. 35.

【図３７】図３４のフローチャートのうち、主要ホワイ
ト領域を表わすデータストリングを生成するステップの
好適な実施形態を詳細に示すフローチャートである。37 is a flowchart detailing a preferred embodiment of the step of generating a data string representing a main white region in the flowchart of FIG. 34. FIG.

【図３８】図３４のフローチャートのうち、データスト
リングを処理するステップの好適な実施形態を詳細に示
すフローチャートである。38 is a flow chart detailing a preferred embodiment of the step of processing a data string of the flow chart of FIG. 34.

【図３９】図３４のフローチャートのうち、最適のパス
に対応する入力文書画像のカラムレイアウトを識別する
ステップの好適な実施形態を詳細に示すフローチャート
である。FIG. 39 is a flowchart showing in detail a preferred embodiment of a step of identifying a column layout of an input document image corresponding to an optimum path in the flowchart of FIG. 34.

【図４０】文書分割システムによって抽出された、構造
的に重要でない主要ホワイト領域の余分な挿入を有する
入力画像を示す図である。FIG. 40 shows an input image with extra insertion of structurally insignificant major white regions extracted by the document segmentation system.

【図４１】文書分割システムによって抽出されなかっ
た、構造的に重要な主要ホワイト領域の欠落を有する入
力文書画像の図である。FIG. 41 is an illustration of an input document image with structurally significant missing major white areas that were not extracted by the document segmentation system.

[Explanation of symbols]

１００文書エレメント分割システム１１０文書ホワイト領域抽出システム１３０メモリ１４０プロセッサ１５０ストリング変換手段１６０比較手段１７０カラムレイアウト識別手段１８０論理タグ割り当て手段１９０文書エレメント抽出手段４００文書画像４６０主要ホワイト空間４７０文書エレメント 100 Document Element Dividing System 110 Document White Area Extraction System 130 Memory 140 Processor 150 String Converting Means 160 Comparing Means 170 Column Layout Identifying Means 180 Logical Tag Assigning Means 190 Document Element Extracting Means 400 Document Image 460 Main White Space 470 Document Elements

フロントページの続き (72)発明者ムージタジャインアメリカ合衆国 85721 アリゾナ州トゥーソンユニヴァーシティーオブアリゾナデパートメントオブコンピューターサイエンス内Front Page Continuation (72) Inventor Moody Jane USA 85721 Towson, Arizona University of Arizona Department of Computer Science Inside Science

Claims

[Claims]

1. A method for logically identifying document elements of a composite column document image, the method comprising the step of identifying a major background area in the document image, the method corresponding to the placement of the major background area in the document image. Optimizing the ordered data string by comparing the ordered data string with a finite state machine and best matching the finite state machine among the at least one candidate path for the ordered data string A method for logically identifying document elements, the method comprising the steps of: determining a path of a document element, and identifying a column layout based on the determined optimal path.