JP2005250816A

JP2005250816A - Document image layout analyzing program

Info

Publication number: JP2005250816A
Application number: JP2004059954A
Authority: JP
Inventors: Hiroaki Takebe; 浩明武部; Katsuto Fujimoto; 克仁藤本; Satoshi Naoi; 聡直井
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2004-03-04
Filing date: 2004-03-04
Publication date: 2005-09-15
Anticipated expiration: 2024-03-04
Also published as: JP4480421B2

Abstract

<P>PROBLEM TO BE SOLVED: To extract an appropriate line or a text block from the document image of a complicated layout. <P>SOLUTION: A blank area in a document image is extracted as a virtual separator (Step S1). While integration of a text element exceeding the virtual separator is prohibited, a plurality of text elements are integrated and extracted as an integrated text element (Step S2). By this, the integrated text element corresponding to the document image layout is extracted. Verification is made on whether or not the extracted integrated text element is appropriate as a line or a text block (Step S3). In the event that the verification result shows its inappropriateness, the size of the blank area is changed by a control parameter, a virtual separator based on the blank area changed against the integrated text element verified as inappropriate is extracted newly, and the processing of the Step S2 is carried out. By the recursive repetition of the processing such as this, an appropriate line or text block can be extracted even in a document image of a complicated layout. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、文書画像における、文字、行、テキストブロック、図、フレームなどの要素の物理的な配置である文書画像レイアウトを解析する、文書画像レイアウト解析プログラムに関する。 The present invention relates to a document image layout analysis program for analyzing a document image layout which is a physical arrangement of elements such as characters, lines, text blocks, diagrams, and frames in a document image.

近年、スキャナ等の光学機器を用いてコンピュータに取り込んだ文書画像中の文字成分を識別し、文字コードとして出力する光学的文字読取（ＯＣＲ）が、盛んに利用されている。ＯＣＲでは、印刷文字、手書き文字等による文書画像を光学的に走査し読み取って、量子化されたデータを得る。そして、そのデータから文字成分を含んだテキストブロックを抽出する。次に、このテキストブロックから文字成分を抽出して、パターンマッチング等の手法により文字認識を行う。 2. Description of the Related Art In recent years, optical character reading (OCR) that identifies character components in a document image captured by a computer using an optical device such as a scanner and outputs it as a character code has been actively used. OCR optically scans and reads document images of printed characters, handwritten characters, etc., and obtains quantized data. Then, a text block including a character component is extracted from the data. Next, character components are extracted from the text block, and character recognition is performed by a technique such as pattern matching.

従来、ある文書画像からテキストブロックを抽出する方法としては、以下のような方法が提案されている。例えば、特許文献１には、基礎要素の集合に対して、それらの近接性（文字成分同士が比較的密に配置されているなど）と同質性（文字成分の大きさがほぼ同じくらいであるなど）に基づいて統合して行を生成する。そして、同様に、行の集合に対しても、それらの近接性と同質性に基づき統合して段（テキストブロック）を生成する。また同時に、生成された段（テキストブロック）を制約と考えることにより、行と段（テキストブロック）を抽出しなおすことが開示されている。具体的には、文書画像における黒画素による連結成分を基礎要素として、それらを統合させ行を生成し、行を統合させてテキストブロックを生成することを基本としている。文書要素に対して、その周辺にある他の文書要素との大きさと位置の関係から、２つの文書要素を統合するかどうかの判断を下す。 Conventionally, the following method has been proposed as a method for extracting a text block from a document image. For example, Patent Document 1 discloses that a set of basic elements has closeness (character components are arranged relatively densely) and homogeneity (the size of character components is approximately the same). Etc.) to create a row based on Similarly, a row (text block) is also generated for a set of rows based on their proximity and homogeneity. At the same time, it is disclosed that lines and columns (text blocks) are extracted again by considering the generated columns (text blocks) as constraints. Specifically, based on the connected components of black pixels in the document image as basic elements, they are integrated to generate lines, and the lines are integrated to generate a text block. Whether or not two document elements are to be integrated is determined based on the relationship between the size and position of the document element with other document elements around it.

また、特許文献２には、文書画像から空白領域の集合を抽出し、それらの中から所定の大きさに関する条件を満たすものを選択して、それらによって被覆される領域以外の領域を抽出することで、文書画像からテキストブロック領域を切り出すことが開示されている。
特開平１１−２１９４０７号公報特開平２−２６３２７２号公報 Further, in Patent Document 2, a set of blank areas is extracted from a document image, a set satisfying a condition regarding a predetermined size is selected from them, and areas other than areas covered by them are extracted. Thus, it is disclosed that a text block region is cut out from a document image.
JP 11-219407 A JP-A-2-263272

しかし、文書要素のレイアウトは複雑かつ多様であり、文書要素の集合に対し、文書要素間の局所的な配置関係の情報のみから、特許文献１のように、一階上の同じ文書要素を構成する文書要素同士をリンクすることは、極めて困難である。例えば、テキストブロック同士が入り組んで配置されている場合や、テキストブロックと図が入り組んで配置されている場合などに、文字成分を過統合して、複数行の文字列をまとめて１行としてしまう問題があった。 However, the layout of document elements is complex and diverse, and the same document element on the first floor is configured from only information on the local arrangement relationship between document elements for a set of document elements as in Patent Document 1. It is extremely difficult to link document elements to be linked. For example, when text blocks are arranged in an intricate manner, or when text blocks and diagrams are arranged in an intricate manner, character components are overintegrated, and a plurality of character strings are combined into one line. There was a problem.

図２１、図２２は、過統合の例を示す図であり、図２１は、テキストブロック抽出結果を示し、図２２は行抽出結果を示している。
図２１では、文書画像３００において、２つのテキストブロック３１０、３２０が抽出されていることを示している。図２２では、図２１のようなテキストブロック３１０、３２０における行抽出結果（行ａ〜ｎ）を示している。ここで、ａ〜ｈの行は正しく抽出できているが、ｊ〜ｎの行は左右にわたり過統合しており、さらに、ｉ、ｋ、ｍ及びｎの行は上下にわたり過統合している。これらの行を文字認識しても正しい文章が得られない。 21 and 22 are diagrams showing examples of over-integration. FIG. 21 shows a text block extraction result, and FIG. 22 shows a line extraction result.
FIG. 21 shows that two text blocks 310 and 320 are extracted from the document image 300. FIG. 22 shows line extraction results (lines a to n) in the text blocks 310 and 320 as shown in FIG. Here, although the rows a to h are correctly extracted, the rows j to n are overintegrated over the left and right, and the rows i, k, m, and n are overintegrated over the top and bottom. Even if these lines are recognized, correct sentences cannot be obtained.

また、ある条件を満たす空白領域の集合によって、テキストブロック領域を閉領域として取り出す特許文献２の方式では、以下のような問題がある。
テキストブロック領域を囲む適切な空白領域の大きさは文書画像の各領域によって異なり、固定した条件では、適切な空白領域を選択することは難しい。 Further, the method of Patent Document 2 in which a text block area is extracted as a closed area by a set of blank areas that satisfy certain conditions has the following problems.
The size of an appropriate blank area surrounding the text block area differs depending on each area of the document image, and it is difficult to select an appropriate blank area under fixed conditions.

図２３は、複数の図に囲まれた領域にテキストブロック領域が配置されているレイアウトの例である。
このような場合、テキストブロック領域３３０のように、図などの他の文書要素と簡単な矩形で分離できないような形で配置されているため、特許文献２の方式であるようなテキストブロック領域を囲む適切な空白領域を選択してテキストブロック領域を切り出すことは、さらに困難である。 FIG. 23 is an example of a layout in which a text block area is arranged in an area surrounded by a plurality of figures.
In such a case, since the text block area 330 is arranged in a form that cannot be separated from other document elements such as a figure by a simple rectangle, the text block area as in the method of Patent Document 2 is used. It is more difficult to cut out a text block area by selecting an appropriate blank area to enclose.

本発明はこのような点に鑑みてなされたものであり、複雑なレイアウトの文書画像においても、適切なテキストブロックを抽出可能な文書画像レイアウト解析プログラムを提供することを目的とする。 The present invention has been made in view of these points, and an object thereof is to provide a document image layout analysis program capable of extracting an appropriate text block even in a document image having a complicated layout.

本発明では上記問題を解決するために、文書画像レイアウトを解析する処理をコンピュータに機能させる文書画像レイアウト解析プログラムにおいて、図１で示すように、コンピュータ１０に、文書画像における空白領域を仮想的なセパレータとして抽出し（ステップＳ１）、仮想的なセパレータを越えたテキスト要素の統合を禁止するもとで、複数のテキスト要素を統合して統合テキスト要素として抽出する（ステップＳ２）、処理を実行させる文書画像レイアウト解析プログラムが提供される。 In the present invention, in order to solve the above problem, in a document image layout analysis program for causing a computer to perform a process of analyzing a document image layout, as shown in FIG. Extracting as a separator (step S1), and prohibiting the integration of text elements beyond a virtual separator, a plurality of text elements are integrated and extracted as an integrated text element (step S2), and the process is executed. A document image layout analysis program is provided.

上記の文書画像レイアウト解析プログラムによればコンピュータ１０は、文書画像における空白領域を仮想的なセパレータとして抽出する。さらに、その仮想的なセパレータを越えたテキスト要素の統合を禁止するもとで、複数のテキスト要素を統合して統合テキスト要素として抽出する。これにより、文書画像レイアウトに応じた統合テキスト要素が抽出される。 According to the above document image layout analysis program, the computer 10 extracts a blank area in the document image as a virtual separator. Furthermore, while prohibiting the integration of text elements beyond the virtual separator, a plurality of text elements are integrated and extracted as an integrated text element. Thereby, an integrated text element corresponding to the document image layout is extracted.

また、本発明では、コンピュータ１０に、抽出した統合テキスト要素が行またはテキストブロックとして適合か不適合かを検証し（ステップＳ３）、不適合の場合は、空白領域の大きさを制御パラメータにより変化させ、不適合とされた統合テキスト要素に対して仮想的なセパレータを再抽出し、新たな統合テキスト要素を抽出する処理を再帰的に繰り返す処理を更に実行させることを特徴とする。 In the present invention, the computer 10 verifies whether or not the extracted integrated text element is a line or a text block as conforming or nonconforming (step S3), and in the case of nonconforming, the size of the blank area is changed by a control parameter, It is characterized in that a virtual separator is re-extracted from the integrated text element determined to be incompatible, and a process of recursively repeating the process of extracting a new integrated text element is further performed.

これにより、コンピュータ１０は、抽出した統合テキスト要素が行またはテキストブロックとして適合か不適合かを検証する。さらに、その検証の結果、不適合の場合には、空白領域の大きさを制御パラメータにより変化させ、不適合とされた統合テキスト要素に対して変化させた空白領域に基づいた仮想的なセパレータを再抽出する。さらに、その仮想的なセパレータを越えたテキスト要素の統合を禁止するもとで、複数のテキスト要素を統合して新たな統合テキスト要素を抽出する。このような処理を再帰的に繰り返すことで、適切な行またはテキストブロックが抽出される。 As a result, the computer 10 verifies whether the extracted integrated text element is suitable or incompatible as a line or a text block. In addition, if the result of the verification is incompatible, the size of the blank area is changed by the control parameter, and a virtual separator is re-extracted based on the changed blank area for the non-conforming integrated text element. To do. Furthermore, a new integrated text element is extracted by integrating a plurality of text elements while prohibiting integration of text elements beyond the virtual separator. By repeating such processing recursively, an appropriate line or text block is extracted.

本発明は、文書画像における空白領域を仮想的なセパレータとして抽出し、さらに、そのセパレータを越えたテキスト要素の統合を禁止するもとで、複数のテキスト要素を統合して統合テキスト要素として抽出するので、文書画像のレイアウトに応じた統合テキスト要素を抽出することができる。 In the present invention, a blank area in a document image is extracted as a virtual separator, and a plurality of text elements are integrated and extracted as an integrated text element while prohibiting integration of text elements exceeding the separator. Therefore, the integrated text element corresponding to the layout of the document image can be extracted.

また、抽出した統合テキスト要素に対して行またはテキストブロックとして適合か不適合かを検証し、その検証の結果、統合テキスト要素が不適合の場合は、行またはテキストブロックとしての条件が満たされるまで、空白領域の大きさを制御パラメータにより変化させ、仮想セパレータの再抽出及び、統合テキスト要素を抽出する処理を再帰的に繰り返すので、正しい文章として認識可能になる適切な統合テキスト要素を抽出することができる。 Also, the extracted integrated text element is verified as conforming or nonconforming as a line or text block. If the result of the verification is that the integrated text element does not conform, it is blank until the condition as a line or text block is satisfied. The process of re-extracting the virtual separator and extracting the integrated text element is recursively repeated by changing the size of the area according to the control parameter, so that an appropriate integrated text element that can be recognized as a correct sentence can be extracted. .

以下、本発明の実施の形態を図面を参照して詳細に説明する。
図１は、本発明の文書画像レイアウト解析プログラムの原理を示す図である。
本発明の文書画像レイアウト解析プログラムは、文書画像レイアウトを解析する処理をコンピュータ１０に機能させるプログラムである。このプログラムは、コンピュータ１０に、文書画像における空白領域を仮想的なセパレータ（以下仮想セパレータという）として抽出し（ステップＳ１）、仮想セパレータを越えたテキスト要素の統合を禁止するもとで、複数のテキスト要素を統合して統合テキスト要素として抽出する（ステップＳ２）処理を実行させる。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing the principle of a document image layout analysis program of the present invention.
The document image layout analysis program of the present invention is a program that causes the computer 10 to function to analyze the document image layout. This program extracts a blank area in a document image as a virtual separator (hereinafter referred to as a virtual separator) to the computer 10 (step S1), and prohibits integration of text elements beyond the virtual separator. The text elements are integrated and extracted as an integrated text element (step S2).

なお、テキスト要素とは、文字成分あるいは、複数の文字成分からなる行または複数の行からなるテキストブロックであり、統合テキスト要素とは、複数のテキスト要素を統合した行またはテキストブロックである。 Note that a text element is a character component, a line composed of a plurality of character components, or a text block composed of a plurality of lines, and an integrated text element is a line or text block obtained by integrating a plurality of text elements.

さらに、このプログラムは、コンピュータ１０に、抽出した統合テキスト要素が行またはテキストブロックとして適合か不適合かを検証し（ステップＳ３）、検証の結果、統合テキスト要素が正しくない場合は、ステップＳ１に戻って、空白領域の大きさを制御パラメータ（詳しくは後述する）により変化させ、不適合とされた統合テキスト要素に対して仮想セパレータを再抽出し、新たな統合テキスト要素を抽出する処理を再帰的に繰り返す処理を実行させる。 Further, the program verifies whether the extracted integrated text element is suitable or non-conforming as a line or a text block (step S3). If the integrated text element is not correct as a result of the verification, the program returns to step S1. The process of changing the size of the blank area according to the control parameter (details will be described later), re-extracting the virtual separator for the non-conforming integrated text element, and extracting the new integrated text element recursively. Make the process repeat.

以下、上記のようなプログラムを実行したときのコンピュータ１０による文書画像レイアウト解析処理の概要を、具体例を挙げて説明する。
図２乃至図５は、本発明の文書画像レイアウト解析プログラムによる文書画像レイアウト解析処理の概要を示す図である。 Hereinafter, an outline of document image layout analysis processing by the computer 10 when the above-described program is executed will be described with a specific example.
2 to 5 are diagrams showing an outline of document image layout analysis processing by the document image layout analysis program of the present invention.

文書画像レイアウト解析処理が開始すると、図２のような文書画像２０において、ある制御パラメータ（後述する）に応じてその大きさ（Ｘ、Ｙ方向の大きさ）を決定する空白領域（空白矩形）を、仮想セパレータ２１ａ、２１ｂ、２１ｃ、２１ｄ、２１ｅとして抽出する（ステップＳ１）。 When the document image layout analysis process starts, a blank area (blank rectangle) for determining the size (size in the X and Y directions) according to a certain control parameter (described later) in the document image 20 as shown in FIG. Are extracted as virtual separators 21a, 21b, 21c, 21d, and 21e (step S1).

次に、ステップＳ１の処理で抽出された仮想セパレータ２１ａ、２１ｂ、２１ｃ、２１ｄ、２１ｅを越えたテキスト要素である複数の文字成分や行の統合を禁止するもとで、図３のように文字成分間の近接性と均質性に基づき文字成分を行に、同様に行をテキストブロックへとテキスト要素を統合して統合テキスト要素（ここではテキストブロック）２２ａ、２２ｂ、２２ｃ、２２ｄ、２２ｅとして抽出する（ステップＳ２）。 Next, as shown in FIG. 3, a character element or line that is a text element beyond the virtual separators 21a, 21b, 21c, 21d, and 21e extracted in step S1 is prohibited. Based on the proximity and homogeneity between the components, the character components are combined into lines, and the text elements are similarly integrated into text blocks, and extracted as integrated text elements (here, text blocks) 22a, 22b, 22c, 22d, and 22e. (Step S2).

さらに、抽出されたテキストブロック２２ａ、２２ｂ、２２ｃ、２２ｄ、２２ｅのそれぞれに対して、テキストブロックとしての条件を満たすか否かを検証する（ステップＳ３）。 Further, it is verified whether or not the extracted text blocks 22a, 22b, 22c, 22d, and 22e satisfy the condition as a text block (step S3).

ここでの条件とは、文字認識により正しい文章として認識しうるテキストブロックであるための条件である。例えば、図３のように抽出されたテキストブロック２２ａ、２２ｂ、２２ｃ、２２ｄ、２２ｅにおいて、テキストブロック２２ｂ、２２ｃについては、文字認識しても正しい文章として認識されず、ステップＳ３の検証処理では、テキストブロックとしての条件を満たさず不適合であると判定される（この条件についての詳細は後述する）。 The condition here is a condition for a text block that can be recognized as a correct sentence by character recognition. For example, in the text blocks 22a, 22b, 22c, 22d, and 22e extracted as shown in FIG. 3, the text blocks 22b and 22c are not recognized as correct sentences even if character recognition is performed. In the verification process in step S3, It is determined that the condition as a text block is not met and is not suitable (details about this condition will be described later).

このとき、ステップＳ１に戻って、仮想セパレータとして抽出する空白領域の大きさを制御パラメータにより変化させ（具体的にはより細い空白領域が仮想セパレータとして抽出されるようにする）、そのテキストブロック２２ｂ、２２ｃに対して図４のように仮想セパレータ２３ａ、２３ｂ、２３ｃ、２３ｄ、２３ｅを再抽出し、図５のように新たなテキストブロック２４ａ、２４ｂ、２４ｃ、２４ｄ、２４ｅ、２４ｆを抽出する。 At this time, returning to step S1, the size of the blank area to be extracted as the virtual separator is changed by the control parameter (specifically, a thinner blank area is extracted as the virtual separator), and the text block 22b. , 22c, virtual separators 23a, 23b, 23c, 23d, and 23e are re-extracted as shown in FIG. 4, and new text blocks 24a, 24b, 24c, 24d, 24e, and 24f are extracted as shown in FIG.

上記のような処理が、テキストブロックとしての条件が満たされるまで処理が再帰的に繰り返される。
なお、制御パラメータは、再帰回数と統合テキスト要素の大きさやそれに含まれる文字の大きさに基づいて設定される。 The above processing is recursively repeated until the condition as a text block is satisfied.
The control parameter is set based on the number of recursions, the size of the integrated text element, and the size of characters included in the integrated text element.

このような、文書画像レイアウト解析プログラムによれば文書画像のレイアウトに応じた統合テキスト要素を抽出することができる。
また、抽出した統合テキスト要素に対して行またはテキストブロックとして適合か不適合かを検証し、その検証の結果、統合テキスト要素が不適合の場合は、行またはテキストブロックとしての条件が満たされるまで、空白領域の大きさを制御パラメータにより変化させ、仮想セパレータの再抽出及び、統合テキスト要素を抽出する処理を再帰的に繰り返すので、正しい文章として認識可能になる適切な統合テキスト要素を抽出することができる。 According to such a document image layout analysis program, an integrated text element corresponding to the layout of the document image can be extracted.
Also, the extracted integrated text element is verified as conforming or nonconforming as a line or text block. If the result of the verification is that the integrated text element does not conform, it is blank until the condition as a line or text block is satisfied. The process of re-extracting the virtual separator and extracting the integrated text element is recursively repeated by changing the size of the area according to the control parameter, so that an appropriate integrated text element that can be recognized as a correct sentence can be extracted. .

以下、本発明の実施の形態の詳細を説明する。
図６は、文書画像レイアウト解析プログラムを適用する文書画像レイアウト解析装置のハードウェア構成例である。 Details of the embodiment of the present invention will be described below.
FIG. 6 is a hardware configuration example of a document image layout analysis apparatus to which the document image layout analysis program is applied.

文書画像レイアウト解析装置１００は、例えばＰＣ（パーソナルコンピュータ）であり、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、ＨＤＤ（Hard Disk Drive）１０４、グラフィック処理部１０５、入力Ｉ／Ｆ（Interface）１０６、通信Ｉ／Ｆ１０７などによって構成され、これらはバス１０８を介して相互に接続されている。 The document image layout analyzing apparatus 100 is, for example, a PC (personal computer), a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an HDD (Hard Disk Drive) 104, a graphic. The processing unit 105, an input I / F (Interface) 106, a communication I / F 107, and the like are connected to each other via a bus 108.

ここで、ＣＰＵ１０１は、ＲＯＭ１０２や、ＨＤＤ１０４に格納されているプログラムや、各種データに応じて各部を制御する。
ＲＯＭ１０２は、ＣＰＵ１０１が実行する基本的なプログラムやデータを格納している。 Here, the CPU 101 controls each unit according to programs stored in the ROM 102 and the HDD 104 and various data.
The ROM 102 stores basic programs and data executed by the CPU 101.

ＲＡＭ１０３は、ＣＰＵ１０１が実行途中のプログラムや、演算途中のデータを格納している。
ＨＤＤ１０４は、ＣＰＵ１０１が実行するＯＳ（Operation System）や、本発明の文書画像レイアウト解析プログラムなど各種アプリケーションプログラム、図示しないスキャナなどの光学機器によって読み込んだ文書画像データなどの各種データを格納している。 The RAM 103 stores programs being executed by the CPU 101 and data being calculated.
The HDD 104 stores an OS (Operation System) executed by the CPU 101, various application programs such as a document image layout analysis program of the present invention, and various data such as document image data read by an optical device such as a scanner (not shown).

グラフィック処理部１０５には、表示装置として例えば、ディスプレイ１０５ａが接続されており、ＣＰＵ１０１からの描画命令に従って、ディスプレイ１０５ａの画面上に、文書画像などを表示する。 For example, a display 105 a is connected to the graphic processing unit 105 as a display device, and a document image or the like is displayed on the screen of the display 105 a in accordance with a drawing command from the CPU 101.

入力Ｉ／Ｆ１０６には、マウス１０６ａやキーボード１０６ｂが接続されており、ユーザにより入力された情報を受信し、バス１０８を介してＣＰＵ１０１に伝送する。
通信Ｉ／Ｆ１０７は、例えば、インターネットなどのネットワーク１２０と接続して、ネットワーク１２０上に接続された他の装置との通信を行う。 A mouse 106 a and a keyboard 106 b are connected to the input I / F 106, and information input by the user is received and transmitted to the CPU 101 via the bus 108.
The communication I / F 107 is connected to a network 120 such as the Internet, for example, and communicates with other devices connected on the network 120.

次に、文書画像レイアウト解析装置１００で行われる文書画像レイアウト解析処理の詳細を説明する。
なお、以下では、前述した統合テキスト要素はテキストブロックであるとして説明を進めるが、行であってもよい。 Next, details of the document image layout analysis process performed by the document image layout analysis apparatus 100 will be described.
In the following description, the integrated text element is described as a text block, but may be a line.

以下で示す文書画像レイアウト解析処理は、ＣＰＵ１０１の制御のもと、例えば、ＲＯＭ１０２またはＨＤＤ１０４に格納された本発明の文書画像レイアウト解析プログラムや、文書画像データなどの各種データが読み出され、ＲＡＭ１０３に展開されて実行されることによって実現される。 In the document image layout analysis processing described below, under the control of the CPU 101, for example, the document image layout analysis program of the present invention stored in the ROM 102 or the HDD 104 and various data such as document image data are read out to the RAM 103. Realized by being deployed and executed.

図７は、文書画像レイアウト解析処理全体の概要を示す図である。
文書画像レイアウト解析処理は、図のように、連結成分属性付与処理（ステップＳ１０）と、再帰的テキストブロック抽出処理（ステップＳ２０）からなる。 FIG. 7 is a diagram showing an overview of the entire document image layout analysis process.
As shown in the figure, the document image layout analysis process includes a connected component attribute assignment process (step S10) and a recursive text block extraction process (step S20).

連結成分属性付与処理（ステップＳ１０）は、文書画像の黒画素による全ての連結成分に、文字成分、セパレータ、図、フレーム、ノイズのいずれかの属性を付与する。ここで、図とは、文字成分、セパレータ、フレームまたはノイズではなく、かつ、それ自身の中に文字成分を含まない連結成分のことである。またフレームとは、複数の文字成分を内側に含んだ枠のことである。 In the connected component attribute assigning process (step S10), any of the attributes of a character component, a separator, a figure, a frame, and noise is assigned to all the connected components by the black pixels of the document image. Here, the figure is not a character component, a separator, a frame, or noise, but a connected component that does not include a character component in itself. A frame is a frame including a plurality of character components inside.

再帰的テキストブロック抽出処理（ステップＳ２０）は、前述の図１で示したステップＳ１〜Ｓ３の処理に相当する。ステップＳ１０の連結成分属性付与処理で属性の付与された連結成分の集合に対して、仮想セパレータの抽出、テキストブロックの抽出、そしてテキストブロックとしての条件を満たしているかの検証を行い、テキストブロックが正しくない場合は、空白領域の大きさを制御パラメータにより変化させ、そのテキストブロックに対して仮想セパレータを再抽出し、新たなテキストブロックを抽出する処理を再帰的に繰り返す処理を行う。 The recursive text block extraction process (step S20) corresponds to the process of steps S1 to S3 shown in FIG. For the set of connected components to which attributes are assigned in the connected component attribute assigning process in step S10, the virtual separator is extracted, the text block is extracted, and whether the text block is satisfied is verified. If not correct, the size of the blank area is changed by the control parameter, the virtual separator is re-extracted from the text block, and the process of extracting a new text block is recursively repeated.

まず、連結成分属性付与処理の詳細を説明する。
図８は、連結成分属性付与処理の流れを示す一例のフローチャートである。
例えば、ＨＤＤ１０４に格納された文書画像がＣＰＵ１０１の制御のもと取り出されると、その文書画像に対し、まずラベリング処理が行われる。 First, details of the connected component attribute assigning process will be described.
FIG. 8 is a flowchart of an example showing the flow of the connected component attribute assigning process.
For example, when a document image stored in the HDD 104 is retrieved under the control of the CPU 101, a labeling process is first performed on the document image.

図９は、ラベリング処理の具体例を示す図である。
例えば「た」という文字成分は、黒画素による３つの連結成分２０１、２０２、２０３からなる。ラベリング処理では、連結成分２０１、２０２、２０３を囲む最小の長方形である外接矩形２０１ａ、２０２ａ、２０３ａの座標値（ＸＹ座標）を得ることで、黒画素の連結成分の情報を取得する。このような処理を文書画像中の全ての連結成分に対し行う（ステップＳ１１）。 FIG. 9 is a diagram illustrating a specific example of the labeling process.
For example, the character component “ta” includes three connected components 201, 202, and 203 formed of black pixels. In the labeling process, the coordinate values (XY coordinates) of the circumscribed rectangles 201a, 202a, and 203a, which are the smallest rectangles surrounding the connected components 201, 202, and 203, are obtained, thereby acquiring information about the connected components of black pixels. Such processing is performed for all connected components in the document image (step S11).

次に、ステップＳ１１の処理で得られた連結成分の集合Ｓに対して、セパレータ判別処理を行う。ここでは、連結成分の外接矩形の長い辺の長さがある一定値以上であり、かつ外接矩形の縦横比がある一定値以上であるとき、その連結成分はセパレータであると判別する（ステップＳ１２）。 Next, a separator determination process is performed on the set S of connected components obtained by the process of step S11. Here, when the length of the long side of the circumscribed rectangle of the connected component is not less than a certain value and the aspect ratio of the circumscribed rectangle is not less than a certain value, it is determined that the connected component is a separator (step S12). ).

また、連結成分の集合Ｓに対して、ノイズ判別処理を行う。ここでは、連結成分の外接矩形の面積がある一定値以下であるとき、その連結成分はノイズであると判別する（ステップＳ１３）。 Also, noise discrimination processing is performed on the set S of connected components. Here, when the area of the circumscribed rectangle of the connected component is equal to or smaller than a certain value, it is determined that the connected component is noise (step S13).

さらに、連結成分の集合Ｓから、ステップＳ１２、Ｓ１３の処理で判別されたセパレータとノイズを除いた集合Ｓａに対して、階層化処理を行う。
図１０は、階層化処理の具体例を示す図である。 Further, the hierarchization process is performed on the set Sa excluding the separator and noise determined in the processes of steps S12 and S13 from the set S of connected components.
FIG. 10 is a diagram illustrating a specific example of the hierarchization process.

例えば、「区」のような文字成分は、黒画素による２つの連結成分２１１、２１２からなる。これらは、連結成分２１１、２１２の外接矩形２１１ａ、２１２ａの座標値によりステップＳ１１の処理でラベリングされている。この文字の場合、連結成分２１１の中に連結成分２１２が包含されている。このような連結成分同士の関連がある場合、連結成分２１１の“子”に連結成分２１２を登録し、連結成分２１２の“親”に連結成分２１１を登録するといった階層化処理を行う（ステップＳ１４）。 For example, a character component such as “ku” is composed of two connected components 211 and 212 composed of black pixels. These are labeled in the process of step S11 by the coordinate values of the circumscribed rectangles 211a and 212a of the connected components 211 and 212. In the case of this character, the connected component 212 is included in the connected component 211. If there is a relationship between such connected components, a hierarchical process is performed in which the connected component 212 is registered in the “child” of the connected component 211 and the connected component 211 is registered in the “parent” of the connected component 212 (step S14). ).

階層化処理が終わると集合Ｓａに対して、文字認識処理を行う。ここでは、連結成分に対して、まず、その外接矩形の領域をそのまま１文字認識する。そして、文字認識結果の信頼度が高いとき、その連結成分に文字成分フラグ“ＣＨ”を付ける。次に、例えば、図１０のように、“子”を持っている連結成分がある場合には、“子”の連結成分の集合に対して重なり統合を行い、その結果得られる全ての重なり成分の外接矩形領域を１文字認識する。 When the hierarchization process is completed, a character recognition process is performed on the set Sa. Here, for the connected component, first, the circumscribed rectangular region is recognized as it is. When the reliability of the character recognition result is high, the character component flag “CH” is attached to the connected component. Next, for example, as shown in FIG. 10, when there is a connected component having “children”, overlap integration is performed on a set of connected components of “children”, and all the overlapping components obtained as a result thereof are integrated. One character is recognized in the circumscribed rectangular area.

図１１は、文字成分を包含するフレームを示した図である。
この図のように連結成分２２０は、重なり統合の結果、例えば、文字認識結果の信頼度が高い外接矩形領域２２１、２２２、２２３、２２４を“子”として有している。このように、文字認識結果の信頼度が高い“子”をある一定数以上有している場合、その連結成分２２０は複数の文字を囲んだフレームである可能性があるので、連結成分２２０にフレームフラグ“ＦＲ”をつける（ステップＳ１５）。 FIG. 11 is a diagram showing a frame including a character component.
As shown in this figure, the connected component 220 has circumscribed rectangular regions 221, 222, 223, and 224 with high reliability of the character recognition result as a “child” as a result of overlap integration. As described above, when there are a certain number or more of “children” having high reliability of the character recognition result, the connected component 220 may be a frame surrounding a plurality of characters. A frame flag “FR” is attached (step S15).

次に、集合Ｓａに対して、文字成分・フレーム・図判別処理を行う。具体的には、連結成分に文字成分フラグ“ＣＨ”がついている場合、その連結成分の“親”が存在するならば、その“親”に文字成分フラグ“ＣＨ”がついていないときは“親”に、フレームフラグ“ＦＲ”をつける。また、“親”に文字成分フラグ“ＣＨ”がついているときは、“子”であるその連結成分の信頼度と“親”の連結成分の信頼度を比較して、“子”の連結成分の信頼度が高ければ、“親”の文字成分フラグ“ＣＨ”を取り消し、フレームフラグ“ＦＲ”をつける。ここで、集合Ｓａに対して、フレームフラグ“ＦＲ”のついている連結成分を全てフレームにする。残りの連結成分の中で、文字成分フラグ“ＣＨ”のついていないもので、外接矩形の面積がある一定値以上であるとき、それを図とする。さらに残りの連結成分を文字成分とする（ステップＳ１６）。 Next, a character component / frame / drawing discrimination process is performed on the set Sa. Specifically, if a connected component has a character component flag “CH”, and if there is a “parent” of the connected component, if the “parent” does not have a character component flag “CH”, the “parent” The frame flag “FR” is attached to “. When the character component flag “CH” is attached to “parent”, the reliability of the connected component that is “child” is compared with the reliability of the connected component of “parent”, and the connected component of “child” is compared. Is higher, the “parent” character component flag “CH” is canceled and the frame flag “FR” is added. Here, for the set Sa, all connected components with the frame flag “FR” are made frames. Among the remaining connected components, those that do not have the character component flag “CH” and the area of the circumscribed rectangle is greater than or equal to a certain value is shown in the figure. Further, the remaining connected components are set as character components (step S16).

最後に、文字成分とした連結成分の集合に対して、重なり統合処理を行う（ステップＳ１７）。
以上のような処理で、文書画像中の全ての連結成分に対して、文字成分、セパレータ、図、フレーム、ノイズのいずれかの属性を付与することができる。 Finally, overlap integration processing is performed on the set of connected components as character components (step S17).
With the processing as described above, any attribute of a character component, a separator, a figure, a frame, and noise can be assigned to all connected components in the document image.

次に、図７で示したステップＳ２０の再帰的テキストブロック抽出処理の詳細を説明する。
図１２は、再帰的テキストブロック抽出処理の流れを示す一例のフローチャートである。 Next, details of the recursive text block extraction processing in step S20 shown in FIG. 7 will be described.
FIG. 12 is a flowchart of an example showing the flow of recursive text block extraction processing.

連結成分属性付与処理が終わり、再帰的テキストブロック抽出処理が開始すると、まず、文書画像内のある矩形領域Ｐにおける極大空白矩形を求める処理が行われる。
文書画像における空白矩形とは、文書画像内の矩形領域であって、内部に黒画素を含まないものである。そして、空白矩形の集合のうち、自分以外に自分自身を含む空白矩形が存在しないものを極大空白矩形という。 When the connected component attribute addition process is completed and the recursive text block extraction process is started, first, a process for obtaining a maximum blank rectangle in a certain rectangular area P in the document image is performed.
The blank rectangle in the document image is a rectangular area in the document image and does not include black pixels inside. A set of blank rectangles having no blank rectangle including itself other than itself is called a maximal blank rectangle.

図１３は、極大空白矩形の一例を示す図である。
文書画像内のある矩形領域Ｐを表している。この矩形領域Ｐ内には、図７のステップ１０で説明した処理により得られた外接矩形の集合Ｓ＝｛Ｒ_i∈Ｐ，ｉ＝１，２，…，ｎ｝が与えられている（図ではｎ＝５の場合について示している）。ここで、矩形領域Ｐにおける空白矩形（以下ＰにおけるＳ空白矩形と称す）の集合を、矩形領域Ｐ内の矩形領域であり、集合Ｓに属する全ての矩形と重ならないものとする。また、ＰにおけるＳ空白矩形の集合のうち、自分以外に自分自身を含むＰにおけるＳ空白矩形が存在しないものをＰにおけるＳ極大空白矩形という。以下、ＰにおけるＳ極大空白矩形の集合をＭ（Ｐ，Ｓ）で表す。図１３では、Ｍ（Ｐ，Ｓ）のうち矩形領域Ｐ内で最大となる、ＰにおけるＳ極大空白矩形２３０を示している。 FIG. 13 is a diagram illustrating an example of a maximum blank rectangle.
A rectangular area P in the document image is represented. In this rectangular area P, a circumscribed rectangle set S = {R _i εP, i = 1, 2,..., N} obtained by the processing described in step 10 of FIG. Shows the case of n = 5). Here, it is assumed that a set of blank rectangles in the rectangular area P (hereinafter referred to as S blank rectangle in P) is a rectangular area in the rectangular area P and does not overlap all rectangles belonging to the set S. Further, among the set of S blank rectangles in P, one in which there is no S blank rectangle in P including itself other than itself is called an S maximum blank rectangle in P. Hereinafter, a set of S maximum blank rectangles in P is represented by M (P, S). FIG. 13 shows the S maximum blank rectangle 230 in P that is the largest within the rectangular area P of M (P, S).

Ｍ（Ｐ，Ｓ）は制御パラメータｎ、ｘによって決定し、以下の式で定義される。
Ｍ_n,x（Ｐ，Ｓ）＝｛Ｔ∈Ｍ（Ｐ，Ｓ）｜ｍｉｎ（Ｔ^X，Ｔ^Y）≧ｎａｎｄｍａｘ（Ｔ^X，Ｔ^Y）≧ｘ｝
ここで、Ｔ^XはＭ（Ｐ，Ｓ）に含まれるＰにおけるＳ極大空白矩形Ｔの横（Ｘ方向）の長さを表し、Ｔ^YはＴの縦（Ｙ方向）の長さを表す。ｍｉｎ（Ｔ^X，Ｔ^Y）≧ｎは、Ｔ^X，Ｔ^Yのうち短いほうが制御パラメータｎ以上であることを示し、ｍａｘ（Ｔ^X，Ｔ^Y）≧ｘはＴ^X，Ｔ^Yのうち長いほうが制御パラメータｘ以上であることを示す（ステップＳ２１）。 M (P, S) is determined by the control parameters n and x and is defined by the following equation.
M _{n, x} (P, S) = {TεM (P, S) | min (T ^X , T ^Y ) ≧ n and max (T ^X , T ^Y ) ≧ x}
Here, T ^X represents the horizontal (X direction) length of the S maximum blank rectangle T in P included in M (P, S), and T ^Y represents the vertical length (Y direction) of T. min (T ^X , T ^Y ) ≧ n indicates that the shorter of T ^X , T ^Y is greater than or equal to the control parameter n, and max (T ^X , T ^Y ) ≧ x is longer of T ^X , T ^Y. Is greater than or equal to the control parameter x (step S21).

制御パラメータｎ、ｘは、再帰回数とテキストブロックの大きさやそれに含まれる文字の大きさに基づいて設定する。
ところで、前述した連結成分属性付与処理によって外接矩形に付与された属性をもとに、外接矩形の集合Ｓのうち、文字成分の外接矩形の集合を“Ｃ”、フレームやセパレータ、図など文字成分以外の外接矩形の集合を“Ｈ”として、“Ｈ”は、他の外接矩形との統合を禁止するリンク禁止領域として分類されているものとする。 The control parameters n and x are set based on the number of recursions, the size of the text block, and the size of characters included in the text block.
By the way, based on the attribute given to the circumscribed rectangle by the above-described connected component attribute assigning process, the circumscribed rectangle set of the character component of the circumscribed rectangle set S is “C”, and the character component such as a frame, separator, figure, etc. Assume that a set of circumscribed rectangles other than is “H”, and “H” is classified as a link prohibited area that prohibits integration with other circumscribed rectangles.

このとき、ステップＳ２１で求めた矩形領域ＰにおけるＣ∪Ｈ（ＣとＨの和集合）極大空白矩形集合Ｍ_n,x（Ｐ，Ｃ∪Ｈ）を、仮想セパレータとしてリンク禁止領域である“Ｈ”に追加する。そして追加されたリンク禁止領域を“Ｈａ”とする（ステップＳ２２）。 At this time, C∪H (the sum of C and H) maximum blank rectangle set M _{n, x} (P, C∪H) in the rectangular area P obtained in step S21 is used as a virtual forbidden area “H”. Add to The added link prohibited area is set to “Ha” (step S22).

次に、文字成分の外接矩形の集合“Ｃ”に対して、リンク禁止領域“Ｈａ”を超える統合を禁止するもとで、近接性あるいは同質性に基づいて統合を行いテキストブロックの抽出を行う。具体的な方法については、特開平１１−２１９４０７号公報に開示されている。処理結果としては、テキストブロックとそれを構成する行が得られる（ステップＳ２３）。 Next, the text block is extracted by integrating the character component circumscribed rectangle set “C” beyond the link prohibition area “Ha” based on proximity or homogeneity. . A specific method is disclosed in JP-A-11-219407. As a processing result, a text block and lines constituting the text block are obtained (step S23).

次に、抽出されたテキストブロック数をｌとして、ループ回数ｉ＝０とし（ステップＳ２４）、ｉ＝ｌとなるまで以下の処理を繰り返す。
すなわち、ｉ＜ｌであるか否かを判断し（ステップＳ２５）、ｉ＜ｌである場合には、抽出したテキストブロックＢ_iがテキストブロック適合性条件（詳細は後述する）を満たすか否かを判断し（ステップＳ２６）、満たす場合にはｉをインクリメントして（ステップＳ２７）、ステップＳ２４の処理に戻る。ｉ＝ｌとなった場合、すなわち矩形領域Ｐ内の全てのテキストブロックＢ_iがテキストブロック適合性条件を満たした場合には、文書画像内の矩形領域Ｐに対する処理を終了して、別の矩形領域に対してステップＳ２１からの処理を繰り返す（リターン）。 Next, the number of extracted text blocks is set to l, the loop count i = 0 (step S24), and the following processing is repeated until i = 1.
That is, it is determined whether or not i <l (step S25). If i <l, whether or not the extracted text block B _i satisfies the text block compatibility condition (details will be described later). (Step S26), i is incremented if satisfied (step S27), and the process returns to step S24. When i = 1, that is, when all the text blocks B _i in the rectangular area P satisfy the text block compatibility condition, the processing for the rectangular area P in the document image is terminated, and another rectangle is obtained. The processing from step S21 is repeated for the area (return).

一方、ステップＳ２６の処理でテキストブロックＢ_iがテキストブロック適合性条件を満たさない場合には、テキストブロックＢ_iを矩形領域Ｐとし、テキストブロックＢ_i内の文字成分の外接矩形Ｕ、文字成分以外の外接矩形Ｖとして、Ｃ＝｛Ｕ∈Ｃ｜Ｕ∩Ｐ≠φ｝、Ｈ＝｛Ｖ∈Ｈ｜Ｖ∩Ｐ≠φ｝と新たに定義する。そして、これらＰ、Ｃ、Ｈに対して、制御パラメータｎ、ｘを変化させ（ステップＳ２８）、ステップＳ２１からの処理を再度行う（ステップＳ２９）。再帰処理が終わる（リターンする）とｉをインクリメントした後（ステップＳ２７）、ステップＳ２４の処理に戻り、次のテキストブロックＢ_i+1についての検証を行う（ステップＳ２６）。 On the other hand, if the text block B _i does not meet the text block compatible conditions in the process of step S26, the text block B _i a rectangular region P, the circumscribed rectangle U characters components of the text block B _i, other than the character component Are newly defined as C = {U∈C | U∩P ≠ φ} and H = {V∈H | V∩P ≠ φ}. Then, the control parameters n and x are changed for these P, C, and H (step S28), and the processing from step S21 is performed again (step S29). When the recursive process ends (returns), i is incremented (step S27), and then the process returns to step S24 to verify the next text block B _{i + 1} (step S26).

なお、再帰処理の際、制御パラメータｎ、ｘは、ともに減少させていくように設定する。すなわち、テキストブロック適合性条件を満たさなかったテキストブロックＢ_iにおいては、仮想セパレータとして設定する極大空白矩形を小さいものにしていく。 In the recursive process, the control parameters n and x are set to decrease. That is, in the text block B _i that does not satisfy the text block compatibility condition, the maximum blank rectangle set as the virtual separator is made smaller.

このようにすることで、テキストブロックと図が複雑に入り組んで配置されている場合でも、文字成分を過統合して、複数行の文字列をまとめて１行としてしまうなどの問題を解消できる。 In this way, even when the text block and the figure are arranged in a complicated manner, it is possible to solve the problem that the character components are overintegrated and a plurality of character strings are combined into one line.

次に、図１２におけるテキストブロックとしての適合性を検証する処理（ステップＳ２６）の詳細を説明する。
ステップＳ２３の処理におけるテキストブロック抽出結果が、テキストブロックとしての条件（テキストブロック適合性条件）を満たすかどうかを判断するために、以下の２つの処理を行う。 Next, details of the process of verifying the suitability as a text block in FIG. 12 (step S26) will be described.
In order to determine whether the text block extraction result in the process of step S23 satisfies the condition as a text block (text block compatibility condition), the following two processes are performed.

図１４は、テキストブロックとしての適合性検証の処理の概略を示す図である。
ステップＳ３０：テキストブロックを構成する各行について、行が、行の方向（縦または横）の垂直方向にわたって、文字を２文字以上含まないか判断する。 FIG. 14 is a diagram showing an outline of the process of verifying compatibility as a text block.
Step S30: For each line constituting the text block, it is determined whether the line includes two or more characters in the vertical direction of the line direction (vertical or horizontal).

図１５は、あるテキストブロックを構成する行の一例を示す図である。
図のようにテキストブロック２４０において、行２４１〜２４５が得られているものとする。このとき、行２４２、２４４、２４５については、行方向の垂直方向にわたり２文字以上の文字を含んでいる。すなわち、抽出された行はその領域内に、行の垂直方向に複数の行を有している。このような行２４２、２４４、２４５は、文字認識を行っても正しい文章が得られないため、行として不適合であるとともに、このような行２４２、２４４、２４５を含むテキストブロック２４０は、テキストブロックとして不適合であると判定される。 FIG. 15 is a diagram showing an example of lines constituting a text block.
Assume that lines 241 to 245 are obtained in the text block 240 as shown in the figure. At this time, the lines 242, 244 and 245 include two or more characters in the vertical direction of the line direction. In other words, the extracted row has a plurality of rows in the region in the vertical direction of the row. Such a line 242, 244, 245 is not suitable as a line because a correct sentence cannot be obtained even if character recognition is performed, and the text block 240 including such a line 242, 244, 245 is a text block As non-conforming.

ステップＳ４０：テキストブロックを構成する行について、所定の行数以上の行が、文字間隔よりも大きい同一の空白領域と交差しないか判断する。
図１６は、図１５のテキストブロックの空白領域を示した図である。 Step S40: Regarding the lines constituting the text block, it is determined whether or not a line having a predetermined number or more intersects the same blank area larger than the character spacing.
FIG. 16 is a diagram showing a blank area of the text block of FIG.

図のようにテキストブロック２４０は、全ての行が、文字間隔よりも大きい同一の空白領域２５０と交差している。このようなテキストブロック２４０では、文字認識を行っても正しい文章が得られないためテキストブロックとして不適合とする。 As shown in the figure, in the text block 240, all lines intersect with the same blank area 250 larger than the character spacing. Such a text block 240 is not suitable as a text block because a correct sentence cannot be obtained even if character recognition is performed.

以下、図１４のそれぞれの処理についての詳細を説明する。
図１７は、テキストブロック適合性検証処理の１つめの処理の詳細を示すフローチャートである。 Hereinafter, the details of each process of FIG. 14 will be described.
FIG. 17 is a flowchart showing details of the first process of the text block compatibility verification process.

抽出されたテキストブロックＢ_iに対して、行抽出結果を｛Ｌ_j：ｊ＝１、２、…、ｍ｝とする。このとき、まず、ｊ＝０として（ステップＳ３１）、ｊ＝ｍとなるまで以下の処理を繰り返す。 For the extracted text block B _i , the line extraction result is {L _j : j = 1, 2,..., M}. At this time, first, j = 0 is set (step S31), and the following processing is repeated until j = m.

まず、ｊ＜ｍであるか否かを判断する（ステップＳ３２）。ｊ＜ｍである場合には、行Ｌ_jに対し、文字候補集合Ｍの中から行Ｌ_jに含まれる文字候補Ｍ_Ljを求める。なお、文字候補集合Ｍは、前述した連結成分属性付与処理（図７）で文字成分と判別されたもののうち、認識信頼度が高い文字成分（前述のフラグ“ＣＨ”が付加されたものを用いても良い）からなる集合である（ステップＳ３３）。 First, it is determined whether j <m is satisfied (step S32). If a j <m, compared row L _j, obtaining the character candidate M _Lj within a row L _j from the character candidate set M. The character candidate set M is a character component having a high recognition reliability (one to which the above-mentioned flag “CH” is added) among those determined as character components in the above-described connected component attribute assignment processing (FIG. 7). (Step S33).

次に、ステップＳ３３の処理で求められた文字候補Ｍ_Ljに対して行生成を行う。
行生成は、行Ｌ_jにおける文字候補Ｍ_Ljに付与される読み取り順序に応じて生成される。 Next, a line is generated for the character candidate _MLj obtained in the process of step S33.
Line generation is generated in accordance with the reading order is given to the character candidate M _Lj in row L _j.

図１８は、読み取り順序を付与する処理の概略を示す図である。
行Ｌ_jに含まれる文字候補Ｍ_Lj（外接矩形で図示している）は、例えば、その外接矩形の左上点のＹ座標の小さい順にソートされる。Ｙ座標が同一の場合は、Ｘ座標が小さい順にソートされる。これにより、文字候補Ｍ_Ljは、図のように番号（１）〜（１６）と番号付けされる。 FIG. 18 is a diagram showing an outline of processing for assigning a reading order.
Character candidates M _Lj (illustrated by a circumscribed rectangle) included in the row L _j are sorted, for example, in ascending order of the Y coordinate of the upper left point of the circumscribed rectangle. When the Y coordinate is the same, the X coordinate is sorted in ascending order. Thereby, the character candidates M _Lj are numbered (1) to (16) as shown in the figure.

ここで、行Ｌ_jを囲む文字列矩形の一角を原点２６０としたときに、その原点２６０を一角として文字候補Ｍ_Ljを含む矩形検査領域２６１、２６２を設定する。文字候補Ｍ_Ljの読み取り順序は、この矩形検査領域２６１、２６２内に、文字候補Ｍ_Ljより後ろの読み取り順序のものを含まないという条件のもとで、文字候補Ｍ_Ljの読み取り順序を決定する。 Here, the corner of the character string rectangle surrounding the line L _j is taken as the origin 260, sets a rectangular test region 261 including a character candidate M _Lj its origin 260 as a corner. Reading order character candidate M _Lj is the rectangular inspection area 261 and 262 is determined under the condition that do not include those of the reading order behind the character candidates M _Lj, the reading order of character candidates M _Lj .

具体的には、まず、番号（１）の文字候補Ｍ_Ljから順に、原点２６０を一角として文字候補Ｍ_Ljを含む矩形検査領域を設定し、その文字候補Ｍ_Ljより後ろの番号の文字候補Ｍ_Ljが含まれない場合に、読み取り番号の付与を行う。その文字候補Ｍ_Ljより後ろの番号が含まれる場合には、読み取り番号の付与は行わない。例えば、図１８のように番号（１）の文字候補Ｍ_Ljを含む矩形検査領域２６１は、番号（４）、（５）、（６）、（７）の文字候補Ｍ_Ljを含む。よって読み取り番号の付与は行わない。番号（２）、（３）の文字候補Ｍ_Ljについても同様に、読み取り番号の付与は行われない。番号（４）の文字候補Ｍ_Ljについては、矩形検査領域２６２にそれ以外のものを含まないので、読み取り番号１が付与される。このような処理を繰り返すことで図のような読み取り番号１〜１６が行Ｌ_jに対して付与される。 Specifically, first, in order from the character candidate M _Lj of the number (1), a rectangular inspection area including the character candidate M _Lj is set with the origin 260 as one corner, and the character candidate M of the number after the character candidate M _Lj is set. _{When Lj} is not included, a reading number is assigned. If a number after the character candidate _MLj is included, no reading number is assigned. For example, a rectangular test area 261 that includes a character candidate M _Lj number (1) as shown in Figure 18 are numbered (4), including (5), (6), character candidate M _Lj (7). Therefore, no reading number is assigned. Similarly, reading numbers are not _{assigned to} the character candidates _MLj of the numbers (2) and (3). For the character candidate _MLj of number (4), the rectangular inspection area 262 does not include anything else, so reading number 1 is assigned. By repeating such processing, reading numbers 1 to 16 as shown in the figure are assigned to the row L _j .

図１９は、行生成の処理の概略を示す図である。
文字候補Ｍ_Ljを図１８で示したような処理で決定した読み取り順序（付与された読み取り番号）に従って順に統合して、新たに行Ｌ_jkを生成する。ここで、図１９のように、読み取り番号５の文字候補Ｍ_Ljを統合して外接矩形２７１（点線で図示している）で表される行Ｌ_jkを生成しようとすると、読み取り番号５より後ろの、読み取り番号８の文字候補Ｍ_Ljがその外接矩形２７１に含まれてしまう。この場合、読み取り番号５の文字候補Ｍ_Ljの統合をせず、読み取り番号１〜４の文字候補Ｍ_Ljを１つの行Ｌ_j1として確定する。このように、統合する文字候補Ｍ_Ljより後ろの読み取り順序のものを含まないという条件の下で、読み取り順序に応じて文字候補Ｍ_Ljを順番に統合して新たな生成行とする。同様にして、図のように新たに行Ｌ_j2、Ｌ_j3、Ｌ_j4が生成される（ステップＳ３４）。 FIG. 19 is a diagram showing an outline of row generation processing.
The character candidates M _Lj are integrated in order according to the reading order (assigned reading number) determined by the processing shown in FIG. 18, and a new row L _jk is generated. Here, as shown in FIG. 19, if an attempt is made to generate a line L _jk represented by a circumscribed rectangle 271 (illustrated by a dotted line) by integrating the character candidates M _Lj of the reading number 5, The character candidate M _Lj with the reading number 8 is included in the circumscribed rectangle 271. In this case, the character candidates M _{Lj with} the reading number 5 are not integrated, and the character candidates M _Lj with the reading numbers 1 to 4 are determined as one row L _j1 . Thus, under the condition that does not contain those behind the reading order from the character candidate M _Lj to integrate, as a new product line by integrating the character candidate M _Lj sequentially in accordance with the reading order. Similarly, new rows L _j2 , L _j3 , L _j4 are generated as shown in the figure (step S34).

次に、ステップＳ３４の処理により新たに生成された行Ｌ_jkの中に、行Ｌ_jの垂直方向に複数並ぶものがないか否かを判断する（ステップＳ３５）。そのような行Ｌ_jkがない場合には、ｊをインクリメントし（ステップＳ３６）、ステップＳ３２からの処理を繰り返す。また、図１９で示したような行Ｌ_jのように、垂直方向に並んだ複数の行Ｌ_j2、Ｌ_j3が生成された場合には、行Ｌ_jは行として不適合として判定され、その行Ｌ_jを含む抽出したテキストブロックＢ_iは不適合と判断され（ステップＳ３７）処理を終了する。 Next, it is determined whether or not a plurality of rows L _jk newly generated by the process of step S34 are arranged in the vertical direction of the row L _j (step S35). If there is no such row L _jk , j is incremented (step S36), and the processing from step S32 is repeated. In addition, when a plurality of lines L _j2 and L _j3 arranged in the vertical direction are generated as in the line L _j as shown in FIG. 19, the line L _j is determined as a non-conforming line, and the line The extracted text block B _i including L _j is determined to be incompatible (step S37), and the process is terminated.

一方、ステップＳ３２において、ｊ＝ｍとなった場合、すなわち、テキストブロックＢ_iを構成する全ての行Ｌ_jに対して、垂直方向に複数の行Ｌ_jkを持たないという判断がなされた場合は、テキストブロックＢ_iは、テキストブロックとして適合であると判断され（ステップＳ３８）処理を終了する。 On the other hand, when j = m is satisfied in step S32, that is, when it is determined that all the lines L _j constituting the text block B _i do not have a plurality of lines L _jk in the vertical direction. The text block B _i is determined to be suitable as a text block (step S38), and the process ends.

次に、図１４で示したテキストブロック適合性検証処理の２つめの処理（ステップＳ４０）の詳細を説明する。
図２０は、テキストブロック適合性検証処理の２つめの処理の詳細を示すフローチャートである。 Next, details of the second process (step S40) of the text block compatibility verification process shown in FIG. 14 will be described.
FIG. 20 is a flowchart showing details of the second process of the text block compatibility verification process.

前述したステップＳ３０のテキストブロック適合性検証処理と同様に、抽出されたテキストブロックＢ_iに対して、行抽出結果を｛Ｌ_j：ｊ＝１、２、…、ｍ｝とする。そして、まず、テキストブロックＢ_iに含まれる文字成分Ｃ_iを求める（ステップＳ４１）。 Similarly to the text block compatibility verification process in step S30 described above, the line extraction result is set to {L _j : j = 1, 2,..., M} for the extracted text block B _i . First, the character component C _i included in the text block B _i is obtained (step S41).

次に、テキストブロックＢ_iに含まれる連結成分である文字成分Ｃ_iから推定される平均文字間隔を求める（ステップＳ４２）。そして、平均文字間隔から、前述した極大空白矩形を設定する制御パラメータ（ｎ，ｘ）の設定を行い（ステップＳ４３）、Ｂ_iにおけるＣ_i極大空白矩形集合Ｍ_n,x（Ｂ_i,Ｃ_i）を求める（ステップＳ４４）。 Next, an average character spacing estimated from the character component C _i which is a connected component included in the text block B _i is obtained (step S42). Then, the average character interval, the control parameter (n, x) to set the maximum blank rectangle described above to set the (step S43), C _i maximum blank rectangle set M _n in _{_{_{B i, x (B i,}}} C i ) Is obtained (step S44).

次に、Ｂ_iにおけるＣ_i極大空白矩形集合Ｍ_n,x（Ｂ_i,Ｃ_i）に含まれる極大空白矩形が、｛Ｌ_j：ｊ＝１、２、…、ｍ｝の所定の行数（ｔｈ）以上の行Ｌ_jと交差するものがないか否かを判断する（ステップＳ４５）。ここで、そのような極大空白矩形がない場合には、抽出したテキストブロックＢ_iは適合であるとし（ステップＳ４６）、一定本数以上の行Ｌ_jと交差するものがある場合には、テキストブロックＢ_iは不適合であるとし（ステップＳ４７）、テキストブロック適合性検証処理を終える。 Then, C _i maximum blank rectangle set M _n in _{_{_{B i, x (B i,}}} C i) maximum blank rectangle _{contained, {L j: j = 1,2} , ..., m} predetermined number of rows in (Th) It is determined whether or not there is anything that intersects the above line L _j (step S45). Here, if there is no such maximal blank rectangle, the extracted text block B _i is determined to be suitable (step S46), and if there is an object that intersects a certain number of lines L _j , the text block B _i is assumed to be incompatible (step S47), completing the text block compliance verification process.

所定の行数（ｔｈ）は、例えば、抽出されたテキストブロックＢ_iを構成する行数に応じて設定し、例えば、その行数の１割などとする。
以上のように、本発明によれば、従来技術では困難であった、テキストブロック同士が入り組んで配置されている場合や、テキストブロックと図が入り組んで配置されている場合や、テキストブロックが他の文書要素と矩形で分離できない形で配置されている場合であっても、テキストブロックを正しく抽出することができる。これにより、雑誌や広告などの複雑なレイアウト構造を持つ文書に対して、テキストブロックを高精度に抽出することができる。 For example, the predetermined number of lines (th) is set according to the number of lines constituting the extracted text block _Bi, and is set to, for example, 10% of the number of lines.
As described above, according to the present invention, when the text blocks are arranged in an intricate arrangement, which is difficult in the prior art, the text block and the figure are arranged in an intricate arrangement, Even if the document elements are arranged in a rectangle that cannot be separated from the document elements, the text block can be correctly extracted. Thereby, a text block can be extracted with high accuracy for a document having a complicated layout structure such as a magazine or an advertisement.

なお、上記では、テキスト要素を統合してテキストブロックや行を抽出するとしたが、抽出された仮想セパレータに基づいて生成される閉領域をそのまま抽出することによって、テキスト要素のほかに、図や表、写真などの文書要素を抽出するようにしてもよい。 In the above, text blocks and lines are extracted by integrating text elements. However, by extracting the closed region generated based on the extracted virtual separator as it is, in addition to text elements, figures and tables are extracted. Document elements such as photographs may be extracted.

そして、抽出した文書要素のうちテキスト要素が行またはテキストブロックとして適合か不適合かを前述のように検証し、不適合の場合は、空白領域の大きさを制御パラメータｎ，ｘにより変化させ、不適合とされたテキスト要素に対して仮想的なセパレータを再抽出し、前記文書要素を抽出する処理を再帰的に繰り返すような処理を行うようにしてもよい。 Then, it is verified as described above whether the text element of the extracted document elements conforms or does not conform as a line or a text block, and if it does not conform, the size of the blank area is changed by the control parameters n and x, A process may be performed in which a virtual separator is re-extracted from the text element and the process of extracting the document element is recursively repeated.

なお、上記の処理機能は、コンピュータ（図６で示したようなハードウェア構成である）によって実現することができる。その場合、文書画像レイアウト解析装置１００が有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリなどがある。磁気記録装置には、ハードディスク装置（ＨＤＤ）、フレキシブルディスク（ＦＤ）、磁気テープなどがある。光ディスクには、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）などがある。光磁気記録媒体には、ＭＯ（Magneto-Optical disk）などがある。 The above processing functions can be realized by a computer (having a hardware configuration as shown in FIG. 6). In that case, a program describing the processing contents of the functions that the document image layout analysis apparatus 100 should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the magnetic recording device include a hard disk device (HDD), a flexible disk (FD), and a magnetic tape. Optical discs include DVD (Digital Versatile Disc), DVD-RAM, CD-ROM, CD-R (Recordable) / RW (ReWritable), and the like. Magneto-optical recording media include MO (Magneto-Optical disk).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, portable recording media such as a DVD and a CD-ROM on which the program is recorded are sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送される毎に、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

本発明は、例えば、スキャナなどの光学機器により読み込んだ広告記事などの様々な文書画像において文字認識処理を行う際に適用される。
（付記１）文書画像レイアウトを解析する処理をコンピュータに機能させる文書画像レイアウト解析プログラムにおいて、
コンピュータに、
文書画像における空白領域を仮想的なセパレータとして抽出し、
前記仮想的なセパレータを越えたテキスト要素の統合を禁止するもとで、複数の前記テキスト要素を統合して統合テキスト要素として抽出する、
処理を実行させることを特徴とする文書画像レイアウト解析プログラム。 The present invention is applied when character recognition processing is performed on various document images such as advertisement articles read by an optical device such as a scanner.
(Supplementary Note 1) In a document image layout analysis program for causing a computer to function to analyze a document image layout,
On the computer,
Extract the blank area in the document image as a virtual separator,
Under the prohibition of integration of text elements beyond the virtual separator, a plurality of the text elements are integrated and extracted as an integrated text element.
A document image layout analysis program characterized by causing a process to be executed.

（付記２）前記コンピュータに、
抽出した前記統合テキスト要素が行またはテキストブロックとして適合か不適合かを検証し、
不適合の場合は、前記空白領域の大きさを制御パラメータにより変化させ、不適合とされた前記統合テキスト要素に対して前記仮想的なセパレータを再抽出し、新たな前記統合テキスト要素を抽出する処理を再帰的に繰り返す、
処理を更に実行させることを特徴とする付記１記載の文書画像レイアウト解析プログラム。 (Supplementary Note 2)
Verify whether the extracted integrated text element is conforming or non-conforming as a line or text block;
In the case of non-conformity, a process of changing the size of the blank area by a control parameter, re-extracting the virtual separator for the unconforming integrated text element, and extracting a new integrated text element Repeat recursively,
The document image layout analysis program according to appendix 1, wherein the processing is further executed.

（付記３）前記抽出した前記統合テキスト要素における行において、前記行を囲む文字列矩形内に含まれる、連結成分あるいは前記連結成分を統合したものを文字認識して認識信頼度が高いものを文字候補とし、
前記文字列矩形の一角を原点とし、前記原点を一角として前記文字候補を含むような矩形検査領域を設定し、
前記矩形検査領域内に前記文字候補より後ろの読み取り順序のものを含まないという条件の下で、前記文字候補の前記読み取り順序を決定し、
統合する前記文字候補より後ろの前記読み取り順序のものを含まないという条件の下で、前記読み取り順序に応じて前記文字候補を順番に統合して生成される新たな生成行が、前記文字列矩形内において前記行の垂直方向に複数存在するときに、前記行を行として不適合であると判定することを特徴とする付記２記載の文書画像レイアウト解析プログラム。 (Additional remark 3) In the line in the extracted said integrated text element, the character which recognizes the thing which recognized the character which integrated the connected component or the said connected component contained in the character string rectangle surrounding the said line, and has high recognition reliability is used as a character. As a candidate,
Set a rectangular inspection area that includes the character candidate with one corner of the character string rectangle as the origin and the origin as the corner,
Determining the reading order of the character candidates under the condition that the rectangular inspection area does not include a reading order after the character candidates;
A new generation line generated by integrating the character candidates in order according to the reading order under the condition that the reading order after the character candidates to be integrated is not included, The document image layout analysis program according to supplementary note 2, wherein when there are a plurality of lines in the vertical direction of the line, the line is determined to be incompatible with the line.

（付記４）前記抽出した前記統合テキスト要素における行において、前記行を囲む文字列矩形内に含まれる、連結成分あるいは前記連結成分を統合したものを文字認識して認識信頼度が高いものを文字候補とし、
前記文字列矩形の一角を原点とし、前記原点を一角として前記文字候補を含むような矩形検査領域を設定し、
前記矩形検査領域内に前記文字候補より後ろの読み取り順序のものを含まないという条件の下で、前記文字候補の前記読み取り順序を決定し、
統合する前記文字候補より後ろの前記読み取り順序のものを含まないという条件の下で、前記読み取り順序に応じて前記文字候補を順番に統合して生成される新たな生成行が、前記文字列矩形内において前記行の垂直方向に複数存在するときに、前記行を含んだテキストブロックをテキストブロックとして不適合であると判定することを特徴とする付記２記載の文書画像レイアウト解析プログラム。 (Additional remark 4) In the line in the extracted said integrated text element, the character which recognizes the thing which recognized the character which integrated the connected component or the said connected component contained in the character string rectangle surrounding the said line, and has a high recognition reliability is used as a character. As a candidate,
Set a rectangular inspection area that includes the character candidate with one corner of the character string rectangle as the origin and the origin as the corner,
Determining the reading order of the character candidates under the condition that the rectangular inspection area does not include a reading order after the character candidates;
A new generation line generated by integrating the character candidates in order according to the reading order under the condition that the reading order after the character candidates to be integrated is not included, The document image layout analysis program according to appendix 2, wherein when there are a plurality of lines in the vertical direction of the line, the text block including the line is determined to be incompatible as a text block.

（付記５）前記抽出した前記統合テキスト要素内に含まれる行において、所定の行数以上の前記行が、前記統合テキスト要素内にある連結成分から推定される前記統合テキスト要素の平均文字間隔よりも大きい同一の空白領域と交差するとき、抽出された前記統合テキスト要素をテキストブロックとして不適合であると判定することを特徴とする付記２記載の文書画像レイアウト解析プログラム。 (Additional remark 5) In the line contained in the extracted said integrated text element, the said line more than predetermined number of lines is from the average character space | interval of the said integrated text element estimated from the connection component in the said integrated text element. The document image layout analysis program according to appendix 2, wherein the extracted integrated text element is determined to be incompatible as a text block when intersecting with the same large blank area.

（付記６）文書画像から抽出した行を検証する処理をコンピュータに機能させる行抽出結果検証プログラムにおいて、
コンピュータに、
抽出した前記行において、前記行を囲む文字列矩形内に含まれる、連結成分あるいは前記連結成分を統合したものを文字認識して認識信頼度が高いものを文字候補とし、
前記文字列矩形の一角を原点とし、前記原点を一角として前記文字候補を含むような矩形検査領域を設定し、
前記矩形検査領域内に前記文字候補より後ろの読み取り順序のものを含まないという条件の下で、前記文字候補の前記読み取り順序を決定し、
統合する前記文字候補より後ろの前記読み取り順序のものを含まないという条件の下で、前記読み取り順序に応じて前記文字候補を順番に統合して生成される新たな生成行が、前記文字列矩形内において前記行の垂直方向に複数存在するときに、抽出した前記行は行として不適合であると判定する、
処理を実行させることを特徴とする行抽出結果検証プログラム。 (Additional remark 6) In the line extraction result verification program which makes a computer function the process which verifies the line extracted from the document image,
On the computer,
In the extracted line, a character candidate that has a high recognition reliability by recognizing a connected component or a combination of the connected components included in a character string rectangle surrounding the line,
Set a rectangular inspection area that includes the character candidate with one corner of the character string rectangle as the origin and the origin as the corner,
Determining the reading order of the character candidates under the condition that the rectangular inspection area does not include a reading order after the character candidates;
A new generation line generated by integrating the character candidates in order according to the reading order under the condition that the reading order after the character candidates to be integrated is not included, The extracted row is determined to be incompatible as a row when there are a plurality of rows in the vertical direction of the row,
A row extraction result verification program characterized by causing processing to be executed.

（付記７）文書画像から抽出したテキストブロックを検証する処理をコンピュータに機能させるテキストブロック抽出結果検証プログラムにおいて、
コンピュータに、
抽出した前記テキストブロックに含まれる行において、前記行を囲む文字列矩形内に含まれる、連結成分あるいは前記連結成分を統合したものを文字認識して認識信頼度が高いものを文字候補とし、
前記文字列矩形の一角を原点とし、前記原点を一角として前記文字候補を含むような矩形検査領域を設定し、
前記矩形検査領域内に前記文字候補より後ろの読み取り順序のものを含まないという条件の下で、前記文字候補の前記読み取り順序を決定し、
統合する前記文字候補より後ろの前記読み取り順序のものを含まないという条件の下で、前記読み取り順序に応じて前記文字候補を順番に統合して生成される新たな生成行が、前記文字列矩形内において前記行の垂直方向に複数存在するときに、前記行を含んだテキストブロックをテキストブロックとして不適合であると判定する、
処理を実行させることを特徴とするテキストブロック抽出結果検証プログラム。 (Supplementary Note 7) In a text block extraction result verification program that causes a computer to function to verify a text block extracted from a document image,
On the computer,
In a line included in the extracted text block, a character candidate that has a high recognition reliability by recognizing a connected component or a combination of the connected components included in a character string rectangle surrounding the line,
Set a rectangular inspection area that includes the character candidate with one corner of the character string rectangle as the origin and the origin as the corner,
Determining the reading order of the character candidates under the condition that the rectangular inspection area does not include a reading order after the character candidates;
A new generation line generated by integrating the character candidates in order according to the reading order under the condition that the reading order after the character candidates to be integrated is not included, Determining that a text block including the line is incompatible as a text block when there are a plurality of lines in the vertical direction of the line in
A text block extraction result verification program characterized in that a process is executed.

（付記８）文書画像から抽出したテキストブロックを検証する処理をコンピュータに機能させるテキストブロック抽出結果検証プログラムにおいて、
コンピュータに、
抽出した前記テキストブロックに含まれる行において、所定の行数以上の前記行が、前記テキストブロック内にある連結成分から推定される前記テキストブロックの平均文字間隔よりも大きい同一の空白領域と交差するとき、抽出された前記テキストブロックをテキストブロックとして不適合であると判定する、
処理を実行させることを特徴とするテキストブロック抽出結果検証プログラム。 (Additional remark 8) In the text block extraction result verification program which makes a computer function the process which verifies the text block extracted from the document image,
On the computer,
Among the lines included in the extracted text block, the lines of a predetermined number or more intersect with the same blank area larger than the average character spacing of the text block estimated from the connected components in the text block. When the extracted text block is determined to be incompatible as a text block,
A text block extraction result verification program characterized in that a process is executed.

（付記９）文書画像レイアウトを解析する処理をコンピュータに機能させる文書画像レイアウト解析プログラムにおいて、
コンピュータに、
文書画像における空白領域を仮想的なセパレータとして抽出し、
前記仮想的なセパレータに基いて生成される閉領域を抽出することにより文書要素を抽出し、
抽出した文書要素のうちテキスト要素が行またはテキストブロックとして適合か不適合かを検証し、
不適合の場合は、前記空白領域の大きさを制御パラメータにより変化させ、不適合とされたテキスト要素に対して前記仮想的なセパレータを再抽出し、前記文書要素を抽出する処理を再帰的に繰り返す、
処理を実行させることを特徴とする文書画像レイアウト解析プログラム。 (Supplementary Note 9) In a document image layout analysis program for causing a computer to function to analyze a document image layout,
On the computer,
Extract the blank area in the document image as a virtual separator,
Extracting document elements by extracting a closed region generated based on the virtual separator;
Verify whether the text elements of the extracted document elements are conforming or non-conforming as lines or text blocks,
In the case of non-conformity, the size of the blank area is changed by a control parameter, the virtual separator is re-extracted for the non-conforming text element, and the process of extracting the document element is recursively repeated.
A document image layout analysis program characterized by causing a process to be executed.

（付記１０）文書画像レイアウトを解析する文書画像レイアウト解析装置において、
文書画像における空白領域を仮想的なセパレータとして抽出する仮想セパレータ抽出手段と、
前記仮想的なセパレータを越えたテキスト要素の統合を禁止するもとで、複数の前記テキスト要素を統合して統合テキスト要素として抽出する統合テキスト要素抽出手段と、
を有することを特徴とする文書画像レイアウト解析装置。 (Supplementary Note 10) In a document image layout analyzing apparatus for analyzing a document image layout,
Virtual separator extraction means for extracting a blank area in a document image as a virtual separator;
Integrated text element extraction means for integrating a plurality of the text elements and extracting them as an integrated text element, while prohibiting integration of text elements beyond the virtual separator;
A document image layout analyzing apparatus comprising:

（付記１１）抽出した前記統合テキスト要素が行またはテキストブロックとして適合か不適合かを検証する検証手段を更に有し、
不適合の場合は、前記仮想セパレータ抽出手段は、前記空白領域の大きさを制御パラメータにより変化させ、不適合とされた前記統合テキスト要素に対して前記仮想的なセパレータを再抽出し、前記統合テキスト要素抽出手段は、新たな前記統合テキスト要素を抽出する処理を再帰的に繰り返すことを特徴とする付記１０記載の文書画像レイアウト解析装置。 (Additional remark 11) It further has a verification means which verifies whether the extracted said integrated text element is conformity or nonconformity as a line or a text block,
In the case of non-conformity, the virtual separator extracting means changes the size of the blank area according to a control parameter, re-extracts the virtual separator for the unconforming integrated text element, and the integrated text element The document image layout analysis apparatus according to appendix 10, wherein the extraction means recursively repeats the process of extracting the new integrated text element.

本発明の文書画像レイアウト解析プログラムの原理を示す図である。It is a figure which shows the principle of the document image layout analysis program of this invention. 本発明の文書画像レイアウト解析プログラムによる文書画像レイアウト解析処理の概要を示す図である（その１）。It is a figure which shows the outline | summary of the document image layout analysis process by the document image layout analysis program of this invention (the 1). 本発明の文書画像レイアウト解析プログラムによる文書画像レイアウト解析処理の概要を示す図である（その２）。It is a figure which shows the outline | summary of the document image layout analysis process by the document image layout analysis program of this invention (the 2). 本発明の文書画像レイアウト解析プログラムによる文書画像レイアウト解析処理の概要を示す図である（その３）。It is a figure which shows the outline | summary of the document image layout analysis process by the document image layout analysis program of this invention (the 3). 本発明の文書画像レイアウト解析プログラムによる文書画像レイアウト解析処理の概要を示す図である（その４）。It is a figure which shows the outline | summary of the document image layout analysis process by the document image layout analysis program of this invention (the 4). 文書画像レイアウト解析プログラムを適用する文書画像レイアウト解析装置のハードウェア構成例である。2 is a hardware configuration example of a document image layout analysis apparatus to which a document image layout analysis program is applied. 文書画像レイアウト解析処理全体の概要を示す図である。It is a figure which shows the outline | summary of the whole document image layout analysis process. 連結成分属性付与処理の流れを示す一例のフローチャートである。It is a flowchart of an example which shows the flow of a connection component attribute provision process. ラベリング処理の具体例を示す図である。It is a figure which shows the specific example of a labeling process. 階層化処理の具体例を示す図である。It is a figure which shows the specific example of a hierarchization process. 文字成分を包含するフレームを示した図である。It is the figure which showed the flame | frame containing a character component. 再帰的テキストブロック抽出処理の流れを示す一例のフローチャートである。It is a flowchart of an example which shows the flow of a recursive text block extraction process. 極大空白矩形の一例を示す図である。It is a figure which shows an example of a maximum blank rectangle. テキストブロックとしての適合性検証の処理の概略を示す図である。It is a figure which shows the outline of the process of the compatibility verification as a text block. あるテキストブロックを構成する行の一例を示す図である。It is a figure which shows an example of the line which comprises a certain text block. 図１５のテキストブロックの空白領域を示した図である。It is the figure which showed the blank area | region of the text block of FIG. テキストブロック適合性検証処理の１つめの処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the 1st process of a text block compatibility verification process. 読み取り順序を付与する処理の概略を示す図である。It is a figure which shows the outline of the process which provides a reading order. 行生成の処理の概略を示す図である。It is a figure which shows the outline of a process of line production | generation. テキストブロック適合性検証処理の２つめの処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the 2nd process of a text block compatibility verification process. 過統合の例を示す図であり、テキストブロック抽出結果を示す図である。It is a figure which shows the example of overintegration, and is a figure which shows a text block extraction result. 過統合の例を示す図であり、行抽出結果を示す図である。It is a figure which shows the example of overintegration, and is a figure which shows a row extraction result. 複数の図に囲まれた領域にテキストブロック領域が配置されているレイアウトの例である。It is an example of a layout in which a text block area is arranged in an area surrounded by a plurality of figures.

Explanation of symbols

１０コンピュータ
２０文書画像
２１ａ、２１ｂ、２１ｃ、２１ｄ、２１ｅ、２３ａ、２３ｂ、２３ｃ、２３ｄ、２３ｅ仮想セパレータ
２２ａ、２２ｂ、２２ｃ、２２ｄ、２２ｅ、２４ａ、２４ｂ、２４ｃ、２４ｄ、２４ｅ、２４ｆテキストブロック
10 Computer 20 Document image 21a, 21b, 21c, 21d, 21e, 23a, 23b, 23c, 23d, 23e Virtual separator 22a, 22b, 22c, 22d, 22e, 24a, 24b, 24c, 24d, 24e, 24f Text block

Claims

In a document image layout analysis program that causes a computer to function to analyze a document image layout,
On the computer,
Extract the blank area in the document image as a virtual separator,
Under the prohibition of integration of text elements beyond the virtual separator, a plurality of the text elements are integrated and extracted as an integrated text element.
A document image layout analysis program characterized by causing a process to be executed.

In the computer,
Verify whether the extracted integrated text element is conforming or non-conforming as a line or text block;
In the case of non-conformity, a process of changing the size of the blank area by a control parameter, re-extracting the virtual separator for the unconforming integrated text element, and extracting a new integrated text element Repeat recursively,
2. The document image layout analysis program according to claim 1, further comprising executing processing.

In the extracted line of the integrated text element, a character component that includes a connected component or a combination of the connected component included in a character string rectangle that surrounds the line and has a high recognition reliability is used as a character candidate.
Set a rectangular inspection area that includes the character candidate with one corner of the character string rectangle as the origin and the origin as the corner,
Determining the reading order of the character candidates under the condition that the rectangular inspection area does not include a reading order after the character candidates;
A new generation line generated by integrating the character candidates in order according to the reading order under the condition that the reading order after the character candidates to be integrated is not included, 3. The document image layout analysis program according to claim 2, wherein when there are a plurality of lines in the vertical direction of the line, the line is determined to be incompatible with the line.

In the extracted line of the integrated text element, a character component that includes a connected component or a combination of the connected component included in a character string rectangle that surrounds the line and has a high recognition reliability is used as a character candidate.
Set a rectangular inspection area that includes the character candidate with one corner of the character string rectangle as the origin and the origin as the corner,
Determining the reading order of the character candidates under the condition that the rectangular inspection area does not include a reading order after the character candidates;
A new generation line generated by integrating the character candidates in order according to the reading order under the condition that the reading order after the character candidates to be integrated is not included, 3. The document image layout analysis program according to claim 2, wherein when there are a plurality of lines in the vertical direction of the line, the text block including the line is determined to be incompatible as a text block. 4.

Among the lines included in the extracted integrated text element, the same number of lines equal to or greater than a predetermined number of lines is greater than the average character spacing of the integrated text element estimated from the connected components in the integrated text element. 3. The document image layout analysis program according to claim 2, wherein, when intersecting with a blank area, the extracted integrated text element is determined to be incompatible as a text block.