JPH0944594A

JPH0944594A - Dividing method for area of document image and discriminating method for kind of multiple column

Info

Publication number: JPH0944594A
Application number: JP7194399A
Authority: JP
Inventors: Takashi Saito; 高志齋藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-07-31
Filing date: 1995-07-31
Publication date: 1997-02-14
Anticipated expiration: 2015-07-31
Also published as: JP3607753B2

Abstract

PROBLEM TO BE SOLVED: To cope with irregular multiple column by making good use of column setting to divide the document image into areas when the multiple column is distinctive. SOLUTION: After the tilt of the inputted document image (101) is corrected (102), a compressed image is generated (103). Small areas are extracted (104) and classified into a character string candidate, etc., in a row direction (106). A column division blank part is extracted from the character string candidate small area by using a connection component of a long white run (10'). The kind of multiple column (one column, plural columns, or fee columns) of the input document is decided (108) and blank parts are selected according to the kind (109); and the small areas are integrated (110) and a document area is extracted.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像の領域分
割方法および段組種類判別方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document image area dividing method and a column type determining method.

【０００２】[0002]

【従来の技術】文書画像の領域分割方法としては、大き
く分けて２つの方法がある。その一つは、文字などの小
さな要素を統合してまとまりのある文章領域（段など）
を得る方法である（特願平３−１２８３４０号）。2. Description of the Related Art There are roughly two methods for dividing an area of a document image. One of them is a cohesive text area (columns, etc.) that integrates small elements such as characters.
(Japanese Patent Application No. 3-128340).

【０００３】他の一つは、領域の端部または領域を分割
する空白部を検出し、これにより画像を分割する方法で
ある。例えば、特開平１−１８３７８３号公報に記載さ
れた文書画像処理装置では、文字列の先頭位置の周辺分
布から段のエッジを求め、同一段に所属する文字列に同
じ属性（所属段番号）を与えて、同一属性を持つ上下に
近接する文字列を統合することによって画像から文字領
域を分割抽出している。また、同５−１６６００１号公
報に記載された文字認識装置では、水平垂直方向の周辺
分布によって領域分割点を決定している。The other is a method of detecting an edge portion of a region or a blank portion dividing the region and dividing the image according to the detected portion. For example, in the document image processing apparatus described in Japanese Patent Laid-Open No. 1-183783, the edge of a column is obtained from the peripheral distribution of the start position of the character string, and the same attribute (belonging column number) is assigned to the character strings belonging to the same column. The character areas are divided and extracted from the image by combining the character strings having the same attribute and vertically adjacent to each other. Further, in the character recognition device described in Japanese Patent Laid-Open No. 5-166001, the area division point is determined by the peripheral distribution in the horizontal and vertical directions.

【０００４】[0004]

【発明が解決しようとする課題】前者の方法は領域の形
状にとらわれないため、明確な段組がなくても領域分割
が可能であるが、統合パラメータによって性能が左右さ
れるため、明確な段があっても分割ができなかったり、
文字間が広い場合には過剰に分割してしまうという問題
があった。Since the former method is not bound by the shape of the area, it is possible to divide the area without a clear column, but since the performance is influenced by the integrated parameters, the clear step is defined. Even if there is, it can not be divided,
There is a problem that the characters are excessively divided when the space between the characters is wide.

【０００５】また、後者の方法では、領域を分割する空
白部の一部が検出できないと全体的に分割できない場合
があり、また傾いていたり、段組が不規則な場合や文章
領域の形状が矩形でない場合にも領域分割ができないと
いう問題があった。In the latter method, the entire area may not be divided unless a part of the blank portion dividing the area can be detected, and the area may be inclined, the columns may be irregular, or the shape of the text area may be incorrect. There is a problem that the area cannot be divided even when the area is not rectangular.

【０００６】本発明の目的は、文書画像の領域分割を行
うとき、段組が明確な場合はそれを利用し、段組が変則
的な場合でも対応できる文書画像の領域分割方法および
段組種類判別方法を提供することにある。It is an object of the present invention to use a document image area division method when the area division of the document image is clear, and to cope with the case where the column arrangement is irregular and the type of the column arrangement. It is to provide a discrimination method.

【０００７】[0007]

【課題を解決するための手段】前記目的を達成するため
に、請求項１記載の発明では、文書画像の領域分割方法
において、該文書画像から文字列を含む、複数の小領域
を抽出し、該複数の小領域から空白部または罫線を検出
し、該検出された空白部または罫線を基に、１段組、複
数段組、自由段組を含む段組種類を判別し、該段組種類
に応じて該空白部を用いて前記小領域を統合し、前記文
書画像を所定の領域に分割することを特徴としている。In order to achieve the above object, according to the invention of claim 1, in a method of dividing an area of a document image, a plurality of small areas including a character string are extracted from the document image, A blank part or ruled line is detected from the plurality of small areas, and a column type including one column, a plurality of columns, and a free column is determined based on the detected blank part or ruled line, and the column type According to the above, the small area is integrated by using the blank portion, and the document image is divided into predetermined areas.

【０００８】請求項２記載の発明では、前記文書画像の
段組種類は、前記空白部または罫線の本数、位置を基に
判別することを特徴としている。The invention according to claim 2 is characterized in that the type of column of the document image is determined based on the number and position of the blank portions or ruled lines.

【０００９】請求項３記載の発明では、文字列と平行す
る方向に前記小領域を分割する空白部または罫線を検出
し、該空白部または罫線によって画像を文字列と平行に
分割し、該分割された各画像部分毎に段組数または種類
を求め、該結果を統合して画像全体の段組種類を決定す
ることを特徴としている。According to the third aspect of the present invention, a blank portion or ruled line that divides the small area in a direction parallel to the character string is detected, the blank portion or ruled line is used to divide the image in parallel with the character string, and the division is performed. It is characterized in that the number of columns or types is obtained for each of the image portions thus created, and the results are integrated to determine the column type of the entire image.

【００１０】請求項４記載の発明では、前記段組種類に
応じて、前記検出された空白部を取捨選択することを特
徴としている。According to a fourth aspect of the present invention, the detected blank portion is selected according to the type of column.

【００１１】請求項５記載の発明では、前記空白部を検
出する方法は、長い白ランの連結成分を空白部として検
出する方法と、文字要素の射影ヒストグラムから空白部
を検出する方法とを併用することを特徴としている。According to a fifth aspect of the present invention, the method for detecting the blank portion is a combination of a method of detecting a connected component of a long white run as a blank portion and a method of detecting a blank portion from a projection histogram of character elements. It is characterized by doing.

【００１２】請求項６記載の発明では、前記段組種類に
応じて、前記小領域の統合条件を変更することを特徴と
している。According to a sixth aspect of the present invention, the integration condition of the small areas is changed according to the column type.

【００１３】請求項７記載の発明では、前記文書画像の
傾きが所定の閾値以上であるとき、前記段組種類を自由
段組とすることを特徴としている。According to a seventh aspect of the present invention, when the inclination of the document image is equal to or larger than a predetermined threshold value, the type of column is set to free column.

【００１４】請求項８記載の発明では、前記段組種類
は、予め指定された段組種類を含むことを特徴としてい
る。The invention according to claim 8 is characterized in that the column type includes a preset column type.

【００１５】請求項９記載の発明では、文書画像の段組
種類判別方法において、該文書画像から文字列を含む、
複数の小領域を抽出し、該複数の小領域から空白部また
は罫線を検出し、該空白部または罫線の本数、位置を基
に、１段組、複数段組、自由段組を含む、文書画像の段
組種類を判別することを特徴としている。According to a ninth aspect of the present invention, in the method of discriminating the column type of a document image, a character string is included from the document image.
A document in which a plurality of small areas are extracted, blank portions or ruled lines are detected from the plurality of small areas, and one column, a plurality of columns, or a free column is included based on the number and position of the blank portions or ruled lines It is characterized by discriminating the column type of the image.

【００１６】[0016]

【実施例】以下、本発明の一実施例を図面を用いて具体
的に説明する。〈実施例１〉図１は、本発明の実施例１の構成を示す。
図において、１０１はスキャナなどの画像入力手段、１
０２は画像の傾き補正手段、１０３は入力画像の圧縮画
像を生成する手段、１０４は圧縮画像から黒画素連結成
分（小領域）を抽出する手段、１０５は行方向の検出手
段、１０６は小領域の分類手段、１０７は段分割空白部
候補の抽出手段、１０８は抽出した段分割空白部候補か
ら段組種類を判別する手段、１０９は決定された段組種
類に従って段分割空白部候補を取捨選択する手段、１１
０は段分割線（段分割空白部および実線の段分割線）を
利用して小領域を統合して大きなまとまりのある文章領
域（段など）を得る手段、１１１はパラメータや処理途
中の各種データなどを記憶するデータ記憶部、１１２は
各手段を制御する制御部、１１３はデータ通信路であ
る。なお、１０２から１１０および１１２は、一つのプ
ロセッサ上でソフトウェアで実現することができる。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be specifically described below with reference to the drawings. <Embodiment 1> FIG. 1 shows the structure of Embodiment 1 of the present invention.
In the figure, 101 is an image input means such as a scanner, 1
Reference numeral 02 is an image inclination correcting means, 103 is a means for generating a compressed image of an input image, 104 is a means for extracting a black pixel connected component (small area) from the compressed image, 105 is a row direction detecting means, and 106 is a small area. Categorizing means, 107 means for extracting column division blank part candidates, 108 means for discriminating the column type from the extracted column division blank part candidates, and 109, selecting column division blank part candidates according to the determined column type. Means to do, 11
0 is a means for obtaining a large cohesive text area (column, etc.) by integrating small areas using a column dividing line (column dividing blank part and solid line dividing line), and 111 is a parameter and various data in the process. And the like, a data storage unit for storing, etc., 112 a control unit for controlling each unit, and 113 a data communication path. Note that 102 to 110 and 112 can be realized by software on one processor.

【００１７】図２は、実施例１の処理フローチャートで
ある。まず、画像入力手段１０１を用いて文書画像を入
力する（ステップ２０１）。次いで、傾き補正手段１０
２によって、入力画像の傾きの補正を行う（ステップ２
０２）。この補正方法としては、例えば特開平５−３５
９１４号公報に記載されている方法を用いる。入力画像
に傾きがないことが予め分かってる場合には、この傾き
の補正処理は必要がないし、傾き補正手段１０２を設け
なくてもよい。FIG. 2 is a processing flowchart of the first embodiment. First, a document image is input using the image input means 101 (step 201). Next, the tilt correction means 10
The inclination of the input image is corrected by step 2 (step 2
02). As this correction method, for example, JP-A-5-35
The method described in Japanese Patent No. 914 is used. If it is known in advance that the input image has no inclination, this inclination correction processing is not necessary and the inclination correction means 102 may not be provided.

【００１８】画像圧縮手段１０３で圧縮画像を生成する
（ステップ２０３）。例えば入力画像が４００ＤＰＩ程
度の解像度の場合には１／８に縦横圧縮する（つまり、
８×８画素の中に一つでも黒画素があれば圧縮後の画素
は黒とする）ことにより、通常、近接した文字は融合す
るが、行間や領域間は融合しない状態が得られる。圧縮
した画像の黒画素連結成分は、文字や文字の融合した文
字列、または線分や表、図（あるいは図の一部）などで
ある。小領域抽出手段１０４は、この連結成分を小領域
として抽出する（ステップ２０４）。The image compression means 103 generates a compressed image (step 203). For example, if the input image has a resolution of about 400 DPI, it is vertically and horizontally compressed to 1/8 (that is,
If there is at least one black pixel among the 8 × 8 pixels, the pixel after compression is made black), so that normally adjacent characters are fused, but lines and regions are not fused. The black pixel connected component of the compressed image is a character, a character string in which characters are fused, a line segment, a table, a figure (or a part of the figure), or the like. The small area extraction unit 104 extracts this connected component as a small area (step 204).

【００１９】続いて、行方向検出手段１０５は、行の方
向を検出する（ステップ２０５）。行方向検出の方法と
しては、例えば特開平５−３１４３０９号公報に記載さ
れた方法を用いる。行方向が予め指定されている場合に
は検出する必要はない。従って、その場合は行方向検出
手段１０５を設けなくてもよい。ただし、その場合は行
方向指定手段が必要となる。また、必ずしもこの時点で
行方向を求める必要はなく、圧縮画像の生成前やその直
後などに求めてもよい。しかし、次の処理である小領域
分類では行方向を必要とするので、これより早い時点で
判明していなければならない。Then, the row direction detecting means 105 detects the row direction (step 205). As a method of detecting the row direction, for example, the method described in Japanese Patent Laid-Open No. 5-314309 is used. It is not necessary to detect when the row direction is designated in advance. Therefore, in that case, the row direction detecting means 105 may not be provided. However, in that case, a line direction designating means is required. Further, it is not always necessary to obtain the row direction at this point, and the row direction may be obtained before or immediately after the generation of the compressed image. However, the sub-region classification, which is the next process, requires the row direction, so it must be known earlier.

【００２０】行方向が判明したら、小領域分類手段１０
６は、文字列方向を水平方向として座標軸をとって、小
領域を文字列候補や図、フィールドセパレータ（実線）
などに分類する（ステップ２０６）。分類する際に小領
域の高さや黒画素密度などを利用する。文字間距離によ
っては圧縮による画素の融合度合いが異なるため、文字
列候補の小領域が１文字で構成されていることもある。When the row direction is known, the small area classification means 10
6 is a coordinate axis with the character string direction as the horizontal direction, and the small area is a character string candidate, a figure, or a field separator (solid line).
Etc. (step 206). The height of a small area and the black pixel density are used for classification. Since the degree of pixel fusion due to compression differs depending on the inter-character distance, a small area of a character string candidate may be composed of one character.

【００２１】以下、文字行方向を水平として実施例を説
明する。段分割空白部候補検出手段１０７は、文字列候
補小領域から段分割空白部候補を抽出する（ステップ２
０７）。この詳細を、図３〜図１３および図２２を用い
て説明する。図３は、段分割空白部候補を抽出する処理
フローチャートである。まず、画像を水平方向（文字行
方向）に幾つかの帯に分割する。そのためには水平の分
割線を抽出する（ステップ３０１）。例えば、図４に示
すような小領域（つまり、文字列として分類されたも
の）の配置があったとして、水平方向に長い白ランの連
結成分を求める。この結果は図５に示すようになる。図
５の５０１、５０２は、この長い白ランの連結成分を示
す（図示していないが、他の行間部分にも５０１と同様
に長い白ランが得られる）。この長い白ラン連結成分の
内、充分な高さを持つ（つまり、所定の閾値より大き
い）ものを抽出する。これが水平方向の分割空白部であ
る。図５の例では連結成分５０１は高さが所定の閾値よ
り小さく、連結成分５０２は所定の閾値よりも大きい。
この高さの閾値は予め定めた固定値でもよいし、長い白
ラン連結成分の内、最も高さが高いもの（図５では５０
２）の高さに対する比率でもよい。あるいは、文字列相
当の小領域の平均高さに連動するようにしてもよい。An embodiment will be described below in which the character line direction is horizontal. The column-divided blank portion candidate detection unit 107 extracts a column-divided blank portion candidate from the character string candidate small area (step 2).
07). The details will be described with reference to FIGS. 3 to 13 and FIG. FIG. 3 is a processing flowchart for extracting a column division blank part candidate. First, the image is divided into several bands in the horizontal direction (character line direction). For that purpose, a horizontal dividing line is extracted (step 301). For example, assuming that a small area (that is, one classified as a character string) is arranged as shown in FIG. 4, a connected component of a white run that is long in the horizontal direction is obtained. The result is shown in FIG. Reference numerals 501 and 502 in FIG. 5 indicate connected components of this long white run (not shown, but long white runs can be obtained in other interline portions as well as 501). Among the long white orchid connected components, those having a sufficient height (that is, larger than a predetermined threshold value) are extracted. This is the horizontal blank space. In the example of FIG. 5, the connected component 501 has a height smaller than a predetermined threshold, and the connected component 502 has a height larger than the predetermined threshold.
This height threshold may be a fixed value determined in advance, or the tallest white-run connected component with the highest height (50 in FIG. 5).
It may be a ratio of 2) to the height. Alternatively, it may be linked to the average height of a small area corresponding to a character string.

【００２２】また、ステップ２０６で水平方向の実線の
フィールドセパレータが存在する場合には、これも水平
分割空白部と同様に扱う。If there is a horizontal solid line field separator in step 206, this is also treated in the same manner as the horizontal division blank part.

【００２３】以上の処理で求めた水平方向の分割線また
は空白部によって全体が幾つかの水平帯領域に分割され
る（ステップ３０２）。図６の例では、全体が水平分割
空白部６０１によって帯領域６０２と帯領域６０３に分
割されている。以下、各帯領域毎に段分割空白部候補を
抽出する。また、帯領域毎に段数の判別を行う（ステッ
プ３０３、３０４）。The horizontal dividing line or the blank portion obtained by the above processing divides the whole into several horizontal band areas (step 302). In the example of FIG. 6, the whole is divided into a band area 602 and a band area 603 by the horizontal division blank portion 601. Hereinafter, a column division blank part candidate is extracted for each band area. Further, the number of steps is determined for each band area (steps 303 and 304).

【００２４】図７は、図３のステップ３０４の詳細の処
理フローチャートである。まず、図５で説明したと同様
にして垂直方向に長い白ランを生成し（ステップ７０
１）、これの連結成分を求める（ステップ７０２）。こ
れを図８を用いて説明する。帯領域８０１において、長
い白ラン８０２、８０３が生成され、その連結成分が求
められる。連結成分の内、幅の広いものだけが抽出され
る。幅が充分に広いか否かは所定の閾値で決定してもよ
いし、検出した長白ラン連結成分の内、最も幅の広いも
のに対する比率で決めてもよい。また、文字列相当の小
領域の分離度によって幅の閾値を変動させてもよい。FIG. 7 is a detailed processing flowchart of step 304 in FIG. First, a white run that is long in the vertical direction is generated in the same manner as described with reference to FIG.
1) Then, the connected component of this is obtained (step 702). This will be described with reference to FIG. In the band area 801, long white runs 802 and 803 are generated, and their connected components are obtained. Of the connected components, only the widest one is extracted. Whether or not the width is sufficiently wide may be determined by a predetermined threshold value or may be determined by the ratio to the widest one among the detected long-white run connected components. Further, the width threshold may be changed according to the degree of separation of the small area corresponding to the character string.

【００２５】小領域の分離度は、文字間距離の程度を表
すもので、画像圧縮によって文字同志がどの程度融合す
るかによって文字間距離が広いか狭いかを判断するもの
である。つまり、文字列相当の小領域が横に長いものが
多ければ、文字間の融合が多いということになり、分離
度は低く、文字間は狭いと判断できる。このような場合
は長白ランが現われにくいので、幅の狭い長白ラン連結
成分であっても抽出する。逆に、横長の小領域が少なけ
れば文字間の分離度が高く、長白ランが現われやすい。
このときには幅の広い長白ラン連結成分のみを抽出す
る。The degree of separation of a small area indicates the degree of inter-character distance, and is used to judge whether the inter-character distance is wide or narrow depending on how much the characters are fused by image compression. In other words, if the number of small areas corresponding to character strings is long in the horizontal direction, it means that there is a lot of fusion between characters, and the degree of separation is low, and it can be determined that the distance between characters is narrow. In such a case, a long-white orchid is unlikely to appear, so even a narrow-white long-run connected component is extracted. On the contrary, if there are few horizontally long small areas, the degree of separation between characters is high, and long white runs are likely to appear.
At this time, only the long white orchid connected component having a wide width is extracted.

【００２６】上記したようにして抽出した長白ラン連結
成分から段数を一時的に求める（ステップ７０３）。図
９で説明すると、抽出した長白ラン連結成分は９０１、
９０２である。これらが段数幾つのときの分割位置にあ
るかを検出する。すなわち、段数が２（つまり、文章領
域が２つ）なら９０３の位置で示す位置（あるいはその
周辺）に当該連結成分がなくてはならない。また、段数
が３なら９０４の位置に、段数が４なら９０５に示す位
置に同様にして当該連結成分が存在するはずである。The number of stages is temporarily obtained from the long-white run connected components extracted as described above (step 703). Referring to FIG. 9, the extracted long-white orchid connected component is 901,
902. It is detected how many stages there are at the division position. That is, if the number of columns is 2 (that is, there are two text areas), the connected component must be present at (or around) the position indicated by the position 903. If the number of stages is 3, the connected component should be present at the position 904, and if the number of stages is 4, the connected component should be present at the position 905.

【００２７】そこで、抽出した連結成分それぞれがコラ
ム数いくつの分割位置にあるかを調べる。その処理フロ
ーチャートを図１０に示す。まず全てのフラグをＯＦＦ
にする（ステップ１００１）。次に処理対象となる未チ
ェックの長白ラン連結成分を選ぶ。なければ処理を終了
する（ステップ１００２、１００３）。次に当該連結成
分が帯領域において左から１／２の地点付近にあれば１
／２地点フラグをＯＮにして次の連結成分の処理に進む
（ステップ１００４、１００５）。Therefore, it is checked how many divided positions each of the extracted connected components has in the number of columns. The processing flowchart is shown in FIG. First turn off all flags
(Step 1001). Next, an unchecked long-white run connected component to be processed is selected. If not, the process ends (steps 1002 and 1003). Next, if the connected component is near the half point from the left in the band area, 1
The / 2 point flag is turned on and the process proceeds to the processing of the next connected component (steps 1004 and 1005).

【００２８】以下、１／３、２／３、１／４、３／４地
点付近にあるかを調べ、当該地点付近にあればフラグを
ＯＮにしていく（ステップ１００６〜１０１３）。どの
場所にも相当しない場合は、イレギュラーフラグをＯＮ
にする（ステップ１０１４）。このとき、長白ラン連結
成分だけでなく実線のフィールドセパレータも使用して
フラグをセットする方法も採ることができる。そのとき
はステップ１００２および１００３において垂直実線セ
パレータも処理対象とする。Thereafter, it is checked whether or not the point is near the 1/3, 2/3, 1/4, and 3/4 points, and if it is near the point, the flag is turned on (steps 1006 to 1013). If it does not correspond to any place, turn on the irregular flag
(Step 1014). At this time, it is possible to adopt a method of setting a flag by using not only the long-white run connected component but also a solid line field separator. In that case, the vertical solid line separator is also processed in steps 1002 and 1003.

【００２９】以上の処理で抽出した長白ラン連結成分
（および実線のフィールドセパレータ）が帯領域のどの
位置にあるかが判明する。次に当該帯領域の段数をこの
位置フラグから求める。図１２の処理フローチャートを
用いて説明する（図１０と図１２を合わせて図７のステ
ップ７０３に相当する）。The position of the long white run connected component (and the solid line field separator) extracted by the above processing is found in the band area. Next, the number of steps of the band area is obtained from this position flag. This will be described with reference to the process flowchart of FIG. 12 (corresponding to step 703 of FIG. 7 by combining FIGS. 10 and 12).

【００３０】１／２地点フラグのみがＯＮになっていれ
ば段数は２である（ステップ１２０１、１２０２）。１
／３地点フラグと２／３地点フラグのみがＯＮになって
いれば段数は３である（ステップ１２０３、１２０
４）。また１／２地点フラグと共に１／４地点フラグと
３／４地点フラグのみがＯＮになっていれば段数は４と
なる（ステップ１２０５、１２０６）。上記以外の場合
は、前述した分離度（文字相当小領域の横長のものの割
合または絶対数で判定）で判別する（ステップ１２０
７）。分離度が高く文字間距離が広いと予想される場合
には、文書はワープロで作成されたようなものであるこ
とが多いことから、段数を１とする（ステップ１２０
８）。それ以外の場合は段数を不定とする（ステップ１
２０９）。この段数判別処理は他の方法を採ることもで
きる。例えば、イレギュラーフラグがＯＮになっていた
場合は必ず段数を不定とするなどである。If only the 1/2 point flag is ON, the number of stages is 2 (steps 1201, 1202). 1
If only the / 3 point flag and the 2/3 point flag are ON, the number of stages is 3 (steps 1203 and 120).
4). If only the 1/4 point flag and the 3/4 point flag are turned on together with the 1/2 point flag, the number of stages is 4 (steps 1205 and 1206). In cases other than the above, determination is made based on the above-described degree of separation (determined by the ratio or absolute number of horizontally long character-equivalent small areas) (step 120
7). If the degree of separation is high and the distance between characters is expected to be wide, the number of columns is set to 1 because the document is often created by a word processor (step 120).
8). In other cases, the number of stages is undefined (step 1
209). Other methods can be adopted for this stage number determination processing. For example, if the irregular flag is ON, the number of stages is always undefined.

【００３１】図７に戻り、上記した処理によって段数が
検出されたら、連結成分の内、段分割空白部の候補とな
るものを選択する（ステップ７０４）。段数不定の場合
または段数１の場合は、段分割空白部候補の数を０とす
る。それ以外の場合（つまり段数２〜４）は、段数検出
に使用した幅広連結成分を全て段分割空白部候補とす
る。なお、連結成分には幅があるので、分割空白部はそ
の中心線の位置とするか、連結成分を構成するランの
内、最も長いランの位置にする方が、ステップ２１０で
の小領域統合時の処理が容易になる。上記したようにし
て、ステップ７０１から７０４で、長い白ランを利用し
た段分割空白部候補の抽出処理が行われる。Returning to FIG. 7, when the number of stages is detected by the above-described processing, one of the connected components that is a candidate for the stage division blank part is selected (step 704). When the number of stages is indefinite or when the number of stages is 1, the number of stage division blank part candidates is set to 0. In other cases (that is, the stage numbers 2 to 4), all the wide connected components used for the stage number detection are set as stage division blank part candidates. Since the connected component has a width, it is better to set the divided blank part at the position of the center line or at the position of the longest run among the runs that form the connected component in the small area integration in step 210. Time handling becomes easier. As described above, in steps 701 to 704, the extraction processing of the stage division blank part candidate using the long white run is performed.

【００３２】次に、ステップ７０５〜７０９の周辺分布
ヒストグラムを利用した段分割空白部候補抽出処理を説
明する。まず、ある程度の幅（クラス）毎に小領域の個
数を求めた周辺分布ヒストグラムを作成する（ステップ
７０５）。図１１は、小領域の周辺分布ヒストグラム１
１０１を示す。次に行頭部を検出する（ステップ７０
６）。これは、ヒストグラム１１０１において１つ隣の
クラスより閾値以上に頻度の高いクラスを検出すること
によって行われる。図１１で、１１０２、１１０３が行
頭部に該当する。ここで閾値は固定値でもよいし、帯領
域の小領域の個数や該ヒストグラムの全頻度で正規化し
てもよい。また、頻度分布は図１１のように、明確に谷
間が形成されるとは限らないないので、１つ隣のクラス
ではなく、２つ隣のクラスと比較してもよい。検出した
頻度差異の大きい地点（１１０２、１１０３）の左のク
ラスが、連続して閾値以上の個数だけ頻度が低ければ、
当該地点が空白部候補となる（ステップ７０６）。Next, the step division blank part candidate extraction processing using the marginal distribution histogram in steps 705 to 709 will be described. First, a marginal distribution histogram in which the number of small areas is obtained for each width (class) to some extent is created (step 705). FIG. 11 is a histogram 1 of the peripheral distribution of a small area.
101 is shown. Next, the line head is detected (step 70).
6). This is performed by detecting a class in the histogram 1101 that is more frequent than the next-neighboring class by a threshold value or more. In FIG. 11, 1102 and 1103 correspond to the head of the line. Here, the threshold value may be a fixed value, or may be normalized by the number of small areas in the band area or the total frequency of the histogram. In addition, as shown in FIG. 11, the frequency distribution does not necessarily form a clear valley, so that the frequency distribution may be compared not with the next class but with the next class. If the class to the left of the detected point (1102, 1103) having a large frequency difference is less frequent by the number of consecutive thresholds or more,
The point becomes a blank part candidate (step 706).

【００３３】例えば、図１１では、１１０２の左隣は１
１０４、１１０５と２クラス連続して頻度が低い。１１
０３の左隣も同様に１１０６、１１０７と頻度が低い。
従って、空白部の候補は１１０５−１１０４と１１０７
−１１０６となる。For example, in FIG. 11, the area to the left of 1102 is 1
The frequency is low for two consecutive classes, 104 and 1105. 11
Similarly, the frequency on the left side of 03 is also low at 1106 and 1107.
Therefore, the candidates for the blank part are 1105-1104 and 1107.
It becomes -1106.

【００３４】続いて、検出した空白部候補を利用して、
段数を判別する（ステップ７０８）。まず、図１０と同
様にして分割位置フラグをセットする。ただし、ここで
はイレギュラーフラグを使用しない。各フラグがセット
されたら段数の判定を行う。図１３は、図７のステップ
７０８の詳細フローチャートである。Then, using the detected blank part candidates,
The number of stages is determined (step 708). First, the division position flag is set in the same manner as in FIG. However, the irregular flag is not used here. When each flag is set, the number of stages is determined. FIG. 13 is a detailed flowchart of step 708 of FIG.

【００３５】まず、１／４、１／２、３／４フラグがＯ
Ｎになっていれば段数を４とする（ステップ１３０１、
１３０２）。このとき、ステップ７０３と異なるのは、
１／３フラグや２／３フラグがＯＮになっていてもよい
ことである。First, the 1/4, 1/2, and 3/4 flags are set to O.
If it is N, the number of stages is set to 4 (step 1301,
1302). At this time, the difference from step 703 is that
This means that the 1/3 flag and the 2/3 flag may be turned on.

【００３６】１／３、２／３フラグがＯＮになっていれ
ば段数を３とする（ステップ１３０３、１３０４）。１
／２フラグがＯＮになっていれば段数を２とする（ステ
ップ１３０５、１３０６）。上記した何れにも該当しな
い場合は、空白部候補の本数が閾値以上あれば段数を不
定とし、そうでない場合は段数を１とする（ステップ１
３０７、１３０８、１３０９）。If the 1/3 and 2/3 flags are ON, the number of stages is set to 3 (steps 1303 and 1304). 1
If the / 2 flag is ON, the number of stages is set to 2 (steps 1305 and 1306). If none of the above, the number of stages is undefined if the number of blank part candidates is greater than or equal to the threshold value, and otherwise the number of stages is set to 1 (step 1
307, 1308, 1309).

【００３７】以上の処理によって段数が検出されるの
で、ステップ７０４と同様にして分割空白部候補の選択
を行う。ステップ７０８の段数検出で使用した空白部候
補の内、検出された段数の分割位置付近に存在する空白
部候補だけを段分割空白部候補とする（ステップ７０
９）。段数が不定であったり、１である場合には段分割
空白部候補は抽出しない（ステップ７０９で抽出しない
ので、ステップ７０４では抽出している可能性があ
る）。Since the number of steps is detected by the above processing, the division blank part candidate is selected in the same manner as in step 704. Among the blank part candidates used in the step number detection of step 708, only the blank part candidates existing near the division position of the detected step number are set as the column division blank part candidates (step 70).
9). If the number of stages is indefinite or is 1, the stage division blank part candidate is not extracted (it is not extracted in step 709, so there is a possibility that it is extracted in step 704).

【００３８】次いで、ステップ７０３と７０８で求めた
段数の整合性をチェックして該帯領域の段数を決定する
（ステップ７１０）。段数の決定は図２２に示すような
決定ルールに従う。段数が決定されたら、段数に整合す
る段分割空白部候補だけを残す（ステップ７１１）。例
えば、ステップ７０３でコラム数３と判定されてそれに
整合するように残っていた空白部候補は、ステップ７１
０での段数整合チェックの結果、段数不定と判定された
場合には除去される。ステップ７０３で段数２、ステッ
プ７０８で段数４と判定された場合には、図２２に示す
ように段数４と判定されるから、長ランから求めた空白
部候補も周辺分布から求めた空白部候補も共に採用され
る。以上によって、ステップ２０７で段分割空白部候補
領域の抽出処理が行われる。Next, the consistency of the number of steps obtained in steps 703 and 708 is checked to determine the number of steps in the band area (step 710). The determination of the number of stages follows the determination rule as shown in FIG. When the number of stages is determined, only the stage division blank part candidates that match the number of stages are left (step 711). For example, the blank part candidate that was determined to have three columns in step 703 and remained so as to match with it is step 71.
As a result of the stage number matching check with 0, if it is determined that the stage number is indefinite, it is removed. When it is determined that the number of stages is 2 in step 703 and the number of stages is 4 in step 708, it is determined that the number of stages is 4, as shown in FIG. 22, and therefore the blank part candidate obtained from the long run is also the blank part candidate obtained from the marginal distribution. Will be adopted together. As described above, in step 207, the extraction processing of the column division blank part candidate area is performed.

【００３９】図２に戻って、段組種類判別手段１０８
は、段組の種類を判別する（ステップ２０８）。これ
は、各帯領域毎に求めた段組数により１段組、複数段
組、自由段組の３種類に入力文書を分類するものであ
る。図１４を例にして説明する。１４０１〜１４０３は
帯領域であり、１４０４〜１４０６は抽出した段分割空
白部候補である。ここで組数は１４０１が１、１４０２
が３、１４０３が２となる。図１５に従って、文書全体
の段組種類を決定する。Returning to FIG. 2, the column type discriminating means 108.
Determines the type of column (step 208). This classifies an input document into three types, one column, a plurality of columns, and a free column, depending on the number of columns calculated for each band area. An example will be described with reference to FIG. Reference numerals 1401 to 1403 are band areas, and reference numerals 1404 to 1406 are extracted stage division blank area candidates. Here, the number of sets is 1401 = 1,1402
Becomes 3, and 1403 becomes 2. According to FIG. 15, the column type of the entire document is determined.

【００４０】まず、文書全体の高さＨを計量する（ステ
ップ１５０１）。この量Ｈは、帯領域の高さの合計でも
よい。後者の方が、帯領域間に罫線や図表などがあって
隙間が生じる場合に、以降で求める各種帯領域高さの合
計との比率が正しく求められなくなることを防止するこ
とができる。次に段数２〜４と判断された帯領域（これ
を複数段帯領域と呼ぶ）の高さの合計（Ｔｏｔａｌ
１）と、段数不定とされた不定段帯領域の高さの合計
（Ｔｏｔａｌ２）を求める（ステップ１５０２、１５
０３）。First, the height H of the entire document is measured (step 1501). This amount H may be the total height of the band area. In the latter case, when there is a ruled line or a chart between the band areas and a gap is generated, it is possible to prevent the ratio from the total of the heights of the various band areas, which will be obtained later, cannot be obtained correctly. Next, the sum of the heights of the strip regions (which are called multi-stage strip regions) determined to have two to four stages (Total)
1) and the total height (Total 2) of the indeterminate step zone area in which the number of steps is indefinite is calculated (steps 1502 and 15).
03).

【００４１】各合計値が求められたら、順に閾値と比較
していく。まず、Ｔｏｔａｌ１／Ｈが閾値１より大き
ければ、当該入力文書は複数段組であると決定する（ス
テップ１５０４、１５０５）。そうでなければ、Ｔｏｔ
ａｌ１／Ｈが閾値２（＜閾値１）より大きいとき、当
該入力文書は自由段組であると決定する（ステップ１５
０６、１５０７）。そうでなければ、（Ｔｏｔａｌ１
＋Ｔｏｔａｌ２）／Ｈは閾値３より大きいとき当該入
力文書は自由段組であると決定する（ステップ１５０
８、１５０９）。以上の条件に該当しなければ、当該入
力文書は１段組であると決定する（ステップ１５１
０）。When the respective total values are obtained, they are sequentially compared with the threshold value. First, if Total 1 / H is larger than the threshold value 1, it is determined that the input document has a plurality of columns (steps 1504 and 1505). Otherwise Tot
When al 1 / H is larger than the threshold 2 (<threshold 1), it is determined that the input document is a free column (step 15).
06, 1507). Otherwise, (Total 1
When + Total 2) / H is larger than the threshold value 3, it is determined that the input document is a free column (step 150).
8, 1509). If the above conditions are not satisfied, it is determined that the input document is a single column (step 151).
0).

【００４２】続いて、段分割空白部分別手段１０９は、
段分割空白部候補の取捨選択を行う（ステップ２０
９）。これはステップ２０８で決定された文書段組種類
に従うもので、１段組であった場合は段分割空白部候補
は除去する。複数段組および自由段組であった場合に
は、段分割空白部候補をそのまま段分割空白部として利
用する。Subsequently, the column division blank portion distinction means 109 is
Selection of blank space candidate for column division is performed (step 20).
9). This is in accordance with the document column type determined in step 208. If it is one column, the column division blank part candidate is removed. In the case of multiple columns and free columns, the column division blank part candidate is used as it is as the column division blank part.

【００４３】小領域統合手段１１０は、この段分割空白
部を活用して文字列相当の小領域を統合してまとまりの
ある文章領域（段またはその一部などに相当）を抽出す
る（ステップ２１０）。小領域の統合方法については、
行方向に近接した領域を行（またはその一部に相当）に
統合し、さらに行方向とは垂直な方向に行（またはその
一部に相当）を統合して領域を形成する方法を用いる
（例えば特願平３−１２８３４０号に記載の方法な
ど）。The small area integrating means 110 integrates the small areas corresponding to the character strings by utilizing the blank space of the column division and extracts a coherent text area (corresponding to the column or a part thereof) (step 210). ). For how to integrate small areas,
A method is used in which areas adjacent to each other in the row direction are integrated into a row (or a part thereof), and rows (or a part thereof) are integrated in a direction perpendicular to the row direction to form an area ( For example, the method described in Japanese Patent Application No. 3-128340).

【００４４】この小領域を行方向に統合する際に、実在
のフィールドセパレータ（分割線）と同様に、ステップ
２０９で求めた段分割空白部を使用する。分割線および
分割空白部を超えて小領域を統合しないようにする。ま
たは、段分割空白部近傍では統合条件を厳しくし、より
近接したものだけを統合するようにする。また、段組種
類によってこの統合パラメータを変動させる。すなわ
ち、１段組の場合は遠く離れていても統合するように
し、自由段組の場合は近いものだけを統合するようにす
る。複数段組においては、１段組と同様に離れたもので
も分割線および段分割空白部を超えない限り統合するよ
うにするか、または段幅相当の距離までは分割線および
段分割空白部を超えない限り統合するようにする。When the small areas are integrated in the row direction, the column division blank portion obtained in step 209 is used as in the case of the existing field separator (division line). Do not combine the small areas beyond the dividing line and the dividing blank area. Alternatively, the integration condition is set to be strict in the vicinity of the step division blank part, and only those closer to each other are integrated. Also, this integrated parameter is changed depending on the type of column. That is, in the case of one column, they are integrated even if they are far apart, and in the case of a free column, only those close to each other are integrated. In the case of multiple columns, even if they are separated as in the case of one column, they should be integrated as long as they do not exceed the dividing line and the blank portion for dividing columns, or the dividing line and the blank portion for dividing columns up to the distance equivalent to the column width. Try to integrate unless it exceeds.

【００４５】〈実施例２〉図１６は、実施例２の構成を
示す。また、図１７は実施例２の処理フローチャートで
ある。本実施例は、実施例１の傾き補正手段１０２を傾
き検出手段１６０２に置き換えたもので、他の構成は実
施例１と同様である。ステップ１７０２において、画像
の傾きを検出し、傾き角度が閾値より大きいとき、空白
部の検出が困難であるので、ステップ１７０８以下の空
白部検出を行わずに（ステップ１７０７）、段組の種類
を自由段組として処理（ステップ１７１２）する。その
他は実施例１と同様であるので、説明を省略する。な
お、１６０２〜１６１０および１６１２は１つのプロセ
ッサ上でソフトウェアで実現することができる。<Second Embodiment> FIG. 16 shows the structure of a second embodiment. 17 is a processing flowchart of the second embodiment. In this embodiment, the inclination correcting means 102 of the first embodiment is replaced with the inclination detecting means 1602, and the other configuration is the same as that of the first embodiment. In step 1702, the inclination of the image is detected, and when the inclination angle is larger than the threshold value, it is difficult to detect the blank portion. Therefore, the blank portion detection in step 1708 and subsequent steps is not performed (step 1707), and the column type is selected. It is processed as a free column (step 1712). Others are the same as those in the first embodiment, and thus the description thereof is omitted. Note that 1602 to 1610 and 1612 can be realized by software on one processor.

【００４６】〈実施例３〉図１８は、実施例３の構成を
示し、図１９はその処理フローチャートである。本実施
例では、実施例１の構成にさらに、段組種類指示手段１
８１４を付加して構成したものである。段組種類指示手
段１８１４によって指示された段組種類が１段組であっ
た場合には、ステップ１９０９以下の段分割空白部検出
処理および段種類判別処理を行わない（ステップ１９０
８）。<Third Embodiment> FIG. 18 shows the configuration of the third embodiment, and FIG. 19 is a processing flowchart thereof. In this embodiment, in addition to the configuration of the first embodiment, the column type indicating means 1
It is configured by adding 814. If the column type instructed by the column type instructing means 1814 is one column, the column division blank part detection processing and the column type determination processing in and after step 1909 are not performed (step 190).
8).

【００４７】段組種類指示手段１８１４によって指示さ
れた段組種類が非１段組であった場合には、ステップ１
９１０において、図１５のように判別を行うときに、１
段組に判定されるところを自由段組と判定する。その他
は実施例１と同様である。なお、１８０２〜１８１０お
よび１８１２は１つのプロセッサ上でソフトウェアで実
現することができる。If the column type instructed by the column type instruction means 1814 is a non-single column, step 1
In 910, when the discrimination is performed as shown in FIG.
A place determined as a column is determined as a free column. Others are the same as in the first embodiment. Note that 1802 to 1810 and 1812 can be realized by software on one processor.

【００４８】〈実施例４〉図２０は、実施例４の構成を
示し、図２１はその処理フローチャートである。実施例
４は、実施例１の構成から段分割空白部分別手段１０９
と小領域統合手段１１０を取り除いて構成され、文書画
像の段組種類を判別する実施例である。従って、実施例
４では、ステップ２１０８において、段組種類が判別さ
れると、処理が終了する。その他は実施例１と同様であ
る。なお、２００２〜２００８および２０１０は１つの
プロセッサ上でソフトウェアで実現することができる。<Fourth Embodiment> FIG. 20 shows the structure of a fourth embodiment, and FIG. 21 is a processing flowchart thereof. The fourth embodiment is different from the structure of the first embodiment in that the column dividing blank portion separating means 109 is provided.
In this embodiment, the small area integrating means 110 is removed to determine the column type of the document image. Therefore, in the fourth embodiment, when the column type is determined in step 2108, the process ends. Others are the same as in the first embodiment. Note that 2002 to 2008 and 2010 can be realized by software on one processor.

【００４９】[0049]

【発明の効果】以上、説明したように、請求項１記載の
発明によれば、段組としての整合性の高い段分割線だけ
を利用して領域分割を行うので、高精度の領域分割処理
を行うことができる。As described above, according to the first aspect of the invention, since the region division is performed only by using the stage division lines having a high consistency as the column set, the highly accurate region division processing is performed. It can be performed.

【００５０】請求項２記載の発明によれば、段組種類判
別を精度よくでき、高精度の領域分割処理を行うことが
できる。According to the second aspect of the present invention, it is possible to accurately determine the column type, and it is possible to perform highly accurate area division processing.

【００５１】請求項３記載の発明によれば、画像の各部
分によって段組数が異なる場合でも精度よく段組種類を
求めることができるので、高精度の領域分割処理を行う
ことができる。According to the third aspect of the present invention, even if the number of columns is different for each part of the image, the type of columns can be obtained with high accuracy, so that highly accurate area division processing can be performed.

【００５２】請求項４記載の発明によれば、信頼性の低
い領域分割空白部を使用せずにすむので、高精度の領域
分割処理を行うことができる。According to the fourth aspect of the invention, since it is not necessary to use the area division blank portion having low reliability, it is possible to perform the area division processing with high accuracy.

【００５３】請求項５記載の発明によれば、精度よく段
分割空白部を検出することができるので、高精度の領域
分割処理を行うことができる。According to the fifth aspect of the present invention, since it is possible to detect the step division blank portion with high accuracy, it is possible to perform the area division processing with high accuracy.

【００５４】請求項６記載の発明によれば、段組種類に
応じた処理を採ることによって、１段組が誤って分離さ
れにくくなり、複数段組の場合は領域の未分割が少なく
なるなど、高精度の領域分割処理を行うことができる。According to the sixth aspect of the present invention, by adopting the processing according to the type of column, one column is less likely to be erroneously separated, and in the case of a plurality of columns, undivided areas are reduced. It is possible to perform highly accurate area division processing.

【００５５】請求項７記載の発明によれば、入力画像が
傾いていて空白部の検出に支障がある場合でも、ある程
度の領域分割処理能力を確保することができる。According to the seventh aspect of the present invention, even if the input image is inclined and there is a problem in detecting a blank portion, it is possible to secure a certain area division processing capability.

【００５６】請求項８記載の発明によれば、段組種類を
絞り込むことにより、より高精度に段組種類の判別を可
能とし、高精度の領域分割処理を行うことができる。According to the eighth aspect of the invention, by narrowing down the types of columns, it is possible to more accurately determine the types of columns, and it is possible to perform highly accurate area division processing.

【００５７】請求項９記載の発明によれば、精度よく段
組種類を求めることができる。According to the ninth aspect of the present invention, it is possible to accurately determine the type of column.

[Brief description of drawings]

【図１】本発明の実施例１の構成を示す。FIG. 1 shows a configuration of a first exemplary embodiment of the present invention.

【図２】実施例１の処理フローチャートである。FIG. 2 is a processing flowchart of the first embodiment.

【図３】図２のステップ２０７の詳細フローチャートで
ある。FIG. 3 is a detailed flowchart of step 207 of FIG.

【図４】分類された小領域の一例を示す図である。FIG. 4 is a diagram showing an example of classified small areas.

【図５】抽出された水平方向の分割空白部を示す図であ
る。FIG. 5 is a diagram showing an extracted horizontal blank portion.

【図６】画像全体が水平分割空白部によって帯領域に分
割された図である。FIG. 6 is a diagram in which the entire image is divided into strip regions by horizontal division blank portions.

【図７】図３のステップ３０４の詳細フローチャートで
ある。FIG. 7 is a detailed flowchart of step 304 of FIG.

【図８】帯領域から垂直方向に長い白ラン連結成分を抽
出する図である。FIG. 8 is a diagram for extracting a white run connected component that is long in the vertical direction from a band area.

【図９】帯領域中における長白ラン連結成分の位置を説
明する図である。FIG. 9 is a diagram illustrating the positions of long-white run connected components in a band area.

【図１０】図７のステップ７０３の詳細フローチャート
の一部である。10 is a part of a detailed flowchart of step 703 of FIG.

【図１１】小領域の周辺分布ヒストグラムの例を示す。FIG. 11 shows an example of a peripheral distribution histogram of a small area.

【図１２】図７のステップ７０３の詳細フローチャート
の一部である。12 is a part of a detailed flowchart of step 703 of FIG.

【図１３】図７のステップ７０８の詳細フローチャート
である。FIG. 13 is a detailed flowchart of step 708 of FIG. 7.

【図１４】帯領域毎の段組数を説明する図である。FIG. 14 is a diagram illustrating the number of columns for each band area.

【図１５】図２のステップ２０８の詳細フローチャート
である。FIG. 15 is a detailed flowchart of step 208 of FIG.

【図１６】本発明の実施例２の構成を示す。FIG. 16 shows a configuration of a second embodiment of the present invention.

【図１７】実施例２の処理フローチャートである。FIG. 17 is a processing flowchart of the second embodiment.

【図１８】本発明の実施例３の構成を示す。FIG. 18 shows a configuration of a third embodiment of the present invention.

【図１９】実施例３の処理フローチャートである。FIG. 19 is a process flowchart of the third embodiment.

【図２０】本発明の実施例４の構成を示す。FIG. 20 shows a configuration of a fourth embodiment of the present invention.

【図２１】実施例４の処理フローチャートである。FIG. 21 is a processing flowchart of the fourth embodiment.

【図２２】段数を決定するルールを示す。FIG. 22 shows a rule for determining the number of stages.

[Explanation of symbols]

１０１画像入力手段１０２傾き補正手段１０３画像圧縮手段１０４小領域抽出手段１０５行方向検出手段１０６小領域分類手段１０７段分割空白部候補抽出手段１０８段組種類判別手段１０９段分割空白部分別手段１１０小領域統合手段１１１データ記憶部１１２制御部１１３データ通信路 101 Image Input Means 102 Inclination Correcting Means 103 Image Compressing Means 104 Small Area Extracting Means 105 Row Direction Detecting Means 106 Small Area Classifying Means 107 Column Dividing Blank Area Candidate Extracting Means 108 Column Type Type Discriminating Means 109 Column Dividing Blank Part Separate Means 110 Small Area integration means 111 Data storage unit 112 Control unit 113 Data communication path

Claims

[Claims]

1. A method of dividing an area of a document image, wherein a plurality of small areas including a character string are extracted from the document image, a blank portion or ruled line is detected from the plurality of small areas, and the detected blank portion is detected. Alternatively, based on the ruled line, the type of column including one column, a plurality of columns, and a free column is discriminated, and the small areas are integrated using the blank portion according to the column type, and the document image is displayed. An area dividing method for a document image, characterized by dividing the area into predetermined areas.

2. The method of dividing a document image according to claim 1, wherein the column type of the document image is determined based on the number and position of the blank portions or ruled lines.

3. A blank portion or ruled line that divides the small area in a direction parallel to the character string is detected, an image is divided in parallel with the character string by the blank portion or ruled line, and each divided image portion is divided. 2. The method for dividing an area of a document image according to claim 1, further comprising: determining the number or types of columns and integrating the results to determine the type of columns of the entire image.

4. The document image area dividing method according to claim 1, wherein the detected blank portion is selected according to the column type.

5. The method for detecting a blank portion is characterized by using a method of detecting a connected component of a long white run as a blank portion and a method of detecting a blank portion from a projection histogram of character elements. The method for dividing an area of a document image according to claim 1.

6. The method for dividing an area of a document image according to claim 1, wherein the integration condition of the small areas is changed according to the type of the column.

7. The method of dividing an area of a document image according to claim 1, wherein when the inclination of the document image is equal to or larger than a predetermined threshold value, the column type is a free column.

8. The method of dividing an area of a document image according to claim 1, wherein the column type includes a column type designated in advance.

9. A method of discriminating a column type of a document image,
Extracting a plurality of small areas including a character string from the document image,
A blank part or ruled line is detected from the plurality of small areas, and one column set, a plurality of column sets, based on the number and position of the blank part or ruled line,
A method of discriminating a type of a document image, which includes discriminating a type of a document image including a free column.