JP2000067158A

JP2000067158A - Document image processing method and machine-readable recording medium where program allowing computer to implement document image processing method is recorded

Info

Publication number: JP2000067158A
Application number: JP10246519A
Authority: JP
Inventors: Takashi Saito; 高志齋藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-08-18
Filing date: 1998-08-18
Publication date: 2000-03-03
Anticipated expiration: 2018-08-18
Also published as: JP3850995B2

Abstract

PROBLEM TO BE SOLVED: To correctly extract character areas from a document image even if characters belonging to different character areas are close to each other. SOLUTION: The document image processing method which corrects excessive integration of extracted character areas when performing character recognition and layout information acquisition as to character areas by extracting the character areas from a document image includes a step S201 for inputting the document image, a step S202 for reducing the inputted document image, extracting the circumscribed rectangles of connecting components of black pixels constituting characters, and extracting basic elements, a step S203 for classifying the basic elements by characters, tables, figures, and others, generating lines by integrating the character elements, and extracting the character areas by integrating the lines, a step S204 for extracting column setting information from the character areas, and a step S205 for correcting over-integrated character areas by referring to the positions of extracted columns.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は，入力された文書画
像を認識する文書画像認識システムなどに利用され，特
に，文書画像から文字領域を抽出する際に，抽出した文
字領域の過剰統合を修正する文書画像処理方法および文
書画像処理方法をコンピュータに実行させるプログラム
を記録した機械読み取り可能な記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used in a document image recognition system for recognizing an input document image, and in particular, when extracting a character region from a document image, corrects excessive integration of the extracted character regions. The present invention relates to a document image processing method and a machine-readable recording medium storing a program for causing a computer to execute the document image processing method.

【０００２】[0002]

【従来の技術】従来，文書画像を入力し，その文書画像
から文字認識処理を行ったり，各種レアウト情報を取得
するには，文字領域や図形領域・その他の領域などの領
域抽出処理を行う必要がある。2. Description of the Related Art Conventionally, in order to input a document image, perform character recognition processing from the document image, and obtain various layout information, it is necessary to perform area extraction processing such as a character area, a graphic area, and other areas. There is.

【０００３】このような，領域抽出処理を行う参考技術
文献として，『印刷文書認識システムＡｕｔｏＲｅｃ
ｏ／２ −イメージプロセス−』（情報処理学会第４７
回（平成５年後期）全国大会２−９７〜９８）が開示さ
れている。ここでは，特に，コラム境界線を検出し，文
字列要素を領域へ統合している（これを第１の従来技術
という）。[0003] As a reference technical document for performing such an area extraction process, "Printed Document Recognition System AutoRec"
o / 2 -Image process- "(Information Processing Society of Japan 47th
(The second half of 1993) National Convention 2-97 to 98) is disclosed. Here, in particular, a column boundary is detected, and character string elements are integrated into an area (this is referred to as a first conventional technique).

【０００４】また，ページ全体の段組状態を検出し，そ
れに応じて文字領域を抽出する技術が特開平９−４４５
９４号公報の『文書画像の領域分割方法および段組種類
判別方法』に開示されている（これを第２の従来技術と
いう）。Japanese Patent Laid-Open No. 9-445 discloses a technique for detecting a column state of an entire page and extracting a character area accordingly.
No. 94, which discloses a "method of dividing a document image into regions and a method of determining a column type" (this is referred to as a second prior art).

【０００５】[0005]

【発明が解決しようとする課題】しかしながら，上記に
示されるような従来の技術にあっては，以下のような問
題点があった。まず，第１の従来技術にあっては，新聞
などで図のキャプションと本文とが近接している場合に
は過剰統合が生じてしまうことがある。その具体例を図
７に示す。図７においてキャプション７０４が本文と近
いため，誤って７０３のようなかたちで１つの領域とし
て抽出されてしまう。However, the prior art as described above has the following problems. First, in the first prior art, if the caption of a figure is close to the text in a newspaper or the like, excessive integration may occur. FIG. 7 shows a specific example. In FIG. 7, since the caption 704 is close to the text, it is erroneously extracted as one area in a form like 703.

【０００６】また，第２の従来技術に開示されている方
法では，新聞などように文字要素が密な場合，初期の段
階では明確な段組みが検出されず，近傍文字要素を統合
することで文字領域を抽出することになってしまう。つ
まり，図７に示すように近接した場合に過剰統合が生じ
る。In the method disclosed in the second prior art, when character elements are dense, such as in a newspaper, a clear column is not detected in an initial stage, and neighboring character elements are integrated. The character area will be extracted. That is, excessive integration occurs when approaching as shown in FIG.

【０００７】また，第１の従来技術に開示されている方
法でも，近接した文字要素は初期段階でグループ化され
るため，段のエッジ部分が検出されなくなってしまう。
また，このような場合でも段のエッジ部分を検出するよ
うにすると，本来のエッジ以外に多数の擬似エッジが検
出され，文字領域の抽出が全体として煩雑になってしま
う。Further, even in the method disclosed in the first prior art, adjacent character elements are grouped in an initial stage, so that an edge portion of a step cannot be detected.
Even in such a case, if the edge portion of the step is detected, a large number of pseudo edges other than the original edge are detected, and the extraction of the character area becomes complicated as a whole.

【０００８】本発明は，上記に鑑みてなされたものであ
って，文書画像から文字領域を抽出する際に，異なる文
字領域に属する文字同士が近接していても，正しく文字
領域を抽出することを目的とする。The present invention has been made in view of the above, and when extracting a character region from a document image, it is possible to correctly extract a character region even if characters belonging to different character regions are close to each other. With the goal.

【０００９】[0009]

【課題を解決するための手段】上記の目的を達成するた
めに，請求項１に係る文書画像処理方法にあっては，文
書画像から文字領域を抽出し，該文字領域に対する文字
認識やレイアウト情報取得などを行う際に，抽出した文
字領域の過剰統合を修正する文書画像処理方法におい
て，文書画像を入力する画像入力工程と，前記画像入力
工程で入力された文書画像を縮小し，文字を構成する黒
画素の連結成分の外接矩形を抽出し，基本要素を抽出す
る基本要素抽出工程と，前記基本要素を文字，表，図，
その他に分類し，かつ文字要素を統合して行を生成し，
その行を統合して文字領域を抽出する文字領域抽出工程
と，前記文字領域抽出工程で抽出した文字領域から段組
み情報を抽出する段組抽出工程と，前記段組抽出工程で
抽出された段の位置を参照し，過剰に統合された文字領
域を修正する領域修正工程と，を含むものである。In order to achieve the above object, a document image processing method according to claim 1 extracts a character region from a document image, and performs character recognition and layout information for the character region. In a document image processing method for correcting over-integration of extracted character regions when performing acquisition, etc., an image input step of inputting a document image, and a document image input in the image input step are reduced to form characters. A basic element extraction step of extracting a circumscribed rectangle of a connected component of a black pixel to be extracted and extracting a basic element;
Generate a line by classifying it into others and integrating character elements.
A character region extraction step of integrating the lines to extract a character region, a column extraction process of extracting column information from the character region extracted in the character region extraction process, and a column extraction process of the column extraction process. And correcting an excessively integrated character area with reference to the position of the character area.

【００１０】また，請求項２に係る文書画像処理方法に
あっては，請求項１の文書画像処理方法において，前記
領域修正工程は，前記段組抽出工程で抽出された段の位
置から過剰統合が見込まれる文字領域を検出し，該文字
領域内の文字あるいは文字列要素を段の位置を参照しな
がら再度文字領域へ統合し，予め設定した基準分割位置
に対する行の整合度を評価し，文字領域を修正するもの
である。In the document image processing method according to the second aspect, in the document image processing method according to the first aspect, the area correction step may be performed by over-integration from the position of the column extracted in the column extraction step. Is detected, the character or character string element in the character area is integrated into the character area again with reference to the column position, and the degree of line consistency with the preset reference division position is evaluated. This is to correct the area.

【００１１】また，請求項３に係る文書画像処理方法に
あっては，請求項２の文書画像処理方法において，前記
領域修正工程は，再度文字領域へ統合する際に，前記基
準分割位置の外側に位置する基準行方向と垂直な方向に
よって構成される文字領域を最初に生成し，その後に文
字領域の評価を行うものである。In the document image processing method according to a third aspect, in the document image processing method according to the second aspect, the area correction step is performed when the area outside the reference division position is integrated with the character area again. First, a character region composed of a direction perpendicular to the reference line direction located at the position is generated, and then the character region is evaluated.

【００１２】また，請求項４に係る機械読み取り可能な
記録媒体にあっては，前記請求項１ないし３のいずれか
一つに記載の文書画像処理方法をコンピュータに実行さ
せるプログラムを記録したものである。According to a fourth aspect of the present invention, there is provided a machine-readable recording medium on which a program for causing a computer to execute the document image processing method according to any one of the first to third aspects is recorded. is there.

【００１３】[0013]

【発明の実施の形態】以下，本発明の文書画像処理方法
および文書画像処理方法をコンピュータに実行させるプ
ログラムを記録した機械読み取り可能な記録媒体につい
て添付図面を参照し，詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a document image processing method and a machine readable recording medium storing a program for causing a computer to execute the document image processing method of the present invention will be described in detail with reference to the accompanying drawings.

【００１４】（システムの構成）図１は，本発明の実施
の形態に係る文書画像処理方法を適用したシステムの構
成を示すブロック図である。図において，１０１はＣＣ
Ｄ搭載のスキャナあるいはファクシミリ装置などを用
い，文書を含む原稿を光学的に２値画像として読み取
り，入力する画像入力部である。また，画像入力部１０
１の具体的な手段として，ネットワーク経由（後述する
データ通信部１１０による）で別の機器から文書画像を
取得するようにしてもよい。(System Configuration) FIG. 1 is a block diagram showing the configuration of a system to which a document image processing method according to an embodiment of the present invention is applied. In the figure, 101 is CC
This is an image input unit for optically reading and inputting a document including a document as a binary image using a scanner or a facsimile device mounted on the D. The image input unit 10
As a specific means, a document image may be obtained from another device via a network (by a data communication unit 110 described later).

【００１５】ネットワーク経由で画像を取り込む場合
は，例えば，予め用意されたブラウザ機能（ＷＷＷ（ｗ
ｏｒｌｄｗｉｄｅｗｅｂ）ブラウザ機能とする）を
用いてインターネット上に分散する広域のデータ資源
（ＷＷＷサーバ上のアクセス情報の一つであるＵＲＬ
（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏ
ｒ）で特定されるＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭ
ａｒｋｕｐＬａｎｇｕａｇｅ：ＷＷＷ（ｗｏｒｌｄ
ｗｉｄｅｗｅｂ：ハイパーテキストを使用したインタ
ーネットの情報サービス）用の文書記述言語）などの言
語によるデータをブラウズする。When an image is taken in via a network, for example, a browser function (WWW (w
and a wide-area data resource (URL which is one of access information on a WWW server) distributed on the Internet using an old wide web (a browser function).
(Uniform Resource Locato
r) (Hyper Text M)
arkup Language: WWW (world
wide web: browses data in a language such as a document description language for an Internet information service using hypertext).

【００１６】また，１０２は画像入力部１０１によって
入力された画像を縮小する画像縮小部，１０３は画像縮
小部１０２で縮小された画像から文字要素を抽出する基
本要素抽出部，１０４は文字領域を抽出する文字領域抽
出部，１０５は文字領域抽出部１０４によって抽出され
た文字領域などから段組を検出する段組検出部，１０６
は後述する文字領域修正処理を実行する文字領域修正
部，１０７は本システム全体を所定の制御プログラムに
基づいて統括的に制御する制御部，１０８は入力された
文書データやレイアウト情報などの各種データを記憶し
ておくためのデータ記憶部，１０９はデータ通信路，１
１０は例えば，ＴＣＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣ
ｏｎｔｒｏｌＰｒｏｔｏｃｏｌ）／ＩＰ（Ｉｎｔｅｒ
ｎｅｔＰｒｏｔｏｃｏｌ）プロトコルに従ってネット
ワーク通信を行うように構成されたデータ通信部であ
る。Reference numeral 102 denotes an image reduction unit for reducing the image input by the image input unit 101; 103, a basic element extraction unit for extracting character elements from the image reduced by the image reduction unit 102; A character region extraction unit 105 to be extracted is a column detection unit 106 that detects a column from the character region extracted by the character region extraction unit 104 and the like.
Is a character area correction unit that executes a character area correction process described later, 107 is a control unit that controls the entire system in accordance with a predetermined control program, and 108 is various data such as input document data and layout information. Is a data storage unit for storing data, 109 is a data communication path, 1
10 is, for example, TCP (Transmission C).
control Protocol / IP (Inter)
This is a data communication unit configured to perform network communication according to a “net protocol”.

【００１７】（システムの動作）次に，以上のように構
成されたシステムにおける文書画像処理動作について説
明する。図２は，本発明の実施の形態に係る文書画像処
理方法の手順を示すフローチャートである。まず，画像
入力部１０１により対象の文書画像を入力する（Ｓ２０
１）。さらに，この入力された文書画像を画像縮小部１
０２で縮小する（Ｓ２０２）。(System Operation) Next, the document image processing operation in the system configured as described above will be described. FIG. 2 is a flowchart showing a procedure of the document image processing method according to the embodiment of the present invention. First, a target document image is input by the image input unit 101 (S20).
1). Further, the input document image is converted to an image
The size is reduced by 02 (S202).

【００１８】この場合，例えば，８×８画素を１画素に
ＯＲ圧縮することにより，文字を構成する黒画素が文字
列方向に連結した状態となる。この連結成分の外接矩形
を抽出して基本要素とする。この基本要素は１文字ある
いは文字列の一部の場合もあれば，表などはそれだけで
一つの大きな基本要素となる。In this case, for example, by OR-compressing 8.times.8 pixels into one pixel, black pixels constituting the character are connected in the character string direction. The circumscribed rectangle of this connected component is extracted and used as a basic element. This basic element may be a single character or a part of a character string, or a table alone may be one large basic element.

【００１９】続いて，上記基本要素を文字，表，図その
他などに分類し，また，文字要素は統合して行を生成
し，行は統合して領域を抽出する（Ｓ２０３）。なお，
この処理の流れは，本発明者が先に特開平９−４４５９
４号公報で提案している。Subsequently, the basic elements are classified into characters, tables, figures, and the like, and the character elements are integrated to generate a line, and the lines are integrated to extract an area (S203). In addition,
The flow of this processing is described in detail in
No. 4 proposes this.

【００２０】続いて，段組み情報を抽出する（Ｓ２０
４）。この段組みについて図３および図４を用いて説明
する。抽出した文字領域の行方向に沿った長さを「領域
幅」，該「領域幅」と垂直な方向の長さを「領域高さ」
とする。ここで，「領域高さ」が予め設定した閾値以上
のものについて「領域幅」のヒストグラムを作成する。Subsequently, column information is extracted (S20).
4). This column will be described with reference to FIGS. The length of the extracted character area along the line direction is "area width", and the length in the direction perpendicular to the "area width" is "area height".
And Here, a histogram of “region width” is created for a region whose “region height” is equal to or larger than a preset threshold.

【００２１】つまり，図３の場合，３０１〜３０４が閾
値以上の高さの文字領域となる。このヒストグラムを図
４に示す。図４のヒストグラムにおいて，総度数に対し
て高い比率の頻度を持つクラスがあった場合，それを
「段幅」とする。図４では９０〜１２０の幅のクラスが
これに相当する。そして，図３では３０１〜３０３がこ
れに相当する。That is, in the case of FIG. 3, 301 to 304 are character areas having a height equal to or higher than the threshold value. This histogram is shown in FIG. In the histogram of FIG. 4, when there is a class having a high frequency with respect to the total frequency, the class is defined as “step width”. In FIG. 4, a class having a width of 90 to 120 corresponds to this. In FIG. 3, reference numerals 301 to 303 correspond to this.

【００２２】このとき，「段幅」の取り得る範囲に制限
を設けてもよい。例えば図４に示す例では，８０〜１７
０がその範囲であるとすれば，たとえ３０〜６０のクラ
スの頻度が高くても段幅としては採用されず，最終的に
段幅は未定となる。なお，先の特開平９−４４５９４号
公報の方法では，文字要素の分布から整合の取れた段組
み状態を検出しており，この検出した段組み情報を流用
してもよい。At this time, a limit may be set to a range where the "step width" can be taken. For example, in the example shown in FIG.
If 0 is within the range, even if the frequency of the class of 30 to 60 is high, it is not adopted as the step width, and the step width is undecided finally. In the method disclosed in Japanese Patent Application Laid-Open No. 9-44594, a consistent column state is detected from the distribution of character elements, and the detected column information may be used.

【００２３】さて，段組み情報が抽出できたならば，続
いて，その段組み情報を利用した文字領域の修正を実行
する（Ｓ２０５）。この文字領域の修正処理の詳細につ
いて図５のフローチャートに示す。If the column information has been extracted, the character area is corrected using the column information (S205). The details of the character area correction processing are shown in the flowchart of FIG.

【００２４】図５において，まず，推定段幅に合致する
領域を検出する（Ｓ５０１）。図７に示す例では７０２
に該当する。続いて，ステップＳ５０１で検出した領域
と同じ段に所属すべき候補領域，つまり修正対象候補を
検出する（Ｓ５０２）。この場合，領域の上端あるいは
下端が揃っており，かつ幅がそれほど変わらない領域が
候補領域となる。また，キャプションの融合を想定する
場合は，近傍に図領域が存在することも条件の一つとな
る。図７に示す例では７０３が該当する。In FIG. 5, first, an area matching the estimated step width is detected (S501). In the example shown in FIG.
Corresponds to. Subsequently, a candidate area that should belong to the same row as the area detected in step S501, that is, a correction target candidate is detected (S502). In this case, a region in which the upper end or the lower end of the region is aligned and the width does not change much is the candidate region. In addition, when assuming the fusion of captions, one of the conditions is that a figure region exists in the vicinity. In the example shown in FIG.

【００２５】上述の如く候補領域が検出されたならば，
次にステップＳ５０３を実行する。すなわち，その候補
領域を構成する文字要素を使用して再び文字領域の抽出
を行う。このとき，段に沿った強制分割位置（図６参
照）を仮定し，それより外側では基準行方向と垂直方向
に行の生成および領域の生成を実行する。図７の該当箇
所を拡大したものが図６である。強制分割位置に対して
段外側の６０１の要素を使って行６０２と６０３を生成
する。そして，上記生成した行における整合度を算出す
る（Ｓ５０３）。If a candidate area is detected as described above,
Next, step S503 is executed. That is, the character area is extracted again using the character elements constituting the candidate area. At this time, a forced division position (see FIG. 6) along the column is assumed, and the generation of a row and the generation of an area in the direction perpendicular to the reference row direction are performed outside the forced division position. FIG. 6 is an enlarged view of the relevant portion in FIG. Rows 602 and 603 are generated by using the element 601 outside the column with respect to the compulsory division position. Then, the matching degree in the generated row is calculated (S503).

【００２６】続いて，予め設定した閾値以上の整合度と
なる場合には，この領域を正式なものとし，残りの強制
分割位置より段内側の文字要素を使用し，もう一方の文
字領域を生成する（Ｓ５０４）。行の整合度としては，
各文字要素の重心の回帰直線の傾きであるとか，要素間
距離の分散などを利用することができる。また，図６に
示すように段外側で複数行生成される場合には，その行
の高さ（図６で上下方向）の差に反比例するような値を
整合度としてもよい。Subsequently, if the matching degree is equal to or higher than a preset threshold value, this area is formalized, and a character element inside the column from the remaining forced division position is used to generate another character area. (S504). As the consistency of rows,
For example, the inclination of the regression line of the center of gravity of each character element or the variance of the distance between elements can be used. When a plurality of rows are generated outside the column as shown in FIG. 6, a value that is inversely proportional to the difference in the height of the rows (vertical direction in FIG. 6) may be used as the matching degree.

【００２７】したがって，以上述べてきたような処理動
作によって，図７に示すような図のキャプションと本分
とが近接している場合であっても正確な領域の抽出を行
うことができる。Therefore, by the processing operation as described above, an accurate region can be extracted even when the caption shown in FIG. 7 is close to the main part.

【００２８】ところで，以上説明した各実施の形態にお
ける文書画像処理動作は図１に示したシステムによって
実行したが，この他に，文書画像処方法としてソフトウ
ェアとして機械読み取り可能な記録媒体に記録し，コン
ピュータ上で実行するようにしてもよい。By the way, the document image processing operation in each of the embodiments described above is executed by the system shown in FIG. 1. In addition, as a document image processing method, a document image is recorded on a machine-readable recording medium as software. It may be executed on a computer.

【００２９】[0029]

【発明の効果】以上説明したように，本発明に係る文書
画像処理方法（請求項１，２，３）によれば，段組抽出
工程で抽出された段の位置を参照し，過剰に統合された
文字領域を評価・修正するため，異なる文字領域に属す
る文字同士が近接するような文書画像であっても，文字
領域の過剰統合を回避し，本来抽出されるべき文字領域
として抽出することができる。As described above, according to the document image processing method according to the present invention (claims 1, 2 and 3), the positions of the columns extracted in the column extraction process are referred to and excessively integrated. In order to evaluate and correct the extracted character area, avoid over-integration of the character area and extract it as the character area that should be extracted, even for a document image in which characters belonging to different character areas are close to each other. Can be.

【００３０】また，本発明に係る機械読み取り可能な記
録媒体（請求項４）にあっては，請求項１ないし３のい
ずれか一つに記載の文書画像処理方法をコンピュータに
実行させるプログラムを記録したことにより，請求項１
ないし３のいずれか一つに記載の動作をコンピュータに
よって実現することが可能となる。According to a fourth aspect of the present invention, there is provided a computer-readable recording medium storing a program for causing a computer to execute the document image processing method according to any one of the first to third aspects. Claim 1
The operations described in any one of (3) to (3) can be realized by a computer.

[Brief description of the drawings]

【図１】本発明の実施の形態に係る文書画像処理方法を
適用したシステムの構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a system to which a document image processing method according to an embodiment of the present invention is applied.

【図２】本発明の実施の形態に係る文書画像処理方法の
手順を示すフローチャートである。FIG. 2 is a flowchart illustrating a procedure of a document image processing method according to the embodiment of the present invention.

【図３】本発明の実施の形態に係る段組み情報抽出例を
示す説明図である。FIG. 3 is an explanatory diagram showing an example of column information extraction according to the embodiment of the present invention;

【図４】本発明の実施の形態に係る段組み情報抽出の際
に作成・使用される領域幅のヒストグラム例である。FIG. 4 is an example of a histogram of an area width created and used when extracting column information according to the embodiment of the present invention.

【図５】図２における領域修正処理の詳細処理手順を示
すフローチャートである。FIG. 5 is a flowchart illustrating a detailed processing procedure of an area correction process in FIG. 2;

【図６】本発明の実施の形態に係る強制再領域抽出・評
価例について図７を拡大して示す説明図である。FIG. 6 is an explanatory diagram showing an enlarged view of FIG. 7 of an example of forced re-region extraction and evaluation according to the embodiment of the present invention.

【図７】図のキャプションと本文とが近接している場合
を示す説明図である。FIG. 7 is an explanatory diagram showing a case where the caption of the figure and the text are close to each other.

[Explanation of symbols]

１０１画像入力部１０２画像縮小部１０３基本要素抽出部１０４文字領域抽出部１０５段組検出部１０６文字領域修正部１０７制御部１０８データ記憶部１１０データ通信部 Reference Signs List 101 Image input unit 102 Image reduction unit 103 Basic element extraction unit 104 Character region extraction unit 105 Column detection unit 106 Character region correction unit 107 Control unit 108 Data storage unit 110 Data communication unit

Claims

[Claims]

1. A document image processing method for extracting a character region from a document image and correcting excessive integration of the extracted character region when performing character recognition or layout information acquisition for the character region. An image inputting step, and reducing the document image input in the image inputting step,
Extract the circumscribed rectangle of the connected components of the black pixels constituting the character,
A basic element extracting step of extracting a basic element, a character for classifying the basic element into characters, tables, figures, and others, generating a line by integrating the character elements, and extracting a character area by integrating the line An area extraction step, a column extraction step of extracting column information from the character area extracted in the character area extraction step, and a character column excessively integrated with reference to the position of the column extracted in the column extraction step. A document image processing method, comprising: an area correcting step of correcting an area.

2. The area correcting step detects a character area in which excessive integration is expected from the position of the column extracted in the column extraction step, and determines the position of the character or character string element in the character area. 2. The document image processing method according to claim 1, wherein the text area is integrated again with reference to the text area, the degree of consistency of the line with respect to a preset reference division position is evaluated, and the text area is corrected.

3. The area correcting step, when integrating into a character area again, first generates a character area constituted by a direction perpendicular to a reference line direction located outside the reference division position, and thereafter, 3. The document image processing method according to claim 2, wherein a character area is evaluated.

4. A machine-readable recording medium storing a program for causing a computer to execute the document image processing method according to claim 1.