JP3850995B2

JP3850995B2 - Document image processing method and machine-readable recording medium storing a program for causing a computer to execute the document image processing method

Info

Publication number: JP3850995B2
Application number: JP24651998A
Authority: JP
Inventors: 高志齋藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-08-18
Filing date: 1998-08-18
Publication date: 2006-11-29
Anticipated expiration: 2018-08-18
Also published as: JP2000067158A

Description

【０００１】
【発明の属する技術分野】
本発明は，入力された文書画像を認識する文書画像認識システムなどに利用され，特に，文書画像から文字領域を抽出する際に，抽出した文字領域の過剰統合を修正する文書画像処理方法および文書画像処理方法をコンピュータに実行させるプログラムを記録した機械読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
従来，文書画像を入力し，その文書画像から文字認識処理を行ったり，各種レアウト情報を取得するには，文字領域や図形領域・その他の領域などの領域抽出処理を行う必要がある。
【０００３】
このような，領域抽出処理を行う参考技術文献として，『印刷文書認識システムＡｕｔｏＲｅｃｏ／２ −イメージプロセス−』（情報処理学会第４７回（平成５年後期）全国大会２−９７〜９８）が開示されている。ここでは，特に，コラム境界線を検出し，文字列要素を領域へ統合している（これを第１の従来技術という）。
【０００４】
また，ページ全体の段組状態を検出し，それに応じて文字領域を抽出する技術が特開平９−４４５９４号公報の『文書画像の領域分割方法および段組種類判別方法』に開示されている（これを第２の従来技術という）。
【０００５】
【発明が解決しようとする課題】
しかしながら，上記に示されるような従来の技術にあっては，以下のような問題点があった。まず，第１の従来技術にあっては，新聞などで図のキャプションと本文とが近接している場合には過剰統合が生じてしまうことがある。その具体例を図７に示す。図７においてキャプション７０４が本文と近いため，誤って７０３のようなかたちで１つの領域として抽出されてしまう。
【０００６】
また，第２の従来技術に開示されている方法では，新聞などように文字要素が密な場合，初期の段階では明確な段組みが検出されず，近傍文字要素を統合することで文字領域を抽出することになってしまう。つまり，図７に示すように近接した場合に過剰統合が生じる。
【０００７】
また，第１の従来技術に開示されている方法でも，近接した文字要素は初期段階でグループ化されるため，段のエッジ部分が検出されなくなってしまう。また，このような場合でも段のエッジ部分を検出するようにすると，本来のエッジ以外に多数の擬似エッジが検出され，文字領域の抽出が全体として煩雑になってしまう。
【０００８】
本発明は，上記に鑑みてなされたものであって，文書画像から文字領域を抽出する際に，異なる文字領域に属する文字同士が近接していても，正しく文字領域を抽出することを目的とする。
【０００９】
【課題を解決するための手段】
上記の目的を達成するために，請求項１に係る文書画像処理方法にあっては，文書画像における文字要素を統合し行を生成し，前記生成した行を統合することで文字領域を抽出し，前記文字領域を抽出するための行の統合を修正する文書画像処理方法において，前記文字領域から段組の位置情報を抽出する段組情報抽出工程と，前記文字領域の領域幅の出現頻度に基づいて段幅を推定する段幅推定工程と，前記推定した段幅に対応する文字領域における前記段組の位置情報と前記推定した段幅に基づいて修正候補領域を抽出する工程と，前記修正候補領域を前記対応する文字領域内の段の位置に対応する強制分割位置を用いて分割する領域修正工程と，を含むものである。
【００１０】
また，請求項２に係る文書画像処理方法にあっては，請求項１の文書画像処理方法において，前記領域修正工程は，前記段組情報抽出工程で抽出された段の位置情報から推定した段幅に合致する領域と同じ段に所属すべき候補領域を検出し，該領域内の文字あるいは文字列要素を，段の位置を参照しながら再度文字領域へ統合する際に，前記段に沿った強制分割位置を仮定し，この分割位置より外側では前記抽出した段幅に合致する領域と同じ段に所属すべき候補領域の方向と垂直な方向によって構成される文字領域を最初に生成し，その後に当該文字領域の評価を行うものである。
【００１２】
また，請求項３に係る機械読み取り可能な記録媒体にあっては，前記請求項１または２に記載の文書画像処理方法をコンピュータに実行させるプログラムを記録したものである。
【００１３】
【発明の実施の形態】
以下，本発明の文書画像処理方法および文書画像処理方法をコンピュータに実行させるプログラムを記録した機械読み取り可能な記録媒体について添付図面を参照し，詳細に説明する。
【００１４】
（システムの構成）
図１は，本発明の実施の形態に係る文書画像処理方法を適用したシステムの構成を示すブロック図である。図において，１０１はＣＣＤ搭載のスキャナあるいはファクシミリ装置などを用い，文書を含む原稿を光学的に２値画像として読み取り，入力する画像入力部である。また，画像入力部１０１の具体的な手段として，ネットワーク経由（後述するデータ通信部１１０による）で別の機器から文書画像を取得するようにしてもよい。
【００１５】
ネットワーク経由で画像を取り込む場合は，例えば，予め用意されたブラウザ機能（ＷＷＷ（ｗｏｒｌｄｗｉｄｅｗｅｂ）ブラウザ機能とする）を用いてインターネット上に分散する広域のデータ資源（ＷＷＷサーバ上のアクセス情報の一つであるＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）で特定されるＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ：ＷＷＷ（ｗｏｒｌｄｗｉｄｅｗｅｂ：ハイパーテキストを使用したインターネットの情報サービス）用の文書記述言語）などの言語によるデータをブラウズする。
【００１６】
また，１０２は画像入力部１０１によって入力された画像を縮小する画像縮小部，１０３は画像縮小部１０２で縮小された画像から文字要素を抽出する基本要素抽出部，１０４は文字領域を抽出する文字領域抽出部，１０５は文字領域抽出部１０４によって抽出された文字領域などから段組を検出する段組検出部，１０６は後述する文字領域修正処理を実行する文字領域修正部，１０７は本システム全体を所定の制御プログラムに基づいて統括的に制御する制御部，１０８は入力された文書データやレイアウト情報などの各種データを記憶しておくためのデータ記憶部，１０９はデータ通信路，１１０は例えば，ＴＣＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ）／ＩＰ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）プロトコルに従ってネットワーク通信を行うように構成されたデータ通信部である。
【００１７】
（システムの動作）
次に，以上のように構成されたシステムにおける文書画像処理動作について説明する。図２は，本発明の実施の形態に係る文書画像処理方法の手順を示すフローチャートである。まず，画像入力部１０１により対象の文書画像を入力する（Ｓ２０１）。さらに，この入力された文書画像を画像縮小部１０２で縮小する（Ｓ２０２）。
【００１８】
この場合，例えば，８×８画素を１画素にＯＲ圧縮することにより，文字を構成する黒画素が文字列方向に連結した状態となる。この連結成分の外接矩形を抽出して基本要素とする。この基本要素は１文字あるいは文字列の一部の場合もあれば，表などはそれだけで一つの大きな基本要素となる。
【００１９】
続いて，上記基本要素を文字，表，図その他などに分類し，また，文字要素は統合して行を生成し，行は統合して領域を抽出する（Ｓ２０３）。なお，この処理の流れは，本発明者が先に特開平９−４４５９４号公報で提案している。
【００２０】
続いて，段組み情報を抽出する（Ｓ２０４）。この段組みについて図３および図４を用いて説明する。抽出した文字領域の行方向に沿った長さを「領域幅」，該「領域幅」と垂直な方向の長さを「領域高さ」とする。ここで，「領域高さ」が予め設定した閾値以上のものについて「領域幅」のヒストグラムを作成する。
【００２１】
つまり，図３の場合，３０１〜３０４が閾値以上の高さの文字領域となる。このヒストグラムを図４に示す。図４のヒストグラムにおいて，総度数に対して高い比率の頻度を持つクラスがあった場合，それを「段幅」とする。図４では９０〜１２０の幅のクラスがこれに相当する。そして，図３では３０１〜３０３がこれに相当する。
【００２２】
このとき，「段幅」の取り得る範囲に制限を設けてもよい。例えば図４に示す例では，８０〜１７０がその範囲であるとすれば，たとえ３０〜６０のクラスの頻度が高くても段幅としては採用されず，最終的に段幅は未定となる。なお，先の特開平９−４４５９４号公報の方法では，文字要素の分布から整合の取れた段組み状態を検出しており，この検出した段組み情報を流用してもよい。
【００２３】
さて，段組み情報が抽出できたならば，続いて，その段組み情報を利用した文字領域の修正を実行する（Ｓ２０５）。この文字領域の修正処理の詳細について図５のフローチャートに示す。
【００２４】
図５において，まず，推定段幅に合致する領域を検出する（Ｓ５０１）。図７に示す例では７０２に該当する。続いて，ステップＳ５０１で検出した領域と同じ段に所属すべき候補領域，つまり修正対象候補を検出する（Ｓ５０２）。この場合，領域の上端あるいは下端が揃っており，かつ幅がそれほど変わらない領域が候補領域となる。また，キャプションの融合を想定する場合は，近傍に図領域が存在することも条件の一つとなる。図７に示す例では７０３が該当する。
【００２５】
上述の如く候補領域が検出されたならば，次にステップＳ５０３を実行する。すなわち，その候補領域を構成する文字要素を使用して再び文字領域の抽出を行う。このとき，段に沿った強制分割位置（図６参照）を仮定し，それより外側では基準行方向と垂直方向に行の生成および領域の生成を実行する。図７の該当箇所を拡大したものが図６である。強制分割位置に対して段外側の６０１の要素を使って行６０２と６０３を生成する。そして，上記生成した行における整合度を算出する（Ｓ５０３）。
【００２６】
続いて，予め設定した閾値以上の整合度となる場合には，この領域を正式なものとし，残りの強制分割位置より段内側の文字要素を使用し，もう一方の文字領域を生成する（Ｓ５０４）。行の整合度としては，各文字要素の重心の回帰直線の傾きであるとか，要素間距離の分散などを利用することができる。また，図６に示すように段外側で複数行生成される場合には，その行の高さ（図６で上下方向）の差に反比例するような値を整合度としてもよい。
【００２７】
したがって，以上述べてきたような処理動作によって，図７に示すような図のキャプションと本分とが近接している場合であっても正確な領域の抽出を行うことができる。
【００２８】
ところで，以上説明した各実施の形態における文書画像処理動作は図１に示したシステムによって実行したが，この他に，文書画像処方法としてソフトウェアとして機械読み取り可能な記録媒体に記録し，コンピュータ上で実行するようにしてもよい。
【００２９】
【発明の効果】
以上説明したように，本発明に係る文書画像処理方法（請求項１，２）によれば，段組抽出工程で抽出された段の位置を参照し，過剰に統合された文字領域を評価・修正するため，異なる文字領域に属する文字同士が近接するような文書画像であっても，文字領域の過剰統合を回避し，本来抽出されるべき文字領域として抽出することができる。
【００３０】
また，本発明に係る機械読み取り可能な記録媒体（請求項３）にあっては，請求項１または２に記載の文書画像処理方法をコンピュータに実行させるプログラムを記録したことにより，請求項１または２に記載の動作をコンピュータによって実現することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る文書画像処理方法を適用したシステムの構成を示すブロック図である。
【図２】本発明の実施の形態に係る文書画像処理方法の手順を示すフローチャートである。
【図３】本発明の実施の形態に係る段組み情報抽出例を示す説明図である。
【図４】本発明の実施の形態に係る段組み情報抽出の際に作成・使用される領域幅のヒストグラム例である。
【図５】図２における領域修正処理の詳細処理手順を示すフローチャートである。
【図６】本発明の実施の形態に係る強制再領域抽出・評価例について図７を拡大して示す説明図である。
【図７】図のキャプションと本文とが近接している場合を示す説明図である。
【符号の説明】
１０１画像入力部
１０２画像縮小部
１０３基本要素抽出部
１０４文字領域抽出部
１０５段組検出部
１０６文字領域修正部
１０７制御部
１０８データ記憶部
１１０データ通信部[0001]
BACKGROUND OF THE INVENTION
INDUSTRIAL APPLICABILITY The present invention is used in a document image recognition system that recognizes an input document image, and in particular, when extracting a character area from a document image, a document image processing method and document for correcting excessive integration of extracted character areas The present invention relates to a machine-readable recording medium recording a program for causing a computer to execute an image processing method.
[0002]
[Prior art]
Conventionally, in order to input a document image, perform character recognition processing from the document image, and acquire various layout information, it is necessary to perform region extraction processing such as a character region, a graphic region, and other regions.
[0003]
As a reference technical document for performing such region extraction processing, “Print Document Recognition System AutoReco / 2 -Image Process” (Information Processing Society of Japan 47th (late 1993) National Conference 2-97-98) is disclosed. Has been. Here, in particular, column boundary lines are detected, and character string elements are integrated into a region (this is referred to as the first prior art).
[0004]
Further, a technique for detecting the column state of the entire page and extracting the character region in accordance with the state is disclosed in “Document Image Region Division Method and Column Type Discrimination Method” of Japanese Patent Laid-Open No. 9-44594 ( This is called the second prior art).
[0005]
[Problems to be solved by the invention]
However, the conventional techniques as described above have the following problems. First, in the first prior art, over-integration may occur when the caption of a figure and the text are close to each other in a newspaper or the like. A specific example is shown in FIG. In FIG. 7, since the caption 704 is close to the text, it is erroneously extracted as one region in the form of 703.
[0006]
Also, in the method disclosed in the second prior art, when the character elements are dense such as in newspapers, a clear column is not detected at the initial stage, and the character area is obtained by integrating neighboring character elements. Will be extracted. That is, excessive integration occurs when close to each other as shown in FIG.
[0007]
Even in the method disclosed in the first prior art, the adjacent character elements are grouped in the initial stage, so that the edge portion of the step is not detected. Even in such a case, if the edge portion of the step is detected, a number of pseudo edges are detected in addition to the original edge, and the extraction of the character area becomes complicated as a whole.
[0008]
The present invention has been made in view of the above, and an object of the present invention is to correctly extract a character area even when characters belonging to different character areas are close to each other when extracting a character area from a document image. To do.
[0009]
[Means for Solving the Problems]
To achieve the above object, in the document image processing method according to claim 1, character elements in a document image are integrated to generate a line, and the generated line is integrated to extract a character area. In the document image processing method for correcting line integration for extracting the character area, a column information extraction step for extracting column position information from the character area, and an appearance frequency of the area width of the character area. a stage width estimation step of estimating a step width based, the step of extracting the correction candidate region based the previous SL-stage set of position information on the estimated variable width of a character region corresponding to the stage width and the estimated, the And an area correction step of dividing the correction candidate area using a forced division position corresponding to the position of the step in the corresponding character area.
[0010]
The document image processing method according to claim 2 is the document image processing method according to claim 1, wherein the region correction step is a step estimated from the position information of the step extracted in the column information extraction step. When a candidate area that should belong to the same stage as the area that matches the width is detected, and characters or character string elements in the area are merged into the character area again while referring to the position of the stage, the above-mentioned stage is followed. Assuming a forced division position, a character area consisting of a direction perpendicular to the direction of the candidate area that should belong to the same stage as the area that matches the extracted step width is generated outside this division position, and then The character area is evaluated .
[0012]
According to a third aspect of the present invention, there is provided a machine-readable recording medium in which a program for causing a computer to execute the document image processing method according to the first or second aspect is recorded.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a document image processing method of the present invention and a machine-readable recording medium that records a program for causing a computer to execute the document image processing method will be described in detail with reference to the accompanying drawings.
[0014]
(System configuration)
FIG. 1 is a block diagram showing the configuration of a system to which a document image processing method according to an embodiment of the present invention is applied. In the figure, reference numeral 101 denotes an image input unit for optically reading and inputting a document including a document as a binary image using a CCD-equipped scanner or a facsimile machine. Further, as a specific means of the image input unit 101, a document image may be acquired from another device via a network (by a data communication unit 110 described later).
[0015]
When capturing an image via a network, for example, a wide-area data resource (access information on a WWW server) distributed on the Internet using a browser function (WWW (world wide web) browser function) prepared in advance is used. Browse data in languages such as HTML (Hyper Text Markup Language: document description language for the Internet information service using hypertext) specified by the URL (Uniform Resource Locator) .
[0016]
Reference numeral 102 denotes an image reduction unit that reduces an image input by the image input unit 101, 103 denotes a basic element extraction unit that extracts a character element from the image reduced by the image reduction unit 102, and 104 denotes a character that extracts a character area. An area extracting unit 105 is a column detecting unit that detects a column from the character area extracted by the character area extracting unit 104, 106 is a character area correcting unit that executes a character area correcting process described later, and 107 is the entire system. Is a control unit that performs overall control based on a predetermined control program, 108 is a data storage unit for storing various data such as input document data and layout information, 109 is a data communication path, 110 is, for example, , TCP (Transmission Control Protocol) / IP (Internet Protocol) protocol A configuration data communication unit to perform network communication according to.
[0017]
(System operation)
Next, the document image processing operation in the system configured as described above will be described. FIG. 2 is a flowchart showing the procedure of the document image processing method according to the embodiment of the present invention. First, a target document image is input by the image input unit 101 (S201). Further, the input document image is reduced by the image reduction unit 102 (S202).
[0018]
In this case, for example, by performing OR compression of 8 × 8 pixels into one pixel, the black pixels constituting the character are connected in the character string direction. The circumscribed rectangle of this connected component is extracted and used as a basic element. This basic element may be a single character or a part of a character string, and a table or the like alone becomes one large basic element.
[0019]
Subsequently, the basic elements are classified into characters, tables, diagrams, and the like, and the character elements are integrated to generate a line, and the lines are integrated to extract a region (S203). The flow of this process has been previously proposed by the present inventor in Japanese Patent Laid-Open No. 9-44594.
[0020]
Subsequently, column information is extracted (S204). This column structure will be described with reference to FIGS. The length of the extracted character area in the row direction is defined as “area width”, and the length in the direction perpendicular to the “area width” is defined as “area height”. Here, a histogram of “region width” is created for those having “region height” equal to or greater than a preset threshold value.
[0021]
That is, in the case of FIG. 3, 301 to 304 are character areas having a height equal to or higher than the threshold. This histogram is shown in FIG. In the histogram of FIG. 4, if there is a class having a high ratio of frequency to the total frequency, it is set as “step width”. In FIG. 4, a class having a width of 90 to 120 corresponds to this. In FIG. 3, 301 to 303 correspond to this.
[0022]
At this time, a restriction may be placed on the range that the “step width” can take. For example, in the example shown in FIG. 4, if 80 to 170 are within the range, even if the frequency of 30 to 60 classes is high, the step width is not adopted, and the step width is undecided finally. In the method disclosed in Japanese Patent Laid-Open No. 9-44594, a matched column state is detected from the distribution of character elements, and the detected column information may be used.
[0023]
If the column information can be extracted, the character area is corrected using the column information (S205). The details of the character area correction processing are shown in the flowchart of FIG.
[0024]
In FIG. 5, first, an area matching the estimated step width is detected (S501). In the example shown in FIG. Subsequently, a candidate area that should belong to the same stage as the area detected in step S501, that is, a correction target candidate is detected (S502). In this case, an area where the upper end or lower end of the area is aligned and the width does not change so much is a candidate area. In addition, when a fusion of captions is assumed, one of the conditions is that a diagram area exists in the vicinity. In the example shown in FIG.
[0025]
If a candidate area is detected as described above, next step S503 is executed. That is, the character area is extracted again using the character elements constituting the candidate area. At this time, a forcible division position (see FIG. 6) along the stage is assumed, and line generation and area generation are executed in a direction perpendicular to the reference line direction outside the position. FIG. 6 is an enlarged view of the corresponding portion of FIG. Rows 602 and 603 are generated using 601 elements outside the stage with respect to the forced division position. Then, the degree of matching in the generated row is calculated (S503).
[0026]
Subsequently, when the degree of matching is equal to or higher than a preset threshold value, this area is made formal, and the other character area is generated using the character elements inside the remaining forcible division positions (S504). ). As the degree of line alignment, it is possible to use the slope of the regression line of the center of gravity of each character element or the variance of the distance between elements. In addition, when a plurality of rows are generated outside the stage as shown in FIG. 6, a value that is inversely proportional to the difference in the height of the row (vertical direction in FIG. 6) may be used as the degree of matching.
[0027]
Therefore, by the processing operation as described above, an accurate region can be extracted even when the caption shown in FIG. 7 is close to the main caption.
[0028]
Incidentally, the document image processing operation in each of the embodiments described above is executed by the system shown in FIG. 1, but in addition to this, the document image processing method is recorded on a machine-readable recording medium as software and is executed on a computer. You may make it perform.
[0029]
【The invention's effect】
As described above, according to the document image processing method of the present invention (Claims 1 and 2 ), the position of the stage extracted in the column extraction process is referred to, and the excessively integrated character area is evaluated / For correction, even a document image in which characters belonging to different character areas are close to each other can be extracted as a character area that should be extracted by avoiding excessive integration of the character areas.
[0030]
Further, in the machine-readable recording medium according to the present invention (claim 3), by recording a program for executing a document image processing method according to the computer to claim 1 or 2, according to claim 1 or The operation described in 2 can be realized by a computer.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a system to which a document image processing method according to an embodiment of the present invention is applied.
FIG. 2 is a flowchart showing a procedure of a document image processing method according to the embodiment of the present invention.
FIG. 3 is an explanatory diagram showing an example of column information extraction according to the embodiment of the present invention.
FIG. 4 is a histogram example of a region width created and used when extracting column information according to an embodiment of the present invention.
FIG. 5 is a flowchart showing a detailed processing procedure of region correction processing in FIG. 2;
6 is an explanatory diagram showing an enlargement of FIG. 7 for an example of forced re-region extraction / evaluation according to the embodiment of the present invention.
FIG. 7 is an explanatory diagram showing a case where the caption of the figure and the text are close to each other.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 Image input part 102 Image reduction part 103 Basic element extraction part 104 Character area extraction part 105 Column detection part 106 Character area correction part 107 Control part 108 Data storage part 110 Data communication part

Claims

In a document image processing method for integrating character elements in a document image, generating a line, extracting a character area by integrating the generated line, and correcting line integration for extracting the character area,
A column information extraction step for extracting column column position information from the character region;
A step width estimation step of estimating a step width based on the appearance frequency of the region width of the character region;
A step of extracting the correction candidate region based the previous SL-stage set of position information on the estimated variable width of a character region corresponding to the stage width and the estimated,
An area correction step of dividing the correction candidate area using a forced division position corresponding to a position of a step in the corresponding character area;
A document image processing method comprising:

The region correction step detects a candidate region that should belong to the same step as the region that matches the step width estimated from the step position information extracted in the column information extraction step, and a character or character string in the region When the elements are merged into the character area again with reference to the position of the step, the forced division position along the step is assumed, and outside the division position, the same step as the region that matches the extracted step width is assumed. The document image processing method according to claim 1 , wherein a character area composed of a direction perpendicular to a direction of a candidate area to belong to is first generated, and then the character area is evaluated .

A machine-readable recording medium having recorded thereon a program for causing a computer to execute the document image processing method according to claim 1 or 2 .