JP3653156B2

JP3653156B2 - Document image area extraction method

Info

Publication number: JP3653156B2
Application number: JP01547297A
Authority: JP
Inventors: 高志齋藤; 敏文山合
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-01-29
Filing date: 1997-01-29
Publication date: 2005-05-25
Anticipated expiration: 2017-01-29
Also published as: JPH10214309A

Description

【０００１】
【発明の属する技術分野】
本発明は、段組なしの原稿から文字領域の抽出を精度よく行う文書画像領域抽出方法に関する。
【０００２】
【従来の技術】
文書画像から文章領域を抽出する方法が従来から種々提案されている。例えば、本出願人が先に提案した文字領域抽出方法は、文書画像から文字などの小さな要素を統合してまとまりのある文字領域を抽出するものである（特開平５−８１４７５号公報を参照）。しかし、この方法では、多段組みでも段組なし（一段組）でも同様の方法によって文字矩形を統合しているので、文字間隔の広い段組なし原稿では、領域の過剰分割（未統合）が生じる可能性がある。
【０００３】
そこで、「一段組」の指定をするか、あるいは本出願人が先に提案した、文書画像の領域分割方法および段組種類判別方法（特願平７−１９４３９９号）によって、文書画像から検出された空白部または罫線を基に、１段組、複数段組、自由段組を含む、文書画像の段組種類を判別し、一段組に適応した処理をすることが必要となる。また、この方法では、小領域を行方向に統合してまとまりのある文章領域を抽出する際に、１段組の場合は、小領域が遠く離れていても統合するようにしている。
【０００４】
【発明が解決しようとする課題】
しかし、通達文などは一段組に該当すると考えられるが、ページ上部にある作成者や宛先など書誌事項部分は左右に離れていて統合してはならない。したがって、単に一段組であるという理由だけで大きく統合するのでは正しく領域を抽出することができない。
【０００５】
また、他の方法として、文字列の先頭位置の周辺分布から段のエッジを求め、同一段に所属する文字列に同じ属性（所属段番号：複数可）を与えて、同一属性を持つ上下に近接する文字列を統合することで画像から文字領域を分割抽出する方法がある（特開平１−１８３７８３号公報を参照）。
【０００６】
しかし、この方法でも、本文部分が相互に離れていても統合し、書誌事項部分を統合しないように制御することができない。また、１段組みの場合でも、ページ端に存在する文字に類似した大きさのノイズと本文とを統合して同じ領域に分割し好ましくない。さらに、単純に射影をとって段のエッジ部分を推定しているので、統合すべきでないノイズなのか箇条書きの番号のような文字なのかを区別することが難しい。
【０００７】
本発明は上記した事情を考慮してなされたもので
本発明の目的は、段組なしの文書から精度よく文字領域を抽出する文書画像領域抽出方法を提供することにある。
【０００８】
【課題を解決するための手段】
前記目的を達成するために、請求項１記載の発明では、文書画像から文字要素を抽出し、該文字要素を統合することによって文字領域を抽出する文書画像領域抽出方法であって、文書画像の左右端または上下端の文字要素の縦方向または横方向への射影をとり、前記射影の頻度の極小値を求め、この絶対値が所定のしきい値より小さい空白部が所定のしきい値以上の幅を持つ場合に、前記空白部を端空白部として検出し、前記極小値より端の文字要素存在幅が所定のしきい値より大きい場合、前記検出された端空白部は、ノイズと文字領域とを分断する端空白部でないものとし、前記端空白部の位置を参照して文字領域を抽出することを特徴としている。
【０００９】
【発明の実施の形態】
以下、本発明の一実施例を図面を用いて具体的に説明する。
図１は、本発明の実施例の構成を示す。１０１は原稿などの文書画像を入力する画像入力手段、１０２は該画像を圧縮し連結成分の外接矩形を抽出する連結成分抽出手段、１０３は該連結成分を文字要素と表要素や分割線やその他に分類する連結成分分類手段、１０４は抽出した文字要素から端部の空白を検出する端空白検出手段、１０５は連結成分分類手段１０３で文字要素と分類されたものからタイトル候補を検出するタイトル候補検出手段、１０６は端空白検出手段１０４で得られた端空白部やタイトル候補検出手段１０５で得られたタイトル候補位置を参考に文字要素を統合して文字領域を得る文字要素統合手段、１０７は入力された画像や処理中の各種情報を蓄積するデータ記憶部、１０８は全体の制御部、１０９はデータ通信路、１１０は入力文書が段組なしであることを指定する手段または判定する手段である。
【００１０】
図２は、本発明の処理フローチャートを示す。図２を参照しながら本発明の処理動作を説明する。
まず、画像入力手段１０１によって文書画像を得る（ステップ２０１）。この画像入力手段はスキャナ、ファツクスなどであり、またネットワーク経由で別の機器から画像を得るようにしてもよい。次に、段組み無しであることを指定する（ステップ２０２）。なお、段組みを自動判定をする場合には、例えば、前掲した出願（特願平７−１９４３９９）の段組種類判別方法などを用いればよい。この場合には処理手順として、文書画像の情報が得られてから判別するので、ステップ２０４とステップ２０５の間に「段組み無し判定」の処理ステップが挿入される。
【００１１】
次いで、連結成分抽出手段１０２は、文書画像を１／８程度に圧縮（縮小）し、その上で連結成分を抽出する（ステップ２０３）。上記した圧縮処理によって近接する文字（文字列）が一つの連結成分となり、また図等もまとまって連結され、抽出される。連結成分分類手段１０３は、抽出した連結成分を文字要素、表、分割線、図その他に分類する（ステップ２０４）。分類に際し、要素の大きさや縦横比、黒画素密度、連結成分内の罫線要素の有無や位置などの特徴を用いる。
【００１２】
以下の処理が本発明の主要な特徴である。
まず、端空白検出手段１０４は、抽出した文字要素から端空白部を検出する
（ステップ２０５）。図３は、端空白の検出を説明する図である。また、図４は、図２のステップ２０４の詳細フローチャートである。図３、４を参照して端空白の検出を説明する。図３において、３０２、３０５、３０６、３０７、３０８、３０９、３１０、３１１、３１２は文字要素であり、３０１、３０３はノイズである。
【００１３】
各文字要素が図３に示すように抽出されたとすると、行（水平）方向と垂直方向に、端部における文字要素の頻度の射影を取る（ステップ４０１）。つまり、この頻度は、水平、垂直方向の単位幅当りの文字要素の個数を表すことになる。図３の例では左端についてのみ示す。右端についても同様である。
【００１４】
この頻度情報の極小値を求め、この絶対値が所定のしきい値より低くまたこの空白部が所定のしきい値以上の幅を持つ場合に求める端空白部とする（ステップ４０２）。このとき、極小値より端（図３では、より左側）の文字要素存在幅３１８が所定のしきい値より大きければノイズでない可能性が高いので、検出された端空白部は、ノイズと文字領域を分断する端空白部ではないものとして、これを除く。これにより端空白部が検出される（行方向が垂直の場合には全体を９０度回転してから上記した処理を行う）。検出された端空白部の位置を記録しておき、後述する文字要素を統合する処理（ステップ２０７）に利用する。
【００１５】
なお、上記した空白部検出の方法として、射影の代わりに白ランを利用した罫線抽出を行ってもよい。図６は、端空白検出の他の処理フローチャートである。図６を用いて処理を説明する。端部において長い縦方向の白ランを抽出し、この連結成分を求める。ランのしきい値を長い値として設定し、この閾値を超える連結成分が罫線状の空白部の一部となる（ステップ６０１）。
【００１６】
次に、求めた空白罫線の幅や長さを基に、端部のノイズと文字領域を分割する空白罫線であるか否かを判別する（ステップ６０２）。このようにして求めた空白罫線の位置を記録しておき、後述する文字要素を統合する処理（ステップ２０７）に利用する。
【００１７】
続いて、タイトル候補検出手段１０５はタイトル候補を抽出する（ステップ２０６）。図５は、ステップ２０６の詳細なフローチャートである。図３、５を参照してタイトル候補の検出処理について説明する。
【００１８】
まず、ページの上から順に所定の範囲においてページ全体の中心付近で、最上部にある文字要素を検出する（ステップ５０１、５０２）。図３の例では、文字要素３０５がこれに相当する。この文字要素の左右（行方向が上下なら上下）に近接する文字要素を足し合わせた長さを計測し、この長さが所定の閾値より大きいときタイトル候補として検出する（ステップ５０３）。この位置を記録しておき、後に文字要素を統合する際（ステップ２０７）に利用する。
【００１９】
最後に、文字要素統合手段１０６は、文字要素を領域へと統合する（ステップ２０７）。この処理は、基本的には前掲した公報（特開平５−８１４７５）に記載された方法のように、行方向に近接する文字要素を行に統合し、さらに領域へと統合するものである。
【００２０】
ここで、従来方法と異なる点は端空白部を利用することとタイトル侯補位置を利用することである。図３を例に説明すると、文字要素を行へと統合する際に、要素間に端空白部があるとき、統合条件（要素間の距離や上下の一致度など）を厳しくする。これによって要素間距離の長い文字要素３０１と文字要素３０５は統合せず、また上下の一致度が低い文字要素３０３と３０７または３０８は統合しない。しかし、箇条書きの番号である文字要素３０２は３０６と統合される。これによって、大きさだけでは区別しにくい端部のノイズ３０１、３０３と文字要素を正しく区別して処理することが可能となる。また単純に空白部で領域を分割していないので、文字要素３０２が孤立することはない。
【００２１】
また、タイトル候補３０５の位置より上（書誌事項の存在範囲）においては、文字要素の行への統合条件を厳しくする。これにより文字要素３０９と３１０が誤って統合されるようなことが防止される。さらに、要素間距離の長い文字要素３１１と３１２はタイトル候補より下にあるので、それらが離れていても統合される。これにより、段組み無し原稿における文字間の広い場合にも正しく文字領域を抽出することができる。上記した処理によって、最終的に抽出された文字領域は、３１３、３１４、３１５、３１６、３１７となる。
【００２２】
なお、本発明は上記したものに限定されず、ソフトウェアによっても実現することができる。本発明をソフトウェアによって実現する場合には、図７に示すように、ＣＰＵ、ＲＯＭ、ＲＡＭ、表示装置、ハードディスク、キーボード、ＣＤ−ＲＯＭドライブなどからなる汎用の処理装置を用意し、ＣＤ−ＲＯＭなどのコンピュータ記憶媒体には、本発明の文書画像の領域抽出機能を実現するプログラムが記録されている。
【００２３】
【発明の効果】
以上、説明したように、本発明によれば、段組み無しの原稿において書誌事項の存在範囲を識別しているので、本文部分と書誌事項部分との領域の切り分けを精度よく行うことが可能となる。
【００２４】
本発明によれば、文書の端部にノイズがある場合でも、空白部または空白罫線を検出し、文字領域を抽出するときにその位置を利用しているので、文字領域を精度よく抽出することができる。
【図面の簡単な説明】
【図１】本発明の実施例の構成を示す。
【図２】本発明の処理フローチャートを示す。
【図３】端空白の検出を説明する図である。
【図４】図２のステップ２０４の詳細フローチャートである。
【図５】ステップ２０６の詳細なフローチャートである。
【図６】端空白検出の他の処理フローチャートである。
【図７】本発明をソフトウェアによって実現する場合の構成例を示す。
【符号の説明】
１０１画像入力手段
１０２連結成分抽出手段
１０３連結成分分類手段
１０４端空白検出手段
１０５タイトル候補検出手段
１０６文字要素統合手段
１０７データ記憶部
１０８制御部
１０９データ通信路
１１０段組み無し指定手段または判定手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document image region extraction method for accurately extracting a character region from a document without columns.
[0002]
[Prior art]
Various methods for extracting a text area from a document image have been proposed. For example, the character region extraction method previously proposed by the present applicant extracts a coherent character region by integrating small elements such as characters from a document image (see Japanese Patent Laid-Open No. 5-81475). . However, in this method, since the character rectangles are integrated by the same method regardless of multi-columns or no columns (single column), excessive division (unintegration) of regions occurs in a document without columns with a wide character interval. there is a possibility.
[0003]
Therefore, it is detected from the document image by designating "one-column set" or by the region division method and column type discrimination method (Japanese Patent Application No. 7-194399) proposed previously by the present applicant. It is necessary to determine the column type of the document image including one column, a plurality of columns, and a free column based on the blank part or ruled line, and to perform processing adapted to the one column. Further, in this method, when extracting small text areas by integrating the small areas in the row direction, in the case of a one-column set, the small areas are integrated even if they are far apart.
[0004]
[Problems to be solved by the invention]
However, it is considered that the notification sentence corresponds to one set, but the bibliographic items such as the creator and address at the top of the page are left and right and should not be integrated. Therefore, it is not possible to correctly extract a region by largely integrating only because it is a one-column set.
[0005]
As another method, the edge of the step is obtained from the peripheral distribution of the beginning position of the character string, and the same attribute (belonging column number: plural) is given to the character string belonging to the same step, and the upper and lower sides having the same attribute There is a method of dividing and extracting a character area from an image by integrating adjacent character strings (see Japanese Patent Laid-Open No. 1-183784).
[0006]
However, even with this method, it is impossible to perform control so that the text parts are integrated even if they are separated from each other and the bibliographic items are not integrated. Further, even in the case of a single column, it is not preferable that noise having a size similar to characters existing at the end of the page and the text are integrated and divided into the same region. Furthermore, since the projection is simply estimated to estimate the edge portion of the step, it is difficult to distinguish between noise that should not be integrated and characters such as bulleted numbers.
[0007]
The present invention has been made in consideration of the above-described circumstances, and an object of the present invention is to provide a document image region extraction method for accurately extracting a character region from a document without columns.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, according to the first aspect of the present invention, there is provided a document image region extraction method for extracting a character region from a document image, and extracting the character region by integrating the character elements. Take the projection in the vertical or horizontal direction of the character elements at the left and right or upper and lower ends, find the minimum value of the frequency of the projection, and the blank part whose absolute value is smaller than the predetermined threshold is greater than or equal to the predetermined threshold The blank portion is detected as an end blank portion, and if the character element existence width at the end from the minimum value is larger than a predetermined threshold, the detected end blank portion is noise and character. The character area is extracted by referring to the position of the end blank portion, and is not an end blank portion that divides the region .
[0009]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 1 shows the configuration of an embodiment of the present invention. 101 is an image input means for inputting a document image such as a manuscript, 102 is a connected component extracting means for compressing the image and extracting a circumscribed rectangle of the connected component, and 103 is a character element, a table element, a dividing line, and the like. Connected component classifying means for classifying into 104, edge blank detecting means for detecting edge blanks from the extracted character elements, and title candidate for detecting title candidates from those classified as character elements by the connected component classifying means 103 A detection means 106 is a character element integration means 107 for obtaining a character region by integrating character elements with reference to the edge blank portion obtained by the edge blank detection means 104 and the title candidate position obtained by the title candidate detection means 105. A data storage unit for storing input images and various types of information being processed, 108 is an overall control unit, 109 is a data communication path, and 110 is an input document with no columns. It is a means or means for determining specifies the.
[0010]
FIG. 2 shows a processing flowchart of the present invention. The processing operation of the present invention will be described with reference to FIG.
First, a document image is obtained by the image input means 101 (step 201). The image input means is a scanner, a fax, or the like, and an image may be obtained from another device via a network. Next, it is designated that there is no column (step 202). When the column is automatically determined, for example, the column type determination method of the above-mentioned application (Japanese Patent Application No. 7-194399) may be used. In this case, since the determination is made after the document image information is obtained as a processing procedure, a processing step of “no column setting determination” is inserted between step 204 and step 205.
[0011]
Next, the connected component extraction unit 102 compresses (reduces) the document image to about 1/8, and then extracts the connected component (step 203). By the compression processing described above, adjacent characters (character strings) become one connected component, and figures and the like are connected together and extracted. The connected component classification means 103 classifies the extracted connected components into character elements, tables, dividing lines, diagrams, and the like (step 204). At the time of classification, characteristics such as the size and aspect ratio of the element, the black pixel density, and the presence and position of ruled line elements in the connected components are used.
[0012]
The following processing is the main feature of the present invention.
First, the edge blank detecting means 104 detects a edge blank portion from the extracted character element (step 205). FIG. 3 is a diagram for explaining edge blank detection. FIG. 4 is a detailed flowchart of step 204 in FIG. The detection of edge blanks will be described with reference to FIGS. In FIG. 3, 302, 305, 306, 307, 308, 309, 310, 311 and 312 are character elements, and 301 and 303 are noises.
[0013]
If each character element is extracted as shown in FIG. 3, the projection of the frequency of the character element at the end is taken in the row (horizontal) direction and the vertical direction (step 401). That is, this frequency represents the number of character elements per unit width in the horizontal and vertical directions. In the example of FIG. 3, only the left end is shown. The same applies to the right end.
[0014]
The minimum value of the frequency information is obtained, and it is set as the edge blank portion obtained when the absolute value is lower than a predetermined threshold and the blank portion has a width equal to or larger than the predetermined threshold (step 402). In this case, (in FIG. 3, more left) end than the minimum value because the character elements exist width 318 is likely not a noise greater than a predetermined threshold, the detected end blank portion, noise and a character area This is excluded as not being an end blank part. As a result, an edge blank portion is detected (if the row direction is vertical, the whole process is rotated 90 degrees before the above-described processing is performed). The position of the detected edge blank portion is recorded and used for processing (step 207) for integrating character elements to be described later.
[0015]
Note that, as a method of detecting the blank portion described above, ruled line extraction using a white run may be performed instead of projection. FIG. 6 is a flowchart illustrating another process for detecting the end blank. The processing will be described with reference to FIG. A long vertical white run is extracted at the end, and this connected component is obtained. The run threshold is set as a long value, and the connected component exceeding this threshold becomes a part of the ruled blank portion (step 601).
[0016]
Next, based on the obtained width and length of the blank ruled line, it is determined whether or not it is a blank ruled line dividing the edge noise and the character area (step 602). The position of the blank ruled line obtained in this way is recorded and used for processing (step 207) for integrating character elements to be described later.
[0017]
Subsequently, the title candidate detection unit 105 extracts title candidates (step 206). FIG. 5 is a detailed flowchart of step 206. The title candidate detection process will be described with reference to FIGS.
[0018]
First, the uppermost character element is detected near the center of the entire page in a predetermined range from the top of the page (steps 501 and 502). In the example of FIG. 3, the character element 305 corresponds to this. The length obtained by adding the character elements adjacent to the left and right of this character element (up and down if the line direction is up and down) is measured, and when this length is larger than a predetermined threshold, it is detected as a title candidate (step 503). This position is recorded and used later when integrating character elements (step 207).
[0019]
Finally, the character element integration means 106 integrates the character elements into the area (step 207). This process basically integrates character elements adjacent in the line direction into lines and further integrates them into regions as in the method described in the above-mentioned publication (Japanese Patent Laid-Open No. 5-81475).
[0020]
Here, the difference from the conventional method is that an edge blank part is used and a title supplementary position is used. Referring to FIG. 3 as an example, when character elements are integrated into lines, if there are end blank portions between the elements, the integration conditions (such as the distance between elements and the degree of coincidence between the upper and lower sides) are tightened. As a result, the character element 301 and the character element 305 having a long inter-element distance are not integrated, and the character elements 303 and 307 or 308 having a low matching degree are not integrated. However, the character element 302 which is the itemized number is integrated with 306. As a result, it is possible to correctly distinguish and process the edge noises 301 and 303 that are difficult to distinguish by size alone and the character elements. In addition, since the area is not simply divided by the blank portion, the character element 302 is not isolated.
[0021]
Also, above the position of the title candidate 305 (the bibliographic item existence range), the conditions for integrating character elements into a line are strict. This prevents the character elements 309 and 310 from being mistakenly integrated. Furthermore, since the character elements 311 and 312 having a long interelement distance are below the title candidates, they are integrated even if they are separated. As a result, the character region can be correctly extracted even when the space between characters in the uncolumnar original is wide. The character regions finally extracted by the above processing are 313, 314, 315, 316, and 317.
[0022]
The present invention is not limited to the above, and can be realized by software. When the present invention is realized by software, as shown in FIG. 7, a general-purpose processing device including a CPU, a ROM, a RAM, a display device, a hard disk, a keyboard, a CD-ROM drive, etc. is prepared. The computer storage medium stores a program for realizing the document image region extraction function of the present invention.
[0023]
【The invention's effect】
As described above, according to the present invention, the existence range of the bibliographic items is identified in the manuscript without columns, so that it is possible to accurately separate the areas of the body part and the bibliographic item part. Become.
[0024]
According to the present invention, even when there is noise at the edge of the document, a blank area or a blank ruled line is detected, and the position is used when extracting the character area. Can do.
[Brief description of the drawings]
FIG. 1 shows a configuration of an embodiment of the present invention.
FIG. 2 shows a processing flowchart of the present invention.
FIG. 3 is a diagram for explaining edge blank detection;
FIG. 4 is a detailed flowchart of step 204 in FIG. 2;
FIG. 5 is a detailed flowchart of step 206;
FIG. 6 is a flowchart illustrating another process for detecting an edge blank.
FIG. 7 shows a configuration example when the present invention is realized by software.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 Image input means 102 Connected component extraction means 103 Connected component classification means 104 Edge blank detection means 105 Title candidate detection means 106 Character element integration means 107 Data storage part 108 Control part 109 Data communication path 110 Non-columnar designation means or determination means

Claims

A document image region extraction method for extracting a character region from a document image and extracting the character region by integrating the character elements, wherein the character elements at the left and right ends or upper and lower ends of the document image are vertically or horizontally Taking a projection, the minimum value of the projection frequency is obtained, and when the blank portion whose absolute value is smaller than a predetermined threshold has a width equal to or larger than the predetermined threshold, the blank portion is detected as an end blank portion. When the character element existence width at the end from the minimum value is larger than a predetermined threshold value, the detected end blank portion is not an end blank portion that divides noise from the character region, and the end blank portion A character image region extracting method, wherein a character region is extracted with reference to a position of the character image.