JP3086277B2

JP3086277B2 - Document image processing device

Info

Publication number: JP3086277B2
Application number: JP03128339A
Authority: JP
Inventors: 高志齋藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-05-02
Filing date: 1991-05-02
Publication date: 2000-09-11
Anticipated expiration: 2015-09-11
Also published as: JPH04330588A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文書画像処理装置に関
し、特に文書画像中からアンダーライン付きの文字列を
識別する文書画像処理装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document image processing apparatus , and more particularly to a document image processing apparatus for identifying underlined character strings in a document image.

【０００２】[0002]

【従来の技術】文書画像から文字列領域を抽出する従来
の方法として、画像のランレングスデータに対して、閾
値以下の白ランを黒ランに置き換えることにより、文字
列、図形等をそれぞれ一つの黒画素連結成分にするぼか
し処理によって領域を抽出する方法（Ｃｏｍｐｕｔｅｒ
ＧｒａｐｈｉｃｓａｎｄＩｍａｇｅＰｒｏｃｅ
ｓｓｉｎｇ２０，３７５〜３９０，１９８２）、ある
いは画像の周辺分布から領域分割を行う方法（電子通信
学会論文誌１９８６／８Ｖｏｌ．Ｊ６９−ＤＮｏ．
８ｐｐ１１８７−１１９５）がある。2. Description of the Related Art As a conventional method for extracting a character string area from a document image, a character string, a graphic, and the like are respectively converted into one by replacing white runs below a threshold value with black runs in the run-length data of the image. A method of extracting a region by a blurring process into a black pixel connected component (Computer
Graphics and Image Process
ssing 20, 375-390, 1982) or a method of performing area division from the peripheral distribution of an image (Transactions of the Institute of Electronics, Information and Communication Engineers, 1986/8 Vol. J69-DNo.
8 pp 1187-1195).

【０００３】上記した抽出方法は、領域の外接矩形の大
きさ、縦横比、領域内の黒画素濃度等によって、文書領
域を文字列領域、図形領域、写真領域、縦横の領域分割
線（セパレータ）等に識別しているが、文字列と同じ大
きさの図形を文字列として誤識別してしまう。In the above-described extraction method, a document area is divided into a character string area, a graphic area, a photograph area, and vertical and horizontal area dividing lines (separators) according to the size of a circumscribed rectangle of the area, the aspect ratio, the density of black pixels in the area, and the like. However, a figure having the same size as a character string is erroneously identified as a character string.

【０００４】[0004]

【発明が解決しようとする課題】これを解決する方法と
して、ぼかし処理によって抽出した領域に対して、長い
線分の存在を調べることによって文字列と図形を識別す
る方法（特開昭６１−１７５８７５号公報）、黒画素連
結成分の大きさから文字と図形を判別する方法（電子通
信学会技報ＰＲＵ８６−１１５，ｐｐ３３−４０）が
ある。しかし、これらの方法によっても、アンダーライ
ンが文字列と接触している場合には文字列を図形として
識別するという問題があった。As a method for solving this problem, a method of discriminating a character string from a figure by examining the presence of long line segments in an area extracted by the blurring processing (Japanese Patent Laid-Open No. 61-175875). Publication) and a method of discriminating a character and a figure from the size of a black pixel connected component (IEICE Technical Report PRU86-115, pp33-40). However, even with these methods, there is a problem that the character string is identified as a graphic when the underline is in contact with the character string.

【０００５】本発明の目的は、文書画像中からアンダー
ライン付きの文字列を識別する文書画像処理装置を提供
することにある。An object of the present invention is to provide a document image processing apparatus for identifying a character string with an underline from a document image.

【０００６】[0006]

【課題を解決するための手段】前記目的を達成するため
に、本発明では、文書画像データから文書要素を抽出す
る文書画像処理装置において、文書画像データから抽出
された横書き文字列候補領域中に、所定の閾値よりも横
幅が大きな黒画素連結成分があるか否かを判定する手段
と、前記黒画素連結成分が前記所定の閾値よりも大きい
場合、前記文字列候補領域を上下に分割する手段と、該
分割された上部の領域内について黒画素連結成分を抽出
し、該抽出された黒画素連結成分の何れもが前記所定の
閾値よりも小さい場合、前記文字列候補領域をアンダー
ライン付き文字列領域と判定し、そうでない場合、横長
の図形またはセパレータと判定する手段とを備えたこと
を特徴としている。To SUMMARY OF THE INVENTION To achieve the above object, the present invention, the document image processing apparatus for extracting document elements from the document image data, in horizontal string candidate area extracted from the document image data horizontal than the predetermined threshold
Width and determines means <br/> whether there is a large black pixel connected component, if the black pixel connected component is greater than the predetermined threshold value, means for dividing the character string candidate region vertically, Extract the black pixel connected component in the divided upper region
And, if none of the black pixel connected component that issued the extract is less than the predetermined threshold value, the character string candidate region determined to underlined text area, if not, Horizontal
Means for determining the figure or the separator .

【０００７】[0007]

【作用】画像入力部によって入力された文書画像は、記
憶部に記憶され、記憶部の画像データから、文字列候補
領域抽出部は、文書の要素となる文字列候補領域を抽出
する。文字列候補領域について、判定処理部は、領域内
の黒画素連結成分の大きさを調べ、連結成分の横幅が予
想文字サイズより閾値以上大きい場合には、領域分割部
は、該領域を上下に分割する。判定処理部は、分割され
た上部の領域内の連結成分と先の閾値とを比較し、連結
成分が閾値以下の場合、文字列候補領域をアンダーライ
ン付きの文字列と識別する。The document image input by the image input section is stored in the storage section, and the character string candidate area extracting section extracts a character string candidate area which is an element of the document from the image data in the storage section. For the character string candidate region, the determination processing unit checks the size of the black pixel connected component in the region, and when the width of the connected component is larger than the expected character size by a threshold or more, the region dividing unit vertically moves the region up and down. To divide. The determination processing unit compares the connected component in the divided upper region with the previous threshold, and if the connected component is equal to or smaller than the threshold, identifies the character string candidate region as an underlined character string.

【０００８】[0008]

【実施例】以下、本発明の一実施例を図面を用いて具体
的に説明する。図１は、本発明のブロック構成図であ
る。１０１は、文書画像を取り込むスキャナ等の画像入
力部、１０２は、入力された画像データを記憶する記憶
部、１０３は、画像データから文字列候補領域を抽出す
る文字列候補領域抽出部、１０４は、文字列候補領域を
上下に分割する領域分割部、１０５は、文字列候補領域
からアンダーライン付き文字列を識別処理する判定処理
部である。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be specifically described below with reference to the drawings. FIG. 1 is a block diagram of the present invention. 101 is an image input unit such as a scanner that takes in a document image, 102 is a storage unit that stores input image data, 103 is a character string candidate area extraction unit that extracts a character string candidate area from image data, and 104 is An area dividing unit 105 for vertically dividing the character string candidate area, and a determination processing unit 105 for identifying an underlined character string from the character string candidate area.

【０００９】図２および図３は、本発明の処理フローチ
ャートを示した図である。スキャナ等の画像入力部１０
１によって入力された文書画像は、記憶部１０２に記憶
される（ステップ２０１）。記憶部１０２に記憶された
画像データから、文字列候補領域抽出部１０３は、公知
の方法（例えば、前記したぼかし処理方法あるいは周辺
分布から領域分割する方法）によって文書の要素となる
文字列候補領域を抽出する（ステップ２０２）。FIGS. 2 and 3 are flowcharts showing the processing of the present invention. Image input unit 10 such as a scanner
1 is stored in the storage unit 102 (step 201). From the image data stored in the storage unit 102, the character string candidate area extraction unit 103 uses a known method (for example, the above-described blur processing method or the method of dividing the area from the marginal distribution) into a character string candidate area that becomes an element of the document. Is extracted (step 202).

【００１０】図４は、文書画像３０１の抽出領域を示す
図で、３０２から３０７は、抽出された領域を示し、３
０２から３０５は、該領域の内、領域の大きさ、黒画素
の濃度から文字列候補領域と識別されたものである。従
って、このようにして識別された文字列候補領域には、
図形やセパレータ（縦横の領域分割線）が含まれている
場合がある。FIG. 4 is a diagram showing an extraction region of the document image 301. Reference numerals 302 to 307 denote the extracted regions.
Numbers 02 to 305 are identified as character string candidate areas based on the size of the area and the density of black pixels. Therefore, in the character string candidate area identified in this way,
There may be cases where figures and separators (vertical and horizontal area dividing lines) are included.

【００１１】そこで、本発明では、文字列として識別さ
れた領域について、判定処理部１０５は、領域内の黒画
素連結成分の大きさを調べる（ステップ２０４）。な
お、この連結成分を求める方法としては、例えば、画素
にラベリングを行う方法、あるいは前掲した文献（Ｃom
puter Ｇraphics and Ｉmage Ｐrocessing ２０，３７
５〜３９０，１９８２）に記載されたＬＡＧ（Ｌine Ａ
djacency Ｇraph）というグラフ構造を利用した公知の
方法によって求めることができる。Therefore, in the present invention, for an area identified as a character string, the determination processing unit 105 checks the size of the black pixel connected component in the area (step 204). As a method of obtaining the connected component, for example, a method of performing labeling on a pixel or a method described above (Com
puter Graphics and Image Processing 20,37
5-390, 1982).
djacency Graph) by a known method using a graph structure.

【００１２】図５は、図４における文字列候補領域４０
１の一つを示す図であり、アンダーライン４０２付きの
文字列を表している。また、図６は、図５の黒画素連結
領域を表した図で、５０１から５０６は黒画素連結成分
の外接矩形である。例えば、領域４０１の外接矩形は５
０１であり（文字とアンダーラインとが近接あるいは接
続しているために、外接矩形５０１が形成される）、文
字４０３の外接矩形は５０４である。図８は、本来は図
形として識別されるべき領域が、黒画素の濃度から文字
列候補領域７０１として識別された他の例を示す図であ
る。FIG. 5 shows a character string candidate area 40 shown in FIG.
FIG. 3 is a diagram illustrating one of the character strings 1 and represents a character string with an underline 402. FIG. 6 is a diagram showing the black pixel connection region in FIG. 5, and 501 to 506 are circumscribed rectangles of the black pixel connection component. For example, the circumscribed rectangle of the area 401 is 5
01 (the circumscribed rectangle 501 is formed because the character and the underline are close to or connected to each other), and the circumscribed rectangle of the character 403 is 504. FIG. 8 is a diagram illustrating another example in which an area that should be originally identified as a graphic is identified as a character string candidate area 701 based on the density of black pixels.

【００１３】黒画素連結成分の大きさが求まったら、判
定処理部１０５は、次いで黒画素連結成分の大きさ（横
幅）と、領域内の予想文字サイズ（すなわち領域の高
さ）から求めた閾値と比較する（ステップ２０５）。黒
画素連結成分の横幅が予想文字サイズより閾値以上大き
い（例えば、２倍）場合には、領域分割部１０４は、該
領域を上下に分割する（ステップ２０６）。図６の例で
は、黒画素連結成分（外接矩形５０１）が閾値以上大き
いので、図５のライン４０４で分割し、図８の例では、
黒画素連結成分（外接矩形７０１）が閾値以上大きいの
で、図８のライン７０２で分割する。なお、分割線の位
置は、分割された上下の高さの比が所定の値（例えば、
上下の比が３対２になる値）になるように設定する。When the size of the black pixel connected component is obtained, the determination processing unit 105 then determines a threshold value obtained from the size (width) of the black pixel connected component and the expected character size in the area (ie, the height of the area). (Step 205). When the width of the black pixel connected component is larger than the expected character size by a threshold value or more (for example, twice), the region dividing unit 104 divides the region into upper and lower regions (step 206). In the example of FIG. 6, the black pixel connected component (the circumscribed rectangle 501) is larger than the threshold value, and thus is divided by the line 404 in FIG. 5, and in the example of FIG.
Since the black pixel connected component (the circumscribed rectangle 701) is larger than the threshold value, it is divided by the line 702 in FIG. It should be noted that the position of the dividing line is such that the ratio of the divided upper and lower heights is a predetermined value (for example,
(A value at which the up-down ratio becomes 3 to 2).

【００１４】そして、判定処理部１０５は、分割された
上部の領域内の黒画素について連結成分を求める（ステ
ップ２０７）。図７は、図５における領域４０１をライ
ン４０４で分割した上部の領域の黒画素の連結成分を抽
出した図で、６１２は、領域の外接矩形、６１３は分割
ライン、６０１から６１１は、新たに求められた上部領
域の黒画素連結成分である。同様に、図９は、図８のラ
イン７０２で分割された上部の領域の黒画素の連結成分
を抽出した図で、８０１は新たに求められた上部領域の
黒画素連結成分、８０２は分割ラインである。Then, the determination processing unit 105 obtains a connected component for the black pixel in the divided upper region (step 207). FIG. 7 is a diagram in which connected components of black pixels in an upper region obtained by dividing the region 401 in FIG. 5 by the line 404 are extracted, 612 is a circumscribed rectangle of the region, 613 is a division line, and 601 to 611 are newly added. This is the obtained black pixel connected component of the upper region. Similarly, FIG. 9 is a diagram in which the connected components of the black pixels in the upper region divided by the line 702 in FIG. 8 are extracted, and reference numeral 801 denotes a newly obtained black pixel connected component of the upper region, and 802 denotes a divided line. It is.

【００１５】判定処理部１０５は、分割された上部の領
域内の黒画素連結成分と先の閾値とを比較する（ステッ
プ２０８）。図７の例では、黒画素連結成分は全て閾値
以下の大きさ、つまり文字サイズとなるので、文字列候
補領域４０１はアンダーライン付きの文字列と識別され
る（ステップ２０９）。一方、図９の例では、黒画素連
結成分８０１は閾値より大きいので、図形またはセパレ
ータとして識別される（ステップ２１０）。The determination processing unit 105 compares the black pixel connected component in the divided upper region with the above threshold (step 208). In the example of FIG. 7, all the black pixel connected components have a size equal to or smaller than the threshold value, that is, a character size, so that the character string candidate area 401 is identified as an underlined character string (step 209). On the other hand, in the example of FIG. 9, since the black pixel connected component 801 is larger than the threshold value, it is identified as a graphic or a separator (step 210).

【００１６】以上の処理を、領域の大きさ、黒画素の濃
度から文字列候補領域として識別された全ての領域につ
いて行うことによって、アンダーライン付きの文字列と
図形またはセパレータとの識別処理が行われる。By performing the above processing for all areas identified as character string candidate areas based on the size of the area and the density of black pixels, the processing for identifying an underlined character string and a figure or a separator is performed. Will be

【００１７】[0017]

【発明の効果】以上、説明したように、本発明によれ
ば、領域の大きさ、黒画素の濃度から文字列候補領域と
して識別された領域中に閾値よりも大きな黒画素連結成
分があるか否かを調べ、閾値よりも大きな黒画素連結成
分がある場合には、さらに該領域を上下に分割し、その
上部の領域について閾値よりも大きな黒画素連結成分が
あるか否かを調べているので、アンダーライン付きの文
字列と、横長の図形あるいはセパレータとを正確に識別
することができる。As described above, according to the present invention, in a region identified as a character string candidate region based on the size of the region and the density of black pixels, is there any black pixel connected component larger than the threshold value? In the case where there is a black pixel connected component larger than the threshold value, the region is further divided into upper and lower parts, and it is checked whether or not there is a black pixel connected component larger than the threshold value in the upper region. Therefore, an underlined character string and a horizontally long graphic or separator can be accurately identified.

[Brief description of the drawings]

【図１】本発明のブロック構成図である。FIG. 1 is a block diagram of the present invention.

【図２】本発明の処理フローチャートを示した図であ
る。FIG. 2 is a diagram showing a processing flowchart of the present invention.

【図３】図２と同じく本発明の処理フローチャートを示
した図である。FIG. 3 is a diagram showing a processing flowchart of the present invention as in FIG. 2;

【図４】文書画像の抽出領域を示す図である。FIG. 4 is a diagram showing an extraction area of a document image.

【図５】文字列候補領域の一つの例を示す図である。FIG. 5 is a diagram illustrating an example of a character string candidate area.

【図６】図５の黒画素連結領域を表した図である。FIG. 6 is a diagram illustrating a black pixel connection region of FIG. 5;

【図７】図５における領域を分割した上部の領域の黒画
素の連結成分を抽出した図である。7 is a diagram in which a connected component of black pixels in an upper region obtained by dividing the region in FIG. 5 is extracted.

【図８】本来は図形として識別されるべき領域が文字列
候補領域として識別された他の例を示す図である。FIG. 8 is a diagram illustrating another example in which an area that should be originally identified as a graphic has been identified as a character string candidate area.

【図９】図８における分割された上部の領域の黒画素の
連結成分を抽出した図である。FIG. 9 is a diagram in which a connected component of black pixels in a divided upper region in FIG. 8 is extracted.

[Explanation of symbols]

１０１画像入力部１０２記憶部１０３文字列候補領域抽出部１０４領域分割部１０５判定処理部４０１文字列候補領域４０２アンダーライン５０１黒画素連結成分の外接矩形 Reference Signs List 101 Image input unit 102 Storage unit 103 Character string candidate area extracting unit 104 Area dividing unit 105 Judgment processing unit 401 Character string candidate area 402 Underline 501 Bounding rectangle of black pixel connected component

Claims

(57) [Claims]

A document element is extracted from document image data.
Document image processingapparatus, Extracted from the document image data
WasHorizontal writingIn the character string candidate area,Width
ButDetermine if there is a large black pixel connected componentmeans
And the black pixel connected component isSaidGreater than a given threshold
Means for dividing the character string candidate area into upper and lower parts,
Split top areaExtract the connected components of black pixels within
And the extractedBlack pixel connected componentAny ofIs the predetermined
If the value is smaller than the threshold, the character string candidate area is under
Judge as character string area with lineAnd if not, landscape
Shape or separatorMeans for determining
Document image processing characterized byapparatus.