JP2968284B2

JP2968284B2 - Character recognition apparatus and character area separation method

Info

Publication number: JP2968284B2
Application number: JP1179529A
Authority: JP
Inventors: 裕勝山
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1989-07-12
Filing date: 1989-07-12
Publication date: 1999-10-25
Anticipated expiration: 2014-10-25
Also published as: JPH0343879A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、イメージスキャナその他の光学的読み取り
装置で得られた二値化画像データ（イメージ情報）から
文字の認識を行う文字認識装置に関し、さらにその文字
認識装置における文字認識の前処理である文字領域と図
形領域とを分離抽出する文字領域分離方法に関する。The present invention relates to a character recognition device for recognizing characters from binary image data (image information) obtained by an image scanner or other optical reading device. Further, the present invention relates to a character area separating method for separating and extracting a character area and a graphic area which are pre-processing of character recognition in the character recognition apparatus.

[Conventional technology]

二値化画像データから文字を認識する文字認識処理で
は、文字領域の画像データから所定の切り出し処理によ
り個々の文字に対応するイメージ情報（文字イメージ）
を得るが、図形を含む文書の場合にはその前処理として
文字領域と図形領域との分離が不可欠になっている。In the character recognition process for recognizing a character from binary image data, image information (character image) corresponding to each character is obtained by performing a predetermined extraction process from the image data in the character area.
However, in the case of a document containing a figure, it is essential to separate the character area and the figure area as preprocessing.

従来、文字認識装置における上記文字領域と図形領域
との分離（以下、簡便のため「文字領域分離方式」と標
記する。）を行うには、まず二値化（黒と白）画像デー
タを対象として縦横に交互にヒストグラムをとり、連続
する領域を一つの領域として扱い領域区分を行う方式が
ある。しかし、この方式では領域区分に必要な所望の制
限を設け、それを満足するまで同じ処理を繰り返すの
で、処理量が多くなるとともに黒領域が多い場合には適
応が困難になることがあった。2. Description of the Related Art Conventionally, in a character recognition apparatus, separation of the character area and the graphic area (hereinafter, referred to as “character area separation method” for simplicity) is performed by first processing binarized (black and white) image data. There is a method in which histograms are alternately arranged vertically and horizontally, and a continuous area is treated as one area to perform area division. However, in this method, a desired restriction necessary for the area division is provided, and the same processing is repeated until the restriction is satisfied. Therefore, when the processing amount is large and the number of black areas is large, adaptation may be difficult.

また、多値化画像データを対象にして小領域に分割し
てラベリングする方式があるが、多値化画像データは二
値化画像データに比べて情報量が多く大容量の画像メモ
リが必要になっていた。There is also a method of labeling the multi-valued image data by dividing it into small areas, but the multi-valued image data has a larger amount of information than the binarized image data and requires a large-capacity image memory. Had become.

また、二値化画像データを対象にして小領域に分割し
てラベリングする方式では、小領域が大きい場合には処
理速度は速いが解像度が悪く、小領域が小さい場合には
解像度は良いが処理速度が遅くなる相反する問題点があ
った。Also, in the method of dividing binary image data into small regions and labeling them, when the small region is large, the processing speed is fast but the resolution is poor, and when the small region is small, the resolution is good but the processing is good. There was a conflicting problem that the speed was slow.

ところで、黒領域（黒画素）の輪郭を追跡して黒領域
の存在する領域を割り出し、それを矩形（ボックス）表
現で扱い、その高さをパラメータとして文字領域と図形
領域とを分離する方式が提案されている。この方法は、
比較的短い時間で解像度の高い処理が可能になってい
る。By the way, a method of tracing the outline of a black area (black pixel) to determine an area where the black area exists, treating the area in a rectangular (box) expression, and separating the character area from the graphic area using the height as a parameter. Proposed. This method
High-resolution processing can be performed in a relatively short time.

[Problems to be solved by the invention]

すなわち、この方法は、輪郭の外接矩形領域の大きさ
（閾値）に応じて文字領域と図形領域との分離が容易で
あり、さらに例えば表中の文字のように、図形に囲まれ
た文字領域の分離抽出を可能にする優れた能力を有して
いるが、解像度が高くなるに従って各文字領域が細かい
領域に分離される傾向にあった。In other words, this method makes it easy to separate the character area from the graphic area according to the size (threshold) of the circumscribed rectangular area of the contour, and furthermore, for example, a character area surrounded by a graphic like a character in a table. However, there is a tendency that each character region is separated into fine regions as the resolution increases.

一方、文字認識処理に供される文字領域において、文
字間隔、行間隔その他の文字間の性質を正確に把握し、
各文字対応の切り出し処理を安定して行うためには、領
域内にある程度の数の文字が必要である。On the other hand, in the character area subjected to the character recognition processing, character spacing, line spacing and other properties between characters are accurately grasped,
In order to stably perform the cutout processing for each character, a certain number of characters are required in the area.

ところが、上述した方式では、分離抽出された文字領
域が小さい場合には、個々の文字イメージを得る切り出
し処理が不安定になり、文字認識率の低下を引き起こす
ことがあった。However, in the above-mentioned method, when the separated and extracted character area is small, the clipping process for obtaining individual character images becomes unstable, which may cause a reduction in the character recognition rate.

本発明は、文字領域の正確な分離抽出を可能にすると
ともに、文字領域を比較的大きな領域にまとめ、従来の
問題点を回避することができる文字認識装置およびその
文字領域分離方法を提供することを目的とする。An object of the present invention is to provide a character recognition device and a character region separation method which can accurately separate and extract a character region, and can combine character regions into relatively large regions and avoid the conventional problems. With the goal.

［課題を解決するための手段］請求項（１）に記載の文字認識装置では、統合処理手
段が、文字領域として分離された各矩形領域に対して、
近接する矩形領域を統合する処理と、統合された矩形領
域の「ネスト」および「重なり」を除去する処理を順に
繰り返し処理し、得られる矩形領域を文字領域として抽
出する。[Means for Solving the Problems] In the character recognition device according to claim (1), the integration processing unit performs processing for each rectangular area separated as a character area.
A process of integrating adjacent rectangular regions and a process of removing “nesting” and “overlap” of the integrated rectangular regions are sequentially repeated, and the obtained rectangular regions are extracted as character regions.

請求項（２）に記載の文字認識装置では、請求項
（１）に記載の文字認識装置において、前記近接する矩
形領域を統合する処理が、前記分離された各矩形領域の
高さのヒストグラムからその最大値を求め、矩形領域間
の距離が前記最大値より小さい複数の矩形領域を統合し
て１つの矩形領域となる。In the character recognition device according to claim (2), in the character recognition device according to claim (1), the process of integrating the adjacent rectangular regions is performed based on a height histogram of each of the separated rectangular regions. The maximum value is obtained, and a plurality of rectangular areas whose distance between the rectangular areas is smaller than the maximum value are integrated to form one rectangular area.

請求項（３）に記載の文字領域分離方法では、二値化
画像データの黒領域の輪郭追跡を行うことにより矩形領
域を抽出し、各矩形領域の高さのヒストグラムを求め、
さらにこのヒストグラムに基づいて文字領域としての矩
形領域を抽出し、前記抽出された文字領域としての矩形
領域間の距離が相対的に近接している矩形領域をまとめ
て１つの矩形領域とする統合処理を行い、前記統合処理
を行った矩形領域について、その矩形領域の中に矩形領
域が存在する「ネスト」或いは複数の矩形領域が重なる
「重なり」がある場合には、それらを除去する「ネス
ト」および「重なり」の除去処理を行い、前記統合処理
と前記除去処理とを、矩形領域の変化が無くなるまで順
に繰り返し、得られる矩形領域を文字領域として分離抽
出する。In the character area separation method according to the third aspect, a rectangular area is extracted by performing contour tracing of a black area of the binary image data, and a histogram of the height of each rectangular area is obtained.
Further, a rectangular area as a character area is extracted based on the histogram, and the rectangular areas in which the distances between the extracted rectangular areas are relatively close are combined into one rectangular area. Is performed, and if there is a “nest” in which a rectangular area is present in the rectangular area or an “overlap” in which a plurality of rectangular areas are overlapped, the “nest” is removed. And the "overlap" removal process is performed, and the integration process and the removal process are repeated in order until there is no change in the rectangular region, and the obtained rectangular region is separated and extracted as a character region.

請求項（４）に記載の文字領域分離方法では、二値化
画像データ黒領域の輪郭追跡を行うことにより矩形領域
を抽出し、各矩形領域の高さのヒストグラムを求め、さ
らにこのヒストグラムに基づいて文字領域としての矩形
領域を抽出し、前記抽出された文字領域としての矩形領
域の外側へ所定の厚さで太線化処理を行い、さらに外側
から所定の厚さを削る細線化処理を行うことにより近接
している矩形領域を連接させ、前記連接させて得られた
各矩形領域について、矩形領域間の距離が相対的に近接
している矩形領域をまとめて１つの矩形領域とする統合
処理を行い、前記統合処理を行った矩形領域について、
その矩形領域の中に矩形領域が存在する「ネフト」或い
は複数の矩形領域が重なる「重なり」がある場合には、
それらを除去する「ネフト」および「重なり」の除去処
理を行い、前記統合処理と前記除去処理とを、矩形領域
の変化が無くなるまで順に繰り返し、得られる矩形領域
を文字領域として分離抽出する。In the character area separating method according to the present invention, a rectangular area is extracted by tracing the contour of a black area of the binarized image data, a histogram of the height of each rectangular area is obtained, and further based on the histogram. Extracting a rectangular area as a character area by performing thickening processing with a predetermined thickness to the outside of the extracted rectangular area as a character area, and further performing thinning processing to reduce a predetermined thickness from the outside. In this process, rectangular areas closer to each other are connected to each other, and for each rectangular area obtained by the connection, the rectangular areas whose distances between the rectangular areas are relatively close are combined into one rectangular area. Performing the integration process, the rectangular area
If there is a “neft” where a rectangular area exists in the rectangular area or an “overlap” where a plurality of rectangular areas overlap,
The removal processing of “neft” and “overlap” for removing them is performed, and the integration processing and the removal processing are repeated in order until there is no change in the rectangular area, and the obtained rectangular area is separated and extracted as a character area.

(Operation)

本発明では、文字領域として分離された各矩形領域に
対して、統合処理手段が近接する矩形領域を統合する処
理と、統合された各矩形領域の「ネスト」および「重な
り」を除去する処理とを繰り返すことにより、さらに大
きな矩形領域にまとめることができる。According to the present invention, for each rectangular area separated as a character area, integration processing means integrates adjacent rectangular areas, and processing for removing “nest” and “overlap” of each integrated rectangular area. Can be combined into a larger rectangular area.

したがって、この大きな矩形領域を文字領域として分
離抽出することにより、以後の文字認識処理における切
り出し処理の安定化を図ることが容易になる。Therefore, by separating and extracting this large rectangular area as a character area, it is easy to stabilize the cutout processing in the subsequent character recognition processing.

また、本発明では、文字領域として分離された各矩形
領域あるいは統合処理の過程にある各矩形領域の太線化
処理および細線化処理を行うことにより、統合処理効率
が大幅に改善される。Further, in the present invention, by performing boldening processing and thinning processing of each rectangular area separated as a character area or each rectangular area in the course of the integration processing, the integration processing efficiency is greatly improved.

〔Example〕

以下、図面に基づいて本発明の実施例について詳細に
説明する。Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

第１図は、本発明の文字領域分離方式を実現する装置
構成例を示すブロック図である。FIG. 1 is a block diagram showing an example of an apparatus configuration for realizing the character area separation system of the present invention.

図において、文字認識装置21は、上位装置（パソコ
ン）23の起動制御に応じてイメージスキャナ25とともに
起動される。イメージスキャナ25から入力される二値化
画像データ（イメージ情報）は、文字認識装置21の対応
するメモリ（RAM）27に格納される。In the figure, the character recognition device 21 is activated together with the image scanner 25 in accordance with the activation control of the host device (personal computer) 23. The binary image data (image information) input from the image scanner 25 is stored in the corresponding memory (RAM) 27 of the character recognition device 21.

文字認識装置21の文字領域分離部28は、イメージメモ
リＩ、II、IIIおよびテーブルＩ、IIで構成されるメモ
リ27を用い、文字領域の分離抽出を行う。ここで分離さ
れた文字領域は文字認識処理部29に渡され、その文字領
域から所定の切り出し処理により各文字イメージを得て
文字認識処理を行い、その認識結果を上位装置23に通知
する構成である。The character area separating unit 28 of the character recognition device 21 separates and extracts a character area using the memory 27 including the image memories I, II, and III and the tables I and II. Here, the separated character area is passed to a character recognition processing unit 29, a character extraction process is performed from the character area to obtain each character image by a predetermined cutout processing, and the character recognition processing is performed, and the recognition result is notified to the host device 23. is there.

第２図は、本発明の文字領域分離方式の実施例手順を
示す流れ図である。第３図は、各処理過程における処理
結果の一例を示す図である。FIG. 2 is a flowchart showing the procedure of an embodiment of the character area separation method of the present invention. FIG. 3 is a diagram showing an example of a processing result in each processing step.

以下、第１図〜第３図を参照して文字領域の分離抽出
距離の流れについて説明する。Hereinafter, the flow of the separation extraction distance of the character area will be described with reference to FIGS.

イメージスキャナ25で読み取った二値化画像データ
（イメージ情報）をイメージメモリＩに格納する。この
画像データは、既存の縮小アルゴリズムを用いて1/8の
縮小画像（第３図（ａ））に変換され、イメージメモリ
IIに格納する。The binarized image data (image information) read by the image scanner 25 is stored in the image memory I. This image data is converted into a 1/8 reduced image (FIG. 3 (a)) using an existing reduction algorithm, and is converted into an image memory.
Store in II.

縮小画像（イメージメモリII）上で、黒点からなる黒
領域の輪郭追跡を行い、黒領域の輪郭上の点の座標の縦
横の最小値および最大値で表される矩形領域（第３図
（ｂ））を抽出し、その座標値データをテーブルＩに格
納する。なお、この処理の作業用にイメージメモリIII
を使用する。On the reduced image (image memory II), the contour of a black area consisting of black points is traced, and a rectangular area (FIG. 3 (b)) represented by the minimum and maximum values of the coordinates of points on the outline of the black area. )), And the coordinate value data is stored in Table I. Note that the image memory III
Use

テーブルＩに格納された座標値データから矩形領域の
高さのヒストグラム（第３図（ｃ））を求め、さらにこ
のヒストグラムから図形領域と文字領域とを分ける閾値
（一山越えた所の高さ、第３図（ｃ）の★印）を求め、
テーブルIIに格納する。続いて、この閾値より小さい矩
形領域を文字領域として抽出し、テーブルＩに格納す
る。A histogram of the height of the rectangular area (FIG. 3 (c)) is obtained from the coordinate value data stored in the table I, and a threshold value for separating the graphic area from the character area (the height of a point beyond one peak, 3) in Fig. 3 (c).
Store in Table II. Subsequently, a rectangular area smaller than the threshold is extracted as a character area and stored in the table I.

なお、以上の処理は従来方式と同様であり、以下の処
理に本発明方式の特徴がある。The above processing is the same as the conventional method, and the following processing has features of the method of the present invention.

テーブルＩに格納されている各矩形領域の中身を塗り
潰し（第３図（ｄ））、イメージメモリIIに格納する。The contents of each rectangular area stored in the table I are painted out (FIG. 3D) and stored in the image memory II.

続いて、近接する矩形領域を連接させるために、イメ
ージメモリIIに格納されている各矩形領域の外側へ２ド
ットの厚さで太線化処理を行い（第３図（ｅ））、さら
に外側から２ドット削る細線化処理を行い（第３図
（ｆ））、イメージメモリIIに格納する。すなわち、太
線化処理および細線化処理を行うことにより、近接する
矩形領域の細かい間隔を埋めることができる。Subsequently, in order to connect adjacent rectangular areas, a bolding process is performed with a thickness of 2 dots to the outside of each rectangular area stored in the image memory II (FIG. 3 (e)). A thinning process for removing two dots is performed (FIG. 3 (f)) and stored in the image memory II. That is, by performing the thick line processing and the thin line processing, it is possible to fill a fine interval between adjacent rectangular areas.

ここで再び、イメージメモリII上で黒点からなる黒領
域の輪郭追跡を行い、同様に黒領域に接する矩形領域
（第３図（ｇ））を抽出し、その座標値データをテーブ
ルＩに格納する。Here, the contour tracing of the black area consisting of black points is performed again on the image memory II, a rectangular area (FIG. 3 (g)) in contact with the black area is similarly extracted, and the coordinate value data is stored in the table I. .

なお、ここで分離抽出された領域が文字領域である
が、原稿上では一つの領域のものが別な領域に分離され
ていることが多いので、さらに相対的に近接している矩
形領域の統合処理を行う。Note that the area separated and extracted here is a character area. However, since an area of one area is often separated into another area on a document, a relatively close rectangular area is integrated. Perform processing.

この近接矩形領域の統合処理では、テーブルIIに格納
されている各矩形領域の高さのヒストグラムからその最
大値（第３図（ｃ）の☆）を求め、各矩形領域間の距離
がこれより小さい二つの矩形領域を統合して一つの矩形
領域とする処理を行い（第３図（ｈ））、テーブルＩに
格納する。なお、この統合処理では、矩形領域の中に矩
形領域が存在する「ネスト」、あるいは複数の矩形領域
が重なる「重なり」が発生する。In the process of integrating the adjacent rectangular areas, the maximum value (☆ in FIG. 3 (c)) is obtained from the histogram of the height of each rectangular area stored in Table II, and the distance between the rectangular areas is calculated as follows. A process of integrating two small rectangular areas into one rectangular area is performed (FIG. 3 (h)) and stored in the table I. In this integration processing, a “nest” in which a rectangular area is present in a rectangular area, or an “overlap” in which a plurality of rectangular areas are overlapped occurs.

したがって、「ネスト」がある場合には大きい方の矩
形領域を残す処理を行い、「重なり」がある場合にはそ
れらを含む新たな矩形領域を作成して旧矩形領域を削除
する「ネスト」および「重なり」の除去処理を行い（第
３図（ｉ））、テーブルＩに格納する。Therefore, if there is a "nest", a process to leave the larger rectangular area is performed, and if there is an "overlap", a new rectangular area including them is created and the old rectangular area is deleted. The "overlap" is removed (FIG. 3 (i)) and stored in Table I.

以下、近接矩形領域の統合処理と「ネスト」および
「重なり」の除去処理を、テーブルＩが定常になる（変
化がなくなる）まで繰り返し、最終的に文字領域を囲む
矩形領域（第３図（ｊ））を決定し、各矩形領域の座標
値をテーブルＩに格納して文字領域の分離抽出処理を終
了する。Hereinafter, the integration processing of the adjacent rectangular areas and the processing of removing the “nest” and “overlap” are repeated until the table I becomes steady (there is no change), and finally the rectangular area surrounding the character area (FIG. 3 (j )), The coordinate values of each rectangular area are stored in Table I, and the character area separation / extraction process ends.

以上の処理が本発明方式の特徴とするところであり、
ここで得られた各文字領域の座標値は文字認識処理に供
され、各文字領域ごとにイメージの切り出し処理および
認識処理が行われる。The above processing is a feature of the method of the present invention,
The coordinate values of each character area obtained here are subjected to character recognition processing, and image cutout processing and recognition processing are performed for each character area.

〔The invention's effect〕

上述したように、本発明によれば、図形領域と文字領
域との正確な分離が容易で、例えば表中の文字領域の分
離抽出ができ、文字認識処理の全自動処理が可能となる
文字領域分離方式において、分離された各文字領域を大
きな領域にまとめることができるので、その文字領域に
対する各文字対応の切り出し処理の安定化を図ることが
容易になり、文字認識率の向上を図ることができる。As described above, according to the present invention, it is easy to accurately separate a graphic region and a character region, for example, a character region in a table can be separated and extracted, and a character region that can be fully automated in character recognition processing can be obtained. In the separation method, each separated character area can be grouped into a large area, so that it is easy to stabilize the cutout processing corresponding to each character in the character area, and to improve the character recognition rate. it can.

また、統合処理前あるいは統合処理過程において、各
文字領域の太線化処理および細線化処理を行うことによ
り、統合処理効率が大幅に改善されるとともに、統合処
理にかかわるメモリ容量の削減を図ることができる。In addition, by performing the thick line processing and the thin line processing of each character area before or during the integration processing, the integration processing efficiency can be greatly improved, and the memory capacity required for the integration processing can be reduced. it can.

[Brief description of the drawings]

第１図は本発明方式を実現する装置構成例を示すブロッ
ク図、第２図は本発明方式の実施例手順を示す流れ図、第３図は各処理過程における処理結果の一例を示す図で
ある。図において、 21は文字認識装置、 23は上位装置（パソコン）、 25はイメージスキャナ、 27はメモリ（RAM）、 28は文字領域分離部、 29は文字認識処理部である。FIG. 1 is a block diagram showing an example of an apparatus configuration for realizing the method of the present invention, FIG. 2 is a flowchart showing the procedure of an embodiment of the method of the present invention, and FIG. 3 is a diagram showing an example of a processing result in each processing step. . In the figure, 21 is a character recognition device, 23 is a higher-level device (personal computer), 25 is an image scanner, 27 is a memory (RAM), 28 is a character area separation unit, and 29 is a character recognition processing unit.

Claims

(57) [Claims]

1. A character recognition device that separates a character region from binary image data and recognizes a character, comprising: integrating a rectangular region adjacent to each rectangular region separated as a character region; A character recognition device, comprising: integrated processing means for sequentially repeating a process of removing “nest” and “overlap” of each integrated rectangular region and extracting an obtained rectangular region as a character region.

2. The character recognition apparatus according to claim 1, wherein the processing of integrating the adjacent rectangular areas includes obtaining a maximum value from a height histogram of each of the separated rectangular areas, A character recognition device, wherein a plurality of rectangular areas having a distance between them smaller than the maximum value are integrated into one rectangular area.

3. A character area separating method for a character recognition apparatus for performing character recognition from binary image data, wherein a rectangular area is extracted by tracing a contour of a black area of the binary image data. Is obtained, and a rectangular region as a character region is extracted based on the histogram. The rectangular regions in which the distance between the rectangular regions as the extracted character region is relatively short are put together. When one rectangular area is subjected to integration processing, and the rectangular area subjected to the integration processing has a “nest” in which a rectangular area exists in the rectangular area or an “overlap” in which a plurality of rectangular areas overlap, Performing a removal process of “nest” and “overlap” to remove them, and repeating the integration process and the removal process in order until there is no change in the rectangular area. Character segmentation method characterized by separating and extracting the rectangular area as a character area.

4. A character region separating method for a character recognition device for performing character recognition from binary image data, wherein a rectangular region is extracted by performing contour tracing of a black region of the binary image data. Is obtained, a rectangular area as a character area is extracted based on the histogram, and a thick line processing is performed with a predetermined thickness outside the rectangular area as the extracted character area. The adjacent rectangular areas are connected by performing a thinning process for reducing a predetermined thickness from the above. 1
In the case where there is a “nest” in which a rectangular area exists in the rectangular area or an “overlap” in which a plurality of rectangular areas overlap, Performing nesting and overlap removal processing to remove them, repeating the integration processing and the removal processing in order until there is no change in the rectangular area, and separating and extracting the resulting rectangular area as a character area. Characteristic region separation method.