JPH06231306A

JPH06231306A - Character recognition device

Info

Publication number: JPH06231306A
Application number: JP5017245A
Authority: JP
Inventors: Noboru Nakamura; 昇中村
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-02-04
Filing date: 1993-02-04
Publication date: 1994-08-19

Abstract

PURPOSE:To provide a character recognition device by which a document including a dot character can easily be recognized and which is superior in operability. CONSTITUTION:A primary character area judgement part 1 obtaining the circumscribing rectangle of a connected graphic from a primary binary picture obtained by reading the document by a scanner and judging a character area with the size, a dot area extraction part 2 extracting a dot area from the rate of the change point of black and white as against the whole size from a non-character area, a secondary character area judgement part 3 judging the character area from a secondary binary picture obtained by reading palely the dot area by the scanner in the same way as the primary character area judgement part 1, a character segment part 5 segmenting a character pattern from primary/ secondary character areas, a character characteristic extraction part 6 extracting a characteristic from the character pattern, a recognition certainty calculation part 8 comparing the character characteristic with a character characteristic dictionary 7 storing all the character characteristics and obtaining certainty and a recognition character decision part 9 deciding the recognized character from recognition certainty are provided.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文書等を読み込んでその
文字を対応する文字コードに変換する文字認識装置であ
って、文字の上に網点をかけたもの（以下網点文字と呼
ぶ）を含んだ文書を文字認識することのできる文字認識
装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device for reading a document or the like and converting the character into a corresponding character code, in which a dot is applied to the character (hereinafter referred to as a dot character). The present invention relates to a character recognition device capable of character-recognizing a document including a character.

【０００２】[0002]

【従来の技術】近年、ワードプロセッサ等の普及に伴
い、文書中の文字を強調するため等に、網点文字が頻繁
に用いられるようになり、この網点文字を含む文書を認
識することのできる文字認識装置の開発が行われてい
る。2. Description of the Related Art In recent years, with the widespread use of word processors and the like, halftone dot characters have come to be frequently used for emphasizing characters in a document, and a document including the halftone dot characters can be recognized. Character recognition devices are being developed.

【０００３】以下に従来の文字認識装置について説明す
る。図３は網点文字を含む認識対象文書を示す図であ
り、図４は図３に示した認識対象文書をスキャナから通
常の濃度で読み込んだときの２値画像を示す図であり、
図５は図３に示した認識対象文書をスキャナから薄い濃
度で読み込んだときの２値画像を示す図である。A conventional character recognition device will be described below. 3 is a diagram showing a recognition target document including halftone dots, and FIG. 4 is a diagram showing a binary image when the recognition target document shown in FIG. 3 is read from a scanner at a normal density.
FIG. 5 is a diagram showing a binary image when the recognition target document shown in FIG. 3 is read from a scanner with a light density.

【０００４】従来の文字認識装置によって、図３に示す
ような認識対象文書を文字認識しようとする場合に、こ
れをスキャナから通常の濃度で読み込むと、図４に示す
ような２値画像が得られ、通常文字部分は認識可能であ
るが、網点文字部分が認識不可能となる。一方、これを
スキャナから薄い濃度で読み込むと、図５に示すような
２値画像が得られ、網点文字部分は認識可能であるが、
通常文字部分が認識不可能となってしまう。When the conventional character recognition apparatus attempts to character-recognize a document to be recognized as shown in FIG. 3, when it is read with a normal density from a scanner, a binary image as shown in FIG. 4 is obtained. Therefore, the normal character portion can be recognized, but the halftone dot character portion cannot be recognized. On the other hand, if the image is read from the scanner at a low density, a binary image as shown in FIG. 5 is obtained, and the halftone dot character portion can be recognized.
The normal character part becomes unrecognizable.

【０００５】そこで、従来の文字認識装置で、図３に示
すような網点文字を含む認識対象文書を文字認識する場
合には、まず、認識対象文書をスキャナから通常の濃度
で読み込ませて通常文字部分を文字認識させ、次に、利
用者がスキャナの濃度を薄く調整して、認識対象文書を
再度スキャナから薄く読み込ませて網点文字部分を文字
認識させ、次に、利用者が２つの認識結果を統合するこ
とによって、文字認識作業を行っている。Therefore, in the case of character recognition of a recognition target document including halftone dots as shown in FIG. 3 by the conventional character recognition device, first, the recognition target document is read from a scanner at a normal density and then the normal recognition is performed. The character portion is character-recognized, then the user adjusts the density of the scanner lightly, the document to be recognized is read again from the scanner lightly, and the halftone dot character portion is character-recognized. Character recognition work is performed by integrating the recognition results.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら上記従来
の構成では、網点文字を含む認識対象文書を文字認識す
る場合に、認識対象文書をスキャナから通常の濃度で読
み込ませた後に、利用者がスキャナの濃度を薄く調整し
て再度認識対象文書をスキャナから読み込ませ、この２
つの認識結果を統合しなければならず、煩雑で手数が掛
かり作業性に欠けるという問題点があった。However, in the above-mentioned conventional configuration, when the recognition target document including halftone characters is character-recognized, after the recognition target document is read from the scanner at a normal density, the user scans it. Adjust the density of the image to be thin and read the document to be recognized again from the scanner.
Since the two recognition results have to be integrated, there has been a problem that it is complicated, laborious, and lacks in workability.

【０００７】本発明は上記従来の問題点を解決するもの
で、網点文字を含む認識対象文書を容易に文字認識する
ことのできる作業性に優れた文字認識装置を提供するこ
とを目的とする。An object of the present invention is to solve the above-mentioned conventional problems, and an object thereof is to provide a character recognition apparatus having excellent workability, which can easily recognize a recognition target document including halftone characters. .

【０００８】[0008]

【課題を解決するための手段】この目的を達成するため
に本発明の文字認識装置は、認識対象文書を読み込んで
２値画像を出力する際に読み込むときの濃度を調整可能
及び／または認識対象文書を多値データとして読み込む
ことが可能なスキャナと、認識対象文書を前記スキャナ
より通常の濃度で読み込むか、または多値データとして
読み込んで通常の閾値により２値化して出力される一次
２値画像から連結図形の外接矩形を求めてその外接矩形
の大きさによって一次文字領域を判定する一次文字領域
判定部と、前記一次文字領域判定部によって判定された
非文字領域から全体の大きさに対する白黒の変化点の割
合によって網点領域を抽出する網点領域抽出部と、前記
網点領域抽出部によって抽出された網点領域部分の認識
対象文書を前記スキャナより通常の濃度よりも薄い濃度
で再度読み込むかまたは網点領域部分の読み込み済の多
値データを通常の閾値よりも薄く設定した閾値によって
２値化して出力される二次２値画像から前記一次文字領
域判定部と同様にして二次文字領域を判定する二次文字
領域判定部と、前記一次文字領域判定部で判定された一
次文字領域と前記二次文字領域判定部で判定された二次
文字領域から外接矩形の大きさ，位置によって文字パタ
ーンを切り出す文字切り出し部と、前記文字切り出し部
で切り出された文字パターンから文字特徴を抽出する文
字特徴抽出部と、予め全ての文字の文字特徴を記憶した
文字特徴辞書と、前記文字特徴抽出部で抽出された文字
特徴と前記文字特徴辞書とを比較して文字候補，類似度
等の認識確度を求める認識確度計算部と、前記認識確度
計算部で求められた認識確度から認識文字を決定する認
識文字決定部とを備えた構成を有している。In order to achieve this object, the character recognition apparatus of the present invention is capable of adjusting the density when reading a document to be recognized and outputting a binary image and / or the object to be recognized. A scanner capable of reading a document as multivalued data, and a primary binary image which is read by the scanner with a normal density from the scanner, or read as multivalued data and binarized by a normal threshold value and output. A primary character area determination unit that determines a circumscribed rectangle of the connected figure from the size of the circumscribed rectangle and a primary character area determination unit that determines the primary character area from the non-character area determined by the primary character area determination unit, The halftone dot area extracting unit for extracting the halftone dot area according to the change point ratio, and the recognition target document of the halftone dot area portion extracted by the halftone dot area extracting unit are From the secondary binary image that is read again at a density lower than the normal density than the scanner, or the read multi-valued data of the halftone dot area is binarized by the threshold set to be lighter than the normal threshold and output. A secondary character area determination unit that determines a secondary character area in the same manner as the primary character area determination unit, a primary character area determined by the primary character area determination unit, and a secondary character area determination unit by the secondary character area determination unit A character cutout unit that cuts out a character pattern according to the size and position of a circumscribing rectangle from a secondary character region, a character feature extraction unit that extracts character features from the character pattern cut out by the character cutout unit, and characters of all characters in advance. Recognition accuracy calculation for obtaining recognition accuracy of character candidates, similarity, etc. by comparing the character feature dictionary storing the features with the character features extracted by the character feature extraction unit and the character feature dictionary If has a configuration that includes a recognition character determining unit that determines the recognized character from the recognition accuracy obtained in the recognition accuracy calculation unit.

【０００９】[0009]

【作用】この構成によって、一次文字領域判定部で一次
文字領域を判定し、網点領域抽出部で抽出された網点領
域を、二次文字領域判定部がスキャナから薄い濃度で再
度読み込むか、または読み込み済の多値データを薄く設
定した閾値によって２値化して、出力される二次２値画
像から、一次文字領域判定部と同様にして二次文字領域
を判定し、一次文字領域及び二次文字領域を文字認識す
ることにより、網点文字を含む認識対象文書を容易かつ
自動的に文字認識することができる。With this configuration, the primary character area determination unit determines the primary character area, and the secondary character area determination unit rereads the halftone dot area extracted by the halftone dot area extraction unit with a light density from the scanner. Alternatively, the read multi-valued data is binarized by a threshold value that is set lightly, the secondary character area is determined from the output secondary binary image in the same manner as the primary character area determination unit, and the primary character area and the secondary character area are determined. By recognizing the character in the next character area, it is possible to easily and automatically recognize the recognition target document including halftone characters.

【００１０】[0010]

【実施例】以下本発明の一実施例における文字認識装置
について、図面を参照しながら説明する。図１は本発明
の一実施例における文字認識装置の構成図である。１は
スキャナ（図示せず）から認識対象文書を通常の濃度で
読み込んで入力される一次２値画像から連結図形の外接
矩形を求めて外接矩形をその距離，大きさにより統合し
統合後の外接矩形の大きさを基にして一次文字領域，非
文字領域を判定する一次文字領域判定部、２は一次文字
領域判定部１で判定された非文字領域から全体の大きさ
と白黒の変化点の割合によって網点領域を抽出する網点
領域抽出部、３は網点領域抽出部２で抽出された網点領
域部分の認識対象文書をスキャナ（図示せず）から通常
の濃度より薄い濃度で再度読み込んで得られる二次２値
画像から一次文字領域判定部１と同様にして二次文字領
域，非文字領域を判定する二次文字領域判定部、４は一
次文字領域判定部１で判定された一次文字領域及び二次
文字領域判定部３で判定された二次文字領域中の外接矩
形の縦方向，横方向の射影をとって文字行を抽出する文
字行抽出部、５は文字行抽出部４で決定された文字行
幅，外接矩形の大きさから文字の大きさを推定しこの文
字の大きさを基準に外接矩形をその大きさ，距離によっ
てノイズ成分を除いてマージして文字パターンとして切
り出す文字切り出し部、６は文字切り出し部５で切り出
された文字パターンから文字特徴を抽出する文字特徴抽
出部、７は予め全ての文字の文字特徴を記憶している文
字特徴辞書、８は文字特徴抽出部６で抽出された文字特
徴と文字特徴辞書７とを比較して文字候補及び類似度等
の認識確度を求める認識確度計算部、９は認識確度計算
部８で求められた認識確度から認識文字を決定する認識
文字決定部である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A character recognition device according to an embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a character recognition device in one embodiment of the present invention. Reference numeral 1 is a circumscribed circumscribed rectangle obtained by reading a recognition target document with a normal density from a scanner (not shown) to obtain a circumscribed rectangle of a connected figure from a primary binary image and integrating the circumscribed rectangles according to their distances and sizes. A primary character area determination unit that determines a primary character area and a non-character area based on the size of a rectangle, 2 is a ratio of the overall size and a black-and-white change point from the non-character area determined by the primary character area determination unit 1. A halftone dot area extracting unit 3 for extracting a halftone dot area by means of a scanner (not shown) rereads the recognition target document of the halftone dot area portion extracted by the halftone dot area extracting unit 2 at a density lower than the normal density. The secondary character area determination unit 4 that determines a secondary character area and a non-character area from the secondary binary image obtained in step 1 in the same manner as the primary character area determination unit 1, and the primary character area determination unit 4 determines the primary character area determination unit 1. Character area and secondary character area judgment The character line extraction unit 5 that extracts the character lines by projecting the vertical and horizontal projections of the circumscribed rectangle in the secondary character area determined in step 3 is the character line width and circumscribed line determined by the character line extraction unit 4. A character cutout unit that estimates the size of a character from the size of the rectangle and merges the circumscribing rectangle based on the size of the character to remove noise components according to the size and distance, and cuts out as a character pattern, 6 is a character cutout unit A character feature extraction unit for extracting character features from the character pattern cut out in 5, a character feature dictionary 7 in which character features of all characters are stored in advance, and 8 a character feature extracted by the character feature extraction unit 6. A recognition accuracy calculation unit that compares the character feature dictionary 7 with the recognition accuracy such as character candidates and similarity, and a recognition character determination unit 9 that determines a recognition character from the recognition accuracy calculated by the recognition accuracy calculation unit 8. .

【００１１】以上のように構成された本発明の一実施例
における文字認識装置について、以下その動作を説明す
る。図２は本発明の一実施例における文字認識装置のフ
ローチャートである。初めに、一次文字領域判定部１に
よって、スキャナ（図示せず）から認識対象文書を通常
の濃度で読み込んで入力される一次２値画像から、連結
図形の外接矩形を求め、外接矩形をその距離，大きさに
より統合し、統合後の外接矩形の大きさを基にして、一
次文字領域，非文字領域を判定する（Ｓ１）。次に、網
点領域抽出部２によって、Ｓ１で判定された非文字領域
から、全体の大きさと白黒の変化点の割合によって、網
点領域を抽出する（Ｓ２）。次に、二次文字領域判定部
３によって、Ｓ２で抽出された網点領域部分の認識対象
文書を、スキャナ（図示せず）から通常の濃度より薄い
濃度で、再度読み込んで入力された二次２値画像から連
結図形の外接矩形を求め、外接矩形をその距離，大きさ
により統合し、統合後の外接矩形の大きさを基にして二
次文字領域，非文字領域を判定する（Ｓ３）。次に、文
字行抽出部４によって、Ｓ１で判定した一次文字領域及
びＳ３で判定した二次文字領域に対して、外接矩形の縦
方向，横方向のヒストグラムをとり、文字行を抽出する
（Ｓ４）。次に、文字切り出し部５によって、Ｓ４で抽
出した文字行幅，外接矩形の大きさの分布から、文字の
大きさを推定する（Ｓ５）。次に、文字切り出し部５に
よって、Ｓ５で推定した文字の大きさを基準に、外接矩
形をその大きさ，距離からノイズ成分を除いてマージし
て、文字パターンとして切り出す（Ｓ６）。次に、文字
特徴抽出部６によって、Ｓ６で切り出した文字パターン
から文字特徴を抽出する（Ｓ７）。次に、認識確度計算
部８によって、Ｓ７で抽出した文字特徴と文字特徴辞書
７とを比較して、文字候補，類似度等の認識確度を求め
る（Ｓ８）。次に、認識文字決定部９によって、Ｓ８で
求められた文字候補のうち類似度が最も高いものを認識
文字として決定する（Ｓ９）。The operation of the character recognition device having the above-described structure according to the embodiment of the present invention will be described below. FIG. 2 is a flowchart of the character recognition device in one embodiment of the present invention. First, the primary character area determination unit 1 obtains a circumscribed rectangle of a connected figure from a primary binary image input by reading a recognition target document with a normal density from a scanner (not shown), and determines the circumscribed rectangle by the distance. , The primary character area and the non-character area are determined based on the size of the circumscribed rectangle after integration (S1). Next, the halftone dot area extraction unit 2 extracts a halftone dot area from the non-character area determined in S1 based on the overall size and the ratio of black and white change points (S2). Next, the secondary character area determination unit 3 reads the recognition target document of the halftone dot area portion extracted in S2 again with a density lower than normal density from a scanner (not shown) and inputs the secondary character. The circumscribing rectangle of the connected figure is obtained from the binary image, the circumscribing rectangle is integrated according to its distance and size, and the secondary character area and the non-character area are determined based on the size of the circumscribing rectangle after integration (S3). . Next, the character line extraction unit 4 takes a histogram of the circumscribing rectangle in the vertical and horizontal directions for the primary character region determined in S1 and the secondary character region determined in S3, and extracts the character line (S4). ). Next, the character cutout unit 5 estimates the character size from the distribution of the character line width and the size of the circumscribing rectangle extracted in S4 (S5). Next, the character slicing unit 5 merges the circumscribed rectangles by removing the noise component from the size and distance based on the character size estimated in S5, and cuts out as a character pattern (S6). Next, the character feature extraction unit 6 extracts character features from the character pattern cut out in S6 (S7). Next, the recognition probability calculation unit 8 compares the character features extracted in S7 with the character feature dictionary 7 to obtain recognition probabilities such as character candidates and similarities (S8). Next, the recognized character determination unit 9 determines the character having the highest degree of similarity among the character candidates obtained in S8 as a recognized character (S9).

【００１２】尚、本実施例においては、一次文字領域判
定部１と二次文字領域判定部３で、読み込むときの濃度
を変えて、２回スキャナ（図示せず）より認識対象文書
を読み込んでいるが、スキャナ（図示せず）を、認識対
象文書を多値データとして読み込むことができるものと
し、一次文字領域判定部１で認識対象文書をこのスキャ
ナ（図示せず）から多値データとして読み込み、通常の
閾値により２値化して一次２値画像を得て、二次文字領
域判定部３では一次文字領域判定部１で読み込み済みの
多値データを通常より薄い閾値により２値化して二次２
値画像を得るようにすれば、スキャナ（図示せず）より
認識対象文書を読み込む回数が１回のみとなり、文字認
識作業に要する時間が短縮されるため好ましい。In this embodiment, the primary character area determination unit 1 and the secondary character area determination unit 3 change the densities at the time of reading and read the document to be recognized twice by a scanner (not shown). However, it is assumed that the scanner (not shown) can read the recognition target document as multivalued data, and the primary character area determination unit 1 reads the recognition target document as multivalued data from this scanner (not shown). , A primary binary image is obtained by binarizing with a normal threshold value, and the secondary character area determination unit 3 binarizes the multi-valued data read by the primary character area determination unit 1 with a threshold value that is thinner than usual to obtain a secondary image. Two
It is preferable to obtain the value image because the number of times the document to be recognized is read by the scanner (not shown) is only once, and the time required for character recognition work is shortened.

【００１３】また、二次文字領域から認識された認識文
字に、その文字が網点文字であったことを示す情報を付
加するようにすると、認識結果から、認識対象文書で網
点文字にされていた部分を容易に検知することができ、
認識結果を表示する際等にその部分を網点文字として表
示する等を行うことができ好ましい。If information indicating that the character was a halftone dot character is added to the recognized character recognized from the secondary character area, the recognition result makes the halftone character in the document to be recognized. It can easily detect the part that was
This is preferable because it is possible to display the recognition result as a halftone dot character when displaying the recognition result.

【００１４】[0014]

【発明の効果】以上のように本発明は、一次文字領域判
定部で一次文字領域を判定し、網点領域抽出部で抽出さ
れた網点領域を、二次文字領域判定部がスキャナから薄
い濃度で再度読み込むか、または読み込み済の多値デー
タを薄く設定した閾値によって２値化して、出力される
二次２値画像から、一次文字領域判定部と同様にして二
次文字領域を判定し、一次文字領域及び二次文字領域を
文字認識することにより、網点文字を含む認識対象文書
を容易かつ自動的に文字認識することができる作業性に
優れた文字認識装置を実現できるものである。As described above, according to the present invention, the primary character area determination unit determines the primary character area, and the secondary character area determination unit detects the halftone dot area extracted by the halftone dot area extraction unit from the scanner. The secondary character area is determined in the same way as the primary character area determination unit from the secondary binary image that is output by reading again with the density or by binarizing the read multi-valued data with a thinly set threshold value. By recognizing characters in the primary character area and the secondary character area, it is possible to realize a character recognizing device with excellent workability that can easily and automatically recognize a recognition target document including halftone characters. .

[Brief description of drawings]

【図１】本発明の一実施例における文字認識装置の構成
図FIG. 1 is a configuration diagram of a character recognition device according to an embodiment of the present invention.

【図２】本発明の一実施例における文字認識装置のフロ
ーチャートFIG. 2 is a flowchart of a character recognition device according to an embodiment of the present invention.

【図３】網点文字を含む認識対象文書を示す図FIG. 3 is a diagram showing a recognition target document including halftone characters.

【図４】図３に示した認識対象文書をスキャナから通常
の濃度で読み込んだときの２値画像を示す図FIG. 4 is a diagram showing a binary image when the recognition target document shown in FIG. 3 is read from a scanner at a normal density.

【図５】図３に示した認識対象文書をスキャナから薄い
濃度で読み込んだときの２値画像を示す図5 is a diagram showing a binary image when the recognition target document shown in FIG. 3 is read from a scanner with a light density.

[Explanation of symbols]

１一次文字領域判定部２網点領域抽出部３二次文字領域判定部４文字行抽出部５文字切り出し部６文字特徴抽出部７文字特徴辞書８認識確度計算部９認識文字決定部 DESCRIPTION OF SYMBOLS 1 Primary character region determination unit 2 Halftone dot region extraction unit 3 Secondary character region determination unit 4 Character line extraction unit 5 Character cutout unit 6 Character feature extraction unit 7 Character feature dictionary 8 Recognition accuracy calculation unit 9 Recognized character determination unit

Claims

[Claims]

1. A scanner capable of adjusting density when reading a recognition target document and outputting a binary image and / or reading the recognition target document as multivalued data, and the recognition target document The circumscribed rectangle of the connected figure is obtained from the primary binary image that is read by the scanner with normal density, or is read as multi-valued data and binarized by the normal threshold value, and the primary character area is determined by the size of the circumscribed rectangle. A primary character area determination unit that determines the primary character area determination unit, a halftone dot area extraction unit that extracts a halftone dot area from the non-character area determined by the primary character area determination unit based on the ratio of black and white change points to the overall size, The document to be recognized in the halftone dot area portion extracted by the halftone dot area extraction unit is read again at a density lower than the normal density by the scanner, or the halftone dot area is read. The secondary character area is determined in the same manner as the primary character area determination unit from the secondary binary image output by binarizing the read multi-valued data of the part with a threshold value set to be thinner than the normal threshold value. A character that cuts out a character pattern according to the size and position of the circumscribed rectangle from the next character area determination unit, the primary character area determined by the primary character area determination unit, and the secondary character area determined by the secondary character area determination unit A cutout unit, a character feature extraction unit that extracts character features from the character pattern cut out by the character cutout unit, a character feature dictionary that stores character features of all characters in advance, and a character feature extraction unit that extracts the character features. A recognition accuracy calculation unit that obtains recognition accuracy such as a character candidate and similarity by comparing a character feature with the character feature dictionary, and a recognized character is determined from the recognition accuracy obtained by the recognition accuracy calculation unit. Character recognition apparatus characterized by comprising a that recognition character determining unit.