JPH10214308A

JPH10214308A - Character discrimination method

Info

Publication number: JPH10214308A
Application number: JP9015471A
Authority: JP
Inventors: Takakuni Minewaki; 隆邦嶺脇; Toshihiro Suzuki; 俊博鈴木; Shiori Ooaku; 志緒理大阿久; Shinobu Yamamoto; 忍山本; Toshio Miyazawa; 利夫宮澤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1997-01-29
Filing date: 1997-01-29
Publication date: 1998-08-11

Abstract

PROBLEM TO BE SOLVED: To improve the precision of character discrimination by using features other than the height of character data and using plurality of features together. SOLUTION: A character rectangle extraction part 1 extracts character rectangles (rectangles of linking of black picture elements) from a prescribed area of a document picture. A character rectangle feature calculation part 2 calculates the area of each extracted rectangle, and a standard deviation calculation part 3 obtains a standard deviation (D) of a set of rectangle area values. A comparison and decision part 4 decides characters in the prescribed area as hand-written characters in the case of the standard deviation D larger than a threshold Dth and decides them as type characters in the case of the standard deviation D equal to or smaller than the threshold Dth and uses respective dictionaries to perform the recognition processing in pats 5 and 6.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、日本語または英語
を認識する文字認識装置（ＯＣＲ）における前処理に関
し、特に文書の文字が手書き文字であるのか活字文字で
あるのかを判別する文字判別方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to preprocessing in a character recognition device (OCR) for recognizing Japanese or English, and more particularly to a character discriminating method for discriminating whether a document character is a handwritten character or a printed character. About.

【０００２】[0002]

【従来の技術】従来、文字認識方法としては、手書き文
字を認識する技術と、活字文字を認識する技術がある。
ところで、認識対象となる文書としては、例えば手書き
文字と活字文字が混在した文書など種々の形態がある。
そこで、認識対象となる１枚の文書上に手書き文字と活
字文字が混在して存在する場合や、認識対象の文書が手
書き文書であるのか活字文書であるのか前もってわから
ない場合でも、それを意識せずに処理できる文字認識装
置が求められる。そのためには、文書上の文字が手書き
文字であるのか活字文字であるのかを判別する処理が必
要となる。2. Description of the Related Art Conventionally, as a character recognition method, there are a technology for recognizing handwritten characters and a technology for recognizing printed characters.
By the way, as a document to be recognized, there are various forms such as a document in which handwritten characters and printed characters are mixed.
Therefore, even if handwritten and printed characters are mixed on a single document to be recognized, or if it is not known in advance whether the document to be recognized is a handwritten document or a printed document, be aware of it. There is a need for a character recognition device that can perform processing without processing. For that purpose, it is necessary to perform a process of determining whether the character on the document is a handwritten character or a printed character.

【０００３】手書き文字と活字文字とを判別する従来の
技術としては、例えば特開平５−１８９６０４号公報に
記載された光学的文字読取装置がある。この装置では、
切り出された各文字データの高さの分布のばらつきを基
に、手書き文字か活字文字かを判定している。As a conventional technique for discriminating between handwritten characters and printed characters, for example, there is an optical character reading apparatus described in Japanese Patent Application Laid-Open No. Hei 5-189604. In this device,
Whether the character data is a handwritten character or a printed character is determined based on a variation in the distribution of the heights of the extracted character data.

【０００４】[0004]

【発明が解決しようとする課題】しかし、上記した装置
は、文字データの高さだけを用いて文字を判別している
ので、判別後に文字認識する場合に誤認識する可能性が
高い。However, in the above-described apparatus, since characters are distinguished using only the height of character data, there is a high possibility of erroneous recognition when characters are recognized after the determination.

【０００５】本発明の目的は、文字データの高さ以外の
特徴を用い、また複数の特徴を併用することによって文
字の判別精度を向上させ、判別後の文字認識の精度を向
上させた文字判別方法を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to improve the accuracy of character discrimination by using features other than the height of character data and by using a plurality of features in combination, and to improve the accuracy of character recognition after discrimination. It is to provide a method.

【０００６】[0006]

【課題を解決するための手段】前記目的を達成するため
に、請求項１記載の発明では、文書画像上の所定の文字
領域から文字矩形を抽出し、該抽出された各文字矩形の
特徴を算出し、該算出された特徴の統計的分布を基に、
前記文字領域内の文字が手書き文字であるか活字文字で
あるかを判別することを特徴としている。In order to achieve the above object, according to the first aspect of the present invention, a character rectangle is extracted from a predetermined character area on a document image, and the characteristics of each extracted character rectangle are extracted. Calculated, based on the statistical distribution of the calculated features,
It is characterized in that it is determined whether the character in the character area is a handwritten character or a printed character.

【０００７】請求項２記載の発明では、前記文字矩形の
特徴として、文字矩形の面積を用いることを特徴として
いる。[0007] The invention according to claim 2 is characterized in that the area of the character rectangle is used as the characteristic of the character rectangle.

【０００８】請求項３記載の発明では、前記文字矩形の
特徴として、文字矩形の幅を用いることを特徴としてい
る。According to a third aspect of the present invention, the width of a character rectangle is used as the characteristic of the character rectangle.

【０００９】請求項４記載の発明では、前記文字矩形の
特徴として、文字矩形の高さと幅の比率を用いることを
特徴としている。According to a fourth aspect of the present invention, the ratio of the height and width of the character rectangle is used as the characteristic of the character rectangle.

【００１０】請求項５記載の発明では、前記文字矩形の
特徴として、文字矩形の中心座標位置と行中心線座標と
の距離を用いることを特徴としている。According to a fifth aspect of the present invention, as the feature of the character rectangle, a distance between a center coordinate position of the character rectangle and a line center line coordinate is used.

【００１１】請求項６記載の発明では、前記文字矩形の
特徴として、隣接する文字矩形の中心位置座標間の行方
向距離を用いることを特徴としている。[0011] The invention according to claim 6 is characterized in that as a characteristic of the character rectangle, a line direction distance between center position coordinates of adjacent character rectangles is used.

【００１２】請求項７記載の発明では、前記文字矩形の
特徴として、請求項２〜６の特徴を組み合わせて用いる
ことを特徴としている。According to a seventh aspect of the present invention, the features of the second to sixth aspects are used in combination as the feature of the character rectangle.

【００１３】[0013]

【発明の実施の形態】以下、本発明の一実施例を図面を
用いて具体的に説明する。図１は、本発明の実施例の構
成を示す。図において、１は、所定領域内から文字相当
の黒画素連結成分を抽出する文字矩形抽出部、２は、抽
出された文字矩形の特徴（面積、幅など）を算出する文
字矩形特徴算出部、３は、算出された特徴の標準偏差を
算出する標準偏差算出部、４は、標準偏差と所定の閾値
を比較して手書き文字であるか活字文字であるかを判定
する比較判定部、５は、手書き文字と判定されたとき手
書き文字を認識処理する処理部、６は、活字文字と判定
されたとき活字文字を認識処理する処理部、７は画像メ
モリ、８は文字矩形データメモリである。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be specifically described below with reference to the drawings. FIG. 1 shows the configuration of an embodiment of the present invention. In the figure, 1 is a character rectangle extracting unit that extracts a black pixel connected component corresponding to a character from within a predetermined area, 2 is a character rectangle feature calculating unit that calculates features (area, width, etc.) of the extracted character rectangle, 3 is a standard deviation calculating unit that calculates the standard deviation of the calculated feature, 4 is a comparison determining unit that compares the standard deviation with a predetermined threshold to determine whether the character is a handwritten character or a printed character, and 5 A processing unit for recognizing a handwritten character when it is determined to be a handwritten character; a processing unit for recognizing a printed character when it is determined to be a printed character; an image memory; and a character rectangle data memory.

【００１４】なお、文字矩形抽出部における文字矩形の
抽出方法として、例えば、特開昭６２−７４１８１号公
報、特開平１−１１４９９２号公報に記載された公知の
技術を用いる。この技術では、文書画像から文字矩形を
抽出し、抽出された矩形の内、分離文字（例えば、
「は」、「い」など）の部分を、文字矩形情報を参照す
ることによって統合または分離し、１文字の文字矩形を
生成する。As a method of extracting a character rectangle in the character rectangle extraction unit, a known technique described in, for example, Japanese Patent Application Laid-Open Nos. 62-74181 and 1-114992 is used. In this technique, a character rectangle is extracted from a document image, and a separated character (for example,
"" And "I") are integrated or separated by referring to the character rectangle information to generate a one-character character rectangle.

【００１５】〈実施例１〉対象文書中から、「段落」に
相当する文字領域が抽出され、画像メモリ７に格納され
ているものとする。そして、この文字領域内に含まれて
いる文字画像が手書き文字であるか活字文字であるかを
判別する場合について、以下説明する。なお、文字領域
としては、文書全体、段落に相当する１行、あるいは複
数の行、複数文字からなる任意の範囲でもよい。Embodiment 1 It is assumed that a character area corresponding to a “paragraph” is extracted from a target document and stored in the image memory 7. The case where it is determined whether the character image included in the character area is a handwritten character or a printed character will be described below. The character area may be an entire document, one line corresponding to a paragraph, or a plurality of lines or an arbitrary range including a plurality of characters.

【００１６】図２は、本発明の実施例１の処理フローチ
ャートである。まず、文字矩形抽出部１は、上記した公
知の技術を用いて領域内の文字に相当する、所定範囲内
のサイズの黒画素連結矩形を抽出し、文字矩形データメ
モリ８に格納する（ステップ１０１）。文字矩形特徴算
出部２は、抽出されたｎ個の矩形群について、それぞれ
の矩形面積（Ｘ＝矩形幅×矩形高さ）を算出する（ステ
ップ１０２）。FIG. 2 is a processing flowchart of the first embodiment of the present invention. First, the character rectangle extracting unit 1 extracts a black pixel connection rectangle having a size within a predetermined range corresponding to the character in the area by using the above-mentioned known technique, and stores it in the character rectangle data memory 8 (step 101). ). The character rectangle feature calculation unit 2 calculates a rectangle area (X = rectangle width × rectangle height) for each of the extracted n rectangle groups (step 102).

【００１７】標準偏差算出部３は、矩形面積値集合Ｘｎ
について、標準偏差Ｄを算出する（ステップ１０３）。The standard deviation calculator 3 calculates a rectangular area value set Xn
, The standard deviation D is calculated (step 103).

【００１８】[0018]

【数１】 (Equation 1)

【００１９】ここで、Ｘｉ（ｉ＝１〜ｎ）は矩形面積デ
ータ、Ｘｍは平均値である。Here, Xi (i = 1 to n) is rectangular area data, and Xm is an average value.

【００２０】比較判定部４は、算出された標準偏差Ｄと
所定の閾値Ｄｔｈとを比較し（ステップ１０４）、標準
偏差Ｄが、閾値Ｄｔｈを超えていたとき（つまり、面積
値のばらつきがある程度以上に大きいとき）、上記した
文字領域内の文字は手書き文字であると判定する（ステ
ップ１０５）。図３は、手書き文字の例を示し、図５
は、手書き文字の場合の分布を示し、手書き文字の特徴
の分布はばらつきが広く、標準偏差は大きな値となる
（実際の値は離散分布する）。The comparing and judging section 4 compares the calculated standard deviation D with a predetermined threshold value Dth (step 104), and when the standard deviation D exceeds the threshold value Dth (that is, when the area value varies to some extent). If it is larger than the above, it is determined that the characters in the character area are handwritten characters (step 105). FIG. 3 shows an example of a handwritten character, and FIG.
Indicates the distribution in the case of handwritten characters, where the distribution of the characteristics of the handwritten characters varies widely and the standard deviation is a large value (actual values are discretely distributed).

【００２１】逆に、標準偏差Ｄが閾値Ｄｔｈより小さい
とき、その領域内の文字は活字文字であると判定する
（ステップ１０７）。図４は、活字文字の例を示し、図
６は、活字文字の場合の分布を示し、活字文字の特徴の
分布はばらつきが狭く、標準偏差は小さな値となる（実
際の値は離散分布する）。Conversely, when the standard deviation D is smaller than the threshold value Dth, it is determined that the characters in the area are print characters (step 107). FIG. 4 shows an example of a printed character, and FIG. 6 shows a distribution in the case of a printed character. The distribution of the characteristics of the printed character has a small variation and a small standard deviation (actual values are discretely distributed). ).

【００２２】手書き文字または活字文字と判定される
と、その領域にそれぞれ固有の処理、例えば文字認識処
理を行う。すなわち、活字文字領域については、活字文
字処理部６は活字文字の辞書を用いて認識処理し（ステ
ップ１０８）、手書き文字領域については、手書き文字
処理部５は手書き文字の辞書を用いて認識を行う（ステ
ップ１０６）。これにより、精度の良い認識結果を得る
ことができる。When it is determined that the area is a handwritten character or a printed character, processing unique to the area, for example, character recognition processing is performed. In other words, the type character processing unit 6 performs recognition processing using the type character dictionary for the type character area (step 108), and the handwritten character processing unit 5 performs recognition using the type dictionary for the handwritten character area. Perform (step 106). Thereby, an accurate recognition result can be obtained.

【００２３】〈実施例２〉実施例２の文字矩形特徴算出
部２では、抽出されたｎ個の矩形群について、それぞれ
の文字矩形の幅を算出し、標準偏差算出部３では、文字
矩形幅値の集合Ｙｎについて、標準偏差Ｄを算出する。
そして、実施例１と同様に、標準偏差Ｄが、閾値Ｄｔｈ
を超えていたとき手書き文字であると判定し、閾値Ｄｔ
ｈより小さいとき活字文字であると判定する。なお、実
施例２の処理フローチャートは、実施例１のステップ１
０２の矩形面積を矩形の幅に置き換え、また、ステップ
１０３の面積の標準偏差を矩形の幅の標準偏差に置き換
えたものとなる。<Embodiment 2> The character rectangle feature calculator 2 calculates the width of each character rectangle for the extracted n rectangle groups, and the standard deviation calculator 3 calculates the character rectangle width. The standard deviation D is calculated for the set of values Yn.
Then, similarly to the first embodiment, the standard deviation D is equal to the threshold value Dth
Is determined to be a handwritten character when it exceeds the threshold value Dt.
If it is smaller than h, it is determined that the character is a print character. Note that the processing flowchart of the second embodiment is the same as that of step 1 of the first embodiment.
02 is replaced with the width of the rectangle, and the standard deviation of the area in step 103 is replaced with the standard deviation of the width of the rectangle.

【００２４】〈実施例３〉実施例３では、文字矩形の特
徴として、文字矩形の高さと幅の比率を算出する。つま
り、文字矩形が縦長か正方形か横長かを特徴として用い
る。実施例１と同様に文字矩形の集合について、上記し
た比率の標準偏差を算出し、所定の閾値と比較すること
によって手書き文字（比率のばらつきが大きい）／活字
文字（比率のばらつきが小さい）を判定する。<Embodiment 3> In Embodiment 3, a ratio of the height and width of a character rectangle is calculated as a characteristic of the character rectangle. That is, whether the character rectangle is vertically long, square, or horizontally long is used as a feature. As in the first embodiment, for the set of character rectangles, the standard deviation of the above ratio is calculated and compared with a predetermined threshold value to determine handwritten characters (variation in ratio is large) / printed characters (variation in ratio is small). judge.

【００２５】〈実施例４〉実施例４では、文字矩形の特
徴として、文字矩形の中心座標位置と行中心線座標との
距離ｓを用いる。図７は、文字矩形の中心と行中心線と
の関係を示す。以下、距離ｓの標準偏差を算出し、実施
例１と同様に所定の閾値と比較することによって手書き
文字（距離のばらつきが大きい）／活字文字（距離のば
らつきが小さい）を判定する。<Embodiment 4> In Embodiment 4, as a feature of a character rectangle, a distance s between the center coordinate position of the character rectangle and the line center line coordinate is used. FIG. 7 shows the relationship between the center of the character rectangle and the line center line. Hereinafter, a standard deviation of the distance s is calculated and compared with a predetermined threshold value as in the first embodiment to determine handwritten characters (variation in distance is large) / printed characters (variation in distance is small).

【００２６】ここで、行の中心線は、行を構成する文字
矩形の中心位置の集合からの距離２乗誤差が最小になる
直線を設定する方法、あるいは文字矩形の面積について
の重みを考慮した重心位置の集合についての距離２乗誤
差が最小になる直線を設定する方法などによって決定す
る。Here, the center line of the line is determined by a method of setting a straight line that minimizes the square error of the distance from the set of the center positions of the character rectangles constituting the line, or by considering the weight of the area of the character rectangle. It is determined by a method of setting a straight line that minimizes the distance square error with respect to the set of barycentric positions.

【００２７】〈実施例５〉実施例５では、文字矩形の特
徴として、図８に示すように、隣接する文字矩形の中心
位置座標間の行方向距離ｐを用いる。以下、行方向距離
ｐの標準偏差を算出し、実施例１と同様に所定の閾値と
比較することによって手書き文字（行方向距離のばらつ
きが大きい）／活字文字（行方向距離のばらつきが小さ
い）を判定する。<Embodiment 5> In Embodiment 5, as shown in FIG. 8, a line-direction distance p between the center position coordinates of adjacent character rectangles is used as a characteristic of a character rectangle. Hereinafter, the standard deviation of the line-direction distance p is calculated and compared with a predetermined threshold value as in the first embodiment, whereby handwritten characters (variation in line-direction distance is large) / printed characters (variation in line-direction distance is small) Is determined.

【００２８】〈実施例６〉実施例６では、上記した実施
例１から５の特徴を適宜組み合わせて総合的に判定す
る。<Embodiment 6> In Embodiment 6, the characteristics of Embodiments 1 to 5 described above are appropriately combined to determine comprehensively.

【００２９】なお、本発明は上記したものに限定され
ず、ソフトウェアによっても実現することができる。本
発明をソフトウェアによって実現する場合には、図９に
示すように、ＣＰＵ、ＲＯＭ、ＲＡＭ、表示装置、ハー
ドディスク、キーボード、ＣＤ−ＲＯＭドライブなどか
らなる汎用の処理装置を用意し、ＣＤ−ＲＯＭなどのコ
ンピュータ記憶媒体には、本発明の文字判別機能を実現
するプログラムが記録されている。It should be noted that the present invention is not limited to the above, but can be realized by software. When the present invention is implemented by software, as shown in FIG. 9, a general-purpose processing device including a CPU, a ROM, a RAM, a display device, a hard disk, a keyboard, a CD-ROM drive, etc. is prepared, and a CD-ROM or the like is prepared. The computer storage medium stores a program for realizing the character discrimination function of the present invention.

【００３０】[0030]

【発明の効果】以上、説明したように、本発明によれ
ば、所定文字領域内の文字矩形について、種々の特徴を
算出し、収集し、その統計的分布のばらつきを基に、領
域内の文字が手書き文字であるか、活字文字であるかを
判別しているので、文字の判別が高精度に行われる。ま
た、判別後にはそれぞれの文字に適合した後続の処理が
実施されるので、より精度のよい認識結果を得ることが
できる。As described above, according to the present invention, various characteristics of a character rectangle in a predetermined character region are calculated and collected, and the character rectangle in the region is calculated based on the statistical distribution variation. Since it is determined whether the character is a handwritten character or a printed character, the character can be determined with high accuracy. After the determination, subsequent processing suitable for each character is performed, so that a more accurate recognition result can be obtained.

[Brief description of the drawings]

【図１】本発明の実施例の構成を示す。FIG. 1 shows a configuration of an embodiment of the present invention.

【図２】本発明の実施例１の処理フローチャートであ
る。FIG. 2 is a processing flowchart according to the first embodiment of the present invention.

【図３】手書き文字の例を示す。FIG. 3 shows an example of a handwritten character.

【図４】活字文字の例を示す。FIG. 4 shows examples of print characters.

【図５】手書き文字の特徴の分布を示す。FIG. 5 shows a distribution of features of handwritten characters.

【図６】活字文字の特徴の分布を示す。FIG. 6 shows the distribution of features of printed characters.

【図７】文字矩形の中心と行中心線との関係を示す。FIG. 7 shows the relationship between the center of the character rectangle and the line center line.

【図８】文字矩形間の行方向距離を示す。FIG. 8 shows a line direction distance between character rectangles.

【図９】本発明をソフトウェアによって実現する場合の
構成例を示す。FIG. 9 shows a configuration example when the present invention is realized by software.

[Explanation of symbols]

１文字矩形抽出部２文字矩形特徴算出部３標準偏差算出部４比較判定部５手書き文字処理部６活字文字処理部７画像メモリ８文字矩形データメモリ 1 Character rectangle extraction unit 2 Character rectangle feature calculation unit 3 Standard deviation calculation unit 4 Comparison judgment unit 5 Handwritten character processing unit 6 Printed character processing unit 7 Image memory 8 Character rectangle data memory

───────────────────────────────────────────────────── フロントページの続き (72)発明者山本忍東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者宮澤利夫東京都大田区中馬込１丁目３番６号株式会社リコー内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Shinobu Yamamoto 1-3-6 Nakamagome, Ota-ku, Tokyo Inside Ricoh Company (72) Inventor Toshio Miyazawa 1-3-6 Nakamagome, Ota-ku, Tokyo Stock Inside the company Ricoh

Claims

[Claims]

1. A character rectangle is extracted from a predetermined character area on a document image, and a characteristic of each extracted character rectangle is calculated.
A character discriminating method for discriminating whether a character in the character area is a handwritten character or a printed character based on the statistical distribution of the calculated features.

2. The method according to claim 1, wherein an area of the character rectangle is used as a characteristic of the character rectangle.

3. The method according to claim 1, wherein a width of the character rectangle is used as the characteristic of the character rectangle.

4. The character discriminating method according to claim 1, wherein a ratio between a height and a width of the character rectangle is used as a characteristic of the character rectangle.

5. The method according to claim 1, wherein a distance between a center coordinate position of the character rectangle and a line center line coordinate is used as a characteristic of the character rectangle.

6. The character discriminating method according to claim 1, wherein a distance in a row direction between center position coordinates of adjacent character rectangles is used as a characteristic of the character rectangle.

7. The character rectangle according to claim 2, wherein
3. The character discriminating method according to claim 1, wherein the characterizing methods are used in combination.