JPH02224084A

JPH02224084A - Discriminating method for capital letter, small letter and character with shape similar to kanji (chinese character) and kana (japanese syllabary)

Info

Publication number: JPH02224084A
Application number: JP1196619A
Authority: JP
Inventors: Taiji Mori; 泰二森
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1988-11-30
Filing date: 1989-07-31
Publication date: 1990-09-06
Anticipated expiration: 2014-08-03
Also published as: JP2930605B2

Abstract

PURPOSE:To decrease erroneous discrimination by discriminating a capital letter and a small letter by using not only a size of the character but also the center coordinate of the character, when a standard size of the character is almost the same by a kind of the character. CONSTITUTION:Whether a character as a result of recognition is a character having both a capital letter and a small letter or not is decided, and when it is a character having both of them, an external form feature quantity containing its character width, character height and that which is obtained by multiplying them is compared with a threshold for setting the character as a capital letter against its standard character, and a threshold for setting it as a small letter, respectively and decided (determined definitely) as one of a capital letter and a small letter. As for that which is not any of them, that is, a character which cannot be decided by the external form feature quantity, a center line L of the line is derived from the center coordinate of a result of recognition of one line, and by comparing a difference of the center X of the character and the coordinate of the center line L with the threshold determined in advance, a capital letter or a small letter is discriminated. In such a way, an erroneous discrimination can be decreased.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、平仮名や片仮名などの文字を認識する文字
認識装置における文字種（大文字か小文字かなど）の判
別方法に関する。なお、大文字と小文字を持つ仮名文字
の例を第６図に示す。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a method for determining character types (such as uppercase or lowercase) in a character recognition device that recognizes characters such as hiragana and katakana. Incidentally, an example of kana characters having uppercase and lowercase letters is shown in FIG.

[Conventional technology]

従来、例えば大文字、小文字の判別にあた、２ては、小
文字とするしきい値のみを設け、これを文字の外形特徴
と比較して行なうものが知られている。Conventionally, for example, when distinguishing between uppercase and lowercase letters, it has been known to provide only a threshold value for determining lowercase letters and to compare this threshold with external features of the characters.

[Problem R that the invention attempts to solve]

ｔかしながら、この方法では一般に片仮名の大きさが漢
字よりも小さく、その比率は書体によっても変化する。However, with this method, the size of katakana is generally smaller than that of kanji, and the ratio also changes depending on the typeface.

このため、成る文字の大きさが成る書体では大文字とな
り、別の書体では小文字となるような場合が生じ得ると
云う問題がある。For this reason, there is a problem in that a case may occur in which the characters of the same size are uppercase in one typeface and lowercase in another typeface.

したがって、この発明の課題は大文字、小文字に別々の
判断基準を設けて判断し、また判断基準の中間の文字に
ついては、行の中心からの文字の中心座標のずれをしき
い値にもとづき判断することにより、判別精度を向上さ
せることにある。Therefore, the problem of this invention is to set separate judgment criteria for uppercase and lowercase letters, and to judge characters in between the judgment criteria based on a threshold value of the deviation of the center coordinates of the character from the center of the line. This aims to improve the discrimination accuracy.

〔課題を解決するための手段〕文字種によらず標準サイズが略同じな対象文字の大きさ
を正規化し、大文字も小文字も同じ標準パターンにて文
字を認識した後、認識結果の各文字についてその外接枠
の中心座標を記憶するとともに、それが大文字と小文字
の両方をもつ文字か否かを判断し、両方をもつ文字なら
ばその文字幅。[Means for solving the problem] After normalizing the size of the target character whose standard size is approximately the same regardless of the character type and recognizing the character using the same standard pattern for both uppercase and lowercase letters, In addition to memorizing the center coordinates of the circumscribing frame, it also determines whether the character has both uppercase and lowercase letters, and if it has both, the width of the character.

文字高さおよび文字幅と文字高さを掛け合わ・（！−た
ものを含む外形特徴量を求め、該外形特徴量を文字毎に
予め定められた標準文字に対して大文字。Multiply the character height, character width, and character height to find the external feature amount including (!-), and use the external feature amount as an uppercase character for a predetermined standard character for each character.

小文字を判定するための各しきい値とそれぞれ比較して
大文字か小文字かを確定し、これらのしきい値にもとづ
く確定ができないときはその文字に未確定なる情報を付
与するとともに、一行の確定作業を終了する毎に該未確
定文字を含む行内の各文字の中心座標から文字行の中心
線を求め、未確定文字の中心座標と中心線の座標とめ差
を予め定められたしきい値と比較１，７て判別する〔作
用〕認識結果の文字が大文字と小文字の両方をもつ文字かど
うかを判断し、両方をもつ文字であればその文字幅９文
字高さおよびＳ−れらを掛げ合わせたものを含む外形特
徴量を、その標準文字に対する大文字とするしきい値、
小文字とするしきい値とそれぞれ比較して大文字、小文
字のいずれかとして判断（確定）し、そのいずれでもな
いもの、すなわち外形特徴量で判断できない文字につい
ては、−・行の認識結果の中心座標より行の中心線を求
め、文字の中心と中心線の座標とめ差を予め定めたしき
い値と比較して大文字、小文字の判別を行なうこｋによ
り誤判別を少なくし、判別精度を向上させる。It determines whether it is an uppercase or lowercase letter by comparing it with each threshold value for determining lowercase letters, and if it cannot be determined based on these thresholds, it adds undetermined information to the character, and also confirms one line. Every time the work is completed, the center line of the character line is calculated from the center coordinates of each character in the line including the undetermined character, and the difference between the center coordinates of the undetermined character and the center line is set to a predetermined threshold value. Distinguish by comparing 1 and 7 [Operation] Determine whether the recognized character has both uppercase and lowercase letters, and if it has both, multiply the character width by 9 characters and the height by A threshold value for converting external shape features, including the combined values, into uppercase letters for standard characters,
It is determined (determined) as either an uppercase or lowercase letter by comparing it with the threshold value for lowercase letters, and for characters that are neither of these, that is, characters that cannot be determined based on the external feature amount, the center coordinates of the recognition result of the line -. The center line of the line is determined from the center line, and the coordinate difference between the center of the character and the center line is compared with a predetermined threshold value to determine whether it is an uppercase or lowercase letter. This reduces misclassification and improves discrimination accuracy. .

〔Example〕

第１図はこの発明の実施例を示すフローチャート、第２
図は横書き文字群の−・例とその中心線を説明するため
の説明図である。FIG. 1 is a flowchart showing an embodiment of the invention, and FIG.
The figure is an explanatory diagram for explaining an example of a horizontally written character group and its center line.

まず、公知の画像処理により文字画像データを抽出しく
■参照）、同じく公知の手法にて対象文字を認徹する（
■参照）。次いで、この文字の中心座種情報を保存ン、
（■参照）、さらに認識結果より、対象文字が大文字、
小文字の両方を持つ文字か否かを判断しく■参照）、大
文字、小文字の両方を持つ文字であればその文字幅、高
さ、および幅と高さを掛けたものを求める（■参照）ｅ
次に、対象文字について予め定められている、１つ以上
のその標準文字に対して大文字とするしきい値と比較し
く■参照）、その結果大文字であれば大文字と確定し７
（■参照）、大文字でなければ、７小文字とするしきい
値と比較Ｌ２（■参照）、その結果小文字であれば小文
字として確定する（■参照）、一方、どちらにも確定で
きなかった場合には、未確定である旨の情報を付加する
（＠参照）。First, the character image data is extracted using known image processing (see ■), and the target characters are recognized using the same known method (
■Reference). Next, save the central locus information of this character,
(See ■).Furthermore, the recognition results show that the target character is an uppercase letter,
Determine whether the character has both lowercase letters (see ■), and if the character has both uppercase and lowercase letters, find the character width, height, and the product of the width and height (see ■) e
Next, compare the target character with a predetermined threshold value for capitalizing one or more standard characters (see ■), and if the result is an uppercase character, it is determined to be an uppercase character.
(See ■), If it is not an uppercase letter, compare it with the threshold value L2 (see ■), and if the result is a lowercase letter, it is determined as a lowercase letter (see ■).On the other hand, if neither can be determined Add information to the effect that it is undetermined (see @).

以上のステップ■−［相］を繰り返し、一行の認識結果
を得る（■参照）。次いで、行中に未確定の文字があれ
ば（■参照）、行内の横書き文字の各中心座標から公知
の手法、例えば最小二乗法などを用いて行の中心線の近
似式、Ｙ＝ａｘ＋ｂを求め（０参照）、未確定文字のＸ方向中心座標Ｘｃを
近似式に代入し、第２図（ｏ　）に示す如きＹ方向の座
標Ｙ４．を得る（０参照）、なお、第２図の「×」印は
各文字の中心位置を示す、そして、この座標ＹＬと未確
定文字のＹ方向の中心座標Ｙ。Repeat the above step ■-[phase] to obtain a recognition result for one line (see ■). Next, if there is an undetermined character in the line (see ■), the approximation formula for the center line of the line, Y=ax+b, is calculated from the center coordinates of each horizontally written character in the line using a known method such as the method of least squares. (see 0), and substitute the X-direction center coordinate Xc of the undefined character into the approximation formula to obtain the Y-direction coordinate Y4. as shown in FIG. 2(o). (see 0). Note that the "x" mark in FIG. 2 indicates the center position of each character, and this coordinate YL and the center coordinate Y in the Y direction of the undetermined character.

とめ差（Ｙｔ、　　Ｙｅ）を求め２、これを予め定めた
標準文字に対するしきい値と比較しく■参照）、その結
果から大文字か小文字かを判別する（［相］。Determine the stop difference (Yt, Ye) 2, compare it with a predetermined threshold value for standard characters (see ①), and determine whether it is an uppercase or lowercase letter from the result ([Phase]).

ぐｉ？ｌ参照）、・つまり、上記差（〜’ｉ、　　’ｔ
ｃ）につき、小文字の場合の方が大文字の場合よりも大
きくなることを利用して判別する。Gui? l), ・In other words, the above difference (~'i, 't
Regarding c), the discrimination is made by utilizing the fact that lowercase letters are larger than uppercase letters.

ところで、以上では文字種によってその標準的な大、き
さ（標準サイズ）が変わらないものと仮定して大文字、
小文字を判別するようにしている。By the way, in the above, it is assumed that the standard size (standard size) does not change depending on the character type, and uppercase letters,
It is designed to recognize lowercase letters.

しかし、印刷文書等では文字種によって標準的な大きさ
が異なるものも多い（例えば、１印刷文書では漢字の方
が仮名よりも一船的に大きい）。標準サイズの例を第７
図に示す。また、漢字と仮名で字形が類似する文字（以
下、漢字仮名類似字形文字ともいう）も存在する。その
−例を第８図に示す。However, in many printed documents, the standard size differs depending on the type of character (for example, in a single printed document, kanji are significantly larger than kana). Example of standard size is shown in 7th
As shown in the figure. In addition, there are characters with similar glyph shapes in kanji and kana (hereinafter also referred to as kanji-kana similar glyph-shaped characters). An example thereof is shown in FIG.

したがって、このような場合は以上の如き方法では対処
できないので、次のようにする。第３図はかかる場合の
方法を説明するためのフローチャートである。Therefore, since such a case cannot be handled using the above method, the following method is used. FIG. 3 is a flowchart for explaining the method in such a case.

まず、第１図の場合と同様に、公知の画像処理により文
字画像データを抽出しく■参照）５．同じく公知の手法
にて対象文字を認識する（■参照）。First, as in the case of FIG. 1, extract character image data using known image processing (see 5). Similarly, the target character is recognized using a known method (see ■).

次いで、認識結果から得られる文字コード、おおきさを
第４図に示すような形式で順次記憶１．（■参照）、そ
の文字コードより文字の文字種が漢字。Next, the character codes and sizes obtained from the recognition results are sequentially stored in the format shown in FIG. 1. (See ■), the character type of the character is kanji according to its character code.

片仮名、平仮名２英字などに判別しく■参照）、その文
字が文字種の標準サイズを持っているか、または第６図
に示ず１や５．゛ゆ２．゛よ”のように小文字を持つ文
字か、もしくは第８図に示ず“力“、“夕”のよ・）な
漢字仮名類似字形文字かを、例えば第５図に示すような
形式で予め文字−１−ド毎に設定されている属性テーブ
ルＴを参照して判断しく■２■参照）、小文字を持・つ
文字または漢字仮名類似字形文字ならば記憶した文字に
マ・〜りを付け（■参照）、その文字が標準サイズを持
っているならば（■参照）、その文字の大きさを文字種
毎に適切な方法、例えば頻度分布計算、平均値計算等を
用いて集計しく■参照）、−文書の認識結犀を得る（［
相］参照）、その集計結果より、文字種毎にその文字種
の標準サイズを、例えば頻度分布から最も頻度の高い大
きさを求めるなどして計算しく０参照）、先に記憶した
文字の中からマークを付けた文字を検索シ、（■参照）
、ステップ◎で文字種毎に計算１．て求めた（確定１．
７だ）標準サイズの、マークを付けた文字種対応の値に
予め設定された比率を乗じる等して求められるしきい値
き、実際の文字の大きさとを比較して大文字か小文字か
の判別を行なう（０参照）。さらに、ステップ■で求め
た標準サイズの漢字と平仮名。For distinguishing between katakana, hiragana, 2 alphabetic characters, etc. (see ■), the character has the standard size of the character type, or is not shown in Figure 6 and is 1 or 5.゛Yu2. In the format shown in Figure 5, for example, write in advance whether it is a lowercase character such as ``゛yo'' or a kanji-kana similar glyph character such as ``力'' or ``evening'' (not shown in Figure 8). Please refer to the attribute table T set for each character (see ■2)), and mark the memorized character if it has a lowercase letter or a kanji/kana similar glyph. (See ■), if the character has a standard size (see ■), calculate the size of the character using an appropriate method for each type of character, such as frequency distribution calculation, average value calculation, etc. (see ■) ), - Obtain document recognition results ([
Based on the tally result, calculate the standard size of each character type by, for example, finding the most frequent size from the frequency distribution. Search for the characters with , (see ■)
, Calculate each character type in step ◎1. (confirmed 1.
7) A threshold value is obtained by multiplying the standard size value corresponding to the marked character type by a preset ratio, etc., and compared with the actual character size to determine whether it is an uppercase or lowercase letter. (see 0). Furthermore, the standard size kanji and hiragana obtained in step ■.

片仮名とめ差を予め設定された＋、きい値と比較して大
きさが異なるか否かをチエツクしく０参照）、異なる場
合には先に”７−・りを付けた漢字仮名類似字形文字に
ついて、これと類似する全ての文字に対し、例えば第５
図に示すテーブルＴの文字の大きさとその文字の属する
文字種の標準の大きさの比率テーブルに予め設定されて
いる比率を、ステップ■で求めた文字種毎の標準サイズ
に掛１．３で文字の大きさを推定し、これと実際の文字
の大きさとを比較して大きさの一番近い文字を候補とす
る（■参照）。」−記ステップ０−■を〜文書が終了す
るまで、繰り返す（［相］参照）、なお、漢字仮名類似
字形文字が漢字か仮名かを判別するに当たっ°ζは、そ
の前後の文字種を判別する方法も併せて用いる、−とが
望ましい。また、土、記では大文字か小文字かの判別と
、漢字仮名類似字形文字が漢字か仮名かの判別とを同時
に実施するようにしているが、そのいずれか一方のみを
実施するようにしても良いことは勿論である。Check whether the size is different by comparing the katakana stop difference with the preset + and threshold (see 0), and if it is different, for the kanji-kana similar glyphs with "7-" prefixed. , for all characters similar to this, e.g.
The ratio of the character size in table T shown in the figure to the standard size of the character type to which the character belongs The ratio preset in the table is multiplied by the standard size for each character type obtained in step ■ by 1.3. Estimate the size, compare this with the actual character size, and select the character with the closest size as a candidate (see ■). ”-Repeat step 0-■ until the end of the document (see [phase]). In addition, in determining whether a kanji-kana-like glyph is a kanji or a kana, °ζ determines the character types before and after it. It is desirable to also use a method to do so. In addition, in Sat and Ki, the determination of uppercase or lowercase letters and the determination of whether Kanji-kana-like glyphs are kanji or kana are performed at the same time, but it is also possible to perform only one of them. Of course.

〔Effect of the invention〕

この発明（１７よれば、文字種によって文字の標準サイ
ズが略同じ場合は、文字のサイズだけでなく文字の中心
座標も使って大文字、小文字の判別を行なうように（，
７たので、誤判別を少なくすることができ、判別精度を
向上し得る利点がもたらされる。According to this invention (17), when the standard size of characters is approximately the same depending on the character type, uppercase and lowercase letters are determined using not only the size of the character but also the center coordinates of the character (,
Therefore, there is an advantage that misclassification can be reduced and classification accuracy can be improved.

また、文字種によって文字の標準サイズが異なる場合は
、文字種毎に標準サイズを計算（確定）するようにし７
たので、大文字、小文字および漢字仮名類似字形文字の
判別精度を向上し得る利点がもたらされる。Also, if the standard size of characters differs depending on the character type, calculate (determine) the standard size for each character type.
Therefore, it is possible to improve the accuracy of discrimination between uppercase letters, lowercase letters, and characters with similar glyphs such as kanji, kana, and kanji.

[Brief explanation of the drawing]

第１図はこの発明の一実施例を示すフローチャート、第
２図は横書き文字群の一例とその中心線を説明するため
の説明図、第３図はこの発明の他の実施例を示すフロー
チャート、第４図は認識結果の記憶態様を説明するため
の説明図、第５図１．よ文字属性テーブルの一例を示す
構成図、第６図は大文字と小文字で字形が類似な文字の
例を説明するための説明図、第７図は文字毎の標準サイ
ズの例を説明するための説明図、第８図は漢字仮名類似
字形文字の例を説明するための説明図である。符号説明 ■７・・・中心線、Ｐ、・・・未確定文字の中心位置、
Ｔ・・・文字属性テーブル。FIG. 1 is a flowchart showing one embodiment of this invention, FIG. 2 is an explanatory diagram for explaining an example of a horizontally written character group and its center line, and FIG. 3 is a flowchart showing another embodiment of this invention. FIG. 4 is an explanatory diagram for explaining the storage mode of recognition results, and FIG. Figure 6 is an explanatory diagram showing an example of a character attribute table with similar character shapes; Figure 7 is an explanatory diagram illustrating an example of characters with similar shapes in uppercase and lowercase; Figure 7 is an explanatory diagram illustrating an example of the standard size of each character. FIG. 8 is an explanatory diagram for explaining examples of kanji, kana, and similar glyphs. Code explanation ■7... Center line, P... Center position of undefined character,
T...Character attribute table.

Claims

[Claims] 1) After normalizing the size of the target character whose standard size is approximately the same regardless of the character type and recognizing the character using the same standard pattern for both uppercase and lowercase letters, the circumference of each character in the recognition result is determined. It memorizes the center coordinates of the frame, determines whether the character has both uppercase and lowercase letters, and if it has both, calculates the character width, character height, and multiplies the character width and character height. The external shape feature amount including the object is determined, and the external shape feature amount is compared with each threshold value for determining whether the character is uppercase or lowercase for a predetermined standard character for each character to determine whether it is an uppercase or lowercase letter. If it cannot be determined based on these thresholds, the character is given undetermined information, and each time the confirmation process for one line is completed, the character line is calculated from the center coordinates of each character in the line that includes the undetermined character. A method for distinguishing between uppercase and lowercase letters, characterized in that the center line of an undefined character is determined, and the difference between the center coordinates of an undefined character and the coordinates of the center line is compared with a predetermined threshold value. 2) After normalizing the size of target characters, which have different standard sizes depending on the character type, and recognizing characters using the same standard pattern for both uppercase and lowercase characters, the character code and size of each character in the recognition result are sequentially memorized and the character is In addition to determining the character type from the code, it also determines whether the character is a standard size character or a character with a similar shape with lowercase letters by referring to a preset table for each character code. While marking memorized characters, the actual size of standard-sized characters is aggregated for each character type to obtain recognition results for one document, and the frequency distribution or average is calculated from the aggregated value of the measured size for each character type. The standard size is determined for each character type by determining the value, and a predetermined threshold value is set for the determined standard size corresponding to the character type for the previously marked character to determine whether it is an uppercase or lowercase letter. Characteristic method for distinguishing between uppercase and lowercase letters. 3) After normalizing the size of target characters, which have different standard sizes depending on the character type, and recognizing characters using the same standard pattern for both uppercase and lowercase characters, the character code and size of each character in the recognition result are sequentially memorized and the character is In addition to determining the character type from the code, refer to a preset table for each character code to determine whether the character is a standard size character or a character with a similar shape between Kanji and Kana (Kanji and Kana similar shape characters). If it is a kanji, kana, or similar glyph, then the memorized character is marked, while the actual size of standard-sized characters is tallied for each character type to obtain the recognition results for one document, and measured for each character type. A standard size is determined for each character type by determining the frequency distribution or average value from the aggregated size values, and a predetermined threshold value is set for the determined standard size corresponding to the character type for the previously marked character. , a method for determining kanji-kana-like glyph-like characters, which comprises determining whether the kanji-kana-like glyph-like characters are kanji or kana. 4) After normalizing the size of the target characters, which have different standard sizes depending on the character type, and recognizing characters using the same standard pattern for both uppercase and lowercase characters, the character code and size of each character in the recognition result are sequentially memorized and the character is In addition to determining the character type from the code, it also determines whether the character is a standard-sized character, a character with a lowercase letter with a similar shape, or a character with a similar shape between Kanji and Kana (Kanji, Kana, and Similar Characters). By referring to a table set in advance for each code, the memorized characters are marked if they are lowercase characters or characters with similar glyphs to kanji, kana, etc., while the actual size of characters with standard size is determined for each character type. Obtain the recognition results for one document by aggregating them, determine the frequency distribution or average value from the aggregated values of the sizes measured for each character type, determine the standard size for each character type, and then match the previously marked characters to that character type. of uppercase letters, lowercase letters, and Kanji-kana-like glyph-like characters, wherein a predetermined threshold is set on the determined standard size of uppercase letters, lowercase letters, and Kanji-kana-like glyph-like characters to determine whether they are uppercase letters or lowercase letters, or whether the Kanji-kana-like glyph-like characters are Kanji or Kana. Discrimination method. 5) The method for determining uppercase and lowercase letters according to claim 4, wherein in determining whether a kanji-kana-like glyph character is a kanji or a kana, a combination of preceding and succeeding character types is also determined. 6) The method for determining uppercase letters, lowercase letters, and kanji-kana-like glyphs according to claim 5), wherein in determining whether the kanji-kana-like glyphs are kanji or kana, a combination of preceding and succeeding character types is also determined. .