JP2963474B2

JP2963474B2 - Similar character identification method

Info

Publication number: JP2963474B2
Application number: JP1250091A
Authority: JP
Inventors: 隆邦嶺脇
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-09-26
Filing date: 1989-09-26
Publication date: 1999-10-18
Anticipated expiration: 2014-10-18
Also published as: JPH03111983A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、日本語文章を処理する文字認識装置（例え
ば漢字OCR）に係り、特に辞書との照合によっては識別
が難しい類似文字の識別方法に関する。Description: BACKGROUND OF THE INVENTION The present invention relates to a character recognition device (for example, Kanji OCR) for processing Japanese sentences, and particularly to a method for identifying similar characters which is difficult to identify by collating with a dictionary. About.

[Conventional technology]

一般に文字認識装置においては、文書画像から切り出
された文字画像は、一定のサイズに圧縮または伸長（正
規化）された後に辞書と照合されるため、形の違いが少
なくサイズが違う文字、例えば読点“。”、アルファベ
ット小文字“o"、アルファベット大文字“O"等の区別が
できないという問題があった。In general, in a character recognition device, a character image cut out from a document image is compared with a dictionary after being compressed or decompressed (normalized) to a fixed size. There was a problem that it was not possible to distinguish “.”, Lowercase alphabetic characters “o”, uppercase alphabetic characters “O”, etc.

このような類似文字の識別に関しては、左隣の文字に
注目する方法（特公昭60−1676号）、帳票記入文字につ
いて枠の位置情報を利用する方法（特公昭60−9314
号）、文字の絶対サイズを利用する方法（特開昭63−23
7584号）などが提案されている。Regarding the identification of such similar characters, a method of focusing on the character on the left (Japanese Patent Publication No. 60-1676) and a method of using the position information of the frame for the characters entered in the form (Japanese Patent Publication No. 60-9314)
No.), a method using the absolute size of the character
No. 7584) has been proposed.

[Problems to be solved by the invention]

しかし、特公昭60−1676号の方法は局所的な情報を利
用するので、例えば小文字が二つ以上続くと対応できな
い。特公昭60−9314号の方法は一般の枠なし文章に対応
できない。特開昭60−237584号の方法は、行毎の文字サ
イズの変化に対応できない。However, the method disclosed in Japanese Patent Publication No. 60-1676 uses local information, and cannot respond if, for example, two or more lowercase letters continue. The method of Japanese Patent Publication No. 60-9314 cannot handle general frameless sentences. The method disclosed in Japanese Patent Application Laid-Open No. 60-237584 cannot deal with a change in the character size for each line.

本発明の目的は、前述のような対象文章の制約がな
く、簡単な処理によって、文字サイズが異なる類似文字
（例えば、「つ」と「っ」、「よ」と「ょ」など）、あ
るいは、文字の形と大きさが同じで行内の位置が異なる
類似文字（例えば、「゜」と「。」、「’」と「，」な
ど）を高精度に識別する方法を提供することにある。It is an object of the present invention to provide similar characters having different character sizes (for example, “tsu” and “tsu”, “yo” and “cho”) by simple processing without the restriction of the target sentence as described above, or The present invention provides a method for accurately identifying similar characters (for example, “゜” and “.”, “′” And “,”, etc.) having the same character shape and size but different positions in a line. .

[Means for solving the problem]

本発明の類似文字識別方法は、日本語文章の処理する
文字識別装置において、横書き文章の場合には、行の上
基準線と文字の上端の間隔と、行の標準幅との比（上空
白比）を検出し、縦書き文章の場合には、行の左基準線
と文字の左端との間隔と、行の標準幅との比（左空白
比）を検出し、検出した上空白比または左空白比を予め
特定文字について用意した標準の上空白比または左空白
比と照合することにより、類似文字を識別することを特
徴とする。A similar character identification method according to the present invention is a character identification device for processing Japanese sentences, wherein in the case of a horizontal writing sentence, the ratio of the distance between the upper reference line of the line and the upper end of the character to the standard width of the line (upper blank) Ratio), and in the case of vertical writing, the ratio between the space between the left reference line of the line and the left edge of the character and the standard width of the line (left space ratio) is detected. Similar characters are identified by comparing the left blank ratio with a standard top blank ratio or left blank ratio prepared for a specific character in advance.

(Operation)

例えば、“つ”と“っ”、“よ”と“ょ”、あるいは
“。”と“o"と“O"等の形が殆ど同じ文字について上空
白比または左空白比を調べると、類似文字間にかなりの
違いがある。For example, if the upper space ratio or left space ratio is examined for characters that have almost the same shape, such as "tsu" and "tsu", "yo" and "cho", or "." There are considerable differences between letters.

また、例えば、「゜」と「。」、「’」と「，」、
「・」と「．」等の文字の形と大きさが同じで位置のみ
が異なるような類似文字間についても同様である。For example, "゜" and ".", "'" And ",",
The same applies to similar characters such as "." And ".", Which have the same shape and size but differ only in position.

したがって、辞書との照合によっては誤認が起きやす
い特定の類似文字について標準の上空白比または左空白
比を用意しておき、辞書との比較によって類似文字が候
補にあがった文字については、その文字について検出し
た上空白比または左空白比と、各候補の標準の上空白比
または左空白比とを照合し、最も差が小さいものを第１
候補とすることによって、辞書との照合では識別が困難
な類似文字を高精度に識別・修正して認識率を上げるこ
とができる。Therefore, a standard top space ratio or left space ratio is prepared for specific similar characters that are likely to be misidentified by matching with the dictionary, and for characters whose similar characters are candidates for comparison with the dictionary, the character Are compared with the standard top blank ratio or left blank ratio of each candidate, and the one with the smallest difference is taken as the first blank space ratio or left blank ratio.
By making the candidate a candidate, similar characters that are difficult to identify by collation with the dictionary can be identified and corrected with high accuracy to increase the recognition rate.

〔Example〕

第１図は、本発明にかかる文字認識装置（漢字OCR）
のブロック図である。この文字認識装置においては、画
像入力部（スキャナ）10によって文書（横書きとする）
の画像を読取り、その画像データを画像メモリ11に蓄積
し、この文書画像より行・文字切り出し部12によって行
切り出し及び文字切り出しを行い、文字の画像データを
文字画像メモリ13に蓄積し、切り出された各文字の画像
データの正規化後の特徴と文字辞書メモリ15に登録され
ている文字別の辞書のデータとを文字認識部14で照合
し、距離が小さい（一致度が大きい）候補を例えば最高
第10位まで選び、これを認識結果メモリ16に格納する。
ここまでは従来の文字認識装置と同様である。FIG. 1 is a character recognition device (Kanji OCR) according to the present invention.
It is a block diagram of. In this character recognition apparatus, a document (horizontal writing) is input by an image input unit (scanner) 10.
Is read out, the image data is stored in the image memory 11, the line and character cutout unit 12 performs line cutout and character cutout from the document image, and stores the character image data in the character image memory 13 to be cut out. The character recognition unit 14 compares the normalized characteristics of the image data of each character with the data of the dictionary for each character registered in the character dictionary memory 15, and finds a candidate having a short distance (a high degree of coincidence). The tenth highest is selected and stored in the recognition result memory 16.
Up to this point, it is the same as the conventional character recognition device.

この文字認識装置と従来の文字認識装置との違いは、
行・文字切り出し部12において、切り出し処理の際に行
の上基準線、下基準線及び行の標準幅を決定して各文字
毎に上空白比を検出し、これらの情報を切り出し情報メ
モリ17に格納すること、並びに類似文字判定部18、類似
文字テーブルメモリ19を有し、文字認識部14の認識結果
を類似文字判定部18で修正してから出力部20より出力す
ることである。The difference between this character recognition device and the conventional character recognition device is that
The line / character extracting unit 12 determines the upper reference line, the lower reference line, and the standard width of the line at the time of the extraction processing, detects the upper blank ratio for each character, and stores these information in the extraction information memory 17. And having a similar character determination unit 18 and a similar character table memory 19, correcting the recognition result of the character recognition unit 14 by the similar character determination unit 18, and outputting the result from the output unit 20.

第２図において文字列“ちょっと待った。”を例に説
明すれば、Luは行の上基準線、Ldは行の下基準線、Ｗは
行の標準線（高さ）である。具体的には、行文字列の外
接矩形の上辺と下辺を上基準線Luと下基準線Ldとして検
出し、その間隔を行の基準幅Ｗとする。あるいは、スキ
ューに対応するために、行文字列の各文字上端と下端に
接触する直線をそれぞれ上基準線Luと下基準線Ldとし、
その平均間隔を行の標準幅Ｗとしてもよい。In FIG. 2, for example, the character string "Wait a moment." Is described, where Lu is the upper reference line of the line, Ld is the lower reference line of the line, and W is the standard line (height) of the line. Specifically, the upper side and the lower side of the circumscribed rectangle of the line character string are detected as the upper reference line Lu and the lower reference line Ld, and the interval between them is defined as the line reference width W. Alternatively, in order to cope with the skew, straight lines contacting the upper end and the lower end of each character of the line character string are defined as an upper reference line Lu and a lower reference line Ld, respectively.
The average interval may be set as the standard width W of the row.

これらの値より各文字について、を求める。ここで、上空白は第２図に示すように、文字
の上端（文字外接矩形の上辺）と上基準線Luとの間隔Du
である。From these values, for each character, Ask for. Here, as shown in FIG. 2, the upper blank is a space Du between the upper end of the character (the upper side of the character circumscribed rectangle) and the upper reference line Lu.
It is.

さて、文字列“ちょっと待った。”について、文字認
識部14で第１表に示す認識結果が得られ、また、各文字
について第２表に示すような上空白比が得られたとす
る。この場合の類似文字判定部18の処理について次に述
べる。Now, it is assumed that the character recognition unit 14 obtains the recognition results shown in Table 1 for the character string "Wait a moment", and obtains the top blank ratio shown in Table 2 for each character. The processing of the similar character determination unit 18 in this case will be described below.

類似文字テーブルメモリ19には、類似文字判定部18で
の処理が必要となる類似文字と、その標準の上空白比の
テーブルが格納されている。この類似文字テーブルの一
例を第３表に示す。The similar character table memory 19 stores a table of similar characters that need to be processed by the similar character determination unit 18 and their standard upper blank ratios. An example of the similar character table is shown in Table 3.

類似文字判定部18は、認識結果メモリ16に得られた認
識結果候補（第１表）を順番に類似文字テーブル（第３
表）と比較し、類似文字テーブルに登録されている類似
文字が候補としてあがっている文字を捜す。この例で
は、“よ”と“ょ”が候補となった第２文字が最初に見
つかる。 The similar character determination unit 18 sequentially sorts the recognition result candidates (Table 1) obtained in the recognition result memory 16 into a similar character table (third table).
Table), and searches for a character in which a similar character registered in the similar character table appears as a candidate. In this example, the second character that has "yo" and "cho" as candidates is found first.

類似文字判定部18は、この第２文字について検出され
た上空白比を切り出し情報メモリ17より読込み、また候
補となった“よ”及び“ょ”の標準上空白比を類似文字
テーブルメモリ19より読込み、比較する。第２文字の上
空白比は第２表より0.4であるが、これは“ょ”の標準
上空白比0.4と一致し、“よ”の標準上空白比0.1とは大
きく異なる。したがって、類似文字判定部18は、第２文
字の第１候補を“ょ”に、第２候補を“よ”にそれぞれ
入れ替える。The similar character determination unit 18 reads the upper blank ratio detected for the second character from the cut-out information memory 17 and reads the standard upper blank ratios of the candidate “yo” and “cho” from the similar character table memory 19. Read and compare. The upper blank ratio of the second character is 0.4 according to Table 2, which coincides with the standard upper blank ratio of "yo" of 0.4, and is significantly different from the standard upper blank ratio of "yo" of 0.1. Therefore, the similar character determination unit 18 replaces the first candidate of the second character with “ょ” and the second candidate with “よ”.

次に処理すべき文字として第３文字が見つかる。この
上空白比は0.5で、“つ”の標準上空白比0.2より“っ”
の標準上空白比0.6に近いので、候補の入れ替えは行わ
ない。The third character is found as the next character to be processed. The top blank ratio is 0.5, which is more than the standard top blank ratio of 0.2.
Since the blank ratio is close to 0.6 by default, the candidates are not replaced.

第６文字は逆に類似文字である“つ”が第１候補
“っ”が第２候補にあがっているので、第１候補と第２
候補を入れ替える。On the other hand, the sixth character is a similar character “tsu” and the first candidate “tsu” is the second candidate.
Swap the candidates.

同様の処理を繰返すことにより、第１表に示した認識
結果は、第４表のように修正される。By repeating the same processing, the recognition results shown in Table 1 are corrected as shown in Table 4.

ここまでは横書き文章を処理対象とした場合について
説明した。次に、縦書き文章を処理する場合について説
明する。 Up to this point, a case has been described in which horizontally written sentences are to be processed. Next, a case in which a vertically written document is processed will be described.

縦書き文章の処理の場合、行・文字切り出し部12は、
第３図に例示するように、行の上基準線の代わりに行の
左基準線Llを、行の下基準線の代わりに行の右基準線Lr
をそれぞれ決定し、各基準線の間隔すなわち行の標準幅
Ｗと、文字の左端（文字外接矩形の左辺）と行の左基準
線Llとの間隔すなわち左空白Dlとからを求める。In the case of processing vertical writing text, the line / character cutout unit 12
As illustrated in FIG. 3, the left reference line Ll of the row is used instead of the upper reference line of the row, and the right reference line Lr of the row is used instead of the lower reference line of the row.
Is determined from the interval between the reference lines, that is, the standard width W of the line, and the interval between the left end of the character (the left side of the circumscribed rectangle of the character) and the left reference line Ll of the line, that is, the left blank Dl. Ask for.

また、類似文字テーブルメモリ19には、縦書き文章に
ついて予め求めた類似文字の標準左空白比のテーブルを
格納しておく。The similar character table memory 19 stores a table of standard left blank ratios of similar characters obtained in advance for a vertically written sentence.

そして、類似文字判定部18においては、この縦書き文
章用の類似文字テーブルを参照し、認識結果中の類似文
字に関し左空白比と標準左空白比とを比較することによ
り、前述のような修正を行う。Then, the similar character determination unit 18 refers to the similar character table for the vertically written text, and compares the left blank ratio with the standard left blank ratio for the similar characters in the recognition result, thereby making the correction as described above. I do.

〔The invention's effect〕

以上説明した如く、本発明によれば、辞書との照合で
は識別が困難な類似文字を比較的簡単な処理によって高
精度に識別し、その誤認識を修正することができるの
で、日本語文章の文字認識率を向上させることができ
る。また、以上の説明から明らかなように、本発明の類
似文字識別方法は、一般の枠無し文章や行毎に文字サイ
ズが変化する文章、さらには小文字が連続するような文
字列についても適用可能であり、従来のような適用対象
の制約がない。As described above, according to the present invention, similar characters that are difficult to identify in matching with a dictionary can be identified with relatively high accuracy by relatively simple processing, and their erroneous recognition can be corrected. The character recognition rate can be improved. Further, as is clear from the above description, the similar character identification method of the present invention can be applied to a general frameless text, a text in which the character size changes for each line, and a character string in which lowercase letters are continuous. Therefore, there is no limitation of the application target as in the related art.

[Brief description of the drawings]

第１図は本発明に係る文字認識装置の概略ブロック図、
第２図は横書き文章の上空白比の説明のための図、第３
図は縦書き文章の左空白比の説明のための図である。 10……画像入力部、11……画像メモリ、12……行・文字
切り出し部、13……文字画像メモリ、14……文字認識
部、15……文字辞書しメモリ、16……認識結果メモリ、
17……切り出し情報メモリ、18……類似文字判定部、19
……類似文字テーブルメモリ、20……出力部。FIG. 1 is a schematic block diagram of a character recognition device according to the present invention,
FIG. 2 is a diagram for explaining a top blank ratio of a horizontal writing sentence, and FIG.
The figure is a diagram for explaining the left blank ratio of a vertically written sentence. 10: Image input unit, 11: Image memory, 12: Line / character cutout unit, 13: Character image memory, 14: Character recognition unit, 15: Character dictionary memory, 16: Recognition result memory ,
17 ... cutout information memory, 18 ... similar character determination unit, 19
…… Similar character table memory, 20 …… Output part.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭62−117090（ＪＰ，Ａ) 特開昭62−187988（ＪＰ，Ａ) 特開昭62−169285（ＪＰ，Ａ) 特開平１−108691（ＪＰ，Ａ) 特開昭63−216189（ＪＰ，Ａ) 特開平１−171080（ＪＰ，Ａ) 特開平２−85985（ＪＰ，Ａ) 特開昭60−134394（ＪＰ，Ａ) 特開昭59−167783（ＪＰ，Ａ) 特開昭59−109979（ＪＰ，Ａ) 特開昭59−231681（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06K 9/62 G06K 9/46 G06K 9/68 ──────────────────────────────────────────────────続き Continuation of front page (56) References JP-A-62-117090 (JP, A) JP-A-62-187988 (JP, A) JP-A-62-169285 (JP, A) 108691 (JP, A) JP-A-63-216189 (JP, A) JP-A-1-171080 (JP, A) JP-A-2-85985 (JP, A) JP-A-60-134394 (JP, A) JP-A-59-167783 (JP, A) JP-A-59-109979 (JP, A) JP-A-59-231681 (JP, A) (58) Fields investigated (Int. Cl. ⁶ , DB name) G06K 9/62 G06K 9/46 G06K 9/68

Claims

(57) [Claims]

1. A similar character identification method in a character recognition device for processing a Japanese sentence, wherein in the case of a horizontal writing sentence, a distance between an upper reference line of a line and an upper end of the character and a standard width of the line are determined. Ratio (hereinafter referred to as the top blank ratio), and in the case of vertical writing, the ratio between the space between the left reference line of the line and the left edge of the character and the standard width of the line (hereinafter the left blank ratio and A similar character is identified by comparing the detected top blank ratio or left blank ratio with a standard top blank ratio or left blank ratio prepared for a specific character in advance. Identification method.