JP2922949B2

JP2922949B2 - Post-processing method for character recognition

Info

Publication number: JP2922949B2
Application number: JP1339788A
Authority: JP
Inventors: 隆邦嶺脇
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-12-27
Filing date: 1989-12-27
Publication date: 1999-07-26
Anticipated expiration: 2014-07-26
Also published as: JPH03198181A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、日本語文字認識装置に係り、特に入力文字
と辞書とのマッチングによる文字認識結果の後処理に関
する。Description: TECHNICAL FIELD The present invention relates to a Japanese character recognition device, and more particularly to post-processing of a character recognition result by matching an input character with a dictionary.

[Conventional technology]

一般に文字認識を行う場合、文字画像を一定のサイズ
に正規化してから辞書とのマッチングが行われるため、
形が同じでサイズが違う文字、例えば、大文字の「あ」
と小文字の「ぁ」、大文字の「ヤ」と小文字の「ャ」等
は区別ができない。In general, when performing character recognition, a character image is normalized to a certain size and then matched with a dictionary.
Characters of the same shape but different sizes, for example, uppercase "A"
And lowercase “ぁ”, uppercase “ya” and lowercase “ya” cannot be distinguished.

この問題に対し、文字の位置やサイズを利用して大文
字、小文字、句読点等を識別する方法が考案されてい
る。To solve this problem, a method has been devised for identifying uppercase letters, lowercase letters, punctuation marks, and the like using the position and size of characters.

例えば、特公昭60−1676号公報に述べられているよう
な、左隣の文字との相対位置関係により文字を判定する
方法、特公昭60−9314号公報に述べられているような、
帳票記入枠内で文字位置により文字を判定する方法、特
開昭63−78287号公報に述べられているような、周辺Ｎ
個の文字並び及び位置、サイズにより文字を判定する方
法等が知られている。For example, as described in Japanese Patent Publication No. 60-1676, a method of determining a character based on the relative positional relationship with the character on the left side, as described in Japanese Patent Publication No. 60-9314,
A method of judging a character by a character position in a form entry frame, as described in JP-A-63-78287.
A method of determining a character based on a character arrangement, a position, and a size is known.

[Problems to be solved by the invention]

しかし、フォントによっては文字サイズや印字位置が
異なるので、多様な文書を認識対象とする場合には、問
題が解決されたとは言えない。However, since the character size and the printing position are different depending on the font, it cannot be said that the problem has been solved when various documents are to be recognized.

よって本発明の目的は、文字辞書とのマッチングによ
る文字認識結果の誤り、特に誤認しやすい大文字と小文
字の誤りを簡単な処理によって修正する後処理方法を提
供することにある。SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a post-processing method that corrects an error in a character recognition result by matching with a character dictionary, particularly an error in uppercase and lowercase characters that is easily misunderstood by a simple process.

[Means for solving the problem]

本発明は、特定の文字と、それに前置可能な文字また
は前置不可能な文字とを記述した文字並びルールと、文
字辞書とのマッチングによる文字認識結果とを照合する
ことにより、文字認識結果を修正することを特徴とす
る。The present invention provides a character recognition result by comparing a character arrangement rule describing a specific character and a character that can be prefixed or non-prefixable with a character recognition result by matching with a character dictionary. Is modified.

(Operation)

例えばカナの小文字の「ゃ」「ゅ」「ょ」等は、文の
先頭に現れることはないし、必ずその前に何等かの大文
字を伴って現れる。その大文字は何でもよいという訳で
はなく、限られた文字しか許されない。つまり、そこに
は文字列としてのルールが存在する。For example, kana lowercase letters “ゃ”, “ゅ”, “ょ”, etc. do not appear at the beginning of a sentence, but always appear before them with some capital letter. The capital letter does not mean anything; only limited characters are allowed. In other words, there is a rule as a character string.

したがって、前置きされる文字が限定される特定のカ
ナ小文字等について、その前置可能な（または不可能
な）文字を前述したルールを用意し、これを文字認識結
果と照合することによって、ルール違反となる大文字と
小文字の誤認識等を簡単かつ確実に修正することができ
る。Therefore, for specific kana lowercase letters, etc., in which the prefixed character is limited, the rules described above are prepared for the prefixable (or impossible) characters, and the rules are compared with the character recognition result, thereby violating the rule. Can be easily and reliably corrected.

〔Example〕

第１図は本発明の一実施例に係る漢字OCRの概略ブロ
ック図であり、第２図は処理の概略フローチャートであ
る。FIG. 1 is a schematic block diagram of a kanji OCR according to one embodiment of the present invention, and FIG. 2 is a schematic flowchart of a process.

第１図において、処理すべき原稿の画像データは画像
入力部10より入力され、画像メモリ11に格納される（処
理ステップ）。この入力画像データに対し、行・文字
切出し部12によって行切出し及び文字切出しが行われ、
切り出された文字画像データが文字画像メモリ13に格納
される（処理ステップ）。この文字画像データに対
し、文字認識部14において、サイズの正規化等の前処理
と特徴抽出が行われ、抽出された特徴ベクトルと文字辞
書メモリ15に格納された文字辞書とのマッチングによ
り、距離の小さいほうから最高Ｎ位までの文字候補か選
ばれ、その結果が認識結果メモリ16に格納される（処理
ステップ）。In FIG. 1, image data of a document to be processed is input from an image input unit 10 and stored in an image memory 11 (processing step). Line cutout and character cutout are performed on the input image data by the line / character cutout unit 12,
The cut-out character image data is stored in the character image memory 13 (processing step). The character image data is subjected to preprocessing such as size normalization and feature extraction in a character recognition unit 14, and matching between the extracted feature vector and a character dictionary stored in a character dictionary memory 15 to obtain a distance. Character candidates from the smaller one to the N-th highest character are selected, and the result is stored in the recognition result memory 16 (processing step).

後処理17は、文字認識結果に対し、大文字／小文字修
正処理を行う部分であり、第２図に示す処理ステップ
から処理ステップまでが大文字／小文字修正処理であ
る。The post-processing 17 is a part for performing upper-case / lower-case correction processing on the character recognition result. The processing steps shown in FIG. 2 are the upper-case / lower-case correction processing.

この大文字／小文字修正処理に用いられる文字並びル
ールは、文字並びルールメモリ18に格納されている。The character arrangement rule used for the uppercase / lowercase correction processing is stored in the character arrangement rule memory 18.

この文字並びルールは、前に来る文字が限定されるよ
うな文字（ヒラガナ、カタカナの小文字）に関して、そ
の前に来ることが許される文字を記述したもので、その
一例を第１表に示す。This character arrangement rule describes characters that are permitted to come before a character (lower case of Hiragana and Katakana) whose preceding character is limited. An example is shown in Table 1.

次に、文字列「およぐように歩く」を認識する場合を
例にして大文字／小文字修正処理について説明する。こ
の場合、文字認識部14により第２表に示す認識結果が得
られたとする。文字の大小が考慮されない結果、「お」
と「ぉ」、「よ」と「ょ」、「う」と「ぅ」が誤認識さ
れている。 Next, the uppercase / lowercase correction processing will be described by taking as an example a case where a character string “walking around” is recognized. In this case, it is assumed that the character recognition unit 14 has obtained the recognition results shown in Table 2. As a result of not considering the size of characters, "O"
And "ぉ", "yo" and "cho", "u" and "ぅ" are misrecognized.

まず、認識結果の第１候補を順番に文字並びルールと
比較し、修正処理が必要な対象文字が出現するまで進
む。対象文字が見つかったら、その位置から一つ前の文
字の第１候補が、前置可能な文字であるか文字並びルー
ルと比較する。その結果、その前の文字が存在しなかっ
たり（対象文字が文の先頭）、一つ前の文字が前置可能
な文字でない場合、対象文字位置の第２候補を第１候補
に入れ替える。第２候補もまた大小文字であるときは、
その文字に対して同様に一つ前の文字をルールと比較
し、前置可能な文字であれば第１候補に入れ替え、そう
でなければ次位の候補に対し同様の処理を行う。 First, the first candidates of the recognition result are sequentially compared with the character arrangement rule, and the process proceeds until a target character requiring correction processing appears. When the target character is found, the first candidate for the character immediately before that position is compared with the character arrangement rule to determine whether it is a prefixable character. As a result, if the preceding character does not exist (the target character is the head of the sentence) or if the preceding character is not a prefixable character, the second candidate at the target character position is replaced with the first candidate. If the second candidate is also in lower case,
Similarly, the previous character is compared with the rule for that character, and if it is a character that can be prefixed, it is replaced with the first candidate; otherwise, the same processing is performed for the next candidate.

この例の場合、文字番号１の第１候補「ぉ」が対象文
字である。しかし、これは文の先頭であるので、「ぉ」
は異常文字列と判定され、第２候補の「お」と入れ替え
られる。この「ぉ」は対象文字ではないので、文字番号
１の処理を終わる。In the case of this example, the first candidate “@” of the character number 1 is the target character. However, since this is the beginning of a sentence,
Is determined to be an abnormal character string, and is replaced with the second candidate “O”. Since this "@" is not a target character, the processing of character number 1 ends.

次の文字番号２の第１候補の「ょ」は対象文字である
ので、一つ前の第１候補文字「お」を文字並びルールの
内容と比較する。しかし、「お」はルールにはないの
で、「ょ」を第２候補の「よ」と入れ替える。この
「よ」は対象文字ではないので、文字番号２の処理を終
わる。Since the first candidate "" of the next character number 2 is the target character, the immediately preceding first candidate character "" is compared with the contents of the character arrangement rule. However, since “O” is not in the rules, “O” is replaced with “Yo” as the second candidate. Since this “yo” is not a target character, the processing of character number 2 ends.

同様の処理を文字番号８まで繰り返すと、認識結果は
第３表に示すように、誤認文字が修正される。When the same process is repeated up to the character number 8, the recognition result corrects the erroneously recognized character as shown in Table 3.

なお、文字並びルールは、対象文字に前置不可能な文
字を記述したものとすることができるが、日本語のカナ
の場合は、実施例のように前置可能な文字を記述するほ
うがルールの規模を小さくできる。 Note that the character arrangement rule can be a description of a character that cannot be prefixed to the target character. However, in the case of Japanese kana, it is better to describe a character that can be prefixed as in the embodiment. Size can be reduced.

大文字／小文字の誤り以外の誤認識についても、同様
のルールを作成することにより修正が可能である。Misrecognitions other than capitalization can be corrected by creating similar rules.

後処理部17において、大文字／小文字修正処理のほか
に、単語または文章としての妥当性による修正処理を行
ってもよい。In the post-processing unit 17, in addition to the upper / lower case correction process, a correction process based on validity as a word or a sentence may be performed.

以上のようにして後処理部17によって修正された認識
結果は認識結果メモリ16に得られるが、これは出力部19
により出力される。The recognition result corrected by the post-processing unit 17 as described above is obtained in the recognition result memory 16, which is output from the output unit 19.
Is output by

〔The invention's effect〕

以上説明した如く、本発明によれば、文字辞書とのマ
ッチングによる文字認識結果の大文字と小文字の誤り等
を、簡単な処理によって確実に修正することができる。As described above, according to the present invention, it is possible to reliably correct an uppercase and a lowercase error in a character recognition result by matching with a character dictionary by a simple process.

[Brief description of the drawings]

第１図は本発明の一実施例に係る漢字OCRの概略ブロッ
ク図、第２図は処理の概略フローチャートである。 10……画像入力部、11……画像メモリ、 12……行・文字切出し部、 13……文字画像メモリ、14……文字認識部、 15……文字辞書メモリ、 16……認識結果メモリ、17……後処理部、 18……文字並びルールメモリ、19……出力部。FIG. 1 is a schematic block diagram of a kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of a process. 10 ... image input unit, 11 ... image memory, 12 ... line / character cutout unit, 13 ... character image memory, 14 ... character recognition unit, 15 ... character dictionary memory, 16 ... recognition result memory, 17 Post-processing unit 18 Character arrangement rule memory 19 Output unit

Claims

(57) [Claims]

1. A post-processing method for character recognition for correcting a character recognition result obtained by matching an input character with a character dictionary, comprising the steps of: Prepare a character arrangement rule that describes a character that can be prefixed or non-prefixable to the character, compare the character string of the character recognition result with the character arrangement rule, and change the uppercase of the character to be corrected to lowercase or lowercase. Post-processing method for character recognition, wherein is corrected to uppercase.

2. A post-processing method for character recognition according to claim 1, wherein if no character exists before the character to be corrected and said character to be corrected cannot be the beginning of a sentence, the character is corrected. Post-processing method of character recognition characterized by the following.