JPH03198181A

JPH03198181A - Post-processing method for character recognition

Info

Publication number: JPH03198181A
Application number: JP1339788A
Authority: JP
Inventors: Takakuni Minewaki; 隆邦嶺脇
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-12-27
Filing date: 1989-12-27
Publication date: 1991-08-29
Anticipated expiration: 2014-07-26
Also published as: JP2922949B2

Abstract

PURPOSE:To correct an error of a large character and a small character, etc., by a simple processing by collating a character list rule, and a result of character recognition by matching with a character dictionary. CONSTITUTION:A result of character recognition is corrected by collating a character list rule 18 in which a specific character, and a character which can be pre-positioned therein or a character which cannot be pre-positioned are described, and the result of character recognition by matching with a charac ter dictionary 15. For instance, 'ya', 'yu', 'yo', etc., of small characters of a KANA (Japanese syllabary) never appear in the head of a sentence, and always appear together with a large character in front thereof. As for its large character, any character is not always allowed, and a rule as a character-string by which only a limited character is allowed exists. Accordingly, with regard to a specific KANA small character, etc., to which a pre-positioned character is limited, the rule in which the character which can be pre-positioned or cannot be pre-positioned is contained is prepared, and it is collated with a result of character recognition. In such a way, an erroneous recognition of a large charac ter and a small character, etc., which become a rule violation can be corrected easily and surely.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、日本語文字認識装置に係り、特に入力文字と
辞書とのマツチングによる文字認識結果の後処理に関す
る。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a Japanese character recognition device, and particularly to post-processing of character recognition results by matching input characters with a dictionary.

[Conventional technology]

一般に文字認識を行う場合、文字画像を一定のサイズに
正規化してから辞書とのマツチングが行われるため、形
が同じでサイズが違う文字、例えば、大文字の「あ」と
小文字の［あ」、大文字の「ヤ」と小文字の「ヤ」等は
区別ができない。Generally, when performing character recognition, character images are normalized to a certain size and then matched with a dictionary. Uppercase letters such as ``ya'' and lowercase letters ``ya'' cannot be distinguished.

この問題に対し、文字の位置やサイズを利用して大文字
、小文字、句読点等を識別する方法が考案されている。To solve this problem, methods have been devised to identify uppercase letters, lowercase letters, punctuation marks, etc. using the position and size of characters.

例えば、特公昭６０−１６７６号公報に述べられている
ような、左隣の文字との相対位置関係により文字を判定
する方法、特公昭６０−９３１４号公報に述べられてい
るような、帳票記入枠内での文字位置により文字を判定
する方法、特開昭６３−７８２８７号公報に述べられて
いるような、周辺Ｎ個の文字並び及び位置、サイズによ
り文字を判定する方法等が知られている。For example, a method of determining a character based on the relative positional relationship with the adjacent character on the left, as described in Japanese Patent Publication No. 1676/1980, and a method of writing a form as described in Japanese Patent Publication No. 60-9314. A method of determining a character based on the position of the character within a frame, a method of determining a character based on the arrangement, position, and size of N surrounding characters as described in Japanese Patent Laid-Open No. 63-78287 are known. There is.

[Problem to be solved by the invention]

しかし、フェントによっては文字サイズや印字位置が異
なるので、多様な文書を認識対象とする場合には、問題
が解決されたとは言えない。However, since the font size and printing position differ depending on the Fendt, the problem cannot be said to have been solved when a variety of documents are to be recognized.

よって本発明の目的は、文字辞書とのマツチングによる
文字認識結果の誤り、特に誤認しやすい大文字と小文字
の誤りを簡単な処理によって修正する後処理方法を提供
することにある。SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a post-processing method for correcting errors in character recognition results due to matching with a character dictionary, particularly errors in uppercase and lowercase letters that are likely to be misrecognized, by simple processing.

[Means to solve the problem]

本発明は、特定の文字と、それに前置可能な文字または
前置不可能な文字とを記述した文字並びルールと、文字
辞書とのマツチングによる文字認識結果とを照合するこ
とにより１文字認識結果を修正することを特徴とする。The present invention obtains a single character recognition result by comparing a character sequence rule that describes a specific character and characters that can or cannot be prefixed to it with a character recognition result by matching with a character dictionary. It is characterized by correcting.

[For production]

例えばカナの小文字の「や」　「ゆ」　［よ」等は、文
の先頭に現れることはないし、必ずその前に何等かの大
文字を伴って現れる。その大文字は何でもよいという訳
ではなく、限られた文字しか許されない。つまり、そこ
には文字列としてのルールが存在する。For example, the lowercase kana letters ``ya'', ``yu'', and ``yo'' never appear at the beginning of a sentence, and they always appear with some kind of uppercase letter before them. Not all capital letters are allowed; only a limited number of characters are allowed. In other words, there are rules as strings.

したがって、前置きされる文字が限定される特定のカナ
小文字等について、その前置可能な（または不可能な）
文字を記述したルールを用意し、これを文字認識結果と
照合することによって、ルール違反となる大文字と小文
字の誤認識等を簡単かつ確実に修正することができる。Therefore, for specific kana lowercase letters etc. that are limited to the characters that can be prefixed,
By preparing rules that describe characters and comparing them with character recognition results, misrecognition of uppercase and lowercase letters that violate the rules can be easily and reliably corrected.

〔Example〕

第１図は本発明の一実施例に係る漢字ＯＣＲの概略ブロ
ック図であり、第２図は処理の概略フローチャートであ
る。FIG. 1 is a schematic block diagram of Kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of processing.

第１図において、処理すべきＪＭＭの画像データは画像
入力部１０より入力され１画像メモリ１１に格納される
（処理ステップ■）、この入力画像データに対し１行・
文字切出し部１２によって行切出し及び文字切出しが行
われ、切り出された文字画像データが文字画像メモリ１
３に格納される（処理ステップ■）、この文字画像デー
タに対し、文字認識部１４において、サイズの正規化等
の前処理と特徴抽出が行われ、抽出された特徴ベクトル
と文字辞書メモリ１５に格納された文字辞書とのマツチ
ングにより、距離の小さいほうから最高Ｎ位までの文字
候補が選ばれる（処理ステップ■）。In FIG. 1, JMM image data to be processed is input from the image input unit 10 and stored in the one-image memory 11 (processing step ■).
Line cutting and character cutting are performed by the character cutting unit 12, and the cut out character image data is stored in the character image memory 1.
3 (processing step ■), the character recognition unit 14 performs preprocessing such as size normalization and feature extraction, and stores the extracted feature vectors and the character dictionary memory 15. By matching with the stored character dictionary, character candidates from the smallest distance to the highest N are selected (processing step ■).

後処理部１６は、文字認識結果に対し、大文字／小文字
修正処理を行う部分であり、第２図に示す処理ステップ
■から処理ステップ０までが大文字／小文字修正処理で
ある。The post-processing unit 16 is a part that performs uppercase/lowercase letter correction processing on the character recognition result, and the uppercase/lowercase letter correction processing is performed from processing step ① to processing step 0 shown in FIG.

この大文字／小文字修正処理に用いられる文字並びルー
ルは１文字並びルールメモリ１８に格納されている。The character arrangement rules used for this uppercase/lowercase letter correction processing are stored in the single character arrangement rule memory 18.

この文字並びルールは、前に来る文字が限定されるよう
な文字（ヒラガナ、カタカナの小文字）に関して、その
前に来ることが許される文字を記述したもので、その−
例を第１表に示す。This character arrangement rule describes the characters that are allowed to come before characters that can only come before them (lowercase letters in Hiragana and Katakana), and the -
Examples are shown in Table 1.

次に５文字列「およぐように歩く」を認識する場合を例
にして大文字／小文字修正処理について説明する。この
場合５文字認識部１４により第２表に示す認識結果が得
られたとする６文字の大小が考慮されない結果、「お」
と「お」、「よ」と「よ」、「う」と「う」が誤Ｅｇｌ
！されている。Next, uppercase/lowercase letter correction processing will be described using an example of recognizing the five-character string ``Yogugu ni Aruru''. In this case, the recognition results shown in Table 2 are obtained by the 5-character recognition unit 14.As a result of not considering the size of the 6 characters, "O"
and "o", "yo" and "yo", "u" and "u" are incorrect Egl
! has been done.

まず、認識結果の第１候補を順番に文字並びルールと比
較し、修正処理が必要な対象文字が出現するまで進む、
対象文字が見つかったら、その位置から一つ前の文字の
第１候補が、前置可能な文字であるか文字並びルールと
比較する。その結果、その前の文字が存在しなかったり
（対象文字が文の先頭）、一つ前の文字が前置可能な文
字でない場合、対象文字位置の第２候補を第１候補に入
れ替える。第２候補もまた大小文字であるときは、その
文字に対して同様に一つ前の文字をルールと比較し、前
置可能な文字であれば第１候補に入れ替え、そうでなけ
れば次位の候補に対し同様の処理を行う。First, the first candidate of the recognition result is compared with the character arrangement rule in order, and the process is continued until a target character that requires correction processing appears.
When the target character is found, the first candidate for the character immediately before the target character is compared with the character arrangement rules to see if it is a character that can be prefixed. As a result, if the previous character does not exist (the target character is the beginning of a sentence) or the previous character is not a character that can be prefixed, the second candidate for the target character position is replaced with the first candidate. If the second candidate is also a lowercase letter, compare the previous character with the rule in the same way for that character, and if it can be prefixed, replace it with the first candidate, otherwise replace it with the next character. The same process is performed for the candidates.

この例の場合、文字番号１の第１候補「お」が対象文字
である。しかし、これは文の先頭であるので、「お」は
以上文字列と判定され、第２候補の「お」と入れ替えら
れる。この「お」は対象文字ではないので、文字番号１
の処理を終わる。In this example, the first candidate "o" with character number 1 is the target character. However, since this is the beginning of the sentence, "o" is determined to be a longer character string, and is replaced with the second candidate "o". This "o" is not a target character, so character number 1
processing ends.

次の文字番号２の第１候補の「よ」は対象文字であるの
で、一つ前の第１候補文字「お」を文字並びルールの内
容と比較する。しかし、「お」はルールにはないので、
「よ」を第２候補の「よＪと入れ替える。この「よ」は
対象文字ではないので、文字番号２の処理を終わる。Since the first candidate character "yo" for the next character number 2 is the target character, the previous first candidate character "o" is compared with the content of the character arrangement rule. However, "o" is not in the rules, so
"Yo" is replaced with the second candidate "Yo J." Since "Yo" is not the target character, the processing for character number 2 ends.

同様の処理を文字番号８まで繰り返すと、認識結果は第
３表に示すように、誤認文字が修正される。When similar processing is repeated up to character number 8, misidentified characters are corrected in the recognition results as shown in Table 3.

第３表（修正結果）注）下線は修正されたものなお、文字並びルールは、対象文字に前置不可能な文字
を記述したものとすることができるが、日本語のカナの
場合は、実施例のように前置可能な文字を記述するほう
がルールの規模を小さくできる。Table 3 (Results of correction) Note: Underlining has been corrected. Note that the character arrangement rule can be such that a character that cannot be prefixed to the target character is written, but in the case of Japanese kana, The scale of the rule can be made smaller by writing characters that can be prefixed as in the example.

大文字／小文字の誤り以外の誤認識についても、同様の
ルールを作成することにより修正が可能である。Recognition errors other than errors in uppercase/lowercase letters can also be corrected by creating similar rules.

後処理部１７において、大文字／小文字修正処理のほか
に、単語または文章としての妥当性による修正処理を行
ってもよい。In addition to the uppercase/lowercase letter correction process, the post-processing unit 17 may also perform a correction process based on the validity of the word or sentence.

以上のようにして後処理部１７によって修正された認識
結果は！！！識結果メモリ１６に得られるが、これは出
力部１９により出力される。The recognition results corrected by the post-processing unit 17 as described above are! ! ! The results are obtained in the identification result memory 16, which is outputted by the output section 19.

〔Effect of the invention〕

以上説明した如く、本発明によれば、文字辞書とのマツ
チングによる文字認識結果の大文字と小文字の誤り等を
、簡単な処理によって確実に修正することができる。As described above, according to the present invention, errors in uppercase and lowercase letters in character recognition results obtained by matching with a character dictionary can be reliably corrected through simple processing.

[Brief explanation of drawings]

第１図は本発明の一実施例に係る漢字ＯＣＲの概略ブロ
ック図、第２図は処理の概略フローチャートである。１０・・・画像人力部、　１１・・・画像メモリ、１２
・・・行・文字切出し部、１３・・・文字画像メモリ、　　１４・・・文字認識部
、１５・・・文字辞書メモリ、１６・・・認識結果メモリ、　　１７・・・後処理部、
１８・・・文字並びルールメモリ、　　１９・・・出力
部。FIG. 1 is a schematic block diagram of Kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of processing. 10... Image human resources department, 11... Image memory, 12
... line/character cutting section, 13... character image memory, 14... character recognition section, 15... character dictionary memory, 16... recognition result memory, 17... post-processing section,
18...Character arrangement rule memory, 19...Output section.

Claims

[Claims]

(1) Character recognition results are obtained by comparing character sequence rules that describe specific characters and characters that can or cannot be prefixed to them with the character recognition results obtained by matching with a character dictionary. A post-processing method for character recognition in a character recognition device, characterized in that the character recognition is corrected.