JPS62285189A

JPS62285189A - Character recognition post processing system

Info

Publication number: JPS62285189A
Application number: JP61128558A
Authority: JP
Inventors: Jiichi Igarashi; 五十嵐　治一
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-06-03
Filing date: 1986-06-03
Publication date: 1987-12-11

Abstract

PURPOSE:To execute the correction of an erroneous recognizing character with high accuracy by executing a morpheme analysis to all character strings to be prepared from a candidate character. CONSTITUTION:The input character read by an OCR part 1, after the similarity with a standard character for respective input characters by a character recognizing part 2 and a candidate character is obtained, is sent to a processing part 3. A post processing part 3 segments a word for a candidate character by referring to a word dictionary 10, a speech part classification table 11, a term application table 12 and a connecting weight matrix table 13, and executes the morpheme analysis by the connecting checking with a preceding word. The morpheme analysis is executed concerning all character strings of the unknown word which can be prepared from the candidate character of respective characters of the unknown word part, and executed by the processing to select the character string with the highest estimation value. Thus, when the deciding capacity of the erroneous recognizing character position is not high, the correcting capacity of the high erroneous recognizing character can be expected.

Description

【発明の詳細な説明】３、発明の詳細な説明［技術分野］本発明は、ＯＣＲ文字認識装置等における後処理方式に
関する。Detailed Description of the Invention 3. Detailed Description of the Invention [Technical Field] The present invention relates to a post-processing method in an OCR character recognition device or the like.

［従来技術］ＯＣＲ文字認識装置等においては、一般に誤認識と判定
された候補文字について、更に後処理を施こすことによ
って認識精度を高めている。近年、この種の文字認識の
後処理として言語の文法知識を利用することが考えられ
ている。その代表的なものに形態素解析を利用する方法
がある。これは誤認識文字と判定された文字を含む文字
列について、その候補文字を考慮して形態素解析を施こ
し、品詞間の接続チェック等を行って正解文字を決定す
るというものである。しかし、従来技術においては、誤
認識文字位置の判定能力が低いと、処理速度が遅くなり
、かつ、誤まった候補文字が正解と判定されることが多
くなるという欠点があった。[Prior Art] In OCR character recognition devices and the like, recognition accuracy is generally improved by further performing post-processing on candidate characters that have been determined to be misrecognized. In recent years, it has been considered to utilize language grammatical knowledge as post-processing for this type of character recognition. A typical example is a method using morphological analysis. This involves performing morphological analysis on a character string that includes a character determined to be a misrecognized character, taking into account candidate characters, and checking connections between parts of speech to determine the correct character. However, in the conventional technology, if the ability to determine the position of an erroneously recognized character is low, the processing speed becomes slow and erroneous candidate characters are often determined to be correct.

［目　的］本発明の目的は、形態素解析を利用する文字認識後処理
方式において、誤認識文字位置の判定能力が高くない場
合にも、誤認識文字の高い訂正能力が期待できる方式を
提供することにある。[Purpose] The purpose of the present invention is to provide a character recognition post-processing method using morphological analysis that can be expected to have a high ability to correct erroneously recognized characters even when the ability to determine the position of erroneously recognized characters is not high. There is a particular thing.

［構　成］本発明は各入力文字に対して文字認識を行って候補文字
を検出し、その第１位候補文字のみから文字列を作成し
て形態素解析を施こす、そして、解析不能となる部分を
抽出し、その文字列部（未知語部と称する）に対して次
の処理１あるいは２を実行して入力文字を最終的に推定
する。[Configuration] The present invention performs character recognition on each input character to detect candidate characters, creates a character string only from the first candidate character, performs morphological analysis, and then performs morphological analysis. The input character is finally estimated by extracting the part and executing the following process 1 or 2 on the character string part (referred to as the unknown word part).

処理１：未知語部の各文字の候補文字から作成可能な未
知語部の文字列すべてに対して形態素解析を行い、最も
評価値の高い文字列を選択する。Process 1: Morphological analysis is performed on all character strings of the unknown word part that can be created from candidate characters of each character of the unknown word part, and the character string with the highest evaluation value is selected.

処理２：未知語部の各文字に対し、誤認識文字位置の判
定処理を行い、誤認識文字と判定された文字位置につい
てのみ、候補文字を順次代入することにより作成される
未知語部の候補文字列に対して形態素解析を行い、最も
評価値の高い文字列を選択する。Process 2: For each character in the unknown word part, a process is performed to determine the position of a misrecognized character, and candidates for the unknown word part are created by sequentially substituting candidate characters only for character positions determined to be misrecognized characters. Perform morphological analysis on character strings and select the character string with the highest evaluation value.

以下、本発明の一実施例について図面により説明する。An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例のブロック図を示す。FIG. 1 shows a block diagram of one embodiment of the invention.

ＯＣＲ読取り部１で読み取られた入力文は、文字認識部
２において各入力文字ごとに標準文字との類似度が計算
されて、候補文字が求められる。文字認識部２で求まっ
た候補文字は後処理部３へ送られる。本発明は、この後
処理部３の処理にかＮわる。後処理部３は形態素解析を
実施するため、単語辞書１０１品詞分類表１１、用言活
用表１２、接続重み行列表１３を具備している。The input sentence read by the OCR reading unit 1 is subjected to a character recognition unit 2 in which the degree of similarity with standard characters is calculated for each input character to obtain candidate characters. The candidate characters found by the character recognition section 2 are sent to the post-processing section 3. The present invention replaces this processing by the post-processing section 3. The post-processing unit 3 includes a word dictionary 101, a part-of-speech classification table 11, a conjugation table 12, and a connection weight matrix table 13 in order to perform morphological analysis.

単語辞書１０は、第２図に示すように、各単語ごとに、
読み（単語の読みをひらがな化したもの）、表記（出力
されるかな、漢字の表記）、品詞、頻度ランク及びその
他の情報を含んでいる。As shown in FIG. 2, the word dictionary 10 includes, for each word,
It includes reading (the pronunciation of the word converted into hiragana), notation (output kana, kanji notation), part of speech, frequency rank, and other information.

品詞分類表１１と用言活用表１２は、接続重み行列表１
３を検索する際の行、列の番号を示したテーブルである
。品詞分類表１１は活用語尾を持たない品詞に対応し、
第３図のようなレコード構成をとる。用言活用表１２は
活用語尾を有する品詞に対応し、第３図（ロ）のような
レコード構成をとる。二Ｎで、活用語尾欄には動詞、形
容詞などの語幹に続く語尾が記されており、この語尾が
入力文字にマツチして初めて評価の対象となる。The part-of-speech classification table 11 and the conjugation table 12 are based on the connection weight matrix table 1
This is a table showing row and column numbers when searching for 3. Part of speech classification table 11 corresponds to parts of speech that do not have a conjugated ending,
The record structure is as shown in Figure 3. The conjugation table 12 corresponds to parts of speech having conjugated endings, and has a record structure as shown in FIG. 3 (b). In 2N, the endings following the stems of verbs, adjectives, etc. are written in the conjugated ending column, and only when this ending matches the input character will it be evaluated.

接続重み行列表１３は第４図に示すように、行方向が受
はコード、列方向がかＮレコードをとるマトリクスであ
り、各交点位置が接続の重みを表わしている。この接続
重み行列表１３が検索されるまでの処理手段は、単語辞
書１ｏを検索して、該当単語の品詞で品詞分類表１１あ
るいは用言活分表１２で受け、かへりを見つけ（用言の
場合は、このとき活用語尾と後続文字列のマツチングを
行う）、接続重み行列表１３で接続チェックを行う流れ
となる。As shown in FIG. 4, the connection weight matrix table 13 is a matrix with codes in the row direction and N records in the column direction, and each intersection position represents the connection weight. The processing means until the connection weight matrix table 13 is searched is to search the word dictionary 1o, search the part of speech of the word in the part of speech classification table 11 or the pragmatics division table 12, and find the meaning (the pragmatics). In this case, the conjugated ending and the subsequent character string are matched), and the connection is checked using the connection weight matrix table 13.

第５図は本発明の中心をなす後処理部３の処理フローチ
ャートを示したものである。以下、第５図にもとづいて
後処理部３の処理を詳述する。FIG. 5 shows a processing flowchart of the post-processing section 3 which forms the center of the present invention. Hereinafter, the processing of the post-processing section 3 will be described in detail based on FIG.

文字認識部２より送られた候補文字中、第１位候補文字
だけからなる文字列を作成しくステップ１０１）、該文
字列からユニットを切り出す（ステップ１ｏ２）。二Ｎ
で、ユニットとは、句点や読点、カッコなどの区切り記
号ではさまれた文字列を指す。このユニットに対して例
えば第６図の手順により形態素解析を実施する（ステッ
プ１０３）、即ち、ユニットの先頭から所定文字数を読
み込み（ステップ２０１）、単語辞書１０内を検索する
ことにより単語を切り出す（ステップ２゜２）。この切
り出した単語に対して品詞分類表１１あるいは用言活用
表１２を検索し、活用語に対しては入力文字列と活用語
尾についてマツチングをとった後（ステップ２０３）、
接続重み行列表１３により直前単語との接続チェックを
行う（ステップ２０４）、そして切り出した単語の文字
数だけポインタを進め（ステップ２０５）、同様の処理
を繰り返す。なお、この形態素解析で使用する単語辞書
は十分大きく、入力文中の全単語を収納しているとする
。A character string consisting only of the first candidate character among the candidate characters sent from the character recognition unit 2 is created (step 101), and units are cut out from the character string (step 1o2). 2N
A unit is a string of characters separated by punctuation marks, commas, parentheses, and other delimiters. For example, morphological analysis is performed on this unit according to the procedure shown in FIG. 6 (step 103), that is, a predetermined number of characters are read from the beginning of the unit (step 201), and words are cut out by searching in the word dictionary 10 ( Step 2゜2). The part-of-speech classification table 11 or the conjugation table 12 is searched for this extracted word, and for conjugation words, the input character string and the conjugation ending are matched (step 203).
The connection with the immediately preceding word is checked using the connection weight matrix table 13 (step 204), the pointer is advanced by the number of characters of the extracted word (step 205), and the same process is repeated. It is assumed that the word dictionary used in this morphological analysis is sufficiently large and stores all the words in the input sentence.

上記形態素解析を施こすことにより、誤認識した文字を
含む文字列が未知語部として抽出される（ステップ１０
４）、この抽出された未知語部について、次の処理１あ
るいは処理２を実行する。By performing the above morphological analysis, character strings containing erroneously recognized characters are extracted as unknown word parts (step 10).
4) The following process 1 or process 2 is executed for this extracted unknown word portion.

処理１では、まず未知語部の各文字について候補文字と
順次置換することにより、可能なすべての候補文字列を
作成する（ステップ１０５）。次に、この未知語部の各
候補文字列についてステップ１０３と同様の形態素解析
を行い（ステップ１０６）、各候補文字列に対する評価
値を計算しくステップ１ｏ７）その中から最尤評価値を
もつ候補文字列を選択する（ステップ１０８）。その後
、ユニットの終りかどうか判定しくステップ１０９）、
終りでなければ、ユニット内の文字位置を示すポインタ
を今処理した未知語部の次の文字位置へ進めてステップ
１０３に戻り（ステップ１１０）、ユニットの終りなら
ば、次のユニットの処理に進む（ステップ１１１）。In process 1, all possible candidate character strings are created by sequentially replacing each character of the unknown word portion with a candidate character (step 105). Next, perform morphological analysis similar to step 103 for each candidate character string in this unknown word part (step 106), and calculate the evaluation value for each candidate character string (step 1o7). Among them, the candidate with the maximum likelihood evaluation value. A character string is selected (step 108). After that, it is determined whether or not the unit ends (Step 109),
If it is not the end, advance the pointer indicating the character position within the unit to the next character position of the unknown word portion just processed and return to step 103 (step 110); if it is the end of the unit, proceed to processing the next unit. (Step 111).

一方、処理２では、抽出された未知語部に対して１文字
認識部２で計算された類似度（距離）にもとづいて誤認
識文字位置の判定を行った後、ステップ１０５以降の処
理を行う（ステップ１１０）。On the other hand, in process 2, after determining the misrecognized character position based on the similarity (distance) calculated by the single character recognition unit 2 for the extracted unknown word part, the processes from step 105 onwards are performed. (Step 110).

この場合、ステップ１０５では、誤認識と判定された文
字についてのみ候補文字と置換して、未知語部の文字列
を作成する。In this case, in step 105, only the characters determined to be misrecognized are replaced with candidate characters to create a character string for the unknown word portion.

第７図に具体例を示す。入力文（正しいユニット文字列
）が「このような計算を行なうことは」であるとする、
今、第１位候袖文字列「このような語算を行なうことは
」に対してステップ１０３で形態素解析を行った結果、
未知語部として「な語算」が抽出されたとする。A specific example is shown in FIG. Assume that the input sentence (correct unit string) is "To perform such a calculation",
Now, as a result of performing morphological analysis in step 103 on the first candidate character string "To perform word calculation like this",
Assume that "nagosan" is extracted as an unknown word part.

処理１では、上記未知語部の候補文字列として［な語算
Ｊ、ｒた語算」、「田計算」・・・の３３＝２７通りの
文字列をまず作成する。次に、これらの各文字列に対し
て形態素解析を行い、評価値を計算する。この場合、解
析不能なものは評価値を零とする。このようにして最大
の評価値をもつ文字列「な計算」が正解として選択され
る６処理２では、上記未知語部に対して誤認識文字位置の判
定処理を行い、その結果、「語」が誤認識文字と判定さ
れたとする。したがって、作成される候補文字列は「な
語算」、「な詐算」、「な計算」の３つだけであり、以
下、処理１と同様の処理により「な計算」が選択される
。即ち、処理２は、誤認識文字位置の判定能力が高い場
合には処理１よりも高速であり、かつ訂正能力も劣らな
いことが分かる。In process 1, 33=27 character strings are first created as candidate character strings for the unknown word portion, such as [na word calculation J, rta word calculation], ``田 calculation'', and so on. Next, morphological analysis is performed on each of these character strings, and an evaluation value is calculated. In this case, the evaluation value is set to zero for those that cannot be analyzed. In this way, the character string "Na calculation" with the highest evaluation value is selected as the correct answer6. In process 2, the position of the misrecognized character is determined for the unknown word part, and as a result, the "word" Suppose that is determined to be a misrecognized character. Therefore, there are only three candidate character strings that are created: ``na word calculation'', ``na fraud'', and ``na calculation''. Thereafter, ``na calculation'' is selected by the same process as process 1. That is, it can be seen that Process 2 is faster than Process 1 when the ability to determine the position of an erroneously recognized character is high, and the correction ability is not inferior.

［効　果コ以上の説明から明らか如く、本発明によれば、形態素解
析を利用して文字認識の後処理を行う場合、誤認識文字
位置の判定能力が高くない場合にも高い誤認識文字の訂
正能力が期待でき、さらに誤認識文字位置の判定能力が
高い場合には処理の高速化が達成できる。[Effects] As is clear from the above explanation, according to the present invention, when character recognition post-processing is performed using morphological analysis, a high number of misrecognized characters can be achieved even when the ability to determine the position of misrecognized characters is not high. If the correction ability is expected to be high and the ability to determine the position of erroneously recognized characters is high, processing speed can be increased.

[Brief explanation of drawings]

第１図は本発明の一実施例の全体構成図、第２図は単語
辞書の一例を示す図、第３図は品詞分類表、用言活分表
の一例を示す図、第４図は接続重み行列表の一例を示す
図、第５図は第１図における後処理部の処理フローを示
す図、第６図は形態素解析の処理フローを示す図、第７
図は処理の具体例を示す図である。１・・・ＯＣＲ読取り部、　２・・・文字認識部、３・
・・後処理部、　　１０・・・単語辞書、１１・・・品
詞分類表、　１２・・・用言活用表、１３・・・接続重
み行列表６第１図第３図（Ｔｌ）　Ｍ巨ロヨ＝丁ヨロ第２図第４図FIG. 1 is an overall configuration diagram of an embodiment of the present invention, FIG. 2 is a diagram showing an example of a word dictionary, FIG. 3 is a diagram showing an example of a part-of-speech classification table, and a conjugation table. A diagram showing an example of a connection weight matrix table, FIG. 5 is a diagram showing the processing flow of the post-processing section in FIG. 1, FIG. 6 is a diagram showing the processing flow of morphological analysis, and FIG.
The figure is a diagram showing a specific example of processing. 1...OCR reading section, 2...Character recognition section, 3.
... Post-processing unit, 10... Word dictionary, 11... Part of speech classification table, 12... Word conjugation table, 13... Connection weight matrix table 6 Figure 1 Figure 3 (Tl) M giant Royo = Dingyoro Figure 2 Figure 4

Claims

[Claims]

(1) In a post-processing method for character recognition using morphological analysis, a character string is created from the first candidate character for the input character, and the part of the character string that cannot be analyzed is extracted by performing morphological analysis. A character recognition post-processing method that performs morphological analysis on all character strings that can be created from candidate characters for each character in a character string part, and selects the character string with the highest evaluation value.

(2) In the post-processing method of character analysis using morphological analysis, a character string is created from the first candidate character for the input character, and the part of the character string that cannot be analyzed is extracted by performing morphological analysis. The position of the misrecognized character is determined for each character in the string part, and candidate characters are substituted only for the characters determined to be misrecognized. Morphological analysis is performed on the candidate string created by A character recognition post-processing method characterized by selecting character strings with high evaluation values.