JPH03198180A

JPH03198180A - Post-processing method for character recognition

Info

Publication number: JPH03198180A
Application number: JP1339787A
Authority: JP
Inventors: Takakuni Minewaki; 隆邦嶺脇
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-12-27
Filing date: 1989-12-27
Publication date: 1991-08-29

Abstract

PURPOSE:To improve the probability that a correct recognition result is obtained by putting the priority to a word candidate to a result of character recognition in accordance with its appearance frequency. CONSTITUTION:In a word dictionary (17), in addition to a notation of a word, information of an appearance frequency in a document of the word is also registered, and a post-processing part 16 retrieves its appearance frequency information, as well at every word candidate, and sorts the word candidate in order of a higher appearance frequency. That is, the priority of the word candidate is put in order of a higher appearance frequency. In such a way, even in the case the candidate order of a character candidate is not appropriate, the word candidate in which possibility of a correct answer is generally high is processed preferentially, therefore, the probability that a correct recognition result is obtained is improved.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、漢字ＯＣＲ等の文字認識装置に係り、特に文
字辞書とのマツチングにより得られた認識結果に対する
後処理方法に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to character recognition devices such as kanji OCR, and particularly to a post-processing method for recognition results obtained by matching with a character dictionary.

[Conventional technology]

漢字などは類似文字が極めて多いため、文字単位の認識
処理では正解率に限界がある。したがって、漢字ＯＣＲ
などは文字認識部の次に後処理部を備え、この後処理部
において単語もしくは文章としての妥当性を調べること
によって、文字認識結果を修正することが多い。Because kanji and other characters have an extremely large number of similar characters, there is a limit to the accuracy rate of character-by-character recognition processing. Therefore, Kanji OCR
etc. have a post-processing section next to the character recognition section, and the character recognition results are often corrected by examining the validity of the word or sentence in this post-processing section.

このような後処理において、文字認識結果に対して複数
の単語候補が得られることが多く、この場合、単語候補
の優先順位付けが必要となる９この方法としては、認識
候補文字のそれぞれに候補順位に応じた重みを付け、単
語辞書との照合により得られた各候補単語に対し、それ
を構成している候補文字に付けられた重みの和を計算し
、その値が大きい順に候補単語に優先順位を付ける方法
がある（特公平１−１９１９５号）。In such post-processing, multiple word candidates are often obtained for the character recognition results, and in this case, it is necessary to prioritize the word candidates9. For each candidate word obtained by matching it with the word dictionary, weights are assigned according to the ranking, and the sum of the weights assigned to the candidate characters that make up the word is calculated, and the candidate words are assigned in descending order of the value. There is a method of prioritizing (Special Publication No. 1-19195).

[Problem to be solved by the invention]

しかし、１文字単位の文字認識による候補順位と、単語
の順位とは必ずしも対応するものではないため、極めて
稀にしか出現しない単語が優先されてしまったり、逆に
頻出する文字でも候補順位が低いと、それを含む単語に
低い順位が付けられる結果、正しい認識結果が得られな
いという問題があった。However, the candidate ranking based on character recognition on a character by character basis does not necessarily correspond to the ranking of words, so words that appear extremely rarely are given priority, or even frequently occurring characters are given a low candidate ranking. However, there was a problem in that words containing this word were given a low ranking, and as a result, correct recognition results could not be obtained.

よって本発明の目的は、後処理における上記問題点を解
決することである。Therefore, an object of the present invention is to solve the above-mentioned problems in post-processing.

[Means to solve the problem]

本発明は、単語辞書に単語の表記だけではなく、各単語
の文書中における出現頻度の情報を登録しておき、文字
認識結果と単語辞書との照合により複数の単語候補が得
られた場合、それぞれの単語候補に出現頻度情報に従っ
て優先順位をつけることを特徴とする。The present invention registers not only the notation of words but also the frequency of appearance of each word in a document in a word dictionary, and when multiple word candidates are obtained by comparing the character recognition results with the word dictionary, It is characterized by prioritizing each word candidate according to appearance frequency information.

[For production]

単語としての出現頻度に従って単語候補に優先順位が付
けられるので、文字候補の候補順位が適切でない場合で
も、正解の可能性が一般に高い単語候補が優先されるた
め、正しい認識結果を得られる確率が向上する。Since word candidates are prioritized according to their frequency of appearance as words, even if the candidate ranking of character candidates is not appropriate, priority is given to word candidates that are generally more likely to be correct, increasing the probability of obtaining correct recognition results. improves.

〔Example〕

第１図は本発明の一実施例に係る漢字ＯＣＲの概略ブロ
ック図であり、第２図は処理の概略フローチャートであ
る。FIG. 1 is a schematic block diagram of Kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of processing.

第１図において、処理すべき原稿の画像データは画像入
力部１０より入力され、画像メモリ１１に格納される（
処理ステップ■）。この入力画像データに対し、行・文
字切出し部１２によって行切出し及び文字切出しが行わ
れ、切り出された文字画像データが文字画像メモリ１３
に格納される（処理ステップ■）、この文字画像データ
に対し、文字認識部１４において、サイズの正規化等の
前処理と特徴抽出が行われ、抽出された特徴ベクトルと
文字辞書メモリ１５に格納された文字辞書とのマツチン
グにより、距離の小さいほうから最高Ｎ位までの文字候
補が選ばれる（処理ステップ■）。In FIG. 1, image data of a document to be processed is input from an image input unit 10 and stored in an image memory 11 (
Processing step ■). This input image data is subjected to line cutting and character cutting by the line/character cutting unit 12, and the cut out character image data is stored in the character image memory 13.
(processing step ■), the character recognition unit 14 performs preprocessing such as size normalization and feature extraction, and stores the extracted feature vectors and the character dictionary memory 15 in the character dictionary memory 15. By matching with the obtained character dictionary, character candidates from the smallest distance to the highest number N are selected (processing step ■).

後処理部１６の処理について、ここでは一般文書中に含
まれる文字列「・・・文字・・・」の認識・後処理をす
る場合を例にして説明する。The processing of the post-processing unit 16 will be described using an example in which recognition and post-processing of a character string "...character..." included in a general document is performed.

例えば、文字認識部１４の処理（１文字単位の文字認識
）により第１表の認識結果が得られたとする。For example, it is assumed that the recognition results shown in Table 1 are obtained by the processing of the character recognition unit 14 (character recognition on a character-by-character basis).

第１表（文字認識結果）後処理部１６は、文字認識結果と単語辞書メモリ１７に
格納されている単語辞書とを照合することによって、文
字候補の組合せと一致する単語候補を選ぶ（処理ステッ
プ■）。ここでは、例えば第２表に示す単語候補が得ら
れたとする。Table 1 (Character recognition results) The post-processing unit 16 selects word candidates that match the combination of character candidates by comparing the character recognition results with the word dictionary stored in the word dictionary memory 17 (processing step ■). Here, it is assumed that word candidates shown in Table 2 are obtained, for example.

第２表（単語候補・頻度）しかし、これでけでは、複数の単語候補が得られた場合
、その中のどれが最も正解に近いのかの判断はで−きな
い。そこで、本発明によれば、単語辞書（１７）の単語
の表記のほかに、単語の文書中における出現頻度の情報
も登録されており、後処理部１６は単語候補ごとにその
出現頻度情報も検索し、単語候補を出現頻度の高い順に
ソートする（処理ステップ■）。すなわち、出現頻度の
高い順に単語候補の優先順位を付ける。この出現頻度の
第２表に示す如くであったとすると、ソートの結果は第
３表に示す如くである。Table 2 (Word Candidates/Frequency) However, when multiple word candidates are obtained, it is not possible to determine which of them is closest to the correct answer. Therefore, according to the present invention, in addition to the notation of the word in the word dictionary (17), information on the frequency of occurrence of the word in the document is also registered, and the post-processing unit 16 also records the frequency of occurrence information for each word candidate. Search and sort word candidates in descending order of frequency of appearance (processing step ■). That is, word candidates are prioritized in descending order of frequency of appearance. Assuming that the appearance frequencies are as shown in Table 2, the sorting results are as shown in Table 3.

第３表（ソート後の単語候補順位）そして、後処理部１６は、最優先となった単語候補によ
って文字認識結果を書き換える（処理ステップ■）１本
例の場合、ｒ文字」が最優先となるので１文字番号１の
第１候補は「文」に書き換えられ、文字番号２の第１候
補は「字」に書き換えられる。Table 3 (word candidate ranking after sorting) Then, the post-processing unit 16 rewrites the character recognition result with the word candidate that has been given the highest priority (processing step ■) 1. In this example, the "r character" has been given the highest priority. Therefore, the first candidate for character number 1 is rewritten as "sentence", and the first candidate for character number 2 is rewritten as "character".

このような後処理によって修正後の認識結果は結果メモ
リ１９に格納され、結果出力部１８によって出力される
（処理ステップ［相］）。The recognition result corrected by such post-processing is stored in the result memory 19 and output by the result output unit 18 (processing step [phase]).

なお、単語候補が１個のみの場合は、その単語候補が自
動的に最優先されて修正が行われることになる（処理ス
テップ■、■）。Note that if there is only one word candidate, that word candidate is automatically given top priority and correction is performed (processing steps ①, ①).

なお、本発明は音声認識の後処理にも応用再診である。Note that the present invention can also be applied to post-processing of speech recognition.

〔発明の効果〕以上説明した如く、本発明によれば、文字認識結果に対
する単語候補をその出現頻度に従って優先順位付けを行
うことにより、めったに出現しない単語の優先を避け、
出現頻度が最高で一般に正解の可能性が高い単語を優先
させ、正しい認識結果を得られる確率を増加させること
ができる。[Effects of the Invention] As explained above, according to the present invention, by prioritizing word candidates for character recognition results according to their frequency of appearance, it is possible to avoid giving priority to words that rarely appear,
By prioritizing words that have the highest frequency of appearance and generally have a high probability of being correct, it is possible to increase the probability of obtaining correct recognition results.

[Brief explanation of drawings]

第１図は本発明の一実施例に係る漢字ＯＣＲの概略ブロ
ック図、第２図は処理の概略フローチャートである。１０・・・画像入力部、１２・・・行・文字切出１３・・・文字画像メモ１５・・・文字辞書メモ１７・・・単語辞書メモ１９・・・結果メモリ。１１・・・画像メモリ、し部、す、　　１４・・・文字認識部。す、　１６・・・後処理部、す、　　１８・・・結果出力部、FIG. 1 is a schematic block diagram of Kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of processing. 10... Image input section, 12... Line/character cutting 13... Character image memo 15... Character dictionary memo 17... Word dictionary memo 19... Result memory. 11...Image memory, part, 14...Character recognition part. S, 16... Post-processing section, S, 18... Result output section,

Claims

[Claims]

(1) In a character recognition post-processing method in which word candidates are obtained by comparing the character recognition results for each character with a word dictionary in which the words are registered, and the character recognition results are corrected according to the word candidates, each word is added to the word dictionary. Information on the frequency of occurrence of words in documents is registered, and when multiple word candidates are obtained by comparing the character recognition results with the word dictionary, it is possible to prioritize each word candidate according to the frequency of occurrence information. Characteristic post-processing method for character recognition.