JPH03198180A - Post-processing method for character recognition - Google Patents

Post-processing method for character recognition

Info

Publication number
JPH03198180A
JPH03198180A JP1339787A JP33978789A JPH03198180A JP H03198180 A JPH03198180 A JP H03198180A JP 1339787 A JP1339787 A JP 1339787A JP 33978789 A JP33978789 A JP 33978789A JP H03198180 A JPH03198180 A JP H03198180A
Authority
JP
Japan
Prior art keywords
word
character
candidate
post
character recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP1339787A
Other languages
Japanese (ja)
Inventor
Takakuni Minewaki
隆邦 嶺脇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP1339787A priority Critical patent/JPH03198180A/en
Publication of JPH03198180A publication Critical patent/JPH03198180A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To improve the probability that a correct recognition result is obtained by putting the priority to a word candidate to a result of character recognition in accordance with its appearance frequency. CONSTITUTION:In a word dictionary (17), in addition to a notation of a word, information of an appearance frequency in a document of the word is also registered, and a post-processing part 16 retrieves its appearance frequency information, as well at every word candidate, and sorts the word candidate in order of a higher appearance frequency. That is, the priority of the word candidate is put in order of a higher appearance frequency. In such a way, even in the case the candidate order of a character candidate is not appropriate, the word candidate in which possibility of a correct answer is generally high is processed preferentially, therefore, the probability that a correct recognition result is obtained is improved.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 本発明は、漢字OCR等の文字認識装置に係り、特に文
字辞書とのマツチングにより得られた認識結果に対する
後処理方法に関する。
DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to character recognition devices such as kanji OCR, and particularly to a post-processing method for recognition results obtained by matching with a character dictionary.

〔従来の技術〕[Conventional technology]

漢字などは類似文字が極めて多いため、文字単位の認識
処理では正解率に限界がある。したがって、漢字OCR
などは文字認識部の次に後処理部を備え、この後処理部
において単語もしくは文章としての妥当性を調べること
によって、文字認識結果を修正することが多い。
Because kanji and other characters have an extremely large number of similar characters, there is a limit to the accuracy rate of character-by-character recognition processing. Therefore, Kanji OCR
etc. have a post-processing section next to the character recognition section, and the character recognition results are often corrected by examining the validity of the word or sentence in this post-processing section.

このような後処理において、文字認識結果に対して複数
の単語候補が得られることが多く、この場合、単語候補
の優先順位付けが必要となる9この方法としては、認識
候補文字のそれぞれに候補順位に応じた重みを付け、単
語辞書との照合により得られた各候補単語に対し、それ
を構成している候補文字に付けられた重みの和を計算し
、その値が大きい順に候補単語に優先順位を付ける方法
がある(特公平1−19195号)。
In such post-processing, multiple word candidates are often obtained for the character recognition results, and in this case, it is necessary to prioritize the word candidates9. For each candidate word obtained by matching it with the word dictionary, weights are assigned according to the ranking, and the sum of the weights assigned to the candidate characters that make up the word is calculated, and the candidate words are assigned in descending order of the value. There is a method of prioritizing (Special Publication No. 1-19195).

〔発明が解決しようとする課題〕[Problem to be solved by the invention]

しかし、1文字単位の文字認識による候補順位と、単語
の順位とは必ずしも対応するものではないため、極めて
稀にしか出現しない単語が優先されてしまったり、逆に
頻出する文字でも候補順位が低いと、それを含む単語に
低い順位が付けられる結果、正しい認識結果が得られな
いという問題があった。
However, the candidate ranking based on character recognition on a character by character basis does not necessarily correspond to the ranking of words, so words that appear extremely rarely are given priority, or even frequently occurring characters are given a low candidate ranking. However, there was a problem in that words containing this word were given a low ranking, and as a result, correct recognition results could not be obtained.

よって本発明の目的は、後処理における上記問題点を解
決することである。
Therefore, an object of the present invention is to solve the above-mentioned problems in post-processing.

〔課題を解決するための手段〕[Means to solve the problem]

本発明は、単語辞書に単語の表記だけではなく、各単語
の文書中における出現頻度の情報を登録しておき、文字
認識結果と単語辞書との照合により複数の単語候補が得
られた場合、それぞれの単語候補に出現頻度情報に従っ
て優先順位をつけることを特徴とする。
The present invention registers not only the notation of words but also the frequency of appearance of each word in a document in a word dictionary, and when multiple word candidates are obtained by comparing the character recognition results with the word dictionary, It is characterized by prioritizing each word candidate according to appearance frequency information.

〔作 用〕[For production]

単語としての出現頻度に従って単語候補に優先順位が付
けられるので、文字候補の候補順位が適切でない場合で
も、正解の可能性が一般に高い単語候補が優先されるた
め、正しい認識結果を得られる確率が向上する。
Since word candidates are prioritized according to their frequency of appearance as words, even if the candidate ranking of character candidates is not appropriate, priority is given to word candidates that are generally more likely to be correct, increasing the probability of obtaining correct recognition results. improves.

〔実施例〕〔Example〕

第1図は本発明の一実施例に係る漢字OCRの概略ブロ
ック図であり、第2図は処理の概略フローチャートであ
る。
FIG. 1 is a schematic block diagram of Kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of processing.

第1図において、処理すべき原稿の画像データは画像入
力部10より入力され、画像メモリ11に格納される(
処理ステップ■)。この入力画像データに対し、行・文
字切出し部12によって行切出し及び文字切出しが行わ
れ、切り出された文字画像データが文字画像メモリ13
に格納される(処理ステップ■)、この文字画像データ
に対し、文字認識部14において、サイズの正規化等の
前処理と特徴抽出が行われ、抽出された特徴ベクトルと
文字辞書メモリ15に格納された文字辞書とのマツチン
グにより、距離の小さいほうから最高N位までの文字候
補が選ばれる(処理ステップ■)。
In FIG. 1, image data of a document to be processed is input from an image input unit 10 and stored in an image memory 11 (
Processing step ■). This input image data is subjected to line cutting and character cutting by the line/character cutting unit 12, and the cut out character image data is stored in the character image memory 13.
(processing step ■), the character recognition unit 14 performs preprocessing such as size normalization and feature extraction, and stores the extracted feature vectors and the character dictionary memory 15 in the character dictionary memory 15. By matching with the obtained character dictionary, character candidates from the smallest distance to the highest number N are selected (processing step ■).

後処理部16の処理について、ここでは一般文書中に含
まれる文字列「・・・文字・・・」の認識・後処理をす
る場合を例にして説明する。
The processing of the post-processing unit 16 will be described using an example in which recognition and post-processing of a character string "...character..." included in a general document is performed.

例えば、文字認識部14の処理(1文字単位の文字認識
)により第1表の認識結果が得られたとする。
For example, it is assumed that the recognition results shown in Table 1 are obtained by the processing of the character recognition unit 14 (character recognition on a character-by-character basis).

第1表 (文字認識結果) 後処理部16は、文字認識結果と単語辞書メモリ17に
格納されている単語辞書とを照合することによって、文
字候補の組合せと一致する単語候補を選ぶ(処理ステッ
プ■)。ここでは、例えば第2表に示す単語候補が得ら
れたとする。
Table 1 (Character recognition results) The post-processing unit 16 selects word candidates that match the combination of character candidates by comparing the character recognition results with the word dictionary stored in the word dictionary memory 17 (processing step ■). Here, it is assumed that word candidates shown in Table 2 are obtained, for example.

第2表(単語候補・頻度) しかし、これでけでは、複数の単語候補が得られた場合
、その中のどれが最も正解に近いのかの判断はで−きな
い。そこで、本発明によれば、単語辞書(17)の単語
の表記のほかに、単語の文書中における出現頻度の情報
も登録されており、後処理部16は単語候補ごとにその
出現頻度情報も検索し、単語候補を出現頻度の高い順に
ソートする(処理ステップ■)。すなわち、出現頻度の
高い順に単語候補の優先順位を付ける。この出現頻度の
第2表に示す如くであったとすると、ソートの結果は第
3表に示す如くである。
Table 2 (Word Candidates/Frequency) However, when multiple word candidates are obtained, it is not possible to determine which of them is closest to the correct answer. Therefore, according to the present invention, in addition to the notation of the word in the word dictionary (17), information on the frequency of occurrence of the word in the document is also registered, and the post-processing unit 16 also records the frequency of occurrence information for each word candidate. Search and sort word candidates in descending order of frequency of appearance (processing step ■). That is, word candidates are prioritized in descending order of frequency of appearance. Assuming that the appearance frequencies are as shown in Table 2, the sorting results are as shown in Table 3.

第3表(ソート後の単語候補順位) そして、後処理部16は、最優先となった単語候補によ
って文字認識結果を書き換える(処理ステップ■)1本
例の場合、r文字」が最優先となるので1文字番号1の
第1候補は「文」に書き換えられ、文字番号2の第1候
補は「字」に書き換えられる。
Table 3 (word candidate ranking after sorting) Then, the post-processing unit 16 rewrites the character recognition result with the word candidate that has been given the highest priority (processing step ■) 1. In this example, the "r character" has been given the highest priority. Therefore, the first candidate for character number 1 is rewritten as "sentence", and the first candidate for character number 2 is rewritten as "character".

このような後処理によって修正後の認識結果は結果メモ
リ19に格納され、結果出力部18によって出力される
(処理ステップ[相])。
The recognition result corrected by such post-processing is stored in the result memory 19 and output by the result output unit 18 (processing step [phase]).

なお、単語候補が1個のみの場合は、その単語候補が自
動的に最優先されて修正が行われることになる(処理ス
テップ■、■)。
Note that if there is only one word candidate, that word candidate is automatically given top priority and correction is performed (processing steps ①, ①).

なお、本発明は音声認識の後処理にも応用再診である。Note that the present invention can also be applied to post-processing of speech recognition.

〔発明の効果〕 以上説明した如く、本発明によれば、文字認識結果に対
する単語候補をその出現頻度に従って優先順位付けを行
うことにより、めったに出現しない単語の優先を避け、
出現頻度が最高で一般に正解の可能性が高い単語を優先
させ、正しい認識結果を得られる確率を増加させること
ができる。
[Effects of the Invention] As explained above, according to the present invention, by prioritizing word candidates for character recognition results according to their frequency of appearance, it is possible to avoid giving priority to words that rarely appear,
By prioritizing words that have the highest frequency of appearance and generally have a high probability of being correct, it is possible to increase the probability of obtaining correct recognition results.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は本発明の一実施例に係る漢字OCRの概略ブロ
ック図、第2図は処理の概略フローチャートである。 10・・・画像入力部、 12・・・行・文字切出 13・・・文字画像メモ 15・・・文字辞書メモ 17・・・単語辞書メモ 19・・・結果メモリ。 11・・・画像メモリ、 し部、 す、  14・・・文字認識部。 す、 16・・・後処理部、 す、  18・・・結果出力部、
FIG. 1 is a schematic block diagram of Kanji OCR according to an embodiment of the present invention, and FIG. 2 is a schematic flowchart of processing. 10... Image input section, 12... Line/character cutting 13... Character image memo 15... Character dictionary memo 17... Word dictionary memo 19... Result memory. 11...Image memory, part, 14...Character recognition part. S, 16... Post-processing section, S, 18... Result output section,

Claims (1)

【特許請求の範囲】[Claims] (1)1文字単位の文字認識結果について、単語が登録
された単語辞書との照合により単語候補を得、この単語
候補に従って文字認識結果を修正する文字認識の後処理
方法において、単語辞書に各単語の文書中における出現
頻度の情報を登録しておき、文字認識結果と単語辞書と
の照合により複数の単語候補が得られた場合、それぞれ
の単語候補に出現頻度情報に従って優先順位をつけるこ
とを特徴とする文字認識の後処理方法。
(1) In a character recognition post-processing method in which word candidates are obtained by comparing the character recognition results for each character with a word dictionary in which the words are registered, and the character recognition results are corrected according to the word candidates, each word is added to the word dictionary. Information on the frequency of occurrence of words in documents is registered, and when multiple word candidates are obtained by comparing the character recognition results with the word dictionary, it is possible to prioritize each word candidate according to the frequency of occurrence information. Characteristic post-processing method for character recognition.
JP1339787A 1989-12-27 1989-12-27 Post-processing method for character recognition Pending JPH03198180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1339787A JPH03198180A (en) 1989-12-27 1989-12-27 Post-processing method for character recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1339787A JPH03198180A (en) 1989-12-27 1989-12-27 Post-processing method for character recognition

Publications (1)

Publication Number Publication Date
JPH03198180A true JPH03198180A (en) 1991-08-29

Family

ID=18330805

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1339787A Pending JPH03198180A (en) 1989-12-27 1989-12-27 Post-processing method for character recognition

Country Status (1)

Country Link
JP (1) JPH03198180A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0553745A2 (en) * 1992-01-30 1993-08-04 Matsushita Electric Industrial Co., Ltd. Character recognition apparatus
JPH05205110A (en) * 1992-01-30 1993-08-13 Matsushita Electric Ind Co Ltd Character recognizing device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0553745A2 (en) * 1992-01-30 1993-08-04 Matsushita Electric Industrial Co., Ltd. Character recognition apparatus
JPH05205110A (en) * 1992-01-30 1993-08-13 Matsushita Electric Ind Co Ltd Character recognizing device
EP0553745A3 (en) * 1992-01-30 1994-06-22 Matsushita Electric Ind Co Ltd Character recognition apparatus
US5689583A (en) * 1992-01-30 1997-11-18 Matsushita Electric Industrial Co. Ltd. Character recognition apparatus using a keyword

Similar Documents

Publication Publication Date Title
CN108920633B (en) Paper similarity detection method
CN107609006B (en) Search optimization method based on local log research
JPH03198180A (en) Post-processing method for character recognition
CN113609864B (en) Text semantic recognition processing system and method based on industrial control system
CN108959263B (en) Entry weight calculation model training method and device
JP2003331214A (en) Character recognition error correction method, device and program
CN110765767A (en) Extraction method, device, server and storage medium of local optimization keywords
EP1076305A1 (en) A phonetic method of retrieving and presenting electronic information from large information sources, an apparatus for performing the method, a computer-readable medium, and a computer program element
JP2827066B2 (en) Post-processing method for character recognition of documents with mixed digit strings
JP2894305B2 (en) Recognition device candidate correction method
CN113268973B (en) Man-machine multi-turn conversation method and device
JP3548372B2 (en) Character recognition device
JP2746345B2 (en) Post-processing method for character recognition
JP3350127B2 (en) Character recognition device
JPH09185674A (en) Device and method for detecting and correcting erroneously recognized character
JP2918380B2 (en) Post-processing method of character recognition result
Muliadi et al. Comparison of String Similarity Algorithm in post-processing OCR
JP2982244B2 (en) Character recognition post-processing method
JP3314720B2 (en) String search device
JPH0757059A (en) Character recognition device
JP3123181B2 (en) Character recognition device
JP3339879B2 (en) Character recognition device
JPH0262659A (en) Extracting device for correction candidate character of japanese sentence
JPH06161995A (en) Method and device for shaping name data
JPH0540854A (en) Post-processing method for character recognizing result