JPS62285189A - Character recognition post processing system - Google Patents

Character recognition post processing system

Info

Publication number
JPS62285189A
JPS62285189A JP61128558A JP12855886A JPS62285189A JP S62285189 A JPS62285189 A JP S62285189A JP 61128558 A JP61128558 A JP 61128558A JP 12855886 A JP12855886 A JP 12855886A JP S62285189 A JPS62285189 A JP S62285189A
Authority
JP
Japan
Prior art keywords
character
word
candidate
characters
morphological analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP61128558A
Other languages
Japanese (ja)
Inventor
Jiichi Igarashi
五十嵐 治一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP61128558A priority Critical patent/JPS62285189A/en
Publication of JPS62285189A publication Critical patent/JPS62285189A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To execute the correction of an erroneous recognizing character with high accuracy by executing a morpheme analysis to all character strings to be prepared from a candidate character. CONSTITUTION:The input character read by an OCR part 1, after the similarity with a standard character for respective input characters by a character recognizing part 2 and a candidate character is obtained, is sent to a processing part 3. A post processing part 3 segments a word for a candidate character by referring to a word dictionary 10, a speech part classification table 11, a term application table 12 and a connecting weight matrix table 13, and executes the morpheme analysis by the connecting checking with a preceding word. The morpheme analysis is executed concerning all character strings of the unknown word which can be prepared from the candidate character of respective characters of the unknown word part, and executed by the processing to select the character string with the highest estimation value. Thus, when the deciding capacity of the erroneous recognizing character position is not high, the correcting capacity of the high erroneous recognizing character can be expected.

Description

【発明の詳細な説明】 3、発明の詳細な説明 [技術分野] 本発明は、OCR文字認識装置等における後処理方式に
関する。
Detailed Description of the Invention 3. Detailed Description of the Invention [Technical Field] The present invention relates to a post-processing method in an OCR character recognition device or the like.

[従来技術] OCR文字認識装置等においては、一般に誤認識と判定
された候補文字について、更に後処理を施こすことによ
って認識精度を高めている。近年、この種の文字認識の
後処理として言語の文法知識を利用することが考えられ
ている。その代表的なものに形態素解析を利用する方法
がある。これは誤認識文字と判定された文字を含む文字
列について、その候補文字を考慮して形態素解析を施こ
し、品詞間の接続チェック等を行って正解文字を決定す
るというものである。しかし、従来技術においては、誤
認識文字位置の判定能力が低いと、処理速度が遅くなり
、かつ、誤まった候補文字が正解と判定されることが多
くなるという欠点があった。
[Prior Art] In OCR character recognition devices and the like, recognition accuracy is generally improved by further performing post-processing on candidate characters that have been determined to be misrecognized. In recent years, it has been considered to utilize language grammatical knowledge as post-processing for this type of character recognition. A typical example is a method using morphological analysis. This involves performing morphological analysis on a character string that includes a character determined to be a misrecognized character, taking into account candidate characters, and checking connections between parts of speech to determine the correct character. However, in the conventional technology, if the ability to determine the position of an erroneously recognized character is low, the processing speed becomes slow and erroneous candidate characters are often determined to be correct.

[目 的] 本発明の目的は、形態素解析を利用する文字認識後処理
方式において、誤認識文字位置の判定能力が高くない場
合にも、誤認識文字の高い訂正能力が期待できる方式を
提供することにある。
[Purpose] The purpose of the present invention is to provide a character recognition post-processing method using morphological analysis that can be expected to have a high ability to correct erroneously recognized characters even when the ability to determine the position of erroneously recognized characters is not high. There is a particular thing.

[構 成] 本発明は各入力文字に対して文字認識を行って候補文字
を検出し、その第1位候補文字のみから文字列を作成し
て形態素解析を施こす、そして、解析不能となる部分を
抽出し、その文字列部(未知語部と称する)に対して次
の処理1あるいは2を実行して入力文字を最終的に推定
する。
[Configuration] The present invention performs character recognition on each input character to detect candidate characters, creates a character string only from the first candidate character, performs morphological analysis, and then performs morphological analysis. The input character is finally estimated by extracting the part and executing the following process 1 or 2 on the character string part (referred to as the unknown word part).

処理1:未知語部の各文字の候補文字から作成可能な未
知語部の文字列すべてに対して形態素解析を行い、最も
評価値の高い文字列を選択する。
Process 1: Morphological analysis is performed on all character strings of the unknown word part that can be created from candidate characters of each character of the unknown word part, and the character string with the highest evaluation value is selected.

処理2:未知語部の各文字に対し、誤認識文字位置の判
定処理を行い、誤認識文字と判定された文字位置につい
てのみ、候補文字を順次代入することにより作成される
未知語部の候補文字列に対して形態素解析を行い、最も
評価値の高い文字列を選択する。
Process 2: For each character in the unknown word part, a process is performed to determine the position of a misrecognized character, and candidates for the unknown word part are created by sequentially substituting candidate characters only for character positions determined to be misrecognized characters. Perform morphological analysis on character strings and select the character string with the highest evaluation value.

以下、本発明の一実施例について図面により説明する。An embodiment of the present invention will be described below with reference to the drawings.

第1図は本発明の一実施例のブロック図を示す。FIG. 1 shows a block diagram of one embodiment of the invention.

OCR読取り部1で読み取られた入力文は、文字認識部
2において各入力文字ごとに標準文字との類似度が計算
されて、候補文字が求められる。文字認識部2で求まっ
た候補文字は後処理部3へ送られる。本発明は、この後
処理部3の処理にかNわる。後処理部3は形態素解析を
実施するため、単語辞書101品詞分類表11、用言活
用表12、接続重み行列表13を具備している。
The input sentence read by the OCR reading unit 1 is subjected to a character recognition unit 2 in which the degree of similarity with standard characters is calculated for each input character to obtain candidate characters. The candidate characters found by the character recognition section 2 are sent to the post-processing section 3. The present invention replaces this processing by the post-processing section 3. The post-processing unit 3 includes a word dictionary 101, a part-of-speech classification table 11, a conjugation table 12, and a connection weight matrix table 13 in order to perform morphological analysis.

単語辞書10は、第2図に示すように、各単語ごとに、
読み(単語の読みをひらがな化したもの)、表記(出力
されるかな、漢字の表記)、品詞、頻度ランク及びその
他の情報を含んでいる。
As shown in FIG. 2, the word dictionary 10 includes, for each word,
It includes reading (the pronunciation of the word converted into hiragana), notation (output kana, kanji notation), part of speech, frequency rank, and other information.

品詞分類表11と用言活用表12は、接続重み行列表1
3を検索する際の行、列の番号を示したテーブルである
。品詞分類表11は活用語尾を持たない品詞に対応し、
第3図のようなレコード構成をとる。用言活用表12は
活用語尾を有する品詞に対応し、第3図(ロ)のような
レコード構成をとる。二Nで、活用語尾欄には動詞、形
容詞などの語幹に続く語尾が記されており、この語尾が
入力文字にマツチして初めて評価の対象となる。
The part-of-speech classification table 11 and the conjugation table 12 are based on the connection weight matrix table 1
This is a table showing row and column numbers when searching for 3. Part of speech classification table 11 corresponds to parts of speech that do not have a conjugated ending,
The record structure is as shown in Figure 3. The conjugation table 12 corresponds to parts of speech having conjugated endings, and has a record structure as shown in FIG. 3 (b). In 2N, the endings following the stems of verbs, adjectives, etc. are written in the conjugated ending column, and only when this ending matches the input character will it be evaluated.

接続重み行列表13は第4図に示すように、行方向が受
はコード、列方向がかNレコードをとるマトリクスであ
り、各交点位置が接続の重みを表わしている。この接続
重み行列表13が検索されるまでの処理手段は、単語辞
書1oを検索して、該当単語の品詞で品詞分類表11あ
るいは用言活分表12で受け、かへりを見つけ(用言の
場合は、このとき活用語尾と後続文字列のマツチングを
行う)、接続重み行列表13で接続チェックを行う流れ
となる。
As shown in FIG. 4, the connection weight matrix table 13 is a matrix with codes in the row direction and N records in the column direction, and each intersection position represents the connection weight. The processing means until the connection weight matrix table 13 is searched is to search the word dictionary 1o, search the part of speech of the word in the part of speech classification table 11 or the pragmatics division table 12, and find the meaning (the pragmatics). In this case, the conjugated ending and the subsequent character string are matched), and the connection is checked using the connection weight matrix table 13.

第5図は本発明の中心をなす後処理部3の処理フローチ
ャートを示したものである。以下、第5図にもとづいて
後処理部3の処理を詳述する。
FIG. 5 shows a processing flowchart of the post-processing section 3 which forms the center of the present invention. Hereinafter, the processing of the post-processing section 3 will be described in detail based on FIG.

文字認識部2より送られた候補文字中、第1位候補文字
だけからなる文字列を作成しくステップ101)、該文
字列からユニットを切り出す(ステップ1o2)。二N
で、ユニットとは、句点や読点、カッコなどの区切り記
号ではさまれた文字列を指す。このユニットに対して例
えば第6図の手順により形態素解析を実施する(ステッ
プ103)、即ち、ユニットの先頭から所定文字数を読
み込み(ステップ201)、単語辞書10内を検索する
ことにより単語を切り出す(ステップ2゜2)。この切
り出した単語に対して品詞分類表11あるいは用言活用
表12を検索し、活用語に対しては入力文字列と活用語
尾についてマツチングをとった後(ステップ203)、
接続重み行列表13により直前単語との接続チェックを
行う(ステップ204)、そして切り出した単語の文字
数だけポインタを進め(ステップ205)、同様の処理
を繰り返す。なお、この形態素解析で使用する単語辞書
は十分大きく、入力文中の全単語を収納しているとする
A character string consisting only of the first candidate character among the candidate characters sent from the character recognition unit 2 is created (step 101), and units are cut out from the character string (step 1o2). 2N
A unit is a string of characters separated by punctuation marks, commas, parentheses, and other delimiters. For example, morphological analysis is performed on this unit according to the procedure shown in FIG. 6 (step 103), that is, a predetermined number of characters are read from the beginning of the unit (step 201), and words are cut out by searching in the word dictionary 10 ( Step 2゜2). The part-of-speech classification table 11 or the conjugation table 12 is searched for this extracted word, and for conjugation words, the input character string and the conjugation ending are matched (step 203).
The connection with the immediately preceding word is checked using the connection weight matrix table 13 (step 204), the pointer is advanced by the number of characters of the extracted word (step 205), and the same process is repeated. It is assumed that the word dictionary used in this morphological analysis is sufficiently large and stores all the words in the input sentence.

上記形態素解析を施こすことにより、誤認識した文字を
含む文字列が未知語部として抽出される(ステップ10
4)、この抽出された未知語部について、次の処理1あ
るいは処理2を実行する。
By performing the above morphological analysis, character strings containing erroneously recognized characters are extracted as unknown word parts (step 10).
4) The following process 1 or process 2 is executed for this extracted unknown word portion.

処理1では、まず未知語部の各文字について候補文字と
順次置換することにより、可能なすべての候補文字列を
作成する(ステップ105)。次に、この未知語部の各
候補文字列についてステップ103と同様の形態素解析
を行い(ステップ106)、各候補文字列に対する評価
値を計算しくステップ1o7)その中から最尤評価値を
もつ候補文字列を選択する(ステップ108)。その後
、ユニットの終りかどうか判定しくステップ109)、
終りでなければ、ユニット内の文字位置を示すポインタ
を今処理した未知語部の次の文字位置へ進めてステップ
103に戻り(ステップ110)、ユニットの終りなら
ば、次のユニットの処理に進む(ステップ111)。
In process 1, all possible candidate character strings are created by sequentially replacing each character of the unknown word portion with a candidate character (step 105). Next, perform morphological analysis similar to step 103 for each candidate character string in this unknown word part (step 106), and calculate the evaluation value for each candidate character string (step 1o7). Among them, the candidate with the maximum likelihood evaluation value. A character string is selected (step 108). After that, it is determined whether or not the unit ends (Step 109),
If it is not the end, advance the pointer indicating the character position within the unit to the next character position of the unknown word portion just processed and return to step 103 (step 110); if it is the end of the unit, proceed to processing the next unit. (Step 111).

一方、処理2では、抽出された未知語部に対して1文字
認識部2で計算された類似度(距離)にもとづいて誤認
識文字位置の判定を行った後、ステップ105以降の処
理を行う(ステップ110)。
On the other hand, in process 2, after determining the misrecognized character position based on the similarity (distance) calculated by the single character recognition unit 2 for the extracted unknown word part, the processes from step 105 onwards are performed. (Step 110).

この場合、ステップ105では、誤認識と判定された文
字についてのみ候補文字と置換して、未知語部の文字列
を作成する。
In this case, in step 105, only the characters determined to be misrecognized are replaced with candidate characters to create a character string for the unknown word portion.

第7図に具体例を示す。入力文(正しいユニット文字列
)が「このような計算を行なうことは」であるとする、
今、第1位候袖文字列「このような語算を行なうことは
」に対してステップ103で形態素解析を行った結果、
未知語部として「な語算」が抽出されたとする。
A specific example is shown in FIG. Assume that the input sentence (correct unit string) is "To perform such a calculation",
Now, as a result of performing morphological analysis in step 103 on the first candidate character string "To perform word calculation like this",
Assume that "nagosan" is extracted as an unknown word part.

処理1では、上記未知語部の候補文字列として[な語算
J、rた語算」、「田計算」・・・の33=27通りの
文字列をまず作成する。次に、これらの各文字列に対し
て形態素解析を行い、評価値を計算する。この場合、解
析不能なものは評価値を零とする。このようにして最大
の評価値をもつ文字列「な計算」が正解として選択され
る6 処理2では、上記未知語部に対して誤認識文字位置の判
定処理を行い、その結果、「語」が誤認識文字と判定さ
れたとする。したがって、作成される候補文字列は「な
語算」、「な詐算」、「な計算」の3つだけであり、以
下、処理1と同様の処理により「な計算」が選択される
。即ち、処理2は、誤認識文字位置の判定能力が高い場
合には処理1よりも高速であり、かつ訂正能力も劣らな
いことが分かる。
In process 1, 33=27 character strings are first created as candidate character strings for the unknown word portion, such as [na word calculation J, rta word calculation], ``田 calculation'', and so on. Next, morphological analysis is performed on each of these character strings, and an evaluation value is calculated. In this case, the evaluation value is set to zero for those that cannot be analyzed. In this way, the character string "Na calculation" with the highest evaluation value is selected as the correct answer6. In process 2, the position of the misrecognized character is determined for the unknown word part, and as a result, the "word" Suppose that is determined to be a misrecognized character. Therefore, there are only three candidate character strings that are created: ``na word calculation'', ``na fraud'', and ``na calculation''. Thereafter, ``na calculation'' is selected by the same process as process 1. That is, it can be seen that Process 2 is faster than Process 1 when the ability to determine the position of an erroneously recognized character is high, and the correction ability is not inferior.

[効 果コ 以上の説明から明らか如く、本発明によれば、形態素解
析を利用して文字認識の後処理を行う場合、誤認識文字
位置の判定能力が高くない場合にも高い誤認識文字の訂
正能力が期待でき、さらに誤認識文字位置の判定能力が
高い場合には処理の高速化が達成できる。
[Effects] As is clear from the above explanation, according to the present invention, when character recognition post-processing is performed using morphological analysis, a high number of misrecognized characters can be achieved even when the ability to determine the position of misrecognized characters is not high. If the correction ability is expected to be high and the ability to determine the position of erroneously recognized characters is high, processing speed can be increased.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は本発明の一実施例の全体構成図、第2図は単語
辞書の一例を示す図、第3図は品詞分類表、用言活分表
の一例を示す図、第4図は接続重み行列表の一例を示す
図、第5図は第1図における後処理部の処理フローを示
す図、第6図は形態素解析の処理フローを示す図、第7
図は処理の具体例を示す図である。 1・・・OCR読取り部、 2・・・文字認識部、3・
・・後処理部、  10・・・単語辞書、11・・・品
詞分類表、 12・・・用言活用表、13・・・接続重
み行列表6 第1図 第3図 (Tl) M巨ロヨ=丁ヨロ 第2図 第4図
FIG. 1 is an overall configuration diagram of an embodiment of the present invention, FIG. 2 is a diagram showing an example of a word dictionary, FIG. 3 is a diagram showing an example of a part-of-speech classification table, and a conjugation table. A diagram showing an example of a connection weight matrix table, FIG. 5 is a diagram showing the processing flow of the post-processing section in FIG. 1, FIG. 6 is a diagram showing the processing flow of morphological analysis, and FIG.
The figure is a diagram showing a specific example of processing. 1...OCR reading section, 2...Character recognition section, 3.
... Post-processing unit, 10... Word dictionary, 11... Part of speech classification table, 12... Word conjugation table, 13... Connection weight matrix table 6 Figure 1 Figure 3 (Tl) M giant Royo = Dingyoro Figure 2 Figure 4

Claims (2)

【特許請求の範囲】[Claims] (1)形態素解析を利用した文字認識の後処理方式にお
いて、入力文字に対する第1位候補文字から文字列を作
成して形態素解析を施こすことにより解析不能となる文
字列部分を抽出し、その文字列部分の各文字の候補文字
から作成可能な文字列すべてに対して形態素解析を行い
、最も評価値の高い文字列を選択することを特徴とする
文字認識後処理方式。
(1) In a post-processing method for character recognition using morphological analysis, a character string is created from the first candidate character for the input character, and the part of the character string that cannot be analyzed is extracted by performing morphological analysis. A character recognition post-processing method that performs morphological analysis on all character strings that can be created from candidate characters for each character in a character string part, and selects the character string with the highest evaluation value.
(2)形態素解析を利用した文字解析の後処理方式にお
いて、入力文字に対する第1位候補文字から文字列を作
成して形態素解析を施こすことにより解析不能となる文
字列部分を抽出し、その文字列部分の各文字に対して誤
認識文字位置の判定を行い、誤認識文字と判定された文
字についてのみ候補文字を代入することにより作成され
る候補文字列に対して形態素解析を行い、最も評価値の
高い文字列を選択することを特徴とする文字認識後処理
方式。
(2) In the post-processing method of character analysis using morphological analysis, a character string is created from the first candidate character for the input character, and the part of the character string that cannot be analyzed is extracted by performing morphological analysis. The position of the misrecognized character is determined for each character in the string part, and candidate characters are substituted only for the characters determined to be misrecognized. Morphological analysis is performed on the candidate string created by A character recognition post-processing method characterized by selecting character strings with high evaluation values.
JP61128558A 1986-06-03 1986-06-03 Character recognition post processing system Pending JPS62285189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP61128558A JPS62285189A (en) 1986-06-03 1986-06-03 Character recognition post processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP61128558A JPS62285189A (en) 1986-06-03 1986-06-03 Character recognition post processing system

Publications (1)

Publication Number Publication Date
JPS62285189A true JPS62285189A (en) 1987-12-11

Family

ID=14987731

Family Applications (1)

Application Number Title Priority Date Filing Date
JP61128558A Pending JPS62285189A (en) 1986-06-03 1986-06-03 Character recognition post processing system

Country Status (1)

Country Link
JP (1) JPS62285189A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5943443A (en) * 1996-06-26 1999-08-24 Fuji Xerox Co., Ltd. Method and apparatus for image based document processing
US9009026B2 (en) 2011-09-26 2015-04-14 Fuji Xerox Co., Ltd. Information processing apparatus, non-transitory computer readable medium storing information processing program, and information processing method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5943443A (en) * 1996-06-26 1999-08-24 Fuji Xerox Co., Ltd. Method and apparatus for image based document processing
US9009026B2 (en) 2011-09-26 2015-04-14 Fuji Xerox Co., Ltd. Information processing apparatus, non-transitory computer readable medium storing information processing program, and information processing method

Similar Documents

Publication Publication Date Title
KR101072460B1 (en) Method for korean morphological analysis
JPS62285189A (en) Character recognition post processing system
JP2002091961A (en) System and processing method for detecting/correcting corpus error and program recording medium
JPS62284480A (en) Post processing system for character recognition
JP3350127B2 (en) Character recognition device
JPS62293386A (en) Postprocessing for recognizing character
JP2827066B2 (en) Post-processing method for character recognition of documents with mixed digit strings
JPS62180462A (en) Voice input kana-kanji converter
JPS62284481A (en) Post processing system for character recognition
JP3725206B2 (en) Character recognition device
JPH03125264A (en) Key word extracting device
JPH06149872A (en) Text input device
Marukawa et al. A post-processing method for handwritten Kanji name recognition using Furigana information
JPS63103393A (en) Word recognizing device
JPH0757059A (en) Character recognition device
JPH0795337B2 (en) Word recognition method
JPH0262659A (en) Extracting device for correction candidate character of japanese sentence
JPS6059487A (en) Recognizer of handwritten character
JPH02121078A (en) Vocabulary dictionary retrieving device
JPH04278664A (en) Address analysis processor
JPS62285190A (en) Unknown word processing method
JPH10240736A (en) Morphemic analyzing device
JPH06266700A (en) Kana/kanji converting device
JPS6316370A (en) Word extracting system
JPS61161588A (en) Postprocessing system of character recognition