JPH02118785A

JPH02118785A - Method for correcting erroneous recognition

Info

Publication number: JPH02118785A
Application number: JP63271591A
Authority: JP
Inventors: Michiyoshi Tachikawa; 道義立川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-10-27
Filing date: 1988-10-27
Publication date: 1990-05-07
Anticipated expiration: 2014-06-21
Also published as: JP2908460B2

Abstract

PURPOSE:To improve a recognition rate by executing an unknown word processing to correct the part of the unknown word to be unsuccessful to the Japanese language analysis to KATAKANA (the square form of Japanese syllabary), an alphabet, figures and symbols. CONSTITUTION:Concerning the part of the character string to be unsuccessful finally to the Japanese language analysis, this is made into the unknown word, and in an unknown word checking part 15, a pointer Ps of the head character position and a pointer Pe of a final character position is written into an unknown word position saving memory 10. An unknown word processing part 9 refers to a candidate lattice memory 4 by a candidate character retrieving part 16, the candidate character is retrieved in the sequence from the first rank concerning the part of the unknown word and finally, a confusion character is retrieved. Concerning the retrieved character, a character species deciding part 17 decides the character species, and the character of KATAKANA, alphabet, figures and symbols found first and the first rank candidate are replaced. Thus, the KATAKANA word, etc., in which the correction is difficult for the erroneous recognition correcting processing by a Japanese language analysis, can be corrected and the recognition rate can be improved.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、日本語を対象とした文字認識装置、音声認識
装置などのパターン認識装置における認識結果に対し、
日本語解析によって誤認識を修正する誤認識修正方法に
関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention provides recognition results for Japanese characters in pattern recognition devices such as character recognition devices and speech recognition devices.
This paper relates to a method for correcting misrecognition using Japanese language analysis.

[Conventional technology]

文字認識装置、音声認識装置などのパターン認識装置に
おいては、入力されたイメージや音声のパターンの特徴
より認識を行うが、このような個々のパターンの特徴に
よる認識処理では誤認識を完全に排除することは殆ど不
可能である。そこで、日本語解析（単語辞書および文法
辞書を用いた単語１文法のチエツク）により誤認識の修
正を行い、認識率の向上を図る必要がある。Pattern recognition devices such as character recognition devices and voice recognition devices perform recognition based on the characteristics of input image and voice patterns, but recognition processing based on the characteristics of such individual patterns completely eliminates misrecognition. That is almost impossible. Therefore, it is necessary to improve the recognition rate by correcting misrecognition through Japanese language analysis (checking the grammar of each word using a word dictionary and a grammar dictionary).

このような日本語解析による誤認識修正に関しては、認
識候補文字のすべての組合せに対し日本語解析を行って
、誤認識を修正する文字入力処理方式（特開昭６２−１
１９１９０号）、認識候補文字に予め用意された類似文
字を追加し、候補文字および追加文字のすべての文字列
について日本語解析を行って誤認識を修正する誤読文字
訂正処理装置（特開昭６２−２５１９８６号）などが知
られている。Regarding correction of misrecognition through Japanese analysis, there is a character input processing method (Japanese Patent Application Laid-Open No. 62-111) that corrects misrecognition by performing Japanese analysis on all combinations of recognition candidate characters.
No. 19190), a misread character correction processing device that adds similar characters prepared in advance to recognition candidate characters and performs Japanese analysis on all character strings of candidate characters and additional characters to correct misrecognition (Japanese Patent Laid-Open No. 62 -251986) are known.

[Problem to be solved by the invention]

しかし、従来のこの種の誤認識修正方法では、単語辞書
に登録されていない単語（当然、日本語解析は失敗する
）については、修正が不可能であるという問題があった
。However, with this type of conventional recognition error correction method, there was a problem in that it was impossible to correct words that were not registered in the word dictionary (naturally, Japanese language analysis would fail).

このような修正不可能となる頻度を減らすために、ｒＦ
語辞書の登録単語数を増加させることも考えられるが、
あらゆる単語をカバーすることは実際上困難である。特
に、カタカナ単語は著しく多く、また造語も頻繁である
ため、すべてを辞書に登録することは不可能である。ア
ルファベット、数字、記号の組合せも同様である。さら
に、登録単語が多くなると、類似単語の増加による認識
率の低下や辞書検索時間の増加という別の問題も生じて
しまう。To reduce the frequency of such uncorrectable situations, rF
It is possible to increase the number of words registered in the word dictionary, but
It is practically difficult to cover every word. In particular, there are an extremely large number of katakana words, and they are often coined, so it is impossible to register them all in a dictionary. The same applies to combinations of alphabets, numbers, and symbols. Furthermore, as the number of registered words increases, other problems arise, such as a decrease in recognition rate and an increase in dictionary search time due to an increase in similar words.

本発明の目的は、日本語を対象とした文字認識装置や音
声認識装置などのパターン認識装置における認識結果に
ついて、日本語解析によっては修正が困難となるカタカ
ナ、アルファベット、数字または記号の単語の修正が可
能な誤認識修正方法を提供することにある。The purpose of the present invention is to correct words such as katakana, alphabets, numbers, or symbols that are difficult to correct depending on the Japanese language analysis, in the recognition results of pattern recognition devices such as character recognition devices and voice recognition devices for Japanese. The purpose of the present invention is to provide a method for correcting misrecognition that is possible.

[Means to solve the problem]

本発明は、文字認識装置などの認識結果について日本語
解析により誤認識の修正処理を行うとへもに、日本語解
析を失敗した未知語の部分を、その候補文字に含まれる
、または該候補文字およびその類似文字に含まれるカタ
カナ、アルファベラｈ、数字または記号に修正する未知
語処理を行う。The present invention corrects erroneous recognition by Japanese language analysis on the recognition results of a character recognition device, etc., and also corrects unknown word parts for which Japanese language analysis has failed to be included in candidate characters or candidates. Performs unknown word processing that corrects characters and their similar characters to katakana, alphabetical h, numbers, or symbols.

この未知語処理は日本語解析を失敗したすべての部分に
ついて行ってもよいが、日本語解析を失敗した未知語の
部分の品詞を普通名詞またはサ変名詞として次の日本語
解析に成功した単語との接続を調べ、この接続が可能な
未知語の部分にのみ未知語処理を行ってもよい。This unknown word processing may be performed on all parts of the unknown word for which Japanese analysis has failed, but the part of speech of the part of the unknown word for which Japanese analysis has failed can be treated as a common noun or a strange noun and used as a word for the next successful Japanese analysis. The connection may be checked and unknown word processing may be performed only on the part of the unknown word where this connection is possible.

[For production]

単語辞書に登録されていない、あるいは登録が不可能な
カタカナ、アルファベット、数字または記号の組合せの
部分は、日本語解析による修正を失敗するが、このよう
な未知語について上記未知語処理によって、かなり高率
で正しい単語が得られるため、認識率が大幅に向上する
。Combinations of katakana, alphabets, numbers, or symbols that are not registered in the word dictionary or cannot be registered will fail to be corrected by Japanese analysis, but the unknown word processing described above will significantly improve the accuracy of such unknown words. Since correct words are obtained at a high rate, the recognition rate is greatly improved.

また漢字も未知語となることがあるが、次の単語との接
続が可能な未知語についてのみ未知語処理を行うならば
、漢字からなる未知語の誤修正を減らすことができる。Additionally, kanji can also be unknown words, but if unknown word processing is performed only on unknown words that can be connected to the next word, it is possible to reduce the number of incorrect corrections of unknown words made up of kanji.

〔Example〕

以下、図面により本発明の実施例について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

第１図は本発明の一実施例を示すブロック図である。１
は原稿読取りのためのスキャナ、２はスキャナ１により
読み取られた原稿画像より個々の文字パターンを切り出
し、その特徴とパターン辞書との比較照合によって候補
文字を抽出する文字認識装置（ＯＣＲ）である。この文
字認識装置２の認識結果について日本語解析（自然言語
処理）によって、ゼ４認識の修正が行われ、この日本語
処理では修正不可能な部分に対しては未知語処理によっ
て修正が行われる。FIG. 1 is a block diagram showing one embodiment of the present invention. 1
2 is a scanner for reading a document, and 2 is a character recognition device (OCR) that cuts out individual character patterns from the document image read by the scanner 1 and extracts candidate characters by comparing their characteristics with a pattern dictionary. The recognition result of the character recognition device 2 is corrected by Japanese language analysis (natural language processing), and parts that cannot be corrected by this Japanese processing are corrected by unknown word processing. .

このような誤認識修正に係わる要素として候補文字ラテ
ィス生成部３、候補ラティスメモリ４、日本語解析によ
る誤認識修正部５、日本語解析に用いられる単語辞書６
と、文法辞書７、候補単語メモリ８９日本語解析に失敗
した未知語の部分に対する未知語処理を行う未知語処理
部９、未知語部分の位置情報（後述のポインタ）を記憶
するための未知語位置セーブメモリ１ｏがある。Elements involved in correcting such misrecognition include a candidate character lattice generation section 3, a candidate lattice memory 4, a misrecognition correction section 5 based on Japanese language analysis, and a word dictionary 6 used for Japanese language analysis.
, a grammar dictionary 7, a candidate word memory 89, an unknown word processing section 9 that performs unknown word processing on the part of the unknown word for which Japanese language analysis has failed, and an unknown word for storing position information (pointer described later) of the unknown word part. There is a position save memory 1o.

１１は修正後の認識結果を出力するための認識結果出力
部、１２は誤認識修正部５、未知語処理部９および認識
結果出力部１１の制御を行う制御部である。11 is a recognition result output section for outputting a corrected recognition result; 12 is a control section that controls the misrecognition correction section 5, the unknown word processing section 9, and the recognition result output section 11.

候補文字ラティス生成部３は、文字認識装置２から出力
された候補文字および距離を候補文字ラティスメモリ４
に登録する。また本実施例においては、認識方式によっ
て誤認識が生じやすい文字をある程度予測可能であるの
で、そのような誤りやすい類似文字（コンフユージヨン
文字）を予め用意し、候補文字ラティス生成部３におい
て、例えば第１候補文字に対するコンフユージヨン文字
も候補ラティスメモリ４に書き込む。The candidate character lattice generation unit 3 stores the candidate characters and distances output from the character recognition device 2 in the candidate character lattice memory 4.
Register. Furthermore, in this embodiment, since characters that are likely to be misrecognized can be predicted to some extent depending on the recognition method, such similar characters that are likely to be misrecognized (confusion characters) are prepared in advance, and the candidate character lattice generation unit 3 For example, a fusion character for the first candidate character is also written into the candidate lattice memory 4.

第２図に候補ラティスメモリ４の内容の一例を示す。こ
の例は「日本政府の意向がサミットで拒否された。」の
入力文に対する認識結果である。FIG. 2 shows an example of the contents of the candidate lattice memory 4. This example is the recognition result for the input sentence "The Japanese government's intention was rejected at the summit."

第１位候補とコンフユージヨン文字だけが示されでいる
が、候補は第１０位まである。Only the first candidate and the confusion character are shown, but there are up to 10 candidates.

誤認識修正部５は、認識結果について日本語解析による
修止を行うとトもに未知語部分の検出を行う。すなわち
、ｍ語照合部１３によって、候補文字（第１位だけでな
く全位の候補文字）およびコンフユージヨン文字の組合
せ文字列について単語辞書６との照合（単語照合）を行
い、マツチした全単語を候補単語として候補単語メモリ
８に書き込む。次に、文法チエツク部１４で文法辞書７
を参照し、それらの候補単語と、その直前に解析が成功
している単語との品詞（名詞、す変名詞。The misrecognition correction unit 5 detects unknown word portions when correcting the recognition results by Japanese language analysis. That is, the m-word matching unit 13 matches the combination string of candidate characters (not only the first candidate character but also all candidate characters) and fusion characters with the word dictionary 6 (word matching), and then checks all the matched characters. The word is written into the candidate word memory 8 as a candidate word. Next, the grammar check section 14 checks the grammar dictionary 7.
, and compare these candidate words with the word that was successfully parsed immediately before them (nouns, mundane nouns).

五段活用動詞など）の接続チエツクを行い、接続不可能
な候補単語を候補単語メモリ８より削除する。この時、
単語の接続が複数成功するときは。5-step conjugation verbs, etc.), and delete candidate words that cannot be connected from the candidate word memory 8. At this time,
When multiple word connections are successful.

長い単語を優先する（最長一致法を適用）、また、解析
が失敗した場合にはバックトラックを行い。Prioritizes long words (applies longest match method) and backtracks if parsing fails.

次に長い候補単語について接続チエツクを行い。Next, perform a connection check on long candidate words.

解析を続行する。Continue analysis.

かＮる日本語解析を最終的に成功した単語の文字が第２
位以下の候補文字またはコンフユージヨン文字である場
合、文法チエツク部］、４は候補ラティスダメモリ４上
の第１候補文字をその第２位以下の候補文字またはコン
フユージヨン文字と入れ替える。ずなわち、修正する。The character of the word for which the Japanese language analysis was finally successful is the second character.
If it is a candidate character or a fusion character of the lower rank, the grammar check section], 4 replaces the first candidate character in the candidate lattice data memory 4 with the candidate character or fusion character of the second or lower rank. In other words, fix it.

第２図に示した例では、「日本政府の」までは第１候補
文字で日本語解析が成功するが、「意向が」では第６位
候補を用いた「意向」で日本語解析が成功し、第１位候
補の「恵」と第６位候補の「意」が入れ替えられる。次
の「サミット」　（なお＃　トＩＩは漢字の″ボク″）
については日本語解析が成功しないので、その次の「で
」より日本語解析を続行する。「で」、「拒否された」
について日本語解析が成功する。In the example shown in Figure 2, Japanese language analysis is successful with the first candidate character up to "Japanese Government's", but Japanese language analysis is successful with "Intention" using the 6th candidate character for "Intention ga". However, the first candidate's ``Megumi'' and the sixth candidate's ``I'' are swapped. Next "Summit" (Note: #ToII is the kanji for "Boku")
Since the Japanese language analysis is not successful for , the Japanese language analysis is continued from the next "de". "So", "Rejected"
Japanese language analysis is successful for .

なお、第１候補の文字列について日本語解析を行うこと
によって誤文節を検出し、検出した誤文節の部分に対し
てのみ第２位以下の候補文字およびコンフユージヨン文
字を含めた日本語解析を行って修正するようにしてもよ
い。In addition, incorrect phrases are detected by performing Japanese analysis on the first candidate character string, and Japanese analysis is performed only for the detected incorrect phrases, including the second and lower candidate characters and confusion characters. You may also modify it by doing so.

さて、上記日本語解析を最終的に失敗した文字列（誤文
節）の部分については、これを未知語とし、未知語チエ
ツク部１５において、その先頭文字位置のポインタＰｓ
と最終文字位置のポインタＰｅを未知語位置セーブメモ
リ１０に書き込む。Now, regarding the part of the character string (erroneous phrase) that ultimately failed in the above Japanese analysis, this is treated as an unknown word, and the unknown word check section 15 uses the pointer Ps of the first character position.
and the pointer Pe of the final character position are written into the unknown word position save memory 10.

第２図の例においては、「サミット」の「す」の位置で
日本語解析を失敗するので、その文字位置をポインタＰ
ｓとしてセーブし、次の文字より日本語解析を続行し、
「で」で日本語解析が成功する。この成功した単語「で
」の前の文字位置すなわち「ト（漢字）」の文字位置を
ポインタＰａとしてセーブする。In the example in Figure 2, Japanese parsing fails at the position of "su" in "summit", so pointer P
Save as s, continue Japanese parsing from the next character,
Japanese parsing is successful with "de". The character position before this successful word "de", that is, the character position of "to (kanji)" is saved as a pointer Pa.

未知語処理部９はポインタＰｓとポインタＰｅの間の部
分に未知語処理を施す部分である。すなわち、候補文字
検索部１６によって候補ラティスメモリ４を参照し、未
知語の部分について候補文字を第１位から順番に検索し
、最後にコンフユージヨン文字を検索する。そして、検
索された文字について文字種判定部１７で文字種の判定
を行い、最初に見つかったカタカナ、アルファベット、
数字または記号の文字と第１位候補を入れ替える（修正
する）。カタカナ、アルファベット、数字、記号のいず
れも見つからないときは、第１位候補をそのま＼にする
。The unknown word processing unit 9 is a part that performs unknown word processing on the portion between the pointer Ps and the pointer Pe. That is, the candidate character search unit 16 refers to the candidate lattice memory 4, searches for candidate characters in the unknown word portion in order from the first position, and finally searches for the fusion character. Then, the character type determination unit 17 determines the character type for the searched characters, and the first found katakana, alphabet,
Swap (correct) the number or symbol character and the first candidate. If you cannot find any katakana, alphabet, number, or symbol, use the first candidate as is.

第２図の例の「サミット」は最後の文字「ト（漢字の１
′ボク″）」がカタカナの「１−」に書き換えられ、全
文字カタカナの単語「サミット」に修正される。In the example in Figure 2, “summit” is the last letter “to” (kanji 1).
``Boku'')'' is rewritten as ``1-'' in katakana, and the word ``summit'' is written in katakana.

以上説明した日本語解析による誤認識修正処理および未
知語処理をフローチャートとして第３図に示す。FIG. 3 shows a flowchart of the misrecognition correction process and unknown word process based on the Japanese language analysis described above.

第４図は本発明の他の実施例を示すブロック図である。FIG. 4 is a block diagram showing another embodiment of the present invention.

本実施例は、未知語処理部９ａに文法チエツク部１９を
追加し、こ＼で未知語を普通名詞またはす変名詞として
、次の日本語解析を成功した単語との接続が可能である
かどうかを調べ、接続可能と判定した未知語に対しての
み候補文字検索部１６および文字種判定部１７による未
知語処理を行う。このような文法チエツクを未知語処理
の前に行うことにより、漢字からなる未知語がカタカナ
、アルファベット、数字または記号に書き換えられると
いう誤修正が減少する。In this embodiment, a grammar check unit 19 is added to the unknown word processing unit 9a, and it is possible to convert the unknown word into a common noun or a strange noun and connect it with the word that succeeded in the next Japanese analysis. The candidate character search unit 16 and character type determination unit 17 perform unknown word processing only on unknown words determined to be connectable. By performing such a grammar check before processing unknown words, it is possible to reduce the number of erroneous corrections in which unknown words consisting of kanji are rewritten into katakana, alphabets, numbers, or symbols.

本実施例の処理をフローチャートとして第５図に示す。The processing of this embodiment is shown in FIG. 5 as a flowchart.

以上、文字認識装置に適用された実施例について説明し
たが、音声認識装置などにも同様に本発明を適用でき、
また、各機能はハードウェアまたはソフトウェアのいず
れの手段によって実現してもよい。Although the embodiments applied to character recognition devices have been described above, the present invention can be similarly applied to voice recognition devices, etc.
Furthermore, each function may be realized by either hardware or software.

〔Effect of the invention〕

以上、詳細に説明したように、本発明によれば、単語辞
書の大容量化やそれによる不利益を避けつシ、日本語解
析による誤認識修正処理では修正が難かしてカタカナ単
語なども修正でき、認識率を向上せしめることができる
。As explained in detail above, according to the present invention, it is possible to avoid the increase in the capacity of word dictionaries and the disadvantages caused by them, and to avoid the disadvantages of increasing the capacity of word dictionaries. It can be corrected and the recognition rate can be improved.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロック図、第２図は
候補ラティスメモリの内容の一例を示す図、第３図は第
１図の処理フローチャート、第４図は本発明の他の実施
例を示すブロック図、第５図は第４図の処理フローチャ
ーｉ−である。２・・・文字認識装置、３・・・候補文字ラティス生成部、４・・・候補ラティスメモリ、　５・・・誤認識修正部
、６・・・単語辞書、　　７・・・文法辞書、８・・・
候補単語メモリ、９，９ａ・・・未知語処理部、１０・
・・未知語位置セーブメモリ、１３・・・単語照合部、１４．１９・・・文法チエツク部、１５・・・未知語チエツク部、１６・・・候補文字検索部、　　１７・・・文字種判定
部。FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram showing an example of the contents of a candidate lattice memory, FIG. 3 is a processing flowchart of FIG. FIG. 5, a block diagram showing the embodiment, is the processing flowchart i- of FIG. 4. 2... Character recognition device, 3... Candidate character lattice generation section, 4... Candidate lattice memory, 5... Erroneous recognition correction section, 6... Word dictionary, 7... Grammar dictionary, 8 ...
Candidate word memory, 9, 9a...Unknown word processing unit, 10.
...Unknown word position save memory, 13...Word matching unit, 14.19...Grammar check unit, 15...Unknown word check unit, 16...Candidate character search unit, 17...Character type determination Department.

Claims

[Claims]

(1) Correcting misrecognition through Japanese language analysis of the recognition results, and replacing parts of unknown words that failed Japanese language analysis with those included in the candidate characters or with the candidate characters and their similar characters. Includes katakana, alphabet,
A misrecognition correction method characterized by performing unknown word processing to correct it to numbers or symbols.

(2) The feature is that the part of speech of the unknown word part is used as a common noun or a paranormal noun, and the connection with the next word that was successfully analyzed in Japanese is checked, and unknown word processing is performed only on the part of the unknown word where this connection is possible. The misrecognition correction method according to claim (1).