JPH0652367A

JPH0652367A - Post-processing method for character recognition result

Info

Publication number: JPH0652367A
Application number: JP4207895A
Authority: JP
Inventors: Yoshitaka Hamaguchi; 佳孝濱口; Akitoshi Tsukamoto; 明利塚本; Sadamasa Hirogaki; 節正広垣
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1992-08-04
Filing date: 1992-08-04
Publication date: 1994-02-25

Abstract

PURPOSE:To provide a post-processing method for a character recognition result by which an accurate recognition result can be obtained in word unit even when an erroneous recognition result is obtained in character unit, and for which small amount of information to be generated in advance is required, and which can easily conform to the change of the information to be generated in advance. CONSTITUTION:A character is classified based on the error trend of the character. No information relating to character classification is stored in a word dictionary. A reference word for the reference of retrieval is formed from the character recognition result, and classification name notation is generated from the reference word based on the character classification, and it is confirmed whether or not a word with the same classification name notation as that of the classification name notation exists based on the character classification when the word is retrieved. A candidate word is narrowed down from the word with the same classification name notation as that of the reference word.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文字認識結果の後処理方
法に関し、特に、認識結果の誤り傾向を考慮して単語を
検索することにより認識性能を向上させるようとしたも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a post-processing method for character recognition results, and more particularly to improving recognition performance by searching for words in consideration of error tendency of the recognition results.

【０００２】[0002]

【従来の技術】例えば、機械翻訳システムとして、入力
手段に文字認識装置を適用してユーザによる入力操作の
簡便化を計ったものがある。例えば、このように適用さ
れる文字認識装置においては、文字単位の認識は勿論、
単語単位の認識も重要である。文字単位の認識において
は誤った認識結果があっても、単語単位には正確な認識
結果を得ることができるようにした方法が、従来、既に
提案されている。2. Description of the Related Art For example, as a machine translation system, there is a machine translation system in which a character recognition device is applied to an input means to simplify an input operation by a user. For example, in the character recognizing device applied in this way, of course character-by-character recognition,
Word-by-word recognition is also important. In the past, a method has been already proposed in which a correct recognition result can be obtained for each word even if there is an erroneous recognition result in recognition for each character.

【０００３】例えば、文献『沼倉他著、「誤ったキーで
も検索できる情報検索システム」、情報処理学会論文
誌、Vol.30、No.11 、pp.1468-1478、1989年11月』を挙
げることができる。For example, reference is made to “Numakura et al.,“ Information Retrieval System Retrievable with Incorrect Key ””, Journal of Information Processing Society of Japan, Vol. 30, No. 11, pp. 1468-1478, November 1989, for example. be able to.

【０００４】以下、認識対象単語を構成する各文字の認
識結果から認識対象単語についての正確な認識結果を得
る、上記文献に開示された方法に従った単語の検索方法
（文字認識結果の後処理方法に一部を構成する処理）を
説明する。Hereinafter, a word search method (post-processing of character recognition results) according to the method disclosed in the above-mentioned document, in which an accurate recognition result of a recognition target word is obtained from the recognition results of respective characters constituting the recognition target word The process which constitutes a part of the method) will be described.

【０００５】なお、この方法を適用するに際しては、予
め、文字分類と、各単語の類名表記で分類した階層的な
単語辞書とを作成しておくことを要する。ここで、文字
分類とは、文字の誤り傾向に基づいて、全ての文字を幾
つかの類に分類したものであり、各類には類名が付与さ
れている。また、単語の類名表記とは、単語を構成する
文字が属する類名を並べて形成された表記である。When applying this method, it is necessary to prepare in advance a character classification and a hierarchical word dictionary classified by the notation of each word. Here, the character classification is a classification of all characters into some classes based on the error tendency of the characters, and each class is given a class name. The word class name notation is a notation formed by arranging the class names to which the characters that make up the word belong.

【０００６】単語の認識時においては、まず、認識対象
単語を構成する各文字の認識結果をそれぞれ、文字分類
の類名に置き換えた類名表記Ｘを作成する。次に、この
ようにして得られた類名表記Ｘと最も一致度の高い類名
表記Ｙを、上述の単語辞書の類名表記より検索する。そ
して最後に、類名表記Ｙを有する上述の単語辞書中の単
語単語を検索対象とし、上述した認識結果と最も一致度
の高い単語を検索し、その単語を検索結果とする。When recognizing a word, first, a class name notation X is created by replacing the recognition result of each character forming the recognition target word with the class name of the character classification. Next, the class name notation X having the highest degree of coincidence with the class name notation X thus obtained is searched from the class name notation of the above-mentioned word dictionary. Finally, the word in the above-mentioned word dictionary having the class name notation Y is set as a search target, the word having the highest degree of coincidence with the above-mentioned recognition result is searched, and the word is set as the search result.

【０００７】[0007]

【発明が解決しようとする課題】ところで、上述した各
文字の誤り傾向は字体によって異なるものである。その
ため、字体によって各文字の分類の仕方が異なる。分類
を作り直すだけであれば、さほど複雑な作業が必要とは
ならないが、上述の従来方法によれば、単語辞書をも作
り直す必要があり、非常に繁雑な作業が必要となる。By the way, the error tendency of each character described above differs depending on the font. Therefore, the method of classifying each character differs depending on the font. If only the classification is recreated, a complicated work is not required, but according to the above-mentioned conventional method, the word dictionary also needs to be recreated, and a very complicated work is required.

【０００８】このような問題は、字体を変えない場合に
も生じる。例えば、各文字の分類を学習によって見直し
て変えようとしても、ある１個の文字が属する類を変え
るだけでも単語辞書の類名表記を広範囲に変更すること
を伴う。従って、このようなことは実際的ではなく、結
局として単語の認識精度の向上を制限するものになって
いた。Such a problem occurs even when the font is not changed. For example, even if the classification of each character is reviewed and changed by learning, even if the class to which a certain character belongs is changed, the class name notation of the word dictionary is widely changed. Therefore, such a thing is not practical and, in the end, limits the improvement of word recognition accuracy.

【０００９】実際上、文字単位の認識精度を向上させる
ように、文字単位の認識方法の研究も盛んに行なわれて
いる。このような新たな文字認識方法を適用した場合に
は、字体が同一であっても誤り傾向が今までのものとは
異なることも生じ、このような場合にも上述した問題が
生じてしまう。In practice, researches on character-by-character recognition methods have been actively conducted so as to improve the recognition accuracy on a character-by-character basis. When such a new character recognition method is applied, the error tendency may be different from the conventional one even if the font is the same, and the above-mentioned problem also occurs in such a case.

【００１０】本発明は、以上の点を考慮してなされたも
のであり、文字単位の認識においては誤った認識結果が
あっても単語単位には正確な認識結果を得ることができ
る、しかも、そのために予め作成しておく情報の量が少
なくて、予め作成しておく情報の変更に容易に対応でき
る文字認識結果の後処理方法を提供しようとしたもので
ある。The present invention has been made in consideration of the above points, and in recognition in character units, an accurate recognition result can be obtained in word units even if there is an erroneous recognition result. Therefore, it is an object of the present invention to provide a post-processing method for a character recognition result that requires a small amount of information to be created in advance and can easily cope with a change in information created in advance.

【００１１】[0011]

【課題を解決するための手段】かかる課題を解決するた
め、本発明においては、認識対象単語を構成する各文字
の認識結果から単語辞書を検索して認識対象単語につい
ての候補単語を得る処理を含む文字認識結果の後処理方
法において、文字の誤り傾向に基づいて、全ての文字を
幾つかの類に分類して類名を付与して予め格納しておく
と共に、単語辞書には類名を並べてなる類名表記を伴う
ことなく単語を格納しておき、各文字の認識結果から、
検索時に参照するための参照単語を作成する第１の処理
と、この参照単語の各文字が属する類の類名を並べた類
名表記を作成する第２の処理と、単語辞書に格納されて
いる単語を検索対象とし、文字分類に基づいて類名表記
が参照単語の類名表記と一致する単語か否かを判定し、
異なる単語は候補単語の要件を満たさないとして除外す
る第３の処理とを含むことを特徴とした。In order to solve such a problem, in the present invention, a process of searching a word dictionary from a recognition result of each character forming a recognition target word to obtain a candidate word for the recognition target word is performed. In the post-processing method of the character recognition result including, all the characters are classified into some classes based on the error tendency of the characters, the class names are given and stored in advance, and the class names are stored in the word dictionary. Words are stored without accompanying class name notation, and from the recognition result of each character,
A first process of creating a reference word for reference at the time of searching, a second process of creating a class name notation in which the class names of the classes to which each character of this reference word belongs are stored, and a second process stored in the word dictionary. Existing words as a search target, it is determined based on the character classification whether the class name notation matches the reference word class name notation,
And a third process of excluding the different words as not satisfying the requirements of the candidate words.

【００１２】ここで、第３の処理で除外されなかった単
語の内で、参照単語と不一致な文字が最も少ないものを
候補単語とすることが好ましい。[0012] Here, among the words not excluded in the third processing, it is preferable that the candidate word has the smallest number of characters that do not match the reference word.

【００１３】[0013]

【作用】文字分類が変更されても容易に対応できるよう
にするためには、文字分類の変更に伴って大幅な変更が
必要な単語辞書の類名表記を省略すれば良い。本発明
は、このような考え方に従ってなされたものである。In order to make it possible to easily cope with a change in the character classification, it is sufficient to omit the class name notation of the word dictionary, which requires a great change in accordance with the change in the character classification. The present invention has been made according to such an idea.

【００１４】すなわち、文字分類は用意しても、単語辞
書には類名表記を伴うことをなくした。そして、文字認
識結果から検索の参照のための参照単語を形成し、これ
から文字分類に基づいて類名表記を作成し、この類名表
記と同じ類名表記を有する単語があるかを単語検索時に
文字分類に基づいて確認することとした。That is, even if the character classification is prepared, the word dictionary is not accompanied by the notation of class names. Then, a reference word for search reference is formed from the character recognition result, a class name notation is created based on the character classification, and at the time of word search, it is checked whether there is a word having the same class name notation as this class name notation. It was decided to confirm based on the character classification.

【００１５】類名表記が異なる単語辞書内の単語は候補
単語とはならないが、類名表記が同じであってもそれだ
けでは候補単語の絞り込みとしては不十分である。そこ
で、類名表記が参照単語の類名表記と同じものであって
しかも参照単語と不一致な文字が最も少ないものを候補
単語とすることが好ましい。Words in the word dictionary with different class names are not candidate words, but even if the class names are the same, that alone is not sufficient for narrowing down candidate words. Therefore, it is preferable that the candidate name is the one in which the class name notation is the same as that of the reference word and the number of characters that do not match the reference word is the smallest.

【００１６】[0016]

【実施例】以下、本発明による文字認識結果の後処理方
法を、英単語の認識に適用した一実施例について図面を
参照しながら詳述する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A post-processing method for character recognition results according to the present invention will be described in detail below with reference to the drawings regarding an embodiment in which it is applied to recognition of English words.

【００１７】図示は省略するが、この実施例は、実際
上、例えば光学式文字読取り装置（ＯＣＲ）を備えたワ
ークステーション等によってハードウェア上は実現され
る。機能ブロック的には、図２に示す通りである。Although not shown in the drawings, this embodiment is practically realized in hardware, for example, by a workstation equipped with an optical character reader (OCR). The functional block is as shown in FIG.

【００１８】図２において、文書読取り手段１１は、用
紙に記載されている文書を２値データとして読み取るも
のである。文字認識手段１２は、２値データに基づい
て、各文字領域や各単語領域等を切出すと共に、各文字
領域の２値パターンの特徴量を得て、予め各文字につい
て作成されている基準の特徴量との距離によって各文字
の認識結果を得て、単語領域毎に認識結果記憶手段１３
に格納させるものである。参照単語作成手段１４は、文
字毎の認識結果に基づいて、検索時に参照するための参
照単語を作成するものである。文字分類記憶手段１５に
は、各文字についての誤り傾向に基づいて全ての文字を
分類した文字分類が類名（分類名）を伴って格納されて
いる。文字分類記憶手段１５の格納内容については後述
する。単語辞書１６には、少なくとも一般的な文章で出
現すると思われる英単語が格納されており、類名表記に
よる階層化はなされていない。なお、この実施例の場
合、検索速度を短くすることを意図し、文字数毎に固め
られており、また、同一文字数内でもハッシュ値が等し
いものずつに固められて格納されている。ハッシュ値と
しては、例えば、単語中の各文字のＡＳＣＩＩコードの
和を６４で割った余りを適用する。候補単語検索手段１
７は、参照単語を構成する各文字の類名でなる類名表記
を得て、この類名表記に一致する類名表記を有するもの
であって参照単語との不一致文字数が最も少ない単語を
単語辞書１６から検索するものであり、検索された候補
単語を認識結果記憶手段１３に一時記憶させるものであ
る。出力単語決定手段１８は、当初の認識結果を利用し
て候補単語の中から出力単語を決定し、結果出力手段２
１から印字出力又は表示出力させるものである。In FIG. 2, the document reading means 11 reads the document written on the paper as binary data. The character recognition unit 12 cuts out each character area, each word area, and the like based on the binary data, obtains the characteristic amount of the binary pattern of each character area, and uses the reference value created in advance for each character. The recognition result of each character is obtained according to the distance from the feature amount, and the recognition result storage unit 13 is provided for each word area.
To be stored in. The reference word creating means 14 creates a reference word to be referred to at the time of search based on the recognition result for each character. The character classification storage unit 15 stores a character classification in which all the characters are classified based on the error tendency of each character, together with a class name (classification name). The contents stored in the character classification storage means 15 will be described later. The word dictionary 16 stores at least English words that appear to appear in general sentences, and is not hierarchized by class name notation. In the case of this embodiment, the search speed is intended to be shortened and fixed for each number of characters, and even within the same number of characters, the hash values are fixed for each and stored. As the hash value, for example, the remainder obtained by dividing the sum of the ASCII code of each character in the word by 64 is applied. Candidate word search means 1
7 is a word that has a class name notation consisting of the class names of the characters that make up the reference word and has a class name notation that matches this class name notation The dictionary 16 is searched, and the searched candidate words are temporarily stored in the recognition result storage means 13. The output word determination means 18 determines an output word from the candidate words by using the initial recognition result, and the result output means 2
1 to print out or display.

【００１９】このような機能ブロックによって実現され
る実施例方法の処理の流れは、図１及び図３に示す通り
である。The process flow of the embodiment method implemented by such functional blocks is as shown in FIGS.

【００２０】まず、図３に基づいて、大きな処理の流れ
を説明する。First, a large processing flow will be described with reference to FIG.

【００２１】２値データに変換された文書データから文
字領域や単語領域等を切出し、各文字領域の２値パター
ン（文字パターン）の特徴量を得て、予め各文字につい
て作成されている基準の特徴量との距離によって各文字
の認識結果を得て、各単語領域毎に各文字の認識結果を
整理する（ステップ１００）。図４は、文字認識結果の
一例を示すものであり、入力された認識対象単語（正確
にはそのパターン）が「ｔａｋｅ」の例である。ここ
で、今回の読取り対象の文字と基準の特徴量を作成した
際の文字とは同一の文字であってもその字体等の文字パ
ターンの違いによって特徴量が異なるので、必ずしも距
離が最少とはならない。ここでは、距離がある閾値以下
の文字を全て認識結果として取り出している。A character area, a word area or the like is cut out from the document data converted into binary data, a characteristic amount of a binary pattern (character pattern) of each character area is obtained, and a reference value is created in advance for each character. The recognition result of each character is obtained according to the distance from the feature amount, and the recognition result of each character is arranged for each word area (step 100). FIG. 4 shows an example of the character recognition result, and is an example in which the input recognition target word (correctly, its pattern) is “take”. Here, even if the character to be read this time and the character at the time of creating the reference feature amount are the same character, the feature amount differs due to the difference in the character pattern such as the font, so the distance is not always the minimum. I won't. Here, all the characters whose distance is less than or equal to a certain threshold value are extracted as the recognition result.

【００２２】このようにして単語を構成する各文字につ
いて認識結果を得ると、各文字についての第１候補の文
字を繋げた参照単語を形成する（ステップ１０１）。When the recognition result is obtained for each character forming the word in this way, a reference word is formed by connecting the first candidate characters for each character (step 101).

【００２３】図５は、処理段階が進むについて得られる
情報を示した図表である。この図５に示すように、図４
に示すような認識結果を得た場合には、参照単語は「ｔ
ａｈｅ」となる。FIG. 5 is a chart showing the information obtained as the process steps progress. As shown in FIG.
When the recognition result as shown in is obtained, the reference word is "t.
ahe ”.

【００２４】このような参照単語が得られると、格納さ
れている文字分類を利用して類名表記を得て、この類名
表記をキーとして単語辞書を検索して候補単語を得る
（ステップ１０２）。このステップの処理にこの実施例
の特徴があり、詳細については後述する。When such a reference word is obtained, the stored character classification is used to obtain a class name notation, and the word dictionary is searched using this class name notation as a key to obtain candidate words (step 102). ). The processing of this step is characteristic of this embodiment, and the details will be described later.

【００２５】このようにして１以上の候補単語が得られ
ると、ステップ１００で得られた認識結果をも用いて出
力する単語を決定する（ステップ１０３）。この出力単
語の決定処理には、例えば特願平３−１９６５０９号明
細書及び図面に記載された方法を用いることができる。
すなわち、候補単語の各文字の基準特徴量を利用して認
識対象単語の各文字との距離を求め、この各文字につい
て求めた距離の総和をこの候補単語の評価値とする。そ
して、評価値が最も小さい候補単語を出力する単語とす
る。When one or more candidate words are obtained in this way, the word to be output is determined using the recognition result obtained in step 100 (step 103). For this output word determination processing, for example, the method described in Japanese Patent Application No. 3-196509 and drawings can be used.
That is, the reference feature amount of each character of the candidate word is used to obtain the distance from each character of the recognition target word, and the sum of the distances obtained for each character is used as the evaluation value of this candidate word. Then, the candidate word having the smallest evaluation value is set as the word to be output.

【００２６】そして、決定された出力単語を印字又は表
示によって出力して一連の処理を終了する（ステップ１
０４）。Then, the determined output word is output by printing or displaying and a series of processing is completed (step 1).
04).

【００２７】次に、上述のステップ１０２による単語辞
書の検索処理を、図１を用いて詳述する。Next, the word dictionary search process in step 102 will be described in detail with reference to FIG.

【００２８】参照単語の作成が終わって単語辞書の検索
処理に入ると、まず、図１（Ａ）に示すように、参照単
語のハッシュ値を計算し、その値をパラメータhashにセ
ットする（ステップ２００）。このハッシュ値の計算
は、当然に単語辞書のハッシュ値の計算方法と同じもの
であり、上述したように、例えば、参照単語中の各文字
のＡＳＣＩＩコードの和を６４で割った余りである。When the reference word is created and the word dictionary search process is started, first, as shown in FIG. 1A, the hash value of the reference word is calculated and the hash value is set in the parameter hash (step 200). The calculation of the hash value is of course the same as the method of calculating the hash value of the word dictionary, and as described above, it is, for example, the remainder obtained by dividing the sum of the ASCII code of each character in the reference word by 64.

【００２９】次に、参照単語に対する類名表記を作成す
る（ステップ２０１）。この際、誤り傾向に基づいて予
め作成されている文字についての分類を利用する。図６
は、分類の格納例を示すものであり、類名とその類に属
する文字とが対応付けられたテーブルとなっている。こ
のような分類は、同じ類の中では読取り誤りは生じる
が、他の類の文字には読取り誤りが生じないように分類
したものである。図６は、例えば、文字「ａ」、
［ｏ」、「ｕ」、「ｖ」間では読取り誤りが生じること
もあることを意味している。このような文字の分類に
は、上記文献に記載された方法を適用することができ
る。上述した図５に示すような参照単語「ｔａｈｅ」に
対して図６に示す分類を適用すると、図５に示すよう
に、類名表記として「ＤＡＢＣ」が得られる。Next, a class name notation for the reference word is created (step 201). At this time, the classification of the character created in advance based on the error tendency is used. Figure 6
Shows a storage example of classifications, and is a table in which class names are associated with characters belonging to the classes. In such a classification, reading errors occur in the same class, but reading errors do not occur in characters of other classes. FIG. 6 shows, for example, the letter “a”,
This means that a read error may occur between [o], "u", and "v". The method described in the above document can be applied to such character classification. When the classification shown in FIG. 6 is applied to the reference word “tahe” as shown in FIG. 5 described above, “DABC” is obtained as the class name notation, as shown in FIG.

【００３０】その後、単語辞書内の単語の照合処理を各
単語に対して後述するように繰返した際に、参照単語と
の不一致文字数がその時点で最も少ない値をセットする
ためのパラメータである最小不一致文字数を、最小不一
致文字数として実際上考えられない大きな値にセットす
る（ステップ２０２）。後述するように、候補単語は、
参照単語との不一致文字数ができるだけ少ないものから
選定する。After that, when the matching process of the words in the word dictionary is repeated for each word as described later, the number of characters that do not match the reference word is a parameter for setting the smallest value at that time. The number of non-matching characters is set to a large value which is practically unthinkable as the minimum number of non-matching characters (step 202). As will be described later, the candidate word is
Select from the number of characters that do not match the reference word as much as possible.

【００３１】以上のような初期化処理を終了した後に、
単語辞書内の各単語を照合することによる候補単語の決
定処理に進む。このような候補単語の決定処理は、ま
ず、ハッシュ値が参照単語のハッシュ値と等しくかつ文
字数（単語長）が参照単語の文字数（単語長）と等しい
単語辞書内の各単語について候補単語になるかを照合し
（ステップ２０３〜２０６でなるループ）、その後、ハ
ッシュ値が参照単語のハッシュ値と異なるが文字数が参
照単語の文字数と等しい単語辞書内の各単語について候
補単語になるかを照合することで行なう（ステップ２０
３〜２０９でなるループ）。After the above initialization processing is completed,
Proceed to the candidate word determination process by matching each word in the word dictionary. In such candidate word determination processing, first, the hash value is equal to the hash value of the reference word and the number of characters (word length) is equal to the number of characters (word length) of the reference word. Is checked (loop consisting of steps 203 to 206), and then it is checked whether each word in the word dictionary whose hash value is different from the hash value of the reference word but whose number of characters is equal to the number of characters of the reference word is a candidate word. (Step 20)
3 to 209 loop).

【００３２】ハッシュ値が参照単語のハッシュ値と等し
くかつ文字数（単語長）が参照単語の文字数（単語長）
と等しい単語辞書内の各単語について候補単語になるか
を照合するループ処理は、上記条件を満たす単語を単語
辞書から取出す処理（ステップ２０３）と、その単語を
照合する図１（Ｂ）に詳細を示す処理（ステップ２０
４）と、対象となっている単語が参照単語と一致してい
るかを確認する処理（ステップ２０５）と、上記条件を
満たす全ての単語に対する処理を終了したかを確認する
処理（ステップ２０６）からなる。The hash value is equal to the hash value of the reference word and the number of characters (word length) is the number of characters of the reference word (word length)
The loop processing for collating each word in the word dictionary that is equal to is a candidate word is detailed in FIG. 1B for collating the word with the processing for extracting the word satisfying the above condition from the word dictionary (step 203). Processing (step 20)
4), the process of confirming whether the target word matches the reference word (step 205), and the process of confirming whether the process for all the words satisfying the above conditions has been completed (step 206). Become.

【００３３】ここで、ハッシュ値が参照単語のハッシュ
値と等しくかつ文字数が参照単語の文字数と等しい単語
辞書内の各単語について候補単語になるかを照合してい
るループ処理において、参照単語と不一致な文字数が０
のものを発見した場合、すなわち、参照単語自体が単語
辞書にある場合には、そこで、辞書検索を直ちに終了す
るようになされている（ステップ２０５で肯定結果）。Here, in the loop processing in which each word in the word dictionary whose hash value is equal to the hash value of the reference word and whose number of characters is equal to the number of characters of the reference word is a candidate word, it is not matched with the reference word. The number of characters is 0
If it finds one, that is, if the reference word itself is in the word dictionary, the dictionary search is immediately terminated there (Yes in step 205).

【００３４】他方、ハッシュ値が参照単語のハッシュ値
と異なるが文字数が参照単語の文字数と等しい単語辞書
内の各単語について候補単語になるかを照合するループ
処理は、この条件を満たす単語を単語辞書から取出す処
理（ステップ２０７）と、その単語を照合する図１
（Ｂ）に詳細を示す処理（ステップ２０８）と、条件を
満たす全ての単語に対する処理を終了したかを確認する
処理（ステップ２０９）とからなる。On the other hand, a loop process for checking whether each word in the word dictionary whose hash value is different from the hash value of the reference word but whose number of characters is equal to the number of characters of the reference word is a candidate word is a word that satisfies this condition. The process of extracting from the dictionary (step 207) and collating the word are shown in FIG.
It is composed of a process shown in detail in (B) (step 208) and a process of confirming whether the process for all the words satisfying the condition has been completed (step 209).

【００３５】以上のように、ハッシュ値によって単語辞
書内の単語の照合順序を変えるようにしたのは、参照単
語自体が、読取対象単語を正確に認識していた場合にお
ける検索時間を短くするためである。認識には、誤りが
生じるとは言え、正確に認識されることの方が多く、こ
のように検索順序を定めることにより、全体としての検
索処理時間従って認識処理時間を短いものとすることが
できる。As described above, the collating order of the words in the word dictionary is changed according to the hash value in order to shorten the search time when the reference word itself correctly recognizes the word to be read. Is. Although recognition is erroneous, it is often recognized correctly, and by defining the search order in this way, the overall search processing time and thus the recognition processing time can be shortened. .

【００３６】ハッシュ値が参照単語のハッシュ値と等し
くかつ文字数が参照単語の文字数と等しい単語辞書内の
単語であろうと、ハッシュ値が参照単語のハッシュ値と
異なるが文字数が参照単語の文字数と等しい単語辞書内
の単語であろうと、各単語に対する照合処理（ステップ
２０４、２０８）は等しく、詳細は図１（Ｂ）に示す通
りである。For any word in the word dictionary whose hash value is equal to the hash value of the reference word and whose number of characters is equal to the number of characters of the reference word, the hash value is different from the hash value of the reference word, but the number of characters is equal to the number of characters of the reference word. The matching process (steps 204 and 208) for each word is the same regardless of the word in the word dictionary, and the details are as shown in FIG. 1 (B).

【００３７】単語辞書内のある単語が照合対象となる
と、その単語の参照単語との不一致文字数を計数するた
めのパラメータdiffcount を初期値０にセットした後、
その照合対象単語の各文字に対してステップ３０１〜３
０５でなるループを繰返す。When a word in the word dictionary is to be matched, after setting a parameter diffcount for counting the number of non-matching characters of the word to the reference word to an initial value 0,
Steps 301 to 301 for each character of the matching target word
The loop consisting of 05 is repeated.

【００３８】まず、ステップ３０１で取り出した文字に
ついて、その文字の類名が参照単語に対する類名表記の
同じ順番のものと一致するか否かを確認する（ステップ
３０２）。ここで、否定結果を得ると、この単語に対す
る照合を直ちに終了してメインルーチン（ステップ２０
５又は２０９）に戻る。他方、類名表記における分類と
同じであれば、その文字自体が参照単語におけるその順
番の文字と一致するか否かを判別する（ステップ３０
３）。異なっていればパラメータdiffcount を１インク
リメントした後、同じであれば直ちに最終文字まで処理
を行なったか否かを確認する（ステップ３０４、３０
５）。First, with respect to the character extracted in step 301, it is confirmed whether or not the class name of the character matches the one in the same order of class name notation for the reference word (step 302). If a negative result is obtained here, the collation for this word is immediately terminated and the main routine (step 20
5 or 209). On the other hand, if it is the same as the classification in the class name notation, it is determined whether or not the character itself matches the character in that order in the reference word (step 30).
3). If they are different, the parameter diffcount is incremented by 1, and if they are the same, it is immediately confirmed whether or not the last character has been processed (steps 304, 30).
5).

【００３９】このようにして最終文字に対する類名の確
認及び参照単語の文字との一致不一致の確認を終了する
と、この照合対象単語における参照単語との不一致文字
数すなわち、パラメータdiffcount の値と、それまで照
合した中の単語における最も少ない最小不一致文字数と
を大小比較する（ステップ３０６）。今回の照合対象単
語における不一致文字数（diffcount の値）が最小不一
致文字数より多い場合には、この照合対象単語を候補単
語に加えることなく、メインルーチン（ステップ２０５
又は２０９）に戻る。今回の照合対象単語における不一
致文字数（diffcount の値）が最小不一致文字数と等し
い場合には、この照合対象単語を候補単語に加えてメイ
ンルーチン（ステップ２０５又は２０９）に戻る（ステ
ップ３０７）。今回の照合対象単語における不一致文字
数（diffcount の値）が最小不一致文字数より少ない場
合には、今までの候補単語を破棄し、今回の照合対象単
語を候補単語に登録し、最小不一致文字数を今回の照合
対象単語における不一致文字数（diffcount の値）に置
き換えてメインルーチン（ステップ２０５又は２０９）
に戻る（ステップ３０８、３０９）。In this way, when the confirmation of the class name with respect to the last character and the confirmation of the match / mismatch with the characters of the reference word are completed, the number of mismatched characters with the reference word in the matching target word, that is, the value of the parameter diffcount and The size is compared with the smallest minimum number of non-matching characters in the matched word (step 306). If the number of non-matching characters (value of diffcount) in this matching target word is larger than the minimum number of non-matching characters, this matching target word is not added to the candidate words and the main routine (step 205
Or, return to 209). When the number of non-matching characters (value of diffcount) in the matching target word this time is equal to the minimum number of non-matching characters, this matching target word is added to the candidate words and the process returns to the main routine (step 205 or 209) (step 307). If the number of non-matching characters (value of diffcount) in the matching target word this time is less than the minimum number of non-matching characters, the candidate word so far is discarded, this matching target word is registered as the candidate word, and the minimum number of non-matching characters is Main routine (step 205 or 209) by replacing with the number of unmatched characters (value of diffcount) in the matching target word
(Steps 308 and 309).

【００４０】以上詳述した図１に示す処理を実行するこ
とにより、参照単語が単語辞書にある場合にはその単語
が候補単語となり、参照単語が単語辞書にない場合に
は、参照単語と類名表記が等しく参照単語と不一致な文
字が最も少ない語長が参照単語と等しい単語辞書に格納
されている１以上の単語が候補単語になる。By executing the processing shown in FIG. 1 described in detail above, when the reference word is in the word dictionary, the word becomes a candidate word, and when the reference word is not in the word dictionary, it is similar to the reference word. One or more words stored in the word dictionary having the same name notation and the smallest number of characters that do not match the reference word are equal to the reference word are candidate words.

【００４１】参照単語が「ｔａｈｅ」、その類名表記が
「ＤＡＢＣ」である図５に示すような場合であって、図
６に示す文字分類を適用した場合には、ハッシュ値が参
照単語のハッシュ値と等しくかつ単語長が参照単語の単
語長と等しい単語辞書内の各単語に対する照合処理では
候補単語を得ることができないが、ハッシュ値が参照単
語のハッシュ値と異なるが単語長が参照単語の単語長に
等しい単語辞書内の各単語に対する照合処理で「ｔａｋ
ｅ」を候補単語として得ることができる。When the reference word is "tahe" and its class name is "DABC" as shown in FIG. 5 and the character classification shown in FIG. 6 is applied, the hash value is the reference word. A candidate word cannot be obtained by the matching process for each word in the word dictionary that is equal to the hash value and the word length is equal to the word length of the reference word, but the hash value is different from the hash value of the reference word, but the word length is the reference word. The matching process for each word in the word dictionary equal to the word length of
e ”can be obtained as a candidate word.

【００４２】従って、上記実施例によれば、誤り傾向に
応じて文字を分類しておけば、すなわち、各単語を類名
表記毎に分類しておくことなしに、認識対象単語に対す
る候補単語を得ることができる。Therefore, according to the above embodiment, if the characters are classified according to the error tendency, that is, without classifying each word for each class name notation, the candidate word for the recognition target word is selected. Obtainable.

【００４３】その結果、読取対象文書の字体が今までの
ものから変化して誤り傾向が変わっても、単語辞書を変
更することなく文字分類だけの変更によって対応するこ
とができる。また、学習によって、文字分類を変える場
合においても、単語辞書は既存のものをそのまま適用す
ることができる。さらに、適用する文字認識方法を変更
することによって誤り傾向が変化しても、単語辞書を変
更することなく文字分類だけの変更によって対応するこ
とができる。As a result, even if the font of the document to be read changes from the one used up to now and the error tendency changes, it is possible to deal with it by changing only the character classification without changing the word dictionary. Further, even when the character classification is changed by learning, the existing word dictionary can be applied as it is. Furthermore, even if the error tendency changes by changing the applied character recognition method, it can be dealt with by changing only the character classification without changing the word dictionary.

【００４４】なお、上記実施例においては、英単語が認
識対象の場合を説明したが、本発明は、他の言語の単語
を認識対象とした場合にも適用できるものである。例え
ば、日本語の場合には、単語の切出しと単語の認識とが
並行して行なわれるが本発明を適用することができる。In the above embodiment, the case where an English word is a recognition target has been described, but the present invention is also applicable when a word in another language is a recognition target. For example, in the case of Japanese, the extraction of words and the recognition of words are performed in parallel, but the present invention can be applied.

【００４５】また、上記実施例においては、参照単語を
１個設定して候補単語を得るものを示したが、文字認識
結果から２以上の参照単語を設定して候補単語を得るよ
うにしても良い。In the above embodiment, one reference word is set to obtain the candidate word, but two or more reference words are set from the character recognition result to obtain the candidate word. good.

【００４６】さらに、上記実施例においては、単語辞書
に参照単語があるか否かを、参照単語とハッシュ値が等
しい他の単語の検索と区別することなく行なうものを示
したが、参照単語が単語辞書にあるか否かを最初に判断
するようにしても良い。このような場合には、ハッシュ
値に基づいて単語の分類をも不要とするようにしても良
い。Further, in the above embodiment, whether or not the reference word is found in the word dictionary is distinguished from the search of other words having the same hash value as that of the reference word. It may be possible to first determine whether or not it is in the word dictionary. In such a case, the word classification may not be necessary based on the hash value.

【００４７】さらにまた、上記実施例においては、単語
辞書に格納されている単語の類名表記が参照単語の類名
表記と等しいかを文字単位毎の比較で行なうものを示し
たが、単語辞書内の単語の類名表記を文字分類を用いて
得た後に比較するようにしても良い。Furthermore, in the above-described embodiment, the word dictionary stored in the word dictionary is compared with the reference word for the same name by comparing each character unit. It is also possible to obtain the class name notation of the word in the above by using the character classification, and then compare them.

【００４８】また、上記実施例においては、参照単語と
の不一致文字数が最も少ないことを候補単語の条件とし
たが、不一致文字数に対する条件をこれより多少緩めに
設定しても良い。Further, in the above-mentioned embodiment, the condition of the candidate word is that the number of characters that do not match the reference word is the smallest. However, the condition for the number of characters that do not match may be set somewhat looser than this.

【００４９】[0049]

【発明の効果】以上のように、本発明によれば、単語辞
書に類名表記に関する情報を盛り込むことなく、文字分
類の情報だけを用意しておくことで、文字の誤り傾向を
考慮した候補単語を得ることができるので、予め作成し
ておく情報の量が少なくて、予め作成しておく情報の変
更に容易に対応できる文字認識結果の後処理方法を実現
できる。As described above, according to the present invention, only the character classification information is prepared without including the information about the class name notation in the word dictionary. Since the word can be obtained, the amount of information to be created in advance is small, and a post-processing method for character recognition results that can easily cope with changes in information created in advance can be realized.

[Brief description of drawings]

【図１】実施例の単語辞書の検索処理を示すフローチャ
ートである。FIG. 1 is a flowchart showing a word dictionary search process according to an embodiment.

【図２】実施例を実現する構成を示す機能ブロック図で
ある。FIG. 2 is a functional block diagram showing a configuration for realizing an embodiment.

【図３】実施例による処理を含む文字認識の一連処理を
示すフローチャートである。FIG. 3 is a flowchart showing a series of character recognition processing including processing according to an embodiment.

【図４】認識結果例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a recognition result.

【図５】実施例の処理に伴い得られる情報を示す説明図
である。FIG. 5 is an explanatory diagram showing information obtained by the processing of the embodiment.

【図６】実施例の文字分類例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of character classification according to the embodiment.

[Explanation of symbols]

１５…文字分類記憶手段、１６…単語辞書、１７…候補
単語検索手段、３０２…単語辞書内の単語の類名表記と
参照単語の類名表記との一致を判定する処理ステップ。15 ... Character classification storage means, 16 ... Word dictionary, 17 ... Candidate word search means, 302 ... Processing step for determining match between word class name notation in reference word dictionary and reference word class name notation.

Claims

[Claims]

1. A post-processing method for character recognition results, which includes a process of searching a word dictionary from the recognition results of each character that constitutes a recognition target word to obtain candidate words for the recognition target word, based on the error tendency of characters. Then, all the characters are classified into several classes, class names are given and stored in advance, and words are stored in the word dictionary without class name notation in which the class names are arranged. , A first process of creating a reference word to be referred to at the time of search from the recognition result of each character, and a second process of creating a class name notation in which the class names of the classes to which each character of this reference word belongs are arranged Then, the words stored in the word dictionary are searched, and it is determined whether the class name notation matches the class name notation of the reference word based on the character classification. The third process to exclude as not satisfying and Post-processing method of the character recognition result, characterized in that it comprises.

2. The character recognition result according to claim 1, wherein among the words not excluded in the third processing, a candidate word has the smallest number of characters that do not match the reference word. Post-processing method.