JP3115459B2

JP3115459B2 - Method of constructing and retrieving character recognition dictionary

Info

Publication number: JP3115459B2
Application number: JP05286113A
Authority: JP
Inventors: 佳孝濱口; 節正広垣
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1993-10-20
Filing date: 1993-10-20
Publication date: 2000-12-04
Anticipated expiration: 2015-12-04
Also published as: JPH07121665A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、文字認識における認識
結果を自動的に修正する後処理を行なうための文字認識
辞書の構成方法及び検索方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for constructing and retrieving a character recognition dictionary for performing post-processing for automatically correcting the result of character recognition.

【０００２】[0002]

【従来の技術】従来、手書き文字等を光学的な手法等に
より認識する文字認識においては、認識の誤りを自動的
に修正するために、文字認識結果を単語辞書と照合する
後処理が行なわれている。このような後処理を迅速に行
なうためには、単語辞書の検索を迅速に行なわなければ
ならない。このため、単純に第１文字目を検索キーとし
て単語辞書を検索したのでは、例えば英語の辞書の場合
に「ａ」で始まる単語は多く、「ｘ」で始まる単語は少
ないように、検索キーによって単語数がまちまちとな
り、効率的な検索が行なえない。そこで、単語中の特定
の位置の文字、例えば第２文字目の文字を検索キーとし
た検索が可能な辞書を構成して文字検索を行なう方法が
ある（特公平４−３８０２６号）。2. Description of the Related Art Conventionally, in character recognition for recognizing handwritten characters or the like by an optical method or the like, post-processing for collating a character recognition result with a word dictionary is performed in order to automatically correct a recognition error. ing. In order to perform such post-processing quickly, it is necessary to quickly search the word dictionary. For this reason, simply searching the word dictionary using the first character as a search key, for example, in the case of an English dictionary, there are many words starting with “a” and few words starting with “x”. As a result, the number of words varies, and an efficient search cannot be performed. Therefore, there is a method of performing a character search by constructing a dictionary capable of searching using a character at a specific position in a word, for example, the character of the second character as a search key (Japanese Patent Publication No. 4-38026).

【０００３】一方、ハッシュ法を用いて単語辞書を分割
し、各分割単語群にハッシュ値を割り当て、文字認識結
果により得られたハッシュ値と等しい分割単語群につい
て検索を行なう方法もある。On the other hand, there is a method in which a word dictionary is divided using a hash method, a hash value is assigned to each divided word group, and a search is performed for a divided word group equal to a hash value obtained as a result of character recognition.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述し
た従来の技術には、次のような問題があった。即ち、単
語の特定の位置の文字を検索キーとする場合でも、なお
各検索キーに登録される単語の数にばらつきがある。従
って、各検索キーについて登録されている単語をハッシ
ュ法により各検索キーについて同一の分割数で分割した
場合、各検索キーの各ハッシュ値に対して登録されてい
る単語の数に大きな差が生じる場合もある。However, the above-mentioned prior art has the following problems. That is, even when a character at a specific position of a word is used as a search key, the number of words registered in each search key still varies. Therefore, when words registered for each search key are divided by the same division number for each search key by the hash method, a large difference occurs in the number of words registered for each hash value of each search key. In some cases.

【０００５】この結果、不必要に辞書が細分化された場
合には、各分割単語群に付与されるハッシュ値等のデー
タ部分が不必要に多くなり、メモリの使用効率が悪くな
る。一方、辞書の分割が大きすぎる場合には、ハッシュ
法を使った検索によっても高速な検索が行なえないとい
う問題があった。[0005] As a result, when the dictionary is unnecessarily subdivided, the data portion such as the hash value assigned to each divided word group becomes unnecessarily large, and the use efficiency of the memory deteriorates. On the other hand, if the division of the dictionary is too large, there is a problem that a high-speed search cannot be performed even by a search using the hash method.

【０００６】例えば、図４に示した例であれば、各単語
の２文字目を検索キーとした場合、検索キー「ａ」に登
録される単語の数は２４６９語であり、検索キー「ｆ」
に登録される単語の数は６２語である。このような辞書
を、検索キー「ａ」に登録される単語の検索をハッシュ
値で十分高速に行なうためにハッシュ値で６４分割しよ
うとすると、検索キー「ｆ」に登録される単語も６４分
割しなければならないことになる。ところが、検索キー
「ｆ」については６２語しか登録されておらず、６４分
割することは無意味である。以上の理由により、特公平
４−３８０２６号の手法とハッシュ法を単純に組み合わ
せただけでは、単語辞書を高速に検索することができな
かった。For example, in the example shown in FIG. 4, when the second character of each word is used as a search key, the number of words registered in the search key "a" is 2,469 words, and the search key "f""
Are 62 words. If it is attempted to divide such a dictionary into 64 words by hash value in order to search for the words registered in the search key "a" sufficiently fast by using the hash value, the words registered in the search key "f" are also divided into 64 words. You have to do it. However, only 62 words are registered for the search key "f", and it is meaningless to divide it into 64 words. For the above reasons, a simple combination of the technique of Japanese Patent Publication No. 4-38026 and the hash method could not search the word dictionary at high speed.

【０００７】本発明は、以上の点に着目してなされたも
ので、各単語に検索キーのコードとハッシュ値を付加し
て登録することで、メモリを有効利用しながら高速に単
語を検索できるようにした文字認識辞書の構成方法及び
検索方法を提供することを目的とするものである。The present invention has been made in view of the above points . A code of a search key and a hash value are added to each word.
Registering at high speed while efficiently using memory.
It is an object of the present invention to provide a method of constructing a character recognition dictionary capable of searching words and a method of searching for the words .

【０００８】[0008]

【課題を解決するための手段】本発明の文字認識辞書の
構成方法は、各単語中の予め定められた共通の特定位置
の各文字をそれぞれ検索キーとし、各単語の特定位置の
文字が各検索キーと一致する全ての単語を該各検索キー
に属する単語とし、該各検索キーに属する単語の単語数
を計数して該単語数を該検索キー毎に登録し、各検索キ
ー毎に登録された単語数が多い検索キーほどビット長の
短いコードになるように該各検索キーに対応するコード
を割当て、各検索キー毎に登録された単語数が多いほど
長いビット長になるように各検索キーに属する各単語の
ハッシュ値を算出し、各単語に、該各単語に対応する検
索キーのコードとハッシュ値を付加して登録することを
特徴とするものである。According to the method of constructing a character recognition dictionary of the present invention , each character at a predetermined common specific position in each word is used as a search key, and the specific position of each word is determined.
Search for all words whose characters match each search key
And the number of words belonging to each search key
And the number of words is registered for each search key.
-The search key with the larger number of words registered for each
Code corresponding to each search key so that it becomes a short code
And the more words registered for each search key
Of each word belonging to each search key so as to have a long bit length
A hash value is calculated, and a check is performed for each word.
It is characterized in that a search key code and a hash value are added and registered.

【０００９】本発明の文字認識辞書の検索方法は、文字
認識の対象となる文字パターンを文字認識した結果得ら
れる候補文字のうち、予め定められた特定位置の文字を
検索キーとし、該検索キーのコードと一致する全ての単
語を文字認識辞書から検索し、文字認識の対象となる文
字パターンを文字認識した結果得られる候補文字により
該候補文字のハッシュ値を算出し、検索キーのコードに
基づいて検索した全ての単語のうち、該検索した全ての
単語が有するハッシュ値と候補文字のハッシュ値が等し
い全ての単語について該候補文字と一致する単語が存在
するか否かを検索し、候補文字と一致する単語が存在す
る場合は、該候補文字を認識結果とすることを特徴とす
るものである。 [0009] The method of the search character recognition dictionary of the present invention, character
The result of character recognition of the character pattern to be recognized
Of the candidate characters to be
As a search key, all units that match the code of the search key
A sentence to be searched for a word from the character recognition dictionary and subjected to character recognition
By the candidate character obtained as a result of character recognition of the character pattern
Calculate the hash value of the candidate character and use it as the search key code
Of all words searched based on
The hash value of the word and the hash value of the candidate character are equal
For all words, there is a word that matches the candidate character
Whether or not there is a word that matches the candidate character
The candidate character as a recognition result.
Things.

【００１０】[0010]

【作用】本発明の文字認識辞書の構成方法においては、
例えば、英語の単語辞書については検索するための単語
の数がなるべく均一になるよう、各単語の第２文字目の
文字を検索キーとすると共に、各検索キー毎に属する単
語の単語数を計数する。そして、単語数の多い検索キー
にはビット長が短く、単語数の少ない検索キーにはビッ
ト長が長くなるコードを各検索キーに割当てる。また、
ハッシュ値のビット長は、登録される単語の数が多い検
索キーについては長くし、登録される単語の数が少ない
検索キーについては短くする。これにより、各単語に付
されるコードとハッシュ値のビット長が一定となる。 In the method for constructing a character recognition dictionary according to the present invention,
For example, as the number of words to search becomes as uniform as possible for the English word dictionary, as well as the search key and the second character of the character of each word, a single belonging to each search key
Count the number of words in a word. And a search key with many words
Has a short bit length and a search key with a small number of words has a small bit length.
Is assigned to each search key. Also,
The bit length of the hash value is increased for search keys having a large number of registered words, and is shortened for search keys having a small number of registered words. This allows each word to be
The bit lengths of the code and the hash value are constant.

【００１１】本発明の文字認識辞書の検索方法において
は、光学式文字読取装置等により認識した文字パターン
について上述のように構成した単語辞書を用いて認識結
果を修正する際、以下のように単語辞書を検索する。第
２文字目の文字パターンの候補文字を検索キーとして単
語辞書を検索し、各文字パターンの候補文字のＡＳＣＩ
Ｉコード値の合計値により上記と同様にハッシュ値を求
め、文字認識結果の修正に適切な単語が含まれる単語群
を検索する。もし、その単語群から適切な単語が検索さ
れなかったときは、同じ検索キーに属する他の単語群を
検索する。In the character recognition dictionary search method of the present invention, when a character pattern recognized by an optical character reader or the like is used to correct the recognition result using the word dictionary configured as described above, the following is performed. Search the dictionary. The word dictionary is searched using the candidate character of the second character pattern as a search key, and the ASCI of the candidate character of each character pattern is searched.
A hash value is obtained from the total value of the I code values in the same manner as described above, and a word group including a word suitable for correcting the character recognition result is searched. If an appropriate word is not searched from the word group, another word group belonging to the same search key is searched.

【００１２】[0012]

【実施例】以下、本発明の実施例を図面を参照して詳細
に説明する。図１は、本発明の文字認識辞書の構成方法
を適用した装置の一実施例のブロック図である。図示の
装置は、検索キー抽出部１２０、単語数計数部５００、
分割数決定部６００、ハッシュ値算出部３２０、辞書作
成部７００等から成る。検索キー抽出部１２０は、単語
辞書４２０に登録すべき各単語について、特定文字位置
の文字を検索キーとして抽出する。この特定文字位置
は、例えば、英語等の場合、第２文字目である。後述す
る図２の装置では、この位置の文字パターンを認識し、
その候補文字を検索キーとして単語辞書４２０を検索す
る。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram of an embodiment of an apparatus to which a method for configuring a character recognition dictionary according to the present invention is applied. The illustrated device includes a search key extracting unit 120, a word counting unit 500,
It comprises a division number determination unit 600, a hash value calculation unit 320, a dictionary creation unit 700, and the like. The search key extraction unit 120 extracts a character at a specific character position as a search key for each word to be registered in the word dictionary 420. This specific character position is, for example, the second character in English or the like. The device of FIG. 2 described later recognizes the character pattern at this position,
The word dictionary 420 is searched using the candidate character as a search key.

【００１３】単語数計数部５００は、検索キー抽出部１
２０で各単語について抽出された検索キーを用い、各検
索キーにいくつの単語が登録されるかを計数する。一般
には文字の出現率に偏りがあるため、検索キーによって
登録される単語数はかなり異なる。従って、これらの単
語を均一な単語数の単語群に分割するため、単語数計数
部５００によりすべての検索キーについて登録されるす
べての単語を計数する。The word number counting section 500 includes a search key extracting section 1
At 20, the number of words registered in each search key is counted using the search keys extracted for each word. In general, since the appearance rate of characters is biased, the number of words registered by a search key differs considerably. Therefore, in order to divide these words into a word group having a uniform number of words, the word number counting unit 500 counts all words registered for all search keys.

【００１４】分割数決定部６００は、単語数計数部５０
０で得られる各検索キーに登録される単語数に応じて、
各検索キーの各ハッシュ値に登録される単語数がほぼ均
一になるように、各検索キーに登録される単語の分割数
を決定する。これにより、各検索キーに登録される単語
数がたとえ大きく異なったとしても、そのような検索キ
ーによる検索に対してハッシュ値による検索を有効に組
み合わせることができる。この分割数決定部６００で決
定された各検索キーに対する分割数は図２の装置での単
語辞書の検索時に使用するために分割数テーブル２２０
に格納される。ハッシュ値算出部３２０は、単語辞書４
２０に登録する各単語について、検索キー抽出部１２０
で抽出された検索キーに対応する分割数に応じたハッシ
ュ値の算出法を用いて当該単語のハッシュ値を決定す
る。この算出法の具体例については後述する。The number-of-division determining unit 600 includes a word number counting unit 50.
According to the number of words registered in each search key obtained by 0,
The number of words to be registered in each search key is determined so that the number of words registered in each hash value of each search key is substantially uniform. Thereby, even if the number of words registered in each search key is significantly different, it is possible to effectively combine a search using a hash value with a search using such a search key. The number of divisions for each search key determined by the number-of-divisions determination unit 600 is used in the division number table 220 for use in searching the word dictionary in the apparatus of FIG.
Is stored in The hash value calculation unit 320 determines whether the word dictionary 4
For each word to be registered in the search key extraction unit 120,
The hash value of the word is determined using a hash value calculation method corresponding to the number of divisions corresponding to the search key extracted in step (1). A specific example of this calculation method will be described later.

【００１５】辞書作成部７００は、単語辞書４２０に登
録する各単語を、検索キー抽出部１２０で得られた検索
キーとハッシュ値算出部３２０で得られたハッシュ値を
検索用のコードとして単語辞書４２０に格納する。図１
の装置で構成された単語辞書は、図２の装置で検索され
る。The dictionary creation unit 700 uses the search key obtained by the search key extraction unit 120 and the hash value obtained by the hash value calculation unit 320 as a search code to convert each word registered in the word dictionary 420 into a word dictionary. 420. FIG.
The word dictionary configured by the apparatus of FIG. 2 is searched by the apparatus of FIG.

【００１６】図２は、本発明の文字認識辞書の検索方法
を適用した装置の一実施例のブロック図である。図示の
装置は、検索キー抽出部１１０、分割数検索部２１０、
ハッシュ値算出部３１０、辞書検索部４１０等から成
る。検索キー抽出部１１０は、光学式文字読取装置等の
文字認識結果から単語中の特定の位置の文字認識結果で
ある文字コード（第１候補文字）を検索キーとして抽出
する。尚、文字認識結果として複数の文字が候補文字と
して挙がる場合は、まず、第１候補文字を検索キーと
し、順次第２候補以降の文字を検索キーとするようにし
てもよい。FIG. 2 is a block diagram of an embodiment of an apparatus to which the character recognition dictionary search method according to the present invention is applied. The illustrated device includes a search key extraction unit 110, a division number search unit 210,
It comprises a hash value calculation unit 310, a dictionary search unit 410 and the like. The search key extraction unit 110 extracts, as a search key, a character code (first candidate character) that is a result of character recognition at a specific position in a word from a character recognition result of an optical character reader or the like. When a plurality of characters are listed as candidate characters as a result of character recognition, the first candidate character may be used as a search key first, and the characters subsequent to the second candidate may be used sequentially as search keys.

【００１７】分割数検索部２１０は、単語辞書４２０に
検索キー抽出部１１０で抽出された検索キー文字に登録
されている単語がハッシュ値で何分割されているかを分
割数テーブル２２０から検索する。分割数テーブル２２
０には、単語辞書４２０に登録されている単語につい
て、各検索キーごとに、その検索キーに登録されている
単語がハッシュ値でいくつに分割されているかを示す分
割数が格納されている。ハッシュ値算出部３１０は、分
割数検索部２１０で検索された分割数により、その分割
数を実現するように予め定められたハッシュ値算出法を
用いて文字認識結果等からハッシュ値の算出を行なう。
例えば、分割数が“８”であれば、単語の各文字の認識
結果の第１候補文字の文字コード（ＡＳＣＩＩコード
等）を足し合せたものを“８”で割った余りをハッシュ
値とする。このようなハッシュ値の算出法は、これ以外
のどのような方法でもよい。The number-of-divisions search unit 210 searches the number-of-divisions table 220 for how many words registered in the word dictionary 420 by the search key characters extracted by the search key extraction unit 110 are divided by hash values. Division number table 22
For each word registered in the word dictionary 420, 0 stores the number of divisions indicating how many words registered in the search key are divided by the hash value. The hash value calculation unit 310 calculates a hash value from a character recognition result or the like using a hash value calculation method predetermined so as to realize the number of divisions, based on the number of divisions searched by the number of divisions search unit 210. .
For example, if the division number is “8”, the remainder obtained by adding the character codes (such as ASCII codes) of the first candidate characters of the recognition result of each character of the word and dividing the sum by “8” is used as the hash value. . Such a hash value calculation method may be any other method.

【００１８】辞書検索部４１０は、単語辞書４２０から
検索キー抽出部１１０で得られた検索キーに登録された
単語をハッシュ値算出部３１０で得られたハッシュ値の
ものから順に検索する。これにより、文字認識結果が正
しい場合は、検索キー及びハッシュ値がともに正しいた
め、その単語は速やかに単語辞書４２０により検索され
る。このため、文字認識結果が正しい場合は、後処理が
高速に行なわれる。The dictionary search unit 410 searches words registered in the search key obtained by the search key extraction unit 110 from the word dictionary 420 in order from the hash value obtained by the hash value calculation unit 310. As a result, if the character recognition result is correct, the search key and the hash value are both correct, and the word is quickly searched by the word dictionary 420. Therefore, when the character recognition result is correct, the post-processing is performed at high speed.

【００１９】次に、分割数決定部６００の具体的な処理
構成例を図３を用いて詳細に説明する。図３に示すよう
に、ハフマン符号割付処理６１０では、各検索キーにつ
いて登録される単語数を用いてハフマン法によりコード
を割り付ける。ハフマン法は、出現率の高いデータに短
いコードを割り付け、出現率の低いデータには長いコー
ドを割り付けることによって、データ圧縮を行なう方法
である。このようなハフマン法により、登録される単語
数が多い検索キーにはビット長が短いコードが割り付け
られ、登録される単語数が少ない検索キーにはビット長
が長いコードが割り付けられる。例えば、図４に示すよ
うに、登録単語数が最も多い検索キー「ａ」には「00」
が割り付けられる。また、登録単語数が２番目に多い検
索キー「ｅ」には「110 」が割り付けられる。そして、
登録単語数が少ない検索キー「ｄ」、「ｆ」には、「10
111101」、「01111100」がそれぞれ割り付けられる。Next, a specific processing configuration example of the division number determination unit 600 will be described in detail with reference to FIG. As shown in FIG. 3, in the Huffman code assignment processing 610, codes are assigned by the Huffman method using the number of words registered for each search key. The Huffman method is a method of performing data compression by allocating a short code to data having a high appearance rate and allocating a long code to data having a low appearance rate. According to the Huffman method, a code having a short bit length is assigned to a search key having a large number of registered words, and a code having a long bit length is assigned to a search key having a small number of registered words. For example, as shown in FIG. 4, "00" is assigned to the search key "a" having the largest number of registered words.
Is assigned. Also, “110” is assigned to the search key “e” having the second largest number of registered words. And
The search keys “d” and “f” with a small number of registered words include “10”
“111101” and “01111100” are assigned.

【００２０】また、図４の例では、「001 」等の「00」
で始まるコードは、「ａ」以外には割り付けられない。
従って、「00」を検出すれば、それに続くビットを検出
しなくても、「ａ」を検出することができる。即ち、
「ａ」のコード「00」の後に任意のビットをつなげても
コード部分だけを抜き出すことができる。尚、図４にお
いては、検索キーの一部のみを示しているが、アルファ
ベットの２６文字及びピリオドやコンマ等についても同
様にそれぞれ図示しないコードが割り付けられる。ハッ
シュ値ビット長決定処理６２０は、予め定められたビッ
ト長からハフマン符号割付処理６１０で検索キーのコー
ドに割り当てられたビット長を引いたものをハッシュ値
のビット長とする。従って、ハッシュ値のビット長は、
登録される単語が多い検索キーでは長くなり、少ない検
索キーでは短くなる。即ち、検索キーに登録される単語
が多いほど、ハッシュ値による分割数が多くなる。In the example of FIG. 4, “00” such as “001” is used.
Are not assigned to any code other than "a".
Therefore, if "00" is detected, "a" can be detected without detecting the subsequent bit. That is,
Even if an arbitrary bit is connected after the code “00” of “a”, only the code portion can be extracted. Although only a part of the search key is shown in FIG. 4, codes (not shown) are similarly assigned to 26 letters of the alphabet, periods, commas, and the like. In the hash value bit length determination processing 620, a value obtained by subtracting the bit length allocated to the code of the search key in the Huffman code allocation processing 610 from the predetermined bit length is set as the bit length of the hash value. Therefore, the bit length of the hash value is
A search key with many registered words becomes longer, and a search key with less words becomes shorter. That is, as the number of words registered in the search key increases, the number of divisions by the hash value increases.

【００２１】以上のような処理により、検索キーのコー
ド及びハッシュ値での分割数が定められた場合、検索キ
ーのコード長とハッシュ値のビット長の和が一定とな
る。このため、検索キーに登録される単語数にかかわら
ず、単語辞書中で検索キーのコードとハッシュ値の占め
るフィールドの合計のビット長を一定とすることができ
る。When the code of the search key and the number of divisions by the hash value are determined by the above processing, the sum of the code length of the search key and the bit length of the hash value becomes constant. Therefore, regardless of the number of words registered in the search key, the total bit length of the field occupied by the code of the search key and the hash value in the word dictionary can be made constant.

【００２２】次に、辞書を構成する手順及び構成した辞
書を検索する手順の具体例を説明する。この例では、検
索キーに単語の２文字目を用いることとする。ハッシュ
値の計算には、単語の各文字のＡＳＣＩＩコードの和を
分割数で割った余りを用いるものとする。図４に、各検
索キーに登録される単語を分割する分割数の決定処理の
例を示す。Next, a specific example of a procedure for configuring a dictionary and a procedure for searching the configured dictionary will be described. In this example, the second character of the word is used as the search key. In calculating the hash value, the remainder obtained by dividing the sum of the ASCII codes of the characters of the word by the number of divisions is used. FIG. 4 shows an example of a process of determining the number of divisions into which words registered in each search key are divided.

【００２３】まず、図１の検索キー抽出部１２０で、登
録される全単語について、２文字目の文字が検索キーと
して抽出される。そして、次の単語数計数部５００で、
各検索キーごとにその検索キー文字が抽出された単語の
数を計数する。図４に示すように、例えば、検索キーが
「ａ」となる単語は、２４６９語である。また、検索キ
ーが「ｃ」、「ｄ」の単語は、２８８語、９９語であ
り、その他の単語については図示の通りである。図示の
例でわかるように、検索キーにより登録される単語数は
それぞれ異なり、その差はかなり大きい。First, in the search key extracting unit 120 of FIG. 1, the second character is extracted as a search key for all registered words. Then, in the next word counting unit 500,
For each search key, the number of words from which the search key character is extracted is counted. As shown in FIG. 4, for example, the word whose search key is “a” is 2469 words. The words with the search keys "c" and "d" are 288 words and 99 words, and the other words are as shown in the figure. As can be seen from the illustrated example, the number of words registered by the search key differs, and the difference is quite large.

【００２４】次に、図３の処理６１０で、各検索キーの
文字に対し、上述した単語数を用いてハフマン法でコー
ドを割り付ける。即ち、単語数の多い文字には短いコー
ドを割り付け、単語数の少ない文字には長いコードを割
り付ける。このようにして決められたコードは例えば検
索キー「ａ」のように２４６９語という他の検索キーよ
り多くの単語が登録される場合は「00」のようにビット
長が短くなる。また、検索キー「ｆ」のように６２語と
いう他の検索キーより少ない単語が登録される場合は
「01111100」のようにビット長が長くなる。そして、処
理６２０で予め定められたビット長から処理６１０で検
索キーに割り当てられたコードのビット長を引き、ハッ
シュ値のビット長とする。例えば、予め定められたビッ
ト長を“８”とすると、検索キー「ａ」の場合は検索キ
ーに割り当てられたコード「00」のビット長“２”を
“８”から引き、ハッシュ値のビット長は６ビットとな
り、分割数は“２”の６乗で“６４”となる。Next, in a process 610 of FIG. 3, a code is assigned to the character of each search key by the Huffman method using the number of words described above. That is, a short code is assigned to a character having a large number of words, and a long code is assigned to a character having a small number of words. The code determined in this manner has a shorter bit length such as "00" when more words are registered than the other search key of 2469 words such as the search key "a". Also, when a word such as the search key “f” is registered, which is 62 words less than the other search keys, the bit length becomes longer like “01111100”. Then, the bit length of the code assigned to the search key in the process 610 is subtracted from the bit length predetermined in the process 620 to obtain the bit length of the hash value. For example, if the predetermined bit length is “8”, in the case of the search key “a”, the bit length “2” of the code “00” assigned to the search key is subtracted from “8”, and the bit of the hash value is The length is 6 bits, and the number of divisions is “64” as the sixth power of “2”.

【００２５】次に、各検索キーについて処理６２０にお
いて、上述のようにして求めた分割数により各検索キー
に登録される単語を分割する。すると、図４の最下欄に
示すように、各検索キーの各ハッシュ値に登録される単
語数の差は小さくなる。例えば、検索キー「ａ」と
「ｆ」では、登録単語数は、“２４６９”と“６２”で
約４０倍の差があるが、ハッシュ値で分割すると、“３
９”と“６２”で、１．６倍程度まで差が小さくなって
いる。これにより、ハッシュ法による検索を有効に行な
うことができ、安定した検索性能を実現できる。Next, in the processing 620 for each search key, divides the words registered in each of the search key by split number determined as described above. Then, as shown in the lowermost column of FIG. 4, the difference in the number of words registered in each hash value of each search key becomes small. For example, for the search keys “a” and “f”, the number of registered words is about 40 times different between “2469” and “62”.
The difference between "9" and "62" is reduced to about 1.6 times, whereby the search by the hash method can be performed effectively, and stable search performance can be realized.

【００２６】図５に、単語「ｒｅａｄ」を例に挙げ、単
語辞書を構成する手順を具体的に示す。まず、検索キー
抽出部１２０で「ｒｅａｄ」の２文字目の「ｅ」が検索
キーとして抽出される。図４により検索キーが「ｅ」の
場合は検索キーコードが「110」でハッシュ法での分割
数は「３２」である。次に、ハッシュ値算出部３２０
で、「ｒ」、「ｅ」、「ａ」、「ｄ」の各文字のＡＳＣ
ＩＩコードの和を取った「４１２」を分割数の「３２」
で割った余りの「２８」、即ち２進数で「11100 」がハ
ッシュ値として算出される。こうして、単語「ｒｅａ
ｄ」は、検索キーコード「110 」、ハッシュ値「11100
」で単語辞書４２０に登録される。FIG. 5 shows a specific procedure for constructing a word dictionary, taking the word "read" as an example. First, the search key extraction unit 120 extracts “e” as the second character of “read” as a search key. According to FIG. 4, when the search key is "e", the search key code is "110" and the number of divisions by the hash method is "32". Next, the hash value calculation unit 320
ASC of each character of "r", "e", "a", "d"
"412", which is the sum of II codes, is divided by "32"
The remainder “28”, ie, “11100” in binary, is calculated as the hash value. Thus, the word "rea
d ”is a search key code“ 110 ”and a hash value“ 11100
"In the word dictionary 420.

【００２７】この例では、検索する単語が３２分割にな
っており、ハッシュ値が５ビットで表わされる。そし
て、検索キーコードが「110 」の３ビットであり、ハッ
シュ値と合せて８ビットとなる。このように、検索キー
コードとハッシュ値のビット長の合計は、必ず８ビット
となる。従って、検索キーコードとハッシュ値を合成し
て全体として１つのキーとして使用することにより、検
索キーコード用のメモリ領域とハッシュ値用のメモリ領
域とをビット長の違いを考慮してそれぞれ確保する必要
はなくなる。このため、メモリ効率が非常によくなる。In this example, the word to be searched is divided into 32, and the hash value is represented by 5 bits. The search key code is 3 bits of "110", which is 8 bits in total with the hash value. Thus, the total of the bit lengths of the search key code and the hash value is always 8 bits. Therefore, by synthesizing the search key code and the hash value and using them as one key as a whole, a memory area for the search key code and a memory area for the hash value are secured in consideration of the bit length difference. There is no need. Therefore, the memory efficiency becomes very good.

【００２８】図６に、辞書検索処理の例を示す。辞書検
索においては、ハッシュ値の算出は、検索キー抽出とハ
ッシュ値算出が登録単語ではなく、光学式文字読取装置
等の文字認識結果に対して行なわれる。この点以外は、
辞書構成処理と同様である。従って、辞書検索の手順
は、図５に示す手順と同様のものとなる。以下、この手
順を説明する。まず、検索キー抽出部１１０で単語の２
文字目の認識結果の第１候補文字「ｅ」を検索キーとし
て抽出する。次に、分割数検索部２１０で分割数テーブ
ル２２０から検索キー「ｅ」に対応する検索キーコード
及び分割数を検索し、それぞれ検索キーコード「110 」
と分割数「３２」を得る。FIG. 6 shows an example of the dictionary search process. In the dictionary search, the hash value is calculated by performing search key extraction and hash value calculation on character recognition results of an optical character reader or the like, not on registered words. Other than this,
This is the same as the dictionary construction process. Therefore, the dictionary search procedure is the same as the procedure shown in FIG. Hereinafter, this procedure will be described. First, the search key extraction unit 110 searches for the word 2
The first candidate character “e” of the recognition result of the character is extracted as a search key. Next, the division number search unit 210 searches the division number table 220 for a search key code and a division number corresponding to the search key “e”, and respectively retrieves the search key code “110”.
And the division number “32”.

【００２９】そして、ハッシュ値算出部３１０で、各文
字の第１候補文字「ｒ」、「ｅ」、「ａ」、「ｄ」のＡ
ＳＣＩＩコードの和の「４１２」を分割数「３２」で割
った余り「２８」の２進数「11100 」をハッシュ値とし
て算出する。こうして得られた検索キーコード「110 」
についてのハッシュ値「11100 」の部分から最初に単語
の検索を始める。この場合、認識結果として得られた第
１候補文字が正しい限り、最初に検索を始めたハッシュ
値の部分に目的の単語の「ｒｅａｄ」が格納されてい
る。これにより、高速に検索を行なうことができる。一
方、認識結果として得られた第１候補文字が正しくない
ときは、それらの第１候補文字から算出されたハッシュ
値の部分には検索したい単語が格納されていないのが普
通である。このように検索したい単語がない場合は、従
来の手法と同様に、その検索キーコードのそのハッシュ
値以外の単語を認識結果の第１候補文字と比較してい
く。Then, the hash value calculation unit 310 calculates the A of the first candidate characters "r", "e", "a", and "d" of each character.
The binary number “11100” of the remainder “28” obtained by dividing “412” of the sum of the SCII codes by the division number “32” is calculated as a hash value. The search key code "110" thus obtained
First, a search for a word is started from the part of the hash value "11100" of. In this case, as long as the first candidate character obtained as a result of the recognition is correct, “read” of the target word is stored in the portion of the hash value at which the search is first started. As a result, a high-speed search can be performed. On the other hand, when the first candidate characters obtained as the recognition result are incorrect, the word to be searched is not usually stored in the portion of the hash value calculated from the first candidate characters. When there is no word to be searched, words other than the hash value of the search key code are compared with the first candidate character of the recognition result, as in the conventional method.

【００３０】このようにして、文字認識結果が正しい場
合は、単語辞書４２０の検索キー抽出部１１０で得られ
た検索キーの単語のうち、ハッシュ値算出部３１０で得
られたハッシュ値の単語の中に検索すべき単語が見つか
るため、検索キーコードに登録されている単語数の多少
にかかわらず、均一かつ高速に単語が検索される。尚、
上述した実施例においては、英語の辞書を構成し、検索
する場合について説明したが、本発明はこれに限定され
ることなく、日本語、フランス語その他、あらゆる言語
に適用できることはもちろんである。As described above, when the character recognition result is correct, of the words of the search key obtained by the search key extraction unit 110 of the word dictionary 420, the words of the hash value obtained by the hash value calculation unit 310 Since a word to be searched is found in the word, the word is searched uniformly and at high speed regardless of the number of words registered in the search key code. still,
In the above-described embodiment, the case where the English dictionary is constructed and searched has been described. However, the present invention is not limited to this, and it is needless to say that the present invention can be applied to Japanese, French and other languages.

【００３１】[0031]

【発明の効果】以上説明したように、本発明の文字認識
辞書の構成方法及び検索方法によれば、単語中の特定位
置の文字を検索キーとして単語の検索を行なう手法と、
ハッシュ法とを組み合わせて辞書を構成し、検索するよ
うにしたので、各検索キーに登録される単語が各ハッシ
ュ値に対しほぼ均一に分割され、分割数が多すぎて検索
用のコードデータ量が増してメモリの使用効率が悪化す
ることを防止でき、また、分割数が少なすぎて効率的な
検索ができなくなることを防止することができる。従っ
て、文字認識結果が正しい場合の辞書の検索時間を短縮
することができる。また、検索キーコードにハフマンコ
ード等を用いることにより、検索キーコードにハッシュ
値をつなげて１つの領域に格納することが可能となり、
メモリの使用効率を更に向上させることができる。As described above, according to the method of constructing and retrieving a character recognition dictionary of the present invention, a method of performing a word search using a character at a specific position in a word as a search key,
Since the dictionary is configured and searched by combining with the hash method, the words registered in each search key are divided almost uniformly for each hash value, and the number of divisions is too large, and the amount of code data for search is large. Can be prevented from increasing and the use efficiency of the memory can be prevented from deteriorating. Further, it can be prevented that the number of divisions is too small to perform efficient search. Therefore, it is possible to shorten the dictionary search time when the character recognition result is correct. Further, by using a Huffman code or the like as a search key code, it becomes possible to connect a hash value to the search key code and store it in one area,
The use efficiency of the memory can be further improved.

[Brief description of the drawings]

【図１】本発明の文字認識辞書の構成方法を適用した装
置の一実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of an apparatus to which a method for configuring a character recognition dictionary according to the present invention is applied.

【図２】本発明の文字認識辞書の検索方法を適用した装
置の一実施例のブロック図である。FIG. 2 is a block diagram of an embodiment of an apparatus to which the character recognition dictionary search method of the present invention is applied.

【図３】分割数決定処理手順の説明図である。FIG. 3 is an explanatory diagram of a division number determination processing procedure.

【図４】分割数決定処理例の説明図である。FIG. 4 is an explanatory diagram of an example of a division number determination process.

【図５】辞書構成処理例の説明図である。FIG. 5 is an explanatory diagram of a dictionary configuration processing example.

【図６】辞書検索処理例の説明図である。FIG. 6 is an explanatory diagram of a dictionary search processing example.

[Explanation of symbols]

１１０、１２０検索キー抽出部２１０分割数検索部２２０分割数テーブル３１０、３２０ハッシュ値算出部４１０辞書検索部４２０単語辞書６００分割数決定部７００辞書作成部 110, 120 search key extraction unit 210 division number search unit 220 division number table 310, 320 hash value calculation unit 410 dictionary search unit 420 word dictionary 600 division number determination unit 700 dictionary creation unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭62−278689（ＪＰ，Ａ) Ａ．Ｖ．エイホ外２名，“情報処理シリーズ11データ構造とアルゴリズム" 初版，株式会社培風館，1987年３月10 日，ｐ106−ｐ112及びｐ331−336 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06K 9/72 G06F 17/30 412 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-62-278689 (JP, A) V. Eiho and two others, "Data Processing Series 11 Data Structure and Algorithm" First Edition, Baifukan Co., Ltd., March 10, 1987, p106-p112 and p331-336 (58) Fields surveyed (Int. Cl. ⁷ , DB name ) G06K 9/72 G06F 17/30 412 JICST file (JOIS)

Claims

(57) [Claims]

1. A method according to claim 1 , wherein each character at a predetermined specific position in each word is used as a search key, and all words whose characters at the specific position of each word match the search key are used as search keys. , The number of words belonging to each search key is counted, and the number of words is registered for each search key. The search key having a larger number of words registered for each search key has a bit length A code corresponding to each of the search keys is assigned so as to have a shorter code length. A hash of each word belonging to each of the search keys is set such that the longer the number of words registered for each of the search keys is, the longer the bit length becomes. A method for constructing a character recognition dictionary, comprising: calculating a value; adding a code of the search key corresponding to each word and the hash value to each word;

2. A method for retrieving a character recognition dictionary according to claim 1, wherein the character pattern is a character pattern to be recognized. Of these, the predetermined character at the specific position is used as a search key, all words that match the code of the search key are searched from the character recognition dictionary, and the result of character recognition of a character pattern to be subjected to character recognition is obtained. The hash value of the candidate character is calculated based on the candidate character to be obtained, and among all the words searched based on the code of the search key, the hash value of all the searched words is equal to the hash value of the candidate character. Is searched for a word that matches the candidate character, and if there is a word that matches the candidate character, the candidate character is used as the recognition result. A method for searching a character recognition dictionary.