JPH0340079A

JPH0340079A - Post-processing method for character recognition in character reader

Info

Publication number: JPH0340079A
Application number: JP1173057A
Authority: JP
Inventors: Akizo Kadota; 門田　彰三; Toshihiro Hananoi; 花野井　歳弘
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1989-07-06
Filing date: 1989-07-06
Publication date: 1991-02-20

Abstract

PURPOSE:To obtain the result of recognition of higher accuracy by not only outputting the most coincident word by collating a word, but also outputting the information peculiar to its word or a pointer of a dictionary in which the intrinsic information is stored. CONSTITUTION:In the case code information peculiar to a word is added to every word stored in a word dictionary 4, and a word whose similarity is high is derived from combination of candidate characters by word collating means 2, 3, code information is outputted together, and based on the code information, necessary information is fetched. In such a way, the most coincident word is outputted by collating a word, furthermore, it is possible to cope with making character readers 1 - 5 intelligent since information peculiar to its word is obtained, thus, a result of recognition of high accuracy can be obtained.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は文字読取装置の文字認識後処理方法にかかり、
特に文字読取装置内の文字認識装置により読み取られた
認識結果に対して、単語辞書とのマツチングを行うこと
により、認識精度を向上させる文字認識後処理方法に関
するものである。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a character recognition post-processing method for a character reading device,
In particular, the present invention relates to a character recognition post-processing method that improves recognition accuracy by matching recognition results read by a character recognition device in a character reading device with a word dictionary.

[Conventional technology]

従来から文字読取装置の文字認識精度を向上させる方法
として、単語辞書とのマツチングを行う文字認識後処理
方法が用いられている。例えば、特開昭６０−２１７４
９０号公報に開示された発明では、文字認識結果の列中
から複数の仮想単語を選択し、辞書単語との類似度を求
め、最も高い類似度の単語を検出することで認識精度を
向上させている。Conventionally, as a method for improving character recognition accuracy of a character reading device, a character recognition post-processing method that performs matching with a word dictionary has been used. For example, JP-A-60-2174
The invention disclosed in Publication No. 90 improves recognition accuracy by selecting a plurality of virtual words from a string of character recognition results, determining their degree of similarity with dictionary words, and detecting the word with the highest degree of similarity. ing.

[IK problem that the invention attempts to solve]

上記した従来技術においては、単語単体での認識しか期
待できない。文字読取装置のインテリジェント化にとも
ない、住所を仮名で認識して漢字に変換したり、逆に漢
字で認識して仮名で出力したり、あるいは住所を認識し
て郵便番号を付加したくなるかもしれない。又、一般に
帳票上に書かれた情報には、互いに関連のあるものが多
い。たとえば、住所欄の他に、郵便番号や電話番号が記
入されたり、振り仮名がふられたりする。これらの情報
を利用すれば、さらに認識精度を向上させることができ
るはずである。ちなみに、特開昭６３−１３８４７８号
公報には、郵便番号を使い、住所のチエツクを行なう発
明が開示されている。In the above-mentioned conventional technology, only recognition of individual words can be expected. As character reading devices become more intelligent, you may want to recognize addresses in kana and convert them to kanji, or conversely, recognize addresses in kanji and output them in kana, or recognize addresses and add postal codes. do not have. Further, in general, much of the information written on a form is related to each other. For example, in addition to the address field, the postal code and telephone number may be entered, or furigana may be written. By using this information, it should be possible to further improve recognition accuracy. Incidentally, Japanese Unexamined Patent Publication No. 138478/1983 discloses an invention for checking an address using a postal code.

従来の単語照合技術は、最も良く合った単語を一つ又は
複数候補出力するのみで、上記した様な高度な情報処理
に使用することはできない。しいてやろうと思えば、第
９図（ａ）、　（ｂ）に示すように、文字読取装置を構
成する文字認識装置２２と単語照合装置２３と単語辞書
２４の他に、コード変換用辞書２６やコード変換装置２
７又はチエツク用辞書２９や郵便番号チエツク装置ｚ８
をあらかじめ作っておかなければならない。これらの辞
書２６．２９には、第９図（ａ）、　（ｂ）から明らか
なように、単語辞書中の単語と同一の単語を入れておか
なければならず、メモリの無駄である。又、コード変換
するには、変換用辞書２６において単語照合結果と一致
する単語− をサーチする必要がある。尚、第９図（ａ）、　（ｂ）
において、２１は帳票を示している。Conventional word matching techniques only output one or more candidates for the best matching word, and cannot be used for the above-described advanced information processing. As shown in FIGS. 9(a) and 9(b), in addition to the character recognition device 22, word matching device 23, and word dictionary 24 that make up the character reading device, a code conversion dictionary 26 is also required. and code conversion device 2
7 or check dictionary 29 or postal code check device z8
must be made in advance. As is clear from FIGS. 9(a) and 9(b), these dictionaries 26 and 29 must contain the same words as the words in the word dictionary, which is a waste of memory. Furthermore, in order to convert the code, it is necessary to search the conversion dictionary 26 for a word that matches the word matching result. In addition, Fig. 9 (a), (b)
21 indicates a form.

本発明は、上記した従来技術の問題点に鑑みなされたも
ので、メモリの無駄をなくし、かつ高精度の認識を行な
うことが可能な文字読取装置における文字認識処理方法
を提供することにある。The present invention has been made in view of the problems of the prior art described above, and an object of the present invention is to provide a character recognition processing method in a character reading device that can eliminate memory waste and perform highly accurate recognition.

[Means to solve the problem]

本発明の文字読取装置における文字認識後処理方法は、
帳票上に記入された文字列を読み取り候補文字を出力す
る文字認識手段と、複数の単語を格納している単語辞書
と、上記認識手段から出力された候補文字の組合せと単
語辞書に格納されている単語とを照合して、類似度の高
い単語を求める単語照合手段とを含んでいる文字読取装
置に適用されるものであり、特しこ上記単語辞書に格納
されている単語毎に、該単語に固有なコード情報を付加
し、単語照合手段によって候補文字の組合せから類似度
の高い単語が求められた場合、上記コード情報を併せて
出力し、上記コード情報に基づいて必要な情報を取り出
すことを特徴としている。The character recognition post-processing method in the character reading device of the present invention includes:
A character recognition means that reads a character string written on a form and outputs candidate characters, a word dictionary that stores a plurality of words, and combinations of candidate characters output from the recognition means and stored in the word dictionary. This is applied to a character reading device that includes a word matching means to find words with a high degree of similarity by matching words in the word dictionary. Code information unique to a word is added, and when words with high similarity are found from a combination of candidate characters by a word matching means, the above code information is also output, and necessary information is extracted based on the above code information. It is characterized by

　− コード情報としては、（１）上記単語に関連する単語、
あるいは単語群。- Code information includes (1) words related to the above words;
Or a group of words.

（２）上記単語に関連する単語あるいは単語群の格納さ
れているアドレス又はポインタ。(2) An address or pointer where a word or word group related to the above word is stored.

（３）上記単語に関連する単語あるいは単語群がデータ
ベース中に格納されている場合は、それらにアクセス可
能なキーワード又はレコード番号等が考えられる。(3) If words or word groups related to the above words are stored in the database, keywords or record numbers that can access them may be considered.

[For production]

本発明によれば、単に単Ｈｎ照合により最も一致する単
語を出力するのみでなく、その単語に固有な情報が得ら
れるため、文字読取装置のインテリジェント化に対応で
き、かつ精度のよい認識結果を得ることができる。また
、単語とともに、あるいは単語のかわりに、得られた固
有な情報を出力することも可能であり、認識結果から他
の情報に変換する手間を最小限にすることが可能である
。According to the present invention, not only the most matching word is output by simple Hn matching, but also information specific to that word is obtained, so that it is compatible with the intelligentization of character reading devices and provides highly accurate recognition results. Obtainable. Further, it is also possible to output the obtained unique information together with the words or instead of the words, and it is possible to minimize the effort required to convert the recognition results into other information.

〔Example〕

以下添付の図面に示す実施例により、更に詳細に本発明
について説明する。The present invention will be described in more detail below with reference to embodiments shown in the accompanying drawings.

第１図は本発明の文字読取装置の一実施例を示すブロッ
ク図である。第１図において、１は帳票、２は文字認識
装置、３は後処理装置、４は単語辞書である。文字認識
装置２は、帳票１上に記入された文字を読み取り、文字
毎に複数の候補文字を後処理装置３へ出力する。帳票１
には、第１図に示すように、郵便番号（２５６）と住所
（小田原布）が記入されているものと仮定して以下話を
進める。FIG. 1 is a block diagram showing an embodiment of the character reading device of the present invention. In FIG. 1, 1 is a form, 2 is a character recognition device, 3 is a post-processing device, and 4 is a word dictionary. The character recognition device 2 reads the characters written on the form 1 and outputs a plurality of candidate characters for each character to the post-processing device 3. Form 1
As shown in Figure 1, the following discussion will proceed assuming that the postal code (256) and address (Odawarafu) are entered.

文字認識装置２において、郵便番号は、数字であるため
、認識精度が高く、はぼまちがいなく認識され、候補文
字は１文字ずつ出力される。住所は、漢字で記入されて
いるため一般に、認識精度が良くなく、多くの候補文字
が出力される。第２図に示すように、「小」に対してｒ
大」、「小」、ｒ山」の３候補が出力され、「田」に対
して「田」、１日」の２候補が出力され、「原」に対し
ては「原」の１候補が、「市」に対して「布」、「市」
の２候補が出力されたと仮定する。In the character recognition device 2, since the postal code is a number, the recognition accuracy is high and it is recognized without any mistakes, and candidate characters are output one by one. Since addresses are written in Kanji, recognition accuracy is generally poor and many candidate characters are output. As shown in Figure 2, r for "small"
Three candidates are output: ``large'', ``small'', and ``r mountain'', and two candidates are output for ``田'', ``田'' and ``1 day'', and one candidate for ``hara'' is ``hara''. However, for “city”, “cloth” and “city”
Assume that two candidates are output.

後処理装置３は、これらの候補文字を文字認識装置２か
ら入力して、一番類似度の大きい単語を候補単語として
出力する。即ち、後処理装置３は、小田原布に対応して
得られた複数の候補文字を組み合わせて得られる単語と
単語辞書４中の単語とを照合して一致するものを求める
。The post-processing device 3 inputs these candidate characters from the character recognition device 2 and outputs the word with the highest degree of similarity as a candidate word. That is, the post-processing device 3 compares a word obtained by combining a plurality of candidate characters obtained corresponding to Odawara cloth with a word in the word dictionary 4 to find a match.

第１候補として「大田原布」が得られ、第２候補として
「小田原布」が得られたと仮定する。Assume that "Otawara-fu" is obtained as the first candidate, and "Odawara-fu" is obtained as the second candidate.

単語辞＠４は、第３図に示す様な構成を有している。第
３図において、６はアドレス表であり、候補にあがった
単語の先頭文字（例えば、「小」）で始まる複数の単語
の単語表７における先頭アドレスを求めることができる
。アドレス表６において、Ｎは「小」で始まる単語数を
示し、Ｐｌは「小」で始まる単語の単語表７における先
頭アドレスを示している。また、第３図において、８は
コード情報であり、単語表７中の単語とコード情報８は
それぞれ１対１に対応づけられている。即ち、単語表７
は、先頭文字でソートされており、同一先頭文字で始ま
る単語はグループ化されている。各グループの先頭は、
アドレス表６で求めることができる。また、第２図の場
合コード情報８７には郵便番号が格納されている。The word dictionary @4 has a structure as shown in FIG. In FIG. 3, reference numeral 6 is an address table, and the starting addresses in the word table 7 of a plurality of words starting with the first letter (for example, "小") of the candidate word can be found. In the address table 6, N indicates the number of words starting with "小", and Pl indicates the first address in the word table 7 of words starting with "small". Further, in FIG. 3, 8 is code information, and the words in the word table 7 and the code information 8 are each in a one-to-one correspondence. That is, word table 7
are sorted by first letter, and words that start with the same first letter are grouped together. The beginning of each group is
It can be obtained from address table 6. Further, in the case of FIG. 2, the code information 87 stores a postal code.

帳票１の第２フイールドが読み取られ、単語照合され、
前記した様に候補単語として「大田原布」と「小田原布
」が得られたとすると、コード情報８に基づいてそれぞ
れの付随情報として郵便番号ｒ３２４Ｊとｒ２５６Ｊも
出力される。第１フイールドでｒ２５６Ｊと読まれてい
れば、郵便番号でチエツクして「大田原布」を排除して
「小田原布」を選択することができる。The second field of form 1 is read and word matched,
As mentioned above, if "Otawarafu" and "Odawarafu" are obtained as candidate words, the postal codes r324J and r256J are also output as their respective accompanying information based on the code information 8. If the first field reads r256J, you can check the postal code, exclude "Otawarafu" and select "Odawarafu".

第４図は単語辞書４の他の例を示す図である。FIG. 4 is a diagram showing another example of the word dictionary 4.

第４図においては、コード情報８としてポインタＰＩが
格納され、ポインタＰ、が郵便番号辞書９に格納されて
いる郵便番号を指示する。In FIG. 4, a pointer PI is stored as code information 8, and a pointer P points to a postal code stored in a postal code dictionary 9.

第５図は単語辞書４の他の例を示す図である。FIG. 5 is a diagram showing another example of the word dictionary 4.

第５図においては、単語表７に単語対応にポインタＰが
設けられ、ポインタＰは他のデータベース１０に格納さ
れている郵便番号、県名、その読み方のデータ格納先を
指示する。尚、この場合、第３図に示すアドレス表６は
付加しなくても良い。In FIG. 5, a pointer P is provided corresponding to a word in the word table 7, and the pointer P indicates the data storage location of the postal code, prefecture name, and how to pronounce the same, which are stored in another database 10. In this case, the address table 6 shown in FIG. 3 may not be added.

上記実施例では、文字読取装置内で郵便番号の８− チエツクを行なったが、チエツクを上位装置にまかせる
ことも可能である。その場合には、候補単語とコード情
報がペアで出力される。上位装置では、コード情報をチ
エツクに使用したり、そのままデータとして出力するこ
とが可能である。In the above embodiment, the 8-check of the postal code is performed within the character reading device, but it is also possible to leave the check to a host device. In that case, the candidate word and code information are output as a pair. The host device can use the code information for checking or output it as data as is.

第６図は、郵便番号のかわりに読みをコードデータにし
た場合を示す。この場合もふりがなチエツクなどのチエ
ツクに利用することが可能であるが、漢字から仮名への
変換あるいは、仮名から漢字への変換に利用することも
可能である。FIG. 6 shows a case where the reading is used as code data instead of the postal code. In this case as well, it can be used for checking furigana, etc., but it can also be used for converting from kanji to kana or from kana to kanji.

第７図は、コード情報８として文字読取装置外部のデー
タベース１１のレコード番号ｒｉを出力する場合を示す
。データベース１１は郵便番号、読み、県名など多数の
情報から構成されているものとする。単語と付随してデ
ータベース１１のレコード番号ｒＪが出力されると、デ
ータベース１１をアクセスして必要な情報を得ることが
できる。第７図の例では、郵便番号とふりがなでチエツ
クして県名を出力したり、郵便番号でチエツクして読み
を出力する等の複雑な処理をすることも可能である。FIG. 7 shows a case where the record number ri of the database 11 outside the character reading device is output as the code information 8. It is assumed that the database 11 is composed of a large amount of information such as postal code, pronunciation, and prefecture name. When the record number rJ of the database 11 is output together with the word, the database 11 can be accessed to obtain the necessary information. In the example shown in FIG. 7, it is also possible to perform complex processing such as checking the postal code and furigana and outputting the name of the prefecture, or checking the postal code and outputting the pronunciation.

なお、コード情報８として、レコード番Ｊ４ｒ　ｒ　＋
　＋７）外にデータベースをアクセスできるキーワード
を用いても良い。In addition, as code information 8, record number J4r r +
+7) You may also use a keyword that allows you to access the database.

第８図は単語辞書４の他の例を示す図である。FIG. 8 is a diagram showing another example of the word dictionary 4.

第８図に示すように、コード情報８は、辞書１２のポイ
ンタＰ２の他にポインタＰ２の示す関連単語群の単語数
Ｍを格納している。したがって、単語照合により最も類
似度の高い単語が得られるとその単語に関連する複数の
単語群を出方することが可能になる。複数の単語群の出
現頻度がわかっている場合には、出現頻度順に並べるか
、あるいは第８図に示すように出現頻度情報１３を格納
してあれば、出現頻度順に並べかえて出力することが可
能である。また、帳票１上の他のフィールドに読みが記
入されていれば、その読みと一致する単語が上記単語群
の中に含まれているか否かにより、チエツクに利用する
ことも可能である。As shown in FIG. 8, the code information 8 stores, in addition to the pointer P2 of the dictionary 12, the number M of words in the related word group indicated by the pointer P2. Therefore, when a word with the highest degree of similarity is obtained through word matching, it becomes possible to generate a plurality of word groups related to that word. If the appearance frequencies of multiple word groups are known, it is possible to arrange them in order of appearance frequency, or if appearance frequency information 13 is stored as shown in Figure 8, it is possible to rearrange them in order of appearance frequency and output them. It is. Furthermore, if a pronunciation is entered in another field on the form 1, it can be used to check whether a word that matches the pronunciation is included in the word group.

〔Effect of the invention〕

以上の説明から明らかな様に、本発明によれば、単に単
語照合により最も一致する単語を出方する０のみでなく、その単語に固有な情報、あるいは固有な情
報の格納されている辞書のポインタを出力することによ
り、他のフィールドに書かれた情報と比較してより精度
のよい認識結果を得ることができる。また単語とともに
、あるいは単語のかわりに、得られた固有な情報を出力
することも可能であり、文字読取装置のインテリジェン
ト化に対応可能になるとともに、認識結果から他の情報
に変換する手間を最小限にすることが可能である。As is clear from the above explanation, according to the present invention, not only the most matching word is found by word matching, but also the information unique to the word or the dictionary storing the unique information can be used. By outputting the pointer, more accurate recognition results can be obtained compared to information written in other fields. It is also possible to output the obtained unique information along with or instead of words, making it possible to respond to intelligent character reading devices and minimizing the effort of converting recognition results into other information. It is possible to limit

[Brief explanation of drawings]

第１図は本発明の文字読取装置の一実施例を示すブロッ
ク図、第２図は帳票への記入文字とその候補文字を示す
図、第３図から第６図は第１図に示す単語辞書の構成例
を示す図、第７図及び第８図は第１図に示す単語辞書と
データベース等との組合せの例を示す図、第９図（ａ）
、　（ｂ）は従来技術による文字読取装置の改良例を示
すブロック図である。１・・・帳票、２・・・文字認識装置、３・・・後処理
装置、４・・・単語辞書、５・・・照合結果、６・・・
アドレス表、７・・・単語表、８・・・コード情報、９・郵便番号辞書、１０゜１１・・データベース、１２・・・辞書、１３・・出現頻度情報。FIG. 1 is a block diagram showing an embodiment of the character reading device of the present invention, FIG. 2 is a diagram showing characters entered in a form and their candidate characters, and FIGS. 3 to 6 are words shown in FIG. 1. Figures 7 and 8 are diagrams showing an example of the structure of a dictionary; Figures 7 and 8 are diagrams showing an example of a combination of the word dictionary shown in Figure 1 and a database, etc.; Figure 9 (a);
, (b) is a block diagram showing an improved example of a character reading device according to the prior art. 1... Form, 2... Character recognition device, 3... Post-processing device, 4... Word dictionary, 5... Matching result, 6...
Address table, 7... Word table, 8... Code information, 9. Postal code dictionary, 10°11... Database, 12... Dictionary, 13... Appearance frequency information.

Claims

[Claims] 1. Character recognition means for reading character strings written on a form and outputting candidate characters; a word dictionary storing a plurality of words; and character recognition means for reading candidate characters from the character string written on a form; In a character reading device, the character reading device includes word matching means for matching the combination with words stored in a word dictionary to find words with a high degree of similarity, for each word stored in the word dictionary, the word When a word with high similarity is found by a word matching means from a combination of candidate characters, the code information is also output, and necessary information is extracted based on the code information. A character recognition post-processing method in a character recognition device characterized by: