JP2530659B2

JP2530659B2 - Optical character reading system

Info

Publication number: JP2530659B2
Application number: JP62179205A
Authority: JP
Inventors: 賢一高本; 彰三門田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1987-07-20
Filing date: 1987-07-20
Publication date: 1996-09-04
Anticipated expiration: 2011-09-04
Also published as: JPS6423384A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は光学文字読取システムに係り、更に詳しくは
使用者が所望の単語辞書を光学文字読取システム内に形
成するのに好適な光学文字読取システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an optical character reading system, and more particularly to an optical character reading system suitable for a user to form a desired word dictionary in the optical character reading system. Regarding the system.

[Conventional technology]

従来の光学文字読取システムにおいて、単語照合を行
う際に参照する単語辞書は、予めシステム固定であり、
また、その単語の種類は、例えば、都道府県名，市町村
名等といった限定されたものであった。そのため、従来
の光学文字読取システムにおいては、単語辞書の有する
単語種属性と同一フィールドの単語だけが単語照合可能
であり、任意の単語の単語照合は不可能であった。In the conventional optical character reading system, the word dictionary referred to when performing word matching is fixed in advance in the system,
Further, the types of the words are limited, for example, prefecture names, municipalities names, and the like. Therefore, in the conventional optical character reading system, only the word in the same field as the word type attribute of the word dictionary can be matched, and it is impossible to match any word.

尚、この種の装置として関連するものには例えば特開
昭60−584,同59−197974等が挙げられる。Examples of devices related to this type include, for example, JP-A-60-584 and 59-197974.

[Problems to be solved by the invention]

上記従来技術では、使用者特有の種類の単語、例え
ば、銀行業界では銀行名，流通業界では品名コードとい
った単語種に関しては、照合される単語辞書は準備され
ておらず、単語照合を実施できないという問題点があっ
た。In the above-mentioned conventional technique, a word dictionary to be collated is not prepared for a word of a type peculiar to a user, for example, a word name such as a bank name in the banking industry and a product name code in the distribution industry, and word collation cannot be performed. There was a problem.

また、従来の光学文字読取システムにおける単語照合
は、文字認識した複数の候補文字の組合せの全てについ
て照合を行なうため、処理に多大の時間を要するという
問題点があった。Further, in the word matching in the conventional optical character reading system, there is a problem that it takes a lot of time to process because all the combinations of a plurality of candidate characters that have been character-recognized are matched.

本発明の目的は、使用者特有の種類の単語に関して単
語照合を可能とし、かつ単語照合の高速化を図ることを
可能にする光学文字読取システムを提供することにあ
る。It is an object of the present invention to provide an optical character reading system that enables word matching with respect to a word of a user-specific type, and that can speed up word matching.

[Means for solving problems]

本発明の光学文字読取システムは、帳票上の文字パタ
ンを読出して１文字単位の文字認識を行う文字認識手段
と、登録単語を具備する単語辞書と、単語辞書の登録単
語と文字認識された単語を照合する単語照合部とを備
え、単語照合結果を出力する光学文字読取システムに適
用されるものであり、次の特徴を有している。The optical character reading system of the present invention reads a character pattern on a form and recognizes characters on a character-by-character basis, a word dictionary including registered words, and a word recognized as a registered word in the word dictionary. Is applied to an optical character reading system that outputs a word matching result, and has the following features.

すなわち、前記単語辞書は、単語表とアドレス表を備
えて構成される。ここで、前記単語表は、任意の単語種
ソースの各単語を一定の基準に従って並らべ変え、並べ
ら変えられた各単語について、少なくとも当該単語の第
２文字目が同一である他の単語の単語表上のアドレスを
示すポインタと当該単語のコードから構成される。That is, the word dictionary comprises a word table and an address table. Here, in the word table, each word of an arbitrary word type source is rearranged according to a certain standard, and for each rearranged word, at least the second character of the word is the same as another word. It is composed of a pointer indicating the address on the word table and the code of the word.

また、前記アドレス表は、前記単語種ソース内の各単
語の先頭文字における当該単語の前記単語表上における
アドレスと各単語の第２文字における前記単語表におけ
るアドレスを備えて構成される。Further, the address table is configured to include an address in the word table of the word in the first character of each word in the word type source and an address in the word table in the second character of each word.

さらに、前記単語照合部は、文字認識された単語の先
頭文字若しくは第２文字の前記単語表上におけるアドレ
スを前記アドレス表から求めて、前記単語表に照合す
る。Further, the word matching unit obtains the address of the first character or the second character of the character-recognized word on the word table from the address table, and matches the address with the word table.

[Work]

本発明によれば、使用者が所望の単語種ソースを用い
る事により、所望の単語表とアドレス表が作成され、こ
れが単語辞書として登録される。その結果、使用者特有
の単語種についても、装置固有の単語種のフィールドと
同様に単語照合を行うことが可能になる。According to the present invention, a user uses a desired word type source to create a desired word table and an address table, which are registered as a word dictionary. As a result, it becomes possible to perform word matching on the word type peculiar to the user as in the field of the word type peculiar to the device.

また、本発明によれば、文字認識された単語の先頭文
字若しくは第２文字について、アドレス表を検索し、検
索されたアドレスに存在する単語表の単語と文字認識さ
れた単語を照合することにより、単語照合を高速化する
ことができる。Further, according to the present invention, the address table is searched for the first character or the second character of the character-recognized word, and the word in the word table existing at the searched address is collated with the character-recognized word. , It is possible to speed up word matching.

〔Example〕

以下、添付の図面に示す実施例により、更に詳細に本
発明について説明する。Hereinafter, the present invention will be described in more detail with reference to the embodiments shown in the accompanying drawings.

第１図は本発明の一実施例を示すブロック図である。
第１図に示すように、この光学文字読取システムは、光
学文字読取装置１と汎用計算機２とから構成されてい
る。帳票３に記入された文字を光電変換部11で帳票上を
光学的に走査し、文字を２値化パタンに変換する。文字
切出し部12では、２値化パタンを１文字単位に切出し、
文字認識部13で１文字単位の文字認識を行い、結果を文
字認識結果候補バッファ15に格納する。単語照合部14で
は、文字認識結果バッファ15の内容と単語辞書17との照
合を行い、最も良く照合のとれる単語照合結果を単語照
合結果バッファ16に格納する。FIG. 1 is a block diagram showing an embodiment of the present invention.
As shown in FIG. 1, this optical character reading system is composed of an optical character reading device 1 and a general-purpose computer 2. The characters entered on the form 3 are optically scanned by the photoelectric conversion unit 11 to convert the characters into a binary pattern. The character cutout unit 12 cuts out the binarized pattern for each character,
The character recognition unit 13 performs character recognition on a character-by-character basis, and stores the result in the character recognition result candidate buffer 15. The word matching unit 14 matches the contents of the character recognition result buffer 15 with the word dictionary 17, and stores the best word matching result in the word matching result buffer 16.

単語照合の具体例を第２図（ａ），（ｂ），（ｃ），
（ｄ）を用いてさらに詳細に説明する。第２図（ａ）に
示す帳票３の都道府県フィールド21の読取りを例にして
説明する。都道府県フィールド21に記入された文字「神
奈川」は１文字単位の認識が行なわれ、第２図（ｂ）に
示す様に第１文字目から第３文字目のそれぞれに対する
第１候補から第４候補の各文字が文字認識結果候補バッ
ファ15に格納される。Specific examples of word matching are shown in FIGS. 2 (a), (b), (c),
This will be described in more detail with reference to (d). The reading of the prefecture field 21 of the form 3 shown in FIG. 2A will be described as an example. The character "Kanagawa" entered in the prefecture field 21 is recognized on a character-by-character basis, and as shown in FIG. 2B, the first to fourth characters from the first to third characters are selected. Each candidate character is stored in the character recognition result candidate buffer 15.

単語照合部14は、前記文字認識結果候補バッファ15内
の各候補と、第２図（ｃ）に示す単語辞書17の都道府県
テーブル22に登録されている単語との照合を行ない、最
も良く照合の取れる単語を単語照合結果バッファ16に格
納する。本例の場合、第２図（ｄ）に示す様に、『神奈
川』という単語が最も照合が取れた単語として出力され
た例である。ここで単語フィールドが住所のように県−
市−町と階層構造化されている場合は、階層構造の情報
をもとに単語照合を行うことができる。The word matching unit 14 matches each candidate in the character recognition result candidate buffer 15 with a word registered in the prefecture table 22 of the word dictionary 17 shown in FIG. The available words are stored in the word matching result buffer 16. In the case of this example, as shown in FIG. 2D, the word "Kanagawa" is output as the most matched word. Here, the word field is
When the city-town is hierarchically structured, word matching can be performed based on the information of the hierarchical structure.

本実施例の特徴は、第１図に示す単語辞書17の内容を
単語種ソース４をもとにして汎用計算機２と制御部18の
処理により、ユーザが自由にセットできる事にある。次
にその処理について説明する。第１図に示す単語種ソー
ス４は、データベース等により作成される。この単語種
ソース４の内容は、汎用計算機２において、各単語の第
１文字の文字コード順（例えばソフトJISコードの順）
にソートされ、アセンブル後、制御部18を介して単語辞
書17へ登録される。The feature of this embodiment is that the user can freely set the contents of the word dictionary 17 shown in FIG. 1 based on the word type source 4 by the processing of the general-purpose computer 2 and the control unit 18. Next, the processing will be described. The word type source 4 shown in FIG. 1 is created by a database or the like. The contents of the word type source 4 are, in the general-purpose computer 2, the character code order of the first character of each word (for example, software JIS code order).
Are sorted into, and after assembling, are registered in the word dictionary 17 via the control unit 18.

以下、単語登録の詳細を説明する。まず、使用者特有
の単語種を文字コード順にソートするが、その目的は、
該当文字を含む単語を検索する場合、一定の定義式によ
りその格納アドレスを一意に決定でるようにするためで
ある。また、ここでデータベース等により作成した単語
種ソース４を利用するには、一般に先に述べた単語の階
層構造が容易に得られるためである。The details of word registration will be described below. First, the user-specific word types are sorted in the order of character codes.
This is because, when searching for a word containing the corresponding character, its storage address can be uniquely determined by a certain defining expression. This is also because, in order to use the word type source 4 created by a database or the like here, generally, the above-mentioned hierarchical structure of words can be easily obtained.

次に、ソートされた単語種をアセンブルし、単語辞書
を作成する。単語辞書は第３図に示すようにアドレス表
31と単語表32より構成される。アドレス表31は、単語表
32に高速にアクセスするためのものであり、単語の先頭
の文字をキーにする場合の第１文字単語表ポインタ33及
び２文字目の文字をキーにする場合の第２文字単語表ポ
インタ34で構成される。アドレス表へのアクセスは以下
の手順による。文字コードがシフトJIS（２バイト構
成）の場合上位バイトをC_U、下位バイトをC_Lとし、ヘキ
サ表示すると、 C_L＝C_L−（40）_Ｈ ……（１） C_U＝C_U−（81）_Ｈ ……（２）となる。ただし、C_U＜（AO）_Ｈの場合には、C_U＝C_U−
（40）_Ｈの式で求めたC_Uの値を（２）式右辺に代入して
C_Uを求める。第１文字単語表ポインタ33及び第２文字単
語表ポインタ34がそれぞれ１バイト構成のテーブルの場
合、（１），（２）式で求めたC_L,C_Uを用いてアドレス
表31の先頭からの相対アドレスＡは、次の（３）式で求
められる。Next, the sorted word types are assembled to create a word dictionary. The word dictionary is an address table as shown in Fig. 3.
It consists of 31 and word table 32. Address table 31 is a word table
For fast access to 32, the first character word table pointer 33 when the first character of the word is the key and the second character word table pointer 34 when the second character is the key Composed. Access to the address table is as follows. If the character code is Shift JIS (2-byte configuration), the upper byte is C _U , the lower byte is C _L , and the hexadecimal display shows C _L = C _L − (40) _H …… (1) C _U = C _U − (81) _H ... (2). However, when C _U <(AO) _H , C _U = C _U −
(40) Substituting the value of C _U obtained by the equation of _H into the right side of equation (2)
_Ask for C _U. If the first character word table pointer 33 and the second character word table pointer 34 is a table of 1 byte configuration, (1), from the beginning of the address table 31 using a C _L, C _U obtained in (2) The relative address A of is calculated by the following equation (3).

Ａ＝（C_U・192＋C_L）・２ ……（３）アドレスＡに格納されている単語表ポインタで単語表
を検索することにより、第１文字が該当コードの単語を
知ることができる。第２文字をキーにして単語を検索す
ることも可能であり、第２文字の文字コードからアドレ
ス表における相対アドレスを求める方法は、第１文字目
の場合と同一である。該当文字を第２文字目に持つ単語
のうち単語表先頭から最初に現われる単語のアドレス
が、アドレス表31のＡ＋１番地の第２文字単語表ポイン
タ34に格納される。第２文字目をキーにするのは、第１
文字目がリジェクトなどで使えない場合に特に使用する
ものである。単語表32には、単語37のほかに該当単語の
長さを表わす語長35及び第２文字目をキーにした場合に
使用する第２文字リンクポインタ36がある。この第２文
字リンクポインタ36には、第２文字目が同一である単語
の当該単語からの相対アドレスが格納されており、この
相対アドレスが０となるまでリンクして行くことにより
第２文字目が同一の単語を得ることができる。A = (C _U · 192 + C _L ) · 2 (3) By searching the word table with the word table pointer stored at the address A, the word having the first character as the corresponding code can be known. It is also possible to search for a word using the second character as a key, and the method of obtaining the relative address in the address table from the character code of the second character is the same as the case of the first character. The address of the word appearing first from the beginning of the word table among the words having the corresponding character as the second character is stored in the second character word table pointer 34 at address A + 1 of the address table 31. The key to the second character is the first
This is especially used when the letter characters cannot be used due to rejects. In addition to the word 37, the word table 32 includes a word length 35 indicating the length of the word and a second character link pointer 36 used when the second character is used as a key. The second character link pointer 36 stores the relative address of the word having the same second character from the word, and the second character is linked by linking until the relative address becomes zero. Can get the same word.

次に、単語種ソース４の内容をソート・アセンブルす
る場合の具体例について第４図（ａ），（ｂ），
（ｃ），（ｄ）を用いて説明する。まず、第４図（ａ）
に示す単語種ソース４に登録されている各単語の第１文
字のシフトJISコードを基準に、シフトJISコードの小さ
い順に各単語を並べかえ、第４図（ｂ）に示すソート後
単語種ソース41を作成する。次に、ソート後単語種ソー
ス41をもとに第４図（ｃ）に示す単語表32を作成する。
ソート後単語種ソース41内の各単語37の長さをバイト単
位に算出し、語長35として格納する。また、単語の第２
文字目と同一の文字が他の単語の第２文字目と一致する
場合は、その単語の格納アドレスから、同一文字を含む
単語の格納アドレスの差分が第２文字リンクポインタ36
として格納される。本例では徳島と福島の第２文字目は
同一であるので、（22）_Ｈ−（12）_Ｈ＝（10）_Ｈが第２
文字リンクポインタ36として計算できる。同一文字を含
む単語がない場合は、０を第２文字リンクポインタ36と
して格納する。また、単語表32のアドレス0,A,12……等
はヘキサ表示のアドレスを示している。Next, FIGS. 4A, 4B, and 4B show a specific example of sorting and assembling the contents of the word type source 4.
This will be described with reference to (c) and (d). First, FIG. 4 (a)
Based on the Shift JIS code of the first character of each word registered in the word type source 4 shown in FIG. 4, the words are rearranged in the order of smaller Shift JIS code, and the sorted word type source 41 shown in FIG. To create. Next, the word table 32 shown in FIG. 4C is created based on the sorted word type source 41.
The length of each word 37 in the sorted word type source 41 is calculated in bytes and stored as a word length 35. Also, the second word
When the same character as the first character matches the second character of another word, the difference between the storage address of the word and the storage address of the word including the same character is the second character link pointer 36.
Is stored as In this example, the second letters of Tokushima and Fukushima are the same, so (22) _H- (12) _H = (10) _H is the second
It can be calculated as the character link pointer 36. If there is no word including the same character, 0 is stored as the second character link pointer 36. Further, addresses 0, A, 12 ... In the word table 32 indicate addresses in hexadecimal.

最後に、単語表32から第４図（ｄ）に示すアドレス表
31を作成する。アドレス表31は単語表32内の各文字をシ
フトJISコードの小さい方から順番にならべ、シフトJIS
コードと１対１に対応している。アドレス表31の作成方
法は次の様なものである。まず、各文字の第１文字単語
表ポインタ33と第２文字単語表ポインタ34を（FF）_Ｈで
すべて初期化する。次に、単語表32の単語37の第１文字
目のコードに対応するアドレス表31の第１文字単語表ポ
インタ33に該当単語の単語表のアドレスを格納する。こ
こで、アドレス表31の内容が（FF）_Ｈでないときは、内
容を更新しない。本例の場合、福島という単語によっ
て、福に該当するアドレス表の内容（1A）_Ｈは更新しな
い。その前に福岡によってすでにセットされているため
である。第２文字目に関しても、同様に第２文字単語表
ポインタ34を求め、この操作を単語表32の総ての単語37
に関して行うことによりアドレス表31が作成できる。Finally, the word table 32 to the address table shown in FIG. 4 (d)
Create 31. Address table 31 arranges each character in word table 32 in order from the smallest Shift JIS code, and shift JIS
There is a one-to-one correspondence with the code. The method for creating the address table 31 is as follows. First, the first character word table pointer 33 and the second character word table pointer 34 for each character are all initialized to (FF) _H. Next, the word table address of the corresponding word is stored in the first character word table pointer 33 of the address table 31 corresponding to the code of the first character of the word 37 of the word table 32. Here, if the content of the address table 31 is not (FF) _H , the content is not updated. In the case of this example, the content (1A) _H of the address table corresponding to Fuku is not updated by the word Fukushima. Because it was already set by Fukuoka before that. For the second character, similarly, the second character word table pointer 34 is obtained, and this operation is performed for all the words 37 in the word table 32.
The address table 31 can be created by performing the above.

以上、単語の高速アクセスのためのアドレス表31及び
言長35,第２文字リンクポインタ36を付加した単語表32
から構成される単語辞書オブジェクトは、汎用計算機２
により光学文字読取装置内に単語辞書17として登録され
る。登録に際しては、第３図に示す様に、単語種毎に単
語辞書のアドレス表31及び単語表32の先頭アドレスとそ
の容量を格納する単語辞書管理表38を作成する。As described above, the address table 31 for high-speed word access and the word table 32 to which the word length 35 and the second character link pointer 36 are added
The word dictionary object consisting of
Is registered as a word dictionary 17 in the optical character reader. At the time of registration, as shown in FIG. 3, a word dictionary management table 38 for storing the head addresses of the word dictionary address table 31 and the word table 32 and their capacities is prepared for each word type.

第１図に示す単語照合部14は、上記の単語辞書17によ
り、次の様に文字認識結果候補バッファ15の内容を照合
する。即ち、文字認識結果候補バッファ15の第１候補文
字が神であったとする。この場合、アドレス表31の神に
対応する第１文字単語表ポインタ33の内容は０であるた
め、単語表32のアドレス０の内容が参照される。これに
よって、照合結果として神奈川の単語37が求められる。
この例の場合は、第１文字が神である単語が他に存在し
ないため、一義的に定まる。他にも存在する場合は（こ
の場合、次アドレスに存在する）、次アドレスも参照さ
れる。また、第１候補文字での照合に失敗した場合は、
第２候補文字を用いる照合が行なわれる。この場合は、
アドレス表31の第２文字単語表ポインタ34を用いて単語
表32のアドレスが求められ、そのアドレスに存在する単
語37と照合される。該当アドレスの第２文字ポインタリ
ンク36が０でない場合は、当該第２文字ポインタリンク
36が指示するアドレスの単語37とも照合される。この様
に、本実施例によれば、文字認識結果候補バッファ15の
候補文字を単語辞書内の全ての単語と照合する事が不要
となり、照合処理を大幅に高速化することが可能にな
る。The word collating unit 14 shown in FIG. 1 collates the contents of the character recognition result candidate buffer 15 with the word dictionary 17 as follows. That is, it is assumed that the first candidate character in the character recognition result candidate buffer 15 is God. In this case, since the content of the first character word table pointer 33 corresponding to the god of the address table 31 is 0, the content of the address 0 of the word table 32 is referred to. Thereby, the word 37 of Kanagawa is obtained as the matching result.
In the case of this example, since there is no other word whose first character is God, it is uniquely determined. If there is another address (in this case, the address exists at the next address), the next address is also referred to. Also, if the collation with the first candidate character fails,
Matching using the second candidate character is performed. in this case,
The address of the word table 32 is obtained by using the second character word table pointer 34 of the address table 31, and the address is compared with the word 37 existing at that address. If the second character pointer link 36 of the corresponding address is not 0, the second character pointer link
It is also matched with the word 37 at the address indicated by 36. As described above, according to the present embodiment, it is not necessary to match the candidate characters in the character recognition result candidate buffer 15 with all the words in the word dictionary, and the matching process can be significantly speeded up.

また、ユーザが単語種ソース４を用いて単語辞書17を
作成することができるので、汎用性の高い光学文字読取
システムを提供することが可能になる。In addition, since the user can create the word dictionary 17 using the word type source 4, it is possible to provide a highly versatile optical character reading system.

尚、上記実施例では、単語照合部14及び単語辞書17を
光学文字読取装置１の中に持ったが、単語照合部14及び
単語辞書17を汎用計算機１の中に持ち、文字認識結果候
補バッファ14の内容を汎用計算機１に送り、汎用計算機
において単語照合することも容易に考えられる。Although the word collation unit 14 and the word dictionary 17 are provided in the optical character reading device 1 in the above embodiment, the word collation unit 14 and the word dictionary 17 are provided in the general-purpose computer 1 and the character recognition result candidate buffer is provided. It is also possible to easily send the contents of 14 to the general-purpose computer 1 and perform word matching in the general-purpose computer.

〔The invention's effect〕

本発明によれば、使用者特有の単語種に関しても、そ
の単語種をアセンブルし、単語表とアドレス表から成る
単語辞書を登録することにより、装置提供の単語種のフ
ィールドと同様に単語照合が行え高い認識精度を達成す
ることができる。According to the present invention, even for a word type peculiar to a user, by assembling the word type and registering a word dictionary consisting of a word table and an address table, the word matching can be performed similarly to the field of the word type provided by the apparatus. It is possible to achieve high recognition accuracy.

また、上記アドレス表と単語表により、高速に単語照
合することが可能になる。Further, the address table and the word table described above enable high-speed word matching.

[Brief description of drawings]

第１図は本発明の一実施例を示すブロック図、第２図
（ａ），（ｂ），（ｃ），（ｄ）は第１図に示す実施例
の処理の概要を示す説明図、第３図は第１図の実施例に
示す単語辞書の例を示す説明図、第４図（ａ），
（ｂ），（ｃ），（ｄ）は第１図に示す実施例の単語辞
書作成の具体例を示す説明図である。１……光学文字読取装置、２……汎用計算機、４……単
語種ソース、14……単語照合部、17……単語辞書、31…
…アドレス表、32……単語表、33……単語辞書管理表。FIG. 1 is a block diagram showing an embodiment of the present invention, and FIGS. 2 (a), (b), (c), and (d) are explanatory views showing the outline of the processing of the embodiment shown in FIG. FIG. 3 is an explanatory view showing an example of the word dictionary shown in the embodiment of FIG. 1, FIG. 4 (a),
(B), (c), (d) is explanatory drawing which shows the specific example of word dictionary creation of the Example shown in FIG. 1 ... Optical character reader, 2 ... General-purpose computer, 4 ... Word type source, 14 ... Word collation unit, 17 ... Word dictionary, 31 ...
… Address table, 32 …… Word table, 33 …… Word dictionary management table.

Claims

(57) [Claims]

1. A character recognition means for reading out a character pattern on a form for character recognition on a character-by-character basis, a word dictionary having registered words, and a word for collating the registered words in the word dictionary with the recognized words. In an optical character reading system that includes a collating unit and outputs a word collating result, the word dictionary is configured to include a word table and an address table, and the word table is a constant word for each word type source. For each word rearranged according to the criteria, rearranged, at least the second character of the word is composed of a pointer indicating the address on the word table of another word and the code of the word, and The address table includes an address in the word table of the word at the first character of each word in the word type source and an address in the word table at the second character of each word. An optical character characterized in that the word matching unit obtains the address of the first character or the second character of the character-recognized word on the word table from the address table, and matches the address with the word table. Reading system.