JPH0636069A

JPH0636069A - Character recognizing device

Info

Publication number: JPH0636069A
Application number: JP4185079A
Authority: JP
Inventors: Kenji Mishima; 健司三縞
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1992-07-13
Filing date: 1992-07-13
Publication date: 1994-02-10

Abstract

PURPOSE:To enable detailed specification and improve the recognition rate by using normal expression to specify the character kind of format control information(FC). CONSTITUTION:This character recognizing device is equipped with an FC part 2 where the format control information referred to so as to read characters, etc., recorded on a form is stored and information specifying the character kind of the format control information is expressed in the normal expression, a normal expression analytic part 31 which analyzes the normal expression in the format control information stored in the FC part 2, a character kind control part 34 which selects a dictionary to be used among recognition dictionaries 33 according to the analytic result of the normal expression analytic part 31, and a matching part 35 which obtains a read result by matching the characters recorded on the form with patterns by using the dictionary selected by the character kind control part 34.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、用紙に記録された文字
等を、読取り位置、字体、字種等を含む書式制御情報
（ＦＣ）に基づいて読取る文字認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character recognition device for reading characters recorded on paper based on format control information (FC) including a reading position, a font, a character type and the like.

【０００２】[0002]

【従来の技術】従来の文字認識装置（ＯＣＲ）は、帳票
等における読取り位置、字体、字種等を含む書式制御情
報（ＦＣ）に基づいて、用紙に記録された文字等の読取
りを行なう。書式制御情報（ＦＣ）は、字種に関して、
用紙上に設けられた文字の記入欄（フィールド）毎に、
アルファベット（Ａ）、数字（Ｎ）、カタカナ（Ｋ）な
どのように指定している。そのため、フィールド内に記
録された全ての文字について、指定された字種であるも
のとして文字認識を行なっている。2. Description of the Related Art A conventional character recognition device (OCR) reads a character or the like recorded on a sheet based on format control information (FC) including a reading position, a font, a character type, etc. on a form or the like. Format control information (FC) is
For each character entry field (field) provided on the form,
It is designated as alphabet (A), number (N), katakana (K), and so on. Therefore, all the characters recorded in the field are recognized as the designated character type.

【０００３】[0003]

【発明が解決しようとする課題】このように従来の文字
認識装置では、フィールド内に記録された全ての文字に
ついて、同フィールドに指定された文字種として認識を
行なっていた。このため、例えば、金額欄のように、最
初の１文字は記号「￥」の場合があり、さらに最初の数
字は「０」ではないといった、細かい制御を指定するこ
とができなかった。この場合、金額欄内に記録された文
字の品質によっては、最初の数字についての認識結果が
「０」に誤認識されることもあり、認識率の低下を招い
ていた。As described above, in the conventional character recognition device, all the characters recorded in the field are recognized as the character type designated in the field. Therefore, for example, like the amount column, the first character may be the symbol “¥”, and the first number may not be “0”, which makes it impossible to specify detailed control. In this case, depending on the quality of the characters recorded in the amount column, the recognition result for the first number may be erroneously recognized as "0", leading to a reduction in the recognition rate.

【０００４】本発明は前記のような点に鑑みてなされた
もので、書式制御情報（ＦＣ）の字種の指定に正規表現
を使用することにより詳細な指定を可能にして、認識率
を向上させることが可能な文字認識装置を提供すること
を目的とする。The present invention has been made in view of the above points. By using a regular expression to specify the character type of format control information (FC), detailed specification is possible and the recognition rate is improved. An object of the present invention is to provide a character recognition device capable of performing the above.

【０００５】[0005]

【課題を解決するための手段】本発明は、用紙に記録さ
れた文字等を読取るために参照される書式制御情報を記
憶するためのものであって、前記書式制御情報中の文字
種を指定する情報が正規表現で表されている書式制御情
報記憶手段と、前記書式制御情報記憶手段に記憶された
書式制御情報中の正規表現を解析する正規表現解析手段
と、前記正規表現解析手段による解析結果に基づいて、
前記用紙に記録された文字等についての読取り結果を求
める読取り手段とを具備して構成するものである。SUMMARY OF THE INVENTION The present invention is for storing format control information that is referred to for reading characters and the like recorded on paper, and specifies the character type in the format control information. Format control information storage means in which information is represented by a regular expression, regular expression analysis means for analyzing a regular expression in the format control information stored in the format control information storage means, and an analysis result by the regular expression analysis means On the basis of,
And a reading unit that obtains a reading result of the characters and the like recorded on the sheet.

【０００６】[0006]

【作用】このような構成によれば、書式制御情報中に含
まれる文字種に関する情報が正規表現によって表されて
いるため、認識対象とする文字毎に細かに制御すること
ができる。With this configuration, since the information regarding the character type included in the format control information is represented by the regular expression, it is possible to finely control each character to be recognized.

【０００７】[0007]

【実施例】以下、図面を参照して本発明の一実施例を説
明する。図２は同実施例に係わる文字認識装置の概略構
成を示すブロック図、図１は図２中の認識部３の詳細な
構成を示すブロック図である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. 2 is a block diagram showing a schematic configuration of the character recognition apparatus according to the embodiment, and FIG. 1 is a block diagram showing a detailed configuration of the recognition unit 3 in FIG.

【０００８】図２に示すように、文字認識装置は、制御
部１、ＦＣ部２、認識部３、出力部４、及び入力部５に
よって構成されている。制御部１は、装置全体を制御す
る。ＦＣ部２は、文字認識を行なう際に参照される、読
取り位置、字体、字種等を含む書式制御情報（ＦＣ）を
記憶するためのものである。書式制御情報（ＦＣ）の字
種に関する情報は、例えば帳票に設けられた文字記入位
置を示すフィールド毎に正規表現によって指定すること
ができる。正規表現の詳細については後述する。認識部
３は、ＦＣ部２に記憶された書式制御情報（ＦＣ）に従
って、後述する入力部５によって帳票等を光学的に走査
することによって得られた画像から文字パターンを切り
出して文字認識を行なうことにより文字を読取る。出力
部４は、認識部３による文字認識によって得られた読取
り結果を出力する。入力部５は、読取りの対象となる文
字等が記録された帳票等を光学的に走査することによっ
て帳票画像を検出して認識部３に出力する。As shown in FIG. 2, the character recognition device comprises a control unit 1, an FC unit 2, a recognition unit 3, an output unit 4, and an input unit 5. The control unit 1 controls the entire device. The FC unit 2 is for storing format control information (FC) including a reading position, a font, a character type, etc., which is referred to when performing character recognition. The information on the character type of the format control information (FC) can be specified by a regular expression for each field indicating a character entry position provided on a form, for example. Details of the regular expression will be described later. The recognition unit 3 performs character recognition by cutting out a character pattern from an image obtained by optically scanning a form or the like by an input unit 5 described later according to the format control information (FC) stored in the FC unit 2. By reading the character. The output unit 4 outputs the reading result obtained by the character recognition by the recognition unit 3. The input unit 5 optically scans a form or the like on which characters or the like to be read are recorded to detect a form image and outputs it to the recognition unit 3.

【０００９】図１に示すように、認識部３は、正規表現
解析部３１、切出部３２、認識辞書３３、字種制御部３
４、及び照合部３５によって構成されている。正規表現
解析部３１は、ＦＣ部２に記憶された書式制御情報（Ｆ
Ｃ）を入力し、このＦＣで指定された正規表現を解析す
る。切出部３２は、書式制御情報（ＦＣ）で指定された
読取り位置情報に基づいて、入力部５から出力された帳
票画像から文字パターンを１文字毎に切出す。認識辞書
３３は、認識可能とする全ての文字についての辞書を記
憶するためのものである。字種制御部３４は、正規表現
解析部３１による解析結果に応じて、文字毎に文字認識
のために何れの文字種用の辞書を用いるかを選択する。
照合部３５は、字種制御部３４によって選択された辞書
と、切出部３２によって切出された文字パターンとを照
合することによって文字を認識し、読取り結果を出力す
る。次に、同実施例の動作について説明する。As shown in FIG. 1, the recognition unit 3 includes a regular expression analysis unit 31, a cutout unit 32, a recognition dictionary 33, and a character type control unit 3.
4 and the collation unit 35. The regular expression analysis unit 31 uses the format control information (F
Input C) and parse the regular expression specified by this FC. The cutout unit 32 cuts out a character pattern for each character from the form image output from the input unit 5 based on the reading position information designated by the format control information (FC). The recognition dictionary 33 is for storing a dictionary for all recognizable characters. The character type control unit 34 selects which character type dictionary to use for character recognition for each character according to the analysis result by the regular expression analysis unit 31.
The collation unit 35 recognizes the character by collating the dictionary selected by the character type control unit 34 with the character pattern cut out by the cutout unit 32, and outputs the read result. Next, the operation of the embodiment will be described.

【００１０】まず、ＦＣ部２に処理対象とする帳票に関
する書式制御情報（ＦＣ）が登録される。書式制御情報
（ＦＣ）には、帳票に設けられた文字読取りの対象領域
であるフィールド位置（読取り位置）や、フィールド内
に記録される文字の字体、字種等に関する情報が含まれ
ている。この中で字種に関する情報は、正規表現によっ
て表されている。First, the FC unit 2 registers format control information (FC) relating to a form to be processed. The format control information (FC) includes information on a field position (reading position) which is a target area for reading characters provided on a form, a font of characters recorded in the field, a character type, and the like. The information about the character types is represented by regular expressions.

【００１１】正規表現とは、文字列のパターンを表す方
式であり、例えば図３に示すような正規表現の規則に従
って表される。ここでは、帳票に設けられた金額欄に記
入される文字列についての正規表現の例を用いて説明す
る。The regular expression is a method of expressing a pattern of a character string, and is expressed according to the rule of the regular expression as shown in FIG. 3, for example. Here, description will be given using an example of a regular expression for a character string entered in the amount column provided on the form.

【００１２】例えば、金額欄には、最初の１文字は記号
「￥」の場合があり、さらに最初の数字は「０」ではな
い文字が記入されることを正規表現によって表すと、￥？［１−９］［０−９］＊ …（ａ）となる。つまり、文字列の左から順に、「￥」が０個ま
たは１個、「１」〜「９」の何れかが１個、「０」〜
「９」の何れかが０個以上続くという意味である。For example, in the amount column, the first character may be the symbol "\", and the first number is a character other than "0". [1-9] [0-9] * (a). That is, in order from the left of the character string, "\" is 0 or 1, "1" to "9" is 1 and "0" to
This means that any one of “9” continues for 0 or more.

【００１３】入力部５で検出された帳票画像が認識部３
の切出部３２に入力されると共に、ＦＣ部２に記憶され
た処理対象とする帳票についての書式制御情報（ＦＣ）
が認識部の正規表現解析部３１、及び切出部３２に入力
される。The form image detected by the input unit 5 is recognized by the recognition unit 3.
Format control information (FC) about the form to be processed, which is input to the cutout unit 32 of and stored in the FC unit 2.
Is input to the regular expression analysis unit 31 and the cutout unit 32 of the recognition unit.

【００１４】切出部３２は、フィールド位置を示す情報
等に基づいて文字パターンの切出しを行ない、切出した
文字パターンを照合部３５に出力する。また、正規表現
解析部３１は、処理対象とするフィールドに関する書式
制御情報（ＦＣ）を解析する。正規表現解析部３１は、
前記（ａ）に示すような正規表現について解析を行なう
ことにより、１桁目に記入されている文字が「￥」また
は「１」〜「９」の何れかであることがわかるので
（「￥」が０個の場合もある）、字種制御部３４に対し
て、これらの文字についての辞書だけを用いるように指
示する。照合部３５は、字種制御部３４によって選択さ
れた辞書と、切出部３２で切出された１桁目の文字パタ
ーンとを照合し、読取り結果を出力する。The cutout unit 32 cuts out the character pattern based on the information indicating the field position and outputs the cut-out character pattern to the collation unit 35. Further, the regular expression analysis unit 31 analyzes the format control information (FC) regarding the field to be processed. The regular expression analysis unit 31
By analyzing the regular expression as shown in (a) above, it can be seen that the character entered in the first digit is either "\" or "1" to "9". In some cases, the character type control unit 34 is instructed to use only the dictionary for these characters. The collation unit 35 collates the dictionary selected by the character type control unit 34 with the character pattern of the first digit cut out by the cutout unit 32, and outputs the read result.

【００１５】次に、正規表現解析部３１は、再び前記
（ａ）を解析することにより、１桁目の読取り結果が
「￥」であった場合、２桁目に記入されている文字が
「１」〜「９」の何れかであることがわかる。Next, the regular expression analysis unit 31 analyzes the above (a) again, and when the read result of the first digit is "\", the character entered in the second digit is " It can be seen that it is any of "1" to "9".

【００１６】また、１桁目の読取り結果が「１」〜
「９」の何れかであった場合は、２桁目に記入されてい
る文字が「０」〜「９」の何れかであることがわかる。
このように、１桁目の読取り結果によっては、２桁目の
読取りに使用する辞書も絞ることもできる。さらに、前
記（ａ）から、３桁目以降は「０」〜「９」の辞書を使
用すれば良いことがわかる。The reading result of the first digit is "1".
When it is any of "9", it is understood that the character entered in the second digit is any of "0" to "9".
In this way, the dictionary used for reading the second digit can be narrowed down depending on the reading result of the first digit. Further, it can be seen from the above (a) that a dictionary of "0" to "9" may be used for the third digit and thereafter.

【００１７】このようにして、フィールド内に記入され
る文字列に対して正規表現による詳細な文字種の指定を
行なうことができるので、文字単位の細かい制御が可能
となり、各文字毎に辞書を限定するためより確実に認識
結果が得られ、また辞書を限定することで処理時間を短
縮することができる。As described above, since the detailed character type can be specified by the regular expression for the character string entered in the field, fine control can be performed on a character-by-character basis and the dictionary is limited for each character. Therefore, the recognition result can be obtained more reliably, and the processing time can be shortened by limiting the dictionary.

【００１８】なお、前記実施例においては、正規表現に
よって指定された文字種に基づいて辞書の絞り込みを行
って文字認識をするものとしたが、フィールド内に記録
された全ての文字列については同じ字種として読取り、
読取り結果の妥当性のチェック（後処理）の際に参照す
るようにしても良い。In the above embodiment, the character recognition is performed by narrowing down the dictionary based on the character type specified by the regular expression, but the same character is used for all the character strings recorded in the field. Read as a seed,
It may be referred to when checking the validity of the read result (post-processing).

【００１９】また、フィールド内の文字種の指定につい
て説明したが、キャラクタセットの指定の代わりとして
の機能もある。例えば、「Ｙ」「Ｎ」としか記入されな
い場合に、正規表現を用いて「ＮＹ」と指定すれば良
い。さらに、限られた単語しか記入されない場合も、正
規表現で指定することができる。例えば、東京｜埼玉｜
千葉｜神奈川のように指定して、簡単な後処理用単語辞
書として用いることができる。Although the designation of the character type in the field has been described, it also has a function as an alternative to the designation of the character set. For example, when only "Y" and "N" are entered, it is sufficient to specify "NY" using a regular expression. Furthermore, even if only a limited number of words are entered, it can be specified by a regular expression. For example, Tokyo | Saitama |
Chiba | Kanagawa can be specified and used as a simple post-processing word dictionary.

【００２０】[0020]

【発明の効果】以上のように本発明によれば、書式制御
情報（ＦＣ）の字種の指定に正規表現を使用することに
より詳細な指定が可能となるので、認識率を向上させる
ことが可能となるものである。As described above, according to the present invention, since a detailed expression can be specified by using a regular expression for specifying the character type of format control information (FC), the recognition rate can be improved. It is possible.

[Brief description of drawings]

【図１】図２中における認識部３の詳細な構成を示すブ
ロック図。FIG. 1 is a block diagram showing a detailed configuration of a recognition unit 3 in FIG.

【図２】本発明の一実施例に係わる文字認識装置の構成
を示すブロック図。FIG. 2 is a block diagram showing the configuration of a character recognition device according to an embodiment of the present invention.

【図３】本実施例において用いられる正規表現の規則の
一例を示す図。FIG. 3 is a diagram showing an example of a regular expression rule used in this embodiment.

[Explanation of symbols]

１…制御部、２…ＦＣ部、３…認識部、４…出力部、５
…入力部、３１…正規表現解析部、３２…切出部、３３
…認識辞書、３４…字種制御部、３５…照合部。1 ... Control unit, 2 ... FC unit, 3 ... Recognition unit, 4 ... Output unit, 5
... input part, 31 ... regular expression analysis part, 32 ... cutout part, 33
... recognition dictionary, 34 ... character type control unit, 35 ... collation unit.

Claims

[Claims]

1. A storage medium for storing format control information that is referred to for reading a character or the like recorded on a sheet, wherein information for designating a character type in the format control information is represented by a regular expression. A format control information storage means, a regular expression analysis means for analyzing a regular expression in the format control information stored in the format control information storage means, and a record on the sheet based on an analysis result by the regular expression analysis means. A character recognition device comprising: a reading unit that obtains a reading result of the written characters and the like.