JP2000029983A

JP2000029983A - Document reader device

Info

Publication number: JP2000029983A
Application number: JP10196001A
Authority: JP
Inventors: Motomitsu Kikuchi; 基充菊地
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-07-10
Filing date: 1998-07-10
Publication date: 2000-01-28

Abstract

PROBLEM TO BE SOLVED: To provide a document reader device for accurately reading a non-standard slip in which plural item names or data are described in one character frame. SOLUTION: The image data of a slip read by an image reader 21 are stored in an image memory 22, and read to a ruled line extracting pat 23 so that the ruled lines can be extracted, and plural character frames are extracted by a character frame extracting part 24. Image data such as character strings in the character frame are recognized by a character recognizing part 25, and an item description frame in which item names are described is detected by an item frame detecting part 27 based on an item name dictionary 29. Also, a data description frame in the same row as the item description frame is detected by a data frame detecting part 28. The item description frame is made to correspond to the data description frame by a data identifying part 30, and when plural lines or plural character strings divided by division codes are present, the item names and the data are made to correspond to each other and identified in the division order.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書上に記載され
た文字等を読み取る文書読取装置、特に読み取り項目の
位置が固定していない非定型帳票の読み取り技術に関す
るものである。[0001] 1. Field of the Invention [0002] The present invention relates to a document reading apparatus for reading characters and the like written on a document, and more particularly to a technique for reading an irregular form in which the positions of read items are not fixed.

【０００２】[0002]

【従来の技術】帳票は、定型帳票と非定型帳票とに分類
される。定型帳票は、例えば銀行の窓口に備え付けられ
た「振り込み依頼書」等のように、記載する項目の位置
が予め印刷されており、定められた位置に指定されたデ
ータを記載するようになっている帳票である。このよう
な定型帳票の読み取りでは、帳票の所定位置（例えば、
右上等）に、その帳票の様式コードが予め印刷されてお
り、この様式コードに基づいて、各項目毎のデータの記
載位置を知ることができるようになっている。一方、非
定型帳票は、読み取りの対象となる項目は予め決められ
ているが、それらの項目の記載位置が帳票毎に異なって
いる。このような非定型帳票の例としては、例えば商品
の注文書のように、顧客側が自由な形式で作成した帳票
を挙げることができる。2. Description of the Related Art Forms are classified into standard forms and non-standard forms. In the standard form, for example, the position of the item to be described is printed in advance, such as a `` transfer request form '' provided at the bank counter, and the specified data is described in the predetermined position. It is a form that is. In reading such a fixed form, a predetermined position of the form (for example,
The form code of the form is pre-printed on the upper right, etc., and the position of the data for each item can be known based on the form code. On the other hand, in the non-standard form, the items to be read are determined in advance, but the description positions of these items are different for each form. As an example of such a non-standard form, a form created by a customer in a free format, such as a purchase order for a product, can be given.

【０００３】図２は、従来の非定型帳票の例を示す図で
ある。この非定型帳票は、例えば「商品名」、「商品コ
ード」、「数量」等の読み取りの対象となる項目は予め
決められているが、それらの項目の記載位置が帳票毎に
異なっている。従って、帳票の様式コードは記載されて
おらず、帳票中に記載されたデータの中から各項目に対
応するデータを識別して対応付けを行う必要がある。例
えば、図２のような非定型帳票を読み取る場合、従来の
文書読取装置では、次のような方法で読み取りを行って
いた。まず、非定型帳票をイメージスキャナ等で白黒の
２値画素に分解して読み取って、その読み取った画像デ
ータを画像メモリに一旦格納する。そして、画像メモリ
に格納された画像データを縦及び横方向に走査して読み
出し、黒画素の縦及び横方向の出現頻度を示すヒストグ
ラムを生成する。生成したヒストグラムに基づいて、罫
線の位置を抽出し、更に縦及び横方向の罫線で囲まれた
複数の長方形の文字枠を抽出する。FIG. 2 is a diagram showing an example of a conventional atypical form. In this non-standard form, for example, items to be read such as "product name", "product code", and "quantity" are predetermined, but the description positions of these items are different for each form. Therefore, the form code of the form is not described, and it is necessary to identify data corresponding to each item from the data described in the form and make correspondence. For example, when reading a non-standard form as shown in FIG. 2, a conventional document reading apparatus performs reading by the following method. First, an atypical form is decomposed into black and white binary pixels by an image scanner or the like and read, and the read image data is temporarily stored in an image memory. Then, the image data stored in the image memory is scanned and read in the vertical and horizontal directions, and a histogram indicating the vertical and horizontal appearance frequencies of the black pixels is generated. Based on the generated histogram, the positions of the ruled lines are extracted, and a plurality of rectangular character frames surrounded by vertical and horizontal ruled lines are extracted.

【０００４】次に、抽出した文字枠の中に記載された文
字または文字列を認識して文字コードに変換し、項目名
辞書を参照して、その文字コードが読み取り対象の項目
名として登録されているか否かを検索する。文字枠中の
文字または文字列が項目名辞書に登録されていれば、そ
の文字枠は項目記載枠と判定し、登録されていなけれ
ば、その文字枠はデータ記載枠と判定する。図２の帳票
の場合、項目名辞書に基づいて、「商品名」、「商品コ
ード」、「数量」、及び「単価」等と記載された文字列
を有する文字枠が、項目記載枠と判定される。一方、
「音声ＬＳＩ」、「Ｍ９８４１」、「１００」、及び
「４５００」等は、項目名辞書に登録されていないの
で、これらの文字列を有する文字枠はデータ記載枠と判
定される。そして、「商品名」と記載された文字列を有
する項目記載枠の右側または下側に、データ記載枠が配
置されているか否かを調べる。この場合、右側の文字枠
はデータ記載枠であるので、このデータ記載枠内に記載
された「音声ＬＳＩ」が項目名「商品名」に対応するデ
ータであると判定する。Next, a character or a character string described in the extracted character frame is recognized and converted into a character code, and the character code is registered as an item name to be read by referring to an item name dictionary. Search whether it is. If the character or character string in the character frame is registered in the item name dictionary, the character frame is determined to be an item description frame, and if not registered, the character frame is determined to be a data description frame. In the case of the form shown in FIG. 2, based on the item name dictionary, a character frame having character strings such as “product name”, “product code”, “quantity”, and “unit price” is determined to be an item description frame. Is done. on the other hand,
Since “voice LSI”, “M9841”, “100”, “4500”, and the like are not registered in the item name dictionary, a character frame having these character strings is determined to be a data description frame. Then, it is determined whether or not a data description frame is arranged on the right or lower side of the item description frame having the character string described as “product name”. In this case, since the character frame on the right side is a data description frame, it is determined that the “voice LSI” described in the data description frame is data corresponding to the item name “product name”.

【０００５】同様に、「商品コード」と記載された文字
列を有する項目記載枠の右側または下側に、データ記載
枠が配置されているか否かを調べる。この場合、右側の
文字枠は項目記載枠であるが、下側の文字枠がデータ記
載枠であるので、このデータ記載枠内に記載された「Ｍ
９８４１」が項目名「商品コード」に対応するデータで
あると判定する。また、「数量」と記載された文字列を
有する項目記載枠の右側には文字枠が存在しないが、下
側の文字枠がデータ記載枠であるので、このデータ記載
枠内に記載された「１００」が項目名「数量」に対応す
るデータであると判定する。更に、「単価」と記載され
た文字列を有する項目記載枠の下側には文字枠が存在し
ないが、右側の文字枠がデータ記載枠であるので、この
データ記載枠内に記載された「４５００」が項目名「単
価」に対応するデータであると判定する。このように、
項目名辞書に基づいて項目名を抽出し、その項目名の右
側または下側に隣接して記載されたデータと対応付け
て、非定型帳票の読み取りを行うようにしている。Similarly, it is determined whether or not a data description frame is arranged on the right or lower side of an item description frame having a character string described as "product code". In this case, the character frame on the right side is the item description frame, but the lower character frame is the data description frame.
9841 "is data corresponding to the item name" product code ". In addition, although there is no character frame on the right side of the item description frame having the character string described as “quantity”, since the lower character frame is the data description frame, “ It is determined that “100” is data corresponding to the item name “quantity”. Furthermore, although there is no character frame below the item description frame having the character string described as “unit price”, the character frame on the right side is the data description frame. It is determined that “4500” is data corresponding to the item name “unit price”. in this way,
An item name is extracted based on the item name dictionary, and the non-standard form is read in association with data described adjacently to the right or below the item name.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、従来の
文書読取装置では、次のような課題があった。図３は、
従来の文書読取装置の課題を説明するための他の非定型
帳票の例を示す図である。例えば図３に示すように、帳
票１０の１つの項目記載枠１１に２つの項目名「商品
名」及び「商品コード」が２行にわたって記載され、か
つ、その下のデータ記載枠１２に２つのデータ「音声Ｌ
ＳＩ」及び「Ｍ９８４１」が２行にわたって記載されて
いる場合、２つの項目を確実に識別して読み取ることが
できなかった。However, the conventional document reading apparatus has the following problems. FIG.
FIG. 11 is a diagram illustrating another example of an irregular form for explaining a problem of the conventional document reading apparatus. For example, as shown in FIG. 3, two item names “product name” and “product code” are described in two lines in one item description frame 11 of the form 10, and two item names are described in a data description frame 12 therebelow. Data "Sound L
When "SI" and "M9841" were described over two lines, two items could not be reliably identified and read.

【０００７】また、帳票１０の１つの項目記載枠１３に
２つの項目名「数量」及び「単価」が区切り符号のスラ
ッシュ「／」で区切られて記載され、かつ、その下のデ
ータ記載枠１４に２つのデータ「１００」及び「４５０
０」がスラッシュ「／」で区切られて記載されている場
合、２つの項目を確実に識別して読み取ることができな
かった。本発明は、前記従来技術が持っていた課題を解
決し、１つの文字枠の中に複数の項目名やデータが記載
された非定型帳票を確実に読み取ることができる文書読
取装置を提供するものである。[0007] In addition, two item names "quantity" and "unit price" are described in one item description frame 13 of the form 10 separated by a slash "/" as a delimiter, and a data description frame 14 therebelow. Two data "100" and "450"
When "0" was described separated by a slash "/", it was not possible to reliably identify and read two items. An object of the present invention is to provide a document reading apparatus which solves the problem of the conventional technique and can reliably read an irregular form in which a plurality of item names and data are described in one character frame. It is.

【０００８】[0008]

【課題を解決するための手段】前記課題を解決するため
に、本発明の内の第１の発明は、文書読取装置におい
て、罫線で区切られた複数の文字枠を有する文書の画像
データを、画素に分解して読み取る画像読取手段と、前
記画像読取手段で読み取られた画像データから、縦方向
及び横方向の罫線を抽出する罫線抽出手段と、前記罫線
抽出手段で抽出された縦方向及び横方向の罫線で囲まれ
た文字枠を抽出する文字枠抽出手段と、前記文字枠抽出
手段で抽出された文字枠の中に記載された文字または文
字列を認識して文字コードに変換する文字認識手段と、
読み取り対象となる項目の名称が項目名として登録され
た項目名辞書を参照して、前記文字認識手段で認識され
た前記文字枠中の文字または文字列が該項目名辞書に登
録されている項目記載枠を検出する項目枠検出手段とを
備えている。更にこの文書読取装置には、前記項目枠検
出手段で検出された項目記載枠と同じ並びの右側または
下側の文字枠で、前記項目名辞書に登録されていない文
字または文字列のデータが記載されたデータ記載枠を検
出するデータ枠検出手段と、前記項目枠検出手段で検出
された項目記載枠と前記データ枠検出手段で検出された
データ記載枠とを対応付けるとともに、該項目記載枠中
に複数の項目名が記載され、かつ該データ記載枠中に複
数のデータが記載されている場合には、その記載された
順番に従って該項目名と該データとを対応付けて識別す
るデータ識別手段とが設けられている。According to a first aspect of the present invention, there is provided a document reading apparatus for converting image data of a document having a plurality of character frames delimited by ruled lines. Image reading means for decomposing and reading pixels; ruled line extracting means for extracting vertical and horizontal ruled lines from image data read by the image reading means; and vertical and horizontal ruled lines extracted by the ruled line extracting means. Character frame extracting means for extracting a character frame surrounded by a ruled line in a direction, and character recognition for recognizing a character or a character string described in the character frame extracted by the character frame extracting means and converting the character or character string into a character code. Means,
Referring to an item name dictionary in which the name of the item to be read is registered as the item name, the character or character string in the character frame recognized by the character recognition unit is registered in the item name dictionary. Item frame detecting means for detecting a description frame. Further, in this document reading device, data of characters or character strings that are not registered in the item name dictionary are described in the right or lower character frame in the same arrangement as the item description frame detected by the item frame detection unit. Data frame detecting means for detecting the data frame described, and associating the item description frame detected by the item frame detecting means with the data description frame detected by the data frame detecting means, When a plurality of item names are described and a plurality of data are described in the data description frame, data identification means for identifying the item names and the data in association with the described order is provided. Is provided.

【０００９】第１の発明によれば、以上のように文書読
取装置を構成したので、次のような作用が行われる。罫
線で区切られた複数の文字枠を有する文書の画像データ
は、画像読取手段によって画素に分解して読み取られ、
罫線抽出手段によって、この画像データから縦及び横方
向の罫線が抽出される。抽出された縦及び横方向の罫線
で囲まれた文字枠が文字枠抽出手段によって抽出され、
この文字枠の中に記載された文字または文字列が文字認
識手段によって認識される。項目枠検出手段において、
項目名が登録された項目名辞書が参照され、文字枠中の
認識された文字または文字列が該項目名辞書に登録され
ている項目記載枠が検出される。更に、検出された項目
記載枠と同じ並びの右側または下側の文字枠で、項目名
辞書に登録されていない文字または文字列のデータが記
載されたデータ記載枠がデータ枠検出手段によって検出
される。そして、データ識別手段によって、項目記載枠
とデータ記載枠との対応付けが行われ、項目記載枠中に
複数の項目名が記載され、かつデータ記載枠中に複数の
データが記載されている場合には、その記載された順番
に従って項目名とデータとが対応付けられて識別され
る。According to the first aspect, since the document reading apparatus is configured as described above, the following operation is performed. Image data of a document having a plurality of character frames separated by ruled lines is read by being decomposed into pixels by image reading means,
The ruled line extracting means extracts vertical and horizontal ruled lines from the image data. A character frame surrounded by the extracted vertical and horizontal ruled lines is extracted by character frame extracting means,
Characters or character strings described in this character frame are recognized by the character recognition means. In the item frame detecting means,
The item name dictionary in which the item names are registered is referred to, and the item description frame in which the recognized character or character string in the character frame is registered in the item name dictionary is detected. Further, in the right or lower character frame in the same row as the detected item description frame, a data description frame in which data of a character or a character string not registered in the item name dictionary is described is detected by the data frame detecting means. You. Then, the data identification unit associates the item description frame with the data description frame, and a plurality of item names are described in the item description frame, and a plurality of data are described in the data description frame. , Item names and data are associated with each other and identified according to the described order.

【００１０】第２の発明は、第１の発明におけるデータ
識別手段を、前記項目記載枠及び前記データ記載枠に複
数行に分割された文字または文字列が記載されている場
合に、各行の文字または文字列をそれぞれ独立した複数
の項目名及びデータであると識別し、その記載された順
番に従って該項目名と該データとを対応付けるように構
成している。第２の発明によれば、第１の発明中のデー
タ識別手段において次のような作用が行われる。項目記
載枠及びデータ記載枠の双方に、複数行に分割された文
字または文字列が記載されている場合に、各行の文字ま
たは文字列がそれぞれ独立した複数の項目名及びデータ
であると識別されて、その記載された順番に従って項目
名とデータとが対応付けられる。According to a second aspect of the present invention, the data identifying means according to the first aspect of the invention is arranged such that, when the item description frame and the data description frame describe characters or character strings divided into a plurality of lines, Alternatively, a character string is identified as a plurality of independent item names and data, and the item names and the data are associated with each other in the order described. According to the second invention, the following operation is performed in the data identification means in the first invention. When characters or character strings divided into multiple lines are described in both the item description frame and the data description frame, it is recognized that the characters or character strings in each line are independent item names and data. Then, the item names and data are associated with each other according to the described order.

【００１１】第３の発明は、第１の発明におけるデータ
識別手段を、前記項目記載枠及び前記データ記載枠に区
切り符号で区切られて複数の文字または文字列が記載さ
れている場合に、区切られた文字または文字列をそれぞ
れ独立した複数の項目名及びデータであると識別し、そ
の記載された順番に従って該項目名と該データとを対応
付けるように構成している。第３の発明によれば、第１
の発明中のデータ識別手段において次のような作用が行
われる。項目記載枠及びデータ記載枠の双方に、区切り
符号で区切られた複数の文字または文字列が記載されて
いる場合に、区切られた文字または文字列がそれぞれ独
立した複数の項目名及びデータであると識別されて、そ
の記載された順番に従って項目名とデータとが対応付け
られる。In a third aspect of the present invention, the data identification means in the first aspect of the present invention is arranged such that, when a plurality of characters or character strings are described in the item description frame and the data description frame by a delimiter, The identified characters or character strings are identified as a plurality of independent item names and data, and the item names and the data are associated with each other in the order described. According to the third aspect, the first aspect
The following operation is performed by the data identification means in the invention. When a plurality of characters or character strings separated by delimiters are described in both the item description frame and the data description frame, the delimited characters or character strings are independent item names and data. , And the item name and the data are associated with each other in the described order.

【００１２】第４の発明は、第１の発明におけるデータ
識別手段を、前記項目記載枠の下側に複数の前記データ
記載枠が連続して配置されている場合に、該項目記載枠
中の項目名とこれらの各データ記載枠中のデータとをそ
れぞれ対応付けるように構成している。第４の発明によ
れば、第１の発明中のデータ識別手段において次のよう
な作用が行われる。項目記載枠の下側に複数のデータ記
載枠が連続して配置されている場合に、これらの各デー
タ記載枠中のデータは、共通の項目記載枠中の項目名と
対応付けられる。According to a fourth aspect of the present invention, the data identifying means according to the first aspect of the present invention is arranged such that, when a plurality of the data description frames are continuously arranged below the item description frame, The configuration is such that the item names are associated with the data in the data description frames. According to the fourth invention, the following operation is performed by the data identification means in the first invention. When a plurality of data entry frames are continuously arranged below the item entry frame, the data in each data entry frame is associated with the item name in the common item entry frame.

【００１３】[0013]

【発明の実施の形態】図１は、本発明の実施形態を示す
文書読取装置の構成図である。この文書読取装置は、例
えば図３に示すように、罫線で区切られた項目記載枠１
１，１３、及びデータ記載枠１２，１４等の複数の文字
枠を有する非定型帳票１０等の文書の画像データを読み
取る画像読取手段（例えば、画像読取部）２１を有して
いる。画像読取部２１は、イメージスキャナ等で構成さ
れ、文書に記載された画像データを、白黒の２値画素に
分解して読み取るものである。画像読取部２１の出力側
には、読み取られた画像データを一旦格納するための画
像メモリ２２が接続されている。画像メモリ２２には、
罫線抽出手段（例えば、罫線抽出部）２３が接続されて
いる。罫線抽出部２３は、画像メモリ２２に格納された
画像データを縦及び横方向に走査して読み出して、黒画
素の縦方向及び横方向の出現頻度を示すヒストグラムを
生成するものである。更に罫線抽出部２３は、生成した
ヒストグラムに基づいて、縦方向及び横方向の罫線の位
置を抽出する機能を有している。罫線抽出部２３の出力
側には、文字枠抽出手段（例えば、文字枠抽出部）２４
が接続されている。文字枠抽出部２４は、罫線抽出部２
３で抽出された縦及び横の罫線に基づいて、これらの罫
線で囲まれた複数の長方形の文字枠を抽出するものであ
る。文字枠抽出部２４と画像メモリ２２の出力側には、
文字認識手段（例えば、文字認識部）２５が接続されて
いる。FIG. 1 is a block diagram of a document reading apparatus showing an embodiment of the present invention. As shown in FIG. 3, for example, this document reading apparatus has an item description frame 1 separated by ruled lines.
An image reading unit (for example, an image reading unit) 21 that reads image data of a document such as an atypical form 10 having a plurality of character frames such as data frames 1 and 13 and data description frames 12 and 14 is provided. The image reading unit 21 is configured by an image scanner or the like, and reads image data described in a document by decomposing the image data into black and white binary pixels. An image memory 22 for temporarily storing read image data is connected to an output side of the image reading unit 21. In the image memory 22,
A ruled line extracting means (for example, a ruled line extracting unit) 23 is connected. The ruled line extraction unit 23 scans and reads the image data stored in the image memory 22 in the vertical and horizontal directions, and generates a histogram indicating the vertical and horizontal appearance frequencies of the black pixels. Further, the ruled line extracting section 23 has a function of extracting the positions of the ruled lines in the vertical and horizontal directions based on the generated histogram. On the output side of the ruled line extracting unit 23, a character frame extracting means (for example, a character frame extracting unit) 24
Is connected. The character frame extracting unit 24 includes the ruled line extracting unit 2
Based on the vertical and horizontal ruled lines extracted in step 3, a plurality of rectangular character frames surrounded by these ruled lines are extracted. On the output side of the character frame extraction unit 24 and the image memory 22,
A character recognition unit (for example, a character recognition unit) 25 is connected.

【００１４】文字認識部２５は、文字枠抽出部２４で抽
出された文字枠毎に、画像メモリ２２から画像データを
切り出して、文字辞書２６を参照してその文字枠の中に
記載された文字または文字列を認識して文字コードに変
換するものである。文字認識部２５の出力側には、項目
枠検出手段（例えば、項目枠検出部）２７、及びデータ
枠検出手段（例えば、データ枠検出部）２８が接続され
ている。項目枠検出部２７は、読み取り対象となる項目
名が予め登録された項目名辞書２９を検索し、文字認識
部２５で認識された文字または文字列が登録されている
場合に、その文字枠を項目記載枠と判定するものであ
る。データ枠検出部２８は、項目枠検出部２７で検出さ
れた項目記載枠と同じ並びの右側または下側の文字枠中
の文字または文字列が項目名辞書２９に登録されていな
い場合に、その文字枠をデータ記載枠と判定するもので
ある。項目枠検出部２７及びデータ枠検出部２８の出力
側には、データ識別手段（例えば、データ識別部）３０
が接続されている。データ識別部３０は、項目枠検出部
２７で検出された項目記載枠とデータ枠検出部２８で検
出されたデータ記載枠とを対応付けるものである。ま
た、データ識別部３０は、項目記載枠中に複数の項目名
が記載され、かつデータ記載枠中に複数のデータが記載
されている場合には、その記載された順番に従って項目
名とデータとを対応付けて識別する機能を有している。The character recognizing unit 25 cuts out image data from the image memory 22 for each character frame extracted by the character frame extracting unit 24, refers to the character dictionary 26, and writes a character described in the character frame. Alternatively, a character string is recognized and converted into a character code. An output side of the character recognition unit 25 is connected to an item frame detection unit (for example, an item frame detection unit) 27 and a data frame detection unit (for example, a data frame detection unit) 28. The item frame detection unit 27 searches the item name dictionary 29 in which the item name to be read is registered in advance, and when the character or character string recognized by the character recognition unit 25 is registered, the character frame is detected. It is determined to be the item description frame. If the character or character string in the right or lower character frame in the same row as the item description frame detected by the item frame detection unit 27 is not registered in the item name dictionary 29, the data frame detection unit 28 The character frame is determined as the data description frame. On the output side of the item frame detection unit 27 and the data frame detection unit 28, a data identification unit (for example, a data identification unit) 30 is provided.
Is connected. The data identification unit 30 associates the item description frame detected by the item frame detection unit 27 with the data description frame detected by the data frame detection unit 28. Further, when a plurality of item names are described in the item description frame and a plurality of data are described in the data description frame, the data identification unit 30 determines the item name and the data according to the described order. Have the function of identifying them.

【００１５】図４は、図１の文書読取装置の動作を示す
フローチャートである。次に、図３及び図４を参照しつ
つ、図１の動作を説明する。図３の非定型帳票１０の読
み取りが開始されると、図４のステップＳ１において、
この帳票１０は、図１の画像読取部２１のイメージスキ
ャナ等で白黒の２値画素に分解されて画像データとして
読み取られる。ステップＳ２において、帳票１０の画像
データは、画像メモリ２２に一旦格納される。ステップ
Ｓ３において、罫線抽出部２３によって画像メモリ２２
中の画像データが縦及び横方向に走査して読み出され、
黒画素の縦方向及び横方向の出現頻度を示すヒストグラ
ムが生成される。更に、生成したヒストグラムに基づい
て、罫線が抽出される。ステップＳ４において、文字枠
抽出部２４によって縦及び横方向の罫線で囲まれるた複
数の長方形の文字枠１１〜１４が抽出される。FIG. 4 is a flowchart showing the operation of the document reading apparatus of FIG. Next, the operation of FIG. 1 will be described with reference to FIGS. When the reading of the non-standard form 10 of FIG. 3 is started, in step S1 of FIG.
The form 10 is decomposed into black and white binary pixels by an image scanner or the like of the image reading unit 21 in FIG. 1 and read as image data. In step S2, the image data of the form 10 is temporarily stored in the image memory 22. In step S3, the ruled line extraction unit 23 sets the image memory 22
The image data inside is read out by scanning in the vertical and horizontal directions,
A histogram indicating the vertical and horizontal appearance frequencies of the black pixels is generated. Further, ruled lines are extracted based on the generated histogram. In step S4, the character frame extracting unit 24 extracts a plurality of rectangular character frames 11 to 14 surrounded by vertical and horizontal ruled lines.

【００１６】ステップＳ５において、文字認識部２５に
よって文字辞書２６が参照され、文字枠１１〜１４の中
に記載された「商品名」、「商品コード」等の文字列が
認識されて文字コードに変換される。ステップＳ６にお
いて、項目枠検出部２７によって項目名辞書２９が参照
され、文字枠１１〜１４中の文字または文字列がこの項
目名辞書２９に登録されているか否かが検索される。こ
のとき、文字枠１１〜１４中に、文字または文字枠が複
数行に分けて記載されている場合、或いはスラッシュ
「／」等の区切り符号によって分けられて記載されてい
る場合には、複数の文字または文字列として別々に項目
名辞書２９の検索が行われる。これにより、図３の帳票
１０の文字枠１１は、「商品名」及び「商品コード」の
２つの項目名が記載された項目記載枠１１であると判定
される。また、文字枠１３は、「数量」及び「単価」の
２つの項目名が記載された項目記載枠１３であると判定
される。In step S5, the character dictionary 26 is referred to by the character recognizing unit 25, and character strings such as "product name" and "product code" described in the character boxes 11 to 14 are recognized and converted into character codes. Is converted. In step S6, the item frame detection unit 27 refers to the item name dictionary 29, and searches whether or not the characters or character strings in the character frames 11 to 14 are registered in the item name dictionary 29. At this time, when characters or character frames are described in a plurality of lines in the character frames 11 to 14, or when the characters or character frames are described by being separated by a delimiter such as a slash "/", The item name dictionary 29 is searched separately as a character or a character string. As a result, the character frame 11 of the form 10 in FIG. 3 is determined to be the item description frame 11 in which the two item names “product name” and “product code” are described. The character frame 13 is determined to be the item description frame 13 in which two item names, “quantity” and “unit price” are described.

【００１７】ステップＳ７において、データ枠検出部２
８によって項目記載枠１１，１３と同じ並びの右側及び
下側の文字枠（例えば、文字枠１２，１４）で、項目名
以外の文字または文字列が記載されたものが検出され
る。このとき、検出された文字枠中に、文字または文字
枠が複数行に分けて記載されている場合、或いはスラッ
シュ「／」等の区切り符号によって分けられて記載され
ている場合には、複数の文字または文字列として識別さ
れる。これにより、図３の帳票１０の文字枠１２は、
「音声ＬＳＩ」及び「Ｍ９８４１」の２つのデータが記
載されたデータ記載枠１２であると判定される。また、
文字枠１４は、「１００」及び「４５００」の２つのデ
ータが記載されたデータ記載枠１４であると判定され
る。ステップＳ８において、データ識別部３０によって
項目記載枠１１，１３とデータ記載枠１２，１４との対
応付けが、次のように行われる。項目記載枠１１の右側
に項目記載枠１３が存在し、下側にデータ記載枠１２が
存在するので、この項目記載枠１１とデータ記載枠１２
とが対応付けられる。また、項目記載枠１３の右側には
文字枠が存在せず、下側にデータ記載枠１４が存在する
ので、この項目記載枠１３とデータ記載枠１４とが対応
付けられる。In step S7, the data frame detection unit 2
8 detects character or character strings other than the item name in the right and lower character frames (for example, character frames 12 and 14) in the same arrangement as the item description frames 11 and 13. At this time, if a character or a character frame is described in a plurality of lines in the detected character frame, or if the character or the character frame is described by a delimiter such as a slash "/", Identified as a character or string. Thus, the character frame 12 of the form 10 in FIG.
It is determined that this is the data description frame 12 in which two data of “voice LSI” and “M9841” are described. Also,
The character frame 14 is determined to be the data description frame 14 in which two data “100” and “4500” are described. In step S8, the data identification unit 30 associates the item description frames 11, 13 with the data description frames 12, 14, as follows. Since the item description frame 13 exists on the right side of the item description frame 11 and the data description frame 12 exists below the item description frame 11, the item description frame 11 and the data description frame 12 are provided.
Are associated with each other. In addition, since no character frame exists on the right side of the item description frame 13 and the data description frame 14 exists below the item description frame 13, the item description frame 13 and the data description frame 14 are associated with each other.

【００１８】ステップＳ９において、項目記載枠中の項
目名の数が単数であるか複数であるかのチェックが行わ
れる。そして、単数の場合はステップＳ１０へ進み、複
数の場合はステップＳ１１へ進む。ステップＳ１０にお
いて、項目記載枠中の項目名と、これに対応するデータ
記載枠中のデータとが対応付けられる。一方、ステップ
Ｓ１１において、項目記載枠中の複数の項目名と、これ
に対応するデータ記載枠中の複数のデータとが、その順
番に従って順次対応付けられて識別される。例えば、項
目記載枠１１には、２つの項目名「商品名」及び「商品
コード」がこの順番で２行に分けて記載されており、デ
ータ記載枠１２には、２つのデータ「音声ＬＳＩ」及び
「Ｍ９８４１」がこの順番で２行に分けて記載されてい
る。従って、項目名「商品名」のデータが「音声ＬＳ
Ｉ」と識別され、項目名「商品コード」のデータが「Ｍ
９８４１」と識別される。In step S9, it is checked whether the number of item names in the item description frame is singular or plural. If the number is singular, the process proceeds to step S10. If the number is plural, the process proceeds to step S11. In step S10, the item name in the item description frame is associated with the corresponding data in the data description frame. On the other hand, in step S11, a plurality of item names in the item description frame and a plurality of data in the corresponding data description frame are sequentially associated with each other in the order and identified. For example, in the item description frame 11, two item names “product name” and “product code” are described in two lines in this order, and in the data description frame 12, two data “audio LSI” are written. And “M9841” are described in two lines in this order. Therefore, the data of the item name “product name” is “voice LS
I "and the data of the item name" product code "is" M
9841 ".

【００１９】更に、項目記載枠１３には、２つの項目名
「数量」及び「単価」がこの順番に区切り符号のスラッ
シュ「／」で分けて記載されており、データ記載枠１４
には、２つのデータ「１００」及び「４５００」がこの
順番にスラッシュ「／」で分けて記載されている。従っ
て、項目名「数量」のデータが「１００」と識別され、
項目名「単価」のデータが「４５００」と識別される。
項目枠検出部２７で検出されたすべての項目記載枠に対
して、ステップＳ７〜Ｓ１１の処理が完了した時点で、
この帳票１０の動作は終了する。このように、本実施形
態の文書読取装置は、項目記載枠１１，１３やデータ記
載枠１２，１４等の文字枠中の文字または文字列が、複
数行に分けて記載されていたり、区切り符号で分けて記
載されている場合に、複数の項目名またはデータとして
識別し、それらの記載された順番に従って対応付けを行
うデータ識別部３０を有している。これにより、１つの
文字枠に１つの項目を対応させた非定型帳票に限らず、
１つの文字枠に複数の項目を記載した非定型帳票を確実
に読み取ることが可能になり、帳票設計の自由度が大き
くなるという利点がある。Further, in the item description box 13, two item names “quantity” and “unit price” are described in this order, separated by a delimiter slash “/”.
Describes two pieces of data “100” and “4500” separated in this order by a slash “/”. Therefore, the data of the item name “quantity” is identified as “100”,
The data of the item name “unit price” is identified as “4500”.
When the processes of steps S7 to S11 are completed for all the item description frames detected by the item frame detection unit 27,
The operation of the form 10 ends. As described above, according to the document reading apparatus of the present embodiment, the characters or character strings in the character frames such as the item description frames 11 and 13 and the data description frames 12 and 14 are described in a plurality of lines, In the case where data items are described separately, a data identification unit 30 that identifies the data items as a plurality of item names or data and associates the data items in accordance with the described order is provided. As a result, the present invention is not limited to an irregular form in which one item is associated with one character frame,
An irregular form in which a plurality of items are described in one character frame can be reliably read, and there is an advantage that the degree of freedom in form design is increased.

【００２０】なお、本発明は、上記実施形態に限定され
ず、種々の変形が可能である。この変形例としては、例
えば、次の（ａ）〜（ｆ）のようなものがある。（ａ）文字枠中の区切り符号はスラッシュ「／」に限
定されず、括弧「（）」、中点「・」、コンマ「，」、
コロン「：」、またはセミコロン「；」等の区切り符号
も同様に用いることができる。（ｂ）文字枠の中に記載される項目名及びデータの数
は２個に限定されず、単数や３以上の複数でも同様に適
用できる。（ｃ）項目記載枠の中の文字または文字列に項目名以
外のものがあれば、項目枠検出部２７において、それを
無視するようにしても良い。（ｄ）項目記載枠中の項目名の数よりも、それに対応
するデータ記載枠中のデータの数の方が多い場合には、
データ識別部３０において、項目名とデータとを順番に
対応付けた後、残ったデータを無視するようにしても良
い。（ｅ）表形式の帳票のように、項目記載枠の下側に同
一形態の文字記載枠が複数連続して配置されている場合
には、データ識別部３０において、それらの文字記載枠
中のデータを、共通の項目記載枠中の項目名に対応付け
るようにしても良い。（ｆ）図１の文書読取装置では、罫線抽出部２３、文
字枠抽出部２４、文字認識部２５、項目枠検出部２７、
データ枠検出部２８、及びデータ識別部３０を、それぞ
れ独立した処理手段として構成しているが、コンピュー
タを用いてプログラム制御によってこれらの処理手段の
機能を果たすように構成しても良い。The present invention is not limited to the above embodiment, but can be variously modified. For example, there are the following modifications (a) to (f). (A) The delimiter in the character frame is not limited to the forward slash “/”, but may include parentheses “()”, midpoint “•”, comma “,”,
Delimiters such as colon ":" or semicolon ";" can be used as well. (B) The number of item names and data described in the character frame is not limited to two, and a single item or a plurality of three or more items can be similarly applied. (C) If there is anything other than the item name in the character or character string in the item description frame, the item frame detection unit 27 may ignore it. (D) If the number of data in the corresponding data box is larger than the number of item names in the box,
After the data identification unit 30 associates the item names with the data in order, the remaining data may be ignored. (E) When a plurality of character description frames of the same form are continuously arranged below the item description frame as in a tabular form, the data identification unit 30 uses The data may be associated with the item names in the common item description frame. (F) In the document reading apparatus of FIG. 1, the ruled line extracting unit 23, the character frame extracting unit 24, the character recognizing unit 25, the item frame detecting unit 27,
Although the data frame detection unit 28 and the data identification unit 30 are configured as independent processing units, they may be configured to perform the functions of these processing units under program control using a computer.

【００２１】[0021]

【発明の効果】以上詳細に説明したように、第１の発明
によれば、項目名辞書を参照して、文字枠中の文字また
は文字列が項目名である項目記載枠を検出する項目枠検
出手段と、検出された項目記載枠と同じ並びの右側また
は下側のデータ記載枠を検出するデータ枠検出手段と、
項目記載枠とデータ記載枠との対応付けを行うととも
に、項目記載枠中に複数の項目名が記載され、かつデー
タ記載枠中に複数のデータが記載されている場合には、
その記載された順番に従って項目名とデータとの対応付
けを行うデータ識別手段を有している。これにより、１
つの文字枠の中に複数の項目名やデータが記載された非
定型帳票を確実に読み取ることができるという効果があ
る。第２の発明によれば、データ識別手段によって、複
数行に分割された文字または文字列を複数の項目名及び
データとして識別するようにしている。これにより、１
つの文字枠中に複数行の文字または文字列が記載されて
いても、確実にこれらの対応付けができるという効果が
ある。As described in detail above, according to the first aspect, an item frame for detecting an item description frame whose character or character string is an item name with reference to the item name dictionary. Detection means, and data frame detection means for detecting the right or lower data description frame in the same row as the detected item description frame,
When the item description frame and the data description frame are associated with each other, and when a plurality of item names are described in the item description frame and a plurality of data are described in the data description frame,
There is a data identification means for associating the item names with the data in the described order. This gives 1
An advantage is that an irregular form in which a plurality of item names and data are described in one character frame can be reliably read. According to the second aspect, the character or character string divided into a plurality of lines is identified as a plurality of item names and data by the data identification means. This gives 1
Even if a plurality of lines of characters or character strings are described in one character frame, there is an effect that these can be reliably associated with each other.

【００２２】第３の発明によれば、データ識別手段によ
って、区切り符号で区切られた文字または文字列を複数
の項目名及びデータとして識別するようにしている。こ
れにより、１つの文字枠中に複数の文字または文字列が
記載されていても、確実にこれらの対応付けができると
いう効果がある。第４の発明によれば、項目記載枠の下
側に複数のデータ記載枠が連続して配置されている場合
に、データ識別手段によって、これらの各データ記載枠
中のデータは、共通の項目記載枠中の項目名と対応付け
るようにしている。これにより、表形式の帳票において
も、確実に項目名とデータとの対応付けができるという
効果がある。According to the third aspect, the character or character string delimited by the delimiter is identified as a plurality of item names and data by the data identification means. Thereby, even if a plurality of characters or character strings are described in one character frame, there is an effect that these can be surely associated with each other. According to the fourth aspect, when a plurality of data description frames are continuously arranged below the item description frame, the data in each of the data description frames is shared by the data identifying means. It is made to correspond to the item name in the description frame. As a result, there is an effect that item names and data can be surely associated with each other even in a tabular form.

[Brief description of the drawings]

【図１】本発明の実施形態を示す文書読取装置の構成図
である。FIG. 1 is a configuration diagram of a document reading apparatus according to an embodiment of the present invention.

【図２】従来の非定型帳票の例を示す図である。FIG. 2 is a diagram showing an example of a conventional non-standard form.

【図３】従来の文書読取装置の課題を説明するための他
の非定型帳票の例を示す図である。FIG. 3 is a diagram showing another example of a non-standard form for explaining the problem of the conventional document reading apparatus.

【図４】図１の文書読取装置の動作を示すフローチャー
トである。FIG. 4 is a flowchart illustrating an operation of the document reading apparatus of FIG. 1;

[Explanation of symbols]

１０非定型帳票１１，１３項目記載枠１２，１４データ記載枠２１画像読取部２２画像メモリ２３罫線抽出部２４文字枠抽出部２５文字認識部２６文字辞書２７項目枠検出部２８データ枠検出部２９項目名辞書３０データ識別部 DESCRIPTION OF SYMBOLS 10 Non-standard form 11,13 Item description frame 12,14 Data description frame 21 Image reading part 22 Image memory 23 Ruled line extraction part 24 Character frame extraction part 25 Character recognition part 26 Character dictionary 27 Item frame detection part 28 Data frame detection part 29 Item name dictionary 30 Data identification part

Claims

[Claims]

An image reading unit that separates image data of a document having a plurality of character frames separated by ruled lines into pixels, and reads the image data from the image data read by the image reading unit in a vertical direction and a horizontal direction. Ruled line extracting means for extracting a ruled line, character frame extracting means for extracting a character frame surrounded by vertical and horizontal ruled lines extracted by the ruled line extracting means, and characters extracted by the character frame extracting means Character recognition means for recognizing a character or a character string described in a frame and converting the character or character string into a character code; and an item name dictionary in which names of items to be read are registered as item names. Item frame detection means for detecting an item description frame in which the character or character string in the character frame recognized by the means is registered in the item name dictionary; and the same as the item description frame detected by the item frame detection means. Right of line Or a data frame detecting means for detecting a data description frame in which data of a character or a character string not registered in the item name dictionary is described in a lower character frame, and an item description detected by the item frame detecting means. While associating the frame with the data description frame detected by the data frame detection means, a plurality of item names are described in the item description frame,
And, when a plurality of data are described in the data description frame, a data identification means for identifying the item name and the data in association with the described order is provided. Document reading device.

2. When the character or character string divided into a plurality of lines is described in the item description frame and the data description frame, the data identification means separates the characters or character strings of each line into a plurality of independent lines. 2. The item name and the data are identified as the item name and the data are associated with each other in the order described.
Document reader as described.

3. The data identification unit, when a plurality of characters or character strings are described in the item description frame and the data description frame with a delimiter, respectively, separates the separated characters or character strings. 2. The document reading apparatus according to claim 1, wherein the document reading apparatus is configured to identify a plurality of independent item names and data and associate the item names with the data according to the described order.

4. When a plurality of the data description frames are continuously arranged below the item description frame, the data identification unit determines the item name in the item description frame and the description of each of these data. 2. The document reading apparatus according to claim 1, wherein data in the frame is associated with each other.