JP2006146741A

JP2006146741A - Method of reading printing data

Info

Publication number: JP2006146741A
Application number: JP2004338345A
Authority: JP
Inventors: Minenobu Seki; 峰伸関; Katsumi Marukawa; 勝美丸川
Original assignee: Hitachi Computer Peripherals Co Ltd
Current assignee: Hitachi Information and Telecommunication Engineering Ltd
Priority date: 2004-11-24
Filing date: 2004-11-24
Publication date: 2006-06-08
Anticipated expiration: 2024-11-24
Also published as: JP4585837B2

Abstract

<P>PROBLEM TO BE SOLVED: To include a read result of target printed data only, excluding a read result of printing data mixed from an adjacent frame, as a recognition result of each read field when there is a displacement between a pre-printing frame and the printing data and when a direction and quantity of displacement are different on an identical sheet and to separate the printing data from a frame line highly precisely while saving time for processing. <P>SOLUTION: By detecting a frame where there is a printing displacement by binary image processing and performing color image processing only to surrounding of the detected frame, the printing data and the frame line are separated highly precisely while saving the time for processing. Then, by using detection of printing displacement data mixed into an area where data are not printed, calculation of overlap of the frame area and a circumscribed rectangle of the printing displacement data, a relation between a position of characters overlapping a position of a frame line dividing the two frames and a central position of the frame, sizes of the circumscription rectangle of the printing displacement data and the frame, and a general printing displacement direction, it is discriminated from which frame the printing data protrudes to become printing displacement data. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、帳票などの枠線を含む文書をスキャナで電子化した画像中の印字データを読取る方法に関する。特に、プレ印刷の枠と印字データに大きなずれが生じた場合、また同一紙上でもずれの方向および量が異なる場合にも、各項目の枠へ対応する印字データを正しく割り当て、各項目の枠の読取り結果が対応する印字データの読取り結果のみであるようにする方法に関する。 The present invention relates to a method for reading print data in an image obtained by digitizing a document including a frame line such as a form with a scanner. In particular, when there is a large shift between the pre-print frame and the print data, or even when the direction and amount of shift are different on the same paper, the print data corresponding to the frame of each item is assigned correctly, and the frame of each item is The present invention relates to a method in which a reading result is only a reading result of corresponding print data.

帳票には、枠線、氏名や金額などの項目名、そしてデータが印刷されおり、OCRは予め定められた枠内に印字されたデータを読取る。従来、複雑かつ多様な枠構造を自動で解析する方式を開発し、これを用いて抽出された枠内のデータを読取っていた。ところが、枠線と項目名が予め印刷(プレ印刷)された帳票に、後からデータを印字する場合が多くあり（本発明におけるデータの印字には手書きによる記入を含む）、プリンタ上に帳票を置いた際の設置位置による印字位置のずれ、帳票用紙の微妙な違い、印字ソフトウエアでの印刷位置設定のずれ、手書き位置のずれのため、印字データがプレ印刷の枠からはみ出してしまう場合、さらには同一紙上でも場所によりずれの方向および量が異なる場合がある。 The form is printed with frame lines, item names such as names and amounts, and data, and the OCR reads the data printed in a predetermined frame. Conventionally, a method for automatically analyzing a complex and diverse frame structure has been developed, and data extracted within the frame is read using this method. However, there are many cases where data is printed later on a form in which frame lines and item names are pre-printed (pre-printed) (printing of data in the present invention includes handwritten entry), and the form is printed on the printer. If the print data protrudes from the pre-print frame due to print position deviation due to the installation position when placed, subtle differences in form paper, print position setting deviation in the printing software, or handwriting position deviation, Furthermore, the direction and amount of deviation may vary depending on the location even on the same paper.

従来は、読取り枠を少し拡大した領域を読取りフィールドとして設定し、その中に存在する印字データを枠線から分離して読取る方式が行われており、枠線と文字を分離する方法は多く出願されている。それらは特開2003-216894号公報（特許文献１）のようにカラー画像における枠線と文字の色の違いを用いる方法、特開平9-305707号公報（特許文献２）のように濃淡画像における枠線と文字の濃淡の違いを用いる方法、特開平9-185676号公報（特許文献３）のよう２値画像において枠線除去した後に残った黒画素の位置形状から除去された文字成分を補完する方法に分けられる。 Conventionally, a method has been used in which an area where the reading frame is slightly enlarged is set as a reading field, and the print data existing therein is read separately from the frame line, and there are many methods for separating the frame line and characters. Has been. They are a method using a difference between a frame line and a character color in a color image as in Japanese Patent Laid-Open No. 2003-216894 (Patent Document 1), and a gray image as in Japanese Patent Laid-Open No. 9-305707 (Patent Document 2). A method of using the difference between the border and character shading, and complementing the character component removed from the position shape of the black pixel remaining after removing the border in the binary image as disclosed in JP-A-9-185676 (Patent Document 3) Divided into ways to do.

特開2003-216894号公報JP 2003-216894 A

特開平9-305707号公報JP 9-305707 A 特開平9-185676号公報JP-A-9-185676

しかしながら、前記従来方法はいずれも枠から文字を分離することに注力しているだけであり、次の３つの問題がある。１つ目は、大きな印字ずれが起きた場合や一枚の帳票内で異なるずれの方向と量がある場合、各枠内のデータの読取りにおいて、枠線と重なった印字データが隣接した枠内から混入したデータなのか注目する枠からはみ出した読取対象のデータであるかを区別できず、隣接する枠内のデータも含めて認識結果としてしまうことである。２つ目は、大きな印字ずれが起きた場合、拡大した枠に印字データが入りきらず、文字パターンが切断された状態で読取りを行い、誤読してしまうことである。３つ目は、枠線と文字の分離の精度と処理時間の問題である。これは２値画像を用いると分離精度が低く、カラー画像を用いると多くの処理時間がかかるということである。 However, all of the conventional methods only focus on separating characters from the frame, and have the following three problems. First, when there is a large print misalignment or when there are different directions and amounts of misalignment within a single form, when reading the data within each frame, the print data that overlaps the frame line is within the adjacent frame. It is impossible to distinguish whether the data is mixed data from the frame to be read or the data to be read out of the frame of interest, and the recognition result includes data in adjacent frames. Second, when a large print misalignment occurs, the print data does not enter the enlarged frame, and reading is performed in a state where the character pattern is cut, resulting in erroneous reading. The third problem is the accuracy and processing time of separating the border and characters. This means that if a binary image is used, the separation accuracy is low, and if a color image is used, it takes a lot of processing time.

本発明は、このような問題に鑑みてなされたものである。すなわち前記問題に対し、プレ印刷の枠と印字データに大きなずれが生じてしまった場合、また同一紙上でもずれの方向および量が異なっている場合にも、隣接する枠から混入した印字データを除き、注目する枠からはみ出した読取り対象の印字データのみを認識結果として出力し、処理時間を抑えつつ印字データと枠線の分離を高精度に行う、高精度な印字データ読取り方法を提供することである。 The present invention has been made in view of such problems. In other words, for the above problem, when a large deviation occurs between the pre-print frame and the print data, or even when the deviation direction and amount are different on the same paper, the print data mixed in from the adjacent frame is excluded. By providing a high-precision print data reading method that outputs only the print data to be read that protrudes from the frame of interest as a recognition result, and that separates the print data and the frame line with high accuracy while reducing processing time. is there.

本発明では、前記課題を解決するために、帳票などの枠線を含む文書をスキャナで電子化した画像中の印字データを読取る方法において、
前記画像から罫線を抽出する罫線抽出手段と、
前記抽出された罫線から枠を抽出する枠抽出手段と、
抽出した複数の枠の中から、印字データの読取りを行う枠を抽出する読取りフィールド抽出手段と、
印字データが枠からはみ出している可能性があるすべての枠を検出する印字ずれフィールド検出手段と、
枠線と印字データを分離し、はみ出した印字データを印字ずれデータとする印字ずれデータの確定手段と、
印字ずれデータがどの枠からはみ出した印字データであるかを判別する読取りフィールドへの印字データ割り当て手段と、
印字データの読取りを行う文字列読取り手段と、
を有することを特徴とする印字データ読取り方法を提供する。 In the present invention, in order to solve the above problem, in a method of reading print data in an image obtained by digitizing a document including a frame line such as a form with a scanner,
Ruled line extraction means for extracting ruled lines from the image;
Frame extraction means for extracting a frame from the extracted ruled line;
A reading field extracting means for extracting a frame for reading the print data from the plurality of extracted frames;
A print misalignment field detection means for detecting all frames in which print data may protrude from the frame;
A means for determining print misalignment data that separates the frame line from the print data and uses the protruding print data as print misalignment data;
Means for assigning print data to a reading field for determining which frame the print misalignment data is out of print data;
A character string reading means for reading the print data;
A method for reading print data is provided.

前記読取りフィールドへの印字データ割り当て手段は、
データが印字されない領域に混入した印字ずれデータを検出し、検出された印字データのずれの方向を判別し、その方向を利用して他の印字ずれデータを読取りフィールドへ割り当てる非データ記入領域からの伝播型データ割り当て手段と、
枠の領域と印字ずれデータの外接矩形の重なり度を算出し、その重なり度を利用して印字ずれデータが注目する枠からのはみ出した印字データであるか、隣接する枠から混入した印字データなのかを判別する注目枠と矩形の重なり度による判別手段と、
水平方向に繋がる２つの枠にまたがった文字に対し、
その2つの枠を仕切る枠線の位置と、
重なった文字の位置と、
左の枠内にある文字或いは左の枠の中心位置と、
右の枠内にある文字或いは右の枠の中心位置と、
の関係を利用して、枠線と重なった文字を左右どちらの枠に割り当てる水平方向に隣接する枠内文字の判別手段と、
印字ずれデータの外接矩形の高さが枠の高さよりも大きい印字ずれデータと、
印字ずれデータの外接矩形の幅が枠の幅よりも大きい印字ずれデータを他の枠から混入した印字ずれデータと判別する矩形サイズによる判別手段と、
前記非データ記入領域からの伝播型データ割り当て手段と
前記矩形の重なり度による判別手段と
前記水平方向に隣接する枠内文字の判別手段と
前記矩形サイズによる判別手段とによって確定した印字ずれデータのずれの方向を利用して、読取りフィールドへ印字ずれデータを割り当てる大局的な印字ずれ方向による判別手段と
を有することをさらに特徴としている。 The means for assigning print data to the reading field includes:
Detects print misalignment data mixed in areas where data is not printed, determines the direction of misalignment of the detected print data, and uses that direction to assign other print misalignment data to the reading field. Propagating data allocation means,
The degree of overlap between the frame area and the circumscribing rectangle of the print misalignment data is calculated, and using the overlap degree, the print misalignment data is print data that protrudes from the target frame or print data mixed from adjacent frames. A discriminating means based on the degree of overlap between the frame of interest and the rectangle,
For characters that straddle two frames that are connected horizontally,
The position of the border line separating the two frames,
The position of the overlapping characters,
The character in the left frame or the center position of the left frame;
The character in the right frame or the center position of the right frame,
Using the relationship, a means for discriminating characters in a frame adjacent to each other in the horizontal direction that assigns a character that overlaps the frame line to either the left or right frame,
Print misalignment data in which the height of the circumscribed rectangle of the print misalignment data is larger than the height of the frame,
A discrimination means by a rectangular size for discriminating print deviation data in which the width of the circumscribed rectangle of the print deviation data is larger than the width of the frame from print deviation data mixed from other frames;
The deviation of the print misalignment data determined by the propagation type data allocating means from the non-data entry area, the discrimination means by the degree of overlap of the rectangles, the discrimination means of the frame characters adjacent in the horizontal direction, and the discrimination means by the rectangle size And a discriminating means based on a general print misalignment direction that assigns print misalignment data to the reading field using the direction of the above.

前記印字ずれフィールド検出手段は、2値画像処理を用いて印字データがはみ出している可能性があるすべての枠を検出し、前記印字ずれデータの確定処理において、前記検出されて枠の周辺のみのカラー画像処理を行うことを更に特徴としている。 The print misalignment field detection means detects all frames in which print data may protrude using binary image processing, and in the print misalignment data determination process, the detected misalignment field only It is further characterized by performing color image processing.

本発明により、帳票などの枠線を含む文書をスキャナで電子化した画像中の印字データを読取る際に、プレ印刷の枠と印字データに大きなずれが生じてしまった場合、また同一紙上でもずれの方向および量が異なる場合にも、どの印字データがどの読取りフィールドに割り当てられるかを判別し、読取りフィールド処理時間を抑えつつ印字データと枠線の分離を高精度に行うことにより、各項目の枠に対応する印字データのみの高精度な文字読取り結果を得る。 According to the present invention, when a print data in an image obtained by digitizing a document including a frame line such as a form with a scanner is read, a large deviation occurs between the pre-print frame and the print data, or even on the same paper. Even if the direction and amount of the print data are different, it is possible to determine which print data is assigned to which read field, and by separating the print data and the frame line with high accuracy while suppressing the read field processing time. A highly accurate character reading result of only the print data corresponding to the frame is obtained.

図１は、本発明の実施の形態における全体の処理フローである。スキャナやＯＣＲなどで電子化された帳票画像（０１０１）が入力となり、罫線抽出処理（０１０２）、枠抽出処理（０１０３）、読取りフィールド抽出処理（０１０４）、印字ずれフィールド候補の検出処理（０１０５）、印字ずれデータの確定処理（０１０６）、読取りフィールドへの印字データ割り当て処理（０１０７）、文字列読取処理（０１０８）が実行され。所定の読取り項目の枠内のデータを認識した結果（０１０９）が出力される。 FIG. 1 is an overall processing flow in the embodiment of the present invention. A form image (0101) digitized by a scanner, OCR or the like is input, and a ruled line extraction process (0102), a frame extraction process (0103), a reading field extraction process (0104), and a print misalignment field candidate detection process (0105) Then, print misalignment data confirmation processing (0106), print data assignment processing to the reading field (0107), and character string reading processing (0108) are executed. A result (0109) of recognizing data within the frame of the predetermined reading item is output.

本一連の処理は、図２に示すような、（画像などの）データ入力装置（０２０２）、操作端末装置（０２０３）、表示端末装置（０２０４）、外部記憶装置（０２０５）、メモリ（０２０６）、中央演算装置（０２０７）、通信装置（０２０８）で構成される印字データ読取装置（０２０１）で実行される。本装置はネットワーク（０２０９）に接続されている場合もあり、入力データである帳票画像（０１０１）は、ＵＳＢインターフェイスや、ＣＤ／ＤＶＤドライブなどのデータ入力装置（０２０２）や通信装置（０２０８）を介して、外部記憶装置（０２０５）やメモリ（０２０６）に格納される。そして、図１に示した罫線抽出処理（０１０２）、枠抽出処理（０１０３）、読取りフィールド抽出処理（０１０４）、印字ずれフィールド候補の検出処理（０１０５）、印字ずれデータの確定処理（０１０６）、読取りフィールドへの印字データ割り当て処理（０１０７）、文字列読取処理（０１０８）のプログラムデータや読取りフィールド抽出処理に用いる帳票定義知識を含む辞書データ（０１１０）は、外部記憶装置（０２０５）或いはメモリ（０２０６）に格納され、マウスやキーボード等の操作端末装置（０２０３）或いは通信装置（０２０８）からの指示データをトリガーとして、中央演算装置（０２０７）により処理される。以降、図１の処理フローに従い説明する。 This series of processing includes a data input device (0202), an operation terminal device (0203), a display terminal device (0204), an external storage device (0205), a memory (0206), as shown in FIG. The print data reading device (0201) including the central processing unit (0207) and the communication device (0208) is executed. In some cases, this apparatus is connected to a network (0209), and a form image (0101) as input data is transmitted from a data input apparatus (0202) such as a USB interface, a CD / DVD drive, or a communication apparatus (0208). And stored in the external storage device (0205) or the memory (0206). The ruled line extraction process (0102), the frame extraction process (0103), the reading field extraction process (0104), the print misalignment field candidate detection process (0105), the print misalignment data confirmation process (0106) shown in FIG. The dictionary data (0110) including the program data assignment process (0107) and the character string reading process (0108) and the form definition knowledge used for the reading field extraction process are stored in the external storage device (0205) or the memory ( 0206) and processed by the central processing unit (0207) using the instruction data from the operation terminal device (0203) such as a mouse or keyboard or the communication device (0208) as a trigger. Hereinafter, a description will be given according to the processing flow of FIG.

図３は、入力される帳票画像の例（０３０１）である。
図４は、帳票画像（０３０１）に対して罫線抽出処理（０１０２）と枠抽出処理（０１０３）を実行した結果を図示したもの（０４０１）である。罫線抽出処理（０１０２）では、水平、垂直方向への連続する黒画素を抽出することで罫線を抽出し、枠抽出処理（０１０３）では、罫線の交点を見つけそれらの位置関係から1つ1つの枠位置が抽出する。この方法には、（非特許文献１：Hiroshi Shinjo、 Eiichi Hadano、 Katsumi Marukawa、 Yoshihiro Shima、 Hiroshi Sako: A Recursive Analysis for Form Cell Recognition. ICDAR 2001: 694-698）など様々な方法がある。 FIG. 3 shows an example (0301) of the form image to be input.
FIG. 4 shows the result (0401) of the result of executing the ruled line extraction process (0102) and the frame extraction process (0103) on the form image (0301). In the ruled line extraction process (0102), a ruled line is extracted by extracting continuous black pixels in the horizontal and vertical directions, and in the frame extraction process (0103), the intersection of the ruled lines is found and one by one from their positional relationship. The frame position is extracted. There are various methods such as (Non-Patent Document 1: Hiroshi Shinjo, Eiichi Hadano, Katsumi Marukawa, Yoshihiro Shima, Hiroshi Sako: A Recursive Analysis for Form Cell Recognition. ICDAR 2001: 694-698).

図５は、枠抽出処理結果（０４０１）から、読取りを行う領域（読取りフィールド）を抽出した結果を図示したもの（０５０１）である。読取りフィールドは、予め辞書データ（０１１０）に枠毎の領域の情報として保持されている。そして、この辞書データ（０１１０）と枠抽出処理結果（０４０１）を照合することにより読取りフィールドが抽出される。この照合方法の例として、（非特許文献２：新庄広、高橋寿一、古川直広:DPマッチングを用いた帳票枠構造照合方式、Technical Report of IEICE、 PRMU2002-228 (2003-03)）がある。 FIG. 5 shows a result (0501) of extracting a region (reading field) to be read from the frame extraction processing result (0401). The reading field is held in advance in the dictionary data (0110) as area information for each frame. The reading field is extracted by comparing the dictionary data (0110) with the frame extraction processing result (0401). As an example of this collation method, there is (Non-patent literature 2: Hiroshi Shinjo, Koichi Takahashi, Naohiro Furukawa: Form frame structure collation method using DP matching, Technical Report of IEICE, PRMU2002-228 (2003-03)).

図６は、印字ずれフィールド候補検出処理（０１０５）の処理フローである。本処理では、2値画像（０６０１）が入力され、前処理で抽出された各読取りフィールドが印字ずれの可能性のあるフィールドであるか、印字ずれのないフィールドであるか（０６０５）が判定される。本処理は読取りフィールド毎に行われ、この時点では各読取りフィールドは枠と同じ領域である。はじめに枠線が除去される（０６０２）。次に、枠内に残った黒画素の連結成分（上下左右斜め方向に連続する黒画素をまとめた画素の集合）が生成される（０６０３）。次に、連結成分の位置と枠線の位置を比較することにより、各連結成分が枠線に接する或いは近接するか否かを判定し、接する或いは近接する連結成分が存在した読取りフィールドは印字ずれフィールド候補と判定される（０６０４）。本処理では、枠からはみ出した印字データがある読取りフィールドの検出漏れがないようにするため、項目名のプレ印刷された文字やノイズなどが枠線と接触或いは近接して存在する場合にも印字ずれフィールド候補として検出するとよい。 FIG. 6 is a processing flow of print misalignment field candidate detection processing (0105). In this processing, a binary image (0601) is input, and it is determined whether each reading field extracted in the preprocessing is a field with a possibility of printing deviation or a field without printing deviation (0605). The This process is performed for each reading field, and at this point, each reading field is the same area as the frame. First, the frame line is removed (0602). Next, a connected component of black pixels remaining in the frame (a set of pixels in which black pixels continuous in an up, down, left, and right diagonal direction are collected) is generated (0603). Next, by comparing the position of the connected component and the position of the frame line, it is determined whether or not each connected component is in contact with or close to the frame line. It is determined as a field candidate (0604). In this process, in order to prevent the detection error of the reading field with the print data protruding from the frame, printing is performed even when pre-printed characters or noise of the item name exists in contact with or close to the frame line. It may be detected as a deviation field candidate.

図７は、印字ずれデータ確定処理（０１０６）の処理フローである。本処理では、検出された印字ずれフィールド候補の領域を拡大した部分カラー画像（０７０１）が入力され、印字ずれデータが確定される。本処理は、嶋等による方法（特許文献４：特開2003-196592号公報）をベースとしたカラードロップアウト処理を行うことで枠線と文字が分離される。そして、印字データの検知処理（０７０７）において、枠線と重なる文字パターンと枠内にある文字パターンから文字列が抽出（枠線と重ならず枠の外にある文字パターンは文字列抽出に用いない）され、抽出された文字列と枠線の位置を比較し枠線と重なる文字列を印字ずれデータと確定する。嶋等の方法は、ドロップアウトする色を予め赤、青、緑の3色に限定し、読取りフィールドを拡大した部分画像からプレ印刷と印字データの色を識別し、識別された色成分を持つ画素に対してのみドロップアウト処理する方法である。本実施例では、嶋等の方法に次の３つの処理方法が加えられており、枠線と印字データの分離がより高速、高精度に行われる。 FIG. 7 is a processing flow of print misalignment data determination processing (0106). In this process, a partial color image (0701) obtained by enlarging the area of the detected print misalignment field candidate is input, and print misalignment data is determined. In this process, a frame line and a character are separated by performing a color dropout process based on a method by Shima et al. (Patent Document 4: Japanese Patent Laid-Open No. 2003-196592). In the print data detection process (0707), a character string is extracted from the character pattern that overlaps the frame line and the character pattern that is inside the frame (a character pattern that does not overlap the frame line but is outside the frame is used for character string extraction). The extracted character string and the position of the frame line are compared, and the character string overlapping the frame line is determined as the print misalignment data. Shima et al.'S method limits the dropout colors to red, blue, and green in advance, identifies the preprint and print data colors from the partial image with the expanded reading field, and has the identified color components. In this method, dropout processing is performed only on pixels. In this embodiment, the following three processing methods are added to the method of Shima et al., And the frame line and the print data are separated at higher speed and higher accuracy.

１つ目は、代表フィールドにおけるプレ印刷色・印字データ色の判定処理（０７０３）である。本処理は、予め定められた代表する読取りフィールドでプレ印刷と印字データの色識別を行った結果をすべてのフィールドのカラードロップアウト処理に利用するというものである。これによりカラードロップアウト処理の度にプレ印刷色と印字データ色の判別処理をすることがなくなるため処理時間を短縮できる。また、嶋等の方法は、印字データ色の判定を枠内の中央部のみで行われるため、印字データが中央から大きくずれている場合や印字データがない場合に印字データ色の識別ができない。そのため図１１に示すように、データ文字列存在判定処理（１１０３）にて、枠の中央部に文字列と推定される矩形が存在するかどうかを判定し、存在しないならば別の読取りフィールドを改めて選択しなおす処理が加えられている。 The first is pre-print color / print data color determination processing (0703) in the representative field. In this process, the result of pre-printing and color identification of print data in a predetermined representative reading field is used for color dropout processing of all fields. As a result, the processing time can be shortened because the pre-print color and the print data color are not discriminated each time the color drop-out process is performed. In addition, since the method of Shima et al. Determines the print data color only at the center of the frame, the print data color cannot be identified when the print data is greatly deviated from the center or when there is no print data. Therefore, as shown in FIG. 11, in the data character string existence determination process (1103), it is determined whether or not a rectangle estimated to be a character string exists at the center of the frame. Processing to re-select is added.

２つ目は、ドロップアウトで残存したプレ印刷成分の除去処理（０７０５）である。枠線の色が濃い場合などにドロップアウトしきれず枠線成分が残ることがある。本処理では、それらを除去するため図８に示すように、枠線の中心線からの距離がmd以下、連結成分の幅mwが一定値(Tw)以上、かつ高さmhが一定値(Th)以下である連結成分（０８０１）を枠線の残りと判定し、除去する。md、Tw、Thは調整可能なパラメータである。 The second is removal processing (0705) of the pre-printing component remaining in the dropout. In some cases, such as when the color of the frame line is dark, the frame line component may remain without being able to drop out. In this process, in order to remove them, as shown in FIG. 8, the distance from the center line of the frame line is md or less, the width mw of the connected component is not less than a certain value (Tw), and the height mh is a certain value (Th ) The following connected component (0801) is determined as the remainder of the frame line and is removed. md, Tw, and Th are adjustable parameters.

３つ目は、ドロップアウトで除去された文字データ成分の補完処理（０７０６）である。カラードロップアウト処理（０７０４）により、プレ印刷成分と共に文字データの一部が除去される場合がある。このため図９と図１０に示すように、枠線位置と枠線近傍の連結成分の位置の関係から印字データの外接矩形を補完する。
本処理は嶋等の方法をベースとしたが、他の様々なカラードロップアウトの手法を用いても良い。 The third is complementing processing (0706) of the character data component removed by dropout. By the color dropout process (0704), part of the character data may be removed together with the pre-print component. Therefore, as shown in FIGS. 9 and 10, the circumscribed rectangle of the print data is complemented from the relationship between the frame line position and the position of the connected component in the vicinity of the frame line.
Although this processing is based on the method of Shima et al., Various other color dropout methods may be used.

次に、読取りフィールドへの印字データ割り当て処理（０１０７）を行う。本処理では、印字ずれデータが存在する読取りフィールドにおいて、注目する枠からはみ出した読取対象の印字データなのか、それとも隣接枠から混入した印字データなのかを判別し、読取りフィールドへ読取対象の印字データのみを割り当てる。そして、読取対象の印字データを含みかつ混入してきた隣接枠内の印字データを除くように、読取りフィールドを補正する。例えば、図１２に示す印字ずれデータのある画像に対し、図１３に示すように、金額Ａの下にある金額Ａのデータ印字枠からはみ出した「０」を含むように読取りフィールドを補正し（１３０１）、金額Ｂの下にある金額Ｂのデータ印字枠からはみ出した「１０」を含むように読取りフィールドを補正し（１３０２）、金額Ｃの右にある金額Ｃのデータ印字枠からはみ出した「１３００００」を含み「０」と「１０」を含まないように読取りフィールドを補正する（１３０３）。以下、注目枠からはみ出した印字データとは、本来は注目枠に属する印字データが他の枠にはみ出して印字されているものをいい、注目枠へ混入している印字データとは、本来は注目枠以外の隣接する枠などに属するはずの印字データが注目枠内を含む領域に印字されているものをいう。 Next, print data assignment processing (0107) to the reading field is performed. In this process, in the reading field where the print misalignment data exists, it is determined whether it is the print data to be read out of the target frame or the print data mixed from the adjacent frame, and the print data to be read is input to the read field. Assign only. Then, the read field is corrected so as to exclude the print data in the adjacent frame that includes the print data to be read and has been mixed therein. For example, as shown in FIG. 13, the read field is corrected to include “0” that protrudes from the data printing frame of the amount A below the amount A, as shown in FIG. 1301), the reading field is corrected so as to include “10” that protrudes from the data printing frame of the amount B below the amount B (1302), and “out of the data printing frame of the amount C to the right of the amount C” The read field is corrected to include “130,000” and not “0” and “10” (1303). Hereinafter, the print data that protrudes from the attention frame means that the print data that originally belongs to the attention frame is printed out of the other frames, and the print data that is mixed into the attention frame originally refers to the attention data. This means that print data that should belong to an adjacent frame other than the frame is printed in an area including the inside of the frame of interest.

このように2値画像処理を用いて印字ずれの可能性のあるフィールドを絞りこみ、絞り込んだ読取りフィールドに対してのみカラー画像処理を行うことで、処理時間を抑えつつ、印字データとプレ印刷を分離することができる。 In this way, binary image processing is used to narrow down fields that may be misaligned, and color image processing is performed only on the narrowed-down read fields, reducing print time and pre-printing. Can be separated.

図１４は、読取りフィールドへの印字データ割り当て処理（０１０７）の処理フローである。本処理では、確定された印字ずれデータを含む読取りフィールドに対し、印字ずれデータが注目する枠からのはみ出しなのか隣接する枠からの混入なのかを判別するために、矩形サイズによる判別処理（１４０１）と水平方向に隣接する枠内文字の判別処理（１４０２）と非データ記入領域からの伝播型データ割り当て処理（１４０３）と注目枠と矩形の重なり度による判別処理T=0.9（１４０４）と注目枠と矩形の重なり度による判別処理T=0.5（１４０６）と大局的な印字ずれ方向による判別処理（１４０５）の６つの判別処理が実行される。 FIG. 14 is a processing flow of print data assignment processing (0107) to the reading field. In this process, a discrimination process (1401) based on a rectangular size is performed on the read field including the confirmed misalignment data to determine whether the misalignment data is out of the target frame or mixed from the adjacent frame. ) And a horizontal adjacent character discrimination process (1402), a propagation data allocation process from a non-data entry area (1403), and a discrimination process based on the degree of overlap between the frame of interest and the rectangle T = 0.9 (1404) Six determination processes are executed: a determination process T = 0.5 (1406) based on the degree of overlap between the frame and the rectangle, and a determination process (1405) based on the overall print misalignment direction.

まず、非データ記入領域からの伝播型データ割り当て処理（１４０３）について説明する。本処理では、生年月日やフリガナ等の項目名の文字がプレ印刷された枠領域や帳票の余白領域（非データ記入領域）にはデータが印字されないことを利用し、非データ記入領域に隣接する“読取りフィールドＸ”とその非データ記入領域とにまたがって存在するデータを、“読取りフィールドＸ”からはみ出した印字データであると判別する。そして、このはみ出し方向を利用し、“読取りフィールドＸ”の非データ記入領域とは反対側に隣接する“読取りフィールドＹ”とまたがった印字データを“読取りフィールドＹ”から“読取りフィールドＸ”に混入した印字データであると判別する。そして、この操作を繰り返すことで、連続して隣接する読取りフィールド間にまたがったデータのはみ出し/混入を判別する。ここで、非データ記入領域の情報は、帳票定義知識の一部として辞書データ０１１０に保持されている。 First, the propagation type data allocation processing (1403) from the non-data entry area will be described. This process uses the fact that data is not printed in the frame area where the characters of the item names such as date of birth and reading are pre-printed and the blank area (non-data entry area) of the form, and is adjacent to the non-data entry area. It is determined that the data existing across the “reading field X” and the non-data entry area is print data protruding from the “reading field X”. Then, using this protruding direction, the print data straddling the “read field Y” adjacent to the opposite side of the non-data entry area of the “read field X” is mixed from the “read field Y” to the “read field X”. It is determined that the print data has been printed. Then, by repeating this operation, it is determined whether or not data protrudes between adjacent reading fields. Here, the information in the non-data entry area is held in the dictionary data 0110 as part of the form definition knowledge.

図１５の例を用いて説明すると、はじめに金額５のデータ印字枠（１５０８）の下部にまたがる印字ずれデータ「０」（１５０４）は非データ記入領域である“日”の項目名の枠にまたがるため、枠１５０８からのはみ出しであると判別する。これにより、枠１５０８の中の印字データは下方向へずれているとわかるので、枠１５０８の上部にまたがるデータ「１２０」（１５０３）は枠１５０８への混入かつ金額４のデータ印字枠（１５０７）からのはみ出しと判別される。次に、枠１５０７の印字データも下方向へずれているとわかるので、枠１５０７の上部のデータ「１３００００」は枠１５０７への混入かつ金額３のデータ印字枠（１５０６）からのはみ出しと判別する。同様に、枠１５０６の上部にある印字ずれデータ「０」（１５０１）は枠１５０５からのはみ出しであると判別する。このように本処理は非データ記入領域へのはみ出した情報を用いて他の読取りフィールドの印字ずれの方向を判定する。 Referring to the example of FIG. 15, the print misalignment data “0” (1504) straddling the lower portion of the data print frame (1508) of the amount 5 first straddles the frame of the item name “day” which is the non-data entry area. Therefore, it is determined that the protrusion is from the frame 1508. As a result, it can be seen that the print data in the frame 1508 is shifted downward, so the data “120” (1503) extending over the upper part of the frame 1508 is mixed into the frame 1508 and the data print frame (1507) of the amount 4 It is determined that it is overhanging. Next, since it can be seen that the print data in the frame 1507 is also shifted downward, it is determined that the data “130000” in the upper part of the frame 1507 is mixed into the frame 1507 and protrudes from the data print frame (1506) of the amount 3. . Similarly, it is determined that the print deviation data “0” (1501) at the upper part of the frame 1506 is an extension from the frame 1505. In this way, this process determines the direction of print misalignment in the other reading fields using the information protruding to the non-data entry area.

注目枠と矩形の重なり度による判別処理（１４０４、１４０６）について説明する。他の枠領域にも重なっていても、注目する枠領域にその大部分が重なる印字データは、注目する枠からはみ出した印字データである可能性が高い。そのため、印字データの外接矩形が注目枠領域に重なる割合を用いる。図１６に示すように、Dinを読取枠内の印字データの高さ、Doutを読取枠外の印字データの高さとし、枠内にある印字データの割合F=Din/(Din+Dout)が閾値Ｔ以上であるならば、印字ずれデータを注目枠からのはみ出しと判別する。閾値Ｔは調整可能であり、本実施例では、0.9と0.5の2種類を用いている。T=0.9の場合は、枠領域に大部分が重なる印字データがはみ出しと判別されるため、はみ出し/混入の判別精度が高い。そのため、本処理の結果は比較的他の判別ルールの結果よりも優先して用いられる。一方、T=0.5の場合は、はみ出し/混入の判別精度は低いが、はみ出しであるか侵入であるか曖昧なデータを判別できる。このため、他の判別ルールをすべて適用した結果判別不可であった印字データに適用される。 The discrimination processing (1404, 1406) based on the degree of overlap between the frame of interest and the rectangle will be described. Even if it overlaps with other frame regions, the print data that largely overlaps the frame region of interest is highly likely to be print data that protrudes from the frame of interest. For this reason, a ratio in which the circumscribed rectangle of the print data overlaps the attention frame region is used. As shown in FIG. 16, Din is the height of the print data in the reading frame, Dout is the height of the print data outside the reading frame, and the ratio F = Din / (Din + Dout) of the print data in the frame is a threshold T If this is the case, it is determined that the print misalignment data is protruding from the attention frame. The threshold value T can be adjusted. In this embodiment, two types of 0.9 and 0.5 are used. In the case of T = 0.9, since the print data that mostly overlaps the frame area is determined to be protruding, the accuracy of protruding / mixing is high. For this reason, the result of this process is used in preference to the results of other discrimination rules. On the other hand, when T = 0.5, the accuracy of the protrusion / mixture determination is low, but it is possible to determine ambiguous data whether it is a protrusion or an intrusion. For this reason, it is applied to print data that cannot be determined as a result of applying all other determination rules.

次に、水平方向に隣接する枠内文字の判別処理（１４０２）について説明する。水平方向に印字ずれが生じている場合、文字単位ではみ出しか混入かを判別する必要がある。図１７に示すサンプルにおいて、フィールドＡ（１７０２）に対し、カラードロップアウト処理を行い、フィールドＡの枠線に重なる或いは枠内に含まれる印字である“１２３４”が割り当てデータとなり、フィールドＢに対しカラードロップアウト処理を行い、フィールドＢの枠線に重なる或いは枠内に含まれる印字である“４”が割り当てデータとなる。このように、フィールドＡとフィールドＢの両方に「４」が含まれてしまう。本処理では、境界となる枠線とフィールドＡ内の文字とフィールドＢ内の文字の位置関係を利用することで、「４」がフィールドAからのはみ出しかつフィールドBへの混入であることを判別する。これには、重複する文字の外接矩形(Rlap)の位置と、重複する文字を除いたフィールドA内の最も右の外接矩形(Rarit)の位置、重複する文字を除いたフィールドB内の最も左の外接矩形(Rblft)の位置、枠線の位置、枠領域の中心位置、または枠領域のサイズを用いることができ、図１８、図１９、図２０、図２１、図２２に示す５つの判別パターンがある。 Next, the discrimination process (1402) for characters in the frame adjacent in the horizontal direction will be described. If there is a print misalignment in the horizontal direction, it is necessary to determine whether the character is protruding or mixed. In the sample shown in FIG. 17, color dropout processing is performed on field A (1702), and “1234”, which is a print that overlaps the frame line of field A or is included in the frame, is assigned data. A color dropout process is performed, and “4”, which is a print that overlaps the frame line of field B or is included in the frame, is assigned data. Thus, “4” is included in both the field A and the field B. In this process, it is determined that “4” is an overflow from field A and a mixture into field B by using the border line, the positional relationship between the characters in field A and the characters in field B. To do. This includes the position of the circumscribed rectangle (Rlap) for the duplicate character, the position of the rightmost circumscribed rectangle (Rarit) in field A excluding the duplicate character, and the leftmost in field B excluding the duplicate character The position of the circumscribed rectangle (Rblft), the position of the frame line, the center position of the frame region, or the size of the frame region can be used, and the five discriminations shown in FIGS. 18, 19, 20, 21, and 22 can be used. There is a pattern.

図１８は、RaritとRblftが存在する場合の判別パターン（判別パターン１）であり、RaritとRlapとの距離をDma、RbritとRlapとの距離をDmbとし、Dma≦DmbならばRlapはフィールド Aからのはみ出し、Dma >DmbならばRlapはフィールドBからのはみ出しとする。 FIG. 18 shows a discrimination pattern (discrimination pattern 1) when Rarit and Rblft exist. The distance between Rarit and Rlap is Dma, the distance between Rbrit and Rlap is Dmb, and if Dma ≦ Dmb, Rlap is field A. Overhang, if Dma> Dmb, Rlap will overhang from field B.

図１９は、RaritとRblftのどちらかが存在する場合の判別パターン（判別パターン２）である。Raritのみ存在する場合はフィールドBの中心とRlapとの距離をDcbとし、Dma≦DcbならばRlapはフィールドAからのはみ出し、Dma >DcbならばRlapはフィールドBからのはみ出しとする。またRblftのみ存在する場合は、フィールドAの中心とRlapとの距離をDcaとし、Dmb≦DcaならばRlapはフィールドBからのはみ出し、Dmb >DcaならばRlap はフィールドAからのはみ出しとする。 FIG. 19 shows a discrimination pattern (discrimination pattern 2) when either Rarit or Rblft exists. If only Rarit exists, the distance between the center of field B and Rlap is Dcb. If Dma ≦ Dcb, Rlap protrudes from field A, and if Dma> Dcb, Rlap protrudes from field B. If only Rblft exists, the distance between the center of field A and Rlap is Dca. If Dmb ≦ Dca, Rlap protrudes from field B, and if Dmb> Dca, Rlap protrudes from field A.

図２０は、RaritとRblftが共に存在しない場合の判別パターン（判別パターン３）であり、RlapのフィールドAへのはみ出し量をDla、フィールドBへのはみ出し量をDlbとし、Dla≦DlbならばRlap はフィールドAからのはみ出し、Dla >DlbならばRlapはフィールドBからのはみ出しとする。 FIG. 20 shows a discrimination pattern (discrimination pattern 3) when both Rarit and Rblft do not exist. The amount of protrusion of Rlap to field A is Dla, the amount of protrusion of field B to Dlb is Dlap, and if Dla ≦ Dlb, Rlap. Extends beyond field A, and if Dla> Dlb, Rlap protrudes from field B.

図２１は、Rlapが２つ存在する場合の判別パターン（判別パターン４）であり、左側のRlapをRlapa、右側のRlapをRlapbとし、RlapaはフィールドAからのはみ出し、RlapbはフィールドBからのはみ出しとする。この判別パターンは、フィールドB或いはフィールドＢの枠一杯に文字列が記載されるなどの原因で、左右の印字データが枠に接触したことを判別する。 FIG. 21 shows a discrimination pattern (discrimination pattern 4) when two Rlaps exist. The left Rlap is Rlapa, the right Rlap is Rlapb, Rlapa protrudes from field A, and Rlapb protrudes from field B. And This discrimination pattern discriminates that the left and right print data has come into contact with the frame due to the fact that the character string is written in the field B or the frame of the field B completely.

図２２は、判別パターン１と判別パターン２の特殊な場合であり、判別パターン１と判別パターン２の判別を実行する前に行う。フィールドBの枠幅をWb、フィールドB内の文字とRlapを含む外接矩形の幅をWpbとし、Wb≦WpbならばRlapはAからのはみ出しとする。 FIG. 22 shows a special case of discrimination pattern 1 and discrimination pattern 2, which is performed before discrimination of discrimination pattern 1 and discrimination pattern 2 is executed. The frame width of field B is Wb, the width of the circumscribed rectangle including the characters and Rlap in field B is Wpb, and if Wb ≦ Wpb, Rlap protrudes from A.

１４０２の判別処理は、始めに判別パターン４、判別パターン５、判別パターン１、判別パターン２、判別パターン３の順に処理されるが、判別パターン１と判別パターン２と判別パターン３の適用順番を変えても判別精度は変わらない。ただし、判別精度は変化するが、この５つの判別パターンのいずれか、またはその組み合わせにより実行しても良い。 The discrimination process 1402 is first performed in the order of discrimination pattern 4, discrimination pattern 5, discrimination pattern 1, discrimination pattern 2, and discrimination pattern 3, but the application order of discrimination pattern 1, discrimination pattern 2, and discrimination pattern 3 is changed. However, the discrimination accuracy does not change. However, although the discrimination accuracy changes, it may be executed by any one of these five discrimination patterns or a combination thereof.

次に、矩形サイズによる判別処理（１４０１）について説明する。本処理では、枠の大きさに適したサイズでデータは印字されることから、印字ずれデータの外接矩形が注目する枠よりも大きい場合、隣接する読取りフィールドからの混入であると判別する。具体的には、印字データの高さ(Wst)と幅(Hst)、枠の高さ(Wfr)と幅(Hfr)を用い、Wst>Wfr 或いはHst＞Hfrならば、隣接する枠からの混入した印字データであると判別する。 Next, the discrimination process (1401) based on the rectangular size will be described. In this process, data is printed with a size suitable for the size of the frame. Therefore, if the circumscribed rectangle of the print deviation data is larger than the frame of interest, it is determined that the data is mixed from the adjacent reading field. Specifically, if the height (Wst) and width (Hst) of the print data, the height (Wfr) and width (Hfr) of the frame are used, and Wst> Wfr or Hst> Hfr, then mixing from adjacent frames It is determined that the print data has been printed.

次に、大局的な印字ずれ方向による判別処理（１４０５）について説明する。印字ずれのある多くの帳票では、帳票内の印字ずれデータが一定の方向にずれている。このことから帳票毎に大局的なずれの方向を決定し、その方向を用いてずれの方向が曖昧な印字データのはみ出し/混入を判別する。大局的なずれの方向の決定には、これまで述べてきた判別処理（１４０１、１４０２、１４０３、１４０４、１４０６）により確定された印字ずれの方向を利用する。上下左右の方向へのずれと判別された印字データの数をそれぞれDirUpNum、DirDownNum、 DirLftNum、DirRitNum、大局的な印字ずれの方向をGlobalDir、大局的な印字ずれの方向を上下左右それぞれUp、Down、Lft、Ritとし、次のように決定する。DirUpNum≧DirDownNum+αならばGlobalDirはUp、DirDownNum≧DirUpNum+αならばGlobalDirはDown、DirLftNum≧DirRitNum+αならばGlobalDirはLft、DirRitNum≧DirLftNum+αならばGlobalDirはRitと判別する。判別された文字列の数が少ない場合や、異なる方向へのずれが同数程度である場合の大局的なずれの方向は信頼性が低いため、定数αが導入している。このαは調整可能なパラメータであり、これにより１方向へ偏ったずれがある場合のみ大局的なずれの方向が決まる。 Next, a determination process (1405) based on a general print misalignment direction will be described. In many forms with print misalignment, print misalignment data in the form deviates in a certain direction. From this, the direction of global displacement is determined for each form, and the direction is used to determine whether the print data has an ambiguous misalignment direction. To determine the direction of global displacement, the direction of print displacement determined by the discrimination processing (1401, 1402, 1403, 1404, 1406) described so far is used. DirUpNum, DirDownNum, DirLftNum, DirRitNum, the global print misalignment direction GlobalDir, the global print misalignment direction Up, Down, Left, Right, Up, Down, Lft and Rit are determined as follows. If DirUpNum ≧ DirDownNum + α, GlobalDir is determined to be Up. If DirDownNum ≧ DirUpNum + α, GlobalDir is determined to be Down. If DirLftNum ≧ DirRitNum + α, GlobalDir is determined to be Lft. The constant α is introduced because the direction of global shift when the number of discriminated character strings is small or when the shift in different directions is about the same is low in reliability. This α is an adjustable parameter, so that only when there is a deviation biased in one direction, the global deviation direction is determined.

前記６つの判別処理において、印字ずれデータが注目する枠の４隅に混入していた場合、フィールド補正を行う方向を決定するために、印字ずれデータが水平方向からの混入なのか垂直方向からの混入なのかを判別する必要がある。このため、枠領域と混入文字列の外接矩形の重なった部分の高さLhと幅Lwを用いて、水平方向からの混入か垂直方向からの混入かを判別する。（一般に文字は縦長であることから高さをＬｈ×0.5として比較する。）具体的には、幅(Lw)よりも高さ(Lh×)が小さければ垂直方向からの混入（図２３）、幅(Lw)よりも高さ(Lh×)が大きければ垂直方向からの混入(図２４)と判別する。ただし、図２４、図２５のようにLhと比較し、枠外の矩形の高さ(Lh')の長さが非常に長い場合(Lh'≧Lh×2)は垂直方向からの混入と判定する。 In the above six determination processes, if print misalignment data is mixed in the four corners of the frame of interest, the print misalignment data is mixed from the horizontal direction or the vertical direction in order to determine the direction for performing field correction. It is necessary to determine whether it is mixed. For this reason, it is determined whether mixing from the horizontal direction or mixing from the vertical direction is performed using the height Lh and width Lw of the overlapping portion of the circumscribed rectangle of the frame area and the mixed character string. (In general, the characters are vertically long, and thus the height is compared as Lh × 0.5.) Specifically, if the height (Lh ×) is smaller than the width (Lw), mixing from the vertical direction (FIG. 23), If the height (Lh ×) is larger than the width (Lw), it is determined that the mixture is from the vertical direction (FIG. 24). However, compared to Lh as shown in FIGS. 24 and 25, when the height (Lh ′) of the rectangle outside the frame is very long (Lh ′ ≧ Lh × 2), it is determined to be mixed from the vertical direction. .

以上説明した、読取りフィールドへの印字データ割り当て処理０１０７のための６つの判別処理（１４０１〜１４０６）は、それぞれ独立した処理であり、いずれかのみを利用してもよく、いくつかを組合わせて利用してもよい。また、各判別処理を組合わせる際の処理の順番は問わない。 The above-described six determination processes (1401 to 1406) for the print data allocation process 0107 to the reading field are independent processes, and any one of them may be used. May be used. Further, the order of the processes when combining the respective determination processes is not limited.

ただし、判別処理毎にはみ出しか混入かの判別結果が異なる場合があるため、判別処理を実行する順番は重要である。例えば、図１５に示す印字ずれデータに対し、非データ記入領域からの伝播型データ割り当て処理（１４０３）は正しく下方向へのはみ出し文字列として判別できるが、注目枠と矩形の重なり度による判別処理T=0.5を用いると、誤って上方向へずれた文字列と判別し、誤って隣接する枠の印字データの認識結果が出力されてしまう。このように伝播型データ割り当て処理は注目枠と矩形の重なり度による判別処理に比べて精度が高いが、非データ記入領域へのはみ出しがないサンプルには適用できないという性質がある。このような観点で、精度が高い順に判別処理を並べると、矩形サイズによる判別処理（１４０１）、水平方向に隣接する枠内文字の判別処理（１４０２）、非データ記入領域からの伝播型データ割り当て処理（１４０３）、注目枠と矩形の重なり度による判別処理T=0.9（１４０４）、大局的な印字ずれ方向による判別処理（１４０５）、矩形の重なり度による判別処理T=0.5（１４０４）となる。そして、この順番で判別ルールを適用していくことにより、読取りフィールドと印字データとの対応付けの誤りを最小にできる。ただし、帳票内のはみ出し/混入データの性質によってこの適用順番は変わっても良い。そして、各読取りフィールドの領域は割り当てられた印字データのみを含むように補正される。 However, since the determination result of the protrusion or the mixture may be different for each determination process, the order in which the determination processes are executed is important. For example, with respect to the print misalignment data shown in FIG. 15, the propagation type data allocation process (1403) from the non-data entry area can be correctly identified as a protruding character string in the downward direction. If T = 0.5 is used, it is determined that the character string is erroneously shifted upward, and the recognition result of the print data in the adjacent frame is erroneously output. As described above, the propagation type data allocation process is more accurate than the discrimination process based on the overlap between the frame of interest and the rectangle, but has a property that it cannot be applied to a sample that does not protrude into the non-data entry area. From this point of view, when the discrimination processing is arranged in descending order of accuracy, the discrimination processing based on the rectangle size (1401), the discrimination processing of the characters in the frame adjacent in the horizontal direction (1402), and the propagation type data allocation from the non-data entry area Processing (1403), discrimination processing T = 0.9 (1404) based on the degree of overlap between the frame of interest and rectangle, discrimination processing based on the overall print misalignment direction (1405), and discrimination processing T = 0.5 (1404) based on the degree of rectangular overlap. . Then, by applying the discrimination rules in this order, it is possible to minimize the error in associating the read field with the print data. However, this application order may change depending on the nature of the overhang / mixed data in the form. The area of each reading field is corrected so as to include only the assigned print data.

最後に、補正した読取りフィールド内の印字データに対して文字列読取を行い、認識結果を得る（０１０８）。本処理では、印字ずれフィールド候補検出処理により、印字ずれなしフィールドと判定された読取りフィールド、印字ずれデータ確定処理により印字ずれデータがなかったフィールド、印字ずれデータに対して領域が補正された読取りフィールドのすべての読取りフィールドに対して文字列読取が行われる。 Finally, character string reading is performed on the print data in the corrected reading field to obtain a recognition result (0108). In this processing, a read field determined as a field without print misalignment by the print misalignment field candidate detection process, a field without print misalignment data by the print misalignment data confirmation process, and a read field whose area has been corrected for the print misalignment data String reading is performed for all of the reading fields.

以上のように、印字ずれデータが注目する枠からはみ出した印字データであるか、隣接する枠からの混入した印字データなのかを判別することで、隣接する枠から混入した印字データを除き、注目する枠からはみ出した読取り対象の印字データのみを認識結果とすることができる。 As described above, by determining whether the print deviation data is print data that protrudes from the frame of interest or print data mixed from the adjacent frame, the print data mixed from the adjacent frame is excluded. Only the print data to be read that protrudes from the frame to be read can be used as the recognition result.

地方自治体で扱われる給与支払報告書をはじめとして、領収書、申込書、振込票、医療機関のレセプトなどの枠線を含む文書画像からの汎用的な印字データ読取りに利用できる。 In addition to salary payment reports handled by local governments, it can be used for general-purpose print data reading from document images including border lines such as receipts, application forms, transfer slips, and medical institution receipts.

本発明の実施形態における処理フローを示す図である。It is a figure which shows the processing flow in embodiment of this invention. 本発明の実施形態におけるハードウエア構成を示す図である。It is a figure which shows the hardware constitutions in embodiment of this invention. 本発明の実施形態において入力される帳票画像の例である。It is an example of the form image input in embodiment of this invention. 本発明の実施形態における枠抽出結果の例を図示したものである。The example of the frame extraction result in embodiment of this invention is illustrated. 本発明の実施形態における読取りフィールド抽出結果の例を図示したものである。FIG. 6 illustrates an example of a read field extraction result in an embodiment of the present invention. 本発明の実施形態における印字ずれフィールド候補検出処理の処理フローを示す図である。It is a figure which shows the processing flow of the printing misalignment field candidate detection process in embodiment of this invention. 本発明の実施形態における印字ずれデータの確定処理の処理フローを示す図である。It is a figure which shows the processing flow of the determination process of the printing deviation data in embodiment of this invention. 本発明の実施形態におけるドロップアウトで残存したプレ印刷成分の除去処理の概要を示す図である。It is a figure which shows the outline | summary of the removal process of the pre printing component which remained by dropout in embodiment of this invention. 本発明の実施形態におけるドロップアウトで除去された文字データ成分の補完処理による1つ目の例である。It is the 1st example by the complementary process of the character data component removed by dropout in the embodiment of the present invention. 本発明の実施形態におけるドロップアウトで除去された文字データ成分の補完処理による２つ目の例である。It is the 2nd example by the complementation process of the character data component removed by dropout in embodiment of this invention. 本発明の実施形態における代表フィールドにおけるプレ印刷色・印字データ職の判定処理の処理フローを示す図である。It is a figure which shows the processing flow of the determination process of the pre print color and print data job in the representative field in embodiment of this invention. 本発明の実施形態における印字ずれデータを含む帳票画像の例である。It is an example of a form image including print deviation data in the embodiment of the present invention. 本発明の実施形態における読取りフィールド補正処理結果を図示したものである。FIG. 6 illustrates a read field correction processing result in an embodiment of the present invention. 本発明の実施形態における読取りフィールドへの印字データ割り当て処理の処理フローを示す図である。It is a figure which shows the processing flow of the printing data allocation process to the reading field in embodiment of this invention. 本発明の実施形態における非データ記入領域からの伝播型データ割り当て処理の概要を図示したものである。FIG. 4 illustrates an overview of a process for allocating propagation data from a non-data entry area according to an embodiment of the present invention. 本発明の実施形態における注目枠と矩形の重なり度による判別処理の概要を示す図である。It is a figure which shows the outline | summary of the discrimination | determination process by the overlap degree of the attention frame and rectangle in embodiment of this invention. 本発明の実施形態における水平方向に印字ずれが生じている場合のカラードロップアウト処理結果の例を示す図である。It is a figure which shows the example of the color dropout process result in case printing misalignment has arisen in the horizontal direction in embodiment of this invention. 本発明の実施形態における水平方向に隣接する枠内文字の判別処理で用いられる１つ目の判別パターンを示す図である。It is a figure which shows the 1st discrimination | determination pattern used by the discrimination | determination process of the in-frame character adjacent to the horizontal direction in embodiment of this invention. 本発明の実施形態における水平方向に隣接する枠内文字の判別処理で用いられる２つ目の判別パターンを示す図である。It is a figure which shows the 2nd discrimination | determination pattern used by the discrimination | determination process of the in-frame character adjacent to the horizontal direction in embodiment of this invention. 本発明の実施形態における水平方向に隣接する枠内文字の判別処理で用いられる３つ目の判別パターンを示す図である。It is a figure which shows the 3rd discrimination | determination pattern used by the discrimination | determination process of the in-frame character adjacent to the horizontal direction in embodiment of this invention. 本発明の実施形態における水平方向に隣接する枠内文字の判別処理で用いられる４つ目の判別パターンを示す図である。It is a figure which shows the 4th discrimination | determination pattern used by the discrimination | determination process of the in-frame character adjacent to the horizontal direction in embodiment of this invention. 本発明の実施形態における水平方向に隣接する枠内文字の判別処理で用いられる５つ目の判別パターンを示す図である。It is a figure which shows the 5th discrimination | determination pattern used by the discrimination | determination process of the in-frame character adjacent to the horizontal direction in embodiment of this invention. 本発明の実施形態において枠の４隅に混入した印字データの混入方向を判別する処理の１つ目の例である。It is the 1st example of the process which discriminate | determines the mixing direction of the printing data mixed in the four corners of the frame in the embodiment of the present invention. 本発明の実施形態において枠の４隅に混入した印字データの混入方向を判別する処理の２つ目の例である。It is the 2nd example of the process which discriminate | determines the mixing direction of the printing data mixed in the four corners of the frame in the embodiment of the present invention. 本発明の実施形態において枠の４隅に混入した印字データの混入方向を判別する処理の３つ目の例である。It is the 3rd example of the process which discriminate | determines the mixing direction of the printing data mixed in the four corners of the frame in the embodiment of the present invention.

Explanation of symbols

０１０１：帳票画像、０１０２：罫線抽出処理、０１０３：枠抽出処理、０１０４：取りフィールド抽出処理、０１０５：印字ずれフィールド候補の検出処理、０１０６：印字ずれデータの確定処理、０１０７：読取りフィールドへの印字データ割り当て処理、０１０８：文字列認識処理、０１０９：文字列認識結果、０１１０：辞書データ、０２０１：印字データ読取装置、０２０２：データ入力装置、０２０３：操作端末装置、０２０４：表示端末装置、０２０５：外部記憶装置、０２０６：メモリ、０２０７：中央演算装置、０２０８：通信装置、０２０９：ネットワーク、０３０１：入力される帳票画像の例、０４０１：枠抽出処理結果の例、０５０１：読取りフィールド抽出結果の例、０６０１：2値画像、０６０２：罫線除去処理、０６０３：連結成分生成処理、０６０４：連結成分の接触判定処理、０６０５：印字ずれフィールド候補検出結果、０７０１：印字ずれフィールド候補を拡大したカラー部分画像、０７０２：帳票全面のカラー画像、０７０３：代表フィールドにおけるプレ印刷色・印字データ色の判定処理、０７０４：フィールド毎のカラードロップアウト処理、０７０５：ドロップアウトで残存したプレ印刷成分の除去処理、０７０６：ドロップアウトで除去された文字データ成分の補完処理、０７０７：印字データの検知処理、０７０８：印字ずれデータの確定結果、０８０１：カラードロップアウトで残存したプレ印刷枠線、０９０１：カラードロップアウトで一部除去された文字パターンの例の左側部分、０９０２：カラードロップアウトで一部除去された文字パターンの例の右側部分、０９０３：ドロップアウトで除去された文字データ成分の補完処理による結果の１つ目の例、１００１：カラードロップアウトで一部除去された文字パターンの例の上側部分、１００２：カラードロップアウトで一部除去された文字パターンの例の下側部分、１００３：ドロップアウトで除去された文字データ成分の補完処理による結果の２つ目の例、１１０２：読取りフィールド選択処理、１１０３：データ文字列存在判定処理、１１０４：データ文字列職の判定処理、１１０５：プレ印刷色の判定処理、１１０６：代表フィールドにおけるプレ印刷色・印字データ色の判定処理結果、１３０１：金額Ａのデータが印字される読取りフィールドの補正結果、１３０２：金額Ｂのデータが印字される読取りフィールドの補正結果、１３０３：金額Ｃのデータが印字される読取りフィールドの補正結果、１４０１：矩形サイズによる判別処理、１４０２：水平方向に隣接する枠内文字の判別処理、１４０３：非データ記入領域からの伝播型データ割り当て処理、１４０４：注目枠と矩形の重なり度による判別(T=0.9)処理、１４０５：大局的な印字ずれ方向による判別処理、１４０６：注目枠と矩形の重なり度による判別(T=0.5)処理、１５０１：印字ずれデータの１つ目の例、１５０２：印字ずれデータの２つ目の例、１５０３：印字ずれデータの３つ目の例、１５０４：印字ずれデータの４つ目の例、１５０５：金額１のデータ印字枠、１５０６：金額３のデータ印字枠、１５０７：金額４のデータ印字枠、１５０８：金額５のデータ印字枠、１７０１：読取りフィールドＡのカラードロップアウト処理領域、１７０２：読取りフィールドＡ、１７０３：読取りフィールドＢ、１７０４：読取りフィールドＡのカラードロップアウト処理結果、１７０５：読取りフィールドＢのカラードロップアウト処理結果。 0101: Form image, 0102: Ruled line extraction process, 0103: Frame extraction process, 0104: Extracted field extraction process, 0105: Print misalignment field candidate detection process, 0106: Print misalignment data confirmation process, 0107: Print in reading field Data assignment processing, 0108: Character string recognition processing, 0109: Character string recognition result, 0110: Dictionary data, 0201: Print data reading device, 0202: Data input device, 0203: Operation terminal device, 0204: Display terminal device, 0205: 0206: Memory, 0207: Central processing unit, 0208: Communication device, 0209: Network, 0301: Example of input form image, 0401: Example of frame extraction processing result, 0501: Example of reading field extraction result , 0601: Binary image, 0602: Ruled line removal processing 0603: Connected component generation processing, 0604: Connected component contact determination processing, 0605: Print misalignment field candidate detection result, 0701: Color partial image in which the print misalignment field candidate is enlarged, 0702: Color image of the entire form, 0703: Representative Pre-print color / print data color determination process in field, 0704: color dropout process for each field, 0705: pre-print component removal process remaining in dropout, 0706: complement of character data component removed in dropout Processing, 0707: print data detection processing, 0708: determination result of print misalignment data, 0801: pre-print frame line remaining in color dropout, 0901: left part of example of character pattern partially removed in color dropout , 0902: Part of color dropout The right part of the example of the left character pattern, 0903: The first example of the result of the complement processing of the character data component removed by dropout, 1001: The example of the character pattern partially removed by color dropout Upper part, 1002: Lower part of example of character pattern partially removed by color dropout, 1003: Second example of result by complement processing of character data component removed by dropout, 1102: Reading field Selection processing, 1103: Data character string existence determination processing, 1104: Data character string job determination processing, 1105: Preprint color determination processing, 1106: Preprint color / print data color determination processing result in representative field, 1301: Correction result of reading field in which amount A data is printed 1302: amount B data is printed Correction result of taking field 1303: Correction result of reading field in which data of amount C is printed 1401: Discrimination processing by rectangular size 1402: Discrimination processing of frame characters adjacent in the horizontal direction 1403: Non-data entry area Propagation type data allocation processing from 1404: Discrimination based on overlap degree of attention frame and rectangle (T = 0.9) processing, 1405: Discrimination processing based on global print misalignment direction, 1406: Discrimination based on overlap degree of attention frame and rectangle ( T = 0.5) processing, 1501: first example of printing deviation data, 1502: second example of printing deviation data, 1503: third example of printing deviation data, 1504: four of printing deviation data Example: 1505: Data print frame for amount 1; 1506: Data print frame for amount 3; 1507: Data print frame for amount 4; 1508: Data print frame for amount 5; 01: Color dropout processing area of the reading field A, 1702: reading field A, 1703: reading field B, 1704: reading field A color dropout processing results, 1705: Color dropout processing result of reading field B.

Claims

In a method of reading print data in an image obtained by digitizing a document including a frame such as a form with a scanner,
A ruled line extracting step of extracting a ruled line from the image;
A frame extraction step of extracting a frame from the extracted ruled line;
A reading field extracting step for extracting a frame for reading print data from the extracted plurality of frames with reference to the form definition knowledge stored in advance,
A print misalignment field detecting step for detecting a frame in which the print data may protrude from the frame;
A step of determining print misalignment data by separating the frame line and the print data and using the print data determined to be in contact with the frame line as the print misalignment data;
A step of assigning print data to a reading field to determine which frame the print deviation data is out of which print data; and
A character string reading step for reading the print data;
A method for reading print data, comprising:

In the step of assigning print data to the reading field,
Detects print misalignment data mixed in areas where data is not printed, determines the direction of misalignment of the detected print data, and uses that direction to assign other print misalignment data to the reading field. A propagation data allocation step;
The degree of overlap between the frame area and the circumscribing rectangle of the print misalignment data is calculated, and using the overlap degree, the print misalignment data is print data that protrudes from the target frame or print data mixed from adjacent frames. A discriminating step based on the degree of overlap between the attention frame and the rectangle,
For characters that straddle two frames that are connected horizontally,
The position of the border line separating the two frames,
The position of the overlapping characters,
The character in the left frame or the center position of the left frame;
The character in the right frame or the center position of the right frame,
Using the relationship of
Print misalignment data where the height of the circumscribed rectangle of the print misalignment data is larger than the height of the frame,
A discrimination step by a rectangular size for discriminating print deviation data in which the width of the circumscribed rectangle of the print deviation data is larger than the width of the frame from print deviation data mixed from other frames;
Misalignment of print misalignment data determined by the step of assigning propagation data from the non-data entry area, the step of discriminating by the degree of rectangle overlap, the step of discriminating characters in the frame adjacent in the horizontal direction, and the step of discriminating by the rectangle size 2. The print data reading method according to claim 1, further comprising a step of discriminating according to a general print misalignment direction in which print misalignment data is assigned to the read field using the direction of the print data.

The step of assigning print data to the reading field includes
2. The print data reading according to claim 1, wherein it is determined based on the positional relationship between the plurality of reading fields and the plurality of printing deviation data that each printing deviation data is the printing data protruding from which reading field. Method.

In the print misalignment field detection step, all frames in which print data may be protruded are detected using binary image processing. In the print misalignment data determination process, only the detected and only the periphery of the frame is detected. 2. The print data reading method according to claim 1, further comprising performing color image processing.

A print data reader,
An image input unit that inputs an image obtained by digitizing a document including a frame line such as a form with a scanner, a memory device that stores dictionary data including form definition knowledge, and an arithmetic unit;
The arithmetic unit is
Extracting ruled lines from the image,
Extracting a frame from the extracted ruled line;
Referencing the form definition knowledge stored in the memory device, extracting a frame for reading print data from the extracted plurality of frames,
Detecting a frame where the print data may protrude from the frame;
Separating the frame line from the print data, classifying the print data determined to contact the frame line as print misalignment data,
Determine which frame the print misalignment data is out of, and
A print data reading device for reading print data.

A print data reading program having an image input unit for inputting an image obtained by digitizing a document including a frame line such as a form with a scanner, a memory device for storing dictionary data including form definition knowledge, and an arithmetic unit In the arithmetic unit of the print data reader,
A ruled line extracting step of extracting a ruled line from the image;
A frame extraction step of extracting a frame from the extracted ruled line;
A reading field extracting step of referring to the form definition knowledge stored in the memory device and extracting a frame for reading print data from the extracted plurality of frames;
A print misalignment field detecting step for detecting a frame in which the print data may protrude from the frame;
A step of determining print misalignment data by separating the frame line and the print data and using the print data determined to be in contact with the frame line as the print misalignment data;
A step of assigning print data to a reading field to determine which frame the print deviation data is out of which print data; and
A print data reading program for executing a character string reading step for reading print data.