JP2008204226A

JP2008204226A - Form recognition device and its program

Info

Publication number: JP2008204226A
Application number: JP2007040489A
Authority: JP
Inventors: Hiroshi Shinjo; 広新庄; Minenobu Seki; 峰伸関; Katsumi Marukawa; 勝美丸川; Takeshi Eisaki; 健永崎; Kazuki Nakajima; 和樹中島
Original assignee: Hitachi Computer Peripherals Co Ltd
Current assignee: Hitachi Information and Telecommunication Engineering Ltd
Priority date: 2007-02-21
Filing date: 2007-02-21
Publication date: 2008-09-04
Anticipated expiration: 2027-02-21
Also published as: JP4996940B2

Abstract

<P>PROBLEM TO BE SOLVED: To read a character in a form different in a format and the same in kinds though, by detecting a data area without designating position coordinates for a reading area in advance and judging an attribute of each of the data areas. <P>SOLUTION: An item name word dictionary 190, in which a character string and the attribute of the item name are registered, is used. The character string of the item name is extracted by detecting the character strings among the form images and collating with the item name dictionary 190. The character string, succeeded in collation with the item name dictionary 190, is judged as the item name character string and the character string, failed in the collation with the item name dictionary 190, as the data character string, and the attribute of the data character string is judged from the position relation between the item name character string and the data character string. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は，帳票画像の処理技術に係り，特に帳票上の記載情報の属性の理解と文字認識の技術に関する。 The present invention relates to a form image processing technique, and more particularly to a technique for understanding attributes of written information on a form and character recognition.

従来の帳票ＯＣＲ（Optical Character Reader）における帳票認識方式では，あらかじめ読取り位置が決められている定型帳票を認識対象としているものが多かった。このような帳票に対する従来技術として，「フォーマットジェネレータ」がある（例えば、非特許文献１参照）。この技術では，読取り対象の文字の記入位置を0.1mm単位で厳密に指定している。既存のＯＣＲには，フォーマットジェネレータと同様の書式情報を採用している機種が多い。 In the form recognition method in the conventional form OCR (Optical Character Reader), there are many cases where a fixed form whose reading position is determined in advance is recognized. As a conventional technique for such a form, there is a “format generator” (for example, see Non-Patent Document 1). With this technology, the character entry position to be read is strictly specified in units of 0.1 mm. Many existing OCR models employ the same format information as the format generator.

一方，給与支払報告書やレセプトのように，同じ種類の帳票でも１枚ごとに罫線の本数や枠の位置や大きさなどが微妙に異なる帳票に対しては，上記の方式では認識できない。このような帳票に対しては，例えば特許文献1に示すように，枠の位置やサイズは異なるものの，項目間の配置関係がほぼ一定であることを利用し，あらかじめ登録された枠構造と帳票画像から解析した枠構造とを照合することにより，読取り領域を特定する方式がある（特許文献１）。 On the other hand, forms such as salary payment reports and receipts cannot be recognized by the above method even for the same type of form, but the number of ruled lines and the position and size of the frames are slightly different for each sheet. For such a form, for example, as shown in Patent Document 1, although the position and size of the frame are different, the arrangement relationship between items is almost constant, and the pre-registered frame structure and form are used. There is a method of specifying a reading region by collating with a frame structure analyzed from an image (Patent Document 1).

より自由度が高い帳票としては登記済通知書などがある。この種の帳票は，同じ種類の帳票であれば項目名はほぼ同じであるものの，一枚ごとに帳票の枠の大きさや数が異なる。場合によっては，項目間の配置も異なる場合がある。このような帳票に対して，まず表構造を解析し，枠内の文字を認識して項目名が記載されている枠を検出し，その右や下の枠が読取り対象のデータ領域であると判断して認識するという方式がある（特許文献２）。特許文献２の方式では，項目名の文字列と，項目名枠とデータ枠の対応付けをあらかじめ定義しておく必要がある。 A form with a higher degree of freedom includes a registered notice. For this type of form, if the form is the same type, the item names are almost the same, but the size and number of form frames differ for each sheet. In some cases, the arrangement between items may be different. For such a form, the table structure is first analyzed, the characters in the frame are recognized to detect the frame in which the item name is written, and the right or lower frame is the data area to be read. There is a method of judging and recognizing (Patent Document 2). In the method of Patent Document 2, it is necessary to define in advance the character string of the item name and the correspondence between the item name frame and the data frame.

さらに，項目名の文字列からデータを検出する従来手法としては，特許文献３がある。特許文献3の方式では，「入荷−個数」，「入荷−金額」，「出荷−個数」「出荷−金額」など，階層関係がある項目をもつデータや，2次元の表構造のデータを対象としている。この方式では，項目名間の配置関係などをあらかじめ定義しておくことにより認識を可能にしている。 Furthermore, Patent Document 3 is a conventional technique for detecting data from a character string of an item name. In the method of Patent Document 3, data having items having a hierarchical relationship such as “arrival-quantity”, “arrival-quantity”, “shipment-quantity”, “shipment-money”, and data of a two-dimensional table structure are targeted. It is said. In this method, recognition is made possible by defining the arrangement relationship between item names in advance.

特開２００４−１３９４８４号公報JP 2004-139484 A 特開平９−３１９８２４号公報Japanese Patent Laid-Open No. 9-319824 特開２００５−２７５８３０号公報JP 2005-275830 A 「日立ＯＣＲソリューションＩｍａｇｉｎｇＯＣＲ」カタログ、株式会社日立製作所、２００５年１２月版、Ｐ５〜６"Hitachi OCR Solution Imaging OCR" Catalog, Hitachi, Ltd., December 2005, P5-6

上述の認識対象としての帳票の特徴を纏めると次のようになる。
（１）「氏名」と「名前」のように，同じ属性であるにもかかわらず，属性を表す項目名の文字列が帳票ごとに異なる。
（２）階層関係や2次元関係も含めて，項目名の文字列とデータの文字列との位置関係が不明である。
（３）一つの枠内に複数の属性のデータが記載されている。
（４）項目名とデータの文字列が枠で囲まれていない。
（５）複数の項目のデータが一組（レコード）となっており，それらのデータが繰返し記載されている。
（６）データが記載されている領域内に，読取りに不必要な別の文字列が含まれている。例えば，金額欄に「円」の文字が記載されている場合である。 The characteristics of the form as the recognition target are summarized as follows.
(1) Although the attributes are the same, such as “name” and “name”, the character string of the item name representing the attribute is different for each form.
(2) The positional relationship between the character string of the item name and the character string of the data, including the hierarchical relationship and the two-dimensional relationship, is unknown.
(3) Data of a plurality of attributes are described in one frame.
(4) Item names and data character strings are not enclosed in a frame.
(5) Data of a plurality of items is a set (record), and these data are repeatedly described.
(6) Another character string unnecessary for reading is included in the area where the data is described. For example, this is a case where the character “yen” is written in the amount column.

このような帳票に対して非特許文献１の方式で対応する場合，全ての帳票の書式情報を作成した上で，入力された帳票ごとにどの書式情報を適用するかを識別しなければならない。数百〜数千種の帳票を扱うような状況では，書式情報作成のコストと識別の精度の観点から，この方式を適用することは現実的ではない。 When dealing with such a form by the method of Non-Patent Document 1, it is necessary to create format information for all forms and identify which format information is applied for each entered form. In a situation where hundreds to thousands of forms are handled, it is not practical to apply this method from the viewpoint of the cost of creating format information and the accuracy of identification.

また，特許文献１の方式についても，項目(属性)間の配置が同じであることが前提条件となっているので，上記のような非定型帳票には適用できない。 Also, the method of Patent Document 1 is premised on the same arrangement between items (attributes), and thus cannot be applied to the above-mentioned non-standard form.

特許文献２の方式は，非定型帳票の一部に対応可能であるものの，属性を表す項目名枠とデータ枠の位置関係と，項目名の文字列をあらかじめ定義しておく必要があるため，上記（１）〜（６）の課題に対応できない。 Although the method of Patent Document 2 can cope with a part of a non-standard form, it is necessary to define the positional relationship between the item name frame and the data frame representing the attribute and the character string of the item name in advance. It cannot cope with the problems (1) to (6).

特許文献３の方式は，項目名の階層関係や2次元関係を解析できるという特長があるものの，特許文献２と同様に上記（１）〜（６）の課題に対応できない。さらに，階層関係や2次元表の関係はあらかじめ定義しておかなければならないため，例えば2次元表の行と列が入れ替わった場合などには対応できない。 Although the method of Patent Document 3 has a feature that the hierarchical relationship and two-dimensional relationship of item names can be analyzed, it cannot cope with the problems (1) to (6) as in Patent Document 2. Furthermore, since the hierarchical relationship and the relationship of the two-dimensional table must be defined in advance, for example, the case where the rows and columns of the two-dimensional table are switched cannot be handled.

さらに，従来手法では枠同士の配置関係に基づいてデータの項目名を認識しているため，上記（３）や（４）の記載形式を持つ帳票を認識することはできない。 Further, since the conventional method recognizes the item name of the data based on the arrangement relationship between the frames, it cannot recognize the form having the description format (3) or (4).

本発明の目的は，同じ種類の帳票にもかかわらず，枠の大きさや位置および項目の順序などの書式が異なる帳票に対して，帳票ごとに読取り項目の位置を定義することなくデータを認識するため，データの属性を自動的に解析することが可能な帳票認識装置、及びそのプログラムを提供することにある。 An object of the present invention is to recognize data without defining the position of a reading item for each form for forms having different formats such as frame size, position and item order despite the same type of form. Therefore, an object of the present invention is to provide a form recognition apparatus capable of automatically analyzing data attributes and a program thereof.

上記の目的を達成するため，本発明においては，認識辞書及び項目名単語辞書を記憶する記憶部と，帳票画像の認識処理を行う処理部とからなる帳票認識装置であって，処理部は，帳票画像中の文字列が含まれる領域を検出し，検出された領域内の文字列を検出し，検出された文字列の文字を，認識辞書を用いて認識し，文字列の文字認識結果と項目名単語辞書中の項目名単語とを照合し，項目名単語照合に成功した文字列を項目名が記載された文字列（以下，項目名文字列と表記）と，項目名単語照合に成功しなかった文字列をデータが記載された文字列（以下，データ文字列と表記）と判定し，この項目名文字列とデータ文字列との配置関係からデータと項目名を対応付けし，データ文字列の属性を判定する構成とする。 In order to achieve the above object, in the present invention, a form recognition apparatus comprising a storage unit that stores a recognition dictionary and an item name word dictionary, and a processing unit that performs a form image recognition process, the processing unit includes: Detects an area containing a character string in a form image, detects a character string in the detected area, recognizes the character of the detected character string using a recognition dictionary, and determines the character recognition result of the character string Matches the item name word in the item name word dictionary, and successfully matches the item name word string (hereinafter referred to as the item name character string) and the item name word match. The character string that has not been determined is determined as a character string in which data is described (hereinafter referred to as a data character string), the data is associated with the item name from the arrangement relationship between the item name character string and the data character string, and the data The character string attribute is determined.

すなわち，帳票画像中の文字を認識する帳票認識用プログラムであって，帳票認識装置の処理部において，帳票画像の中で文字列が含まれる領域を検出するステップと，領域内の文字列を検出するステップと，検出された文字列の文字を認識するステップと，文字列の文字認識結果と項目名単語とを照合するステップと，項目名単語照合に成功した文字列を項目名文字列と判定するステップと，項目名単語照合に成功しなかった文字列をデータ文字列と判定するステップと，この項目名文字列とデータ文字列との配置関係からデータと項目名を対応付けし，項目名に相当する属性を判定するステップを実行する構成としたものである。 That is, a form recognition program for recognizing characters in a form image, wherein a processing unit of the form recognition apparatus detects a region including a character string in the form image and detects a character string in the region. A step of recognizing the character of the detected character string, a step of collating the character recognition result of the character string with the item name word, and determining that the character string that has been successfully matched with the item name word is an item name character string. A step of determining the character string that has not been successfully matched with the item name word as a data character string, and associating the data with the item name from the arrangement relationship between the item name character string and the data character string, It is configured to execute a step of determining an attribute corresponding to.

類似しているが書式が異なる帳票に対して，厳密な定義なしで認識することができる。項目名の文字列を登録するだけで，帳票内のデータの属性を解析しながら認識することができる。 Recognize forms with similar formats but different formats without strict definition. By simply registering the item name character string, it can be recognized while analyzing the attributes of the data in the form.

以下，図に示す実施例により本発明をさらに詳細に説明する。なお，これにより本発明が限定されるものではない。具体的な処理の内容を説明する前に，本発明の概略について説明する。 Hereinafter, the present invention will be described in more detail with reference to embodiments shown in the drawings. Note that the present invention is not limited thereby. Before describing specific processing contents, an outline of the present invention will be described.

本発明においては，書式の自由度が高い帳票において，データの記載位置の事前定義なしにデータの文字列を認識することである。このために，記載されたデータの属性を自動的に理解することが必要であり、本実施例では，データの記載位置ではなく，データの属性を表す項目名の単語を事前に辞書登録しておく。この辞書の項目名単語と帳票内の文字列とを照合することにより，項目名文字列を検出する。一方，項目名の単語辞書と照合できなかった文字列はデータ文字列であるとみなす。さらに，項目名文字列とデータ文字列との配置関係を解析することにより，データ文字列と項目名文字列との対応関係を認識する。対応付けられた項目名の属性が，データ文字列の属性である。 In the present invention, a character string of data is recognized in a form having a high degree of freedom of format without pre-defining the data description position. For this reason, it is necessary to automatically understand the attribute of the described data. In this embodiment, the word of the item name representing the attribute of the data is registered in the dictionary in advance instead of the data description position. deep. The item name character string is detected by comparing the item name word in this dictionary with the character string in the form. On the other hand, the character string that cannot be matched with the word dictionary of the item name is regarded as a data character string. Furthermore, the correspondence relationship between the data character string and the item name character string is recognized by analyzing the arrangement relationship between the item name character string and the data character string. The attribute of the associated item name is the data string attribute.

具体的には，下記の順序により，項目名文字列同士，および項目名文字列とデータ文字列との配置関係を解析して，データ文字列の属性を認識する。 Specifically, the attribute of the data character string is recognized by analyzing the arrangement relation between the item name character strings and between the item name character string and the data character string in the following order.

（１）従来の方式は枠を基準と考えており，項目名が記載された枠とデータが記載された枠との配置関係を解析している。これに対し，本方式では文字列を基準として処理を行う。具体的には，項目名文字列とデータ文字列との配置関係を解析することにより，同一枠内に複数の項目が記載されている場合や，枠がない場合にも対応できる。 (1) The conventional method considers a frame as a reference, and analyzes the arrangement relationship between a frame in which item names are described and a frame in which data is described. In contrast, in this method, processing is performed based on a character string. Specifically, by analyzing the arrangement relationship between the item name character string and the data character string, it is possible to cope with cases where a plurality of items are described in the same frame or when there is no frame.

（２）抽出された文字列を認識し項目名の単語と照合する。照合に成功した文字列は項目名の文字列とし，照合に失敗した文字列はデータの文字列であると判定する。判定された項目名文字列とデータ文字列との配置関係を解析することにより，データ文字列の属性を認識する。 (2) The extracted character string is recognized and collated with the word of the item name. The character string that succeeds in collation is the character string of the item name, and the character string that fails in collation is determined to be the data character string. The attributes of the data character string are recognized by analyzing the arrangement relationship between the determined item name character string and the data character string.

（３）従来の方式では，項目名とデータの配置関係や，項目名の階層関係，および項目名の2次元関係を定義していたのに対し，本方式では（２）に示したように項目名の単語のみから階層関係や2次元関係も含めたデータの属性を認識する。 (3) In the conventional method, the layout relationship between item names and data, the hierarchical relationship between item names, and the two-dimensional relationship between item names were defined, whereas in this method, as shown in (2) Recognize data attributes including hierarchical relationships and two-dimensional relationships only from item name words.

（４）複数の項目名とデータの組の並びから，一纏りとなっているデータの組（レコード）を検出する。
これらの処理の一部もしくは全てを実行することにより，本発明では、認識率の高い帳票認識を実現できる。 (4) A group of data (record) is detected from a list of a plurality of item names and data sets.
By executing part or all of these processes, the present invention can realize form recognition with a high recognition rate.

図２は，本発明の第一の実施例である帳票認識装置のハードウェア構成の一例である。図１において，１０はコマンドやコードデータなどを入力するための入力部である入力装置，２０は処理対象の帳票画像を入力するための画像入力部である画像入力装置，３０は文字列の検出や文字の認識およびデータ文字列の属性付けを行う帳票認識部，４０は本実施例における帳票認識用の辞書であり，文字認識辞書や項目名単語辞書を格納する，５０は認識結果を表示する表示部である表示装置である。なお，２０の画像入力装置の代わりに６０の画像データベース（ＤＢ）から帳票画像を入力してもよい。認識辞書４０と画像ＤＢ６０は、図示されない記憶部に記憶されている。 FIG. 2 shows an example of the hardware configuration of the form recognition apparatus according to the first embodiment of the present invention. In FIG. 1, 10 is an input device that is an input unit for inputting commands and code data, 20 is an image input device that is an image input unit for inputting a form image to be processed, and 30 is character string detection. A form recognition unit for recognizing characters and attributes of data character strings, 40 is a form recognition dictionary in this embodiment, and stores a character recognition dictionary and item name word dictionary, and 50 displays a recognition result. It is a display device which is a display unit. A form image may be input from 60 image databases (DB) instead of 20 image input devices. The recognition dictionary 40 and the image DB 60 are stored in a storage unit (not shown).

なお，帳票認識部３０は通常，中央処理部（Central Processing Unit，ＣＰＵ）で構成され，この帳票認識部としてのＣＰＵ３０が，以下で詳述する本実施例の帳票認識プログラムの処理を実行することになる。なお，ＣＰＵが実行する帳票認識プログラムは，通常，図示されていない記憶部に記憶されているが，可搬型記憶媒体やネットワークなどを介して外部から記憶部に導入して記憶することも可能であることは言うまでもない。 The form recognizing unit 30 is usually composed of a central processing unit (CPU), and the CPU 30 as the form recognizing unit executes the process of the form recognizing program of this embodiment described in detail below. become. The form recognition program executed by the CPU is normally stored in a storage unit (not shown). However, it can also be introduced into the storage unit from the outside via a portable storage medium or a network and stored. Needless to say.

以下，第一の実施例における処理の詳細について図を用いて説明する。図１は，第一の実施例の帳票処理装置３０による帳票処理の概略を示すフロー図である。この処理は先に述べたように，通常はＣＰＵにおけるプログラム処理として実行される。以下同様である。 Details of the processing in the first embodiment will be described below with reference to the drawings. FIG. 1 is a flowchart showing an outline of the form processing by the form processing apparatus 30 of the first embodiment. As described above, this process is usually executed as a program process in the CPU. The same applies hereinafter.

まず，ステップ１１０において，入力された帳票画像から文字列が含まれる領域を検出する。以下，この領域を文字列領域と定義する。この文字列領域は，表形式の帳票の場合は個々の枠（セル）に相当する。表形式でない帳票や表外の文字列の場合には，文字列そのもの，もしくは文字列を分けるための空白を仮想的に罫線とみなして構成される枠に相当する。 First, in step 110, an area including a character string is detected from the input form image. Hereinafter, this area is defined as a character string area. This character string area corresponds to an individual frame (cell) in the case of a tabular form. In the case of a non-tabular form or an out-of-line character string, the character string itself or a space for separating the character string corresponds to a frame configured by virtually considering a ruled line.

次に，ステップ１２０において，各文字列領域内の文字列を検出する。ステップ１１０と１２０の具体例としては，例えば、特開平１１−５３４６６号公報に詳述されている方式を利用することが可能である。行抽出の一実施例については，領域内の隣接する連結成分（黒画素が連続している塊）を横方向もしくは縦方向に統合していくことにより，横方向の文字行もしくは縦方向の文字行を抽出することが可能である。 Next, in step 120, a character string in each character string area is detected. As a specific example of steps 110 and 120, for example, a method detailed in Japanese Patent Laid-Open No. 11-53466 can be used. For one example of line extraction, a horizontal character line or a vertical character is obtained by integrating adjacent connected components in the region (a block of continuous black pixels) in the horizontal or vertical direction. It is possible to extract rows.

ステップ１３０では，文字認識辞書１８０を利用して、文字列内の個々の文字を認識する。
ステップ１４０では，ステップ１３０の文字認識結果と項目名単語辞書１８０に登録されている項目名の単語とを照合する。１３０と１４０の処理の一実施例としては，例えば、特開２００４−１７１３１６号公報記載の方式を利用することが可能である。 In step 130, the character recognition dictionary 180 is used to recognize individual characters in the character string.
In step 140, the character recognition result in step 130 is collated with the item name word registered in the item name word dictionary 180. As an embodiment of the processes 130 and 140, for example, a method described in Japanese Patent Application Laid-Open No. 2004-171316 can be used.

ステップ１５０では，ステップ１４０にて単語照合が成功した文字列を項目名文字列と判定する。
ステップ１６０では，ステップ１４０にて単語照合に失敗した文字列をデータ文字列と判定する。単語照合に失敗したため，文字認識結果はステップ１３０の結果をそのまま利用する。ただし，文字認識の誤りなどによる単語照合の失敗の可能性もあるため，データ文字列の候補と判定し，データ文字列か否かは後段の処理に基づいて判定してもよい。 In step 150, the character string for which word matching succeeded in step 140 is determined as the item name character string.
In step 160, the character string for which word matching failed in step 140 is determined as a data character string. Since word matching has failed, the result of step 130 is used as it is as the character recognition result. However, since there is a possibility that word collation may fail due to an error in character recognition or the like, it may be determined as a data character string candidate and whether or not it is a data character string may be determined based on subsequent processing.

ステップ１７０では，項目名文字列とデータ文字列との配置関係から，各文字列の属性を判定する。属性認識の詳細については，項目名とデータの配置関係ごとに，図６から図１２を用いて詳細に説明する。 In step 170, the attribute of each character string is determined from the arrangement relationship between the item name character string and the data character string. Details of attribute recognition will be described in detail with reference to FIGS. 6 to 12 for each item name and data arrangement relationship.

図３と図４は同じ種類にもかかわらず，枠の大きさや項目名等の書式が異なる帳票の例である。図３の帳票には，項目名として「銀行名」，「支店名」，「口座番号」，「氏名」，「金額」が記載されている。図４の帳票には，項目名として「銀行」，「支店」，「口座No.」，「名前」，「金額」が記載されている。 FIG. 3 and FIG. 4 are examples of forms that have the same type but different frame sizes and item names. The form shown in FIG. 3 includes “bank name”, “branch name”, “account number”, “name”, and “amount” as item names. The form shown in FIG. 4 includes “bank”, “branch”, “account No.”, “name”, and “amount” as item names.

図５は，図３と図４の帳票に対応した項目名単語辞書１９０に格納されているデータの例である。記載文字が異なる項目名単語が登録されており，同じ属性の項目名単語には同じ属性IDが付与されている。例えば，「銀行名」と「銀行」は属性IDが「１」となっている。登録する項目名単語は，処理対象の帳票を観察して，同じ属性で異なる表記の単語を選択する。なお，本実施例では項目名として記載された文字列全てを登録したが，項目名の一部だけを登録してもよい。これは登録する単語数を削減する効果がある。例えば，「銀行」だけを登録することにより，「銀行名」を登録しなくても単語照合が成功する。また，属性IDについては番号で記載したが，文字列でもよい。 FIG. 5 shows an example of data stored in the item name word dictionary 190 corresponding to the forms shown in FIGS. Item name words with different written characters are registered, and the same attribute ID is assigned to item name words with the same attribute. For example, the attribute ID of “bank name” and “bank” is “1”. For the item name word to be registered, the form to be processed is observed, and words with the same attribute and different notation are selected. In the present embodiment, all the character strings described as item names are registered, but only a part of the item names may be registered. This has the effect of reducing the number of registered words. For example, by registering only “bank”, word matching succeeds without registering “bank name”. Moreover, although attribute ID was described by the number, a character string may be sufficient.

以下，図６から図１２を用いて，図１のステップ１７０の文字列属性解析について説明する。まず，ステップ１７０の概略の処理フローを図６に，図６内の詳細の処理フローを図７と図８を用いて説明する。 Hereinafter, the character string attribute analysis in step 170 of FIG. 1 will be described with reference to FIGS. First, a schematic processing flow of step 170 will be described with reference to FIG. 6, and a detailed processing flow in FIG. 6 will be described with reference to FIGS.

図６は図１のステップ１７０の文字列属性解析の処理フローを説明する図である。まず，ステップ６００の項目名文字列間階層関係解析を行う。これは，図１のステップ１５０にて単語照合の結果検出された項目名文字列のみを対象とする。これらの文字列間の配置関係を解析して項目名文字列間の階層関係を解析する。この処理の詳細については，図７のフロー図と図８の枠配置例にて説明する。 FIG. 6 is a diagram for explaining the processing flow of character string attribute analysis in step 170 of FIG. First, the hierarchical relationship analysis between item name character strings in step 600 is performed. This applies only to the item name character string detected as a result of word matching in step 150 of FIG. The arrangement relationship between these character strings is analyzed to analyze the hierarchical relationship between the item name character strings. Details of this processing will be described with reference to the flowchart of FIG. 7 and the frame arrangement example of FIG.

ステップ６１０の縦方向項目名−データ関係解析は，項目名文字列の下側にデータ文字列が存在する場合に，項目名文字列とデータ文字列の関係を解析し，データ文字列の属性を検出する処理である。この処理の詳細については，図９のフロー図と図１０の枠配置例にて説明する。 The vertical item name-data relationship analysis in step 610 analyzes the relationship between the item name character string and the data character string when the data character string exists below the item name character string, and sets the attribute of the data character string. It is a process to detect. Details of this processing will be described with reference to the flowchart of FIG. 9 and the frame arrangement example of FIG.

ステップ６２０の横方向項目名−データ関係解析は，項目名文字列の右側にデータ文字列が存在する場合に，項目名文字列とデータ文字列の関係を解析し，データ文字列の属性を検出する処理である。この処理はステップ６１０の処理を横方向に変更して実現できる。 The horizontal item name-data relationship analysis in step 620 analyzes the relationship between the item name character string and the data character string and detects the attribute of the data character string when the data character string exists on the right side of the item name character string. It is processing to do. This process can be realized by changing the process of step 610 in the horizontal direction.

ステップ６３０の繰返し関係解析は，縦方向や横方向に同じ属性のデータ文字列が並んでいる場合に，データの繰返し構造を解析する処理である。この処理の詳細については，図１１を用いて説明する。 The repetitive relationship analysis in step 630 is a process of analyzing the data repetitive structure when data character strings having the same attribute are arranged in the vertical and horizontal directions. Details of this processing will be described with reference to FIG.

なお，図６にはステップ６００，６１０，６２０，６３０の４種類の解析処理を記載しているが，用途に応じて一部の処理を省略してもよい。また，後で図１３や図１４を用いて説明する処理等を追加してもよい。 Although FIG. 6 shows four types of analysis processes of steps 600, 610, 620, and 630, some processes may be omitted depending on the application. Further, processing described later with reference to FIGS. 13 and 14 may be added.

図７は，図６のステップ６００の項目名文字列間階層関係解析の処理フローを説明する図である。まず，ステップ７００において，項目名文字列領域を一つずつ解析対象とする。次に，ステップ７１０において，項目名文字列領域の下側に別の項目名文字列領域が存在するか否かを判定する。存在する場合には，ステップ７００における判定対象の項目名文字列領域の左右端と，項目名文字列領域の下側の項目名文字列領域の左右端とが一致するか否かをステップ７２０にて判定する。一致する場合には，ステップ７３０において，下側の項目名文字列領域の高さが等しいか否かを判定する。ステップ７２０と７３０の処理については，図８を用いて具体的に後述する。高さが等しければ，ステップ７４０において，上側の項目名文字列領域を上位階層に，下側の項目名文字列領域を下位階層として属性の階層関係を設定する。これらの処理が終了するか，ステップ７１０から７３０の判定を満足しない場合は，ステップ７５０にて次の項目名文字列領域に処理対象を移動し，ステップ７００へ戻る。 FIG. 7 is a diagram for explaining a processing flow of the hierarchical relationship analysis between item name character strings in step 600 of FIG. First, in step 700, the item name character string areas are analyzed one by one. Next, in step 710, it is determined whether another item name character string area exists below the item name character string area. If it exists, step 720 determines whether the left and right ends of the item name character string area to be determined in step 700 match the left and right edges of the item name character string area below the item name character string area. Judgment. If they match, it is determined in step 730 whether the lower item name character string areas are equal in height. The processing of steps 720 and 730 will be specifically described later with reference to FIG. If the heights are equal, in step 740, the hierarchical relationship of the attributes is set with the upper item name character string area as the upper hierarchy and the lower item name character string area as the lower hierarchy. If these processes are completed or if the determinations in steps 710 to 730 are not satisfied, the process target is moved to the next item name character string area in step 750 and the process returns to step 700.

図８は，図７のステップ７２０と７３０において，上下の項目文字列領域に階層関係があると判定される例である。ステップ７００と７１０において，項目名文字列１を含む枠を処理対象の項目文字列領域（８００）とする。ステップ７２０と７３０において，項目名文字列２１を含む枠と項目名文字列２２を含む枠の２つが，下側の判定対象の項目名文字行領域（８１０，８２０）とすると，上側の項目名文字列領域８１０の左端と，下側の項目名文字列領域８１０の左端が一致しており，同様に項目名文字列領域８１０と項目名文字列領域８２０の右端とも一致しており，ステップ７２０の条件を満たしている。さらに，下側の２つの項目名文字列領域８１０と８２０の枠の高さが一致しており，ステップ７３０の条件を満たしている。したがって，項目名文字列１の下位階層の項目が，項目名文字列２１と項目名文字列２２と判定できる。 FIG. 8 is an example in which it is determined in steps 720 and 730 in FIG. 7 that the upper and lower item character string areas have a hierarchical relationship. In steps 700 and 710, a frame including the item name character string 1 is set as an item character string region (800) to be processed. In steps 720 and 730, if two frames, the frame containing the item name character string 21 and the frame containing the item name character string 22, are the lower item name character line areas (810, 820) to be determined, the upper item name The left end of the character string area 810 is coincident with the left end of the lower item name character string area 810. Similarly, the left end of the item name character string area 810 and the right end of the item name character string area 820 are also matched. Meet the conditions. Further, the frame heights of the lower two item name character string areas 810 and 820 are the same, and the condition of step 730 is satisfied. Therefore, the items in the lower hierarchy of the item name character string 1 can be determined as the item name character string 21 and the item name character string 22.

図９は，図６のステップ６１０の縦方向項目名−データ関係解析の処理フローを説明する図である。まず，ステップ９００において，項目名文字列領域を一つずつ解析対象とする。次に，ステップ９１０において，項目名文字列領域の下側にデータ文字列領域が存在するか否かを判定する。存在する場合には，ステップ９００における判定対象の項目名文字列領域の左右端と，下側のデータ文字列領域の左右端とが一致するか否かをステップ９２０にて判定する。ステップ９２０の処理については，図１０を用いて具体的に後述する。ステップ９２０の条件を満たす場合には，ステップ９３０において，下側のデータ文字列領域内の文字列の属性は，上側の項目名文字列領域の文字列の属性であると判断する。これらの処理が終了するか，ステップ９１０と９２０の判定を満足しない場合は，ステップ９４０にて次の項目名文字列領域に処理対象を移動し，ステップ９００へ戻る。なお，図６のステップ６２０の横方向項目名−データ関係解析については，この処理の解析方向を横方向に変更して実現できる。 FIG. 9 is a diagram for explaining the processing flow of the vertical item name-data relationship analysis in step 610 of FIG. First, in step 900, the item name character string areas are analyzed one by one. Next, in step 910, it is determined whether a data character string area exists below the item name character string area. If it exists, it is determined in step 920 whether the left and right ends of the item name character string area to be determined in step 900 match the left and right edges of the lower data character string area. The processing of step 920 will be specifically described later with reference to FIG. If the condition of step 920 is satisfied, it is determined in step 930 that the character string attribute in the lower data character string region is the character string attribute in the upper item name character string region. If these processes are completed or if the determinations in steps 910 and 920 are not satisfied, the process target is moved to the next item name character string area in step 940 and the process returns to step 900. The horizontal item name-data relationship analysis in step 620 of FIG. 6 can be realized by changing the analysis direction of this processing to the horizontal direction.

図１０は，図９のステップ９２０において，上下の文字列領域に項目名−データの関係があると判定される例である。図１０(a)の例は，項目名枠の直下にデータ枠が存在する場合である。項目名文字列を含む枠を処理対象の項目文字列領域（１０００）とし，ステップ９２０おいてデータ文字列を含む枠１０１０をデータ文字行領域とすると，上側の項目名文字列領域１０００の左端と，下側のデータ文字列領域１０１０の両端が一致している。したがって，１０１０内の文字列の属性は，１０００内の文字列の属性と判定できる。図３の例では，データ「AAA」の属性が銀行名（図５の属性IDが「1」）と判定できる。 FIG. 10 is an example in which it is determined in step 920 in FIG. 9 that there is an item name-data relationship between the upper and lower character string regions. The example of FIG. 10A is a case where a data frame exists immediately below the item name frame. If the frame including the item name character string is the item character string area (1000) to be processed and the frame 1010 including the data character string is the data character line area in step 920, the left end of the upper item name character string area 1000 is , Both ends of the lower data character string area 1010 coincide. Therefore, the attribute of the character string in 1010 can be determined as the attribute of the character string in 1000. In the example of FIG. 3, it can be determined that the attribute of the data “AAA” is the bank name (the attribute ID of FIG. 5 is “1”).

図１０(b)の例は，項目名枠の直下に複数のデータ枠が存在する場合である。項目名文字列を含む枠を処理対象の項目文字列領域（１０２０）とし，ステップ９２０においてデータ文字列１，２，３を含む枠を処理対象のデータ文字列領域（１０３０，１０４０，１０５０）とすると，上側の項目名文字列領域１０２０の左端と，下側のデータ文字列領域１０３０の左端が一致しており，同様に項目名文字列領域１０２０とデータ文字列領域１０５０の右端とも一致しており，ステップ９２０の条件を満たしている。さらに，１０３０，１０４０，１０５０は隣接しており，枠の高さが一致しているという判定条件を加えてもよい。この結果，１０３０，１０４０，１０５０内の文字列の属性は，１０２０内の文字列の属性と判断できる。図４の例では，「（空白）」と「５」，「０００」の３個の枠内のデータの属性が金額（図５の属性IDが「５」）と判定できる。 The example of FIG. 10B is a case where there are a plurality of data frames immediately below the item name frame. The frame including the item name character string is set as the processing target item character string area (1020), and the frame including the data character strings 1, 2, and 3 is set as the processing target data character string area (1030, 1040, 1050) in step 920. Then, the left end of the upper item name character string area 1020 and the left end of the lower data character string area 1030 are matched, and similarly the right end of the item name character string area 1020 and the data character string area 1050 are also matched. Therefore, the condition of step 920 is satisfied. Furthermore, 1030, 1040, and 1050 are adjacent to each other, and a determination condition that the frame heights coincide may be added. As a result, the attribute of the character string in 1030, 1040, and 1050 can be determined as the attribute of the character string in 1020. In the example of FIG. 4, it can be determined that the attribute of the data in the three frames “(blank)”, “5”, and “000” is the amount of money (the attribute ID of FIG. 5 is “5”).

図１１は，図６のステップ６３０の繰返し構造解析を説明するための図である。図１１において，図９の処理により，項目名文字列１を含む枠である項目名文字列領域（１１００）と，データ文字列１１を含む枠であるデータ文字列領域（１１１０）との配置関係を解析することにより，データ文字列１１の属性は項目名文字列１の属性であると判定できる。さらに，データ文字列領域１１１０の下側に同じ幅のデータ文字列領域１１２０が存在する場合，１１２０内のデータ文字列１２の属性も項目名文字列１の属性であると判定する。上記の処理を項目名文字列２と３についても実行し，それぞれの項目でのデータ文字列領域の繰返し関係を求める。さらに，データ文字列１１，データ文字列２１，データ文字列３１を含む各データ文字列領域は，高さが同じで隣接しているか否かの判定を行う。図１１では判定を満たすため，これらのデータを一組（１レコード）と判定する。データ文字列１２，データ文字列２２，データ文字列３２についても同様に，一組と判定する。 FIG. 11 is a diagram for explaining the repeated structural analysis in step 630 of FIG. 11, the arrangement relationship between the item name character string area (1100), which is a frame including the item name character string 1, and the data character string area (1110), which is a frame including the data character string 11, by the processing of FIG. Can be determined that the attribute of the data character string 11 is the attribute of the item name character string 1. Further, when a data character string area 1120 having the same width exists below the data character string area 1110, it is determined that the attribute of the data character string 12 in 1120 is also the attribute of the item name character string 1. The above processing is also executed for the item name character strings 2 and 3, and the repetitive relationship of the data character string area in each item is obtained. Furthermore, each data character string area including the data character string 11, the data character string 21, and the data character string 31 has the same height and determines whether or not they are adjacent to each other. In FIG. 11, in order to satisfy the determination, these data are determined as one set (one record). Similarly, the data character string 12, the data character string 22, and the data character string 32 are determined as one set.

図１２は，図３の帳票に対して図５の項目名単語辞書を用いて図１の処理を実行した結果の例である。データ「AAA」の属性は上側に存在している「銀行名」であり，その属性IDは「１」である。横方向の属性は存在しない。また，「AAA」，「aaa」，「00000000」，「○○○」，「2000」の５個のデータは一組と判定されてレコード番号「１」と判定されている。同様に，銀行名「BBB」の行のデータはレコード番号「２」と認識されている。また，図３の最下行のデータ「12000」は，縦方向と横方向の「項目名−データ関係解析」（６１０，６２０）を実行することにより，縦方向の属性が「金額」で横方向の属性が「合計」の２次元関係であることがわかる。なお，項目名に階層関係がある場合には，階層情報も付加して出力してもよい。このように，各データの属性を認識することにより，書式が異なる帳票であっても，属性に基づいた同一のデータベースに認識結果を格納することができる。 FIG. 12 is an example of a result of executing the processing of FIG. 1 on the form of FIG. 3 using the item name word dictionary of FIG. The attribute of the data “AAA” is “bank name” existing on the upper side, and its attribute ID is “1”. There is no horizontal attribute. In addition, the five pieces of data “AAA”, “aaa”, “00000000”, “XXX”, and “2000” are determined as one set and are determined as the record number “1”. Similarly, the data in the row with the bank name “BBB” is recognized as the record number “2”. Further, the data “12000” in the bottom row of FIG. 3 is obtained by executing “item name-data relationship analysis” (610, 620) in the vertical direction and the horizontal direction so that the vertical attribute is “money” and the horizontal direction. It can be seen that there is a two-dimensional relationship of “total”. If item names have a hierarchical relationship, hierarchical information may also be added and output. Thus, by recognizing the attributes of each data, even if the forms have different formats, the recognition results can be stored in the same database based on the attributes.

なお，図１２のデータ欄は，データの文字認識結果だけでなく，文字列座標や文字列の部分画像としてもよい。これは，文字認識ではデータを全て正しく認識することができないため，データの入力にはOCRを用いずデータが記載された領域の属性のみを認識するという利用形式である。この場合，データの入力は人間がパンチ入力を行うなどの作業に利用することができる。 The data column in FIG. 12 may include not only character recognition results of data but also character string coordinates and character string partial images. This is a usage format in which all data cannot be correctly recognized by character recognition, and only the attributes of the area in which the data is described are recognized for data input without using OCR. In this case, data input can be used for operations such as human punch input.

図１３は，図１のステップ１７０の文字列属性解析において，同じ枠内に複数の項目が存在する場合の解析の例である。図６のフローにこの処理を追加してもよい。特許文献３などの従来の項目名とデータの対応付け方法では，枠を属性付けの基本単位としているため，図４の「銀行」「支店」のように同一枠内に複数の項目が含まれている場合には，各データ文字列の属性付けができない。本方式では，図６のステップ６１０やステップ６２０の解析の結果，項目名文字列領域内に複数の項目名文字列が存在し，データ文字列領域内にも複数のデータ文字列が存在する場合には，個々のデータ文字列に対応した別々の属性を割り当てる。図１３の場合，「データ文字列１」の属性は「項目名文字列１」の属性となり，「データ文字列２」の属性は「項目名文字列２」の属性となる。対応付けの基準は，枠内の相対位置を用いることができる。図１３の例では，枠内の上にある文字列同士，下にある文字列同士を対応付けする。この配置関係については，左右の配置や，枠内の何行目かなどの判断基準を利用することも可能である。 FIG. 13 is an example of analysis in the case where there are a plurality of items in the same frame in the character string attribute analysis in step 170 of FIG. This processing may be added to the flow of FIG. In the conventional method for associating data with item names such as Patent Document 3 and the like, a frame is used as a basic unit for attribute assignment, and therefore a plurality of items are included in the same frame, such as “bank” and “branch” in FIG. If this is the case, the attribute of each data string cannot be assigned. In this method, as a result of the analysis in step 610 or step 620 of FIG. 6, there are a plurality of item name character strings in the item name character string area and a plurality of data character strings in the data character string area. Are assigned different attributes corresponding to individual data strings. In the case of FIG. 13, the attribute of “data character string 1” is the attribute of “item name character string 1”, and the attribute of “data character string 2” is the attribute of “item name character string 2”. The relative position within the frame can be used as the reference for association. In the example of FIG. 13, the upper character strings in the frame are associated with the lower character strings. For this arrangement relationship, it is also possible to use criteria such as the left and right arrangement and the number of lines in the frame.

図１４は，図１のステップ１７０の文字列属性解析において，同じ枠内に項目名文字列とデータ文字列の両方が存在する場合の解析の例である。図６のフローにこの処理を追加してもよい。特許文献３などの従来の項目名とデータの対応付け方法では，枠を属性付けの基本単位としているため，この例でもデータの属性付けができない。この場合，項目名文字列を含む枠と隣接したデータ文字列領域が存在せず，同一枠内にデータ文字列が存在する場合には，項目名文字列とデータ文字列の対応付けをする。また，同一枠内だけでなく，枠外の複数の文字列間にもこの対応付けをすることができる。 FIG. 14 shows an example of analysis in the case where both the item name character string and the data character string exist in the same frame in the character string attribute analysis in step 170 of FIG. This processing may be added to the flow of FIG. In the conventional method of associating data with item names such as Patent Document 3 and the like, the frame is the basic unit of attribute assignment, and therefore data cannot be attributed even in this example. In this case, when there is no data character string area adjacent to the frame including the item name character string and the data character string exists in the same frame, the item name character string and the data character string are associated with each other. Further, this association can be made not only within the same frame but also between a plurality of character strings outside the frame.

さらに，同一文字列内の一部のみが項目名単語との照合に成功した場合に，文字列を照合に成功した部分とそれ以外に分割し，上記の対応付けをすることもできる。 Furthermore, when only a part of the same character string has been successfully matched with the item name word, the character string can be divided into a part that has been successfully matched and the other, and the above-described correspondence can be made.

図１５は，第一の実施例の認識結果の表示の一実施例である。この例では，オペレータに見せたくない属性のデータを非表示にしている。図１５は図３の帳票において氏名欄のデータを非表示にしている例である。この機能により，パンチ入力や認識結果の確認をするオペレータは，個人情報を見ることなく，必要なデータのみを見ることが可能になる。 FIG. 15 shows an example of displaying the recognition result of the first example. In this example, the attribute data that the operator does not want to show is hidden. FIG. 15 shows an example in which the data in the name column is hidden in the form shown in FIG. This function enables an operator who performs punch input and confirmation of recognition results to see only necessary data without looking at personal information.

図１６は，第一の実施例の認識結果の表示の一実施例である。この例では，特定の属性のデータのみを一覧表示した例である。図１６は，帳票１は図３の帳票を帳票２は図４の帳票を対象にして，銀行名の属性のデータのみを表示している例である。なお，データの表示は認識結果でも，その文字列の部分画像でもよい。また，この例では複数の帳票のデータを一覧表示しているが，１枚の帳票のデータのみを表示してもよい。また，複数の属性のデータを表示してもよい。 FIG. 16 shows an example of displaying the recognition result of the first example. In this example, only the data of a specific attribute is displayed as a list. FIG. 16 is an example in which only the data of the attribute of the bank name is displayed for the form 1 for the form of FIG. 3 and the form 2 for the form of FIG. The data display may be a recognition result or a partial image of the character string. In this example, the data of a plurality of forms is displayed as a list, but only the data of one form may be displayed. In addition, data with a plurality of attributes may be displayed.

図１７は，他の実施例であるプレ印刷文字認識方法を説明する図である。本実施例では，図１の項目名単語辞書１９０に，「円」等のプレ印刷文字も登録しておく。ステップ１４０の項目名文字列照合にてプレ印刷単語と照合が成功した場合には，ステップ１５０にてこの文字列はプレ印刷と判定する。その結果，ステップ１７０の文字列属性解析において，「円」の文字列は金額の属性を持つプレ印刷であると判定できる。この処理により，プレ印刷文字をデータであると誤認識することがなくなるという効果がある。 FIG. 17 is a diagram for explaining a preprinted character recognition method according to another embodiment. In this embodiment, preprinted characters such as “yen” are also registered in the item name word dictionary 190 of FIG. If the collation with the pre-printed word is successful in the item name character string collation in step 140, the character string is determined to be pre-printed in step 150. As a result, in the character string attribute analysis in step 170, it can be determined that the character string of “yen” is pre-printing having a monetary attribute. This process has the effect of preventing erroneous recognition of preprinted characters as data.

第一の実施例における帳票認識処理のフローを示す図。The figure which shows the flow of the form recognition process in a 1st Example. 第一の実施例に関わる帳票認識装置の概略構成を示すブロック図。The block diagram which shows schematic structure of the form recognition apparatus in connection with a 1st Example. 第一の実施例における処理対象帳票の一例を示す図。The figure which shows an example of the process target form in a 1st Example. 第一の実施例における処理対象帳票の一例を示す図。The figure which shows an example of the process target form in a 1st Example. 第一の実施例における項目名単語辞書の一例を示す図。The figure which shows an example of the item name word dictionary in a 1st Example. 第一の実施例の文字列属性解析処理フローの一例を示す図。The figure which shows an example of the character string attribute analysis processing flow of a 1st Example. 第一の実施例の図６の項目名文字列間階層関係解析処理フローの一例を示す図。The figure which shows an example of the hierarchical relationship analysis process flow between the item name character strings of FIG. 6 of a 1st Example. 第一の実施例における項目名文字列間階層関係解析処理の対象の例を示す図。The figure which shows the example of the object of the hierarchical relationship analysis process between the item name character strings in a 1st Example. 第一の実施例の図６の縦方向項目名−データ関係解析処理フローの一例を示す図。The figure which shows an example of the vertical direction item name-data relationship analysis processing flow of FIG. 6 of a 1st Example. 第一の実施例における縦方向項目名−データ関係解析処理の対象の例を示す図。The figure which shows the example of the object of the vertical direction item name-data relationship analysis process in a 1st Example. 第一の実施例における繰返し関係解析処理の対象の例を示す図。The figure which shows the example of the object of the repetition relationship analysis process in a 1st Example. 第一の実施例における図３と図４の帳票の認識結果の一例を示す図。The figure which shows an example of the recognition result of the form of FIG. 3 and FIG. 4 in a 1st Example. 同一枠内に複数の項目が記載されている一例を示す図。The figure which shows an example in which several items are described in the same frame. 同一枠内に項目名の文字列とデータの文字列が両方とも記載されている例を示す図。The figure which shows the example in which both the character string of an item name and the character string of data are described in the same frame. 第一の実施例における特定項目を非表示にする例を示す図。The figure which shows the example which hides the specific item in a 1st Example. 第一の実施例における特定属性のデータのみを一覧表示にする例を示す図。The figure which shows the example which displays only the data of the specific attribute in a 1st Example as a list display. 他の実施例における同一枠内にプレ印刷の文字列とデータの文字列が両方とも記載されている例を示す図。The figure which shows the example in which both the character string of pre-printing and the character string of data are described in the same frame in another Example.

Explanation of symbols

１０…入力装置，２０…画像入力装置，３０…帳票認識部（ＣＰＵ），４０…認識辞書，５０…表示装置，６０…画像ＤＢ，１８０…文字認識辞書，１９０…項目名単語辞書，８００，８１０，８２０，１０００，１０２０，１１００…項目名文字列，１０１０，１１１０，１１２０…データ文字列。 DESCRIPTION OF SYMBOLS 10 ... Input device, 20 ... Image input device, 30 ... Form recognition part (CPU), 40 ... Recognition dictionary, 50 ... Display device, 60 ... Image DB, 180 ... Character recognition dictionary, 190 ... Item name word dictionary, 800, 810, 820, 1000, 1020, 1100 ... item name character string, 1010, 1110, 1120 ... data character string.

Claims

A form recognition device comprising a storage unit for storing a recognition dictionary and an item name word dictionary, and a processing unit for recognizing a form image,
The processor is
In the form image, an area including a character string is detected,
Detect the character string in the detected area,
Recognizing characters of the detected character string using the recognition dictionary;
Collating the character recognition result of the character string with the item name word in the item name word dictionary;
The character string that succeeded in item name word matching is determined as the item name character string, and the character string that did not succeed in item name word matching is determined as the data character string.
A form recognition apparatus that associates data with an item name from an arrangement relationship between the item name character string and the data character string and determines an attribute corresponding to the item name.

The form recognition device according to claim 1,
The processor is
When determining the attribute corresponding to the item name by associating data with the item name from the arrangement relationship between the item name character string and the data character string, the item name attribute is determined from the item name character string arrangement. A form recognition device that analyzes the hierarchical relationship between the two.

The form recognition device according to claim 1,
When determining an attribute corresponding to the item name by associating data with the item name from the arrangement relationship between the item name character string and the data character string,
A form recognition apparatus that analyzes the attribute of the data character string from the vertical or horizontal arrangement relationship between the item name character string and the data character string.

The form recognition device according to claim 3,
When determining an attribute corresponding to the item name by associating data with the item name from the arrangement relationship between the item name character string and the data character string,
A form recognition device that analyzes the repetitive structure of data having the same attribute from the arrangement relation of the data character string.

The form recognition device according to claim 1,
When determining an attribute corresponding to the item name by associating data with the item name from the arrangement relationship between the item name character string and the data character string,
When the data character strings having a plurality of attributes exist in the same area, the data character strings are associated with each other by associating the arrangement of the item name character strings with the arrangement of the data character strings in the area. A form recognition device that analyzes column attributes.

The form recognition device according to claim 1,
When determining an attribute corresponding to the item name by associating data with the item name from the arrangement relationship between the item name character string and the data character string,
A form recognition apparatus that determines that both attributes are the same when both the item name character string and the data character string exist in the same area.

The form recognition device according to claim 1,
When determining an attribute corresponding to the item name by associating data with the item name from the arrangement relationship between the item name character string and the data character string,
A form recognition device for determining a character string in which pre-printed characters are described by storing item name words and pre-printed words in an item name word dictionary.

A form recognition program that is executed by a form recognition device including a storage unit that stores a recognition dictionary and an item name word dictionary, and a processing unit that performs a form image recognition process,
In the processing unit,
Detecting an area including a character string in a form image;
Detecting a character string in the region;
Recognizing the detected character of the character string;
Collating a character recognition result of a character string with an item name word stored in the storage unit;
Determining a character string that has been successfully matched with an item name word as an item name character string;
Determining a character string that has not been successfully collated as an item name as a data character string;
A form recognition program for executing a step of determining an attribute corresponding to the item name by associating data with the item name based on an arrangement relationship between the item name character string and the data character string.

A program for recognizing a form according to claim 8,
In the step of determining the attribute,
A form recognition program comprising a step of analyzing a hierarchical relationship between attributes of the item name from an arrangement of the item name character string.

A program for recognizing a form according to claim 8,
In the step of determining the attribute,
A form recognition program comprising a step of analyzing an attribute of the data character string from a vertical or horizontal arrangement relationship between the item name character string and the data character string.