JP2013190993A

JP2013190993A - Device and method for analyzing table structure

Info

Publication number: JP2013190993A
Application number: JP2012056656A
Authority: JP
Inventors: Junichi Hirayama; 淳一平山; Masakazu Fujio; 正和藤尾; Yoshiyuki Kobayashi; 義行小林; Kimiyoshi Machii; 君吉待井; Kaoru Kawabata; 薫川端
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-03-14
Filing date: 2012-03-14
Publication date: 2013-09-26
Anticipated expiration: 2032-03-14
Also published as: JP5775839B2

Abstract

PROBLEM TO BE SOLVED: To stably extract item names-data relation even when an item name word is not known in advance, and an item name dictionary cannot be fully provided in technology for extracting the item name-data relation from unspecified and a large amount of documents.SOLUTION: Form features and character string features of all pairs of adjacent frames in a table are referred to, and a differential score showing difference among them is set to a contact of the pair of frames. Next, to all rule grids in the table, the differential score set to a frame contact belonging to the rule grids is projected (such as taking the sum, taking the average), and an item name-data boundary score is calculated. The item name-data boundary score is a certainty factor representing whether or not the rule grids are ruled lines at the boundaries between an item name frame and a data frame, and set based on a policy that the contact at which difference appears in frame features is the boundary between the item name frame and the data frame. Next, the item name frame in the table is determined from a position of the item name-data boundary, and the item name-data relation is determined based on adjacent relation with other frames.

Description

本発明は，文書内の表解析方式に係り，特に文書内の表における項目名、又は、項目名とデータとの対応関係の抽出方式に関する。 The present invention relates to a table analysis method in a document, and more particularly to a method for extracting an item name in a table in a document or a correspondence relationship between an item name and data.

表解析技術において，表内に記載される文字列のうち，文字列の属性を表すキーワードである項目名と，文字列の値を表すデータとを抽出し，項目名-データ関係（項目名とデータとの対応関係）を抽出する方式がある。従来の項目名-データ関係抽出方式では，あらかじめユーザが読取対象の項目名を項目名辞書に登録しておくことにより，項目名-データ関係の抽出を行っている。 In the table analysis technology, the item name, which is a keyword representing the character string attribute, and the data representing the character string value are extracted from the character strings described in the table, and the item name-data relationship (item name and data There is a method for extracting the correspondence relationship with data). In the conventional item name-data relationship extraction method, the item name-data relationship is extracted by the user registering the item name to be read in the item name dictionary in advance.

項目名-データ関係抽出方式の適用先として、帳票認識システムが挙げられる。例えば、金融機関等で帳票認識システムが使われる際には，帳票を処理するために必要な情報を，帳票認識を用いて自動で帳票内から抽出し、業務システムに受け渡す。これにより、ユーザが手作業で必要な情報を業務システムに入力する作業を、自動化することができる。このような適用例では，抽出すべき項目名-データ関係が「振込先口座番号」や「納入金額」と具体的な金額、又は「納入期限日付」と具体的な日付、などのように，事前にあらかじめわかっている場合が多い。また，対応する項目名単語の種類も限られており，事前にわかっている場合が多い。そのため，ユーザやシステム管理者が、抽出すべき項目名-データ関係に対応する項目名単語を、事前に項目名辞書として準備しておく。 As an application destination of the item name-data relation extraction method, there is a form recognition system. For example, when a form recognition system is used in a financial institution or the like, information necessary for processing the form is automatically extracted from the form using the form recognition and transferred to the business system. Thereby, the operation | work which a user inputs required information into a business system manually can be automated. In such an application example, the item name-data relationship to be extracted is “transfer account number”, “delivery amount” and specific amount, or “delivery date” and specific date, etc. Often known in advance. In addition, the types of corresponding item name words are limited and often known in advance. Therefore, an item name word corresponding to the item name-data relationship to be extracted is prepared in advance as an item name dictionary by the user or system administrator.

特許文献１に開示された技術では，表内の文字列と項目名辞書に登録された項目名単語とを照合し，項目名単語と一致した文字列を項目名，項目名単語に一致しない文字列をデータ候補と判定する。更に、項目名とデータ候補の配置関係から，項目名とデータの対応関係を決定する。 In the technique disclosed in Patent Document 1, the character string in the table is matched with the item name word registered in the item name dictionary, and the character string that matches the item name word is not matched with the item name or item name word. The column is determined as a data candidate. Further, the correspondence between the item name and the data is determined from the arrangement relationship between the item name and the data candidate.

また、特許文献２には、文書から罫線を抽出し，２本の罫線の交点と端点を抽出し，矩形枠の四隅に相当する右上角，左上角，右下角，左下角を検出することにより，文書内から枠を検出する技術が開示されている。 Further, Patent Document 2 extracts ruled lines from a document, extracts intersections and end points of two ruled lines, and detects the upper right corner, upper left corner, lower right corner, and lower left corner corresponding to the four corners of the rectangular frame. , A technique for detecting a frame from within a document is disclosed.

特開２００８−２０４２２６号公報JP 2008-204226 A 特開平１１−０５３４６６号公報Japanese Patent Laid-Open No. 11-053466

不特定多数の文書を扱う文書管理システムにおいて、項目名-データ関係抽出方式を用いて、類似または関連する文書の検索や複数の表の統合を行うことにより，大量の文書を扱う業務の効率化を図ることができる。しかし，不特定かつ大量の文書を対象とする場合，抽出すべき項目名-データ関係が事前にわかっていない，項目名-データ関係抽出のキーとなる項目名単語がわかっていない，又は一部わかっていても表記のゆれが存在するなどの理由により，項目名辞書を完備することが困難である。そのため，項目名辞書に登録されている単語をキーに項目名-データ関係を抽出する方式を，不特定多数の文書を扱うような文書管理システムに適用することは困難である。 In a document management system that handles a large number of unspecified documents, using the item name-data relation extraction method, search for similar or related documents and integrate multiple tables to improve the efficiency of handling large volumes of documents. Can be achieved. However, when targeting an unspecified and large number of documents, the item name-data relationship to be extracted is not known in advance, the item name word that is the key for extracting the item name-data relationship is unknown, or part of it Even if it is known, it is difficult to complete the item name dictionary due to the existence of fluctuations in the notation. For this reason, it is difficult to apply the method of extracting the item name-data relationship using the words registered in the item name dictionary as a key to a document management system that handles a large number of unspecified documents.

上述したような従来技術における問題に鑑み，本発明の目的は，不特定かつ大量の文書から項目名-データ関係を抽出する技術において，項目名単語が事前にわかっておらず，項目名辞書が完備できない場合にも，安定的に項目名-データ関係の抽出を可能にすることにある。 In view of the problems in the prior art as described above, the object of the present invention is to extract the item name-data relationship from an unspecified and large amount of documents. Even when it is not complete, it is possible to stably extract the item name-data relationship.

また，本発明の他の目的は、項目名の表記のゆれが多数存在する文書群においても，表記のゆれをユーザが逐一メンテナンスすることなく，抽出すべき項目名-データ関係の抽出を可能にすることにある。 Another object of the present invention is to enable the extraction of the item name-data relationship to be extracted without requiring the user to maintain the fluctuation of notation one by one, even in a document group in which there are many fluctuations of the item name. There is to do.

上記目的を達成するために、本発明の一つの観点では、制御部と、制御部に接続される記憶部と、画像情報が入力される入力部と、を有する表構造解析装置であって、制御部は、入力部より入力された画像情報から、画像に含まれる表を構成する枠領域を検出し、検出された枠領域の特徴量を示す枠特徴量を算出し、隣接する２つの枠領域間における枠特徴量の差を算出し、算出された枠特徴量の差と、隣接する２つの枠領域間の線分とを対応付け、表を構成する罫線を構成する線分に対応付けられた枠特徴量の差に基づいて、表における項目名が存在する枠領域とデータが存在する枠領域との境界となる罫線を境界線として抽出し、境界線に基づいて、表において項目名が存在する枠領域、又は、項目名が存在する枠領域とデータが存在する枠領域との対応関係を判定する。 In order to achieve the above object, according to one aspect of the present invention, there is provided a table structure analyzing apparatus including a control unit, a storage unit connected to the control unit, and an input unit to which image information is input. The control unit detects a frame region constituting a table included in the image from the image information input from the input unit, calculates a frame feature amount indicating a feature amount of the detected frame region, and sets two adjacent frames The difference between the frame feature amounts between the regions is calculated, the calculated difference between the frame feature amounts is associated with the line segment between two adjacent frame regions, and the line segment constituting the ruled line constituting the table is associated. Based on the difference in the frame feature value, a ruled line that is a boundary between the frame area where the item name exists in the table and the frame area where the data exists is extracted as a boundary line, and the item name in the table based on the boundary line Exists, or there is a frame area and data where the item name exists. Determining the correspondence between the frame area.

より好ましくは、記憶部には項目名が登録された項目名辞書が格納され、制御部は、項目名が存在する枠領域と判定された枠領域内の文字列を、項目名辞書に項目名として登録する。 More preferably, the storage unit stores an item name dictionary in which item names are registered, and the control unit stores a character string in the frame area determined as the frame area in which the item name exists in the item name dictionary. Register as

更に好ましくは、項目名辞書には文字列と該文字列の項目名らしさを示す尤度情報とが対応付けられて登録され、制御部は、項目名辞書に登録されている尤度情報のうち、項目名が存在する枠領域と判定された枠領域内の文字列に対応する尤度情報を変更する。 More preferably, in the item name dictionary, a character string and likelihood information indicating the likelihood of the item name of the character string are associated and registered, and the control unit includes the likelihood information registered in the item name dictionary. The likelihood information corresponding to the character string in the frame area determined to be the frame area in which the item name exists is changed.

また、表における項目名が存在する枠領域とデータが存在する枠領域との境界となる罫線を抽出するために、本発明の一つの観点では、制御部は、隣接する２つの枠領域間の枠特徴量の差を示す枠特徴差分値を算出し、該算出された枠特徴差分値と、隣接する２つの枠領域間の線分とを対応付け、罫線を構成する線分に対応付けられた枠特徴差分値に基づいて、罫線の境界線らしさを示す境界線尤度情報を算出し、罫線の前記境界線尤度情報が所定の値以上である場合に、罫線を境界線であると判定する。また、表を構成する複数の罫線のうち、尤度情報が大きい所定数の罫線を、境界線であると判定してもよい。 In addition, in order to extract a ruled line that is a boundary between a frame area in which an item name exists in the table and a frame area in which data exists, in one aspect of the present invention, the control unit may A frame feature difference value indicating a difference between the frame feature amounts is calculated, the calculated frame feature difference value is associated with a line segment between two adjacent frame regions, and is associated with a line segment constituting a ruled line. Based on the frame feature difference value, boundary line likelihood information indicating the likelihood of the border of the ruled line is calculated, and when the border line likelihood information of the ruled line is a predetermined value or more, the ruled line is a boundary line judge. Moreover, you may determine with a predetermined number of ruled lines with large likelihood information being a boundary line among several ruled lines which comprise a table | surface.

本発明によれば，不特定かつ大量の文書から項目名や項目名とデータとの対応関係を抽出することが可能になる。また、項目名単語が事前にわかっていないような場合や、項目名の表記のゆれが多数存在する場合においても，安定的に項目名や項目名とデータとの対応関係の抽出が可能になる。 According to the present invention, it is possible to extract the item name and the correspondence between the item name and data from an unspecified and large amount of documents. In addition, even when the item name word is not known in advance or when there are many fluctuations in the notation of the item name, it is possible to stably extract the item name and the correspondence between the item name and the data. .

本発明の実施例における，項目名-データ関係抽出方式の処理フロー図である。It is a processing flow figure of the item name-data relation extraction method in the example of the present invention. 本発明の実施例における，項目名-データ関係抽出方式の前提となる基本処理フロー図である。FIG. 5 is a basic processing flow diagram as a premise of an item name-data relationship extraction method in an embodiment of the present invention. 本発明の実施例における，項目名-データ関係抽出方式のモジュール構成図である。It is a module block diagram of the item name-data relationship extraction system in the Example of this invention. 本発明の実施例における，枠特徴差分スコア計算処理のフローチャートである。It is a flowchart of the frame feature difference score calculation process in the Example of this invention. 本発明の実施例における，項目名-データ境界線検出処理のフローチャートである。It is a flowchart of the item name-data boundary line detection process in the Example of this invention. 本発明の実施例における，枠特徴の例である。It is an example of the frame characteristic in the Example of this invention. 本発明の実施例における，枠特徴差分スコアの計算例である。It is an example of calculation of a frame feature difference score in the embodiment of the present invention. 本発明の実施例における，項目名-データ境界線の検出例である。It is an example of the detection of the item name-data boundary line in the Example of this invention. 本発明の実施例における，項目名-データ関係抽出方式の応用例の処理フロー図である。It is a processing flow figure of the example of application of the item name-data relationship extraction system in the Example of this invention. 本発明の実施例における，項目名-データ関係抽出方式の応用例の処理フロー図である。It is a processing flow figure of the example of application of the item name-data relationship extraction system in the Example of this invention. 本発明の実施例における，項目名辞書の例である。It is an example of the item name dictionary in the Example of this invention. 本発明の実施例における，項目名辞書の修正例である。It is a correction example of the item name dictionary in the Example of this invention. 本発明の実施例における，項目名辞書修正処理のフローチャートである。It is a flowchart of the item name dictionary correction process in the Example of this invention. 項目名-データ関係の抽出結果例である。It is an example of the extraction result of an item name-data relationship. 項目名-データ関係の抽出結果例である。It is an example of the extraction result of an item name-data relationship. 表における罫線と接線の対応を示す図である。It is a figure which shows the correspondence of the ruled line in a table | surface, and a tangent. 罫線と罫線を構成する接線及び罫線の境界線らしさを示すスコアとの対応関係を示す項目名−データ境界線スコアテーブルの例である。It is an example of the item name-data boundary line score table which shows the correspondence with the score which shows the rule line and the score which shows the tangent which comprises a ruled line, and the boundary line of a ruled line. 接線と枠特徴差分スコアとの対応関係を示す枠特徴差分スコアテーブルの例である。It is an example of the frame feature difference score table which shows the correspondence of a tangent and a frame feature difference score.

本発明は，表から項目名とデータとの対応関係（項目名-データ関係）を抽出する方式である。まず表内の全ての枠領域の特徴（枠の書式特徴や枠内の文字列特徴をいい、以下における、枠特徴又は枠の特徴も同様の意味を有する。）を抽出し，隣接する２つの枠を一つのペアとして、当該ペアを構成する２つの枠間の特徴の差を示す差分スコアをペア毎に求める。そして、差分スコアと枠のペアの接線（隣接する２つの枠の間の線分）とを対応付ける。次に，表内の全ての罫線に対して，当該罫線に属する枠接線に設定された差分スコアを投影し（和をとる，平均をとるなど），項目名-データ境界線スコアを計算する。項目名-データ境界線スコアとは，当該罫線が項目名枠とデータ枠の境界となる罫線かどうかを表す確度（尤度）を示す値である。差分スコアの大きい接線が項目名枠とデータ枠の境界と推定されるため、差分スコアの大きい接線を含む罫線が、項目名枠とデータ枠の境界となる罫線であると判定される。次に，項目名-データ境界線の位置から表内の項目名枠を決定し，その他の枠との隣接関係を元に，項目名-データ関係を決定する。 The present invention is a method for extracting the correspondence between item names and data (item name-data relationship) from a table. First, the features of all the frame regions in the table (refer to the frame format feature and the character string feature in the frame, and the frame feature or the frame feature in the following also has the same meaning) are extracted. Taking a frame as one pair, a difference score indicating a difference in characteristics between two frames constituting the pair is obtained for each pair. Then, the difference score is associated with a tangent of a pair of frames (a line segment between two adjacent frames). Next, for all ruled lines in the table, the difference score set for the frame tangent belonging to the ruled line is projected (summed, averaged, etc.), and the item name-data boundary score is calculated. The item name-data boundary score is a value indicating a probability (likelihood) indicating whether the ruled line is a ruled line that is a boundary between the item name frame and the data frame. Since the tangent line with a large difference score is estimated as the boundary between the item name frame and the data frame, it is determined that the ruled line including the tangent line with the large difference score is a ruled line serving as the boundary between the item name frame and the data frame. Next, the item name frame in the table is determined from the position of the item name-data boundary line, and the item name-data relationship is determined based on the adjacency relationship with other frames.

表は複数の罫線から構成され、項目名やデータを含む２つ以上の枠（枠領域）を有する。また、本発明では、枠領域を囲む線を線分と呼び、線分は罫線を構成する。通常、枠は矩形であり、この場合、枠は４方を囲む４つの線分から構成される。また、隣接する２つの枠領域間の線分（隣接する２つの枠の共通の線分）を隣接する２つの枠の接線（又は、省略して接線）と呼ぶ。
＜第一の実施形態＞
以下，本発明の一実施例になる項目名-データ関係抽出方式について，図面を用いて詳細に説明する。 The table is composed of a plurality of ruled lines, and has two or more frames (frame regions) including item names and data. In the present invention, a line surrounding the frame region is called a line segment, and the line segment constitutes a ruled line. Usually, the frame is a rectangle, and in this case, the frame is composed of four line segments that surround four sides. In addition, a line segment between two adjacent frame regions (a common line segment of two adjacent frames) is referred to as a tangent line (or abbreviated tangent line) between the two adjacent frames.
<First embodiment>
Hereinafter, an item name-data relationship extraction method according to an embodiment of the present invention will be described in detail with reference to the drawings.

図３は，本発明の一実施列に係る項目名-データ抽出方式が構築される表構造解析計算機３３１の構成を示すブロック図である。表構造解析計算機３３１は、メモリ３０９、外部記憶装置３１４、入出力インタフェース（Ｉ／Ｆ）３１１、３１２、及びＣＰＵ３１０を備え、入出力Ｉ／Ｆ３１１、３１２を介して、処理対象の文書を入力するための文書入力装置３０１、及び、項目名-データ関係などの処理結果を出力する出力装置３０３に接続される。また、表構造解析計算機３３１はネットワークＩ／Ｆ３１３を有し、ネットワークＩ／Ｆ３１３を介して、ネットワーク３１５に接続されてもよい。 FIG. 3 is a block diagram showing the configuration of the table structure analysis computer 331 in which the item name-data extraction method according to one embodiment of the present invention is constructed. The table structure analysis computer 331 includes a memory 309, an external storage device 314, input / output interfaces (I / F) 311 and 312, and a CPU 310, and inputs a document to be processed via the input / output I / Fs 311 and 312. A document input device 301 and an output device 303 that outputs a processing result such as an item name-data relationship. The table structure analysis computer 331 may include a network I / F 313 and be connected to the network 315 via the network I / F 313.

メモリ３０９又は外部記憶装置３１４は、ＯＳ、入力文書から枠を検出する枠検出部（枠検出プログラム）３０７、入力文書から文字列を検出する文字列検出部（文字列検出プログラム）３０８、項目名-データ関係を抽出する項目名-データ関係抽出部（項目名-データ関係抽出プログラム）３０２、抽出すべき項目名単語を格納する項目名辞書３０４、文書から抽出した項目名-データ関係を示す項目名―データ関係情報３０５、及び入力された文書３０６が蓄えられている。ＣＰＵ３１０は、メモリ３０９にロードされた各種プログラムを実行する。
文書入力装置３０１，ＣＰＵ３１０，出力装置３０３，メモリ３０９、外部記憶装置３１４は，物理的な接続手段に依らず，ネットワークなどを介して接続されてもよく，それぞれ異なる計算機上に構成されてもよい。 The memory 309 or the external storage device 314 includes an OS, a frame detection unit (frame detection program) 307 that detects a frame from the input document, a character string detection unit (character string detection program) 308 that detects a character string from the input document, and an item name -Item name for extracting data relationship-Data relationship extraction unit (item name-data relationship extraction program) 302, Item name dictionary 304 for storing item name words to be extracted, Item name extracted from document-Item indicating data relationship The name-data relation information 305 and the input document 306 are stored. The CPU 310 executes various programs loaded in the memory 309.
The document input device 301, the CPU 310, the output device 303, the memory 309, and the external storage device 314 may be connected via a network or the like without depending on physical connection means, and may be configured on different computers. .

以下，項目名-データ関係抽出の処理フローを図１を用いて示すが，まず本発明の前提として、項目名-データ関係抽出方式の基本処理フローを、図２を用いて説明する。図２に示す処理フローでは、項目名辞書を用いて，項目名の文字列および当該文字列が属する枠を抽出し，枠の配置関係から項目名-データ関係を抽出する。本発明は，図２の基本処理フローの項目名照合ステップＳ２３０の代わりに，図１の枠特徴差分スコア計算処理Ｓ１４０および項目名-データ境界線検出処理Ｓ１５０を備えることで構成される。
まず，ステップＳ２１０の枠検出処理において，入力された文書から枠を検出する。次に，ステップＳ２２０の文字列検出において，文書から文字列を検出する。ステップＳ２１０およびＳ２２０の具体例として，例えば，特許文献２に開示の技術のように，文書から罫線を抽出し，２本の罫線の交点と端点を抽出し，矩形枠の四隅に相当する右上角，左上角，右下角，左下角を検出することにより，文書内から枠を検出する方式を利用することができる。また，PDF（Portable Document Format）やMicrosoft Office（登録商標）などに代表される電子文書では，罫線や枠や文字列矩形の情報があらかじめ文書に付与されている場合もあり，この場合はステップＳ２１０もしくはステップＳ２２０，またはその両方の処理は省略することも可能である。 Hereinafter, the processing flow of item name-data relationship extraction will be described with reference to FIG. 1. First, as a premise of the present invention, the basic processing flow of the item name-data relationship extraction method will be described with reference to FIG. In the processing flow shown in FIG. 2, the item name dictionary is used to extract the character string of the item name and the frame to which the character string belongs, and the item name-data relationship is extracted from the arrangement relationship of the frame. The present invention is configured by including the frame feature difference score calculation process S140 and the item name-data boundary detection process S150 of FIG. 1 instead of the item name collation step S230 of the basic process flow of FIG.
First, in the frame detection process in step S210, a frame is detected from the input document. Next, in the character string detection in step S220, a character string is detected from the document. As a specific example of steps S210 and S220, for example, as in the technique disclosed in Patent Document 2, ruled lines are extracted from a document, intersections and end points of two ruled lines are extracted, and upper right corners corresponding to the four corners of the rectangular frame are extracted. By detecting the upper left corner, lower right corner, and lower left corner, it is possible to use a method for detecting a frame from within a document. In addition, in electronic documents represented by PDF (Portable Document Format), Microsoft Office (registered trademark), and the like, ruled lines, frames, and character string rectangle information may be added to the document in advance. In this case, step S210 is performed. Alternatively, the process of step S220 or both can be omitted.

ステップＳ２３０の項目名辞書照合処理では，ステップＳ２２０において抽出された文字列ごとに，項目名辞書３０４と照合し，当該文字列が項目名辞書３０４に登録された単語や表記文法に一致するかどうかを判定する。項目名辞書３０４には抽出すべき項目名の文字列が定義されており，「Voltage[V]」や「Power」などの単語のリストや，「nnn-nnn」(nは数字)や「Type. nn」などの表記文法リストで表現される。辞書との照合時には，当該文字列がどの程度辞書内の単語や表記文法と一致するかを表す尤度が計算され，項目名辞書３０４と照合した結果のスコアが高ければ，当該文字列は項目名と判定され，スコアが低ければデータ候補と判定される。ステップＳ２３０で用いられる単語照合の具体的な例としては，例えば，非特許文献：”Lexical search approach for character string recognition”, Proceeding of 3^rd Document Analysis System, pp.237-251, M. Koga, et.al. ，に開示の技術のように，個別文字をノードと見立てた文字ネットワークと状態遷移ネットワークで表現した辞書のマッチングにより，状態遷移ネットワークから最適な文字列パスを選択し，照合結果を得る方法がある。 In the item name dictionary collation processing in step S230, each character string extracted in step S220 is collated with the item name dictionary 304, and whether or not the character string matches a word or a notation grammar registered in the item name dictionary 304. Determine. The item name dictionary 304 defines the character string of the item name to be extracted, and includes a list of words such as “Voltage [V]” and “Power”, “nnn-nnn” (n is a number), and “Type nn "and other grammar lists. At the time of matching with the dictionary, the likelihood indicating how much the character string matches the word or notation grammar in the dictionary is calculated, and if the score of the result of matching with the item name dictionary 304 is high, the character string is an item. If the score is low, it is determined as a data candidate. Specific examples of word collating used in step S230, the example, Non-Patent Document: "Lexical search approach for character string recognition", Proceeding of 3 rd Document Analysis System, pp.237-251, M. Koga, et Like the technology disclosed in .al., the optimal character string path is selected from the state transition network by matching the character network in which individual characters are regarded as nodes and the dictionary expressing the state transition network, and the matching result is obtained. There is a way.

ステップＳ２６０の項目名枠決定処理では，表を構成する全ての枠のうち項目名枠を決定する。図２のフローでは，ステップＳ２３０において項目名と判定された文字列が属する枠を項目名枠と判定する。また，項目名枠に属する文字列を項目名文字列、それ以外の枠に属する文字列をデータ候補文字列と判定する。他の項目名枠の判定方法については，発明方式の処理フローの説明にて後述する。 In the item name frame determination process in step S260, an item name frame is determined among all the frames constituting the table. In the flow of FIG. 2, the frame to which the character string determined as the item name in step S230 belongs is determined as the item name frame. Further, the character string belonging to the item name frame is determined as the item name character string, and the character string belonging to the other frame is determined as the data candidate character string. The method for determining other item name frames will be described later in the description of the processing flow of the invention method.

ステップＳ２７０の項目名-データ関係決定処理では，ステップＳ２６０の項目名枠決定処理において，項目名およびデータ候補と判定された文字列に対し，当該文字列同士の配置関係が項目名-データ関係として妥当かどうかを判定し，妥当であれば当該文字列同士を項目名-データ関係として抽出する。配置関係が項目名-データ関係として妥当かどうかの判定は，ステップＳ２１０において検出された枠の座標およびステップＳ２２０において検出された文字列領域の座標を基に行われ，２つの文字列が属する枠のサイズ、配置関係および隣接関係や，２つの文字列矩形のサイズおよび配置関係を基に行われる。例えば，項目名と判定された文字列が属する枠の下に隣接する枠内の文字列もしくは右に隣接する枠内のデータ候補文字列をデータと判定し，項目名-データ関係として抽出する方式がある。 In the item name-data relationship determining process in step S270, the arrangement relationship between the character strings is determined as the item name-data relationship for the character string determined as the item name and data candidate in the item name frame determining process in step S260. If it is valid, the character strings are extracted as the item name-data relationship. The determination as to whether the arrangement relationship is valid as the item name-data relationship is made based on the coordinates of the frame detected in step S210 and the coordinates of the character string area detected in step S220, and the frame to which the two character strings belong. This is performed based on the size, the arrangement relationship and the adjacency relationship, and the size and arrangement relationship of the two character string rectangles. For example, the character string in the frame adjacent to the frame to which the character string determined to be the item name belongs or the data candidate character string in the frame adjacent to the right is determined as data and extracted as the item name-data relationship There is.

なお，ステップＳ２６０の項目名枠決定処理およびＳ２７０の項目名-データ関係決定処理の動作は上記のものには限定されず，ステップＳ２６０が文書内の任意の枠が項目名枠もしくはデータ枠候補となるかを判定する機能を持ち，ステップＳ２７０が文書内の任意の２つの枠同士の配置関係が項目名-データ関係として妥当かどうかを判定し，項目名-データ関係を抽出する機能を持つものであれば，上記のものには依らない。 Note that the operation of the item name frame determination process in step S260 and the item name-data relationship determination process in step S270 is not limited to the above, and step S260 determines that any frame in the document is an item name frame or a data frame candidate. Have a function to determine whether or not, and step S270 has a function to determine whether the arrangement relationship between any two frames in the document is valid as an item name-data relationship and to extract the item name-data relationship If so, it does not depend on the above.

次に，本発明の一実施形態における処理フローを、図１を用いて説明する。図１および図２におけるステップＳ１１０とステップＳ２１０の枠検出処理，ステップＳ１２０とステップＳ２２０の文字列検出処理，ステップＳ１７０とステップＳ２７０の項目名-データ関係決定処理は，それぞれ同一の処理となる。本発明は図２に示した項目名-データ関係抽出方式の基本処理フローにおいて，ステップＳ２３０の項目名辞書照合手段および項目名辞書３０４の代わりに，ステップＳ１４０の枠特徴差分スコア計算処理，ステップＳ１５０の項目名-データ境界線検出処理を付加することで，抽出すべき項目名-データ関係に対して，項目名辞書が完備されていない場合にも，項目名-データ関係の抽出を可能にするものである。以下，ステップＳ１４０の枠特徴差分スコア計算処理，ステップＳ１５０の項目名-データ境界線検出処理について詳細に述べる。 Next, a processing flow in one embodiment of the present invention will be described with reference to FIG. The frame detection processing in steps S110 and S210, the character string detection processing in steps S120 and S220, and the item name-data relationship determination processing in steps S170 and S270 in FIGS. 1 and 2 are the same processing. In the basic processing flow of the item name-data relationship extraction method shown in FIG. 2, the present invention replaces the item name dictionary collating means and the item name dictionary 304 in step S230 with the frame feature difference score calculation processing in step S140, step S150. By adding the item name-data boundary detection process, the item name-data relationship can be extracted even if the item name dictionary is not complete for the item name-data relationship to be extracted Is. Hereinafter, the frame feature difference score calculation processing in step S140 and the item name-data boundary detection processing in step S150 will be described in detail.

ステップＳ１４０の枠特徴差分スコア計算処理では，ステップＳ１１０で抽出された枠およびステップＳ１２０で抽出された文字列に対し，隣接する２つの枠の特徴を参照して，２つの枠の特徴の差を表す枠特徴差分スコアを計算する。枠特徴差分スコアは２つの枠の書式や記載されている文字列がどれだけ違うかを表す実数値である。使用する特徴には枠の書式特徴（書式特徴）や枠内の文字列の書式特徴（文字列特徴）があり，例えば，入力された文書がスキャンされた画像もしくはそれらから作られた電子文書の場合，寄せの位置，文字列の高さ，文字列の数値表現などがあり，Microsoft Office文書などの電子文書の場合，前述の特徴に加えて，太字設定の有無，斜体設定の有無，フォントタイプ，フォントサイズ，背景色などがある。計算された枠特徴差分スコアは，２つの枠間の枠接線にその値が付与され，図１８の枠特徴差分スコアテーブル１６３０，１６４０に保存される。枠特徴差分スコアテーブル１６３０における枠接線ｈ１〜１３は、図１６に示す表における横の罫線Ｈ１〜Ｈ４を構成する接線である。枠特徴差分スコアテーブル１６４０における枠接線ｖ１〜ｖ１４は、図１６に示す表における縦の罫線Ｖ１〜Ｖ３を構成する接線である。また、各罫線と各枠接線との対応関係を抽出し、図１７に示す項目名-データ境界線スコアテーブル１６１０，１６２０に格納してもよい。これらのテーブル１６１０，１６２０，１６３０，１６４０は、メモリ３０９や外部記憶装置３１４に格納される。 In the frame feature difference score calculation process in step S140, the difference between the features of the two frames is determined by referring to the features of the two adjacent frames with respect to the frame extracted in step S110 and the character string extracted in step S120. Calculate the frame feature difference score to represent. The frame feature difference score is a real value representing how different the format and the character string of the two frames are. Features to be used include frame format features (format features) and text features within the frames (character string features). For example, scanned images of input documents or electronic documents created from them. In the case of electronic documents such as Microsoft Office documents, in addition to the features described above, whether to set bold, whether to set italics, and font type , Font size, background color, etc. The calculated frame feature difference score is assigned to a frame tangent between two frames, and is stored in the frame feature difference score tables 1630 and 1640 in FIG. Frame tangents h1 to h13 in the frame feature difference score table 1630 are tangents that constitute the horizontal ruled lines H1 to H4 in the table shown in FIG. Frame tangents v1 to v14 in the frame feature difference score table 1640 are tangents that constitute the vertical ruled lines V1 to V3 in the table shown in FIG. Alternatively, the correspondence between each ruled line and each frame tangent may be extracted and stored in the item name-data boundary score tables 1610 and 1620 shown in FIG. These tables 1610, 1620, 1630 and 1640 are stored in the memory 309 and the external storage device 314.

次に図４のフロー図と図６の例と図７の例とを用いて，ステップＳ１４０の枠特徴差分スコア計算処理の例を説明する。図４の処理は，ステップＳ１１０で抽出された枠のうち，互いに隣接する枠のペアごとに実行される。図４のステップＳ４０１では枠特徴差分スコアを０に初期化する。本実施例では，枠特徴差分スコアはその値が大きいほど，当該枠ペアの書式特徴や文字列特徴の差が大きいことを表す値とする。ステップＳ４０２では残り枠特徴があるかを判定し，残り枠特徴がなければＳ４０６に処理を遷移する。枠特徴とは，例えば図６の６１０１〜６１１１，下記の(１)〜(１１)に示すような枠の装飾やサイズなどの書式情報、および、当該枠に属する文字列のフォントや強調の有無や文字列がマッチする正規表現などがある。
○枠の書式情報の例
（１）枠内の文字列の寄せ位置（左寄せ or 中央寄せ or 右寄せ）
（２）枠の背景色の差
（３）枠の幅の比
（４）枠の高さの比
○枠内の文字列の書式特徴の例
（５）フォントサイズの差
（６）フォントタイプ（Times, Arial, Century, … ）
（７）太字設定の有無
（８）斜体設定の有無
（９）下線設定の有無
○枠内の文字列の意味特徴の例
（１０）文字列の数値記号表現の編集距離
（１１）項目名スコアの差
例えば，図６の枠６１０の枠特徴は，中央寄せ（６１０１），背景色が(R,G,B)=(240, 240, 240)（６１０２），枠幅324[dot]（６１０３），枠高さ85[dot]（６１０４），フォントサイズ18（６１０５），フォントタイプがArial（６１０６），太字設定あり（６１０７），斜体設定なし（６１０８），下線なし（６１０９），数値記号表現が該当なし（６１１０）となる。数値記号表現とは，枠内の文字列に数値や記号（カンマ、ピリオド、ハイフンなど）のマスクをかけた際にどのように表現されるかを示した特徴であり，「Voltage[V]」ならば「----------」（６１１０），「1,350」ならば「n,nnn」(nは数字)（６３１０）となる。前述の「（１０）文字列の数値記号表現の編集距離」は，例えば「220」と「1,350」ならば２，「220」と「240」ならば０となる。（編集距離とは，文字列間の距離を表す指標であり，ある文字列をある文字列に変換する際の，置換，挿入，削除の回数で定義される。「nnn」→「n,nnn」ならば挿入２回のため編集距離２，「nnn-nn」→「nn-nnn」ならば，挿入１回削除１回のため編集距離２となる。）
ステップＳ４０３では，隣接する２つの枠について，上記特徴の差分スコアをそれぞれ計算する。差分スコアは，特徴の差が大きいほどその値が大きくなるように設定する。例えば，上記（１）〜（１１）の特徴のうち，実数値を持つものについてはその差の絶対値とし，実数値を持たないものについては，その差がある場合にある定数とし，その差がない場合には０とする方法などがある。また、上記（１）〜（１１）の特徴ごとに、それぞれの重要度に応じて重みづけをしてもよい。
ステップＳ４０４では，枠特徴差分スコアに現在参照している枠特徴の差分スコアを加算し，ステップＳ４０５で次の枠特徴を参照する。ステップＳ４０６では，Ｓ４０１〜Ｓ４０５の結果得られたスコアを枠特徴差分スコアテーブル１６３０と１６４０に書きこむ。
図７はＳ１４０の枠特徴差分スコア計算処理の計算例を示す図である。枠７１０と枠７２０の枠特徴差分スコアを枠特徴ごとに計算したものがテーブル７０００となる。例えば，背景色（７００２），枠幅（７００３），枠高さ（７００４）は，枠１と枠２のそれぞれの特徴値の比や差から差分スコアを計算する。また，寄せ位置（７００１），フォントサイズ（７００５），フォントタイプ（７００６），太字設定の有無（７００７），斜体設定の有無（７００８），下線の有無（７００９）などは，その特徴が異なる場合に差分スコアをある定数，その特徴が等しい場合に差分スコアを０と計算する。また，数値記号表現（７０１０）は，マスクされた文字列の（編集距離÷文字列長）を差分スコアとして定義する。このようにして求められた枠特徴ごとの差分スコアの和をとることで，枠７１０と枠７２０の枠特徴差分スコアが4.591と計算される。 Next, an example of the frame feature difference score calculation process in step S140 will be described using the flowchart of FIG. 4, the example of FIG. 6, and the example of FIG. The process of FIG. 4 is executed for each pair of adjacent frames among the frames extracted in step S110. In step S401 in FIG. 4, the frame feature difference score is initialized to zero. In this embodiment, the frame feature difference score is a value indicating that the larger the value is, the larger the difference between the format feature and the character string feature of the frame pair is. In step S402, it is determined whether there is a remaining frame feature. If there is no remaining frame feature, the process proceeds to S406. The frame features are, for example, 6101 to 6111 in FIG. 6, format information such as frame decoration and size as shown in the following (1) to (11), and the font of character strings belonging to the frame and the presence / absence of emphasis And regular expressions that match strings.
○ Examples of frame format information (1) Position of the text within the frame (left-justified or centered or right-justified)
(2) Difference in frame background color (3) Ratio of frame width (4) Ratio of frame height ○ Examples of format characteristics of character strings in a frame (5) Difference in font size (6) Font type ( Times, Arial, Century,…)
(7) Presence / absence of bold setting (8) Presence / absence of italic setting (9) Presence / absence of underline ○ Example of semantic feature of character string in frame (10) Edit distance of numeric symbol expression of character string (11) Item name score For example, the frame characteristics of the frame 610 in FIG. 6 are center alignment (6101), background color (R, G, B) = (240, 240, 240) (6102), frame width 324 [dot] (6103) ), Frame height 85 [dot] (6104), font size 18 (6105), font type Arial (6106), bold setting (6107), no italic setting (6108), no underline (6109), numerical symbols The expression is not applicable (6110). Numeric symbol expression is a feature that indicates how a numerical value or symbol (comma, period, hyphen, etc.) is masked on the character string in the frame. "Voltage [V]" Then, “----------” (6110) is obtained, and “1,350” is “n, nnn” (n is a number) (6310). For example, “(10) Editing distance of numerical symbol expression of character string” is 2 for “220” and “1,350”, and 0 for “220” and “240”. (Edit distance is an index that represents the distance between character strings, and is defined by the number of substitutions, insertions, and deletions when converting a character string into a character string. "Nnn" → "n, nnn If "", the insertion distance is 2 because the insertion is 2, and if "nnn-nn" → "nn-nnn", the insertion distance is 1 because the insertion is once deleted.
In step S403, the difference score of the feature is calculated for each of two adjacent frames. The difference score is set so that the larger the feature difference, the larger the value. For example, among the features of (1) to (11) above, those having real values are the absolute value of the difference, and those having no real value are constants when there is the difference, and the difference If there is no, there is a method of setting it to 0. Moreover, you may weight according to each importance for every feature of said (1)-(11).
In step S404, the difference score of the currently referenced frame feature is added to the frame feature difference score, and the next frame feature is referenced in step S405. In step S406, the scores obtained as a result of S401 to S405 are written in the frame feature difference score tables 1630 and 1640.
FIG. 7 is a diagram illustrating a calculation example of the frame feature difference score calculation process of S140. A table 7000 is obtained by calculating the frame feature difference score between the frame 710 and the frame 720 for each frame feature. For example, for the background color (7002), the frame width (7003), and the frame height (7004), the difference score is calculated from the ratio or difference between the feature values of the frames 1 and 2. Also, if the characteristics are different, such as the alignment position (7001), font size (7005), font type (7006), bold setting presence / absence (7007), italic setting presence / absence (7008), underline presence / absence (7009), etc. If the difference score is a constant and the features are equal, the difference score is calculated as 0. The numerical symbol expression (7010) defines (edit distance ÷ character string length) of a masked character string as a difference score. By taking the sum of the difference scores for the respective frame features obtained in this way, the frame feature difference score between the frames 710 and 720 is calculated as 4.591.

次に，ステップＳ１５０の項目名-データ境界線検出処理では，ステップＳ１４０の枠特徴差分スコア計算処理において，隣接する全ての枠のペアごとに計算された枠特徴差分スコアを元に，項目名の枠とデータの枠の境界である項目名-データ境界線を検出する。項目名-データ境界線の検出は，項目名-データ境界線スコアを計算することで行う。項目名-データ境界線スコアは，表内の全ての罫線に対して計算され，当該罫線が項目名-データ境界線となる確度が大きいほど，スコアが大きくなるように計算される。 Next, in the item name-data boundary detection processing in step S150, the item name of the item name is calculated based on the frame feature difference scores calculated for every pair of adjacent frames in the frame feature difference score calculation processing in step S140. The item name-data boundary line that is the boundary between the frame and the data frame is detected. The item name-data boundary line is detected by calculating the item name-data boundary score. The item name-data boundary score is calculated for all ruled lines in the table, and the score is increased as the probability that the ruled line becomes the item name-data boundary is larger.

図１のステップＳ１５０の項目名-データ境界線検出処理の処理例を，図５のフロー図および図８の例を用いて説明する。図５の処理フローは表内の全ての罫線ごとに実行される。図８の例の場合，縦罫線Ｖ１，Ｖ２，Ｖ３および横罫線Ｈ１，Ｈ２，Ｈ３，Ｈ４ごとに，図５の処理フローが実行される。ここでは，例として罫線Ｈ２に対する項目名-データ境界線検出の処理例を説明する。ステップＳ５０１では罫線の項目名-データ境界線スコアを０に初期化する。ステップＳ５０２では，表内に残りの枠接線があるか判定する。残り枠接線がある場合はステップＳ５０３に遷移し，残り枠接線がない場合はステップＳ５０６に遷移する。ステップＳ５０３では，図１７の項目名-データ境界線スコアテーブル１６１０および１６２０を参照し，現在参照している枠接線が，当該罫線に属するかどうかを判定し，属していればステップＳ５０４に遷移し，属していなければステップＳ５０５に遷移する。図８の例の場合，罫線Ｈ２には，枠接線８０１，８０２，８０３，８０４が属している。ステップＳ５０４では，当該罫線の項目名-データ境界線スコアに，当該枠接線の枠特徴差分スコアを加算する。枠特徴差分スコアは枠特徴差分スコアテーブル１６３０，１６４０を参照して取得する。ステップＳ５０５で，次の枠接線を参照する。全ての枠接線の参照を終えると，ステップＳ５０６に遷移する。ステップＳ５０６では，ステップＳ５０２とステップＳ５０３とステップＳ５０４とステップＳ５０５にて計算された当該罫線の項目名-データ境界線スコアを枠接線数で正規化する。図８の場合，罫線Ｈ２に属する枠接線数は４となる。次に，ステップＳ５０７で項目名-データ境界線スコアが事前に決めた閾値αより大きいかどうかを判定する。大きければステップＳ５０８に遷移し，当該罫線を項目名-データ境界線と判定し，小さければ項目名-データ境界線検出処理を終了する。図８の場合，罫線Ｈ２およびＶ１が項目名-データ境界線であると検出される。また，Ｓ５０７およびＳ５０８の処理は，あらかじめ閾値を定めずに，全ての罫線の項目名-データ境界線スコアを比較し，最もスコアの高い罫線やスコアの高い所定数の罫線を境界線としてもよい。 A processing example of the item name-data boundary detection processing in step S150 of FIG. 1 will be described using the flowchart of FIG. 5 and the example of FIG. The processing flow of FIG. 5 is executed for every ruled line in the table. In the case of the example of FIG. 8, the processing flow of FIG. 5 is executed for each of the vertical ruled lines V1, V2, and V3 and the horizontal ruled lines H1, H2, H3, and H4. Here, as an example, a processing example of item name-data boundary detection for the ruled line H2 will be described. In step S501, the ruled line item name-data boundary score is initialized to zero. In step S502, it is determined whether there is a remaining frame tangent in the table. If there is a remaining frame tangent, the process proceeds to step S503. If there is no remaining frame tangent, the process proceeds to step S506. In step S503, the item name-data boundary line score tables 1610 and 1620 in FIG. 17 are referred to, and it is determined whether or not the currently referenced frame tangent belongs to the ruled line, and if it belongs, the process proceeds to step S504. If not, the process proceeds to step S505. In the example of FIG. 8, frame tangent lines 801, 802, 803, and 804 belong to the ruled line H2. In step S504, the frame feature difference score of the frame tangent is added to the item name-data boundary score of the ruled line. The frame feature difference score is acquired with reference to the frame feature difference score tables 1630 and 1640. In step S505, the next frame tangent is referred to. When the reference of all the frame tangents is completed, the process proceeds to step S506. In step S506, the item name-data boundary score of the ruled line calculated in steps S502, S503, S504, and S505 is normalized by the number of frame tangents. In the case of FIG. 8, the number of frame tangent lines belonging to the ruled line H2 is 4. In step S507, it is determined whether the item name-data boundary score is greater than a predetermined threshold value α. If it is larger, the process proceeds to step S508, where the ruled line is determined as an item name-data boundary line, and if it is smaller, the item name-data boundary line detection process is terminated. In the case of FIG. 8, it is detected that the ruled lines H2 and V1 are item name-data boundary lines. Further, the processing of S507 and S508 may compare the item name-data boundary score of all ruled lines without predetermining a threshold in advance, and use the ruled line with the highest score or a predetermined number of ruled lines with the highest score as the boundary line. .

ステップＳ１６０の項目名枠決定処理は，項目名辞書３０４がある場合、ステップＳ２３０の項目名辞書照合処理で項目名と判定された文字列が属する枠を項目名枠と決定することができる。図１の発明方式の処理フローのように項目名辞書３０４を利用しない場合では，これとは異なる実施方法で項目名枠を決定することができる。例えば，ステップＳ１５０の項目名-データ境界線検出処理において，項目名-データ境界線と判定された罫線の上に位置する枠および左に位置する枠を項目名枠と判定する方法がある。図８の場合，横罫線Ｈ２および縦罫線Ｖ１が項目名-データ境界線と判定されているため，横罫線Ｈ２より上に位置する枠，すなわち文字列「Electric system」「Input」「Voltage[V]」「Power[W]」「DC/AC」の属する枠が項目名枠と判定され，また，縦罫線Ｖ１より左に位置する枠，すなわち文字列「Minimum」「Normal」「Maximum」の属する枠が項目名枠と判定される。 In the item name frame determination process in step S160, when the item name dictionary 304 exists, the frame to which the character string determined as the item name in the item name dictionary collation process in step S230 belongs can be determined as the item name frame. In the case where the item name dictionary 304 is not used as in the processing flow of the inventive method of FIG. 1, the item name frame can be determined by a different implementation method. For example, in the item name-data boundary detection processing in step S150, there is a method of determining a frame positioned on the ruled line determined to be the item name-data boundary and a frame positioned on the left as the item name frame. In the case of FIG. 8, since the horizontal ruled line H2 and the vertical ruled line V1 are determined to be item name-data boundary lines, the frames located above the horizontal ruled line H2, that is, the character strings “Electric system” “Input” “Voltage [V ] ”,“ Power [W] ”and“ DC / AC ”belong to the item name frame, and the frame located to the left of the vertical ruled line V1, that is, the character strings“ Minimum ”,“ Normal ”and“ Maximum ”belong to it. The frame is determined to be an item name frame.

以上の処理により、図１４及び図１５上部に示すような表構造から、図１４及び図１５下部のような出力結果を、出力装置３０３に表示させることができる。 Through the above processing, output results as shown in the lower part of FIGS. 14 and 15 can be displayed on the output device 303 from the table structure shown in the upper part of FIGS. 14 and 15.

以上，図１を用いて説明した本発明方式の処理フローにより，枠特徴差分スコア計算処理および項目名-データ境界線検出処理を備えることで，項目名辞書を用いずに，表内から項目名-データ関係の抽出が可能になる。これにより，抽出すべき項目名単語が未知の場合や，項目名に表記ゆれが存在する場合にも，項目名-データ関係の抽出が可能になる。すなわち，不特定かつ大量の表を解析するといった，従来は項目名-データ抽出方式を適用できなかった利用ケースにも，適用が可能となる。
＜第二の実施形態＞
以下，本発明方式の応用例について述べる。 As described above, the processing flow of the method of the present invention described with reference to FIG. 1 includes the frame feature difference score calculation processing and the item name-data boundary detection processing, so that the item names can be retrieved from the table without using the item name dictionary. -Data relations can be extracted. As a result, the item name-data relationship can be extracted even when the item name word to be extracted is unknown or the item name has a notation fluctuation. In other words, it can be applied to use cases where the item name-data extraction method could not be applied in the past, such as analyzing unspecified and large numbers of tables.
<Second Embodiment>
Hereinafter, application examples of the method of the present invention will be described.

前述の図１の処理フローでは，ステップＳ１４０の枠特徴差分スコア計算処理およびステップＳ１５０の項目名-データ境界線検出項目名辞書を備え，枠の書式情報や枠内の文字列情報の特徴の差に基づいて，項目名枠を決定する。これにより、抽出すべき項目名-データ関係に対して，項目名辞書が完備されていない場合にも，項目名-データ関係の抽出が可能となる。一方で，従来通りの項目名辞書を併用した構成方法もあり，この場合さらに安定的に項目名-データ関係の抽出を行うこともできる。図９の処理フロー図を用いて下記に説明する。 1 includes the frame feature difference score calculation process in step S140 and the item name-data boundary line detection item name dictionary in step S150, and the feature difference between the frame format information and the character string information in the frame. The item name frame is determined based on. This makes it possible to extract the item name-data relationship even when the item name dictionary is not complete for the item name-data relationship to be extracted. On the other hand, there is a configuration method using a conventional item name dictionary, and in this case, the item name-data relationship can be extracted more stably. This will be described below with reference to the process flow diagram of FIG.

図９の処理フローは，図１の処理フローＳ１１０〜Ｓ１７０に，ステップＳ９３０の項目名照合処理および項目名辞書３０４を追加したものである。ステップＳ９１０とステップＳ１１０の枠検出処理，ステップＳ９２０とステップＳ１２０の文字列検出処理，ステップＳ９５０とステップＳ１５０の項目名-データ境界線検出処理，ステップＳ９６０とステップＳ１６０の項目名枠決定処理，ステップＳ９７０とステップＳ１７０の項目名-データ関係決定処理は，それぞれ同一の処理である。また，ステップＳ９３０と図２のステップＳ２３０の項目名照合処理は同一の処理である。 The processing flow of FIG. 9 is obtained by adding the item name matching process and the item name dictionary 304 in step S930 to the processing flows S110 to S170 of FIG. Frame detection processing in steps S910 and S110, character string detection processing in steps S920 and S120, item name-data boundary detection processing in steps S950 and S150, item name frame determination processing in steps S960 and S160, step S970 The item name-data relationship determination process in step S170 is the same process. Further, the item name matching process in step S930 and step S230 in FIG. 2 is the same process.

ステップＳ９４０の枠特徴差分スコア計算処理は，ステップＳ１４０とほぼ同一の処理であるが，枠特徴として図６の６１０１〜６１１０に，項目名辞書スコア６１１１が追加される。項目名辞書スコアは，項目名辞書内に定義された単語がどの程度項目名らしいかを示したスコアである。項目名辞書スコアは，ステップＳ９３０の項目名辞書照合処理において、ある文字列が項目名辞書内の項目名単語に一致した場合に付与される。図１１に項目名辞書の例を示す。項目名辞書には，項目名単語および項目名単語に対する項目名辞書スコアが格納される。項目名辞書スコアは，例えばある項目名の過去の出現回数を全体の項目名の出現回数の和で正規化した数値として定義する方法がある。図７の枠７１０において，枠７１０内の文字列「Voltage[V]」が図１１の項目名辞書の項目名単語「Voltage[V]」に一致する。この場合，項目名単語「Voltage[V]」に紐づいている項目名辞書スコア0.24が枠７１０の項目名辞書スコアとなる。一方で，枠７２０の文字列「220」は項目名辞書内の項目名単語に一致しない。よって，枠７２０の項目名辞書スコアは0.00となる。すなわち，枠７１０と枠７２０の項目名辞書スコアの差分スコアは0.24（７０１１）となる。この項目名辞書スコアの差分スコアも利用して，枠特徴差分スコアを計算することで，項目名とデータの枠特徴に差がみられない書式の表においても，部分的に項目名辞書に一致する単語および枠が見つかれば，項目名-データ境界線を検出することが可能となる。また，項目名辞書が完備されていなくとも，ステップＳ９５０の項目名-データ境界線検出処理の併用により，安定的に項目名-データ境界線を検出することが可能となる。
＜第三の実施形態＞
さらに次に本発明の別の応用例について述べる。 The frame feature difference score calculation processing in step S940 is almost the same processing as step S140, but the item name dictionary score 6111 is added to 6101 to 6110 in FIG. 6 as the frame feature. The item name dictionary score is a score indicating how much the word defined in the item name dictionary seems to be an item name. The item name dictionary score is given when a certain character string matches an item name word in the item name dictionary in the item name dictionary collation process in step S930. FIG. 11 shows an example of the item name dictionary. The item name dictionary stores item name words and item name dictionary scores for the item name words. For example, the item name dictionary score may be defined as a numerical value obtained by normalizing the number of appearances of a certain item name with the sum of the number of appearances of all item names. In the frame 710 in FIG. 7, the character string “Voltage [V]” in the frame 710 matches the item name word “Voltage [V]” in the item name dictionary in FIG. In this case, the item name dictionary score 0.24 linked to the item name word “Voltage [V]” becomes the item name dictionary score in the frame 710. On the other hand, the character string “220” in the frame 720 does not match the item name word in the item name dictionary. Therefore, the item name dictionary score in the frame 720 is 0.00. That is, the difference score between the item name dictionary scores of the frames 710 and 720 is 0.24 (7011). By using the difference score of this item name dictionary score to calculate the frame feature difference score, even in a table in which there is no difference between the item name and the data frame feature, it partially matches the item name dictionary If the word and frame to be found are found, the item name-data boundary line can be detected. Even if the item name dictionary is not complete, the item name-data boundary line can be stably detected by the combined use of the item name-data boundary line detection process in step S950.
<Third embodiment>
Next, another application example of the present invention will be described.

図９の処理フローでは，抽出すべき項目名-データ関係に対して項目名辞書が完備されていない場合や，項目名とデータの枠特徴に差が見られない書式の表の場合に対しても，安定的に項目名-データ関係を抽出することが可能になる。一方で，ある表を解析した際に，項目名と判定された単語が項目名辞書に存在しない場合には，新規に項目名と判定された単語を項目名辞書に追加することで，項目名辞書の自動学習・自動メンテナンスが可能となるため，その後別の表を解析する際に，メンテナンスされた項目名辞書を用いての表解析が可能となる。例えば，図８の項目名-データ関係決定の例，および図１１の項目名辞書の例において，枠８１０と枠８２０と枠８３０が項目名枠と判定された場合，枠８１０内の文字列「Maximum」と枠８３０内の文字列「Minimum」は項目名辞書１１０１に登録されているが，枠８２０内の文字列「Normal」は登録されていない。ここで，新規に項目名辞書１１０１に項目名単語「Normal」を追加することで，別のレイアウトで「Normal」という単語を含む表からの項目名-データ関係の抽出が可能になる。 In the processing flow of FIG. 9, when the item name dictionary is not complete for the item name-data relationship to be extracted, or when the table has a format in which there is no difference between the item name and the data frame characteristics. However, the item name-data relationship can be extracted stably. On the other hand, if a word determined to be an item name does not exist in the item name dictionary when a certain table is analyzed, the item name dictionary is newly added to the item name dictionary. Since automatic learning and automatic maintenance of the dictionary are possible, it is possible to perform table analysis using the maintained item name dictionary when analyzing another table. For example, in the example of determining the item name-data relationship in FIG. 8 and the example of the item name dictionary in FIG. 11, if the frame 810, the frame 820, and the frame 830 are determined to be item name frames, the character string “ “Maximum” and the character string “Minimum” in the frame 830 are registered in the item name dictionary 1101, but the character string “Normal” in the frame 820 is not registered. Here, by newly adding the item name word “Normal” to the item name dictionary 1101, it is possible to extract the item name-data relationship from a table including the word “Normal” in another layout.

図１０の処理フローは，図９の処理フローＳ９１０〜Ｓ９７０にステップＳ１０８０の項目名辞書修正処理を追加したものである。図１０の処理により，表の解析結果を元に自動で項目名辞書の修正が可能になり，ユーザは項目名辞書をメンテナンスしなくとも，解析対象の文書群にあわせて，項目名辞書が自動で修正されるというメリットをもたらす。図９と図１０の処理フローのうち，ステップＳ９１０とＳ１０１０の枠検出処理，ステップＳ９２０とＳ１０２０の文字列検出処理，ステップＳ９３０とＳ１０３０の項目名辞書照合処理，ステップＳ９４０とＳ１０４０の枠特徴差分スコア計算処理，ステップＳ９５０とＳ１０５０の項目名-データ境界線検出処理，ステップＳ９６０とＳ１０６０の項目名枠決定処理，ステップＳ９７０とＳ１０７０の項目名-データ関係決定処理は，それぞれ同一の処理である。 The processing flow of FIG. 10 is obtained by adding the item name dictionary correction processing of step S1080 to the processing flows S910 to S970 of FIG. The processing of FIG. 10 makes it possible to automatically correct the item name dictionary based on the analysis result of the table, and the user does not need to maintain the item name dictionary, and the item name dictionary is automatically adapted to the group of documents to be analyzed. It brings me the advantage of being corrected by. 9 and 10, the frame detection process in steps S910 and S1010, the character string detection process in steps S920 and S1020, the item name dictionary collation process in steps S930 and S1030, and the frame feature difference score in steps S940 and S1040. The calculation process, the item name-data boundary detection process in steps S950 and S1050, the item name frame determination process in steps S960 and S1060, and the item name-data relationship determination process in steps S970 and S1070 are the same process.

ステップＳ１０８０の項目名辞書修正処理では，ステップＳ１０７０の項目名-データ関係決定処理において抽出された項目名を元に，項目名辞書に新規に項目名単語を追加したり，各項目名単語に紐づいた項目名辞書スコアを修正したりする。 In the item name dictionary correction process in step S1080, a new item name word is added to the item name dictionary based on the item name extracted in the item name-data relationship determination process in step S1070, or each item name word is linked. Correct the item name dictionary score.

項目名辞書修正処理の処理フロー例を図１３に，項目名辞書の修正例を図１２に示す。説明のために，例えば，図８の例のように，項目名が「Voltage[V]」「Power」「Maximum」「Normal」「Minimum」と抽出されたとする。ステップＳ１３０１では項目名-データ関係の抽出結果を参照し，抽出された項目名を参照する。次に，ステップＳ１３０２で残り抽出結果があるか参照し，あればステップＳ１３０３に遷移し，なければステップＳ１３０７に遷移する。ステップＳ１３０３では，抽出された項目名が項目名辞書内にあるかどうか判定し，あればステップＳ１３０４に遷移し，なければステップＳ１３０５に遷移する。例えば図８の抽出例および図１２の修正前の項目名辞書１２０１の場合，抽出された項目名「Voltage[V]」「Maximum」「Minimum」は項目名辞書内にあるが，「Power[W]」「Normal」はないと判定される。ステップＳ１３０４では，抽出された項目名に一致する項目名辞書内の単語の出現回数を＋１する。例えば，図８の場合「Voltage[V]」の出現回数が＋１され145から146になっている。ステップＳ１３０５では，抽出された項目名を新規に項目名辞書に追加する。図１２の修正前項目名辞書１２０１と修正後項目名辞書１２０２の場合，「Power[W]」「Normal」が新規に追加されている。このとき出現回数は1としてよいし，ある定数（例の場合10）としてもよい。ステップＳ１３０６では次の抽出結果を参照する。ステップＳ１３０７では，項目名辞書内の項目名の出現回数を正規化する。例えば，全ての項目名の出現回数の総和で割って正規化してもよいし，出現回数が最大の項目名の出現回数で割って正規化してもよい。
以上に説明した，ステップ１０８０の項目名辞書修正処理により，項目名-データ抽出結果に基づいて，項目名辞書が随時修正される。これにより，表解析を積み重ねるにつれて，初期の項目名辞書になかった単語を自動で追加したり，出現回数の高い項目名の項目名辞書スコアを自動で大きく設定したりなど，ユーザがメンテナンスせずに，項目名辞書の自動学習・修正が可能となる。 A processing flow example of the item name dictionary correction processing is shown in FIG. 13, and an item name dictionary correction example is shown in FIG. For explanation, it is assumed that the item names are extracted as “Voltage [V]”, “Power”, “Maximum”, “Normal”, and “Minimum” as in the example of FIG. In step S1301, the extraction result of the item name-data relationship is referred to, and the extracted item name is referred to. Next, in step S1302, it is checked whether there is a remaining extraction result. If there is, the process proceeds to step S1303, and if not, the process proceeds to step S1307. In step S1303, it is determined whether the extracted item name is in the item name dictionary. If there is, the process proceeds to step S1304, and if not, the process proceeds to step S1305. For example, in the extraction example of FIG. 8 and the item name dictionary 1201 before correction of FIG. 12, the extracted item names “Voltage [V]”, “Maximum”, and “Minimum” are in the item name dictionary, but “Power [W ] ”And“ Normal ”. In step S1304, the number of occurrences of the word in the item name dictionary that matches the extracted item name is incremented by one. For example, in the case of FIG. In step S1305, the extracted item name is newly added to the item name dictionary. In the case of the item name dictionary 1201 before correction and the item name dictionary 1202 after correction in FIG. 12, “Power [W]” and “Normal” are newly added. At this time, the number of appearances may be 1 or a certain constant (10 in the example). In step S1306, the next extraction result is referred to. In step S1307, the number of appearances of the item name in the item name dictionary is normalized. For example, normalization may be performed by dividing by the total number of appearances of all item names, or normalization may be performed by dividing by the number of appearances of the item name having the maximum number of appearances.
The item name dictionary is corrected as needed based on the item name-data extraction result by the item name dictionary correction processing in step 1080 described above. Thus, as the table analysis is accumulated, the user does not perform maintenance, such as automatically adding words that were not in the initial item name dictionary, or automatically setting the item name dictionary score of item names with a high number of appearances. In addition, the item name dictionary can be automatically learned and corrected.

３３１・・・表構造解析計算機
３０１・・・文書入力装置
３０３・・・出力装置
３０９・・・メモリ
３１０・・・ＣＰＵ
３１１、３１２・・・Ｉ／Ｆ
３１３・・・ネットワークＩ／Ｆ
３１４・・・外部記憶装置
３１５・・・ネットワーク
Ｈ１〜Ｈ４・・・表を構成する横の罫線
Ｖ１〜Ｖ３・・・表を構成する縦の罫線
ｈ１〜ｈ１３・・・横の罫線Ｈ１〜Ｈ４を構成する枠接線
ｖ１〜ｖ１４・・・縦の罫線Ｖ１〜Ｖ３を構成する枠接線 331: Table structure analysis computer 301 ... Document input device 303 ... Output device 309 ... Memory 310 ... CPU
311, 312 ... I / F
313: Network I / F
314... External storage device 315... Network H1 to H4... Horizontal ruled lines V1 to V3 constituting table. Vertical ruled lines h1 to h13 constituting table. Horizontal ruled lines H1 to H4. Frame tangent lines that constitute vertical ruled lines V1 to V3

Claims

A table structure analyzing apparatus having a control unit, a storage unit connected to the control unit, and an input unit for inputting image information,
The controller is
From the image information input from the input unit, detect a frame region constituting a table included in the image,
Calculating a frame feature amount indicating the feature amount of the detected frame region;
Calculating a difference between the frame feature amounts between two adjacent frame regions;
Associating the calculated difference between the frame feature values with a line segment between the two adjacent frame regions;
A ruled line serving as a boundary between a frame area in which the item name exists in the table and a frame area in which data exists, based on the difference in the frame feature amounts associated with the line segments constituting the ruled lines constituting the table As a boundary line,
A table structure analysis characterized in that, based on the boundary line, a correspondence between a frame area where an item name exists in the table or a frame area where an item name exists and a frame area where data exists is determined. apparatus.

The table structure analyzing apparatus according to claim 1,
The storage unit stores an item name dictionary in which item names are registered,
The table structure analysis apparatus, wherein the control unit registers a character string in a frame area determined as a frame area in which the item name exists as an item name in the item name dictionary.

The table structure analyzing apparatus according to claim 1,
The storage unit stores an item name dictionary registered in association with a character string and likelihood information indicating the likelihood of the item name of the character string.
The control unit changes the likelihood information corresponding to a character string in a frame area determined to be a frame area in which the item name exists among the likelihood information registered in the item name dictionary. Table structure analyzer characterized by

The table structure analyzing apparatus according to claim 1,
The storage unit stores an item name dictionary registered in association with a character string and likelihood information indicating the likelihood of the item name of the character string.
The controller is
Detecting a character string in the detected frame region;
From the item name dictionary, the likelihood information corresponding to the detected character string is calculated as the frame feature amount,
A table structure analyzing apparatus that calculates a difference in the likelihood information between two adjacent frame regions as a difference in the frame feature amount between the two adjacent frame regions.

The table structure analyzing apparatus according to claim 1,
The controller is
Calculating a frame format feature amount indicating a feature amount of a frame format in the detected frame region as the frame feature amount;
A table structure analysis apparatus that calculates a difference between the frame format feature values between two adjacent frame regions as a difference between the frame feature values between the two adjacent frame regions.

The table structure analyzing apparatus according to claim 1,
The controller is
Detecting a character string in the detected frame region;
Calculating a character string feature amount indicating the feature amount of the detected character string as a feature amount of the frame;
A table structure analysis apparatus that calculates a difference between the character string feature amounts between two adjacent frame regions as a difference between the frame feature amounts between the two adjacent frame regions.

The table structure analyzing apparatus according to claim 1,
The controller is
Calculating a frame feature difference value indicating a difference in the frame feature amount between two adjacent frame regions;
Associating the calculated frame feature difference value with a line segment between the two adjacent frame regions;
Based on the frame feature difference value associated with the line segment constituting the ruled line, calculating boundary line likelihood information indicating the likelihood of the boundary line of the ruled line,
A table structure analysis apparatus, wherein the ruled line is determined to be the boundary line when the boundary line likelihood information of the ruled line is a predetermined value or more.

The table structure analyzing apparatus according to claim 1,
The controller is
Calculating a frame feature difference value indicating a difference in the frame feature amount between two adjacent frame regions;
Associating the calculated frame feature difference value with a line segment between the two adjacent frame regions;
Based on the frame feature difference value associated with the line segment constituting the ruled line, calculating boundary line likelihood information indicating the likelihood of the boundary line of the ruled line,
A table structure analyzing apparatus that determines a predetermined number of ruled lines having a large likelihood information among the plurality of ruled lines constituting the table as the boundary lines.

A table structure analyzing apparatus having a control unit and an input unit to which image information is input,
The controller is
From the image information input from the input unit, detect a frame region constituting a table included in the image,
Calculating a frame feature amount indicating the feature amount of the detected frame region;
Calculating a difference between the frame feature values between two adjacent frame regions;
A table structure analyzing apparatus characterized by detecting a frame region where an item name exists based on the calculated difference between the frame feature values.

A table structure analysis method for analyzing a structure of a table included in electronic information,
From the input image information, detect the frame area constituting the table included in the image,
Calculating a frame feature amount indicating the feature amount of the detected frame region;
Calculating a difference between the frame feature values between two adjacent frame regions;
Associating the calculated difference between the frame feature values with a line segment between the two adjacent frame regions;
A ruled line serving as a boundary between a frame area in which the item name exists in the table and a frame area in which data exists, based on the difference in the frame feature amounts associated with the line segments constituting the ruled lines constituting the table As a boundary line,
A table structure analysis characterized in that, based on the boundary line, a correspondence between a frame area where an item name exists in the table or a frame area where an item name exists and a frame area where data exists is determined. Method.

The table structure analysis method according to claim 10,
A table structure analysis method, wherein a character string in a frame area determined as a frame area in which the item name exists is registered as an item name in an item name dictionary in which the item name is registered.

The table structure analysis method according to claim 10,
Of the likelihood information registered in the item name dictionary registered in association with a character string and likelihood information indicating the item name likelihood of the character string, it is determined as a frame region where the item name exists. A table structure analysis method, wherein the likelihood information corresponding to the character string in the frame region is changed.

The table structure analysis method according to claim 10,
Detecting a character string in the detected frame region;
From the item name dictionary in which a character string and likelihood information indicating the item name likelihood of the character string are associated and registered, the likelihood information corresponding to the detected character string is used as the frame region feature amount. Extract and
A table structure analysis method, wherein a difference in the likelihood information between two adjacent frame regions is calculated as a difference in the frame feature amount between the two adjacent frame regions.

The table structure analysis method according to claim 10,
Extracting the frame format feature amount indicating the format feature of the frame in the detected frame region as the frame feature amount;
A table structure analysis method, wherein a difference in the frame format feature value between two adjacent frame regions is calculated as a difference in the frame feature value between the two adjacent frame regions.

The table structure analysis method according to claim 10,
Detecting a character string in the detected frame region;
Calculating a character string feature amount indicating the feature amount of the detected character string as a feature amount of the frame region;
A table structure analysis method, wherein a difference between the character string feature values between two adjacent frame regions is calculated as a difference between the frame feature values between the two adjacent frame regions.