JP6430919B2

JP6430919B2 - Ruled line frame correction method, ruled line frame correction apparatus, and ruled line frame correction program

Info

Publication number: JP6430919B2
Application number: JP2015232410A
Authority: JP
Inventors: 郁子高木; 山田　光一; 光一山田; 名和　長年; 長年名和; 勉丸山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2018-11-28
Anticipated expiration: 2035-11-27
Also published as: JP2017097805A

Description

本発明は、罫線枠補正方法、罫線枠補正装置および罫線枠補正プログラムに関する。 The present invention relates to a ruled line frame correction method, a ruled line frame correction apparatus, and a ruled line frame correction program.

業務において、電子ファイルまたは紙で作成された帳票が様々なシーンで用いられている。帳票は、項目名および項目名に対する項目値によって構成されている場合が多い。その場合、項目名と項目名、または項目名と項目値を論理的に対応付けることで、帳票を構造的に表現することが可能となる。 In business, forms created with electronic files or paper are used in various scenes. A form is often composed of item names and item values for the item names. In that case, the form can be structurally expressed by logically associating the item name and the item name or the item name and the item value.

従来、帳票の罫線枠の位置や大きさ等の情報を利用して、項目名と項目名の間（以降、項目名間）、または項目名と項目値の間（以降、項目名−項目値間）の論理関係を自動推定し、推定結果を木構造データとして出力する手法が知られている（例えば非特許文献１）。この手法によれば、出力した木構造データを用いて、他のシステム等と帳票との間で半自動的にデータの連携を行うことが可能となる。 Conventionally, using information such as the position and size of the ruled line frame of a form, between item names and item names (hereinafter, between item names), or between item names and item values (hereinafter, item names-item values) A method of automatically estimating the logical relationship between the two (2) and outputting the estimation result as tree structure data is known (for example, Non-Patent Document 1). According to this method, it is possible to link data semi-automatically between other systems and forms using the output tree structure data.

高木郁子，名和長年，丸山勉，“電子帳票群に対する横断的データ操作技術のための抽出手法の検討”，電子情報通信学会技術研究報告 LOIS2014-11，2014年7月17日Atsuko Takagi, Nawa for many years, Tsutomu Maruyama, “Examination of extraction methods for cross-sectional data manipulation techniques for electronic forms”, IEICE technical report LOIS2014-11, July 17, 2014

しかしながら、従来の手法には、帳票に所定の要件を満たさない記載方法で項目名または項目値が記載されている場合、項目名間および項目名−項目値間の論理関係を正確に推定できない場合があるという問題があった。 However, in the conventional method, when item names or item values are described in a form that does not satisfy the prescribed requirements, the logical relationship between item names and between item names and item values cannot be accurately estimated There was a problem that there was.

例えば、従来の手法においては、１つの罫線枠内には１つの項目名または項目値が記載されていることを前提として論理関係の推定を行う場合がある。この場合、１つの罫線枠内に１つの項目名または項目値を記載することは、従来の手法を用いて論理関係を推定するための要件の１つである。 For example, in the conventional method, there is a case where the logical relationship is estimated on the assumption that one item name or item value is described in one ruled line frame. In this case, describing one item name or item value in one ruled line frame is one of the requirements for estimating a logical relationship using a conventional method.

このとき、例えば、１つの罫線枠内に１つの項目名または項目値を記載するという要件が満たされない場合の例として、１つの罫線枠内に項目名および項目値の両方が記載されている場合がある。この場合、従来の手法では正確な項目名または項目値を認識することができず、項目名間および項目名−項目値間の論理関係を正確に推定できない場合がある。 At this time, for example, when both the item name and the item value are described in one ruled line frame as an example of the case where the requirement that one item name or item value is described in one ruled line frame is not satisfied There is. In this case, the conventional method cannot recognize an accurate item name or item value, and may not be able to accurately estimate the logical relationship between item names and between item names and item values.

また、１つの罫線枠内に１つの項目名または項目値を記載するという要件が満たされない場合の他の例として、１つの項目名または項目値が複数の罫線枠にわたって記載されている場合がある。この場合も、従来の手法では正確な項目名または項目値を認識することができず、項目名間および項目名−項目値間の論理関係を正確に推定できない場合がある。 In addition, as another example of the case where the requirement that one item name or item value is described in one ruled line frame is not satisfied, one item name or item value may be described over a plurality of ruled line frames. . Also in this case, the conventional method cannot recognize an accurate item name or item value, and may not be able to accurately estimate the logical relationship between item names and between item names and item values.

本発明の罫線枠補正方法は、帳票から罫線枠を抽出し、前記罫線枠ごとの罫線枠情報として、罫線の種類または太さ、枠内の文字列、および枠内の塗りつぶし色を少なくとも取得する取得工程と、複数の前記罫線枠の前記罫線枠情報が予め設定された罫線枠結合条件を満たしている場合、該複数の罫線枠を結合する結合工程と、前記結合工程による処理が実行された後、前記罫線枠の前記罫線枠情報が予め設定された罫線枠分割条件を満たしている場合、該罫線枠を分割する分割工程と、前記分割工程による処理が実行された後、前記罫線枠の前記罫線枠情報が予め設定された罫線枠削除条件を満たしている場合、該罫線枠を削除する削除工程と、を含んだことを特徴とする。 The ruled line frame correction method of the present invention extracts a ruled line frame from a form, and acquires at least the type or thickness of the ruled line, the character string in the frame, and the fill color in the frame as the ruled line frame information for each ruled line frame. If the ruled line frame information of the plurality of ruled line frames satisfies a predetermined ruled line frame combination condition, a combination step of combining the plurality of ruled line frames and a process by the combined step are executed. Thereafter, when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame dividing condition, a dividing step of dividing the ruled line frame, and a process by the dividing step are executed, and then the ruled line frame A deletion step of deleting the ruled line frame when the ruled line frame information satisfies a preset ruled line frame deletion condition.

本発明の罫線枠補正装置は、帳票から罫線枠を抽出し、前記罫線枠ごとの罫線枠情報として、罫線の種類または太さ、枠内の文字列、および枠内の塗りつぶし色を少なくとも取得する取得部と、複数の前記罫線枠の前記罫線枠情報が予め設定された罫線枠結合条件を満たしている場合、該複数の罫線枠を結合する結合部と、前記結合部によって処理が実行された後、前記罫線枠の前記罫線枠情報が予め設定された罫線枠分割条件を満たしている場合、該罫線枠を分割する分割部と、前記分割部によって処理が実行された後、前記罫線枠の前記罫線枠情報が予め設定された罫線枠削除条件を満たしている場合、該罫線枠を削除する削除部と、を有することを特徴とする。 The ruled line frame correction apparatus of the present invention extracts a ruled line frame from a form, and acquires at least the type or thickness of the ruled line, the character string in the frame, and the fill color in the frame as the ruled line frame information for each ruled line frame. When the acquisition unit and the ruled line frame information of the plurality of ruled line frames satisfy a predetermined ruled line frame combination condition, processing is performed by the combination unit that combines the plurality of ruled line frames and the combination unit. Thereafter, when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame dividing condition, a division unit that divides the ruled line frame, and processing performed by the dividing unit, And a deletion unit that deletes the ruled line frame when the ruled line frame information satisfies a preset ruled line frame deletion condition.

本発明によれば、帳票に所定の要件を満たさない記載方法で項目名または項目値が記載されている場合であっても、項目名間および項目名−項目値間の論理関係を正確に推定できる。 According to the present invention, even when item names or item values are described in a form that does not satisfy a predetermined requirement, a logical relationship between item names and between item names and item values is accurately estimated. it can.

図１は、帳票ファイルおよび木構造データの例を示す図である。FIG. 1 is a diagram illustrating an example of a form file and tree structure data. 図２は、データ構造抽出装置の構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of the data structure extraction device. 図３−１は、データ構造抽出部の構成例を示す図である。FIG. 3A is a diagram illustrating a configuration example of the data structure extraction unit. 図３−２は、補正される罫線枠の一例を示す図である。FIG. 3B is a diagram illustrating an example of the ruled line frame to be corrected. 図４−１は、データ構造抽出装置の処理手順を示すフローチャートである。FIG. 4A is a flowchart of the processing procedure of the data structure extraction apparatus. 図４−２は、図４−１のＳ２のグラフ生成処理を示すフローチャートである。FIG. 4B is a flowchart of the graph generation process of S2 in FIG. 図４−３は、図４−１のＳ３の罫線枠補正処理を示すフローチャートである。FIG. 4C is a flowchart illustrating the ruled line frame correction process in S3 of FIG. 図４−４は、図４−１のＳ５の木構造推定処理を示すフローチャートである。FIG. 4-4 is a flowchart of the tree structure estimation process in S5 of FIG. 4-1. 図５−１は、図４−２のＳ１３１の操作インタフェースの識別処理の一例を示すフローチャートである。FIG. 5A is a flowchart illustrating an example of the identification process of the operation interface in S131 of FIG. 4-2. 図５−２は、図４−２のＳ１３２の帳票書式情報取得処理の一例を示すフローチャートである。FIG. 5-2 is a flowchart illustrating an example of the form format information acquisition process in S132 of FIG. 4-2. 図５−３は、図４−２のＳ１３３のノード生成処理の一例を示すフローチャートである。FIG. 5C is a flowchart illustrating an example of the node generation process in S133 of FIG. 図５−４は、ノード生成処理の一例を説明するための図である。FIG. 5-4 is a diagram for explaining an example of the node generation processing. 図５−５は、図４−２のＳ１３４のプロパティ情報取得処理の一例を示すフローチャートである。FIG. 5-5 is a flowchart illustrating an example of the property information acquisition process in S134 of FIG. 4-2. 図６は、帳票データベースにおけるプロパティ情報と木構造データの一例を示す図である。FIG. 6 is a diagram showing an example of property information and tree structure data in the form database. 図７−１は、図４−２のＳ１３５の隣接エッジ生成処理の一例を示すフローチャートである。FIG. 7A is a flowchart illustrating an example of the adjacent edge generation process in S135 of FIG. 図７−２は、隣接エッジ生成処理の一例を説明するための図である。FIG. 7B is a diagram for explaining an example of the adjacent edge generation process. 図７−３は、図７−１のＳ１３５３の各ノード間の隣接エッジを求める処理の一例を示すフローチャートである。FIG. 7C is a flowchart illustrating an example of processing for obtaining adjacent edges between nodes in S1353 of FIG. 図７−４は、図７−１のＳ１３５４の各ノード間の隣接関係をチェックする処理の一例を示すフローチャートである。FIG. 7D is a flowchart illustrating an example of processing for checking the adjacency relationship between the nodes in S1354 of FIG. 図８は、図４−２のＳ１３６の包含エッジ生成処理の一例を示すフローチャートである。FIG. 8 is a flowchart illustrating an example of the inclusion edge generation process in S136 of FIG. 図９は、項目名登録部の処理手順の一例を示すフローチャートである。FIG. 9 is a flowchart illustrating an example of a processing procedure of the item name registration unit. 図１０−１は、図４−１のＳ４のノードクラスタに分類する処理の一例を示すフローチャートである。FIG. 10A is a flowchart illustrating an example of a process for classifying the node cluster in S4 of FIG. 図１０−２は、ノードクラスタの一例を示す図である。FIG. 10-2 is a diagram illustrating an example of a node cluster. 図１０−３は、図１０−１のＳ１３８３の任意のノードＸを始点とした他のノードＹのクラスタリング処理の一例を示すフローチャートである。FIG. 10C is a flowchart illustrating an example of clustering processing of another node Y starting from an arbitrary node X in S1383 in FIG. 図１１は、項目名割当処理の一例を示すフローチャートである。FIG. 11 is a flowchart illustrating an example of the item name assignment process. 図１２−１は、図４−４のＳ１４０の部分木パターン生成処理の一例を示すフローチャートである。FIG. 12A is a flowchart illustrating an example of the partial tree pattern generation process in S140 of FIG. 4-4. 図１２−２は、包含ノードの階層を説明するための図である。FIG. 12-2 is a diagram for explaining the hierarchy of inclusion nodes. 図１２−３は、図１２−１のＳ１４０５の部分木パターンの取得処理の一例を示すフローチャートである。FIG. 12C is a flowchart illustrating an example of the acquisition process of the partial tree pattern in S1405 of FIG. 図１２−４は、図１２−３のＳ１４０５６およびＳ１４０６０におけるＣ（Ｘ，ｋ）についての木構造変換処理の一例を示すフローチャートである。FIG. 12-4 is a flowchart illustrating an example of a tree structure conversion process for C (X, k) in S14056 and S14060 of FIG. 12-3. 図１２−５は、上記の表型・列挙型推定ルールに従った、項目属性の割当と隣接エッジの修正を説明するための図である。FIG. 12-5 is a diagram for explaining assignment of item attributes and modification of adjacent edges according to the above-described tabular / enumerated type estimation rules. 図１３は、図４−４のＳ１４１の木構造データ構築処理の一例を示すフローチャートである。FIG. 13 is a flowchart showing an example of the tree structure data construction process in S141 of FIG. 4-4. 図１４−１は、図４−４のＳ１４２の木構造データ選定処理の一例を示すフローチャートである。FIG. 14A is a flowchart illustrating an example of the tree structure data selection processing in S142 of FIG. 4-4. 図１４−２は、図１４−１のＳ１４２４の木構造データ選定ルールに従った、木構造データの選定処理の例を示すフローチャートである。FIG. 14B is a flowchart illustrating an example of tree structure data selection processing in accordance with the tree structure data selection rule of S1424 in FIG. 図１５は、図４−１のＳ６の帳票構造構築処理の一例を示すフローチャートである。FIG. 15 is a flowchart illustrating an example of the form structure construction process in S6 of FIG. 4-1. 図１６−１は、帳票構造ルールの一例を示す図である。FIG. 16A is a diagram illustrating an example of a form structure rule. 図１６−２は、帳票の一例を示す図である。FIG. 16B is a diagram illustrating an example of a form. 図１６−３は、ノードＸの隣接エッジ生成を説明するための図である。FIG. 16C is a diagram for explaining the generation of the adjacent edge of the node X. 図１６−４は、ノードＸの隣接エッジのチェックを説明するための図である。FIG. 16D is a diagram for explaining the check of the adjacent edge of the node X. 図１６−５は、ノードＸの包含エッジ生成を説明するための図である。FIG. 16-5 is a diagram for explaining the inclusive edge generation of the node X. 図１６−６は、木構造変換処理の一例を説明する図である。FIG. 16-6 is a diagram illustrating an example of a tree structure conversion process. 図１６−７は、表型・列挙型の設定の一例を説明する図である。FIG. 16-7 is a diagram for explaining an example of setting of the table type / enumeration type. 図１７は、罫線枠結合処理の一例を示すフローチャートである。FIG. 17 is a flowchart illustrating an example of ruled line frame combination processing. 図１８は、図１７のＳ２３３の結合処理Ａの一例を示すフローチャートである。FIG. 18 is a flowchart illustrating an example of the combining process A in S233 of FIG. 図１９は、図１７のＳ２３４の結合処理Ｂの一例を示すフローチャートである。FIG. 19 is a flowchart illustrating an example of the combining process B in S234 of FIG. 図２０は、図１７のＳ２３３の結合処理Ａ、およびＳ２３４の結合処理Ｂを説明するための図である。FIG. 20 is a diagram for explaining the combining process A in S233 and the combining process B in S234 in FIG. 図２１は、罫線枠分割処理の一例を示すフローチャートである。FIG. 21 is a flowchart illustrating an example of a ruled line frame dividing process. 図２２は、図２１のＳ２４４の分割処理Ａの一例を示すフローチャートである。FIG. 22 is a flowchart illustrating an example of the division process A in S244 of FIG. 図２３は、図２１のＳ２４４の分割処理Ａを説明するための図である。FIG. 23 is a diagram for explaining the division processing A in S244 of FIG. 図２４は、図２１のＳ２４４の分割処理Ａを説明するための図である。FIG. 24 is a diagram for explaining the division processing A in S244 of FIG. 図２５は、図２１のＳ２４５の分割処理Ｂの一例を示すフローチャートである。FIG. 25 is a flowchart illustrating an example of the division process B in S245 of FIG. 図２６は、図２１のＳ２４５の分割処理Ｂを説明するための図である。FIG. 26 is a diagram for explaining the dividing process B in S245 of FIG. 図２７は、図２１のＳ２４６の分割処理Ｃの一例を示すフローチャートである。FIG. 27 is a flowchart illustrating an example of the division process C in S246 of FIG. 図２８は、図２１のＳ２４６の分割処理Ｃを説明するための図である。FIG. 28 is a diagram for explaining the division process C in S246 of FIG. 図２９は、罫線枠削除処理の一例を示すフローチャートである。FIG. 29 is a flowchart illustrating an example of ruled line frame deletion processing. 図３０は、罫線枠削除処理を説明するための図である。FIG. 30 is a diagram for explaining ruled line frame deletion processing. 図３１は、罫線枠追加処理の一例を示すフローチャートである。FIG. 31 is a flowchart illustrating an example of ruled line frame addition processing. 図３２は、図３１のＳ２６３の追加処理Ａの一例を示すフローチャートである。FIG. 32 is a flowchart showing an example of the additional process A in S263 of FIG. 図３３は、図３１のＳ２６３の追加処理Ａを説明するための図である。FIG. 33 is a diagram for explaining the additional processing A in S263 of FIG. 図３４は、図３１のＳ２６３の追加処理Ａを説明するための図である。FIG. 34 is a diagram for explaining the additional processing A in S263 of FIG. 図３５は、図３１のＳ２６３の追加処理Ａを説明するための図である。FIG. 35 is a diagram for explaining the additional processing A in S263 of FIG. 図３６は、図３１のＳ２６４の追加処理Ｂの一例を示すフローチャートである。FIG. 36 is a flowchart illustrating an example of the addition process B in S264 of FIG. 図３７は、図３１のＳ２６４の追加処理Ｂを説明するための図である。FIG. 37 is a diagram for explaining the additional processing B in S264 of FIG. 図３８は、図４−１のＳ３の罫線枠補正処理の他の例を示すフローチャートである。FIG. 38 is a flowchart showing another example of the ruled line frame correction process in S3 of FIG. 4-1. 図３９は、図４−１のＳ３の罫線枠補正処理の他の例を示すフローチャートである。FIG. 39 is a flowchart showing another example of the ruled line frame correction process in S3 of FIG. 4-1. 図４０は、罫線枠補正プログラムを実行するコンピュータを示す図である。FIG. 40 is a diagram illustrating a computer that executes a ruled line frame correction program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。なお、本発明は本実施形態に限定されない。 Hereinafter, embodiments (embodiments) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.

（概要）
まず、図１を参照しながら、データ構造抽出装置１０が扱う帳票ファイルについて説明する。 (Overview)
First, a form file handled by the data structure extraction apparatus 10 will be described with reference to FIG.

帳票ファイルは１以上のシートからなり、シートは、符号１０１、符号１０２に示すように項目名と、その項目名に対応する項目値とを示した表（帳票）を含む。帳票内の各項目名同士、あるいは、項目名と項目値との間には包含関係が存在する場合がある。例えば、符号１０１に示す帳票において項目名１４は項目値１４−１〜項目値１４−４を包含し（縦方向の包含）、項目名１９は項目名１４〜項目名１７に対応する各項目値を包含する（横方向の包含）。また、符号１０２に示す帳票において項目名２２は項目名２０および項目名２１に対応する各項目値を包含し（縦方向の包含）、また、項目名２３は項目名２２、項目名２０および項目名２１に対応する各項目値を包含する（横方向の包含）。つまり、帳票には縦方向の包含関係と横方向の包含関係が混在する場合がある。 The form file is composed of one or more sheets, and the sheet includes a table (form) indicating item names and item values corresponding to the item names as indicated by reference numerals 101 and 102. There may be an inclusive relationship between the item names in the form or between the item name and the item value. For example, in the form indicated by reference numeral 101, the item name 14 includes item values 14-1 to 14-4 (inclusive in the vertical direction), and the item name 19 corresponds to each item value corresponding to the item name 14 to item name 17. (Lateral inclusion). In the form indicated by reference numeral 102, the item name 22 includes item values corresponding to the item name 20 and the item name 21 (vertical inclusion), and the item name 23 includes the item name 22, the item name 20 and the item. Each item value corresponding to the name 21 is included (inclusive in the horizontal direction). In other words, a form may have both a vertical inclusion relation and a horizontal inclusion relation.

データ構造抽出装置１０は、このように帳票に縦方向または横方向の論理関係が混在する場合であっても、帳票の論理構造を解釈し、項目名および項目値のノードからなる木構造データを抽出する。例えば、データ構造抽出装置１０は、符号１０１に示す帳票から符号１０３に示す木構造データを抽出し、符号１０２に示す帳票から符号１０４に示す木構造データを抽出する。 The data structure extraction device 10 interprets the logical structure of the form and converts the tree structure data composed of the item name and item value nodes even when the form has a vertical or horizontal logical relationship in this way. Extract. For example, the data structure extraction apparatus 10 extracts the tree structure data indicated by reference numeral 103 from the form indicated by reference numeral 101 and extracts the tree structure data indicated by reference numeral 104 from the form indicated by reference numeral 102.

ところで、所定の要件を満たした記載方法によって帳票の項目名および項目値が記載されていない場合、データ構造抽出装置１０は正確な論理構造の解釈を行えないことがある。そのため、本実施形態において、データ構造抽出装置１０は、帳票に所定の要件を満たさない記載方法によって記載された箇所がある場合、当該箇所を要件が満たされるように補正したうえで論理構造を解釈し木構造データを抽出する。 By the way, if the item name and item value of the form are not described by a description method that satisfies a predetermined requirement, the data structure extraction apparatus 10 may not be able to accurately interpret the logical structure. Therefore, in this embodiment, the data structure extraction device 10 interprets the logical structure after correcting the part so that the requirement is satisfied when there is a part described by a description method that does not satisfy the predetermined requirement in the form. Extract tree structure data.

（構成）
図２を用いてデータ構造抽出装置１０の構成を説明する。データ構造抽出装置１０は、データ構造抽出部１１と、記憶部１２と、罫線枠補正部２０とを備える。 (Constitution)
The configuration of the data structure extraction apparatus 10 will be described with reference to FIG. The data structure extraction device 10 includes a data structure extraction unit 11, a storage unit 12, and a ruled line frame correction unit 20.

データ構造抽出部１１は、端末（例えば、パーソナルコンピュータ、スマートフォン等）等から帳票ファイルの入力を受け付けると、帳票構造ルール１２１（詳細は後記）を参照して、この帳票ファイルの木構造データを抽出し、帳票データベース（帳票構造情報記憶部）１２２に登録する。 When the data structure extraction unit 11 receives an input of a form file from a terminal (for example, a personal computer, a smartphone, etc.), the data structure extraction unit 11 refers to the form structure rule 121 (details will be described later) and extracts the tree structure data of the form file. And registered in the form database (form structure information storage unit) 122.

記憶部１２は、帳票構造ルール１２１と、帳票データベース１２２とを備える。帳票構造ルール１２１は、データ構造抽出部１１が、帳票ファイルから木構造データを抽出する際に参照する種々のルールを記憶する。この帳票構造ルール１２１は、例えば、図１６−１の（ａ）に示すノード生成ルール、メタ情報生成ルール、隣接エッジ生成ルール、包含エッジ生成ルール、ノードクラスタ生成ルール、木構造生成ルールや、（ｂ）に示す隣接エッジチェックルール、包含エッジチェックルール、木構造条件ルール、表型・列挙型推定ルール、木構造選定ルール等を含む。これらのルールの詳細は後記する。 The storage unit 12 includes a form structure rule 121 and a form database 122. The form structure rule 121 stores various rules that the data structure extraction unit 11 refers to when extracting the tree structure data from the form file. The form structure rule 121 includes, for example, a node generation rule, meta information generation rule, adjacent edge generation rule, inclusion edge generation rule, node cluster generation rule, tree structure generation rule shown in FIG. b) include adjacent edge check rules, inclusion edge check rules, tree structure condition rules, table type / enumeration type estimation rules, tree structure selection rules, and the like. Details of these rules will be described later.

帳票データベース１２２は、データ構造抽出部１１が抽出した木構造データを含む帳票構成情報（図６参照）を記憶する。 The form database 122 stores form configuration information (see FIG. 6) including tree structure data extracted by the data structure extraction unit 11.

罫線枠補正部２０は、帳票に所定の要件を満たさない記載方法によって記載された箇所がある場合、当該箇所が要件を満たすように補正したうえで帳票をデータ構造抽出部１１に受け渡す。そして、データ構造抽出部１１は、罫線枠補正部２０によって補正された帳票を基に木構造データの抽出等を行う。 The ruled line frame correction unit 20 passes the form to the data structure extraction unit 11 after correcting the part so as to satisfy the requirement when there is a part described by a description method that does not satisfy the predetermined requirement. Then, the data structure extraction unit 11 extracts tree structure data based on the form corrected by the ruled line frame correction unit 20.

（データ構造抽出部）
次に、図３−１を用いてデータ構造抽出装置１０のデータ構造抽出部１１を詳細に説明する。データ構造抽出部１１は、グラフ生成部１３と、ノードクラスタ部１３８と、木構造推定部１４と、帳票構造構築部１４３とを備える。なお、記憶部１２の項目名データベース１２３は、装備する場合と装備しない場合があり、装備する場合については後記する。 (Data structure extractor)
Next, the data structure extraction unit 11 of the data structure extraction apparatus 10 will be described in detail with reference to FIG. The data structure extraction unit 11 includes a graph generation unit 13, a node cluster unit 138, a tree structure estimation unit 14, and a form structure construction unit 143. Note that the item name database 123 of the storage unit 12 may or may not be equipped, and the case of being equipped will be described later.

グラフ生成部１３は、帳票ファイルの項目名および項目値を示すノードのノード情報を生成する。また、このノード情報には、ノード間の隣接関係を示す情報（隣接エッジ）および包含関係を示す情報（包含エッジ）を含める。さらに、グラフ生成部１３は、帳票ファイルから、当該帳票ファイルの属性情報であるプロパティ情報を取得する。 The graph generation unit 13 generates node information of nodes indicating item names and item values of the form file. Further, this node information includes information (adjacent edge) indicating adjacency relation between nodes and information (inclusion edge) indicating inclusion relation. Further, the graph generation unit 13 acquires property information that is attribute information of the form file from the form file.

ノードクラスタ部１３８は、ノード情報に示される各ノードの隣接エッジの連結性に基づいてノードをノードクラスタに分類する。 The node cluster unit 138 classifies the nodes into node clusters based on the connectivity of adjacent edges of each node indicated in the node information.

木構造推定部１４は、ノードクラスタに分類されたノード群から部分木パターンを生成する。そして、木構造推定部１４は生成した部分木パターン群をノードの重複がないように組み合わせることにより、帳票ファイルの木構造データを生成する。 The tree structure estimation unit 14 generates a partial tree pattern from the node group classified into the node cluster. Then, the tree structure estimation unit 14 generates the tree structure data of the form file by combining the generated partial tree pattern groups so that there is no overlapping of nodes.

帳票構造構築部１４３は、木構造推定部１４から出力された帳票ファイルの木構造データと、グラフ生成部１３から出力された当該帳票ファイルのプロパティ情報とを統合し、帳票データベース１２２に登録する。 The form structure construction unit 143 integrates the tree structure data of the form file output from the tree structure estimation unit 14 and the property information of the form file output from the graph generation unit 13 and registers them in the form database 122.

（グラフ生成部）
グラフ生成部１３は、操作インタフェース識別部１３１と、帳票書式情報取得部１３２と、ノード生成部１３３と、プロパティ情報取得部１３４と、隣接エッジ生成部１３５と、包含エッジ生成部１３６とを備える。項目名登録部１３７は、装備する場合と装備しない場合があり、装備する場合については後記する。 (Graph generator)
The graph generation unit 13 includes an operation interface identification unit 131, a form format information acquisition unit 132, a node generation unit 133, a property information acquisition unit 134, an adjacent edge generation unit 135, and an inclusion edge generation unit 136. The item name registration unit 137 may or may not be equipped, and will be described later.

操作インタフェース識別部１３１は、帳票ファイルの種類を特定し、帳票ファイルを操作するための操作インタフェースを決定する。そして、操作インタフェース識別部１３１は、決定した操作インタフェースを示す情報（操作インタフェース情報）を帳票書式情報取得部１３２へ出力する。操作インタフェースは、帳票ファイルの情報を取得するためのインタフェースであり、例えば、ＡＰＩ（Application Programming Interface）、ＣＯＭ（Component Object Model）、ＯＬＥ（Object Linking and Embedding）等である。また、操作インタフェース識別部１３１は、帳票ファイルのファイル情報（例えば、帳票ファイルのアプリケーションの種類、作成日、追加日、サイズ、ファイル属性等のプロパティ情報）をプロパティ情報取得部１３４へ出力する。 The operation interface identification unit 131 identifies the type of form file and determines an operation interface for operating the form file. Then, the operation interface identification unit 131 outputs information indicating the determined operation interface (operation interface information) to the form format information acquisition unit 132. The operation interface is an interface for acquiring form file information, and is, for example, an API (Application Programming Interface), a COM (Component Object Model), or an OLE (Object Linking and Embedding). Further, the operation interface identification unit 131 outputs the file information of the form file (for example, property information such as application type, creation date, addition date, size, file attribute, etc. of the form file) to the property information acquisition unit 134.

帳票書式情報取得部１３２は、帳票の操作インタフェースを利用して、帳票ファイルの各シート（ページ）のドキュメント情報、書式情報を取得する。ドキュメント情報は、例えば、帳票に関するプロパティ情報、タイトル、様式番号、作成者、作成日、ページ数、文字数、分類、キーワード等である。また、書式情報は、例えば、帳票を構成する書式に関するプロパティ情報、文字情報（例えば、文字列、文字の型等）、罫線情報（例えば、罫線の開始および終了位置、罫線の種類、罫線の太さ等）、セル情報等である。なお、書式とは、例えば、帳票上の文字列、罫線の種類、罫線の太さ、罫線が囲う範囲、ノードの結合情報、ノードの色情報等であり、セル情報とは、例えば、帳票ファイルのシートを構成するセルの高さ・幅・左上座標・右下座標、セルの結合状態、セルの塗りつぶしの色等である。 The form format information acquisition unit 132 acquires document information and format information of each sheet (page) of the form file by using a form operation interface. The document information is, for example, property information about a form, title, style number, creator, creation date, number of pages, number of characters, classification, keyword, and the like. The format information includes, for example, property information related to the format constituting the form, character information (for example, character string, character type, etc.), ruled line information (for example, ruled line start and end positions, ruled line type, ruled line thickness, etc. Cell information). The format is, for example, a character string on the form, the type of ruled line, the thickness of the ruled line, the range enclosed by the ruled line, the node connection information, the node color information, etc. The cell information is, for example, a form file The height, width, upper left coordinates, lower right coordinates of the cells constituting the sheet, the connection state of the cells, the fill color of the cells, and the like.

ノード生成部１３３は、帳票構造ルール１２１に従い、帳票の項目名または項目値を示すノードを生成する。ノード生成部１３３は、例えば、帳票書式情報取得部１３２で得た帳票の書式情報に基づき、帳票のシート上の罫線で囲われている部分をノードとして抽出する。そして、ノード生成部１３３は生成したノードに関する情報（ノード情報）を隣接エッジ生成部１３５へ出力する。また、ノード生成部１３３は、シート上の罫線で囲われていない部分の情報をメタデータ（非ノード情報）として抽出し、プロパティ情報取得部１３４へ出力する。 The node generation unit 133 generates a node indicating the item name or item value of the form according to the form structure rule 121. For example, based on the form format information obtained by the form format information acquisition unit 132, the node generation unit 133 extracts a portion surrounded by ruled lines on the form sheet as a node. Then, the node generation unit 133 outputs information about the generated node (node information) to the adjacent edge generation unit 135. In addition, the node generation unit 133 extracts information on a portion not surrounded by ruled lines on the sheet as metadata (non-node information) and outputs the metadata to the property information acquisition unit 134.

プロパティ情報取得部１３４は、帳票のプロパティ情報（帳票のファイル情報、ドキュメント情報、非ノード情報）を、帳票構造構築部１４３へ出力する。 The property information acquisition unit 134 outputs form property information (form file information, document information, non-node information) to the form structure construction unit 143.

隣接エッジ生成部１３５は、帳票の書式情報およびノード情報を参照して、当該帳票のノード間の隣接関係を示す隣接エッジを生成する。隣接エッジ生成部１３５は、生成した隣接エッジを当該ノードのノード情報に追加する。 The adjacent edge generation unit 135 refers to the form format information and node information of the form, and generates an adjacent edge indicating the adjacent relationship between the nodes of the form. The adjacent edge generation unit 135 adds the generated adjacent edge to the node information of the node.

包含エッジ生成部１３６は、ノード情報に示される各ノードの隣接エッジおよび帳票の書式情報を参照して、各ノードの位置およびサイズから、ノード間の縦方向または横方向の包含関係を示す包含エッジを生成する。包含エッジ生成部１３６は、生成した包含エッジを当該ノードのノード情報に追加する。 The inclusion edge generation unit 136 refers to the adjacent edge of each node indicated in the node information and the format information of the form, and includes the inclusion edge indicating the vertical or horizontal inclusion relation between the nodes based on the position and size of each node. Is generated. The inclusion edge generation unit 136 adds the generated inclusion edge to the node information of the node.

（木構造推定部）
木構造推定部１４は、部分木パターン生成部１４０と、木構造データ構築部１４１と、木構造選定部１４２とを備える。項目名割当部１３９は装備する場合と装備しない場合があり、装備する場合については後記する。 (Tree structure estimation part)
The tree structure estimation unit 14 includes a partial tree pattern generation unit 140, a tree structure data construction unit 141, and a tree structure selection unit 142. The item name assigning unit 139 may or may not be equipped, and will be described later.

部分木パターン生成部１４０は、ノードクラスタに分類されたノード群について部分木パターンを生成する。具体的には、部分木パターン生成部１４０は、ノードクラスタに分類されたノード群について帳票上の木構造の特性を満たすように隣接エッジを修正し、各ノードが項目名か項目値かの項目属性の設定を行うことにより部分木パターンを生成する。 The partial tree pattern generation unit 140 generates a partial tree pattern for the node group classified into the node cluster. Specifically, the subtree pattern generation unit 140 corrects adjacent edges so that the node group classified into the node cluster satisfies the characteristics of the tree structure on the form, and each node is an item name or item value. A subtree pattern is generated by setting attributes.

また、部分木パターン生成部１４０は、上記の隣接エッジの修正において、ノード群のノードそれぞれの項目属性および配置位置に基づき、ノード群の示す帳票構造が、項目名と項目値とが一対一の関係である列挙型か、項目名と項目値とが一対多の関係である表型かを推定し、その推定結果に基づき、ノード群の隣接エッジを修正する。そして、生成した部分木パターンを、木構造データ構築部１４１へ出力する。なお、この部分木パターンの生成の詳細は後記する。 In addition, in the modification of the adjacent edge, the subtree pattern generation unit 140 has a one-to-one correspondence between the item name and the item value based on the item attribute and the arrangement position of each node of the node group. It is estimated whether the enumeration type is a relationship or a table type in which the item name and the item value are a one-to-many relationship, and the adjacent edge of the node group is corrected based on the estimation result. Then, the generated partial tree pattern is output to the tree structure data construction unit 141. Details of the generation of the partial tree pattern will be described later.

木構造データ構築部１４１は、部分木パターンをノードの重複がないように組み合わせることにより帳票の木構造データを生成する。生成した木構造データは木構造選定部１４２へ出力する。なお、木構造データは、例えば、図１の符号１０３，１０４に示すように、帳票の論理構造｛項目名，…，項目名，項目値｝を木構造に変換して表したものである。 The tree structure data construction unit 141 generates the tree structure data of the form by combining the partial tree patterns so that there is no overlapping of nodes. The generated tree structure data is output to the tree structure selection unit 142. The tree structure data is, for example, represented by converting the logical structure {item name,..., Item name, item value} of the form into a tree structure as indicated by reference numerals 103 and 104 in FIG.

木構造選定部１４２は、１つの帳票について複数の木構造データが生成されたとき、帳票構造ルール１２１に従い、木構造データの選択を行う。選択した木構造データは帳票構造構築部１４３へ出力する。 The tree structure selection unit 142 selects tree structure data according to the form structure rule 121 when a plurality of pieces of tree structure data are generated for one form. The selected tree structure data is output to the form structure construction unit 143.

このようなデータ構造抽出装置１０によれば、帳票の中に縦横の論理構造が混在している場合であっても、木構造データを精度よく抽出することができる。 According to such a data structure extraction device 10, even when vertical and horizontal logical structures are mixed in a form, tree structure data can be extracted with high accuracy.

（罫線枠補正部）
罫線枠補正部２０は、取得部２１と、補正部２２と、補正ルール入力部２７とを備える。また、補正部２２は、罫線枠の結合を行う結合部２３と、罫線枠の分割を行う分割部２４と、罫線枠の削除を行う削除部２５と、罫線枠の追加を行う追加部２６とを備える。取得部２１は、グラフ生成部１３から、隣接エッジ生成部１３５によって隣接エッジが追加されたノード情報および書式情報から、罫線枠情報を取得する。また、取得部２１は、必要に応じてグラフ生成部１３からメタデータを罫線枠情報に含めて取得する。 (Rule line frame correction part)
The ruled line frame correction unit 20 includes an acquisition unit 21, a correction unit 22, and a correction rule input unit 27. The correction unit 22 includes a combining unit 23 that combines ruled line frames, a dividing unit 24 that divides ruled line frames, a deleting unit 25 that deletes ruled line frames, and an adding unit 26 that adds ruled line frames. Is provided. The acquisition unit 21 acquires ruled line frame information from the graph generation unit 13 from the node information and the format information to which the adjacent edge is added by the adjacent edge generation unit 135. Further, the acquisition unit 21 acquires the metadata including the ruled line frame information from the graph generation unit 13 as necessary.

罫線枠情報は、各ノードのノード情報、および各ノードに対応する罫線で構成された矩形に関する情報である。取得部２１が罫線枠情報を取得する方法は、グラフ生成部１３から取得する方法に限られない。例えば、取得部２１は、罫線枠情報を、帳票ファイルを参照しアプリ標準のオブジェクト構造から取得するようにしてもよいし、帳票を画像として読み取り、画像情報から取得するようにしてもよい。また、罫線枠情報には、例えば矩形の座標位置、幅、高さ、枠内の塗りつぶし色、枠内の文字列、四辺の罫線の種類および太さが含まれる。また、罫線枠で囲われていない文字列については、内部的に四方に罫線枠を保持しておくようにしてもよい。取得部２１は、帳票から罫線枠を抽出し、罫線枠ごとの罫線枠情報として、罫線の種類または太さ、枠内の文字列、および枠内の塗りつぶし色を少なくとも取得する。 The ruled line frame information is information related to a rectangle composed of node information of each node and ruled lines corresponding to each node. The method of acquiring the ruled line frame information by the acquisition unit 21 is not limited to the method of acquiring from the graph generation unit 13. For example, the acquisition unit 21 may acquire ruled line frame information from an application standard object structure by referring to a form file, or may read a form as an image and acquire it from image information. Further, the ruled line frame information includes, for example, the coordinate position, width, and height of the rectangle, the fill color within the frame, the character string within the frame, the type and thickness of the ruled lines on the four sides. For a character string not surrounded by a ruled line frame, the ruled line frame may be held internally in all directions. The acquisition unit 21 extracts a ruled line frame from the form, and acquires at least a ruled line type or thickness, a character string in the frame, and a fill color in the frame as ruled line frame information for each ruled line frame.

補正ルール記憶部２８は、補正が行われる条件および条件が満たされた場合に行われる補正処理すなわちアクションを組み合わせた補正ルールを記憶している。補正ルール記憶部２８は、補正ルールとして、例えば罫線枠の結合処理と結合処理が行われる条件である罫線枠結合条件とを組み合わせた結合ルール、罫線枠の分割処理と分割処理が行われる条件である罫線枠分割条件とを組み合わせた分割ルール、罫線枠の削除処理と削除処理が行われる条件である罫線枠削除条件とを組み合わせた削除ルール、罫線枠の追加処理と追加処理が行われる条件である罫線枠追加条件とを組み合わせた追加ルールを記憶している。 The correction rule storage unit 28 stores a correction rule that is a combination of a correction process that is performed when a condition for performing correction and a condition are satisfied, that is, an action. The correction rule storage unit 28 uses, as correction rules, for example, a rule that combines ruled line frame combination processing and ruled line frame combination conditions, which are conditions for performing the combination processing, and ruled line frame dividing processing and division processing. A division rule that combines a ruled line frame division condition, a rule that deletes a ruled line frame and a rule that deletes the ruled line frame deletion condition, and a condition that adds and adds a ruled line frame An additional rule that is combined with a ruled line frame additional condition is stored.

罫線枠結合条件、罫線枠分割条件および罫線枠削除条件は、例えば、罫線の種類が予め設定された特定の種類であること、罫線の太さが予め設定された特定の太さの範囲に含まれること、枠内の文字列が予め設定された特定の文字列であること、枠内の塗りつぶし色が予め設定された特定の色であること等のいずれか１つ、または複数を組み合わせたものである。 Ruled line frame combination conditions, ruled line frame dividing conditions, and ruled line frame deletion conditions include, for example, that the type of ruled line is a specific type set in advance, and the thickness of the ruled line is included in a range of specific thickness set in advance That the character string in the frame is a specific character string set in advance, the fill color in the frame is a specific color set in advance, or a combination of a plurality It is.

補正部２２の各部は、補正ルールを参照し、条件を満たす罫線枠に対して、各条件に対応するアクションを行う。結合部２３は、複数の罫線枠の罫線枠情報が予め設定された罫線枠結合条件を満たしている場合、該複数の罫線枠を結合する。また、分割部２４は、結合部２３による処理が実行された後、罫線枠の罫線枠情報が予め設定された罫線枠分割条件を満たしている場合、該罫線枠を分割する。また、削除部２５は、分割部２４による処理が実行された後、罫線枠の罫線枠情報が予め設定された罫線枠削除条件を満たしている場合、該罫線枠を削除する。 Each unit of the correction unit 22 refers to the correction rule and performs an action corresponding to each condition on the ruled line frame that satisfies the condition. When the ruled line frame information of a plurality of ruled line frames satisfies a preset ruled line frame combining condition, the combining unit 23 combines the plurality of ruled line frames. Further, after the processing by the combining unit 23 is performed, the dividing unit 24 divides the ruled line frame when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame dividing condition. Further, after the processing by the dividing unit 24 is executed, the deletion unit 25 deletes the ruled line frame when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame deletion condition.

また、追加部２６は、削除部２５による処理が実行された後、領域の領域情報が予め設定された罫線枠追加条件を満たしている場合、該領域に罫線を追加する。このとき、取得部２１は、メタデータ等を基に、帳票から文字列が記載された領域を抽出し、領域ごとの領域情報として、領域の上下左右いずれかの方向の罫線の有無、文字列および領域の塗りつぶし色を少なくとも取得し、また領域の上下左右いずれかの方向に罫線が存在する場合は該罫線の種類または太さを取得しておく。 In addition, after the processing by the deletion unit 25 is executed, the adding unit 26 adds a ruled line to the area when the area information of the area satisfies a preset ruled frame addition condition. At this time, the acquisition unit 21 extracts a region in which a character string is described from the form based on metadata or the like, and as region information for each region, the presence / absence of ruled lines in the upper, lower, left, and right directions of the region, the character string In addition, at least the fill color of the area is acquired, and if there is a ruled line in any of the upper, lower, left, and right directions of the area, the type or thickness of the ruled line is acquired.

なお、補正部２２の各部は、ノード情報を変更することで罫線枠の補正を行う。補正ルール記憶部２８は、補正ルール入力部２７から手動または自動で入力される。また、補正部２２によって補正されたノード情報は、グラフ生成部１３、ノードクラスタ部１３８または木構造推定部１４等に受け渡される。 Each unit of the correction unit 22 corrects the ruled line frame by changing the node information. The correction rule storage unit 28 is manually or automatically input from the correction rule input unit 27. The node information corrected by the correction unit 22 is transferred to the graph generation unit 13, the node cluster unit 138, the tree structure estimation unit 14, or the like.

ここで、図３−２を用いて、罫線枠補正部２０によって補正が行われる罫線枠の例について説明する。図３−２は、補正される罫線枠の一例を示す図である。まず、図３−２の（ａ）に示すように、削除部２５は、罫線枠の文字列が「※」で始まる場合、罫線枠を削除する。また、図３−２の（ｂ）に示すように、分割部２４は、罫線枠の枠内の文字列にチェックボックスを表す文字および該チェックボックスの項目名を表す文字列の両方が含まれる場合、該罫線枠を該チェックボックスに対応する罫線枠および該項目名に対応する罫線枠に分割する。 Here, an example of a ruled line frame that is corrected by the ruled line frame correction unit 20 will be described with reference to FIG. FIG. 3B is a diagram illustrating an example of the ruled line frame to be corrected. First, as illustrated in FIG. 3A, the deletion unit 25 deletes the ruled line frame when the character string of the ruled line frame starts with “*”. Further, as illustrated in FIG. 3B, the dividing unit 24 includes both a character representing a check box and a character string representing an item name of the check box in the character string in the ruled line frame. In this case, the ruled line frame is divided into a ruled line frame corresponding to the check box and a ruled line frame corresponding to the item name.

また、図３−２の（ｃ）のように、分割部２４は、罫線枠内の文字列が所定の文字列ＡおよびＢである場合、それぞれの文字列ごとに罫線枠を分割する。また、図３−２の（ｄ）に示すように、結合部２３は、第１の罫線枠の、第１の罫線枠に隣接する第２の罫線枠との間の罫線の種類が点線である場合、第１の罫線枠と第２の罫線枠とを結合する。また、図３−２の（ｅ）に示すように、削除部２５は、罫線枠の枠内の文字列が空かつ塗りつぶし色が灰色である場合、該罫線枠を削除する。 As shown in FIG. 3C, when the character strings in the ruled line frame are predetermined character strings A and B, the dividing unit 24 divides the ruled line frame for each character string. Further, as shown in FIG. 3D, the connecting unit 23 uses a dotted line as the ruled line type between the first ruled line frame and the second ruled line frame adjacent to the first ruled line frame. In some cases, the first ruled line frame and the second ruled line frame are combined. As shown in FIG. 3E, the deletion unit 25 deletes the ruled line frame when the character string in the frame of the ruled line frame is empty and the fill color is gray.

（処理手順）
次に、図４−１を用いて、データ構造抽出装置１０の処理手順を説明する。データ構造抽出装置１０のデータ構造抽出部１１において帳票ファイルの入力を受け付けると（Ｓ１）、グラフ生成部１３は、帳票ファイルの項目名および項目値を示すノードのグラフを生成する（Ｓ２）。つまり、グラフ生成部１３は、帳票を構成する各ノードについて、各ノード間の隣接関係および包含関係を示すノード情報を生成する。また、グラフ生成部１３は、帳票のプロパティ情報の取得も行う。 (Processing procedure)
Next, the processing procedure of the data structure extraction device 10 will be described with reference to FIG. When the data structure extraction unit 11 of the data structure extraction apparatus 10 receives an input of a form file (S1), the graph generation unit 13 generates a node graph indicating item names and item values of the form file (S2). That is, the graph generation unit 13 generates node information indicating the adjacent relationship and the inclusion relationship between the nodes for each node constituting the form. The graph generation unit 13 also acquires form property information.

罫線枠補正部２０は、グラフ生成部１３によって生成されたノード情報を変更することで、罫線枠の補正を行う（Ｓ３）。そして、罫線枠補正部２０によって補正されたノード情報は、ノードクラスタ部１３８に受け渡される。 The ruled line frame correction unit 20 corrects the ruled line frame by changing the node information generated by the graph generation unit 13 (S3). The node information corrected by the ruled line frame correction unit 20 is transferred to the node cluster unit 138.

次に、ノードクラスタ部１３８は、罫線枠補正部２０によって補正された各ノードのノード情報に示される隣接エッジの連結性に基づいて各ノードをノードクラスタに分類する（Ｓ４）。そして、木構造推定部１４は、ノードクラスタに分類されたノード群ごとに部分木パターンを生成し、生成した部分木パターンを組み合わせて木構造データを生成する（Ｓ５：木構造推定）。その後、帳票構造構築部１４３は、Ｓ５で生成された木構造データと、グラフ生成部１３において取得した帳票のプロパティ情報とを統合し、帳票データベース１２２に登録する（Ｓ６：帳票構造構築）。 Next, the node cluster unit 138 classifies each node into a node cluster based on the connectivity of adjacent edges indicated in the node information of each node corrected by the ruled line frame correction unit 20 (S4). Then, the tree structure estimation unit 14 generates a partial tree pattern for each node group classified into the node cluster, and generates tree structure data by combining the generated partial tree patterns (S5: tree structure estimation). Thereafter, the form structure construction unit 143 integrates the tree structure data generated in S5 and the property information of the form acquired in the graph generation unit 13 and registers them in the form database 122 (S6: form structure construction).

このようにすることでデータ構造抽出装置１０は、帳票ファイルから帳票の木構造データを生成することができる。また、データ構造抽出部１１は、帳票の木構造データと、当該帳票のプロパティ情報とを統合した情報を帳票データベース１２２に登録することができる。 In this way, the data structure extraction device 10 can generate the tree structure data of the form from the form file. Further, the data structure extraction unit 11 can register in the form database 122 information obtained by integrating the tree structure data of the form and the property information of the form.

（グラフ生成）
次に、図４−２を用いて、図４−１のＳ２のグラフ生成処理を詳細に説明する。まず、グラフ生成部１３の操作インタフェース識別部１３１は、帳票ファイルの操作インタフェースの識別を行い（Ｓ１３１）、帳票書式情報取得部１３２は、帳票ファイルの書式情報の取得を行い（Ｓ１３２）、ノード生成部１３３は、帳票ファイルを構成するノード情報の生成を行う（Ｓ１３３：ノード生成）。また、プロパティ情報取得部１３４は、帳票のファイル情報、ドキュメント情報、非ノード情報を当該帳票のプロパティ情報として集約し、帳票構造構築部１４３へ出力する（Ｓ１３４：プロパティ情報取得）。その後、隣接エッジ生成部１３５は、帳票の書式情報およびノード情報を参照して、当該帳票のノード間の隣接関係を示す隣接エッジを生成し、生成した隣接エッジを当該ノードのノード情報に追加する（Ｓ１３５：隣接エッジ生成）。その後、包含エッジ生成部１３６は、ノード情報に示される各ノードの隣接エッジおよび帳票の書式情報を参照して、各ノードの包含エッジを生成し、生成した包含エッジを当該ノードのノード情報に追加する（Ｓ１３６：包含エッジ生成）。 (Graph generation)
Next, the graph generation process of S2 in FIG. 4A will be described in detail with reference to FIG. First, the operation interface identification unit 131 of the graph generation unit 13 identifies the operation interface of the form file (S131), and the form format information acquisition unit 132 acquires the format information of the form file (S132), and generates a node. The unit 133 generates node information constituting the form file (S133: node generation). Also, the property information acquisition unit 134 aggregates the file information, document information, and non-node information of the form as the property information of the form, and outputs it to the form structure construction unit 143 (S134: acquisition of property information). Thereafter, the adjacent edge generation unit 135 refers to the format information and node information of the form, generates an adjacent edge indicating the adjacent relationship between the nodes of the form, and adds the generated adjacent edge to the node information of the node. (S135: Adjacent edge generation). Thereafter, the inclusion edge generation unit 136 refers to the adjacent edge of each node indicated in the node information and the form format information, generates an inclusion edge of each node, and adds the generated inclusion edge to the node information of the node. (S136: inclusion edge generation).

このようにすることでグラフ生成部１３は、帳票を構成する各ノードについて、各ノード間の隣接関係および包含関係を示すノード情報を生成することができる。 By doing in this way, the graph production | generation part 13 can produce | generate the node information which shows the adjacent relationship and inclusion relation between each node about each node which comprises a form.

（罫線枠補正）
次に、図４−３を用いて、図４−１のＳ３の罫線枠補正処理を詳細に説明する。罫線枠補正部２０の取得部２１は、グラフ生成部１３から罫線枠情報を取得する（Ｓ２１）。なお、追加部２６による罫線枠追加処理を行う場合、取得部２１はグラフ生成部１３から罫線で囲われていない部分の情報であるメタデータを罫線枠情報に含めて取得する。また、補正部２２は、補正ルール記憶部２８から補正ルールを読み込む（Ｓ２２）。 (Rule border correction)
Next, the ruled line frame correction process in S3 of FIG. 4A will be described in detail with reference to FIG. The acquisition unit 21 of the ruled line frame correction unit 20 acquires ruled line frame information from the graph generation unit 13 (S21). In addition, when performing the ruled line frame addition process by the adding unit 26, the acquiring unit 21 acquires the ruled line frame information including metadata, which is information of a portion not surrounded by the ruled line, from the graph generating unit 13. Further, the correction unit 22 reads the correction rule from the correction rule storage unit 28 (S22).

そして、結合部２３は、罫線枠情報が罫線枠結合条件を満たす場合、当該罫線枠情報に対応する罫線枠を結合する（Ｓ２３）。そして、分割部２４は、罫線枠情報が罫線枠分割条件を満たす場合、当該罫線枠情報に対応する罫線枠を分割する（Ｓ２４）。そして、削除部２５は、罫線枠情報が罫線枠削除条件を満たす場合、当該罫線枠情報に対応する罫線枠を削除する（Ｓ２５）。そして、追加部２６は、メタデータが罫線枠追加条件を満たす場合、当該メタデータに対応する部分に罫線枠を追加する（Ｓ２６）。そして、罫線枠補正部２０は補正を行ったノード情報を出力する（ステップＳ２７）。 When the ruled line frame information satisfies the ruled line frame combination condition, the combining unit 23 combines the ruled line frame corresponding to the ruled line frame information (S23). When the ruled line frame information satisfies the ruled line frame division condition, the dividing unit 24 divides the ruled line frame corresponding to the ruled line frame information (S24). When the ruled line frame information satisfies the ruled line frame deletion condition, the deleting unit 25 deletes the ruled line frame corresponding to the ruled line frame information (S25). Then, when the metadata satisfies the ruled line frame addition condition, the adding unit 26 adds the ruled line frame to a portion corresponding to the metadata (S26). Then, the ruled line frame correction unit 20 outputs the corrected node information (step S27).

このように、罫線枠補正部２０によって所定の条件を満たす罫線枠の補正が行われることで、木構造推定部１４等で項目名間および項目名−項目値間の論理関係を正確に推定できるようになる。また、補正部２２の各部の処理の具体的な例については後述する。 As described above, the ruled line frame correction unit 20 corrects the ruled line frame satisfying a predetermined condition, so that the tree structure estimation unit 14 and the like can accurately estimate the logical relationship between item names and between item names and item values. It becomes like this. A specific example of processing of each unit of the correction unit 22 will be described later.

（木構造推定）
次に、図４−４を用いて、図４−１のＳ５の木構造推定処理を詳細に説明する。木構造推定部１４の部分木パターン生成部１４０は、Ｓ４でノードクラスタに分類されたノード群について部分木パターンを生成する（Ｓ１４０）。そして、木構造推定部１４は、Ｓ１４０で生成された部分木パターンを組み合わせて木構造データを構築する（Ｓ１４１）。その後、木構造選定部１４２は、Ｓ１４１で構築された木構造データが複数あれば、これらの中から木構造データを１つ選定する（Ｓ１４２）。 (Tree structure estimation)
Next, the tree structure estimation process of S5 of FIG. 4-1 will be described in detail with reference to FIG. The subtree pattern generation unit 140 of the tree structure estimation unit 14 generates a subtree pattern for the node group classified into the node cluster in S4 (S140). Then, the tree structure estimation unit 14 constructs tree structure data by combining the partial tree patterns generated in S140 (S141). Thereafter, if there are a plurality of tree structure data constructed in S141, the tree structure selection unit 142 selects one tree structure data from these (S142).

このようにすることで木構造推定部１４は、帳票の各ノード間の隣接関係および包含関係を反映した木構造データを生成することができる。 By doing in this way, the tree structure estimation part 14 can produce | generate the tree structure data reflecting the adjacent relationship and inclusion relation between each node of a form.

（操作インタフェース識別）
次に、図５−１を用いて、図４−２のＳ１３１の操作インタフェースの識別処理の一例を説明する。なお、以下の説明において、指定先への出力時のデータの記述言語は、例えば、ＣＳＶ（Comma-Separated Values）、ＪＳＯＮ（JavaScript（登録商標） Object Notation）、ＸＭＬ（eXtensible Markup Language）等を用いる。また、指定先への出力はフォルダごとに出力してもよいし、ＺＩＰ、ＣＡＢ形式等のデータ圧縮を行った上で出力してもよい。 (Operation interface identification)
Next, an example of the operation interface identification process in S131 of FIG. 4-2 will be described with reference to FIG. In the following description, for example, CSV (Comma-Separated Values), JSON (JavaScript (registered trademark) Object Notation), XML (eXtensible Markup Language), or the like is used as a description language of data at the time of output to a specified destination. . The output to the designated destination may be output for each folder, or may be performed after data compression such as ZIP or CAB format.

まず、操作インタフェース識別部１３１は、帳票ファイルのファイル情報を読み込むと（Ｓ１３１１）、帳票ファイルの種類を識別する（Ｓ１３１２）。例えば、操作インタフェース識別部１３１は、ファイル情報に含まれる帳票ファイルのプロパティ情報や拡張子、アプリケーションが固有に持つマジックナンバーから当該帳票ファイルで用いられるアプリケーションの種類を特定する。また、操作インタフェース識別部１３１は、読み込んだファイル情報をプロパティ情報取得部１３４へ出力する。そして、操作インタフェース識別部１３１は、Ｓ１３１２で特定した帳票ファイルの種類に合わせて操作インタフェースを決定する（Ｓ１３１３）。ここで、操作インタフェースが決定できれば（Ｓ１３１４でＹｅｓ）、操作インタフェース識別部１３１は、操作インタフェース情報を指定先（ここでは帳票書式情報取得部１３２）へ出力し（Ｓ１３１５）、操作インタフェースが決定できなければ（Ｓ１３１４でＮｏ）、例えば、エラーメッセージとして「帳票として情報を登録しない」旨をユーザに返す（Ｓ１３１６）。 First, when the operation interface identification unit 131 reads the file information of the form file (S1311), the operation interface identification unit 131 identifies the type of the form file (S1312). For example, the operation interface identification unit 131 identifies the type of application used in the form file from the property information and extension of the form file included in the file information and the magic number inherent in the application. Further, the operation interface identification unit 131 outputs the read file information to the property information acquisition unit 134. Then, the operation interface identifying unit 131 determines an operation interface according to the type of the form file specified in S1312 (S1313). If the operation interface can be determined (Yes in S1314), the operation interface identification unit 131 outputs the operation interface information to the designated destination (here, the form format information acquisition unit 132) (S1315), and the operation interface cannot be determined. If (No in S1314), for example, an error message “do not register information as a form” is returned to the user (S1316).

このようにすることで、操作インタフェース識別部１３１は帳票ファイルの操作インタフェースを決定することができる。 In this way, the operation interface identification unit 131 can determine the operation interface of the form file.

（帳票書式情報取得）
次に、図５−２を用いて、図４−２のＳ１３２の帳票書式情報取得処理の一例を説明する。帳票書式情報取得部１３２は、帳票ファイルの操作インタフェース情報を読み込み（Ｓ１３２１）、帳票ファイルを読み込むと（Ｓ１３２２）、帳票ファイルからページ（シート）ごとの書式情報を取得し（Ｓ１３２３）、また、帳票ファイルからドキュメント情報を取得する（Ｓ１３２４）。そして、帳票書式情報取得部１３２は、取得した書式情報、ドキュメント情報に罫線情報があれば（Ｓ１３２５でＹｅｓ）、取得した書式情報、ドキュメント情報を指定先に出力する（Ｓ１３２６）。例えば、書式情報についてはノード生成部１３３に出力し、ドキュメント情報についてはプロパティ情報取得部１３４へ出力する。なお、帳票書式情報取得部１３２は、取得した書式情報、ドキュメント情報に罫線情報がなければ（Ｓ１３２５でＮｏ）、例えば、エラーメッセージとして「帳票として情報を登録しない」旨をユーザに返す（Ｓ１３２７）。 (Get form format information)
Next, an example of the form format information acquisition process in S132 of FIG. 4-2 will be described with reference to FIG. The form format information acquisition unit 132 reads the operation interface information of the form file (S1321). When the form file is read (S1322), the form information for each page (sheet) is acquired from the form file (S1323). Document information is acquired from the file (S1324). Then, if the acquired format information and document information include ruled line information (Yes in S1325), the form format information acquisition unit 132 outputs the acquired format information and document information to the designated destination (S1326). For example, the format information is output to the node generation unit 133, and the document information is output to the property information acquisition unit 134. If the acquired format information and document information do not have ruled line information (No in S1325), the form format information acquisition unit 132 returns, for example, an error message “not register information as a form” to the user (S1327). .

このようにすることで帳票書式情報取得部１３２は、帳票ファイルから書式情報およびドキュメント情報を取得することができる。 In this way, the form format information acquisition unit 132 can acquire format information and document information from the form file.

（ノード生成）
次に、図５−３、図５−４および図１６−２を用いて、図４−２のＳ１３３のノード生成処理の一例を説明する。ノード生成部１３３は、書式情報を読み込み（Ｓ１３３１）、また、帳票構造ルール１２１を読み込み（Ｓ１３３２）、帳票構造ルール１２１のノード生成ルール（図１６−１参照）に従い、書式情報からノード情報を取得する（Ｓ１３３３）。ノード情報は、例えば、図５−４に示すように、罫線で囲まれた文字列（例えば、「担当者」）、罫線で囲まれたセルの左上座標（px1，py1）および右下座標（px2，py2）、塗りつぶし色（例えば、白）、罫線の種類（例えば、実線）、罫線のサイズ（例えば、１ｐｔ）等である。この他、罫線で囲まれたセルの高さや幅の情報も含んでいてもよい。 (Node generation)
Next, an example of the node generation processing in S133 of FIG. 4-2 will be described using FIGS. 5-3, 5-4, and 16-2. The node generation unit 133 reads the format information (S1331), reads the form structure rule 121 (S1332), and acquires the node information from the format information according to the node generation rule of the form structure rule 121 (see FIG. 16-1). (S1333). For example, as shown in FIG. 5-4, the node information includes a character string (for example, “person in charge”) surrounded by a ruled line, upper left coordinates (px1, py1) and lower right coordinates ( px2, py2), fill color (for example, white), ruled line type (for example, solid line), ruled line size (for example, 1 pt), and the like. In addition, information on the height and width of the cell surrounded by the ruled line may be included.

ノード生成部１３３は上記のようにして書式情報からノード情報を取得すると、ノード情報を隣接エッジ生成部１３５に出力する（Ｓ１３３４：ノード情報を出力）。一方書式情報に含まれる情報のうちノード情報以外の情報（非ノード情報）はプロパティ情報取得部１３４に出力する（Ｓ１３３５：非ノード情報を出力）。 When the node generation unit 133 acquires the node information from the format information as described above, the node generation unit 133 outputs the node information to the adjacent edge generation unit 135 (S1334: output node information). On the other hand, information other than the node information (non-node information) among the information included in the format information is output to the property information acquisition unit 134 (S1335: output non-node information).

書式情報に含まれる情報が、ノード情報（ノードの情報）か、非ノード情報（非ノードの情報）かは、以下のようにして判断する。例えば、ノード生成部１３３は、帳票構造ルール１２１に従い、図１６−２に示す帳票のうち、帳票の罫線で囲まれた「氏名」等をノードとし、罫線で囲まれていない「様式Ａ−１」等を非ノードとする。なお、ノード生成部１３３は、罫線で囲まれており、かつ、同じ背景色で塗りつぶされた領域をノードとしてもよいし、実線の罫線で囲まれた領域をノードとしてもよい。さらに、ノード生成部１３３は、予め非ノードとする文字列を決めておき、その文字列を含む領域を非ノードとしてもよいし、予め非ノードと判断する領域（例えば、帳票の上部または下部）を決めておき、その領域で文字列が配置されている領域を非ノードとしてもよい。そして、ノード生成部１３３は、ノードと判断した部分についてノード情報として取得し、非ノードと判断した部分について非ノード情報として取得する。 Whether the information included in the format information is node information (node information) or non-node information (non-node information) is determined as follows. For example, in accordance with the form structure rule 121, the node generation unit 133 uses “name” or the like enclosed by ruled lines of the form as a node in the form shown in FIG. ”Etc. as non-nodes. Note that the node generation unit 133 may use a region surrounded by a ruled line and painted with the same background color as a node, or may use a region surrounded by a solid ruled line as a node. Further, the node generation unit 133 may determine a character string to be a non-node in advance, and an area including the character string may be a non-node, or an area to be determined as a non-node in advance (for example, at the top or bottom of a form) A region where the character string is arranged in the region may be determined as a non-node. Then, the node generation unit 133 acquires the portion determined to be a node as node information, and acquires the portion determined to be a non-node as non-node information.

このようにすることで、ノード生成部１３３は帳票ファイルの木構造データの生成に必要なノード（ノード情報）を抽出することができる。また、ノード生成部１３３は、帳票ファイルの属性情報を非ノード情報として抽出することができる。 In this way, the node generation unit 133 can extract nodes (node information) necessary for generating the tree structure data of the form file. In addition, the node generation unit 133 can extract the attribute information of the form file as non-node information.

（プロパティ情報取得）
次に、図５−５を用いて、図４−２のＳ１３４のプロパティ情報取得処理の一例を説明する。プロパティ情報取得部１３４は、ファイル情報を読み込み（Ｓ１３４１）、ドキュメント情報を読み込み（Ｓ１３４２）、非ノード情報を読み込み（Ｓ１３４３）、帳票構造ルール１２１を読み込む（Ｓ１３４４）。そして、プロパティ情報取得部１３４は、読み込んだ帳票構造ルール１２１に従って非ノード情報から帳票メタ情報を生成する（Ｓ１３４５）。帳票メタ情報とは、帳票のタイトル、日付、様式等の帳票に付随する情報である。 (Acquire property information)
Next, an example of the property information acquisition process in S134 of FIG. 4-2 will be described with reference to FIG. The property information acquisition unit 134 reads the file information (S1341), reads the document information (S1342), reads the non-node information (S1343), and reads the form structure rule 121 (S1344). Then, the property information acquisition unit 134 generates form meta information from the non-node information in accordance with the read form structure rule 121 (S1345). The form meta information is information accompanying the form, such as the form title, date, and style.

例えば、プロパティ情報取得部１３４は、帳票構造ルール１２１のメタ情報生成ルール（図１６−１の（ａ）参照）に従い、非ノード情報の文字列の区切り記号（例えば、「：」や「／」等）や文字列の内容を基に帳票メタ情報を生成する。例えば、「氏名：ｘｘｘ」という非ノード情報があれば、「氏名」という帳票メタ情報を生成する。また、プロパティ情報取得部１３４は、非ノード情報に含まれる文字列（例えば、「様式」、「氏名」）や特殊文字（例えば、「〒」、「TEL」）、データ型、フォーマット等が分かっていればそのデータ型やフォーマット等を基に、帳票メタ情報を生成してもよい。例えば、日付や社員番号等のデータ型が予めわかっていれば、プロパティ情報取得部１３４はそのデータ型により、「日付」や「社員番号」という帳票メタ情報を生成する。 For example, the property information acquisition unit 134 follows the meta information generation rule of the form structure rule 121 (see (a) in FIG. 16A), and delimiters for character strings of non-node information (for example, “:” and “/”). Etc.) and form meta information based on the contents of the character string. For example, if there is non-node information “name: xxx”, form meta information “name” is generated. Further, the property information acquisition unit 134 knows a character string (for example, “style”, “name”), special characters (for example, “〒”, “TEL”), data type, format, etc. included in the non-node information. If so, the form meta information may be generated based on the data type or format. For example, if the data type such as date and employee number is known in advance, the property information acquisition unit 134 generates form meta information such as “date” and “employee number” based on the data type.

Ｓ１３４５の後、プロパティ情報取得部１３４は、各属性情報（つまり、ファイル情報、ドキュメント情報および帳票メタ情報）をプロパティ情報として集約し（Ｓ１３４６）、帳票構造構築部１４３に出力する（Ｓ１３４７：プロパティ情報出力）。 After S1345, the property information acquisition unit 134 aggregates each piece of attribute information (that is, file information, document information, and form meta information) as property information (S1346) and outputs it to the form structure construction unit 143 (S1347: property information). output).

このようにすることでプロパティ情報取得部１３４は、帳票ファイルのプロパティ情報を取得することができる。 In this way, the property information acquisition unit 134 can acquire the property information of the form file.

なお、Ｓ１３４７においてプロパティ情報は、例えば、図６に示すように、ファイル情報、ドキュメント情報および帳票メタ情報が関連付けられた状態で帳票構造構築部１４３へ出力される。その後、帳票構造構築部１４３は、このプロパティ情報に、木構造推定部１４により生成された木構造データ（符号６０１，６０２）を統合して、帳票データベース１２２に出力する。 In S1347, for example, the property information is output to the form structure construction unit 143 in a state in which file information, document information, and form meta information are associated with each other as shown in FIG. Thereafter, the form structure construction unit 143 integrates the property information with the tree structure data (reference numerals 601 and 602) generated by the tree structure estimation unit 14 and outputs the result to the form database 122.

（隣接エッジ生成）
次に、図７−１および図７−２を用いて、図４−２のＳ１３５の隣接エッジ生成処理の一例を説明する。隣接エッジ生成部１３５は、図４−２のＳ１３３で生成されたノード情報を読み込み（Ｓ１３５１）、帳票構造ルール１２１を読み込む（Ｓ１３５２）。そして、隣接エッジ生成部１３５は、帳票構造ルール１２１に従い、各ノード間の隣接エッジを求め（Ｓ１３５３）、隣接関係をチェックする（Ｓ１３５４）。その後、隣接エッジ生成部１３５は、ノード情報にＳ１３５３で生成した隣接エッジ（隣接エッジ情報）を追加し（Ｓ１３５５）、隣接エッジ生成部１３５は隣接エッジ情報が追加されたノード情報を包含エッジ生成部１３６に出力する（Ｓ１３５６）。 (Adjacent edge generation)
Next, an example of the adjacent edge generation process of S135 of FIG. 4-2 will be described using FIGS. 7-1 and 7-2. The adjacent edge generation unit 135 reads the node information generated in S133 of FIG. 4-2 (S1351), and reads the form structure rule 121 (S1352). Then, the adjacent edge generation unit 135 obtains an adjacent edge between the nodes in accordance with the form structure rule 121 (S1353), and checks the adjacent relationship (S1354). Thereafter, the adjacent edge generation unit 135 adds the adjacent edge (adjacent edge information) generated in S1353 to the node information (S1355), and the adjacent edge generation unit 135 includes the node information to which the adjacent edge information is added as an inclusion edge generation unit. It outputs to 136 (S1356).

例えば、隣接エッジ生成部１３５は、まず、各ノードのノード情報を参照して、ノードごとに、当該ノードの上、下、左、右方向の隣接ノードを示した情報（隣接ベクトル）を生成する。なお、当該ノードに隣接ノードがある場合には、隣接ベクトルに隣接ノードのインデックス情報を保持させる。隣接ノードがない場合には、隣接ベクトルにその旨（例えば、「０」）を記載する。例えば、図７−２に示す「担当者」ノードの上、下、左に隣接ノードはないが、右に「名前」ノードと「所属」ノードが隣接する場合、隣接エッジ生成部１３５は、その旨を示す隣接エッジ情報を、「担当者」ノードのノード情報に追加する。 For example, the adjacent edge generation unit 135 first refers to the node information of each node, and generates information (adjacent vector) indicating adjacent nodes in the upper, lower, left, and right directions for each node. . When there is an adjacent node in the node, the adjacent node index information is held in the adjacent vector. When there is no adjacent node, the fact (for example, “0”) is described in the adjacent vector. For example, if there is no adjacent node above, below, and left in the “person” node shown in FIG. 7B, but the “name” node and “affiliation” node are adjacent on the right, the adjacent edge generation unit 135 The adjacent edge information indicating that is added to the node information of the “person in charge” node.

このようにすることで隣接エッジ生成部１３５は、各ノードの隣接関係を示す隣接エッジを生成することができる。 In this way, the adjacent edge generation unit 135 can generate an adjacent edge indicating the adjacent relationship between the nodes.

（各ノード間の隣接エッジを求める処理）
次に、図７−３および図１６−３を用いて、図７−１のＳ１３５３の各ノード間の隣接エッジを求める処理の一例を説明する。隣接エッジ生成部１３５は、図７−１のＳ１３５１で読み込んだノード情報を参照して、未探索のノードＸ，Ｙを定義すると（Ｓ１３５３２）、ノードＸとノードＹが、隣接関係に関する帳票構造ルール１２１（図１６−１の（ａ）の隣接エッジ生成ルール）を満たすか否かを判定する（Ｓ１３５３３）。Ｓ１３５３３において、隣接エッジ生成部１３５が、ノードＸとノードＹについて隣接関係に関する帳票構造ルール１２１を満たすと判定したとき（Ｓ１３５３３でＹｅｓ）、ノードＸ，Ｙに対して隣接方向に隣接エッジを張る（Ｓ１３５３４）。そして、隣接エッジ生成部１３５は、全てのノード間の隣接関係をチェックしたと判定すると（Ｓ１３５３５でＹｅｓ）、隣接エッジ情報を出力する（Ｓ１３５３６）。一方、Ｓ１３５３３において、隣接エッジ生成部１３５がノードＸとノードＹについて隣接関係に関する帳票構造ルール１２１を満たさないと判定したとき（Ｓ１３５３３でＮｏ）、Ｓ１３５３４をスキップしてＳ１３５３５へ進む。一方、隣接エッジ生成部１３５は、まだノード間の隣接関係をチェックしていないものがあると判定すると（Ｓ１３５３５でＮｏ）、Ｓ１３５３２へ戻る。 (Process to find adjacent edges between nodes)
Next, an example of processing for obtaining adjacent edges between nodes in S1353 of FIG. 7-1 will be described using FIGS. 7-3 and 16-3. When the adjacent edge generation unit 135 defines the unsearched nodes X and Y with reference to the node information read in S1351 of FIG. 7-1 (S13532), the node X and the node Y are in the form structure rule related to the adjacent relationship. It is determined whether or not 121 (adjacent edge generation rule in FIG. 16A (a)) is satisfied (S13533). In S13533, when the adjacent edge generation unit 135 determines that the form structure rule 121 related to the adjacent relationship is satisfied for the node X and the node Y (Yes in S13533), the adjacent edge is extended in the adjacent direction to the nodes X and Y ( S13534). If the adjacent edge generation unit 135 determines that the adjacent relationship between all the nodes has been checked (Yes in S13535), the adjacent edge generation unit 135 outputs adjacent edge information (S13536). On the other hand, when it is determined in S13533 that the adjacent edge generation unit 135 does not satisfy the form structure rule 121 related to the adjacent relationship with respect to the node X and the node Y (No in S13533), S13534 is skipped and the process proceeds to S13535. On the other hand, if the adjacent edge generation unit 135 determines that there is one that has not yet checked the adjacent relationship between the nodes (No in S13535), the process returns to S13532.

なお、上記の隣接エッジ生成ルールは、例えば、図１６−３に示すノードＸとノードＹについて、（１）ノードＸのｙ座標範囲がノードＹのｙ座標の範囲を包含し、かつ、（２）ノードＸのｘ座標の終点がノードＹのｘ座標の始点と一致するときノードＸとノードＹについて隣接関係があるとみなす、というルールである。つまり、図１６−３に示すノードＸとノードＹについて、ノードＸの左上座標をＸ（ｘ１，ｙ１）、右下座標をＸ（ｘ２，ｙ２）とし、ノードＹの左上座標をＹ（ｘ１，ｙ１）、右下座標をＹ（ｘ２，ｙ２）とするときに、（１）Ｘ（＊，ｙ１）≦Ｙ（＊，ｙ１）かつＸ（＊，ｙ２）≧Ｙ（＊，ｙ２）であり、（２）Ｘ（ｘ２，＊）＝Ｙ（ｘ１，＊）であるとき、隣接エッジ生成部１３５は、ノードＸとノードＹについて隣接関係があるとみなす。 The above-mentioned adjacent edge generation rule is, for example, for the node X and the node Y shown in FIG. 16C, (1) the y coordinate range of the node X includes the y coordinate range of the node Y, and (2 The rule is that when the end point of the x coordinate of the node X coincides with the start point of the x coordinate of the node Y, the node X and the node Y are considered to be adjacent. That is, for node X and node Y shown in FIG. 16C, the upper left coordinate of node X is X (x1, y1), the lower right coordinate is X (x2, y2), and the upper left coordinate of node Y is Y (x1, y1). y1), where Y (x2, y2) is the lower right coordinate, (1) X (*, y1) ≦ Y (*, y1) and X (*, y2) ≧ Y (*, y2) , (2) When X (x2, *) = Y (x1, *), the adjacent edge generation unit 135 regards the node X and the node Y as having an adjacent relationship.

なお、上記は右方向（横方向）の隣接エッジ生成ルールであるが、下方向（縦方向）の隣接エッジ生成ルールについても同様に定義される。このように隣接エッジ生成部１３５がノードの右方向、下方向の隣接エッジを生成することで、対となる（当該ノードに隣接する）ノードの左、上方向の隣接エッジも取得できる。 Although the above is the right (horizontal) adjacent edge generation rule, the downward (vertical) adjacent edge generation rule is defined similarly. As described above, the adjacent edge generation unit 135 generates adjacent edges in the right and down directions of the node, so that the left and upper adjacent edges of the paired nodes (adjacent to the node) can also be acquired.

このようにすることで隣接エッジ生成部１３５はノード間の位置関係に基づき隣接エッジを生成することができる。 In this way, the adjacent edge generation unit 135 can generate an adjacent edge based on the positional relationship between the nodes.

（隣接関係をチェックする処理）
次に、図７−４および図１６−４を用いて、図７−１のＳ１３５４の各ノード間の隣接関係をチェックする処理の一例を説明する。隣接エッジ生成部１３５は、Ｓ１３５３６で出力された隣接エッジ情報を読み込み（Ｓ１３５４１）、未確認のノードＸについて隣接方向ｋを定義する（Ｓ１３５４２）。なお、隣接方向ｋは、横方向、縦方向のいずれかを示し、例えば、横方向であればｋ＝０とし、縦方向であればｋ＝１とする。次に、隣接エッジ生成部１３５は、ノードＸのｋ方向の隣接エッジが帳票構造ルール１２１（図１６−１の（ｂ）の隣接エッジチェックルール）を満たし（Ｓ１３５４３でＮｏ）、全ノード、全隣接方向を確認済みであれば（Ｓ１３５４５でＹｅｓ）、隣接エッジ情報を出力する（Ｓ１３５４６）。一方、ノードＸのｋ方向の隣接エッジが帳票構造ルール１２１（図１６−１の（ｂ）の隣接エッジチェックルール）を満たさなければ（Ｓ１３５４３でＹｅｓ）、当該隣接エッジを削除し（Ｓ１３５４４）、Ｓ１３５４５へ進む。また、Ｓ１３５４５でいずれかのノードまたはいずれかの隣接方向を確認していなければ（Ｓ１３５４５でＮｏ）、Ｓ１３５４１へ戻る。 (Process to check adjacency)
Next, an example of processing for checking the adjacency relationship between nodes in S1354 of FIG. 7-1 will be described with reference to FIGS. 7-4 and 16-4. The adjacent edge generation unit 135 reads the adjacent edge information output in S13536 (S13541), and defines the adjacent direction k for the unconfirmed node X (S13542). The adjacent direction k indicates either the horizontal direction or the vertical direction. For example, k = 0 in the horizontal direction and k = 1 in the vertical direction. Next, the adjacent edge generation unit 135 satisfies the form structure rule 121 (adjacent edge check rule of (b) in FIG. 16A) of the node X in the k direction (No in S13543), all nodes, all If the adjacent direction has been confirmed (Yes in S13545), adjacent edge information is output (S13546). On the other hand, if the adjacent edge in the k direction of the node X does not satisfy the form structure rule 121 (adjacent edge check rule in FIG. 16B (b)) (Yes in S13543), the adjacent edge is deleted (S13544). The process proceeds to S13545. If any node or any adjacent direction is not confirmed in S13545 (No in S13545), the process returns to S13541.

なお、上記の隣接エッジチェックルールは、例えば、図１６−４に示すノードＸについて、（１）ノードＸの右側に位置する全ノードのｙ座標がノードＸのｙ座標の範囲内あり（１つでも満たさないノードがあれば不成立）、かつ、（２）ノードＸの左にあるノードは１つ以下であるとき、ノードＸに関する隣接エッジの修正は必要ないとみなす、というルールである。例えば、図１６−４の符号１６０１に示すようにノードＸの右側に位置するノードがノードＹのみであり、このノードＹのｙ座標がノードＸのｙ座標に包含されていれば、隣接エッジ生成部１３５は、前記した（１）の条件を満たすと判断する。一方、符号１６０２に示すようにノードＸの右側に位置するノードがノードＹ，Ｚであり、ノードＹのｙ座標がノードＸのｙ座標に包含されていなければ、前記した（１）の条件を満たさないと判断する。また、符号１６０３に示すようにノードＸの左側に位置するノードがノードＷのみであれば、隣接エッジ生成部１３５は、前記した（２）の条件を満たすと判断する。一方、符号１６０４に示すように、ノードＸの左側に位置するノードがノードＷ，Ｖの２つであるとき、隣接エッジ生成部１３５は、前記した（２）の条件は満たさないと判断する。 The above adjacent edge check rule is, for example, for the node X shown in FIG. 16-4. (1) The y coordinate of all nodes located on the right side of the node X is within the range of the y coordinate of the node X (one However, if there is a node that does not satisfy, it is not established), and (2) when there is no more than one node to the left of the node X, it is considered that the modification of the adjacent edge relating to the node X is not necessary. For example, as shown by reference numeral 1601 in FIG. 16-4, if the node located on the right side of the node X is only the node Y and the y coordinate of the node Y is included in the y coordinate of the node X, the adjacent edge is generated. The unit 135 determines that the condition (1) described above is satisfied. On the other hand, as indicated by reference numeral 1602, if the nodes located on the right side of the node X are the nodes Y and Z and the y coordinate of the node Y is not included in the y coordinate of the node X, the condition (1) described above is satisfied. Judge that it does not meet. If the node located on the left side of the node X is only the node W as indicated by reference numeral 1603, the adjacent edge generation unit 135 determines that the condition (2) described above is satisfied. On the other hand, as indicated by reference numeral 1604, when there are two nodes W and V located on the left side of the node X, the adjacent edge generation unit 135 determines that the condition (2) described above is not satisfied.

なお、上記は右方向（横方向）の隣接エッジチェックルールであるが、下方向（縦方向）の隣接エッジチェックルールについても同様に定義される。 Although the above is the adjacent edge check rule in the right direction (horizontal direction), the adjacent edge check rule in the downward direction (vertical direction) is similarly defined.

このようにすることで隣接エッジ生成部１３５は生成した隣接エッジについて帳票上の隣接関係として不適切な隣接関係を含む場合、これを修正することができる。 In this way, the adjacent edge generation unit 135 can correct the generated adjacent edge if it includes an inappropriate adjacent relationship as the adjacent relationship on the form.

（包含エッジ生成）
次に、図８および図１６−５を用いて、図４−２のＳ１３６の包含エッジ生成処理の一例を説明する。包含エッジ生成部１３６は、図４−２のＳ１３５で生成された隣接エッジ情報を含むノード情報を読み込み（Ｓ１３６１）、また、帳票構造ルール１２１を読み込む（Ｓ１３６２）。そして、包含エッジ生成部１３６は、未探索のノードＸ，Ｙ、隣接方向ｋを定義すると（Ｓ１３６３）、任意のノードＸと他のノードＹとｋ方向（縦／横）について、包含関係に関する帳票構造ルール１２１（図１６−１の（ａ）の包含エッジ生成ルール）を満たすか否かを判定する（Ｓ１３６４）。ここで、包含エッジ生成部１３６が任意のノードＸと他のノードＹとｋ方向（縦／横）について、包含関係に関する帳票構造ルール１２１を満たしていれば（Ｓ１３６４でＹｅｓ）、ノードＸ，Ｙ間に包含エッジを張る（Ｓ１３６５）。一方、包含エッジ生成部１３６が任意のノードＸと他のノードＹとｋ方向（縦／横）について、包含関係に関する帳票構造ルール１２１（図１６−１の（ａ）の包含エッジ生成ルール）を満たしていなければ（Ｓ１３６４でＮｏ）、Ｓ１３６５をスキップしてＳ１３６６へ進む。Ｓ１３６６で、全てのノード間の包含関係を確認済みであれば（Ｓ１３６６でＹｅｓ）、ノード情報に包含エッジ情報を追加し（Ｓ１３６７）、当該ノード情報をノードクラスタ部１３８に出力する（Ｓ１３６８）。一方、まだいずれかのノード間の包含関係を確認済みでなければ（Ｓ１３６６でＮｏ）、Ｓ１３６３へ戻る。 (Included edge generation)
Next, an example of the inclusion edge generation process in S136 of FIG. 4-2 will be described with reference to FIGS. 8 and 16-5. The inclusion edge generation unit 136 reads node information including the adjacent edge information generated in S135 of FIG. 4-2 (S1361), and reads the form structure rule 121 (S1362). When the inclusion edge generation unit 136 defines the unsearched nodes X and Y and the adjacent direction k (S 1363), the inclusion-related form for an arbitrary node X and another node Y and the k direction (vertical / horizontal). It is determined whether or not the structure rule 121 (the inclusion edge generation rule in FIG. 16A (a)) is satisfied (S1364). Here, if the inclusion edge generation unit 136 satisfies the form structure rule 121 regarding the inclusion relation for an arbitrary node X, another node Y, and the k direction (vertical / horizontal) (Yes in S1364), the nodes X, Y An inclusion edge is stretched between them (S1365). On the other hand, the inclusion edge generation unit 136 generates a form structure rule 121 (inclusion edge generation rule in FIG. 16A in FIG. 16A) regarding an inclusion relationship for an arbitrary node X, another node Y, and the k direction (vertical / horizontal). If not satisfied (No in S1364), S1365 is skipped and the process proceeds to S1366. If the inclusion relationship between all the nodes has been confirmed in S1366 (Yes in S1366), inclusion edge information is added to the node information (S1367), and the node information is output to the node cluster unit 138 (S1368). On the other hand, if the inclusion relationship between any of the nodes has not been confirmed (No in S1366), the process returns to S1363.

なお、上記の包含エッジ生成ルールは、例えば、（１）ノードＸの右方向に隣接するノードが２つ以上ある。（２）ノードＸの右方向に隣接するノードのｙ座標が全てノードＸのｙ座標の範囲にある。（３）ノードＸの右方向に隣接するノードのｙ座標のいずれか１つがノードＸの始点のｙ座標と重なり、かつノードＸの右方向に隣接するノードのｙ座標のいずれか１つがノードＸの終点のｙ座標と重なる。という３つの条件を全て満たすとき、ノードＸと、このノードＸの右方向に隣接するノード（および当該ノードに隣接する一連のノード）について包含関係がある、とみなすルールである。 Note that the above-described inclusion edge generation rule includes, for example, (1) two or more nodes adjacent to the right side of the node X. (2) The y-coordinates of the nodes adjacent in the right direction of the node X are all within the range of the y-coordinates of the node X. (3) Any one of the y coordinates of the nodes adjacent to the right side of the node X overlaps with the y coordinate of the start point of the node X, and any one of the y coordinates of the nodes adjacent to the right side of the node X is the node X. It overlaps with the y coordinate of the end point. When all the three conditions are satisfied, this is a rule that considers that there is an inclusive relationship between the node X and a node adjacent to the node X in the right direction (and a series of nodes adjacent to the node).

この包含エッジ生成ルールによれば、例えば、図１６−５の符号１６１に示すノードＸと、ノードＸの右方向に隣接するノード（ノードＹ，Ｚ，Ｕ）および当該ノードの右方向に隣接する一連のノード（ノードＷ，Ｖ，Ｔ）について、包含エッジ生成部１３６は包含関係があるとみなす。すなわち、上記の例でいうと、包含エッジ生成部１３６は、ノードＸ、ノードＹ，Ｚ，Ｕ，Ｗ，Ｖ，Ｔに包含関係がある、とみなす。 According to the inclusion edge generation rule, for example, the node X indicated by reference numeral 161 in FIG. 16-5, the node adjacent to the node X in the right direction (node Y, Z, U), and adjacent to the node in the right direction For a series of nodes (nodes W, V, T), the inclusion edge generation unit 136 considers that there is an inclusion relationship. In other words, in the above example, the inclusion edge generation unit 136 considers that the node X, the nodes Y, Z, U, W, V, and T have an inclusion relationship.

なお、例えば、図１６−５の符号１６１に示すノード群のうちノードＵが欠けた状態のとき（符号１６２参照）、上記の（１）および（２）の条件を満たすが、上記の（３）に示す「ノードＸの右方向に隣接するノードのｙ座標のいずれか１つがノードＸの終点のｙ座標と重なる」という条件を満たさないので、包含エッジ生成部１３６は、ノードＸ、ノードＹ，Ｚ，Ｗ，Ｖ，Ｔには包含関係がないとみなす。 For example, when the node U is missing from the node group indicated by reference numeral 161 in FIG. 16-5 (see reference numeral 162), the above conditions (1) and (2) are satisfied. ) Does not satisfy the condition that “one of the y-coordinates of nodes adjacent to the right of node X overlaps the y-coordinate of the end point of node X”. , Z, W, V, and T are regarded as having no inclusive relation.

なお、上記は右方向（横方向）の包含エッジ生成ルールであるが、下方向（縦方向）の包含エッジ生成ルールについても同様に定義される。例えば、下方向（縦方向）の包含エッジ生成ルールは、（１）ノードＸの下方向に隣接するノードが２つ以上ある。（２）ノードＸの下方向に隣接するノードのｘ座標が全てノードＸのｘ座標の範囲にある。（３）ノードＸの右方向に隣接するノードのｘ座標のいずれか１つがノードＸの始点のｘ座標と重なり、かつノードＸの右方向に隣接するノードのｘ座標のいずれか１つがノードＸの終点のｘ座標と重なる。という３つの条件を全て満たすとき、ノードＸと、このノードＸにの下方向に隣接するノード（および当該ノードに隣接する一連のノード）について包含関係がある、とみなすルールである。 The above is the right (horizontal) inclusion edge generation rule, but the downward (vertical) inclusion edge generation rule is defined similarly. For example, the inclusive edge generation rule in the downward direction (vertical direction) has (1) two or more nodes adjacent in the downward direction of the node X. (2) All the x coordinates of the nodes adjacent in the downward direction of the node X are within the range of the x coordinates of the node X. (3) Any one of the x coordinates of the nodes adjacent to the right side of the node X overlaps with the x coordinate of the start point of the node X, and any one of the x coordinates of the nodes adjacent to the right side of the node X is the node X. It overlaps with the x coordinate of the end point of. When all the three conditions are satisfied, this is a rule that considers that there is an inclusive relationship between the node X and a node adjacent to the node X in the downward direction (and a series of nodes adjacent to the node).

このようにすることで包含エッジ生成部１３６は、各ノードの包含関係を示す包含エッジを生成することができる。 By doing in this way, the inclusion edge production | generation part 136 can produce | generate the inclusion edge which shows the inclusion relationship of each node.

（ノードクラスタに分類する処理）
次に、図１０−１、図１０−２および図１０−３を用いて、図４−１のＳ４のノードクラスタに分類する処理の一例を説明する。ノードクラスタ部１３８は、図４−１のＳ２のグラフ生成処理で生成されたノード情報を読み込み（Ｓ１３８１）、帳票構造ルール１２１を読み込む（Ｓ１３８２）。そして、ノードクラスタ部１３８は帳票構造ルール１２１（図１６−１の（ａ）に示すノードクラスタ生成ルール）に従い、任意のノードＸを始点として他のノードＹのクラスタリングを行う（Ｓ１３８３）。その後、ノードクラスタ部１３８はクラスタリングされていないノードがあるか否かを確認し（Ｓ１３８４）、クラスタリングされていないノードがあれば（Ｓ１３８４でＹｅｓ）、クラスタリングされていないノードをノードＸとして選択し（Ｓ１３８５）、Ｓ１３８３に戻る。一方、ノードクラスタリングされていないノードがなければ（Ｓ１３８４でＮｏ）、ノードクラスタ部１３８はノードクラスタを木構造推定部１４に出力する（Ｓ１３８６）。 (Process to classify into node cluster)
Next, an example of processing for classifying the node cluster in S4 of FIG. 4A will be described with reference to FIGS. 10A, 10B, and 10C. The node cluster unit 138 reads the node information generated by the graph generation process of S2 of FIG. 4A (S1381), and reads the form structure rule 121 (S1382). Then, the node cluster unit 138 performs clustering of another node Y starting from an arbitrary node X according to the form structure rule 121 (the node cluster generation rule shown in FIG. 16A (a)) (S1383). Thereafter, the node cluster unit 138 checks whether or not there is a non-clustered node (S1384), and if there is a non-clustered node (Yes in S1384), selects the non-clustered node as the node X ( S1385), the process returns to S1383. On the other hand, if there is no node that has not been subjected to node clustering (No in S1384), the node cluster unit 138 outputs the node cluster to the tree structure estimation unit 14 (S1386).

なお、上記のノードクラスタ生成ルールは、例えば、あるノードの隣接エッジについて、当該ノードに連結しているノード群をノードクラスタとみなす、というルールである。また上記のルールに、例えば、当該ノードに連結しているノード群のうち、所定の罫線（例えば、太線や赤色の罫線）で分断されたノードについては別のノードクラスタとみなすというルールや、所定の塗りつぶし色（例えば、灰色）のノードは同じノードクラスタとみなすというルールを組み合わせてもよい。 Note that the above node cluster generation rule is, for example, a rule that regarding a neighboring edge of a certain node, a node group connected to the node is regarded as a node cluster. In addition, in the above rules, for example, a node that is divided by a predetermined ruled line (for example, a thick line or a red ruled line) among a group of nodes connected to the node is regarded as another node cluster, A rule may be combined in which nodes of a solid color (for example, gray) are regarded as the same node cluster.

ノードクラスタ部１３８は、このようなノードクラスタ生成ルールに従い、ノードクラスタへの分類を行うことで、例えば、図１０−２に示すように、同じ帳票ファイルのシート内の表について物理的に離れているものをノードクラスタ１，２に分けることができる。その結果、木構造推定部１４は、同じシート内の帳票（表）のうち、物理的に離れているものについてそれぞれ別個の木構造データを生成することができる。 The node cluster unit 138 physically separates the tables in the sheet of the same form file, for example, as shown in FIG. 10-2 by performing classification into node clusters according to such a node cluster generation rule. Can be divided into node clusters 1 and 2. As a result, the tree structure estimation unit 14 can generate separate tree structure data for physically separated forms (tables) in the same sheet.

（任意のノードＸを始点とした他のノードＹのクラスタリング処理）
なお、図１０−１のＳ１３８３の処理は、例えば、図１０−３に示す処理手順により行われる。まず、ノードクラスタ部１３８は、ノードＸのノード情報を読み込むと（Ｓ１３８３１）、ノードＸを始点として他のノードＹを探索し（Ｓ１３８３２）、ノードＹを発見できなかったとき（Ｓ１３８３３でＮｏ）、ノードＹがクラスタリング済みか否かを判定し（Ｓ１３８３４）、ノードＹがクラスタリング済みでなければ（Ｓ１３８３４でＮｏ）、ノードＹをノードＸと同じクラスタ（ノードクラスタ）に分類し、ノードＹに分類済みのフラグをたてる（Ｓ１３８３５）。その後、Ｓ１３８３６へ進む。Ｓ１３８３３でノードＹを発見できたとき（Ｓ１３８３３でＹｅｓ）、Ｓ１３８３９へ進む。またノードＹがクラスタリング済みであれば（Ｓ１３８３４でＹｅｓ）、Ｓ１３８３６へ進む。 (Clustering processing of another node Y starting from an arbitrary node X)
Note that the process of S1383 in FIG. 10A is performed, for example, according to the process procedure illustrated in FIG. First, when the node cluster unit 138 reads the node information of the node X (S13831), the node cluster unit 138 searches for another node Y from the node X (S13832), and when the node Y cannot be found (No in S13833), It is determined whether or not the node Y has been clustered (S13834). If the node Y has not been clustered (No in S13834), the node Y is classified into the same cluster (node cluster) as the node X and has been classified into the node Y. Is set (S13835). Thereafter, the process proceeds to S13836. When the node Y is found in S13833 (Yes in S13833), the process proceeds to S13839. If the node Y has been clustered (Yes in S13834), the process proceeds to S13836.

Ｓ１３８３６において、ノードクラスタ部１３８は、全ノードを探索したとき（ＳＳ１３８３６でＹｅｓ）、ノードＸに分類済みフラグをたて（Ｓ１３８３７）、ノードクラスタを出力する（Ｓ１３８３８）。一方、まだ探索していないノードがあるとき（Ｓ１３８３６でＮｏ）、未探索のノードをノードＹとして定義し(Ｓ１３８３９)、Ｓ１３８３２へ戻る。 In S13836, when the node cluster unit 138 has searched all the nodes (Yes in SS13836), the node cluster unit 138 sets a classified flag for the node X (S13837), and outputs the node cluster (S13838). On the other hand, when there is a node that has not been searched yet (No in S13836), an unsearched node is defined as node Y (S13839), and the process returns to S13832.

このようにすることで、ノードクラスタ部１３８は、各ノードをノードクラスタに分類することができる。 In this way, the node cluster unit 138 can classify each node into a node cluster.

（部分木パターン生成処理）
次に、図１２−１〜図１２−５、図１６−６および図１６−７を用いて、図４−４のＳ１４０の部分木パターン生成処理の一例を説明する。部分木パターン生成部１４０は、図４−１のＳ４で分類されたノードクラスタのノード情報を読み込み（Ｓ１４０１）、帳票構造ルール１２１を読み込む（Ｓ１４０２）。そして、部分木パターン生成部１４０は、ノードクラスタ内の包含ノードの集合を取得し、探索済包含ノード集合に｛φ｝を設定する（Ｓ１４０３）。なお、包含ノードとは、ノード間の包含関係において他のノードを包含する側のノードであり、例えば、図１６−５の符号１６１に示すノード群のうちノードＸが包含ノードに相当する。 (Partial tree pattern generation processing)
Next, an example of the sub-tree pattern generation process of S140 of FIG. 4-4 will be described using FIGS. 12-1 to 12-5, FIG. 16-6, and FIG. 16-7. The partial tree pattern generation unit 140 reads the node information of the node cluster classified in S4 of FIG. 4A (S1401), and reads the form structure rule 121 (S1402). Then, the subtree pattern generation unit 140 acquires a set of inclusion nodes in the node cluster, and sets {φ} to the searched inclusion node set (S1403). The inclusion node is a node that includes other nodes in the inclusion relationship between nodes. For example, node X in the node group indicated by reference numeral 161 in FIG. 16-5 corresponds to the inclusion node.

次に、包含エッジ生成部１３６は包含ノードの階層毎にＣ（Ｘ，ｋ）をレベル分けする。また最大階層をｍとし、ｎ（包含ノードの階層）に「０」を設定する（Ｓ１４０４）。 Next, the inclusion edge generation unit 136 classifies C (X, k) for each hierarchy of inclusion nodes. Further, m is set as the maximum hierarchy, and “0” is set to n (hierarchy of the inclusion node) (S1404).

なお、上記のＣ（Ｘ，ｋ）は、包含ノードであるノードＸがｋ方向に包含するノードの集合を示す。例えば、ノードＸが縦方向に包含するノード集合はＣ（Ｘ，１）であり、横方向に包含するノード集合はＣ（Ｘ，０）である。また、上記の階層は、包含ノードが入れ子構造になっている場合の階層を示し、入れ子になる包含ノードがない場合は「１」であり、ある場合は「その入れ子構造の数＋１」である。例えば、図１２−２に示すノードクラスタは入れ子になる包含ノードが１つなので、階層は「２」である。 Note that the above C (X, k) indicates a set of nodes included in the k direction by the node X which is an included node. For example, the node set included in the vertical direction by the node X is C (X, 1), and the node set included in the horizontal direction is C (X, 0). The above hierarchy indicates a hierarchy when the inclusion node has a nested structure. When there is no inclusion node to be nested, the hierarchy is “1”, and in other cases, “the number of the nested structures + 1”. . For example, since the node cluster shown in FIG. 12B has one nested inclusion node, the hierarchy is “2”.

次に、部分木パターン生成部１４０はレベルｎのＣ（Ｘ，ｋ）について部分木パターンを取得し（Ｓ１４０５）、探索済包含ノード集合にレベルｎの包含ノードを追加する（Ｓ１４０６）。そして、部分木パターン生成部１４０はｎの値をインクリメントして（Ｓ１４０７）、ｎ＞ｍ、かつ、未探索の包含ノードが存在しない場合（Ｓ１４０８でＹｅｓ）、包含ノード集合の部分木パターンを木構造データ構築部１４１に出力する（Ｓ１４０９）。一方、ｎ＞ｍではない、または、未探索の包含ノードが存在する場合（Ｓ１４０８でＮｏ）、Ｓ１４０５へ戻る。 Next, the subtree pattern generation unit 140 acquires a subtree pattern for C (X, k) at level n (S1405), and adds a level n inclusion node to the searched inclusion node set (S1406). Then, the subtree pattern generation unit 140 increments the value of n (S1407). If n> m and there is no unsearched inclusion node (Yes in S1408), the subtree pattern of the inclusion node set is represented by a tree. The data is output to the structure data construction unit 141 (S1409). On the other hand, if n> m is not satisfied or there is an unsearched inclusion node (No in S1408), the process returns to S1405.

このようにすることで部分木パターン生成部１４０は、ノードクラスタ内で包含ノードが入れ子構造になっている場合でも、階層ごとの部分木パターンを取得することができる。 By doing in this way, the subtree pattern generation unit 140 can acquire a subtree pattern for each hierarchy even when the inclusion node has a nested structure in the node cluster.

（部分木パターン取得処理）
次に、図１２−３を用いて、図１２−１のＳ１４０５の部分木パターンの取得処理の一例を説明する。部分木パターン生成部１４０は、レベルｎのＣ（Ｘ，ｋ）を読み込み（Ｓ１４０５１）、帳票構造ルール１２１を読み込む（Ｓ１４０５２）。そして、部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）の部分木パターンに｛φ｝を設定し（Ｓ１４０５３）、ｎが１以上であれば（Ｓ１４０５４でＹｅｓ）、Ｃ（Ｘ，ｋ）に含まれる直近階層の包含ノード（ノードＸ´）について、Ｃ（Ｘ´，ｋ）をダミーノードとみなす（Ｓ１４０５５）。例えば、部分木パターン生成部１４０は、図１２−２に示すノード群のうち、符号１２０１に示すノード群を、図１２−３の吹き出し１２０２に示すように１つのノード（ダミーノード）とみなす。なお、Ｓ１４０５４でｎが１以上でなければ（Ｓ１４０５４でＮｏ）、Ｓ１４０５９へ進む。 (Partial tree pattern acquisition process)
Next, an example of the sub-tree pattern acquisition process in S1405 of FIG. 12A will be described with reference to FIG. The partial tree pattern generation unit 140 reads C (X, k) of level n (S14051), and reads the form structure rule 121 (S14052). Then, the subtree pattern generation unit 140 sets {φ} to the subtree pattern of C (X, k) (S14053), and if n is 1 or more (Yes in S14054), C (X, k) C (X ′, k) is regarded as a dummy node for the inclusion node (node X ′) in the nearest hierarchy included in (S14055). For example, the subtree pattern generation unit 140 regards the node group indicated by reference numeral 1201 among the node groups shown in FIG. 12-2 as one node (dummy node) as indicated by a balloon 1202 in FIG. 12-3. If n is not 1 or more in S14054 (No in S14054), the process proceeds to S14059.

次に、部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）について木構造変換処理を行い（Ｓ１４０５６）、部分木パターンが取得できれば（Ｓ１４０５７でＹｅｓ）、Ｃ（Ｘ，ｋ）の部分木パターンに当該部分木パターンを追加する（Ｓ１４０５８）。一方、Ｓ１４０５７で部分木パターンが取得できなければ（Ｓ１４０５７でＮｏ）、Ｓ１４０５９へ進む。 Next, the subtree pattern generation unit 140 performs a tree structure conversion process on C (X, k) (S14056), and if a subtree pattern can be acquired (Yes in S14057), the subtree pattern of C (X, k). The subtree pattern is added to (S14058). On the other hand, if the subtree pattern cannot be acquired in S14057 (No in S14057), the process proceeds to S14059.

次に、部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）に含まれる包含ノードがないとみなし（Ｓ１４０５９）、Ｓ１４０５６と同様にＣ（Ｘ，ｋ）について木構造変換処理を行い（Ｓ１４０６０）、部分木パターンが取得できれば（Ｓ１４０６１でＹｅｓ）、Ｃ（Ｘ，ｋ）の部分木パターンに当該部分木パターンを追加する（Ｓ１４０６２）。そして、部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）の部分木パターンを出力する（Ｓ１４０６３）。一方、部分木パターンが取得できなければ（Ｓ１４０６１でＮｏ）、Ｓ１４０６２をスキップして、Ｓ１４０６３へ進む。 Next, the subtree pattern generation unit 140 considers that there is no inclusion node included in C (X, k) (S14059), and performs a tree structure conversion process on C (X, k) as in S14056 (S14060). If the subtree pattern can be acquired (Yes in S14061), the subtree pattern is added to the subtree pattern of C (X, k) (S14062). Then, the subtree pattern generation unit 140 outputs a C (X, k) subtree pattern (S14063). On the other hand, if the partial tree pattern cannot be acquired (No in S14061), S14062 is skipped and the process proceeds to S14063.

このようにすることで部分木パターン生成部１４０は、ノードクラスタ内で包含ノードが入れ子構造になっている場合に、階層ごとの部分木パターンと、ノードクラスタ全体の部分木パターンとの両方の部分木パターンを取得できる。 In this way, the subtree pattern generation unit 140 allows both the subtree pattern for each hierarchy and the subtree pattern of the entire node cluster when the containing nodes are nested in the node cluster. A tree pattern can be acquired.

（Ｃ（Ｘ，ｋ）についての木構造変換処理）
次に、図１２−４を用いて、図１２−３のＳ１４０５６およびＳ１４０６０におけるＣ（Ｘ，ｋ）についての木構造変換処理の一例を説明する。部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）を読み込み（Ｓ１４０６０１）、木構造生成に関する帳票構造ルール１２１を読み込む（Ｓ１４０６０２）。そして、部分木パターン生成部１４０は、木構造生成に関する帳票構造ルール１２１（図１６−１の（ａ）の木構造生成ルール）に従い、各ノードの隣接エッジの修正と項目属性の設定を行う（Ｓ１４０６０３）。 (Tree structure conversion processing for C (X, k))
Next, an example of a tree structure conversion process for C (X, k) in S14056 and S14060 in FIG. 12-3 will be described with reference to FIG. The partial tree pattern generation unit 140 reads C (X, k) (S140601), and reads the form structure rule 121 related to tree structure generation (S140602). Then, the subtree pattern generation unit 140 corrects adjacent edges of each node and sets item attributes in accordance with the form structure rule 121 (tree structure generation rule in FIG. 16A) relating to tree structure generation ( S140603).

なお、上記の木構造生成ルールは、例えば、以下の（１）〜（４）に示す４つの条件からなる。すなわち、（１）包含ノードを除いて下位（この場合、右側）に隣接エッジは１本のみである。包含ノードを除いて下位（この場合、右側）に隣接エッジが２本以上ある場合、部分木パターン生成部１４０は、その隣接エッジをカットする。例えば、部分木パターン生成部１４０は、図１６−６の（１）に示すノードａとノードｂ，ｃを接続する隣接エッジをカットする。 Note that the tree structure generation rule includes, for example, the following four conditions (1) to (4). That is, (1) there is only one adjacent edge in the lower layer (in this case, the right side) except for the inclusion node. If there are two or more adjacent edges in the lower order (in this case, the right side) excluding the inclusion node, the subtree pattern generation unit 140 cuts the adjacent edges. For example, the subtree pattern generation unit 140 cuts adjacent edges connecting the node a and the nodes b and c shown in (1) of FIG.

（２）包含ノードの下位のノードはすべて包含ノードを上位に持つ。（１）の後、包含ノードを除いて上位（この場合、左側）の隣接エッジを持たないノードがある場合、部分木パターン生成部１４０は、ｋ方向に包含ノード（ノードＸ）と隣接エッジを張る。例えば、部分木パターン生成部１４０は、図１６−６の（２）に示すノードＸとノードｂ，ｃを接続する隣接エッジを張る。 (2) All nodes below the inclusion node have inclusion nodes at the top. After (1), if there is a node that does not have a higher-order (in this case, left side) adjacent edge except for the included node, the subtree pattern generation unit 140 sets the included node (node X) and the adjacent edge in the k direction. Tighten. For example, the subtree pattern generation unit 140 extends adjacent edges connecting the node X and the nodes b and c shown in (2) of FIG.

（３）下位の隣接エッジがある場合の項目属性は「項目名」、ないノードの項目属性は「項目値」である。（２）の後、部分木パターン生成部１４０は、下位の隣接エッジがないノードの項目属性を「項目値」に設定する。例えば、部分木パターン生成部１４０は、図１６−６の（３）に示すノードａ，ｄ，ｅ，ｆの項目属性を「項目値」に設定する。 (3) The item attribute when there is a lower adjacent edge is “item name”, and the item attribute of a node that is not present is “item value”. After (2), the subtree pattern generation unit 140 sets the item attribute of the node having no lower adjacent edge to “item value”. For example, the subtree pattern generation unit 140 sets the item attributes of the nodes a, d, e, and f shown in (3) of FIG. 16-6 to “item value”.

（４）隣接エッジが１本のノードが２つ以上連結されている部分を含む場合、表型または列挙型または隣接エッジを張るべきでない（木構造不適合）。（１）〜（３）の後、隣接エッジが１本のノードが２つ以上連結されている部分を含む場合、表型・列挙型推定ルール（詳細は後記）に従い、包含ノードを含むノード群について表型か列挙型かの推定（判断）を行う。例えば、部分木パターン生成部１４０は、図１６−６の（４）の符号１６１０に示す隣接エッジが１本のノードが２つ以上連結されている一連のノード群を発見した場合、符号１６１０に示すノード群について表型・列挙型推定ルール（詳細は後記）による表型か列挙型かの推定（判断）対象とする。 (4) When the adjacent edge includes a portion in which two or more nodes are connected, the table type or enumeration type or adjacent edge should not be stretched (incompatible tree structure). After (1) to (3), when an adjacent edge includes a portion in which two or more nodes are connected, a group of nodes including inclusion nodes according to a table type / enumeration type estimation rule (details will be described later) Estimate (determine) whether the table type or enumeration type. For example, if the subtree pattern generation unit 140 finds a series of node groups in which two or more adjacent edges are connected to each other as indicated by reference numeral 1610 in (4) of FIG. The node group to be indicated is subject to estimation (judgment) whether it is a table type or an enumeration type according to a table type / enumeration type estimation rule (details will be described later).

図１２−４のＳ１４０６０３の後、部分木パターン生成部１４０は、Ｓ１４０６０３により設定された項目属性が「項目値」のノードから未探索のノードＺを選択し（Ｓ１４０６０４）、ノードＺからノードＸまでを辿る（Ｓ１４０６０５）。このときのノードＺからノードＸまでの経路をroute(Ｘ，Ｚ)とする。そして、部分木パターン生成部１４０は、表型・列挙型推定ルール（図１６−１の（ｂ）参照）に従い、表型・列挙型となる経路（route(Ｘ，Ｚ)）があるか否かを判定（推定）し（Ｓ１４０６０６）、表型・列挙型となる経路（route(Ｘ，Ｚ)）があれば（Ｓ１４０６０６のＹｅｓ）、表型・列挙型推定に関する帳票構造ルール１２１（表型・列挙型推定ルール）に従って、表型・列挙型の設定を行う（Ｓ１４０６０７）。そして、Ｓ１４０６０９へ進む。一方、表型・列挙型となる経路（route(Ｘ，Ｚ)）がなければ（Ｓ１４０６０６のＮｏ）、表型・列挙型となる経路（route(Ｘ，Ｚ)）上のノードＺ以外の項目属性を「項目名」に設定し（Ｓ１４０６０８）、Ｓ１４０６０９へ進む。 After S140603 in FIG. 12-4, the subtree pattern generation unit 140 selects an unsearched node Z from the nodes having the item attribute “item value” set in S140603 (S140604), and from the node Z to the node X. (S140605). A route from the node Z to the node X at this time is defined as route (X, Z). Then, according to the table type / enumeration type estimation rule (see (b) of FIG. 16A), the subtree pattern generation unit 140 determines whether there is a route (route (X, Z)) that becomes the table type / enumeration type. Is determined (estimated) (S140606), and if there is a route (route (X, Z)) that becomes a table type / enumeration type (Yes in S140606), the form structure rule 121 (table type) regarding the table type / enumeration type estimation According to the enumeration type estimation rule), a table type and an enumeration type are set (S140607). Then, the process proceeds to S140609. On the other hand, if there is no route (route (X, Z)) that becomes a table type / enumeration type (No in S140606), items other than the node Z on the route (route (X, Z)) that becomes a table type / enumeration type. The attribute is set to “item name” (S140608), and the process proceeds to S140609.

なお、上記の表型・列挙型推定ルールは、例えば、以下の（１）〜（４）に示す４つの条件からなる。（１）ノードＸが包含するノードが格子状の連結関係を持つ。例えば、部分木パターン生成部１４０は、図１６−７の（１）に示すようにノードＸが包含するノードが格子状の連結関係を持つとき、これらのノードを、表型・列挙型推定対象とする。 The table type / enumeration type estimation rule includes, for example, the following four conditions (1) to (4). (1) The nodes included in the node X have a grid-like connection relationship. For example, when the nodes included in the node X have a grid-like connection relationship as shown in (1) of FIG. 16-7, the subtree pattern generation unit 140 selects these nodes as a table type / enumeration type estimation target. And

（２）（１）の条件を満たす場合において、ノードＸに隣接しないノードに「項目名」のノードが存在するならば「列挙型」か「木構造不適合」と判断する。例えば、部分木パターン生成部１４０は、図１６−７の（２）に示すようにノードＸに隣接しないノードに「項目名」のノードが存在するとき、「列挙型」か「木構造不適合」と推定する。 (2) In the case where the condition of (1) is satisfied, if a node of “item name” exists in a node not adjacent to the node X, it is determined as “enumeration type” or “incompatible with tree structure”. For example, as shown in (2) of FIG. 16-7, the sub-tree pattern generation unit 140, when a node of “item name” exists in a node that is not adjacent to the node X, is “enumerated type” or “incompatible with tree structure”. Estimated.

（３）（１）の条件を満たすが、（２）の条件を満たさない場合において、ノードＸが包含する各ノードの端点が隣接する他のノードのいずれかの端点と重なるならば「表型」と推定する。例えば、部分木パターン生成部１４０は、図１６−７の（３）に示すようにノードＸが包含する各ノードの端点が隣接する他のノードのいずれかの端点と重なっていれば「表型」と推定する。 (3) In the case where the condition of (1) is satisfied but the condition of (2) is not satisfied, if the end point of each node included in the node X overlaps with any end point of another adjacent node, “table type” ". For example, as shown in FIG. 16-7 (3), the subtree pattern generation unit 140 displays a “table type” if the end points of each node included in the node X overlap with any of the end points of other adjacent nodes. ".

（４）（１）および（２）の条件を満たす場合において、ノードＸを除いたノードから末端のノードまでのノードの個数が偶数であれば「列挙型」と推定し、奇数であれば「木構造不適合」と推定する。例えば、部分木パターン生成部１４０は、図１６−７の（４）に示すようにノードＸを除いたノードから末端のノードまでのノードの個数が偶数（４個）であれば「列挙型」と判断し、奇数（３個）であれば「木構造不適合」と判断する。つまり、図１６−７の（４）に示すように「列挙型」であれば、「項目名」と「項目値」とがペアの構造となるが、「項目名」に対しペアとなる「項目値」がない場合、帳票の木構造として不自然であるので、部分木パターン生成部１４０は、このようなノード群については「木構造不適合」と推定する。上記のようにして、部分木パターン生成部１４０はノード群が表型か列挙型かそもそも木構造として不適合かの推定を行う。 (4) When the conditions of (1) and (2) are satisfied, if the number of nodes from the node excluding the node X to the terminal node is an even number, it is estimated as “enumeration type”, and if it is an odd number, “ It is estimated that the tree structure is incompatible. For example, if the number of nodes from the node excluding the node X to the terminal node is an even number (four) as shown in (4) of FIG. If it is an odd number (three), it is determined that the tree structure is incompatible. That is, as shown in (4) of FIG. 16-7, in the case of “enumeration type”, “item name” and “item value” have a pair structure, but “item name” is a pair “ When there is no “item value”, the tree structure of the form is unnatural, and the partial tree pattern generation unit 140 estimates that the node group is “incompatible with the tree structure”. As described above, the subtree pattern generation unit 140 estimates whether the node group is a table type, an enumeration type, or a tree structure in the first place.

そして、部分木パターン生成部１４０は、上記の表型・列挙型推定ルールに従い、表型と推定したノード群については表型の項目属性の割当と隣接エッジの修正を行い（図１２−５の（ａ）参照）、列挙型と推定したノード群については列挙型の項目属性の割当と隣接エッジの修正を行う（図１２−５の（ｂ）参照）。 Then, the subtree pattern generation unit 140 assigns table type item attributes and corrects adjacent edges for the node group estimated as the table type in accordance with the above table type / enumeration type estimation rule (FIG. 12-5). (See (a).) For the node group estimated to be an enumeration type, assignment of enumeration type item attributes and correction of adjacent edges are performed (see (b) of FIG. 12-5).

図１２−４のＳ１４０６０７の後、部分木パターン生成部１４０は、Ｓ１４０６０７においてＣ（Ｘ，ｋ）を表型または列挙型に変換できた場合（Ｓ１４０６０９でＹｅｓ）、Ｓ１４０６０４で選択したノードＺを探索済みとする（Ｓ１４０６１０）。そして、項目属性が「項目値」のノード全てを探索済みであれば（Ｓ１４０６１１でＹｅｓ）、部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）の部分木パターンを出力し（Ｓ１４０６１２）、項目属性が「項目値」のノードに探索していないノードがあれば（Ｓ１４０６１１でＮｏ）、Ｓ１４０６０４に戻る。 After S140607 in FIG. 12-4, the subtree pattern generation unit 140 searches for the node Z selected in S140604 when C (X, k) can be converted into a table type or an enumeration type in S140607 (Yes in S140609). (S140610). If all nodes having the item attribute “item value” have been searched (Yes in S140611), the subtree pattern generation unit 140 outputs a subtree pattern of C (X, k) (S140612). If there is a node that has not been searched for in the node whose attribute is “item value” (No in S140611), the process returns to S140604.

一方、Ｓ１４０６０９において、部分木パターン生成部１４０は、Ｃ（Ｘ，ｋ）を表型または列挙型に変換できなかった場合（Ｓ１４０６０９でＮｏ）、Ｃ（Ｘ，ｋ）について木構造条件を満たさないと判断する（Ｓ１４０６１３）。この場合、部分木パターンの出力は行わない。 On the other hand, in S140609, the subtree pattern generation unit 140 does not satisfy the tree structure condition for C (X, k) when C (X, k) cannot be converted into a table type or an enumeration type (No in S140609). Is determined (S140613). In this case, the subtree pattern is not output.

このようにすることで部分木パターン生成部１４０は、包含ノードに包含されるノード群（Ｃ（Ｘ，ｋ））について、表型か列挙型かを考慮しつつ、各ノードの隣接エッジの修正と項目属性の設定を行う。また、部分木パターン生成部１４０は、木構造条件（帳票としての木構造の特性）を満たさないノード群については、部分木パターンとして出力しない。その結果、部分木パターン生成部１４０は、精度の高い部分木パターンを生成することができる。 In this way, the subtree pattern generation unit 140 corrects adjacent edges of each node while considering whether the node group (C (X, k)) included in the included node is a table type or an enumerated type. And set item attributes. Also, the subtree pattern generation unit 140 does not output a node group that does not satisfy the tree structure condition (characteristics of the tree structure as a form) as a subtree pattern. As a result, the subtree pattern generation unit 140 can generate a subtree pattern with high accuracy.

（木構造データ構築処理）
次に、図１３を用いて、図４−４のＳ１４１の木構造データ構築処理を説明する。木構造データ構築部１４１は、Ｓ１４０で生成された各包含ノードの部分木パターンを読み込み（Ｓ１４１１）、ノードクラスタのノード情報を読み込み（Ｓ１４１２）、帳票構造ルール１２１を読み込む（Ｓ１４１３）。そして、木構造データ構築部１４１は、木構造データの集合に｛φ｝を設定した後（Ｓ１４１４）、ノードの重複がないような部分木パターンの組み合わせを求める（Ｓ１４１５）。その後、木構造データ構築部１４１は、部分木パターンの組み合わせ群から、未確認の部分木パターンの組み合わせを選ぶ（Ｓ１４１６）。つまり、後記するＳ１４１９の処理を行っていない部分木パターンの組み合わせを選ぶ。その後、木構造データ構築部１４１は、Ｓ１４１２で読み込んだノードクラスタのノード情報を参照して、Ｓ１４１６で選んだ部分木パターンの組み合わせについて、不足しているノードを追加する（Ｓ１４１７）。 (Tree structure data construction process)
Next, the tree structure data construction processing of S141 of FIG. 4-4 will be described using FIG. The tree structure data construction unit 141 reads the partial tree pattern of each inclusion node generated in S140 (S1411), reads the node information of the node cluster (S1412), and reads the form structure rule 121 (S1413). Then, after setting {φ} in the set of tree structure data (S1414), the tree structure data construction unit 141 obtains a combination of subtree patterns so that there is no overlapping of nodes (S1415). Thereafter, the tree structure data construction unit 141 selects an unconfirmed partial tree pattern combination from the partial tree pattern combination group (S1416). That is, a combination of sub-tree patterns not subjected to the processing of S1419 described later is selected. Thereafter, the tree structure data construction unit 141 refers to the node information of the node cluster read in S1412, and adds a missing node for the combination of the partial tree patterns selected in S1416 (S1417).

次に、木構造データ構築部１４１は、Ｓ１４１７までの処理により生成された部分木パターンの組み合わせ（およびそれに追加されたノード）の包含エッジ、隣接エッジを参照しながら木構造を生成する（Ｓ１４１８）。 Next, the tree structure data construction unit 141 generates a tree structure with reference to the inclusion edges and adjacent edges of the combination of subtree patterns generated by the processing up to S1417 (and nodes added thereto) (S1418). .

そして、木構造データ構築部１４１は、Ｓ１４１８で生成された木構造について、木構造の条件に関する帳票構造ルール１２１（図１６−１の（ｂ）の木構造条件ルール）に従い、木構造の条件を満たすか否かを判断し（Ｓ１４１９）、木構造の条件を満たせば（Ｓ１４１９でＹｅｓ）、当該木構造を木構造データに追加する（Ｓ１５２０）。その後、全ての部分木パターンの組み合わせを確認していれば（Ｓ１５２１でＹｅｓ）、木構造データ構築部１４１は、木構造データを木構造選定部１４２に出力する（Ｓ１５２２）。 Then, the tree structure data construction unit 141 sets the tree structure condition for the tree structure generated in S1418 in accordance with the form structure rule 121 (the tree structure condition rule in FIG. 16B). It is determined whether or not it is satisfied (S1419). If the conditions of the tree structure are satisfied (Yes in S1419), the tree structure is added to the tree structure data (S1520). Thereafter, if the combination of all the partial tree patterns is confirmed (Yes in S1521), the tree structure data construction unit 141 outputs the tree structure data to the tree structure selection unit 142 (S1522).

なお、木構造データ構築部１４１が、Ｓ１４１８で生成された木構造が木構造の条件を満たさないと判断したとき（Ｓ１４１９でＮｏ）、Ｓ１４１６へ戻る。また、木構造データ構築部１４１が、Ｓ１５２１で、まだ確認していない部分木パターンの組み合わせがあるときも（Ｓ１５２１でＮｏ）、Ｓ１４１６へ戻る。 When the tree structure data construction unit 141 determines that the tree structure generated in S1418 does not satisfy the tree structure condition (No in S1419), the process returns to S1416. Also, when there is a combination of partial tree patterns that have not yet been confirmed in S1521, the tree structure data construction unit 141 returns to S1416.

なお、上記の木構造条件ルールは、例えば、以下の（１）〜（３）に示す３つの条件からなる。すなわち、（１）「項目名」のノード間のエッジは１対多。（２）「項目名」のノードと「項目値」のノード間のエッジは１対１。ただし、部分木パターン生成部１４０により表型と判断されたノード群の場合は、１対多となる。（３）「項目値」のノードは下位のノードを持たない。以上の３つの条件を満たすとき、木構造データ構築部１４１は、当該木構造を木構造の条件を満たすと判断する。 The tree structure condition rule includes, for example, the following three conditions (1) to (3). That is, (1) the edge between nodes of “item name” is one-to-many. (2) The edge between the “item name” node and the “item value” node is 1: 1. However, in the case of a node group determined as a table type by the subtree pattern generation unit 140, there is a one-to-many. (3) The “item value” node has no lower nodes. When the above three conditions are satisfied, the tree structure data construction unit 141 determines that the tree structure satisfies the tree structure condition.

（木構造データ選定処理）
次に、図１４−１を用いて、図４−４のＳ１４２の木構造データ選定処理を説明する。木構造選定部１４２は、木構造データ構築部１４１から出力された木構造データを読み込む（Ｓ１４２１）。ここで、読み込んだ木構造データが１種類でなければ（Ｓ１４２２でＮｏ）、木構造データ選定に関する帳票構造ルール１２１を読み込み（Ｓ１４２３）、木構造データ選定に関する帳票構造ルール１２１（図１６−１の（ｂ）の木構造データ選定ルール）に従い、木構造データの選定処理を行い（Ｓ１４２４）、選定した木構造データを出力する（Ｓ１４２５）。一方、読み込んだ木構造データが１種類であれば（Ｓ１４２２でＹｅｓ）、木構造選定部１４２は、読み込んだ木構造データを出力する（Ｓ１４２５）。 (Tree structure data selection process)
Next, the tree structure data selection processing of S142 of FIG. 4-4 will be described using FIG. The tree structure selection unit 142 reads the tree structure data output from the tree structure data construction unit 141 (S1421). If the read tree structure data is not one type (No in S1422), the form structure rule 121 relating to tree structure data selection is read (S1423), and the form structure rule 121 relating to tree structure data selection (FIG. 16-1). In accordance with (b) tree structure data selection rule), tree structure data selection processing is performed (S1424), and the selected tree structure data is output (S1425). On the other hand, if the read tree structure data is one type (Yes in S1422), the tree structure selection unit 142 outputs the read tree structure data (S1425).

図１４−２を用いて、図１４−１のＳ１４２４における木構造データ選定ルールに従った、木構造データの選定処理の例を説明する。 An example of tree structure data selection processing according to the tree structure data selection rule in S1424 of FIG. 14A will be described with reference to FIG.

例えば、木構造選定部１４２は、図１４−２の（ａ）に示すように複数の木構造データ間の差分情報をユーザに表示し、ユーザが修正した情報を利用して木構造データを選定する。すなわち、木構造選定部１４２は、各木構造データのノード情報の差分をとり（Ｓ１４２４１）、ユーザ側に結果通知（例えば、「異常」があることの通知）を行う（Ｓ１４２４２）。そして、Ｓ１４２４２の後、木構造選定部１４２は、例えば、木構造データの差分情報（例えば、生成された複数の木構造データを比較して得られるノードの項目属性や、包含エッジにおける包含関係、隣接エッジにおける隣接方向等の違いを示した情報）を表示した後、ユーザからＧＵＩ（Graphical User Interface）等により木構造修正情報の入力を受け付けると（Ｓ１４２４３）、木構造選定部１４２は、この木構造修正情報に基づき木構造データを修正し、修正した木構造データを選定する（Ｓ１４２４４）。 For example, as shown in FIG. 14A, the tree structure selection unit 142 displays difference information between a plurality of tree structure data to the user, and selects tree structure data using the information corrected by the user. To do. That is, the tree structure selection unit 142 takes the difference of the node information of each tree structure data (S14241), and notifies the user of the result (for example, notification that there is “abnormal”) (S14242). After S14242, the tree structure selection unit 142, for example, includes difference information of tree structure data (for example, item attributes of nodes obtained by comparing a plurality of generated tree structure data, inclusion relations at inclusion edges, (Information indicating the difference in the adjacent direction or the like at the adjacent edge) and then receiving an input of the tree structure correction information from the user through GUI (Graphical User Interface) or the like (S14243), the tree structure selecting unit 142 The tree structure data is corrected based on the structure correction information, and the corrected tree structure data is selected (S14244).

このようにすることで、複数の木構造データが生成されたときに、木構造選定部１４２はユーザが所望するような修正を加味した木構造データを出力することができる。 In this way, when a plurality of tree structure data is generated, the tree structure selection unit 142 can output the tree structure data with corrections desired by the user.

また、例えば、木構造選定部１４２は、図１４−２の（ｂ）に示すように、２以上の帳票ファイルを処理する場合に、１つの帳票ファイルから複数の木構造データが生成されれば、全て（またはある一定の）帳票ファイルの処理が行われた後にユーザに処理を求めるようにしてもよい。例えば、木構造選定部１４２は、複数の木構造データが出力された場合に各木構造データのノード情報の差分をとると（Ｓ１４２４５）、木構造データの差分情報を出力（キャッシュ）し（Ｓ１４２４６）、この処理中に「帳票として情報を登録しない（つまり、当該木構造データを帳票データベース１２２に登録しない）」旨をユーザに表示する（Ｓ１４２４７）。そして、木構造選定部１４２は、全ての帳票ファイル群を処理したか否かを判断し（Ｓ１４２４８）、未処理の帳票ファイルがあれば（Ｓ１４２４８のＮｏ）、未処理の帳票ファイル（帳票）を処理する（Ｓ１４２４９）。つまり、Ｓ１４２４５以降の処理を行う。また、木構造選定部１４２は、全ての帳票ファイル群を処理したと判断したとき（Ｓ１４２４８でＹｅｓ）、Ｓ１４２４６で出力された木構造データの差分情報を読み込み（Ｓ１４２５０）、ユーザ側に結果通知を行う（Ｓ１４２５１）。例えば、木構造選定部１４２は、ユーザ側に帳票ファイルそれぞれの木構造データの差分情報の通知を行う。その後、ユーザはこれらの帳票ファイルの木構造データの差分情報に対して図１４−２の（ａ）の処理を実施するか、帳票ファイルから木構造データへの変換を行わないかを決定する。 For example, as shown in FIG. 14B, when the tree structure selection unit 142 processes two or more form files, if a plurality of pieces of tree structure data are generated from one form file. Alternatively, the processing may be requested from the user after all (or a certain) form file has been processed. For example, if the tree structure selection unit 142 takes a difference in node information of each tree structure data when a plurality of tree structure data is output (S14245), the tree structure data difference information is output (cached) (S14246). ), During this process, a message “Do not register information as a form (that is, do not register the tree structure data in the form database 122)” is displayed to the user (S14247). Then, the tree structure selection unit 142 determines whether or not all the form file groups have been processed (S14248). If there is an unprocessed form file (No in S14248), the unprocessed form file (form) is selected. Process (S14249). That is, the processing after S14245 is performed. When the tree structure selection unit 142 determines that all the form file groups have been processed (Yes in S14248), the tree structure selection unit 142 reads the difference information of the tree structure data output in S14246 (S14250), and notifies the user of the result. This is performed (S14251). For example, the tree structure selection unit 142 notifies the user of difference information of tree structure data for each form file. After that, the user determines whether to execute the process of FIG. 14A on the difference information of the tree structure data of these form files, or not to convert the form file into the tree structure data.

このようにすることで、木構造選定部１４２は、２以上の帳票ファイルの処理を行う場合に、他の帳票ファイルの選定・変換処理に影響を与えずに全帳票ファイルの処理を実行できる。また、複数の木構造データが生成されたときに、木構造選定部１４２はその差分情報をまとめてユーザに表示することができる。 In this way, the tree structure selection unit 142 can execute processing of all form files without affecting the selection / conversion processing of other form files when processing two or more form files. Further, when a plurality of tree structure data is generated, the tree structure selection unit 142 can collectively display the difference information to the user.

また、例えば、木構造選定部１４２は、図１４−２の（ｃ）に示すように、複数の木構造データのうち最もシンプルな構造の木構造データを選定してもよい。例えば、木構造選定部１４２は、各木構造データについて構造の複雑さを示す値を計算する（Ｓ１４２５２）。そして、木構造選定部１４２は、構造が最もシンプルな木構造データを選定する（Ｓ１４２５３）。例えば、木構造選定部１４２は、各木構造データにおける包含関係の階層の数をカウントし、その包含関係の階層が最も少ない木構造データを選定する。そして、木構造選定部１４２は、ユーザ側に木構造データの選定の結果通知を行う（Ｓ１４２５４）。 Further, for example, the tree structure selection unit 142 may select the tree structure data having the simplest structure among a plurality of tree structure data, as illustrated in FIG. For example, the tree structure selection unit 142 calculates a value indicating the complexity of the structure for each tree structure data (S14252). Then, the tree structure selection unit 142 selects tree structure data having the simplest structure (S14253). For example, the tree structure selection unit 142 counts the number of hierarchies of inclusion relationships in each tree structure data, and selects the tree structure data having the smallest hierarchy of inclusion relationships. Then, the tree structure selection unit 142 notifies the user of the result of selecting tree structure data (S14254).

このように木構造データの構造の複雑さに着目して木構造データの選択を行うのは、木構造データ構築部１４１によりあまりに複雑な構造を持つ木構造データが生成された場合、その木構造データは、実際の帳票ファイルの論理構造とは異なる可能性が高いと推測されるからである。つまり、上記のように複数の木構造データが生成されたときに、木構造選定部１４２が、最もシンプルな構造の木構造データを選定することで、より実際の帳票ファイルの論理構造に近い木構造データを選定することができる。 The tree structure data is selected by paying attention to the complexity of the structure of the tree structure data in this way, when tree structure data having an extremely complicated structure is generated by the tree structure data construction unit 141. This is because it is estimated that the data is likely to be different from the actual logical structure of the form file. That is, when a plurality of tree structure data is generated as described above, the tree structure selection unit 142 selects a tree structure data having the simplest structure, so that the tree structure closer to the logical structure of the actual form file can be obtained. Structure data can be selected.

（帳票構造構築処理）
次に、図１５を用いて、図４−１のＳ６の帳票構造構築処理を説明する。帳票構造構築部１４３は、木構造推定部１４により出力された木構造データを読み込むと（Ｓ１４３１）、全ノードクラスタの木構造データを取得済みか否かを判定し（Ｓ１４３３）、全ノードクラスタの木構造データを取得済みであれば（Ｓ１４３３のＹｅｓ）、ノードクラスタを統合する（Ｓ１４３４）。そして、帳票構造構築部１４３は、プロパティ情報取得部１３４から出力されたプロパティ情報を読み込み（Ｓ１４３５）、このプロパティ情報を含む帳票構成に木構造データを追加する（Ｓ１４３６）。これを帳票構成情報とする。 (Form structure construction process)
Next, the form structure construction process in S6 of FIG. 4A will be described with reference to FIG. When the form structure construction unit 143 reads the tree structure data output by the tree structure estimation unit 14 (S1431), the form structure construction unit 143 determines whether or not the tree structure data of all node clusters has been acquired (S1433), and If the tree structure data has already been acquired (Yes in S1433), the node clusters are integrated (S1434). Then, the form structure construction unit 143 reads the property information output from the property information acquisition unit 134 (S1435), and adds the tree structure data to the form configuration including this property information (S1436). This is form configuration information.

例えば、帳票構造構築部１４３は、図６に示すようにプロパティ情報（ファイル情報＋ドキュメント情報＋帳票メタ情報）を含む帳票構成に、木構造推定部１４にて生成した木構造データを統合する。 For example, the form structure construction unit 143 integrates the tree structure data generated by the tree structure estimation unit 14 into a form structure including property information (file information + document information + form meta information) as shown in FIG.

その後、帳票構造構築部１４３は帳票ファイルの全ページ（全シート）について木構造データを取得したことを確認すると（Ｓ１４３７でＹｅｓ）、Ｓ１４３６で生成した帳票構成情報を帳票データベース１２２に出力する（Ｓ１４３８）。一方、帳票構造構築部１４３は帳票ファイルのまだ木構造データを取得していないページがあれば（Ｓ１４３７でＮｏ）、ノード生成部１３３にて図４−２のＳ１３３のノード生成処理を行う（Ｓ１３３）。また、Ｓ１４３３においてまだ木構造データを取得していないノードクラスタがあれば（Ｓ１４３３でＮｏ）、帳票構造構築部１４３は未取得のノードクラスタを選択し（Ｓ１４４０）、木構造推定部１４にて図４−１のＳ５の木構造推定処理を行う（Ｓ５）。 Thereafter, when the form structure construction unit 143 confirms that the tree structure data has been acquired for all pages (all sheets) of the form file (Yes in S1437), the form structure information generated in S1436 is output to the form database 122 (S1438). ). On the other hand, if there is a page for which the tree structure data of the form file has not yet been acquired (No in S1437), the form structure construction unit 143 performs the node generation process in S133 of FIG. 4-2 in the node generation unit 133 (S133). ). If there is a node cluster that has not yet acquired tree structure data in S1433 (No in S1433), the form structure construction unit 143 selects an unacquired node cluster (S1440), and the tree structure estimation unit 14 The tree structure estimation process of S4-1 of 4-1 is performed (S5).

このようにすることで、帳票構造構築部１４３は各ノードクラスタの木構造データをプロパティ情報と統合した情報（帳票構成情報）を帳票データベース１２２に登録することができる。 In this way, the form structure construction unit 143 can register information (form structure information) obtained by integrating the tree structure data of each node cluster with property information in the form database 122.

以上説明したデータ構造抽出装置１０によれば帳票ファイルの中に縦横の論理構造が混在している場合であっても、帳票ファイルの木構造データを精度よく抽出することができる。また、データ構造抽出部１１は、この帳票ファイルの木構造データとプロパティ情報とを対応付けた情報を帳票データベース１２２に登録することができる。 According to the data structure extraction apparatus 10 described above, even if the vertical and horizontal logical structures are mixed in the form file, the tree structure data of the form file can be accurately extracted. In addition, the data structure extraction unit 11 can register information in which the tree structure data of the form file is associated with the property information in the form database 122.

（その他の実施形態）
なお、データ構造抽出装置１０は、帳票ファイルから抽出した項目名を項目名データベース１２３に登録しておき、新たな帳票ファイルを受け付けたときには、この項目名データベース１２３を参照して、ノードクラスタのノードの項目属性を付与するようにしてもよい。 (Other embodiments)
The data structure extraction apparatus 10 registers the item name extracted from the form file in the item name database 123, and when receiving a new form file, refers to the item name database 123 and refers to the node of the node cluster. You may make it provide the item attribute of.

このようなデータ構造抽出装置１０は、図３−１の破線で示す項目名登録部１３７と、項目名データベース１２３と、項目名割当部１３９とを備える。 Such a data structure extraction apparatus 10 includes an item name registration unit 137, an item name database 123, and an item name assignment unit 139 indicated by broken lines in FIG.

項目名登録部１３７は、ノード生成部１３３からノード情報を取得すると、項目名の判断に関する帳票構造ルール１２１に従い、項目名のノードである可能性の高いノードから文字列を抜き出し、項目名データベース１２３に登録する。 When the item name registration unit 137 acquires the node information from the node generation unit 133, the item name registration unit 137 extracts a character string from a node that is likely to be a node of the item name according to the form structure rule 121 relating to the item name determination, and the item name database 123 Register with.

項目名データベース１２３は、項目名登録部１３７により抜き出された文字列（項目名に用いられることが多い文字列）を記憶する。 The item name database 123 stores a character string extracted by the item name registration unit 137 (a character string often used for an item name).

項目名割当部１３９は、項目名データベース１２３を参照して、ノードクラスタの各ノードに対して項目属性（「項目名」か「項目値」か）を付与する。 The item name assigning unit 139 refers to the item name database 123 and assigns an item attribute (“item name” or “item value”) to each node of the node cluster.

このようなデータ構造抽出装置１０によれば、ノードクラスタの各ノードに対し精度よく項目属性を付与することができる。その結果、データ構造抽出装置１０は精度のよい木構造データを生成することができる。また、木構造データ生成処理に要する時間を低減できる。 According to such a data structure extraction device 10, it is possible to assign item attributes to each node of the node cluster with high accuracy. As a result, the data structure extraction device 10 can generate accurate tree structure data. In addition, the time required for the tree structure data generation process can be reduced.

（項目名登録処理）
図９を用いて項目名登録部１３７の処理手順の例を説明する。項目名登録部１３７は、ノード生成部１３３から出力された帳票のノード情報を読み込み（Ｓ１３７１）、項目名判断に関する帳票構造ルール１２１を読み込む（Ｓ１３７２）。そして、項目名登録部１３７は、当該帳票のノード情報が項目名の判断に関する帳票構造ルール１２１を満たすと判断したとき（Ｓ１３７３でＹｅｓ）、当該帳票をテンプレートとみなし、項目名の判断に関する帳票構造ルール１２１に従って、各ノードから文字列情報を抜き出す（Ｓ１３７４）。その後、項目名登録部１３７は、抜き出した文字列から文の構造を省き（Ｓ１３７５）、抜き出した文字列を項目名データベース１２３に登録する（Ｓ１３７６）。一方、項目名登録部１３７は、当該帳票のノード情報が項目名の判断に関する帳票構造ルール１２１を満たさないと判断したとき（Ｓ１３７３でＮｏ）、処理を終了する。 (Item name registration process)
An example of the processing procedure of the item name registration unit 137 will be described with reference to FIG. The item name registration unit 137 reads the node information of the form output from the node generation unit 133 (S1371), and reads the form structure rule 121 related to the item name determination (S1372). Then, when the item name registration unit 137 determines that the node information of the form satisfies the form structure rule 121 regarding the determination of the item name (Yes in S1373), the item name registration unit 137 regards the form as a template and forms the form structure regarding the determination of the item name. Character string information is extracted from each node according to the rule 121 (S1374). Thereafter, the item name registration unit 137 omits the sentence structure from the extracted character string (S1375), and registers the extracted character string in the item name database 123 (S1376). On the other hand, when the item name registration unit 137 determines that the node information of the form does not satisfy the form structure rule 121 relating to the determination of the item name (No in S1373), the process ends.

上記の項目名の判断に関するルールは、例えば、以下の（１）〜（３）のいずれかの条件を満たすとき、ノード情報から文字列情報を抽出すると判断する、というルールである。（１）ノードの文字列情報が空（null）のノードが閾値以上（例えば、帳票ファイル全体の５０％以上のノードに対して）である。（２）ノードに指定した塗りつぶし色、または白、透明以外のいずれかの塗りつぶし色が閾値以上（例えば、帳票ファイル全体数の５０％以上のノードに対して）使われている。（３）ユーザにより帳票ファイルがテンプレート、すなわち、項目名にあたるノードにのみ文字列が付与されている帳票であると定義される。また、（１）または（３）の場合、項目名の登録に関し、項目名登録部１３７は、文字情報が空でないノードの文字列を全て項目名として登録するというルール、（２）の場合、項目名登録部１３７は、指定した塗りつぶし色、または白、透明以外のいずれかの塗りつぶし色の文字列を項目名として登録するというルールもさらに備える。 For example, the rule regarding the determination of the item name is a rule that it is determined that character string information is extracted from node information when any one of the following conditions (1) to (3) is satisfied. (1) A node whose character string information is empty (null) is equal to or greater than a threshold value (for example, for nodes of 50% or more of the entire form file). (2) The fill color specified for the node, or any one of the colors other than white and transparent is used above the threshold (for example, for nodes of 50% or more of the total number of form files). (3) A user defines a form file as a template, that is, a form in which a character string is assigned only to a node corresponding to an item name. In the case of (1) or (3), regarding the registration of the item name, the item name registration unit 137 registers a character string of a node whose character information is not empty as an item name, and in the case of (2), The item name registration unit 137 further includes a rule of registering a designated fill color or a character string of any fill color other than white or transparent as an item name.

このようにすることで、項目名割当部１３９は、より項目名である可能性の高い文字列情報を項目名データベース１２３に登録することができる。 In this way, the item name assigning unit 139 can register character string information that is more likely to be an item name in the item name database 123.

（項目名割当処理）
図１１を用いて項目名割当部１３９の処理手順の例を説明する。項目名割当部１３９は、ノードクラスタのノード情報を読み込み（Ｓ１３９１）、項目名データベース１２３の項目名リストを読み込む（Ｓ１３９２）。次に、項目名割当部１３９は、ノードクラスタの未確認のノードをノードＸとし（Ｓ１３９３）、任意のノードＸの文字列が項目リスト上に存在すれば（Ｓ１３９４でＹｅｓ）、ノードＸの項目属性に「項目名」を割り当て（Ｓ１３９５）、ノードＸの文字列が項目リスト上に存在しなければ（Ｓ１３９４でＮｏ）、ノードＸの項目属性を割り当てない（Ｓ１３９６）。Ｓ１３９５、Ｓ１３９６の後、項目名割当部１３９が全てのノードを確認した（つまり、Ｓ１３９３以降の処理を実行した）と判断すると（Ｓ１３９７でＹｅｓ）、ノードクラスタのノード情報を部分木パターン生成部１４０に出力する（Ｓ１３９８）。一方、項目名割当部１３９において未確認のノードがあれば（Ｓ１３９７でＮｏ）、Ｓ１３９３へ戻る。 (Field name assignment process)
An example of the processing procedure of the item name assigning unit 139 will be described with reference to FIG. The item name assignment unit 139 reads the node information of the node cluster (S1391), and reads the item name list of the item name database 123 (S1392). Next, the item name assigning unit 139 sets an unconfirmed node of the node cluster as the node X (S1393), and if the character string of an arbitrary node X exists on the item list (Yes in S1394), the item attribute of the node X "Item name" is assigned to (S1395), and if the character string of node X does not exist on the item list (No in S1394), the item attribute of node X is not assigned (S1396). After S1395 and S1396, if it is determined that the item name assignment unit 139 has confirmed all the nodes (that is, the processing after S1393 has been executed) (Yes in S1397), the node information of the node cluster is obtained from the subtree pattern generation unit 140. (S1398). On the other hand, if there is an unconfirmed node in the item name assignment unit 139 (No in S1397), the process returns to S1393.

このようにすることで、項目名割当部１３９は、ノードクラスタの各ノードに対し項目属性を付与することができる。 By doing in this way, the item name assignment part 139 can assign an item attribute to each node of the node cluster.

（罫線枠補正部の補正処理の具体例）
ここで、罫線枠補正部２０における罫線枠の補正処理の具体的な例について説明する。図４−３のＳ２３〜Ｓ２６において、補正部２２の各部は、まず罫線枠情報および補正ルールを読み込み、補正ルールごとに補正処理を行う。そして、補正部２２の各部は、補正処理におけるノード情報の更新を罫線枠情報に反映させ、次の処理部に対して罫線枠情報を出力する。例えば、分割部２４の補正処理によって更新された罫線枠情報が削除部２５に対して出力される。そして、削除部２５は分割部２４から出力された罫線枠情報を用いて補正処理を行う。以降、補正部２２の各部の処理の具体例について詳細に説明する。 (Specific example of correction processing of ruled line frame correction unit)
Here, a specific example of the ruled line frame correction processing in the ruled line frame correction unit 20 will be described. In S23 to S26 of FIG. 4C, each unit of the correction unit 22 first reads ruled line frame information and correction rules, and performs correction processing for each correction rule. Then, each unit of the correction unit 22 reflects the update of the node information in the correction process in the ruled line frame information, and outputs the ruled line frame information to the next processing unit. For example, the ruled line frame information updated by the correction process of the dividing unit 24 is output to the deleting unit 25. Then, the deletion unit 25 performs correction processing using the ruled line frame information output from the dividing unit 24. Hereinafter, specific examples of processing of each unit of the correction unit 22 will be described in detail.

（罫線枠結合処理）
まず、図１７を用いて、罫線枠結合処理の具体例について説明する。図１７は、罫線枠結合処理の一例を示すフローチャートである。図１７に示すように、まず、結合部２３は、補正ルールの中から結合ルール群を取得する（Ｓ２３１）。次に、結合ルール群のうちのある結合ルールの罫線枠結合条件が、罫線の種類または太さに関する条件である場合（Ｓ２３２で罫線の種類、太さ）、結合部２３は、結合処理Ａを実行する（Ｓ２３３）。また、当該結合ルールの罫線枠結合条件が、枠内の文字列または塗りつぶし色に関する条件である場合（Ｓ２３２で枠内の文字列、色）、結合部２３は、結合処理Ｂを実行する（Ｓ２３４）。 (Rule border frame processing)
First, a specific example of ruled line frame combination processing will be described with reference to FIG. FIG. 17 is a flowchart illustrating an example of ruled line frame combination processing. As shown in FIG. 17, first, the combining unit 23 acquires a combination rule group from the correction rules (S231). Next, when the ruled line frame combination condition of a certain combination rule in the combination rule group is a condition related to the type or thickness of the ruled line (the type and thickness of the ruled line in S232), the combining unit 23 performs the combination process A. Execute (S233). If the ruled line frame combination condition of the combination rule is a condition related to the character string or the fill color in the frame (character string and color in the frame in S232), the combining unit 23 executes the combination process B (S234). ).

このとき、全結合ルール群について処理が完了している場合（Ｓ２３５でＹｅｓ）、処理を終了する。一方、全結合ルール群について処理が完了していない場合（Ｓ２３５でＮｏ）、未処理の結合ルールを取得し（Ｓ２３６）、Ｓ２３２へ戻る。 At this time, when the process is completed for all the combination rule groups (Yes in S235), the process is terminated. On the other hand, if the processing has not been completed for all the combination rule groups (No in S235), an unprocessed combination rule is acquired (S236), and the process returns to S232.

（結合処理Ａ）
ここで、図１８を用いて結合処理Ａについて説明する。図１８は、図１７のＳ２３３の結合処理Ａの一例を示すフローチャートである。図１８に示すように、まず、結合部２３は、処理の対象となる結合ルールを取得する（Ｓ２３３０１）。例えば、結合ルールの罫線枠結合条件は「罫線枠の１辺に点線があること」であり、アクションは「点線で隣接する罫線枠を結合させること」である。 (Combining process A)
Here, the combining process A will be described with reference to FIG. FIG. 18 is a flowchart illustrating an example of the combining process A in S233 of FIG. As illustrated in FIG. 18, first, the combining unit 23 acquires a combination rule to be processed (S 23301). For example, a ruled line frame combination condition of the combination rule is “there is a dotted line on one side of the ruled line frame”, and the action is “join adjacent ruled line frames with dotted lines”.

次に、結合部２３は罫線枠情報を参照し、罫線枠結合条件に合致する罫線枠Ｘを取得する（Ｓ２３３０２）。そして、結合部２３は、条件に合致する罫線の方向ｄを取得する（Ｓ２３３０３）。そして、結合部２３は、罫線枠Ｘからｄ方向に隣接する罫線枠群Ｙ_１，…，Ｙ_ｍを取得する（Ｓ２３３０４）。 Next, the combining unit 23 refers to the ruled line frame information and acquires a ruled line frame X that matches the ruled line frame combining condition (S23302). Then, the combining unit 23 acquires the ruled line direction d that matches the condition (S23303). Then, the combining unit 23 acquires ruled line frame groups Y ₁ ,..., Y _m adjacent in the d direction from the ruled line frame X (S23304).

ここで、結合部２３は、結合ルールの中に、文字列や色に関する条件がさらに設定されている場合、罫線枠Ｘ，Ｙ_１，…，Ｙ_ｍが条件に従うか否かを判定する（Ｓ２３３０５）。そして、結合部２３は、罫線枠Ｘ，Ｙ_１，…，Ｙ_ｍが条件に従う場合（Ｓ２３３０５でＹｅｓ）、罫線枠Ｘと罫線枠群Ｙ_１，…，Ｙ_ｍをｄ方向に結合させる（Ｓ２３３０６）。また、結合部２３は、罫線枠Ｘ，Ｙ_１，…，Ｙ_ｍが条件に従わない場合（Ｓ２３３０５でＮｏ）、ステップＳ２３３０２へ戻る。 Here, when the condition regarding the character string or the color is further set in the combining rule, the combining unit 23 determines whether or not the ruled line frames X, Y ₁ ,..., Y _m comply with the condition (S23305). ). When the ruled line frames X, Y ₁ ,..., Y _m comply with the conditions (Yes in S23305), the combining unit 23 combines the ruled line frame X and the ruled line frame groups Y ₁ ,..., Y _m in the d direction (S23306). ). If the ruled line frames X, Y ₁ ,..., Y _m do not comply with the conditions (No in S23305), the combining unit 23 returns to step S23302.

そして、結合部２３は、罫線枠を結合させた場合、結合後の罫線枠Ｘを罫線枠Ｚとし（Ｓ２３３０７）、罫線枠群Ｙ_１，…，Ｙ_ｍを削除する（Ｓ２３３０８）。さらに、結合部２３は、罫線枠Ｚの周辺の隣接エッジを更新する（Ｓ２３３０９）。そして、他に罫線枠結合条件に合致する罫線枠がある場合（Ｓ２３３１０でＹｅｓ）、Ｓ２３３０２へ戻る。一方、他に罫線枠結合条件に合致する罫線枠がない場合（Ｓ２３３１０でＮｏ）、処理を終了する。 When the ruled line frames are combined, the combining unit 23 sets the combined ruled line frame X as the ruled line frame Z (S23307), and deletes the ruled line frame group Y ₁ ,..., Y _m (S23308). Further, the combining unit 23 updates adjacent edges around the ruled line frame Z (S23309). If there is another ruled line frame that matches the ruled line frame combination condition (Yes in S23310), the process returns to S23302. On the other hand, if there is no other ruled line frame that matches the ruled line frame combination condition (No in S23310), the process ends.

（結合処理Ｂ）
次に、図１９を用いて結合処理Ｂについて説明する。図１９は、図１７のＳ２３４の結合処理Ｂの一例を示すフローチャートである。図１９に示すように、まず、結合部２３は、処理の対象となる結合ルールを取得する（Ｓ２３４０１）。例えば、結合ルールの罫線枠結合条件は「罫線枠の文字列が「（１）」または「申込年月」であること」であり、アクションは「条件に合致する文字列が隣接する場合は罫線枠を結合させること」である。 (Combining process B)
Next, the combining process B will be described with reference to FIG. FIG. 19 is a flowchart illustrating an example of the combining process B in S234 of FIG. As illustrated in FIG. 19, first, the combining unit 23 acquires a combination rule to be processed (S2341). For example, the rule for combining ruled lines in the combination rule is “the character string of the ruled line frame is“ (1) ”or“ application date ””, and the action is “ruled lines if the character string that matches the conditions is adjacent” To combine the frames.

次に、結合部２３は罫線枠情報を参照し、罫線枠結合条件に合致する罫線枠群Ｘ，Ｙ_１，…，Ｙ_ｍを取得する（Ｓ２３４０２）。例えば罫線枠Ｘの文字列が「（１）」、罫線枠群Ｙ_１，…，Ｙ_ｍの文字列が「申込年月」であるものとする。そして、結合部２３は、罫線枠群Ｘと罫線枠群Ｙ_１，…，Ｙ_ｍとが隣接している場合（Ｓ２３４０３でＹｅｓ）、罫線枠群Ｙ_１，…，Ｙ_ｍの隣接方向ｄ_１，…，ｄ_ｍを取得する（Ｓ２３４０４）。 Next, the combining unit 23 refers to the ruled line frame information, and acquires ruled line frame groups X, Y ₁ ,..., Y _m that match the ruled line frame combining condition (S 23402). For example, it is assumed that the character string of the ruled line frame X is “(1)” and the character string of the ruled line frame group Y ₁ ,..., Y _m is “application date”. The coupling unit 23, line border groups X and line border group _Y 1, ..., if a and _{Y m} are adjacent (Yes in S23403), ruled line frame group _Y 1, ..., adjacent direction _{d 1} of _{Y m} ,..., _Dm are acquired (S23404).

そして、結合部２３は、罫線枠Ｘと罫線枠群Ｙ_ｉ（ｉ＝１，…，ｍ）とを結合させる（Ｓ２３４０５）。そして、結合部２３は、結合後の罫線枠Ｘを罫線枠Ｚとし（Ｓ２３４０６）、罫線枠群Ｙ_１，…，Ｙ_ｍを削除する（Ｓ２３４０７）。さらに、結合部２３は、罫線枠Ｚの周辺の隣接エッジを更新する（Ｓ２３４０８）。そして、他に罫線枠結合条件に合致する罫線枠がある場合（Ｓ２３４０９でＹｅｓ）、Ｓ２３４０２へ戻る。一方、他に罫線枠結合条件に合致する罫線枠がない場合（Ｓ２３４０９でＮｏ）、処理を終了する。 Then, the combining unit 23 combines the ruled line frame X and the ruled line frame group Y _i (i = 1,..., M) (S23405). Then, the combining unit 23 sets the combined ruled line frame X as the ruled line frame Z (S23406), and deletes the ruled line frame group Y ₁ ,..., Y _m (S23407). Further, the combining unit 23 updates adjacent edges around the ruled line frame Z (S23408). If there is another ruled line frame that matches the ruled line frame combination condition (Yes in S23409), the process returns to S23402. On the other hand, if there is no other ruled line frame that matches the ruled line frame combination condition (No in S23409), the process ends.

なお、結合前の罫線枠は、隣接する罫線枠群Ｙ_１，…，Ｙ_ｍの隣接方向によって、図２０のように分類される。図２０は、図１７のＳ２３３の結合処理Ａ、およびＳ２３４の結合処理Ｂを説明するための図である。このとき、罫線枠結合処理における罫線枠Ｚの書式情報は、例えば下記のように表される。
・範囲
Ｘ，Ｙ_１，…，Ｙ_ｍの範囲を合わせた範囲
・左上座標
（ｄが右または下の場合）Ｘの左上座標
（ｄが左または上の場合）Ｙ_１の左上座標
・右下座標
（ｄが右または下の場合）Ｙ_ｍの右下座標
（ｄが左または上の場合）Ｘの右下座標
・文字列
（ｄが右または下の場合）Ｘの文字列＋Ｙ_１の文字列＋…＋Ｙ_ｍの文字列
（ｄが左または上の場合）Ｙ_１の文字列＋…＋Ｙ_ｍの文字列＋Ｘの文字列
・塗りつぶし色
罫線枠Ｘの塗りつぶし色 Note that the ruled line frames before combining are classified as shown in FIG. 20 according to the adjacent direction of adjacent ruled line frame groups Y ₁ ,..., Y _m . FIG. 20 is a diagram for explaining the combining process A in S233 and the combining process B in S234 in FIG. At this time, the format information of the ruled line frame Z in the ruled line frame combination process is expressed as follows, for example.
• Range X, Y ₁ , ..., Y _m range combined • Upper left coordinate (when d is right or lower) Upper left coordinate of X (when d is left or upper) Upper left coordinate of Y _{1 •} Lower right coordinates (if d is right or down) Y lower right coordinates (when d is above or left) of the _m lower-right coordinates, string of X (when d is the right or below) X of the string + Y ₁ character column + ... + Y (when d is above or left) string _m Y ₁ string + ... + Y _m fill color of a character string, fill color border frame X string + X of

（罫線枠分割処理）
次に、図２１を用いて、罫線枠分割処理の具体例について説明する。図２１は、罫線枠分割処理の一例を示すフローチャートである。図２１に示すように、まず、分割部２４は、補正ルールの中から分割ルール群を取得する（Ｓ２４１）。次に、分割ルール群のうちのある分割ルールの罫線枠分割条件が、枠内の文字列に関する条件である場合（Ｓ２４２で枠内の文字列）、分割部２４は、分割処理Ａ、Ｂ、Ｃのいずれかを実行する。また、当該分割ルールの罫線枠分割条件が、枠内の文字列に関する条件でない場合（Ｓ２４２でそれ以外）、分割部２４は、分割処理Ａ、Ｂのいずれかを実行する。 (Rule frame division processing)
Next, a specific example of the ruled line frame dividing process will be described with reference to FIG. FIG. 21 is a flowchart illustrating an example of a ruled line frame dividing process. As shown in FIG. 21, first, the dividing unit 24 acquires a division rule group from the correction rules (S241). Next, when the ruled line frame division condition of a certain division rule in the division rule group is a condition related to the character string in the frame (character string in the frame in S242), the dividing unit 24 performs the dividing processes A, B, Perform any of C. Further, when the ruled line frame division condition of the division rule is not a condition related to the character string in the frame (other than that in S242), the dividing unit 24 executes either of the dividing processes A and B.

具体的に、分割ルールの罫線枠分割基準が隣接する罫線を利用するものである場合（Ｓ２４３で隣接する罫線を利用）、分割部２４は分割処理Ａを実行する（Ｓ２４４）。また、分割ルールの罫線枠分割基準が分割個数を利用するものである場合（Ｓ２４３で分割個数を利用）、分割部２４は分割処理Ｂを実行する（Ｓ２４５）。また、分割ルールの罫線枠分割基準が文字列区切を利用するものである場合（Ｓ２４３で文字列区切を利用）、分割部２４は分割処理Ｃを実行する（Ｓ２４６）。 Specifically, when the ruled line frame division criterion of the division rule uses an adjacent ruled line (uses an adjacent ruled line in S243), the dividing unit 24 executes a dividing process A (S244). When the ruled line frame division criterion of the division rule uses the number of divisions (the number of divisions is used in S243), the division unit 24 executes the division process B (S245). Further, when the ruled line frame division criterion of the division rule uses the character string delimiter (uses the character string delimiter in S243), the dividing unit 24 executes the division process C (S246).

そして、全分割ルール群について処理が完了している場合（Ｓ２４７でＹｅｓ）、処理を終了する。一方、全分割ルール群について処理が完了していない場合（Ｓ２４７でＮｏ）、未処理の分割ルールを取得し（Ｓ２４８）、Ｓ２４２へ戻る。 Then, when the processing has been completed for all the division rule groups (Yes in S247), the processing ends. On the other hand, if the processing has not been completed for all the division rule groups (No in S247), an unprocessed division rule is acquired (S248), and the process returns to S242.

（分割処理Ａ）
ここで、図２２を用いて分割処理Ａについて説明する。図２２は、図２１のＳ２４４の分割処理Ａの一例を示すフローチャートである。図２２に示すように、まず、分割部２４は、処理の対象となる分割ルールを取得する（Ｓ２４４０１）。例えば、図２３に示すように、分割ルールの罫線枠分割条件は「罫線枠の文字列が「氏名住所」であること」であり、アクションは「右方向の罫線枠にならって、改行コードで分割すること」である。また、どの罫線枠にならって分割個数および線分間隔を決定するかは、図２４に示すように、基準となる方向によって異なる。図２３および図２４は、図２１のＳ２４４の分割処理Ａを説明するための図である。 (Division process A)
Here, the division process A will be described with reference to FIG. FIG. 22 is a flowchart illustrating an example of the division process A in S244 of FIG. As shown in FIG. 22, first, the dividing unit 24 acquires a division rule to be processed (S24401). For example, as shown in FIG. 23, the ruled line frame dividing condition of the dividing rule is “the character string of the ruled line frame is“ name address ””, and the action is “new line code following the ruled line frame in the right direction”. To divide. " In addition, as to which ruled line frame the division number and line segment interval are determined, as shown in FIG. 24, it differs depending on the reference direction. 23 and 24 are diagrams for explaining the division process A in S244 of FIG.

次に、分割部２４は罫線枠情報を参照し、罫線枠分割条件に合致する罫線枠Ｘを取得する（Ｓ２４４０２）。そして、分割部２４は、罫線枠Ｘを分割する基準とする方向ｄを取得する（Ｓ２４４０３）。そして、分割部２４は、罫線枠Ｘからｄ方向に隣接する罫線枠から、分割個数ｎとその線分間隔Ｌ_１，…，Ｌ_ｎを取得する（Ｓ２４４０４）。そして、分割部２４は、罫線枠Ｘの文字列の分割基準に従って文字列を分割し、分割した文字列をＳ_１，…，Ｓ_ｎとする（Ｓ２４４０５）。 Next, the dividing unit 24 refers to the ruled line frame information and acquires a ruled line frame X that matches the ruled line frame dividing condition (S24402). Then, the dividing unit 24 acquires a direction d as a reference for dividing the ruled line frame X (S24403). Then, the dividing unit 24 obtains the division number n and the line segment intervals L ₁ ,..., L _n from the ruled line frame adjacent in the d direction from the ruled line frame X (S24404). Then, division unit 24, a character string is divided in accordance with the divided reference string line border X, _S 1 the divided character string, ..., and _{S n} (S24405).

このとき、文字列の分割個数がｎでない場合（Ｓ２４４０６でＮｏ）、分割部２４は、エラーを出力し（ステップＳ２４４１３）、Ｓ２４４０２へ戻る。一方、文字列の分割個数がｎである場合（Ｓ２４４０６でＹｅｓ）、分割部２４は、罫線枠Ｘを分割した罫線枠Ｚ_１，…，Ｚ_ｎを生成する（Ｓ２４４０７）。そして、分割部２４は、罫線枠Ｚ_１，…，Ｚ_ｎに、それぞれ文字列Ｓ_１，…，Ｓ_ｎを格納する（Ｓ２４４０８）。 At this time, if the number of divided character strings is not n (No in S24406), the dividing unit 24 outputs an error (step S24413), and the process returns to S24402. On the other hand, when the number of character string divisions is n (Yes in S24406), the dividing unit 24 generates ruled line frames Z ₁ ,..., Z _n obtained by dividing the ruled line frame X (S24407). Then, division unit 24, line border _Z 1, ..., a _{Z n,} respectively strings _S 1, ..., and stores the _{S n} (S24408).

そして、分割部２４は、罫線枠Ｚ_１，…，Ｚ_ｎをノード情報に追加し（Ｓ２４４０９）、ノード情報から罫線枠Ｘを削除する（Ｓ２４４１０）。さらに、分割部２４は、罫線枠Ｚの周辺の隣接エッジを更新する（Ｓ２４４１１）。そして、他に罫線枠分割条件に合致する罫線枠がある場合（Ｓ２４４１２でＹｅｓ）、Ｓ２４４０２へ戻る。一方、他に罫線枠結合条件に合致する罫線枠がない場合（Ｓ２４４１２でＮｏ）、処理を終了する。 Then, the dividing unit 24 adds ruled line frames Z ₁ ,..., Z _n to the node information (S24409), and deletes the ruled line frame X from the node information (S24410). Furthermore, the dividing unit 24 updates adjacent edges around the ruled line frame Z (S24411). If there is another ruled line frame that matches the ruled line frame dividing condition (Yes in S24412), the process returns to S24402. On the other hand, if there is no other ruled line frame that matches the ruled line frame combination condition (No in S24412), the process ends.

（分割処理Ｂ）
次に、図２５を用いて分割処理Ｂについて説明する。図２５は、図２１のＳ２４５の分割処理Ｂの一例を示すフローチャートである。図２５に示すように、まず、分割部２４は、処理の対象となる分割ルールを取得する（Ｓ２４５０１）。例えば、分割ルールの罫線枠分割条件は「罫線枠の文字列が「氏名住所」であること」であり、アクションは「隣接するいずれかの方向の罫線枠にならって、改行コードで２個に分割すること」である。また、図２６に示すように、隣接する罫線枠が２個である方向の罫線枠にならって線分間隔が決定される。図２６は、図２１のＳ２４５の分割処理Ｂを説明するための図である。 (Division process B)
Next, the division process B will be described with reference to FIG. FIG. 25 is a flowchart illustrating an example of the division process B in S245 of FIG. As shown in FIG. 25, first, the dividing unit 24 acquires a division rule to be processed (S24501). For example, the rule for dividing the ruled line frame of the division rule is “the character string of the ruled line frame is“ name address ””, and the action is “following the ruled line frame in one of the adjacent directions with two line feed codes. To divide. " Also, as shown in FIG. 26, the line segment interval is determined following a ruled line frame in a direction in which there are two adjacent ruled line frames. FIG. 26 is a diagram for explaining the dividing process B in S245 of FIG.

次に、分割部２４は罫線枠情報を参照し、罫線枠分割条件に合致する罫線枠Ｘを取得する（Ｓ２４５０２）。そして、分割部２４は、罫線枠Ｘ上の文字列の分割基準に従って、文字列を分割し、分割した文字列をＳ_１，…，Ｓ_ｎとする（Ｓ２４５０３）。そして、分割部２４は、罫線枠Ｘに隣接する個数がｎである方向ｄを取得する（Ｓ２４５０４）。方向ｄが取得できなかった場合、またはｎが１である場合（Ｓ２４５０５でＹｅｓ）、Ｓ２４５０２へ戻る。 Next, the dividing unit 24 refers to the ruled line frame information, and acquires a ruled line frame X that matches the ruled line frame dividing condition (S24502). Then, the dividing unit 24 divides the character string in accordance with the character string division criterion on the ruled line frame X, and sets the divided character strings as S ₁ ,..., S _n (S 24503). Then, the dividing unit 24 acquires a direction d in which the number of adjacent lines to the ruled line frame X is n (S24504). If the direction d cannot be acquired, or if n is 1 (Yes in S24505), the process returns to S24502.

一方、方向ｄが取得でき、ｎが１でない場合（Ｓ２４５０５でＮｏ）、分割部２４は、罫線枠Ｘからｄ方向に隣接する罫線枠から線分間隔Ｌ_１，…，Ｌ_ｎを取得する（Ｓ２４５０６）。そして、分割部２４は、罫線枠Ｘを分割した罫線枠Ｚ_１，…，Ｚ_ｎを生成する（Ｓ２４５０７）。そして、分割部２４は、罫線枠Ｚ_１，…，Ｚ_ｎに、それぞれ文字列Ｓ_１，…，Ｓ_ｎを格納する（Ｓ２４５０８）。 On the other hand, the direction d can be acquired, if n is not 1 (No at S24505), the dividing unit 24, a line segment distance _L 1 from the line border adjacent to the direction d from line border X, ..., acquires the _{L n} ( S24506). Then, the dividing unit 24 generates ruled line frames Z ₁ ,..., Z _n obtained by dividing the ruled line frame X (S24507). Then, division unit 24, line border _Z 1, ..., a _{Z n,} respectively strings _S 1, ..., and stores the _{S n} (S24508).

そして、分割部２４は、罫線枠Ｚ_１，…，Ｚ_ｎをノード情報に追加し（Ｓ２４５０９）、ノード情報から罫線枠Ｘを削除する（Ｓ２４５１０）。さらに、分割部２４は、罫線枠Ｚの周辺の隣接エッジを更新する（Ｓ２４５１１）。そして、他に罫線枠分割条件に合致する罫線枠がある場合（Ｓ２４５１２でＹｅｓ）、Ｓ２４５０２へ戻る。一方、他に罫線枠結合条件に合致する罫線枠がない場合（Ｓ２４５１２でＮｏ）、処理を終了する。 Then, the dividing unit 24 adds the ruled line frames Z ₁ ,..., Z _n to the node information (S24509), and deletes the ruled line frame X from the node information (S24510). Further, the dividing unit 24 updates adjacent edges around the ruled line frame Z (S24511). If there is another ruled line frame that matches the ruled line frame dividing condition (Yes in S24512), the process returns to S24502. On the other hand, if there is no other ruled line frame that matches the ruled line frame combination condition (No in S24512), the process ends.

（分割処理Ｃ）
次に、図２７を用いて分割処理Ｃについて説明する。図２７は、図２１のＳ２４６の分割処理Ｃの一例を示すフローチャートである。図２７に示すように、まず、分割部２４は、処理の対象となる分割ルールを取得する（Ｓ２４６０１）。例えば、図２８に示すように、分割ルールの罫線枠分割条件は「罫線枠の文字列が項目名とチェックボックスであること」であり、アクションは「項目名とチェックボックスの間で分割すること」である。図２８は、図２１のＳ２４６の分割処理Ｃを説明するための図である。 (Division process C)
Next, the division process C will be described with reference to FIG. FIG. 27 is a flowchart illustrating an example of the division process C in S246 of FIG. As shown in FIG. 27, first, the dividing unit 24 acquires a division rule to be processed (S24601). For example, as shown in FIG. 28, the ruled line frame dividing condition of the dividing rule is “the character string of the ruled line frame is an item name and a check box”, and the action is “divide between the item name and the check box” It is. FIG. 28 is a diagram for explaining the division process C in S246 of FIG.

次に、分割部２４は罫線枠情報を参照し、罫線枠分割条件に合致する罫線枠Ｘを取得する（Ｓ２４６０２）。そして、分割部２４は、罫線枠Ｘ上の文字列の分割基準に従って、文字列を分割し、分割した文字列をＳ_１，…，Ｓ_ｎとする（Ｓ２４６０３）。ｎが１である場合（Ｓ２４６０４でＹｅｓ）、Ｓ２４６０２へ戻る。 Next, the dividing unit 24 refers to the ruled line frame information and acquires a ruled line frame X that matches the ruled line frame dividing condition (S24602). Then, the dividing unit 24 divides the character string in accordance with the character string division criteria on the ruled line frame X, and sets the divided character strings as S ₁ ,..., S _n (S 24603). When n is 1 (Yes in S24604), the process returns to S24602.

一方、ｎが１でない場合（Ｓ２４６０４でＮｏ）、分割部２４は、文字列の長さに合わせて罫線枠Ｘを分割する線分間隔Ｌ_１，…，Ｌ_ｎを取得する（Ｓ２４６０５）。そして、分割部２４は、罫線枠Ｘを分割した罫線枠Ｚ_１，…，Ｚ_ｎを生成する（Ｓ２４６０６）。そして、分割部２４は、罫線枠Ｚ_１，…，Ｚ_ｎに、それぞれ文字列Ｓ_１，…，Ｓ_ｎを格納する（Ｓ２４６０７）。 On the other hand, when n is not 1 (No in S24604), the dividing unit 24 acquires line segment intervals L ₁ ,..., L _n for dividing the ruled line frame X according to the length of the character string (S24605). Then, the dividing unit 24 generates ruled line frames Z ₁ ,..., Z _n obtained by dividing the ruled line frame X (S24606). Then, division unit 24, line border _Z 1, ..., a _{Z n,} respectively strings _S 1, ..., and stores the _{S n} (S24607).

そして、分割部２４は、罫線枠Ｚ_１，…，Ｚ_ｎをノード情報に追加し（Ｓ２４６０８）、ノード情報から罫線枠Ｘを削除する（Ｓ２４６０９）。さらに、分割部２４は、罫線枠Ｚの周辺の隣接エッジを更新する（Ｓ２４６１０）。そして、他に罫線枠分割条件に合致する罫線枠がある場合（Ｓ２４６１１でＹｅｓ）、Ｓ２４６０２へ戻る。一方、他に罫線枠結合条件に合致する罫線枠がない場合（Ｓ２４６１１でＮｏ）、処理を終了する。 Then, the dividing unit 24 adds ruled line frames Z ₁ ,..., Z _n to the node information (S24608), and deletes the ruled line frame X from the node information (S24609). Further, the dividing unit 24 updates adjacent edges around the ruled line frame Z (S24610). If there is another ruled line frame that matches the ruled line frame dividing condition (Yes in S24611), the process returns to S24602. On the other hand, if there is no other ruled line frame that matches the ruled line frame combination condition (No in S24611), the process ends.

（罫線枠削除処理）
次に、図２９を用いて、罫線枠削除処理の具体例について説明する。図２９は、罫線枠削除処理の一例を示すフローチャートである。図２９に示すように、まず、削除部２５は、補正ルールの中から削除ルール群を取得する（Ｓ２５１）。次に、削除ルール群のうちのある削除ルールの罫線枠削除条件が、罫線の種類、太さまたは枠内の塗りつぶし色に関する条件である場合（Ｓ２５２で罫線の種類、太さ、枠内の色）、削除部２５は、図３０に示すように、条件を満たす罫線枠領域を取得する（Ｓ２５３）。図３０は、罫線枠削除処理を説明するための図である。図３０に示すように、条件に指定される罫線の種類として例えば点線および太線があり、また、条件に指定される塗りつぶし色として灰色がある。 (Rule line frame deletion processing)
Next, a specific example of ruled line frame deletion processing will be described with reference to FIG. FIG. 29 is a flowchart illustrating an example of ruled line frame deletion processing. As shown in FIG. 29, first, the deletion unit 25 acquires a deletion rule group from the correction rules (S251). Next, when the ruled line frame deletion condition of a deletion rule in the deletion rule group is a condition related to the type and thickness of the ruled line or the fill color in the frame (the type of ruled line, the thickness, and the color in the frame in S252) ), The deletion unit 25 acquires a ruled line frame region that satisfies the condition as shown in FIG. 30 (S253). FIG. 30 is a diagram for explaining ruled line frame deletion processing. As shown in FIG. 30, the types of ruled lines specified in the condition include, for example, a dotted line and a thick line, and the fill color specified in the condition includes gray.

次に、削除部２５は、条件を満たす罫線枠群を取得する（Ｓ２５４）。そして、削除部２５は罫線枠群を削除し（Ｓ２５５）、ノード情報を更新する（Ｓ２５６）。そして、全削除ルール群について処理が完了している場合（Ｓ２５７でＹｅｓ）、処理を終了する。一方、全削除ルール群について処理が完了していない場合（Ｓ２５７でＮｏ）、未処理の削除ルールを取得し（Ｓ２５８）、Ｓ２５２へ戻る。 Next, the deletion unit 25 acquires a ruled line frame group that satisfies the condition (S254). Then, the deletion unit 25 deletes the ruled line frame group (S255) and updates the node information (S256). If the process has been completed for all deletion rule groups (Yes in S257), the process ends. On the other hand, if the processing has not been completed for all deletion rule groups (No in S257), an unprocessed deletion rule is acquired (S258), and the process returns to S252.

（罫線枠追加処理）
次に、図３１を用いて、罫線枠追加処理の具体例について説明する。図３１は、罫線枠追加処理の一例を示すフローチャートである。図３１に示すように、まず、追加部２６は、補正ルールの中から追加ルール群を取得する（Ｓ２６１）。次に、追加ルール群のうちのある追加ルールの罫線枠追加条件が、罫線の有無、種類、太さに関する条件である場合（Ｓ２６２で罫線の有無、種類、太さ）、追加部２６は、追加処理Ａを実行する（Ｓ２６３）。また、当該追加ルールの罫線枠追加条件が、枠内の塗りつぶし色、または文字列に関する条件である場合（Ｓ２６２で枠内の文字列、色）、追加部２６は、追加処理Ｂを実行する（Ｓ２６４）。 (Rule frame addition processing)
Next, a specific example of ruled line frame addition processing will be described with reference to FIG. FIG. 31 is a flowchart illustrating an example of ruled line frame addition processing. As shown in FIG. 31, first, the adding unit 26 acquires an additional rule group from the correction rules (S261). Next, when the ruled line frame addition condition of an additional rule in the additional rule group is a condition regarding the presence / absence, type, and thickness of a ruled line (presence / absence of ruled line, type, thickness) in S262, the adding unit 26 An additional process A is executed (S263). When the ruled line frame addition condition of the addition rule is a condition related to the fill color or character string in the frame (character string and color in the frame in S262), the adding unit 26 executes the addition process B ( S264).

そして、全追加ルール群について処理が完了している場合（Ｓ２６５でＹｅｓ）、処理を終了する。一方、全追加ルール群について処理が完了していない場合（Ｓ２６５でＮｏ）、未処理の追加ルールを取得し（Ｓ２６６）、Ｓ２６２へ戻る。 If the process has been completed for all the additional rule groups (Yes in S265), the process ends. On the other hand, when the process has not been completed for all the additional rule groups (No in S265), an unprocessed additional rule is acquired (S266), and the process returns to S262.

（追加処理Ａ）
ここで、図３２を用いて追加処理Ａについて説明する。図３２は、図３１のＳ２６３の追加処理Ａの一例を示すフローチャートである。図３２に示すように、まず、追加部２６は、処理の対象となる追加ルールを取得する（Ｓ２６３０１）。例えば、図３３に示すように、追加ルールの罫線枠追加条件は「文字列のいずれか方向に罫線があること」であり、アクションは「罫線枠を追加すること」である。 (Additional processing A)
Here, the additional process A will be described with reference to FIG. FIG. 32 is a flowchart showing an example of the additional process A in S263 of FIG. As illustrated in FIG. 32, first, the adding unit 26 acquires an additional rule to be processed (S26301). For example, as shown in FIG. 33, the ruled line frame addition condition of the addition rule is “there is a ruled line in any direction of the character string”, and the action is “add a ruled line frame”.

次に、追加部２６は罫線枠情報を参照し、罫線枠追加条件に合致する文字列範囲を取得する（Ｓ２６３０２）。そして、取得した文字列範囲の上下左右方向のいずれにも罫線がない場合（Ｓ２６３０３でＮｏ）、他に罫線枠追加条件に合致する文字列範囲があるか否かの判定を行う（Ｓ２６４１６）。一方、取得した文字列範囲の上下左右方向のいずれかに罫線がある場合（Ｓ２６３０３でＹｅｓ）、追加部２６は、罫線のある方向ｄを取得する（Ｓ２６３０４）。 Next, the adding unit 26 refers to the ruled line frame information, and acquires a character string range that matches the ruled line frame adding condition (S2632). If there is no ruled line in any of the acquired character string range in the vertical and horizontal directions (No in S26303), it is determined whether there is another character string range that matches the ruled line frame addition condition (S26416). On the other hand, if there is a ruled line in any one of the upper, lower, left, and right directions of the acquired character string range (Yes in S26303), the adding unit 26 acquires the direction d with the ruled line (S26304).

ｄが上である場合（Ｓ２６３０６で上）、追加部２６は、罫線枠の左上座標（ｘ_２，ｙ_２）についてｙ＝ｙ_２を満たす罫線枠集合Ｂを取得する（Ｓ２６３０７）。また、ｄが下である場合（Ｓ２６３０６で下）、追加部２６は、罫線枠の右下座標（ｘ_２，ｙ_２）についてｙ＝ｙ_２を満たす罫線枠集合Ｂを取得する（Ｓ２６３０８）。また、ｄが左である場合（Ｓ２６３０６で左）、追加部２６は、罫線枠の左上座標（ｘ_２，ｙ_２）についてｘ＝ｘ_２を満たす罫線枠集合Ｂを取得する（Ｓ２６３０９）。また、ｄが右である場合（Ｓ２６３０６で右）、追加部２６は、罫線枠の右下座標（ｘ_２，ｙ_２）についてｘ＝ｘ_２を満たす罫線枠集合Ｂを取得する（Ｓ２６３１０）。ｄが上である場合の例を図３４に、ｄが左である場合の例を図３５に示す。図３４および図３５は、図３１のＳ２６３の追加処理Ａを説明するための図である。 If d is above (above in S26306), addition unit 26, the upper left coordinates _(x 2, _{y 2)} of the line border to get the line border set B satisfying y = _{y 2} for (S26307). Also, when d is below (under S26306), addition unit 26, lower right coordinates _(x 2, _{y 2)} of the line border to get the line border set B satisfying y = _{y 2} for (S26308). Also, if d is the left (left in S26306), addition unit 26, the upper left coordinates _(x 2, _{y 2)} of the line border to get the line border set B satisfying x = _{x 2} for (S26309). Also, if d is the right (right in S26306), addition unit 26, lower right coordinates _(x 2, _{y 2)} of the line border to get the line border set B satisfying x = _{x 2} for (S26310). An example in the case where d is on is shown in FIG. 34, and an example in the case where d is on the left is shown in FIG. 34 and 35 are diagrams for explaining the additional processing A in S263 of FIG.

そして、追加部２６は、罫線枠集合Ｂについて連結グラフＧを求め（Ｓ２６３１１）、Ｇから（ｘ，ｙ）を含む線分範囲Ｌ（ｓ，ｇ）を求める（Ｓ２６３１２）。さらに、ｄが上または下である場合（Ｓ２６３１３で上または下）、高さが文字列の高さ、幅がＬ（ｓ，ｇ）の罫線枠Ｚを作成し、ノード情報に追加する（Ｓ２６３１４）。また、ｄが左または右である場合（Ｓ２６３１３で左または右）、幅が文字列の幅、高さがＬ（ｓ，ｇ）の罫線枠Ｚを作成し、ノード情報に追加する（Ｓ２６３１４）。 Then, the adding unit 26 obtains a connected graph G for the ruled line frame set B (S26311), and obtains a line segment range L (s, g) including (x, y) from G (S26312). Further, when d is up or down (up or down in S26313), a ruled line frame Z whose height is the height of the character string and whose width is L (s, g) is created and added to the node information (S26314). ). If d is left or right (left or right in S26313), a ruled line frame Z having a width of the character string and a height of L (s, g) is created and added to the node information (S26314). .

そして、他に罫線枠追加条件に合致する文字列範囲がある場合（Ｓ２６３１６でＹｅｓ）、Ｓ２６３０２へ戻る。一方、他に罫線枠結合条件に合致する文字列範囲がない場合（Ｓ２６３１６でＮｏ）、処理を終了する。 If there is another character string range that matches the ruled line frame addition condition (Yes in S26316), the process returns to S2632. On the other hand, if there is no other character string range that matches the ruled line frame combination condition (No in S26316), the process ends.

（追加処理Ｂ）
ここで、図３６を用いて追加処理Ｂについて説明する。図３６は、図３１のＳ２６４の追加処理Ｂの一例を示すフローチャートである。図３６に示すように、まず、追加部２６は、処理の対象となる追加ルールを取得する（Ｓ２６４０１）。例えば、図３７に示すように、追加ルールの罫線枠追加条件は「文字列が灰色で塗りつぶされていること」であり、アクションは「罫線枠を追加すること」である。 (Additional processing B)
Here, the additional process B will be described with reference to FIG. FIG. 36 is a flowchart illustrating an example of the addition process B in S264 of FIG. As shown in FIG. 36, first, the adding unit 26 acquires an additional rule to be processed (S26401). For example, as shown in FIG. 37, the ruled line frame addition condition of the addition rule is “a character string is painted in gray”, and the action is “add a ruled line frame”.

次に、追加部２６は罫線枠情報を参照し、罫線枠追加条件に合致する文字列範囲を取得する（Ｓ２６４０２）。そして、取得した文字列範囲の上下左右方向のいずれかに罫線がある場合（Ｓ２６４０３でＹｅｓ）、追加部２６は、追加処理Ａを実行する（Ｓ２６４０４）。一方、取得した文字列範囲の上下左右方向のいずれにも罫線がない場合（Ｓ２６４０３でＮｏ）、文字列の座標位置、高さ、幅から罫線枠Ｚを生成し（Ｓ２６４０５）、ノード情報に罫線枠Ｚを追加する（Ｓ２６４０６）。 Next, the adding unit 26 refers to the ruled line frame information, and acquires a character string range that matches the ruled line frame adding condition (S26402). If there is a ruled line in any of the upper, lower, left, and right directions of the acquired character string range (Yes in S26403), the adding unit 26 executes an adding process A (S26404). On the other hand, if there is no ruled line in any of the acquired character string range in the vertical and horizontal directions (No in S26403), a ruled line frame Z is generated from the coordinate position, height, and width of the character string (S26405), and the ruled line is included in the node information. A frame Z is added (S26406).

そして、他に罫線枠追加条件に合致する文字列範囲がある場合（Ｓ２６４０７でＹｅｓ）、Ｓ２６４０２へ戻る。一方、他に罫線枠結合条件に合致する文字列範囲がない場合（Ｓ２６４０７でＮｏ）、処理を終了する。 If there is another character string range that matches the ruled line frame addition condition (Yes in S26407), the process returns to S26402. On the other hand, if there is no other character string range that matches the ruled line frame combination condition (No in S26407), the process ends.

（効果）
取得部２１は、帳票から罫線枠を抽出し、罫線枠ごとの罫線枠情報として、罫線の種類または太さ、枠内の文字列、および枠内の塗りつぶし色を少なくとも取得する。また、結合部２３は、複数の罫線枠の罫線枠情報が予め設定された罫線枠結合条件を満たしている場合、該複数の罫線枠を結合する。また、分割部２４は、結合部２３による処理が実行された後、罫線枠の罫線枠情報が予め設定された罫線枠分割条件を満たしている場合、該罫線枠を分割する。また、削除部２５は、分割部２４による処理が実行された後、罫線枠の罫線枠情報が予め設定された罫線枠削除条件を満たしている場合、該罫線枠を削除する。 (effect)
The acquisition unit 21 extracts a ruled line frame from the form, and acquires at least a ruled line type or thickness, a character string in the frame, and a fill color in the frame as ruled line frame information for each ruled line frame. Further, when the ruled line frame information of a plurality of ruled line frames satisfies a preset ruled line frame combining condition, the combining unit 23 combines the plurality of ruled line frames. Further, after the processing by the combining unit 23 is performed, the dividing unit 24 divides the ruled line frame when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame dividing condition. Further, after the processing by the dividing unit 24 is executed, the deletion unit 25 deletes the ruled line frame when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame deletion condition.

これにより、帳票に所定の要件を満たさない記載方法で項目名または項目値が記載されている部分がある場合であっても、当該部分が要件を満たした記載方法補正されるため、項目名間および項目名−項目値間の論理関係を正確に推定できる。 As a result, even if there is a part in which the item name or item value is described in a form that does not satisfy the prescribed requirements, the description method is corrected so that the part satisfies the requirement. And the logical relationship between the item name and the item value can be accurately estimated.

また、補正後の罫線枠を、帳票の項目名と項目名に対応する項目値とに分類し、項目名と項目値とを対応付けた木構造のデータを生成し出力する。これにより、項目名−項目値間の論理関係を正確に推定した木構造データを取得することができる。 Further, the corrected ruled frame is classified into item names and item values corresponding to the item names, and tree-structured data in which the item names and item values are associated is generated and output. Thereby, tree structure data in which the logical relationship between the item name and the item value is accurately estimated can be acquired.

取得部２１は、さらに、帳票から文字列が記載された領域を抽出し、領域ごとの領域情報として、領域の上下左右いずれかの方向の罫線の有無、文字列および領域の塗りつぶし色を少なくとも取得し、また領域の上下左右いずれかの方向に罫線が存在する場合は該罫線の種類または太さを取得する。追加部２６は、領域の領域情報が予め設定された罫線枠追加条件を満たしている場合、該領域に罫線を追加する。これにより、罫線枠が記載されていない文字列も含めて項目名−項目値間の論理関係を推定することができる。 The acquisition unit 21 further extracts an area in which the character string is described from the form, and acquires at least the presence / absence of ruled lines in the upper, lower, left, and right directions of the area, the character string, and the fill color of the area as area information for each area. If a ruled line exists in either the top, bottom, left, or right direction of the area, the type or thickness of the ruled line is acquired. When the area information of the area satisfies a ruled line frame addition condition set in advance, the adding unit 26 adds a ruled line to the area. Thereby, the logical relationship between the item name and the item value can be estimated including the character string in which the ruled line frame is not described.

（その他の実施形態）
罫線枠補正部２０は、図３８に示すように、入力された動作指示を認識し（ステップＳ２２ａ）、動作指示に応じた処理を行うようにしてもよい。図３８は、図４−１のＳ３の罫線枠補正処理の他の例を示すフローチャートである。例えば、動作指示としては、木構造データ出力、帳票出力、罫線枠グラフ出力がある。動作指示が木構造データ出力である場合の動作は前述の通りである。また、動作指示が帳票出力である場合、罫線枠補正部２０は、補正部２２で補正処理が行われた結果のノード情報を用いて帳票出力を行う（ステップＳ２９ａ）。また、動作指示が罫線枠グラフ出力である場合、罫線枠補正部２０は、ノード情報をそのまま罫線枠グラフとして出力する（ステップＳＳ２９ｂ）。 (Other embodiments)
As shown in FIG. 38, the ruled line frame correction unit 20 may recognize the input operation instruction (step S22a) and perform processing according to the operation instruction. FIG. 38 is a flowchart showing another example of the ruled line frame correction process in S3 of FIG. 4-1. For example, the operation instructions include tree structure data output, form output, and ruled line frame graph output. The operation when the operation instruction is tree structure data output is as described above. If the operation instruction is a form output, the ruled line frame correction unit 20 outputs the form using the node information obtained as a result of the correction process performed by the correction unit 22 (step S29a). When the operation instruction is ruled line frame graph output, the ruled line frame correction unit 20 outputs the node information as it is as a ruled line frame graph (step SS29b).

また、罫線枠補正部２０の補正部２２は、図３９に示すように、予め補正ルールを結合ルール、分割ルール、削除ルール、追加ルールの順にソートしておくようにしてもよい。図３９は、図４−１のＳ３の罫線枠補正処理の他の例を示すフローチャートである。この場合、各部は補正ルールを検索することなく取得することができる。 Further, as shown in FIG. 39, the correction unit 22 of the ruled line frame correction unit 20 may sort the correction rules in advance in the order of the combination rule, the division rule, the deletion rule, and the addition rule. FIG. 39 is a flowchart showing another example of the ruled line frame correction process in S3 of FIG. 4-1. In this case, each unit can obtain the correction rule without searching.

なお、帳票出力が行われる場合、罫線枠補正部２０は、補正を行ったノード情報に対してフラグを設定しておき、フラグが設定されたノード情報と対応する罫線枠について、ノード情報に含まれる座標情報を参照しながら、当該罫線枠の構成と文字列、塗りつぶし色等を書き換える。 When form output is performed, the ruled line frame correction unit 20 sets a flag for the corrected node information, and the ruled line frame corresponding to the node information for which the flag is set is included in the node information. With reference to the coordinate information, the configuration of the ruled line frame, the character string, the fill color, and the like are rewritten.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵ（Central Processing Unit）および当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Further, all or any part of each processing function performed in each device is realized by a CPU (Central Processing Unit) and a program analyzed and executed by the CPU, or hardware by wired logic. Can be realized as

また、本実施形態において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 In addition, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

（プログラム）
一実施形態として、罫線枠補正方法は、罫線枠補正部２０と同様の機能を持つ罫線枠補正装置により実施されてもよい。この場合、罫線枠補正装置は、パッケージソフトウェアやオンラインソフトウェアとして上記の罫線枠補正を実行する罫線枠補正プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の罫線枠補正プログラムを情報処理装置に実行させることにより、情報処理装置を罫線枠補正装置として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistants）等のスレート端末等がその範疇に含まれる。 (program)
As an embodiment, the ruled line frame correction method may be implemented by a ruled line frame correction apparatus having the same function as the ruled line frame correction unit 20. In this case, the ruled line frame correction apparatus can be implemented by installing a ruled line frame correction program for executing the above ruled line frame correction as package software or online software in a desired computer. For example, the information processing apparatus can function as a ruled line frame correction apparatus by causing the information processing apparatus to execute the ruled line frame correction program. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, the information processing apparatus includes mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDA (Personal Digital Assistants).

また、罫線枠補正装置は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の罫線枠補正に関するサービスを提供するサーバ装置として実装することもできる。例えば、罫線枠補正装置は、罫線枠情報を入力とし、補正済み罫線枠情報を出力とする罫線枠補正サービスを提供するサーバ装置として実装される。この場合、罫線枠補正装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の罫線枠補正に関するサービスを提供するクラウドとして実装することとしてもかまわない。以上のように、本発明は、コンピュータとプログラムによっても実現でき、当該プログラムを記録媒体に記録することも、ネットワークを通して提供することも可能である。 The ruled line frame correction apparatus can also be implemented as a server apparatus that uses a terminal device used by a user as a client and provides the client with the above-described service related to ruled line frame correction. For example, the ruled line frame correction apparatus is implemented as a server apparatus that provides a ruled line frame correction service that receives ruled line frame information as input and outputs corrected ruled line frame information as output. In this case, the ruled line frame correction apparatus may be implemented as a Web server, or may be implemented as a cloud that provides the above-described ruled line frame correction service by outsourcing. As described above, the present invention can be realized by a computer and a program, and the program can be recorded on a recording medium or provided through a network.

図４０は、罫線枠補正プログラムを実行するコンピュータを示す図である。図４０に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ（Central Processing Unit）１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 FIG. 40 is a diagram illustrating a computer that executes a ruled line frame correction program. As shown in FIG. 40, the computer 1000 includes, for example, a memory 1010, a CPU (Central Processing Unit) 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network. Interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。ディスクドライブ１１００には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100, for example. For example, a mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050. For example, a display 1130 is connected to the video adapter 1060.

ここで、図４０に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。上記実施形態で説明した各情報は、例えばハードディスクドライブ１０９０やメモリ１０１０に記憶される。 Here, as shown in FIG. 40, the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each piece of information described in the above embodiment is stored in, for example, the hard disk drive 1090 or the memory 1010.

また、罫線枠補正プログラムは、例えば、コンピュータ１０００によって実行される指令が記述されたプログラムモジュールとして、ハードディスクドライブ１０９０に記憶される。具体的には、上述の罫線枠補正装置が実行する各処理が記述されたプログラムモジュールが、ハードディスクドライブ１０９０に記憶される。 Further, the ruled line frame correction program is stored in the hard disk drive 1090 as a program module in which a command executed by the computer 1000 is described, for example. Specifically, a program module describing each process executed by the ruled line frame correction apparatus described above is stored in the hard disk drive 1090.

また、罫線枠補正プログラムによる情報処理に用いられるデータは、プログラムデータとして、例えば、ハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Further, data used for information processing by the ruled line frame correction program is stored as, for example, the hard disk drive 1090 as program data. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1090 to the RAM 1012 as necessary, and executes the above-described procedures.

なお、罫線枠補正プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、制御プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 related to the ruled line frame correction program are not limited to being stored in the hard disk drive 1090. For example, the program module 1093 and the program data 1094 are stored in a removable storage medium and the CPU 1020 via the disk drive 1100 or the like. It may be read out. Alternatively, the program module 1093 and the program data 1094 related to the control program are stored in another computer connected via a network such as a LAN (Local Area Network) or a WAN (Wide Area Network), and are transmitted via the network interface 1070. It may be read by the CPU 1020.

１０データ構造抽出装置
１１データ構造抽出部
１２記憶部
１３グラフ生成部
２０罫線枠補正部
２１取得部
２２補正部
２３結合部
２４分割部
２５削除部
２６追加部
２７補正ルール入力部
２８補正ルール記憶部
１２１帳票構造ルール
１２２帳票データベース
１２３項目名データベース
１３１操作インタフェース識別部
１３２帳票書式情報取得部
１３３ノード生成部
１３４プロパティ情報取得部
１３５隣接エッジ生成部
１３６包含エッジ生成部
１３７項目名登録部
１３８ノードクラスタ部
１３９項目名割当部
１４０部分木パターン生成部
１４１木構造データ構築部
１４２木構造選定部
１４３帳票構造構築部 DESCRIPTION OF SYMBOLS 10 Data structure extraction apparatus 11 Data structure extraction part 12 Memory | storage part 13 Graph production | generation part 20 Ruled line frame correction | amendment part 21 Acquisition part 22 Correction | amendment part 23 Combining part 24 Dividing part 25 Deletion part 26 Addition part 27 Correction rule input part 28 Correction rule storage part 121 Form structure rule 122 Form database 123 Item name database 131 Operation interface identification unit 132 Form format information acquisition unit 133 Node generation unit 134 Property information acquisition unit 135 Adjacent edge generation unit 136 Inclusion edge generation unit 137 Item name registration unit 138 Node cluster unit 139 Item Name Allocation Unit 140 Partial Tree Pattern Generation Unit 141 Tree Structure Data Construction Unit 142 Tree Structure Selection Unit 143 Form Structure Construction Unit

Claims

An acquisition step of extracting a ruled line frame from the form and acquiring at least a ruled line type or thickness, a character string in the frame, and a fill color in the frame as ruled line frame information for each ruled line frame;
A combination step of combining the plurality of ruled line frames when the ruled line frame information of the plurality of ruled line frames satisfies a preset ruled line frame combining condition;
A division step of dividing the ruled line frame when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame division condition after the processing by the combining step is executed;
After the processing by the dividing step is executed, when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame deletion condition, a deletion step of deleting the ruled line frame;
A ruled line frame correction method comprising:

After executing the combining step, the dividing step, and the deleting step, the ruled line frame is classified into item names of the form and item values corresponding to the item names, and the item names are associated with the item values. 2. The ruled line frame correction method according to claim 1, further comprising a tree structure data output step of generating and outputting data of a tree structure.

The obtaining step further extracts an area in which a character string is described from the form, and as area information for each area, the presence / absence of ruled lines in the upper, lower, left, and right directions of the area, the character string, and the area Get at least the fill color, and if there is a ruled line in either the top, bottom, left, or right direction of the area, get the type or thickness of the ruled line,
After the process by the deletion step is executed, the method further includes an adding step of adding a ruled line to the area when the area information of the area satisfies a preset ruled line frame addition condition. The ruled line frame correction method according to claim 1 or 2.

In the combining step, at least one of the first ruled line frame and the second ruled line frame adjacent to the first ruled line frame is a specific type in which the type of ruled line is set in advance, The thickness is included in a preset specific thickness range, the character string in the frame is a specific character string set in advance, and the fill color in the frame is a specific color set in advance The said 1st ruled line frame and said 2nd ruled line frame are couple | bonded when at least 1 is satisfy | filled that there exists, The one of Claim 1 to 3 characterized by the above-mentioned. Ruled line frame correction method.

The dividing step includes at least one of the ruled line frame being a specific character string in which a character string in the frame is preset, and a fill color in the frame being a predetermined color in advance. 5. The ruled line frame correction method according to claim 1, wherein the ruled line frame is divided when the two are satisfied.

In the deleting step, the ruled line frame is a specific type in which the type of ruled line is preset, the thickness of the ruled line is included in a predetermined thickness range, and the character string in the frame Is a specific character string set in advance, and if at least one of the fill color in the frame is a specific color set in advance, the ruled line frame is deleted. 6. The ruled line frame correction method according to claim 1, wherein the ruled line frame is corrected.

An acquisition unit that extracts a ruled line frame from the form, and acquires at least a ruled line type or thickness, a character string in the frame, and a fill color in the frame as ruled line frame information for each ruled line frame;
When the ruled line frame information of the plurality of ruled line frames satisfies a preset ruled line frame combining condition, a combining unit that combines the plurality of ruled line frames;
After the processing is executed by the combining unit, when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame dividing condition, a dividing unit that divides the ruled line frame;
After the processing is performed by the dividing unit, when the ruled line frame information of the ruled line frame satisfies a preset ruled line frame deletion condition, a deleting unit that deletes the ruled line frame;
A ruled line frame correction apparatus comprising:

A ruled line frame correction program for causing a computer to function as the ruled line frame correction device according to claim 7.