JP2013015909A

JP2013015909A - Table structure automatic recognition program, table structure automatic recognition method and table structure automatic recognition device

Info

Publication number: JP2013015909A
Application number: JP2011146569A
Authority: JP
Inventors: Isao Nanba; 功難波
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-06-30
Filing date: 2011-06-30
Publication date: 2013-01-24
Anticipated expiration: 2031-06-30
Also published as: JP5664481B2

Abstract

PROBLEM TO BE SOLVED: To provide a table structure automatic recognition program, a table structure automatic recognition method and a table structure automatic recognition device capable of highly accurately performing automatic recognition of a table with simple definition of table structures.SOLUTION: A computer executes processing comprising: extracting multiple tables from a document database storing document data including the multiple tables including multiple sections; referring to a table extraction condition database storing a first index of the tables to be extracted and checking the first index and data of the sections of index items included in each of the multiple tables against each other; extracting a table a check result of which satisfies a predetermined condition; and correcting, on the basis of a length of each section included in the table satisfying the predetermined condition, at least one of the length of the section of the index item and the length of the section of a data item corresponding to the index item, used for associating the section of the index item of the table and the section of the data item with each other.

Description

本発明は、表構造の自動認識を行う表構造自動認識プログラム、表構造自動認識方法及び表構造自動認識装置に関する。 The present invention relates to a table structure automatic recognition program, a table structure automatic recognition method, and a table structure automatic recognition apparatus that automatically recognize a table structure.

ソフトウェア開発では、各種の要件を定義する表が含まれる複数の設計書間の整合性等を確認する整合検査が行われる。整合検査では、各設計書において表構造の定義に合致した表を特定し、特定した表から項目を抽出して項目同士が一致しているか否かを確認する。 In software development, a consistency check is performed to confirm consistency between a plurality of design documents including tables defining various requirements. In the consistency check, a table that matches the definition of the table structure is specified in each design document, and items are extracted from the specified table to check whether the items match each other.

従来の整合検査では、表構造の定義は人手により行われており、設計書の分量が多い場合は表の構造定義を行うのが困難であった。このため従来では、表構造の自動認識等の技術を用いて表構造の定義に係る手間を軽減させている。 In the conventional consistency check, the table structure is manually defined, and it is difficult to define the structure of the table when the amount of the design document is large. For this reason, conventionally, techniques such as automatic recognition of the table structure are used to reduce the labor involved in defining the table structure.

表構造の自動認識の手法としては、例えば項目名及び階層構造や二次元構造といった項目の種類を予め定義しておき、項目に対応する値を取得する手法がある。また他には、項目名の文字列と属性を登録した項目名単語辞書を利用し、帳票画像中から文字列を検出し項目名辞書と照合する手法がある。さらに、例えばデータの属性を表す見出し項目辞書を準備し、見出しに対してデータの属性、見出しとの距離、位置関係により矩形領域を算出し、矩形領域が最小となる項目の値を得る手法が知られている。 As a method for automatically recognizing a table structure, for example, there is a method in which item types such as item names and hierarchical structures and two-dimensional structures are defined in advance, and values corresponding to the items are acquired. In addition, there is a method of detecting a character string from a form image and collating it with the item name dictionary using an item name word dictionary in which item name character strings and attributes are registered. Furthermore, for example, there is a method in which a heading item dictionary representing data attributes is prepared, a rectangular area is calculated based on the data attributes, the distance from the heading, and the positional relationship with respect to the heading, and an item value that minimizes the rectangular area is obtained. Are known.

特開２００５−２７５８３０号公報JP 2005-275830 A 特開２００８−２０４２２６号公報JP 2008-204226 A 特開２００９−１１０４１６号公報JP 2009-110416 A

従来の表構造の自動認識の手法では、一般の文字列であるデータに対して属性を定義しなければならず、項目に対して属性を推定する機能が必要となる。また見出しと項目との位置関係により見出しと項目とを対応付ける場合、項目の記述量が多い場合等に見出しと項目との対応付けを誤る可能性がある。 In the conventional method for automatically recognizing a table structure, an attribute must be defined for data that is a general character string, and a function for estimating an attribute for an item is required. In addition, when the headline and the item are associated with each other based on the positional relationship between the headline and the item, there is a possibility that the headline and the item are incorrectly associated with each other when the description amount of the item is large.

本発明の一実施形態では、上記事情を鑑み、簡単な表の構造定義で高精度に表の自動認識を行う表構造自動認識プログラム、表構造自動認識方法及び表構造自動認識装置を提供することを目的としている。 In one embodiment of the present invention, in view of the above circumstances, a table structure automatic recognition program, a table structure automatic recognition method, and a table structure automatic recognition device that automatically recognize a table with high accuracy with a simple table structure definition are provided. It is an object.

開示の技術は、複数の区画が含まれる複数の表を含む文書データが格納された文書データベースから、前記複数の表を抽出し、抽出対象の表の第１の見出しが格納された表抽出条件データベースを参照して、前記第１の見出しと前記複数の表の各々に含まれる見出し項目の区画のデータとを照合し、前記照合の結果が所定条件を満たす表を抽出し、前記所定条件を満たす表に含まれる各区画の長さに基づき、該表における見出し項目の区画とデータ項目の区画との対応付けに用いる見出し項目の区画の長さ又は該見出し項目と対応するデータ項目の区画の長さの少なくとも何れか一方を補正する処理をコンピュータに実行させる。 The disclosed technique extracts a plurality of tables from a document database in which document data including a plurality of tables including a plurality of sections is stored, and a table extraction condition in which a first heading of the table to be extracted is stored Referring to the database, the first heading and the data of the section of the heading item included in each of the plurality of tables are collated, a table in which the collation result satisfies a predetermined condition is extracted, and the predetermined condition is Based on the length of each section included in the table to be satisfied, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or the section of the data item corresponding to the heading item Causes the computer to execute a process of correcting at least one of the lengths.

上記各処理を実行する方法、上記プログラムを実行する装置、上記プログラムを記憶したコンピュータ読み取り可能な記憶媒体とすることもできる。 A method for executing the above processes, an apparatus for executing the program, and a computer-readable storage medium storing the program may be used.

簡単な表の構造定義で高精度に表の自動認識を行う。 Automatic table recognition with high accuracy by simple table structure definition.

表構造の自動認識の概要を説明する図である。It is a figure explaining the outline | summary of the automatic recognition of a table structure. 表構造自動認識装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of a table structure automatic recognition apparatus. 表構造自動認識装置の機能構成の一例を説明する図である。It is a figure explaining an example of functional composition of a table structure automatic recognition device. 記憶部に格納された各データベースを説明する図である。It is a figure explaining each database stored in the memory | storage part. 表抽出条件データベースの一例を示す図である。It is a figure which shows an example of a table extraction condition database. 項目整合データベースの一例を示す図である。It is a figure which shows an example of an item matching database. 表構造自動認識装置の全体の動作を説明するフローチャートである。It is a flowchart explaining the whole operation | movement of a table structure automatic recognition apparatus. 矩形領域の抽出を説明する図である。It is a figure explaining extraction of a rectangular area. 矩形領域抽出部の処理を説明するフローチャートである。It is a flowchart explaining the process of a rectangular area extraction part. 表抽出条件に合致した類似表の抽出を説明する図である。It is a figure explaining extraction of the similar table | surface which matched table extraction conditions. 見出し項目照合部の処理を説明するフローチャートである。It is a flowchart explaining the process of a heading item collation part. 類似度の算出と見出し項目と区画の対応付けを説明する図である。It is a figure explaining the calculation of a similarity, and the matching of a heading item and a division. 見出し項目毎の類似度を算出した例を示す第一の図である。It is a 1st figure which shows the example which calculated the similarity for every heading item. 見出し項目毎の類似度を算出した例を示す第二の図である。It is a 2nd figure which shows the example which calculated the similarity for every heading item. ステップＳ１１１２の処理の詳細を説明するフローチャートである。It is a flowchart explaining the detail of the process of step S1112. 見出し項目一覧に対して最も類似度の高い区間を対応が1対１となるように割り振った例を示す図である。It is a figure which shows the example which allocated the section with the highest similarity with respect to a heading item list so that a correspondence might become 1: 1. データ項目長さ補正部による処理を説明するフローチャートである。It is a flowchart explaining the process by a data item length correction | amendment part. データ項目の長さの補正の具体例を説明する第一図である。It is a 1st figure explaining the specific example of correction | amendment of the length of a data item. データ項目の長さの補正の具体例を説明する第二図である。It is a 2nd figure explaining the specific example of correction | amendment of the length of a data item. 見出し項目長さ補正部による処理を説明するフローチャートである。It is a flowchart explaining the process by the heading item length correction | amendment part. 見出し項目の長さの補正の具体例を示す第一の図である。It is a 1st figure which shows the specific example of correction | amendment of the length of a heading item. 見出し項目の長さの補正の具体例を示す第二の図である。It is a 2nd figure which shows the specific example of correction | amendment of the length of a heading item. 見出し項目の長さの補正の具体例を示す第三の図である。It is a 3rd figure which shows the specific example of correction | amendment of the length of a heading item. 見出し項目の長さの補正の具体例を示す第四の図である。It is a 4th figure which shows the specific example of correction | amendment of the length of a heading item. 見出し対応データ特定部による処理を説明するフローチャートである。It is a flowchart explaining the process by the heading corresponding | compatible data specific | specification part. 見出し項目に対応するデータ項目の特定を説明する図である。It is a figure explaining specification of the data item corresponding to a heading item. 表の入力形式を説明する図である。It is a figure explaining the input format of a table | surface. 見出し項目とデータ項目の対応付けをとる入力例を示す図である。It is a figure which shows the example of an input which takes matching with a heading item and a data item. 対応付けの結果の例を示す図である。It is a figure which shows the example of the result of matching.

本実施例では、表構造自動認識装置により、見出しが定義された表構造の自動認識を行う。表構造の自動認識とは、表の見出しと項目との対応付けを行うことを含む。表構造の自動認識は、例えば文書データそれぞれに要件を定義した表が含まれる場合に、各表における要件の定義の整合性を確認する整合検査等に用いられる。文書データとは、例えばソフトウェア設計等を行う際に作成される設計書データ等である。 In this embodiment, the table structure automatic recognition device automatically recognizes the table structure in which the headline is defined. Automatic recognition of the table structure includes associating table headings with items. The automatic recognition of the table structure is used for, for example, a consistency check for confirming the consistency of the definition of the requirements in each table when each document data includes a table defining the requirements. The document data is, for example, design document data created when performing software design or the like.

以下に図１を参照して表構造自動認識装置による表構造の自動認識の概要を説明する。図１は、表構造の自動認識の概要を説明する図である。 An outline of automatic table structure recognition by the table structure automatic recognition apparatus will be described below with reference to FIG. FIG. 1 is a diagram for explaining an outline of automatic recognition of a table structure.

本実施例では、予め設計書に含まれる文書が種別毎に分類されており、種別毎に文書に含まれる表の見出し項目が定義されている。表構造自動認識装置は、文書種別毎に文書から複数のセル（区画）からなる表を矩形領域として抽出する（ステップＳ１１）。続いて表構造自動認識装置は、抽出した表の見出し項目のデータを照合し、見出し項目のデータ同士が類似する表を抽出する（ステップＳ１２）。続いて表構造自動認識装置は、抽出した表のデータ項目の長さと見出し項目の長さを補正する（ステップＳ１３）。尚本実施例の見出し項目の長さとは、見出し項目に対応した区画の長さである。具体的には見出し項目に対応した区画内の文字数である。またデータ項目の長さとは、データ項目に対応した区画の長さである。具体的にはデータ項目に対応する区画内の文字数である。続いて表構造自動認識装置は、見出し項目に対応するデータ項目の候補を特定する（ステップＳ１４）。 In this embodiment, documents included in the design document are classified in advance for each type, and a table heading item included in the document is defined for each type. The table structure automatic recognition apparatus extracts a table including a plurality of cells (sections) from the document for each document type as a rectangular area (step S11). Subsequently, the table structure automatic recognition device collates the data of the heading item in the extracted table, and extracts a table in which the data of the heading item is similar (step S12). Subsequently, the table structure automatic recognition apparatus corrects the length of the data item and the length of the heading item in the extracted table (step S13). The heading item length in this embodiment is the length of the section corresponding to the heading item. Specifically, it is the number of characters in the section corresponding to the heading item. The length of the data item is the length of the section corresponding to the data item. Specifically, it is the number of characters in the section corresponding to the data item. Subsequently, the table structure automatic recognition apparatus specifies data item candidates corresponding to the heading items (step S14).

本実施例では、このように表の見出し項目と表が含まれる文書の種別による表構造の定義から、文書中に含まれる表構造の自動認識を行う。 In this embodiment, the table structure included in the document is automatically recognized based on the table header definition and the table structure definition based on the type of the document including the table.

上述した表構造の自動認識を実現する表構造自動認識装置は、コンピュータ装置であって、図２に示すようなハードウェア構成を有する。 The table structure automatic recognition device that realizes the above-described table structure automatic recognition is a computer device, and has a hardware configuration as shown in FIG.

図２は、表構造自動認識装置のハードウェア構成例を示す図である。表構造自動認識装置１００では、入力装置１１、表示装置１２、主記憶装置１３、ＣＰＵ１４、インターフェース装置１５、補助記憶装置１６及びドライバ装置１７がバスＢで相互に接続されている。 FIG. 2 is a diagram illustrating a hardware configuration example of the table structure automatic recognition apparatus. In the table structure automatic recognition apparatus 100, an input device 11, a display device 12, a main storage device 13, a CPU 14, an interface device 15, an auxiliary storage device 16, and a driver device 17 are connected to each other via a bus B.

バスＢで相互に接続されている入力装置１１、表示装置１２、主記憶装置１３、ＣＰＵ１４、インターフェース装置１５、補助記憶装置１６及びドライバ装置１７は、ＣＰＵ１４による管理下で相互にデータの送受を行うことができる。ＣＰＵ１４は、表構造自動認識装置１００全体の動作制御を司る中央処理装置である。 The input device 11, the display device 12, the main storage device 13, the CPU 14, the interface device 15, the auxiliary storage device 16, and the driver device 17 connected to each other via the bus B exchange data with each other under the control of the CPU 14. be able to. The CPU 14 is a central processing unit that controls operation of the entire table structure automatic recognition apparatus 100.

インターフェース装置１５は他の情報処理装置からのデータを受信し、そのデータのデータをＣＰＵ１４に渡す。さらに、インターフェース装置１５はＣＰＵ１４からの指示に応じて他の情報処理装置にデータを送信する。 The interface device 15 receives data from another information processing device and passes the data to the CPU 14. Further, the interface device 15 transmits data to another information processing device in response to an instruction from the CPU 14.

補助記憶装置１６には、表構造自動認識装置１００の機能を発揮させるプログラムの一部として、少なくとも表構造自動認識装置１００に表構造の自動認識処理を実行させる表構造自動認識プログラムが記憶されている。 The auxiliary storage device 16 stores at least a table structure automatic recognition program that causes the table structure automatic recognition device 100 to execute automatic table structure recognition processing as a part of a program that exhibits the function of the table structure automatic recognition device 100. Yes.

そして表構造自動認識装置１００は、ＣＰＵ１４が表構造自動認識プログラムを補助記憶装置１６から読み出して実行することで、表構造の自動認識機能を有する装置となる。表構造自動認識プログラムはＣＰＵ１４とアクセス可能な主記憶装置１３に格納されていても良い。入力装置１１はＣＰＵ１４の管理下でデータの入力を受付ける。表構造自動認識プログラムは表構造自動認識装置１００が読み取り可能な記録媒体１８に記録しておくことができる。 Then, the table structure automatic recognition apparatus 100 is an apparatus having a table structure automatic recognition function when the CPU 14 reads out and executes a table structure automatic recognition program from the auxiliary storage device 16. The table structure automatic recognition program may be stored in the main storage device 13 accessible to the CPU 14. The input device 11 receives data input under the control of the CPU 14. The table structure automatic recognition program can be recorded in the recording medium 18 that can be read by the table structure automatic recognition apparatus 100.

表構造自動認識装置１００で読み取り可能な記録媒体１８には、磁気記録媒体、光ディスク、光磁気記録媒体、半導体メモリなどがある。磁気記録媒体には、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、フレキシブルディスク（ＦＤ）、磁気テープ（ＭＴ）などがある。光ディスクには、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＤＶＤ−ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃ − ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（Ｒｅｃｏｒｄａｂｌｅ）／ＲＷ（ＲｅＷｒｉｔａｂｌｅ）などがある。また、光磁気記録媒体には、ＭＯ（Ｍａｇｎｅｔｏ − Ｏｐｔｉｃａｌｄｉｓｋ）などがある。表構造自動認識プログラムを流通させる場合には、例えば表構造自動認識プログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭ等の可搬型の記録媒体１８を販売することが考えられる。 Examples of the recording medium 18 readable by the table structure automatic recognition apparatus 100 include a magnetic recording medium, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the magnetic recording medium include an HDD (Hard Disk Drive), a flexible disk (FD), and a magnetic tape (MT). Examples of the optical disc include a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc-Read Only Memory), and a CD-R (Recordable) / RW (ReWriteable). Magneto-optical recording media include MO (Magneto-Optical disk). When distributing the table structure automatic recognition program, it is conceivable to sell a portable recording medium 18 such as a DVD or a CD-ROM in which the table structure automatic recognition program is recorded.

そして表構造自動認識プログラムを実行する表構造自動認識装置１００は、例えばドライバ装置１７が表構造自動認識プログラムを記録した記録媒体１８から、表構造自動認識プログラムを読み出す。ＣＰＵ１４は、読み出された表構造自動認識プログラムを主記憶装置１３若しくは補助記憶装置１６に格納する。 Then, the table structure automatic recognition apparatus 100 that executes the table structure automatic recognition program reads the table structure automatic recognition program from, for example, the recording medium 18 on which the driver device 17 has recorded the table structure automatic recognition program. The CPU 14 stores the read table structure automatic recognition program in the main storage device 13 or the auxiliary storage device 16.

そして表構造自動認識装置１００は、自己の記憶装置である主記憶装置１３若しくは補助記憶装置１６から表構造自動認識プログラムを読み取り、表構造自動認識プログラムに従った処理を実行する。 Then, the table structure automatic recognition device 100 reads the table structure automatic recognition program from the main storage device 13 or the auxiliary storage device 16 which is its own storage device, and executes processing according to the table structure automatic recognition program.

図３は、表構造自動認識装置の機能構成の一例を説明する図である。本実施例の表構造自動認識装置１００は、認識処理部２００と記憶部３００とを有する。 FIG. 3 is a diagram illustrating an example of a functional configuration of the table structure automatic recognition apparatus. The table structure automatic recognition apparatus 100 according to the present embodiment includes a recognition processing unit 200 and a storage unit 300.

認識処理部２００は、矩形領域抽出部２１０と、見出し項目照合部２２０と、長さ補正部２３０と、見出し対応項目特定部２４０と、を有する。 The recognition processing unit 200 includes a rectangular area extraction unit 210, a heading item matching unit 220, a length correction unit 230, and a heading corresponding item specifying unit 240.

矩形領域抽出部２１０は、設計書データベース３１０に格納された設計書データから表を抽出する。本実施例の設計書データは、例えば複数の表やテキストデータ等を含む文書データである。尚文書データは、表のみであっても良い。本実施例の設計書データベース３１０は、所謂文書データベースであって、文書データである設計書データが格納されている。 The rectangular area extraction unit 210 extracts a table from the design document data stored in the design document database 310. The design document data of this embodiment is document data including a plurality of tables, text data, and the like, for example. The document data may be only a table. The design document database 310 of the present embodiment is a so-called document database, and stores design document data that is document data.

見出し項目照合部２２０は、抽出した表の見出し項目のデータと、設計書データベース３１０に格納された他の表の見出し項目のデータとを照合し、類似する表（以下、類似表）を抽出する。長さ補正部２３０は、類似表と抽出された表の見出し項目の長さとデータ項目の長さとに基づき、抽出された表の見出し項目の長さとデータ項目の長さとを補正する。見出し対応項目特定部２４０は、抽出された表において見出し項目に対応するデータ項目を特定する。尚各部の処理の詳細は後述する。 The heading item collation unit 220 collates the data of the heading item of the extracted table with the data of the heading item of another table stored in the design document database 310, and extracts a similar table (hereinafter, similar table). . The length correction unit 230 corrects the length of the heading item and the length of the data item of the extracted table based on the length of the heading item and the length of the data item of the extracted table. The headline corresponding item specifying unit 240 specifies a data item corresponding to the headline item in the extracted table. Details of the processing of each unit will be described later.

記憶部３００は、設計書データベース３１０と、表抽出条件データベース３２０と、項目整合データベース３３０とを有する。記憶部３００は、主記憶装置１３及び／又は補助記憶装置１６内に形成されるものであり、上記各データベースは主記憶装置１３及び／又は補助記憶装置１６内に設けられる。 The storage unit 300 includes a design document database 310, a table extraction condition database 320, and an item matching database 330. The storage unit 300 is formed in the main storage device 13 and / or the auxiliary storage device 16, and each of the databases is provided in the main storage device 13 and / or the auxiliary storage device 16.

以下に図４乃至図６を参照して記憶部３００に格納された各データベースについて説明する。図４は、記憶部に格納された各データベースを説明する図である。 Hereinafter, each database stored in the storage unit 300 will be described with reference to FIGS. 4 to 6. FIG. 4 is a diagram illustrating each database stored in the storage unit.

図４に示す部分４１は、設計書データベース３１０を説明するものである。実施例の設計書データは、部分４１に示すように、文書種別毎に分けられており、文書種別毎に設計書データベース３１０に格納される。本実施例では、設計書データは文書種別Ａ、Ｂ、Ｃに分けられて設計書データベース３１０に格納されている。文書種別は、例えば設計書データに含まれる表の形式毎に分類される。 A portion 41 shown in FIG. 4 describes the design document database 310. The design document data of the embodiment is divided for each document type as shown in a portion 41 and is stored in the design document database 310 for each document type. In this embodiment, the design document data is divided into document types A, B, and C and stored in the design document database 310. The document type is classified, for example, for each table format included in the design document data.

図４に示す部分４２は表抽出条件データベース３２０を説明するものである。本実施例の表抽出条件データベース３２０には、文書種別毎に、設計書データに含まれる表の見出し項目を識別する識別子と、表の見出し項目のデータとが対応付けられて格納されている。例えば表抽出条件データベース３２０には、文書種別Ａの表形式における見出し項目１、２を識別する識別子と、見出し項目１、２のデータとが文書書別Ａの表抽出条件として格納されている。また表抽出条件データベース３２０には、文書種別Ｂの表形式における見出し項目１の識別子と見出し項目１のデータとが格納されている。 A portion 42 shown in FIG. 4 explains the table extraction condition database 320. In the table extraction condition database 320 of this embodiment, for each document type, an identifier for identifying a table heading item included in the design document data and data of the table heading item are stored in association with each other. For example, in the table extraction condition database 320, identifiers for identifying the heading items 1 and 2 in the table format of the document type A and the data of the heading items 1 and 2 are stored as the table extraction conditions for each document document A. The table extraction condition database 320 stores the identifier of the heading item 1 and the data of the heading item 1 in the document type B table format.

図４に示す部分４３は、項目整合データベース３３０を説明するものである。本実施例の項目整合データベース３３０では、異なる文書種別の設計書データに含まれる表において、整合性が期待されるデータ項目同士が対応付けられて格納されている。例えば項目整合データベース３３０には、整合性が期待されるデータ項目として、文書種別Ａのデータ項目４１と文書種別Ｂのデータ項目４３とが対応付けられて格納されている。また文書種別Ｂのデータ項目４３と整合が期待されるデータ項目として、文書種別Ｃに含まれるデータ項目４４が対応付けられて格納されている。 A portion 43 shown in FIG. 4 describes the item matching database 330. In the item matching database 330 of the present embodiment, data items expected to be consistent are stored in association with each other in tables included in design document data of different document types. For example, in the item matching database 330, data items 41 of document type A and data items 43 of document type B are stored in association with each other as data items expected to be consistent. A data item 44 included in the document type C is stored in association with the data item 43 that is expected to be consistent with the data item 43 of the document type B.

図５は、表抽出条件データベースの一例を示す図である。本実施例の表抽出条件データベース３２０では、文書種別と、見出し項目の識別子と、見出し項目のデータとが対応付けられて格納されている。 FIG. 5 is a diagram illustrating an example of a table extraction condition database. In the table extraction condition database 320 of this embodiment, the document type, the header item identifier, and the header item data are stored in association with each other.

図５の例では、文書種別毎の見出し項目の一覧が格納されている。例えば見出し項目一覧３２１は、文書種別Ａの見出し項目の一覧であり、見出し項目一覧３２２は文書種別Ｂの見出し項目の一覧である。見出し項目一覧３２１の見出し項目Ａ−１のデータは、「組織名／役割名／担当名」である。また見出し項目一覧３２２の見出し項目Ｂ−１のデータは「組織および役割」である。本実施例では、表抽出条件データベース３２０に格納された文書種別毎の見出し項目一覧が、設計書データから表を抽出する際の条件となる。 In the example of FIG. 5, a list of heading items for each document type is stored. For example, the heading item list 321 is a list of heading items of the document type A, and the heading item list 322 is a list of heading items of the document type B. The data of the heading item A-1 in the heading item list 321 is “organization name / role name / charge name”. The data of the heading item B-1 in the heading item list 322 is “organization and role”. In this embodiment, the heading item list for each document type stored in the table extraction condition database 320 is a condition for extracting a table from design document data.

図６は、項目整合データベースの一例を示す図である。本実施例の項目整合データベース３３０では、整合性が期待されるデータ項目の見出し項目同士が対応付けられて格納されている。例えば図６の例では、文書種別Ａの見出し項目一覧３２１と文書種別Ｂの見出し項目一覧３２２において、見出し項目Ａ−１と見出し項目Ｂ−１とが対応付けられている。これは、見出し項目Ａ−１に対応したデータ項目と、見出し項目Ｂ−１で対応したデータ項目とが整合する可能性が高いことを示している。 FIG. 6 is a diagram illustrating an example of the item matching database. In the item matching database 330 of this embodiment, the heading items of data items expected to be consistent are stored in association with each other. For example, in the example of FIG. 6, in the heading item list 321 of the document type A and the heading item list 322 of the document type B, the heading item A-1 and the heading item B-1 are associated with each other. This indicates that there is a high possibility that the data item corresponding to the heading item A-1 and the data item corresponding to the heading item B-1 match.

次に図７を参照して本実施例の表構造自動認識装置１００の動作を説明する。図７は、表構造自動認識装置の全体の動作を説明するフローチャートである。図７に示す各ステップの処理の詳細は後述する。 Next, the operation of the table structure automatic recognition apparatus 100 of this embodiment will be described with reference to FIG. FIG. 7 is a flowchart for explaining the overall operation of the table structure automatic recognition apparatus. Details of the processing of each step shown in FIG. 7 will be described later.

本実施例の表構造自動認識装置１００は、表構造の自動認識処理が開始されると、矩形領域抽出部２１０により、処理対象の設計書データから矩形領域を抽出する（ステップＳ７０１）。ここで抽出される矩形領域は、設計書データに含まれる表である。 In the table structure automatic recognition apparatus 100 of this embodiment, when the table structure automatic recognition process is started, the rectangular area extraction unit 210 extracts a rectangular area from the design document data to be processed (step S701). The rectangular area extracted here is a table included in the design document data.

続いて表構造自動認識装置１００は、表抽出条件データベース３２０にリストされた文書種別毎の見出し項目一覧を表の抽出条件とし、見出し項目照合部２２０により抽出条件に合致した類似表を抽出する（ステップＳ７０２）。具体的には例えば、文書種別Ａの設計書データに含まれる表の見出し項目一覧を類似表の抽出条件とし、文書種別Ａにおいて見出し項目一覧がある程度一致している表を抽出して類似表とする。 Subsequently, the table structure automatic recognition apparatus 100 uses the headline item list for each document type listed in the table extraction condition database 320 as a table extraction condition, and extracts a similar table that matches the extraction condition by the headline item matching unit 220 ( Step S702). Specifically, for example, a list of heading items included in the design document data of document type A is used as a similar table extraction condition, and a table in which the heading item list matches to some extent in document type A is extracted as a similar table. To do.

続いて表構造自動認識装置１００は、長さ補正部２３０の有するデータデータ項目長さ補正部２３１により、抽出された類似表のデータ項目の長さを補正する（ステップＳ７０３）。次に表構造自動認識装置１００は、長さ補正部２３０の有する見出しデータ項目長さ補正部２３２により、抽出された類似表の見出し項目の長さを補正する（ステップＳ７０４）。 Subsequently, the table structure automatic recognition apparatus 100 corrects the length of the extracted data item of the similar table by the data data item length correction unit 231 included in the length correction unit 230 (step S703). Next, the table structure automatic recognition apparatus 100 corrects the length of the extracted heading item of the similar table by the heading data item length correcting unit 232 of the length correcting unit 230 (step S704).

続いて表構造自動認識装置１００は、見出し対応データ特定部２４０により、補正後の見出し項目に対応するデータ項目を特定する（ステップＳ７０５）。続いて表構造自動認識装置１００は、見出し項目とデータ項目とを対応付けたリストを出力する（ステップＳ７０６）。 Subsequently, the table structure automatic recognition apparatus 100 specifies the data item corresponding to the corrected heading item by the heading correspondence data specifying unit 240 (step S705). Subsequently, the table structure automatic recognition apparatus 100 outputs a list in which the heading item is associated with the data item (step S706).

以下に図８を参照してステップＳ７０１で実行される矩形領域の抽出の詳細を説明する。図８は、矩形領域の抽出を説明する図である。 Details of the extraction of the rectangular area executed in step S701 will be described below with reference to FIG. FIG. 8 is a diagram for explaining extraction of a rectangular area.

図８に示す設計書データ８０は、例えば文書種別Ａの設計書データである。設計書データ８０は、文書８１、８２、・・・、８ｎを含む。本実施例の矩形領域抽出部２１０は、文書種別Ａの設計書データ８０が処理対象として選択されると、設計書データ８０に含まれる文書毎に、矩形領域を検出して抽出する。形領域抽出部２１０は、例えば文書８１から矩形領域１、矩形領域２を抽出する。また形領域抽出部２１０は、文書８２から矩形領域３、矩形領域４を抽出する。 The design document data 80 shown in FIG. 8 is, for example, design document data of document type A. The design document data 80 includes documents 81, 82,. When the design document data 80 of the document type A is selected as a processing target, the rectangular region extraction unit 210 according to the present embodiment detects and extracts a rectangular region for each document included in the design document data 80. The shape area extraction unit 210 extracts the rectangular area 1 and the rectangular area 2 from the document 81, for example. The shape area extraction unit 210 extracts the rectangular area 3 and the rectangular area 4 from the document 82.

図９は、矩形領域抽出部の処理を説明するフローチャートである。 FIG. 9 is a flowchart for explaining processing of the rectangular area extraction unit.

本実施例の矩形領域抽出部２１０は、設計書データベース３１０から処理対象となる設計書データを選択する（ステップＳ９０１）。続いて矩形領域抽出部２１０は、ポインタを設計書データに含まれる文書の左上に設定する（ステップＳ９０２）。続いて矩形領域抽出部２１０は、ポインタを文書の右端まで走査し、矩形領域の左上角が存在するか否かを判断する（ステップＳ９０３）。 The rectangular area extraction unit 210 of the present embodiment selects design document data to be processed from the design document database 310 (step S901). Subsequently, the rectangular area extraction unit 210 sets the pointer to the upper left of the document included in the design document data (step S902). Subsequently, the rectangular area extraction unit 210 scans the pointer to the right end of the document, and determines whether or not the upper left corner of the rectangular area exists (step S903).

矩形領域抽出部２１０は、角を検出した場合（ステップＳ９０４）、後述するステップＳ９０９へ進む。ステップＳ９０４において角を検出しない場合、矩形領域抽出部２１０は、ポインタの位置が文書の右下にあるか否かを判断する（ステップＳ９０５）。ステップＳ９０５において、ポインタの位置が文書の右下にある場合、矩形領域抽出部２１０は、設計書データに含まれる他の文書があるか否かを判断する（ステップＳ９０６）。 If the rectangular area extraction unit 210 detects a corner (step S904), the process proceeds to step S909 described later. If the corner is not detected in step S904, the rectangular area extraction unit 210 determines whether or not the position of the pointer is at the lower right of the document (step S905). If the position of the pointer is at the lower right of the document in step S905, the rectangular area extraction unit 210 determines whether there is another document included in the design document data (step S906).

ステップＳ９０６において他の文書が存在しない場合、矩形領域抽出部２１０は処理を終了する。ステップＳ９０６において他の文書が存在する場合、矩形領域抽出部２１０は、次の文書を選び（ステップＳ９０７）、ステップＳ９０２へ戻る。ステップＳ９０５おいて、ポインタの位置が文書の右下にない場合、矩形領域抽出部２１０は、ポインタを現在の次の位置に下げて左端に移動させ（ステップＳ９０８）、ステップＳ９０３に戻る。具体的に矩形領域抽出部２１０は、例えば現在のポインタの位置がｋなら、ポインタの位置をｋ＋１とする。 If there is no other document in step S906, the rectangular area extraction unit 210 ends the process. If another document exists in step S906, the rectangular area extraction unit 210 selects the next document (step S907) and returns to step S902. In step S905, when the position of the pointer is not in the lower right of the document, the rectangular area extraction unit 210 moves the pointer to the current next position and moves it to the left end (step S908), and returns to step S903. Specifically, for example, if the current pointer position is k, the rectangular area extraction unit 210 sets the pointer position to k + 1.

ステップＳ９０４において角が検出された場合、矩形領域抽出部２１０は、左上角より、右、下に伸びている線をスキャンし、左下角、右上角を検出する（ステップＳ９０９）。続いて矩形領域抽出部２１０は、右上角より下に伸びている線をスキャンし、右下角を検出する（ステップＳ９１０）。続いて矩形領域抽出部２１０は、文書中の矩形領域の位置情報を保存する（ステップＳ９１１）。位置情報とは、文書における左上角、右上角、左下角、右下角の座標値である。続いて矩形領域抽出部２１０は、ポインタを右上角の隣にセットし（ステップＳ９１２）、ステップＳ９０３へ戻る。具体的には例えば、矩形領域抽出部２１０は、ポインタの現在位置がｍならば、ｍ＋１にポインタをセットする。尚本実施例の矩形領域である表は、ｍ列×ｎ行のセルを有するものであり、セルの固まりとした。 When a corner is detected in step S904, the rectangular area extraction unit 210 scans a line extending right and down from the upper left corner, and detects a lower left corner and an upper right corner (step S909). Subsequently, the rectangular area extraction unit 210 scans a line extending below the upper right corner and detects the lower right corner (step S910). Subsequently, the rectangular area extraction unit 210 stores the position information of the rectangular area in the document (step S911). The position information is the coordinate values of the upper left corner, upper right corner, lower left corner, and lower right corner in the document. Subsequently, the rectangular area extraction unit 210 sets the pointer next to the upper right corner (step S912), and returns to step S903. Specifically, for example, if the current position of the pointer is m, the rectangular area extraction unit 210 sets the pointer to m + 1. In addition, the table | surface which is a rectangular area | region of a present Example has a cell of m column xn row, and was taken as the lump of the cell.

次にステップＳ７０２で行われる表抽出条件に合致した類似表の抽出の詳細を説明する。図１０は、表抽出条件に合致した類似表の抽出を説明する図である。 Next, details of extraction of similar tables that match the table extraction conditions performed in step S702 will be described. FIG. 10 is a diagram for explaining extraction of a similar table that matches a table extraction condition.

本実施例の見出し項目照合部２２０は、表抽出条件データベース３２０に格納された見出し項目一覧に基づき、表抽出条件に合致した矩形領域抽出部２１０により抽出された表の類似表を抽出する。 The headline item collation unit 220 of this embodiment extracts a similar table of the table extracted by the rectangular area extraction unit 210 that matches the table extraction condition based on the list of headline items stored in the table extraction condition database 320.

図１０の例では、処理対象の設計書データ８０は文書種別Ａであるため、表抽出条件データベース３２０に格納された文書種別Ａの見出し項目一覧３２１を表抽出条件としている。本実施例の見出し項目照合部２３０は、見出し項目一覧３２１と矩形領域抽出部２１０により抽出された表の見出し項目とを比較して類似度を算出し、類似度が所定閾値以上の表を抽出する。図１０の例では、矩形領域１〜４が抽出されている。本実施例では、見出し項目照合部２２０により抽出されたこの矩形領域の表を類似表と呼ぶ。 In the example of FIG. 10, since the design document data 80 to be processed is the document type A, the heading item list 321 of the document type A stored in the table extraction condition database 320 is used as the table extraction condition. The headline item collation unit 230 according to the present embodiment compares the headline item list 321 and the headline item of the table extracted by the rectangular area extraction unit 210 to calculate a similarity, and extracts a table having a similarity greater than or equal to a predetermined threshold. To do. In the example of FIG. 10, rectangular areas 1 to 4 are extracted. In this embodiment, the rectangular area table extracted by the heading item matching unit 220 is called a similarity table.

以下に本実施例の見出し項目照合部２２０の処理の詳細を説明する。図１１は、見出し項目照合部の処理を説明するフローチャートである。 Details of the processing of the heading item matching unit 220 of this embodiment will be described below. FIG. 11 is a flowchart for explaining the processing of the heading item matching unit.

本実施例の見出し項目照合部２２０は、表抽出条件データベース３２０から、処理対象の設計書データ８０の文書種別Ａに対応する表抽出条件を取得する（ステップＳ１１０１）。表抽出条件は、具体的には文書種別Ａの見出し項目一覧３２１である。続いて見出し項目照合部２２０は、表抽出条件に含まれる見出し項目をｍ１，ｍ２，・・・，ｍｎとする（ステップＳ１１０２）。 The headline item collation unit 220 according to the present embodiment acquires a table extraction condition corresponding to the document type A of the design document data 80 to be processed from the table extraction condition database 320 (step S1101). The table extraction condition is specifically a heading item list 321 of the document type A. Subsequently, the headline item collation unit 220 sets the headline items included in the table extraction condition as m1, m2,..., Mn (step S1102).

続いて見出し項目照合部２２０は、表抽出条件の類似度の閾値を取得する（ステップＳ１１０３）。この閾値は、予め表構造自動認識装置１００の記憶部３００に予め格納されているものとした。 Subsequently, the heading item collation unit 220 acquires a similarity threshold of the table extraction condition (step S1103). This threshold value is stored in advance in the storage unit 300 of the table structure automatic recognition apparatus 100.

次に見出し項目照合部２２０は、矩形領域抽出部２１０により抽出された矩形領域で文書種別Ａの表形式に対応する矩形領域をｈ１，ｈ２，・・・，ｈｎとする（ステップＳ１１０４）。続いて見出し項目照合部２２０は、矩形領域を一つ取り出し（ステップＳ１１０５）、区画に分割して各区画をｋ１，ｋ２，・・・，ｋｎとする（ステップＳ１１０６）。尚区画とは、矩形領域の中で線に囲まれた領域である。 Next, the headline item collation unit 220 sets h1, h2,..., Hn as rectangular regions corresponding to the table type of the document type A in the rectangular region extracted by the rectangular region extracting unit 210 (step S1104). Subsequently, the headline item matching unit 220 extracts one rectangular area (step S1105), divides it into sections, and sets each section as k1, k2,..., Kn (step S1106). A section is an area surrounded by a line in a rectangular area.

次に見出し項目照合部２２０は、見出し項目一覧３２１に含まれる一つの見出し項目を取り出し（ステップＳ１１０７）、取り出した見出し項目と区画ｋ１，ｋ２，・・・，ｋｎ内の各文字列との類似度を算出する（ステップＳ１１０８）。本実施例の類似度とは、取り出した見出し項目と区画内の文字列とで一致する文字数に基づく値である。類似度の算出の詳細は後述する。 Next, the headline item collation unit 220 extracts one headline item included in the headline item list 321 (step S1107), and the similarity between the extracted headline item and each character string in the sections k1, k2,. The degree is calculated (step S1108). The similarity in the present embodiment is a value based on the number of characters that match between the extracted headline item and the character string in the section. Details of the calculation of the similarity will be described later.

続いて見出し項目照合部２２０は、算出結果を保存する（ステップＳ１１０９）。算出結果は、見出し項目、区画（ｋ１，ｋ２，・・・，ｋｎの何れか）、類似度の３つの値が対応付けられた組として記憶部３００等に保存される。続いて見出し項目照合部２２０は、残りの見出し項目が存在するか否かを判断する（ステップＳ１１１０）。ステップＳ１１１０において残りの見出し項目が存在する場合、見出し項目照合部２２０は次の見出し項目を取り出し（ステップＳ１１１１）、ステップＳ１１０８以降の処理を繰り返す。 Subsequently, the heading item matching unit 220 stores the calculation result (step S1109). The calculation result is stored in the storage unit 300 or the like as a set in which three values of a heading item, a section (any one of k1, k2,..., Kn) and similarity are associated. Subsequently, the heading item collation unit 220 determines whether or not there are remaining heading items (step S1110). If there are remaining heading items in step S1110, the heading item matching unit 220 extracts the next heading item (step S1111), and repeats the processing from step S1108 onward.

ステップ１１１０において残りの見出し項目が存在しない場合、見出し項目照合部２２０は、矩形領域において、見出し項目一覧に対して最も類似度の高い区間を対応が1対１となるように割り振る（テップ１１１２）。ステップＳ１１１２では、上記各組を類似度の高い順に並びかえ、見出し項目と区画が１対１となるようにすれば良い。ステップＳ１１１２の処理の詳細は後述する。 If there is no remaining headline item in step 1110, the headline item collation unit 220 allocates the section having the highest similarity to the headline item list in the rectangular area so that the correspondence is 1: 1 (step 1112). . In step S1112, the pairs may be rearranged in descending order of similarity so that the heading item and the section are in a one-to-one relationship. Details of the processing in step S1112 will be described later.

続いて見出し項目照合部２２０は、見出し項目と区画を１対１に対応させた類似度の平均を求め、平均が閾値以上である場合は、矩形領域、見出し項目、区画の中の文字列を類似表として出力する（ステップ１１１３）。見出し項目照合部２２０は、残りの矩形領域が存在するか否かを判断する（ステップ１１１４）。ステップＳ１１１４において、残りの矩形領域がある場合、次の矩形領域を取り出し（ステップ１１１５）、ステップＳ１１０６へ戻る。ステップＳ１１１４において残りの矩形領域がない場合、見出し項目照合部２２０は処理を終了する。 Subsequently, the headline item collation unit 220 obtains an average of the degrees of similarity in which the headline item and the section correspond to each other one by one, and when the average is equal to or greater than the threshold, the character string in the rectangular area, the headline item, and the section is obtained. It outputs as a similar table (step 1113). The heading item matching unit 220 determines whether or not there is a remaining rectangular area (step 1114). If there is a remaining rectangular area in step S1114, the next rectangular area is extracted (step 1115), and the process returns to step S1106. If there is no remaining rectangular area in step S1114, the heading item collation unit 220 ends the process.

以下に図１２を参照して、ステップＳ１１０８、ステップＳ１１０９における見出し項目照合部２２０による類似度の算出と見出し項目と区画の対応付けについて説明する。図１２は、類似度の算出と見出し項目と区画の対応付けを説明する図である。 Hereinafter, with reference to FIG. 12, the calculation of the similarity and the association between the heading item and the section by the heading item matching unit 220 in steps S1108 and S1109 will be described. FIG. 12 is a diagram for explaining the calculation of the similarity and the association between the heading item and the section.

図１２では、例えば文書種別Ａの表抽出条件である見出し項目一覧３２１と矩形領域１〜Ｎを照合する場合を示している。文書種別Ａから出力された矩形領域は、領域１２１に示されるような区画に分割される。 FIG. 12 shows a case where, for example, the heading item list 321 which is the table extraction condition of the document type A is collated with the rectangular areas 1 to N. The rectangular area output from the document type A is divided into sections as shown in the area 121.

見出し項目照合部２２０は、矩形領域１の区画１〜区画８の文字列に対して、見出し項目一覧３２１の見出し項目Ａ−１と一致する文字数を調べる。そして、式（１）に示す類似度算出式により、見出し項目Ａ−１と区画１の類似度を算出し、類似度として記憶部３００等に保持する。 The heading item matching unit 220 checks the number of characters that match the heading item A-1 in the heading item list 321 with respect to the character strings in the sections 1 to 8 of the rectangular area 1. Then, the similarity between the heading item A-1 and the section 1 is calculated by the similarity calculation formula shown in Expression (1), and the similarity is stored in the storage unit 300 or the like.

同様に見出し項目照合部２２０は、区画１〜区画８の文字列に対して見出し項目Ａ−２と一致する文字数を調べる。そして、式（１）に示す類似度算出式により、見出し項目Ａ−１と区画１の類似度を算出する。見出し項目照合部２２０は、矩形領域１〜Ｎの各区画に対して、見出し項目一覧３２１に含まれる全ての見出し項目について上述の処理を行う。 Similarly, the heading item collation unit 220 checks the number of characters that match the heading item A-2 with respect to the character strings of the sections 1 to 8. And the similarity of heading item A-1 and the division 1 is calculated by the similarity calculation formula shown in Formula (1). The heading item collation unit 220 performs the above-described processing for all heading items included in the heading item list 321 for each section of the rectangular regions 1 to N.

図１３は、見出し項目毎の類似度を算出した例を示す第一の図である。図１３では、矩形領域１の区画１〜８に対する見出し項目Ａ−１〜Ａ−４との類似度の算出結果１３１を示している。本実施例では、図１３に示す見出し項目と、区画と、類似度とが一つの組として保存される。 FIG. 13 is a first diagram illustrating an example of calculating the similarity for each heading item. In FIG. 13, the calculation result 131 of the similarity with heading item A-1 to A-4 with respect to the divisions 1-8 of the rectangular area 1 is shown. In the present embodiment, the heading item, section, and similarity shown in FIG. 13 are stored as one set.

図１４は、見出し項目毎の類似度を算出した例を示す第二の図である。図１４では、矩形領域Ｎの区画１〜８に対する見出し項目Ａ−１〜Ａ−４との類似度を算出結果１４１を示している。 FIG. 14 is a second diagram illustrating an example of calculating the similarity for each heading item. In FIG. 14, the calculation result 141 shows the similarity between the heading items A-1 to A-4 for the sections 1 to 8 of the rectangular area N.

次に図１５を参照して、ステップＳ１１１２における見出し項目照合部２２０による割り振りの詳細を説明する。図１５は、ステップＳ１１１２の処理の詳細を説明するフローチャートである。 Next, with reference to FIG. 15, the details of the allocation by the headline item collation unit 220 in step S1112 will be described. FIG. 15 is a flowchart illustrating details of the process in step S1112.

見出し項目照合部２２０は、ステップＳ１１０９で保存された組において、類似度の値が大きい順にソートする（ステップＳ１５０１）。続いて見出し項目照合部２２０は、ポインタを最初の組にセットする（ステップＳ１５０２）。続いて見出し項目照合部２２０は、現在ポインタが指す組の見出し項目と、区画と、類似度とを出力する（ステップＳ１５０３）。すなわち見出し項目と区画の類似度が最も大きい組が出力される。本実施例では、この出力が１対１の割り振りの結果となる。 The headline item collation unit 220 sorts the sets stored in step S1109 in descending order of similarity (step S1501). Subsequently, the heading item matching unit 220 sets the pointer to the first set (step S1502). Subsequently, the heading item collation unit 220 outputs the heading item of the set pointed to by the current pointer, the section, and the similarity (step S1503). That is, the set having the highest similarity between the heading item and the section is output. In this embodiment, this output is a one-to-one allocation result.

続いて見出し項目照合部２２０は、出力した組を類似度の算出結果から削除する（ステップＳ１５０４）。尚ここでは削除するものとしたが、出力されないように該当する組に出力禁止フラグを付けても良い。続いて見出し項目照合部２２０は、ポインタを一つ進める（ステップＳ１５０５）。ステップＳ１５０５においてポインタが指す組が存在する場合（ステップＳ１５０６）、見出し項目照合部２２０はステップＳ１５０３へ戻る。ステップＳ１５０５においてポインタが指す組が存在しない場合、見出し項目照合部２２０はステップＳ１１１２の処理を終了する。 Subsequently, the heading item collation unit 220 deletes the output set from the similarity calculation result (step S1504). In this example, the information is deleted, but an output prohibition flag may be attached to the corresponding group so that it is not output. Subsequently, the heading item collation unit 220 advances the pointer by one (step S1505). If there is a pair pointed to by the pointer in step S1505 (step S1506), the headline item collation unit 220 returns to step S1503. In step S1505, if there is no pair pointed to by the pointer, the headline item matching unit 220 ends the process of step S1112.

図１６は、見出し項目一覧に対して最も類似度の高い区間を対応が1対１となるように割り振った例を示す図である。図１６では、算出結果１３１に対して図１５で説明した処理を行った結果を割振結果１６１として示している。 FIG. 16 is a diagram illustrating an example in which a section having the highest similarity is assigned to the heading item list so that the correspondence is 1: 1. In FIG. 16, the result of performing the processing described with reference to FIG. 15 on the calculation result 131 is shown as the allocation result 161.

割振結果１６１では、矩形領域１において見出し項目Ａ−１と最も類似する区間は区画２であり、見出し項目Ａ−２と最も類似する区間は区画３であり、見出し項目Ａ−３と最も類似する区間は区画４であり、見出し項目Ａ−４と最も類似する区間は区画１となる。本実施例では割振結果１６１以外の組は、記憶部３００から削除されても良い。 In the allocation result 161, the section most similar to the heading item A-1 in the rectangular area 1 is the section 2, the section most similar to the heading item A-2 is the section 3, and is most similar to the heading item A-3. The section is section 4, and the section most similar to the heading item A-4 is section 1. In the present embodiment, groups other than the allocation result 161 may be deleted from the storage unit 300.

本実施例の見出し項目照合部２２０は、文書種別Ａの設計書データ８０から抽出された全ての矩形領域について、上述の割振結果を求める。そして見出し項目照合部２２０は、割振結果１６１に含まれる類似度の平均値が閾値以上である矩形領域を抽出し、類似表として記憶部３００に保持する。例えば図１６に示す割振結果１６１の場合、４つの類似度１，１，０．８，０．５の平均値が閾値以上であった場合、矩形領域１は類似表１として保持される。 The headline item collation unit 220 according to the present embodiment obtains the above-described allocation result for all the rectangular areas extracted from the design document data 80 of the document type A. Then, the headline item collation unit 220 extracts a rectangular area in which the average value of the similarity included in the allocation result 161 is equal to or greater than the threshold, and stores the extracted rectangular area in the storage unit 300 as a similarity table. For example, in the case of the allocation result 161 illustrated in FIG. 16, the rectangular area 1 is held as the similarity table 1 when the average value of the four similarities 1, 1, 0.8, and 0.5 is equal to or greater than the threshold value.

次にステップＳ７０３で行われる類似表のデータ項目の長さの補正の詳細を説明する。本実施例の表構造自動認識装置１００において、データ項目長さ補正部２３１は、抽出された類似表のデータ項目の長さを補正する。図１７は、データ項目長さ補正部による処理を説明するフローチャートである。 Next, details of the correction of the length of the data item of the similar table performed in step S703 will be described. In the table structure automatic recognition apparatus 100 of the present embodiment, the data item length correction unit 231 corrects the length of the extracted data item of the similar table. FIG. 17 is a flowchart for explaining processing by the data item length correction unit.

データ項目長さ補正部２３１は、ステップＳ７０２の処理結果の文書種別に対応する類似表を準備する（ステップＳ１７０１）。続いてデータ項目長さ補正部２３１は、類似表を一つ取り出す（ステップＳ１７０２）。続いてデータ項目長さ補正部２３１は、類似表を区画毎に分割する（ステップＳ１７０３）。続いてデータデータ項目長さ補正部２３１は、区画毎のデータ項目のデータを取得してデータ項目の長さを記憶部３００等に記憶する（ステップＳ１７０４）。尚データ項目の長さとは、データ項目に対応する区画の長さであり、具体的には区画内の文字数を示す。 The data item length correction unit 231 prepares a similarity table corresponding to the document type of the processing result in step S702 (step S1701). Subsequently, the data item length correction unit 231 takes out one similarity table (step S1702). Subsequently, the data item length correction unit 231 divides the similarity table for each section (step S1703). Subsequently, the data data item length correcting unit 231 acquires data item data for each section and stores the length of the data item in the storage unit 300 or the like (step S1704). The length of the data item is the length of the section corresponding to the data item, and specifically indicates the number of characters in the section.

続いてデータデータ項目長さ補正部２３１は、残りの類似表が存在するか否かを判断する（ステップＳ１７０５）。ステップＳ１７０５において類似表が残っている場合、データ項目長さ補正部２３１は次の類似表を取り出し（ステップＳ１７０６）、ステップＳ１７０３に戻る。ステップＳ１７０６において残りの類似表が存在しない場合、取得したデータ項目の長さを用いて、類似表のデータ項目の長さを正規化する（ステップＳ１７０７）。具体的にはデータ項目長さ補正部２３１は、各類似表のデータ項目毎のデータの長さの平均値を求め、各データ項目の長さをこの平均値とするように補正しても良い。 Subsequently, the data data item length correction unit 231 determines whether or not there are remaining similar tables (step S1705). If a similar table remains in step S1705, the data item length correction unit 231 extracts the next similar table (step S1706) and returns to step S1703. If there is no remaining similar table in step S1706, the length of the data item in the similar table is normalized using the length of the acquired data item (step S1707). Specifically, the data item length correction unit 231 may obtain an average value of the data length for each data item of each similar table and correct the length of each data item to be this average value. .

図１８は、データ項目の長さの補正の具体例を説明する第一図である。図１８では、類似表として、文書種別Ａの矩形領域１〜４が抽出された場合を示している。図１８（Ａ）は、見出し項目とデータ項目との対応を示す図であり、図１８（Ｂ）はデータ項目の長さの補正を説明する図である。 FIG. 18 is a first diagram illustrating a specific example of correcting the length of a data item. FIG. 18 shows a case where rectangular areas 1 to 4 of the document type A are extracted as the similarity table. FIG. 18A is a diagram showing the correspondence between the heading item and the data item, and FIG. 18B is a diagram for explaining the correction of the length of the data item.

本実施例のデータ項目長さ補正部２３１は、見出し項目照合部２２０により見出し項目Ａ−１〜Ａ−４が割り振られた区画以外の区画をデータ項目の区画とする。例えば文書種別Ａの類似表では、図１８（Ａ）に示すように、区画１〜４には見出し項目Ａ−１〜Ａ−４が割り振られている。よって区画５〜８が見出し項目Ａ−１〜Ａ−４に対応するデータ項目１〜データ項目４の区画となる。 The data item length correction unit 231 according to the present embodiment sets a partition other than the partition to which the heading items A-1 to A-4 are allocated by the heading item matching unit 220 as a data item partition. For example, in the similarity table of document type A, as shown in FIG. 18A, heading items A-1 to A-4 are allocated to sections 1 to 4. Therefore, the sections 5 to 8 are sections of the data items 1 to 4 corresponding to the heading items A-1 to A-4.

データ項目長さ補正部２３１は、図１８（Ｂ）に示すように、矩形領域１〜４の区画５〜８の文字数を各区画のデータ項目の長さとして取得する。矩形領域１では、区画５の文字列は９文字であるため、区画５のデータ項目の長さは９となる。尚本実施例では、スペースや符号等も１文字とカウントするものとした。また矩形領域１の区画６の文字数は４文字であるため、区画６のデータ項目の長さは４となる。 As shown in FIG. 18B, the data item length correction unit 231 acquires the number of characters in the sections 5 to 8 of the rectangular areas 1 to 4 as the length of the data item in each section. In the rectangular area 1, since the character string in the section 5 is 9 characters, the length of the data item in the section 5 is 9. In this embodiment, a space, a code, etc. are counted as one character. Since the number of characters in the section 6 of the rectangular area 1 is 4, the length of the data item in the section 6 is 4.

データ項目長さ補正部２３１は、抽出された矩形領域（類似表）毎に、区画５〜８のデータ項目の長さを取得した記憶部３００に保持する。 The data item length correction unit 231 holds the data item lengths of the sections 5 to 8 in the storage unit 300 that has acquired the extracted rectangular regions (similar tables).

図１９は、データ項目の長さの補正の具体例を説明する第二図である。図１９では、図１８でデータ項目長さ補正部２３１による補正後のデータ項目の長さから、矩形領域毎のデータ項目の長さを補正する例を示している。図１９（Ａ）は区画５〜８のデータ項目の長さの平均値の例を示しており、図１９（Ｂ）は矩形領域１のデータ項目の長さを補正した例を示している。 FIG. 19 is a second diagram illustrating a specific example of correction of the length of a data item. FIG. 19 shows an example in which the length of the data item for each rectangular area is corrected from the length of the data item corrected by the data item length correction unit 231 in FIG. 19A shows an example of the average value of the lengths of the data items in the sections 5 to 8, and FIG. 19B shows an example in which the length of the data items in the rectangular area 1 is corrected.

本実施例では、各区画のデータ項目の長さの平均が補正後のデータ項目の長さとなる。図１９（Ａ）に示すように、矩形領域１〜４における区画５のデータ項目の長さの合計は３６である。よってデータ項目長さ補正部２３１はこの平均値９を区画５の補正後のデータ項目の長さとする。データ項目長さ補正部２３１は、区画６〜８に対しても同様に補正後のデータ項目の長さを算出する。尚平均値の算出において、小数点第２位の値は切り上げるものとした。算出された補正後のデータ項目の長さは、各区画と対応付けられて記憶部３００等に保持される。 In this embodiment, the average of the lengths of the data items in each section is the length of the corrected data item. As shown in FIG. 19A, the total length of the data items of the section 5 in the rectangular areas 1 to 4 is 36. Therefore, the data item length correction unit 231 uses the average value 9 as the length of the corrected data item in the section 5. The data item length correction unit 231 similarly calculates the length of the corrected data item for the sections 6 to 8. In the calculation of the average value, the value of the second decimal place is rounded up. The calculated length of the corrected data item is stored in the storage unit 300 or the like in association with each section.

そしてデータ項目長さ補正部２３１は、補正後のデータ項目の長さに基づき区画５〜８の文字列の長さを補正する。この補正により、例えば図１９（Ｂ）に示すように、矩形領域１の区画５のデータ項目の文字列の長さは９となり、区画６のデータ項目の文字列の長さは４となり、区画７のデータ項目の文字列の長さは４．５となり、区画８のデータ項目の文字列の長さは３．５となる。 The data item length correction unit 231 corrects the length of the character strings in the sections 5 to 8 based on the corrected data item length. By this correction, for example, as shown in FIG. 19B, the length of the character string of the data item in the section 5 of the rectangular area 1 becomes 9, the length of the character string of the data item in the section 6 becomes 4, and the section The length of the character string of the data item 7 is 4.5, and the length of the character string of the data item of the section 8 is 3.5.

次にステップＳ７０４で行われる類似表の見出し項目の長さの補正の詳細を説明する。本実施例の表構造自動認識装置１００において、見出し項目長さ補正部２３２は、抽出された類似表の見出し項目の長さを補正する。図２０は、見出し項目長さ補正部による処理を説明するフローチャートである。 Next, details of the correction of the length of the heading item in the similarity table performed in step S704 will be described. In the table structure automatic recognition apparatus 100 of the present embodiment, the heading item length correction unit 232 corrects the length of the extracted heading item of the similar table. FIG. 20 is a flowchart for explaining processing by the heading item length correction unit.

本実施例の見出し項目長さ補正部２３２は、見出し項目の長さ補正の対象とする文書種別に対する表抽出条件を取得する（ステップＳ２００１）。ここで補正対象の文書種別を文書種別Ａとし、文書種別Ａの表抽出条件を対象表抽出条件とする。すなわち対象表抽出条件とは、文書種別Ａの見出し項目一覧３２１である。 The heading item length correction unit 232 according to the present exemplary embodiment acquires a table extraction condition for a document type that is a target of heading item length correction (step S2001). Here, the document type to be corrected is the document type A, and the table extraction condition of the document type A is the target table extraction condition. That is, the target table extraction condition is the heading item list 321 of the document type A.

続いて見出し項目長さ補正部２３２は、項目整合データベース３３０を参照し、文書種別Ａと整合性が期待される文書種別の見出し項目一覧（表抽出条件）を取得する（ステップＳ２００２）。ここでは文書種別Ａと整合性が期待される文書種別を文書種別Ｂとし、文書種別Ｂの表抽出条件を対応表抽出条件とする。すなわち対応表抽出条件とは文書種別Ｂの見出し項目一覧３２２である。 Subsequently, the heading item length correction unit 232 refers to the item matching database 330 and acquires a heading item list (table extraction condition) of the document type expected to be consistent with the document type A (step S2002). Here, the document type expected to be consistent with the document type A is the document type B, and the table extraction condition of the document type B is the correspondence table extraction condition. That is, the correspondence table extraction condition is a heading item list 322 of the document type B.

続いて見出し項目長さ補正部２３２は、対応表抽出条件に基づき文書種別Ｂの設計書データから抽出された類似表を準備する（ステップＳ２００３）。次に見出し項目長さ補正部２３２は、対象表抽出条件から一つ見出し項目を取り出す（ステップＳ２００４）。具体的には例えば、見出し項目一覧３２１から一つの見出し項目を取り出す。 Subsequently, the heading item length correction unit 232 prepares a similar table extracted from the design document data of the document type B based on the correspondence table extraction condition (step S2003). Next, the heading item length correction unit 232 extracts one heading item from the target table extraction condition (step S2004). Specifically, for example, one heading item is extracted from the heading item list 321.

続いて見出し項目長さ補正部２３２は、整合性が期待される文書種別に対応する見出し項目が対応表抽出条件中に存在するか否かを判断する（ステップＳ２００５）。具体的には例えば、文書種別Ｂの見出し項目一覧３２２中に、ステップＳ２００４で取り出した見出し項目と整合性が期待できる見出し項目が存在するか否かを判断する。 Subsequently, the heading item length correction unit 232 determines whether or not a heading item corresponding to the document type for which consistency is expected exists in the correspondence table extraction condition (step S2005). Specifically, for example, it is determined whether or not there is a heading item that can be expected to be consistent with the heading item extracted in step S2004 in the heading item list 322 of the document type B.

ステップＳ２００５において該当する見出し項目が存在する場合、見出し項目長さ補正部２３２は、対応表抽出条件に対する類似表中で、対応する見出し項目と合致する見出し項目の長さの平均値を算出する（ステップＳ２００６）。具体的には例えば、見出し項目一覧３２２を表抽出条件として文書種別Ｂの設計書データから抽出した類似表において、ステップＳ２００４で取り出した見出し項目と整合性が期待できる見出し項目を抽出し、抽出した見出し項目の文字数の平均値を算出する。ステップＳ２００５において該当する見出し項目が存在しない場合、後述するステップＳ２００８へ進む。 If there is a corresponding heading item in step S2005, the heading item length correction unit 232 calculates an average value of the lengths of heading items that match the corresponding heading item in the similar table for the correspondence table extraction condition ( Step S2006). Specifically, for example, a headline item that can be expected to be consistent with the headline item extracted in step S2004 is extracted from the similar table extracted from the design document data of document type B using the headline item list 322 as a table extraction condition. Calculate the average number of characters in the heading item. If no corresponding heading item exists in step S2005, the process proceeds to step S2008 described later.

続いて見出し項目長さ補正部２３２は、ステップＳ２００４で取り出した見出し項目の長さを、ステップＳ２００６で算出した見出し項目の平均値を用いて補正する（ステップＳ２００７）。続いて見出し項目長さ補正部２３２は、残りの見出し項目が存在するか否かを判断する（ステップＳ２００８）。ステップＳ２００８で残りの見出し項目が存在する場合、見出し項目長さ補正部２３２は次の見出し項目を取り出し（ステップＳ２００９）、ステップＳ２００５に戻る。ステップＳ２００８で残りの見出し項目が存在しない場合、見出し項目長さ補正部２３２は処理を終了する。 Subsequently, the heading item length correction unit 232 corrects the length of the heading item extracted in step S2004 using the average value of the heading item calculated in step S2006 (step S2007). Subsequently, the heading item length correction unit 232 determines whether or not there are remaining heading items (step S2008). If there are remaining heading items in step S2008, the heading item length correction unit 232 extracts the next heading item (step S2009) and returns to step S2005. If there is no remaining heading item in step S2008, the heading item length correction unit 232 ends the process.

以下に図２１乃至図２４を参照して見出し項目の長さの補正について具体的に説明する。図２１は、見出し項目の長さの補正の具体例を示す第一の図である。 Hereinafter, the correction of the length of the heading item will be described in detail with reference to FIGS. FIG. 21 is a first diagram illustrating a specific example of correcting the length of a heading item.

図２１では、文書種別Ａの類似表の見出し項目の長さを補正する場合を示している。見出し項目長さ補正部２３２は、項目整合データベース３３０に基づき、文書種別Ａの見出し項目Ａ−１と文書種別Ｂの見出し項目Ｂ−１とが整合性が期待されると判断する。また見出し項目長さ補正部２３２は、見出し項目Ａ−３と文書種別Ｃの見出し項目Ｃ−１とが整合性が期待されると判断する。 FIG. 21 shows a case where the length of the heading item in the similarity table of document type A is corrected. Based on the item matching database 330, the heading item length correction unit 232 determines that the heading item A-1 of the document type A and the heading item B-1 of the document type B are expected to be consistent. The heading item length correction unit 232 determines that the heading item A-3 and the heading item C-1 of the document type C are expected to be consistent.

そして見出し項目長さ補正部２３２は、表抽出条件データベース３２０から文書種別Ｂの見出し項目一覧３２２を表抽出条件として取得する。そして図７のステップＳ７０１、ステップＳ７０２の処理を文書種別Ｂの設計書データに対して行い、文書種別Ｂの類似表を抽出する。また見出し項目長さ補正部２３２は文書種別Ｃについても見出し項目一覧３２３を表抽出条件として取得し、文書種別Ｃの類似表を抽出する。 Then, the heading item length correction unit 232 acquires the heading item list 322 of the document type B from the table extraction condition database 320 as the table extraction condition. Then, the processing in steps S701 and S702 in FIG. 7 is performed on the design document data of the document type B, and the similarity table of the document type B is extracted. The heading item length correction unit 232 also acquires the heading item list 323 for the document type C as a table extraction condition, and extracts a similar table of the document type C.

図２２は、見出し項目の長さの補正の具体例を示す第二の図である。見出し項目長さ補正部２３２は、文書種別Ｂの類似表と文書種別Ｃの類似表を抽出すると、文書種別Ａの見出し項目一覧３２１から一つ見出し項目を取り出し、文書種別Ｂ，Ｃに取り出した見出し項目と整合性が期待できる見出し項目があるか否か判断する。 FIG. 22 is a second diagram illustrating a specific example of correction of the length of a heading item. When the heading item length correction unit 232 extracts the similar table of the document type B and the similar table of the document type C, the heading item length correction unit 232 extracts one heading item from the heading item list 321 of the document type A and extracts the heading item into the document types B and C. It is determined whether there is a heading item that can be expected to be consistent with the heading item.

図２２において、例えば見出し項目一覧３２１の見出し項目Ａ−１を取り出すと、見出し項目Ａ−１は見出し項目Ｂ−１と整合性が期待でき、見出し項目Ａ−３は見出し項目Ｃ−１と整合性が期待できることがわかる。また見出し項目Ａ−２、Ａ−４との整合性が期待できる見出し項目は、文書種別Ｂ，Ｃには存在しないことがわかる。 In FIG. 22, for example, when the heading item A-1 is extracted from the heading item list 321, the heading item A-1 can be expected to be consistent with the heading item B-1, and the heading item A-3 is consistent with the heading item C-1. It turns out that sex can be expected. It can also be seen that there are no heading items that can be expected to be consistent with the heading items A-2 and A-4 in the document types B and C.

図２３は、見出し項目の長さの補正の具体例を示す第三の図である。文書種別Ｂ，Ｃにおいて文書種別Ａの見出し項目と整合性が期待できる見出し項目がわかると、見出し項目長さ補正部２３２は、各文書種別の類似表における該当見出し項目の長さの平均値を算出する。 FIG. 23 is a third diagram illustrating a specific example of correcting the length of a heading item. When the heading item that can be expected to be consistent with the heading item of the document type A is found in the document types B and C, the heading item length correction unit 232 calculates the average length of the corresponding heading item in the similarity table of each document type. calculate.

図２３（Ａ）は、文書種別Ｂの類似表における見出し項目Ｂ−１の長さの平均値の算出を説明する図であり、図２３（Ｂ）は、文書種別Ｃの類似表における見出し項目Ｃ−１の長さの平均値の算出を説明する図である。 23A is a diagram for explaining the calculation of the average value of the lengths of the heading items B-1 in the similar table of document type B. FIG. 23B shows the heading items in the similar table of document type C. It is a figure explaining calculation of the average value of the length of C-1.

図２３（Ａ）に示すように、文書種別Ｂは、見出し項目Ｂ−１が見出し項目Ａ−１と整合性が期待される見出し項目である。よって見出し項目長さ補正部２３２は、文書種別Ｂの各類似表から見出し項目Ｂ−１の文字数を抽出し、この平均値を求める。図２３（Ａ）の例では、文書種別Ｂから３つの類似表が抽出された例である。よって見出し項目長さ補正部２３２は、３つの類似表の見出し項目Ｂ−１の長さ（文字列の数）の平均値を算出する。 As shown in FIG. 23A, the document type B is a heading item in which the heading item B-1 is expected to be consistent with the heading item A-1. Therefore, the heading item length correction unit 232 extracts the number of characters of the heading item B-1 from each similar table of the document type B, and obtains this average value. In the example of FIG. 23A, three similar tables are extracted from the document type B. Therefore, the heading item length correction unit 232 calculates an average value of the lengths (number of character strings) of the heading item B-1 of the three similar tables.

また図２３（Ｂ）に示すように、文書種別Ｃは、見出し項目Ｃ−１が見出し項目Ａ−３と整合性が期待される見出し項目である。よって見出し項目長さ補正部２３２は、文書種別Ｃの各類似表から見出し項目Ｃ−１の文字数を抽出し、この平均値を求める。この平均値は、記憶部３００等に記憶される。
図２３（Ｂ）の例では、文書種別Ｃから３つの類似表が抽出された例である。よって見出し項目長さ補正部２３２は、３つの類似表の見出し項目Ｃ−１の長さ（文字列の数）の平均値を算出する。尚平均値の算出において、小数点以下の数値は切り捨てるものとした。 As shown in FIG. 23B, the document type C is a heading item in which the heading item C-1 is expected to be consistent with the heading item A-3. Therefore, the heading item length correction unit 232 extracts the number of characters of the heading item C-1 from each similarity table of the document type C, and obtains this average value. This average value is stored in the storage unit 300 or the like.
In the example of FIG. 23B, three similar tables are extracted from the document type C. Therefore, the heading item length correction unit 232 calculates the average value of the lengths (number of character strings) of the heading item C-1 of the three similar tables. In the calculation of the average value, the numbers after the decimal point are rounded down.

図２４は、見出し項目の長さの補正の具体例を示す第四の図である。図２４（Ａ）は補正方法について説明する図であり、図２４（Ｂ）は矩形領域１の見出し項目を補正した例を示す図である。 FIG. 24 is a fourth diagram illustrating a specific example of correcting the length of a heading item. FIG. 24A is a diagram for explaining the correction method, and FIG. 24B is a diagram showing an example in which the heading item in the rectangular area 1 is corrected.

本実施例の見出し項目長さ補正部２３２は、文書種別Ａの見出し項目の長さと、他の文書種別で整合性が期待される見出し項目の長さの平均値とを用いて文書種別Ａの見出し項目の長さを補正する。具体的には図２４（Ａ）に示すように、文書種別Ａの見出し項目の長さと他の文書種別で整合性が期待される見出し項目の長さの平均値との和の１／２の値が補正後の見出し項目の長さとなる。 The heading item length correction unit 232 of this embodiment uses the length of the heading item of the document type A and the average value of the lengths of heading items expected to be consistent with other document types. Correct the length of the heading item. Specifically, as shown in FIG. 24A, the length of the heading item of the document type A and 1/2 of the sum of the average lengths of heading items expected to be consistent with other document types. The value is the length of the heading item after correction.

図２４では、文書種別Ｂの見出し項目Ｂ−１と整合性が期待される見出し項目がある見出し項目Ａ−１と、文書種別Ｃの見出し項目Ｃ−１と整合性が期待される見出し項目がある見出し項目Ａ−３とが補正される。見出し項目Ａ−１の長さは１５であり、見出し項目Ｂ−１の長さの平均値は１１である。よって見出し項目Ａ−１の補正後の長さは１３となる。 In FIG. 24, there are a heading item A-1 having a heading item expected to be consistent with the heading item B-1 of the document type B, and a heading item expected to be consistent with the heading item C-1 of the document type C. A certain heading item A-3 is corrected. The length of the heading item A-1 is 15, and the average length of the heading item B-1 is 11. Therefore, the corrected length of the heading item A-1 is 13.

また見出し項目Ａ−３の長さは６であり、見出し項目Ｃ−１の長さの平均値は４である。よって見出し項目Ａ−３の補正後の長さは５となる。 The length of the heading item A-3 is 6, and the average length of the heading item C-1 is 4. Therefore, the corrected length of the heading item A-3 is 5.

このように補正した結果、矩形領域１の見出し項目は、図２４（Ｂ）に示すように、見出し項目Ａ−１は長さ１３、見出し項目Ａ−２は長さ３、見出し項目Ａ−３は長さ５、見出し項目Ａ−４は長さ２となる。 As a result of the correction, the heading item in the rectangular area 1 has a length of 13 for the heading item A-1, a length of 3 for the heading item A-2, and a heading item A-3 as shown in FIG. Is 5 and heading item A-4 is 2 in length.

尚本実施例では、例えば文書種別Ａの見出し項目Ａ−１と整合性が期待される見出し項目が他の文書種別の見出し項目一覧に存在しない場合、文書種別Ａと表形式が類似した文書種別において類似した見出し項目の文字数の平均値を用いて見出し項目の補正を行う。例えば項目整合データベース３３０において、文書種別Ａの見出し項目Ａ−４は、整合性が期待される見出し項目が存在しない。よって見出し項目長さ補正部２３２は、見出し項目Ａ−１と整合性が期待される見出し項目が含まれる文書種別Ｂの見出し項目一覧３２２のうち、見出し項目Ａ−４と類似する見出し項目Ｂ−５の文字数と、見出し項目Ａ−４の文字数との平均値を見出し項目Ａ−４の補正に用いても良い。 In this embodiment, for example, when the heading item expected to be consistent with the heading item A-1 of the document type A does not exist in the heading item list of the other document types, the document type similar to the document type A and the table format. The headline item is corrected using the average value of the number of characters of similar headline items. For example, in the item matching database 330, the heading item A-4 of the document type A does not have a heading item that is expected to be consistent. Therefore, the heading item length correction unit 232 includes a heading item B- similar to the heading item A-4 in the heading item list 322 of the document type B including the heading item expected to be consistent with the heading item A-1. The average value of the number of characters of 5 and the number of characters of the heading item A-4 may be used for the correction of the heading item A-4.

尚本実施例では、見出し項目の長さの補正とデータ項目の長さの補正とを両方行うものとして説明したが、これに限定されない。例えば見出し項目又はデータ項目の何れか一方の長さを補正しても良い。何れか一方の補正を行う場合はデータ項目の長さを補正することが好ましい。 In this embodiment, the description has been made assuming that both the correction of the length of the heading item and the correction of the length of the data item are performed, but the present invention is not limited to this. For example, the length of either the heading item or the data item may be corrected. When either one of the corrections is performed, it is preferable to correct the length of the data item.

次にステップＳ７０５で行われる補正後の見出し項目に対応するデータ項目の特定の詳細を説明する。本実施例の表構造自動認識装置１００において、見出し対応データ特定部２４０は、補正後の見出し項目に対応するデータ項目を特定する。図２５は、見出し対応データ特定部による処理を説明するフローチャートである。 Next, specific details of the data item corresponding to the corrected heading item performed in step S705 will be described. In the table structure automatic recognition apparatus 100 of the present embodiment, the headline corresponding data specifying unit 240 specifies a data item corresponding to the corrected headline item. FIG. 25 is a flowchart for explaining processing by the headline-corresponding data specifying unit.

見出し対応データ特定部２４０は、データ項目と見出し項目が補正された補正後の類似表を取得する（ステップＳ２５０１）。続いて見出し対応データ特定部２４０は、取得した類似表から見出し項目を一つ取り出す（ステップＳ２５０２）。続いて見出し対応データ特定部２４０は、取り出した見出し項目と、類似表中の各データ項目との組み合わせにより、作り出される領域の面積を計算する（ステップＳ２５０３）。 The headline-corresponding data specifying unit 240 acquires a corrected similarity table in which the data item and the headline item are corrected (step S2501). Subsequently, the headline corresponding data specifying unit 240 extracts one headline item from the acquired similarity table (step S2502). Subsequently, the headline-corresponding data specifying unit 240 calculates the area of the created region based on the combination of the extracted headline item and each data item in the similar table (step S2503).

具体的には見出し対応データ特定部２４０は、見出し項目とデータ項目により作り出される領域を見出し項目の長さとデータ項目の長さをそれぞれ上辺と下辺とする台形とみなし、領域の面積を（見出し項目の長さ＋データ項目の長さ）×高さ÷２として計算する。このとき高さは、例えば１つの区画の高さを１として計算しても良い。 Specifically, the headline-corresponding data specifying unit 240 regards the area created by the heading item and the data item as a trapezoid having the length of the heading item and the length of the data item as the upper side and the lower side, respectively. (Length + data item length) × height ÷ 2. At this time, the height may be calculated, for example, assuming that the height of one section is 1.

見出し対応データ特定部２４０は、この領域の計算を、１つの見出し項目に対して全てのデータ項目について行い、見出し項目、データ項目、面積の３つを対応付けた組として記憶部３００等に保存する（ステップＳ２５０４）。 The headline-corresponding data specifying unit 240 performs the calculation of this area for all data items for one heading item, and saves the data in the storage unit 300 or the like as a set in which the heading item, the data item, and the area are associated with each other. (Step S2504).

見出し対応データ特定部２４０は、残りの見出し項目が存在するか否かを判断する（ステップＳ２５０５）。ステップＳ２５０５において残りの見出し項目がある場合、見出し対応データ特定部２４０は次の見出し項目を取り出し（ステップＳ２５０６）、ステップＳ２５０３へ戻る。 The headline corresponding data specifying unit 240 determines whether or not there are remaining headline items (step S2505). If there is a remaining headline item in step S2505, the headline-corresponding data specifying unit 240 extracts the next headline item (step S2506) and returns to step S2503.

ステップＳ２５０５において残りの見出し項目が存在しない場合、見出し対応データ特定部２４０は、所定の条件に該当する計算結果を除く（ステップＳ２５０７）。本実施例では、１つのデータ項目が複数の見出し項目に対応しているものの計算結果と、見出し項目とデータ項目を結ぶ線分との交差数が一定閾値を越える計算結果とを除く。尚交差数の一定閾値は、例えば類似表中の見出し項目の数に対する一定の割合として与えられても良い。また交差数の一定閾値は、見出し項目の数に応じて変更されても良い。交差数の一定閾値は、記憶部３００等に予め格納されている。 If there is no remaining headline item in step S2505, the headline-corresponding data specifying unit 240 excludes the calculation result corresponding to the predetermined condition (step S2507). In the present embodiment, the calculation result of one data item corresponding to a plurality of heading items and the calculation result in which the number of intersections between the heading item and the line segment connecting the data items exceeds a certain threshold are excluded. The constant threshold value of the number of intersections may be given as a fixed ratio with respect to the number of heading items in the similarity table, for example. The constant threshold value for the number of intersections may be changed according to the number of heading items. The constant threshold value of the number of intersections is stored in advance in the storage unit 300 or the like.

続いて見出し対応データ特定部２４０は、計算結果から見出し項目毎に領域の面積が最も小さい組を選び出す（ステップＳ２５０８）。見出し項目対応データ項目特定部２４０は、選ばれた組の見出し項目とデータ項目を最終結果として出力する（ステップＳ２５０９）。この見出し項目対応データ項目特定部２４０の処理により、類似表中の見出し項目に対応するデータ項目が特定される。 Subsequently, the headline corresponding data specifying unit 240 selects a group having the smallest area for each headline item from the calculation result (step S2508). The heading item corresponding data item specifying unit 240 outputs the selected set of heading items and data items as a final result (step S2509). By the processing of the heading item corresponding data item specifying unit 240, the data item corresponding to the heading item in the similar table is specified.

図２６は、見出し項目に対応するデータ項目の特定を説明する図である。図２６では、交差数の一定閾値を例えば１とした場合に、図２６（Ａ）は、見出し項目とデータ項目を結ぶ線分の交差数が一定閾値を越える場合を示しており、図２６（Ｂ）は見出し項目とデータ項目を結ぶ線分の交差数が一定閾値を越えない場合を示している。 FIG. 26 is a diagram for explaining identification of a data item corresponding to a heading item. In FIG. 26, when the constant threshold of the number of intersections is set to 1, for example, FIG. 26A shows a case where the number of intersections of the line segment connecting the heading item and the data item exceeds the predetermined threshold. B) shows the case where the number of intersections of the line connecting the heading item and the data item does not exceed a certain threshold.

図２６（Ａ）では、見出し項目に対応する区画２６１に対して面積が最小の領域を作り出すデータ項目に対応する区間は区画２６２である。しかし、区画２６１の見出し項目と区画２６２のデータ項目とを結ぶ線分Ｈ１は、他の見出し項目とデータ項目とを結ぶ線分と３個所で交差する。したがってこの交差数は３となり、一定閾値を越えるため、区画２６２のデータ項目は区画２６１の見出し項目と対応するデータ項目には選択されない。 In FIG. 26A, the section corresponding to the data item that creates the area having the smallest area with respect to the section 261 corresponding to the heading item is the section 262. However, the line segment H1 connecting the heading item of the section 261 and the data item of the section 262 intersects with the line segment connecting the other heading item and the data item at three points. Therefore, since the number of intersections is 3 and exceeds a certain threshold, the data item in the partition 262 is not selected as the data item corresponding to the heading item in the partition 261.

図２６（Ｂ）では、交差数が一定閾値を越えるデータ項目を除いた場合に、区画２６１の見出し項目と最小の面積の領域を作り出す区画は区画２６３である。区画２６１の見出し項目と区画２６３のデータ項目とを結ぶ線分Ｈ２は、他の見出し項目とデータ項目とを結ぶ線分と交差しない。よって交差数は０となり、一定閾値を越えないため、区画２６３のデータ項目は区画２６１の見出し項目に対応するデータ項目として特定される。 In FIG. 26B, when the data item whose number of intersections exceeds a certain threshold value is excluded, the partition that creates the header item of the partition 261 and the area of the smallest area is the partition 263. A line segment H2 connecting the heading item of the section 261 and the data item of the section 263 does not intersect with a line segment connecting another heading item and the data item. Therefore, the number of intersections is 0 and does not exceed a certain threshold value, so the data item of the partition 263 is specified as a data item corresponding to the heading item of the partition 261.

このように本実施例では、見出し項目とデータ項目とが作り出す領域の面積が最小であることに加え、見出し項目とデータ項目とを結ぶ線分の交差数に制約を設ける。これは、表の解釈において、交差数が多いほど認知的負担が大きくなるという観察を反映したものである。 As described above, in this embodiment, in addition to the area of the area created by the heading item and the data item being minimized, the number of intersections of line segments connecting the heading item and the data item is limited. This reflects the observation that in the interpretation of the table, the greater the number of intersections, the greater the cognitive burden.

本実施例では、このように交差数に制約を設けることで表構造の自動認識の精度を高めることができる。 In the present embodiment, the accuracy of automatic recognition of the table structure can be improved by providing a restriction on the number of intersections in this way.

次に、本実施例の表構造自動認識装置１００において表構造の自動認識を行う際の表構造の入力形式と出力形式について説明する。 Next, an input format and an output format of the table structure when the table structure automatic recognition apparatus 100 according to this embodiment performs automatic table structure recognition will be described.

図２７は、表の入力形式を説明する図である。図２７（Ａ）は、表の一例を示しており、図２７（Ｂ）は図２７（Ａ）で示す表構造の入力形式の一例を示している。 FIG. 27 is a diagram illustrating a table input format. FIG. 27A shows an example of a table, and FIG. 27B shows an example of an input format of the table structure shown in FIG.

本実施例では、見出し項目を「Ｈ：座標：文字列長さ」で表し、データ項目を「座標：文字列長さ」で表し、領域の高さを「座標Ｙ：高さ」で表す。また本実施例では、座標は例えばＭＳExcel（登録商標）形式で表す。 In this embodiment, the heading item is represented by “H: coordinate: character string length”, the data item is represented by “coordinate: character string length”, and the height of the region is represented by “coordinate Y: height”. In this embodiment, the coordinates are expressed in, for example, MSExcel (registered trademark) format.

図２７（Ａ）に示すｔａｂｌｅ２は、座標Ｆ２，Ｇ２，Ｇ３，Ｈ３で示される区画が見出し項目であり、各区画には見出し項目の文字列が入力されている。またｔａｂｌｅ２では、残りの区画がデータ項目であり、各区画にはデータ項目の文字列が入力されている。よってｔａｂｌｅ２は、図２７（Ｂ）に示すように、「Ｈ：Ｆ２：２Ｈ：Ｇ２：３Ｈ：Ｇ３：２」で見出し項目を示す。また先頭にＨ：がないものがデータ項目であり「Ｆ５：４Ｇ４：１６Ｇ５：１３Ｈ５：５」でデータ項目を示す。また「２：１３：１４：２５：２」で各区画の高さを示す。例えば、Ｙ座標が２のものは高さが１であると解釈する。 In table 2 shown in FIG. 27A, sections indicated by coordinates F2, G2, G3, and H3 are heading items, and a character string of the heading item is input to each section. In table 2, the remaining sections are data items, and a character string of the data item is input to each section. Therefore, table2 indicates the heading item as “H: F2: 2 H: G2: 3 H: G3: 2” as shown in FIG. Data items without H: at the head are data items, and the data items are indicated by “F5: 4 G4: 16 G5: 13 H5: 5”. In addition, the height of each section is indicated by “2: 1 3: 1 4: 2 5: 2”. For example, a Y coordinate of 2 is interpreted as a height of 1.

図２８は、見出し項目とデータ項目の対応付けをとる入力例を示す図である。本実施例の表構造自動認識装置１００に複数の表の表構造を入力する場合、図２８に示すように複数の表の表構造を一つのデータとして入力しても良い。図２８では、ｔａｂｌｅ１〜ｔａｂｌｅ６までの表構造が含まれるデータを示している。 FIG. 28 is a diagram illustrating an input example for associating a heading item with a data item. When inputting the table structures of a plurality of tables to the table structure automatic recognition apparatus 100 of the present embodiment, the table structures of the plurality of tables may be input as one data as shown in FIG. FIG. 28 shows data including a table structure from table1 to table6.

次に図２９を参照して本実施例の表構造自動認識装置１００から出力される見出し項目とデータ項目との対応付けの形式について説明する。図２９は、対応付けの結果の例を示す図である。 Next, with reference to FIG. 29, a format for associating heading items and data items output from the table structure automatic recognition apparatus 100 of the present embodiment will be described. FIG. 29 is a diagram illustrating an example of the result of association.

本実施例では、見出し項目とデータ項目の対応付けを行った結果が、「交差数＃領域サイズ＃個々の領域１＃個々の領域２＃・・・＃個々の領域ｎ」という形式で出力される。尚個々の領域は、「個々の領域サイズ＿領域の高さ＿見出し項目＿データ項目」で表され、見出し項目は「Ｈ：座標：文字列長さ」で表され、データ項目は「座標：文字列長さ」で表される。尚領域サイズは、表を形成する矩形領域の面積であり、個々の領域サイズは表に含まれる区画毎の面積である。 In this embodiment, the result of associating the heading item with the data item is output in the form of “number of intersections # area size # individual area 1 # individual area 2 #... # Individual area n”. The Each area is represented by “individual area size_area height_heading item_data item”, the heading item is represented by “H: coordinate: character string length”, and the data item is represented by “coordinate: It is represented by “character string length”. The area size is an area of a rectangular area forming the table, and each area size is an area for each partition included in the table.

例えば図２９のｔａｂｌｅ２では、交差数が１、領域サイズが１０６．５、領域１の領域サイズが１８、領域の高さが６、見出し項目の座標がＦ２で見出し項目の長さが２、見出し項目に対応するデータ項目の座標がＦ５でデータ項目の長さが４となる。また領域２の領域サイズが２４、領域の高さが６、見出し項目の座標がＧ２で見出し項目の長さが３、見出し項目に対応するデータ項目の座標がＨ５でデータ項目の長さが５となる。 For example, in the table 2 in FIG. 29, the number of intersections is 1, the area size is 106.5, the area size of the area 1 is 18, the height of the area is 6, the coordinates of the heading item are F2, the length of the heading item is 2, The coordinate of the data item corresponding to the item is F5, and the length of the data item is 4. The area size of area 2 is 24, the area height is 6, the coordinates of the heading item are G2, the length of the heading item is 3, the coordinate of the data item corresponding to the heading item is H5, and the length of the data item is 5 It becomes.

以上のように本実施例では、見出し項目及びデータ項目の座標と文字列長さを入力するだけで、見出し項目に対応したデータ項目を特定することができる。よって簡単な表の構造定義で高精度に表の自動認識を行うことができる。 As described above, in this embodiment, the data item corresponding to the heading item can be specified only by inputting the coordinates and the character string length of the heading item and the data item. Therefore, automatic table recognition can be performed with high accuracy by a simple table structure definition.

本発明は、以下に記載する付記のような構成が含まれる。
（付記１）
複数の区画が含まれる複数の表を含む文書データが格納された文書データベースから、前記複数の表を抽出し、
抽出対象の表の第１の見出しが格納された表抽出条件データベースを参照して、前記第１の見出しと前記複数の表の各々に含まれる見出し項目の区画のデータとを照合し、
前記照合の結果が所定条件を満たす表を抽出し、
前記所定条件を満たす表に含まれる各区画の長さに基づき、該表における見出し項目の区画とデータ項目の区画との対応付けに用いる見出し項目の区画の長さ又は該見出し項目と対応するデータ項目の区画の長さの少なくとも何れか一方を補正する
処理をコンピュータに実行させる表構造自動認識プログラム。
（付記２）
前記補正する処理において、
前記所定条件を満たす表に含まれる、前記データ項目の区画毎に、当該データ項目の区画の長さを取得して当該データ項目の区画の長さを正規化し、
前記対応付けに用いるデータ項目の区画の長さを前記正規化された値に補正する
処理をコンピュータに実行させる付記１記載の表構造自動認識プログラム。
（付記３）
前記補正する処理において、
前記データ項目の区画のデータが整合する見出し同士が対応付けて格納された項目整合データベースを参照して、前記第１の見出しと対応付けられた第２の見出しを取得し、
前記第１の見出し及び第２の見出しと前記複数の表の各々に含まれる見出し項目の区画のデータとを照合し、
前記照合の結果が所定条件を満たす第２の表を抽出し、
前記第２の表に含まれる見出し項目の区画の長さの平均値を用いて、前記対応付けに用いる見出し項目の区画の長さを補正する
処理をコンピュータに実行させる付記１又は２記載の表構造自動認識プログラム。
（付記４）
前記所定条件を満たす表の見出し項目の区画の各々と、データ項目の区画の各々とを結ぶ線分を求め、
前記線分同士が交差する回数が所定回数以下となる前記表の前記見出し項目と前記データ項目との組み合わせを特定し、
前記特定された見出し項目とデータ項目とを対応付けて出力する
処理をコンピュータに実行させる付記１乃至３の何れか一項に記載の表構造自動認識プログラム。
（付記５）
前記所定回数は、前記所定条件を満たす表の見出し項目の数に対して所定の割合となるように設定される付記４記載の表構造自動認識プログラム。
（付記６）
前記照合において、
前記表抽出条件データベースに格納された見出し項目のデータと前記文書データベースから抽出された前記複数の表の見出し項目のデータとが一致する割合に基づき前記複数の表のそれぞれの類似度を算出する処理と、
前記類似度が所定の閾値以上である表を前記複数の表から抽出する処理と、をコンピュータに実行させる付記１ないし５の何れか一項に記載の表構造自動認識プログラム。
（付記７）
コンピュータが表構造の自動認識を行う表構造自動認識方法であって、
複数の区画が含まれる複数の表を含む文書データが格納された文書データベースから、前記複数の表を抽出し、
抽出対象の表の第１の見出しが格納された表抽出条件データベースを参照して、前記第１の見出しと前記複数の表の各々に含まれる見出し項目の区画のデータとを照合し、
前記照合の結果が所定条件を満たす表を抽出し、
前記所定条件を満たす表に含まれる各区画の長さに基づき、該表における見出し項目の区画とデータ項目の区画との対応付けに用いる見出し項目の区画の長さ又は該見出し項目と対応するデータ項目の区画の長さの少なくとも何れか一方を補正する表構造自動認識方法。
（付記８）
表構造の自動認識を行う表構造自動認識装置であって、
複数の区画が含まれる複数の表を含む文書データが格納された文書データベースから、前記複数の表を抽出する矩形領域抽出部と、
抽出対象の表の第１の見出しが格納された表抽出条件データベースを参照して、前記第１の見出しと前記複数の表の各々に含まれる見出し項目の区画のデータとを照合し、前記照合の結果が所定条件を満たす表を抽出する見出し項目照合部と、
前記所定条件を満たす表に含まれる各区画の長さに基づき、該表における見出し項目の区画とデータ項目の区画との対応付けに用いる見出し項目の区画の長さ又は該見出し項目と対応するデータ項目の区画の長さの少なくとも何れか一方を補正する長さ補正部と、を有する表構造自動認識装置。 The present invention includes configurations as described in the following supplementary notes.
(Appendix 1)
Extracting the plurality of tables from a document database storing document data including a plurality of tables including a plurality of partitions;
With reference to the table extraction condition database in which the first heading of the table to be extracted is stored, the first heading is collated with the data of the section of the heading item included in each of the plurality of tables,
Extracting a table in which the result of the matching satisfies a predetermined condition;
Based on the length of each section included in the table satisfying the predetermined condition, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or data corresponding to the heading item An automatic table structure recognition program for causing a computer to execute processing for correcting at least one of the lengths of the section of an item.
(Appendix 2)
In the correction process,
For each section of the data item included in the table that satisfies the predetermined condition, obtain the length of the section of the data item to normalize the length of the section of the data item,
The program for automatically recognizing a table structure according to supplementary note 1, which causes a computer to execute a process of correcting a length of a section of a data item used for the association to the normalized value.
(Appendix 3)
In the correction process,
With reference to the item matching database in which the headings that match the data of the data item section are stored in association with each other, the second heading associated with the first heading is obtained;
Collating the first heading and the second heading with the data of the section of the heading item included in each of the plurality of tables;
Extracting a second table in which the result of the matching satisfies a predetermined condition;
The table according to appendix 1 or 2, which causes the computer to execute processing for correcting the length of the section of the heading item used for the association using the average value of the length of the section of the heading item included in the second table. Automatic structure recognition program.
(Appendix 4)
Obtaining a line connecting each of the sections of the heading item of the table satisfying the predetermined condition and each of the sections of the data item;
Identify the combination of the heading item and the data item of the table in which the number of times the line segments cross each other is a predetermined number of times or less;
The table structure automatic recognition program according to any one of appendices 1 to 3, which causes a computer to execute a process of associating and outputting the identified heading item and data item.
(Appendix 5)
The table structure automatic recognition program according to appendix 4, wherein the predetermined number of times is set to be a predetermined ratio with respect to the number of heading items in the table that satisfy the predetermined condition.
(Appendix 6)
In the verification,
Processing for calculating the similarity of each of the plurality of tables based on the ratio of the heading item data stored in the table extraction condition database and the heading item data of the plurality of tables extracted from the document database When,
The table structure automatic recognition program according to any one of appendices 1 to 5, which causes a computer to execute a process of extracting a table having the similarity equal to or greater than a predetermined threshold from the plurality of tables.
(Appendix 7)
A table structure automatic recognition method in which a computer automatically recognizes a table structure,
Extracting the plurality of tables from a document database storing document data including a plurality of tables including a plurality of partitions;
With reference to the table extraction condition database in which the first heading of the table to be extracted is stored, the first heading is collated with the data of the section of the heading item included in each of the plurality of tables,
Extracting a table in which the result of the matching satisfies a predetermined condition;
Based on the length of each section included in the table satisfying the predetermined condition, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or data corresponding to the heading item A table structure automatic recognition method for correcting at least one of the lengths of item sections.
(Appendix 8)
A table structure automatic recognition device that automatically recognizes a table structure,
A rectangular area extraction unit for extracting the plurality of tables from a document database storing document data including a plurality of tables including a plurality of sections;
Referencing the table extraction condition database in which the first heading of the table to be extracted is stored, collating the first heading with the data of the section of the heading item included in each of the plurality of tables, and A header item matching unit that extracts a table in which the result of
Based on the length of each section included in the table satisfying the predetermined condition, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or data corresponding to the heading item A table structure automatic recognition apparatus comprising: a length correction unit that corrects at least one of the lengths of the sections of items.

１００表構造自動認識装置
２００認識処理部
２１０矩形領域抽出部
２２０見出し項目照合部
２３０長さ補正部
２３１データ項目長さ補正部
２３２見出し項目長さ補正部
２４０見出し対応データ特定部
３００記憶部
３１０設計書データベース
３２０表抽出条件データベース
３３０項目整合データベース DESCRIPTION OF SYMBOLS 100 Table structure automatic recognition apparatus 200 Recognition processing part 210 Rectangular area extraction part 220 Heading item collation part 230 Length correction part 231 Data item length correction part 232 Heading item length correction part 240 Heading corresponding | compatible data specification part 300 Storage part 310 Design Database 320 Table extraction condition database 330 Item consistency database

Claims

Extracting the plurality of tables from a document database storing document data including a plurality of tables including a plurality of partitions;
With reference to the table extraction condition database in which the first heading of the table to be extracted is stored, the first heading is collated with the data of the section of the heading item included in each of the plurality of tables,
Extracting a table in which the result of the matching satisfies a predetermined condition;
Based on the length of each section included in the table satisfying the predetermined condition, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or data corresponding to the heading item An automatic table structure recognition program for causing a computer to execute processing for correcting at least one of the lengths of the section of an item.

In the correction process,
For each section of the data item included in the table that satisfies the predetermined condition, obtain the length of the section of the data item to normalize the length of the section of the data item,
The program for automatically recognizing a table structure according to claim 1, which causes a computer to execute a process of correcting a length of a section of a data item used for the association to the normalized value.

In the correction process,
With reference to the item matching database in which the headings that match the data of the data item section are stored in association with each other, the second heading associated with the first heading is obtained;
Collating the first heading and the second heading with the data of the section of the heading item included in each of the plurality of tables;
Extracting a second table in which the result of the matching satisfies a predetermined condition;
The computer according to claim 1, wherein the computer executes a process of correcting the length of the section of the heading item used for the association using the average value of the length of the section of the heading item included in the second table. Table structure automatic recognition program.

Obtaining a line connecting each of the sections of the heading item of the table satisfying the predetermined condition and each of the sections of the data item;
Identify the combination of the heading item and the data item of the table in which the number of times the line segments cross each other is a predetermined number of times or less;
The table structure automatic recognition program according to any one of claims 1 to 3, which causes a computer to execute a process of outputting the identified heading item and a data item in association with each other.

A table structure automatic recognition method in which a computer automatically recognizes a table structure,
Extracting the plurality of tables from a document database storing document data including a plurality of tables including a plurality of partitions;
With reference to the table extraction condition database in which the first heading of the table to be extracted is stored, the first heading is collated with the data of the section of the heading item included in each of the plurality of tables,
Extracting a table in which the result of the matching satisfies a predetermined condition;
Based on the length of each section included in the table satisfying the predetermined condition, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or data corresponding to the heading item A table structure automatic recognition method for correcting at least one of the lengths of item sections.

A table structure automatic recognition device that automatically recognizes a table structure,
A rectangular area extraction unit for extracting the plurality of tables from a document database storing document data including a plurality of tables including a plurality of sections;
Referencing the table extraction condition database in which the first heading of the table to be extracted is stored, collating the first heading with the data of the section of the heading item included in each of the plurality of tables, and A header item matching unit that extracts a table in which the result of
Based on the length of each section included in the table satisfying the predetermined condition, the length of the section of the heading item used for associating the section of the heading item with the section of the data item in the table or data corresponding to the heading item A table structure automatic recognition apparatus comprising: a length correction unit that corrects at least one of the lengths of the sections of items.