JP6804913B2

JP6804913B2 - Table structure estimation system and method

Info

Publication number: JP6804913B2
Application number: JP2016183089A
Authority: JP
Inventors: 優浅野; 真岩山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2016-09-20
Filing date: 2016-09-20
Publication date: 2020-12-23
Anticipated expiration: 2036-09-20
Also published as: JP2018049356A

Description

本発明は、構造が不明確な表に対する構造の推定システム、および方法に関するものである。 The present invention relates to structure estimation systems and methods for tables with unclear structures.

公共の資産として、データの公開が進められている。いわゆるデータのオープン化である。例えば、国や省庁、公共団体により公開されている有用なデータとしては、都道府県別や産業別の売上高や購買情報、投資情報などの統計データがある。一方、公開されているデータを有効利用するニーズが高まっている。例えば、企業等では公開された統計データを検索し、組織内のデータと組み合わせて分析することで、有用な知見を得ることができる。現在公開されているデータは、Excel（商標）やcomma-separated values（CSV）などによる表形式で提供されていることが多い。以後、表形式で提供されたデータを単に「表」ということにする。 Data is being released as a public asset. This is the so-called openness of data. For example, useful data released by the national government, ministries, and public organizations includes statistical data such as sales, purchasing information, and investment information by prefecture and industry. On the other hand, there is an increasing need for effective use of publicly available data. For example, in a company or the like, useful findings can be obtained by searching publicly available statistical data and analyzing it in combination with data in an organization. Currently published data is often provided in tabular format such as Excel (trademark) or comma-separated values (CSV). Hereinafter, the data provided in the tabular format will be simply referred to as a "table".

公開されている表は、データ作成者あるいは提供者が一方的に定めた構造により作成されている。また、表自体は構造に関する情報を含まない場合が多い。よって、データを利用しようとする第三者は、当該表の構造についての知識を持たないまま、種々の構造のデータを利用しなければならない。しかし、構造が不明確な表のデータを、人手を介さずに利用することは難しい。すなわち、公開されているデータを、コンピュータ等の情報処理装置により、機械的に検索等の処理を行うことは困難である。 The published table is created by a structure unilaterally determined by the data creator or provider. Also, the table itself often does not contain information about the structure. Therefore, a third party who intends to use the data must use the data of various structures without having knowledge about the structure of the table. However, it is difficult to use the data in a table whose structure is unclear without human intervention. That is, it is difficult to mechanically perform processing such as searching the published data by an information processing device such as a computer.

ここで、表の「構造」とは、表が「何」を「どの観点」で表しているかをいう。構造は、「次元」と「測度」によって定義できる。上記の「何」が「測度」に対応し、「どの観点」が「次元」に対応する。表の構造が明確だと、表構造に基づいて、表のデータを機械的に検索できるようになる。また、表構造と自然言語表現を対応付けることで、比較的容易に自然言語を用いた表内のデータ検索が実現できると考えられる。一方、不明確だと、表構造と自然言語表現との対応付けが複数通り考えられ、自然言語を用いた検索は困難となる。表のデータ検索を実現するためには、構造を自動的に推定し明確化する技術が求められている。 Here, the "structure" of the table means what the table represents "what" in "what viewpoint". The structure can be defined by "dimension" and "measure". The above "what" corresponds to "measure", and "which viewpoint" corresponds to "dimension". If the structure of the table is clear, the data in the table can be searched mechanically based on the structure of the table. In addition, by associating the table structure with the natural language expression, it is considered that data retrieval in the table using natural language can be realized relatively easily. On the other hand, if it is unclear, there are multiple possible associations between the table structure and natural language expressions, making it difficult to search using natural language. In order to realize table data retrieval, a technique for automatically estimating and clarifying the structure is required.

構造が不明確な表の構造を推定するための技術として、非特許文献１に記載のものがある。この文献では、表の各列に対して、表内に現れる表記や他の列との位置関係を用いた機械学習手法により、何を表しているかを推定していることが記されている。 As a technique for estimating the structure of a table whose structure is unclear, there is one described in Non-Patent Document 1. In this document, it is stated that for each column of the table, what is represented is estimated by a machine learning method using the notation appearing in the table and the positional relationship with other columns.

特開２０１５−４１２２５号公報JP-A-2015-41225

Goel, A., Knoblock, C.A., Lerman, K.: Exploiting Structure within Data for Accurate Labeling using Conditional Random Fields. In: Proceedings of the 14th international conference on Artificial Intelligence, ICAI (2012)Goel, A., Knoblock, C.A., Lerman, K .: Exploiting Structure within Data for Accurate Labeling using Conditional Random Fields. In: Proceedings of the 14th international conference on Artificial Intelligence, ICAI (2012)

非特許文献１では、表以外のものを用いることを想定していない。そのため、構造の推定には、表のみを用いることに留まる。例えば、表内に現れる表記や位置関係である。この場合、表内に構造を推定するための特徴が含まれていないと、推定が困難となる。 Non-Patent Document 1 does not assume that anything other than the table is used. Therefore, only the table is used to estimate the structure. For example, the notation and positional relationship that appear in the table. In this case, if the table does not include features for estimating the structure, the estimation becomes difficult.

また、特許文献１には共起性の高いテキスト情報を画像に関連付ける技術について開示があるが、表を取り扱う技術ではなかった。 Further, Patent Document 1 discloses a technique for associating highly co-occurrence text information with an image, but it is not a technique for handling a table.

本発明の課題は、表内に構造を推定するための特徴が含まれていない場合であっても、表の構造を精度よく自動的に推定し、明確化する技術を提供することにある。 An object of the present invention is to provide a technique for automatically estimating and clarifying the structure of a table with high accuracy even when the table does not include features for estimating the structure.

上記課題を解決する本発明の一側面は、入力装置、出力装置、記憶装置、および処理装置を備えた表構造の推定システムである。このシステムにおいて、記憶装置は、表と文章の両方に関係する特徴を定義する特徴定義データを格納する。入力装置は、分析対象データを受け付けるものである。処理装置は、分析対象データから、表を取得するとともに、表に関連する文章を関連テキストとして取得する、関連テキスト情報抽出部と、特徴定義データを用いて、取得した表と関連テキストから、特徴を抽出する特徴抽出部と、特徴抽出部の特徴抽出結果に基づいて、表の構造を推定する識別部と、を備える。 One aspect of the present invention that solves the above problems is a table structure estimation system including an input device, an output device, a storage device, and a processing device. In this system, the storage device stores feature definition data that defines features related to both tables and sentences. The input device receives the data to be analyzed. The processing device acquires a table from the data to be analyzed and also acquires sentences related to the table as related texts. The feature is a feature from the acquired table and related texts using the related text information extraction unit and the feature definition data. It is provided with a feature extraction unit for extracting data and an identification unit for estimating the structure of the table based on the feature extraction result of the feature extraction unit.

本発明の他の一側面は、入力装置、出力装置、記憶装置、および処理装置を用いた表構造の推定方法である。この方法では、記憶装置に、表と文章の両方に関係する特徴を定義する特徴定義データを準備する。入力装置から、分析対象データを入力する。処理装置は、分析対象データから、表を取得するとともに、表に関連する文章を関連テキストとして取得し、特徴定義データを用いて、取得した表と関連テキストから、特徴を抽出し、抽出された特徴に基づいて、表の項目の「次元」と「測度」の区別を推定する。 Another aspect of the present invention is a method of estimating a table structure using an input device, an output device, a storage device, and a processing device. In this method, the storage device prepares feature definition data that defines features related to both tables and sentences. Input the analysis target data from the input device. The processing device acquires a table from the data to be analyzed, acquires sentences related to the table as related texts, and extracts features from the acquired table and related texts using the feature definition data. Estimate the distinction between "dimension" and "measure" of table items based on their characteristics.

本発明によれば、表の構造の推定が可能になり、その結果、推定の精度が向上する。上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to the present invention, it is possible to estimate the structure of the table, and as a result, the accuracy of the estimation is improved. Issues, configurations and effects other than those described above will be clarified by the description of the following embodiments.

処理対象とする表の例を示す概念図Conceptual diagram showing an example of the table to be processed 表の構造の概念図Conceptual diagram of table structure 実施例の表構造推定システムが行う処理の概念図Conceptual diagram of processing performed by the table structure estimation system of the embodiment 実施例のシステムのハードウェア構成ブロック図Hardware configuration block diagram of the system of the embodiment 前処理部の構成ブロック図Block diagram of the pre-processing unit 表と文章の対応付けを説明する概念図Conceptual diagram explaining the correspondence between tables and sentences 特徴定義データの例を示す表図Table diagram showing an example of feature definition data 訓練データの処理の流れを示すフロー図Flow diagram showing the flow of training data processing 訓練データの識別結果と識別モデルを示す概念図Conceptual diagram showing the discriminative results and discriminative model of training data 特徴抽出処理の流れを示すフロー図Flow diagram showing the flow of feature extraction processing 分析対象データの処理の流れを示すフロー図Flow diagram showing the processing flow of the data to be analyzed 識別モデルの他の例を示す表図Table diagram showing other examples of discriminative models 識別モデルを用いた識別処理の例を示す表図Table diagram showing an example of identification processing using an identification model 自然言語による検索処理の概念図Conceptual diagram of search processing in natural language

実施の形態について、図面を用いて詳細に説明する。ただし、本発明は以下に示す実施の形態の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。 The embodiment will be described in detail with reference to the drawings. However, the present invention is not construed as being limited to the description of the embodiments shown below. It is easily understood by those skilled in the art that a specific configuration thereof can be changed without departing from the idea or purpose of the present invention.

以下に説明する発明の構成において、同一部分又は同様な機能を有する部分には同一の符号を異なる図面間で共通して用い、重複する説明は省略することがある。 In the configuration of the invention described below, the same reference numerals may be used in common among different drawings for the same parts or parts having similar functions, and duplicate description may be omitted.

本明細書等における「第１」、「第２」、「第３」などの表記は、構成要素を識別するために付するものであり、必ずしも、数または順序を限定するものではない。また、構成要素の識別のための番号は文脈毎に用いられ、一つの文脈で用いた番号が、他の文脈で必ずしも同一の構成を示すとは限らない。また、ある番号で識別された構成要素が、他の番号で識別された構成要素の機能を兼ねることを妨げるものではない。 The notations such as "first", "second", and "third" in the present specification and the like are attached to identify the components, and do not necessarily limit the number or order. In addition, numbers for identifying components are used for each context, and numbers used in one context do not always indicate the same composition in other contexts. Further, it does not prevent the component identified by a certain number from having the function of the component identified by another number.

図面等において示す各構成の位置、大きさ、形状、範囲などは、発明の理解を容易にするため、実際の位置、大きさ、形状、範囲などを表していない場合がある。このため、本発明は、必ずしも、図面等に開示された位置、大きさ、形状、範囲などに限定されない。 The position, size, shape, range, etc. of each configuration shown in the drawings and the like may not represent the actual position, size, shape, range, etc. in order to facilitate understanding of the invention. Therefore, the present invention is not necessarily limited to the position, size, shape, range, etc. disclosed in the drawings and the like.

本明細書において単数形で表される構成要素は、特段文脈で明らかに示されない限り、複数形を含むものとする。 Components represented in the singular form herein shall include the plural form unless explicitly stated in the context.

上記で説明した課題を解決するために、代表的な本発明の手法及びシステムの実施例では、表と表の関連文書を組み合わせた特徴を用いた手法により、構造を推定する。実施例には、表構造の推定手法、および表構造の推定システムが含まれる。以下で説明する実施例では、表に加えて、表に関連する文書も用い、表の構造を推定する。 In order to solve the problems described above, in a typical example of the method and system of the present invention, the structure is estimated by a method using a feature that combines a table and related documents of the table. Examples include a table structure estimation method and a table structure estimation system. In the examples described below, in addition to the table, documents related to the table are also used to estimate the structure of the table.

＜１．表データの説明＞
実施例の具体的な説明に先立ち、本実施例が処理の対象とする表の例を説明する。 <1. Explanation of table data>
Prior to the specific description of the embodiment, an example of the table to be processed by the present embodiment will be described.

図１は、実施例の表構造推定システムが処理対象とする表の例である。図１は、分析対象データである報告書１００の例を示している。報告書１００には、表１０１、表のタイトル１０２、説明文１０３が含まれている。表１０１は、年度に対応して各年齢層の人口を示している。タイトル１０２では、表が年度別年齢別の人口であることを示している。説明文１０３には、表１０１に関連した記述が記載されている。以下、タイトル１０２や説明文１０３等、表１０１に付随したテキストを含む情報を「文章」あるいは「テキスト情報」ということにする。本明細書で「文章」あるいは「テキスト情報」といった場合には、複数の文でもよいし、単一の文でもよいものとする。 FIG. 1 is an example of a table to be processed by the table structure estimation system of the embodiment. FIG. 1 shows an example of report 100, which is the data to be analyzed. Report 100 includes Table 101, Table Title 102, and Description 103. Table 101 shows the population of each age group corresponding to the year. Title 102 shows that the table shows the population by year and age. The description 103 contains descriptions related to Table 101. Hereinafter, information including text accompanying the table 101, such as the title 102 and the explanatory text 103, will be referred to as "text" or "text information". When the term "sentence" or "text information" is used in the present specification, it may be a plurality of sentences or a single sentence.

ここで、表やタイトル、説明文のデータ形式は、テキストとして認識でき、表との位置関係、対応関係あるいは配置関係を識別できるものであれば、特に問わない。例えば、Excel（商標）やCSV等のテキストを含むデータである。あるいは、ビットマップ形式の画像データや紙データであってもよい。画像データの場合は、文字部分は光学文字認識（Optical character recognition（OCR)）等で処理してテキストデータを得ればよい。 Here, the data format of the table, title, and description is not particularly limited as long as it can be recognized as text and can identify the positional relationship, correspondence relationship, or arrangement relationship with the table. For example, it is data including texts such as Excel (trademark) and CSV. Alternatively, it may be image data or paper data in bitmap format. In the case of image data, the character portion may be processed by optical character recognition (OCR) or the like to obtain text data.

図２に、表１０１の構造の概念を示す。表１０１の一番上のセル（枠）２０１〜２０３は「項目」である。「項目」には、表の構造である「次元」や「測度」が格納されている。「測度」は観測対象のことであり、「次元」は観測の条件である。この例ではセル２０１の「年度」やセル２０２の「年齢」は「次元」である。セル２０３の「人口」は「測度」である。表１０１の「項目」以外のセルには「項目の値」が格納されている。セル２０１の項目「年度」に対しては、セル２０４、２０５に「次元の値（年度の値）」が格納される。セル２０２の項目「年齢」に対しては、セル２０６、２０７に「次元の値（年齢の値）」が格納される。セル２０３の項目「人口」に対しては、セル２０８、２０９に観測値である「測度の値（人口の値）」が格納される。なお、項目のセルは、一般に、表の一番上のセルの行か、一番左のセルの列であるため、デフォルトではこの部分を項目と推定すればよい。 FIG. 2 shows the concept of the structure in Table 101. The cells (frames) 201 to 203 at the top of Table 101 are "items". In the "item", the "dimension" and "measure" which are the structures of the table are stored. The "measure" is the observation target, and the "dimension" is the observation condition. In this example, the "year" in cell 201 and the "age" in cell 202 are "dimensions". The "population" in cell 203 is a "measure". "Item values" are stored in cells other than "items" in Table 101. For the item "year" in cell 201, "dimensional values (year values)" are stored in cells 204 and 205. For the item "age" in cell 202, "dimensional value (age value)" is stored in cells 206 and 207. For the item "population" in cell 203, the observed value "measure value (population value)" is stored in cells 208 and 209. Note that the item cell is generally the row of the top cell of the table or the column of the leftmost cell, so this part may be estimated as an item by default.

本実施例の表構造推定システムの目的は、表から上記の構造を自動的に推定することである。例えば、「年度」「年齢」「人口」という項目からは、可能性として「年度別、年齢別の人口」と、「年度別の年齢（平均）と人口」が考えられる。前者の場合では「年度」と「年齢」が「次元」であり、「人口」が「測度」である。後者の場合では、「年度」が「次元」であり、「年齢」と「人口」が「測度」である。 The purpose of the table structure estimation system of this embodiment is to automatically estimate the above structure from the table. For example, from the items of "year", "age", and "population", "population by year and age" and "age (average) and population by year" can be considered. In the former case, "year" and "age" are "dimensions" and "population" is "measure". In the latter case, the "year" is the "dimension" and the "age" and "population" are the "measures".

本実施例では、表１０１の構造が不明の場合であっても、表１０１とタイトル１０２や説明文１０３から、例えば図２の「年度」と「年齢」が「次元」であり、「人口」が「測度」であることを推定する。また項目が省略されている場合であっても、各列に対応する項目が「次元」であるか「測度」であるかを推定する。 In this embodiment, even when the structure of Table 101 is unknown, from Table 101, the title 102, and the explanatory text 103, for example, the “year” and “age” in FIG. 2 are “dimensions” and the “population”. Estimate that is a "measure". Even if the item is omitted, it is estimated whether the item corresponding to each column is "dimension" or "measure".

＜２．本実施例による処理の概要＞
図３に本実施例の表構造推定システムが行う処理の概念を示す。本システムの典型的な一例では、表のみでなく、表のタイトルや説明文（テキスト情報）における項目やその値の表れ方を特徴として使用した、教師あり機械学習手法を採用する。これにより表の構造の推定精度が向上する。 <2. Outline of processing according to this embodiment>
FIG. 3 shows the concept of processing performed by the table structure estimation system of this embodiment. A typical example of this system employs a supervised machine learning method that uses not only tables but also items in table titles and explanations (text information) and how their values appear. This improves the estimation accuracy of the table structure.

本実施例の表構造推定システムでは、訓練データ３０１を用いる。訓練データ３０１には表１０１Ｔ、表のタイトル１０２Ｔ、説明文１０３Ｔが含まれている。また、訓練データ３０１に対して、構造データ３０２が付加されている。構造データ３０２は、対象データの内容、次元である項目、次元の値、測度となる項目などが含まれている。構造データは、オペレータが訓練データ３０１を目視して分析し、入力し、データとして保存することができる。 In the table structure estimation system of this embodiment, training data 301 is used. The training data 301 includes a table 101T, a table title 102T, and a description 103T. Further, structural data 302 is added to the training data 301. The structural data 302 includes the contents of the target data, items that are dimensions, values of dimensions, items that serve as measures, and the like. The structural data can be stored as data by the operator visually analyzing and inputting the training data 301.

本実施例では、複数の訓練データ３０１と、これに対応する複数の構造データ３０２を用いることにする。一般に、訓練データ３０１の数が多いほど、推定精度が向上する。 In this embodiment, a plurality of training data 301 and a plurality of structural data 302 corresponding thereto will be used. In general, the larger the number of training data 301, the better the estimation accuracy.

以上のように準備された訓練データ３０１と構造データ３０２を用い、特徴を抽出し、系列ラベリングの手法により識別モデル生成を行う。識別モデルは、特徴抽出結果と表の構造の関係を規定する。識別モデル３０３には例えば、次元の構造に関する次元モデルと、測度の構造に関する測度モデルを含んでいる。以上のように識別モデル３０３を準備した上で、分析対象データ３０４を入力し、識別モデル３０３に基づいて分析を行い、構造推定データ３０５を得ることができる。以下具体的に実施例の構成を説明する。 Using the training data 301 and the structural data 302 prepared as described above, features are extracted and an identification model is generated by a series labeling method. The discriminative model defines the relationship between the feature extraction results and the structure of the table. The discriminative model 303 includes, for example, a dimensional model relating to the dimensional structure and a measure model relating to the structure of the measure. After preparing the discriminative model 303 as described above, the analysis target data 304 can be input and the analysis can be performed based on the discriminative model 303 to obtain the structure estimation data 305. The configuration of the embodiment will be specifically described below.

＜３．実施例のシステム全体構成＞
図４は、本発明の実施例の一例である、表構造推定システムの構成を示すブロック図である。表構造推定システム１は、具体的な例としては、コンピュータ等の情報処理装置により構成される。通常の情報処理装置と同様に、表構造推定システム１は、中央処理装置（ＣＰＵ）１１、キーボードや画像モニタなどの公知の入出力装置１３、磁気ディスク装置や半導体記憶素装置からなるメモリ１５を備える。また、外部とデータをやり取りするためのインタフェースとして、データ通信部１２を備えていてもよい。データ通信部１２は、例えば外部のネットワーク１６と接続される。なお、入出力装置という場合、入出力両機能を備える装置のみを意味するのではなく、入力機能のみを備える装置、出力機能のみを備える装置、さらには入出力の両方を備える装置のいずれをも意味するものとする。 <3. Overall system configuration of the embodiment>
FIG. 4 is a block diagram showing a configuration of a table structure estimation system, which is an example of an embodiment of the present invention. As a specific example, the table structure estimation system 1 is configured by an information processing device such as a computer. Similar to a normal information processing device, the table structure estimation system 1 includes a central processing unit (CPU) 11, a known input / output device 13 such as a keyboard and an image monitor, and a memory 15 composed of a magnetic disk device and a semiconductor storage device. Be prepared. Further, the data communication unit 12 may be provided as an interface for exchanging data with the outside. The data communication unit 12 is connected to, for example, an external network 16. The term "input / output device" does not mean only a device having both input / output functions, but also includes a device having only an input function, a device having only an output function, and a device having both input / output functions. It shall mean.

本実施例では計算や制御等の機能は、メモリ１５に格納されたプログラムがＣＰＵ１１によって実行されることで、定められた処理を他のハードウェアと協働して実現される。ＣＰＵ１１が実行するプログラム、その機能、あるいはその機能を実現する手段を、「機能」、「手段」、「部」、「ユニット」、「モジュール」等と呼ぶ場合がある。図４では、概念的にＣＰＵ１１が、ソフトウェアに基づいて実行する機能を、制御部１４として示している。制御部１１は、前処理部１４１、学習部１４２、識別部１４３、表示部１４４を備える。これらの機能を実現するためのプログラムはメモリ１５に格納される。また、メモリ１５にはデータとして、訓練データ３０１、構造データ３０２、識別モデル３０３、分析対象データ３０４、識別結果である構造推定データ３０５、特徴定義データ３０６、辞書３０７等が格納される。 In this embodiment, functions such as calculation and control are realized by executing a program stored in the memory 15 by the CPU 11 in cooperation with other hardware. A program executed by the CPU 11, a function thereof, or a means for realizing the function may be referred to as a "function", a "means", a "part", a "unit", a "module", or the like. In FIG. 4, a function conceptually executed by the CPU 11 based on software is shown as a control unit 14. The control unit 11 includes a preprocessing unit 141, a learning unit 142, an identification unit 143, and a display unit 144. The program for realizing these functions is stored in the memory 15. Further, as data, training data 301, structure data 302, identification model 303, analysis target data 304, structure estimation data 305 which is an identification result, feature definition data 306, dictionary 307, and the like are stored in the memory 15.

メモリ１５に格納する、訓練データ３０１、構造データ３０２、分析対象データ３０４、特徴定義データ３０６、辞書３０７は、入出力装置１３やデータ通信部１２を介して入力することができる。また、識別モデル３０３や構造推定データ３０５は、入出力装置１３やデータ通信部１２を介して出力することができる。 The training data 301, the structural data 302, the analysis target data 304, the feature definition data 306, and the dictionary 307 stored in the memory 15 can be input via the input / output device 13 and the data communication unit 12. Further, the identification model 303 and the structure estimation data 305 can be output via the input / output device 13 and the data communication unit 12.

以上の構成は、図４に示したように単体のコンピュータで構成してもよいし、あるいは、入力装置、出力装置、処理装置、記憶装置の任意の部分が、ネットワークで接続された他のコンピュータで構成されてもよい。また、本実施例中、ソフトウェアで構成した機能と同等の機能は、FPGA（Field Programmable Gate Array）、ASIC（Application Specific Integrated Circuit）などのハードウェアでも実現できる。そのような態様も本願発明の範囲に含まれる。 The above configuration may be configured by a single computer as shown in FIG. 4, or another computer in which any part of the input device, output device, processing device, and storage device is connected by a network. It may be composed of. Further, in this embodiment, the same function as the function configured by software can be realized by hardware such as FPGA (Field Programmable Gate Array) and ASIC (Application Specific Integrated Circuit). Such aspects are also included in the scope of the present invention.

＜４．実施例のシステムのデータ構造の説明＞
メモリ１５に格納されるデータについて説明する。 <4. Description of the data structure of the system of the embodiment>
The data stored in the memory 15 will be described.

訓練データ３０１は、図３で概念を説明したように、表１０１Ｔとこれに付随するテキスト情報を含むデータである。テキスト情報は例えば、表のタイトル１０２Ｔ、説明文１０３Ｔなどである。本実施例では、訓練データ３０１はテキストベースのデータとし、また、表１０１Ｔとこれに付随するテキスト情報の位置関係を示す情報を含むものとする。位置関係を示す情報としては、例えば表１０１Ｔとこれに付随するテキスト情報の座標情報である。あるいは、表１０１Ｔとこれに付随するテキスト情報が、同一のファイルに含まれることを示す情報である。あるいは、表１０１Ｔとこれに付随するテキスト情報が、特定の関係にあること、例えばエクセル（商標）ファイルの同一のページに含まれることを示す情報である。 The training data 301 is data including the table 101T and the text information accompanying the table 101T as described in FIG. The text information is, for example, a table title 102T, a description 103T, and the like. In this embodiment, the training data 301 is text-based data, and includes information indicating the positional relationship between Table 101T and the text information accompanying the table 101T. The information indicating the positional relationship is, for example, the coordinate information of Table 101T and the text information accompanying the table 101T. Alternatively, it is information indicating that Table 101T and the text information accompanying the table 101T are included in the same file. Alternatively, it is information indicating that Table 101T and the text information accompanying the table 101T have a specific relationship, for example, are included in the same page of an Excel ™ file.

構造データ３０２は、図３で概念を説明したように、訓練データの表１０１Ｔの構造を示すデータである。構造データは３０２は、例えば表１０１Ｔの「項目名」に対して、「次元」あるいは「測度」の別を規定する。また、「次元」や「測度」の値を規定しても良い。また、表の名称や内容についての情報を含んでも良い。構造データは３０２は、オペレータが訓練データ３０１を目視、分析することで生成する。 The structural data 302 is data showing the structure of Table 101T of the training data, as described in the concept in FIG. The structural data 302 defines, for example, the distinction between "dimension" and "measure" for the "item name" in Table 101T. Further, the values of "dimension" and "measure" may be specified. It may also include information about the name and content of the table. The structural data 302 is generated by the operator visually and analyzing the training data 301.

識別モデル３０３は、訓練データ３０１と構造データ３０２を用いて生成されたモデルである。識別モデルの生成および使用については、後述する。 The discriminative model 303 is a model generated by using the training data 301 and the structural data 302. The generation and use of the discriminative model will be described later.

分析対象データ３０４は、図３に示したような、構造を推定するべき表構造が未知の表データである。データ構成としては、訓練データ３０１と同様でよく、例としては図１の報告書１００のような構成である。 The analysis target data 304 is table data for which the table structure for which the structure should be estimated is unknown, as shown in FIG. The data structure may be the same as that of the training data 301, and an example is the structure shown in the report 100 of FIG.

構造推定データ３０５は、分析対象データ３０４を識別モデル３０３に基づいて分析した結果である、分析対象データ３０４の表構造を示すデータである。データ構成としては、構造データ３０２と同様でよい。 The structure estimation data 305 is data showing the table structure of the analysis target data 304, which is the result of analyzing the analysis target data 304 based on the identification model 303. The data structure may be the same as that of the structural data 302.

特徴定義データ３０６は、表の構造を推定するために用いる特徴を定義したデータである。例えば、オペレータが経験に基づいて生成し、入出力装置１３から入力しても良い。あるいは、自動的に生成しても良い。 The feature definition data 306 is data that defines features used for estimating the structure of the table. For example, the operator may generate it based on experience and input it from the input / output device 13. Alternatively, it may be automatically generated.

辞書３０７は、後に説明する関連テキスト情報抽出部１４１２の処理に用いる。 The dictionary 307 is used for processing the related text information extraction unit 1412, which will be described later.

＜４．実施例のシステムの制御部の説明＞
制御部１４について説明する。 <4. Description of the control unit of the system of the embodiment>
The control unit 14 will be described.

＜４−１.前処理部＞
図５は、前処理部１４１の機能ブロック図である。前処理部１４１は、表と文章の対応付け部１４１１、関連テキスト情報抽出部１４１２、関連テキスト情報解釈部１４１３、特徴定義部１４１４、特徴抽出部１４１５を含む。 <4-1. Pretreatment section>
FIG. 5 is a functional block diagram of the preprocessing unit 141. The preprocessing unit 141 includes a table-text correspondence unit 1411, a related text information extraction unit 1412, a related text information interpretation unit 1413, a feature definition unit 1414, and a feature extraction unit 1415.

表と文章の対応付け部１４１１は、表に対応する文章（テキスト情報）を抽出する。抽出は、例えば表と文章が特定の配置関係にあることを条件に行うことができる。例えば、表と文章が接して配置されていたり、同じ頁に存在する場合に、文章を抽出をすることができる。あるいは、タイトル等のように、表の内部に文章が埋め込まれている場合に、当該文章を抽出することができる。表と文章の対応付け部１４１１は、訓練データ３０１および分析対象データ３０４に対して、上記の処理を行う。表と文章の対応付けによって、表に付随するテキスト情報の中から、特に表に対応する部分を選択して抽出することができる。ただし、この処理は省略し、表に付随するテキスト情報を全て以降の処理で用いることもできる。 The table-sentence correspondence unit 1411 extracts the text (text information) corresponding to the table. Extraction can be performed, for example, on the condition that the table and the text have a specific arrangement relationship. For example, when a table and a sentence are arranged in contact with each other or are on the same page, the sentence can be extracted. Alternatively, when a sentence is embedded inside the table, such as a title, the sentence can be extracted. The table-text correspondence unit 1411 performs the above processing on the training data 301 and the analysis target data 304. By associating the table with the text, it is possible to select and extract the part corresponding to the table from the text information attached to the table. However, this process can be omitted and all the text information attached to the table can be used in the subsequent processes.

関連テキスト情報抽出部１４１２は、表と文章の対応付け部１４１１で抽出された、あるいは表に付随する全てのテキスト情報において、表内の項目または項目の値が現れる文章を関連テキスト情報として抽出する。このとき、項目の同義語等も項目と同様に扱う。このためには、辞書（シソーラス）３０７を使用して、同義語や類語を抽出することができる。同様に、項目の省略形や、対応する外国語等も項目と同様に扱うことができる。また、項目の値も、単位が付加されているものと付加されていないものを同様に扱うことができる。また、同じ物理量や性質を示す単位、例えばキログラム（ｋｇ）とトン（ｔ）、円と米ドルを同様に扱うことができる。また乗数の有無なども同等に扱うことができる。これら関連する語については、辞書３０７に登録しておけばよい。 The related text information extraction unit 1412 extracts as related text information the item in the table or the sentence in which the value of the item appears in all the text information extracted by the table-sentence correspondence unit 1411 or attached to the table. .. At this time, synonyms of items are treated in the same way as items. For this purpose, a dictionary (thesaurus) 307 can be used to extract synonyms and synonyms. Similarly, abbreviations for items and corresponding foreign languages can be treated in the same way as items. Further, as the value of the item, the one with the unit added and the one without the unit can be treated in the same way. In addition, units exhibiting the same physical quantity and properties, such as kilogram (kg) and ton (t), yen and US dollar, can be treated in the same manner. In addition, the presence or absence of a multiplier can be treated in the same way. These related words may be registered in the dictionary 307.

図６に関連テキスト情報抽出の例を示す。タイトル１０２には、項目「年度」「年齢」「人口」が含まれる文章６０１があるため抽出される。説明文１０３からは、「人口」が含まれる文章６０２と、項目「人口」と項目の値である「２０１０年」「６５歳」「６１０人」が含まれる文章６０３を抽出する。抽出される文章は、一つでも複数でも良い。また抽出される文章は、一つの文からなるものでもよいし、複数の文からなるものでも良い。 FIG. 6 shows an example of extracting related text information. The title 102 is extracted because there is a sentence 601 including the items "year", "age", and "population". From the explanatory text 103, a sentence 602 including the "population" and a sentence 603 including the item "population" and the value of the item "2010", "65 years old", and "610 people" are extracted. The extracted sentences may be one or more. Further, the extracted sentence may be composed of one sentence or may be composed of a plurality of sentences.

関連テキスト情報解釈部１４１３は、テキスト情報の文章の語順、主語、述語、目的語、修飾関係などを特定する。これはテキストマイニング等で用いられる、自然言語の構文解析を行う、公知の文書解析ソフトウェアで構成することができる。 The related text information interpretation unit 1413 specifies the word order, subject, predicate, object, modifier, and the like of the text information sentence. It can be configured with known document analysis software that performs parsing of natural language, such as used in text mining.

特徴定義部１４１４は、表や文章の特徴を定義して特徴定義データ３０６を生成する。 The feature definition unit 1414 defines the features of the table or text and generates the feature definition data 306.

図７に特徴定義データ３０６の例を示す。図７の例は、オペレータが経験則に基づいて作成した定義データである。特徴定義データ３０６は、特徴を一意に示すＩＤ７０１、特徴の対象７０２、特徴の内容７０３等の情報を含む。図７に示すように、特徴には、表を対象とする特徴、例えば「特徴１」や「特徴２」、表と文章を対象とする特徴、例えば「特徴３」や「特徴４」がある。 FIG. 7 shows an example of the feature definition data 306. The example of FIG. 7 is definition data created by the operator based on an empirical rule. The feature definition data 306 includes information such as an ID 701 that uniquely indicates the feature, a feature target 702, and a feature content 703. As shown in FIG. 7, the features include features targeting tables, such as "feature 1" and "feature 2", and features targeting tables and sentences, such as "feature 3" and "feature 4". ..

例えば、表を対象とする「特徴１」は、項目の表内の位置に関する特徴であり、「項目が表の右側に現れる」ことを内容とする。例えば、表と文書を対象とする「特徴３」は、項目と文章の構造に関する特徴であり、「項目が主語であり、修飾されている」ことを内容とする。また、例えば、表と文章を対象とする「特徴４」は、項目の値と文章の構造に関する特徴であり、「項目の値が述部に含まれる」ことを内容とする。 For example, "feature 1" for a table is a feature relating to the position of an item in the table, and includes "the item appears on the right side of the table". For example, "feature 3" for a table and a document is a feature relating to the structure of an item and a sentence, and includes "the item is the subject and is modified". Further, for example, "feature 4" for a table and a sentence is a feature relating to an item value and a sentence structure, and includes "the item value is included in the predicate".

特徴定義部１４１４は、以上の様な特徴を定義して、メモリ１５に特徴定義データ３０６として格納する。特徴は例えば、オペレータが経験則に基づいて作成し、入出力装置１３あるいはデータ通信部１２から入力することができる。あるいは、特徴を網羅的に自動生成し、後に説明する学習部の処理において推定に寄与しない特徴を削除することで、自動的に生成することも可能である。 The feature definition unit 1414 defines the above features and stores them in the memory 15 as feature definition data 306. The feature can be created by the operator based on an empirical rule and input from the input / output device 13 or the data communication unit 12, for example. Alternatively, it is also possible to automatically generate features comprehensively and automatically by deleting features that do not contribute to estimation in the processing of the learning unit described later.

特徴抽出部１４１５は、関連テキスト情報解釈部１４１３の解釈結果も用いて、特徴定義部１４１４で定義した特徴が、訓練データ３０１または分析対象データ３０４の各項目に当てはまるかどうかを判定する。特徴は表の項目ごとに判定する。 The feature extraction unit 1415 also uses the interpretation result of the related text information interpretation unit 1413 to determine whether or not the feature defined by the feature definition unit 1414 applies to each item of the training data 301 or the analysis target data 304. Features are judged for each item in the table.

＜４−２.学習部＞
学習部１４２は、訓練データ３０１と構造データ３０２を用いて、識別モデル３０３を生成する処理を行う。 <4-2. Learning Department>
The learning unit 142 performs a process of generating the discriminative model 303 by using the training data 301 and the structural data 302.

図８は学習部１４２が行う処理の処理フローを示す。 FIG. 8 shows a processing flow of processing performed by the learning unit 142.

処理Ｓ８０１では、例えば図３に示すような訓練データ３０１に対して、関連テキスト情報抽出部１４１２を用いて、表１０１Ｔとタイトル１０２Ｔ，説明文１０３Ｔの関連付けを行う。関連付けの概念は図６で説明した。このとき、先に述べたように、表と文章の対応付け部１４１１を用いて、訓練データ３０１のテキスト情報から必要な部分のみを選択抽出して処理しても良い。 In the process S801, for example, the training data 301 as shown in FIG. 3 is associated with the table 101T, the title 102T, and the explanatory text 103T by using the related text information extraction unit 1412. The concept of association has been described in FIG. At this time, as described above, only the necessary part may be selected and extracted from the text information of the training data 301 by using the table-text correspondence unit 1411 and processed.

処理Ｓ８０２では、各訓練データへ正解を付与する。すなわち、例えば図３の表１０１Ｔの各項目について、それぞれ「次元」や「測度」を特定する構造データ３０２を入力する。入力は、先に述べたように、オペレータが表１０１Ｔを目視して判断し、入出力装置１３から入力すればよい。構造データ３０２はメモリ１５に格納される。なお、処理Ｓ８０２は、処理Ｓ８０１の前でもよいし、処理Ｓ８０３の後でも良い。 In the process S802, a correct answer is given to each training data. That is, for example, for each item in Table 101T of FIG. 3, structural data 302 for specifying the "dimension" and "measure" is input. As described above, the operator may visually determine the table 101T and input the input from the input / output device 13. The structure data 302 is stored in the memory 15. The process S802 may be performed before the process S801 or after the process S803.

処理Ｓ８０３では、特徴定義データ３０６に基づいて、関連テキスト情報解釈部１４１３および特徴抽出部１４１５を用い、訓練データ３０１の特徴を抽出する。 In the process S803, the features of the training data 301 are extracted by using the related text information interpretation unit 1413 and the feature extraction unit 1415 based on the feature definition data 306.

図９は、訓練データ３０１の特徴抽出を説明する概念図である。訓練データ３０１に含まれる表の各項目について、図７の特徴定義データ３０６の特徴の内容７０３が当てはまるかどうかを判定し、特徴を抽出する。図９の例では、特徴抽出結果９０１において、訓練データ３０１の各項目に便宜上とおし番号９１１を付加し、当該項目について、特徴の抽出有無を特徴ＩＤ７０１ごとに「True」「False」で判定する。特徴が抽出された場合を「True」、特徴が抽出されない場合を「False」とする。このとき、訓練データ３０１の各項目については、正解が構造データ３０２として既知であるため、特徴抽出結果９０１を統計的に処理することにより、識別モデル３０３を得ることができる。 FIG. 9 is a conceptual diagram illustrating the feature extraction of the training data 301. For each item in the table included in the training data 301, it is determined whether or not the feature content 703 of the feature definition data 306 of FIG. 7 is applicable, and the feature is extracted. In the example of FIG. 9, in the feature extraction result 901, the number 911 is added to each item of the training data 301 for convenience, and the presence or absence of feature extraction is determined by “True” and “False” for each feature ID 701. When the feature is extracted, it is set as "True", and when the feature is not extracted, it is set as "False". At this time, since the correct answer for each item of the training data 301 is known as the structural data 302, the discriminative model 303 can be obtained by statistically processing the feature extraction result 901.

図９に示した例では、識別モデル３０３には、モデル定義テーブル９０２として、特徴の出現パターンに基づき、モデルＡ、Ｂ、Ｃ・・・を定義している。各モデルに該当する複数の項目があった場合において、項目が「次元」か「測度」のいずれかであったかは、構造データ３０２により判定できるため、統計的に当該モデルにおける「次元」「測度」の出現頻度９０３が決定できる。作成した識別モデル３０３はメモリ１５に格納する。 In the example shown in FIG. 9, in the discriminative model 303, models A, B, C ... Are defined as the model definition table 902 based on the appearance pattern of the feature. When there are multiple items corresponding to each model, it can be determined from the structural data 302 whether the item is "dimension" or "measure", so statistically "dimension" and "measure" in the model. Occurrence frequency 903 can be determined. The created identification model 303 is stored in the memory 15.

なお、図９の識別モデル３０３では、出現頻度９０３をそのまま判定の判断結果として表構造の推定に用いているが、出願頻度の高いほうを選択し、「次元」か「測度」かの二者択一の結果としてもよい。 In the discriminative model 303 of FIG. 9, the appearance frequency 903 is used as it is for estimating the table structure as the judgment result of the judgment, but the one with the higher application frequency is selected and either "dimension" or "measure" is selected. It may be the result of the alternative.

先に述べたように特徴を網羅的に自動生成している場合等では、出現頻度９０３に対して相関が見られない特徴については、推定に寄与しない特徴と判定して削除することで、特徴数を減少し、処理量を低減することができる。 As mentioned above, in the case where features are comprehensively and automatically generated, features that do not correlate with the appearance frequency 903 are judged to be features that do not contribute to estimation and deleted. The number can be reduced and the processing amount can be reduced.

＜４−３．特徴抽出処理＞
図１０は、訓練データの特徴抽出処理Ｓ８０３の詳細を示す図である。図１０のフローは、一つの項目についての処理を示しており、訓練データ３０１の表の全ての項目について、同様の処理を行う。 <4-3. Feature extraction process>
FIG. 10 is a diagram showing details of the training data feature extraction process S803. The flow of FIG. 10 shows the processing for one item, and the same processing is performed for all the items in the table of the training data 301.

図１０では、図７の特徴定義データ３０６を持いて特徴抽出を行い、図９の特徴抽出結果９０１を得る例を示している。 FIG. 10 shows an example in which feature extraction is performed with the feature definition data 306 of FIG. 7 and the feature extraction result 901 of FIG. 9 is obtained.

処理Ｓ８０３１では、変数Ｎに１を代入する。 In process S8031, 1 is assigned to the variable N.

処理Ｓ８０３２では、特徴定義データ３０６からＮ番目の特徴の内容を取得する。最初の特徴は（特徴１）である。 In the process S8032, the contents of the Nth feature are acquired from the feature definition data 306. The first feature is (feature 1).

処理Ｓ８０３３では、特徴の内容により分岐処理を行う。特徴が表と文章の両方に関するものである場合は、処理Ｓ８０３６に進む。特徴が表と文章の両方に関するものでない場合は、処理Ｓ８０３４に進む。図７の特徴定義データ３０６の例では、表と文章の両方に関する特徴は、（特徴３）（特徴４）（特徴７）（特徴８）である。表と文章の両方に関するものでない特徴、すなわち表のみに関する特徴は、（特徴１）（特徴２）（特徴５）（特徴６）である。 In the process S8033, a branch process is performed according to the content of the feature. If the feature relates to both a table and a sentence, the process proceeds to S8036. If the feature is not related to both the table and the text, the process proceeds to S8034. In the example of the feature definition data 306 of FIG. 7, the features relating to both the table and the text are (feature 3) (feature 4) (feature 7) (feature 8). Features that are not related to both tables and sentences, that is, features related only to tables, are (feature 1), (feature 2), (feature 5), and (feature 6).

表のみに関する特徴判定処理である処理Ｓ８０３４では、表の項目の位置を判定する。例えば図７の（特徴１）（特徴５）については、処理Ｓ８０３４で特徴を抽出することができる。 In the process S8034, which is a feature determination process relating only to the table, the positions of the items in the table are determined. For example, with respect to (feature 1) and (feature 5) of FIG. 7, the feature can be extracted by the process S8034.

処理Ｓ８０３５では、表の項目または項目の値を判定する。例えば図７の（特徴２）（特徴６）については、処理Ｓ８０３５で特徴を抽出することができる。 In the process S8035, the item in the table or the value of the item is determined. For example, with respect to (feature 2) and (feature 6) of FIG. 7, the feature can be extracted by the process S8035.

表と文章の両方に関する特徴判定処理である処理Ｓ８０３６では、処理Ｓ８０１で抽出した関連テキスト情報を取得する。 In the process S8036, which is a feature determination process for both the table and the text, the related text information extracted in the process S801 is acquired.

処理Ｓ８０３７では、関連テキスト情報解釈部１４１３により、関連テキスト情報の文章の構文解析を行う。 In the process S8037, the related text information interpretation unit 1413 analyzes the sentence of the related text information.

処理Ｓ８０３８では、文章中での項目または項目の値の使われ方を判定する。すなわち、図７の例では、「項目が主語であり、修飾されているかどうか」を判定することで（特徴３）を抽出する。また、「項目の値が述部に含まれるかどうか」を判定することで（特徴４）を抽出する。 In the process S8038, it is determined how to use the item or the value of the item in the text. That is, in the example of FIG. 7, (feature 3) is extracted by determining "whether or not the item is the subject and is modified". Further, (feature 4) is extracted by determining "whether or not the value of the item is included in the predicate".

以上の処理の結果、処理Ｓ８０３９で特徴を抽出し、当該特徴について「True」「False」の結果を得ることができる。 As a result of the above processing, the feature can be extracted in the processing S8039, and the results of "True" and "False" can be obtained for the feature.

処理Ｓ８０４０ではＮの値をインクリメントし、処理Ｓ８０４１では、最後の特徴まで処理が終わっている場合には処理を終了し、最後の特徴まで処理が終わっていない場合には、処理Ｓ８０３２に戻って、次の特徴の抽出処理を行う。以上により、図９に例を示す特徴抽出結果９０１を得ることができる。 In the process S8040, the value of N is incremented, and in the process S8041, the process ends when the process is completed up to the last feature, and returns to the process S8032 when the process is not completed up to the last feature. The following features are extracted. As a result, the feature extraction result 901 shown in FIG. 9 can be obtained.

以上の説明では、訓練データ３０１の処理Ｓ８０３について説明したが、分析対象データ３０４についての特徴抽出処理Ｓ１１０２も同様に行うことができる。 In the above description, the processing S803 of the training data 301 has been described, but the feature extraction processing S1102 of the analysis target data 304 can also be performed in the same manner.

＜４−４.識別部＞
識別部１４３は、分析対象データ３０４と識別モデル３０３を用いて、構造推定データ３０５を生成する処理を行う。 <4-4. Identification unit>
The identification unit 143 performs a process of generating the structure estimation data 305 by using the analysis target data 304 and the identification model 303.

図１１は識別部１４３が行う処理の処理フローを示す。 FIG. 11 shows a processing flow of processing performed by the identification unit 143.

処理Ｓ１１０１では、分析対象データ３０４に対して、関連テキスト情報抽出部１４１２を用いて、表１０１とタイトル１０２，説明文１０３の関連付けを行う。具体的な処理は、処理Ｓ８０１と同様でよい。 In the process S1101, the table 101 is associated with the title 102 and the explanatory text 103 by using the related text information extraction unit 1412 with respect to the analysis target data 304. The specific process may be the same as that of process S801.

処理Ｓ１１０２では、特徴定義データ３０６に基づいて、関連テキスト情報解釈部１４１３および特徴抽出部１４１５を用い、分析対象データ３０４の特徴を抽出する。具体的な処理は、処理Ｓ８０３と同様でよい。具体的には、図１０で説明したものと同様でよい。分析対象データ３０４の特徴抽出結果も、図９の訓練データ３０１の特徴抽出結果９０１と同様の構造である。 In the process S1102, the features of the analysis target data 304 are extracted by using the related text information interpretation unit 1413 and the feature extraction unit 1415 based on the feature definition data 306. The specific process may be the same as that of process S803. Specifically, it may be the same as that described with reference to FIG. The feature extraction result of the analysis target data 304 has the same structure as the feature extraction result 901 of the training data 301 of FIG.

処理Ｓ１１０３では、識別モデル３０３に、分析対象データ３０４の特徴抽出結果を当てはめ、分析対象データ３０４が含む表の項目を識別する。例えば、二者択一方式であれば、図９のモデルＡに該当する特徴が抽出された項目は、「次元」と判定する。 In the process S1103, the feature extraction result of the analysis target data 304 is applied to the identification model 303, and the table items included in the analysis target data 304 are identified. For example, in the alternative method, the item from which the feature corresponding to the model A in FIG. 9 is extracted is determined as "dimension".

処理Ｓ１１０４では、識別部１４３は識別結果を表示部１４４に送付し、表示部１４４の制御により、入出力装置１３が備える例えば画像表示装置に結果を表示する。また、同時にあるいはこれに代えて、データ通信部１２から結果を送信してもよい。 In the process S1104, the identification unit 143 sends the identification result to the display unit 144, and under the control of the display unit 144, displays the result on, for example, an image display device included in the input / output device 13. Further, the result may be transmitted from the data communication unit 12 at the same time or instead.

以上説明した実施例では、表のみではなく、表のタイトルや説明文における、表の項目やその値の現れ方を特徴として使用している。本実施例によれば、表と表に関連する文書の両方を用いることで、表内に構造を推定するための特徴が含まれていない場合であっても、表の関連文書に構造を推定する特徴が含まれている場合は、構造の推定が可能になる。その結果、推定の精度が向上する。また、構造を推定するための識別モデルを自動生成することができるので、人手を介する部分を削減することができる。 In the above-described embodiment, not only the table but also the table items and their values appear in the table titles and explanations as features. According to this embodiment, by using both the table and the document related to the table, the structure is estimated in the related document of the table even if the table does not contain the features for estimating the structure. If the feature is included, the structure can be estimated. As a result, the accuracy of estimation is improved. Moreover, since the discriminative model for estimating the structure can be automatically generated, the part that requires manual labor can be reduced.

実施例１で使用した識別モデル３０３は、定義した特徴の全てを用いている。しかし、特徴を全て用いずに、結果との相関が高い特徴のみを用いて識別モデルを生成することもできる。所定のアルゴリズムを与えることにより、所望の識別モデルを生成することができる。 The discriminative model 303 used in Example 1 uses all of the defined features. However, it is also possible to generate a discriminative model using only the features that are highly correlated with the result without using all the features. By giving a predetermined algorithm, a desired discriminative model can be generated.

例えば、図９において、訓練データ３０１の特徴抽出結果９０１を集計した場合において、「所定の条件が成立する場合に、所定の識別モデルを生成する」、というアルゴリズムにより、モデル生成を行うことができる。 For example, in FIG. 9, when the feature extraction results 901 of the training data 301 are aggregated, the model can be generated by the algorithm of "generating a predetermined discriminative model when a predetermined condition is satisfied". ..

例えば、「ある特徴Ｘが「True」であれば、他の特徴がどのような結果であっても、当該項目は９０％以上が「次元」である」という条件が成立する場合に、「特徴Ｘが「True」であれば、当該項目は「次元」である」という識別モデルを生成することができる。 For example, if the condition that "if a certain feature X is" True ", 90% or more of the item is" dimension "" is satisfied regardless of the result of the other feature, "feature" is satisfied. If X is "True", the discriminative model that "the item is a dimension" can be generated.

上記は単純な例であるが、条件や識別モデルは種々のものが採用可能であり、さらに複雑な条件により、複雑な識別モデルを生成するものであってもよい。 Although the above is a simple example, various conditions and discriminative models can be adopted, and a complicated discriminative model may be generated under more complicated conditions.

実施例１では、学習部１４２において、訓練データ３０１と構造データ３０２を用いて識別モデル３０３を自動生成した。 In the first embodiment, the learning unit 142 automatically generated the discriminative model 303 using the training data 301 and the structural data 302.

しかし、単純な識別モデル３０３であれば、データの使用者が経験的に人手により作成することもできる。作成した識別モデル３０３は、入出力装置１３やデータ通信部１２から入力し、メモリ１５に格納する。この場合のシステム構成は基本的に実施例１と同様であるが、図４の学習部１４２および学習部１４２が使用するデータは不要となる。 However, if it is a simple discriminative model 303, the user of the data can empirically create it manually. The created identification model 303 is input from the input / output device 13 and the data communication unit 12 and stored in the memory 15. The system configuration in this case is basically the same as that of the first embodiment, but the data used by the learning unit 142 and the learning unit 142 in FIG. 4 becomes unnecessary.

図１２に実施例３の識別モデル３０３−２の例を示す。図１２の識別モデル３０３−２は、オペレータが経験側に基づいて作成し、分類して表示したものである。図１２では、表を対象とする特徴（特徴１）（特徴２）（特徴５）（特徴６）、表と文書を対象とする特徴（特徴３）（特徴４）（特徴７）（特徴８）が定義されている。また、「測度」を判定する特徴（特徴１）（特徴２）（特徴３）（特徴４）と、「次元」を判定する特徴（特徴５）（特徴６）（特徴７）（特徴８）がある。 FIG. 12 shows an example of the identification model 303-2 of the third embodiment. The discriminative model 303-2 of FIG. 12 is created by the operator based on the experience side, classified and displayed. In FIG. 12, features (feature 1) (feature 2) (feature 5) (feature 6) for tables and features (feature 3) (feature 4) (feature 7) (feature 8) for tables and documents. ) Is defined. In addition, the feature (feature 1) (feature 2) (feature 3) (feature 4) for determining "measure" and the feature (feature 5) (feature 6) (feature 7) (feature 8) for determining "dimension" There is.

例えば、表を対象とする（特徴１）は、項目の表内の位置に関する特徴であり、「項目が表の右側に現れる」ことを内容とする。「測度」を示す項目は、表の右側に現れやすいので、（特徴１）が抽出された項目は「測度」と判定する。 For example, targeting a table (feature 1) is a feature relating to the position of an item in the table, and includes "the item appears on the right side of the table". Since the item indicating "measure" tends to appear on the right side of the table, the item from which (feature 1) is extracted is determined to be "measure".

例えば、表と文書を対象とする（特徴３）は、表の項目と文章の構造に関する特徴であり、「文章中において、項目が主語であり、修飾されている」ことを内容とする。このような特徴が抽出された項目は、「測度」と判定する。 For example, targeting a table and a document (feature 3) is a feature relating to the items of the table and the structure of the sentence, and includes that "the item is the subject and is modified in the sentence". Items from which such features have been extracted are determined to be "measures".

例えば、表と文章を対象とする（特徴４）は、表の項目の値と文章の構造に関する特徴であり、「項目の値が述部に含まれる」ことを内容とする。このような特徴が抽出された項目は、「測度」と判定する。 For example, targeting a table and a sentence (feature 4) is a feature relating to the value of an item in the table and the structure of the sentence, and includes "the value of the item is included in the predicate". Items from which such features have been extracted are determined to be "measures".

識別部１４３が、図１２のような識別モデル２０２−２を用いて識別する場合の処理フローは、図１１で説明したものと同様である。 The processing flow when the identification unit 143 identifies using the identification model 202-2 as shown in FIG. 12 is the same as that described in FIG.

図１３に、実施例３における、分析対象データ３０４の識別処理Ｓ１００３の例を示す。この例では、分析対象データ３０４が含む表の各項目に便宜上とおし番号１３１１を付している。各項目夫々に対して、図１２で示した８つの特徴が抽出されるかどうかをカウントする。例えば、＃００１番の項目では、測度を示す特徴のカウントが３で、次元を示す特徴のカウントが１であり、この例では多数決により判定を行うため、結果は「測度」となる。もちろん、単純な多数決ではなく、各特徴に異なる重みをつける等の変更は可能である。また、説明した８つの特徴は例であり、種類を増やすことも減らすことも可能である。 FIG. 13 shows an example of the identification process S1003 of the analysis target data 304 in the third embodiment. In this example, each item of the table included in the analysis target data 304 is numbered 1311 for convenience. For each item, it is counted whether or not the eight features shown in FIG. 12 are extracted. For example, in item # 001, the count of the feature indicating the measure is 3, and the count of the feature indicating the dimension is 1, and in this example, the judgment is made by a majority vote, so the result is "measure". Of course, it is possible to make changes such as giving different weights to each feature rather than a simple majority vote. In addition, the eight features described are examples, and it is possible to increase or decrease the types.

図１４に実施例１〜３によって得られた、表の構造情報を利用した自然言語による検索システムの応用例を示す。 FIG. 14 shows an application example of a search system in natural language using the structural information of the table obtained in Examples 1 to 3.

実施例１〜３によって、分析対象データ３０４の表１０１には、その構造に関する構造データ３０２が付加されている。この構造に従って、検索可能な自然言語表現１４０１を準備する。自然言語表現１４０１は所定の生成規則に基づいて自動生成しても良いし、オペレータが作成しても良い。例えば、次元が「都道府県」であり、測度が「年齢」と「人口」の場合には、これを示す自然言語表現としては「都道府県別の年齢と人口」となる。従って、このような自然言語表現をユーザ１４０２に提示すること、あるいはユーザ１４０２に入力させることにより、所望の表を自然言語で検索することを支援することができる。 According to Examples 1 to 3, structural data 302 relating to the structure is added to Table 101 of the analysis target data 304. According to this structure, a searchable natural language expression 1401 is prepared. The natural language expression 1401 may be automatically generated based on a predetermined production rule, or may be created by the operator. For example, when the dimension is "prefecture" and the measure is "age" and "population", the natural language expression indicating this is "age and population by prefecture". Therefore, by presenting such a natural language expression to the user 1402 or having the user 1402 input it, it is possible to support the search for a desired table in the natural language.

本発明は上記した実施形態に限定されるものではなく、様々な変形例が含まれる。例えば、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることが可能である。また、各実施例の構成の一部について、他の実施例の構成の追加・削除・置換をすることが可能である。 The present invention is not limited to the above-described embodiment, and includes various modifications. For example, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is possible to add the configuration of another embodiment to the configuration of one embodiment. In addition, it is possible to add / delete / replace the configurations of other examples with respect to a part of the configurations of each embodiment.

表：１０１
表のタイトル：１０２
説明文：１０３ Table: 101
Table title: 102
Description: 103

Claims

A table-structured estimation system with input devices, output devices, storage devices, and processing devices.
The storage device is
Stores feature definition data that defines features related to both tables and sentences,
The input device is
It accepts data to be analyzed and
The processing device
A related text information extraction unit that acquires a table from the analysis target data and also acquires sentences related to the table as related texts.
A feature extraction unit that extracts the feature from the acquired table and the related text using the feature definition data,
Based on the feature extraction result of the feature extraction unit, the identification unit that estimates the distinction between "dimension" and "measure" of the items in the table, and the identification unit.
Equipped with a,
The related text information extraction unit
The text including the table items included in the analysis target data is acquired as the related text, and the text is obtained.
The above features
It defines how table items are used in a document.
The processing device
It has a related text information interpretation unit that analyzes the syntax of the related text.
The feature extraction unit
By determining how the items in the table are used in the related text, the features are extracted.
How to use the items in the above table
(1) The item is the subject and is qualified
(2) The item modifies the subject
Including at least one of
The storage device is
A discriminative model that defines the relationship between the feature extraction result of the feature extraction unit and the distinction between "dimension" and "measure" of the items in the table is stored.
The identification unit
Using the discriminative model, the distinction between "dimension" and "measure" of the items in the table is estimated.
Table structure estimation system.

A table-structured estimation system with input devices, output devices, storage devices, and processing devices.
The storage device is
Stores feature definition data that defines features related to both tables and sentences,
The input device is
It accepts data to be analyzed and
The processing device
A related text information extraction unit that acquires a table from the analysis target data and also acquires sentences related to the table as related texts.
A feature extraction unit that extracts the feature from the acquired table and the related text using the feature definition data,
Based on the feature extraction result of the feature extraction unit, the identification unit that estimates the distinction between "dimension" and "measure" of the items in the table, and the identification unit.
Equipped with a,
The related text information extraction unit
A sentence including the value of the item in the table included in the analysis target data is acquired as the related text, and the text is obtained.
The above features
It defines how the values of table items are used in a document.
The processing device
It has a related text information interpretation unit that analyzes the syntax of the related text.
The feature extraction unit
The feature is extracted by determining how the values of the items in the table are used in the related text.
How to use the values of the items in the above table
(1) Item values are included in the predicate
(2) The item value modifies the subject
Including at least one of
The storage device is
A discriminative model that defines the relationship between the feature extraction result of the feature extraction unit and the distinction between "dimension" and "measure" of the items in the table is stored.
The identification unit
Using the discriminative model, the distinction between "dimension" and "measure" of the items in the table is estimated.
Table structure estimation system.

A table-structured estimation system with input devices, output devices, storage devices, and processing devices.
The storage device is
Stores feature definition data that defines features related to both tables and sentences,
The input device is
It accepts data to be analyzed and
The processing device
A related text information extraction unit that acquires a table from the analysis target data and also acquires sentences related to the table as related texts.
A feature extraction unit that extracts the feature from the acquired table and the related text using the feature definition data,
Based on the feature extraction result of the feature extraction unit, the identification unit that estimates the distinction between "dimension" and "measure" of the items in the table, and the identification unit.
Equipped with a,
The related text information extraction unit
A sentence including a word related to a table item included in the analysis target data is acquired as the related text.
The storage device is
Stores a dictionary that defines words related to a given word as related words,
The words related to the items in the table above are:
It is a related word defined by the dictionary.
The above features
It defines how table items are used in a document.
The processing device
It has a related text information interpretation unit that analyzes the syntax of the related text.
The feature extraction unit
By determining how the items in the table are used in the related text, the features are extracted.
How to use the items in the above table
(1) The item is the subject and is qualified
(2) The item modifies the subject
Including at least one of
The storage device is
A discriminative model that defines the relationship between the feature extraction result of the feature extraction unit and the distinction between "dimension" and "measure" of the items in the table is stored.
The identification unit
Using the discriminative model, the distinction between "dimension" and "measure" of the items in the table is estimated.
Table structure estimation system.

A table-structured estimation system with input devices, output devices, storage devices, and processing devices.
The storage device is
Stores feature definition data that defines features related to both tables and sentences,
The input device is
It accepts data to be analyzed and
The processing device is
A related text information extraction unit that acquires a table from the analysis target data and also acquires sentences related to the table as related texts.
A feature extraction unit that extracts the feature from the acquired table and the related text using the feature definition data,
Based on the feature extraction result of the feature extraction unit, the identification unit that estimates the distinction between "dimension" and "measure" of the items in the table, and the identification unit.
Equipped with a,
The related text information extraction unit
A sentence including a word related to the value of the item in the table included in the analysis target data is acquired as the related text.
The storage device is
Stores a dictionary that defines words related to a given word as related words,
The words associated with the values of the items in the table above are:
It is a related word defined by the dictionary.
The above features
It defines how the values of table items are used in a document.
The processing device is
It has a related text information interpretation unit that analyzes the syntax of the related text.
The feature extraction unit
The feature is extracted by determining how the values of the items in the table are used in the related text.
How to use the values of the items in the above table
(1) Item values are included in the predicate
(2) The item value modifies the subject
Including at least one of
The storage device is
A discriminative model that defines the relationship between the feature extraction result of the feature extraction unit and the distinction between "dimension" and "measure" of the items in the table is stored.
The identification unit
Using the discriminative model, the distinction between "dimension" and "measure" of the items in the table is estimated.
Table structure estimation system.

The input device is
It accepts training data and
The processing device includes a learning unit, and the learning unit
Information regarding the distinction between "dimension" and "measure" of the items in the table of the training data is stored in the storage device as structural data.
Using the training data as the analysis target data, the related text information extraction unit and
Let the feature extractor execute the process
The identification model is generated based on the features extracted by the feature extraction unit and the structural data, and stored in the storage device.
The table structure estimation system according to any one of claims 1 to 4 .

Equipped with a table-text correspondence section
The correspondence part between the table and the text is
Wherein the analysis target data, out extract the text in the table and text specific positional relationship, a sentence the extracted, sent to the related text information extracting unit,
The table structure estimation system according to any one of claims 1 to 4 .

A method for estimating a table structure using an input device, an output device, a storage device, and a processing device.
In the storage device
Prepare feature definition data that defines features related to both tables and sentences,
From the input device
Enter the data to be analyzed and
The processing device
A table is acquired from the analysis target data, and sentences related to the table are acquired as related texts.
Using the feature definition data, the feature is extracted from the acquired table and the related text.
Based on the extracted features, and estimate the distinction of "measure" and "dimension" of the item of the table,
The text including the table items included in the analysis target data is acquired as the related text, and the text is obtained.
The above features
It defines how table items are used in a document.
The processing device
By determining how the items in the table are used in the related text, the features are extracted.
How to use the items in the above table
(1) The item is the subject and is qualified
(2) The item modifies the subject
Including at least one of
When estimating the distinction between "dimension" and "measure" of the items in the table, a discriminative model is used that defines the relationship between the extracted features and the distinction between "dimension" and "measure" of the items in the table.
How to estimate the table structure.

A method for estimating a table structure using an input device, an output device, a storage device, and a processing device.
In the storage device
Prepare feature definition data that defines features related to both tables and sentences,
From the input device
Enter the data to be analyzed and
The processing device
A table is acquired from the analysis target data, and sentences related to the table are acquired as related texts.
Using the feature definition data, the feature is extracted from the acquired table and the related text.
Based on the extracted features, and estimate the distinction of "measure" and "dimension" of the item of the table,
A sentence including the value of the item in the table included in the analysis target data is acquired as the related text, and the text is obtained.
The above features
It defines how the values of table items are used in a document.
The processing device is
The feature is extracted by determining how the values of the items in the table are used in the related text.
How to use the values of the items in the above table
(1) Item values are included in the predicate
(2) The item value modifies the subject
Including at least one of
When estimating the distinction between "dimension" and "measure" of the items in the table, a discriminative model is used that defines the relationship between the extracted features and the distinction between "dimension" and "measure" of the items in the table.
How to estimate the table structure.

From the input device
Enter the training data and
From the input device
Input the structural data including the information that distinguishes the "dimension" and "measure" of the table items included in the training data.
The processing device
A table is acquired from the training data, and sentences related to the table are acquired as related texts.
Using the feature definition data, the feature is extracted from the acquired table and the related text.
Based on the extracted features and the structural data, the discriminative model is generated and stored in the storage device.
The method for estimating the table structure according to claim 7 or 8 .