JP6817246B2

JP6817246B2 - Data processing equipment, data processing method and data processing program

Info

Publication number: JP6817246B2
Application number: JP2018053984A
Authority: JP
Inventors: 雅晴服部; 康孝西村; 吉原　貴仁; 貴仁吉原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2021-01-20
Anticipated expiration: 2038-03-22
Also published as: JP2019168758A

Description

本発明は、半構造データから構造データ及びスキーマを作成する装置、方法及びプログラムに関する。 The present invention relates to devices, methods and programs for creating structural data and schemas from semi-structured data.

従来、例えば自治体は、保育園情報、公共施設情報、人口統計情報等の街に関する情報を、オープンデータとして公開している。これらのデータを統合して分析することで、ユーザが住みたい家をＷｅｂ検索する際に、街の情報や他の街との比較情報等を合わせて表示する家探し支援サービスが提供される。また、企業には、顧客情報、商品情報、購入情報等のデータがある。これらのデータを統合して分析することで、顧客の商品購入の傾向が把握され、類似した他の商品のお薦め、又は新商品の開発等に活用される。 Conventionally, for example, local governments have released information about the city such as nursery school information, public facility information, and demographic information as open data. By integrating and analyzing these data, a house search support service that displays information on the city and comparison information with other cities when the user searches the Web for the house he / she wants to live in is provided. In addition, the company has data such as customer information, product information, and purchase information. By integrating and analyzing these data, the tendency of customers to purchase products can be grasped, and it is utilized for recommending other similar products or developing new products.

このように、複数に分散したデータを統合して分析することで、個々のデータのみでは提供できなかった価値が生み出される。
ところが、統合する対象である各自治体等で公開しているデータの形式がそれぞれ異なる場合がある。例えば、ある自治体ではＲＤＢ（ＲｅｌａｔｉｏｎａｌＤａｔａｂａｓｅ）で扱うことができるように、先頭行に属性名を配置したＣＳＶ（Ｃｏｍｍａ−ＳｅｐａｒａｔｅｄＶａｌｕｅｓ）形式で構造データを公開しているが、他の自治体ではＲＤＢで扱えないＰＤＦ（ＰｏｒｔａｂｌｅＤｏｃｕｍｅｎｔＦｏｒｍａｔ）又はＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）等の形式で半構造データを公開している。したがって、データを統合する際は、ＰＤＦ又はＸＭＬ形式の半構造データから構造データ及びＲＤＢのスキーマを作成する必要がある。
なお、半構造データとは、保育施設のリスト等の表形式で表現できる構造データと、公開資料に含まれるタイトル、前書き等の表形式で表現できない非構造データとを併せ持つデータである。 In this way, by integrating and analyzing multiple distributed data, value that could not be provided by individual data alone is created.
However, the format of the data published by each local government, etc. to be integrated may be different. For example, some local governments publish structural data in CSV (Comma-Separated Values) format with attribute names placed in the first line so that they can be handled by RDB (Relational Data), but other local governments publish structural data in RDB. Semi-structured data is open to the public in a format such as PDF (Portable Document Forum) or XML (eXtensible Markup Language) that cannot be handled. Therefore, when integrating data, it is necessary to create a schema of structural data and RDB from semi-structured data in PDF or XML format.
The semi-structured data is data that includes both structural data that can be expressed in a tabular format such as a list of childcare facilities and unstructured data that cannot be expressed in a tabular format such as titles and prefaces included in public materials.

半構造データから構造データ及びＲＤＢのスキーマを作成する方法として、例えば以下の方法が提案されている。
特許文献１の方法では、ＸＭＬスキーマ内の、＜ｘｓｄ：ｅｌｅｍｅｎｔｎａｍｅ＝”（属性名）”＞の記述から、ＲＤＢスキーマが作成される。
特許文献２の方法では、ＸＭＬデータのタグ内の＜（属性名）＞属性値＜／（属性名）＞の記述から、＜＞内の単語を属性名と認識して構造データ及びＲＤＢスキーマが作成される。
特許文献３の方法では、自然文章であるテキストの文章構造を予め想定し、例えば、「［イベント］において［目的］のため［アクション］した。」といった属性名及びその属性値が入る場所を指定した構造化変換ルールに基づき構造データ及びＲＤＢスキーマが作成される。 For example, the following method has been proposed as a method for creating a schema of structural data and RDB from semi-structured data.
In the method of Patent Document 1, the RDB schema is created from the description of <xsd: element name = "(attribute name)"> in the XML schema.
In the method of Patent Document 2, from the description of <(attribute name)> attribute value </ / (attribute name)> in the tag of the XML data, the word in <> is recognized as the attribute name and the structural data and the RDB schema are generated. Will be created.
In the method of Patent Document 3, the sentence structure of a text that is a natural sentence is assumed in advance, and for example, an attribute name such as "[action] was performed for [purpose] in [event]" and a place where the attribute value is entered are specified. Structural data and RDB schema are created based on the structured conversion rules.

また、非特許文献１のライブラリは、テキスト・表・画像等をそれぞれ１つの情報単位として、ＰＤＦデータを、＜ａｈｐ：ｆｒａｍｅ＞＜／ａｈｐ：ｆｒａｍｅ＞タグで情報単位毎に区切って表現したＸＭＬデータに変換する。例えば、＜ａｈｐ：ｆｒａｍｅａｈｐ：ｆｒａｍｅ−ｔｙｐｅ＝”ｔａｂｌｅ”＞＜／ａｈｐ：ｆｒａｍｅ＞の記載で区切られた箇所には、表形式のデータに関する記載が行われる。 In addition, the library of Non-Patent Document 1 uses XML as one information unit for each of text, table, image, etc., and represents PDF data by dividing it into information units with <ahp: frame> </ ahp: frame> tags. Convert to data. For example, in the places separated by the description of <ahp: frame ahp: frame-type = "table"> </ ahp: frame>, the description regarding the tabular data is performed.

特開２００６−３５０９２４号公報Japanese Unexamined Patent Publication No. 2006-350924 特開２００３−２７１４４３号公報Japanese Unexamined Patent Publication No. 2003-271444 特開２００３−２８８３３２号公報Japanese Unexamined Patent Publication No. 2003-288332

「ＡｎｔｅｎｎａＨｏｕｓｅＰＤＦＸＭＬ変換ライブラリＶ２．０」、［ｏｎｌｉｎｅ］、アンテナハウス株式会社、［平成３０年３月１３日検索］、インターネット＜ｈｔｔｐ：／／ｗｗｗ．ａｎｔｅｎｎａ．ｃｏ．ｊｐ／ｐｄｆｘｍｌ／＞"Antenna House PDFXML Conversion Library V2.0", [online], Antenna House Co., Ltd., [Searched on March 13, 2018], Internet <http://www. antenna. co. jp / pdfxml />

しかしながら、前述した従来の方法では、属性名が明記されているＸＭＬスキーマ若しくはＸＭＬデータ、又は予め属性名を明記した構造化変換ルールを用いることで、半構造データから、ＲＤＢで利用できる構造データ及びスキーマを自動作成していた。 However, in the conventional method described above, by using the XML schema or XML data in which the attribute name is specified, or the structured conversion rule in which the attribute name is specified in advance, the semi-structured data can be used as the structural data and the structural data that can be used in the RDB. The schema was created automatically.

特許文献１の方法では、対象とする半構造データに、属性名が定義されているＸＭＬスキーマが存在しないと、属性名を認識することができず、ＲＤＢスキーマを自動作成できない。
特許文献２の方法では、対象とする半構造データにおいて、ＸＭＬタグ内には文書のスタイルを示す記述（例えば、＜Ｐ＞又は＜ＴＤ＞等）が入り、属性名が記述されていないと、タグ内の情報から属性名を認識することができず、ＲＤＢスキーマを自動作成できない。
特許文献３の方法では、属性名と属性値とが区別されていない個々の半構造データに対して、属性名とその属性値の入る場所を指定する構造化変換ルールを手動で作成する必要があり、ＲＤＢスキーマを自動作成できない。 In the method of Patent Document 1, if the XML schema in which the attribute name is defined does not exist in the target semi-structured data, the attribute name cannot be recognized and the RDB schema cannot be automatically created.
In the method of Patent Document 2, in the target semi-structured data, a description indicating the style of the document (for example, <P> or <TD>) is included in the XML tag, and the attribute name is not described. The attribute name cannot be recognized from the information in the tag, and the RDB schema cannot be created automatically.
In the method of Patent Document 3, it is necessary to manually create a structured conversion rule that specifies the place where the attribute name and the attribute value are entered for each semi-structured data in which the attribute name and the attribute value are not distinguished. Yes, the RDB schema cannot be created automatically.

また、非特許文献１の方法では、半構造データからＣＳＶ形式の構造データを取り出すことができるものの、各セルの値が属性名か属性値かを認識できない場合、正しい構造データ及びＲＤＢスキーマを自動作成できない。 Further, in the method of Non-Patent Document 1, although the structural data in CSV format can be extracted from the semi-structured data, if the value of each cell cannot be recognized as the attribute name or the attribute value, the correct structural data and the RDB schema are automatically executed. Cannot be created.

半構造データには、スキーマがなく、例えば「実施施設」等の性質を表す属性名と、この属性の値に該当する「第一保育園」等の属性値とが記載されているものの、属性名と属性値とは明確に区別されていない場合がある。なお、属性名と属性値とが区別されていないとは、双方が同じ形式でデータ内に記述されており、この記述単独では区別を判定できない状況のことである。
このように、対象とする半構造データにおいて、属性名と属性値とが明確に区別されていない場合には、従来の方法では、正しい構造データ及びＲＤＢスキーマを作成することは難しかった。 The semi-structured data does not have a schema, and although the attribute name representing the property such as "implementation facility" and the attribute value such as "first nursery school" corresponding to the value of this attribute are described, the attribute name And attribute values may not be clearly distinguished. It should be noted that the fact that the attribute name and the attribute value are not distinguished means that both are described in the data in the same format, and the distinction cannot be determined by this description alone.
As described above, when the attribute name and the attribute value are not clearly distinguished in the target semi-structured data, it is difficult to create the correct structural data and RDB schema by the conventional method.

本発明は、属性名と属性値とが明確に区別されていない半構造データから、構造データとスキーマとを自動作成できるデータ処理装置、データ処理方法及びデータ処理方法を提供することを目的とする。 An object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing method capable of automatically creating structural data and a schema from semi-structured data in which attribute names and attribute values are not clearly distinguished. ..

本発明に係るデータ処理装置は、半構造データを汎用データに変換し、当該汎用データからテーブル構造の部分を抽出するデータ取得部と、前記テーブル構造の部分に含まれる単語それぞれが出現する頻度を算出し、当該頻度のヒストグラムから、最も度数の高い頻度である単語を、属性候補として抽出する属性候補抽出部と、前記テーブル構造の形式に基づく所定の行又は列の組み合わせのうち、前記属性候補の出現率が最も高い組み合わせに配置された単語を属性名と判別する属性推定部と、前記属性名に対して、前記汎用データのうち前記属性名を除いたデータを属性値として対応付け、構造データを作成する構造データ作成部と、前記属性名を定義したスキーマを作成するスキーマ作成部と、を備える。 The data processing apparatus according to the present invention determines the frequency with which each of the data acquisition unit that converts semi-structured data into general-purpose data and extracts the table structure part from the general-purpose data and the word included in the table structure part appears. The attribute candidate is a combination of a predetermined row or column based on the format of the table structure and an attribute candidate extraction unit that calculates and extracts the word having the highest frequency as an attribute candidate from the histogram of the frequency. An attribute estimation unit that discriminates words arranged in the combination with the highest occurrence rate as an attribute name, and the attribute name is associated with the general-purpose data excluding the attribute name as an attribute value, and has a structure. It includes a structural data creation unit that creates data and a schema creation unit that creates a schema that defines the attribute names.

前記属性推定部は、所定の共通語彙に存在する単語との類似度が閾値以上の前記属性候補のみを採用してもよい。 The attribute estimation unit may adopt only the attribute candidates whose similarity with words existing in a predetermined common vocabulary is equal to or higher than the threshold value.

前記属性候補抽出部は、属性推定部により前記属性候補が採用されなかった場合、前記度数が１段階低い頻度である単語を、前記属性候補として抽出してもよい。 When the attribute candidate is not adopted by the attribute estimation unit, the attribute candidate extraction unit may extract a word whose frequency is one step lower as the attribute candidate.

前記属性推定部は、前記属性候補として抽出した単語の頻度に基づいて、前記テーブル構造の形式を判定してもよい。 The attribute estimation unit may determine the format of the table structure based on the frequency of words extracted as the attribute candidates.

前記データ取得部は、前記半構造データを、ＸＭＬ形式を含む前記汎用データに変換し、当該ＸＭＬ形式における特定のタグにより前記テーブル構造の部分を抽出してもよい。 The data acquisition unit may convert the semi-structured data into the general-purpose data including the XML format, and extract the part of the table structure by a specific tag in the XML format.

本発明に係るデータ処理方法は、半構造データを汎用データに変換し、当該汎用データからテーブル構造の部分を抽出するデータ取得ステップと、前記テーブル構造の部分に含まれる単語それぞれが出現する頻度を算出し、当該頻度のヒストグラムから、最も度数の高い頻度である単語を、属性候補として抽出する属性候補抽出ステップと、前記テーブル構造の形式に基づく所定の行又は列の組み合わせのうち、前記属性候補の出現率が最も高い組み合わせに配置された単語を属性名と判別する属性推定ステップと、前記属性名に対して、前記汎用データのうち前記属性名を除いたデータを属性値として対応付け、構造データを作成する構造データ作成ステップと、前記属性名を定義したスキーマを作成するスキーマ作成ステップと、をコンピュータが実行する。 In the data processing method according to the present invention, a data acquisition step of converting semi-structured data into general-purpose data and extracting a part of the table structure from the general-purpose data, and a frequency at which each word included in the part of the table structure appears. The attribute candidate among the combination of the attribute candidate extraction step that calculates and extracts the word with the highest frequency as the attribute candidate from the histogram of the frequency, and a predetermined row or column based on the format of the table structure. An attribute estimation step that determines a word arranged in the combination with the highest occurrence rate as an attribute name, and the attribute name is associated with the general-purpose data excluding the attribute name as an attribute value, and has a structure. The computer executes a structural data creation step for creating data and a schema creation step for creating a schema in which the attribute name is defined.

本発明に係るデータ処理プログラムは、半構造データを汎用データに変換し、当該汎用データからテーブル構造の部分を抽出するデータ取得ステップと、前記テーブル構造の部分に含まれる単語それぞれが出現する頻度を算出し、当該頻度のヒストグラムから、最も度数の高い頻度である単語を、属性候補として抽出する属性候補抽出ステップと、前記テーブル構造の形式に基づく所定の行又は列の組み合わせのうち、前記属性候補の出現率が最も高い組み合わせに配置された単語を属性名と判別する属性推定ステップと、前記属性名に対して、前記汎用データのうち前記属性名を除いたデータを属性値として対応付け、構造データを作成する構造データ作成ステップと、前記属性名を定義したスキーマを作成するスキーマ作成ステップと、をコンピュータに実行させるためのものである。 The data processing program according to the present invention determines the data acquisition step of converting semi-structured data into general-purpose data and extracting the part of the table structure from the general-purpose data, and the frequency with which each word included in the part of the table structure appears. The attribute candidate among the combination of the attribute candidate extraction step that calculates and extracts the word with the highest frequency as the attribute candidate from the histogram of the frequency, and a predetermined row or column based on the format of the table structure. An attribute estimation step that determines a word arranged in the combination with the highest occurrence rate as an attribute name, and the attribute name is associated with the general-purpose data excluding the attribute name as an attribute value, and has a structure. This is for causing a computer to execute a structural data creation step for creating data and a schema creation step for creating a schema in which the attribute name is defined.

本発明によれば、属性名と属性値とが明確に区別されていない半構造データから、構造データとスキーマとを自動作成できる。 According to the present invention, structural data and schema can be automatically created from semi-structured data in which attribute names and attribute values are not clearly distinguished.

実施形態に係るデータ処理装置の機能構成を示す図である。It is a figure which shows the functional structure of the data processing apparatus which concerns on embodiment. 実施形態に係る第１のパターンの半構造データを例示する図である。It is a figure which illustrates the semi-structured data of the 1st pattern which concerns on embodiment. 実施形態に係る第２のパターンの半構造データを例示する図である。It is a figure which illustrates the semi-structured data of the 2nd pattern which concerns on embodiment. 実施形態に係る第１のパターンの半構造データに対する汎用データを例示する図である。It is a figure which illustrates the general-purpose data with respect to the semi-structured data of the 1st pattern which concerns on embodiment. 実施形態に係る第１のパターンの半構造データに対するヒストグラムを例示する図である。It is a figure which illustrates the histogram with respect to the semi-structured data of the 1st pattern which concerns on embodiment. 実施形態に係る第１のパターンの半構造データに対して、属性候補と共通語彙との類似度を算出した結果を例示する図である。It is a figure which illustrates the result of having calculated the degree of similarity between the attribute candidate and a common vocabulary with respect to the semi-structured data of the 1st pattern which concerns on embodiment. 実施形態に係る第１のパターンの半構造データにおける属性名及び属性値の判別方法を示す図である。It is a figure which shows the method of discriminating the attribute name and the attribute value in the semi-structured data of the 1st pattern which concerns on embodiment. 実施形態に係る第１のパターンの半構造データに対する構造データ及びスキーマの出力例を示す図である。It is a figure which shows the output example of the structural data and the schema with respect to the semi-structured data of the 1st pattern which concerns on embodiment. 実施形態に係る第２のパターンの半構造データに対する汎用データを例示する図である。It is a figure which illustrates the general-purpose data with respect to the semi-structured data of the 2nd pattern which concerns on embodiment. 実施形態に係る第２のパターンの半構造データに対するヒストグラムを例示する図である。It is a figure which illustrates the histogram with respect to the semi-structured data of the 2nd pattern which concerns on embodiment. 本実施形態に係る第２のパターンの半構造データに対して、属性候補と共通語彙との類似度を算出した結果を例示する図である。It is a figure which illustrates the result of having calculated the degree of similarity between the attribute candidate and a common vocabulary with respect to the semi-structured data of the 2nd pattern which concerns on this embodiment. 実施形態に係る第２のパターンの半構造データにおける属性名及び属性値の判別方法を示す図である。It is a figure which shows the method of discriminating the attribute name and the attribute value in the semi-structured data of the 2nd pattern which concerns on embodiment. 実施形態に係る第２のパターンの半構造データに対する構造データ及びスキーマの出力例を示す図である。It is a figure which shows the output example of the structural data and the schema with respect to the semi-structured data of the 2nd pattern which concerns on embodiment. 実施形態に係る構造データ及びスキーマの作成処理を示す第１のフローチャートである。It is a 1st flowchart which shows the creation process of the structural data and the schema which concerns on embodiment. 実施形態に係る構造データ及びスキーマの作成処理を示す第２のフローチャートである。2 is a second flowchart showing a process of creating structural data and a schema according to an embodiment. 実施形態に係る構造データ及びスキーマの作成処理を示す第３のフローチャートである。It is a 3rd flowchart which shows the creation process of the structural data and schema which concerns on embodiment.

以下、本発明の実施形態の一例について説明する。
図１は、本実施形態に係るデータ処理装置１の機能構成を示す図である。
データ処理装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置であり、記憶部に格納された所定のソフトウェア（データ処理プログラム）を制御部が実行することにより、本実施形態の各種機能を実現する。また、データ処理装置１は、入出力デバイス及び通信インタフェースを備え、処理対象の半構造データを入力とし、本実施形態の各種機能により構造データ及びスキーマを作成して出力する。 Hereinafter, an example of the embodiment of the present invention will be described.
FIG. 1 is a diagram showing a functional configuration of the data processing device 1 according to the present embodiment.
The data processing device 1 is an information processing device such as a server device or a personal computer, and various functions of the present embodiment are realized by the control unit executing predetermined software (data processing program) stored in the storage unit. To do. Further, the data processing device 1 includes an input / output device and a communication interface, receives semi-structured data to be processed as input, and creates and outputs structural data and a schema by various functions of the present embodiment.

例えば、自治体は、保育サービスを実施している施設の情報等を、一部に表形式のデータを含んだＰＤＦファイルで公表している。データ処理装置１は、このように表形式のデータが記載されているものの、属性名と属性値とが明確に区別されておらず直接ＲＤＢに取込むことができない半構造データから、構造データとスキーマとを自動作成し、ＲＤＢに取り込めるようにする。 For example, the local government publishes information on facilities that provide childcare services in a PDF file that includes some tabular data. In the data processing device 1, although the tabular data is described in this way, the attribute name and the attribute value are not clearly distinguished, and the semi-structured data that cannot be directly imported into the RDB is divided into the structural data. Create a schema automatically so that it can be imported into RDB.

データ処理装置１は、データ取得部１１と、属性候補抽出部１２と、属性推定部１３と、構造データ作成部１４と、スキーマ作成部１５と、メモリ部１６と、ルール管理部１７と、共通語彙記憶部１８と、設定部１９とを備える。 The data processing device 1 is common to the data acquisition unit 11, the attribute candidate extraction unit 12, the attribute estimation unit 13, the structural data creation unit 14, the schema creation unit 15, the memory unit 16, and the rule management unit 17. A vocabulary storage unit 18 and a setting unit 19 are provided.

データ取得部１１は、インターネット上を含むデータリソースから、半構造データを取得すると、この半構造データを、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅｃｏｇｎｉｔｉｏｎ／Ｒｅａｄｅｒ）ツールを用いて汎用データに変換する。汎用データは、例えばＸＭＬ形式のデータであり、さらにテキスト形式のデータにも変換されることが好ましい。
さらに、データ取得部１１は、ＸＭＬ形式における特定のタグ（例えば、＜ｔａｂｌｅ＞）を目印にして、変換した汎用データからテーブル構造の部分を抽出し、属性候補抽出部１２及びメモリ部１６に転送する。 When the data acquisition unit 11 acquires semi-structured data from a data resource including the Internet, the data acquisition unit 11 converts the semi-structured data into general-purpose data using an OCR (Optical Character Recognition / Reader) tool. The general-purpose data is, for example, XML format data, and is preferably converted into text format data.
Further, the data acquisition unit 11 extracts a part of the table structure from the converted general-purpose data using a specific tag (for example, <table>) in the XML format as a mark, and transfers it to the attribute candidate extraction unit 12 and the memory unit 16. To do.

属性候補抽出部１２は、データ取得部１１から転送されたデータ、すなわち汎用データにおけるテーブル構造の部分に含まれる各単語が出現する頻度を算出し、この頻度のヒストグラムから、最も度数の高い頻度である単語を、属性候補として抽出する。属性候補抽出部１２は、抽出した単語を属性推定部１３に転送する。
また、属性候補抽出部は、後述の属性推定部１３により属性候補が採用されなかった場合、度数が１段階低い頻度である単語を、新たな属性候補として抽出し、属性推定部１３に提供する。 The attribute candidate extraction unit 12 calculates the frequency at which each word included in the table structure part of the data transferred from the data acquisition unit 11, that is, the general-purpose data appears, and from the histogram of this frequency, the frequency is the highest. Extract a word as an attribute candidate. The attribute candidate extraction unit 12 transfers the extracted word to the attribute estimation unit 13.
Further, when the attribute candidate is not adopted by the attribute estimation unit 13 described later, the attribute candidate extraction unit extracts a word having a frequency one step lower in frequency as a new attribute candidate and provides it to the attribute estimation unit 13. ..

属性推定部１３は、属性候補抽出部１２から転送される属性候補の単語と、共通語彙記憶部１８に記憶された共通語彙（例えば、独立行政法人情報処理推進機構（ＩＰＡ）が整備を進めている、表記の統一を図るための用語集）の単語との一致文字数等の類似度を求め、閾値以上の単語を属性名であると推定する。なお、共通語彙は、例えば、独立行政法人情報処理推進機構（ＩＰＡ）が整備を進めている、表記の統一を図るための用語集である。 The attribute estimation unit 13 is being developed by the attribute candidate words transferred from the attribute candidate extraction unit 12 and the common vocabulary stored in the common vocabulary storage unit 18 (for example, the Information-technology Promotion Agency (IPA)). Find the degree of similarity with the words in the glossary for unifying the notation), and estimate that the words above the threshold are attribute names. The common vocabulary is, for example, a glossary of terms used by the Information-technology Promotion Agency (IPA) to unify the notation.

また、属性推定部１３は、メモリ部に格納されているＸＭＬ形式のデータを参照し、テーブル構造に基づいて、後述の方法により属性名と推定する単語を更新する。
具体的には、属性推定部１３は、テーブル構造の形式に基づく所定の行又は列の組み合わせのうち、属性名と推定した単語の出現率が最も高い組み合わせに配置された単語を属性名と判定する。ここで、テーブル構造の形式は、例えば、属性候補として抽出した単語の出現した頻度に基づいて判定される。
属性推定部１３は、属性名と推定した単語を構造データ作成部１４に転送する。 Further, the attribute estimation unit 13 refers to the XML format data stored in the memory unit, and updates the word estimated as the attribute name by the method described later based on the table structure.
Specifically, the attribute estimation unit 13 determines that the word arranged in the combination having the highest occurrence rate of the word estimated as the attribute name among the predetermined row or column combinations based on the format of the table structure is determined as the attribute name. To do. Here, the format of the table structure is determined based on, for example, the frequency of appearance of the words extracted as attribute candidates.
The attribute estimation unit 13 transfers the word estimated as the attribute name to the structural data creation unit 14.

構造データ作成部１４は、属性推定部１３から転送された属性名と推定した単語を先頭行に配し、メモリ部１６に格納されている汎用データのうち属性名を除いたデータを属性値と判断して属性名と対応付けて配列し、構造データを作成する。構造データ作成部１４は、構造データをスキーマ作成部１５に転送する。 The structure data creation unit 14 arranges the word estimated as the attribute name transferred from the attribute estimation unit 13 on the first line, and sets the data excluding the attribute name from the general-purpose data stored in the memory unit 16 as the attribute value. Judge and arrange in association with the attribute name to create structural data. The structural data creation unit 14 transfers the structural data to the schema creation unit 15.

スキーマ作成部１５は、構造データ作成部１４から転送された構造データの先頭行の単語を属性名として定義し、必要に応じて共通語彙を参照することでデータ形式を決定してＲＤＢスキーマを作成する。スキーマ作成部１５は、作成したＲＤＢスキーマ及び構造データを出力する。 The schema creation unit 15 defines the word in the first line of the structural data transferred from the structural data creation unit 14 as an attribute name, determines the data format by referring to a common vocabulary as necessary, and creates an RDB schema. To do. The schema creation unit 15 outputs the created RDB schema and structural data.

メモリ部１６は、データ取得部１１が作成した汎用データを格納し、属性推定部１３及び構造データ作成部１４等へ提供する。 The memory unit 16 stores general-purpose data created by the data acquisition unit 11 and provides it to the attribute estimation unit 13, the structural data creation unit 14, and the like.

ルール管理部１７は、構造データ及びスキーマの自動生成処理のルールとして、例えば、データ取得部１１における半構造データの取得ＵＲＬ、取得間隔、変換ルール等と、属性推定部１３における共通語彙との類似度の閾値等とを管理する。 The rule management unit 17 resembles, for example, the acquisition URL, acquisition interval, conversion rule, etc. of the semi-structured data in the data acquisition unit 11 and the common vocabulary in the attribute estimation unit 13 as rules for automatic generation processing of structural data and schema. It manages the threshold value of the degree.

共通語彙記憶部１８は、ＲＤＢスキーマのテンプレートとなる共通語彙を記憶する。共通語彙は、標準化されたものに加え、インターネット上の辞典等を参照し類似語が設定されてもよい。 The common vocabulary storage unit 18 stores a common vocabulary that serves as a template for the RDB schema. As the common vocabulary, in addition to the standardized one, similar words may be set by referring to a dictionary or the like on the Internet.

設定部１９は、管理者がルール及び共通語彙を設定するためのインタフェースを提供し、入力されたルールをルール管理部１７に、共通語彙を共通語彙記憶部１８に提供する。 The setting unit 19 provides an interface for the administrator to set rules and a common vocabulary, and provides the input rules to the rule management unit 17 and the common vocabulary to the common vocabulary storage unit 18.

ここで、データ処理装置１への入力となる半構造データのパターンを例示する。
半構造データは、記載されている属性名と属性値との対応関係によって、次のように場合分けできる。 Here, a pattern of semi-structured data to be input to the data processing device 1 will be illustrated.
Semi-structured data can be classified into the following cases according to the correspondence between the described attribute name and the attribute value.

図２は、本実施形態に係るデータ処理装置１への入力となる第１のパターンの半構造データを例示する図である。
この例では、１ページ目の文章に続く２ページ目の一覧表で、属性名と属性値とが１対多に対応している。例えば、１つの属性名「実施施設」に対して、複数の属性値（「Ａ保育所」、「Ｂ保育所」、・・・）が記載されている。
しかしながら、このデータは、属性名と属性値とが記載されているものの、どの単語が属性名で、どの単語が属性値であるかの定義がないため、明確に属性名と属性値とが区別されていない半構造データである。 FIG. 2 is a diagram illustrating semi-structured data of the first pattern that is input to the data processing device 1 according to the present embodiment.
In this example, in the list on the second page following the text on the first page, the attribute name and the attribute value have a one-to-many correspondence. For example, a plurality of attribute values ("A nursery school", "B nursery school", ...) Are described for one attribute name "implementation facility".
However, in this data, although the attribute name and the attribute value are described, there is no definition as to which word is the attribute name and which word is the attribute value, so the attribute name and the attribute value are clearly distinguished. Semi-structured data that has not been created.

図３は、本実施形態に係るデータ処理装置１への入力となる第２のパターンの半構造データを例示する図である。
この例では、情報のタイトルに続く複数の表において、属性名と属性値とが１対１に対応している。例えば、上から１つ目の表に出現する１つの属性名「名称」に対して、対応する１つの属性値「ＡＡ保育園」が記載されている。同様に、上から２番目の表に出現する属性名「名称」に対しては、属性値「ＢＢ保育園」が１対１に対応付けられている。
しかしながら、このデータは、属性名と属性値とが記載されているものの、どの単語が属性名で、どの単語が属性値であるかの定義がないため、明確に属性名と属性値とが区別されていない半構造データである。 FIG. 3 is a diagram illustrating semi-structured data of a second pattern that is input to the data processing device 1 according to the present embodiment.
In this example, there is a one-to-one correspondence between the attribute name and the attribute value in the plurality of tables following the title of the information. For example, for one attribute name "name" appearing in the first table from the top, one corresponding attribute value "AA nursery school" is described. Similarly, the attribute value "BB nursery school" is associated with the attribute name "name" appearing in the second table from the top on a one-to-one basis.
However, in this data, although the attribute name and the attribute value are described, there is no definition as to which word is the attribute name and which word is the attribute value, so the attribute name and the attribute value are clearly distinguished. Semi-structured data that has not been created.

また、入力される半構造データは、データファイルの中に含まれるテーブル構造の数と、属性名の共有の有無によって、次のように場合分けできる。
（ケース１）単一のテーブル構造が１つのファイル（半構造データ）に記載される。
（ケース２）複数のテーブル構造が１つのファイルに記載される。
（ケース２−１）属性名を共有する複数のテーブル構造が１つのファイルに記載される。
（ケース２−２）属性名を共有しない複数のテーブル構造が１つのファイルに記載される。 In addition, the input semi-structured data can be classified as follows depending on the number of table structures included in the data file and whether or not the attribute name is shared.
(Case 1) A single table structure is described in one file (semi-structured data).
(Case 2) A plurality of table structures are described in one file.
(Case 2-1) A plurality of table structures sharing attribute names are described in one file.
(Case 2-2) A plurality of table structures that do not share attribute names are described in one file.

次に、データ処理装置１による構造データ及びスキーマの作成処理の手順を、テーブル構造のパターン毎に詳述する。 Next, the procedure for creating the structural data and the schema by the data processing device 1 will be described in detail for each pattern of the table structure.

［実施例１］
実施例１は、属性名と属性値とが１対多である第１のパターンの半構造データ（図２）が入力された場合である。
まず、データ取得部１１により、半構造データからテーブル構造の部分が抽出される。 [Example 1]
The first embodiment is a case where the semi-structured data (FIG. 2) of the first pattern in which the attribute name and the attribute value are one-to-many is input.
First, the data acquisition unit 11 extracts a part of the table structure from the semi-structured data.

図４は、本実施形態に係る第１のパターンの半構造データに対する汎用データを例示する図である。
ここでは、属性名と属性値とが１対多のパターン（図２）で記述されているテキスト形式及びＸＭＬ形式のデータから、テーブル構造の部分を抽出したデータを示している。
この例では、ＸＭＬ形式のデータにおける＜ｔａｂｌｅ＞タグで特定される部分が、テーブル構造の部分として抽出されている。
なお、テキスト形式のデータは、ＸＭＬ形式のデータからタグを削除することにより作成されてもよい。 FIG. 4 is a diagram illustrating general-purpose data for the semi-structured data of the first pattern according to the present embodiment.
Here, the data obtained by extracting the part of the table structure from the data in the text format and the XML format in which the attribute name and the attribute value are described in a one-to-many pattern (FIG. 2) is shown.
In this example, the part specified by the <table> tag in the XML format data is extracted as the part of the table structure.
The text format data may be created by deleting the tag from the XML format data.

次に、属性候補抽出部１２により、テーブル構造内の単語が抽出され、単語毎の出現頻度に基づくヒストグラムが作成される。 Next, the attribute candidate extraction unit 12 extracts words in the table structure and creates a histogram based on the frequency of occurrence for each word.

図５は、本実施形態に係る第１のパターンの半構造データに対するヒストグラムを例示する図である。
汎用データにおけるテーブル構造の部分（図４）から、「実施施設」、「所在地・問合先」等の単語が抽出され、単語毎に、データ中に出現した頻度が記録される（Ａ）。
そして、１回、２回、・・・の頻度毎に、単語数及び度数（頻度×単語数）が集計され、ヒストグラム（Ｂ）が作成される。 FIG. 5 is a diagram illustrating a histogram for the semi-structured data of the first pattern according to the present embodiment.
Words such as "implementation facility" and "location / contact" are extracted from the table structure part (FIG. 4) in the general-purpose data, and the frequency of appearance in the data is recorded for each word (A).
Then, the number of words and the frequency (frequency × number of words) are totaled for each frequency of once, twice, ..., And a histogram (B) is created.

属性名と属性値とが１対多のパターンでは、属性名が１回のみ出現し、属性値が複数回出現する場合が多いため、頻度が１の単語数及び度数が高くなる。したがって、この例において、属性候補抽出部１２は、度数が最高の４６である頻度１の単語を４６個、属性候補として抽出する。 In the one-to-many pattern of the attribute name and the attribute value, the attribute name appears only once and the attribute value often appears a plurality of times, so that the number of words and the frequency of 1 are high. Therefore, in this example, the attribute candidate extraction unit 12 extracts 46 words having a frequency of 1 having the highest frequency of 46 as attribute candidates.

続いて、属性推定部１３により、共通語彙記憶部１８が参照され、共通語彙との類似度の高い単語に属性候補が絞り込まれる。 Subsequently, the attribute estimation unit 13 refers to the common vocabulary storage unit 18, and the attribute candidates are narrowed down to words having a high degree of similarity to the common vocabulary.

図６は、本実施形態に係る第１のパターンの半構造データに対して、属性候補と共通語彙との類似度を算出した結果を例示する図である。
図中、横に並べた「実施施設」、「所在地・問合先」等の単語が属性候補抽出部１２により抽出された属性候補であり、縦に並べた「識別情報」、「団体コード」等の単語が共通語彙である。
この例では、属性推定部１３は、属性候補の文字列のうち、共通語彙のいずれかの文字と一致している文字数を算出し、類似度としている。なお、類似度の指標はこれには限られず、例えばレーベンシュタイン距離等の別の指標であってもよい。 FIG. 6 is a diagram illustrating the result of calculating the similarity between the attribute candidate and the common vocabulary with respect to the semi-structured data of the first pattern according to the present embodiment.
In the figure, words such as "implementation facility" and "location / contact" arranged horizontally are attribute candidates extracted by the attribute candidate extraction unit 12, and vertically arranged "identification information" and "group code". Words such as are common vocabularies.
In this example, the attribute estimation unit 13 calculates the number of characters that match any character in the common vocabulary among the character strings of the attribute candidates, and sets them as the degree of similarity. The index of similarity is not limited to this, and may be another index such as the Levenshtein distance.

属性推定部１３は、例えば、一致文字数が２以上の単語を属性名と推定する。この例では、複数の属性候補のうち、共通語彙のいずれかとの類似度が２以上となっている「実施施設」、「対象年齢」、「実施日及び保育時間」、「利用料金」が属性名と推定される。 The attribute estimation unit 13 estimates, for example, a word having two or more matching characters as an attribute name. In this example, among multiple attribute candidates, the attributes are "implementation facility", "target age", "implementation date and childcare hours", and "usage fee", which have a similarity of 2 or more to any of the common vocabularies. Presumed to be the name.

そして、属性推定部１３により、属性名と属性値とが１対多というテーブル構造の特徴を用いて、属性名と属性値とが最終的に判別される。 Then, the attribute estimation unit 13 finally determines the attribute name and the attribute value by using the feature of the table structure that the attribute name and the attribute value are one-to-many.

図７は、本実施形態に係る第１のパターンの半構造データにおける属性名及び属性値の判別方法を示す図である。
第１のパターンの半構造データでは、テーブル構造内の各データＡ_ｋｌのうち、ｋ＝１（１行目）又はｌ＝１（１列目）のいずれかに属性名が記述されている。
したがって、属性推定部１３は、１行目又は１列目のうち、属性名であると推定されたデータの割合を示す推定属性率の高い方を属性名と判別する。この例では１行目（推定属性率＝８０％）の単語群が属性名と最終的に判別され、推定された属性名に新たに「所在地・問合先」が追加されている。 FIG. 7 is a diagram showing a method of determining an attribute name and an attribute value in the semi-structured data of the first pattern according to the present embodiment.
In the semi-structured data of the first pattern, the attribute name is described in either k = 1 (first row) or l = 1 (first column) of each data A _kl in the table structure.
Therefore, the attribute estimation unit 13 determines from the first row or the first column, the one having the higher estimated attribute ratio indicating the ratio of the data estimated to be the attribute name is the attribute name. In this example, the word group on the first line (estimated attribute rate = 80%) is finally determined as the attribute name, and a new "location / contact destination" is added to the estimated attribute name.

図８は、本実施形態に係る第１のパターンの半構造データに対する構造データ及びスキーマの出力例を示す図である。
構造データ作成部１４は、属性名と判別された単語を先頭行に配し、汎用データから属性名を除いたデータを属性値として属性名に対応付けて配列する。なお、属性名と属性値との対応関係は、ＸＭＬデータに記載のタグに基づいて判定される。 FIG. 8 is a diagram showing an output example of structural data and a schema for the semi-structured data of the first pattern according to the present embodiment.
The structure data creation unit 14 arranges the word determined to be the attribute name in the first line, and arranges the data obtained by removing the attribute name from the general-purpose data as the attribute value in association with the attribute name. The correspondence between the attribute name and the attribute value is determined based on the tag described in the XML data.

また、スキーマ作成部１５は、構造データの先頭行に配置された属性名を定義するために、共通語彙で定義されたデータ形式を参照してＲＤＢスキーマを作成する。
具体的には、属性名のデータ形式には、類似度が高いと計算された共通語彙のデータ形式が採用されてよい。例えば、「対象年齢」は、図６に示したように、共通語彙の「サービス＿対象［利用可能年齢］」との類似度が高い。したがって、「サービス＿対象［利用可能年齢］」のデータ形式である「文字列」が採用される。また、「実施施設」のように複数の共通語彙との類似度が高い場合は、最も類似度が高い共通語彙のデータ形式が、さらに、最高の類似度が複数存在する場合は、これらのデータ形式のうち多数のものが採用されてよい。なお、共通語彙に類似の単語がない場合は、デフォルトとして特定のデータ形式（例えば、文字列）が採用されてよい。 Further, the schema creation unit 15 creates an RDB schema by referring to the data format defined in the common vocabulary in order to define the attribute name arranged in the first line of the structural data.
Specifically, as the data format of the attribute name, the data format of the common vocabulary calculated to have high similarity may be adopted. For example, as shown in FIG. 6, the “target age” has a high degree of similarity to the common vocabulary “service_target [available age]”. Therefore, the "character string" which is the data format of "service_target [usable age]" is adopted. In addition, when the degree of similarity with multiple common vocabularies is high, such as "implementation facility", the data format of the common vocabulary with the highest degree of similarity is used, and when there are multiple highest degree of similarity, these data are used. Many of the formats may be adopted. If there are no similar words in the common vocabulary, a specific data format (for example, a character string) may be adopted as a default.

［実施例２］
実施例２は、属性名と属性値とが１対１である第２のパターンの半構造データ（図３）が入力された場合である。 [Example 2]
The second embodiment is a case where the semi-structured data (FIG. 3) of the second pattern in which the attribute name and the attribute value are one-to-one is input.

図９は、本実施形態に係る第２のパターンの半構造データに対する汎用データを例示する図である。
ここでは、属性名と属性値とが１対１のパターン（図３）で記述されているテキスト形式及びＸＭＬ形式のデータから、テーブル構造の部分を抽出したデータを示している。 FIG. 9 is a diagram illustrating general-purpose data for the semi-structured data of the second pattern according to the present embodiment.
Here, the data obtained by extracting the part of the table structure from the data in the text format and the XML format in which the attribute name and the attribute value are described in a one-to-one pattern (FIG. 3) is shown.

図１０は、本実施形態に係る第２のパターンの半構造データに対するヒストグラムを例示する図である。
汎用データにおけるテーブル構造の部分（図９）から、「施設情報」、「団体コード」等の単語が抽出され、単語毎に、データ中に出現した頻度が記録される（Ａ）。
そして、１回、２回、・・・の頻度毎に、単語数及び度数（頻度×単語数）が集計され、ヒストグラム（Ｂ）が作成される。 FIG. 10 is a diagram illustrating a histogram for the semi-structured data of the second pattern according to the present embodiment.
Words such as "facility information" and "group code" are extracted from the table structure portion (FIG. 9) of the general-purpose data, and the frequency of appearance in the data is recorded for each word (A).
Then, the number of words and the frequency (frequency × number of words) are totaled for each frequency of once, twice, ..., And a histogram (B) is created.

属性名と属性値とが１対１のパターンでは、属性名がテーブル構造の個数と同じｎ回出現する場合が多いため、２以上の特定の頻度ｎの単語数及び度数が高くなる。したがって、この例において、属性候補抽出部１２は、度数が最高の５６である頻度４の単語を１４個、属性候補として抽出する。 In a pattern in which the attribute name and the attribute value are one-to-one, the attribute name often appears n times, which is the same as the number of table structures, so that the number of words and the frequency of two or more specific frequencies n are high. Therefore, in this example, the attribute candidate extraction unit 12 extracts 14 words having a frequency of 4 having the highest frequency of 56 as attribute candidates.

図１１は、本実施形態に係る第２のパターンの半構造データに対して、属性候補と共通語彙との類似度を算出した結果を例示する図である。
この例では、実施例１と同様に、属性推定部１３は、属性候補の文字列のうち、共通語彙のいずれかの文字と一致している文字数を算出し、類似度としている。
属性推定部１３は、例えば、一致文字数が２以上の単語を属性名と推定する。この例では、複数の属性候補のうち、共通語彙のいずれかとの類似度が２以上となっている「団体コード」、「団体名」、「種別」、「設置者」、「名称」、「住所」、「電話番号」、「受入年齢」、「定員」、「一時預かり」、「給食」、「アレルギー食」が属性名と推定される。 FIG. 11 is a diagram illustrating the result of calculating the similarity between the attribute candidate and the common vocabulary with respect to the semi-structured data of the second pattern according to the present embodiment.
In this example, as in the first embodiment, the attribute estimation unit 13 calculates the number of characters that match any character in the common vocabulary among the character strings of the attribute candidates, and sets the degree of similarity.
The attribute estimation unit 13 estimates, for example, a word having two or more matching characters as an attribute name. In this example, "group code", "group name", "type", "installer", "name", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "type", "group code", "group name", "name" The attribute names are presumed to be "address", "phone number", "acceptance age", "capacity", "temporary custody", "school lunch", and "allergic food".

図１２は、本実施形態に係る第２のパターンの半構造データにおける属性名及び属性値の判別方法を示す図である。
第２のパターンの半構造データでは、テーブル構造内の各データＡ_ｋｌのうち、ｋが奇数の行、ｋが偶数の行、ｌが奇数の列、ｌが偶数の列のうち、いずれかに属性名が記述されている。
したがって、属性推定部１３は、奇数行、偶数行、奇数列、偶数列のうち、属性名であると推定されたデータの割合を示す推定属性率の最も高い行又は列の組み合わせを属性名と判別する。この例では奇数列（推定属性率＝８６％）の単語群が属性名と最終的に判別されている。 FIG. 12 is a diagram showing a method of determining an attribute name and an attribute value in the semi-structured data of the second pattern according to the present embodiment.
In the semi-structured data of the second pattern, of each data A _kl in the table structure, k is an odd row, k is an even row, l is an odd column, and l is an even column. The attribute name is described.
Therefore, the attribute estimation unit 13 uses the combination of rows or columns having the highest estimated attribute ratio indicating the ratio of data estimated to be the attribute name among the odd rows, even rows, odd columns, and even columns as the attribute name. Determine. In this example, a word group in an odd number sequence (estimated attribute rate = 86%) is finally discriminated from the attribute name.

図１３は、本実施形態に係る第２のパターンの半構造データに対する構造データ及びスキーマの出力例を示す図である。
構造データ作成部１４は、実施例１と同様に、属性名と判別された単語を先頭行に配し、汎用データから属性名を除いたデータを属性値として属性名に対応付けて配列する。
また、スキーマ作成部１５は、実施例１と同様に、構造データの先頭行に配置された属性名を定義するために、共通語彙で定義されたデータ形式を参照してＲＤＢスキーマを作成する。 FIG. 13 is a diagram showing an output example of structural data and a schema for the semi-structured data of the second pattern according to the present embodiment.
Similar to the first embodiment, the structural data creation unit 14 arranges the word determined to be the attribute name in the first line, and arranges the data obtained by removing the attribute name from the general-purpose data as the attribute value in association with the attribute name.
Further, the schema creation unit 15 creates an RDB schema with reference to the data format defined in the common vocabulary in order to define the attribute name arranged in the first line of the structural data, as in the first embodiment.

図１４〜１６は、本実施形態に係る構造データ及びスキーマの作成処理を示すフローチャートである。 14 to 16 are flowcharts showing the process of creating the structural data and the schema according to the present embodiment.

ステップＳ１において、データ取得部１１は、インターネット上を含むデータリソースから半構造データを取得する。 In step S1, the data acquisition unit 11 acquires semi-structured data from a data resource including the Internet.

ステップＳ２において、データ取得部１１は、取得したデータをテキスト形式及びＸＭＬ形式の汎用データに変換する。 In step S2, the data acquisition unit 11 converts the acquired data into general-purpose data in text format and XML format.

ステップＳ３において、データ取得部１１は、変換したＸＭＬ形式のデータにテーブル構造が含まれているか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ４に移り、判定がＮＯの場合、処理は終了する。 In step S3, the data acquisition unit 11 determines whether or not the converted XML format data includes a table structure. If this determination is YES, the process proceeds to step S4, and if the determination is NO, the process ends.

ステップＳ４において、データ取得部１１は、テキスト形式及びＸＭＬ形式の汎用データから、テーブル構造の部分のみを抽出する。このとき、テーブル構造が複数個（ｎ個）抽出される場合がある。この場合、データ取得部１１は、複数のテーブル構造をそれぞれ分離しておく。 In step S4, the data acquisition unit 11 extracts only the part of the table structure from the general-purpose data in the text format and the XML format. At this time, a plurality (n) table structures may be extracted. In this case, the data acquisition unit 11 separates the plurality of table structures.

ステップＳ５において、データ取得部１１は、ステップＳ２の変換結果、及びステップＳ４で抽出された個々のテーブル構造の部分を、メモリ部１６に保存すると共に、属性候補抽出部１２に転送する。 In step S5, the data acquisition unit 11 saves the conversion result of step S2 and the individual table structure extracted in step S4 in the memory unit 16 and transfers them to the attribute candidate extraction unit 12.

ステップＳ６において、属性候補抽出部１２は、テーブル構造のデータの中から、単語を抽出すると共に、各単語が出現する頻度を算出し、この頻度に基づくヒストグラムを作成する。 In step S6, the attribute candidate extraction unit 12 extracts words from the data of the table structure, calculates the frequency at which each word appears, and creates a histogram based on this frequency.

ステップＳ７において、属性候補抽出部１２は、作成したヒストグラムの中で最も度数の高い頻度である単語を抽出し、抽出した単語を属性候補として属性推定部１３に転送する。 In step S7, the attribute candidate extraction unit 12 extracts the word having the highest frequency in the created histogram, and transfers the extracted word as an attribute candidate to the attribute estimation unit 13.

ステップＳ８において、属性推定部１３は、抽出された単語と共通語彙記憶部１８の各単語との類似度を算出する。 In step S8, the attribute estimation unit 13 calculates the degree of similarity between the extracted word and each word of the common vocabulary storage unit 18.

ステップＳ９において、属性推定部１３は、算出した類似度が閾値以上の単語を属性名であると推定して抽出する。 In step S9, the attribute estimation unit 13 estimates and extracts words having a calculated similarity equal to or greater than a threshold value as attribute names.

ステップＳ１０において、データ処理装置１は、属性推定部１３により属性名が抽出されたか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ１２に移り、判定がＮＯの場合、処理はステップＳ１１に移る。 In step S10, the data processing device 1 determines whether or not the attribute name has been extracted by the attribute estimation unit 13. If this determination is YES, the process proceeds to step S12, and if the determination is NO, the process proceeds to step S11.

ステップＳ１１において、データ処理装置１は、属性候補の中に属性名と推定される単語がないため、既に調査した単語を除外し、処理をステップＳ７に戻して次に度数の高い頻度である単語を属性候補として抽出する。 In step S11, since the data processing device 1 does not have a word presumed to be an attribute name among the attribute candidates, the word that has already been investigated is excluded, the process is returned to step S7, and the word with the next highest frequency is used. Is extracted as an attribute candidate.

ステップＳ１２において、データ処理装置１は、ステップＳ４で抽出されたテーブル構造を順に読み出し構造データを作成するために、まず、インデックスｉを１に初期化する。 In step S12, the data processing device 1 first reads the table structure extracted in step S4 in order to create structure data, and first initializes the index i to 1.

ステップＳ１３において、データ処理装置１は、インデックスｉがテーブル構造の個数ｎを超えているか否かを判定する。この判定がＹＥＳの場合、処理はステップＳ２０に移り、判定がＮＯの場合、処理はステップＳ１４に移る。 In step S13, the data processing device 1 determines whether or not the index i exceeds the number n of the table structures. If this determination is YES, the process proceeds to step S20, and if the determination is NO, the process proceeds to step S14.

ステップＳ１４において、属性推定部１３は、ステップＳ７で判定された最も度数の高い頻度の値が１であるか否かを判定し、これにより、対象のテーブル構造のパターンを場合分けする。この判定がＹＥＳの場合、属性名と属性値とが１対多の第１パターンと推定され、処理はステップＳ１５に移る。一方、判定がＮＯの場合、属性名と属性値とが１対１の第２パターンと推定され、処理はステップＳ１６に移る。 In step S14, the attribute estimation unit 13 determines whether or not the value with the highest frequency determined in step S7 is 1, and thereby classifies the pattern of the target table structure into cases. If this determination is YES, it is estimated that the attribute name and the attribute value are the first pattern of one-to-many, and the process proceeds to step S15. On the other hand, when the determination is NO, it is estimated that the attribute name and the attribute value are one-to-one in the second pattern, and the process proceeds to step S16.

ステップＳ１５において、属性推定部１３は、テーブル構造が第１パターンであるので、１行目と１列目とを比較し、推定属性率の高い方を属性名と判別する。属性推定部１３は、属性名と判別した単語を構造データ作成部１４に転送する。 In step S15, since the table structure is the first pattern, the attribute estimation unit 13 compares the first row and the first column, and determines the one with the higher estimated attribute ratio as the attribute name. The attribute estimation unit 13 transfers the word determined to be the attribute name to the structural data creation unit 14.

ステップＳ１６において、属性推定部１３は、テーブル構造が第２パターンであるので、奇数行、偶数行、奇数列、偶数列を比較し、推定属性率が最も高いものを属性名と判別する。属性推定部１３は、属性名と判別した単語を構造データ作成部１４に転送する。 In step S16, since the table structure is the second pattern, the attribute estimation unit 13 compares odd-numbered rows, even-numbered rows, odd-numbered columns, and even-numbered columns, and determines that the one with the highest estimated attribute ratio is the attribute name. The attribute estimation unit 13 transfers the word determined to be the attribute name to the structural data creation unit 14.

ステップＳ１７において、構造データ作成部１４は、属性名と判別された単語を先頭行に配置し、ＲＤＢに適合した所定の形式の構造データを作成する。 In step S17, the structural data creation unit 14 arranges a word determined to be an attribute name on the first line, and creates structural data in a predetermined format conforming to the RDB.

ステップＳ１８において、構造データ作成部１４は、メモリ部１６に格納されている汎用データから属性名を除いたデータを属性値と判別し、作成する構造データに配列する。構造データ作成部１４は、作成した構造データをスキーマ作成部１５に転送する。 In step S18, the structural data creation unit 14 determines the data obtained by removing the attribute name from the general-purpose data stored in the memory unit 16 as the attribute value, and arranges the data in the structural data to be created. The structural data creation unit 14 transfers the created structural data to the schema creation unit 15.

ステップＳ１９において、データ処理装置１は、インデックスｉをカウントアップし、処理はステップＳ１３に戻る。 In step S19, the data processing device 1 counts up the index i, and the process returns to step S13.

ステップＳ２０において、構造データ作成部１４は、ステップＳ４で抽出されたテーブル構造の個数ｎが１か否かを判定し、これにより、処理対象の半構造データを前述のケース１（単一のテーブル）とケース２（複数のテーブル）とに場合分けする。この判定がＹＥＳの場合、ケース１となり、処理はステップＳ２３に移る。一方、判定がＮＯの場合、ケース２となり、処理はステップＳ２１に移る。 In step S20, the structural data creation unit 14 determines whether or not the number n of the table structures extracted in step S4 is 1, and thereby, the semi-structured data to be processed is converted into the above-mentioned case 1 (single table). ) And case 2 (plurality of tables). If this determination is YES, case 1 occurs and the process proceeds to step S23. On the other hand, if the determination is NO, case 2 occurs and the process proceeds to step S21.

ステップＳ２１において、構造データ作成部１４は、ｎ個のテーブル構造それぞれに対して作成された構造データ（デーブル）の属性名が共通であるか否かを判定する。この判定がＹＥＳの場合、半構造データは前述のケース２−１に相当し、処理はステップＳ２２に移る。一方、判定がＮＯの場合、半構造データは前述のケース２−２に相当し、構造データ作成部１４は、個々に独立した構造データがあると判断して、処理はステップＳ２３に移る。 In step S21, the structural data creation unit 14 determines whether or not the attribute names of the structural data (tables) created for each of the n table structures are common. If this determination is YES, the semi-structured data corresponds to the above-mentioned case 2-1 and the process proceeds to step S22. On the other hand, when the determination is NO, the semi-structured data corresponds to the above-mentioned case 2-2, and the structural data creating unit 14 determines that each has independent structural data, and the process proceeds to step S23.

ステップＳ２２において、構造データ作成部１４は、共通の属性名を先頭行にして、属性値を２行目以降に、ＸＭＬデータに記載のタグ（例えば、＜ＴＲ＞及び＜ＴＤ＞）を基にして順に配列し、ｎ個の構造データを統合する。 In step S22, the structural data creation unit 14 sets the common attribute name as the first line, sets the attribute value in the second and subsequent lines, and based on the tags (for example, <TR> and <TD>) described in the XML data. Arrange in order and integrate n structural data.

ステップＳ２３において、スキーマ作成部１５は、構造データの先頭行を属性名とし、共通語彙のデータ形式を参照してスキーマを作成する。 In step S23, the schema creation unit 15 creates a schema with reference to the data format of the common vocabulary, with the first line of the structural data as the attribute name.

ステップＳ２４において、スキーマ作成部１５は、作成した構造データとスキーマとを出力する。 In step S24, the schema creation unit 15 outputs the created structural data and the schema.

本実施形態によれば、データ処理装置１は、半構造データのテーブル構造の部分において、単語それぞれが出現する頻度を算出し、この頻度のヒストグラムから、最も度数の高い頻度である単語を属性候補として抽出する。そして、データ処理装置１は、テーブル構造の形式に基づく所定の行又は列の組み合わせのうち、属性候補の出現率が最も高い組み合わせに配置された単語を属性名と判別する。
これにより、データ処理装置１は、テーブル構造に含まれる単語が属性名であるか属性値であるかを自動で判別でき、属性名と属性値とが明確に区別されていない半構造データから、構造データとスキーマとを自動作成できる。
この結果、スキーマがなく、属性名と属性値とが記載されているものの明確に区別されていない半構造データを、自動的にＲＤＢに取り込み分析に利用できる。 According to the present embodiment, the data processing device 1 calculates the frequency at which each word appears in the part of the table structure of the semi-structured data, and from the histogram of this frequency, the word with the highest frequency is selected as an attribute candidate. Extract as. Then, the data processing device 1 determines the word arranged in the combination having the highest appearance rate of the attribute candidate among the predetermined row or column combinations based on the format of the table structure as the attribute name.
As a result, the data processing device 1 can automatically determine whether the word included in the table structure is an attribute name or an attribute value, and from the semi-structured data in which the attribute name and the attribute value are not clearly distinguished, Structural data and schema can be created automatically.
As a result, semi-structured data that does not have a schema and has attribute names and attribute values but is not clearly distinguished can be automatically imported into an RDB and used for analysis.

データ処理装置１は、共通語彙記憶部１８に存在する標準的な単語との類似度が閾値以上の属性候補のみを採用することにより、属性名としてより確かな候補を抽出でき、属性名及び属性値の判別精度を向上できる。
このとき、データ処理装置１は、共通語彙と類似する属性候補を得られなかった場合、ヒストグラムの度数が１段階低い頻度である単語を新たに属性候補として抽出する。
これにより、データ処理装置１は、頻度に基づく属性名の推定に誤りがあった場合にも、正しく調整して属性名を判別できる。 The data processing device 1 can extract more reliable candidates as attribute names by adopting only the attribute candidates whose similarity with the standard word existing in the common vocabulary storage unit 18 is equal to or higher than the threshold value, and the attribute names and attributes can be extracted. The accuracy of value discrimination can be improved.
At this time, when the data processing device 1 cannot obtain an attribute candidate similar to the common vocabulary, the data processing device 1 newly extracts a word having a frequency one step lower in the histogram as an attribute candidate.
As a result, the data processing device 1 can correctly adjust and determine the attribute name even if there is an error in the estimation of the attribute name based on the frequency.

データ処理装置１は、テーブル構造から属性候補として抽出した単語の出現した頻度に基づいて、属性名と属性値とが１対多なのか、１対１なのかといった、テーブル構造の形式を判定する。
したがって、データ処理装置１は、テーブル構造の特徴を、ヒストグラムから自動で抽出でき、処理可能なテーブル構造の形式を限定することなく、属性名と属性値とを区分できる。 The data processing device 1 determines the format of the table structure, such as whether the attribute name and the attribute value are one-to-many or one-to-one, based on the frequency of appearance of words extracted as attribute candidates from the table structure. ..
Therefore, the data processing device 1 can automatically extract the features of the table structure from the histogram, and can distinguish between the attribute name and the attribute value without limiting the format of the table structure that can be processed.

データ処理装置１は、半構造データを、ＸＭＬ形式を含む汎用データに変換し、ＸＭＬ形式における特定のタグによりテーブル構造の部分を抽出する。
これにより、データ処理装置１は、ＸＭＬのタグを利用することで効率的にテーブル構造の部分を抽出し、さらに、属性名と属性値との対応付を容易に把握して構造データを作成できる。 The data processing device 1 converts the semi-structured data into general-purpose data including the XML format, and extracts a part of the table structure by a specific tag in the XML format.
As a result, the data processing device 1 can efficiently extract the part of the table structure by using the XML tag, and can easily grasp the correspondence between the attribute name and the attribute value to create the structural data. ..

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. In addition, the effects described in the above-described embodiments are merely a list of the most preferable effects arising from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

前述の実施形態では、共通語彙記憶部１８に標準的な属性名を共通語彙として予め記憶することとしたが、共通語彙の参照形態はこれに限られない。例えば、外部のデータベースを参照する形態であってもよい。
また、共通語彙として、既存のスキーマが利用されてもよい。 In the above-described embodiment, the standard attribute name is stored in advance in the common vocabulary storage unit 18 as a common vocabulary, but the reference form of the common vocabulary is not limited to this. For example, it may be in the form of referencing an external database.
In addition, an existing schema may be used as a common vocabulary.

データ処理装置１によるデータ処理方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The data processing method by the data processing device 1 is realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１データ処理装置
１１データ取得部
１２属性候補抽出部
１３属性推定部
１４構造データ作成部
１５スキーマ作成部
１６メモリ部
１７ルール管理部
１８共通語彙記憶部
１９設定部 1 Data processing device 11 Data acquisition unit 12 Attribute candidate extraction unit 13 Attribute estimation unit 14 Structural data creation unit 15 Schema creation unit 16 Memory unit 17 Rule management unit 18 Common vocabulary storage unit 19 Setting unit

Claims

A data acquisition unit that converts semi-structured data into general-purpose data and extracts the table structure part from the general-purpose data.
An attribute candidate extraction unit that calculates the frequency at which each word included in the table structure appears and extracts the word with the highest frequency as an attribute candidate from the histogram of the frequency.
Of the predetermined row or column combinations based on the table structure format, the attribute estimation unit that determines the word arranged in the combination with the highest occurrence rate of the attribute candidates as the attribute name, and
A structural data creation unit that creates structural data by associating the general-purpose data excluding the attribute name with the attribute name as an attribute value.
A data processing device including a schema creation unit that creates a schema in which the attribute name is defined.

The data processing device according to claim 1, wherein the attribute estimation unit employs only the attribute candidates whose similarity with words existing in a predetermined common vocabulary is equal to or higher than a threshold value.

The data processing device according to claim 2, wherein the attribute candidate extraction unit extracts a word having a frequency one step lower as the attribute candidate when the attribute candidate is not adopted by the attribute estimation unit.

The data processing device according to any one of claims 1 to 3, wherein the attribute estimation unit determines the format of the table structure based on the frequency of words extracted as the attribute candidates.

The data acquisition unit converts the semi-structured data into the general-purpose data including the XML format, and extracts a part of the table structure by a specific tag in the XML format according to any one of claims 1 to 4. The data processing device described.

A data acquisition step that converts semi-structured data into general-purpose data and extracts a part of the table structure from the general-purpose data.
An attribute candidate extraction step that calculates the frequency at which each word included in the table structure part appears and extracts the word with the highest frequency as an attribute candidate from the histogram of the frequency.
Among the predetermined row or column combinations based on the table structure format, the attribute estimation step for determining the word arranged in the combination having the highest occurrence rate of the attribute candidate as the attribute name, and
A structural data creation step of creating structural data by associating the general-purpose data excluding the attribute name with the attribute name as an attribute value.
A data processing method in which a computer executes a schema creation step for creating a schema in which the attribute name is defined.

A data acquisition step that converts semi-structured data into general-purpose data and extracts a part of the table structure from the general-purpose data.
An attribute candidate extraction step that calculates the frequency at which each word included in the table structure part appears and extracts the word with the highest frequency as an attribute candidate from the histogram of the frequency.
Among the predetermined row or column combinations based on the table structure format, the attribute estimation step for determining the word arranged in the combination having the highest occurrence rate of the attribute candidate as the attribute name, and
A structural data creation step of creating structural data by associating the general-purpose data excluding the attribute name with the attribute name as an attribute value.
A data processing program for causing a computer to execute a schema creation step for creating a schema in which the attribute name is defined.