JP2013120534A

JP2013120534A - Related word classification device, computer program, and method for classifying related word

Info

Publication number: JP2013120534A
Application number: JP2011269000A
Authority: JP
Inventors: Koichi Tanigaki; 宏一谷垣; Mitsuteru Shiba; 光輝柴; Atsushi Ishii; 篤石井; Shigenobu Takayama; 茂伸高山
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2011-12-08
Filing date: 2011-12-08
Publication date: 2013-06-17

Abstract

PROBLEM TO BE SOLVED: To extract related words while discriminating words used properly according to different purposes, even when the words subject to be classified are structured data such as database definitions.SOLUTION: A word classification unit 60 optimizes a model parameter for discriminating a class of a word on the basis of a provisional correct answer class allocated to the word appearing in data. The word classification unit 60 selects a maximum likelihood class in classes satisfying a restriction relating to the class allocation on the basis of a likelihood of the class obtained by using a current model parameter, and allocates it to the word as a new provisional correct answer class. The word classification unit 60, after alternately and iteratively processing the two steps, extracts words classified in the same class as the related words.

Description

この発明は、複数の単語を関連する単語同士のグループに分類する関連語分類装置に関する。 The present invention relates to a related word classification device that classifies a plurality of words into groups of related words.

単語の特徴ベクトル間の類似度に基づいて、単語をクラスタリングして、関連語を抽出する技術がある。単語の特徴ベクトルの要素には、例えば、その単語と共起する文脈語の出現頻度などが用いられる。単語の特徴ベクトル間の類似度には、例えば、特徴ベクトルの内積などが用いられる。クラスタリングの方式には、例えば、ｋ−ｍｅａｎｓ法や階層クラスタリング法、あるいは複数のクラスタリング法を組み合わせた方式などが用いられる。
また、同義語辞書を用いて、２つの単語が同義語である同義語ペアと、同義語でない非同義語ペアとを多数生成し、生成したペアをサポートベクターマシン（ＳＶＭ）などの教師あり学習モデルに学習させることにより、２つの単語が同義語であるか否かを識別する分類器を形成する技術がある。 There is a technique of extracting related words by clustering words based on the similarity between feature vectors of words. For example, the appearance frequency of a context word co-occurring with the word is used as an element of the word feature vector. For example, the inner product of feature vectors is used as the similarity between feature vectors of words. As a clustering method, for example, a k-means method, a hierarchical clustering method, or a method combining a plurality of clustering methods is used.
In addition, using a synonym dictionary, a number of synonym pairs in which two words are synonyms and non-synonym pairs that are not synonyms are generated, and the generated pairs are supervised learning such as a support vector machine (SVM). There is a technique for forming a classifier that identifies whether two words are synonyms by having a model learn.

特開２０１１−１１８５２６号公報JP 2011-118526 A

Ｂｅｌｌｅｇａｒｄａ，Ｊ．Ｒ．他「Ａｎｏｖｅｌｗｏｒｄｃｌｕｓｔｅｒｉｎｇａｌｇｏｒｉｔｈｍｂａｓｅｄｏｎｌａｔｅｎｔｓｅｍａｎｔｉｃａｎａｌｙｓｉｓ」ＰｒｏｃｓｅｅｄｉｎｇｓｏｆＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ、ＩＣＡＳＳＰ−９６、Ｖｏｌ．１、１７２〜１７５ページ、１９９６年。Bellegarda, J.A. R. Et al., “A novel word clustering algorithm based on latent semantic analysis,” Proceedings of IEEE International Conference on Acoustics, Speech, and SignA. 1, 172-175 pages, 1996. 相澤彰子「大規模テキストコーパスを用いた語の類似度計算に関する考察」情報処理学会論文誌、第４９巻第３号、１４２６〜１４３６ページ、２００８年３月。Akiko Aizawa “Consideration on Calculation of Word Similarity Using Large-scale Text Corpus” IPSJ Transactions, Vol. 49, No. 3, pp. 1426-1436, March 2008. 「ＷｏｒｄＮｅｔ―ＡｌｅｘｉｃａｌｄａｔａｂａｓｅｆｏｒＥｎｇｌｉｓｈ」ｈｔｔｐ：／／ｗｏｒｄｎｅｔ．ｐｒｉｎｃｅｔｏｎ．ｅｄｕ／ｗｏｒｄｎｅｔ／“WordNet-A lexical database for English” http: // wordnet. Princeton. edu / wordnet / Ｒｕｂｉｎ，Ｄ．Ｂ．「Ｉｔｅｒａｔｉｖｅｌｙｒｅｗｅｉｇｈｔｅｄｌｅａｓｔｓｑｕａｒｅｓ」ＥｎｃｙｃｌｏｐｅｄｉａｏｆＳｔａｔｉｓｔｉｃａｌＳｃｉｅｎｃｅｓ、第４巻、２７２〜２７５ページ、Ｗｉｌｅｙ、１９８３年。Rubin, D.M. B. “Iteratively reweighted least squares” Encyclopedia of Statistical Sciences, Vol. 4, pages 272-275, Wiley, 1983. Ｂｉｓｈｏｐ，Ｃ．Ｍ．、Ｉ．Ｔ．Ｎａｂｎｅｙ「ＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎａｎｄＭａｃｈｉｎｅＬｅａｒｎｉｎｇ：ＡＭａｔｌａｂＣｏｍｐａｎｉｏｎ」Ｓｐｒｉｎｇｅｒ、１９９８年。Bishop, C.I. M.M. I. T.A. Nabney “Pattern Recognition and Machine Learning: A Matlab Companion” Springer, 1998.

データベースでは、テーブルの名称やカラムの名称として、データベース設計者が任意に名づけた名称が用いられる。このため、異なるデータベースを統合する場合などにおいて、対応するカラムを判別するのに、非常な労力が必要となる。
テーブル名やカラム名には、通常、自然言語に由来する単語が用いられる。このため、上述したような関連語抽出技術を応用することにより、似たような意味の名称を持つカラムを抽出できれば、対応するカラムを判別するのを助けることができる。
しかし、データベースのテーブルを定義するテーブルスキーマやその他の構造化データに対して、上述の技術を適用するには、以下のような課題が存在する。
第一に、構造化データにおいては、単語間の文脈上の結合が疎であり、文脈に基づいて形成した特徴ベクトルでは、関連語を適切に抽出できない場合がある。このため、本来区別して扱われるべき単語とそうでない単語とを識別することができず、関連語として抽出されてしまう場合がある。
第二に、同義語辞書を用いる方式では、対象となる構造化データに含まれる単語が、同義語辞書にあらかじめ十分登録されていることが必要となり、そのような同義語辞書が存在しない場合は、適用することができない。 In the database, names arbitrarily named by the database designer are used as table names and column names. For this reason, when integrating different databases, a great effort is required to determine the corresponding columns.
For table names and column names, words derived from natural languages are usually used. Therefore, if a column having a similar meaning name can be extracted by applying the related word extraction technique as described above, it is possible to help discriminate the corresponding column.
However, there are the following problems in applying the above-described technique to a table schema that defines a database table and other structured data.
First, in structured data, contextual connections between words are sparse, and related words may not be extracted properly with feature vectors formed based on the context. For this reason, a word that should be treated distinctly and a word that should not be handled cannot be distinguished, and may be extracted as a related word.
Secondly, in the method using the synonym dictionary, it is necessary that the word included in the target structured data is sufficiently registered in the synonym dictionary in advance, and such a synonym dictionary does not exist. , Can not apply.

この発明は、例えば、データベースの定義などの構造化データを対象とする場合でも、使い分けられている語を識別しつつ、関連語を抽出することを目的とする。 An object of the present invention is to extract related words while identifying words that are used properly, even when structured data such as database definitions is targeted.

この発明にかかる関連語分類装置は、
複数の単語を関連する単語同士のグループに分類する関連語分類装置において、
データを処理する処理装置と、テキスト取得部と、特徴ベクトル算出部と、単語分類部と、重みベクトル算出部とを有し、
上記テキスト取得部は、上記処理装置を用いて、１つ以上の単語からなる単語群を１つ以上含むテキストを複数取得し、
上記特徴ベクトル算出部は、上記処理装置を用いて、上記テキスト取得部が取得した複数のテキストに含まれる複数の単語のそれぞれについて、上記単語の特徴を表わす特徴ベクトルを算出し、
上記単語分類部は、上記処理装置を用いて、上記テキスト取得部が取得した複数のテキストに含まれる複数の単語のそれぞれを、複数のグループのいずれかに分類し、
上記重みベクトル算出部は、上記処理装置を用いて、上記複数のグループのそれぞれについて、上記特徴ベクトルとの内積が上記特徴ベクトルによって特徴を表わされた単語が上記グループに属する尤度を表わす重みベクトルであって、上記単語分類部による分類に最もよく適合する重みベクトルを算出し、
上記単語分類部は、更に、上記処理装置を用いて、上記複数の単語のそれぞれについて上記特徴ベクトル算出部が算出した特徴ベクトルと、上記複数のグループのそれぞれについて上記重みベクトル算出部が算出した重みベクトルとに基づいて、上記複数の単語のそれぞれが上記複数のグループのそれぞれに属する尤度を算出し、算出した尤度に基づいて、上記複数の単語を分類し直す
ことを特徴とする。 The related word classification device according to this invention is:
In a related word classification device for classifying a plurality of words into groups of related words,
A processing device for processing data, a text acquisition unit, a feature vector calculation unit, a word classification unit, and a weight vector calculation unit;
The text acquisition unit acquires a plurality of texts including one or more word groups composed of one or more words using the processing device,
The feature vector calculation unit calculates a feature vector representing the feature of the word for each of a plurality of words included in the plurality of texts acquired by the text acquisition unit using the processing device;
The word classification unit classifies each of the plurality of words included in the plurality of texts acquired by the text acquisition unit into any of a plurality of groups using the processing device,
The weight vector calculation unit uses the processing device to determine, for each of the plurality of groups, a weight indicating a likelihood that a word whose feature is represented by the feature vector as an inner product with the feature vector belongs to the group. A vector that calculates a weight vector that best fits the classification by the word classification unit,
The word classification unit further uses the processing device to calculate the feature vector calculated by the feature vector calculation unit for each of the plurality of words and the weight calculated by the weight vector calculation unit for each of the plurality of groups. The likelihood that each of the plurality of words belongs to each of the plurality of groups is calculated based on the vector, and the plurality of words are reclassified based on the calculated likelihood.

この発明によれば、データベースの定義などの構造化データを対象とする場合でも、使い分けられている語を識別しつつ、関連語を抽出することができる。 According to the present invention, even when structured data such as a database definition is targeted, related words can be extracted while identifying words that are used properly.

実施の形態１における関連カラム抽出システム９００の全体構成の一例を示すシステム構成図。1 is a system configuration diagram illustrating an example of an overall configuration of a related column extraction system 900 according to Embodiment 1. FIG. 実施の形態１における関連カラム判定装置９０３などのハードウェア資源の一例を示すハードウェア構成図。FIG. 3 is a hardware configuration diagram illustrating an example of hardware resources such as a related column determination device 903 according to the first embodiment. 実施の形態１における関連カラム判定装置９０３の構成の一例を示すブロック構成図。FIG. 3 is a block configuration diagram illustrating an example of a configuration of a related column determination device 903 according to the first embodiment. 実施の形態１における関連語抽出元データ１０の一例を示す図。FIG. 6 is a diagram illustrating an example of related word extraction source data 10 according to the first embodiment. 実施の形態１における概念辞書データ３０の一例を示す図。FIG. 4 is a diagram showing an example of conceptual dictionary data 30 in the first embodiment. 実施の形態１における概念辞書データ３０が表わす概念の間の階層構造の一例を示す図。The figure which shows an example of the hierarchical structure between the concepts which the concept dictionary data 30 in Embodiment 1 represent. 実施の形態１における概念辞書データ３０が表わす概念の間の階層構造の一例を示す図。The figure which shows an example of the hierarchical structure between the concepts which the concept dictionary data 30 in Embodiment 1 represent. 実施の形態１における単語抽出部２０の動作の一例を説明するための図。FIG. 10 is a diagram for explaining an example of the operation of the word extraction unit 20 in the first embodiment. 実施の形態１における文脈解析部５０の動作の一例を説明するための図。FIG. 10 is a diagram for explaining an example of the operation of the context analysis unit 50 according to the first embodiment. 実施の形態１における特徴ベクトル生成部４０の動作の一例を説明するための図。FIG. 6 is a diagram for explaining an example of an operation of a feature vector generation unit 40 in the first embodiment. 実施の形態１における単語分類部６０の動作の一例を説明するための図。FIG. 6 is a diagram for explaining an example of the operation of the word classification unit 60 in the first embodiment. 実施の形態１における単語分類部６０の動作の一例を説明するための図。FIG. 6 is a diagram for explaining an example of the operation of the word classification unit 60 in the first embodiment. 実施の形態１における単語分類部６０の動作の一例を説明するための図。FIG. 6 is a diagram for explaining an example of the operation of the word classification unit 60 in the first embodiment. 実施の形態１における単語分類部６０の動作の一例を説明するための図。FIG. 6 is a diagram for explaining an example of the operation of the word classification unit 60 in the first embodiment. 実施の形態１における関連語決定部７０の動作の一例を説明するための図。FIG. 6 is a diagram for explaining an example of an operation of a related word determination unit 70 in the first embodiment. 実施の形態１における関連抽出処理Ｓ８００の流れの一例を示すフロー図。FIG. 6 is a flowchart showing an example of a flow of related extraction processing S800 in the first embodiment. 実施の形態１における単語分類処理Ｓ８０４の流れの一例を示すフロー図。FIG. 7 is a flowchart showing an example of the flow of word classification processing S804 in the first embodiment. 実施の形態１における初期化工程Ｓ８１０の流れの一例を示すフロー図。FIG. 5 is a flowchart showing an example of a flow of an initialization step S810 in the first embodiment. 実施の形態１における分類算出工程Ｓ８３０の流れの一例を示すフロー図。FIG. 5 is a flowchart showing an example of a flow of a classification calculation step S830 in the first embodiment. 実施の形態１における分類算出工程Ｓ８３０単語分類部６０の動作の一例を説明するための図。The figure for demonstrating an example of operation | movement of the classification | category calculation process S830 word classification | category part 60 in Embodiment 1. FIG. 実施の形態２における文脈解析部５０の動作の一例を示す説明するための図。The figure for demonstrating an example of operation | movement of the context analysis part 50 in Embodiment 2. FIG.

実施の形態１．
実施の形態１について、図１〜図２０を用いて説明する。 Embodiment 1 FIG.
The first embodiment will be described with reference to FIGS.

図１は、この実施の形態における関連カラム抽出システム９００の全体構成の一例を示すシステム構成図である。 FIG. 1 is a system configuration diagram showing an example of the overall configuration of a related column extraction system 900 in this embodiment.

関連カラム抽出システム９００は、一または複数のデータベースにおいて、テーブルを構成するカラムのなかから、関連カラムである可能性が高い複数のカラムを抽出する。関連カラムとは、互いに異なるテーブルに属する複数のカラムであって、相当する内容を格納するカラムのことである。関連カラム抽出システム９００は、カラムに付けられた名称（カラム名）に基づいて、関連カラムである可能性が高いカラムを抽出する。
関連カラム抽出システム９００は、例えば、データベース定義記憶装置９０１と、概念辞書記憶装置９０２と、関連カラム判定装置９０３とを有する。 The related column extraction system 900 extracts a plurality of columns that are likely to be related columns from among the columns constituting the table in one or a plurality of databases. The related column is a plurality of columns belonging to different tables and storing corresponding contents. The related column extraction system 900 extracts columns that are highly likely to be related columns based on the names (column names) given to the columns.
The related column extraction system 900 includes, for example, a database definition storage device 901, a concept dictionary storage device 902, and a related column determination device 903.

データベース定義記憶装置９０１は、データベースを定義するデータベース定義データを記憶している。
概念辞書記憶装置９０２は、概念辞書を表わす概念辞書データを記憶している。概念辞書とは、自然言語における単語の意味や、複数の単語の間の意味の関係（同意語、類義語、反意語、上位概念・下位概念など）を記載した辞書である。概念辞書記憶装置９０２が記憶した概念辞書データは、例えば、汎用の大語彙概念辞書を表わす。
関連カラム判定装置９０３（関連語分類装置の一例。）は、データベース定義記憶装置９０１が記憶したデータベース定義データから、カラム名を取得し、概念辞書記憶装置９０２が記憶した概念辞書データを使って、関連カラムである可能性が高いカラムを抽出する。関連カラム判定装置９０３は、例えば、抽出したカラムを表示して、関連カラム抽出システム９００の利用者に提示する。
関連カラム抽出システム９００の利用者は、関連カラム判定装置９０３が提示したカラムの内容を確認して、関連カラムであるか否かを判断する。 The database definition storage device 901 stores database definition data that defines a database.
The concept dictionary storage device 902 stores concept dictionary data representing the concept dictionary. The concept dictionary is a dictionary that describes the meaning of words in a natural language and the relationship of meanings between a plurality of words (synonyms, synonyms, antonyms, superordinate concepts, subordinate concepts, etc.). The concept dictionary data stored in the concept dictionary storage device 902 represents, for example, a general-purpose large vocabulary concept dictionary.
A related column determination device 903 (an example of a related word classification device) acquires a column name from the database definition data stored in the database definition storage device 901, and uses the concept dictionary data stored in the concept dictionary storage device 902. Extract columns that are likely to be related columns. The related column determination device 903 displays, for example, the extracted column and presents it to the user of the related column extraction system 900.
The user of the related column extraction system 900 confirms the contents of the column presented by the related column determination device 903 and determines whether or not the column is a related column.

図２は、この実施の形態における関連カラム判定装置９０３などのハードウェア資源の一例を示すハードウェア構成図である。 FIG. 2 is a hardware configuration diagram illustrating an example of hardware resources such as the related column determination device 903 in this embodiment.

データベース定義記憶装置９０１、概念辞書記憶装置９０２、関連カラム判定装置９０３は、例えば、処理装置９１１と、入力装置９１２と、出力装置９１３と、記憶装置９１４とを有するコンピュータである。
処理装置９１１は、記憶装置９１４が記憶したコンピュータプログラムを実行することにより、データを処理し、コンピュータ全体を制御する。なお、コンピュータは、処理装置９１１を複数有する構成であってもよい。
記憶装置９１４は、処理装置９１１が実行するコンピュータプログラムや、処理装置９１１が処理するデータなどを記憶する。例えば、記憶装置９１４は、処理途中の一時的なデータなどを記憶する揮発性の記憶装置である。あるいは、記憶装置９１４は、コンピュータプログラムや永続的なデータなどを記憶する不揮発性の記憶装置である。また、例えば、記憶装置９１４は、半導体記憶素子を集積した集積回路や大規模集積回路である。あるいは、記憶装置９１４は、磁気ディスクや光学ディスクなどの記憶媒体を用いる外部記憶装置である。なお、コンピュータは、記憶装置９１４を複数有する構成であってもよい。複数の記憶装置９１４は、同じ種類の記憶装置であってもよいし、異なる種類の記憶装置であってもよい。
入力装置９１２は、コンピュータの外部から情報を入力して、処理装置９１１が処理できるデータに変換する。例えば、入力装置９１２は、操作者による操作を入力するキーボードやマウスなどの操作入力装置である。あるいは、入力装置９１２は、静止画像や動画像を入力するカメラやスキャナ装置などの画像入力装置である。あるいは、入力装置９１２は、音声を入力するマイクなどの音声入力装置である。あるいは、入力装置９１２は、温度や明るさや圧力などの物理量を測定する温度センサや照度センサや圧力センサなどのセンサ装置である。あるいは、入力装置９１２は、アナログ信号をデジタルデータに変換するアナログデジタル変換装置である。あるいは、入力装置９１２は、外部の装置がコンピュータに対して送信した信号を受信して復調する受信装置である。なお、コンピュータは、入力装置９１２を複数有する構成であってもよい。複数の入力装置９１２は、同じ種類の入力装置であってもよいし、異なる種類の入力装置であってもよい。また、入力装置９１２が変換したデータは、処理装置９１１が直接処理する構成であってもよいし、記憶装置９１４が一時的に記憶する構成であってもよい。
出力装置９１３は、処理装置９１１が処理したデータや、記憶装置９１４が記憶したデータなどを変換して、コンピュータの外部に出力する。例えば、出力装置９１３は、テキストや画像などを表示する表示装置である。あるいは、出力装置９１３は、テキストや画像などを印刷する印刷装置である。あるいは、出力装置９１３は、音声を出力するスピーカなどの音声出力装置である。あるいは、出力装置９１３は、デジタルデータをアナログ信号に変換するデジタルアナログ変換装置である。あるいは、出力装置９１３は、変調した信号を外部の装置に対して送信する送信装置である。なお、コンピュータは、出力装置９１３を複数有する構成であってもよい。複数の出力装置９１３は、同じ種類の出力装置であってよいし、異なる種類の出力装置であってもよい。 The database definition storage device 901, the concept dictionary storage device 902, and the related column determination device 903 are, for example, computers having a processing device 911, an input device 912, an output device 913, and a storage device 914.
The processing device 911 processes the data and controls the entire computer by executing the computer program stored in the storage device 914. Note that the computer may have a plurality of processing devices 911.
The storage device 914 stores a computer program executed by the processing device 911, data processed by the processing device 911, and the like. For example, the storage device 914 is a volatile storage device that stores temporary data during processing. Alternatively, the storage device 914 is a non-volatile storage device that stores a computer program, permanent data, and the like. For example, the memory device 914 is an integrated circuit or a large-scale integrated circuit in which semiconductor memory elements are integrated. Alternatively, the storage device 914 is an external storage device that uses a storage medium such as a magnetic disk or an optical disk. Note that the computer may have a plurality of storage devices 914. The plurality of storage devices 914 may be the same type of storage device or different types of storage devices.
The input device 912 inputs information from outside the computer and converts it into data that can be processed by the processing device 911. For example, the input device 912 is an operation input device such as a keyboard and a mouse for inputting an operation by the operator. Alternatively, the input device 912 is an image input device such as a camera or a scanner device that inputs a still image or a moving image. Alternatively, the input device 912 is a voice input device such as a microphone for inputting voice. Alternatively, the input device 912 is a sensor device such as a temperature sensor, an illuminance sensor, or a pressure sensor that measures physical quantities such as temperature, brightness, and pressure. Alternatively, the input device 912 is an analog-digital conversion device that converts an analog signal into digital data. Alternatively, the input device 912 is a receiving device that receives and demodulates a signal transmitted from an external device to the computer. Note that the computer may have a plurality of input devices 912. The plurality of input devices 912 may be the same type of input device or different types of input devices. The data converted by the input device 912 may be processed directly by the processing device 911, or may be stored temporarily by the storage device 914.
The output device 913 converts the data processed by the processing device 911, the data stored in the storage device 914, and the like, and outputs them to the outside of the computer. For example, the output device 913 is a display device that displays text, images, and the like. Alternatively, the output device 913 is a printing device that prints text, images, and the like. Alternatively, the output device 913 is a sound output device such as a speaker that outputs sound. Alternatively, the output device 913 is a digital-analog conversion device that converts digital data into an analog signal. Alternatively, the output device 913 is a transmission device that transmits a modulated signal to an external device. Note that the computer may be configured to include a plurality of output devices 913. The plurality of output devices 913 may be the same type of output devices or different types of output devices.

以下に説明する機能ブロックは、記憶装置９１４が記憶したコンピュータプログラムを処理装置９１１が実行することによって実現することができる。なお、一つのコンピュータを用いて一つの装置を実現する構成であってもよいし、複数のコンピュータを用いて一つの装置を実現する構成であってもよい。逆に、一つのコンピュータを用いて複数の装置を実現する構成であってもよい。
また、これらの機能ブロックのうちのいくつか、あるいは、そのすべてを、コンピュータを用いて実現するのではなく、例えば、デジタル回路やアナログ回路などの電気的構成や、機械的構成など、他の構成を用いて実現してもよい。 The functional blocks described below can be realized by the processing device 911 executing the computer program stored in the storage device 914. In addition, the structure which implement | achieves one apparatus using one computer may be sufficient, and the structure which implement | achieves one apparatus using a some computer may be sufficient. On the contrary, the structure which implement | achieves a some apparatus using one computer may be sufficient.
In addition, some or all of these functional blocks are not realized using a computer, but other configurations such as an electrical configuration such as a digital circuit or an analog circuit, or a mechanical configuration, for example. You may implement | achieve using.

図３は、この実施の形態における関連カラム判定装置９０３の構成の一例を示すブロック構成図である。 FIG. 3 is a block configuration diagram showing an example of the configuration of the related column determination device 903 in this embodiment.

関連カラム判定装置９０３は、例えば、単語抽出部２０と、特徴ベクトル生成部４０と、文脈解析部５０と、単語分類部６０と、関連語決定部７０とを有する。 The related column determination device 903 includes, for example, a word extraction unit 20, a feature vector generation unit 40, a context analysis unit 50, a word classification unit 60, and a related word determination unit 70.

単語抽出部２０（テキスト取得部の一例。）は、処理装置９１１を用いて、関連語抽出元データ１０（テキストの一例。）から、分類対象となる単語を抽出する。
例えば、単語抽出部２０は、データベース定義記憶装置９０１が記憶したデータベース定義データから、関連語抽出元データ１０として、テーブルスキーマを取得する。テーブルスキーマは、データベースのテーブルを定義するデータである。単語抽出部２０は、処理装置９１１を用いて、取得した関連語抽出元データ１０から、テーブルのカラム名を構成する単語を抽出する。
テーブルスキーマは、１つ以上のカラム名（単語群の一例。）を含む。カラム名は、データベースの設計者が任意に名付ける文字列であるが、その内容に関連する自然言語の単語や、複数の単語を組み合わせたものを使う場合が多い。例えば、複数の単語を「＿」（アンダースコア）などの区切り文字を使って連結したものをカラム名とする。
例えば、単語抽出部２０は、関連語抽出元データ１０から、カラム名を取得する。単語抽出部２０は、取得したカラム名に区切り文字が含まれているか否かを判定する。区切り文字が含まれていない場合、単語抽出部２０は、取得したカラム名が１単語からなると判定し、取得したカラム名をそのまま単語として抽出する。区切り文字が含まれている場合、単語抽出部２０は、取得したカラム名を区切り文字を境にして複数の文字列に分解し、分解した複数の文字列を単語として抽出する。 The word extraction unit 20 (an example of a text acquisition unit) uses the processing device 911 to extract words to be classified from the related word extraction source data 10 (an example of text).
For example, the word extraction unit 20 acquires a table schema as the related word extraction source data 10 from the database definition data stored in the database definition storage device 901. The table schema is data that defines a database table. The word extraction unit 20 uses the processing device 911 to extract words constituting the column name of the table from the acquired related word extraction source data 10.
The table schema includes one or more column names (an example of word groups). The column name is a character string arbitrarily named by the database designer, but a natural language word related to the contents or a combination of a plurality of words is often used. For example, a column name is formed by concatenating a plurality of words using a delimiter such as “_” (underscore).
For example, the word extraction unit 20 acquires a column name from the related word extraction source data 10. The word extraction unit 20 determines whether or not a delimiter is included in the acquired column name. When the delimiter is not included, the word extraction unit 20 determines that the acquired column name is composed of one word, and extracts the acquired column name as a word as it is. When a delimiter is included, the word extraction unit 20 decomposes the acquired column name into a plurality of character strings with the delimiter as a boundary, and extracts the plurality of decomposed character strings as words.

特徴ベクトル生成部４０（特徴ベクトル算出部の一例。）は、単語抽出部２０が抽出した各単語を、概念ベクトルに変換する。
例えば、特徴ベクトル生成部４０は、単語抽出部２０が抽出した単語それぞれについて、概念辞書記憶装置９０２が記憶した概念辞書データ３０に基づいて、概念ベクトルを生成する。概念ベクトルとは、単語の意味や、他の単語との間の意味の関係などを表わす複数の数値からなるベクトルである。例えば、概念ベクトルの各要素は、それぞれ異なる概念に対応づけられていて、「１」または「０」の値を取る。概念ベクトルの要素の値が「１」である場合、その単語が表わす概念が、その要素が対応づけられている概念またはその下位概念であることを表わし、「０」である場合、その単語が表わす概念が、その要素が対応づけられている概念でもその下位概念でもないことを表わす。 The feature vector generation unit 40 (an example of a feature vector calculation unit) converts each word extracted by the word extraction unit 20 into a concept vector.
For example, the feature vector generation unit 40 generates a concept vector for each word extracted by the word extraction unit 20 based on the concept dictionary data 30 stored in the concept dictionary storage device 902. A concept vector is a vector composed of a plurality of numerical values representing the meaning of a word and the relationship of meaning with another word. For example, each element of the concept vector is associated with a different concept and takes a value of “1” or “0”. When the value of the element of the concept vector is “1”, it indicates that the concept represented by the word is a concept to which the element is associated or its subordinate concept, and when the value is “0”, the word is Indicates that the concept to be represented is neither a concept to which the element is associated nor a subordinate concept.

文脈解析部５０（特徴ベクトル算出部の一例。）は、処理装置９１１を用いて、単語抽出部２０が抽出した各単語のテーブル構造やカラム構造上の位置関係を解析する。
例えば、文脈解析部５０は、単語抽出部２０が抽出した単語それぞれについて、文脈ベクトルを生成する。文脈ベクトルとは、その単語と、他の単語との間の文脈上の位置関係などを表わす複数の数値からなるベクトルである。例えば、文脈ベクトルの各要素は、前接語グループと、後接語グループと、テーブル名グループとの３つのグループのうちのいずれかに属する。前接語グループに属する各要素は、それぞれ異なる単語に対応づけられていて、「１」または「０」の値を取る。前接語グループに属する要素の値が「１」である場合、その単語を含むカラム名において、その要素が対応づけられている単語の直後にその単語が連結していることを表わす。前接語グループに属する要素の値が「０」である場合、その単語を含むカラム名において、その単語が先頭の単語である（カラム名が１単語で構成されている場合を含む。）か、あるいは、その要素が対応づけられている単語以外の単語の直後にその単語が連結していることを表わす。後接語グループに属する各要素は、それぞれ異なる単語に対応づけられていて、「１」または「０」の値を取る。後接語グループに属する要素の値が「１」である場合、その単語を含むカラム名において、その単語の直後に、その要素が対応づけれられている単語が連結していることを表わす。後接語グループに属する要素の値が「０」である場合、その単語を含むカラム名において、その単語が末尾の単語である（カラム名が１単語で構成されている場合を含む。）か、あるいは、その単語の直後にその要素が対応づけられている単語以外の単語が連結していることを表わす。テーブル名グループに属する各要素は、それぞれ異なる単語に対応づけられていて、「１」または「０」の値を取る。テーブル名グループに属する要素の値が「１」である場合、その単語を含むカラム名を有するカラムが属するテーブルの名称（テーブル名）に、その要素が対応づけられている単語が含まれることを表わす。テーブル名グループに属する要素の値が「０」である場合、その単語を含むカラム名を有するカラムが属するテーブルのテーブル名に、その要素が対応づけられている単語が含まれないことを表わす。 The context analysis unit 50 (an example of a feature vector calculation unit) uses the processing device 911 to analyze the positional relationship on the table structure and column structure of each word extracted by the word extraction unit 20.
For example, the context analysis unit 50 generates a context vector for each word extracted by the word extraction unit 20. The context vector is a vector composed of a plurality of numerical values representing the positional relationship in context between the word and other words. For example, each element of the context vector belongs to one of three groups: a prefix word group, a suffix word group, and a table name group. Each element belonging to the prefix group is associated with a different word and takes a value of “1” or “0”. When the value of the element belonging to the prefix group is “1”, it indicates that the word is connected immediately after the word associated with the element in the column name including the word. When the value of the element belonging to the prefix group is “0”, whether the word is the first word in the column name including the word (including the case where the column name is composed of one word). Alternatively, it indicates that the word is connected immediately after a word other than the word with which the element is associated. Each element belonging to the postfix group is associated with a different word and takes a value of “1” or “0”. When the value of the element belonging to the postfix group is “1”, it indicates that in the column name including the word, the word associated with the element is connected immediately after the word. If the value of the element belonging to the postfix group is “0”, whether the word is the last word in the column name including the word (including the case where the column name is composed of one word). Alternatively, it indicates that a word other than the word associated with the element is connected immediately after the word. Each element belonging to the table name group is associated with a different word and takes a value of “1” or “0”. If the value of an element belonging to the table name group is “1”, the name of the table (table name) to which the column having the column name including the word belongs includes the word associated with the element. Represent. When the value of the element belonging to the table name group is “0”, this indicates that the word associated with the element is not included in the table name of the table to which the column having the column name including the word belongs.

単語分類部６０は、処理装置９１１を用いて、単語の概念ベクトルと単語の出現文脈とに基づいて、単語のクラスタリングを行って、各単語のクラス尤度を決定する。単語分類部６０は、算出したクラス尤度に基づいて、各クラスより単語を抽出する。
例えば、単語分類部６０は、特徴ベクトル生成部４０が生成した概念ベクトルと、文脈解析部５０が生成した文脈ベクトルとに基づいて、単語抽出部２０が抽出した複数の単語をクラスタリングする。例えば、単語分類部６０は、単語抽出部２０が抽出した単語それぞれについて、あらかじめ定められた数のクラスそれぞれに属する尤度を算出する。単語分類部６０は、単語抽出部２０が抽出した単語それぞれについて、算出した尤度に基づいて、その単語がどのクラスに属するかを判定する。 The word classification unit 60 uses the processing device 911 to perform clustering of words based on the word concept vector and the word appearance context to determine the class likelihood of each word. The word classification unit 60 extracts words from each class based on the calculated class likelihood.
For example, the word classification unit 60 clusters a plurality of words extracted by the word extraction unit 20 based on the concept vector generated by the feature vector generation unit 40 and the context vector generated by the context analysis unit 50. For example, the word classification unit 60 calculates the likelihood belonging to each of a predetermined number of classes for each word extracted by the word extraction unit 20. The word classification unit 60 determines, for each word extracted by the word extraction unit 20, which class the word belongs to based on the calculated likelihood.

関連語決定部７０は、処理装置９１１を用いて、単語分類部６０がクラスごとに抽出した単語群に基づいて、関連カラムである可能性が高いカラムを抽出する。
例えば、関連語決定部７０は、あるテーブルに属するカラムのカラム名を構成する単語を分類したクラスの組と、別のテーブルに属するカラムのカラム名を構成する単語を分類したクラスの組とが一致する場合に、その２つのカラムは、関連カラムである可能性が高いと判定する。
関連語決定部７０は、処理装置９１１を用いて、判定した結果を関連語データ８０として出力する。 The related word determination unit 70 uses the processing device 911 to extract a column that is likely to be a related column based on the word group extracted by the word classification unit 60 for each class.
For example, the related word determination unit 70 has a class set in which words constituting the column name of a column belonging to a certain table are classified and a class set in which words constituting a column name of a column belonging to another table are classified. If they match, it is determined that the two columns are likely to be related columns.
Using the processing device 911, the related word determination unit 70 outputs the determined result as related word data 80.

図４は、この実施の形態における関連語抽出元データ１０の一例を示す図である。 FIG. 4 is a diagram showing an example of the related word extraction source data 10 in this embodiment.

この例において、関連語抽出元データ１０は、データベース定義記憶装置９０１が記憶しているデータベース定義データに含まれるテーブルスキーマである。この例に示した３つの関連語抽出元データ１０ａ〜１０ｃは、それぞれ、１つずつのテーブルを定義している。
関連語抽出元データ１０ａは、「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲ」というテーブル名のテーブル（以下、「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブル」と呼ぶ。他も同様。）を定義している。ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブルは、ＯＲＤＥＲ＿ＤＡＴＥカラムと、ＬＩＮＥ＿ＮＯカラムと、ＰＨＯＮＥ＿ＮＯカラムとを有する。ＯＲＤＥＲ＿ＤＡＴＥカラムは、日付型である。ＬＩＮＥ＿ＮＯカラムは、８桁の数値型である。ＰＨＯＮＥ＿ＮＯカラムは、１０文字の固定長文字列型である。例えば、ＯＲＤＥＲ＿ＤＡＴＥカラムは、発注日を格納する。ＬＩＮＥ＿ＮＯカラムは、商品番号を格納する。ＰＨＯＮＥ＿ＮＯカラムは、電話番号を格納する。
関連語抽出元データ１０ｂは、ＯＲＤＥＲテーブルを定義している。ＯＲＤＥＲテーブルは、ＯＤＲ＿ＤＡＴカラムと、ＩＴＥＭ＿ＮＯカラムと、ＴＥＬカラムと、ＦＡＸカラムとを有する。ＯＤＲ＿ＤＡＴカラムは、日付型である。ＩＴＥＭ＿ＮＯカラムは、６桁の数値型である。ＴＥＬカラムは、８文字の固定長文字列型である。ＦＡＸカラムは、８文字の固定長文字列型である。例えば、ＯＤＲ＿ＤＡＴカラムは、発注日を格納する。ＩＴＥＭ＿ＮＯカラムは、商品番号を格納する。ＴＥＬカラムは、電話番号を格納する。ＦＡＸカラムは、ファクシミリ番号を格納する。
関連語抽出元データ１０ｃは、ＳＨＩＰＭＥＮＴテーブルを定義している。ＳＨＩＰＭＥＮＴテーブルは、ＰＲＯＤＵＣＴカラムと、ＰＲＯＤ＿ＴＹＰＥカラムとを有する。ＰＲＯＤＵＣＴカラムは、８桁の数値型である。ＰＲＯＤ＿ＴＹＰＥカラムは、最大１２８文字の可変長文字列型である。 In this example, the related term extraction source data 10 is a table schema included in the database definition data stored in the database definition storage device 901. Each of the three related word extraction source data 10a to 10c shown in this example defines one table.
The related term extraction source data 10a defines a table having a table name “PURCHASE_ORDER” (hereinafter referred to as “PURCHASE_ORDER table”, and so on). The PURCHASE_ORDER table has an ORDER_DATE column, a LINE_NO column, and a PHONE_NO column. The ORDER_DATE column is a date type. The LINE_NO column is an 8-digit numeric type. The PHONE_NO column is a 10-character fixed-length character string type. For example, the ORDER_DATE column stores the order date. The LINE_NO column stores the product number. The PHONE_NO column stores a telephone number.
The related word extraction source data 10b defines an ORDER table. The ORDER table has an ODR_DAT column, an ITEM_NO column, a TEL column, and a FAX column. The ODR_DAT column is a date type. The ITEM_NO column is a 6-digit numeric type. The TEL column is a fixed-length character string type of 8 characters. The FAX column is a fixed-length character string type of 8 characters. For example, the ODR_DAT column stores the order date. The ITEM_NO column stores the product number. The TEL column stores the telephone number. The FAX column stores the facsimile number.
The related word extraction source data 10c defines a SHIPMENT table. The SHIPMENT table has a PRODUCT column and a PROD_TYPE column. The PRODUCT column is an 8-digit numeric type. The PROD_TYPE column is a variable-length character string type with a maximum of 128 characters.

この例において、ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブルのＯＲＤＥＲ＿ＤＡＴＥカラムと、ＯＲＤＥＲテーブルのＯＤＲ＿ＤＡＴカラムとは、いずれも発注日を格納するカラムであり、関連カラムである。また、ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブルのＬＩＮＥ＿ＮＯカラムと、ＯＲＤＥＲテーブルのＩＴＥＭ＿ＮＯカラムと、ＳＨＩＰＭＥＮＴテーブルのＰＲＯＤＵＣＴカラムとは、いずれも商品番号を格納するカラムであり、関連カラムである。また、ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブルのＰＨＯＮＥ＿ＮＯカラムと、ＯＲＤＥＲテーブルのＴＥＬカラムとは、いずれも電話番号を格納するカラムであり、関連カラムである。 In this example, the ORDER_DATE column of the PURCHASE_ORDER table and the ODR_DAT column of the ORDER table are both columns that store the order date and are related columns. The LINE_NO column of the PURCHASE_ORDER table, the ITEM_NO column of the ORDER table, and the PRODUCT column of the SHIPMENT table are all columns that store product numbers and are related columns. Also, the PHONE_NO column of the PURCHASE_ORDER table and the TEL column of the ORDER table are both columns that store telephone numbers and are related columns.

このように異なるシステム向けに定義されたテーブルスキーマや、異なるシステム設計者が定義したテーブルスキーマでは、名称は異なるものの、相当する内容を格納するカラムが含まれることがある。
このようなカラム間の対応関係を分析する作業は、既存の業務システム間で統合・最適化を行ったり、既存の複数のデータベースを連携させた新たなアプリケーションを構築したりする際に必須の作業である。関連カラム抽出システム９００は、カラム名に含まれる単語の対応関係を自動推定することにより、関連カラムである可能性が高いカラムを抽出し、既存システムの大量のカラム間の対応を分析する作業を効率化する。 As described above, table schemas defined for different systems and table schemas defined by different system designers may include columns for storing corresponding contents although the names are different.
Analyzing the correspondence between these columns is essential when integrating and optimizing existing business systems or building new applications that link existing databases. It is. The related column extraction system 900 automatically extracts a column that is likely to be a related column by automatically estimating the correspondence of words included in the column name, and analyzes the correspondence between a large number of columns in the existing system. Increase efficiency.

図５は、この実施の形態における概念辞書データ３０の一例を示す図である。 FIG. 5 is a diagram showing an example of the concept dictionary data 30 in this embodiment.

概念辞書データ３０は、例えば、複数の概念データ３１を有する。それぞれの概念データ３１は、１つの概念を表わす。
概念データ３１は、例えば、概念識別子３２と、概念表示語３３と、上位概念識別子３４とを有する。 The concept dictionary data 30 includes a plurality of concept data 31, for example. Each concept data 31 represents one concept.
The concept data 31 includes, for example, a concept identifier 32, a concept display word 33, and a higher concept identifier 34.

概念識別子３２は、識別番号など、その概念データ３１を一意に識別するための識別子である。
概念表示語３３は、その概念を表わす単語あるいは連語を表わす。同じ概念を表わす単語あるいは連語が複数ある場合、概念表示語３３は、複数の単語あるいは連語を「，」（カンマ）などの区切り文字で連結したリストである。また、一つの単語あるいは連語に複数の意味がある場合、同じ単語が異なる概念データ３１の概念表示語３３のなかに出現してもよい。
上位概念識別子３４は、その概念の上位概念を表わす概念データ３１の概念識別子３２である。ある概念の上位概念の更に上位概念も、その概念の上位概念であるといえるが、上位概念の上位概念は、上位概念識別子３４を辿ることにより判別可能なので、上位概念識別子３４は、直上の上位概念だけを表わす。なお、上位概念が存在しない概念の場合、上位概念識別子３４は、空欄であってもよい。また、直上の上位概念が複数ある場合、上位概念識別子３４は、複数の概念識別子３２を「，」（カンマ）などの区切り文字で連結したリストであってもよい。 The concept identifier 32 is an identifier for uniquely identifying the concept data 31 such as an identification number.
The concept display word 33 represents a word or a collocation representing the concept. When there are a plurality of words or collocations representing the same concept, the concept display word 33 is a list in which a plurality of words or collocations are connected by a delimiter such as “,” (comma). Further, when one word or multiple words have a plurality of meanings, the same word may appear in the concept display word 33 of different concept data 31.
The superordinate concept identifier 34 is a concept identifier 32 of the concept data 31 representing the superordinate concept of the concept. It can be said that a superordinate concept of a superordinate concept of a concept is a superordinate concept of the concept. However, since the superordinate concept of the superordinate concept can be discriminated by tracing the superordinate concept identifier 34, the superordinate concept identifier 34 Represents only the concept. In the case of a concept that does not have a superordinate concept, the superordinate concept identifier 34 may be blank. In addition, when there are a plurality of superordinate concepts, the superordinate concept identifier 34 may be a list in which a plurality of concept identifiers 32 are connected by a delimiter such as “,” (comma).

例えば、概念識別子３２が「００００１７４０」の概念（以下「概念｛００００１７４０｝」と呼ぶ。他も同様。）は、単語「ｅｎｔｉｔｙ」が表わす意味の一つに対応した概念である。概念｛００００１７４０｝は、最上位概念であって、上位概念が存在しない。
また、単語「ｏｂｊｅｃｔ」が表わす意味の一つと、連語「ｐｈｙｓｉｃａｌｏｂｊｅｃｔ」が表わす意味の一つとは、同じ意味であって、概念｛００００２６８４｝に対応している。概念｛００００２６８４｝のすぐ上の上位概念は、概念｛００００１９３０｝である。
また、単語「ｌｉｎｅ」には、非常に多くの意味がある。例えば、そのなかの一つは、単語「ｃａｂｌｅ」や連語「ｔｒａｎｓｍｉｓｓｉｏｎｌｉｎｅ」の意味の一つと同じであって、概念｛０２９３７５５２｝に対応している。また別の一つは、連語「ｒａｉｌｗａｙｌｉｎｅ」や連語「ｒａｉｌｌｉｎｅ」の意味の一つと同じであって、概念｛０３６７６５９８｝に対応している。 For example, a concept whose concept identifier 32 is “000014040” (hereinafter referred to as “concept {0000001740}”, and so on) is a concept corresponding to one of the meanings represented by the word “entity”. The concept {00001440} is the highest level concept and has no superordinate concept.
Further, one of the meanings represented by the word “object” and one of the meanings represented by the collocation “physical object” have the same meaning and correspond to the concept {0000002684}. The superordinate concept immediately above the concept {000000264} is the concept {0000001930}.
Also, the word “line” has many meanings. For example, one of them is the same as one of the meanings of the word “cable” and the collocation “transmission line”, and corresponds to the concept {029375552}. Another one is the same as one of the meanings of the collocation “railway line” and collocation “rail line”, and corresponds to the concept {0367598}.

なお、データベースのカラム名に使用できる文字種の制限により、カラム名には、英語の単語を使うことが多いので、この例における概念辞書データ３０が表わす概念辞書は、英語の辞書である。しかし、概念辞書データ３０が表わす概念辞書は、英語の辞書に限らず、日本語など他の言語の辞書であってもよいし、複数の言語を含む辞書であってもよい。 Since English words are often used for column names due to restrictions on the types of characters that can be used for database column names, the concept dictionary represented by the concept dictionary data 30 in this example is an English dictionary. However, the concept dictionary represented by the concept dictionary data 30 is not limited to an English dictionary, and may be a dictionary of another language such as Japanese or a dictionary including a plurality of languages.

図６及び図７は、この実施の形態における概念辞書データ３０が表わす概念の間の階層構造の一例を示す図である。 6 and 7 are diagrams showing an example of a hierarchical structure between concepts represented by the concept dictionary data 30 in this embodiment.

２つの図は、概念｛００００１７４０｝（ｅｎｔｉｔｙ）を頂点とする階層構造を示している。なお、図を分割しているが、全体として一つの階層構造をなす。図６上部にある「ｍａｔｔｅｒ」は、同図下部にある「ｍａｔｔｅｒ」と同一であり、つながっている。また、図６にある「ａｂｓｔｒｕｃｔｉｏｎ／ａｂｓｔｒａｃｔｅｎｔｉｔｙ」は、図７にある「ａｂｓｔｒｕｃｔｉｏｎ／ａｂｓｔｒａｃｔｅｎｔｉｔｙ」と同一であり、つながっている。また、図６にある「ｄｏｃｕｍｅｎｔ」は、図７にある「ｄｏｃｕｍｅｎｔ」と同一であり、つながっている。また、図６にある「ｒｅｌａｔｉｏｎ」は、図７にある「ｒｅｌａｔｉｏｎ」と同一であり、つながっている。また、図７上部にある「ｐｓｙｃｈｏｌｏｇｉｃａｌｆｅａｔｕｒｅ」は、同図下部にある「ｐｓｙｃｈｏｌｏｇｉｃａｌｆｅａｔｕｒｅ」と同一であり、つながっている。 The two figures show a hierarchical structure having the concept {00001440} (entity) as a vertex. In addition, although the figure is divided | segmented, the whole makes one hierarchical structure. “Matter” in the upper part of FIG. 6 is the same as “Matter” in the lower part of FIG. Further, “abstraction / abstract entity” in FIG. 6 is the same as “abstraction / abstract entity” in FIG. 7 and is connected. Further, “document” in FIG. 6 is the same as “document” in FIG. 7 and is connected. Further, “relation” in FIG. 6 is the same as “relation” in FIG. 7 and is connected. 7 is the same as “psychological feature” in the lower part of FIG. 7 and is connected to the “pysyological feature” in the upper part of FIG.

長円は、一つの概念を表わす。長円から上へ伸びる線は、上位概念を示す。長円から下へ伸びる線は、下位概念を示す。長円のなかに単語あるいは連語が記載されている場合、その概念を表わす単語あるいは連語を表わす。一つの長円のなかに複数の単語あるいは連語が記載されている場合、同じ概念を表わす同義語である。 An ellipse represents a concept. A line extending upward from the ellipse indicates a superordinate concept. A line extending downward from the ellipse indicates a subordinate concept. When a word or collocation is described in the ellipse, it represents a word or collocation representing the concept. When a plurality of words or collocations are described in one ellipse, they are synonyms representing the same concept.

概念間の階層構造は、概ね木構造であるが、一つの概念に１階層上の上位概念が複数ある場合（例えば、図６下部にある「ｓｕｂｓｔａｎｃｅ」。）があるので、厳密な意味での木構造ではない。
また、この図によれば、単語「ｌｉｎｅ」には、単語「ｐｒｏｄｕｃｔ」に比較的近い意味（図６にある「ｍｅｒｃｈａｎｄｉｓｅ／ｗａｒｅ／ｐｒｏｄｕｃｔ」の下位概念など。）もあるし、単語「ｐｈｏｎｅ」に比較的近い意味（図６にある「ｉｎｓｔｒｕｍｅｎｔａｌｉｔｙ／ｉｎｓｔｒｕｍｅｎｔａｔｉｏｎ」の下位概念など。）もあることがわかる。 The hierarchical structure between concepts is generally a tree structure, but there are cases where a single concept has a plurality of higher-level concepts (for example, “substance” at the bottom of FIG. 6). It is not a tree structure.
Further, according to this figure, the word “line” also has a meaning that is relatively close to the word “product” (such as a subordinate concept of “merchandise / ware / product” in FIG. 6), and the word “phone”. It is understood that there is a meaning that is relatively close to (such as a subordinate concept of “instrumentality / instrumentation” in FIG. 6).

図８は、この実施の形態における単語抽出部２０の動作の一例を説明するための図である。 FIG. 8 is a diagram for explaining an example of the operation of the word extraction unit 20 in this embodiment.

テーブル名構成単語データ２１は、テーブル名を構成する単語を表わすデータである。テーブル名構成単語データ２１は、例えば、テーブル識別子２２と、１つ以上のテーブル名構成単語２３とを含む。テーブル識別子２２は、テーブルを一意に識別するための識別子である。単語抽出部２０は、１つのテーブルについて、１つのテーブル名構成単語データ２１を生成する。テーブル名構成単語２３は、テーブル識別子２２によって識別されるテーブルのテーブル名を構成する単語を表わす。
カラム名構成単語データ２４は、カラム名を構成する単語を表わすデータである。カラム名構成単語データ２４は、例えば、テーブル識別子２５と、カラム識別子２６と、１つ以上のカラム名構成単語２７とを含む。テーブル識別子２５は、そのカラムが属するテーブルのテーブル識別子２２である。カラム識別子２６は、カラムを一意に識別するための識別子である。単語抽出部２０は、１つのカラムについて、１つのカラム名構成単語データ２４を生成する。なお、カラム識別子２６は、カラム識別子２６単独でカラムを一意に識別できる識別子であってもよいし、テーブル識別子２５との組み合わせによりカラムを一意に識別できる識別子であってもよい。カラム名構成単語２７は、カラム識別子２６によって識別されるカラムのカラム名を構成する単語を表わす。 The table name constituting word data 21 is data representing words constituting the table name. The table name constituent word data 21 includes, for example, a table identifier 22 and one or more table name constituent words 23. The table identifier 22 is an identifier for uniquely identifying a table. The word extraction unit 20 generates one table name constituent word data 21 for one table. The table name constituent word 23 represents a word constituting the table name of the table identified by the table identifier 22.
The column name constituting word data 24 is data representing words constituting the column name. The column name constituent word data 24 includes, for example, a table identifier 25, a column identifier 26, and one or more column name constituent words 27. The table identifier 25 is the table identifier 22 of the table to which the column belongs. The column identifier 26 is an identifier for uniquely identifying a column. The word extraction unit 20 generates one column name constituent word data 24 for one column. The column identifier 26 may be an identifier that can uniquely identify a column by the column identifier 26 alone, or may be an identifier that can uniquely identify a column in combination with the table identifier 25. The column name constituent word 27 represents a word constituting the column name of the column identified by the column identifier 26.

単語抽出部２０は、テーブルスキーマのテーブル名、カラム名の単語分割を行い、単語を抽出する。
例えば、ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブルについて、単語抽出部２０は、識別子「Ｔ０００１」を割り当ててテーブル識別子２２とする。単語抽出部２０は、テーブル名「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲ」を、区切り文字「＿」（アンダースコア）で単語「ＰＵＲＣＨＡＳＥ」と単語「ＯＲＤＥＲ」とに分割して、テーブル名構成単語２３とする。
また、ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブルのＯＲＤＥＲ＿ＤＡＴＥカラムについて、単語抽出部２０は、識別子「Ｃ００１１」を割り当ててカラム識別子２６とする。単語抽出部２０は、カラム名「ＯＲＤＥＲ＿ＤＡＴＥ」を、区切り文字「＿」（アンダースコア）で単語「ＯＲＤＥＲ」と単語「ＤＡＴＥ」とに分割して、カラム名構成単語２７とする。 The word extraction unit 20 performs word division on the table name and column name of the table schema, and extracts words.
For example, for the PURCHASE_ORDER table, the word extraction unit 20 assigns the identifier “T0001” to the table identifier 22. The word extraction unit 20 divides the table name “PURCHASE_ORDER” into the word “PURCHASE” and the word “ORDER” by the delimiter “_” (underscore) to obtain a table name constituent word 23.
In addition, for the ORDER_DATE column of the PURCHASE_ORDER table, the word extraction unit 20 assigns an identifier “C0011” as a column identifier 26. The word extraction unit 20 divides the column name “ORDER_DATE” into the word “ORDER” and the word “DATE” by the delimiter “_” (underscore), and forms the column name constituting word 27.

図９は、この実施の形態における文脈解析部５０の動作の一例を説明するための図である。 FIG. 9 is a diagram for explaining an example of the operation of the context analysis unit 50 in this embodiment.

前置単語５１は、カラム名構成単語２７を含むカラム名において、カラム名構成単語２７の直前に現れる単語である。後置単語５２は、カラム名構成単語２７を含むカラム名において、カラム名構成単語２７の直後に現れる単語である。テーブル名構成単語５３は、カラム名構成単語２７を含むカラムが属するテーブルのテーブル名を構成するテーブル名構成単語２３である。 The prefix word 51 is a word that appears immediately before the column name constituent word 27 in the column name including the column name constituent word 27. The postfix word 52 is a word that appears immediately after the column name constituent word 27 in the column name including the column name constituent word 27. The table name constituent word 53 is a table name constituent word 23 constituting the table name of the table to which the column including the column name constituent word 27 belongs.

例えば、文脈解析部５０は、単語抽出部２０が抽出したカラム名構成単語２７のなかから、前置単語５１を抽出する。文脈解析部５０は、カラム名構成単語２７のなかから、後接する単語が存在する単語だけを抽出して、前置単語５１とする。
また、文脈解析部５０は、単語抽出部２０が抽出したカラム名構成単語２７のなかから、後置単語５２を抽出する。文脈解析部５０は、カラム名構成単語２７のなかから、前接する単語が存在する単語だけを抽出して、後置単語５２とする。
また、文脈解析部５０は、単語抽出部２０が抽出したテーブル名構成単語２３を、テーブル名構成単語５３とする。 For example, the context analysis unit 50 extracts the prefix word 51 from the column name constituent words 27 extracted by the word extraction unit 20. The context analysis unit 50 extracts only the words in which the trailing word is present from the column name constituent words 27 and sets them as prefix words 51.
Further, the context analysis unit 50 extracts a postfix word 52 from the column name constituent words 27 extracted by the word extraction unit 20. The context analysis unit 50 extracts only the words in which the front word is present from the column name constituent words 27 and uses them as postfix words 52.
The context analysis unit 50 sets the table name constituent word 23 extracted by the word extraction unit 20 as the table name constituent word 53.

文脈ベクトル５４は、文脈解析部５０が生成する文脈ベクトルである。文脈解析部５０は、１つの単語について、１つの文脈ベクトル５４を生成する。文脈解析部５０は、異なる位置（カラムやテーブル）に出現する同じ綴りの単語を区別する。例えば、カラム名構成単語２７「ＮＯ」は、テーブル｛Ｔ０００１｝（ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブル）のカラム｛Ｃ００１２｝（ＬＩＮＥ＿ＮＯカラム）と、テーブル｛Ｔ０００１｝（ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブル）のカラム｛Ｃ００１３｝（ＰＨＯＮＥ＿ＮＯカラム）と、テーブル｛Ｔ０００２｝（ＯＲＤＥＲテーブル）のカラム｛Ｃ０２２２｝（ＩＴＥＭ＿ＮＯカラム）との３箇所に出現している。文脈解析部５０は、これら３箇所に出現したカラム名構成単語２７「ＮＯ」を区別して、それぞれについて、文脈ベクトル５４を生成する。
なお、前置単語５１や後置単語５２やテーブル名構成単語５３については、文脈解析部５０は、同じ綴りの単語を区別せず、１つの単語として取り扱う。 The context vector 54 is a context vector generated by the context analysis unit 50. The context analysis unit 50 generates one context vector 54 for one word. The context analysis unit 50 distinguishes words with the same spelling that appear at different positions (columns or tables). For example, the column name constituent word 27 “NO” includes a column {C0012} (LINE_NO column) of the table {T0001} (PURCHASE_ORDER table), a column {C0013} (PHONE_NO column) of the table {T0001} (PURCHASE_ORDER table), and It appears in three places with column {C0222} (ITEM_NO column) of table {T0002} (ORDER table). The context analysis unit 50 distinguishes the column name constituent word 27 “NO” appearing at these three locations, and generates a context vector 54 for each.
Note that, for the prefix word 51, the suffix word 52, and the table name constituent word 53, the context analysis unit 50 treats words having the same spelling as one word without distinguishing them.

文脈ベクトル５４の各要素は、前置単語５１、後置単語５２及びテーブル名構成単語５３のうちのいずれかに対応づけられている。
例えば、文脈解析部５０は、抽出した前置単語５１と、後置単語５２と、テーブル名構成単語５３との数に基づいて、文脈ベクトル５４の次数を決定する。文脈解析部５０は、抽出した前置単語５１の数と、後置単語５２の数と、テーブル名構成単語５３の数とを合計して、文脈ベクトル５４の次数とする。
文脈解析部５０は、カラム名構成単語２７のそれぞれについて、単語抽出部２０が生成したカラム名構成単語データ２４に基づいて、そのカラム名構成単語２７の直前に位置する単語が存在するか否かを判定する。そのカラム名構成単語２７の直前に位置する単語が存在すると判定した場合、文脈解析部５０は、前置単語５１のなかから、そのカラム名構成単語２７の直前に位置する単語を選択する。文脈解析部５０は、文脈ベクトル５４の要素のうち、選択した前置単語５１に対応づけられた要素の値を「１」にし、それ以外の前置単語５１に対応づけられた要素の値を「０」にする。
文脈解析部５０は、単語抽出部２０が生成したカラム名構成単語データ２４に基づいて、そのカラム名構成単語２７の直後に位置する単語が存在するか否かを判定する。そのカラム名構成単語２７の直後に位置する単語が存在すると判定した場合、文脈解析部５０は、後置単語５２のなかから、そのカラム名構成単語２７の直後に位置する単語を選択する。文脈解析部５０は、文脈ベクトル５４の要素のうち、選択した後置単語５２に対応づけられた要素の値を「１」にし、それ以外の後置単語５２に対応づけられた要素の値を「０」にする。
文脈解析部５０は、単語抽出部２０が生成したテーブル名構成単語データ２１に基づいて、テーブル識別子２５によって識別されるテーブルを構成するテーブル名構成単語２３を取得する。文脈解析部５０は、テーブル名構成単語５３のなかから、取得したテーブル名構成単語２３を選択する。文脈解析部５０は、文脈ベクトル５４の要素のうち、選択したテーブル名構成単語５３に対応づけられた要素の値を「１」にし、それ以外のテーブル名構成単語５３に対応づけられた要素の値を「０」にする。 Each element of the context vector 54 is associated with any one of the prefix word 51, the suffix word 52, and the table name constituent word 53.
For example, the context analysis unit 50 determines the order of the context vector 54 based on the numbers of the extracted prefix word 51, postfix word 52, and table name constituent word 53. The context analysis unit 50 adds up the number of extracted prefix words 51, the number of postfix words 52, and the number of table name constituent words 53 to obtain the order of the context vector 54.
Based on the column name constituent word data 24 generated by the word extraction unit 20, for each column name constituent word 27, the context analysis unit 50 determines whether there is a word located immediately before the column name constituent word 27. Determine. When it is determined that there is a word located immediately before the column name constituent word 27, the context analysis unit 50 selects a word located immediately before the column name constituent word 27 from the prefix words 51. The context analysis unit 50 sets the value of the element associated with the selected prefix word 51 among the elements of the context vector 54 to “1”, and sets the value of the element associated with the other prefix word 51 as the value of the element. Set to “0”.
Based on the column name constituent word data 24 generated by the word extraction unit 20, the context analysis unit 50 determines whether there is a word located immediately after the column name constituent word 27. When it is determined that there is a word located immediately after the column name constituent word 27, the context analysis unit 50 selects a word located immediately after the column name constituent word 27 from the postfix words 52. The context analysis unit 50 sets the value of the element associated with the selected postfix word 52 among the elements of the context vector 54 to “1”, and sets the value of the element associated with the postfix word 52 other than that. Set to “0”.
Based on the table name constituent word data 21 generated by the word extraction unit 20, the context analysis unit 50 acquires a table name constituent word 23 that constitutes a table identified by the table identifier 25. The context analysis unit 50 selects the acquired table name constituent word 23 from the table name constituent words 53. The context analysis unit 50 sets the value of the element associated with the selected table name constituent word 53 among the elements of the context vector 54 to “1”, and sets the values of the elements associated with the other table name constituent words 53. Set the value to “0”.

例えば、テーブル｛Ｔ０００１｝（ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブル）のカラム｛Ｃ００１１｝（ＯＲＤＥＲ＿ＤＡＴＥカラム）に出現する単語「ＯＲＤＥＲ」について、文脈解析部５０は、単語「ＯＲＤＥＲ」がカラム｛Ｃ００１１｝のカラム名の先頭に位置し、直前に位置する単語が存在しないことから、文脈ベクトル５４の要素のうち、前置単語５１に対応づけられた要素の値をすべて「０」にする。また、文脈解析部５０は、カラム｛Ｃ００１１｝のカラム名において単語「ＯＲＤＥＲ」の直後に単語「ＤＡＴＥ」が位置することから、後置単語５２に対応づけられた文脈ベクトル５４の要素のうち、「ＤＡＴＥ」に対応づけられた要素の値を「１」にし、他の要素の値を「０」にする。また、文脈解析部５０は、テーブル｛Ｔ０００１｝のテーブル名構成単語２３が「ＰＵＲＣＨＡＳＥ」及び「ＯＲＤＥＲ」であることから、テーブル名構成単語５３に対応づけられた文脈ベクトル５４の要素のうち、「ＰＵＲＣＨＡＳＥ」及び「ＯＲＤＥＲ」に対応づけられた要素の値を「１」にし、他の要素の値を「０」にする。 For example, for the word “ORDER” appearing in the column {C0011} (ORDER_DATE column) of the table {T0001} (PURCHASE_ORDER table), the context analysis unit 50 positions the word “ORDER” at the beginning of the column name of the column {C0011}. Since there is no immediately preceding word, all the values of the elements associated with the prefix word 51 among the elements of the context vector 54 are set to “0”. In addition, since the word “DATE” is located immediately after the word “ORDER” in the column name of the column {C0011}, the context analysis unit 50 includes the elements of the context vector 54 associated with the postfix word 52. The value of the element associated with “DATE” is set to “1”, and the values of the other elements are set to “0”. In addition, since the table name constituent word 23 of the table {T0001} is “PURCHASE” and “ORDER”, the context analysis unit 50 selects “PURCHASE” and “ORDER” from among the elements of the context vector 54 associated with the table name constituent word 53. The values of the elements associated with “PURCHASE” and “ORDER” are set to “1”, and the values of the other elements are set to “0”.

図１０は、この実施の形態における特徴ベクトル生成部４０の動作の一例を説明するための図である。 FIG. 10 is a diagram for explaining an example of the operation of the feature vector generation unit 40 in this embodiment.

概念識別子４１は、カラム名構成単語２７が表わす概念についての概念辞書データ３０における概念識別子３２である。
例えば、特徴ベクトル生成部４０は、概念辞書データ３０のなかから、単語抽出部２０が抽出したカラム名構成単語２７のうちいずれかの単語を概念表示語３３のなかに含む概念データ３１を抽出し、抽出した概念データ３１に含まれる概念識別子３２を、概念識別子４１とする。また、特徴ベクトル生成部４０は、抽出した概念データ３１に含まれる上位概念識別子３４に基づいて、その概念の上位概念を表わす概念データ３１を、概念辞書データ３０のなかから抽出する。特徴ベクトル生成部４０は、抽出した概念データ３１に含まれる概念識別子３２を、概念識別子４１に追加する。特徴ベクトル生成部４０は、これを、最上位概念に到達するまで繰り返す。 The concept identifier 41 is a concept identifier 32 in the concept dictionary data 30 for the concept represented by the column name constituent word 27.
For example, the feature vector generation unit 40 extracts, from the concept dictionary data 30, the concept data 31 that includes any one of the column name constituent words 27 extracted by the word extraction unit 20 in the concept display word 33. The concept identifier 32 included in the extracted concept data 31 is a concept identifier 41. Further, the feature vector generation unit 40 extracts the concept data 31 representing the superordinate concept of the concept from the concept dictionary data 30 based on the superordinate concept identifier 34 included in the extracted concept data 31. The feature vector generation unit 40 adds the concept identifier 32 included in the extracted concept data 31 to the concept identifier 41. The feature vector generation unit 40 repeats this until the highest concept is reached.

概念ベクトル４４は、特徴ベクトル生成部４０が生成する概念ベクトルである。特徴ベクトル生成部４０は、１つの単語について１つの概念ベクトル４４を生成する。ただし、特徴ベクトル生成部４０は、文脈解析部５０と異なり、同じ綴りの単語を区別しない。概念ベクトル４４の要素の値は、出現位置によって変わることはない。例えば、カラム名構成単語２７「ＮＯ」は、３箇所に出現するが、特徴ベクトル生成部４０は、カラム名構成単語２７「ＮＯ」に対して１つの概念ベクトル４４を生成する。 The concept vector 44 is a concept vector generated by the feature vector generation unit 40. The feature vector generation unit 40 generates one concept vector 44 for one word. However, unlike the context analysis unit 50, the feature vector generation unit 40 does not distinguish between words having the same spelling. The value of the element of the concept vector 44 does not change depending on the appearance position. For example, the column name constituent word 27 “NO” appears in three places, but the feature vector generation unit 40 generates one concept vector 44 for the column name constituent word 27 “NO”.

概念ベクトル４４の各要素は、概念識別子４１にそれぞれ対応づけられている。
例えば、特徴ベクトル生成部４０は、抽出した概念識別子４１の数に基づいて、概念ベクトル４４の次数を決定する。特徴ベクトル生成部４０は、抽出した概念識別子４１の数を、概念ベクトル４４の次数とする。
特徴ベクトル生成部４０は、カラム名構成単語２７のそれぞれについて、概念辞書データ３０のなかから、そのカラム名構成単語２７が概念表示語３３のなかに含まれる概念データ３１を抽出する。特徴ベクトル生成部４０は、概念識別子４１のなかから、抽出した概念データ３１の概念識別子３２と一致する概念識別子４１を選択する。特徴ベクトル生成部４０は、抽出した概念データ３１に含まれる上位概念識別子３４に基づいて、その概念の上位概念を表わす概念データ３１を、概念辞書データ３０のなかから抽出する。特徴ベクトル生成部４０は、概念識別子４１のなかから、抽出した概念データ３１の概念識別子３２と一致する概念識別子４１を更に選択する。特徴ベクトル生成部４０は、これを、最上位概念に到達するまで繰り返す。このようにして、特徴ベクトル生成部４０は、カラム名構成単語２７の概念辞書における概念カテゴリパスに含まれる概念の概念識別子４１をすべて選択する。
特徴ベクトル生成部４０は、概念識別子４１の要素のうち、選択した概念識別子４１に対応づけられた要素の値を「１」にし、それ以外の概念識別子４１に対応づけられた要素の値を「０」にする。 Each element of the concept vector 44 is associated with a concept identifier 41.
For example, the feature vector generation unit 40 determines the order of the concept vector 44 based on the number of extracted concept identifiers 41. The feature vector generation unit 40 sets the number of extracted concept identifiers 41 as the order of the concept vector 44.
The feature vector generation unit 40 extracts, for each column name constituent word 27, the concept data 31 in which the column name constituent word 27 is included in the concept display word 33 from the concept dictionary data 30. The feature vector generation unit 40 selects a concept identifier 41 that matches the concept identifier 32 of the extracted concept data 31 from the concept identifiers 41. The feature vector generation unit 40 extracts the concept data 31 representing the superordinate concept of the concept from the concept dictionary data 30 based on the superordinate concept identifier 34 included in the extracted concept data 31. The feature vector generation unit 40 further selects a concept identifier 41 that matches the concept identifier 32 of the extracted concept data 31 from the concept identifiers 41. The feature vector generation unit 40 repeats this until the highest concept is reached. In this manner, the feature vector generation unit 40 selects all the concept identifiers 41 of the concepts included in the concept category path in the concept dictionary of the column name constituent word 27.
The feature vector generation unit 40 sets the value of the element associated with the selected concept identifier 41 among the elements of the concept identifier 41 to “1”, and sets the value of the element associated with the other concept identifier 41 to “ 0 ”.

図１１は、この実施の形態における単語分類部６０の動作の一例を説明するための図である。 FIG. 11 is a diagram for explaining an example of the operation of the word classification unit 60 in this embodiment.

この図の上側の部分は、文脈管理グラフである。文脈管理グラフは、テーブル、カラム、単語の構成関係を表わす。
ネットワークの最上位階層は、テーブル識別子２５によって識別される個々のテーブルである。第二階層は、カラム識別子２６によって識別される個々のカラムである。第三階層は、カラム名構成単語２７である。ただし、第三階層は、異なり語彙としてノードを生成している。
テーブルとカラムとを接続する線は、そのカラムがそのテーブルに属することを表わす。各テーブルは、１つ以上のカラムと接続している。各カラムは、いずれか１つのテーブルと接続している。
カラムとカラム名構成単語２７とを接続する線は、そのカラムのカラム名がそのカラム名構成単語２７を含むことを表わす。各カラムは、１つ以上のカラム名構成単語２７と接続している。同じ綴りの単語を異なるカラムが含む場合があるので、各カラム名構成単語２７は、複数のカラムと接続する場合がある。 The upper part of this figure is a context management graph. The context management graph represents the compositional relationship between tables, columns, and words.
The highest level of the network is an individual table identified by the table identifier 25. The second hierarchy is an individual column identified by the column identifier 26. The third hierarchy is a column name constituent word 27. However, in the third hierarchy, nodes are generated as different vocabularies.
A line connecting a table and a column indicates that the column belongs to the table. Each table is connected to one or more columns. Each column is connected to any one table.
A line connecting the column and the column name constituent word 27 indicates that the column name of the column includes the column name constituent word 27. Each column is connected to one or more column name constituent words 27. Since different columns may contain the same spelling word, each column name component word 27 may be connected to multiple columns.

この図の下側の部分は、クラスを表わす。クラスとは、カラム名構成単語２７を分類するグループのことである。クラスの数は、あらかじめ定められている。例えば、関連カラム判定装置９０３は、入力装置９１２を用いて、関連カラム抽出システム９００の利用者が設定したクラス数を入力し、記憶装置９１４を用いて、入力したクラス数を記憶しておく。
クラス識別子６１は、クラスを一意に識別するための識別子である。例えば、単語分類部６０は、処理装置９１１を用いて、記憶したクラス数に基づいて、それぞれのクラスを識別するクラス識別子６１を生成する。 The lower part of the figure represents the class. The class is a group for classifying the column name constituent words 27. The number of classes is predetermined. For example, the related column determination device 903 inputs the number of classes set by the user of the related column extraction system 900 using the input device 912, and stores the input number of classes using the storage device 914.
The class identifier 61 is an identifier for uniquely identifying a class. For example, the word classification unit 60 uses the processing device 911 to generate a class identifier 61 that identifies each class based on the stored number of classes.

単語分類部６０は、カラム名構成単語２７をいずれかのクラスに分類する。すなわち、単語分類部６０は、この図に示したカラム名構成単語２７（異なり単語）とクラスとの間を繋ぐ１対他の枝を引く。各カラム名構成単語２７は、１つのクラスと接続される。各クラスは、１つ以上のカラム名構成単語２７と接続される。
単語分類部６０は、与えられた条件に基づいて、条件を充足する最適なクラス割り当てを探索する。 The word classification unit 60 classifies the column name constituent words 27 into any class. That is, the word classification unit 60 draws a pair of other branches that connect the column name constituent word 27 (different word) and the class shown in FIG. Each column name constituent word 27 is connected to one class. Each class is connected to one or more column name constituent words 27.
The word classification unit 60 searches for an optimal class assignment that satisfies the condition based on the given condition.

図１２は、この実施の形態における単語分類部６０の動作の一例を説明するための図である。 FIG. 12 is a diagram for explaining an example of the operation of the word classification unit 60 in this embodiment.

クラス識別子６２は、カラム名構成単語２７を分類したクラスのクラス識別子６１である。
単語分類部６０は、処理装置９１１を用いて、単語抽出部２０が抽出したカラム名構成単語２７を、いずれかのクラスに分類する。単語分類部６０は、カラム名構成単語２７を分類したクラスのクラス識別子６１を、クラス識別子６２とする。
このとき、単語分類部６０は、異なる位置に出現する同じ綴りの単語を区別しない。すなわち、単語分類部６０は、異なる位置に出現する同じ綴りの単語を、必ず同じクラスに分類する。 The class identifier 62 is a class identifier 61 of a class in which the column name constituent word 27 is classified.
The word classification unit 60 uses the processing device 911 to classify the column name constituent words 27 extracted by the word extraction unit 20 into any class. The word classification unit 60 sets the class identifier 61 of the class into which the column name constituent words 27 are classified as the class identifier 62.
At this time, the word classification unit 60 does not distinguish between the same spelled words appearing at different positions. That is, the word classifying unit 60 always classifies the same spelled words appearing at different positions into the same class.

単語分類部６０は、各テーブルの各カラムについて、クラスベクトル６３を生成する。単語分類部６０は、カラムの総数と同じ数のクラスベクトル６３を生成する。クラスベクトル６３は、そのカラムのカラム名に含まれるカラム名構成単語２７が分類されたクラスを表わす。
クラスベクトル６３の各要素は、クラス識別子６１によって識別されるクラスに対応づけられている。クラスベクトル６３の各要素は、そのカラムのカラム名に含まれるカラム名構成単語２７のなかに、そのクラスに分類された単語があるか否かを表わす。
例えば、単語分類部６０は、テーブル識別子２５で識別されるテーブルのカラム識別子２６で識別されるカラムそれぞれについて、あらかじめ定められたクラスの数を次数とするクラスベクトル６３を生成する。単語分類部６０は、単語抽出部２０が生成したカラム名構成単語データ２４と、分類したクラスのクラス識別子６２とに基づいて、クラス識別子６１のなかから、そのカラムに属する単語を分類したクラスのクラス識別子６２と一致するクラス識別子６１を選択する。単語分類部６０は、クラスベクトル６３の要素のうち、選択したクラス識別子６１に対応づけられた要素の値を「１」（そのカラムのカラム名に含まれるカラム名構成単語２７のなかに、そのクラスに分類された単語があることを表わす値の一例。）にし、それ以外の要素の値を「０」（そのカラムのカラム名に含まれるカラム名構成単語２７のなかに、そのクラスに分類された単語がないことを表わす値の一例。）にする。 The word classification unit 60 generates a class vector 63 for each column of each table. The word classification unit 60 generates the same number of class vectors 63 as the total number of columns. The class vector 63 represents a class into which the column name constituent words 27 included in the column name of the column are classified.
Each element of the class vector 63 is associated with a class identified by the class identifier 61. Each element of the class vector 63 represents whether or not there is a word classified into the class among the column name constituent words 27 included in the column name of the column.
For example, the word classification unit 60 generates a class vector 63 whose degree is a predetermined number of classes for each column identified by the column identifier 26 of the table identified by the table identifier 25. Based on the column name constituent word data 24 generated by the word extraction unit 20 and the class identifier 62 of the classified class, the word classification unit 60 classifies the words belonging to the column from the class identifier 61. A class identifier 61 that matches the class identifier 62 is selected. The word classification unit 60 sets the value of the element associated with the selected class identifier 61 among the elements of the class vector 63 to “1” (in the column name constituting word 27 included in the column name of the column, An example of a value indicating that there is a word classified into the class.), And the value of the other elements are classified as “0” (the column name constituting word 27 included in the column name of the column is classified into the class). An example of a value indicating that there is no added word.).

例えば、単語分類部６０が、単語「ＯＲＤＥＲ」をクラス｛ｙ３｝に分類し、単語「ＤＡＴＥ」をクラス｛ｙ３｝に分類し、単語「ＬＩＮＥ」をクラス｛ｙ４｝に分類し、単語「ＮＯ」をクラス｛ｙ１｝に分類し、単語「ＴＥＬ」をクラス｛ｙ１｝に分類したとする。
テーブル｛Ｔ０００１｝（ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブル）のカラム｛Ｃ００１１｝（ＯＲＤＥＲ＿ＤＡＴＥカラム）について、単語分類部６０は、単語「ＯＲＤＥＲ」をクラス｛ｙ３｝に分類し、単語「ＤＡＴＥ」をクラス｛ｙ３｝に分類したことから、クラスベクトル６３の要素のうち、クラス｛ｙ３｝に対応づけられた要素の値を「１」にし、それ以外のクラスに対応づけられた要素の値を「０」にする。 For example, the word classification unit 60 classifies the word “ORDER” into the class {y3}, classifies the word “DATE” into the class {y3}, classifies the word “LINE” into the class {y4}, and adds the word “NO”. Is classified into class {y1}, and the word “TEL” is classified into class {y1}.
For column {C0011} (ORDER_DATE column) of table {T0001} (PURCHASE_ORDER table), word classification unit 60 classifies word “ORDER” into class {y3} and classifies word “DATE” into class {y3}. Therefore, among the elements of the class vector 63, the value of the element associated with the class {y3} is set to “1”, and the value of the element associated with the other class is set to “0”.

また、テーブル｛Ｔ０００１｝のカラム｛Ｃ００１２｝（ＬＩＮＥ＿ＮＯカラム）について、単語分類部６０は、単語「ＬＩＮＥ」をクラス｛ｙ４｝に分類し、単語「ＮＯ」をクラス｛ｙ１｝に分類したことから、クラスベクトル６３の要素のうち、クラス｛ｙ１｝及びクラス｛ｙ４｝に対応づけられた要素の値を「１」にし、それ以外のクラスに対応づけられた要素の値を「０」にする。
また、テーブル｛Ｔ０００２｝（ＯＲＤＥＲテーブル）のカラム｛Ｃ００２３｝（ＴＥＬカラム）について、単語分類部６０は、単語「ＴＥＬ」をクラス｛ｙ１｝に分類したことから、クラスベクトル６３の要素のうち、クラス｛ｙ１｝に対応づけられた要素の値を「１」にし、それ以外のクラスに対応づけられた要素の値を「０」にする。 Further, for the column {C0012} (LINE_NO column) of the table {T0001}, the word classification unit 60 classifies the word “LINE” into the class {y4} and classifies the word “NO” into the class {y1}. Among the elements of the class vector 63, the value of the element associated with the class {y1} and the class {y4} is set to “1”, and the value of the element associated with the other class is set to “0”. .
In addition, for the column {C0023} (TEL column) of the table {T0002} (ORDER table), the word classification unit 60 classifies the word “TEL” into the class {y1}, so among the elements of the class vector 63, The value of the element associated with the class {y1} is set to “1”, and the value of the element associated with the other class is set to “0”.

単語分類部６０は、同じテーブルに属するカラムについて生成したクラスベクトル６３が、互いに異なるよう、カラム名構成単語２７をクラスに分類する。
例えば、単語分類部６０は、すべてのカラム名構成単語２７を分類したのちにクラスベクトル６３を生成する。同じテーブルに属するカラムについて生成したクラスベクトル６３のなかに一致するものがある場合、単語分類部６０は、分類をやり直す。
あるいは、単語分類部６０は、カラム名構成単語２７を順に分類していき、カラム名に含まれるすべての単語を分類したカラムから順に、クラスベクトル６３を生成する。新たに生成したクラスベクトル６３が、同じテーブルに属する他のカラムについて既に生成してあるクラスベクトル６３と一致している場合、単語分類部６０は、最後に分類したカラム名構成単語２７のクラスを変更する。 The word classification unit 60 classifies the column name constituent words 27 into classes so that the class vectors 63 generated for the columns belonging to the same table are different from each other.
For example, the word classification unit 60 generates the class vector 63 after classifying all the column name constituent words 27. If there is a match among the class vectors 63 generated for the columns belonging to the same table, the word classification unit 60 performs classification again.
Or the word classification | category part 60 classify | categorizes the column name structure word 27 in order, and produces | generates the class vector 63 in an order from the column which classified all the words contained in a column name. When the newly generated class vector 63 matches the class vector 63 that has already been generated for other columns belonging to the same table, the word classifying unit 60 selects the class of the column name constituting word 27 that has been classified last. change.

図１３は、この実施の形態における単語分類部６０の動作の一例を説明するための図である。 FIG. 13 is a diagram for explaining an example of the operation of the word classification unit 60 in this embodiment.

単語分類部６０（特徴ベクトル生成部の一例。）は、処理装置９１１を用いて、文脈解析部５０が生成した文脈ベクトル５４と、特徴ベクトル生成部４０が生成した概念ベクトル４４とを結合して、特徴ベクトル６４を生成する。前置単語５１の次数がｍ、概念識別子４１の次数がｎであるとすると、特徴ベクトル６４の次数は、ｍ＋ｎである。
このとき、単語分類部６０は、文脈解析部５０と同様、異なる位置に出現する同じ綴りの単語を区別する。単語分類部６０は、文脈解析部５０が生成した文脈ベクトル５４の数と同じ数の特徴ベクトル６４を生成する。
特徴ベクトル６４の各要素は、前置単語５１、後置単語５２、テーブル名構成単語５３及び概念識別子４１のいずれかに対応づけられている。 The word classification unit 60 (an example of a feature vector generation unit) uses the processing device 911 to combine the context vector 54 generated by the context analysis unit 50 and the concept vector 44 generated by the feature vector generation unit 40. , A feature vector 64 is generated. If the degree of the prefix word 51 is m and the degree of the concept identifier 41 is n, the degree of the feature vector 64 is m + n.
At this time, like the context analysis unit 50, the word classification unit 60 distinguishes words with the same spelling that appear at different positions. The word classification unit 60 generates the same number of feature vectors 64 as the number of context vectors 54 generated by the context analysis unit 50.
Each element of the feature vector 64 is associated with one of the prefix word 51, the suffix word 52, the table name constituent word 53, and the concept identifier 41.

例えば、単語分類部６０は、文脈解析部５０が生成した文脈ベクトル５４のそれぞれについて、特徴ベクトル生成部４０が生成した概念ベクトル４４のなかから、カラム名構成単語２７が一致する概念ベクトル４４を選択する。単語分類部６０は、文脈ベクトル５４のｍ個の要素と、選択した概念ベクトル４４のｎ個の要素との合計ｍ＋ｎ個の要素を持つ特徴ベクトル６４を生成する。 For example, the word classification unit 60 selects, for each of the context vectors 54 generated by the context analysis unit 50, the concept vector 44 that matches the column name constituent word 27 from the concept vectors 44 generated by the feature vector generation unit 40. To do. The word classification unit 60 generates a feature vector 64 having a total of m + n elements including m elements of the context vector 54 and n elements of the selected concept vector 44.

図１４は、この実施の形態における単語分類部６０の動作の一例を説明するための図である。 FIG. 14 is a diagram for explaining an example of the operation of the word classification unit 60 in this embodiment.

単語分類部６０（重みベクトル生成部の一例。）は、処理装置９１１を用いて、クラス識別子６１によって識別されるクラスそれぞれについて、重みパラメータベクトル６５（重みベクトルの一例。）を生成する。単語分類部６０は、あらかじめ定められたクラスの数と同じ数の重みパラメータベクトル６５を生成する。
重みパラメータベクトル６５の次数は、特徴ベクトル６４の次数と同じである。重みパラメータベクトル６５の各要素は、前置単語５１、後置単語５２、テーブル名構成単語５３及び概念識別子４１のいずれかに対応づけられている。 The word classification unit 60 (an example of a weight vector generation unit) uses the processing device 911 to generate a weight parameter vector 65 (an example of a weight vector) for each class identified by the class identifier 61. The word classification unit 60 generates the same number of weight parameter vectors 65 as the number of classes determined in advance.
The order of the weight parameter vector 65 is the same as the order of the feature vector 64. Each element of the weight parameter vector 65 is associated with any one of the prefix word 51, the suffix word 52, the table name constituent word 53, and the concept identifier 41.

重みパラメータベクトル６５は、カラム名構成単語２７がそのクラスに属する尤度を算出するために用いられる。単語分類部６０は、あるカラム名構成単語２７について生成した特徴ベクトル６４と、あるクラスについて生成した重みパラメータベクトル６５との内積を算出することにより、そのカラム名構成単語２７がそのクラスに属する尤度を算出する。重みパラメータベクトル６５の各要素は、その要素が対応づけられた項目（前置単語５１、後置単語５２、テーブル名構成単語５３または概念識別子４１）が、単語がそのクラスに属する尤度にどのように影響するかを表わす。要素の値が正である場合、単語がその項目に該当するときに、その単語がそのクラスに属する尤度が高いことを意味する。逆に、要素の値が負である場合、単語がその項目に該当しないときに、その単語がそのクラスに属する尤度が高いことを意味する。要素の値の絶対値が大きいほど、影響が大きいことをを表わす。要素の値が０である場合、単語がその項目に該当するかしないかは、その単語がそのクラスに属するか否かと無関係であることを意味する。 The weight parameter vector 65 is used to calculate the likelihood that the column name constituent word 27 belongs to the class. The word classification unit 60 calculates the inner product of the feature vector 64 generated for a certain column name constituent word 27 and the weight parameter vector 65 generated for a certain class, whereby the likelihood that the column name constituent word 27 belongs to that class. Calculate the degree. Each element of the weight parameter vector 65 indicates which item (prefix word 51, postfix word 52, table name constituent word 53 or concept identifier 41) to which the element is associated corresponds to the likelihood that the word belongs to the class. How it affects. If the value of the element is positive, it means that when the word corresponds to the item, the likelihood that the word belongs to the class is high. Conversely, when the value of the element is negative, it means that the likelihood that the word belongs to the class is high when the word does not correspond to the item. The larger the absolute value of the element value, the greater the influence. If the value of the element is 0, whether or not the word corresponds to the item means that it is irrelevant whether or not the word belongs to the class.

最初に、単語分類部６０は、条件を充足する範囲内でランダムに、カラム名構成単語２７を分類する。 First, the word classification unit 60 classifies the column name constituent words 27 randomly within a range that satisfies the condition.

次に、単語分類部６０は、その分類に最適な重みパラメータベクトル６５を生成する。単語分類部６０は、例えば、多クラスロジスティック回帰の教師あり学習を用いる。単語分類部６０は、例えば、反復重み付き最小二乗法（ｉｔｅｒａｔｉｖｅｒｅｗａｉｔｅｄｌｅａｓｔｓｑｕａｒｅｓｍｅｔｈｏｄ）によって最適解を求めることにより、重みパラメータベクトル６５を算出する。
単語分類部６０は、算出した重みパラメータベクトル６５を用いて、カラム名構成単語２７がそれぞれのクラスに属する尤度を算出する。単語分類部６０は、算出した尤度を用いて、カラム名構成単語２７を分類し直す。単語分類部６０は、例えば、多クラスロジスティック回帰モデルを用いた制約付き教師なしクラスタリングにより、カラム名構成単語２７を分類する。 Next, the word classification unit 60 generates a weight parameter vector 65 that is optimal for the classification. The word classification unit 60 uses, for example, supervised learning of multiclass logistic regression. The word classification unit 60 calculates the weight parameter vector 65 by, for example, obtaining an optimal solution by an iterative weighted least squares method.
The word classification unit 60 uses the calculated weight parameter vector 65 to calculate the likelihood that the column name constituent word 27 belongs to each class. The word classification unit 60 reclassifies the column name constituent word 27 using the calculated likelihood. The word classification unit 60 classifies the column name constituent words 27 by, for example, constrained unsupervised clustering using a multi-class logistic regression model.

単語分類部６０は、新たに分類し直した分類と、分類し直す前の分類とを比較する。分類に変化があった場合、単語分類部６０は、重みパラメータベクトル６５を算出し直し、算出し直した重みパラメータベクトル６５を用いて、カラム名構成単語２７を分類し直す。単語分類部６０は、これを、分類が変化しなくなるまで繰り返す。
なお、分類が収束しない場合もあるので、繰り返しの最大回数をあらかじめ定めておく構成であってもよい。単語分類部６０は、繰り返し回数が最大回数に達した場合、処理を打ち切る。 The word classification unit 60 compares the newly reclassified classification with the classification before reclassification. When there is a change in classification, the word classification unit 60 recalculates the weight parameter vector 65 and reclassifies the column name constituent word 27 using the recalculated weight parameter vector 65. The word classification unit 60 repeats this until the classification does not change.
Since the classification may not converge, a configuration in which the maximum number of repetitions is determined in advance may be used. The word classification unit 60 aborts the process when the number of repetitions reaches the maximum number.

図１５は、この実施の形態における関連語決定部７０の動作の一例を説明するための図である。 FIG. 15 is a diagram for explaining an example of the operation of the related word determination unit 70 in this embodiment.

関連語決定部７０は、単語分類部６０による分類の結果に基づいて、関連カラムデータ７１を生成する。関連カラムデータ７１は、関連カラムである可能性の高い複数のカラムの組を表わす。関連カラムデータ７１は、例えば、テーブル識別子２５とカラム識別子２６との組を複数含む。関連カラムデータ７１は、テーブル識別子２５によって識別されるテーブルのカラム識別子２６によって識別されるカラムが、関連カラムである可能性が高いことを表わす。 The related word determination unit 70 generates related column data 71 based on the result of classification by the word classification unit 60. The related column data 71 represents a set of a plurality of columns that are likely to be related columns. The related column data 71 includes a plurality of sets of table identifiers 25 and column identifiers 26, for example. The related column data 71 indicates that the column identified by the column identifier 26 of the table identified by the table identifier 25 is highly likely to be a related column.

関連語決定部７０は、例えば、単語分類部６０がカラム名構成単語２７を最終的に分類したクラスに基づいて単語分類部６０が生成したクラスベクトル６３に基づいて、クラスベクトル６３が一致する複数のカラムが存在するか否かを判定する。クラスベクトル６３が一致する複数のカラムが存在する場合、関連語決定部７０は、それぞれのカラムが属するテーブルのテーブル識別子２５とそれぞれのカラムのカラム識別子２６とに基づいて、関連カラムデータ７１を生成し出力する。 The related word determination unit 70 includes, for example, a plurality of class vectors 63 matching the class vector 63 based on the class vector 63 generated by the word classification unit 60 based on the class in which the word classification unit 60 finally classifies the column name constituent word 27. It is determined whether or not a column exists. When there are a plurality of columns that match the class vector 63, the related word determination unit 70 generates related column data 71 based on the table identifier 25 of the table to which each column belongs and the column identifier 26 of each column. And output.

例えば、関連語決定部７０は、テーブル｛Ｔ０００１｝（ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲテーブル）のカラム｛Ｃ００１１｝（ＯＲＤＥＲ＿ＤＡＴＥカラム）について単語分類部６０が生成したクラスベクトル６３と、テーブル｛Ｔ０００２｝（ＯＲＤＥＲテーブル）のカラム｛Ｃ００２１｝（ＯＤＲ＿ＤＡＴカラム）について単語分類部６０が生成したクラスベクトル６３とが一致するので、この２つのカラムを、関連カラムである可能性が高いカラムであると判定し、この２つのカラムのテーブル識別子２５及びカラム識別子２６を含む関連カラムデータ７１を生成する。
また、関連語決定部７０は、テーブル｛Ｔ０００１｝のカラム｛Ｃ００１２｝（ＬＩＮＥ＿ＮＯカラム）について単語分類部６０が生成したクラスベクトル６３と、テーブル｛Ｔ０００２｝のカラム｛Ｃ００２２｝（ＩＴＥＭ＿ＮＯカラム）について単語分類部６０が生成したクラスベクトル６３とが一致するので、この２つのカラムを、関連カラムである可能性が高いカラムであると判定し、この２つのカラムのテーブル識別子２５及びカラム識別子２６を含む関連カラムデータ７１を生成する。 For example, the related word determination unit 70 includes the class vector 63 generated by the word classification unit 60 for the column {C0011} (ORDER_DATE column) of the table {T0001} (PURCHASE_ORDER table), and the column {C0002} (ORDER table) of the table {T0002} (ORDER table). Since the class vector 63 generated by the word classification unit 60 for C0021} (ODR_DAT column) matches, it is determined that these two columns are columns that are likely to be related columns, and the table of these two columns The related column data 71 including the identifier 25 and the column identifier 26 is generated.
In addition, the related word determination unit 70 generates a word for the class vector 63 generated by the word classification unit 60 for the column {C0012} (LINE_NO column) of the table {T0001} and the column {C0022} (ITEM_NO column) of the table {T0002}. Since the class vector 63 generated by the classification unit 60 matches, it is determined that these two columns are likely to be related columns, and includes the table identifier 25 and the column identifier 26 of these two columns. Related column data 71 is generated.

なお、関連語決定部７０は、更に、テーブル識別子２５によって識別されるテーブルのテーブル名や、カラム識別子２６によって識別されるカラムのカラム名などを、関連カラムデータ７１に含める構成であってもよい。
また、関連語決定部７０は、関連語抽出元データ１０から、各カラムのデータ型を取得し、取得したデータ型が一致しもしくは近いカラムだけを、関連カラムである可能性が高いカラムであると判定する構成であってもよい。 The related word determination unit 70 may further include the table name of the table identified by the table identifier 25, the column name of the column identified by the column identifier 26, and the like in the related column data 71. .
Further, the related word determination unit 70 acquires the data type of each column from the related word extraction source data 10, and only the columns where the acquired data types match or are close are highly likely to be related columns. May be determined.

また、関連語決定部７０は、関連カラムデータ７１を出力するのではなく、単語分類部６０が同じクラスに分類したカラム名構成単語２７の組を出力する構成であってもよい。 Further, the related word determination unit 70 may be configured to output a set of column name constituent words 27 classified by the word classification unit 60 into the same class, instead of outputting the related column data 71.

図１６は、この実施の形態における関連抽出処理Ｓ８００の流れの一例を示すフロー図である。 FIG. 16 is a flowchart showing an example of the flow of the related extraction process S800 in this embodiment.

関連抽出処理Ｓ８００において、関連カラム判定装置９０３は、概念辞書データ３０が記憶したデータベース定義データによって表わされる複数のテーブルの複数のカラムのなかから、関連カラムである可能性の高いカラムを抽出する。
関連抽出処理Ｓ８００は、例えば、単語抽出処理Ｓ８０１と、文脈解析処理Ｓ８０２と、概念解析処理Ｓ８０３と、単語分類処理Ｓ８０４と、関連決定処理Ｓ８０５とを有する。関連カラム判定装置９０３は、単語抽出処理Ｓ８０１から関連抽出処理Ｓ８００を開始する。 In the related extraction process S800, the related column determination device 903 extracts a column that is highly likely to be a related column from a plurality of columns of a plurality of tables represented by the database definition data stored in the concept dictionary data 30.
The association extraction process S800 includes, for example, a word extraction process S801, a context analysis process S802, a concept analysis process S803, a word classification process S804, and an association determination process S805. The related column determination apparatus 903 starts the related extraction process S800 from the word extraction process S801.

単語抽出処理Ｓ８０１において、単語抽出部２０は、データベース定義記憶装置９０１から関連語抽出元データ１０を取得する。単語抽出部２０は、取得した関連語抽出元データ１０から、カラム名構成単語２７を抽出する。
文脈解析処理Ｓ８０２において、文脈解析部５０は、単語抽出処理Ｓ８０１で単語抽出部２０が抽出したカラム名構成単語２７それぞれについて、文脈ベクトル５４を生成する。
概念解析処理Ｓ８０３において、特徴ベクトル生成部４０は、概念辞書記憶装置９０２が記憶した概念辞書データ３０を使って、単語抽出処理Ｓ８０１で単語抽出部２０が抽出したカラム名構成単語２７それぞれについて、概念ベクトル４４を生成する。
単語分類処理Ｓ８０４において、単語分類部６０は、文脈解析処理Ｓ８０２で文脈解析部５０が生成した文脈ベクトル５４と、概念解析処理Ｓ８０３で特徴ベクトル生成部４０が生成した概念ベクトル４４とに基づいて、単語抽出処理Ｓ８０１で単語抽出部２０が抽出したカラム名構成単語２７を分類する。
関連決定処理Ｓ８０５において、関連語決定部７０は、単語分類処理Ｓ８０４で単語分類部６０が分類した分類結果に基づいて、関連カラムである可能性の高いカラムを判定する。 In word extraction processing S <b> 801, the word extraction unit 20 acquires the related word extraction source data 10 from the database definition storage device 901. The word extraction unit 20 extracts column name constituent words 27 from the acquired related word extraction source data 10.
In the context analysis process S802, the context analysis unit 50 generates a context vector 54 for each column name constituent word 27 extracted by the word extraction unit 20 in the word extraction process S801.
In the concept analysis process S803, the feature vector generation unit 40 uses the concept dictionary data 30 stored in the concept dictionary storage device 902, and uses the concept dictionary data 30 for each column name constituent word 27 extracted by the word extraction unit 20 in the word extraction process S801. A vector 44 is generated.
In the word classification process S804, the word classification unit 60, based on the context vector 54 generated by the context analysis unit 50 in the context analysis process S802 and the concept vector 44 generated by the feature vector generation unit 40 in the concept analysis process S803, The column name constituent words 27 extracted by the word extraction unit 20 in the word extraction process S801 are classified.
In the related determination process S805, the related word determination unit 70 determines a column that is likely to be a related column based on the classification result classified by the word classification unit 60 in the word classification process S804.

図１７は、この実施の形態における単語分類処理Ｓ８０４の流れの一例を示すフロー図である。 FIG. 17 is a flowchart showing an example of the flow of the word classification process S804 in this embodiment.

単語分類処理Ｓ８０４は、例えば、初期化工程Ｓ８１０と、重み最適化工程Ｓ８２０と、分類算出工程Ｓ８３０と、収束判定工程Ｓ８４０とを有する。 The word classification process S804 includes, for example, an initialization step S810, a weight optimization step S820, a classification calculation step S830, and a convergence determination step S840.

初期化工程Ｓ８１０において、単語分類部６０は、処理装置９１１を用いて、暫定正解クラスを、制約を充足しなからランダムに与える。単語分類部６０は、以下の一致制約と排他制約を充足させながら、各単語に対し、暫定正解クラスをランダムに割り当てる。
（１）一致制約：表記が等しい単語は、テーブル／カラムの出現文脈に依らず同一の正解クラスを割り当てる。
（２）排他制約：同一テーブルを構成するカラム間では、異なるクラスの単語が１つ以上存在するように正解クラスを割り当てる。 In the initialization step S810, the word classification unit 60 uses the processing device 911 to randomly assign a provisional correct answer class without satisfying the constraints. The word classification unit 60 randomly assigns a provisional correct answer class to each word while satisfying the following matching constraint and exclusion constraint.
(1) Match constraint: Words with the same notation are assigned the same correct class regardless of the appearance context of the table / column.
(2) Exclusive constraint: correct classes are allocated so that one or more words of different classes exist between columns constituting the same table.

重み最適化工程Ｓ８２０において、単語分類部６０は、処理装置９１１を用いて、暫定正解クラスを固定して、重みパラメータベクトル６５などのモデルパラメータを最適化する。単語分類部６０は、例えば次の式を用いて、正解クラスの割り当てＡ_ｔをＡ_ｔ ^{（η−１）}に固定して、モデルパラメータｗを最適化する。

ただし、モデルパラメータｗは、Ｋ個の重みパラメータベクトル｛ｗ_ｋ｜ｋ＝１，２，…，Ｋ｝を並べたＭ行Ｋ列の行列（ｗ_１，ｗ_２，…，ｗ_Ｋ）を表わす。Ｋは、あらかじめ定めれらたクラスの数を表わす。重みパラメータベクトルｗ_ｋは、Ｍ次の縦ベクトルである。Ｍは、特徴ベクトルの次数を表わす。ａｒｇｍａｘは、下に記した変数（この場合、ｗ。）を変化させた場合に、引数である関数（この場合、ｌｏｇｐ（ｔ，ｗ，Ａ_ｔ ^{（η−１）}｜Φ）。）の値が最大になる変数の値を表わす。ｌｏｇは、オイラー数を底とする対数（自然対数）を表わす。ｐ（ｔ，ｗ，Ａ_ｔ ^{（η−１）}｜Φ）は、ΦにおけるｔかつｗかつＡ_ｔ ^{（η−１）}の条件付き確率を表わす。ｔは、Ｎ個の正解クラス｛ｔ_ｉ｜ｉ＝１，…，Ｎ｝を要素とするＮ次のベクトルを表わす。Ｎは、同じ綴りを区別したカラム名構成単語２７の数を表わす。ｔ_ｉは、ｉ番目のカラム名構成単語ｘ_ｉ（ｉ＝１，…，Ｎ）に割り当てる正解クラスを表わす。Ａ_ｔは、ｔの割り当てを表わす。ηは、１以上の整数である。ηは、重み最適化工程Ｓ８２０〜収束判定工程Ｓ８４０のループを実行した繰り返し回数を表わす。例えば、ｗ^（η）は、η回目の重み最適化工程Ｓ８２０で算出するｗを表わす。Ａ_ｔ ^（η）は、η回目の分類算出工程Ｓ８３０で算出するＡ_ｔを表わす。Ａ_ｔ ^（０）は、初期化工程Ｓ８１０で単語分類部６０が割り当てた暫定正解クラスの割り当てを表わす。Φは、Ｎ個の特徴ベクトル｛φ_ｉ｜ｉ＝１，２，…，Ｎ｝を並べたＭ行Ｎ列の行列（φ_１，φ_２，…，φ_Ｎ）を表わす。特徴ベクトルφ_ｉは、Ｍ次の縦ベクトルである。特徴ベクトルφ_ｉは、ｉ番目のカラム名構成単語ｘ_ｉについて、単語分類部６０が生成した特徴ベクトル６４を表わす。 In the weight optimization step S820, the word classification unit 60 uses the processing device 911 to fix the provisional correct answer class and optimize model parameters such as the weight parameter vector 65. Word classification unit 60, for example using the following formula, the allocation A _t the correct class is fixed to A _{t ^(eta-1),} optimizing the model parameters w.

Here, the model parameter w represents an M-row / K-column matrix (w ₁ , w ₂ ,..., W _K ) in which _K weight parameter vectors {w _k | k = 1, ₂ ,. . K represents a predetermined number of classes. The weight parameter vector w _k is an M-th order vertical vector. M represents the order of the feature vector. arg max is an argument function (in this case, log p (t, w, A _t ^(η−1) | Φ)) when a variable described below (in this case, w.) is changed. Represents the value of the variable that has the maximum value. log represents a logarithm (natural logarithm) based on the Euler number. p (t, w, A _t ^(η−1) | Φ) represents the conditional probability of t and w and A _t ^(η−1) in Φ. t represents an Nth-order vector having N correct answer classes {t _i | i = 1,..., N} as elements. N represents the number of column name constituent words 27 that distinguish the same spelling. t _i represents a correct class assigned to the i-th column name constituent word x _i (i = 1,..., N). A _t represents the assignment of t. η is an integer of 1 or more. η represents the number of repetitions of executing the loop of the weight optimization step S820 to the convergence determination step S840. For example, w ^(η) represents w calculated in the η-th weight optimization step S820. A _t ^(eta) represents a _{A t} to calculate the classification calculation step S830 of eta time. A _t ⁽⁰⁾ represents the assignment of the temporary correct class word classification unit 60 assigned in the initialization step S810. Φ represents an M-row N-column matrix (φ ₁ , φ ₂ ,..., Φ _N ) in which _N feature vectors {φ _i | i = 1, ₂ ,. The feature vector φ _i is an M-th order vertical vector. The feature vector φ _i represents the feature vector 64 generated by the word classification unit 60 for the i-th column name constituent word x _i .

単語分類部６０は、例えば次の式によりｌｏｇｐ（ｔ，ｗ，Ａ_ｔ｜Φ）の値を求める。

The word classification unit 60 obtains the value of log p (t, w, A _t | Φ) by the following equation, for example.

ここで、ｌｏｇｐ（ｗ）は、モデルパラメータｗに対する二次の正則化項であり、モデルが過剰に大きな重みを使って学習データにオーバーフィッティングするのを防ぐ。単語分類部６０は、例えば次の式により、ｌｏｇｐ（ｗ）の値を求める。

ただし、Ｃ_１は、あらかじめ定めた定数（ハイパーパラメータ）である。
例えば、関連カラム判定装置９０３は、入力装置９１２を用いて、Ｃ_１の設定値を入力し、記憶装置９１４を用いて、入力したＣ_１の設定値を記憶しておく。単語分類部６０は、記憶したＣ_１の設定値を使って、ｌｏｇｐ（ｗ）の値を算出する。
あるいは、Ｃ_１は、固定値ではなく、繰り返し回数ηが多くなるにしたがって減衰する値であってもよい。例えば、関連カラム判定装置９０３は、入力装置９１２を用いて、Ｃ_１の初期値Ｃ_１ ^（１）と、Ｃ_１の収束値Ｃ_１ ^（∞）と、減衰係数ε_１とを入力する。関連カラム判定装置９０３は、記憶装置９１４を用いて、入力したＣ_１の初期値Ｃ_１ ^（１）と、Ｃ_１の収束値Ｃ_１ ^（∞）と、減衰係数ε_１とを記憶しておく。単語分類部６０は、η回目の繰り返し（ただし、η≧２。）において、前回の繰り返しで使ったＣ_１の値Ｃ_１ ^{（η−１）}と、記憶したＣ_１の収束値Ｃ_１ ^（∞）と減衰係数ε_１とに基づいて、例えば次の式により、今回の繰り返しで使う定数Ｃ_１ ^（η）を算出する。

このように、Ｃ_１の値を徐々に減衰させることにより、学習の強度を徐々に強めることができる。これにより、不安定な学習の初期段階において誤ったクラス割り当てをしてしまったとしても、その後の反復において大域的なクラス割り当てに基づいて誤ったクラス割り当ての変更が起きやすくなる。 Here, log p (w) is a quadratic regularization term for the model parameter w, and prevents the model from overfitting the learning data using an excessively large weight. The word classification unit 60 obtains the value of log p (w) by the following formula, for example.

However, _{C 1} is a predetermined constant (hyper parameters).
For example, associated column determination unit 903 uses the input device 912 inputs the set value of C _1, using the storage device 914 stores the set values of C ₁ entered. Word classification unit 60 uses the set value of the stored _{C 1,} to calculate the value of log p (w).
Alternatively, C ₁ is not a fixed value, but may be a value that attenuates as the number of repetitions η increases. For example, associated column determination unit 903 uses the input device 912 to input the initial value _C ¹ of _{C 1} ^(1), and the convergence value _{C 1} of _{C 1} ^(∞), and a damping coefficient epsilon _1. Related column determination unit 903, using the storage device 914, the initial value _C ¹ of _{C 1} input ^(1), and the convergence value _{C 1} of _{C 1} ^(∞), and stores the attenuation coefficients epsilon ₁ . In the η-th iteration (where η ≧ 2), the word classifying unit 60 uses the C ₁ value C ₁ ^(η−1) used in the previous iteration and the stored C ₁ convergence value C ₁ ^{(∞). )} And the attenuation coefficient ε ₁ , for example, a constant C ₁ ^(η) used in the current iteration is calculated by the following equation.

In this way, by gradually attenuate the value of C _1, it can be increased gradually the intensity of the learning. As a result, even if an incorrect class assignment is made in the initial stage of unstable learning, an incorrect class assignment change is likely to occur in the subsequent iteration based on the global class assignment.

また、ｌｏｇｐ（Ａ_ｔ｜ｗ）は、各クラスで全学習データから受ける重みの総和を最大化する。上述の正則化項による大きさに対するペナルティと合わせて、クラス間でクラスの大きさ（単語の種類数）がバランスするようなペナルティとして働く。単語分類部６０は、例えば次の式により、ｌｏｇｐ（Ａ_ｔ｜ｗ）の値を求める。

ただし、Ｃ_２は、あらかじめ定めた定数（ハイパーパラメータ）である。例えば、関連カラム判定装置９０３は、入力装置９１２を用いて、Ｃ_２の設定値を入力し、記憶装置９１４を用いて、入力したＣ_２の設定値を記憶しておく。単語分類部６０は、記憶したＣ_２の設定値を使って、ｌｏｇｐ（Ａ_ｔ｜ｗ）の値を算出する。
あるいは、Ｃ_２は、固定値ではなく、繰り返し回数ηが多くなるにしたがって減衰する値であってもよい。例えば、関連カラム判定装置９０３は、入力装置９１２を用いて、Ｃ_２の初期値Ｃ_２ ^（１）と、Ｃ_２の収束値Ｃ_１ ^（∞）と、減衰係数ε_２とを入力する。関連カラム判定装置９０３は、記憶装置９１４を用いて、入力したＣ_２の初期値Ｃ_２ ^（１）と、Ｃ_２の収束値Ｃ_２ ^（∞）と、減衰係数ε_１とを記憶しておく。単語分類部６０は、η回目の繰り返し（ただし、η≧２。）において、前回の繰り返しで使ったＣ_２の値Ｃ_２ ^{（η−１）}と、記憶したＣ_２の収束値Ｃ_２ ^（∞）と減衰係数ε_２とに基づいて、例えば次の式により、今回の繰り返しで使う定数Ｃ_２ ^（η）を算出する。

このように、Ｃ_２の値を徐々に減衰させることにより、学習の強度を徐々に強めることができる。これにより、不安定な学習の初期段階において誤ったクラス割り当てをしてしまったとしても、その後の反復において大域的なクラス割り当てに基づいて誤ったクラス割り当ての変更が起きやすくなる。 Also, log p (A _t | w) maximizes the sum of weights received from all learning data in each class. Combined with the penalty for size according to the regularization term described above, it works as a penalty that balances class size (number of types of words) among classes. The word classification unit 60 obtains the value of log p (A _t | w) by the following formula, for example.

However, _{C 2} is a predetermined constant (hyper parameters). For example, associated column determination unit 903 uses the input device 912 inputs the set value of the C _2, using the storage device 914 stores the set value of C ₂ entered. The word classification unit 60 calculates the value of log p (A _t | w) using the stored setting value of C ₂ .
Alternatively, C ₂ is not a fixed value, but may be a value that attenuates as the number of repetitions η increases. For example, associated column determination unit 903 uses the input device 912, the initial value of _{C 2} and _C ^{2 (1),} and the convergence value _{C 1} of _{C 2} ^(∞), and inputs the damping coefficient epsilon _2. Related column determination unit 903, using the storage device 914, the initial value _C ² of _{C 2} input ^(1), the convergence value _{C 2} of _{C 2} ^(∞), and stores the attenuation coefficients epsilon ₁ . In the η-th iteration (where η ≧ 2), the word classifying unit 60 uses the C ₂ value C ₂ ^(η−1) used in the previous iteration and the stored C ₂ convergence value C ₂ ^{(∞ )} And the attenuation coefficient ε ₂ , for example, the constant C ₂ ^(η) used in the current iteration is calculated by the following equation.

In this way, by gradually attenuate the value of C _2, it can be increased gradually the intensity of the learning. As a result, even if an incorrect class assignment is made in the initial stage of unstable learning, an incorrect class assignment change is likely to occur in the subsequent iteration based on the global class assignment.

また、ｌｏｇｐ（ｔ｜Φ；ｗ，Ａ_ｔ）は、モデルによる学習データの尤度である。例えば、単語分類部６０は、次の式により、ｌｏｇｐ（ｔ｜Φ；ｗ，Ａ_ｔ）の値を求める。

Also, log p (t | Φ; w, A _t ) is the likelihood of the learning data by the model. For example, the word classification unit 60 obtains the value of log p (t | Φ; w, A _t ) by the following equation.

多クラスロジスティック回帰を用いる場合、単語分類部６０は、例えば次の式を用いて、ｉ番目の単語ｘ_ｉがｋ番目のクラスｙ_ｋに属する尤度ｐ（ｙ_ｋ｜φ_ｉ；ｗ，Ａ_ｔ）を求める。

ただし、ｅｘｐは、オイラー数を底とする指数関数を表わす。Ｚ_ｉは、正規化係数である。単語分類部６０は、例えば次の式により、正規化係数Ｚ_ｉを求める。

When multi-class logistic regression is used, the word classification unit 60 uses, for example, the following expression to estimate the likelihood p (y _k | φ _i ; w, A that the i th word x _i belongs to the k th class y _k. _t ).

Here, exp represents an exponential function based on the Euler number. Z _i is a normalization coefficient. The word classification unit 60 obtains the normalization coefficient Z _i by the following equation, for example.

分類算出工程Ｓ８３０において、単語分類部６０は、処理装置９１１を用いて、モデルパラメータを固定して、制約付き最尤クラスを求める。単語分類部６０は、例えば次の式により、モデルパラメータｗをｗ^（η）に固定して、制約付きで最適なクラスを探索し、正解クラスＡ_ｔ ^（η）として割り当て直す。

ただし、Ｓは、制約条件を充足するＡ_ｔの集合を表わす。 In the classification calculation step S830, the word classification unit 60 uses the processing device 911 to fix the model parameters and obtain a constrained maximum likelihood class. The word classifying unit 60 fixes the model parameter w to w ^(η) , for example, by the following formula, searches for an optimal class with constraints, and reassigns it as a correct class A _t ^(η) .

However, S represents the set of A _t that satisfy the constraints.

例えば、単語分類部６０は、各カラム名構成単語２７を、各クラスに分類するかしないかを変数とする０−１整数計画問題を解くことにより、クラス選択の最適解（制約付き最尤解）を求める。この方式は、クラス数や単語数が小規模の場合に有効である。
あるいは、単語分類部６０は、貪欲法により平均尤度の高い単語から順番にクラスを割り当てることにより、準最適解を求める。この方式は、クラス数や単語数が大規模な場合に有効である。 For example, the word classifying unit 60 solves the 0-1 integer programming problem using whether each column name component word 27 is classified into each class as a variable, thereby obtaining an optimal class selection solution (a constrained maximum likelihood solution). ) This method is effective when the number of classes and words is small.
Or the word classification | category part 60 calculates | requires a suboptimal solution by assigning a class in an order from a word with a high average likelihood by a greedy method. This method is effective when the number of classes and words is large.

収束判定工程Ｓ８４０において、単語分類部６０は、処理装置９１１を用いて、クラスタリング処理の収束判定を行う。例えば、単語分類部６０は、いずれの単語に割り当てた暫定正解クラスも前回の反復で割り当てたクラスと変わらなければクラスタリング処理が収束したと判定し、反復を打ち切る。あるいは、単語分類部６０は、繰り返し回数ηが所定の最大反復回数に達したら、強制的に反復を打ち切る。
反復を打ち切らないと判定した場合、単語分類部６０は、重み最適化工程Ｓ８２０に処理を戻す。
反復を打ち切ると判定した場合、単語分類部６０は、単語分類処理Ｓ８０４を終了する。 In the convergence determination step S840, the word classification unit 60 uses the processing device 911 to determine the convergence of the clustering process. For example, the word classification unit 60 determines that the clustering process has converged if the provisional correct answer class assigned to any word does not differ from the class assigned in the previous iteration, and aborts the iteration. Alternatively, the word classification unit 60 forcibly terminates the repetition when the number of repetitions η reaches a predetermined maximum number of repetitions.
If it is determined that the iteration is not terminated, the word classification unit 60 returns the process to the weight optimization step S820.
When it is determined that the repetition is to be terminated, the word classification unit 60 ends the word classification process S804.

図１８は、この実施の形態における初期化工程Ｓ８１０の流れの一例を示すフロー図である。 FIG. 18 is a flowchart showing an example of the flow of the initialization step S810 in this embodiment.

初期化工程Ｓ８１０は、例えば、単語選択工程Ｓ８１１と、暫定クラス割当工程Ｓ８１２と、カラム選択工程Ｓ８１３と、兄弟カラム選択工程Ｓ８１４と、排他制約判定工程Ｓ８１５とを有する。 The initialization step S810 includes, for example, a word selection step S811, a provisional class assignment step S812, a column selection step S813, a sibling column selection step S814, and an exclusion constraint determination step S815.

単語選択工程Ｓ８１１において、単語分類部６０は、まだ暫定正解クラスを割り当てていないカラム名構成単語２７のなかから、単語ζを一つ選択する。
全語彙（カラム名構成単語２７）に暫定正解クラスを付与済みであり、選択すべき単語が存在しない場合、すべての単語に暫定正解クラスを割り当てたので、単語分類部６０は、初期化工程Ｓ８１０を終了する。
暫定正解クラスを未付与の語彙がある場合、単語分類部６０は、暫定正解クラスを割り当てていない単語を一つ取り出し、単語ζとする。単語分類部６０は、暫定クラス割当工程Ｓ８１２へ処理を進める。 In the word selection step S811, the word classification unit 60 selects one word ζ from the column name constituent words 27 to which no provisional correct answer class has been assigned yet.
If the provisional correct answer class has been assigned to all the vocabularies (column name constituent words 27) and there are no words to be selected, the provisional correct answer class is assigned to all the words. Exit.
When there is a vocabulary to which no provisional correct answer class has been assigned, the word classification unit 60 extracts one word to which no provisional correct answer class has been assigned and sets it as a word ζ. The word classification unit 60 proceeds to the provisional class assignment step S812.

暫定クラス割当工程Ｓ８１２において、単語分類部６０は、単語選択工程Ｓ８１１で選択した単語ζに対してまだ割り当てていないクラスのなかから、暫定正解クラスを一つ選択する。例えば、単語分類部６０は、すべてのクラスをランダムな順序に並べ替えたリストを生成して、暫定正解クラス候補リストとする。単語分類部６０は、暫定正解クラス候補リストの先頭にあるクラスを暫定正解クラスとして選択し、選択したクラスを暫定正解クラス候補リストから削除する。単語分類部６０は、その単語ζについての暫定正解クラス候補リストを記憶しておく。次回の暫定クラス割当工程Ｓ８１２を同じ単語ζに対して実行する場合、単語分類部６０は、その単語ζについて記憶しておいた暫定正解クラス候補リストを使い、暫定正解クラス候補リストの先頭にあるクラスを暫定正解クラスとして選択し、選択したクラスを暫定正解クラス候補リストから削除する。
単語ζに対してまだ割り当てていないクラスが存在しない場合、例えば、その単語ζについて記憶している暫定正解クラス候補リストが空である場合、単語分類部６０は、その単語ζに対してクラスを割り当てることができない。単語分類部６０は、その単語ζに対してクラスを割り当てることを諦め、前回の単語選択工程Ｓ８１１で選択した単語にロールバックする。単語分類部６０は、単語選択工程Ｓ８１１で選択した単語ζを未選択に戻し、前回の単語選択工程Ｓ８１１で選択した単語に対して割り当てた暫定正解クラスを変更する。単語分類部６０は、前回の単語選択工程Ｓ８１１で選択した単語ζに対して、まだ割り当てていないクラスのなかから、暫定正解クラスを一つ選択する。例えば、単語分類部６０は、前回の単語選択工程Ｓ８１１で選択した単語ζについて記憶しておいた暫定正解クラス候補リストの先頭にあるクラスを暫定正解クラスとして選択し、選択したクラスを暫定正解クラス候補リストから削除する。 In the provisional class assignment step S812, the word classification unit 60 selects one provisional correct answer class from the classes not yet assigned to the word ζ selected in the word selection step S811. For example, the word classification unit 60 generates a list in which all classes are rearranged in a random order, and sets it as a provisional correct class candidate list. The word classification unit 60 selects the class at the head of the provisional correct answer class candidate list as the provisional correct answer class, and deletes the selected class from the provisional correct answer class candidate list. The word classification unit 60 stores a provisional correct answer class candidate list for the word ζ. When the next provisional class assignment step S812 is executed for the same word ζ, the word classification unit 60 uses the provisional correct class candidate list stored for the word ζ and is at the head of the provisional correct class candidate list. The class is selected as a provisional correct answer class, and the selected class is deleted from the provisional correct answer class candidate list.
When there is no class not yet assigned to the word ζ, for example, when the temporary correct class candidate list stored for the word ζ is empty, the word classification unit 60 assigns a class to the word ζ. Cannot be assigned. The word classification unit 60 gives up assigning a class to the word ζ and rolls back to the word selected in the previous word selection step S811. The word classification unit 60 returns the word ζ selected in the word selection step S811 to unselected, and changes the provisional correct answer class assigned to the word selected in the previous word selection step S811. The word classification unit 60 selects one provisional correct answer class from the classes not yet assigned to the word ζ selected in the previous word selection step S811. For example, the word classification unit 60 selects the class at the head of the provisional correct class candidate list stored for the word ζ selected in the previous word selection step S811 as the provisional correct class, and selects the selected class as the provisional correct class. Remove from the candidate list.

カラム選択工程Ｓ８１３において、単語分類部６０は、全カラムのなかから、暫定クラス割当工程Ｓ８１２で暫定正解クラスを割り当てた単語ζを含むカラムを抽出する。単語分類部６０は、抽出したカラムのなかから、そのカラムのカラム名に含まれるすべての単語（全構成語）に対して暫定正解クラスを割り当てたカラムを抽出して、カラム集合Ｃ１とする。
カラム集合Ｃ１が空集合であって、カラム集合Ｃ１に属するカラムが存在しない場合、単語分類部６０は、単語選択工程Ｓ８１１に処理を戻し、次の単語ζを選択する。
カラム集合Ｃ１が空集合でなく、カラム集合Ｃ１に属するカラムが存在する場合、単語分類部６０は、カラム集合Ｃ１に属するカラムのなかからカラムを１つ選択し、選択したカラムをカラム集合Ｃ１から削除する。単語分類部６０は、その単語ζについてのカラム集合Ｃ１を記憶しておく。次回のカラム選択工程Ｓ８１３を同じ単語ζに対して実行する場合、単語分類部６０は、記憶しておいたカラム集合Ｃ１を使い、カラム集合Ｃ１に属するカラムのなかから、カラムを１つ選択し、選択したカラムをカラム集合Ｃ１から削除する。
単語分類部６０は、選択したカラムについてクラスベクトル６３を生成し、生成したクラスベクトル６３を記憶しておく。単語分類部６０は、兄弟カラム選択工程Ｓ８１４へ処理を進める。 In the column selection step S813, the word classification unit 60 extracts a column including the word ζ to which the provisional correct answer class is assigned in the provisional class assignment step S812 from all the columns. The word classification unit 60 extracts, from the extracted columns, a column to which a provisional correct answer class is assigned to all words (all constituent words) included in the column name of the column, and sets it as a column set C1.
When the column set C1 is an empty set and there is no column belonging to the column set C1, the word classification unit 60 returns to the word selection step S811 and selects the next word ζ.
When the column set C1 is not an empty set and there is a column belonging to the column set C1, the word classification unit 60 selects one column from the columns belonging to the column set C1, and selects the selected column from the column set C1. delete. The word classification unit 60 stores a column set C1 for the word ζ. When the next column selection step S813 is executed for the same word ζ, the word classification unit 60 uses the stored column set C1 to select one column from the columns belonging to the column set C1. The selected column is deleted from the column set C1.
The word classification unit 60 generates a class vector 63 for the selected column, and stores the generated class vector 63. The word classification unit 60 proceeds to the brother column selection step S814.

兄弟カラム選択工程Ｓ８１４において、単語分類部６０は、全カラムのなかから、カラム選択工程Ｓ８１３で選択したカラムと同じテーブルに属するカラムを抽出する。単語分類部６０は、抽出したカラムのなかから、そのカラムのカラム名に含まれるすべての単語に対して暫定正解クラスを割り当てたカラムを抽出して、カラム集合Ｃ２とする。カラム集合Ｃ２には、カラム選択工程Ｓ８１３で選択したカラムが必ず含まれるので、単語分類部６０は、カラム選択工程Ｓ８１３で選択したカラムをカラム集合Ｃ２から削除する。
カラム集合Ｃ２が空集合であって、カラム集合Ｃ２に属するカラムが存在しない場合、単語分類部６０は、カラム選択工程Ｓ８１３に処理を戻し、次のカラムをカラム集合Ｃ１から選択する。
カラム集合Ｃ２が空集合でなく、カラム集合Ｃ２に属するカラムが存在する場合、単語分類部６０は、カラム集合Ｃ２に属するカラムのなかからカラムを１つ選択し、選択したカラムをカラム集合Ｃ２から削除する。単語分類部６０は、そのカラムについてのカラム集合Ｃ２を記憶しておく。次回の兄弟カラム選択工程Ｓ８１４を同じカラムに対して実行する場合、単語分類部６０は、記憶しておいたカラム集合Ｃ２を使い、カラム集合Ｃ２に属するカラムのなかから、カラムを１つ選択し、選択したカラムをカラム集合Ｃ２から削除する。
単語分類部６０は、排他制約判定工程Ｓ８１５へ処理を進める。 In the sibling column selection step S814, the word classification unit 60 extracts columns belonging to the same table as the column selected in the column selection step S813 from all the columns. The word classification unit 60 extracts, from the extracted columns, a column to which a provisional correct answer class is assigned to all words included in the column name of the column, and sets it as a column set C2. Since the column set C2 always includes the column selected in the column selection step S813, the word classification unit 60 deletes the column selected in the column selection step S813 from the column set C2.
When the column set C2 is an empty set and there is no column belonging to the column set C2, the word classification unit 60 returns the process to the column selection step S813 and selects the next column from the column set C1.
When the column set C2 is not an empty set and there is a column belonging to the column set C2, the word classification unit 60 selects one column from the columns belonging to the column set C2, and selects the selected column from the column set C2. delete. The word classification unit 60 stores a column set C2 for the column. When the next sibling column selection step S814 is performed on the same column, the word classification unit 60 selects one column from the columns belonging to the column set C2 using the stored column set C2. The selected column is deleted from the column set C2.
The word classification unit 60 advances the processing to the exclusion constraint determination step S815.

排他制約判定工程Ｓ８１５において、単語分類部６０は、カラム選択工程Ｓ８１３で選択したカラムについて生成したクラスベクトル６３と、兄弟カラム選択工程Ｓ８１４で選択したカラムについて記憶しておいたクラスベクトル６３とを比較する。
２つのクラスベクトル６３が一致している場合、単語分類部６０は、制約違反であると判定し、暫定クラス割当工程Ｓ８１２に処理を戻して、単語ζに対して次の暫定正解クラスを割り当てる。
２つのクラスベクトル６３が異なる場合、単語分類部６０は、制約を充足していると判定し、兄弟カラム選択工程Ｓ８１４に処理を戻して、次のカラムをカラム集合Ｃ２から選択する。 In the exclusion constraint determination step S815, the word classification unit 60 compares the class vector 63 generated for the column selected in the column selection step S813 with the class vector 63 stored for the column selected in the sibling column selection step S814. To do.
When the two class vectors 63 match, the word classifying unit 60 determines that the restriction is violated, and returns to the provisional class assignment step S812 to assign the next provisional correct answer class to the word ζ.
When the two class vectors 63 are different, the word classification unit 60 determines that the constraint is satisfied, returns the process to the sibling column selection step S814, and selects the next column from the column set C2.

図１９は、この実施の形態における分類算出工程Ｓ８３０の流れの一例を示すフロー図である。 FIG. 19 is a flowchart showing an example of the flow of the classification calculation step S830 in this embodiment.

分類算出工程Ｓ８３０は、例えば、単語選択工程Ｓ８３１と、暫定クラス割当工程Ｓ８３２と、カラム選択工程Ｓ８１３と、兄弟カラム選択工程Ｓ８１４と、排他制約判定工程Ｓ８１５とを有する。 The classification calculation step S830 includes, for example, a word selection step S831, a provisional class assignment step S832, a column selection step S813, a sibling column selection step S814, and an exclusive constraint determination step S815.

単語選択工程Ｓ８３１において、単語分類部６０は、すべての単語について、すべてのクラスに対する平均尤度を算出する。単語分類部６０は、例えば次の式により、単語ζがクラスｙ_ｋに属する尤度をすべての出現箇所について平均した平均尤度ｐ￣（ｙ_ｋ｜ζ）を算出する。

ただし、Ｘ_ζは、単語ｘ_ｉがζである添え字ｉの集合を表わす。 In the word selection step S831, the word classification unit 60 calculates the average likelihood for all classes for all words. The word classification unit 60 calculates an average likelihood p￣ (y _k | ζ) _obtained by, for example, averaging the likelihood that the word ζ belongs to the class y _k with respect to all appearance locations by the following formula.

Here, X _ζ represents a set of subscripts _i whose word x _i is ζ.

単語分類部６０は、それぞれの単語ζについて、平均尤度が大きい順にクラスを並べたリストを生成し、その単語ζについての暫定正解クラスリストとする。
単語分類部６０は、暫定正解クラスリストの先頭にあるクラスに対する平均尤度（すなわち、平均尤度の最大値）が大きい順に単語ζを並べたリストを生成し、割当順単語リストとする。
単語分類部６０は、割当順単語リストのなかから、未処理の単語を１つ選択する。
割当順単語リストに未処理の単語が存在しない場合、すべての単語に対するクラス割当てが完了したので、単語分類部６０は、分類算出工程Ｓ８３０を終了する。
割当順単語リストに未処理の単語が存在する場合、単語分類部６０は、未処理の単語のなかで一番前にある単語（すなわち、最大平均尤度が一番大きい単語）を選択して、単語ζとする。単語分類部６０は、暫定クラス割当工程Ｓ８３２へ処理を進める。 The word classification unit 60 generates a list in which classes are arranged in descending order of average likelihood for each word ζ, and sets it as a provisional correct answer class list for the word ζ.
The word classifying unit 60 generates a list in which the words ζ are arranged in descending order of the average likelihood (that is, the maximum value of the average likelihood) for the class at the head of the provisional correct answer class list, and sets it as the assignment order word list.
The word classification unit 60 selects one unprocessed word from the allocation order word list.
When there is no unprocessed word in the allocation order word list, the class allocation for all the words is completed, and the word classification unit 60 ends the classification calculation step S830.
When there is an unprocessed word in the allocation order word list, the word classification unit 60 selects the word that is the foremost among the unprocessed words (that is, the word having the largest maximum average likelihood). Let ζ be the word ζ. The word classification unit 60 proceeds to the provisional class assignment step S832.

暫定クラス割当工程Ｓ８３２において、単語分類部６０は、単語選択工程Ｓ８３１で選択した単語ζに対して、単語選択工程Ｓ８３１で生成した暫定正解クラスリストを使って、クラスを割り当てる。単語分類部６０は、暫定正解クラスリストのなかから、未選択のクラスを１つ選択する。
暫定正解クラスリストに未選択のクラスが存在しない場合、その単語ζに対してクラスを割り当てることができないので、単語分類部６０は、１つ前の単語にロールバックする。単語分類部６０は、単語選択工程Ｓ８３１で選択した単語ζについての暫定正解クラスリストをすべて未選択状態に戻す。単語分類部６０は、割当順単語リストの処理済の単語のなかから、一番後ろにある単語を選択して、単語ζとし、未処理状態に戻す。単語分類部６０は、ロールバックした単語ζについての暫定正解クラスリストのなかから、未選択のクラスを１つ選択する。
暫定正解クラスリストに未選択のクラスが存在する場合、単語分類部６０は、未選択のクラスのなかで一番前にあるクラス（すなわち、平均尤度が一番大きいクラス）を選択する。 In the provisional class assignment step S832, the word classification unit 60 assigns a class to the word ζ selected in the word selection step S831, using the provisional correct answer class list generated in the word selection step S831. The word classification unit 60 selects one unselected class from the provisional correct answer class list.
If no unselected class exists in the provisional correct answer class list, a class cannot be assigned to the word ζ, and the word classification unit 60 rolls back to the previous word. The word classification unit 60 returns all provisional correct answer class lists for the word ζ selected in the word selection step S831 to the unselected state. The word classification unit 60 selects the last word from the processed words in the assigned order word list, sets it as the word ζ, and returns it to the unprocessed state. The word classification unit 60 selects one unselected class from the provisional correct answer class list for the rolled back word ζ.
When there is an unselected class in the provisional correct answer class list, the word classification unit 60 selects the class that is at the forefront among the unselected classes (that is, the class with the highest average likelihood).

カラム選択工程Ｓ８１３、兄弟カラム選択工程Ｓ８１４及び排他制約判定工程Ｓ８１５は、クラスの割当てが排他制約に違反しないかを検証する工程であり、初期化工程Ｓ８１０と同様である。 The column selection step S813, the sibling column selection step S814, and the exclusion constraint determination step S815 are steps for verifying whether the class assignment does not violate the exclusion constraint, and are the same as the initialization step S810.

初期化工程Ｓ８１０では、単語の選択順やクラスの選択順がランダムであるのに対し、分類算出工程Ｓ８３０では、最大平均尤度が高い順に単語を選択し、平均尤度が高い順にクラスを選択する。これにより、排他制約に反しないという条件を充足しつつ、全体として尤度が高いクラス割当を見つけることができる。 In the initialization step S810, the word selection order and the class selection order are random, whereas in the classification calculation step S830, the words are selected in descending order of the maximum average likelihood, and the classes are selected in descending order of the average likelihood. To do. As a result, it is possible to find a class assignment having a high likelihood as a whole while satisfying the condition that the exclusive constraint is not violated.

図２０は、この実施の形態における分類算出工程Ｓ８３０単語分類部６０の動作の一例を説明するための図である。 FIG. 20 is a diagram for explaining an example of the operation of the classification calculation step S830 word classification unit 60 in this embodiment.

単語選択工程Ｓ８３１において、単語分類部６０は、異なる位置に出現する同じ綴りの単語を区別して、それぞれのカラム名構成単語２７について、それぞれのクラスに対する尤度６６を算出する。
単語分類部６０は、異なる位置に出現する同じ綴りの単語について算出した尤度６６に基づいて、平均尤度６７を算出する。この例では、単語「ＮＯ」が３箇所に出現しているので、単語分類部６０は、それぞれのクラスについて、算出した３つの尤度６６から、平均尤度６７を算出する。
単語分類部６０は、異なる位置に出現する同じ綴りの単語を区別せず、それぞれの単語について、算出した平均尤度６７が大きい順にクラスを並べた暫定正解クラスリスト６８を生成する。また、単語分類部６０は、異なる位置に出現する同じ綴りの単語を区別せず、算出した平均尤度６７のうち最大の平均尤度６７が大きい順に単語を並べた割当順単語リスト６９を生成する。 In the word selection step S831, the word classification unit 60 distinguishes between the same spelled words appearing at different positions, and calculates the likelihood 66 for each class for each column name constituent word 27.
The word classification unit 60 calculates the average likelihood 67 based on the likelihood 66 calculated for the same spelled word appearing at different positions. In this example, since the word “NO” appears in three places, the word classification unit 60 calculates an average likelihood 67 from the calculated three likelihoods 66 for each class.
The word classification unit 60 does not distinguish between words with the same spelling that appear at different positions, and generates a provisional correct answer class list 68 in which classes are arranged in descending order of the calculated average likelihood 67 for each word. In addition, the word classification unit 60 does not distinguish between the same spelling words appearing at different positions, and generates an allocation order word list 69 in which words are arranged in descending order of the maximum average likelihood 67 among the calculated average likelihoods 67. To do.

単語分類部６０は、割当順単語リスト６９の上位に位置する単語から順に、クラスを割り当てる。単語分類部６０は、暫定正解クラスリスト６８の先頭に位置するクラスを割り当て、それで排他制約違反になる場合には、次のクラスを割り当てる。 The word classification unit 60 assigns classes in order from the word positioned at the top of the assignment order word list 69. The word classifying unit 60 assigns the class located at the head of the provisional correct answer class list 68, and if it violates the exclusion constraint, assigns the next class.

例えば、単語分類部６０は、単語選択工程Ｓ８３１で、割当順単語リスト６９の先頭にある単語「ＮＯ」を選択し、暫定クラス割当工程Ｓ８３２で、その単語の暫定正解クラスリスト６８の先頭にあるクラス｛ｙ４｝を選択し、単語「ＮＯ」にクラス｛ｙ４｝を割り当てる。
上述したように、単語「ＮＯ」は、３箇所に出現している。それぞれの出現位置について別々に算出した尤度６６によれば、テーブル｛Ｔ０００１｝のカラム｛Ｃ００１２｝で尤度６６が一番大きいのはクラス｛ｙ４｝であり、テーブル｛Ｔ０００１｝のカラム｛Ｃ００１３｝で重みパラメータベクトル６５が一番大きいのもクラス｛ｙ４｝であるが、テーブル｛Ｔ０００２｝のカラム｛Ｃ００２２｝で尤度６６が一番大きいのはクラス｛ｙ２｝である。しかし、単語分類部６０は、出現位置を区別せずに単語「ＮＯ」にクラス｛ｙ４｝を割り当てる。これにより、一致制約を充足する割当てを見つけることができる。 For example, the word classification unit 60 selects the word “NO” at the top of the allocation order word list 69 at the word selection step S831, and at the top of the provisional correct answer class list 68 for the word at the temporary class allocation step S832. Select class {y4} and assign class {y4} to word “NO”.
As described above, the word “NO” appears in three places. According to the likelihood 66 calculated separately for each appearance position, the class {y4} has the largest likelihood 66 in the column {C0012} of the table {T0001}, and the column {C0013] of the table {T0001}. }, The class {y4} has the largest weight parameter vector 65, but the class {y2} has the largest likelihood 66 in the column {C0022} of the table {T0002}. However, the word classification unit 60 assigns the class {y4} to the word “NO” without distinguishing the appearance position. Thereby, it is possible to find an assignment satisfying the matching constraint.

次に、単語分類部６０は、単語選択工程Ｓ８３１で、割当順単語リスト６９の２番目にある単語「ＩＴＥＭ」を選択し、暫定クラス割当工程Ｓ８３２で、その単語の暫定正解クラスリスト６８の先頭にあるクラス｛ｙ１｝を選択し、単語「ＩＴＥＭ」にクラス｛ｙ１｝を割り当てる。この時点で、テーブル｛Ｔ０００２｝のカラム｛Ｃ００２２｝について、カラム名に含まれるすべての単語についてクラスを割り当てたので、単語分類部６０は、カラム選択工程Ｓ８１３で、クラスベクトル６３を生成する。このように、単語分類部６０は、すべての単語にクラスを割り当てたカラムから順に、クラスベクトル６３を生成していく。テーブル｛Ｔ０００２｝に属する他のカラムについては、クラスベクトル６３をまだ生成していないので、単語分類部６０は、排他制約違反ではないと判定する。 Next, the word classification unit 60 selects the second word “ITEM” in the allocation order word list 69 in the word selection step S831, and in the provisional class assignment step S832, the head of the provisional correct answer class list 68 of the word. Class {y1} is selected and class {y1} is assigned to the word “ITEM”. At this point, since the classes are assigned to all the words included in the column name for the column {C0022} of the table {T0002}, the word classification unit 60 generates the class vector 63 in the column selection step S813. In this way, the word classification unit 60 generates the class vector 63 in order from the column in which classes are assigned to all words. For the other columns belonging to the table {T0002}, the class vector 63 has not been generated yet, so the word classification unit 60 determines that it is not an exclusive constraint violation.

このようにして、単語分類部６０は、割当順単語リスト６９の順に、単語にクラスを割り当てていく。 In this way, the word classification unit 60 assigns classes to words in the order of the assignment order word list 69.

分類算出工程Ｓ８３０が単語「ＤＡＴＥ」まで終わったとする。この時点で、単語分類部６０は、テーブル｛Ｔ０００１｝のカラム｛Ｃ００１２｝と、テーブル｛Ｔ０００２｝のカラム｛Ｃ００２２｝と、テーブル｛Ｔ０００３｝のカラム｛Ｃ００３２｝とについて、カラム名に含まれるすべての単語についてクラスを割り当て終わり、クラスベクトル６３を生成している。
次に、単語分類部６０は、単語選択工程Ｓ８３１で、割当順単語リスト６９の次にある単語「ＰＨＯＮＥ」を選択し、その単語の暫定正解クラスリスト６８の先頭にあるクラス｛ｙ１｝を選択し、単語「ＰＨＯＮＥ」にクラス｛ｙ１｝を割り当てる。テーブル｛Ｔ０００１｝のカラム｛Ｃ００１３｝について、カラム名に含まれるすべての単語についてクラスを割り当てたので、単語分類部６０は、カラム選択工程Ｓ８１３で、クラスベクトル６３を生成する。しかし、生成したクラスベクトル６３は、同じテーブル｛Ｔ０００１｝のカラム｛Ｃ００１２｝について生成したクラスベクトル６３と一致しているので、単語分類部６０は、排他制約違反であると判定する。そこで、単語分類部６０は、暫定クラス割当工程Ｓ８３２に処理を戻し、単語「ＰＨＯＮＥ」の暫定正解クラスリスト６８の２番目にあるクラス｛ｙ２｝を選択し、単語「ＰＨＯＮＥ」にクラス｛ｙ２｝を割り当て直す。これにより、排他制約を充足する割当てを見つけることができる。 Assume that the classification calculation step S830 ends up to the word “DATE”. At this point, the word classification unit 60 includes all the column names {C0012} of the table {T0001}, the columns {C0022} of the table {T0002}, and the columns {C0032} of the table {T0003} that are included in the column names. Class assignment is finished for the word, and a class vector 63 is generated.
Next, in the word selection step S831, the word classification unit 60 selects the word “PHONE” next to the assigned order word list 69, and selects the class {y1} at the head of the provisional correct answer class list 68 of the word. Class {y1} is assigned to the word “PHONE”. For the column {C0013} of the table {T0001}, classes are assigned to all the words included in the column name, so the word classification unit 60 generates a class vector 63 in the column selection step S813. However, since the generated class vector 63 matches the class vector 63 generated for the column {C0012} of the same table {T0001}, the word classifying unit 60 determines that the exclusive constraint is violated. Therefore, the word classification unit 60 returns the process to the provisional class assignment step S832, selects the second class {y2} in the provisional correct answer class list 68 of the word “PHONE”, and class {y2} to the word “PHONE”. Reassign Thereby, it is possible to find an assignment satisfying the exclusive constraint.

このようにして、教師なし学習だけでは識別が難しいサンプルにおいても、大域的な識別状況とタスクのヒューリスティクスに基づいて、正しいクラスを割り当てることができ、高精度なクラスタリングを実現できる。 In this way, even for samples that are difficult to identify by unsupervised learning alone, the correct class can be assigned based on the global identification situation and task heuristics, and high-precision clustering can be realized.

この実施の形態における関連語抽出方式（関連カラム判定装置９０３、関連語分類装置）は、テキストを含むデータから多クラス識別モデルを用いて関連語を抽出する。
単語分類部６０（重みベクトル算出部）は、単語のクラスを識別するためのモデルパラメータを、データ中に出現する単語に割り当てられた暫定正解クラスに基づいて最適化する。
単語分類部６０は、現在のモデルパラメータを使って求めたクラスの尤度に基づいて、クラス割り当てに関する制約を充足するクラスのなかで最尤のクラスを選択し、新たな暫定正解クラスとして単語に割り当てる。
単語分類部６０は、この２つを交互に反復処理した後に、同一クラスに分類された単語を関連語として抽出する。 The related word extraction method (related column determination device 903, related word classification device) in this embodiment extracts related words from data including text using a multi-class identification model.
The word classification unit 60 (weight vector calculation unit) optimizes the model parameters for identifying the word class based on the provisional correct answer class assigned to the word appearing in the data.
The word classifying unit 60 selects the most likely class among the classes satisfying the restrictions on the class assignment based on the class likelihood obtained using the current model parameter, and sets the word as a new provisional correct answer class. assign.
The word classification unit 60 extracts the words classified into the same class as related words after alternately repeating the two.

前記テキストを含むデータは、データベースのスキーマ定義データ（テーブルスキーマ）である。 The data including the text is database schema definition data (table schema).

前記クラス割り当てに関する制約は、データ構造中の姉妹データ要素が、異なるクラスのデータ要素を有するような割り当てである。 The constraint on the class assignment is an assignment such that sister data elements in the data structure have different classes of data elements.

この実施の形態における関連語抽出方式によれば、関連語をテーブルスキーマ群から高い精度で自動抽出することができる。 According to the related word extraction method in this embodiment, related words can be automatically extracted from a table schema group with high accuracy.

この実施の形態における関連語抽出方式は、「類似した単語は類似した文脈で出現する」との仮定に基づいて、共起語の出現頻度ベクトルをもとに適当な重み付けを行って特徴ベクトルとし、その類似度を基準に関連語を抽出する。
しかし、テーブルスキーマなどの構造化データは、新聞記事やウェブページなどと比べて、利用可能なデータ量が限定される。このため、共起頻度ベクトルがスパースとなり、共起語だけでは単語間の類似性を十分説明できないことがある。
そこで、この実施の形態における関連語抽出方式は、更に、概念辞書を用いて特徴量を追加する。 The related word extraction method in this embodiment is based on the assumption that “similar words appear in similar contexts” and is weighted appropriately based on the appearance frequency vector of co-occurrence words as feature vectors. Then, related words are extracted based on the similarity.
However, the amount of data that can be used for structured data such as a table schema is limited compared to newspaper articles and web pages. For this reason, the co-occurrence frequency vector becomes sparse, and the similarity between words may not be sufficiently explained only by the co-occurrence words.
Therefore, the related word extraction method in this embodiment further adds feature amounts using a concept dictionary.

しかし、概念辞書を用いて追加した特徴量は、自然言語処理一般に存在する課題である曖昧性や多義性の影響を受ける。
そこで、この実施の形態における関連語抽出方式は、以下の３つの仮定を導入して、単語の使い分けを判別する。 However, the feature amount added using the concept dictionary is affected by ambiguity and ambiguity, which are problems that generally exist in natural language processing.
Therefore, the related word extraction method in this embodiment introduces the following three assumptions to determine the proper use of words.

仮定（１）：関連語は、出現文脈の特徴または概念辞書における登録階層が近い。
仮定（１）は、特徴ベクトルのクラスタリングで関連語を抽出するための基本的な仮定である。仮定（１）は、特徴ベクトル６４の構成に反映されている。文脈ベクトル５４及び概念ベクトル４４を含む特徴ベクトル６４を使うことにより、仮定（１）に沿ったクラスタリングを実現する。 Assumption (1): The related word is close to the feature of the appearance context or the registration hierarchy in the concept dictionary.
Assumption (1) is a basic assumption for extracting related words by clustering feature vectors. Assumption (1) is reflected in the configuration of the feature vector 64. By using the feature vector 64 including the context vector 54 and the concept vector 44, clustering according to the assumption (1) is realized.

仮定（２）：単語は、出現文脈に依らず一貫して同じ概念を表している。
仮定（２）は、スパースなスキーマデータで頻度を集約するために導入するやや強い仮定である。仮定（２）は、上述の一致制約に反映されている。単語分類部６０が、異なる位置に出現する同じ綴りの単語に対して、同じクラスを割り当てることにより、仮定（２）に沿ったクラスタリングを実現する。 Assumption (2): Words consistently represent the same concept regardless of appearance context.
Assumption (2) is a rather strong assumption that is introduced to aggregate frequency with sparse schema data. Assumption (2) is reflected in the above-mentioned coincidence constraint. The word classification unit 60 implements clustering according to assumption (2) by assigning the same class to words of the same spelling that appear at different positions.

仮定（３）：カラム名は、テーブル内で異なる概念を表している。
仮定（３）は、テーブル設計のヒューリスティックスに基づく最も強い仮定である。仮定（３）は、上述の排他制約に反映されている。カラム名が複数の単語で構成されている場合、カラム名が表わす概念は、それを構成する単語が表わす概念を総合したものと考えられる。そこで、仮定（３）に基づいて、単語分類部６０は、同じテーブルに属する複数のカラムの間で、カラム名を構成する単語に割り当てたクラスの組合せが異なるよう、単語にクラスを割り当てる。 Assumption (3): The column name represents a different concept in the table.
Assumption (3) is the strongest assumption based on heuristics of table design. Assumption (3) is reflected in the exclusion constraint described above. When the column name is composed of a plurality of words, the concept represented by the column name is considered to be a combination of the concepts represented by the words constituting the column name. Therefore, based on Assumption (3), the word classification unit 60 assigns classes to words so that combinations of classes assigned to words constituting column names are different among a plurality of columns belonging to the same table.

上述の例において、「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲ」テーブルには、「ＬＩＮＥ＿ＮＯ」カラムと「ＰＨＯＮＥ＿ＮＯ」カラムとが属する。仮に、単語「ＬＩＮＥ」と単語「ＰＨＯＮＥ」とを同じクラスに分類したとすると、「ＬＩＮＥ＿ＮＯ」と「ＰＨＯＮＥ＿ＮＯ」とを概念的に区別できなくなる。仮定（３）は、このような直観にそぐわないクラスタリングを回避するための仮定である。 In the above example, the “PURCHASE_ORDER” table includes a “LINE_NO” column and a “PHONE_NO” column. If the word “LINE” and the word “PHONE” are classified into the same class, “LINE_NO” and “PHONE_NO” cannot be conceptually distinguished. Assumption (3) is an assumption for avoiding such unsuitable clustering.

以上のように、この実施の形態における関連語抽出方式は、単語特徴ベクトルのクラスタリングにおいて、特徴ベクトルの類似度を尺度とするだけでなく、元データから取得するヒューリスティクスをクラス分類の一致制約ないし排他制約として利用して、クラスタリングをする。
このように制約条件としてヒューリスティクスを利用する方式は、特徴ベクトルの類似度にペナルティを加えるようなアドホックな方式と異なり、教師あり学習で用いられる識別モデルを用いたクラスタリングにおいて匿名クラス割り当てに反映させて識別学習する方式である。このため、効果的にクラスタリングの精度を向上することができる。 As described above, the related word extraction method according to the present embodiment uses not only the similarity of feature vectors as a measure in clustering of word feature vectors but also the heuristics acquired from the original data as class classification matching constraints. Use as an exclusive constraint for clustering.
Unlike the ad hoc method that adds a penalty to the similarity of feature vectors, the method that uses heuristics as a constraint condition is reflected in anonymous class assignment in clustering using an identification model used in supervised learning. Discriminative learning method. For this reason, it is possible to effectively improve the accuracy of clustering.

更に、この実施の形態における関連語抽出方式は、ヒューリスティクスとして、スキーマ構造中の姉妹データ要素（同じテーブルに属するカラム）は、異なる概念を表すよう設計されていることを仮定し、それゆえ、それらデータ要素は互いに異なるクラスのデータ要素を有することを利用する。この仮定は、スキーマデータから容易に取得可能で、かつ、スキーマ設計においては広く成立することが期待される。これにより、類似した単語の使い分けを混同することなくクラスタリングをすることができる。 Furthermore, the related term extraction scheme in this embodiment assumes, as heuristics, that sister data elements (columns belonging to the same table) in the schema structure are designed to represent different concepts, and therefore These data elements are used to have different classes of data elements. This assumption can be easily obtained from schema data, and is expected to hold widely in schema design. As a result, clustering can be performed without confusing the use of similar words.

この実施の形態における関連語抽出方式は、本来区別して扱われるべき単語とそうでない単語とを判別し、関連語として抽出しないという、高精度な関連語抽出を実現する。
また、この実施の形態における関連語抽出方式は、辞書に登録されている関連語の組み合わせに基づいて学習のためのサンプルを生成する必要がないので、対象ドメインの単語が辞書にあらかじめ十分登録されていない場合であっても適用可能であり、対象ドメインの単語が辞書に登録されていない状態であっても、高精度な関連語抽出を実現することができる。 The related word extraction method in this embodiment realizes high-precision related word extraction in which a word that should be treated distinctly and a word that should not be handled are discriminated and are not extracted as related words.
In addition, the related word extraction method in this embodiment does not need to generate a sample for learning based on a combination of related words registered in the dictionary, so the words of the target domain are sufficiently registered in the dictionary in advance. Even if it is not, it can be applied, and even when the words of the target domain are not registered in the dictionary, it is possible to realize high-precision related word extraction.

なお、文脈解析部５０は、カラム名構成単語２７の直前や直後に位置する単語に対応づけられた文脈ベクトル５４の要素の値を「１」にするだけでなく、同じカラム名のなかで離れた位置にある単語に対応づけられた文脈ベクトル５４の要素の値も「１」にする構成であってもよい。
あるいは、文脈解析部５０は、離れた位置にある単語に対応づけられた文脈ベクトル５４の要素に、単語間の距離が遠くなるほど小さくなる数値を与える構成であってもよい。例えば、文脈解析部５０は、２つ前もしくは２つ後の単語に対応づけられた文脈ベクトル５４の要素の値を「１／２」、３つ前もしくは３つ後の単語に対応づけられた文脈ベクトル５４の要素の値を「１／３」にする構成であってよい。 Note that the context analysis unit 50 not only sets the value of the element of the context vector 54 associated with the word located immediately before or after the column name constituent word 27 to “1”, but also separates them within the same column name. The value of the element of the context vector 54 associated with the word at the position may also be “1”.
Alternatively, the context analysis unit 50 may be configured to give a numerical value that becomes smaller as the distance between the words is longer, to the element of the context vector 54 that is associated with the word at a distant position. For example, the context analysis unit 50 associates the value of the element of the context vector 54 associated with the word before or after two by “½”, the word before or after three, The configuration may be such that the value of the element of the context vector 54 is “1/3”.

また、特徴ベクトル生成部４０は、カラム名構成単語２７が表わす概念及びその上位概念に対応づけられた概念ベクトル４４の要素の値を「１」にするだけでなく、カラム名構成単語２７が表わす概念の下位概念に対応づけれらた要素の値も「１」にする構成であってもよい。 The feature vector generation unit 40 not only sets the value of the element of the concept vector 44 associated with the concept represented by the column name constituent word 27 and its superordinate concept to “1”, but also represents the column name constituent word 27. The configuration may be such that the value of the element associated with the concept subordinate concept is also “1”.

あるいは、特徴ベクトル生成部４０は、階層が離れた概念に対応づけられた概念ベクトル４４の要素に、階層差が大きくなるほど小さくなる数値を与える構成であってもよい。例えば、特徴ベクトル生成部４０は、すぐ上の上位概念に対応づけられた要素の値を「１／２」、２階層上の上位概念に対応づけれらた要素の値を「１／３」にする構成であってもよい。 Alternatively, the feature vector generation unit 40 may be configured to give a numerical value that decreases as the hierarchy difference increases to the elements of the concept vector 44 that are associated with concepts that are separated from each other. For example, the feature vector generation unit 40 sets the value of the element associated with the superordinate concept immediately above to “1/2”, and sets the value of the element associated with the superordinate concept on the second hierarchy to “1/3”. It may be configured as follows.

あるいは、特徴ベクトル生成部４０は、カラム名構成単語２７が多義語である場合、その意味の使用頻度に応じて、概念ベクトル４４の要素に与える数値を按分する構成であってもよい。例えば、特徴ベクトル生成部４０は、その単語の第一義である概念に対応づけられた要素の値を「０．５」、第二義である概念に対応づけられた要素の値を「０．３」、第三義である概念に対応づけられた要素の値を「０．２」にする構成であってもよい。 Alternatively, the feature vector generation unit 40 may be configured to apportion the numerical values given to the elements of the concept vector 44 according to the frequency of use of the meaning when the column name constituent word 27 is a polysemy. For example, the feature vector generation unit 40 sets “0.5” as the value of the element associated with the primary concept of the word and “0” as the value of the element associated with the secondary concept. .3 ”, the value of the element associated with the third meaning concept may be“ 0.2 ”.

また、クラスベクトル６３は、排他制約違反を見つけるため、そのカラムのカラム名に含まれる単語に割り当てたクラスの組合せと、他のカラムのカラム名に含まれる単語に割り当てたクラスの組合せとが一致するか否かを判定するためのものである。単語分類部６０は、クラスベクトル６３を生成する代わりに、他のデータを使って、排他条件違反を見つける構成であってもよい。例えば、単語分類部６０は、カラム名に含まれる単語に割り当てたクラスのリストや集合を生成し、生成したリストや集合を使って、そのカラムのカラム名に含まれる単語に割り当てたクラスの組合せと、他のカラムのカラム名に含まれる単語に割り当てたクラスの組合せとが一致するか否かを判定する構成であってもよい。 The class vector 63 matches the combination of the class assigned to the word included in the column name of the column and the combination of the class assigned to the word included in the column name of the other column in order to find an exclusive constraint violation. It is for determining whether to do. The word classification unit 60 may be configured to find an exclusion condition violation using other data instead of generating the class vector 63. For example, the word classification unit 60 generates a list or set of classes assigned to the words included in the column name, and uses the generated list or set to combine the classes assigned to the words included in the column name of the column. And a combination of classes assigned to words included in column names of other columns may be determined.

また、単語分類部６０（重みベクトル算出部）は、多クラスロジスティクス回帰ではなく、他の最適化手法を用いて、重みパラメータベクトル６５（重みベクトル）を算出する構成であってもよい。
あるいは、単語分類部６０は、貪欲法ではなく、他のクラスタリング手法を用いて、単語をクラスタリングする構成であってもよい。 The word classification unit 60 (weight vector calculation unit) may be configured to calculate the weight parameter vector 65 (weight vector) using another optimization method instead of multi-class logistics regression.
Alternatively, the word classification unit 60 may be configured to cluster words using other clustering methods instead of the greedy method.

実施の形態２．
実施の形態２について、図２１を用いて説明する。
なお、実施の形態１と共通する部分については、同一の符号を付し、説明を省略する。 Embodiment 2. FIG.
Embodiment 2 will be described with reference to FIG.
In addition, about the part which is common in Embodiment 1, the same code | symbol is attached | subjected and description is abbreviate | omitted.

この実施の形態における関連カラム抽出システム９００の全体構成、関連カラム判定装置９０３などのハードウェア構成、関連カラム判定装置９０３のブロック構成は、実施の形態１と同様である。 The overall configuration of the related column extraction system 900 in this embodiment, the hardware configuration of the related column determination device 903, and the block configuration of the related column determination device 903 are the same as those in the first embodiment.

図２１は、この実施の形態における文脈解析部５０の動作の一例を示す説明するための図である。 FIG. 21 is a diagram for explaining an example of the operation of the context analysis unit 50 in this embodiment.

文脈解析部５０（特徴ベクトル生成部の一例。）は、異なる位置に出現する同じ綴りの単語を区別せず、それぞれのカラム名構成単語２７について、文脈ベクトル５４を生成する。
文脈ベクトル５４の要素のうち、前置単語５１及び後置単語５２に対応づけられた要素は、その単語の前接単語及び後接単語の出現頻度を表わす。また、文脈ベクトル５４の要素のうち、テーブル名構成単語５３に対応づけられた要素は、その単語を含むカラムが属するテーブルであって、テーブル名構成単語５３を含むテーブル名を有するテーブルの数を表わす。 The context analysis unit 50 (an example of a feature vector generation unit) generates a context vector 54 for each column name constituent word 27 without distinguishing between the same spelling words appearing at different positions.
Of the elements of the context vector 54, the elements associated with the prefix word 51 and the suffix word 52 represent the appearance frequency of the prefix word and the suffix word of the word. Of the elements of the context vector 54, the element associated with the table name constituent word 53 is the table to which the column including the word belongs, and the number of tables having the table name including the table name constituent word 53 is indicated. Represent.

例えば、文脈解析部５０は、それぞれのカラム名構成単語２７について、すべてのカラムのなかから、その単語を含むカラム名を有するカラムを抽出して、カラム集合とする。文脈解析部５０は、抽出したカラム集合に属するカラムのカラム名を構成するカラム名構成単語２７のなかから、その単語に前接する前接単語を抽出して、前接単語集合とする。なお、前接単語集合において、文脈解析部５０は、同じ綴りの単語を区別する。すなわち、その単語に前接する単語が同じ綴りであるカラムが複数ある場合、前接単語集合には、同じ綴りの単語が複数含まれる。文脈解析部５０は、それぞれの前置単語５１について、前接単語集合のなかから、その前置単語５１と一致する単語を抽出する。文脈解析部５０は、抽出した単語の数を、その前置単語５１に対応づけられた文脈ベクトル５４の要素の値とする。
同様に、文脈解析部５０は、抽出したカラム集合に属するカラムのカラム名を構成するカラム名構成単語２７のなかから、その単語に後接する後接単語を抽出して、後接単語集合とする。後接単語集合においても、文脈解析部５０は、同じ綴りの単語を区別する。文脈解析部５０は、それぞれの後置単語５２について、後接単語集合のなかから、その後置単語５２と一致する単語を抽出する。文脈解析部５０は、抽出した単語の数を、その後置単語５２に対応づけられた文脈ベクトル５４の要素の値とする。
また、文脈解析部５０は、すべてのテーブルのなかから、抽出したカラム集合に属するカラムが属するテーブルを抽出して、テーブル集合とする。なお、テーブル集合において、文脈解析部５０は、同じテーブルを区別しない。すなわち、同じテーブルに含まれるカラムがカラム集合内に複数ある場合でも、テーブル集合には、そのテーブルが１つだけ含まれる。文脈解析部５０は、テーブル集合に属するテーブルのテーブル名を構成する単語を抽出して、テーブル名構成単語集合とする。テーブル名構成単語集合において、文脈解析部５０は、同じ綴りの単語を区別する。すなわち、異なるテーブルのテーブル名に同じ綴りの単語が含まれる場合、テーブル名構成単語集合には、同じ綴りの単語が複数含まれる。文脈解析部５０は、それぞれのテーブル名構成単語５３について、テーブル名構成単語集合のなかから、そのテーブル名構成単語５３と一致する単語を抽出する。文脈解析部５０は、抽出した単語の数を、そのテーブル名構成単語５３に対応づけられた文脈ベクトル５４の要素の値とする。 For example, for each column name constituent word 27, the context analysis unit 50 extracts a column having a column name including the word from all the columns, and forms a column set. The context analysis unit 50 extracts the prefix word that precedes the word from the column name constituent words 27 that constitute the column names of the columns that belong to the extracted column set, and sets it as the prefix word set. In the prefix word set, the context analysis unit 50 distinguishes words having the same spelling. That is, when there are a plurality of columns in which the word preceding the word has the same spelling, the prefix word set includes a plurality of words having the same spelling. For each prefix word 51, the context analysis unit 50 extracts a word that matches the prefix word 51 from the prefix word set. The context analysis unit 50 sets the number of extracted words as the value of the element of the context vector 54 associated with the prefix word 51.
Similarly, the context analysis unit 50 extracts a suffix word that comes after the word from the column name constituent words 27 that constitute the column names of the columns that belong to the extracted column set, and forms the suffix word set. . In the succeeding word set, the context analysis unit 50 distinguishes words having the same spelling. For each postfix word 52, the context analysis unit 50 extracts a word that matches the postfix word 52 from the set of suffix words. The context analysis unit 50 sets the number of extracted words as the value of the element of the context vector 54 associated with the postword 52.
Further, the context analysis unit 50 extracts a table to which a column belonging to the extracted column set belongs from all the tables, and sets it as a table set. In the table set, the context analysis unit 50 does not distinguish between the same tables. That is, even when there are a plurality of columns included in the same table in the column set, only one table is included in the table set. The context analysis unit 50 extracts the words constituting the table names of the tables belonging to the table set and sets them as the table name constituting word set. In the table name constituent word set, the context analysis unit 50 distinguishes words having the same spelling. That is, when the same spelling word is included in the table names of different tables, the table name constituent word set includes a plurality of words having the same spelling. For each table name constituent word 53, the context analysis unit 50 extracts a word that matches the table name constituent word 53 from the table name constituent word set. The context analysis unit 50 sets the number of extracted words as the value of the element of the context vector 54 associated with the table name constituent word 53.

例えば、単語「ＮＯ」が、「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲ」テーブルの「ＬＩＮＥ＿ＮＯ」カラムと、同じく「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲ」テーブルの「ＰＨＯＮＥ＿ＮＯ」カラムと、「ＯＲＤＥＲ」テーブルの「ＩＴＥＭ＿ＮＯ」カラムとの３箇所に出現しているとする。 For example, the word “NO” appears in three places: the “LINE_NO” column of the “PURCHASE_ORDER” table, the “PHONE_NO” column of the “PURCHASE_ORDER” table, and the “ITEM_NO” column of the “ORDER” table. To do.

単語「ＮＯ」について、文脈解析部５０は、「ＬＩＮＥ＿ＮＯ」カラムと「ＰＨＯＮＥ＿ＮＯ」カラムと「ＩＴＥＭ＿ＮＯ」カラムとの３つのカラムからなる集合を生成して、カラム集合とする。
文脈解析部５０は、単語「ＬＩＮＥ」と単語「ＰＨＯＮＥ」と単語「ＩＴＥＭ」との３つの単語からなる集合を生成して、前接単語集合とする。文脈解析部５０は、前置単語５１に対応づけられた文脈ベクトル５４の要素のうち、単語「ＬＩＮＥ」に対応づけられた要素と、単語「ＰＨＯＮＥ」に対応づけられた要素と、単語「ＩＴＥＭ」に対応づけられた要素との３つの要素の値を「１」にし、それ以外の前置単語５１に対応づけられた要素の値を「０」にする。
また、文脈解析部５０は、単語を含まない空集合を、後接単語集合とする。文脈解析部５０は、後置単語５２に対応づけられた文脈ベクトル５４の要素の値をすべて「０」にする。
また、文脈解析部５０は、「ＰＵＲＣＨＡＳＥ＿ＯＲＤＥＲ」テーブルと「ＯＲＤＥＲ」テーブルとの２つのテーブルからなる集合を生成して、テーブル集合とする。文脈解析部５０は、単語「ＰＵＲＣＨＡＳＥ」と単語「ＯＲＤＥＲ」と単語「ＯＲＤＥＲ」とからなる集合（単語「ＯＲＤＥＲ」が２つ含まれる点に留意。）を生成して、テーブル名構成単語集合とする。文脈解析部５０は、テーブル名構成単語５３に対応づけられた文脈ベクトル５４の要素のうち、単語「ＰＵＲＣＨＡＳＥ」に対応づけられた要素の値を「１」にし、単語「ＯＲＤＥＲ」に対応づけられた要素の値を「２」にし、それ以外のテーブル名構成単語５３に対応づけられた要素の値を「０」にする。 For the word “NO”, the context analysis unit 50 generates a set of three columns, a “LINE_NO” column, a “PHONE_NO” column, and an “ITEM_NO” column, and sets it as a column set.
The context analysis unit 50 generates a set of three words, the word “LINE”, the word “PHONE”, and the word “ITEM”, and sets the set as a front word set. The context analysis unit 50 includes an element associated with the word “LINE”, an element associated with the word “PHONE”, an element associated with the word “PHONE”, and the word “ITEM” among the elements of the context vector 54 associated with the prefix word 51. The values of the three elements corresponding to the element associated with “” are set to “1”, and the values of the elements associated with the other prefix words 51 are set to “0”.
In addition, the context analysis unit 50 sets an empty set that does not include a word as a trailing word set. The context analysis unit 50 sets all the values of the elements of the context vector 54 associated with the postfix word 52 to “0”.
In addition, the context analysis unit 50 generates a set of two tables, a “PURCHASE_ORDER” table and an “ORDER” table, and sets it as a table set. The context analysis unit 50 generates a set including the word “PURCHASE”, the word “ORDER”, and the word “ORDER” (note that two words “ORDER” are included), To do. The context analysis unit 50 sets the value of the element associated with the word “PURCHASE” among the elements of the context vector 54 associated with the table name constituent word 53 to “1” and is associated with the word “ORDER”. The value of the corresponding element is set to “2”, and the value of the element associated with the other table name constituting word 53 is set to “0”.

文脈解析部５０が、単語の出現位置を区別せず、異なる位置に出現した同じ綴りの単語を統合した文脈ベクトル５４を生成するのに伴い、単語分類部６０（特徴ベクトル生成部の一例。）も、単語の出現位置を区別せず、異なる位置に出現した同じ綴りの単語を統合した特徴ベクトル６４を生成する。単語分類部６０は、それぞれの単語について、文脈解析部５０が生成した文脈ベクトル５４と、特徴ベクトル生成部４０が生成した概念ベクトル４４とを結合して、特徴ベクトル６４とする。 As the context analysis unit 50 generates the context vector 54 that integrates the same spelled words that appear at different positions without distinguishing the appearance positions of the words, the word classification unit 60 (an example of a feature vector generation unit). In addition, the feature vector 64 is generated by integrating the same spelled words appearing at different positions without distinguishing the appearance positions of the words. For each word, the word classification unit 60 combines the context vector 54 generated by the context analysis unit 50 and the concept vector 44 generated by the feature vector generation unit 40 into a feature vector 64.

同じ綴りの単語に対する特徴ベクトル６４が一つしかないので、単語分類部６０は、単語の出現位置を区別せず、それぞれのクラスに対する尤度６６を算出する。したがって、単語分類部６０は、平均尤度６７を算出する必要がない。単語分類部６０は、算出した尤度６６に基づいて、単語をいずれかのクラスに分類する。 Since there is only one feature vector 64 for the same spelled word, the word classifying unit 60 calculates the likelihood 66 for each class without distinguishing the appearance position of the word. Therefore, the word classification unit 60 does not need to calculate the average likelihood 67. The word classification unit 60 classifies words into any class based on the calculated likelihood 66.

以上、各実施の形態で説明した構成は、一例であり、他の構成であってもよい。例えば、異なる実施の形態で説明した構成を組み合わせた構成であってもよいし、本質的でない部分の構成を、他の構成で置き換えた構成であってもよい。 As described above, the configuration described in each embodiment is an example, and another configuration may be used. For example, the structure which combined the structure demonstrated in different embodiment may be sufficient, and the structure which replaced the structure of the non-essential part with the other structure may be sufficient.

以上説明した関連語分類装置（関連カラム判定装置９０３）は、複数の単語を関連する単語同士のグループ（クラス）に分類する。
関連語分類装置は、データを処理する処理装置（９１１）と、テキスト取得部（単語抽出部２０）と、特徴ベクトル算出部（特徴ベクトル生成部４０、文脈解析部５０、単語分類部６０）と、単語分類部（６０）と、重みベクトル算出部（単語分類部６０）とを有する。
上記テキスト取得部は、上記処理装置を用いて、１つ以上の単語（カラム名構成単語２７）からなる単語群（カラム名）を１つ以上含むテキスト（テーブルスキーマ）を複数取得する。
上記特徴ベクトル算出部は、上記処理装置を用いて、上記テキスト取得部が取得した複数のテキストに含まれる複数の単語のそれぞれについて、上記単語の特徴を表わす特徴ベクトル（６４）を算出する。
上記単語分類部は、上記処理装置を用いて、上記テキスト取得部が取得した複数のテキストに含まれる複数の単語のそれぞれを、複数のグループのいずれかに分類する。
上記重みベクトル算出部は、上記処理装置を用いて、上記複数のグループのそれぞれについて、上記特徴ベクトルとの内積が上記特徴ベクトルによって特徴を表わされた単語が上記グループに属する尤度を表わす重みベクトル（重みパラメータベクトル６５）であって、上記単語分類部による分類に最もよく適合する重みベクトルを算出する。
上記単語分類部は、更に、上記処理装置を用いて、上記複数の単語のそれぞれについて上記特徴ベクトル算出部が算出した特徴ベクトルと、上記複数のグループのそれぞれについて上記重みベクトル算出部が算出した重みベクトルとに基づいて、上記複数の単語のそれぞれが上記複数のグループのそれぞれに属する尤度を算出し、算出した尤度に基づいて、上記複数の単語を分類し直す。 The related word classification device (related column determination device 903) described above classifies a plurality of words into groups (classes) of related words.
The related word classification device includes a processing device (911) for processing data, a text acquisition unit (word extraction unit 20), a feature vector calculation unit (feature vector generation unit 40, context analysis unit 50, word classification unit 60), And a word classification unit (60) and a weight vector calculation unit (word classification unit 60).
The text acquisition unit acquires a plurality of texts (table schemas) including one or more word groups (column names) made up of one or more words (column name constituent words 27) using the processing device.
The feature vector calculation unit calculates a feature vector (64) representing the feature of the word for each of a plurality of words included in the plurality of texts acquired by the text acquisition unit using the processing device.
The word classification unit classifies each of a plurality of words included in the plurality of texts acquired by the text acquisition unit into one of a plurality of groups using the processing device.
The weight vector calculation unit uses the processing device to determine, for each of the plurality of groups, a weight indicating a likelihood that a word whose feature is represented by the feature vector as an inner product with the feature vector belongs to the group. A vector (weight parameter vector 65) that is most suitable for classification by the word classification unit is calculated.
The word classification unit further uses the processing device to calculate the feature vector calculated by the feature vector calculation unit for each of the plurality of words and the weight calculated by the weight vector calculation unit for each of the plurality of groups. The likelihood that each of the plurality of words belongs to each of the plurality of groups is calculated based on the vector, and the plurality of words are reclassified based on the calculated likelihood.

これにより、構造化データなどを対象テキストとする場合でも、使い分けられている語を識別しつつ、関連語を抽出することができる。 As a result, even when structured data or the like is used as a target text, related words can be extracted while identifying words that are used properly.

上記重みベクトル算出部は、更に、上記単語分類部が分類し直した回数が所定の回数より少ない場合と、上記単語分類部が分類し直した後の分類と上記単語分類部が分類し直す前の分類とが異なる場合とのうち少なくともいずれかの場合に、上記処理装置を用いて、上記単語分類部が分類し直した後の分類に基づいて、上記重みベクトルを算出し直す。 The weight vector calculation unit further includes a case where the number of times the word classification unit reclassifies is less than a predetermined number of times, and a classification after the word classification unit reclassifies and before the word classification unit reclassifies In the case where the classification is different from at least one of the classifications, the weight vector is recalculated using the processing device based on the classification after the word classification unit reclassifies.

重みベクトルの算出と、単語の分類とを交互に繰り返すことにより、使い分けられている語を識別しつつ、関連語を抽出することができる。 By alternately repeating the calculation of the weight vector and the word classification, it is possible to extract related words while identifying the words that are used properly.

上記単語分類部は、ある単語群に含まれる１つ以上の単語をそれぞれ分類したグループの組と、上記単語群と同じテキストに属する他の１つの単語群に含まれる１つ以上の単語をそれぞれ分類したグループの組とが異なるように、上記複数の単語を分類する。 The word classifying unit includes a group group in which one or more words included in a certain word group are classified, and one or more words included in another word group belonging to the same text as the word group. The plurality of words are classified so that the group of the classified group is different.

データベース定義などの構造化データの特徴に基づいて単語を分類するので、使い分けられている語を識別しつつ、関連語を抽出することができる。 Since words are classified based on features of structured data such as a database definition, related words can be extracted while identifying different words.

上記単語分類部は、ある単語を分類したグループと、上記複数のテキストのなかで上記単語と異なる位置に出現する上記単語と同一の単語を分類したグループとが同じになるように、上記複数の単語を分類する。 The word classifying unit is configured such that a group in which a certain word is classified and a group in which the same word as the word appearing at a position different from the word in the plurality of texts are the same are the same. Classify words.

上記特徴ベクトル算出部は、上記複数の単語のそれぞれについて、上記単語が表わす概念と、上記単語が属する単語群に属する他の単語（前置単語５１、後置単語５２）と、上記単語が属する単語群の名称（テーブル名）を構成する単語（テーブル名構成単語５３）とのうち少なくともいずれかに基づいて、上記特徴ベクトルを算出する。 The feature vector calculation unit, for each of the plurality of words, the concept represented by the word, other words (prefix word 51, postfix word 52) belonging to the word group to which the word belongs, and the word The feature vector is calculated based on at least one of the words (table name constituting word 53) constituting the name of the word group (table name).

単語が表わす一般的な概念や、単語が現れる文脈などに基づいて、単語を分類するので、使い分けられている語を識別しつつ、関連語を抽出することができる。 Since the words are classified based on the general concept represented by the word, the context in which the word appears, and the like, it is possible to extract related words while identifying the words that are used properly.

上記テキスト取得部が取得するテキストは、データベースにおけるテーブルを定義するテーブル定義データである。上記単語群は、上記テーブルを構成するカラムのカラム名である。
上記関連語分類装置は、更に、関連カラム判定部（関連語決定部７０）を有する。
上記関連カラム判定部は、上記処理装置を用いて、上記単語分類部があるカラムのカラム名を構成する１つ以上の単語をそれぞれ分類したグループの組と、上記単語分類部が別のカラムのカラム名を構成する１つ以上の単語をそれぞれ分類したグループの組とが一致する場合に、上記２つのカラムが、同一の内容を格納するカラム（関連カラム）である可能性があると判定する。 The text acquired by the text acquisition unit is table definition data that defines a table in the database. The word group is a column name of columns constituting the table.
The related word classification device further includes a related column determination unit (related word determination unit 70).
The related column determination unit uses the processing device to group a group of one or more words constituting the column name of the column in which the word classification unit is located, and the word classification unit in a different column. When a pair of groups in which one or more words constituting a column name are respectively matched, it is determined that there is a possibility that the two columns are columns (related columns) that store the same contents. .

単語の分類結果に基づいて、関連カラムである可能性があるカラムを判定するので、関連カラムを判別する作業の効率を高めることができる。 Since a column that may be a related column is determined based on the word classification result, the efficiency of the work of determining the related column can be improved.

１０関連語抽出元データ、２０単語抽出部、２１テーブル名構成単語データ、２２，２５テーブル識別子、２３，５３テーブル名構成単語、２４カラム名構成単語データ、２６カラム識別子、２７カラム名構成単語、３０概念辞書データ、３１概念データ、３２，４１概念識別子、３３概念表示語、３４上位概念識別子、４０特徴ベクトル生成部、４４概念ベクトル、５０文脈解析部、５１前置単語、５２後置単語、５４文脈ベクトル、６０単語分類部、６１，６２クラス識別子、６３クラスベクトル、６４特徴ベクトル、６５重みパラメータベクトル、６６尤度、６７平均尤度、６８暫定正解クラスリスト、６９割当順単語リスト、７０関連語決定部、７１関連カラムデータ、８０関連語データ、９００関連カラム抽出システム、９０１データベース定義記憶装置、９０２概念辞書記憶装置、９０３関連カラム判定装置、９１１処理装置、９１２入力装置、９１３出力装置、９１４記憶装置、Ｓ８００関連抽出処理、Ｓ８０１単語抽出処理、Ｓ８０２文脈解析処理、Ｓ８０３概念解析処理、Ｓ８０４単語分類処理、Ｓ８０５関連決定処理、Ｓ８１０初期化工程、Ｓ８１１，Ｓ８３１単語選択工程、Ｓ８１２，Ｓ８３２暫定クラス割当工程、Ｓ８１３カラム選択工程、Ｓ８１４兄弟カラム選択工程、Ｓ８１５排他制約判定工程、Ｓ８２０重み最適化工程、Ｓ８３０分類算出工程、Ｓ８４０収束判定工程。 10 related word extraction source data, 20 word extraction unit, 21 table name constituent word data, 22, 25 table identifier, 23, 53 table name constituent word, 24 column name constituent word data, 26 column identifier, 27 column name constituent word, 30 concept dictionary data, 31 concept data, 32, 41 concept identifier, 33 concept display word, 34 superordinate concept identifier, 40 feature vector generation unit, 44 concept vector, 50 context analysis unit, 51 prefix word, 52 suffix word, 54 Context Vector, 60 Word Classifier, 61, 62 Class Identifier, 63 Class Vector, 64 Feature Vector, 65 Weight Parameter Vector, 66 Likelihood, 67 Average Likelihood, 68 Temporary Correct Answer Class List, 69 Allocation Order Word List, 70 Related word determination unit, 71 Related column data, 80 Related word data , 900 related column extraction system, 901 database definition storage device, 902 conceptual dictionary storage device, 903 related column determination device, 911 processing device, 912 input device, 913 output device, 914 storage device, S800 related extraction processing, S801 word extraction processing , S802 context analysis processing, S803 concept analysis processing, S804 word classification processing, S805 association determination processing, S810 initialization step, S811, S831 word selection step, S812, S832 provisional class assignment step, S813 column selection step, S814 sibling column selection Step, S815 Exclusive constraint determination step, S820 weight optimization step, S830 classification calculation step, S840 convergence determination step.

Claims

In a related word classification device for classifying a plurality of words into groups of related words,
A processing device for processing data, a text acquisition unit, a feature vector calculation unit, a word classification unit, and a weight vector calculation unit;
The text acquisition unit acquires a plurality of texts including one or more word groups composed of one or more words using the processing device,
The feature vector calculation unit calculates a feature vector representing the feature of the word for each of a plurality of words included in the plurality of texts acquired by the text acquisition unit using the processing device;
The word classification unit classifies each of the plurality of words included in the plurality of texts acquired by the text acquisition unit into any of a plurality of groups using the processing device,
The weight vector calculation unit uses the processing device to determine, for each of the plurality of groups, a weight indicating a likelihood that a word whose feature is represented by the feature vector as an inner product with the feature vector belongs to the group. A vector that calculates a weight vector that best fits the classification by the word classification unit,
The word classification unit further uses the processing device to calculate the feature vector calculated by the feature vector calculation unit for each of the plurality of words and the weight calculated by the weight vector calculation unit for each of the plurality of groups. A related word characterized by calculating a likelihood that each of the plurality of words belongs to each of the plurality of groups based on the vector, and reclassifying the plurality of words based on the calculated likelihood Classification device.

The weight vector calculation unit further includes a case where the number of times the word classification unit reclassifies is less than a predetermined number of times, and a classification after the word classification unit reclassifies and before the word classification unit reclassifies The weight vector is recalculated on the basis of the classification after the word classification unit reclassifies using the processing device in at least one of cases where the classification is different from The related word classification device according to claim 1.

The word classifying unit includes a group group in which one or more words included in a certain word group are classified, and one or more words included in another word group belonging to the same text as the word group. The related word classification device according to claim 1 or 2, wherein the plurality of words are classified so that a group of the classified groups is different.

The word classifying unit is configured such that a group in which a certain word is classified and a group in which the same word as the word appearing at a position different from the word in the plurality of texts are the same are the same. The related word classification device according to claim 1, wherein words are classified.

The feature vector calculation unit includes, for each of the plurality of words, a concept represented by the word, another word belonging to the word group to which the word belongs, and a word constituting the name of the word group to which the word belongs. 5. The related word classification device according to claim 1, wherein the feature vector is calculated based on at least one of the above.

The text acquired by the text acquisition unit is table definition data that defines a table in the database, and the word group is a column name of a column constituting the table,
The related word classification device further includes a related column determination unit,
The related column determination unit uses the processing device to group a group of one or more words constituting the column name of the column in which the word classification unit is located, and the word classification unit in a different column. It is characterized in that it is determined that there is a possibility that the above two columns are columns storing the same contents when a set of groups in which one or more words constituting the column name are respectively matched. The related word classification device according to any one of claims 1 to 5.

A computer program for causing a computer having a processing device for processing data to function as the related word classifying device according to any one of claims 1 to 6 when executed by a computer.

A related word classification device having a processing device that processes data, a text acquisition unit, a feature vector calculation unit, a word classification unit, and a weight vector calculation unit classifies a plurality of words into groups of related words. In the related term classification method,
The text acquisition unit acquires a plurality of texts including one or more word groups composed of one or more words using the processing device,
The feature vector calculation unit calculates a feature vector representing the feature of the word for each of a plurality of words included in the plurality of texts acquired by the text acquisition unit using the processing device;
The word classification unit classifies each of a plurality of words included in the plurality of texts acquired by the text acquisition unit into any of a plurality of groups using the processing device,
The weight vector calculation unit uses the processing device to determine, for each of the plurality of groups, a weight representing a likelihood that a word whose feature is represented by the feature vector as an inner product with the feature vector belongs to the group. A vector that calculates a weight vector that best fits the classification by the word classification unit,
The word classifying unit using the processing device, the feature vector calculated by the feature vector calculating unit for each of the plurality of words, and the weight vector calculated by the weight vector calculating unit for each of the plurality of groups; Based on the above, the related word classification method characterized by calculating the likelihood that each of the plurality of words belongs to each of the plurality of groups, and reclassifying the plurality of words based on the calculated likelihood .