JP7110554B2

JP7110554B2 - Ontology generation device, ontology generation program and ontology generation method

Info

Publication number: JP7110554B2
Application number: JP2017131632A
Authority: JP
Inventors: 育昌鄭; 友樹長瀬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2022-08-02
Anticipated expiration: 2037-07-05
Also published as: JP2019016074A

Description

オントロジー生成装置、オントロジー生成プログラム及びオントロジー生成方法に関する。 The present invention relates to an ontology generation device, an ontology generation program, and an ontology generation method.

特定の専門知識オントロジー構築のため、文書群から専門知識を表現できる関係を持つ語句対の抽出技術への取り組みが行われている。文書群から関係を持つ語句対の抽出に関して、手がかり（シード）として少量の語句ペアを付与し、品詞と構文構造の抽出パターンを用いたパターンマッチングで抽出する手法がある。 In order to construct a specific ontology of specialized knowledge, efforts are being made to extract word pairs that have relationships that can express specialized knowledge from a group of documents. Regarding the extraction of related word pairs from a group of documents, there is a method of assigning a small number of word pairs as clues (seeds) and extracting them by pattern matching using the extraction patterns of parts of speech and syntactic structures.

しかし、上記手法の場合、意図する関係ではない語句対が抽出される場合がある。そのため、抽出パターンが正確な語句対を抽出できるかどうか分類し、学習データとして抽出パターン適切性判定の教師あり分類器により、新しい抽出パターンの適切性を判定し、不適切な抽出パターンを利用しないことによって意図しない語句対の抽出を抑制する技術がある（下記の特許文献１を参照）。 However, in the case of the above method, word pairs that are not in the intended relationship may be extracted. Therefore, we classify whether the extraction pattern can extract an accurate word pair, and use a supervised classifier for judging the appropriateness of the extraction pattern as learning data to judge the appropriateness of the new extraction pattern, and do not use inappropriate extraction patterns. There is a technique for suppressing the extraction of unintended word/phrase pairs by means of the above (see Patent Literature 1 below).

特許第５４３０９８９号Patent No. 5430989

しかし、抽出パターン適切性の分類結果を用いて教師あり分類器を構築するため、抽出パターン適切性の判定誤りの影響によって以降の判定精度が低下してしまうという問題がある。 However, since a supervised classifier is constructed using the extracted pattern appropriateness classification results, there is a problem that subsequent determination accuracy is reduced due to the influence of extracted pattern appropriateness determination errors.

そこで、本発明の１つの側面では、途中処理の結果に影響されることなく、意図する関係を持つ語句対の抽出精度を向上させることを目的とする。 Accordingly, it is an object of one aspect of the present invention to improve the accuracy of extracting word/phrase pairs having an intended relationship without being affected by the results of intermediate processing.

態様の一例では、意味データベースであるオントロジー辞書を生成するオントロジー生成装置が、意味に関連性を有する語句のペアである語句対であって、前方語句と後方語句とのペアで構成されており、前記後方語句は該語句対を含む語義文において前記前方語句よりも後方の位置に配置されている前記語句対を前記オントロジー辞書の項目として記憶する記憶部と、外部から入力される前記語句対であるシード語句対を取得して前記記憶部に前記項目として記憶させる取得部と、取得された文書データを複数の語義文に分割する分割部と、前記分割により得られた複数の語義文から、前記記憶部に前記項目として記憶されている語句対が含まれている語義文を抽出する語義文抽出部と、前記語義文抽出部により抽出された語義文に含まれている前記語句対を構成する各語句と前記シード語句対を構成する各語句との意味の関連性の高さを表すスコアを所定の算出式を用いて算出し、該語義文について算出されたスコアと所定の閾値との大小比較によって前記関連性が低いか否かを判定し、前記関連性が低いと判定した場合に該語義文を前記抽出の結果から除外する評価部と、前記評価部により除外されずに前記抽出の結果として残されている語義文において、該語義文に含まれている前記語句対を構成する各語句を、それぞれの語句についての属性の情報を有する変数に置き換えることによって、関係抽出パターンを生成する生成部と、前記分割により得られた複数の語義文のうちで、前記変数以外の語句が前記関係抽出パターンと一致し、且つ、前記関係抽出パターンにおける前記変数の位置に配置されている語句についての属性の情報が該変数の有する属性の情報と一致する語義文から、前記位置に配置されている語句のペアを抽出して前記記憶部に前記項目として記憶させる語句対抽出部とを有し、前記所定の算出式は、前記語義文抽出部により抽出された語義文においての、該語義文に含まれている前記語句対のうちの前記前方語句についての主題分布と前記シード語句対のうちの前記前方語句についての主題分布とのベクトル距離と、該語義文に含まれている前記語句対のうちの前記後方語句についての主題分布と前記シード語句対のうちの前記後方語句についての主題分布とのベクトル距離との重み付き平均値を、前記スコアとして算出する式であり、語義文においての語句についての主題分布は、該語句についての語句主題分布を表すベクトルと該語義文についての文書主題分布を表すベクトルとの掛け算によって求められるベクトルによって表され、語句についての語句主題分布とは、主題に対して該主題に関する文書中に該語句が含まれる確率を該主題毎に表した確率分布であり、語義文についての文書主題分布とは、主題に対して該主題に関する文書中に該語義文が含まれる確率を該主題毎に表した確率分布である、というものである。 In one example of the aspect, an ontology generation device that generates an ontology dictionary, which is a semantic database, is a word pair that is a pair of words that are related in meaning, and is composed of a front word and a back word. The rear word/phrase is obtained by: a storage unit that stores the word/phrase pair that is arranged at a position behind the front word/phrase in the word meaning sentence containing the word/phrase pair as an item of the ontology dictionary; and the word/phrase pair that is input from the outside. An acquisition unit that acquires a certain seed word pair and stores it in the storage unit as the item, a division unit that divides the acquired document data into a plurality of word sense sentences, and from the plurality of word sense sentences obtained by the division, a word sense sentence extracting unit for extracting a word sense sentence containing the word pair stored as the item in the storage unit; and the word sense sentence pair included in the word sense sentence extracted by the word sense sentence extracting unit. using a predetermined calculation formula to calculate a score representing the degree of relevance in meaning between each word and phrase that constitutes the seed word and phrase pair, and the score calculated for the word meaning sentence and a predetermined threshold value an evaluation unit for determining whether or not the relevance is low by comparing the magnitudes, and excluding the word meaning sentence from the result of the extraction when the relevance is determined to be low; and the extraction without being excluded by the evaluation unit. Generating a relationship extraction pattern by replacing each word and phrase constituting the word and phrase pair contained in the word meaning sentence remaining as a result of the above with a variable having attribute information for each word and phrase and a word/phrase in which a word/phrase other than the variable among the plurality of word meaning sentences obtained by the division matches the relation extraction pattern and is arranged at the position of the variable in the relation extraction pattern. a word/phrase pair extracting unit for extracting a pair of words/phrases placed at the position from a word meaning sentence whose attribute information matches the attribute information of the variable, and storing the word/phrase pair in the storage unit as the item. and the predetermined calculation formula is, in the word-meaning sentence extracted by the word-meaning sentence extracting unit, the theme distribution and the seed word-phrase pair for the preceding word/phrase of the word/phrase pair included in the word-sense sentence. the subject distribution for the back term of the word pair contained in the semantic sentence and the subject for the back term of the seed word pair; It is a formula for calculating the weighted average value of the vector distance from the distribution as the score, and the subject distribution for the words in the word meaning sentence is the word is represented by a vector obtained by multiplying a vector representing the word topic distribution for a word with a vector representing the document topic distribution for the semantic sentence, and It is a probability distribution that expresses the probability that the word or phrase is included for each subject. It is a probability distribution that represents

途中処理の結果に影響されることなく、意図する関係を持つ語句対の抽出精度を向上させることができる。 It is possible to improve the accuracy of extracting word/phrase pairs having an intended relationship without being affected by the results of intermediate processing.

抽出パターンと語句対の自動抽出フレームワークについて説明するための図である。FIG. 4 is a diagram for explaining an automatic extraction framework for extraction patterns and word/phrase pairs; パターン適切性の判定を行う教師あり分類器による語句対抽出を説明するための図である。FIG. 4 is a diagram for explaining word pair extraction by a supervised classifier that determines pattern suitability; パターン適切性の判定誤りにより以降の判定に影響を及ぼし、判定精度が低下することを説明するための図である。FIG. 10 is a diagram for explaining that an error in determination of pattern appropriateness affects subsequent determinations and lowers the determination accuracy; 第１の実施形態のオントロジー生成装置の機能構成を示す図である。2 is a diagram showing the functional configuration of the ontology generation device of the first embodiment; FIG. オントロジーを構築する元になる上位及び下位の関係にある語句の一例を示す図である。FIG. 4 is a diagram showing an example of words and phrases having a high-level and low-level relationship that are the basis for constructing an ontology; 構築される木構造のオントロジーの一例を示す図である。FIG. 3 is a diagram showing an example of a constructed tree-structured ontology; 構文解析部によってテキストが処理された結果を示す図である。FIG. 10 is a diagram showing the result of processing the text by the parser; テキストに対して形態素解析と語句抽出をした結果を示す図である。FIG. 10 is a diagram showing the results of morphological analysis and word/phrase extraction on text; 学習結果の語句主題分布と文書主題分布を示す図である。FIG. 10 is a diagram showing a word/phrase topic distribution and a document topic distribution of learning results; 重みの自動判断において正例と負例の組み合わせの作成について説明するための図である。FIG. 10 is a diagram for explaining how to create a combination of positive and negative examples in automatic determination of weight; 重みの自動判断における重み及び意味関連性スコアの一例を示す図である。FIG. 10 is a diagram showing an example of weights and semantic relevance scores in automatic determination of weights; 第１、第２、第４の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理フローの一例を示すフローチャートである。FIG. 11 is a flow chart showing an example of a relationship extraction pattern extraction processing flow in the ontology generation devices of the first, second, and fourth embodiments; FIG. 第１の実施形態のオントロジー生成装置における語句文書別主題分布学習の処理フローの一例を示すサブルーチンである。4 is a subroutine showing an example of the processing flow of subject distribution learning by word/phrase document in the ontology generation device of the first embodiment; 第１の実施形態のオントロジー生成装置におけるパターン情報抽出処理フローの一例を示すサブルーチンである。4 is a subroutine showing an example of a pattern information extraction processing flow in the ontology generation device of the first embodiment; 関係抽出パターンの生成候補の削除判断の概念について説明するための図である。FIG. 10 is a diagram for explaining the concept of determination of deletion of a candidate for generating a relationship extraction pattern; 各実施形態のオントロジー生成装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram showing an example of a hardware configuration of an ontology generation device according to each embodiment; FIG. 各実施形態のオントロジー生成装置による効果を説明するための図である。FIG. 4 is a diagram for explaining the effect of the ontology generation device of each embodiment; 各実施形態のオントロジー生成装置によって意図した語句対が抽出できることを説明するための図である。FIG. 4 is a diagram for explaining that an intended word/phrase pair can be extracted by the ontology generation device of each embodiment; 従来のオントロジー生成方法によって抽出された語句対の結果を示す図である。FIG. 4 is a diagram showing the result of word pairs extracted by a conventional ontology generation method; 各実施形態のオントロジー生成装置によって抽出された語句対の結果を示す図である。FIG. 10 is a diagram showing the results of word/phrase pairs extracted by the ontology generation device of each embodiment; 両者の語句対の抽出結果を示す図である。It is a figure which shows the extraction result of both word-phrase pairs. 第２及び第４の実施形態のオントロジー生成装置の機能構成を示す図である。FIG. 11 is a diagram showing the functional configuration of ontology generation devices of the second and fourth embodiments; 第２の実施形態における語句主題分布と文書主題分布を示す図である。FIG. 11 is a diagram showing word/phrase topic distribution and document topic distribution in the second embodiment; 第２及び第３の実施形態のオントロジー生成装置における語句文書別主題分布学習の処理フローの一例を示すサブルーチンである。FIG. 11 is a subroutine showing an example of the processing flow of subject distribution learning by word/phrase document in the ontology generation device of the second and third embodiments; FIG. 第２及び第３の実施形態のオントロジー生成装置におけるパターン情報抽出処理フローの一例を示すサブルーチンである。FIG. 11 is a subroutine showing an example of the pattern information extraction processing flow in the ontology generation devices of the second and third embodiments; FIG. 第３の実施形態のオントロジー生成装置の機能構成を示す図である。FIG. 13 is a diagram showing the functional configuration of an ontology generation device according to the third embodiment; FIG. 第３の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理フローの一例を示すフローチャートである。FIG. 14 is a flowchart showing an example of a relationship extraction pattern extraction processing flow in the ontology generation device of the third embodiment; FIG. 取得したテキストに対して同義語置換を行った後のテキストを示す図である。FIG. 10 is a diagram showing the text after synonym replacement is performed on the acquired text; 取得したテキストに対して同義語置換を行わずに学習した場合の学習結果と、取得したテキストに対して同義語置換を行って学習した場合の学習結果とを示す図である。FIG. 10 is a diagram showing a learning result when learning is performed without performing synonym replacement on an acquired text and a learning result when learning is performed on an acquired text with synonym replacement; 第３の実施形態における組み合わせを含む語句主題分布と文書主題分布を示す図である。FIG. 10 is a diagram showing word topic distributions and document topic distributions including combinations in the third embodiment; 第４の実施形態のオントロジー生成装置における語句文書別主題分布学習の処理フローの一例を示すサブルーチンである。FIG. 14 is a subroutine showing an example of the processing flow of subject distribution learning by word/phrase document in the ontology generation device of the fourth embodiment; FIG. 第４の実施形態のオントロジー生成装置におけるパターン情報抽出処理フローの一例を示すサブルーチンである。FIG. 13 is a subroutine showing an example of the pattern information extraction processing flow in the ontology generation device of the fourth embodiment; FIG. 第１から第４の実施形態のオントロジー生成装置によって生成されたオントロジーを用いた検索システムの一例を示す図である。FIG. 10 is a diagram showing an example of a search system using ontologies generated by the ontology generation devices of the first to fourth embodiments; FIG. キーワード拡張装置による自動拡張について説明するための図である。FIG. 4 is a diagram for explaining automatic extension by a keyword extension device;

以下、本発明を実施するための形態について図面を参照しながら説明する。
まず、関係抽出パターン（以下、抽出パターンとも言う）と語句対の自動抽出フレームワークについて図１を用いて説明する。自動抽出フレームワークでは、文書群１１を文単位で解析し、２項関係にある語の組を語句対として抽出する。すなわち、所定のシード（手がかり）のシード語句対を用いて解析された文（テキスト）から関係抽出パターンを獲得し、関係抽出パターンＤＢ（Data Base）１２に記録する。獲得された関係抽出パターンを用いてテキストから語句対を獲得し、語句対ＤＢ１３に記録する。このように順次獲得され、語句対ＤＢ１３に記録された語句対の情報に基づいてオントロジーが生成され、ユーザ１４へ提供され、ユーザ１４によってさらにフィードバックされる。 EMBODIMENT OF THE INVENTION Hereinafter, it demonstrates, referring drawings for the form for implementing this invention.
First, a relationship extraction pattern (hereinafter also referred to as an extraction pattern) and an automatic extraction framework for word pairs will be described with reference to FIG. The automatic extraction framework analyzes the document group 11 on a sentence-by-sentence basis and extracts pairs of words having a binary relationship as word pairs. That is, a relation extraction pattern is acquired from a sentence (text) analyzed using a seed word/phrase pair of a predetermined seed (clue), and recorded in a relation extraction pattern DB (Data Base) 12 . A word/phrase pair is obtained from the text using the obtained relationship extraction pattern and recorded in the word/phrase pair DB 13 . An ontology is generated based on the information of word pairs acquired sequentially in this way and recorded in the word pair DB 13 , provided to the user 14 , and further fed back by the user 14 .

語句対とは語句を対としたものであり、例えば（Ａ、Ｂ）と表すことが可能である。Ａ及びＢはそれぞれ語句である。ここでの語句とは、例えば形態素解析により抽出されるＡという語句のみだけでなく、Ａという語句と例えば株式会社という語句が結合されたＡ株式会社など語句Ａを示すものも含まれるとする。 A phrase pair is a pair of phrases, and can be expressed as (A, B), for example. A and B are words and phrases respectively. Here, the word/phrase includes not only the word/phrase A extracted by morphological analysis, but also the word/phrase A, such as A Corporation, which is a combination of the word A and the word corporation.

関係抽出パターンとは、シード語句対が出現するテキストにおいてシード語句対の出現箇所を変数化したテキストの全部又は一部を言う。例えば、シード語句対が（Ｘ、Ｙ）でＸがメーカー名であり、Ｙがスマホ機種名である場合を考える。このとき、シード語句対（Ｘ、Ｙ）が出現するテキストが「Ｘの最新製品であるＹが発表された」である場合、関係抽出パターンはＸとＹが出現する箇所をメーカー名とスマホ機種名で変数化、すなわち置き換えた「（メーカー名）の最新製品である（スマホ機種名）が発表された」となる。なお、スマホとはスマートフォンの略称であり、以下同様である。 A relationship extraction pattern refers to all or part of a text in which a seed word/phrase pair appears, in which the occurrence locations of the seed word/phrase pair are converted into variables. For example, consider the case where the seed phrase pair is (X, Y), where X is the manufacturer name and Y is the smartphone model name. At this time, if the text in which the seed word/phrase pair (X, Y) appears is "X's latest product Y has been announced", the relationship extraction pattern is The name is converted into a variable, that is, replaced with “(manufacturer name)’s latest product (smartphone model name) has been announced”. Note that “smartphone” is an abbreviation for smartphone, and the same applies hereinafter.

オントロジーとは、辞書の一種で言葉の持つ概念を体系的に整理したものである。
上記自動抽出フレームワークでは、意図しない関係の語句対が抽出される場合がある。例えば、関係抽出パターンが「（メーカー名）の最新製品である（スマホ機種名）が発表された」で、関係抽出パターンに基づいて他のテキストから獲得された語句対が（ＸＸ、ＹＹ）であるとする。この場合、ＸＸがメーカー名でＹＹがスマホ機種名であれば問題ない。しかし、ＸＸはメーカー名であるが、ＹＹはスマホ機種名ではなく洋服名である場合もある。この場合には意図しない関係の語句対が抽出されてしまう。 An ontology is a type of dictionary that systematically organizes the concepts of words.
In the above automatic extraction framework, word pairs with unintended relationships may be extracted. For example, the relationship extraction pattern is "(manufacturer name)'s latest product (smartphone model name) has been announced", and the word/phrase pair obtained from other text based on the relationship extraction pattern is (XX, YY). Suppose there is In this case, there is no problem if XX is the manufacturer name and YY is the smartphone model name. However, while XX is the name of the manufacturer, YY may be the name of clothes instead of the model name of the smartphone. In this case, a word/phrase pair having an unintended relationship is extracted.

上述した特許文献１では意図しない関係の語句対の抽出を抑制する技術が開示されている。特許文献１に開示された技術では、関係抽出パターンが正確な語句対を抽出できるか否かを分類し、学習データとしてパターン適切性の判定を行う教師あり分類器を構築している。これにより、新しい関係抽出パターンの適切さを判定し、不適切な関係抽出パターンを利用しないことで意図しない関係の語句対の抽出を抑止している。 The above-mentioned Patent Document 1 discloses a technique for suppressing the extraction of unintended word/phrase pairs. In the technique disclosed in Patent Document 1, a supervised classifier is constructed that classifies whether or not a relationship extraction pattern can extract an accurate word pair, and determines pattern appropriateness as learning data. As a result, the suitability of the new relationship extraction pattern is determined, and extraction of unintended relationship word pairs is suppressed by not using inappropriate relationship extraction patterns.

具体的には、図２に示すように、シード語句対（セダン、スカイ○○）に基づいてテキストからパターン抽出（関係抽出パターン抽出）を行い、１回目の抽出パターンとして例えば「Ｘの新車のＹを発表した」が抽出される。ここでは、所望の関係として（車種、製品名）を抽出させるものとする。 Specifically, as shown in FIG. 2, pattern extraction (relationship extraction pattern extraction) is performed from the text based on the seed phrase pair (sedan, sky XX), and the first extraction pattern is, for example, "X's new car Y was announced" is extracted. Here, it is assumed that (vehicle type, product name) is extracted as the desired relationship.

今までのパターン判定結果を学習データとする教師あり分類器によって、抽出されたパターンの適切性が判定される。この例では、関係抽出パターン「Ｘの代表車の１つであるＹ」と「Ｘの作品Ｙ」を学習データとしている。 Appropriateness of the extracted pattern is determined by a supervised classifier that uses the past pattern determination results as learning data. In this example, the relationship extraction pattern "Y which is one of X's representative cars" and "X's work Y" are used as learning data.

判定された結果、抽出されたパターン「Ｘの新車のＹを発表した」は適切なパターンとされ、このパターンを使って新たな語句対（セダン、フーグー）が抽出される。新たな語句対を抽出すると、１回目抽出パターンと同様に２回目の抽出パターンとして「Ｘの新型Ｙを発表」、「Ｘで発表されたＹの新作」、「ＹはＸシリーズの新製品である」が抽出され、パターンの適切性の判定が行われる。 As a result of the determination, the extracted pattern "X's new car, Y has been announced" is determined to be a suitable pattern, and this pattern is used to extract a new word/phrase pair (sedan, foogu). When new word pairs are extracted, the second extraction pattern is the same as the first extraction pattern: "X's new model Y announced", "Y's new model announced at X", "Y is a new product of the X series There is" is extracted, and the suitability of the pattern is determined.

判定の結果、抽出パターン「ＹはＸシリーズの新製品である」が適切でないと判定され、適切と判定された抽出パターンを使って新たな語句対（□□フェア、ロクス）、（ＳＵＶ、ＲＯＶ４）、（Ａｒｌｏｗｓ、Ｍ０３）が抽出される。 As a result of the determination, it was determined that the extraction pattern "Y is a new product of the X series" is not appropriate. ), (Arlows, M03) are extracted.

しかし、抽出された語句対には所望の関係を持たない語句対が含まれる場合がある。これは、今までのパターン適切性の分類結果を用いて教師あり分類器を構築することで、パターン適切性の判定誤りが生じた場合、それ以降の判定に影響を及ぼし、判定精度が低下してしまうためである。 However, the extracted word/phrase pairs may include word/phrase pairs that do not have the desired relationship. This is because when a supervised classifier is constructed using the classification results of pattern appropriateness so far, if an error occurs in determining pattern appropriateness, subsequent determinations are affected, and the determination accuracy decreases. This is because

図３を用いて具体的に説明する。パターンの適切性の判定誤りによるパターン３０で不適切な語句対３１が抽出され、抽出された語句対３１によって抽出されるパターン３２で再度不適切な語句対３３が抽出される。このように、パターンの適切性の判定誤りが一旦生じると、それ以降の判定に影響を及ぼし、判定精度が低下する。なお、白枠（白領域）で示されるものは抽出される全ての語句対とパターンをそれぞれ示している。 A specific description will be given with reference to FIG. Inappropriate word/phrase pair 31 is extracted from pattern 30 due to an error in determination of appropriateness of the pattern, and inappropriate word/phrase pair 33 is again extracted from pattern 32 extracted by the extracted word/phrase pair 31 . In this way, once an error in determining the suitability of a pattern occurs, it affects subsequent determinations and lowers determination accuracy. The white frames (white areas) indicate all the word/phrase pairs and patterns to be extracted.

各実施形態のオントロジー生成装置は、上記問題を解決し、意図しない関係の語句対の抽出を抑制するものである。 The ontology generation device of each embodiment solves the above problem and suppresses the extraction of unintended relational word pairs.

以下では各実施形態のオントロジー生成装置について説明する。
＜第１の実施形態＞
第１の実施形態のオントロジー生成装置の構成について図４を用いて説明する。オントロジー生成装置４０は、シード語句入力部４０１、テキスト取得部４０２、構文解析部４０３、抽出処理制御部４０４、抽出パターン生成候補抽出部４０５、パターン抽出部４０６、語句抽出部４０７、オントロジー生成部４０８、出力部４０９を有している。また、オントロジー生成装置４０は、更に構文構造記憶部４１０、語句記憶部４１１、抽出パターン記憶部４１２、抽出パターン生成候補記憶部４１３、語句文書別主題分布学習部４１４、語句文書別主題分布記憶部４１５、抽出パターン生成候補評価部４１６を有している。 The ontology generation device of each embodiment will be described below.
<First embodiment>
The configuration of the ontology generation device of the first embodiment will be described with reference to FIG. The ontology generation device 40 includes a seed phrase input unit 401, a text acquisition unit 402, a syntax analysis unit 403, an extraction processing control unit 404, an extraction pattern generation candidate extraction unit 405, a pattern extraction unit 406, a phrase extraction unit 407, and an ontology generation unit 408. , and an output unit 409 . The ontology generation device 40 further includes a syntactic structure storage unit 410, a word/phrase storage unit 411, an extraction pattern storage unit 412, an extraction pattern generation candidate storage unit 413, a phrase document-by-word topic distribution learning unit 414, and a phrase/document-by-word topic distribution storage unit. 415 and an extraction pattern generation candidate evaluation unit 416 .

シード語句入力部４０１は、ユーザが指定して望ましい関係を持つ語句対（シード語句対）の入力を事前に受け付け、入力された語句対を関係抽出パターンと語句対の自動抽出フレームワークのシードとする。例えば、ユーザが所望する関係例（スマホメーカー、スマホシリーズ）に対して、ユーザが語句対（ＢＢＢ、Ａｒｌｏｗｓ）をシード語句対として入力する場合には、シード語句入力部４０１は語句対（ＢＢＢ、Ａｒｌｏｗｓ）を受け付ける。ここで、ＢＢＢは例えばスマホメーカーのメーカー名であり、Ａｒｌｏｗｓはスマホのシリーズを示すシリーズ名である。 The seed word/phrase input unit 401 accepts in advance a word/phrase pair (seed word/phrase pair) that is specified by the user and has a desired relationship, and uses the input word/phrase pair as a relationship extraction pattern and a seed for the framework for automatically extracting the word/phrase pair. do. For example, when the user inputs a word/phrase pair (BBB, Arrows) as a seed word/phrase pair for a relationship example (smartphone manufacturer, smartphone series) desired by the user, the seed word/phrase input unit 401 inputs the word/phrase pair (BBB, Arrows). Here, BBB is, for example, the name of a smartphone manufacturer, and Arrows is a series name indicating a series of smartphones.

テキスト取得部４０２は、文書群を記録するデータベースなどから文書群（文書データ）を取得し、取得された文書群を複数のテキスト（語義文とも言う）に分割し、テキストを抽出する。ここで、文書群は文書が複数集合したものであって、例えばテキスト形式やＨＴＭＬ（Hyper Text Markup Language）形式のドキュメントである。 A text acquisition unit 402 acquires a group of documents (document data) from a database that records the group of documents, etc., divides the acquired group of documents into a plurality of texts (also called semantic sentences), and extracts the texts. Here, the document group is a collection of a plurality of documents, for example, documents in text format or HTML (Hyper Text Markup Language) format.

構文解析部４０３は、テキスト取得部４０２によって抽出されたテキストに対して形態素解析と構文解析、又はテキストの意味構造解析を行い、その結果を構文構造記憶部４１０に記憶させる。例えば、抽出されたテキストが「今年度のベストスマホ大賞は、ＢＢＢの最新製品であるＡｒｌｏｗｓＭ０３が受賞した」の場合、構文解析部４０３によって形態素解析と構文解析が行われ、その結果である「今年度の、ベストスマホ大賞は、ＢＢＢの、最新製品である、ＡｒｌｏｗｓＭ０３が、受賞した」が構文構造記憶部４１０に記憶される。 The syntactic analysis unit 403 performs morphological analysis and syntactic analysis on the text extracted by the text acquisition unit 402 , or semantic structure analysis of the text, and stores the results in the syntactic structure storage unit 410 . For example, when the extracted text is "This year's best smartphone award was awarded to Arrows M03, the latest product of BBB", the syntactic analysis unit 403 performs morphological analysis and syntactic analysis, and the result is "now Arrows M03, the latest product of BBB, won the best smartphone award of the year." is stored in the syntax structure storage unit 410.

抽出処理制御部４０４は、関係抽出パターンと語句対の自動抽出フレームワークの実行を制御し、テキストから新たに関係のある語句対が抽出されない場合、関係ある語句対の自動抽出処理を停止し、語句記憶部４１１に記憶された語句対をオントロジー生成部４０８へ送信する。 The extraction processing control unit 404 controls the execution of the framework for automatically extracting relationship extraction patterns and word pairs, and stops the automatic extraction processing of related word pairs when new related word pairs are not extracted from the text, The word/phrase pairs stored in the word/phrase storage unit 411 are transmitted to the ontology generation unit 408 .

抽出パターン生成候補抽出部４０５は、構文構造記憶部４１０に記憶された構文構造から、語句記憶部４１１に記憶された語句対が出現する箇所（文）を抽出し、抽出パターン生成候補記憶部４１３に送信する。例えば、語句記憶部４１１に記憶された語句対が（ＣＣＣ、ｚｅｎｆｏ３）に対して、１つ目の出現箇所（文）「ＣＣＣの最新製品であるｚｅｎｆｏ３。」と、２つ目の文「ＣＣＣの年間出荷計画に、ｚｅｎｆｏ３の販売予想数を２０％引き上げる」を抽出したとする。ここで、ＣＣＣは例えばスマホメーカーのメーカー名であり、ｚｅｎｆｏ３はスマホ機種名である。抽出パターン生成候補抽出部４０５は、抽出された上記文を抽出パターン生成候補記憶部４１３に送信して記憶させる。 The extraction pattern generation candidate extraction unit 405 extracts the locations (sentences) where the word/phrase pairs stored in the word/phrase storage unit 411 appear from the syntactic structure stored in the syntactic structure storage unit 410, Send to For example, if the word/phrase pair stored in the word/phrase storage unit 411 is (CCC, zenfo3), the first appearance (sentence) "CCC's latest product, zenfo3." 20% increase in sales forecast for zenfo3" is extracted from the annual shipping plan. Here, CCC is, for example, the name of a smartphone manufacturer, and zenfo3 is the model name of the smartphone. The extraction pattern generation candidate extraction unit 405 transmits the extracted sentence to the extraction pattern generation candidate storage unit 413 to store it.

パターン抽出部４０６は、関係抽出パターンと語句対の自動抽出フレームにおける関係のある語句対を抽出するための関係抽出パターンを抽出パターン生成候補記憶部４１３に記憶された文から生成する。例えば、文が「ＣＣＣの最新製品であるｚｅｎｆｏ３。」の場合、パターン抽出部４０６は、語句対（ＣＣＣ、ｚｅｎｆｏ３）の箇所を変数化、すなわち（メーカー名、スマホ機種名）で置き換えて関係抽出パターンを生成する。この場合に生成される関係抽出パターンは「（メーカー名）の最新製品である（スマホ機種名）」である。 The pattern extraction unit 406 generates relationship extraction patterns for extracting relationship extraction patterns and related word/phrase pairs in automatic extraction frames of word/phrase pairs from the sentences stored in the extraction pattern generation candidate storage unit 413 . For example, if the sentence is "zenfo3, the latest product of CCC." Generate patterns. The relationship extraction pattern generated in this case is "(manufacturer name) latest product (smartphone model name)".

語句抽出部４０７は、抽出パターン生成候補記憶部４１３に記録された関係抽出パターンを用いて変数化に相当する部分の語句を抽出、すなわち、分割された複数の語義文に対し、語彙の概念の組である概念ペアを構成する単語（すなわち語句対）を特定し、関係ある語句対として語句記憶部４１１に記録させる。 The word/phrase extraction unit 407 extracts words/phrases corresponding to variableization using the relationship extraction patterns recorded in the extraction pattern generation candidate storage unit 413, that is, extracts the words/phrases of the divided plural word meaning sentences from the lexical concepts. Words (that is, word/phrase pairs) forming a set of concept pairs are specified and recorded in the word/phrase storage unit 411 as related word/phrase pairs.

例えば、関係抽出パターンが「（メーカー名）の最新製品である（スマホ機種名）」の場合を考える。語句抽出部４０７は、関係抽出パターンを用いて、構文構造記憶部４１０に記憶された「今年度の、ベストスマホ大賞は、ＢＢＢの、最新製品である、ＡｒｌｏｗｓＭ０３が、受賞した」から変数部分に相当する語句対（ＢＢＢ、ＡｒｌｏｗｓＭ０３）を抽出する。語句抽出部４０７は、抽出された語句対を語句記憶部４１１に送信し、記憶させる。 For example, consider the case where the relationship extraction pattern is "(manufacturer name) latest product (smartphone model name)". The word/phrase extraction unit 407 uses the relationship extraction pattern to extract the variable part from "This year's best smartphone award was awarded to Arrows M03, the latest product of BBB", stored in the syntactic structure storage unit 410. Extract the corresponding word pairs (BBB, Arrows M03). The word/phrase extraction unit 407 transmits the extracted word/phrase pair to the word/phrase storage unit 411 for storage.

オントロジー生成部４０８は、語句記憶部４１１に記録された関係ある語句対を整理、連結してオントロジーを構築する。例えば、オントロジー生成部４０８は図５Ａに示す上位と下位の関係を元に図５Ｂに示す木構造のオントロジーを構築する。例えば、上位がスタジアムの場合、下位はスタジアムのうちの１つである個別のスタジアム名となっている。図５Ｂでは上位のものを実線円とし、下位のものを破線円としている。 The ontology generation unit 408 organizes and connects related word/phrase pairs recorded in the word/phrase storage unit 411 to construct an ontology. For example, the ontology generation unit 408 constructs a tree-structured ontology shown in FIG. 5B based on the upper-level and lower-level relationships shown in FIG. 5A. For example, if the upper level is a stadium, the lower level is an individual stadium name that is one of the stadiums. In FIG. 5B, the upper ones are indicated by solid line circles, and the lower ones are indicated by dashed line circles.

出力部４０９は、オントロジー生成部４０８によって生成されたオントロジー構築結果を所定のフォーマットに変換し、不図示の表示装置に表示させたり、その他の装置へ変換後の内容を送信したりする。 The output unit 409 converts the ontology construction result generated by the ontology generation unit 408 into a predetermined format, displays it on a display device (not shown), or transmits the content after conversion to another device.

構文構造記憶部４１０は、構文解析部４０３によって処理された結果を記憶する。例えば、上述したように「今年度の、ベストスマホ大賞は、ＢＢＢの、最新製品である、ＡｒｌｏｗｓＭ０３が、受賞した」という構造を記録する（図６を参照）。このとき、図６に示す矢印は文節の修飾先を示している。 The syntax structure storage unit 410 stores the results processed by the syntax analysis unit 403 . For example, as described above, the structure "This year's best smartphone award was won by BBB's latest product Arrows M03" is recorded (see FIG. 6). At this time, the arrows shown in FIG. 6 indicate the modification destination of the clause.

語句記憶部４１１は、シード語句入力部４０１が受け付けたシード語句対や、語句対の自動抽出処理で語句抽出部４０７によって抽出された語句対を記憶する。 The word/phrase storage unit 411 stores seed word/phrase pairs received by the seed word/phrase input unit 401 and word/phrase pairs extracted by the word/phrase extraction unit 407 in the automatic word/phrase pair extraction process.

抽出パターン記憶部４１２は、パターン抽出部４０６によって抽出された関係抽出パターンを記憶する。例えば、上述した「（メーカー名）の最新製品である（スマホ機種名）」などを記憶する。 The extracted pattern storage unit 412 stores the relationship extraction patterns extracted by the pattern extraction unit 406 . For example, it stores the above-mentioned "(manufacturer name) latest product (smartphone model name)".

抽出パターン生成候補記憶部４１３は、抽出パターン生成候補抽出部４０５が抽出した語句対の出現箇所（文）を記憶する。上述したように、例えば「ＣＣＣの最新製品であるｚｅｎｆｏ３。」や「ＣＣＣの年間出荷計画に、ｚｅｎｆｏ３の販売予想数を２０％引き上げる」を記憶する。 The extraction pattern generation candidate storage unit 413 stores the occurrence locations (sentences) of the word/phrase pairs extracted by the extraction pattern generation candidate extraction unit 405 . As described above, for example, "CCC's latest product, zenfo3."

語句文書別主題分布学習部４１４は、テキスト取得部４０２が取得したテキストに対し、形態素解析を行い、解析結果の語句集合の文書別主題分布（Topic Model）を学習し、学習結果（ベクトル）を語句文書別主題分布記憶部４１５に記憶する。すなわち、語義文（テキスト）に対し形態素解析を行って語句を抽出し、抽出された語句及び抽出された語句を組み合わせた組み合わせ語句が主題を言及する語句主題分布と、抽出された語句が存在する語義文が主題を言及する文書主題分布を学習する。 The word/phrase document-by-document topic distribution learning unit 414 performs morphological analysis on the text acquired by the text acquisition unit 402, learns the document-by-document topic distribution (Topic Model) of the analysis result word/phrase set, and obtains the learning result (vector). It is stored in the phrase-document-by-word-theme-distribution storage unit 415 . In other words, there is a word/phrase theme distribution in which words and phrases are extracted by performing morphological analysis on a word meaning sentence (text) and a combination word/phrase combining the extracted words and phrases refers to the topic, and the extracted words/phrases exist. Learn the document topic distribution in which the semantic sentence refers to the topic.

Topic Modelとは、主に文書データを想定したクラスタリング（教師なし分類）であり、単語の順番を無視して文書内の出現単語の頻度のみを考慮し、教師データを与えず単語の主題をモデル化するものである。すなわち、Topic Modelは、文書が複数の潜在的な主題から確率的に生成されると仮定したモデルであり、文書内の各単語は主題が持つ確率分布にしたがって出現すると仮定する。Topic Modelでは、主題ごとに単語の出現頻度分布を想定することで、主題間の類似性やその意味を解析できる。 Topic Model is a clustering (unsupervised classification) that mainly assumes document data, ignores the order of words and considers only the frequency of words in the document, and models the topic of words without giving supervised data. It is something that becomes. That is, the Topic Model is a model that assumes that a document is probabilistically generated from a plurality of potential themes, and that each word in the document appears according to the probability distribution of the topic. Topic Model can analyze the similarity between topics and their meaning by assuming the appearance frequency distribution of words for each topic.

語句は独立に出現しているわけではなく、潜在的な主題を持ち、同じ主題を持つ語句は同じ文書に出現しやすい。例えば、「○○山が単独首位、△△丸は敗れる」という文がある。ここで、○○山は有名な横綱であり、△△丸は有名な大関であるとする。この場合、文中に「相撲」という単語がないがこの文を見て「相撲」に関係する文であると理解できる。すなわち、この文には「相撲」という１つの主題がある。 Phrases do not appear independently, they have potential subjects, and phrases with the same subject tend to appear in the same document. For example, there is a sentence that reads, ``Mt. Here, it is assumed that ○○yama is a famous yokozuna and △△maru is a famous ozeki. In this case, there is no word "sumo" in the sentence, but looking at this sentence, it can be understood that the sentence is related to "sumo". That is, this sentence has one subject, "sumo."

例えば、図７Ａに示すように、取得されたテキストがＮ個あり、これらのテキストに対して形態素解析と語句抽出をするとテキストごとに抽出された語句が得られる。ここで、主題数をＫ個とした場合、語句文書別主題分布学習部４１４は、学習した結果である図７Ｂに示す語句主題分布（ベクトルＶ_ｗ）と文書主題分布（ベクトルＶ_Ｄ）を語句文書別主題分布記憶部４１５に記憶する。語句主題分布は語句自身の各主題を言及する確率分布であり、文書主題分布は文書の各主題を言及する確率分布である。 For example, as shown in FIG. 7A, there are N acquired texts, and morphological analysis and phrase extraction are performed on these texts to obtain extracted phrases for each text. Assuming that the number of themes is K, the word/phrase document topic distribution learning unit 414 uses the word/phrase topic distribution (vector V _w ) and the document topic distribution (vector V _D ) shown in FIG. Stored in the document-by-document topic distribution storage unit 415 . The phrase-subject distribution is the probability distribution that mentions the subject of the phrase itself, and the document-subject distribution is the probability distribution that mentions the subject of the document.

語句主題分布における数値は各主題における各語句が出現する確率分布を示しており、例えば語句「セダン」において主題Ｔ１では０．１、主題Ｔ２では０．２、・・・となっている。これらの数値を足し合わせると１となる。また、文書主題分布における数値は各主題における各文書が出現する確率分布を示しており、例えば文書ＩＤ「１」において主題Ｔ１では０．０１、主題Ｔ２では０．２、・・・となっている。これらの数値を足し合わせると１となる。 The numerical values in the word/phrase theme distribution indicate the probability distribution of occurrence of each word/phrase in each subject. The sum of these numbers equals 1. Also, the numerical values in the document subject distribution indicate the probability distribution of appearance of each document in each subject. there is The sum of these numbers equals 1.

ここでは主題数がＫ個となっているが、Ｋ個に限定されるものではなく、他の数とすることも可能である。また、図７Ａに示す語句aaaは例えば製品名であり、語句ＡＡＡ及び語句ＢＢＢは例えばメーカー名である。 Although the number of themes is K here, it is not limited to K, and other numbers are possible. The word aaa shown in FIG. 7A is, for example, a product name, and the words AAA and BBB are, for example, manufacturer names.

なお、学習方式については従来技術であり、ここでは詳細な説明を省略するが以下の文献を参照することができる。例えば、Biei, David M. (2012). Probabilistic topic models, Commun. ACM, 55(4), 77-84である。また、Liu, Jun S, (1994). The Collapsed Gibbs Sampler in Bayesian Computations with Association, 89(427), 958-966である。また、Mikolov, Tomas; et al, Efficient Estimation of Word Representations in Vector Spaceである。 Note that the learning method is a conventional technique, and although detailed description is omitted here, the following documents can be referred to. For example, Biei, David M. (2012). Probabilistic topic models, Commun. ACM, 55(4), 77-84. Also, Liu, Jun S, (1994). The Collapsed Gibbs Sampler in Bayesian Computations with Association, 89(427), 958-966. Also, Mikolov, Tomas; et al, Efficient Estimation of Word Representations in Vector Space.

語句文書別主題分布記憶部４１５は、語句文書別主題分布学習部４１４による学習結果（ベクトル）を記録する。例えば、図７Ａに示す語句主題分布と文書主題分布を記録する。この分布は確率分布である。 The phrase document-by-word topic distribution storage unit 415 records the learning result (vector) by the phrase-by-word document topic distribution learning unit 414 . For example, record the phrase topic distribution and the document topic distribution shown in FIG. 7A. This distribution is a probability distribution.

抽出パターン生成候補評価部４１６は、抽出パターン生成候補記憶部４１３に記録された語句対の出現箇所に対して、語句記憶部４１１に記録されたシード語句対と比較し、出現箇所の語句対とシード語句対との意味関連性スコアが閾値を超えたか否かを判断（評価）する。意味関連性スコアが閾値を超えない場合には出現箇所（文）を抽出パターン生成候補記憶部４１３から削除する。すなわち、複数の語義文のうち、第１の語義文に対応する概念ペアを構成する第１の単語（シード語句対）と、第２の語義文に対応する概念ペアを構成する第２の単語（出現箇所の語句対）とが類似するか否かを判定し、類似しないと判定された場合に、概念ペアを構成する単語（語句対）をオントロジー辞書に登録する対象から除外する。なお、上記類似するとは語句対同士の関連性があることを言う。 The extraction pattern generation candidate evaluation unit 416 compares the occurrence location of the word/phrase pair recorded in the extraction pattern generation candidate storage unit 413 with the seed word/phrase pair recorded in the word/phrase storage unit 411, and compares the occurrence location with the word/phrase pair. Determine (evaluate) whether the semantic relevance score with the seed phrase pair exceeds a threshold. If the semantic relevance score does not exceed the threshold, the occurrence location (sentence) is deleted from the extraction pattern generation candidate storage unit 413 . That is, among the plurality of word-sense sentences, a first word (seed word-phrase pair) forming a concept pair corresponding to the first word-sense sentence and a second word forming a concept pair corresponding to the second word-sense sentence (word/phrase pairs at appearance locations) are similar, and if it is determined that they are not similar, the words (word/phrase pairs) that make up the concept pair are excluded from being registered in the ontology dictionary. It should be noted that "similar" means that there is a relationship between word/phrase pairs.

意味関連性スコアとは、対比する語句対が持つそれぞれの意味がどの程度関連しているかを評価する値を言う。 The semantic relevance score is a value that evaluates the extent to which the respective meanings of contrasting word pairs are related.

例えば、シード語句対（セダン、スカイ○○）の場合に抽出された語句対（セダン、フーグー）の出現箇所（文）が、２つあるとする。１つは「セダンの一種であるフーグーが発表された」であり、もう１つは「セダンで新作のフーグーが発表された」であるとする。１つ目の文は（車種、製品名）に関するものであり、２つ目の文は（地名、楽曲種類）に関するものであるとする。 For example, it is assumed that there are two occurrence locations (sentences) of the word/phrase pair (sedan, fugu) extracted in the case of the seed word/phrase pair (sedan, sky XX). One is "Fugu, which is a kind of sedan, was announced", and the other is "Fugu, a new sedan, was announced". It is assumed that the first sentence relates to (car model, product name) and the second sentence relates to (place name, music type).

この場合、抽出パターン生成候補評価部４１６は、出現箇所の文書におけるシード語句対と、出現箇所の文書における抽出された語句対の意味関連性スコアを、語句対文書別主題分布記憶部４１５に記録された学習結果に基づいて計算する。この例では、２つ目の文における（セダン、フーグー）と（セダン、スカイ○○）の関連度が低いため、計算された意味関連性スコアは閾値を超えず、２つ目の出現箇所（文）は抽出パターン生成候補記憶部４１３から削除される。 In this case, the extraction pattern generation candidate evaluation unit 416 records the semantic relevance scores of the seed word/phrase pair in the document of the appearance location and the extracted word/phrase pair in the document of the appearance location in the word/phrase-by-document subject distribution storage unit 415. Calculate based on the learned results. In this example, since the relevance between (sedan, fugu) and (sedan, sky○○) in the second sentence is low, the calculated semantic relevance score does not exceed the threshold, and the second occurrence ( sentence) is deleted from the extraction pattern generation candidate storage unit 413 .

すなわち、抽出パターン生成候補評価部４１６は、第１の単語（シード語句対）と第２の単語（出現箇所の語句対）の類似度（関連度）を示す意味関連性スコアを所定の算出式である以下の（１）によって算出する。そして、算出された意味関連性スコアが所定の閾値以下の場合に第１の単語と第２の単語が類似しない（関連性が低い）と判定する。 That is, the extracted pattern generation candidate evaluation unit 416 calculates a semantic relevance score indicating the degree of similarity (relevance) between the first word (seed word pair) and the second word (word/phrase pair at the occurrence location) using a predetermined calculation formula. is calculated by the following (1). Then, when the calculated semantic relevance score is equal to or less than a predetermined threshold, it is determined that the first word and the second word are not similar (relevance is low).

ここで、出現箇所の文書におけるシード語句対と、出現箇所の文書における抽出された語句対の意味関連性スコアの計算について説明する。 Calculation of semantic relevance scores for seed phrase pairs in the document of occurrence and extracted phrase pairs in the document of occurrence will now be described.

以下では、文書ｄにおける２つの語句対i、ｊの意味関連性スコアの計算について説明する。語句対ｉはシード語句対であり、語句対ｊは抽出された語句対である。文書ｄにおいて、語句対ｊ＝（ｊ^Ａ、ｊ^Ｂ）と語句対ｉ＝（ｉ^Ａ、ｉ^Ｂ）に対して以下の式（１）によって意味関連性スコアSRScore_{ｉ、ｊ、ｄ}を算出する。ｊ^Ａは語句対ｊの語句Ａ（前方語句Ａ）、ｊ^Ｂは語句対ｊの語句Ｂ（後方語句Ｂ）、ｉ^Ａは語句対ｉの語句Ａ（前方語句Ａ）、ｉ^Ｂは語句対ｉの語句Ｂ（後方語句Ｂ）をそれぞれ示す。例えば、語句対ｊが（ＢＢＢ、Zinrei）の場合、語句ＡがＢＢＢ、語句ＢがZinreiとなる。前方語句とは語句対のうちの前段に示される語句を言い、後方語句とは語句対のうちの後段に示される語句を言う。 In the following, we describe the computation of the semantic relevance score for two phrase pairs i, j in document d. Phrase pair i is the seed phrase pair and phrase pair j is the extracted phrase pair. In document d, a semantic relevance score SRScore _{i, j, d} is calculated by the following formula (1) for a phrase pair j=(j ^A , j ^B ) and a phrase pair i=(i ^A , i ^B ) . j ^A is phrase A of phrase pair j (previous phrase A), j ^B is phrase B of phrase pair j (back phrase B), i ^A is phrase A of phrase pair i (previous phrase A), i ^B is phrase pair Term B of i (rear term B) is shown respectively. For example, if the word/phrase pair j is (BBB, Zinrei), then word/phrase A is BBB and word/phrase B is Zinrei. A forward phrase refers to a phrase that appears earlier in a phrase pair, and a backward phrase refers to a phrase that appears later in a phrase pair.

ここで、

はシード語句対ｉの前方語句Ａの文書ｄにおける主題分布を表すベクトルである。これは、例えば主題数を３個（次元Ｋ＝３）とした場合の語句Ａの主題分布Ａ_ｋと文書ｄの主題分布Ｄ_ｋの掛け算によって求められる。例えば、語句Ａの主題分布Ａ_ｋが｛０．３、０．１、０．６｝、文書ｄの主題分布Ｄ_ｋが｛０．４、０．４、０．２｝とした場合、語句Ａの文書ｄにおける主題分布は｛０．３×０．４、０．１×０．４、０．６×０．２｝＝｛０．１２、０．０４、０．１２｝となる。 here,

is a vector representing the subject distribution in document d of the predecessor phrase A of seed phrase pair i. This is obtained, for example, by multiplying the subject distribution Ak of the word/phrase A and the subject distribution _Dk of the document _d when the number of themes is three (dimension K=3). For example, if the subject distribution A _k of word A is {0.3, 0.1, 0.6} and the subject distribution D _k of document d is {0.4, 0.4, 0.2}, the word The subject distribution in document d of A is {0.3×0.4, 0.1×0.4, 0.6×0.2}={0.12, 0.04, 0.12}.

語句Ａの主題分布Ａ_ｋ｛０．３、０．１、０．６｝は、３つの各主題における語句Ａの出現する確率分布を示しており、例えば１つ目の主題における語句Ａの出現する確率が０．３である。それぞれの数値を足し合わせると１となる。なお、文書ｄの主題分布Ｄ_ｋ｛０．４、０．４、０．２｝についても同様に考えられる。 The subject distribution A _k {0.3, 0.1, 0.6} of the word A indicates the probability distribution of the appearance of the word A in each of the three themes. The probability of doing so is 0.3. The sum of these numbers equals 1. Note that the subject distribution D _k {0.4, 0.4, 0.2} of document d can also be considered in the same way.

語句の文書における主題分布についての具体例を挙げる。例えば、文書１として「明日は機械学習ハッカソンの日だ。予習はバッチリだ。」、文書２として「機械学習ハッカソンのせいで夜更かししそうで怖い。」があるとする。文書１に対して形態素解析をして語句を抽出すると「明日、機械学習、ハッカソン、日、予習、バッチリ」となる。文書２に対して形態素解析をして語句を抽出すると「機械学習、ハッカソン、夜更かし、怖い」となる。 A specific example of the topical distribution of phrases in documents is given. For example, document 1 is "Tomorrow is the day of the machine learning hackathon. My preparation is perfect." When morphological analysis is performed on document 1 and phrases are extracted, it becomes "tomorrow, machine learning, hackathon, day, preparation, perfect". When morphological analysis is performed on document 2 and phrases are extracted, "machine learning, hackathon, staying up late, scary" is obtained.

ここで、主題数Ｋ＝３とし、各語句の主題分布及び各文書の主題分布がわかるものとする。この場合、各語句の主題分布及び各文書の主題分布に基づいて、各語句の文書１における主題分布は以下のようになる。すなわち「明日｛０．１、０．１、０．８｝、機械学習｛０．２、０．３、０．５｝、ハッカソン｛０．４、０．３、０．３｝、日｛０．２、０．２、０．６｝、予習｛０．３、０．５、０．２｝、バッチリ｛０．２、０．６、０．２｝」となる。また、各語句の文書２における主題分布は「機械学習｛０．４、０．３、０．３｝、ハッカソン｛０．９、０．０５、０．０５｝、夜更かし｛０．５、０．３、０．２｝、怖い｛０．４、０．３、０．３｝」となる。このように、文書の主題がそれぞれ異なるため、語句の主題分布も文書ごとに異なる。 Here, it is assumed that the number of themes is K=3, and the topic distribution of each word and the topic distribution of each document are known. In this case, based on the topic distribution of each word and the topic distribution of each document, the topic distribution of each word in document 1 is as follows. That is, "tomorrow {0.1, 0.1, 0.8}, machine learning {0.2, 0.3, 0.5}, hackathon {0.4, 0.3, 0.3}, day { 0.2, 0.2, 0.6}, preparation {0.3, 0.5, 0.2}, perfect {0.2, 0.6, 0.2}". Also, the subject distribution of each word in document 2 is "machine learning {0.4, 0.3, 0.3}, hackathon {0.9, 0.05, 0.05}, late night {0.5, 0.3, 0.2}, scary {0.4, 0.3, 0.3}". In this way, since the themes of documents are different, the topic distribution of words and phrases is also different for each document.

また、

は、シード語句対ｉの前方語句Ａと、抽出された語句対ｊの前方語句Ａの文書ｄにおけるベクトル距離を示しており、例えば

のコサイン類似度で示される。 again,

denotes the vector distance in document d between the prefix A of seed phrase pair i and the prefix A of extracted phrase pair j, e.g.

is represented by the cosine similarity of .

また、α_lは前方語句Ａ、後方語句Ｂの重み係数を示すものであり、あらかじめ人手によって設定されてもよく、自動で判断して設定されるようにしてもよい。なお、第１の実施形態では重みは２つであるが、後述する他の実施形態では重みが３つ（前方語句Ａ、後方語句Ｂ、組み合わせ）などとなる。 In addition, α _l indicates the weighting factor of the front phrase A and the rear phrase B, and may be set manually in advance, or may be set by automatic judgment. In the first embodiment, there are two weights, but in other embodiments described later, there are three weights (forward term A, backward term B, combination), and the like.

ここで、重みの自動判断による設定について説明する。まず、ユーザの意図に合致する正解語句対（正例語句対リスト）と絶対に不正解の語句対（負例語句対リスト）を用意する。抽出対象となる文書群に対して、正例語句対リストと負例語句対リストにある全語句の組み合わせを作成し、上述した式（１）においてｌ＝Ａ、Ｂ、Ｃとし、分母を３として意味関連性スコアを計算する。意味関連性スコアで正例と負例が区別できるまでα_lを調整してスコア計算を繰り返す。最適化したα_lを実際の処理で使用する重みとして設定する。なお、上記処理は、例えば抽出パターン生成候補評価部４１６が行うようにしてもよく、他の構成が行うようにしてもよい。 Here, setting by automatic determination of weights will be described. First, a correct word/phrase pair (positive example word/phrase pair list) that matches the user's intention and an absolutely incorrect word/phrase pair (negative example word/phrase pair list) are prepared. For the document group to be extracted, a combination of all words in the positive example word pair list and the negative example word pair list is created, and l=A, B, C in the above equation (1), and the denominator is 3. Calculate the semantic relevance score as Adjust _αl and repeat the score calculation until the semantic relevance score distinguishes between positive and negative examples. Set the optimized α _l as the weight used in actual processing. Note that the above process may be performed by the extraction pattern generation candidate evaluation unit 416, for example, or may be performed by another component.

例えば、図８Ａに示すように、正例（正例語句対リスト）と負例（負例語句対リスト）が人手などによって付与される。ここでの正例は前方語句がスマホメーカー、後方語句がスマホ機種名とする。語句ＢＢＢ、語句ＤＤＤ、語句ＥＥＥはスマホメーカーであり、語句Ａｒｌｏｗｓ、語句ＧＮｏｔｅ、語句Ｘｐｉｅｒはスマホ機種名である。一方、負例はスマホメーカーとスマホ機種名ではなく、前方語句ＡＡＡは自動車メーカー、前方語句ＦＦＦは総合電気メーカーであり、後方語句ピリオは車の製品名、後方語句ビリはテレビ機種名である。 For example, as shown in FIG. 8A, positive examples (positive example word/phrase pair list) and negative examples (negative example word/phrase pair list) are provided manually. In the positive example here, the front term is the smartphone manufacturer and the back term is the smartphone model name. The words BBB, DDD, and EEE are smartphone manufacturers, and the words Arrows, G Note, and Xpier are smartphone model names. On the other hand, the negative examples are not smartphone manufacturers and smartphone model names, but the antecedent phrase AAA is automobile manufacturers, the antecedent phrase FFF is a general electric manufacturer, the suffix term Prio is a car product name, and the suffix Biri is a TV model name.

この正例と負例を混ぜて組み合わせを作成する。この場合の組み合わせには、正例同士、正例と負例、負例同士の組み合わせがある。正例同士の組み合わせは例えばＩＤが１－２であり、正例と負例の組み合わせは例えばＩＤが１－Ｒであり、負例同士の組み合わせは例えばＩＤがＲ－Ｓである。 Create a combination by mixing the positive and negative examples. Combinations in this case include combinations of positive cases, combinations of positive cases and negative cases, and combinations of negative cases. For example, a combination of positive cases has an ID of 1-2, a combination of a positive case and a negative case has an ID of 1-R, and a combination of negative cases has an ID of RS, for example.

作成された組み合わせを用いて、図８Ｂに示すような重み（重み１、重み２、重み３・・・）を調整しながら意味関連性スコアを計算する。その結果が図８Ｂに示されている。調整する重みの一例としては、例えば重み１が０．３、０．１、０．６である。重みを決定するための閾値を例えば０．５とする。図８Ｂに示す意味関連性スコア結果より、重み２のときのみ正例同士の意味関連性スコアが０．８や０．９と高く、正例と負例の組み合わせでは０．４、０．３、０．１と閾値を超えずに低い。この結果から重み係数α_lを重み２の値に設定する。正例同士の組み合わせのスコアが高く、正例と負例の組み合わせのスコアが低くなる重みは、ユーザの意図に合致する語句対を抽出する重みとして適切である。 Using the created combinations, a semantic relevance score is calculated while adjusting the weights (weight 1, weight 2, weight 3, . . . ) as shown in FIG. 8B. The results are shown in FIG. 8B. As an example of weights to be adjusted, weight 1 is 0.3, 0.1, and 0.6. A threshold for determining the weight is set to 0.5, for example. From the semantic relevance score results shown in FIG. 8B, only when the weight is 2, the semantic relevance score between positive examples is as high as 0.8 or 0.9, and the combination of positive and negative examples is 0.4, 0.3. , 0.1, which is low without exceeding the threshold. Based on this result, the weighting factor α _l is set to the weight 2 value. A weight that gives a high score for a combination of positive examples and a low score for a combination of positive and negative examples is suitable as a weight for extracting word/phrase pairs that match the user's intention.

次に、第１の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理のフローについて図９を用いて説明する。 Next, a flow of relation extraction pattern extraction processing in the ontology generation device of the first embodiment will be described with reference to FIG.

オントロジー生成装置４０は、シード語句対を受け付け、取得する（ステップＳ９０１）。また、オントロジー生成装置４０は、文書群を記録するデータベースなどから文書群を取得し、取得した文書群の文書からテキストを抽出し、取得する（ステップＳ９０２）。 The ontology generation device 40 receives and acquires seed word/phrase pairs (step S901). Also, the ontology generation device 40 acquires a group of documents from a database or the like that records the group of documents, and extracts and acquires text from the documents of the acquired group of documents (step S902).

オントロジー生成装置４０は、取得されたテキストに対して形態素解析と構文解析、又はテキストの意味構造解析を行う、すなわち構文構造解析を行う（ステップＳ９０３）。オントロジー生成装置４０は、取得されたテキストに対し、形態素解析を行い、解析結果の語句集合の文書別主題分布（Topic Model）を学習、すなわち語句文書別主題分布学習する（ステップＳ９０４）。語句文書別主題分布学習の処理フローの詳細については後述する。 The ontology generation device 40 performs morphological analysis and syntactic analysis on the acquired text, or semantic structure analysis of the text, that is, syntactic structure analysis (step S903). The ontology generation device 40 performs morphological analysis on the acquired text, and learns the document-by-document theme distribution (Topic Model) of the analysis result word/phrase set, that is, word/phrase-document-by-word topic distribution learning (step S904). The details of the processing flow of topic distribution learning for each phrase document will be described later.

オントロジー生成装置４０は、抽出パターン生成候補記憶部４１３に記録された語句対の出現箇所に対して、語句記憶部４１１に記録されたシード語句対と比較し、出現箇所の語句対とシード語句対との意味関連性スコアが閾値を超えたか否かを判断（評価）する。意味関連性スコアが閾値を超えない場合には出現箇所（文）を抽出パターン生成候補記憶部４１３から削除する。すなわち、オントロジー生成装置４０はパターン情報抽出処理を行う（ステップＳ９０５）。パターン情報抽出処理のフローの詳細については後述する。 The ontology generation device 40 compares the occurrence location of the word/phrase pair recorded in the extraction pattern generation candidate storage unit 413 with the seed word/phrase pair recorded in the word/phrase storage unit 411, and compares the occurrence location word/phrase pair with the seed word/phrase pair. It is determined (evaluated) whether or not the semantic relevance score of exceeds the threshold. If the semantic relevance score does not exceed the threshold, the occurrence location (sentence) is deleted from the extraction pattern generation candidate storage unit 413 . That is, the ontology generation device 40 performs pattern information extraction processing (step S905). The details of the flow of pattern information extraction processing will be described later.

オントロジー生成装置４０は、生成された関係抽出パターンを用いて変数化に相当する部分の語句を抽出する、すなわち語句情報抽出処理を行う（ステップＳ９０６）。 The ontology generating device 40 uses the generated relationship extraction pattern to extract the word/phrase corresponding to the variableization, that is, performs word/phrase information extraction processing (step S906).

ここで、上記語句文書別主題分布学習の処理フローについて図１０を用いて説明する。オントロジー生成装置４０は、取得されたテキスト内の語句を同定する（ステップＳ１００１）。すなわち、テキスト内から語句を抽出して学習対象の語句を特定する。オントロジー生成装置４０は、同定した語句集合の文書別主題分布を学習、すなわちモデル学習を行う（ステップＳ１００２）。 Here, the processing flow of the topic distribution learning by word/phrase document will be described with reference to FIG. The ontology generation device 40 identifies words in the acquired text (step S1001). That is, words are extracted from the text and the words to be learned are specified. The ontology generation device 40 learns the document-by-document theme distribution of the identified phrase set, that is, performs model learning (step S1002).

また、上記パターン情報抽出処理フローについて図１１を用いて説明する。オントロジー生成装置４０は、まず抽出パターン生成候補記憶部４１３に記録された語句対ｊの出現箇所Ｄを抽出する（ステップＳ１１０１）。語句対ｊ＝｛ｊ^Ａ,ｊ^Ｂ｝の出現箇所ｄｊに対し、更にシード語句対ｉ＝｛ｉ^Ａ,ｉ^Ｂ｝に対し、語句｛ｉ^Ａ,ｊ^Ａ｝と語句｛ｉ^Ｂ,ｊ^Ｂ｝の文書群における意味関連性スコアを計算する（ステップＳ１１０２）。すなわち、第１の単語（シード語句対）と第２の単語（出現箇所の語句対）の前方語句同士及び後方語句同士を対比させてスコアを算出している。 Also, the pattern information extraction processing flow will be described with reference to FIG. The ontology generation device 40 first extracts the occurrence location D of the word/phrase pair j recorded in the extraction pattern generation candidate storage unit 413 (step S1101). For occurrence dj of phrase pair j={ ^jA , ^jB }, and for seed phrase pair i= ^{ iA,iB}, phrase { ^iA , ^jA } and phrase ^{ ^iB , ^jB } is calculated (step S1102). In other words, the score is calculated by comparing the preceding words and phrases and the following words and phrases of the first word (seed word and phrase pair) and the second word (word and phrase pair of appearance location).

オントロジー生成装置４０は、計算された意味関連性スコアが閾値を超えたか否かを判定し（ステップＳ１１０３）、超えていない場合（ステップＳ１１０３でＮｏ）、関連性が低いとして出現箇所ｄｊを関係抽出パターンの生成候補から削除（除外）し（ステップＳ１１０４）、次の語句対について同様の処理を行う。一方、超えている場合（ステップＳ１１０３でＹｅｓ）、オントロジー生成装置４０は次のシード語句対について同様の処理を行う。 The ontology generation device 40 determines whether or not the calculated semantic relevance score exceeds the threshold (step S1103). Delete (exclude) from the pattern generation candidates (step S1104), and perform the same processing for the next word/phrase pair. On the other hand, if it exceeds (Yes in step S1103), the ontology generation device 40 performs similar processing for the next seed word/phrase pair.

すべての語句対について処理が終了すると、オントロジー生成装置４０は語句対ｊの出現箇所（フィルタリング済み）から語句対の関係抽出パターンを生成する（ステップＳ１１０５）。 When the processing for all the word/phrase pairs is completed, the ontology generation device 40 generates a relationship extraction pattern of the word/phrase pairs from the occurrence locations (filtered) of the word/phrase pair j (step S1105).

第１の実施形態では前方語句同士、後方語句同士の評価をしているが、後述する第２の実施形態では各語句対の語句を１つの語句として組み合わせた語句同士の評価も更にしている。この３つの評価パターンを用いた場合における関係抽出パターンの生成候補の削除判断の概念について図１２を用いて説明する。 In the first embodiment, forward words and phrases and backward words and phrases are evaluated. In the second embodiment, which will be described later, words and phrases that are combined as one word and phrase are further evaluated. . The concept of deletion determination of the relationship extraction pattern generation candidate when these three evaluation patterns are used will be described with reference to FIG. 12 .

例えば、前方語句Ａ、後方語句Ｂ、組み合わせ語句（ＡＢ）であるシード語句対（Ａ、Ｂ）と、前方語句Ｃ、後方語句Ｄ、組み合わせ語句（ＣＤ）である語句対（Ｃ、Ｄ）の場合を考える。図１２に示すように、文書２では前方語句同士（Ａ、Ｃ）、後方語句同士（Ｂ、Ｄ）、組み合わせ語句同士（ＡＢ、ＣＤ）間は同じ主題を表す傾向にある。すなわち、前方語句同士（Ａ、Ｃ）は主題Ｔ１、後方語句同士（Ｂ、Ｄ）は主題Ｔ２、組み合わせ語句同士（ＡＢ、ＣＤ）は主題Ｔ３を表している。 For example, a seed phrase pair (A, B) that is anterior phrase A, an posterior phrase B, a combination phrase (AB) and a phrase pair (C, D) that is an anterior phrase C, an posterior phrase D, a combination phrase (CD). Consider the case. As shown in FIG. 12, in Document 2, front words and phrases (A, C), back words and phrases (B, D), and combination words and phrases (AB, CD) tend to represent the same subject. That is, the front words (A, C) represent the topic T1, the rear words (B, D) represent the topic T2, and the combination words (AB, CD) represent the topic T3.

一方、文書１では組み合わせ語句同士（ＡＢ、ＣＤ）は主題分布傾向が異なる。すなわち、語句（ＡＢ）は主題Ｔ２、語句（ＣＤ）は主題Ｔ１を表し、それぞれの語句の主題が異なる。それぞれの語句が同じ文書で主題が異なることから、文書１の出現箇所を関係抽出パターンの生成候補から削除し、文書２の出現箇所のみから関係抽出パターンを生成する。 On the other hand, in document 1, the theme distribution tendencies of the combination words (AB, CD) are different. That is, the words (AB) represent the subject T2, the words (CD) represent the subject T1, and the subjects of the respective words are different. Since the words and phrases are the same in the document but have different themes, the locations of appearance in Document 1 are deleted from the relationship extraction pattern generation candidates, and a relationship extraction pattern is generated only from the locations of appearance in Document 2.

次に、第１の実施形態のオントロジー生成装置のハードウェア構成について図１３を用いて説明する。なお、後述する他の実施形態のオントロジー生成装置においても同様であるため、他の実施形態のオントロジー生成装置のハードウェア構成についての説明は省略する。 Next, the hardware configuration of the ontology generation device of the first embodiment will be explained using FIG. Since the same applies to ontology generation devices of other embodiments, which will be described later, a description of the hardware configuration of the ontology generation devices of other embodiments will be omitted.

オントロジー生成装置４０のハードウェアは、ＣＰＵ（Central Processor Unit）１３０、ＨＤＤ（Hard Disk Drive）１３１、ＲＡＭ（Random Access Memory）１３２、ＲＯＭ（Read Only Memory）１３３、入力装置１３４、通信インタフェース１３５、バス１３６から構成されている。ＣＰＵ１３０、ＨＤＤ１３１、ＲＡＭ１３２、ＲＯＭ１３３、入力装置１３４、通信インタフェース１３５は、例えばバス１３６を介して互いに接続されている。 The hardware of the ontology generation device 40 includes a CPU (Central Processor Unit) 130, a HDD (Hard Disk Drive) 131, a RAM (Random Access Memory) 132, a ROM (Read Only Memory) 133, an input device 134, a communication interface 135, a bus 136. The CPU 130, HDD 131, RAM 132, ROM 133, input device 134, and communication interface 135 are connected to each other via a bus 136, for example.

ＣＰＵ１３０は、バス１３６を介して、ＨＤＤ１３１などに格納される各種処理（例えば、パターン情報抽出処理など）を行うためのプログラムを読み込み、読み込んだプログラムをＲＡＭ１３２に一時的に格納し、そのプログラムにしたがって各種処理を行う。ＣＰＵ１３０は、主として上記テキスト取得部４０２、構文解析部４０３、抽出処理制御部４０４、抽出パターン生成候補抽出部４０５、パターン抽出部４０６、語句抽出部４０７、オントロジー生成部４０８、出力部４０９、語句文書別主題分布学習部４１４、抽出パターン生成候補評価部４１６として機能する。 The CPU 130 reads a program for performing various processing (for example, pattern information extraction processing) stored in the HDD 131 or the like via the bus 136, temporarily stores the read program in the RAM 132, and executes the program according to the program. Perform various processing. The CPU 130 mainly includes the text acquisition unit 402, the syntax analysis unit 403, the extraction processing control unit 404, the extraction pattern generation candidate extraction unit 405, the pattern extraction unit 406, the phrase extraction unit 407, the ontology generation unit 408, the output unit 409, the phrase document It functions as a separate subject distribution learning unit 414 and an extraction pattern generation candidate evaluation unit 416 .

ＨＤＤ１３１には、オントロジー生成装置４０の各種処理を行うためのアプリケーションプログラムや、オントロジー生成装置４０の処理に必要なデータなどが格納される。ＨＤＤ１３１は、主として上述した各種記憶部として機能する。 The HDD 131 stores application programs for performing various processes of the ontology generation device 40, data necessary for processing of the ontology generation device 40, and the like. The HDD 131 mainly functions as various storage units described above.

ＲＡＭ１３２は、揮発性メモリであって、ＣＰＵ１３０に実行させるためのＯＳ（Operating System）プログラムやアプリケーションプログラムの一部が一時的に格納される。また、ＲＡＭ１３２には、ＣＰＵ１３０による処理に必要な各種データが格納される。 The RAM 132 is a volatile memory that temporarily stores a part of an OS (Operating System) program and application programs to be executed by the CPU 130 . Also, the RAM 132 stores various data necessary for processing by the CPU 130 .

ＲＯＭ１３３は、不揮発性メモリであって、ブートプログラムやＢＩＯＳ（Basic Input / Output System）などのプログラムを記憶する。 The ROM 133 is a non-volatile memory and stores programs such as a boot program and BIOS (Basic Input/Output System).

入力装置１３４は、情報を入力するものであって、例えばキーボードなどである。
通信インタフェース１３５は、外部（オントロジーの構築結果などを要求する外部装置など）とネットワークを介してデータの送受信を行うものである。 The input device 134 is for inputting information, and is, for example, a keyboard.
The communication interface 135 transmits and receives data to and from the outside (such as an external device that requests an ontology construction result, etc.) via a network.

バス１３６は、各装置間の制御信号、データ信号などの授受を媒介する経路である。
第１の実施形態のオントロジー生成装置によれば、語句対の出現箇所ごとにシード語句対との関連性を、語句の文書別主題分布を利用して評価し、不適切な出現箇所を削除するため、不適切な関係抽出パターンの抽出を抑止することができる。すなわち、第１の実施形態のオントロジー生成装置によれば、図１４のドット（点）が記載された領域に示すように、不適切な関係抽出パターン１４００の抽出を抑えることによって意図した関係を持つ語句対の抽出精度が向上する。 A bus 136 is a path that mediates transfer of control signals, data signals, and the like between devices.
According to the ontology generation device of the first embodiment, the relevance to the seed word/phrase pair for each word/phrase pair appearance is evaluated using the document-by-document thematic distribution of words/phrases, and inappropriate appearances are deleted. Therefore, extraction of inappropriate relationship extraction patterns can be suppressed. That is, according to the ontology generation device of the first embodiment, as shown in the dot-marked area in FIG. Extraction accuracy of word pairs is improved.

また、図１５に示すように、不適切な関係抽出パターンを除外することにより、意図した語句対を抽出できる、すなわち不適切な語句対の抽出を避けることができる。 Also, as shown in FIG. 15, by excluding inappropriate relationship extraction patterns, intended word pairs can be extracted, that is, extraction of inappropriate word/phrase pairs can be avoided.

ここで、語句の文書別主題分布の利用でシード語句対が表す関係を持たない語句対を削除できることを検証した。図１６Ａ、図１６Ｂに示すように、ここでの対象文書群はプレスリリース文書の７０６件であり、シード語句対は｛製品、サービス種類｝と｛製品、サービス名｝のペア５件であり、（スマートフォン、ＡＲＬＯＷＳＺＩＳＷ１１Ｐ）などの語句対である。また、ここでは上記特許文献１の技術ではない従来のオントロジー生成技術による語句対抽出と、実施形態のオントロジー生成装置による語句対抽出の結果を比較している。 Here, it was verified that word pairs that do not have the relationship expressed by seed word pairs can be deleted by using the document-by-document theme distribution of words. As shown in FIGS. 16A and 16B, the target document group here is 706 press release documents, and the seed phrase pairs are {product, service type} and {product, service name}. (smartphone, ARLOWS Z IS W11P). Also, here, the result of word pair extraction by the conventional ontology generation technology, which is not the technology of Patent Document 1, is compared with the result of word pair extraction by the ontology generation device of the embodiment.

図１６Ａは従来のオントロジー生成方法によって抽出された語句対の結果を示し、図１６Ｂは実施形態のオントロジー生成装置によって抽出された語句対の結果を示している。なお、図１６Ｂの抽出結果を示す表に記載されている○印は意図する語句対が抽出されたことを示している。また、両者の結果を表にまとめたものが図１６Ｃに示されている。 FIG. 16A shows the results of word pairs extracted by the conventional ontology generation method, and FIG. 16B shows the results of word pairs extracted by the ontology generation device of the embodiment. It should be noted that the ◯ mark described in the table showing the extraction result in FIG. 16B indicates that the intended word/phrase pair was extracted. FIG. 16C shows a table summarizing both results.

図１６Ｃに示す抽出件数は抽出された語句対の件数であり、正解件数は意図した関係の語句対が抽出された件数である。また、再現率は正解の語句対がどのくらいあるかを示すものであり、従来技術では意図する関係の有無に関わらず語句対を抽出するため再現率は１としている。適合率は正解件数を抽出件数で割ったもの、すなわち正解率である。Ｆ値は予測結果の評価尺度である。 The number of extractions shown in FIG. 16C is the number of extracted word/phrase pairs, and the number of correct answers is the number of extracted word/phrase pairs having the intended relationship. The recall rate indicates how many correct word pairs exist. In the conventional technology, the recall rate is set to 1 because word pairs are extracted regardless of the presence or absence of an intended relationship. The precision rate is obtained by dividing the number of correct answers by the number of extractions, that is, the correct answer rate. The F value is an evaluation scale for prediction results.

従来技術では抽出件数が２４３件で、正解件数が８８件で、再現率が１で、適合率が０．３６で、Ｆ値が０．５３である。一方、実施形態のオントロジー生成装置では抽出件数が１３５件で、正解件数が８１件で、再現率が０．８２で、適合率が０．６２で、Ｆ値が０．７２である。この結果から、実施形態のオントロジー生成装置では正解件数は従来技術よりも減っているが、抽出件数も減っているため適合率（正解率）は向上している。 In the prior art, the number of extractions is 243, the number of correct answers is 88, the recall is 1, the precision is 0.36, and the F value is 0.53. On the other hand, in the ontology generation device of the embodiment, the number of extractions is 135, the number of correct answers is 81, the recall is 0.82, the precision is 0.62, and the F value is 0.72. From this result, although the number of correct answers is smaller in the ontology generation device of the embodiment than in the conventional technology, the number of extractions is also reduced, so the matching rate (correct answer rate) is improved.

＜第２の実施形態＞
第２の実施形態のオントロジー生成装置の構成について図１７を用いて説明する。図１７に示すように、第２の実施形態のオントロジー生成装置１７００は、語句対組み合わせ生成部１７０１を有する点が第１の実施形態のオントロジー生成装置と異なるのみで他の構成要素については同一である。そのため、語句対組み合わせ生成部１７０１についてのみ説明し、同一の構成要素についての説明は省略し、同一の構成要素を指し示す場合には第１の実施形態と同一の符号を用いる。 <Second embodiment>
The configuration of the ontology generation device of the second embodiment will be explained using FIG. As shown in FIG. 17, the ontology generation device 1700 of the second embodiment differs from the ontology generation device of the first embodiment only in that it has a phrase pair combination generation unit 1701, and the other components are the same. be. Therefore, only the word/phrase pair combination generation unit 1701 will be described, description of the same components will be omitted, and the same reference numerals as in the first embodiment will be used to indicate the same components.

語句対組み合わせ生成部１７０１は、テキスト取得部４０２から取得したテキストを解析し語句を抽出し、同一文に出現する語句を使って語句組み合わせ（組み合わせ語句とも言う）を生成する。例えば、テキストが図７Ａに示すものである場合、セダンとａａａが同一文にあるため、図１８に示すように語句組み合わせ（セダン－ａａａなど）を生成する。なお、セダン、ａａａなどの語句は第１の実施形態における解析結果の語句集合の語句を示している。 The word/phrase pair combination generation unit 1701 analyzes the text acquired from the text acquisition unit 402, extracts words, and generates a word/phrase combination (also referred to as a combination word/phrase) using words that appear in the same sentence. For example, if the text is as shown in FIG. 7A, then sedan and aaa are in the same sentence, so a phrase combination (such as sedan-aaa) is generated as shown in FIG. Words such as "sedan" and "aaa" indicate words in the set of words of the analysis result in the first embodiment.

語句文書別主題分布学習部４１４は、第１の実施形態における語句集合として上記語句組み合わせも含めて文書別主題分布を学習し、学習結果（ベクトル）を語句文書別主題分布記憶部４１５に記憶する。 The phrase-document-by-document topic distribution learning unit 414 learns the document-by-document topic distribution including the above-mentioned phrase combination as the phrase set in the first embodiment, and stores the learning result (vector) in the phrase-by-document topic distribution storage unit 415. .

抽出パターン生成候補評価部４１６は、上記語句組み合わせを追加し、語句組み合わせも考慮して、出現箇所の語句対とシード語句対との意味関連性スコアが閾値を超えたか否かを判断（評価）する。意味関連性スコアが閾値を超えない場合には出現箇所（文）を抽出パターン生成候補記憶部４１３から削除する。 The extraction pattern generation candidate evaluation unit 416 adds the word/phrase combination, considers the word/phrase combination, and determines (evaluates) whether or not the semantic relevance score between the word/phrase pair at the appearance location and the seed word/phrase pair exceeds the threshold. do. If the semantic relevance score does not exceed the threshold, the occurrence location (sentence) is deleted from the extraction pattern generation candidate storage unit 413 .

具体的には、抽出パターン生成候補評価部４１６は、図１８に示す語句主題分布と文書主題分布を利用し、以下に示す式（２）によって意味関連性スコアを計算する。 Specifically, the extraction pattern generation candidate evaluation unit 416 uses the word/phrase theme distribution and the document theme distribution shown in FIG.

ここでは、文書ｄにおける２つの語句対i、ｊの意味関連性スコアの計算について説明する。語句対ｉはシード語句対であり、語句対ｊは抽出された語句対である。文書ｄにおいて、語句対ｊ＝｛ｊ^Ａ、ｊ^Ｂ｝と語句対ｉ＝｛ｉ^Ａ、ｉ^Ｂ｝に対して以下の式（２）によって意味関連性スコアSRScore_{ｉ、ｊ、ｄ}を算出する。ｊ^Ａは語句対ｊの語句Ａ（前方語句Ａ）、ｊ^Ｂは語句対ｊの語句Ｂ（後方語句Ｂ）、ｊ^Ｃは語句対ｊの語句ＡとＢの組み合わせ語句Ｃを示す。また、ｉ^Ａは語句対ｉの語句Ａ（前方語句Ａ）、ｉ^Ｂは語句対ｉの語句Ｂ（後方語句Ｂ）、ｉ^Ｃは語句対ｉの語句ＡとＢの組み合わせ語句Ｃを示す。例えば、語句対ｊが（ＢＢＢ、Zinrei）の場合、語句ＡがＢＢＢ、語句ＢがZinrei、組み合わせ語句ＣがＢＢＢ－Zinreiとなる。 Here, we describe the computation of the semantic relevance score for two phrase pairs i, j in document d. Phrase pair i is the seed phrase pair and phrase pair j is the extracted phrase pair. In the document d, the semantic relevance score SRScore _{i, j, d} is calculated by the following formula (2) for the phrase pair j={j ^A , j ^B } and the phrase pair i={i ^A , i ^B } . j ^A denotes word A of word pair j (previous word A), j ^B denotes word B of word pair j (back word B), and j ^C denotes combination word C of words A and B of word pair j. Also, iA indicates the word/phrase ^A of the word/phrase pair i (preceding word/phrase A), iB indicates the word/phrase ^B of the word/phrase pair i (backward word/phrase B), and iC indicates the combined word/phrase ^C of the words/phrases A and B of the word/phrase pair i. For example, if the word/phrase pair j is (BBB, Zinrei), then the word/phrase A is BBB, the word/phrase B is Zinrei, and the combined word/phrase C is BBB-Zinrei.

ここで、

はシード語句対ｉの組み合わせ語句Ｃの文書ｄにおける主題分布を表すベクトルである。これは、例えば主題数を３個（次元Ｋ＝３）とした場合の組み合わせ語句Ｃの主題分布Ｃ_ｋと文書ｄの主題分布Ｄ_ｋの掛け算によって求められる。組み合わせ語句Ｃの主題分布Ｃ_ｋは図１８に示す語句主題分布のうちの語句組み合わせ部分であり、文書ｄの主題分布Ｄ_ｋは図１８に示す文書主題分布である。 here,

is a vector representing the thematic distribution in document d of combination phrase C of seed phrase pair i. This can be obtained, for example, by multiplying the subject distribution _Ck of the combined word/phrase C and the subject distribution _Dk of the document d when the number of themes is three (dimension K=3). The topic distribution _Ck of the combined word/phrase C is the word/phrase combination part of the word/phrase topic distribution shown in FIG. 18, and the topic distribution Dk of the document _d is the document topic distribution shown in FIG.

例えば、組み合わせ語句Ｃの主題分布Ｃ_ｋが｛０．３、０．１、０．６｝、文書ｄの主題分布Ｄ_ｋが｛０．４、０．４、０．２｝とした場合、組み合わせ語句Ｃの文書ｄにおける主題分布は｛０．３×０．４、０．１×０．４、０．６×０．２｝＝｛０．１２、０．０４、０．１２｝となる。 For example, if the subject distribution C _k of the combination word C is {0.3, 0.1, 0.6} and the subject distribution D _k of the document d is {0.4, 0.4, 0.2}, The subject distribution of combination phrase C in document d is {0.3×0.4, 0.1×0.4, 0.6×0.2}={0.12, 0.04, 0.12} Become.

また、α_lは前方語句Ａ、後方語句Ｂ、組み合わせ語句Ｃの重み係数を示すものであり、あらかじめ人手によって設定されてもよく、上述したように自動で判断して設定されるようにしてもよい。 Also, α _l indicates the weighting factor of the front phrase A, the rear phrase B, and the combination phrase C, and may be set manually in advance, or may be automatically determined and set as described above. good.

このように、組み合わせ語句も考慮に入れるのは、語句対が表す関係は構成語句の組み合わせで表現可能な場合があるためである。例えば、語句対（ＢＢＢ、ＡｒｌｏｗｓＭ０３）と語句対（ＣＣＣ、Ｘｐｉｅｒ）において、「ＢＢＢ」と「ＣＣＣ」はスマホメーカー、「ＡｒｌｏｗｓＭ０３」と「Ｘｐｉｅｒ」はスマホである。しかし、語句対（ＢＢＢ、ＡｒｌｏｗｓＭ０３）はメーカー名と自社製品の関係を表している一方、語句対（ＣＣＣ、Ｘｐｉｅｒ）はメーカー名と他社製品の関係を表している。 The combination words are also taken into consideration in this way because the relationship expressed by the word pair can sometimes be expressed by the combination of the constituent words. For example, in the phrase pair (BBB, Arrows M03) and the phrase pair (CCC, Xpier), "BBB" and "CCC" are smartphone manufacturers, and "Arlows M03" and "Xpier" are smartphones. However, the word pair (BBB, Arrows M03) represents the relationship between the manufacturer's name and its own product, while the word pair (CCC, Xpier) represents the relationship between the manufacturer's name and the products of other companies.

そのため、組み合わせ語句も考慮することによって、意図しない関係の語句対の抽出をより防ぐことが可能となる。 Therefore, by taking combination words into consideration, it is possible to further prevent the extraction of unintended word/phrase pairs.

次に、第２の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理のフローについて説明する。基本的な処理フローは図９に示す第１の実施形態における処理フローと同様であるため、異なる処理についてのみ説明する。異なる処理は語句文書別主題分布学習とパターン情報抽出処理である。 Next, the flow of extraction processing of relationship extraction patterns in the ontology generation device of the second embodiment will be described. Since the basic processing flow is the same as the processing flow in the first embodiment shown in FIG. 9, only different processing will be described. The different processes are topic distribution learning for each word/phrase document and pattern information extraction process.

語句文書別主題分布学習の処理フローについて図１９を用いて説明する。オントロジー生成装置１７００は、取得されたテキスト内の語句を同定する（ステップＳ１９０１）。すなわち、テキスト内から語句を抽出して学習対象の語句を特定する。オントロジー生成装置１７００は、同一文内の語句組み合わせを生成する（ステップＳ１９０２）。オントロジー生成装置１７００は、同定した語句集合（語句組み合わせを含む）の文書別主題分布を学習、すなわちモデル学習を行う（ステップＳ１９０３）。 A processing flow of subject distribution learning by word/phrase document will be described with reference to FIG. The ontology generator 1700 identifies words in the acquired text (step S1901). That is, words are extracted from the text and the words to be learned are specified. The ontology generation device 1700 generates word/phrase combinations within the same sentence (step S1902). The ontology generation device 1700 learns the document-by-document theme distribution of the identified phrase set (including phrase combinations), that is, performs model learning (step S1903).

また、パターン情報抽出処理フローについて図２０を用いて説明する。オントロジー生成装置１７００は、まず抽出パターン生成候補記憶部４１３に記録された語句対ｊの出現箇所Ｄを抽出する（ステップＳ２００１）。語句対ｊ＝｛ｊ^Ａ,ｊ^Ｂ｝出現箇所ｄｊに対し、更にシード語句対ｉ＝｛ｉ^Ａ,ｉ^Ｂ｝に対し、組み合わせ語句ｉ^Ｃ＝ｉ^Ａｉ^Ｂとｊ^Ｃ＝ｊ^Ａｊ^Ｂを生成する（ステップＳ２００２）。 Also, the pattern information extraction processing flow will be described with reference to FIG. The ontology generation device 1700 first extracts the occurrence location D of the word/phrase pair j recorded in the extraction pattern generation candidate storage unit 413 (step S2001). For phrase pair j= ^{ ^jA , ^jB } occurrence dj, and for seed phrase pair i= ^{ ^iA , ^iB }, combine phrases ^iC = ^iAiB and ^jC = ^jAjB is generated (step S2002).

オントロジー生成装置１７００は、語句｛ｉ^Ａ,ｊ^Ａ｝、語句｛ｉ^Ｂ,ｊ^Ｂ｝、語句｛ｉ^Ｃ,ｊ^Ｃ｝の文書群における意味関連性スコアを計算する（ステップＳ２００３）。すなわち、第１の単語（シード語句対）と第２の単語（出現箇所の語句対）の前方語句同士及び後方語句同士を対比させ、更に第１の単語の前方語句と後方語句を組み合わせた語句と第２の単語の前方語句と後方語句を組み合わせた語句同士を対比させて意味関連性スコアを算出している。 The ontology generation device 1700 calculates the semantic relevance scores of the phrases {i ^A , j ^A }, the phrases {i ^B , j ^B }, and the phrases {i ^C , j ^C } in the document group (step S2003). That is, the first word (seed word pair) and the second word (word/phrase pair at the location of occurrence) are compared with each other and after each other, and furthermore, the words and phrases in which the first word and the after word are combined. The semantic relevance score is calculated by comparing the words and phrases that combine the forward and backward phrases of the second word with each other.

オントロジー生成装置１７００は、計算された意味関連性スコアが閾値を超えたか否かを判定し（ステップＳ２００４）、超えていない場合（ステップＳ２００４でＮｏ）、関連性が低いとして出現箇所ｄｊを関係抽出パターンの生成候補から削除（除外）し（ステップＳ２００５）、次の語句対について同様の処理を行う。一方、超えている場合（ステップＳ２００４でＹｅｓ）、オントロジー生成装置１７００は次のシード語句対について同様の処理を行う。 The ontology generation device 1700 determines whether or not the calculated semantic relevance score exceeds the threshold (step S2004). Delete (exclude) from the pattern generation candidates (step S2005), and perform the same processing for the next word/phrase pair. On the other hand, if it exceeds (Yes in step S2004), the ontology generation device 1700 performs similar processing for the next seed word pair.

すべての語句対について処理が終了すると、オントロジー生成装置１７００は語句対ｊの出現箇所（フィルタリング済み）から語句対の関係抽出パターンを生成する（ステップＳ２００６）。 When the processing for all word/phrase pairs is completed, the ontology generation device 1700 generates word/phrase pair relationship extraction patterns from the occurrence locations (filtered) of word/phrase pair j (step S2006).

＜第３の実施形態＞
第３の実施形態のオントロジー生成装置の構成について図２１を用いて説明する。図２１に示すように、第３の実施形態のオントロジー生成装置２１００は、同義語置換部２１０１及び同義語辞書２１０２を有する点が第２の実施形態のオントロジー生成装置と異なるのみで他の構成要素については同一である。そのため、同義語置換部２１０１及び同義語辞書２１０２についてのみ説明し、同一の構成要素についての説明は省略し、同一の構成要素を指し示す場合には第１及び第２の実施形態と同一の符号を用いる。 <Third Embodiment>
The configuration of the ontology generation device of the third embodiment will be explained using FIG. As shown in FIG. 21, the ontology generation device 2100 of the third embodiment differs from the ontology generation device of the second embodiment only in that it has a synonym replacement unit 2101 and a synonym dictionary 2102. are the same. Therefore, only the synonym replacement unit 2101 and the synonym dictionary 2102 will be explained, and the explanation of the same components will be omitted. use.

同義語置換部２１０１は、文脈や文書に依存せずに如何なる場合でも同じ意味を持つ語句を１つの表現に統一する。すなわち、同義語置換部２１０１は、同義語を置換させる所定の情報（同義語辞書２１０２）に基づいて、語義文（テキスト）に同義である語句が存在すると判断した場合、所定の語句に置換する。 The synonym replacement unit 2101 unifies words and phrases having the same meaning into one expression regardless of the context or document. That is, when the synonym replacement unit 2101 determines that there is a synonymous word in the word meaning sentence (text) based on predetermined information (synonym dictionary 2102) for replacing synonyms, the synonym replacement unit 2101 replaces the word with a predetermined word. .

例えば、語句「ＢＢＢ（株）」と語句「ＢＢＢ株式会社」は同義のため、「ＢＢＢ株式会社」に統一する。このとき、統一しても語句の意味は変更しない。これにより、文書数が少ない文書群であっても語句文書別主題分布学習部４１４による学習効果が向上する。 For example, since the words "BBB Corporation" and "BBB Corporation" are synonymous, they are unified into "BBB Corporation". At this time, the unification does not change the meaning of the words. As a result, even if the number of documents is small, the learning effect of the word/document-based topic distribution learning unit 414 is improved.

同義語辞書２１０２は、上記同義語置換を行うための辞書情報を記憶する。この辞書情報は、例えば事前にユーザによって整備されていたり、又はその他の自動抽出方式によって整備されていたりする。例えば、以下の方式で自動整備が可能である。 The synonym dictionary 2102 stores dictionary information for performing the synonym replacement. This dictionary information is, for example, prepared in advance by the user, or prepared by some other automatic extraction method. For example, automatic maintenance is possible by the following method.

１つ目としては、語句構成要素の意味に基づく方式である。例えば、語句「色名」と語句「カラー名」はいずれも「color」と「name」の意味を持つものであるため、同義とみなす。 The first is based on the semantics of phrase constituents. For example, the phrase "color name" and the phrase "color name" both have the same meanings as "color" and "name", so they are considered synonymous.

２つ目としては、大規模文書群から出現文脈の統計情報に基づく方式である。例えば、所定のツールで大規模コーパスから「ＢＢＢ」と「ＢＢＢ株式会社」が同じ意味ベクトルを持つことを学習し、それぞれを同義語とみなす。コーパスとは、コンピュータにより検索可能となっている大量の言語データのデータベースである。 The second method is based on statistical information of appearance contexts from a large-scale document group. For example, a predetermined tool learns from a large-scale corpus that "BBB" and "BBB Co., Ltd." have the same semantic vector, and regards them as synonyms. A corpus is a large database of linguistic data that is searchable by a computer.

次に、第３の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理のフローについて図２２を用いて説明する。第３の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理のフローが、第１及び第２の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理のフローと違う点は、ステップＳ２２０３の処理がある点である。 Next, the flow of extraction processing of relationship extraction patterns in the ontology generation device of the third embodiment will be described with reference to FIG. The relationship extraction pattern extraction processing flow in the ontology generation device of the third embodiment differs from the relationship extraction pattern extraction processing flow in the ontology generation devices of the first and second embodiments in that the processing in step S2203 There is one point.

オントロジー生成装置２１００は、シード語句対を受け付け、取得する（ステップＳ２２０１）。また、オントロジー生成装置２１００は、文書群を記録するデータベースなどから文書群を取得し、取得した文書群の文書からテキストを抽出し、取得する（ステップＳ２２０２）。 The ontology generation device 2100 receives and acquires seed word/phrase pairs (step S2201). Also, the ontology generation device 2100 acquires a document group from a database that records the document group or the like, and extracts and acquires text from the documents of the acquired document group (step S2202).

オントロジー生成装置２１００は、取得したテキストに対して同義語の置換を行う（ステップＳ２２０３）。同義語の置換の詳細については後述する。 The ontology generation device 2100 performs synonym replacement for the acquired text (step S2203). The details of synonym replacement will be described later.

オントロジー生成装置２１００は、同義語置換がされたテキストに対して形態素解析と構文解析、又はテキストの意味構造解析を行う、すなわち構文構造解析を行う（ステップＳ２２０４）。オントロジー生成装置２１００は、置換されたテキストに対し、形態素解析を行い、解析結果の語句集合の文書別主題分布（Topic Model）を学習、すなわち語句文書別主題分布学習する（ステップＳ２２０５）。語句文書別主題分布学習の処理フローについては第２の実施形態と同様であるため説明を省略する。 The ontology generation device 2100 performs morphological analysis and syntactic analysis on the synonym-replaced text, or semantic structure analysis of the text, that is, syntactic structure analysis (step S2204). The ontology generation device 2100 performs morphological analysis on the replaced text, and learns the document-by-document theme distribution (Topic Model) of the analysis result word/phrase set, that is, word/phrase-by-document topic distribution learning (step S2205). The processing flow of topic distribution learning by word/phrase document is the same as that of the second embodiment, so the explanation is omitted.

オントロジー生成装置２１００は、抽出パターン生成候補記憶部４１３に記録された語句対の出現箇所に対して、語句記憶部４１１に記録されたシード語句対と比較し、出現箇所の語句対とシード語句対との意味関連性スコアが閾値を超えたか否かを判断（評価）する。意味関連性スコアが閾値を超えない場合には出現箇所（文）を抽出パターン生成候補記憶部４１３から削除する。すなわち、オントロジー生成装置２１００はパターン情報抽出処理を行う（ステップＳ２２０６）。パターン情報抽出処理のフローについては第２の実施形態と同様であるため説明を省略する。 The ontology generation device 2100 compares the occurrence location of the word/phrase pair recorded in the extraction pattern generation candidate storage unit 413 with the seed word/phrase pair recorded in the word/phrase storage unit 411, and compares the occurrence location word/phrase pair with the seed word/phrase pair. It is determined (evaluated) whether or not the semantic relevance score of exceeds the threshold. If the semantic relevance score does not exceed the threshold, the occurrence location (sentence) is deleted from the extraction pattern generation candidate storage unit 413 . That is, the ontology generation device 2100 performs pattern information extraction processing (step S2206). The flow of the pattern information extraction process is the same as that of the second embodiment, so description thereof will be omitted.

オントロジー生成装置２１００は、生成された関係抽出パターンを用いて変数化に相当する部分の語句を抽出する、すなわち語句情報抽出処理を行う（ステップＳ２２０７）。 The ontology generation device 2100 extracts the word/phrase corresponding to the variable using the generated relationship extraction pattern, that is, performs word/phrase information extraction processing (step S2207).

ここで、上記同義語の置換について図２３Ａ、図２３Ｂを用いて説明する。図２３Ａには取得したテキストに対して同義語置換を行った後のテキストが示されている。また、図２３Ｂには取得したテキストに対して同義語置換を行わずに学習した場合の学習結果と、取得したテキストに対して同義語置換を行って学習した場合の学習結果とが示されている。 Here, the substitution of synonyms will be described with reference to FIGS. 23A and 23B. FIG. 23A shows the text after synonym replacement is performed on the acquired text. Also, FIG. 23B shows the learning result when learning is performed without subjecting the acquired text to synonym replacement, and the learning result when learning is performed after subjecting the acquired text to synonym replacement. there is

図２３Ａに示すように、取得したテキストには「ＢＢＢ」と「ＢＢＢ株式会社」が存在し、これらを同義として「ＢＢＢ」に統一してテキストを置換している。同義語を置換して統一しているため、図２３Ｂに示すように、置換して学習した場合の学習結果のほうが、置換せずに学習した場合の学習結果より少ない。 As shown in FIG. 23A, the acquired text includes "BBB" and "BBB Co., Ltd.", which are synonymous and unified into "BBB" to replace the text. Since synonyms are replaced and unified, as shown in FIG. 23B, the learning result when learning with replacement is less than the learning result when learning without replacement.

＜第４の実施形態＞
第４の実施形態のオントロジー生成装置の構成について説明するが、第４の実施形態のオントロジー生成装置の構成は第２の実施形態の構成と同様であるため、図１７に示す構成図を用いて説明する。第４の実施形態は、語句対組み合わせ生成部１７０１が語句の組み合わせを生成する際、語句の修飾関係を考慮して組み合わせを生成する点に特徴がある。 <Fourth Embodiment>
The configuration of the ontology generation device of the fourth embodiment will be described. Since the configuration of the ontology generation device of the fourth embodiment is the same as that of the second embodiment, the configuration diagram shown in FIG. explain. The fourth embodiment is characterized in that when the word/phrase pair combination generation unit 1701 generates a combination of words/phrases, the combinations are generated in consideration of the modification relation of the words/phrases.

すなわち、語句対組み合わせ生成部１７０１は、出現箇所（文）内の修飾関係を考慮し、解析対象の語句に近いその他の語句をまとめた組み合わせを１つの語句として処理する。例えば、語句対（ＢＢＢ、ＡｒｌｏｗｓＭ０３）の出現箇所（文）である「今年度のベストスマホ大賞は、ＢＢＢの最新製品であるＡｒｌｏｗｓＭ０３が受賞した」では語句修飾関係は図６に示すようになる。 That is, the word/phrase pair combination generation unit 1701 considers the modification relationship in the occurrence location (sentence) and processes a combination of other words/phrases close to the analysis target word/phrase as one word/phrase. For example, in the appearance (sentence) of the phrase pair (BBB, Arrows M03), "This year's best smartphone award was awarded to Arrows M03, the latest product of BBB", the phrase modification relationship is shown in FIG. .

語句に対して修飾関係に近い語句の組み合わせを１つの語句として登録する。同一文内の修飾構造を無向グラフとし、各語句間の探索距離が求められる。例えば、語句「ＡｒｌｏｗｓＭ０３」は、「ＡｒｌｏｗｓＭ０３」→「受賞」→「ベストスマホ大賞」→「今年度」のパス（探索距離が３）で「今年度」まで探索可能である。探索距離を制約し、その距離内の修飾語句群を１語として登録する。例えば、探索距離が１の場合、「最新製品－ＡｒｌｏｗｓＭ０３－受賞」を１つの語句とみなす。また、例えば探索距離が最大の場合、「ＢＢＢ－最新製品－ＡｒｌｏｗｓＭ０３－受賞－ベストスマホ大賞－今年度」を１つの語句とみなす。 A combination of words and phrases that are closely related to a word and phrase are registered as one word and phrase. The modification structures within the same sentence are treated as an undirected graph, and the search distance between each word is obtained. For example, the phrase “Arlows M03” can be searched up to “current year” by following the path (search distance is 3) of “Arlows M03”→“award”→“best smartphone award”→“current year”. A search distance is restricted, and a modifier group within that distance is registered as one word. For example, if the search distance is 1, consider "latest product--Arlows M03--awarded" as one phrase. Also, for example, when the search distance is the maximum, "BBB-latest product-Arlows M03-award-best smartphone award-this year" is regarded as one phrase.

したがって、語句対（ＢＢＢ、ＡｒｌｏｗｓＭ０３）の出現箇所（探索距離が１の場合）では、「ＢＢＢ」の修飾関係語句（ＢＢＢに近い語句の集合体）である「ＢＢＢ－最新製品」と、「ＡｒｌｏｗｓＭ０３」の修飾関係語句（ＡｒｌｏｗｓＭ０３に近い語句の集合体）である「最新製品－ＡｒｌｏｗｓＭ０３－受賞」が生成される。なお、探索距離はユーザによって予め設定されてもよく、所定の自動生成方法で設定されてもよい。 Therefore, in the occurrence of the phrase pair (BBB, Arrows M03) (when the search distance is 1), the modifier-related phrases of “BBB” (aggregate of phrases close to BBB) “BBB-latest product” and “ "Arlows M03" is generated as a modification relational phrase (aggregate of phrases close to Arrows M03) "latest product-Arlows M03-award". The search distance may be set in advance by the user, or may be set by a predetermined automatic generation method.

ここで、修飾関係を考慮して生成された組み合わせを含む語句の主題分布と文書主題分布の学習結果の一例を図２４に示す。なお、元になっているテキストは図７Ａに示すテキストである。図２４に示す語句主題分布の太枠部分が、同一文内の修飾関係の探索距離１の語句組み合わせ（「ＢＢＢ－ＡＡＡ－新事業」など）を示している。 Here, FIG. 24 shows an example of learning results of the theme distribution of words and phrases including combinations generated in consideration of the modification relation and the document theme distribution. The original text is the text shown in FIG. 7A. The bold-framed portion of the word/phrase theme distribution shown in FIG. 24 indicates word/phrase combinations (such as “BBB-AAA-new business”) with a search distance of 1 in the modification relation within the same sentence.

また、以下では第４の実施形態における意味関連性スコアの計算について式（３）を用いて説明する。具体的には、抽出パターン生成候補評価部４１６は、図２４に示す語句主題分布と文書主題分布を利用し、以下に示す式（３）によって意味関連性スコアを計算する。 Calculation of the semantic relevance score in the fourth embodiment will be described below using Equation (3). Specifically, the extraction pattern generation candidate evaluation unit 416 uses the word/phrase topic distribution and the document topic distribution shown in FIG.

ここでは、文書ｄにおける２つの語句対i、ｊの意味関連性スコアの計算について説明する。語句対ｉはシード語句対であり、語句対ｊは抽出された語句対である。文書ｄにおいて、語句対ｊ＝｛ｊ^Ａ、ｊ^Ｂ｝と語句対ｉ＝｛ｉ^Ａ、ｉ^Ｂ｝に対して以下の式（３）によって意味関連性スコアSRScore_{ｉ、ｊ、ｄ}を算出する。ｊ^Ａは語句対ｊの語句Ａ（前方語句Ａ）、ｊ^Ｂは語句対ｊの語句Ｂ（後方語句Ｂ）、ｊ^Ｃは語句対ｊの語句ＡとＢの組み合わせ語句Ｃを示す。また、ｊ^ＡＧは前方語句Ａの修飾語集合（前方修飾語句ＡＧ）、ｊ^ＢＧは後方語句Ｂの修飾語集合（後方修飾語句ＢＧ）、ｊ^ＣＧは前方修飾語句ＡＧと後方修飾語句ＢＧの組み合わせ語句ＣＧの修飾語集合（組み合わせ修飾語句ＣＧ）を示す。 Here, we describe the computation of the semantic relevance score for two phrase pairs i, j in document d. Phrase pair i is the seed phrase pair and phrase pair j is the extracted phrase pair. In document d, the semantic relevance score SRScore _{i, j, d} is calculated for the word pair j={j ^A , j ^B } and the word pair i={i ^A , i ^B } by the following equation (3). . j ^A denotes word A of word pair j (previous word A), j ^B denotes word B of word pair j (back word B), and j ^C denotes combination word C of words A and B of word pair j. In addition, j ^AG is a modifier set of forward modifiers A (forward modifiers AG), j ^BG is a modifier set of backward modifiers B (back modifiers BG), and j ^CG is a combination of forward modifiers AG and backward modifiers BG. A set of modifiers (combined modifiers CG) for the phrase CG is shown.

また、ｉ^Ａは語句対ｉの語句Ａ（前方語句Ａ）、ｉ^Ｂは語句対ｉの語句Ｂ（後方語句Ｂ）、ｉ^Ｃは語句対ｉの語句ＡとＢの組み合わせ語句Ｃを示す。また、ｉ^ＡＧは前方語句Ａの修飾語集合（前方修飾語句ＡＧ）、ｉ^ＢＧは後方語句Ｂの修飾語集合（後方修飾語句ＢＧ）、ｉ^ＣＧは前方修飾語句ＡＧと後方修飾語句ＢＧの組み合わせ語句ＣＧの修飾語集合（組み合わせ修飾語句ＣＧ）を示す。例えば、上述した例をとると、語句対ｊが（ＢＢＢ、ＡｒｌｏｗｓＭ０３）の場合、語句Ａが「ＢＢＢ」、語句Ｂが「ＡｒｌｏｗｓＭ０３」、組み合わせ語句Ｃが「ＢＢＢ－ＡｒｌｏｗｓＭ０３」となる。また、語句ＡＧが「ＢＢＢ－最新製品」、語句ＢＧが「最新製品－ＡｒｌｏｗｓＭ０３－受賞」、語句ＣＧが「ＢＢＢ－最新製品－ＡｒｌｏｗｓＭ０３－受賞」となる。 Also, iA indicates the word/phrase ^A of the word/phrase pair i (preceding word/phrase A), iB indicates the word/phrase ^B of the word/phrase pair i (backward word/phrase B), and iC indicates the combined word/phrase ^C of the words/phrases A and B of the word/phrase pair i. In addition, i ^AG is a modifier set of forward modifiers A (forward modifiers AG), i ^BG is a modifier set of backward modifiers B (back modifiers BG), and i ^CG is a combination of forward modifiers AG and backward modifiers BG. A set of modifiers (combined modifiers CG) for the phrase CG is shown. For example, taking the above example, if the phrase pair j is (BBB, Arrows M03), then phrase A is "BBB", phrase B is "Arlows M03", and combination phrase C is "BBB-Arlows M03". Also, the phrase AG is "BBB-latest product", the phrase BG is "latest product-Arlows M03-awarded", and the phrase CG is "BBB-latest product-Arlows M03-awarded".

ここで、

はシード語句対ｉの前方修飾語句ＡＧの文書ｄにおける主題分布を表すベクトルである。 here,

is a vector representing the subject distribution in document d of the prefix modifiers AG of seed word pair i.

α_lは前方語句Ａ、後方語句Ｂ、組み合わせ語句Ｃ、前方修飾語句ＡＧ、後方修飾語句ＢＧ、組み合わせ修飾語句ＣＧの重み係数を示すものであり、あらかじめ人手によって設定されてもよく、上述したように自動で判断して設定されるようにしてもよい。 α _l indicates the weighting factor of the forward phrase A, the backward phrase B, the combination phrase C, the front modifier AG, the rear modifier BG, and the combination modifier CG, and may be set manually in advance, as described above. may be determined and set automatically.

次に、第４の実施形態のオントロジー生成装置における関係抽出パターンの抽出処理のフローについて説明する。基本的な処理フローは図９に示す第１の実施形態における処理フローと同様であるため、異なる処理についてのみ説明する。異なる処理は語句文書別主題分布学習とパターン情報抽出処理である。 Next, the flow of extraction processing of relationship extraction patterns in the ontology generation device of the fourth embodiment will be described. Since the basic processing flow is the same as the processing flow in the first embodiment shown in FIG. 9, only different processing will be described. The different processes are topic distribution learning for each word/phrase document and pattern information extraction process.

語句文書別主題分布学習の処理フローについて図２５を用いて説明する。オントロジー生成装置１７００は、取得されたテキスト内の語句を同定する（ステップＳ２５０１）。すなわち、テキスト内から語句を抽出して学習対象の語句を特定する。オントロジー生成装置１７００は、同一文内の語句の修飾関係を考慮して組み合わせを生成する（ステップＳ２５０２）。オントロジー生成装置１７００は、同定した語句集合（語句組み合わせを含む）の文書別主題分布を学習、すなわちモデル学習を行う（ステップＳ２５０３）。 A processing flow of subject distribution learning by word/phrase document will be described with reference to FIG. The ontology generation device 1700 identifies words in the acquired text (step S2501). That is, words are extracted from the text and the words to be learned are specified. The ontology generation device 1700 generates combinations considering the modification relationships of words in the same sentence (step S2502). The ontology generation device 1700 learns the document-by-document theme distribution of the identified phrase sets (including phrase combinations), that is, performs model learning (step S2503).

また、パターン情報抽出処理フローについて図２６を用いて説明する。オントロジー生成装置１７００は、まず抽出パターン生成候補記憶部４１３に記録された語句対ｊの出現箇所Ｄを抽出する（ステップＳ２６０１）。オントロジー生成装置１７００は、語句対ｊ＝｛ｊ^Ａ,ｊ^Ｂ｝の出現箇所ｄｊに対し、ｄｊ内の語句修飾関係を考慮し、ｊ^ＡＧ＝ｊ^Ａに近い語句の集合体と、ｊ^ＢＧ＝ｊ^Ｂに近い語句の集合体と、ｊ^ＣＧ＝ｊ^ＡＧｊ^ＢＧとを生成する（ステップＳ２６０２）。 Also, the pattern information extraction processing flow will be described with reference to FIG. The ontology generation device 1700 first extracts the occurrence location D of the word/phrase pair j recorded in the extraction pattern generation candidate storage unit 413 (step S2601). The ontology generation device 1700 considers the word modification relation in dj for the occurrence location dj of the word/phrase pair j={ ^jA , ^jB }, and considers the word/phrase modification relation in dj, and the aggregate of words/phrases close to ^jAG = ^jA and ^jBG = A set of words close to j ^B and j ^CG =j ^AG j ^BG are generated (step S2602).

オントロジー生成装置１７００は、シード語句対ｉ＝｛ｉ^Ａ,ｉ^Ｂ｝に対し、ｉ^ＡＧ、ｉ^ＢＧ、ｉ^ＣＧを生成する（ステップＳ２６０３）。 The ontology generation device 1700 generates i ^AG , i ^BG , and i ^CG for the seed word/phrase pair i={i ^A , i ^B } (step S2603).

オントロジー生成装置１７００は、語句｛ｉ^Ａ,ｊ^Ａ｝、語句｛ｉ^Ｂ,ｊ^Ｂ｝、語句｛ｉ^Ｃ,ｊ^Ｃ｝、語句｛ｉ^ＡＧ,ｊ^ＡＧ｝、語句｛ｉ^ＢＧ,ｊ^ＢＧ｝、語句｛ｉ^ＣＧ,ｊ^ＣＧ｝の文書群における意味関連性スコアを計算する（ステップＳ２６０４）。すなわち、第１の単語（シード語句対）と第２の単語（出現箇所の語句対）の前方語句の修飾語集合同士、後方語句の修飾語集合同士、第１の単語の前方語句の修飾語集合と後方語句の修飾語集合を組み合わせた語句と第２の単語の前方語句の修飾語集合と後方語句の修飾語集合を組み合わせた語句同士の対比も含めて意味関連性スコアを算出する。 The ontology generation device 1700 generates the terms {i ^A ,j ^A }, terms {i ^B ,j ^B }, terms {i ^C ,j ^C }, terms {i ^AG ,j ^AG }, terms {i ^BG ,j ^BG }. , the semantic relevance score in the document group of the phrase {i ^CG ,j ^CG } is calculated (step S2604). That is, the modifier sets of forward phrases of the first word (seed word pair) and the second word (word phrase pair of occurrence location), the modifier sets of backward phrases, and the modifiers of the forward phrases of the first word A semantic relevance score is calculated including a comparison between a phrase combining the set and the modifier set of the backward phrase and a phrase combining the modifier set of the forward phrase of the second word and the modifier set of the backward phrase.

オントロジー生成装置１７００は、計算された意味関連性スコアが閾値を超えたか否かを判定し（ステップＳ２６０５）、超えていない場合（ステップＳ２６０５でＮｏ）、関連性が低いとして出現箇所ｄｊを関係抽出パターンの生成候補から削除（除外）し（ステップＳ２６０６）、次の語句対について同様の処理を行う。一方、超えている場合（ステップＳ２６０５でＹｅｓ）、オントロジー生成装置１７００は次のシード語句対について同様の処理を行う。 The ontology generation device 1700 determines whether or not the calculated semantic relevance score exceeds the threshold (step S2605). Delete (exclude) from the pattern generation candidates (step S2606), and perform the same processing for the next word/phrase pair. On the other hand, if it exceeds (Yes in step S2605), the ontology generator 1700 performs similar processing for the next seed word/phrase pair.

すべての語句対について処理が終了すると、オントロジー生成装置１７００は語句対ｊの出現箇所（フィルタリング済み）から語句対の関係抽出パターンを生成する（ステップＳ２６０７）。 When the processing for all the word/phrase pairs is completed, the ontology generation device 1700 generates a relationship extraction pattern of the word/phrase pairs from the occurrence locations (filtered) of the word/phrase pair j (step S2607).

＜第５の実施形態＞
第５の実施形態では、第１から第４の実施形態のオントロジー生成装置によって生成されたオントロジーを用いた検索システムについて説明する。図２７に示すように、この検索システムでは、オントロジー生成装置４０（１７００、２１００）をネットワーク２７０１上に配置し、オントロジー生成装置４０が特定コミュニティ（ＸＸ部門、会社）のユーザによって日々作成、更新された文書群を蓄積した部門文書ＤＢ２７０２から取得する。オントロジー生成装置４０は、取得した文書群によってオントロジーを生成し、生成されたオントロジーをＸＸ部門オントロジーＤＢ２７０３に記録する。キーワード拡張装置２７０４は、検索サーバ２７０５に入力された入力キーワードと、ＸＸ部門オントロジーＤＢ２７０３のオントロジーとに基づいて入力キーワードを自動拡張する。検索サーバ２７０５は、自動拡張されたキーワードに基づいて検索を行う。なお、オントロジー生成装置４０とキーワード拡張装置２７０４から構成されるものをキーワード自動拡張システムとも言う。 <Fifth Embodiment>
In the fifth embodiment, a search system using ontologies generated by the ontology generation devices of the first to fourth embodiments will be described. As shown in FIG. 27, in this search system, ontology generators 40 (1700, 2100) are arranged on a network 2701, and ontology generators 40 are created and updated daily by users of a specific community (XX department, company). It acquires from the department document DB 2702 storing the document group. The ontology generation device 40 generates an ontology from the acquired document group, and records the generated ontology in the XX department ontology DB 2703 . The keyword expansion device 2704 automatically expands the input keyword based on the input keyword input to the search server 2705 and the ontology of the XX department ontology DB 2703 . The search server 2705 searches based on the automatically expanded keywords. Note that a system composed of the ontology generation device 40 and the keyword expansion device 2704 is also called an automatic keyword expansion system.

キーワード拡張装置２７０４による自動拡張について図２８を用いて具体的に説明する。入力キーワードが例えば「炭素」の場合を考える。入力キーワード「炭素」とオントロジーとに基づいて、キーワードの自動拡張を行う。入力キーワード「炭素」は、オントロジーを用いると、例えば、英語名称では「carbon」、日本語名称では「炭素」以外に「カーボン紙／複写を取る」などの語句が抽出される。また、化学式では「Ｃ」、関連語では「炭素繊維」、「carbon fiber」、「アクリル」、「ピッチ」などが抽出される。また、関連語の下位概念（実線）や関連概念（破線）として、「ＰＡＮ系炭素繊維」、「炭素化繊維」、「黒鉛繊維」、「ピッチ系炭素繊維」などが抽出される。 Automatic extension by the keyword extension device 2704 will be specifically described with reference to FIG. Consider a case where the input keyword is, for example, "carbon". Keywords are automatically expanded based on the input keyword "carbon" and the ontology. For the input keyword "carbon", if the ontology is used, for example, the English name "carbon" and the Japanese name "carbon" as well as words such as "carbon paper/take a copy" are extracted. In addition, "C" is extracted as a chemical formula, and "carbon fiber", "carbon fiber", "acrylic", "pitch" and the like are extracted as related terms. Further, as subordinate concepts (solid lines) and related concepts (dashed lines) of related terms, "PAN-based carbon fiber", "carbonized fiber", "graphite fiber", "pitch-based carbon fiber", etc. are extracted.

このように、オントロジーに基づいて入力キーワード「炭素」を拡張すると、図２８に示すように、carbon；炭素／カーボン紙／複写を取る；Ｃ；carbon fiber；炭素繊維；｛下位概念｝ＰＡＮ系炭素繊維などのようにキーワードが拡張される。この拡張されたキーワードに基づいて検索することによって、検索結果として多くの関連文書を検索することが可能となる。 In this way, if the input keyword "carbon" is expanded based on the ontology, carbon; carbon/carbon paper/take a copy; C; carbon fiber; carbon fiber; Keywords are extended, such as fiber. By searching based on this expanded keyword, it becomes possible to search for many related documents as search results.

各実施形態のオントロジー生成装置の１つの側面によれば、途中処理の結果に影響されることなく、意図する関係を持つ語句対の抽出精度を向上させることができる。 According to one aspect of the ontology generation device of each embodiment, it is possible to improve the accuracy of extracting word/phrase pairs having an intended relationship without being affected by the results of intermediate processing.

以上の実施形態に関して、更に以下の付記を開示する。
（付記１）意味データベースであるオントロジー辞書を生成するオントロジー生成装置によって実行されるオントロジー生成プログラムであって、
取得された文書データを複数の語義文に分割し、
分割された複数の語義文に対し、語彙の概念の組である概念ペアを構成する単語を特定し、
複数の語義文のうち、第１の語義文に対応する概念ペアを構成する第１の単語と、第２の語義文に対応する概念ペアを構成する第２の単語とが類似するか否かを判定し、類似しないと判定された場合に、概念ペアを構成する単語を前記オントロジー辞書に登録する対象から除外する、
処理を前記オントロジー生成装置のコンピュータに実行させることを特徴とするオントロジー生成プログラム。
（付記２）前記類似しないと判定された場合に、前記第１の単語と類似する関係を有する単語を抽出するための抽出パターンの候補から前記第２の単語が出現する語義文を削除し、前記概念ペアを構成する前記第２の単語を前記オントロジー辞書に登録する対象から除外することを特徴とする付記１に記載のオントロジー生成プログラム。
（付記３）前記第１の単語と前記第２の単語の類似度を示すスコアを所定の算出式によって算出し、算出されたスコアが所定の閾値以下の場合に前記第１の単語と前記第２の単語が類似しないと判定することを特徴とする付記１又は２に記載のオントロジー生成プログラム。
（付記４）前記語義文に対し形態素解析を行って語句を抽出し、抽出された語句及び抽出された語句を組み合わせた組み合わせ語句が主題を言及する語句主題分布と、前記抽出された語句が存在する語義文が主題を言及する文書主題分布を学習し、
前記語句主題分布と前記文書主題分布を用いて前記スコアを算出することを特徴とする付記３に記載のオントロジー生成プログラム。
（付記５）前記第１の単語と前記第２の単語の前方語句同士及び後方語句同士を対比させて前記スコアを算出し、又は更に前記第１の単語の前方語句と後方語句を組み合わせた語句と前記第２の単語の前方語句と後方語句を組み合わせた語句同士の対比も含めて前記スコアを算出することを特徴とする付記４に記載のオントロジー生成プログラム。
（付記６）前記第１の単語と前記第２の単語の前方語句の修飾語集合同士、後方語句の修飾語集合同士、前記第１の単語の前方語句の修飾語集合と後方語句の修飾語集合を組み合わせた語句と前記第２の単語の前方語句の修飾語集合と後方語句の修飾語集合を組み合わせた語句同士の対比も含めて前記スコアを算出することを特徴とする付記５に記載のオントロジー生成プログラム。
（付記７）前記語義文に同義である語句が存在すると判断した場合、同義語を置換させるための所定の情報に基づいて所定の語句に置換し、置換後に前記語句主題分布を学習することを特徴とする付記４から６のいずれか１つに記載のオントロジー生成プログラム。
（付記８）意味データベースであるオントロジー辞書を生成するオントロジー生成方法であって、
取得された文書データを複数の語義文に分割し、
分割された複数の語義文に対し、語彙の概念の組である概念ペアを構成する単語を特定し、
複数の語義文のうち、第１の語義文に対応する概念ペアを構成する第１の単語と、第２の語義文に対応する概念ペアを構成する第２の単語とが類似するか否かを判定し、類似しないと判定された場合に、概念ペアを構成する単語を前記オントロジー辞書に登録する対象から除外する、
ことを特徴とするオントロジー生成方法。
（付記９）前記類似しないと判定された場合に、前記第１の単語と類似する関係を有する単語を抽出するための抽出パターンの候補から前記第２の単語が出現する語義文を削除し、前記概念ペアを構成する前記第２の単語を前記オントロジー辞書に登録する対象から除外することを特徴とする付記８に記載のオントロジー生成方法。
（付記１０）前記第１の単語と前記第２の単語の類似度を示すスコアを所定の算出式によって算出し、算出されたスコアが所定の閾値以下の場合に前記第１の単語と前記第２の単語が類似しないと判定することを特徴とする付記８又は９に記載のオントロジー生成方法。
（付記１１）前記語義文に対し形態素解析を行って語句を抽出し、抽出された語句及び抽出された語句を組み合わせた組み合わせ語句が主題を言及する語句主題分布と、前記抽出された語句が存在する語義文が主題を言及する文書主題分布を学習し、
前記語句主題分布と前記文書主題分布を用いて前記スコアを算出することを特徴とする付記１０に記載のオントロジー生成方法。
（付記１２）前記第１の単語と前記第２の単語の前方語句同士及び後方語句同士を対比させて前記スコアを算出し、又は更に前記第１の単語の前方語句と後方語句を組み合わせた語句と前記第２の単語の前方語句と後方語句を組み合わせた語句同士の対比も含めて前記スコアを算出することを特徴とする付記１１に記載のオントロジー生成方法。
（付記１３）前記第１の単語と前記第２の単語の前方語句の修飾語集合同士、後方語句の修飾語集合同士、前記第１の単語の前方語句の修飾語集合と後方語句の修飾語集合を組み合わせた語句と前記第２の単語の前方語句の修飾語集合と後方語句の修飾語集合を組み合わせた語句同士の対比も含めて前記スコアを算出することを特徴とする付記１２に記載のオントロジー生成方法。
（付記１４）前記語義文に同義である語句が存在すると判断した場合、同義語を置換させるための所定の情報に基づいて所定の語句に置換し、置換後に前記語句主題分布を学習することを特徴とする付記１１から１３のいずれか１つに記載のオントロジー生成方法。
（付記１５）意味データベースであるオントロジー辞書を生成するオントロジー生成装置であって、
取得された文書データを複数の語義文に分割する分割部と、
分割された複数の語義文に対し、語彙の概念の組である概念ペアを構成する単語を特定する特定部と、
複数の語義文のうち、第１の語義文に対応する概念ペアを構成する第１の単語と、第２の語義文に対応する概念ペアを構成する第２の単語とが類似するか否かを判定し、類似しないと判定された場合に、概念ペアを構成する単語を前記オントロジー辞書に登録する対象から除外する処理部とを、有する前記オントロジー生成装置と、
入力された入力キーワードと、前記オントロジー生成装置によって生成された前記オントロジー辞書とに基づいて前記入力キーワードを自動拡張するキーワード拡張装置とを、
備えることを特徴とするキーワード自動拡張システム。 The following notes are further disclosed with respect to the above embodiments.
(Appendix 1) An ontology generation program executed by an ontology generation device that generates an ontology dictionary that is a semantic database,
Divide the acquired document data into multiple word meaning sentences,
Identifying words that constitute concept pairs, which are sets of lexical concepts, for the divided multiple word meaning sentences,
Whether the first word forming the concept pair corresponding to the first word meaning sentence and the second word forming the concept pair corresponding to the second word meaning sentence are similar among the plurality of word meaning sentences is determined, and if it is determined that they are not similar, the words that make up the concept pair are excluded from being registered in the ontology dictionary;
An ontology generation program characterized by causing a computer of the ontology generation device to execute processing.
(Supplementary Note 2) If the non-similarity is determined, delete the word meaning sentence in which the second word appears from the extraction pattern candidates for extracting words having a similar relationship with the first word, 1. The ontology generation program according to appendix 1, wherein the second words forming the concept pair are excluded from targets to be registered in the ontology dictionary.
(Appendix 3) A score indicating the degree of similarity between the first word and the second word is calculated by a predetermined calculation formula, and if the calculated score is equal to or less than a predetermined threshold, the first word and the second word are 3. The ontology generation program according to appendix 1 or 2, characterized by determining that two words are dissimilar.
(Appendix 4) A morphological analysis is performed on the word-sense sentence to extract words, and there is a word/phrase topic distribution in which the extracted words/phrases and combination words/phrases combining the extracted words/phrases refer to the topic, and the extracted words/phrases exist. learning the document subject distribution in which the semantic sentences refer to the subject,
3. The ontology generation program according to appendix 3, wherein the score is calculated using the word/phrase subject distribution and the document subject distribution.
(Additional remark 5) The score is calculated by comparing the forward phrases and the backward phrases of the first word and the second word, or a phrase in which the forward phrases and the backward phrases of the first word are combined. 5. The ontology generation program according to appendix 4, wherein the score is calculated including a comparison of words and phrases that are a combination of a prefix phrase and a posterior phrase of the second word.
(Appendix 6) Modifier sets of forward phrases of the first word and second word, modifier sets of backward phrases of each other, modifier set of forward phrases of the first word and modifiers of backward phrases The score is calculated according to Supplementary Note 5, wherein the score is calculated including a comparison between a phrase combining a set and a modifier set of prefix phrases of the second word and a modifier set of backward phrases of the second word. An ontology generator.
(Appendix 7) When it is determined that a synonymous word exists in the word-meaning sentence, it is replaced with a predetermined word based on predetermined information for replacing the synonym, and after the replacement, the word-thematic distribution is learned. 7. An ontology generation program according to any one of appendices 4 to 6, characterized in that:
(Appendix 8) An ontology generation method for generating an ontology dictionary that is a semantic database,
Divide the acquired document data into multiple word meaning sentences,
Identifying words that constitute concept pairs, which are sets of lexical concepts, for the divided multiple word meaning sentences,
Whether the first word forming the concept pair corresponding to the first word meaning sentence and the second word forming the concept pair corresponding to the second word meaning sentence are similar among the plurality of word meaning sentences is determined, and if it is determined that they are not similar, the words that make up the concept pair are excluded from being registered in the ontology dictionary;
An ontology generation method characterized by:
(Supplementary note 9) when it is determined that the second word is not similar, deleting the word meaning sentence in which the second word appears from the extraction pattern candidates for extracting words having a relationship similar to the first word, 8. The ontology generating method according to appendix 8, wherein the second words forming the concept pair are excluded from targets to be registered in the ontology dictionary.
(Appendix 10) A score indicating the similarity between the first word and the second word is calculated by a predetermined calculation formula, and if the calculated score is equal to or less than a predetermined threshold, the first word and the second word are 10. The ontology generation method according to appendix 8 or 9, wherein the two words are determined to be dissimilar.
(Appendix 11) A morphological analysis is performed on the word-sense sentence to extract words, and a word/phrase topic distribution in which the extracted words/phrases and combination words/phrases combining the extracted words/phrases refer to the topic, and the extracted words/phrases exist. learning the document subject distribution in which the semantic sentences refer to the subject,
11. The ontology generating method according to appendix 10, wherein the score is calculated using the word topic distribution and the document topic distribution.
(Appendix 12) The score is calculated by comparing forward phrases and backward phrases of the first word and the second word, or a phrase combining forward phrases and backward phrases of the first word. 12. The ontology generating method according to Supplementary Note 11, wherein the score is calculated including a comparison of words and phrases that are a combination of a prefix phrase and a posterior phrase of the second word.
(Appendix 13) Modifier sets of forward phrases of the first word and second word, modifier sets of backward phrases of each other, modifier set of forward phrases of the first word and modifiers of backward phrases 13. The score according to Supplementary Note 12, wherein the score is calculated including a comparison between a phrase combining a set and a modifier set of prefix phrases of the second word and a modifier set of backward phrases of the second word. Ontology generation method.
(Appendix 14) When it is determined that a synonymous word exists in the word-sense sentence, it is replaced with a predetermined word based on predetermined information for replacing the synonym, and after the replacement, the word-thematic distribution is learned. 14. An ontology generation method according to any one of appendices 11 to 13, characterized in that:
(Appendix 15) An ontology generation device that generates an ontology dictionary that is a semantic database,
a dividing unit that divides the acquired document data into a plurality of word meaning sentences;
an identification unit that identifies words that form concept pairs, which are sets of lexical concepts, for the plurality of divided word meaning sentences;
Whether the first word forming the concept pair corresponding to the first word meaning sentence and the second word forming the concept pair corresponding to the second word meaning sentence are similar among the plurality of word meaning sentences and, if it is determined that they are not similar, a processing unit that excludes the words that make up the concept pair from being registered in the ontology dictionary;
a keyword expansion device that automatically expands the input keyword based on the inputted input keyword and the ontology dictionary generated by the ontology generation device;
A keyword automatic expansion system comprising:

１１文書群
１２関係抽出パターンＤＢ
１３語句対ＤＢ
１４ユーザ
３０、３２パターン
３１、３３語句対
４０、１７００、２１００オントロジー生成装置
１３０ＣＰＵ
１３１ＨＤＤ
１３２ＲＡＭ
１３３ＲＯＭ
１３４入力装置
１３５通信インタフェース
１３６バス
４０１シード語句入力部
４０２テキスト取得部
４０３構文解析部
４０４抽出処理制御部
４０５抽出パターン生成候補抽出部
４０６パターン抽出部
４０７語句抽出部
４０８オントロジー生成部
４０９出力部
４１０構文構造記憶部
４１１語句記憶部
４１２抽出パターン記憶部
４１３抽出パターン生成候補記憶部
４１４語句文書別主題分布学習部
４１５語句文書別主題分布記憶部
４１６抽出パターン生成候補評価部
１４００不適切な関係抽出パターン
１７０１語句対組み合わせ生成部
２１０１同義語置換部
２１０２同義語辞書
２７０１ネットワーク
２７０２部門文書ＤＢ
２７０３ＸＸ部門オントロジーＤＢ
２７０４キーワード拡張装置
２７０５検索サーバ 11 document group 12 relationship extraction pattern DB
13 Phrases vs. DB
14 users 30, 32 patterns 31, 33 phrase pairs 40, 1700, 2100 ontology generator 130 CPU
131 HDDs
132 RAMs
133 ROMs
134 input device 135 communication interface 136 bus 401 seed phrase input unit 402 text acquisition unit 403 syntax analysis unit 404 extraction processing control unit 405 extraction pattern generation candidate extraction unit 406 pattern extraction unit 407 phrase extraction unit 408 ontology generation unit 409 output unit 410 syntax Structure storage unit 411 Word/phrase storage unit 412 Extraction pattern storage unit 413 Extraction pattern generation candidate storage unit 414 Word/phrase document-based theme distribution learning unit 415 Word/phrase document-based subject distribution storage unit 416 Extraction pattern generation candidate evaluation unit 1400 Inappropriate relationship extraction pattern 1701 Phrase pair combination generator 2101 Synonym replacement unit 2102 Synonym dictionary 2701 Network 2702 Department document DB
2703 XX Department Ontology DB
2704 Keyword expansion device 2705 Search server

Claims

An ontology generation device that generates an ontology dictionary that is a semantic database,
A phrase pair, which is a pair of phrases having semantic relevance, is composed of a pair of anterior phrase and a posterior phrase, and the posterior phrase is posterior to the anterior phrase in a semantic sentence containing the phrase pair. a storage unit that stores the word/phrase pair arranged at the position as an item of the ontology dictionary;
an acquisition unit that acquires a seed word/phrase pair that is the word/phrase pair input from the outside and stores the seed word/phrase pair in the storage unit as the item;
a dividing unit that divides the acquired document data into a plurality of word meaning sentences;
a word sense sentence extracting unit for extracting a word sense sentence including the word/phrase pair stored as the item in the storage unit from the plurality of word sense sentences obtained by the division;
Predetermining a score representing a degree of semantic relevance between each word forming the word pair and each word forming the seed word pair included in the word meaning sentence extracted by the word meaning sentence extracting unit It is determined whether the relevance is low by comparing the score calculated for the word sense sentence with a predetermined threshold, and if the relevance is determined to be low, the word sense sentence is an evaluation unit that excludes from the results of the extraction;
In the word-sense sentence remaining as a result of the extraction without being excluded by the evaluation unit, each word and phrase constituting the word-and-phrase pair included in the word-sense sentence is provided with attribute information about each word and phrase. a generation unit that generates a relationship extraction pattern by replacing with variables;
Attribute information about a word or phrase whose word or phrase other than the variable matches the relation extraction pattern and is arranged at the position of the variable in the relation extraction pattern, among the plurality of word meaning sentences obtained by the division. a word/phrase pair extracting unit for extracting a word/phrase pair located at the position from a word meaning sentence matching attribute information of the variable and storing the word/phrase pair in the storage unit as the item;
has
The predetermined calculation formula is:
a vector distance between the subject distribution for the forward term of the word pair contained in the semantic sentence and the subject distribution for the forward term of the seed word pair;
A weighted average of vector distances between the subject distribution for the backward term of the word pair included in the semantic sentence and the subject distribution for the backward term of the seed word pair, the score is a formula to calculate as
a subject distribution for a word in a semantic sentence is represented by a vector obtained by multiplying a vector representing the word subject distribution for the word and a vector representing the document subject distribution for the semantic sentence;
A word/phrase theme distribution for a word is a probability distribution representing the probability that the word/phrase is included in a document related to the subject for each subject,
A document topic distribution for a semantic sentence is a probability distribution that expresses the probability that the semantic sentence is included in documents related to the subject for each subject.
An ontology generation device characterized by:

An ontology generation device that generates an ontology dictionary that is a semantic database,
A phrase pair, which is a pair of phrases having semantic relevance, is composed of a pair of anterior phrase and a posterior phrase, and the posterior phrase is posterior to the anterior phrase in a semantic sentence containing the phrase pair. a storage unit that stores the word/phrase pair arranged at the position as an item of the ontology dictionary;
an acquisition unit that acquires a seed word/phrase pair that is the word/phrase pair input from the outside and stores the seed word/phrase pair in the storage unit as the item;
a dividing unit that divides the acquired document data into a plurality of word meaning sentences;
a word sense sentence extracting unit for extracting a word sense sentence including the word/phrase pair stored as the item in the storage unit from the plurality of word sense sentences obtained by the division;
Predetermining a score representing a degree of semantic relevance between each word forming the word pair and each word forming the seed word pair included in the word meaning sentence extracted by the word meaning sentence extracting unit It is determined whether the relevance is low by comparing the score calculated for the word sense sentence with a predetermined threshold, and if the relevance is determined to be low, the word sense sentence is an evaluation unit that excludes from the results of the extraction;
In the word-sense sentence remaining as a result of the extraction without being excluded by the evaluation unit, each word and phrase constituting the word-and-phrase pair included in the word-sense sentence is provided with attribute information about each word and phrase. a generation unit that generates a relationship extraction pattern by replacing with variables;
Attribute information about a word or phrase whose word or phrase other than the variable matches the relation extraction pattern and is arranged at the position of the variable in the relation extraction pattern, among the plurality of word meaning sentences obtained by the division. a word/phrase pair extracting unit for extracting a word/phrase pair located at the position from a word meaning sentence matching attribute information of the variable and storing the word/phrase pair in the storage unit as the item;
has
The predetermined calculation formula is:
a vector distance between the subject distribution for the forward term of the word pair contained in the semantic sentence and the subject distribution for the forward term of the seed word pair;
a vector distance between the subject distribution for the back term of the word pair contained in the semantic sentence and the subject distribution for the back term of the seed word pair;
the vector distance between the subject distribution of the word combinations of the word pairs included in the word meaning sentence and the subject distribution of the word combinations of the seed word pairs; A weighted average value is a formula for calculating the score,
a subject distribution for a word in a semantic sentence is represented by a vector obtained by multiplying a vector representing the word subject distribution for the word and a vector representing the document subject distribution for the semantic sentence;
a subject distribution for a word combination in a sense sentence is represented by a vector obtained by multiplying a vector representing a word subject distribution for the word combination and a vector representing a document subject distribution for the word sense sentence;
A word/phrase theme distribution for a word is a probability distribution representing the probability that the word/phrase is included in a document related to the subject for each subject,
The word/phrase theme distribution for a combination word is a probability distribution representing the probability that all the words/phrases combined as the combination word/phrase are included in a document related to the subject for each subject,
A document topic distribution for a semantic sentence is a probability distribution that expresses the probability that the semantic sentence is included in documents related to the subject for each subject.
An ontology generation device characterized by:

An ontology generation device that generates an ontology dictionary that is a semantic database,
A phrase pair, which is a pair of phrases having semantic relevance, is composed of a pair of anterior phrase and a posterior phrase, and the posterior phrase is posterior to the anterior phrase in a semantic sentence containing the phrase pair. a storage unit that stores the word/phrase pair arranged at the position as an item of the ontology dictionary;
an acquisition unit that acquires a seed word/phrase pair that is the word/phrase pair input from the outside and stores the seed word/phrase pair in the storage unit as the item;
a dividing unit that divides the acquired document data into a plurality of word meaning sentences;
a word sense sentence extracting unit for extracting a word sense sentence including the word/phrase pair stored as the item in the storage unit from the plurality of word sense sentences obtained by the division;
Predetermining a score representing a degree of semantic relevance between each word forming the word pair and each word forming the seed word pair included in the word meaning sentence extracted by the word meaning sentence extracting unit It is determined whether the relevance is low by comparing the score calculated for the word sense sentence with a predetermined threshold, and if the relevance is determined to be low, the word sense sentence is an evaluation unit that excludes from the results of the extraction;
In the word-sense sentence remaining as a result of the extraction without being excluded by the evaluation unit, each word and phrase constituting the word-and-phrase pair included in the word-sense sentence is provided with attribute information about each word and phrase. a generation unit that generates a relationship extraction pattern by replacing with variables;
Attribute information about a word or phrase whose word or phrase other than the variable matches the relation extraction pattern and is arranged at the position of the variable in the relation extraction pattern, among the plurality of word meaning sentences obtained by the division. a word/phrase pair extracting unit for extracting a word/phrase pair located at the position from a word meaning sentence matching attribute information of the variable and storing the word/phrase pair in the storage unit as the item;
has
The predetermined calculation formula is:
a vector distance between the subject distribution for the forward term of the word pair contained in the semantic sentence and the subject distribution for the forward term of the seed word pair;
a vector distance between the subject distribution for the back term of the word pair contained in the semantic sentence and the subject distribution for the back term of the seed word pair;
A vector distance between a theme distribution for a combination of words that combine the words that make up the word pair included in the word meaning sentence and a theme distribution for a combination of words that make up the seed word pair;
A vector distance between a subject distribution for a modifier set for the preceding term of the word pair contained in the semantic sentence and a subject distribution for a modifier set for the preceding term of the seed word pair. When,
A vector distance between a subject distribution for a modifier set for the backward term of the word pair included in the semantic sentence and a subject distribution for a modifier set for the backward term of the seed word pair. When,
Thematic distribution of a modifier set for a combined word/phrase combining each word/phrase constituting the word/phrase pair contained in the word meaning sentence and a modifier set for a combined word/phrase combining each word/phrase constituting the seed word/phrase pair A formula for calculating the weighted average value of the vector distance from the subject distribution for the score as the score,
A modifier set for words and phrases forming a word pair included in a word meaning sentence is a set whose elements are the words and phrases that modify the words and phrases in the word meaning sentence,
A modifier set for a combination word that combines word pairs included in a meaning sentence is a set whose elements are each word that is combined as the combination word and a word that modifies each word in the meaning sentence. and
a subject distribution for a word in a semantic sentence is represented by a vector obtained by multiplying a vector representing the word subject distribution for the word and a vector representing the document subject distribution for the semantic sentence;
a subject distribution for a combination word in a semantic sentence is represented by a vector obtained by multiplying a vector representing the word theme distribution for the combination word by a vector representing the document subject distribution for the semantic sentence;
The subject distribution for the modifier set in the word sense sentence is represented by a vector obtained by multiplying a vector representing the word theme distribution for the modifier set by a vector representing the document subject distribution for the word sense sentence,
A word/phrase subject distribution for a word is a probability distribution representing the probability that the word/phrase is included in a document related to the subject for each subject,
The word/phrase theme distribution for a combination word is a probability distribution representing the probability that all the words/phrases combined as the combination word/phrase are included in a document related to the subject for each subject,
The word/phrase subject distribution for a modifier set is a probability distribution that expresses the probability that all the words that are elements of the modifier set are included in documents related to the subject for each subject,
The document topic distribution for a semantic sentence is a probability distribution representing the probability that the semantic sentence is included in the document related to the subject for each subject.
An ontology generation device characterized by:

It is determined based on predetermined information for substituting synonyms whether or not synonymous words exist in the plurality of word meaning sentences obtained by the division, and when it is determined that they exist, it is determined that they exist. further comprising a synonym replacement unit that replaces the synonymous words in the word meaning sentence with a predetermined word,
The word sense sentence extraction unit extracts the word sense sentences from the word sense sentences replaced by the synonym replacement unit and the remaining word sense sentences among the plurality of word sense sentences obtained by the division.
4. The ontology generation device according to any one of claims 1 to 3 , characterized by:

An ontology generation program executed by an ontology generation device that generates an ontology dictionary that is a semantic database,
A phrase pair, which is a pair of phrases having semantic relevance, is composed of a pair of anterior phrase and a posterior phrase, and the posterior phrase is posterior to the anterior phrase in a semantic sentence containing the phrase pair. Obtaining a seed word pair, which is the word pair input from the outside, from the word pairs arranged at the position ;
storing the seed word pair as an item in a word/phrase pair storage unit that stores the word/phrase pair as an item of the ontology dictionary;
Divide the acquired document data into multiple word meaning sentences,
Extracting a word meaning sentence containing the word pair stored as the item in the word pair storage unit from the plurality of word meaning sentences obtained by the division,
Predetermining a score representing a degree of semantic relevance between each word and phrase constituting the word pair and each word and phrase making up the seed word pair included in the word meaning sentence extracted by the word meaning sentence extraction Calculated using the formula,
determining whether the relevance is low by comparing the score calculated for the word-sense sentence extracted by the word-sense sentence extraction with a predetermined threshold;
excluding the word meaning sentence from the result of extracting the word meaning sentence when it is determined that the relevance is low;
In the word meaning sentence that is not excluded by the above-mentioned exclusion and remains as a result of the extraction of the word meaning sentence, each word and phrase that constitutes the word and phrase pair included in the word meaning sentence is obtained by obtaining attribute information for each word and phrase. Generate a relationship extraction pattern by replacing variables with
Attribute information about a word or phrase whose word or phrase other than the variable matches the relation extraction pattern and is arranged at the position of the variable in the relation extraction pattern, among the plurality of word meaning sentences obtained by the division. extracts a pair of words and phrases arranged at the position from the word meaning sentence that matches the attribute information of the variable, and stores the word and phrase pairs as the items in the word and phrase pair storage unit. let
The predetermined calculation formula, in the word sense sentence extracted by the word sense sentence extraction,
a vector distance between the subject distribution for the forward term of the word pair contained in the semantic sentence and the subject distribution for the forward term of the seed word pair;
a vector distance between the subject distribution for the back term of the word pair contained in the semantic sentence and the subject distribution for the back term of the seed word pair;
is a formula for calculating the weighted average value of as the score,
a subject distribution for a word in a semantic sentence is represented by a vector obtained by multiplying a vector representing the word subject distribution for the word and a vector representing the document subject distribution for the semantic sentence;
A word/phrase theme distribution for a word is a probability distribution representing the probability that the word/phrase is included in a document related to the subject for each subject,
A document subject distribution for a semantic sentence is a probability distribution representing the probability that the semantic sentence is included in a document related to the subject for each subject,
An ontology generator.

An ontology generation method for generating an ontology dictionary, which is a semantic database, comprising:
A phrase pair, which is a pair of phrases having semantic relevance, is composed of a pair of anterior phrase and a posterior phrase, and the posterior phrase is posterior to the anterior phrase in a semantic sentence containing the phrase pair. Obtaining a seed word pair, which is the word pair input from the outside, from the word pairs arranged at the position ;
storing the seed word pair as an item in a word/phrase pair storage unit that stores the word/phrase pair as an item of the ontology dictionary;
Divide the acquired document data into multiple word meaning sentences,
Extracting a word meaning sentence containing the word pair stored as the item in the word pair storage unit from the plurality of word meaning sentences obtained by the division,
Predetermining a score representing a degree of semantic relevance between each word and phrase constituting the word pair and each word and phrase making up the seed word pair included in the word meaning sentence extracted by the word meaning sentence extraction Calculated using the formula,
determining whether the relevance is low by comparing the score calculated for the word-sense sentence extracted by the word-sense sentence extraction with a predetermined threshold;
excluding the word meaning sentence from the result of extracting the word meaning sentence when it is determined that the relevance is low;
In the word meaning sentence that is not excluded by the above-mentioned exclusion and remains as a result of the extraction of the word meaning sentence, each word and phrase that constitutes the word and phrase pair included in the word meaning sentence is obtained by obtaining attribute information for each word and phrase. Generate a relationship extraction pattern by replacing variables with
Attribute information about a word or phrase whose word or phrase other than the variable matches the relation extraction pattern and is arranged at the position of the variable in the relation extraction pattern, among the plurality of word meaning sentences obtained by the division. The computer of the ontology generation device executes a process of extracting the pair of words and phrases arranged at the position from the word meaning sentence matching the attribute information of the variable and storing them as the items in the word and phrase pair storage unit. ,
The predetermined calculation formula, in the word sense sentence extracted by the word sense sentence extraction,
a vector distance between the subject distribution for the forward term of the word pair contained in the semantic sentence and the subject distribution for the forward term of the seed word pair;
a vector distance between the subject distribution for the back term of the word pair contained in the semantic sentence and the subject distribution for the back term of the seed word pair;
is a formula for calculating the weighted average value of as the score,
a subject distribution for a word in a semantic sentence is represented by a vector obtained by multiplying a vector representing the word subject distribution for the word and a vector representing the document subject distribution for the semantic sentence;
A word/phrase theme distribution for a word is a probability distribution representing the probability that the word/phrase is included in a document related to the subject for each subject,
A document subject distribution for a semantic sentence is a probability distribution representing the probability that the semantic sentence is included in a document related to the subject for each subject,
Ontology generation method.