JP2020047209A

JP2020047209A - Ontology processing apparatus and ontology processing program

Info

Publication number: JP2020047209A
Application number: JP2018177472A
Authority: JP
Inventors: 奥村　晃弘; Akihiro Okumura; 晃弘奥村
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2018-09-21
Filing date: 2018-09-21
Publication date: 2020-03-26

Abstract

To provide an ontology processing apparatus and an ontology processing program capable of aiding in addition of more suitable words to an ontology that has superordinate/subordinate concept relations.SOLUTION: An ontology processing apparatus according to an embodiment of the present disclosure includes a coordinate conversion unit, a distance calculation unit, and an addition candidate extraction unit. The coordinate conversion unit extracts a word vector from a word vector holding unit, and converts the extracted word vector into a conversion vector in a predetermined space where a cosine similarity can be approximated by a Euclidean distance. The distance calculation unit calculates the distance between a hyperplane determined from a result of a seed word fitting process and a word in the conversion vector obtained by the conversion by the coordinate conversion unit. The addition candidate extraction unit extracts, based on the distance calculated by the distance calculation unit, a word that is a candidate for addition to the ontology in an ontology storage unit from words held in the word vector holding unit.SELECTED DRAWING: Figure 1

Description

本開示は、オントロジー処理装置およびオントロジー処理プログラムに関する。 The present disclosure relates to an ontology processing device and an ontology processing program.

従来、言語処理技術において、機械翻訳や対話理解の研究が長年続けられている。近年では、単語の意味情報（概念）を利用した高度な知識処理に関する研究が行われるようになってきている。そうした技術の中にオントロジー技術がある。「オントロジー」とは、辞書の一種で、単語の持つ概念を体系的に整理したものである。このオントロジーの生成に関して、例えば、特許文献１，２などの技術が開示されている。 2. Description of the Related Art Conventionally, in language processing technology, research on machine translation and dialog understanding has been continued for many years. In recent years, research on advanced knowledge processing using semantic information (concept) of words has been performed. One such technology is ontology technology. An "ontology" is a type of dictionary, which systematically organizes the concepts of words. With respect to the generation of the ontology, for example, techniques disclosed in Patent Documents 1 and 2 are disclosed.

また、従来、非特許文献１では、大量の文書を与えて自動学習することにより単語ベクトルを作成する手法が示されている。単語からベクトル表現を作り出す技術はそれ以前にもあったが、非特許文献１の記載技術では、複雑な概念の表現が可能となっている。例えば、作成されたベクトルで「フランス」−「パリ」＋「東京」≒「日本」のような演算が可能となる。これは「フランス」−「パリ」が「その都市を首都とする国」を表現している。このように、非特許文献１では、ベクトルの方向がなんらかの「意味」を持つようになる。 Conventionally, Non-Patent Document 1 discloses a method of creating a word vector by providing a large number of documents and performing automatic learning. Although there was a technology for creating a vector expression from words before that, the technology described in Non-Patent Document 1 enables expression of a complex concept. For example, an operation such as “France” − “Paris” + “Tokyo” ≒ “Japan” can be performed with the created vector. This expresses "France"-"Paris" as "the country whose capital is the city". Thus, in Non-Patent Document 1, the direction of the vector has some “meaning”.

特開２００９−１１０５１３号公報JP 2009-110513 A 特開２０１７−１８２１６８号公報JP 2017-182168 A

"Ｄｉｓｔｒｉｂｕｔｅｄｒｅｐｒｅｓｅｎｔａｔｉｏｎｓｏｆｗｏｒｄｓａｎｄｐｈｒａｓｅｓａｎｄｔｈｅｉｒｃｏｍｐｏｓｉｔｉｏｎａｌｉｔｙ" ＴｏｍａｓＭｉｋｏｌｏｖ，ＩｌｙａＳｕｔｓｋｅｖｅｒ，ＫａｉＣｈｅｎ，ＧｒｅｇＳＣｏｒｒａｄｏ，ＪｅｆｆＤｅａｎ，２０１３，ＮＩＰＳ"Distributed representations of words and phrases and their compositionality" Tomas Mikolov, Ilya Stuskever, Kai Chen, Greg Scord, Canada

しかしながら、特許文献１の技術で生成されるオントロジーでは、各単語が同列であり、上位概念・下位概念の関係を持つオントロジーを作成することができなかった。従って、上位概念・下位概念の関係を持つオントロジーに対して、より適切な単語追加を支援することの可能なオントロジー処理装置およびオントロジー処理プログラムを提供することが望ましい。 However, in the ontology generated by the technique of Patent Document 1, each word is the same, and an ontology having a superordinate concept / lower concept relationship cannot be created. Therefore, it is desirable to provide an ontology processing device and an ontology processing program that can support more appropriate word addition for an ontology having a higher concept / lower concept relationship.

本開示の一実施の形態としてのオントロジー処理装置は、複数の単語と、複数の単語に対応する複数の単語ベクトルとを保持する単語ベクトル保持部と、オントロジーを記憶するオントロジー記憶部とを備えている。このオントロジー処理装置は、さらに、タネ単語取得部と、座標変換部と、距離計算部と、追加候補抽出部と、オントロジー編集部とを備えている。タネ単語取得部は、オントロジー記憶部のオントロジーから、指定された単語の下位概念に該当する単語をタネ単語として取得する。座標変換部は、単語ベクトル保持部から単語ベクトルを抽出し、抽出した単語ベクトルを、コサイン類似度をユークリッド距離で近似できる所定の空間への変換ベクトルに変換する。距離計算部は、タネ単語のフィッティング処理の結果から求められた超平面と、座標変換部での変換により得られた変換ベクトルの単語との距離を計算する。追加候補抽出部は、距離計算部で計算した距離に基づいて、単語ベクトル保持部で保持された単語から、オントロジー記憶部のオントロジーに追加登録する候補となる単語を抽出する。オントロジー編集部は、追加候補抽出部で抽出した追加候補の単語の一部又は全部についてオントロジー記憶部のオントロジーに追加する。 An ontology processing apparatus according to an embodiment of the present disclosure includes a word vector holding unit that holds a plurality of words, a plurality of word vectors corresponding to a plurality of words, and an ontology storage unit that stores an ontology. I have. The ontology processing device further includes a seed word acquisition unit, a coordinate conversion unit, a distance calculation unit, an additional candidate extraction unit, and an ontology editing unit. The seed word obtaining unit obtains, as a seed word, a word corresponding to a lower concept of the specified word from the ontology in the ontology storage unit. The coordinate conversion unit extracts a word vector from the word vector holding unit, and converts the extracted word vector into a conversion vector into a predetermined space in which cosine similarity can be approximated by a Euclidean distance. The distance calculation unit calculates the distance between the hyperplane obtained from the result of the fitting process of the seed word and the word of the conversion vector obtained by the conversion by the coordinate conversion unit. The additional candidate extraction unit extracts a word that is a candidate to be additionally registered in the ontology of the ontology storage unit from the words stored in the word vector storage unit based on the distance calculated by the distance calculation unit. The ontology editing unit adds a part or all of the words of the addition candidates extracted by the addition candidate extraction unit to the ontology in the ontology storage unit.

本開示の一実施の形態としてのオントロジー処理プログラムは、複数の単語と、複数の単語に対応する複数の単語ベクトルとを保持する単語ベクトル保持部と、オントロジーを記憶するオントロジー記憶部とを備えたコンピュータに対して、以下のステップを実行させる。
（Ａ）オントロジー記憶部のオントロジーから、指定された単語の下位概念に該当する単語をタネ単語として取得すること
（Ｂ）単語ベクトル保持部から単語ベクトルを抽出し、抽出した単語ベクトルを、タネ単語のコサイン類似度をユークリッド距離で近似できる所定の空間への変換ベクトルに変換すること
（Ｃ）タネ単語のフィッティング処理の結果から求められた超平面と、変換により得られた変換ベクトルの単語との距離を計算すること
（Ｄ）計算により得られた距離に基づいて、単語ベクトル保持部で保持された単語から、オントロジー記憶部のオントロジーに追加登録する候補となる単語を抽出すること
（Ｅ）抽出した追加候補の単語の一部又は全部についてオントロジー記憶部のオントロジーに追加すること An ontology processing program according to an embodiment of the present disclosure includes a word vector holding unit that holds a plurality of words, a plurality of word vectors corresponding to a plurality of words, and an ontology storage unit that stores an ontology. Have the computer perform the following steps:
(A) Acquiring a word corresponding to a subordinate concept of a specified word as a seed word from the ontology of the ontology storage unit. (B) Extracting a word vector from the word vector holding unit and converting the extracted word vector into a seed word. Is converted into a conversion vector into a predetermined space that can be approximated by the Euclidean distance. (C) The hyperplane obtained from the result of the fitting process of the seed word and the word of the conversion vector obtained by the conversion Calculating the distance (D) Based on the distance obtained by the calculation, extracting a word that is a candidate to be additionally registered in the ontology of the ontology storage unit from the words stored in the word vector storage unit (E) extraction Adding some or all of the additional candidate words to the ontology in the ontology storage unit

本開示の一実施の形態としてのオントロジー処理装置およびオントロジー処理プログラムによれば、上位概念・下位概念の関係を持つオントロジーに対して、より適切な単語追加を支援することができる。 According to the ontology processing device and the ontology processing program as an embodiment of the present disclosure, it is possible to support more appropriate word addition to an ontology having a relationship between a higher concept and a lower concept.

本開示の一実施の形態に係るオントロジー処理装置の機能的構成について示したブロック図である。1 is a block diagram illustrating a functional configuration of an ontology processing device according to an embodiment of the present disclosure. 図１のオントロジー処理装置により単語が追加される前のオントロジーの構成例について示した説明図（概念図）である。FIG. 2 is an explanatory diagram (conceptual diagram) illustrating a configuration example of an ontology before a word is added by the ontology processing device in FIG. 1. 図１のオントロジー処理装置により単語が追加された後のオントロジーの構成例について示した説明図（概念図）である。FIG. 2 is an explanatory diagram (conceptual diagram) showing a configuration example of an ontology after a word has been added by the ontology processing device of FIG. 1. 図１の単語ベクトル作成部で作成されたタネ単語の単語ベクトルの構成例について示した説明図である。FIG. 2 is an explanatory diagram showing a configuration example of a word vector of a seed word created by a word vector creating unit in FIG. 1. 図１の距離計算部で行われた寄与率算出の結果について示した説明図である。FIG. 2 is an explanatory diagram illustrating a result of a contribution ratio calculation performed by a distance calculation unit in FIG. 1. 図１の距離計算部で取得された各タネ単語の分布にフィットするように回転するための変換係数について示した説明図である。FIG. 3 is an explanatory diagram showing conversion coefficients for rotating so as to fit the distribution of each seed word acquired by the distance calculation unit in FIG. 1. 図１の単語ベクトル作成部で作成された単語ベクトルについて示した説明図である。FIG. 2 is an explanatory diagram illustrating a word vector created by a word vector creating unit in FIG. 1. 図１の単語ベクトル作成部で作成された単語ベクトルを座標回転させたパラメータについて示した説明図である。FIG. 2 is an explanatory diagram showing parameters obtained by rotating coordinates of a word vector created by a word vector creating unit in FIG. 1. 図１のオントロジー処理装置の動作について示したフローチャートである。2 is a flowchart illustrating the operation of the ontology processing device of FIG. 1. 図１のオントロジー処理装置の機能的構成の一変形例について示したブロック図である。FIG. 7 is a block diagram illustrating a modification of the functional configuration of the ontology processing device of FIG. 1.

以下、本開示の一実施の形態に係るオントロジー処理装置１について、図面を参照して詳細に説明する。以下の説明は本開示の一具体例であって、本開示は以下の態様に限定されるものではない。 Hereinafter, an ontology processing device 1 according to an embodiment of the present disclosure will be described in detail with reference to the drawings. The following description is a specific example of the present disclosure, and the present disclosure is not limited to the following embodiments.

［構成］
図１は、本実施の形態に係るオントロジー処理装置１の機能的構成について示したブロック図である。オントロジー処理装置１は、例えば、制御部１と、単語ベクトル作成部２と、入出力部３と、オントロジー記憶部４と、文書作成部５とを備えている。 [Constitution]
FIG. 1 is a block diagram showing a functional configuration of an ontology processing device 1 according to the present embodiment. The ontology processing apparatus 1 includes, for example, a control unit 1, a word vector creation unit 2, an input / output unit 3, an ontology storage unit 4, and a document creation unit 5.

オントロジー記憶部４は、複数の単語（概念）により形成されるオン卜ロジーを記憶する。オントロジー記憶部４に記憶されるオントロジーは、単語（概念）間の上位概念および下位概念の対応づけが可能な構成となっている。オントロジー記憶部４に記憶されるオントロジーは、上位概念及び下位概念の対応づけが可能であれば、その他の具体的なデータ記述形式については限定されないものであり、種々のオントロジーのデータ記述形式を適用することができる。 The ontology storage unit 4 stores an ontology formed by a plurality of words (concepts). The ontology stored in the ontology storage unit 4 has a configuration in which superordinate concepts and subordinate concepts between words (concepts) can be associated. The ontology stored in the ontology storage unit 4 is not limited to other specific data description formats as long as the upper concept and the lower concept can be associated with each other. can do.

文書記憶部５は、大量の文書データ（例えば、テキストデータ等の種々の形式の文書データのファイル）を記憶する。 The document storage unit 5 stores a large amount of document data (for example, various types of document data files such as text data).

単語ベクトル作成部２は、文書記憶部５の大量の文書データから複数の単語を抽出し、抽出した複数の単語に対応する複数の単語ベクトル（文書データに含まれる単語に関する単語ベクトル）を作成する。単語ベクトル作成部２は、抽出した複数の単語と、抽出した複数の単語に対応する複数の単語ベクトルとを記憶する。単語ベクトル作成部２は、例えば、非特許文献１に記載された手法、または、一般的なｗｏｒｄ２ｖｅｃを用いて、文書記憶部５で保持された文書データから単語ベクトルを生成してもよい。文書データから単語ベクトルを生成する手法は、上記の手法に限定されるものではない。単語ベクトル作成部２は、文書を単語に分解する際に、単語ベクトル作成部２内部の図示しない単語辞書と共にオントロジー記憶部４に記憶された単語を参照してもよい。 The word vector creation unit 2 extracts a plurality of words from a large amount of document data in the document storage unit 5, and creates a plurality of word vectors (word vectors related to words included in the document data) corresponding to the extracted words. . The word vector creating unit 2 stores a plurality of extracted words and a plurality of word vectors corresponding to the plurality of extracted words. The word vector creation unit 2 may generate a word vector from the document data held in the document storage unit 5 using, for example, the method described in Non-Patent Document 1 or general word2vec. The method of generating a word vector from document data is not limited to the above method. When decomposing a document into words, the word vector creating unit 2 may refer to words stored in the ontology storage unit 4 together with a word dictionary (not shown) inside the word vector creating unit 2.

制御部１は、オントロジー処理装置１の各構成要素を制御する機能を担っており、処理対象選択部１１、座標変換部１２、距離計算部１３およびオントロジー編集部１４を有している。 The control unit 1 has a function of controlling each component of the ontology processing apparatus 1, and includes a processing target selection unit 11, a coordinate conversion unit 12, a distance calculation unit 13, and an ontology editing unit 14.

処理対象選択部１１は、ユーザから、オントロジー記憶部４に記憶されるオントロジー上で追加候補を探索する位置（単語）の指定を受付ける。以下では、このときにユーザから指定された単語を基本単語Ｗｂと称するものとする。図２には、基本単語Ｗｂとして、「犬」が例示されている。処理対象選択部１１は、ユーザから指定された基本単語Ｗｂ（概念）に基づいて、追加する単語（以下、「周辺単語Ｗａ」と称する。）を抽出する際のタネとなる複数のタネ単語Ｗｓを決定する。図２には、複数のタネ単語Ｗｓとして、「柴犬」「チワワ」「マルチーズ」「ブルドッグ」「ダルメシアン」が例示されている。処理対象選択部１１は、決定した複数のタネ単語Ｗｓに対応する単語ベクトルを単語ベクトル作成部２から取得する。 The processing target selection unit 11 receives from the user a designation of a position (word) to search for an additional candidate on the ontology stored in the ontology storage unit 4. Hereinafter, the word specified by the user at this time is referred to as a basic word Wb. FIG. 2 illustrates “dog” as the basic word Wb. The processing target selection unit 11 is configured to extract a plurality of additional words (hereinafter, referred to as “peripheral words Wa”) based on the basic words Wb (concepts) specified by the user. To determine. FIG. 2 exemplifies a plurality of seed words Ws such as “Shiba Inu”, “Chihuahua”, “Maltes”, “Bulldog”, and “Dalmatian”. The processing target selection unit 11 acquires a word vector corresponding to the determined plurality of seed words Ws from the word vector creation unit 2.

座標変換部１２は、コサイン類似度をユークリッド距離で近似できる空間（以下、「空間Ａ」と称する。）で近似可能な単語ベクトルを単語ベクトル作成部２から抽出する。座標変換部１２は、単語ベクトル作成部２から抽出した単語ベクトルを空間Ａへ座標変換する。具体的には、座標変換部１２は、まず、複数のタネ単語Ｗｓの単語ベクトルの平均からタネ中心を算出する。次に、座標変換部１２は、タネ中心とのコサイン類似度が所定の範囲内（例えば、０．６以上）の単語（タネ単語Ｗｓ、周辺単語Ｗａ）に対応する単語ベクトルを単語ベクトル作成部２から抽出する。図３には、周辺単語Ｗａとして、「プードル」「ダックスフント」が例示されている。 The coordinate conversion unit 12 extracts from the word vector creation unit 2 a word vector that can be approximated in a space in which the cosine similarity can be approximated by a Euclidean distance (hereinafter, referred to as “space A”). The coordinate conversion unit 12 performs coordinate conversion of the word vector extracted from the word vector creation unit 2 into the space A. Specifically, the coordinate conversion unit 12 first calculates the seed center from the average of the word vectors of the plurality of seed words Ws. Next, the coordinate conversion unit 12 converts a word vector corresponding to a word (seed word Ws, peripheral word Wa) whose cosine similarity with the seed center falls within a predetermined range (for example, 0.6 or more) into a word vector creation unit. Extract from 2. FIG. 3 illustrates “poodle” and “dachshund” as the peripheral words Wa.

座標変換部１２は、単語ベクトル作成部２から抽出した、各タネ単語Ｗｓに対応する単語ベクトルを正規化し、タネ中心を正規化した結果（ベクトル）が(１，０，…，０)となるように、正規化された各単語ベクトルを、原点を中心に回転させる。座標変換部１２は、各周辺単語Ｗａに対応する単語ベクトルについても正規化し、タネ中心を正規化した結果（ベクトル）が(１，０，…，０)となるように、正規化された各単語ベクトルを、原点を中心に回転させる。ここで、座標変換部１２は、正規化されたタネ中心を後述の極座標（空間Ａの座標）に変換したときに、正規化されたタネ中心が極座標（空間Ａの座標）で原点となるように、正規化された各単語ベクトルを、原点を中心に回転させる。次に、座標変換部１２は、各タネ単語Ｗｓおよび各周辺単語Ｗａの座標を極座標（空間Ａの座標）に変換して、単語ベクトル作成部２のベクトルよりも１次元少ないベクトル（以下、「変換後ベクトル」と称する。）を生成する。以下では、変換後ベクトルの次元をＮ次元（Ｎ＝Ｌ−１）と表現する。距離計算部１３は、変換後ベクトル（空間Ａにおけるベクトル）を近似的に直交座標におけるベクトルとして扱う。 The coordinate conversion unit 12 normalizes the word vector corresponding to each seed word Ws extracted from the word vector creation unit 2 and normalizes the seed center (vector) becomes (1, 0,..., 0). Thus, each normalized word vector is rotated about the origin. The coordinate conversion unit 12 also normalizes the word vectors corresponding to the respective peripheral words Wa, and normalizes each of the normalized words such that the result (vector) obtained by normalizing the seed center is (1, 0,..., 0). Rotate the word vector about the origin. Here, when the coordinate conversion unit 12 converts the normalized seed center into polar coordinates (coordinates in the space A) described later, the normalized seed center becomes the origin in polar coordinates (coordinates in the space A). Then, each normalized word vector is rotated about the origin. Next, the coordinate conversion unit 12 converts the coordinates of each of the seed words Ws and each of the surrounding words Wa into polar coordinates (coordinates in the space A), and a vector (hereinafter referred to as “vector”) one-dimensionally smaller than the vector of the word vector creation unit 2. This is referred to as a "post-conversion vector." Hereinafter, the dimension of the transformed vector is represented as N dimensions (N = L-1). The distance calculation unit 13 treats the converted vector (the vector in the space A) approximately as a vector in the orthogonal coordinates.

ここで、コサイン類似度は、超球面上の２点間の距離と等価になる。単語ベクトルがＬ次元であり、超球面上の座標（ｙ₁，ｙ₂，…，ｙ_L）を極座標変換して（ｒ，θ₁，θ₂，…，θ_L-1）が得られたときに、ｒは常に１なので、ｒを除外した（θ₁，θ₂，…，θ_L-1）を使って２点間の距離を求めることを考える。超球面上の幾何なので、非ユークリッド幾何学であり、ピタゴラスの定理が成り立たず、（θ₁，θ₂，…，θ_L-1）を直交座標に置き換えてユークリッド距離を求めても、球面上の距離とは一般的には一致しない。しかし、一方の点を（π／２，π／２，…，π／２，０）に固定し、他方の点をその近傍とすれば、直交座標で近似することができる。なお、（π／２，π／２，…，π／２，０）は、（１，０，…，０）を極座標に変換し、ｒを除外した座標である。本実施の形態では、この手法を用いてコサイン類似度をユークリッド距離で近似できる空間（空間Ａ）を扱う。単語ベクトルがＬ次元の場合、この空間Ａは、Ｌ−１次元となる。 Here, the cosine similarity is equivalent to the distance between two points on the hypersphere. The word vector is L-dimensional, and coordinates (y ₁ , y ₂ ,..., Y _L ) on the hypersphere are converted to polar coordinates to obtain (r, θ ₁ , θ ₂ ,..., Θ _L-1 ). At this time, since r is always 1, it is considered that a distance between two points is obtained using (θ ₁ , θ ₂ ,..., Θ _{L -1} ) excluding r. Because geometric on hypersphere, a non-Euclidean geometry, the Pythagorean theorem is not _{_{satisfied, (θ 1, θ 2,}} ..., θ L-1) be calculated Euclidean distance is replaced in the orthogonal coordinates, on a spherical surface Does not generally coincide with the distance. However, if one point is fixed at (π / 2, π / 2,..., Π / 2, 0) and the other point is in the vicinity, it can be approximated by rectangular coordinates. (Π / 2, π / 2,..., Π / 2, 0) are coordinates obtained by converting (1, 0,..., 0) into polar coordinates and excluding r. In the present embodiment, a space (space A) in which the cosine similarity can be approximated by a Euclidean distance is used by using this method. When the word vector is L-dimensional, the space A has L-1 dimensions.

距離計算部１３は、処理対象選択部１１で受付けた位置（単語）の下位概念に追加する候補となる単語を探索する処理を行う。具体的には、距離計算部１３は、まず、複数のタネ単語Ｗｓの変換後ベクトル（例えば図４参照）の分布にフィットするように、タネ中心（変換後ベクトル）を原点とした座標回転を行う。距離計算部１３は、複数のタネ単語Ｗｓのフィッティング処理（例えば、主成分分析と同様の計算を利用いたフィッティング処理）を行う。距離計算部１３は、複数のタネ単語Ｗｓのフィッティング処理の結果から、複数のタネ単語Ｗｓの変換後ベクトルに基づくＭ次元超平面を決定する。 The distance calculation unit 13 performs a process of searching for a candidate word to be added to the lower concept of the position (word) received by the processing target selection unit 11. Specifically, the distance calculation unit 13 first performs coordinate rotation using the seed center (the converted vector) as the origin so as to fit the distribution of the converted vectors (for example, see FIG. 4) of the plurality of seed words Ws. Do. The distance calculation unit 13 performs fitting processing of a plurality of seed words Ws (for example, fitting processing using calculation similar to principal component analysis). The distance calculation unit 13 determines an M-dimensional hyperplane based on the converted vector of the plurality of seed words Ws from the result of the fitting processing of the plurality of seed words Ws.

以下に、Ｍ次元超平面の決定の具体例について説明する。単語ベクトル作成部２がＮ次元のベクトルを作成したとすると、複数のタネ単語Ｗｓの変換後ベクトルは、例えば、図４のように表される。このとき、図４のデータは、以下の式（１）に示した行列で表される。
Hereinafter, a specific example of the determination of the M-dimensional hyperplane will be described. Assuming that the word vector creation unit 2 creates an N-dimensional vector, the converted vectors of a plurality of seed words Ws are represented, for example, as shown in FIG. At this time, the data in FIG. 4 is represented by a matrix shown in the following equation (1).

距離計算部１３は、この行列の分散共分散行列を求め、さらに、その固有値と固有ベクトルを求める。距離計算部１３は、固有値が大きい順に固有ベクトルを並べて回転行列とする（図６）。距離計算部１３は、各固有値を固有値の総和で割って寄与率を算出する（図５）。距離計算部１３は、図５中の累積寄与率を第１成分（ＰＣ１）から順に見て、初めて９０％を超えるときの次元をＭとする。距離計算部１３は、第１軸から第Ｍ軸が張る超平面を、求めるべきＭ次元超平面とする。図５では、Ｍ次元超平面の次元が３となっている。 The distance calculation unit 13 obtains a variance-covariance matrix of this matrix, and further obtains its eigenvalue and eigenvector. The distance calculation unit 13 arranges the eigenvectors in descending order of the eigenvalues to form a rotation matrix (FIG. 6). The distance calculation unit 13 calculates the contribution rate by dividing each eigenvalue by the sum of the eigenvalues (FIG. 5). The distance calculation unit 13 considers the cumulative contribution rate in FIG. 5 in order from the first component (PC1) and sets M as the dimension when it exceeds 90% for the first time. The distance calculation unit 13 sets a hyperplane extending from the first axis to the Mth axis as an M-dimensional hyperplane to be obtained. In FIG. 5, the dimension of the M-dimensional hyperplane is 3.

次に、距離計算部１３は、複数の周辺単語Ｗａの変換後ベクトルに対して座標回転を行い、それにより得られた変換後ベクトルの座標と、複数のタネ単語Ｗｓのフィッティング処理の結果から求められた超平面との距離を計算する。さらに、距離計算部１３は、計算した距離に基づいて、複数の周辺単語Ｗａの中から、オントロジー記憶部４のオントロジーに追加登録する候補となる単語を抽出する。 Next, the distance calculation unit 13 performs coordinate rotation on the converted vector of the plurality of peripheral words Wa, and obtains the coordinates of the converted vector obtained by the coordinate rotation and the result of the fitting processing of the plurality of seed words Ws. Calculate the distance to the given hyperplane. Further, the distance calculation unit 13 extracts a word that is a candidate to be additionally registered in the ontology of the ontology storage unit 4 from the plurality of peripheral words Wa based on the calculated distance.

以下、オントロジー記憶部４のオントロジーに追加登録する候補となる単語の抽出方法の具体例について説明する。図７に、複数の周辺単語Ｗａの変換後ベクトルの例を示す。距離計算部１３は、複数の周辺単語Ｗａの変換後ベクトルを、図６に示したような回転行列を使って回転させる。このとき、距離計算部１３は、例えば、以下の式（２），（３），（４），（５）に示した行列演算によって、複数の周辺単語Ｗａの変換後ベクトルの回転結果を得る（図８）。図８において、Ｍ次元超平面とｊ番目の単語との距離は、第１軸から第Ｍ軸が張る超平面から、図８のベクトルが示す点までの距離となる。距離計算部１３は、以下の式（６）により、Ｍ次元超平面とｊ番目の単語との距離を求める。
Hereinafter, a specific example of a method of extracting a candidate word to be additionally registered in the ontology of the ontology storage unit 4 will be described. FIG. 7 shows an example of a vector after conversion of a plurality of peripheral words Wa. The distance calculation unit 13 rotates the converted vectors of the plurality of peripheral words Wa using a rotation matrix as shown in FIG. At this time, the distance calculation unit 13 obtains the rotation result of the converted vector of the plurality of peripheral words Wa by, for example, the matrix operation shown in the following Expressions (2), (3), (4), and (5). (FIG. 8). In FIG. 8, the distance between the M-dimensional hyperplane and the j-th word is the distance from the hyperplane extending from the first axis to the Mth axis to the point indicated by the vector in FIG. The distance calculation unit 13 calculates the distance between the M-dimensional hyperplane and the j-th word by the following equation (6).

オントロジー編集部１４は、Ｍ次元超平面とｊ番目の単語との距離が小さな周辺単語Ｗａを抽出し、追加候補の単語として、オントロジーに距離値とともに出力部３２を介して表示する（図３）。オントロジー編集部１４は、ユーザから、追加候補の単語のうち、オントロジー記憶部４のオントロジーに追加する単語の指定を受付ける。そして、オントロジー編集部１４は、ユーザの指示（操作）に従って、追加候補の単語の一部又は全部について、オントロジー記憶部４のオントロジーに追加する処理を行う。 The ontology editing unit 14 extracts the peripheral word Wa having a small distance between the M-dimensional hyperplane and the j-th word, and displays the word as an additional candidate word in the ontology together with the distance value via the output unit 32 (FIG. 3). . The ontology editing unit 14 receives a designation of a word to be added to the ontology in the ontology storage unit 4 among words of addition candidates from the user. Then, the ontology editing unit 14 performs a process of adding some or all of the addition candidate words to the ontology in the ontology storage unit 4 according to the user's instruction (operation).

入出力部３は、ユーザインタフェースの機能を担っており、ユーザからの操作や情報入力を受付けるための入力部３１と、ユーザへ情報出力するための出力部３２を有している。入力部３１としては、例えば、キーボードやマウスなどの入力デバイスを適用することができる。また、出力部３２としては、ディスプレイやプリンタ等の出力デバイスを適用することができる。 The input / output unit 3 has a function of a user interface, and includes an input unit 31 for receiving an operation and information input from a user, and an output unit 32 for outputting information to the user. As the input unit 31, for example, an input device such as a keyboard or a mouse can be applied. Further, as the output unit 32, an output device such as a display or a printer can be applied.

[動作]
次に、以上のような構成を有するこの実施形態のオントロジー処理装置１の動作について説明する。図９は、オントロジー処理装置１の動作について示したフローチャートである。 [motion]
Next, the operation of the ontology processing apparatus 1 according to this embodiment having the above configuration will be described. FIG. 9 is a flowchart illustrating the operation of the ontology processing device 1.

なお、図９のフローチャートでは、前提（初期状態）として、単語ベクトル作成部２による単語ベクトルの作成（文書記憶部５の文書データを用いた単語ベクトルの作成）が完了し、作成した単語ベクトルのデータが保持されているものとする。単語ベクトル作成部２による単語ベクトルの作成処理については、上述の通り非特許文献１と同様の処理、または、一般的なｗｏｒｄ２ｖｅｃを適用することができるため、詳しい説明は省略する。 In the flowchart of FIG. 9, as a premise (initial state), the creation of the word vector by the word vector creation unit 2 (the creation of the word vector using the document data in the document storage unit 5) is completed, and the created word vector is It is assumed that data is retained. Regarding the word vector creation processing by the word vector creation unit 2, as described above, the same processing as in Non-Patent Document 1 or general word2vec can be applied, and therefore a detailed description is omitted.

また、図９のフローチャートでは、前提（初期状態）として、オントロジー記憶部４に、任意の数の単語（概念）で構成されるオントロジーが登録されているものとする。 In the flowchart of FIG. 9, it is assumed that an ontology composed of an arbitrary number of words (concepts) is registered in the ontology storage unit 4 as a premise (initial state).

まず、処理対象選択部１１が、ユーザから、オントロジー記憶部４のオントロジー上で、下位概念の単語を追加する位置（基本単語Ｗｂ）の指定（選択）を受け付ける（ステップＳ１０１）。例えば、処理対象選択部１１は、入出力部３を介して、ユーザにオントロジー記憶部４のオントロジーに含まれる単語（概念）を提示（例えば、オントロジーを構成する単語のリストやマップを表示）して、いずれかの単語（概念）の指定（選択）を受付ける。 First, the processing target selection unit 11 receives designation (selection) of a position (basic word Wb) where a word of a lower concept is to be added on the ontology of the ontology storage unit 4 from the user (step S101). For example, the processing target selection unit 11 presents a word (concept) included in the ontology of the ontology storage unit 4 to the user via the input / output unit 3 (for example, displays a list or a map of words forming the ontology). Then, the designation (selection) of any word (concept) is accepted.

次に、処理対象選択部１１は、ステップＳ１０１で指定された概念（基本単語Ｗｂ）に基づいて、追加候補の単語を抽出する際のタネとなるタネ単語Ｗｓを取得する。そして、処理対象選択部１１は、ステップＳ１０１で取得したタネ単語Ｗｓの単語ベクトルを単語ベクトル作成部２から取得する（ステップＳ１０２）。全てのタネ単語Ｗｓに対応する単語ベクトルが単語ベクトル作成部２に存在しない場合には、処理対象選択部１１は、例えば、存在する単語ベクトルだけを単語ベクトル作成部２から取得する。 Next, based on the concept (basic word Wb) specified in step S101, the processing target selection unit 11 acquires a seed word Ws as a seed when extracting an additional candidate word. Then, the processing target selection unit 11 acquires the word vector of the seed word Ws acquired in step S101 from the word vector creation unit 2 (step S102). When the word vectors corresponding to all the seed words Ws do not exist in the word vector creation unit 2, the processing target selection unit 11 acquires, for example, only the existing word vectors from the word vector creation unit 2.

この実施形態では、処理対象選択部１１は、オントロジー記憶部４に記憶されたオントロジーから、ステップＳ１０１でユーザに指定された概念の１つ下位の概念（単語）を取得する。但し、処理対象選択部１１は、指定された概念の１つ下位の概念が中間概念だった場合は、当該中間概念のさらに１つ下位の概念を、下位概念として取得（参照）する。 In this embodiment, the processing target selection unit 11 acquires a concept (word) one level lower than the concept specified by the user in step S101 from the ontology stored in the ontology storage unit 4. However, when the concept one level lower than the specified concept is an intermediate concept, the processing target selection unit 11 acquires (refers to) a concept one level lower than the intermediate concept as a lower level concept.

ここでは、オントロジー記憶部４に記憶された概念（単語）から、ユーザにより「犬」という概念（単語）が指定された場合の例について説明する。 Here, an example in which the concept (word) of “dog” is specified by the user from the concepts (words) stored in the ontology storage unit 4 will be described.

図２は、処理対象選択部１１が、「犬」という概念からタネ単語Ｗｓを取得する例について示した説明図である。なお、図２の例では、犬の１つ下位の概念として「柴犬」、「愛玩犬」、「ブルドッグ」、「ダルメシアン」という概念が存在している。なお、図２の例では、「愛玩犬」は中間概念であり、「愛玩犬」の１つ下位の概念として「チワワ」と「マルチーズ」が存在する例について示している。したがって、処理対象選択部１１は、「愛玩犬」という中間概念の１つ下位の「チワワ」と「マルチーズ」を「犬」に対応するタネ単語Ｗｓの一部として取得する。したがって、図２の例では、処理対象選択部１１は、「犬」という概念に対応するタネ単語として、「柴犬」、「チワワ」「マルチーズ」、「ブルドッグ」、「ダルメシアン」を取得する。そして、処理対象選択部１１は、「犬」対する各タネ単語Ｗｓの単語ベクトルを、単語ベクトル作成部２から取得する。 FIG. 2 is an explanatory diagram illustrating an example in which the processing target selection unit 11 acquires the seed word Ws from the concept of “dog”. In the example of FIG. 2, the concepts “Shiba Inu”, “Pet Dog”, “Bulldog”, and “Dalmatian” exist as one concept below the dog. In the example of FIG. 2, “pet dog” is an intermediate concept, and an example is shown in which “Chihuahua” and “Maltese” exist as concepts one level below “pet dog”. Therefore, the processing target selection unit 11 acquires “Chihuahua” and “Maltese”, which are one level lower than the intermediate concept of “pet dog”, as a part of the seed word Ws corresponding to “dog”. Therefore, in the example of FIG. 2, the processing target selection unit 11 acquires “Shiba Inu”, “Chihuahua”, “Maltese”, “Bulldog”, and “Dalmatian” as seed words corresponding to the concept of “dog”. Then, the processing target selection unit 11 acquires the word vector of each seed word Ws for “dog” from the word vector creation unit 2.

次に、座標変換部１２が、上述の空間Ａで近似可能な単語ベクトルを単語ベクトル作成部２から抽出し、その空間Ａへ座標変換する（ステップＳ１０３）。まず、座標変換部１２は、複数のタネ単語Ｗｓの単語ベクトルの平均からタネ中心を算出する。次に、座標変換部１２は、タネ中心とのコサイン類似度が所定の範囲内（例えば、０．６以上）の単語（タネ単語Ｗｓ）および単語ベクトルを単語ベクトル作成部２から抽出する。座標変換部１２は、単語ベクトル作成部２から抽出した、各タネ単語Ｗｓに対応する単語ベクトルを正規化し、タネ中心を正規化した結果（ベクトル）が(１，０，…，０)となるように、正規化された各単語ベクトルを、原点を中心に回転させる。座標変換部１２は、各周辺単語Ｗａに対応する単語ベクトルについても正規化し、タネ中心を正規化した結果（ベクトル）が(１，０，…，０)となるように、正規化された各単語ベクトルを、原点を中心に回転させる。次に、座標変換部１２は、各タネ単語Ｗｓおよび各周辺単語Ｗａの座標を極座標（空間Ａの座標）に変換して、単語ベクトル作成部２のベクトルよりも１次元少ないベクトル（以下、「変換後ベクトル」と称する。）を生成する。距離計算部１３は、変換後ベクトル（空間Ａにおけるベクトル）を近似的に直交座標におけるベクトルとして扱う。 Next, the coordinate conversion unit 12 extracts a word vector that can be approximated in the space A from the word vector creation unit 2 and performs coordinate conversion to the space A (step S103). First, the coordinate conversion unit 12 calculates a seed center from an average of word vectors of a plurality of seed words Ws. Next, the coordinate conversion unit 12 extracts words (seed word Ws) and word vectors whose cosine similarity with the seed center is within a predetermined range (for example, 0.6 or more) from the word vector creation unit 2. The coordinate conversion unit 12 normalizes the word vector corresponding to each seed word Ws extracted from the word vector creation unit 2 and normalizes the seed center (vector) becomes (1, 0,..., 0). Thus, each normalized word vector is rotated about the origin. The coordinate conversion unit 12 also normalizes the word vectors corresponding to the respective peripheral words Wa, and normalizes each of the normalized words such that the result (vector) obtained by normalizing the seed center is (1, 0,..., 0). Rotate the word vector about the origin. Next, the coordinate conversion unit 12 converts the coordinates of each of the seed words Ws and each of the surrounding words Wa into polar coordinates (coordinates in the space A), and a vector (hereinafter referred to as “vector”) one-dimensionally smaller than the vector of the word vector creation unit 2. This is referred to as a "post-conversion vector." The distance calculation unit 13 treats the converted vector (the vector in the space A) approximately as a vector in the orthogonal coordinates.

次に、距離計算部１３が、ステップＳ１０３で取得された各単語ベクトル（抽出された各タネ単語Ｗｓに対応する単語ベクトル）の分布にフィットするように、タネ中心（変換後ベクトル）を原点とした座標回転を行い、その結果を用いて各単語ベクトルに基づくＭ次元超平面を決定する（ステップＳ１０４）。 Next, the distance calculation unit 13 sets the seed center (the converted vector) as the origin so as to fit the distribution of each word vector (the word vector corresponding to each extracted seed word Ws) acquired in step S103. The coordinate rotation is performed, and an M-dimensional hyperplane based on each word vector is determined using the result (step S104).

以下では、距離計算部１３が各タネ単語Ｗｓに基づくＭ次元超平面を決定する処理の具体例について説明する。 Hereinafter, a specific example of a process in which the distance calculation unit 13 determines an M-dimensional hyperplane based on each seed word Ws will be described.

まず、単語ベクトル作成部２が、単語ベクトルとしてＮ次元（Ｎは１以上の整数）のベクトルを作成していたものとする。そうすると、選択したタネ単語Ｗｓに対応する単語ベクトルは、図４のようなマトリクス（テーブル）で表すことができる。図４のマトリクス（テーブル）では、それぞれの行にタネ単語Ｗｓを割当てている。図４では、１行目から順に「柴犬」、「チワワ」「マルチーズ」、「ブルドッグ」、「ダルメシアン」を割当てている。そして、図４のマトリクス（テーブル）では、１列目から順にＸ１、Ｘ２、Ｘ３、…、ＸＮというパラメータを割当てている。 First, it is assumed that the word vector creating unit 2 has created an N-dimensional (N is an integer of 1 or more) vector as a word vector. Then, the word vector corresponding to the selected seed word Ws can be represented by a matrix (table) as shown in FIG. In the matrix (table) in FIG. 4, the seed word Ws is assigned to each row. In FIG. 4, “Shiba Inu”, “Chihuahua”, “Maltese”, “Bulldog”, and “Dalmatian” are assigned in order from the first line. In the matrix (table) of FIG. 4, parameters X1, X2, X3,..., XN are assigned in order from the first column.

このとき、図４のデータを式（１）のように行列で表す。この場合、距離計算部１３は、式（１）の行列の分散共分散行列を求め、さらにその固有値と固有ベクトルを求める。そして、距離計算部１３は、固有値が大きい順に固有ベクトルを並べて回転行列とする（図６参照）。そして、距離計算部１３は、各固有値を固有値の総和で割って寄与率を算出する。そして、距離計算部１３は、固有値が最大のものの寄与率から順に寄与率を累積加算していき、累積寄与率を算出する。 At this time, the data of FIG. 4 is represented by a matrix as in equation (1). In this case, the distance calculation unit 13 obtains a variance-covariance matrix of the matrix of Expression (1), and further obtains its eigenvalue and eigenvector. Then, the distance calculation unit 13 arranges the eigenvectors in descending order of the eigenvalues to obtain a rotation matrix (see FIG. 6). Then, the distance calculation unit 13 calculates the contribution rate by dividing each eigenvalue by the sum of the eigenvalues. Then, the distance calculation unit 13 cumulatively adds the contribution rates in order from the contribution rate of the eigenvalue having the largest eigenvalue, and calculates the cumulative contribution rate.

図５は、各タネ単語Ｗｓの単語ベクトルを処理した結果得られた各成分（第１成分ＰＣ１、第２成分ＰＣ２、…第Ｎ成分ＰＣＮ）の寄与率と、各成分の累積寄与率（第１成分から当該成分までの寄与率の累積値（合計値））の例について示している。図６は、単語ベクトルを構成する各パラメータ（Ｘ１〜ＸＮ）と、各成分（ＰＣ１〜ＰＣＮ）との各組合せに対応する変換係数について示したマトリクスである。例えば、Ｘ１とＰＣ１との組み合わせに対応する変換係数をａ₁₁と図示し、ＸとＰＣ２との組み合わせに対応する変換係数をａ₁₂と図示している。 FIG. 5 shows the contribution ratio of each component (first component PC1, second component PC2,..., Nth component PCN) obtained as a result of processing the word vector of each seed word Ws, and the cumulative contribution ratio of each component (No. An example of a cumulative value (total value) of contribution rates from one component to the component is shown. FIG. 6 is a matrix showing the conversion coefficients corresponding to each combination of each parameter (X1 to XN) constituting each word vector and each component (PC1 to PCN). For example, it illustrates a transform coefficient transform coefficients corresponding to the combination of X1 and PC1 shown with a _11, corresponding to the combination of X and PC2 and a _12.

次に、距離計算部１３は、フィッティング結果において、第１成分ＰＣ１から順に累積寄与率を参照し、初めて所定の累積寄与率Ｔを超えるときの次元数Ｍ（初めて閾値である累積寄与率Ｔを超えるときの成分の番号（順番））を取得する。ここでは、例として累積寄与率Ｔを９０％（０．９０）であるものとするが、累積寄与率Ｔの値は任意に設定することができる。 Next, the distance calculation unit 13 refers to the cumulative contribution rate in order from the first component PC1 in the fitting result, and determines the number of dimensions M when the cumulative contribution rate T exceeds the predetermined cumulative contribution rate T for the first time (for the first time, the cumulative contribution rate T that is a threshold The number (order) of the component when it exceeds the number is acquired. Here, as an example, the cumulative contribution ratio T is assumed to be 90% (0.90), but the value of the cumulative contribution ratio T can be set arbitrarily.

例えば、図５に示すフィッティングの結果では、第１成分ＰＣ１から順に累積寄与率を参照すると、初めて累積寄与率が９０％（０．９０）を超えるのは第３成分ＰＣ３となる。したがって、距離計算部１３は、次元数Ｍとして「３」を取得する。 For example, in the result of the fitting shown in FIG. 5, when the cumulative contribution rate is referred to in order from the first component PC1, the third component PC3 has the cumulative contribution rate exceeding 90% (0.90) for the first time. Therefore, the distance calculation unit 13 acquires “3” as the number of dimensions M.

次に、距離計算部１３は、第１軸（第１の成分の軸）から第Ｍ軸（第Ｍの成分の軸）が張る超平面を、求めるべきＭ次元超平面として決定する。ここでは、上述のようにＭ＝３であるため、距離計算部１３は、第１軸、第２軸、第３軸が張る超平面を、求めるべきＭ次元超平面として決定することになる。 Next, the distance calculation unit 13 determines a hyperplane extending from the first axis (the axis of the first component) to the Mth axis (the axis of the Mth component) as an M-dimensional hyperplane to be obtained. Here, since M = 3 as described above, the distance calculation unit 13 determines the hyperplane formed by the first axis, the second axis, and the third axis as the M-dimensional hyperplane to be obtained.

以上のように、距離計算部１３は、Ｍ次元超平面を決定する。 As described above, the distance calculation unit 13 determines an M-dimensional hyperplane.

次に、距離計算部１３が、ステップＳ１０４で決定したＭ次元超平面と、単語ベクトル作成部２で保持された各単語ベクトルの単語との距離を計算する（ステップＳ１０５）。 Next, the distance calculation unit 13 calculates the distance between the M-dimensional hyperplane determined in step S104 and the word of each word vector held by the word vector creation unit 2 (step S105).

以下、距離計算部１３が行うＭ次元超平面と各単語ベクトルとの距離計算処理の具体例について説明する。 Hereinafter, a specific example of the distance calculation processing between the M-dimensional hyperplane and each word vector performed by the distance calculation unit 13 will be described.

図７は、単語ベクトル作成部２内の各単語ベクトルの例について示した説明図である。図７では、「プードル」、「オカメインコ」、「ダックスフント」、…、という各単語に対応する単語ベクトルの各パラメータ（Ｘ１〜ＸＮ）が図示されている。図７では、「プードル」という単語に対応する各パラメータＸ１〜ＸＮの値をｘ１１〜ｘ１Ｎと図示している。また、図７では、「オカメインコ」という単語に対応する各パラメータＸ１〜ＸＮの値をｘ２１〜ｘ２Ｎと図示している。さらに、図７では、「ダックスフント」という単語に対応する各パラメータＸ１〜ＸＮの値をｘ３１〜ｘ３Ｎと図示している。 FIG. 7 is an explanatory diagram showing an example of each word vector in the word vector creation unit 2. FIG. 7 illustrates each parameter (X1 to XN) of a word vector corresponding to each of the words “poodle”, “cockatiel”, “dachshund”,. In FIG. 7, the values of the parameters X1 to XN corresponding to the word "poodle" are shown as x11 to x1N. In FIG. 7, the values of the parameters X1 to XN corresponding to the word “cockatiel” are shown as x21 to x2N. Further, in FIG. 7, the values of the parameters X1 to XN corresponding to the word “Dachshund” are illustrated as x31 to x3N.

そして、図８は、図７に示す各単語ベクトルを座標回転した結果について示した説明図である。以下では、単語ベクトルを座標回転した結果を「回転結果ベクトル」と呼ぶものとする。図８では、「プードル」、「オカメインコ」、「ダックスフント」、…、という各単語に対応する単語ベクトルが回転結果ベクトルのパラメータ（第１成分ＰＣ１〜第Ｎ成分ＰＣＮ）で図示されている。 FIG. 8 is an explanatory diagram showing a result of coordinate rotation of each word vector shown in FIG. Hereinafter, the result of the coordinate rotation of the word vector is referred to as a “rotation result vector”. In FIG. 8, word vectors corresponding to the words “poodle”, “cockatiel”, “dachshund”,... Are illustrated by parameters of the rotation result vector (first component PC1 to Nth component PCN).

図８では、「プードル」という単語に対応する第１成分ＰＣ１〜第Ｎ成分ＰＣＮの値をｚ₁₁〜ｚ_1Nと図示している。また、図８では、「オカメインコ」という単語に対応する第１成分ＰＣ１〜第Ｎ成分ＰＣＮの値をｚ₂₁〜ｚ_2Nと図示している。さらに、図８では、「ダックスフント」という単語に対応する第１成分ＰＣ１〜第Ｎ成分ＰＣＮの値をｚ₃₁〜ｚ_3Nと図示している。さらにまた、図８では、任意の単語に対応する第１成分ＰＣ１〜第Ｎ成分ＰＣＮの値をｚ_i1〜ｚ_iN（ｉは１以上の整数）と図示している。なお、各単語に対応する第１成分ＰＣ１〜第Ｎ成分ＰＣＮの値（ｚ_i1〜ｚ_iN）については、上記の式（２）に示す行列演算によって求めることができる。なお、式（２）においてＡは上記の式（３）の行列となる。また、式（２）において、Ｘは上記の式（４）の行列となる。さらに、式（２）においてＺは上記の式（５）の行列となる。従って、距離計算部１３は、Ｍ次元超平面と、任意の単語（単語ベクトル）との距離は、図８に示す回転結果ベクトルのパラメータの第Ｍ＋１成分以降の２乗和より求めることができる。 8, the value of the first component PC1~ the N component PCN corresponding to the word "poodle" depicts a z ₁₁ to z _1N. In FIG. 8, the values of the first to Nth components PC1 to PCN corresponding to the word “cockatiel” are shown as z ₂₁ to z _2N . Further, in FIG. 8, the values of the first component PC1 to the Nth component PCN corresponding to the word “Dachshund” are illustrated as z ₃₁ to z _3N . Further, in FIG. 8, the values of the first component PC1 to the Nth component PCN corresponding to an arbitrary word are illustrated as z _i1 to z _iN (i is an integer of 1 or more). Note that the first component PC1~ value of the N component PCN corresponding to each word (z _i1 to z _iN), can be obtained by a matrix calculation shown in the equation (2). Note that in Expression (2), A is a matrix of Expression (3). In Expression (2), X is a matrix of Expression (4). Further, in Expression (2), Z is a matrix of Expression (5). Therefore, the distance calculation unit 13 can obtain the distance between the M-dimensional hyperplane and an arbitrary word (word vector) from the sum of squares of the parameter of the rotation result vector shown in FIG.

具体的には、距離計算部１３は、Ｍ次元超平面と任意の単語（ｉ番目の単語）との距離を（６）式により求めることができる。例えば、Ｍ＝２の場合、図７、図８に示す単語「プードル」の距離Ｄ１は（７）式のように示すことができる。 Specifically, the distance calculation unit 13 can calculate the distance between the M-dimensional hyperplane and an arbitrary word (the i-th word) by Expression (6). For example, when M = 2, the distance D1 of the word “poodle” shown in FIG. 7 and FIG. 8 can be expressed as in equation (7).

以上のように、距離計算部１３は、Ｍ次元超平面と、単語ベクトル作成部２内の各単語ベクトル（タネ単語以外の単語ベクトル）との距離を求める。 As described above, the distance calculation unit 13 calculates the distance between the M-dimensional hyperplane and each word vector (word vector other than the seed word) in the word vector creation unit 2.

次に、オントロジー編集部１４は、ステップＳ１０５で計算した各単語の距離に基づいて、単語ベクトル作成部２で保持された単語から、オントロジー記憶部４への追加候補の単語を抽出し、抽出結果をユーザに提示（例えば、入出力部３を介してユーザに表示出力）する（ステップＳ１０６）。 Next, the ontology editing unit 14 extracts words to be added to the ontology storage unit 4 from the words held by the word vector creation unit 2 based on the distance between the words calculated in step S105, and Is presented to the user (for example, output to the user via the input / output unit 3) (step S106).

次に、オントロジー編集部１４は、ユーザから追加候補の単語について追加の要否を受付け（例えば、入出力部３を介して入力を受け付け）、追加する旨が入力された単語について、オントロジー記憶部４のオントロジーに追加登録する（ステップＳ１０７）。 Next, the ontology editing unit 14 receives, from the user, whether or not addition is necessary for the addition candidate word (for example, accepts an input via the input / output unit 3), and, for the word input to add, adds the ontology storage unit. 4 is additionally registered in the ontology 4 (step S107).

オントロジー編集部１４は、例えば、距離が所定の閾値よりも小さい（短い）単語を追加候補として抽出して、ユーザに提示するようにしてもよい。例えば、オントロジー編集部１４は、入出力部３を介して操作画面（ＧＵＩ画面）を表示してユーザに追加候補の単語を提示し、それぞれの追加候補の単語についてオントロジーへの追加の要否（「追加する」又は「追加しない」）を受け付けるようにしてもよい。また、オントロジー編集部１４は、図３に示すように各追加候補の単語に計算した距離の情報を付してユーザに提示するようにしてもよい。 For example, the ontology editing unit 14 may extract a word whose distance is shorter (shorter) than a predetermined threshold as an addition candidate and present it to the user. For example, the ontology editing unit 14 displays an operation screen (GUI screen) via the input / output unit 3 to present words of addition candidates to the user, and determines whether each addition candidate word needs to be added to the ontology ( “Add” or “Do not add” may be accepted. In addition, the ontology editing unit 14 may add the information of the calculated distance to each additional candidate word as shown in FIG. 3 and present the word to the user.

次に、制御部１は、ユーザから処理の継続の要否を受付け（ステップＳ１０８）、ユーザから処理を継続する旨の入力を受け付けた場合は上述のステップＳ１０１から動作し、ユーザから処理を継続しない旨の入力を受け付けた場合は処理を終了する。 Next, the control unit 1 receives the necessity of the continuation of the process from the user (step S108), and starts the above-described step S101 when receiving the input of the continuation of the process from the user, and continues the process from the user. If an input indicating not to be received is received, the process ends.

[効果]
次に、本実施の形態に係るオントロジー処理装置１の効果について説明する。 [effect]
Next, effects of the ontology processing apparatus 1 according to the present embodiment will be described.

本実施の形態に係るオントロジー処理装置１では、単語ベクトル作成部２から抽出した単語ベクトルが、コサイン類似度をユークリッド距離で近似できる空間Ａへ座標変換される。さらに、本実施の形態に係るオントロジー処理装置１では、タネ単語Ｗｓの単語ベクトルをフィッティング処理した結果に基づいて決定したＭ次元超平面と、空間Ａへ座標変換された変換ベクトルの単語との距離が計算される。このように、本実施の形態に係るオントロジー処理装置１では、単語ベクトルがコサイン類似度をユークリッド距離で近似できる空間（空間Ａ）へ座標変換されることにより、単語同士の概念上の関係性（類義語らしさ）が評価される。これにより、単に、単語ベクトルの距離を比較することで単語同士の概念上の関係性（類義語らしさ）を評価する場合と比べて、精度よく、単語同士の概念上の関係性（類義語らしさ）を評価することができる。従って、上位概念・下位概念の関係を持つオントロジーに対して、より適切な単語追加を支援することができる。 In the ontology processing apparatus 1 according to the present embodiment, the word vector extracted from the word vector creation unit 2 is coordinate-transformed into a space A in which the cosine similarity can be approximated by the Euclidean distance. Furthermore, in the ontology processing apparatus 1 according to the present embodiment, the distance between the M-dimensional hyperplane determined based on the result of the fitting processing of the word vector of the seed word Ws and the word of the transformed vector coordinate-transformed to the space A Is calculated. As described above, in the ontology processing apparatus 1 according to the present embodiment, the word vector is coordinate-transformed into a space (space A) in which the cosine similarity can be approximated by the Euclidean distance, so that the conceptual relationship between words ( Synonymousness) is evaluated. In this way, the conceptual relationship between words (similarity) is more accurately compared to the case where the conceptual relationship (similarity) between words is evaluated simply by comparing the distances between word vectors. Can be evaluated. Therefore, it is possible to support more appropriate word addition for an ontology having a superordinate concept / subordinate concept relationship.

本実施の形態に係るオントロジー処理装置１では、複数のタネ単語Ｗｓの単語ベクトルの平均からタネ中心が算出され、算出により得られたタネ中心とのコサイン類似度が所定の範囲内の単語に対応する単語ベクトルが単語ベクトル作成部２から抽出され、単語ベクトル作成部２から抽出した各単語ベクトルが正規化され、正規化されたタネ中心が空間Ａで原点となるように、正規化された各単語ベクトルが、原点を中心に回転された上で、変換ベクトルに変換される。これにより、空間Ａへの変換により生じる誤差を最小限に抑えることができる。従って、精度よく、単語同士の概念上の関係性（類義語らしさ）を評価することができる。従って、上位概念・下位概念の関係を持つオントロジーに対して、より適切な単語追加を支援することができる。 In the ontology processing apparatus 1 according to the present embodiment, the seed center is calculated from the average of the word vectors of the plurality of seed words Ws, and the cosine similarity with the calculated seed center corresponds to a word within a predetermined range. The word vector to be extracted is extracted from the word vector creation unit 2, each word vector extracted from the word vector creation unit 2 is normalized, and each normalized normalized so that the normalized seed center becomes the origin in the space A. The word vector is converted to a conversion vector after being rotated about the origin. As a result, an error caused by the conversion into the space A can be minimized. Therefore, the conceptual relationship between words (similarity) can be accurately evaluated. Therefore, it is possible to support more appropriate word addition for an ontology having a superordinate concept / subordinate concept relationship.

本実施の形態に係るオントロジー処理装置１では、オントロジー記憶部４のオントロジーにある程度の単語（概念）を登録しておけば、それらをタネ単語として追加候補の単語を自動的に抽出することができる。 In the ontology processing apparatus 1 according to the present embodiment, if a certain number of words (concepts) are registered in the ontology of the ontology storage unit 4, words of additional candidates can be automatically extracted as seed words. .

また、本実施の形態に係るオントロジー処理装置１では、タネ単語の単語ベクトルをフィッティング処理した結果に基づいて決定したＭ次元超平面からの距離を算出し、算出した距離に基づいて追加候補となる単語を抽出している。これにより、本実施の形態に係るオントロジー処理装置１では、全体としての類似度ではなく、ある観点での類似度に着目し、既存の上位概念、下位概念を持つオントロジーに新しく単語を追加することができる。 Further, in the ontology processing apparatus 1 according to the present embodiment, the distance from the M-dimensional hyperplane determined based on the result of the fitting processing of the word vector of the seed word is calculated, and the candidate is added based on the calculated distance. Words are being extracted. Thereby, the ontology processing apparatus 1 according to the present embodiment focuses on the similarity from a certain viewpoint, not on the overall similarity, and adds a new word to an ontology having an existing superordinate concept and subordinate concept. Can be.

また、本実施の形態に係るオントロジー処理装置１では、抽出された追加候補は単語と共に距離値が表示（例えば、図４参照）されるため、最終的にユーザの操作に応じて、オントロジーに追加するかどうか決定することができる。 Further, in the ontology processing apparatus 1 according to the present embodiment, since the extracted addition candidate is displayed with the distance value together with the word (for example, see FIG. 4), it is finally added to the ontology according to the operation of the user. You can decide if you want to.

なお、本実施の形態に係るオントロジー処理装置１では、処理対象選択部１１は、ユーザからの指定によって基本単語Ｗｂを指定していた。しかし、本実施の形態に係るオントロジー処理装置１において、処理対象選択部１１は、あらかじめ規定された単語を基本単語Ｗｂとして指定してもよい。また、本実施の形態に係るオントロジー処理装置１では、処理対象選択部１１は、ユーザからの指定によって逐次、基本単語Ｗｂを指定していた。しかし、本実施の形態に係るオントロジー処理装置１において、処理対象選択部１１は、単語ベクトル作成部２で抽出された一部もしくは全ての単語を基本単語Ｗｂとして指定してもよい。 In the ontology processing device 1 according to the present embodiment, the processing target selection unit 11 has specified the basic word Wb by the specification from the user. However, in the ontology processing apparatus 1 according to the present embodiment, the processing target selection unit 11 may specify a word defined in advance as the basic word Wb. Further, in the ontology processing apparatus 1 according to the present embodiment, the processing target selection unit 11 sequentially specifies the basic words Wb according to the specification from the user. However, in the ontology processing apparatus 1 according to the present embodiment, the processing target selection unit 11 may specify some or all of the words extracted by the word vector creation unit 2 as the basic words Wb.

また、本実施の形態に係るオントロジー処理装置１では、距離計算部１３は、超平面の次元を決めるときの累積寄与率の閾値を９０％としていた。しかし、本実施の形態に係るオントロジー処理装置１において、距離計算部１３は、超平面の次元を決めるときの累積寄与率の閾値を９０％とは異なる値に設定してもよい。このとき、距離計算部１３は、超平面の次元を決めるときの累積寄与率の閾値を、ユーザによって任意に設定できるようにしてもよい。 Further, in the ontology processing apparatus 1 according to the present embodiment, the distance calculation unit 13 sets the threshold of the cumulative contribution ratio when determining the dimension of the hyperplane to 90%. However, in the ontology processing apparatus 1 according to the present embodiment, the distance calculation unit 13 may set the threshold value of the cumulative contribution rate when determining the dimension of the hyperplane to a value different from 90%. At this time, the distance calculation unit 13 may allow the user to arbitrarily set the threshold value of the cumulative contribution rate when determining the dimension of the hyperplane.

また、本実施の形態に係るオントロジー処理装置１では、単語同士の概念上の関係性（類義語らしさ）を評価するときの指標を、Ｍ次元超平面と単語の変換後ベクトルとの距離としていた。しかし、本実施の形態に係るオントロジー処理装置１において、単語同士の概念上の関係性（類義語らしさ）を評価するときの指標を、Ｍ次元超平面と単語の変換後ベクトルとの距離以外の値としてもよい。 Further, in the ontology processing apparatus 1 according to the present embodiment, the index used when evaluating the conceptual relationship between words (similarity) is the distance between the M-dimensional hyperplane and the converted vector of the word. However, in the ontology processing apparatus 1 according to the present embodiment, an index used when evaluating the conceptual relationship between words (similarity) is a value other than the distance between the M-dimensional hyperplane and the vector after conversion of the word. It may be.

＜４．変形例＞
以下に、変形例について説明する。 <4. Modification>
Hereinafter, a modified example will be described.

図１０は、上記実施の形態に係るオントロジー処理装置１の機能的構成の一変形例について示したブロック図である。本変形例に係るオントロジー処理装置１は、上記実施の形態に係るオントロジー処理装置１における一部又は全部をソフトウェア的に構成したものである。本変形例に係るオントロジー処理装置１は、例えば、上記実施の形態に係るオントロジー処理装置１において、制御部１および単語ベクトル作成部２を、情報処理部６およびメモリ部７に置き換えたものに相当する。メモリ部７は、例えば、ハードディスクドライブやフラッシュメモリ等で構成されており、オントロジー処理プログラム７１を記憶している。情報処理部６は、メモリ部７に記憶されたオントロジー処理プログラム７１がロードされることにより、制御部１および単語ベクトル作成部２で実行される処理と同様の処理を行う。 FIG. 10 is a block diagram showing a modification of the functional configuration of the ontology processing device 1 according to the above embodiment. The ontology processing apparatus 1 according to the present modification is configured by configuring part or all of the ontology processing apparatus 1 according to the above-described embodiment as software. The ontology processing device 1 according to the present modification corresponds to, for example, the ontology processing device 1 according to the above-described embodiment in which the control unit 1 and the word vector creation unit 2 are replaced with an information processing unit 6 and a memory unit 7. I do. The memory unit 7 includes, for example, a hard disk drive or a flash memory, and stores an ontology processing program 71. The information processing section 6 performs the same processing as the processing executed by the control section 1 and the word vector creation section 2 by loading the ontology processing program 71 stored in the memory section 7.

本変形例では、オントロジー処理プログラム７１がロードされた情報処理部６によって、制御部１および単語ベクトル作成部２で実行される処理と同様の処理が行われる。従って、本変形例に係るオントロジー処理装置１は、上記実施の形態と同様の効果を奏する。 In this modification, the information processing unit 6 loaded with the ontology processing program 71 performs the same processing as the processing executed by the control unit 1 and the word vector creation unit 2. Therefore, the ontology processing device 1 according to the present modification has the same effects as those of the above embodiment.

以上、実施の形態およびその変形例を挙げて本開示を説明したが、本開示は上記実施の形態等に限定されるものではなく、種々変形が可能である。なお、本明細書中に記載された効果は、あくまで例示である。本開示の効果は、本明細書中に記載された効果に限定されるものではない。本開示が、本明細書中に記載された効果以外の効果を持っていてもよい。 As described above, the present disclosure has been described with reference to the embodiment and the modifications thereof. However, the present disclosure is not limited to the above-described embodiment and the like, and various modifications are possible. Note that the effects described in this specification are merely examples. The effects of the present disclosure are not limited to the effects described in this specification. The present disclosure may have effects other than those described herein.

また、本開示は、以下のような構成を取ることも可能である。
（１）
複数の単語と、複数の単語に対応する複数の単語ベクトルとを保持する単語ベクトル保持部と、
オントロジーを記憶するオントロジー記憶部と、
前記オントロジー記憶部の前記オントロジーから、指定された単語の下位概念に該当する単語をタネ単語として取得するタネ単語取得部と、
前記単語ベクトル保持部から前記単語ベクトルを抽出し、抽出した前記単語ベクトルを、コサイン類似度をユークリッド距離で近似できる所定の空間への変換ベクトルに変換する座標変換部と、
前記タネ単語のフィッティング処理の結果から求められた超平面と、前記座標変換部での変換により得られた前記変換ベクトルの単語との距離を計算する距離計算部と、
前記距離計算部で計算した距離に基づいて、前記単語ベクトル保持部で保持された単語から、前記オントロジー記憶部の前記オントロジーに追加登録する候補となる単語を抽出する追加候補抽出部と、
前記追加候補抽出部で抽出した追加候補の単語の一部又は全部について前記オントロジー記憶部の前記オントロジーに追加するオントロジー編集部と
を備えた
オントロジー処理装置。
（２）
前記座標変換部は、複数の前記タネ単語の単語ベクトルの平均からタネ中心を算出し、算出により得られた前記タネ中心とのコサイン類似度が所定の範囲内の単語に対応する単語ベクトルを前記単語ベクトル保持部から抽出し、前記単語ベクトル保持部から抽出した各前記単語ベクトルを正規化し、正規化されたタネ中心が前記所定の空間で原点となるように、正規化された各前記単語ベクトルを、原点を中心に回転させた上で、前記変換ベクトルに変換する
（１）に記載のオントロジー処理装置。
（３）
複数の単語と、複数の前記単語に対応する複数の単語ベクトルとを保持する単語ベクトル保持部と、オントロジーを記憶するオントロジー記憶部とを備えたコンピュータに対して、
前記オントロジー記憶部の前記オントロジーから、指定された単語の下位概念に該当する単語をタネ単語として取得することと、
前記単語ベクトル保持部から前記単語ベクトルを抽出し、抽出した前記単語ベクトルを、前記タネ単語のコサイン類似度をユークリッド距離で近似できる所定の空間への変換ベクトルに変換することと、
フィッティング処理の結果から求められた超平面と、変換により得られた前記変換ベクトルの単語との距離を計算することと、
計算により得られた距離に基づいて、前記単語ベクトル保持部で保持された単語から、前記オントロジー記憶部の前記オントロジーに追加登録する候補となる単語を抽出することと、
抽出した追加候補の単語の一部又は全部について前記オントロジー記憶部の前記オントロジーに追加することと
を実行させる
オントロジー処理プログラム。 In addition, the present disclosure may have the following configurations.
(1)
A word vector holding unit that holds a plurality of words and a plurality of word vectors corresponding to the plurality of words,
An ontology storage unit that stores the ontology,
From the ontology of the ontology storage unit, a seed word acquisition unit that acquires a word corresponding to a lower concept of a specified word as a seed word,
A coordinate conversion unit that extracts the word vector from the word vector holding unit and converts the extracted word vector into a conversion vector into a predetermined space in which cosine similarity can be approximated by a Euclidean distance;
A hyperplane determined from the result of the fitting process of the seed word, and a distance calculation unit that calculates a distance between the word of the conversion vector obtained by the conversion in the coordinate conversion unit,
Based on the distance calculated by the distance calculation unit, from the words held in the word vector holding unit, an additional candidate extraction unit that extracts a word that is a candidate to be additionally registered in the ontology of the ontology storage unit,
An ontology editing unit for adding a part or all of the words of the addition candidate extracted by the addition candidate extraction unit to the ontology in the ontology storage unit.
(2)
The coordinate conversion unit calculates a seed center from an average of a plurality of word vectors of the seed word, and calculates a word vector corresponding to a word whose cosine similarity with the seed center obtained by the calculation is within a predetermined range. Normalizing each of the word vectors extracted from the word vector holding unit and each of the word vectors extracted from the word vector holding unit, such that the normalized seed center is the origin in the predetermined space; Is rotated around the origin and then converted into the conversion vector. The ontology processing apparatus according to (1).
(3)
A plurality of words, a word vector holding unit that holds a plurality of word vectors corresponding to the plurality of words, and a computer including an ontology storage unit that stores an ontology,
From the ontology in the ontology storage unit, acquiring a word corresponding to a lower concept of a specified word as a seed word;
Extracting the word vector from the word vector holding unit, and converting the extracted word vector into a conversion vector into a predetermined space in which the cosine similarity of the seed word can be approximated by a Euclidean distance;
Calculating the distance between the hyperplane determined from the result of the fitting process and the word of the conversion vector obtained by the conversion,
Extracting words that are candidates to be additionally registered in the ontology in the ontology storage unit, from the words stored in the word vector storage unit, based on the distance obtained by the calculation;
Adding, to the ontology in the ontology storage unit, a part or all of the extracted additional candidate words.

１…オントロジー処理装置、２…単語ベクトル作成部、３…入出力部、４…オントロジー記憶部、５…文書記憶部、６…情報処理部、７…メモリ部、１１…処理対象選択部、１２…座標変換部、１３…距離計算部、１４…オントロジー編集部、３１…入力部、３２…出力部、７１…オントロジー処理プログラム。 DESCRIPTION OF SYMBOLS 1 ... Ontology processing apparatus, 2 ... Word vector preparation part, 3 ... Input / output part, 4 ... Ontology storage part, 5 ... Document storage part, 6 ... Information processing part, 7 ... Memory part, 11 ... Processing target selection part, 12 ... Coordinate conversion unit, 13 ... Distance calculation unit, 14 ... Ontology editing unit, 31 ... Input unit, 32 ... Output unit, 71 ... Ontology processing program.

Claims

A plurality of words, a word vector holding unit that holds a plurality of word vectors corresponding to the plurality of words,
An ontology storage unit that stores the ontology,
From the ontology of the ontology storage unit, a seed word acquisition unit that acquires a word corresponding to a lower concept of a specified word as a seed word,
A coordinate conversion unit that extracts the word vector from the word vector holding unit and converts the extracted word vector into a conversion vector into a predetermined space in which cosine similarity can be approximated by a Euclidean distance;
A hyperplane determined from the result of the fitting process of the seed word, and a distance calculation unit that calculates a distance between the word of the conversion vector obtained by the conversion in the coordinate conversion unit,
Based on the distance calculated by the distance calculation unit, from the words held in the word vector holding unit, an additional candidate extraction unit that extracts a word that is a candidate to be additionally registered in the ontology of the ontology storage unit,
An ontology editing unit for adding a part or all of the words of the addition candidate extracted by the addition candidate extraction unit to the ontology in the ontology storage unit.

The coordinate conversion unit calculates a seed center from an average of a plurality of word vectors of the seed word, and calculates a word vector corresponding to a word whose cosine similarity with the seed center obtained by the calculation is within a predetermined range. Normalizing each of the word vectors extracted from the word vector holding unit and each of the word vectors extracted from the word vector holding unit, such that the normalized seed center is the origin in the predetermined space; The ontology processing apparatus according to claim 1, wherein the ontology processing device converts the rotation vector around the origin and converts the rotation vector into the conversion vector.

A plurality of words, a word vector holding unit that holds a plurality of word vectors corresponding to the plurality of words, and a computer including an ontology storage unit that stores an ontology,
From the ontology in the ontology storage unit, acquiring a word corresponding to a lower concept of a specified word as a seed word;
Extracting the word vector from the word vector holding unit, and converting the extracted word vector into a conversion vector into a predetermined space in which cosine similarity can be approximated by a Euclidean distance;
Calculating the distance between the hyperplane determined from the result of the fitting process of the seed word and the word of the conversion vector obtained by the conversion,
Extracting words that are candidates to be additionally registered in the ontology in the ontology storage unit, from the words stored in the word vector storage unit, based on the distance obtained by the calculation;
Adding, to the ontology in the ontology storage unit, a part or all of the extracted additional candidate words.