JP4305083B2

JP4305083B2 - Word similarity calculation device and program

Info

Publication number: JP4305083B2
Application number: JP2003274245A
Authority: JP
Inventors: 一成橋本; 紹明劉
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-07-14
Filing date: 2003-07-14
Publication date: 2009-07-29
Anticipated expiration: 2023-07-14
Also published as: JP2005038162A

Description

本発明は、単語間類似度計算装置およびプログラムに関し、特に、情報検索、文書自動分類、文書自動クラスタリング、テキストマイニング等の自然言語処理において、基礎要素技術である単語間の意味の類似性を測る単語間類似度計算装置およびプログラムに関する。 The present invention relates to inter-word similarity computation equipment Contact and programs, in particular, information retrieval, document automatic classification, automatic document clustering, natural language processing, such as text mining, similarity meaning between words is the basis element technologies about the word similarity between calculation equipment you and program to measure.

従来の単語間類似度計算装置において、単語間類似度の計算には、シソーラスまたは統計情報が用いられている。 In a conventional interword similarity calculation apparatus, a thesaurus or statistical information is used to calculate the similarity between words.

ここで、シソーラスとは、ある概念において、どのような概念または単語が属しているかを体系的に表したデータベースである。 Here, the thesaurus is a database that systematically represents what concepts or words belong to a certain concept.

図１１は、シソーラスについて説明する図である。 FIG. 11 is a diagram for explaining the thesaurus.

シソーラスは階層構造を有し、概念や単語をノードとすると、上位概念を表すノードと下位概念を表すノードとの間または概念を表すノードとその概念に属する単語を表すノードの間をリンクとするグラフで表すことができる。例えば、図７に示すように、「食べ物」および「果物」は概念を表すノードであり、「りんご」および「みかん」は単語を表すノードであり、概念「食べ物」は概念「果物」の上位概念なので、概念「食べ物」と概念「果物」とはリンクし、単語「りんご」および単語「みかん」は概念「果物」に属する単語なので、単語「りんご」および単語「みかん」はそれぞれ概念「果物」とリンクしている。 The thesaurus has a hierarchical structure, and if a concept or a word is a node, a link is made between a node representing a superordinate concept and a node representing a subordinate concept or between a node representing a concept and a node representing a word belonging to the concept. It can be represented by a graph. For example, as shown in FIG. 7, “food” and “fruit” are nodes representing concepts, “apple” and “mandarin” are nodes representing words, and the concept “food” is a higher rank of the concept “fruit”. Since the concept “food” and the concept “fruit” are linked, the word “apple” and the word “mandarin” belong to the concept “fruit”, so the word “apple” and the word “mandarin” Is linked to.

例えば、２単語間のノード間は２単語の最小単位の共通概念を経由したパスを有し、このパス上のリンクを用いて単語間の類似度を計算するものがある（例えば、特許文献１参照。）。 For example, a node between two words has a path that passes through a common concept of a minimum unit of two words, and there is one that calculates a similarity between words using a link on this path (for example, Patent Document 1). reference.).

ここで、単語ｗaとｗbとの類似性を測る尺度ｄｉｓｔ（ｗa，ｗb）を表したものとして、例えば、数１がある。 Here, for example, Equation 1 is used as a measure for measuring the similarity dist (wa, wb) between the words wa and wb.

ｌi：リンク重み
path（ｗa，ｗb）：単語を表すノードｗa、ｗbの間に存在するパス

li: Link weight
path (wa, wb): path existing between nodes wa and wb representing the word

つまり、シソーラスを用いた類似度の計算では、単語間の類似度は２単語が属している概念を経由して測られるため、類似度には単語の属性が反映している。 That is, in the calculation of similarity using a thesaurus, the similarity between words is measured via a concept to which two words belong, so the attribute of the word is reflected in the similarity.

ここで、統計情報とは、コーパス中の単語の出現頻度、共起度、ｔｆ・ｉｄｆ等を表す。統計情報を用いて単語間の類似度を測る手法としてベクトル空間モデルがあり、ベクトル空間モデルは統計情報からベクトル成分を定義し、単語の統計的性質を表す単語ベクトルの空間をつくる。例えば、ベクトル空間上の単語ベクトル間の類似性を単語間の類似性とするものがある（例えば、特許文献２参照。）。 Here, the statistical information represents the appearance frequency, co-occurrence degree, tf · idf, and the like of words in the corpus. There is a vector space model as a technique for measuring the similarity between words using statistical information. The vector space model defines a vector component from statistical information and creates a space of word vectors representing the statistical properties of words. For example, there is a technique in which similarity between word vectors in a vector space is defined as similarity between words (see, for example, Patent Document 2).

ここで、共起度を用いて単語ベクトルを表したものとして、例えば、数２がある。 Here, for example, Expression 2 is used to represent the word vector using the co-occurrence degree.

また、ベクトル空間モデルにおける単語間の類似度の尺度として、例えば、数３（コサイン法）がある。 Further, as a measure of the similarity between words in the vector space model, for example, there is Equation 3 (cosine method).

つまり、統計情報を用いた手法では、単語間の関係は統計的な関係のみによって表されるため、単語の一般的な意味に捉われずに単語間の類似度を測ることができる。 That is, in the method using statistical information, the relationship between words is represented only by the statistical relationship, and therefore the similarity between words can be measured without being caught by the general meaning of the word.

特開平８−１４７３２４号公報JP-A-8-147324 特開２００１−２７３２９３号公報JP 2001-273293 A

しかし、従来の単語間類似度計算装置において、シソーラスを用いた手法、統計情報を用いた手法には、それぞれ課題点がある。 However, in the conventional inter-word similarity calculation apparatus, the method using the thesaurus and the method using statistical information have respective problems.

まず、１つ目の課題として、図１２（ａ）に示すようなシソーラスを一例にして考える。図１２（ａ）に示すように、ある概念Ｘには概念Ｂと概念Ｙとが属し、概念Ｂには単語ｂが属し、概念Ｙには概念Ａと概念Ｚが属し、概念Ａには単語ａが属し、概念Ｚには概念Ｃが属し、概念Ｃには単語ｃが属している。なお、単語間のシソーラス上のパスに存在するリンク重みを全て１とする。 First, as a first problem, a thesaurus as shown in FIG. 12A, concept B and concept Y belong to a concept X, word b belongs to concept B, concept A and concept Z belong to concept Y, and word to concept A a belongs, concept Z belongs to concept Z, and word c belongs to concept C. It is assumed that all link weights existing in the path on the thesaurus between words are 1.

ここで、単語ａと単語ｂとの間のパスは、
「単語ａ−概念Ａ−概念Ｙ−概念Ｘ−概念Ｂ−単語ｂ」
で表される。 Here, the path between word a and word b is
“Word a—Concept A—Concept Y—Concept X—Concept B—Word b”
It is represented by

また、単語ａと単語ｃとの間のパスは、
「単語ａ−概念Ａ−概念Ｙ−概念Ｚ−概念Ｃ−単語ｃ」
で表される。 Also, the path between word a and word c is
"Word a-concept A-concept Y-concept Z-concept C-word c"
It is represented by

単語ａと単語ｂとの間に存在するリンク数と単語ａと単語ｃとの間に存在するリンク数とはともに５であり、単語間の類似性は等しいと計算される。 It is calculated that the number of links existing between the word a and the word b and the number of links existing between the word a and the word c are both 5, and the similarity between the words is equal.

しかし、単語ａに対して単語ｂは上位層に属する単語であり、単語ｃは下位層に属する単語であり、上位層の分類は下位層の分類よりも意味的に大きく分けられているため、任意の単語と当該単語より下位層に存在する単語との類似性は、任意の単語と当該単語より上位層に存在する単語との類似性よりも大きくなることが、実際の意味上では考えられる。すなわち、単語ａと単語ｃとの類似性は、単語ａと単語ｂとの類似性よりも大きくなる。 However, the word b is a word belonging to the upper layer with respect to the word a, the word c is a word belonging to the lower layer, and the classification of the upper layer is semantically divided more than the classification of the lower layer. It is considered that the similarity between an arbitrary word and a word existing below the word is larger than the similarity between an arbitrary word and a word higher than the word. . That is, the similarity between the word a and the word c is larger than the similarity between the word a and the word b.

このように、従来のシソーラスを用いた手法では、任意の単語に対して、上位層の単語との類似性と下位層の単語との類似性の差が表現できないという課題があった。 As described above, the method using the conventional thesaurus has a problem that the difference between the similarity between the upper layer word and the lower layer word cannot be expressed for an arbitrary word.

次に、２つ目の課題として、図１２（ｂ）に示すようなシソーラスを一例にして考える。図１２（ｂ）に示すように、ある概念Ｐには概念Ｑと概念Ｒとが属し、概念Ｒには概念Ｓが属し、概念Ｑには単語ａと単語ｂとが属し、概念Ｓには単語ｃと単語ｄとが属している。なお、単語間のシソーラス上のパスに存在するリンク重みを全て１とする。 Next, as a second problem, a thesaurus as shown in FIG. As shown in FIG. 12B, a concept Q and a concept R belong to a concept P, a concept S belongs to a concept R, a word a and a word b belong to a concept Q, Word c and word d belong to it. It is assumed that all link weights existing in the path on the thesaurus between words are 1.

ここで、単語ａと単語ｂとの間のパスは、
「単語ａ−概念Ｑ−単語ｂ」
で表される。 Here, the path between word a and word b is
"Word a-Concept Q-Word b"
It is represented by

また、単語ｃと単語ｄとの間のパスは、
「単語ｃ−概念Ｓ−単語ｄ」
で表される。 The path between word c and word d is
“Word c—Concept S—Word d”
It is represented by

単語ａと単語ｂとの間に存在するリンク数と単語ｃと単語ｄとの間に存在するリンク数とはともに２であり、単語間の類似度は等しいと計算される。 The number of links existing between the word a and the word b and the number of links existing between the word c and the word d are both 2, and the similarity between the words is calculated to be equal.

しかし、概念Ｓは概念Ｐよりも２つ下の階層であるのに対し、概念Ｑは概念Ｐよりも１つ下の階層であり、上位層の分類は下位層の分類よりも意味的に大きく分けられているため、概念の直下に属する単語間の類似性は概念の階層が深いほど類似性が大きくなることが、実際の意味上では考えられる。すなわち、単語ｃと単語ｄとの類似性は単語ａと単語ｂとの類似性よりも大きくなる。 However, the concept S is two levels below the concept P, while the concept Q is one level below the concept P, and the upper layer classification is semantically larger than the lower layer classification. In fact, it can be considered that the similarity between the words directly under the concept increases as the concept hierarchy increases. That is, the similarity between the word c and the word d is larger than the similarity between the word a and the word b.

このように、従来のシソーラスを用いた手法では、任意の概念に直下する２単語間の類似性は当該概念の階層に依存しないという課題があった。 As described above, the method using the conventional thesaurus has a problem that the similarity between two words directly under an arbitrary concept does not depend on the hierarchy of the concept.

次に、３つ目の課題として、図１２（ｃ）に示すようなシソーラスを一例にして考える。図１２（ｃ）に示すように、ある概念Ｘには概念Ａと概念Ｂと概念Ｃとが直下に属し、概念Ａには単語ａ、概念Ｂには単語ｂ、概念Ｃには単語ｃが直下に属している。なお、単語間のシソーラス上のパスに存在するリンク重みを全て１とする。 Next, as a third problem, a thesaurus as shown in FIG. As shown in FIG. 12 (c), concept A, concept B, and concept C belong to a certain concept X, word a in concept A, word b in concept B, and word c in concept C. It belongs directly below. It is assumed that all link weights existing in the path on the thesaurus between words are 1.

ここで、単語ａと単語ｂとの間のパスは、
「単語ａ−概念Ａ−概念Ｘ−概念Ｂ−単語ｂ」
で表される。 Here, the path between word a and word b is
“Word a—Concept A—Concept X—Concept B—Word b”
It is represented by

また、単語ａと単語ｃとの間のパスは、
「単語ａ−概念Ａ−概念Ｘ−概念Ｃ−単語ｃ」
で表される。 Also, the path between word a and word c is
“Word a—Concept A—Concept X—Concept C—Word c”
It is represented by

単語ａと単語ｂとの間に存在するリンク数と単語ａと単語ｃとの間に存在するリンク数とはともに４であり、単語間の類似度は等しいと計算される。単語ｂと単語ｃとでは階層の深さは等しいが、それぞれの単語の上位概念は概念Ｂと概念Ｃとであり、上位概念の違いが反映されず、単語間の類似性はパス上の種類に依存しないという課題があった。 The number of links existing between the word a and the word b and the number of links existing between the word a and the word c are both 4, and the similarity between the words is calculated to be equal. The depths of the words b and c are the same, but the superordinate concept of each word is the concept B and the concept C, and the difference between the superordinate concepts is not reflected. There was a problem of not depending on.

そこで、本発明は、上記課題を効果的に克服することが可能な単語間類似度計算装置およびプログラムを提供することを目的とする。 The present invention aims at providing an effective inter capable of overcoming word similarity calculating equipment Contact and program the above problems.

上記目的を達成するため、請求項１の発明は、計算対象とされる２つの単語間の意味上の類似性を、複数の単語が収集された単語資料データベースと、概念および該概念にリンクする概念若しくは単語を階層構造で分類・配列した語彙データベースとに基づいて計算する単語間類似度計算装置において、前記語彙データベースにおける階層間の関係を示す語彙データベース階層間関係情報を設定する語彙データベース階層間関係設定手段と、前記語彙データベースにおける概念に対する該概念の上位に存在する上位概念、および該概念の下位に存在する下位概念の関係を示す上位下位概念関係情報を設定する上位下位概念関係設定手段と、前記語彙データベースにおける第１の概念群と該第１の概念群以外の第２の概念群との関係を示す概念群間相関関係情報を設定する概念群間相関関係設定手段と、前記語彙データベース階層間関係情報と、前記上位下位概念関係情報と、前記概念群間相関関係情報とに基づき、前記語彙データベースにおける各リンクにリンク重みを設定するリンク重み設定手段と、前記リンク重み設定手段により語彙データベースにおける各リンクに設定したリンク重みに基づき、前記計算対象とされる２つの単語間の類似度を計算する単語間類似度計算手段とを具備することを特徴とする。 In order to achieve the above object, the invention of claim 1 links the semantic similarity between two words to be calculated, a word data database in which a plurality of words are collected, the concept, and the concept. In a word similarity calculation device for calculating based on a vocabulary database in which concepts or words are classified and arranged in a hierarchical structure, a vocabulary database between layers that sets relationship information between vocabulary databases indicating relationships between layers in the vocabulary database Relation setting means, and superordinate concept relation setting means for setting superordinate concept relation information indicating a superordinate concept existing above the concept with respect to the concept in the vocabulary database and a subordinate concept existing below the concept; , A concept indicating a relationship between a first concept group in the vocabulary database and a second concept group other than the first concept group Each link in the vocabulary database based on the inter-concept group correlation setting means for setting inter-correlation information, the lexical database hierarchy relation information, the upper and lower concept relation information, and the concept group correlation information A link weight setting means for setting a link weight to the link, and a word weight similarity for calculating a similarity between the two words to be calculated based on the link weight set for each link in the vocabulary database by the link weight setting means And a degree calculation means.

なお、語彙データベース階層関係設定手段は図２に示すシソーラス階層関係パラメータ設定部に対応し、上位下位概念関係設定手段は図２に示す上位下位概念関係パラメータ設定部に対応し、概念群間相関関係設定手段は図２に示す概念群間相関関係パラメータ設定部に対応し、リンク重み設定手段は図２に示すリンク重み設定部に対応し、単語間類似度計算手段は図１に示す単語間類似度計算部に対応する。 The vocabulary database hierarchy relation setting means corresponds to the thesaurus hierarchy relation parameter setting section shown in FIG. 2, and the upper and lower concept relation setting means corresponds to the higher and lower concept relation parameter setting section shown in FIG. The setting means corresponds to the inter-concept group correlation parameter setting section shown in FIG. 2, the link weight setting means corresponds to the link weight setting section shown in FIG. 2, and the word similarity calculation means is the word similarity shown in FIG. Corresponds to the degree calculator.

また、請求項２の発明は、請求項１の発明において、前記語彙データベース階層間関係設定手段は、前記語彙データベースにおける第１の階層間より深い階層に位置する第２の階層間の類似性が、該第１の階層間の類似性より大きくなるような語彙データベース階層間関係情報を設定することを特徴とする。 The invention according to claim 2 is the invention according to claim 1, wherein the lexical database hierarchy relation setting means has similarity between second hierarchies located in a hierarchy deeper than the first hierarchy in the vocabulary database. The lexical database inter-layer relation information is set to be larger than the similarity between the first hierarchies.

また、請求項３の発明は、請求項２の発明において、前記第２の階層間の語彙データベース階層間関係情報は前記第１の階層間の語彙データベース階層間関係情報よりも小さくなることを特徴とする。 The invention of claim 3 is characterized in that, in the invention of claim 2, lexical database hierarchy relation information between the second hierarchies is smaller than lexical database hierarchy relation information between the first hierarchies. And

また、請求項４の発明は、請求項１の発明において、前記上位下位概念関係情報は、前記上位概念および前記下位概念における特徴の違いによって表され、前記上位概念と前記下位概念との間に存在するリンクによって表されることを特徴とする。 According to a fourth aspect of the present invention, in the first aspect of the present invention, the higher-level and lower-level concept relationship information is represented by a difference in characteristics between the higher-level concept and the lower-level concept, and between the higher-level concept and the lower-level concept. It is characterized by being represented by an existing link.

また、請求項５の発明は、請求項４の発明において、前記上位下位概念関係設定手段は、前記単語資料データベースから単語の単語特徴量を計算する単語特徴量計算手段と、前記概念に属する全ての単語の単語特徴量に基づき、該概念の概念特徴量を計算する概念特徴量計算手段と、前記単語特徴量計算手段で計算した単語特徴量と、前記概念特徴量計算手段で計算した概念特徴量とに基づき、上位下位概念関係情報を計算する上位下位概念関係情報計算手段とを具備することを特徴とする。 The invention according to claim 5 is the invention according to claim 4, wherein the upper and lower concept relationship setting means includes a word feature quantity calculating means for calculating a word feature quantity of a word from the word material database, and all of the concepts belonging to the concept. Based on the word feature amount of the word, the concept feature amount calculating means for calculating the concept feature amount of the concept, the word feature amount calculated by the word feature amount calculating means, and the concept feature calculated by the concept feature amount calculating means It is characterized by comprising upper / lower concept relationship information calculating means for calculating higher / lower concept relationship information based on the quantity.

なお、単語特徴量計算手段は図３に示す単語特徴量計算部に対応し、概念特徴量計算手段は図３に示す概念特徴量計算部に対応し、上位下位概念関係情報計算手段は図３に示す上位下位概念関係パラメータ計算部に対応する。 The word feature quantity calculating means corresponds to the word feature quantity calculating section shown in FIG. 3, the conceptual feature quantity calculating means corresponds to the conceptual feature quantity calculating section shown in FIG. 3, and the upper and lower concept relation information calculating means is shown in FIG. This corresponds to the upper and lower conceptual relationship parameter calculation unit shown in FIG.

また、請求項６の発明は、請求項５の発明において、前記概念特徴量計算手段は、前記概念に前記単語が属する場合、該概念に属する全ての単語の単語特徴量の和を示す概念特徴総和量と、該全ての単語の数を示す概念単語総和数とを求め、該概念特徴総和量を該概念単語総和数で割算することで、該概念の概念特徴量を計算し、前記概念に下位概念が属する場合、該概念に属する全ての下位概念の概念特徴総和量の総和量と、該全ての下位概念の概念単語総和数の総和数とを求め、該概念特徴総和量の総和量を概念単語総和数の総和数で割算することで、該概念の概念特徴量を計算することを特徴とする。 Further, in the invention of claim 6, in the invention of claim 5, when the word belongs to the concept, the concept feature quantity calculating means indicates a concept feature indicating a sum of word feature quantities of all words belonging to the concept. The concept feature amount of the concept is calculated by obtaining a sum amount and a concept word sum number indicating the number of all the words, and dividing the concept feature sum amount by the concept word sum number. If a subordinate concept belongs, the sum total of the concept feature sums of all the subordinate concepts belonging to the concept and the sum total of the concept word sums of all the subordinate concepts are obtained, and the sum of the concept feature sums Is divided by the total number of concept word sums to calculate the concept feature value of the concept.

また、請求項７の発明は、請求項６の発明において、前記上位下位概念関係情報計算手段は、前記概念と該概念に属する単語とのリンクの上位下位概念関係情報を設定する場合、該概念の概念特徴量と該単語の単語特徴量とに基づき、前記上位下位概念関係情報を計算し、前記概念同士のリンクの上位下位概念関係情報を設定する場合、該リンクにおける上位概念の概念特徴量と下位概念の概念特徴量とに基づき、前記上位下位概念関係情報を計算することを特徴とする。 The invention according to claim 7 is the invention according to claim 6, wherein the higher-order concept relationship information calculating means sets the higher-order concept relationship information of a link between the concept and a word belonging to the concept. When calculating the upper and lower concept relationship information based on the concept feature amount of the word and the word feature amount of the word and setting the upper and lower concept relationship information of the link between the concepts, the concept feature amount of the upper concept in the link And the upper and lower concept relationship information is calculated based on the concept feature value of the lower concept.

また、請求項８の発明は、請求項７の発明において、前記概念特徴量は多次元空間の重心ベクトルで表され、前記上位下位概念関係情報は前記上位概念の概念特徴量と、前記下位概念の概念特徴量とを変数とした距離関数で表されることを特徴とする。 The invention of claim 8 is the invention of claim 7, wherein the concept feature amount is represented by a centroid vector of a multidimensional space, and the upper and lower concept relationship information includes the concept feature amount of the higher concept and the lower concept. It is characterized by being expressed by a distance function with the concept feature amount of.

また、請求項９の発明は、請求項１の発明において、前記概念群間相関関係情報は、前記第１の概念群に含まれる概念若しくは単語と、前記第２の概念群に含まれる概念もしくは単語との関係により表されることを特徴とする。 The invention according to claim 9 is the invention according to claim 1, wherein the correlation information between concept groups includes a concept or word included in the first concept group and a concept or word included in the second concept group. It is expressed by the relationship with a word.

また、請求項１０の発明は、請求項９の発明において、前記概念群間相関関係設定手段は、前記単語資料データベースから単語の共起情報を抽出する単語共起情報抽出手段と、前記単語共起情報抽出手段で抽出した共起情報を前記語彙データベースのリンクに付与し、該付与された共起情報に基づき、該リンクにおけるリンク相関関係情報を作成するリンク相関関係情報作成手段と、前記リンク相関関係情報作成手段で作成したリンク相関関係情報に基づき、前記概念群間相関関係情報を計算する概念群間相関関係情報計算手段とを具備することを特徴とする。 The invention of claim 10 is the invention of claim 9, wherein the conceptual group between correlation setting means, a word cooccurrence information extracting means for extracting the co-occurrence information of words from said word article database, the word co Link correlation information creating means for assigning the co-occurrence information extracted by the occurrence information extracting means to the link of the vocabulary database and creating link correlation information in the link based on the assigned co-occurrence information; and the link And a concept group correlation information calculation unit that calculates the concept group correlation information based on the link correlation information created by the correlation information creation unit.

なお、共起情報抽出手段は図４に示す単語共起情報抽出部に対応し、リンク相関関係情報作成手段は図４に示すリンク相関関係情報作成部に対応し、概念群間相関関係情報計算手段は図４に示す概念群間相関関係パラメータ計算部に対応する。 The co-occurrence information extracting means corresponds to the word co-occurrence information extracting section shown in FIG. 4, and the link correlation information creating means corresponds to the link correlation information creating section shown in FIG. The means corresponds to the concept group correlation parameter calculation unit shown in FIG.

また、請求項１１の発明は、請求項１０の発明において、前記概念群間相関関係情報計算手段は、前記リンク相関関係情報に前記共起情報が存在する場合、全ての共起情報の数を示す共起情報総和数の逆数を概念群間相関関係情報とし、前記リンク相関関係情報に前記共起情報が存在しない場合、正の定数を概念群間相関関係情報とすることを特徴とする。 The invention according to claim 11 is the invention according to claim 10, wherein the inter-concept group correlation information calculation means calculates the number of all the co-occurrence information when the co-occurrence information exists in the link correlation information. The reciprocal of the total number of co-occurrence information shown is used as correlation information between concept groups, and when the co-occurrence information does not exist in the link correlation information, a positive constant is used as correlation information between concept groups.

また、請求項１２の発明は、請求項１の発明において、前記単語間類似度計算手段は、前記計算対象とされる２つの単語間に存在する少なくとも１つのリンクのリンク重みを合計した値を該単語間の類似度とすることを特徴とする。 The invention according to claim 12 is the invention according to claim 1, wherein the inter-word similarity calculation means calculates a value obtained by summing link weights of at least one link existing between the two words to be calculated. It is characterized by the similarity between the words.

また、請求項１３の発明は、計算対象とされる２つの単語間の意味上の類似性を、複数の単語が収集された単語資料データベースと、概念および該概念にリンクする概念若しくは単語を階層構造で分類・配列した語彙データベースとに基づいて計算する単語間類似度計算装置において、前記語彙データベースにおける第１の概念群と該第１の概念群以外の第２の概念群との関係を示す概念群間相関関係情報を設定する概念群間相関関係設定手段と、前記概念群間相関関係設定手段で設定した概念群間相関関係情報に基づき、前記語彙データベースにおける各リンクに重みを設定するリンク重み設定手段と、前記リンク重み設定手段により語彙データベースにおける各リンクに設定したリンク重みに基づき、前記計算対象とされる２つの単語間の類似度を計算する単語間類似度計算手段とを具備することを特徴とする。 Further, the invention of claim 13 provides a hierarchical representation of semantic similarity between two words to be calculated, a word material database in which a plurality of words are collected, a concept and a concept or word linked to the concept. In the inter-word similarity calculation device for calculating based on a vocabulary database classified and arranged by structure, the relationship between the first concept group in the vocabulary database and a second concept group other than the first concept group is shown. Correlation setting unit between concept groups for setting correlation information between concept groups, and a link for setting a weight for each link in the vocabulary database based on the correlation information between concept groups set by the correlation setting unit between concept groups a weight setting means, based on the link weight set for each link in the lexical database by the link weight setting means, between two words that are the object of calculation Characterized by comprising the inter-word similarity calculation means for calculating the similarity.

また、請求項１４の発明は、請求項１３の発明において、前記概念群間相関関係情報は、前記第１の概念群に含まれる概念若しくは単語と、前記第２の概念群に含まれる概念もしくは単語との関係により表されることを特徴とする。 The invention according to claim 14 is the invention according to claim 13, wherein the correlation information between concept groups includes a concept or word included in the first concept group and a concept or word included in the second concept group. It is expressed by the relationship with a word.

また、請求項１５の発明は、請求項１４の発明において、前記概念群間相関関係設定手段は、前記単語資料データベースから単語の共起情報を抽出する単語共起情報抽出手段と、前記単語共起情報抽出手段で抽出した共起情報を前記語彙データベースのリンクに付与し、該付与された共起情報に基づき、該リンクにおけるリンク相関関係情報を作成するリンク相関関係情報作成手段と、前記リンク相関関係情報作成手段で作成したリンク相関関係情報に基づき、前記概念群間相関関係情報を計算する概念群間相関関係情報計算手段とを具備することを特徴とする。 The invention of claim 15 is the invention of claim 14, wherein the conceptual group between correlation setting means, a word cooccurrence information extracting means for extracting the co-occurrence information of words from said word article database, the word co Link correlation information creating means for assigning the co-occurrence information extracted by the occurrence information extracting means to the link of the vocabulary database and creating link correlation information in the link based on the assigned co-occurrence information; and the link And a concept group correlation information calculation unit that calculates the concept group correlation information based on the link correlation information created by the correlation information creation unit.

また、請求項１６の発明は、請求項１５の発明において、前記概念群間相関関係情報計算手段は、前記リンク相関関係情報に前記共起情報が存在する場合、全ての共起情報の数を示す共起情報総和数の逆数を概念群間相関関係情報とし、前記リンク相関関係情報に前記共起情報が存在しない場合、正の定数を概念群間相関関係情報とすることを特徴とする。 Further, in the invention of claim 16 according to the invention of claim 15, the inter-concept group correlation information calculating means calculates the number of all the co-occurrence information when the co-occurrence information exists in the link correlation information. The reciprocal of the total number of co-occurrence information shown is used as correlation information between concept groups, and when the co-occurrence information does not exist in the link correlation information, a positive constant is used as correlation information between concept groups.

また、請求項１７の発明は、請求項１３の発明において、前記単語間類似度計算手段は、前記計算対象とされる２つの単語間に存在する少なくとも１つのリンクのリンク重みを合計した値を該単語間の類似度とすることを特徴とする。 The invention according to claim 17 is the invention according to claim 13, wherein the inter-word similarity calculation means calculates a sum of link weights of at least one link existing between the two words to be calculated. It is characterized by the similarity between the words.

また、請求項１８の発明は、コンピュータに、計算対象とされる２つの単語間の意味上の類似性を、複数の単語が収集された単語資料データベースと、概念および該概念にリンクする概念若しくは単語を階層構造で分類・配列した語彙データベースとに基づいて計算させるための単語間類似度計算プログラムにおいて、前記コンピュータを、前記語彙データベースにおける階層間の関係を示す語彙データベース階層間関係情報を設定する語彙データベース階層間関係設定手段、前記語彙データベースにおける概念に対する該概念の上位に存在する上位概念、および該概念の下位に存在する下位概念の関係を示す上位下位概念関係情報を設定する上位下位概念関係設定手段、前記語彙データベースにおける第１の概念群と該第１の概念群以外の第２の概念群との関係を示す概念群間相関関係情報を設定する概念群間相関関係設定手段、前記語彙データベース階層間関係情報と、前記上位下位概念関係情報と、前記概念群間相関関係情報とに基づき、前記語彙データベースにおける各リンクにリンク重みを設定するリンク重み設定手段、前記リンク重み設定手段により語彙データベースにおける各リンクに設定したリンク重みに基づき、前記計算対象とされる２つの単語間の類似度を計算する単語間類似度計算手段として機能させることを特徴とする。 The invention of claim 18 provides a computer with a semantic reference between two words to be calculated, a word material database in which a plurality of words are collected, a concept and a concept linked to the concept, or in between words similarity calculation program for calculating on the basis of words and lexical database classified, arranged in a hierarchical structure, the computer sets the lexical database hierarchy among related information representing the relationship between the hierarchy in the lexical database lexical database hierarchy among relevant setting manually stage, upper subordinate concept of setting the broader concept be in the higher該概precaution against concepts in lexical database, and該概upper subordinate concept relationship information indicating a relationship between subordinate concepts exist beneath a precaution relation setting hand stage, the first concept groups in lexical database and the first concept of non-group 2 Concepts correlation set hand interstage concepts group for setting correlation information between concepts group showing the relationship between the group and the lexical database hierarchical relations between information, and the upper lower conceptually related information, among the concepts group correlation based on the information, the vocabulary link weight setting means to set a link weight to each link in the database, based on the link weight set for each link in the lexical database by the link weight setting means, the two that are the object of calculation It is made to function as a similarity calculation means between words which calculates the similarity between words.

また、請求項１９の発明は、請求項１８の発明において、前記語彙データベースにおける第１の階層間より深い階層に位置する第２の階層間の類似性が、該第１の階層間の類似性より大きくなるような語彙データベース階層間関係情報を前記語彙データベース階層間関係設定手段が設定するものであることを特徴とする。 The invention according to claim 19 is the invention according to claim 18, in which the similarity between the second hierarchies located in a hierarchy deeper than the first hierarchies in the vocabulary database is similar to the similarity between the first hierarchies. characterized in that it is used to set the larger becomes such lexical database hierarchical relations between information the lexical database hierarchy between relationship setting means.

また、請求項２０の発明は、請求項１９の発明において、前記第２の階層間の語彙データベース階層間関係情報は前記第１の階層間の語彙データベース階層間関係情報よりも小さくなることを特徴とする。 The invention of claim 20 is characterized in that, in the invention of claim 19 , lexical database hierarchy relation information between the second hierarchies is smaller than lexical database hierarchy relation information between the first hierarchies. And

また、請求項２１の発明は、請求項１８の発明において、前記上位下位概念関係情報は、前記上位概念および前記下位概念における特徴の違いによって表され、前記上位概念と前記下位概念との間に存在するリンクによって表されることを特徴とする。 The invention of claim 21 is the invention of claim 18 , wherein the upper and lower concept relationship information is represented by a difference in characteristics between the superordinate concept and the subordinate concept, and between the superordinate concept and the subordinate concept. It is characterized by being represented by an existing link.

また、請求項２２の発明は、請求項２１の発明において、前記上位下位概念関係設定手段は、更に、前記単語資料データベースから単語の単語特徴量を計算する単語特徴量計算手段、前記概念に属する全ての単語の単語特徴量に基づき、該概念の概念特徴量を計算する概念特徴量計算手段、前記単語特徴量と前記概念特徴量とに基づき、上位下位概念関係情報を計算する上位下位概念関係情報計算手段として機能することを特徴とする。 The invention of claim 22 is the invention of claim 21, wherein the upper subordinate concept relationship setting means further word feature calculation means to calculate a word feature words from the word document database, the concept belongs on the basis of the word feature of all the words, concepts feature quantity calculation means to calculate the concept feature amount of該概precaution, based on said concept feature amount and the word feature, the upper lower for calculating the upper lower conceptually related information It functions as a conceptual relationship information calculation means.

また、請求項２３の発明は、請求項２２の発明において、前記概念に前記単語が属する場合、該概念に属する全ての単語の単語特徴量の和を示す概念特徴総和量と、該全ての単語の数を示す概念単語総和数とを求め、該概念特徴総和量を該概念単語総和数で割算することで、該概念の概念特徴量を前記概念特徴量計算手段が計算し、前記概念に下位概念が属する場合、該概念に属する全ての下位概念の概念特徴総和量の総和量と、該全ての下位概念の概念単語総和数の総和数とを求め、該概念特徴総和量の総和量を概念単語総和数の総和数で割算することで、該概念の概念特徴量を前記概念特徴量計算手段が計算するものであることを特徴とする。 Further, in the invention of claim 23, in the invention of claim 22, when the word belongs to the concept, a concept feature total amount indicating a sum of word feature amounts of all words belonging to the concept, and all the words And calculating the concept feature amount of the concept by the concept feature amount calculation means , and dividing the concept feature sum amount by the concept word sum number. When the subordinate concept belongs, the sum total of the concept feature sums of all the subordinate concepts belonging to the concept and the sum total of the concept word sums of all the subordinate concepts are obtained, and the sum of the concept feature sums is calculated. by dividing the concept words total number of total number, wherein said conceptual feature quantity calculating means concepts characteristic of該概precaution in which to calculate.

また、請求項２４の発明は、請求項２３の発明において、前記概念と該概念に属する単語とのリンクの上位下位概念関係情報を設定する場合、該概念の概念特徴量と該単語の単語特徴量とに基づき、前記上位下位概念関係情報を前記上位下位概念関係情報計算手段が計算し、前記概念同士のリンクの上位下位概念関係情報を設定する場合、該リンクにおける上位概念の概念特徴量と下位概念の概念特徴量とに基づき、前記上位下位概念関係情報を前記上位下位概念関係情報計算手段が計算するものであることを特徴とする。 Further, in the invention of claim 24, in the invention of claim 23, when the upper and lower concept relation information of the link between the concept and the word belonging to the concept is set, the concept feature quantity of the concept and the word feature of the word When the upper and lower concept relationship information calculating means calculates the upper and lower concept relationship information based on the quantity, and sets the upper and lower concept relationship information of the link between the concepts, based on the concept feature quantity subgeneric, characterized in that the upper lower conceptually related information is the upper subordinate concept relationship information calculating unit is intended to calculate.

また、請求項２５の発明は、請求項２４の発明において、前記概念特徴量は多次元空間の重心ベクトルで表され、前記上位下位概念関係情報は前記上位概念の概念特徴量と、前記下位概念の概念特徴量とを変数とした距離関数で表されることを特徴とする。 According to a twenty-fifth aspect of the present invention, in the twenty-fourth aspect, the concept feature amount is represented by a centroid vector of a multidimensional space, and the upper and lower concept relationship information includes the concept feature amount of the upper concept and the lower concept. It is characterized by being expressed by a distance function with the concept feature amount of.

また、請求項２６の発明は、請求項１８の発明において、前記概念群間相関関係情報は、前記第１の概念群に含まれる概念若しくは単語と、前記第２の概念群に含まれる概念もしくは単語との関係により表されることを特徴とする。 The invention of claim 26 is the invention of claim 18 , wherein the correlation information between concept groups includes a concept or word included in the first concept group and a concept or word included in the second concept group. It is expressed by the relationship with a word.

また、請求項２７の発明は、請求項２６の発明において、前記概念群間相関関係設定手段は、更に、前記単語資料データベースから単語の共起情報を抽出する単語共起情報抽出手段、該抽出した共起情報を前記語彙データベースのリンクに付与し、該付与された共起情報に基づき、該リンクにおけるリンク相関関係情報を作成するリンク相関関係情報作成手段、該作成したリンク相関関係情報に基づき、前記概念群間相関関係情報を計算する概念群間相関関係情報計算手段として機能することを特徴とする。 The invention of claim 27 is the invention of claim 26, the correlation setting unit between the concepts group further, the word cooccurrence information extracting means to extract the co-occurrence information of words from said word article database, the the extracted co-occurrence information assigned to the link of the lexical database, on the basis of the granted co-occurrence information, the link correlation information creation means to create a link correlation information in the link, the link correlation information the created Based on the above, it functions as a correlation information calculation unit between concept groups for calculating the correlation information between concept groups.

また、請求項２８の発明は、請求項２７の発明において、前記リンク相関関係情報に前記共起情報が存在する場合、前記概念群間相関関係情報計算手段が全ての共起情報の数を示す共起情報総和数の逆数を概念群間相関関係情報とし、前記リンク相関関係情報に前記共起情報が存在しない場合、正の定数を概念群間相関関係情報と前記概念群間相関関係情報計算手段がするものであることを特徴とする。 In the invention of claim 28, in the invention of claim 27, when the co-occurrence information exists in the link correlation information, the inter-concept group correlation information calculation means indicates the number of all the co-occurrence information. When the reciprocal of the total number of co-occurrence information is used as correlation information between concept groups, and the co-occurrence information does not exist in the link correlation information, a positive constant is calculated as correlation information between concept groups and the correlation information between concept groups. It is characterized by what the means do.

また、請求項２９の発明は、請求項１８の発明において、前記計算対象とされる２つの単語間に存在する少なくとも１つのリンクのリンク重みを合計した値を該単語間の類似度と前記単語間類似度計算手段がするものであることを特徴とする。 The invention of claim 29 is the invention of claim 18, wherein the sum of the link weights of at least one link existing between the two words to be calculated is used as the similarity between the words and the word. It is characterized in that the similarity calculation means is used .

また、請求項３０の発明は、コンピュータに、計算対象とされる２つの単語間の意味上の類似性を、複数の単語が収集された単語資料データベースと、概念および該概念にリンクする概念若しくは単語を階層構造で分類・配列した語彙データベースとに基づいて計算させるための単語間類似度計算プログラムにおいて、前記コンピュータを、前記語彙データベースにおける第１の概念群と該第１の概念群以外の第２の概念群との関係を示す概念群間相関関係情報を設定する概念群間相関関係設定手段、該設定した概念群間相関関係情報に基づき、前記語彙データベースにおける各リンクに重みを設定するリンク重み設定手段、前記リンク重みに基づき、前記計算対象とされる２つの単語間の類似度を計算する単語間類似度計算手段として機能させることを特徴とする。 Further, the invention of claim 30 provides a computer with a semantic reference between two words to be calculated, a word material database in which a plurality of words are collected, a concept and a concept linked to the concept, or In a word similarity calculation program for calculating based on a vocabulary database in which words are classified and arranged in a hierarchical structure, the computer includes a first concept group in the vocabulary database and a first concept group other than the first concept group. correlation set hand interstage concepts group for setting correlation information between concepts group showing the relationship between the two concepts group, on the basis of the set of concepts group between correlation information, sets a weight to each link in the lexical database link weight setting hand stage, on the basis of the link weight functions as inter-word similarity calculation means for calculating the similarity between two words that are the object of calculation And characterized in that.

また、請求項３１の発明は、請求項３０の発明において、前記概念群間相関関係情報は、前記第１の概念群に含まれる概念若しくは単語と、前記第２の概念群に含まれる概念もしくは単語との関係により表されることを特徴とする。 The invention according to claim 31 is the invention according to claim 30 , wherein the correlation information between concept groups includes a concept or word included in the first concept group and a concept or word included in the second concept group. It is expressed by the relationship with a word.

また、請求項３２の発明は、請求項３１の発明において、前記概念群間相関関係設定手段は、更に、前記単語資料データベースから単語の共起情報を抽出する単語共起情報抽出手段、該抽出した共起情報を前記語彙データベースのリンクに付与し、該付与された共起情報に基づき、該リンクにおけるリンク相関関係情報を作成するリンク相関関係情報作成手段、該作成したリンク相関関係情報に基づき、前記概念群間相関関係情報を計算する概念群間相関関係情報計算手段として機能することを特徴とする。 The invention of claim 32 is the invention of claim 31, the correlation setting unit between the concepts group further, the word cooccurrence information extracting means to extract the co-occurrence information of words from said word article database, the the extracted co-occurrence information assigned to the link of the lexical database, on the basis of the granted co-occurrence information, the link correlation information creation means to create a link correlation information in the link, the link correlation information the created Based on the above, it functions as a correlation information calculation unit between concept groups for calculating the correlation information between concept groups.

また、請求項３３の発明は、請求項３２の発明において、前記リンク相関関係情報に前記共起情報が存在する場合、前記概念群間相関関係情報計算手段が全ての共起情報の数を示す共起情報総和数の逆数を概念群間相関関係情報とし、前記リンク相関関係情報に前記共起情報が存在しない場合、正の定数を概念群間相関関係情報と前記概念群間相関関係情報計算手段がするものであることを特徴とする。 Further, in the invention of claim 33, in the invention of claim 32, when the co-occurrence information exists in the link correlation information, the inter-concept group correlation information calculation means indicates the number of all the co-occurrence information. When the reciprocal of the total number of co-occurrence information is used as correlation information between concept groups, and the co-occurrence information does not exist in the link correlation information, a positive constant is calculated as correlation information between concept groups and the correlation information between concept groups. It is characterized by what the means do.

また、請求項３４の発明は、請求項３３の発明において、前記計算対象とされる２つの単語間に存在する少なくとも１つのリンクのリンク重みを合計した値を該単語間の類似度と前記単語間類似度計算手段がするものであることを特徴とする。 The invention of claim 34 is the invention of claim 33, wherein the sum of the link weights of at least one link existing between the two words to be calculated is the similarity between the words and the word. It is characterized in that the similarity calculation means is used .

本発明によれば、シソーラスとコーパスのデータの共起情報とから求めたパラメータに基づき、２つの単語間の距離を算出することで、シソーラス上の２単語間の類似度をシソーラス上の階層の深さに依存させることが可能になり、更に、シソーラス上の任意の概念に直下する２単語間の類似度を当該概念の階層に依存させることが可能になり、更に、シソーラス上の２単語間の類似度をシソーラス上のパス上の概念に依存させることが可能になるという効果を奏する。 According to the present invention, by calculating the distance between two words based on the parameters obtained from the co-occurrence information of the thesaurus and corpus data, the similarity between the two words on the thesaurus can be determined as the level of the hierarchy on the thesaurus. It is possible to depend on the depth, and furthermore, the similarity between two words directly under any concept on the thesaurus can be made to depend on the hierarchy of the concept, and further, between the two words on the thesaurus It is possible to make the degree of similarity dependent on the concept on the path on the thesaurus.

以下、本発明に係わる単語間類似度計算装置およびプログラムの実施の形態について添付図面を参照して詳細に説明する。なお、本発明に係わる単語間類似度計算装置がＰＣ（Personal Computer）等の情報端末装置に組み込まれている構成を実施例として説明する。 Hereinafter, embodiments of the inter-word similarity computation equipment Contact and program according to the present invention with reference to the accompanying drawings will be described in detail. A configuration in which the inter-word similarity calculation apparatus according to the present invention is incorporated in an information terminal device such as a PC (Personal Computer) will be described as an example.

図１は、本発明に係わる単語間類似度計算装置１を組み込む情報端末装置２の機能的な構成の一例を示すブロック図である。 FIG. 1 is a block diagram showing an example of a functional configuration of an information terminal device 2 incorporating an inter-word similarity calculation device 1 according to the present invention.

図１に示すように、情報端末装置２は、第１の外部記憶装置３、第２の外部記憶装置４、入力装置５、表示装置６、単語類似度計算装置１から構成されている。 As shown in FIG. 1, the information terminal device 2 includes a first external storage device 3, a second external storage device 4, an input device 5, a display device 6, and a word similarity calculation device 1.

ここで、第１の外部記憶装置３は、コーパスのデータまたはテキスト文のデータを記憶保持し、ハードディスク等から構成されている。 Here, the first external storage device 3 stores and holds corpus data or text data, and is composed of a hard disk or the like.

また、第２の外部記憶装置４は、シソーラスのデータを記憶保持し、ハードディスク等から構成されている。 The second external storage device 4 stores and holds thesaurus data and is composed of a hard disk or the like.

また、入力装置５は、ユーザにより指示された測定対象となる２つの単語を単語類似度計算装置１に入力し、キーボード、マウス等から構成されている。 Further, the input device 5 inputs two words to be measured designated by the user to the word similarity calculation device 1, and is configured from a keyboard, a mouse, and the like.

また、表示装置６は、単語間の類似性を示す数値、シソーラス上の単語間のパス、または各リンク重み等を表示する。 Further, the display device 6 displays a numerical value indicating similarity between words, a path between words on the thesaurus, or each link weight.

また、単語類似度計算装置１は、リンク重み計算部７、単語類似度計算部８から構成されている。 The word similarity calculation device 1 includes a link weight calculation unit 7 and a word similarity calculation unit 8.

ここで、リンク重み計算部７は、リンク重み設定手段により、第１の外部記憶装置３に記憶保持されているコーパスのデータまたはテキスト文のデータと、第２の外部記憶装置４に記憶保持されているシソーラスのデータとに基づき、シソーラス上の各リンクのリンク重みを計算する。なお、リンク重み設定手段については後述にて詳細に説明する。 Here, the link weight calculation unit 7 stores and holds the corpus data or text data stored in the first external storage device 3 and the second external storage device 4 by the link weight setting means. The link weight of each link on the thesaurus is calculated based on the data of the thesaurus. The link weight setting means will be described in detail later.

また、単語類似度計算部８は、リンク重み計算部７により計算されたシソーラスの各リンクのリンク重みと、第２の外部記憶装置４に記憶保持されているシソーラスのデータとに基づき、入力装置５から入力された測定対象となる２つの単語の類似度を計算する。 Further, the word similarity calculation unit 8 is based on the link weight of each link of the thesaurus calculated by the link weight calculation unit 7 and the thesaurus data stored and held in the second external storage device 4. The similarity between two words to be measured input from 5 is calculated.

図２は、リンク重み計算部７の機能的な構成の一例を詳細に示すブロック図である。 FIG. 2 is a block diagram illustrating in detail an example of a functional configuration of the link weight calculation unit 7.

図２に示すように、リンク重み計算部７は、シソーラス階層関係パラメータ設定部９、上位下位概念関係パラメータ設定部１０、概念群間相関関係パラメータ設定部１１、リンク重み設定部１２、複数のメモリ（２０１〜２０６）から構成されている。 As shown in FIG. 2, the link weight calculation unit 7 includes a thesaurus hierarchy relationship parameter setting unit 9, a higher and lower concept relationship parameter setting unit 10, a concept group correlation parameter setting unit 11, a link weight setting unit 12, and a plurality of memories. (201 to 206).

ここで、メモリ２０１は、第２の外部記憶装置４からシソーラスのデータを読み出し、読み出したシソーラスのデータを記憶保持し、また、メモリ２０２は、第１の外部記憶装置３からコーパスのデータまたはテキスト文のデータを読み出し、読み出したコーパスのデータまたはテキスト文のデータを記憶保持する。 Here, the memory 201 reads the thesaurus data from the second external storage device 4, stores and holds the read thesaurus data, and the memory 202 stores the corpus data or text from the first external storage device 3. The sentence data is read, and the read corpus data or text sentence data is stored and held.

また、シソーラス階層関係パラメータ設定部９は、シソーラス階層関係パラメータ設定手段により、メモリ２０１に記憶保持されているシソーラスのデータに基づき、シソーラス階層関係パラメータを設定し、設定したシソーラス階層関係パラメータをメモリ２０３に記憶保持する。なお、シソーラス階層関係パラメータおよびシソーラス階層関係パラメータ設定手段については後述にて詳細に説明する。 The thesaurus hierarchy relation parameter setting unit 9 sets the thesaurus hierarchy relation parameters based on the thesaurus data stored and held in the memory 201 by the thesaurus hierarchy relation parameter setting means, and stores the set thesaurus hierarchy relation parameters in the memory 203. Keep it in memory. The thesaurus hierarchy relationship parameters and thesaurus hierarchy relationship parameter setting means will be described in detail later.

また、上位下位概念関係パラメータ設定部１０は、上位下位概念関係パラメータ設定手段により、上位下位概念関係パラメータを設定し、設定した上位下位概念関係パラメータをメモリ２０４に記憶保持する。なお、上位下位概念関係パラメータおよび上位下位概念関係パラメータ設定手段については後述にて詳細に説明する。 The upper / lower concept relationship parameter setting unit 10 sets the upper / lower concept relationship parameters by the upper / lower concept relationship parameter setting means, and stores and holds the set upper / lower concept relationship parameters in the memory 204. The upper / lower concept relationship parameter and the upper / lower concept relationship parameter setting means will be described in detail later.

また、概念群間相関関係パラメータ設定部１１は、概念群間相関関係パラメータ設定手段により、概念群間相関関係パラメータを設定し、設定した概念群間相関関係パラメータをメモリ２０５に記憶保持する。なお、概念群間相関関係パラメータおよび概念群間相関関係パラメータ設定手段については後述にて詳細に説明する。 Further, the concept group correlation parameter setting unit 11 sets the concept group correlation parameter by the concept group correlation parameter setting means, and stores and holds the set concept group correlation parameter in the memory 205. The concept group correlation parameter and the concept group correlation parameter setting means will be described in detail later.

また、リンク重み設定部１２は、リンク重み設定手段により、メモリ２０３に記憶保持されているシソーラス階層関係パラメータと、メモリ２０４に記憶保持されている上位下位概念関係パラメータと、メモリ２０５に記憶保持されている概念群間相関関係パラメータとに基づき、シソーラス上の各リンクのリンク重みを設定し、設定したリンク重みをメモリ２０６に記憶保持する。なお、リンク重み設定手段については後述にて詳細に説明する。 The link weight setting unit 12 stores and holds the thesaurus hierarchy relation parameters stored in the memory 203, the higher and lower concept relation parameters stored in the memory 204, and the memory 205 by the link weight setting means. Based on the correlation parameter between the concept groups, the link weight of each link on the thesaurus is set, and the set link weight is stored and held in the memory 206. The link weight setting means will be described in detail later.

図３は、上位下位概念関係パラメータ設定部１０の機能的な構成の一例を詳細に示すブロック図である。 FIG. 3 is a block diagram illustrating in detail an example of a functional configuration of the upper / lower conceptual relationship parameter setting unit 10.

図３に示すように、上位下位概念関係パラメータ設定部１０は、単語特徴量計算部１３、概念特徴量計算部１４、上位下位概念関係パラメータ計算部１５、複数のメモリ（３０１、３０２）から構成されている。 As shown in FIG. 3, the upper and lower concept relationship parameter setting unit 10 includes a word feature quantity calculation unit 13, a concept feature quantity calculation unit 14, an upper and lower concept relationship parameter calculation unit 15, and a plurality of memories (301 and 302). Has been.

ここで、単語特徴量計算部１３は、単語特徴量計算手段により、メモリ２０２に記憶保持されているコーパスのデータに基づき、メモリ２０１に記憶保持されているシソーラス上の単語に関する単語特徴量を計算し、計算した単語特徴量をメモリ３０１に記憶保持する。なお、単語特徴量計算手段については後述にて詳細に説明する。 Here, the word feature quantity calculation unit 13 calculates the word feature quantity related to the words on the thesaurus stored and held in the memory 201 based on the corpus data stored and held in the memory 202 by the word feature quantity calculation means. Then, the calculated word feature amount is stored and held in the memory 301. The word feature amount calculating means will be described in detail later.

また、概念特徴量計算部１４は、概念特徴量計算手段により、最小単位である概念の概念特徴量を計算する場合、当該概念に属する全ての単語の単語特徴量をメモリ３０１から読み出し、読み出した単語特徴量に基づいて当該概念に関する概念特徴量を計算してメモリ３０２に記憶保持し、また、概念特徴量計算手段により、最小単位でない概念の概念特徴量を計算する場合、当該概念の下位概念の概念特徴量をメモリ３０２から読み出し、読み出した下位概念の概念特徴量に基づいて当該概念の概念特徴量を計算してメモリ３０２に記憶保持し、シソーラスの構造に従って、ボトムアップに全ての概念に関して計算し、最終的には全ての概念の概念特徴量を計算してメモリ３０２に記憶保持する。なお、概念特徴量計算手段については後述にて詳細に説明する。 In addition, when calculating the conceptual feature value of the concept that is the minimum unit by the conceptual feature value calculation unit, the conceptual feature value calculation unit 14 reads the word feature values of all the words belonging to the concept from the memory 301 and reads them. When calculating the concept feature amount related to the concept based on the word feature amount and storing and storing it in the memory 302, and calculating the concept feature amount of the concept that is not the minimum unit by the concept feature amount calculating means, the subordinate concept of the concept The concept feature amount of the concept is read from the memory 302, the concept feature amount of the concept is calculated based on the read concept feature amount of the subordinate concept, and stored in the memory 302, and all the concepts are bottom-up according to the thesaurus structure. Finally, the concept feature values of all the concepts are calculated and stored in the memory 302. Note that the concept feature amount calculating means will be described in detail later.

また、上位下位概念関係パラメータ計算部１５は、上位下位概念関係パラメータ計算手段により、最小単位の概念と単語との間のリンクの上位下位概念関係パラメータを計算する場合、当該単語の単語特徴量をメモリ３０１から読み出すとともに、当該概念の概念特徴量をメモリ３０２から読み出し、読み出した単語特徴量と概念特徴量とに基づいて上位下位概念関係パラメータを計算し、計算した上位下位概念関係パラメータをメモリ２０４に記憶保持し、また、上位下位概念関係パラメータ計算手段により、概念と概念との間のリンクの上位下位概念関係パラメータを計算する場合、当該概念の上位概念の概念特徴量と下位概念の概念特徴量とをメモリ３０２から読み出し、読み出した上位概念の概念特徴量と下位概念の概念特徴量とに基づいて上位下位概念関係パラメータを計算し、計算した上位下位概念関係パラメータをメモリ２０４に記憶保持する。なお、上位下位概念関係パラメータ計算手段については後述にて詳細に説明する。 Further, when the upper and lower concept relationship parameter calculation unit 15 calculates the upper and lower concept relationship parameters of the link between the concept of the minimum unit and the word by the upper and lower concept relationship parameter calculation means, the word feature amount of the word is calculated. In addition to reading from the memory 301, the concept feature quantity of the concept is read from the memory 302, higher and lower concept relation parameters are calculated based on the read word feature quantity and concept feature quantity, and the calculated higher and lower concept relation parameters are stored in the memory 204. In addition, when the upper and lower concept relationship parameter of the link between the concept and the concept is calculated by the higher and lower concept relationship parameter calculation means, the concept feature amount of the super concept of the concept and the concept feature of the sub concept Are read out from the memory 302, and the read-out conceptual feature value of the higher concept and the conceptual feature value of the lower concept Zui calculates the upper lower-level concepts related parameters, and stores and holds the upper lower conceptually related parameters computed in the memory 204. The upper and lower conceptual relationship parameter calculation means will be described in detail later.

図４は、概念群間相関関係パラメータ設定部１１の機能的な構成の一例を詳細に示すブロック図である。 FIG. 4 is a block diagram illustrating in detail an example of a functional configuration of the concept group correlation parameter setting unit 11.

図４に示すように、概念群間相関関係パラメータ設定部１１は、単語共起情報抽出部１６、リンク相関関係情報作成部１７、概念群間相関関係パラメータ計算部１８、複数のメモリ（４０１、４０２）から構成されている。 As shown in FIG. 4, the concept group correlation parameter setting unit 11 includes a word co-occurrence information extraction unit 16, a link correlation information creation unit 17, a concept group correlation parameter calculation unit 18, a plurality of memories (401, 402).

ここで、単語共起情報抽出部１６は、単語共起情報抽出手段により、メモリ２０２に記憶保持されているコーパスのデータまたはテキスト文のデータから共起情報を抽出し、抽出した共起情報をメモリ４０１に記憶保持する。なお、単語共起情報抽出手段については後述にて詳細に説明する。 Here, the word co-occurrence information extraction unit 16 extracts the co-occurrence information from the corpus data or the text data stored and held in the memory 202 by the word co-occurrence information extraction unit, and extracts the extracted co-occurrence information. Stored in the memory 401. The word co-occurrence information extracting means will be described in detail later.

また、リンク相関関係情報作成部１７は、リンク相関関係情報作成手段により、メモリ４０１に記憶保持されている共起情報と、メモリ２０１に記憶保持されているシソーラスのデータとに基づき、シソーラス上の各リンクに関するリンク相関関係情報を作成し、作成したリンク相関関係情報をメモリ４０２に記憶保持する。なお、リンク相関関係情報作成手段については後述にて詳細に説明する。 Further, the link correlation information creating unit 17 uses the link correlation information creating unit to create a link on the thesaurus based on the co-occurrence information stored in the memory 401 and the thesaurus data stored in the memory 201. Link correlation information about each link is created, and the created link correlation information is stored and held in the memory 402. The link correlation information creating means will be described in detail later.

また、概念群間相関関係パラメータ計算部１８は、概念群間相関関係パラメータ計算手段により、メモリ４０２に記憶保持されているリンク相関関係情報に基づいてシソーラス上の各リンクの概念群間相関関係パラメータを計算し、計算した概念群間相関関係パラメータをメモリ２０５に記憶保持する。なお、概念群間相関関係パラメータ計算手段については後述にて詳細に説明する。 Further, the inter-concept group correlation parameter calculator 18 calculates the inter-concept group correlation parameter of each link on the thesaurus based on the link correlation information stored in the memory 402 by the inter-concept group correlation parameter calculation means. And the calculated correlation parameter between concept groups is stored and held in the memory 205. The concept group correlation parameter calculation means will be described in detail later.

そして、リンク重み設定部１２は、シソーラス階層関係パラメータ設定部９により設定したシソーラス階層関係パラメータと、上位下位概念関係パラメータ設定部１０により設定した上位下位概念関係パラメータと、概念群間相関関係パラメータ設定部１１により設定した概念群間相関関係パラメータとに基づき、シソーラス上の各リンクのリンク重みを計算する。 Then, the link weight setting unit 12 sets the thesaurus hierarchy relationship parameters set by the thesaurus hierarchy relationship parameter setting unit 9, the higher and lower concept relationship parameters set by the higher and lower concept relationship parameter setting unit 10, and the inter-concept group correlation parameter setting. Based on the correlation parameter between concept groups set by the unit 11, the link weight of each link on the thesaurus is calculated.

次に、シソーラス階層関係パラメータについて詳細に説明する。 Next, the thesaurus hierarchical relationship parameters will be described in detail.

シソーラス階層関係パラメータは、シソーラスの構造によって定義されたものであり、シソーラスの階層が深いほど小さな値をとるように設定されるため、深い階層での同じ上位概念をもつ２つの下位概念間のパス上に存在するリンクのシソーラス階層関係パラメータの和と、浅い階層での同じ上位概念をもつ２つの下位概念間のパス上に存在するリンクのシソーラス階層関係パラメータの和とを比較すると、深い階層でのリンクのシソーラス階層関係パラメータの方が小さい値となる。 The thesaurus hierarchy relationship parameter is defined by the structure of the thesaurus, and is set so that the deeper the thesaurus hierarchy, the smaller the value, so the path between two subordinate concepts with the same superordinate concept in the deep hierarchy Comparing the sum of the thesaurus hierarchical relationship parameters of the links existing above with the sum of the thesaurus hierarchical relationship parameters of the links existing on the path between two subordinate concepts with the same superordinate concept in the shallow hierarchy, The thesaurus hierarchy relation parameter of the link is smaller.

次に、シソーラス階層関係パラメータ設定部が行うシソーラス階層関係パラメータの設定の処理について図５に示すフローチャートを参照して説明する。 Next, the setting process of the thesaurus hierarchy related parameters performed by the thesaurus hierarchy related parameter setting unit will be described with reference to the flowchart shown in FIG.

シソーラスのデータを読み出し（ステップＳ５０１）、シソーラス階層関係パラメータ設定手段により、シソーラスのデータに基づいてシソーラス階層関係パラメータを設定し（ステップＳ５０２）、シソーラス階層関係パラメータを記憶保持し（ステップＳ５０３）、処理手順を終了する。 The thesaurus data is read (step S501), the thesaurus hierarchy relation parameter setting means sets the thesaurus hierarchy relation parameters based on the thesaurus data (step S502), stores and holds the thesaurus hierarchy relation parameters (step S503), and processing End the procedure.

次に、シソーラス階層関係パラメータ設定手段について詳細に説明する。 Next, the thesaurus hierarchy relationship parameter setting means will be described in detail.

シソーラス階層関係パラメータ設定手段は、数４によりシソーラス階層関係パラメータｌi1を求める。 The thesaurus hierarchy relation parameter setting means obtains the thesaurus hierarchy relation parameter li1 by Equation (4).

ｋi：リンクｉに対する下位概念の階層（ただし、ルートを０階層とする）
０＜ａ＜１

k i: Subordinate concept hierarchy for link i (provided that the root is 0 hierarchy)
0 <a <1

数４により、深い下位階層間のリンクほどｌi1は小さな値をとり、小さい値ほど類似度が大きく評価されることを表すパラメータを設定することができる。例えば、ａ＝０．８とすると、シソーラス階層関係パラメータは階層が深くなるにつれて、０．８、０．６４、０．５１２、…と下降順で値をとることになる。なお、ａには任意の値を設定することができ、また、シソーラス階層関係パラメータを求める数式として数４以外の数式を適用しても良い。 According to Equation 4, a parameter indicating that li1 takes a smaller value as a link between deeper lower hierarchies, and that the similarity is more greatly evaluated as the value is smaller. For example, if a = 0.8, the thesaurus hierarchy related parameters take values in descending order as 0.8, 0.64, 0.512,. An arbitrary value can be set for a, and a mathematical expression other than Equation 4 may be applied as a mathematical expression for obtaining a thesaurus hierarchy relation parameter.

従って、深い階層ほど単語間の類似度が大きくなるシソーラス階層関係パラメータを求めることにより、１つ目の課題および２つ目の課題を克服することが可能になる。 Therefore, the first problem and the second problem can be overcome by obtaining a thesaurus hierarchical relationship parameter in which the similarity between words increases as the hierarchy becomes deeper.

次に、上位下位概念関係パラメータについて詳細に説明する。 Next, the upper and lower conceptual relationship parameters will be described in detail.

上位下位概念関係パラメータは、シソーラスの構造と統計情報とによって定義されたものであり、同一の上位概念に対する下位概念の違いを反映するように設定されるため、リンク周辺の単語に依存した値となる。 The superordinate and subordinate concept relationship parameters are defined by the thesaurus structure and statistical information, and are set to reflect the difference in the subordinate concepts with respect to the same superordinate concept. Become.

次に、上位下位概念関係パラメータ設定部が行う上位下位概念関係パラメータの設定の処理について図６に示すフローチャートを参照して説明する。 Next, the process of setting the upper and lower concept relationship parameters performed by the upper and lower concept relationship parameter setting unit will be described with reference to the flowchart shown in FIG.

コーパスのデータとシソーラスのデータとを読み出し（ステップＳ６０１）、単語特徴量計算手段により、コーパスのデータに基づいてシソーラス上の単語の単語特徴量を計算し（ステップＳ６０２）、単語特徴量を記憶保持し（ステップＳ６０３）、シソーラス上の各概念の概念特徴量の計算を開始する。 The corpus data and thesaurus data are read out (step S601), the word feature amount calculation means calculates the word feature amount of the word on the thesaurus based on the corpus data (step S602), and the word feature amount is stored and held. (Step S603), the calculation of the concept feature amount of each concept on the thesaurus is started.

最小単位の概念である場合（ステップＳ６０４でＹＥＳ）、その概念に属する全ての単語の単語特徴量を読み出し（ステップＳ６０５）、概念特徴量計算手段により、単語特徴量に基づいて概念特徴量を計算し（ステップＳ６０６）、概念特徴量を記憶保持し（ステップＳ６０９）、上位概念がある場合（ステップＳ６１０でＹＥＳ）、ステップＳ６０４に戻る。 If the concept is the minimum unit (YES in step S604), the word feature amounts of all words belonging to the concept are read (step S605), and the concept feature amount is calculated based on the word feature amount by the concept feature amount calculation means. (Step S606), the concept feature quantity is stored and held (step S609), and if there is a superordinate concept (YES in step S610), the process returns to step S604.

ステップＳ６０４において、最小単位の概念でない場合（ステップＳ６０４でＮＯ）、その概念の下位概念の概念特徴量を読み出し（ステップＳ６０７）、概念特徴量計算手段により、概念特徴量に基づいて概念特徴量を計算し（ステップＳ６０８）、概念特徴量を記憶保持する（ステップＳ６０９）。 In step S604, if the concept is not the minimum unit concept (NO in step S604), the concept feature quantity of the subordinate concept of the concept is read (step S607), and the concept feature quantity is calculated based on the concept feature quantity by the concept feature quantity calculation means. The calculation is performed (step S608), and the concept feature amount is stored and held (step S609).

ステップＳ６１０において、上位概念がない場合（ステップＳ６１０でＮＯ）、シソーラス上の各リンクの上位下位概念関係パラメータの計算を開始する。 In step S610, when there is no superordinate concept (NO in step S610), calculation of superordinate / subordinate concept relationship parameters of each link on the thesaurus is started.

リンクが最小単位の概念と単語とのリンクである場合（ステップＳ６１１でＹＥＳ）、概念の概念特徴量と単語の単語特徴量とを読み出し（ステップＳ６１２）、上位下位概念関係パラメータ計算手段により、概念の概念特徴量と単語の単語特徴量とに基づいて上位下位概念関係パラメータを計算し（ステップＳ６１３）、上位下位概念関係パラメータを記憶保持し（ステップＳ６１６）、上位階層のリンクが有る場合（ステップＳ６１７でＹＥＳ）、ステップＳ６１１に戻る。 If the link is a link between the concept and the word in the smallest unit (YES in step S611), the concept feature quantity of the concept and the word feature quantity of the word are read (step S612), and the concept is calculated by the upper and lower concept relationship parameter calculation means. The upper and lower concept relationship parameters are calculated based on the concept feature amount and the word feature amount of the word (step S613), the upper and lower concept relationship parameters are stored and held (step S616), and there is a link in the upper layer (step S616). YES in S617), the process returns to step S611.

ステップＳ６１１において、リンクが最小単位の概念と単語とのリンクでない場合（ステップＳ６１１でＮＯ）、上位概念の概念特徴量と下位概念の概念特徴量とを読み出し（ステップＳ６１４）、上位下位概念関係パラメータ計算手段により、上位概念の概念特徴量と下位概念の概念特徴量とに基づいて上位下位概念関係パラメータを計算し（ステップＳ６１５）、上位下位概念関係パラメータを記憶保持する（ステップＳ６１６）。 In step S611, if the link is not a link between the concept of the smallest unit and a word (NO in step S611), the concept feature quantity of the superordinate concept and the concept feature quantity of the subordinate concept are read (step S614), and the superordinate / lower concept relationship parameters Based on the concept feature quantity of the superordinate concept and the concept feature quantity of the subordinate concept, the superordinate concept relation parameter is calculated by the calculating means (step S615), and the superordinate and subordinate concept relation parameter is stored and held (step S616).

ステップＳ６１７において、上位階層のリンクがない場合（ステップＳ６１７でＮＯ）、処理手順を終了する。 In step S617, if there is no higher-level link (NO in step S617), the processing procedure ends.

次に、上位下位概念関係パラメータ設定手段について詳細に説明する。 Next, the upper and lower concept relationship parameter setting means will be described in detail.

単語特徴量計算手段は、統計情報としてコーパスのデータから単語ｗの単語特徴量ｆwを求める。なお、単語特徴量ｆｗとして単語ベクトルを用い、単語ベクトルの成分にはｔｆ・ｉｄｆ、共起度、カテゴリ等を用いる。 The word feature amount calculating means obtains the word feature amount fw of the word w from the corpus data as statistical information. A word vector is used as the word feature quantity fw, and tf · idf, co-occurrence, category, and the like are used as the word vector components.

概念特徴量計算手段は、各概念Ｃiにおいて概念Ｃiに属する全ての単語特徴量の和ＳＣi（以後、概念特徴総和量とする）、および概念Ｃiに属する全ての単語の数ｎＣi（以後、概念単語総和数という）を求め、概念Ｃiが最小単位の概念である場合、数５により概念特徴総和量ＳＣiおよび概念単語総和数ｎＣiを求める。 The concept feature quantity calculation means calculates the sum SCi of all the word feature quantities belonging to the concept Ci in each concept Ci (hereinafter referred to as concept feature total quantity) and the number nCi of all the words belonging to the concept Ci (hereinafter referred to as concept words). When the concept Ci is the concept of the minimum unit, the concept feature summation amount SCi and the concept word summation number nCi are obtained from Equation 5.

｜Ａ｜：集合Ａの要素数

| A |: Number of elements in set A

また、概念特徴量計算手段は、概念Ｃiが最小単位の概念でない場合、数６により概念特徴総和量ＳＣiおよび概念単語総和数ｎＣiを求める。 Further, the concept feature quantity calculating means obtains the concept feature sum quantity SCi and the concept word sum number nCi from Expression 6 when the concept Ci is not the concept of the minimum unit.

Ｃij：概念Ｃiにおけるｊ番目の直下の下位概念
ＮＣi：概念Ｃiの下位概念の数

Cij: j-th subordinate concept NCi in concept Ci: number of subordinate concepts of concept Ci

そして、概念特徴量計算手段は、概念特徴総和量ＳＣiと概念単語総和数ｎＣiとを用いて、数７により概念Ｃiの概念特徴量ｆＣiを求める。 Then, the concept feature quantity calculating means obtains the concept feature quantity fCi of the concept Ci by using the concept feature total quantity SCi and the concept word total number nCi.

ここで、概念特徴量計算手段は、最小単位の概念における概念特徴総和量ＳＣiを表す概念特徴総和ベクトル、および概念単語総和数ｎＣiを求め、求めた概念特徴総和ベクトルと概念単語総和数とから概念特徴量ｆＣiを表す概念特徴ベクトルを求める。 Here, the concept feature quantity calculation means obtains a concept feature sum vector representing the concept feature sum total SCi in the concept of the minimum unit and a concept word sum number nCi, and calculates a concept from the obtained concept feature sum vector and the concept word sum number. A conceptual feature vector representing the feature quantity fCi is obtained.

そして、概念特徴量計算手段は、最小単位の概念における概念特徴総和ベクトルと概念単語総和数とから最小単位の概念が属する上位概念における概念特徴ベクトルを求めることで、最小単位の概念から最上位の概念へと概念特徴ベクトルを求めていく。 Then, the concept feature quantity calculating means obtains the concept feature vector in the superordinate concept to which the concept of the minimum unit belongs from the concept feature sum vector and the concept word total number in the concept of the minimum unit, so that the highest level concept from the concept of the minimum unit is obtained. The concept feature vector is calculated to the concept.

上位下位概念関係パラメータ計算手段は、シソーラス上の各リンクに対して上位下位概念関係パラメータを求める。ここで、上位概念ベクトルと下位概念ベクトルとを定義した数８がある。 The upper and lower concept relationship parameter calculation means obtains the upper and lower concept relationship parameters for each link on the thesaurus. Here, there is a number 8 that defines a higher concept vector and a lower concept vector.

Ｃhigh：上位概念ベクトル
Ｃlow：下位概念ベクトル

Chigh: upper concept vector Clow: lower concept vector

上記のような上位概念ベクトルと下位概念ベクトルにより、ユーグリッド距離による上位下位概念パラメータｌi2は数９のようになる。 By using the higher concept vector and the lower concept vector as described above, the higher and lower concept parameter l i2 based on the Eugrid distance becomes as shown in Equation 9.

上記実施例では、類似度が大きいときにｌi2の値が小さくなる。なお、ユーグリッド距離以外の距離関数を上位下位概念関係パラメータに適用しても良い。 In the above embodiment, the value of li2 decreases when the similarity is large. Note that a distance function other than the Eugrid distance may be applied to the upper and lower conceptual relationship parameters.

数９により、最小単位の概念から最上位の概念までボトムアップに各概念の概念特徴総和ベクトルと概念単語総和数とを求めるとともに、求めた概念特徴総和ベクトルと概念単語総和数とによって求められた概念特徴ベクトルを求める。このように、ある概念の概念特徴ベクトルは直下の概念特徴ベクトルを用いて求められ、あるリンクの上位下位概念パラメータは当該リンクに対して上位の概念特徴ベクトルと下位の概念特徴ベクトルとから求められる。なお、単語と当該単語が属する概念との間のリンクにおける上位下位概念パラメータは、単語ベクトルと当該単語が属する概念の概念特徴ベクトルとから求められる。 From Equation 9, the concept feature sum vector and the concept word sum number of each concept are calculated bottom-up from the concept of the smallest unit to the highest concept, and the sum of the concept feature sum vector and the concept word sum number are obtained. A concept feature vector is obtained. As described above, a concept feature vector of a certain concept is obtained using the concept feature vector immediately below, and upper and lower concept parameters of a link are obtained from the upper concept feature vector and the lower concept feature vector for the link. . Note that the upper and lower concept parameters in the link between the word and the concept to which the word belongs are obtained from the word vector and the concept feature vector of the concept to which the word belongs.

次に、概念群間相関関係パラメータについて詳細に説明する。 Next, the concept group correlation parameter will be described in detail.

概念群間相関関係パラメータは、シソーラスの構造と統計情報とによって定義されたものであり、共起情報の２単語がシソーラス上にどのように分布しているかに依存した値をとる。 The correlation parameter between concept groups is defined by the thesaurus structure and statistical information, and takes a value depending on how the two words of the co-occurrence information are distributed on the thesaurus.

次に、概念群間相関関係パラメータ設定部が行う概念群間相関関係パラメータの設定の処理について図７に示すフローチャートを参照して説明する。 Next, the process of setting the correlation parameter between concept groups performed by the correlation parameter setting unit between concept groups will be described with reference to the flowchart shown in FIG.

単語共起情報抽出手段によりコーパスのデータから共起情報を抽出し（ステップＳ７０１）、シソーラスのデータを読み出し（ステップＳ７０２）、リンク相関関係情報作成手段により、共起情報に基づいてシソーラス上の各リンクのリンク相関関係情報を作成し（ステップＳ７０３）、概念群間相関関係パラメータ計算手段により、リンク相関関係情報に基づいてシソーラス上の各リンクの概念群間相関関係パラメータを計算し（ステップＳ７０４）、概念群間相関関係パラメータを記憶保持し（ステップＳ７０５）、処理手順を終了する。 The co-occurrence information is extracted from the corpus data by the word co-occurrence information extracting means (step S701), the thesaurus data is read (step S702), and the link correlation information creating means is used to generate each of the data on the thesaurus based on the co-occurrence information. The link correlation information of the link is created (step S703), and the concept group correlation parameter calculation means calculates the concept group correlation parameter of each link on the thesaurus based on the link correlation information (step S704). Then, the correlation parameter between concept groups is stored and held (step S705), and the processing procedure is terminated.

次に、概念群間相関関係パラメータ設定手段について詳細に説明する。 Next, the concept group correlation parameter setting means will be described in detail.

単語共起情報抽出手段は、統計情報としてコーパスのデータ中で共起している単語の組み合わせを共起情報として抽出する。例えば、コーパスのデータ中に存在する「太郎はおにぎりを食べた」という文から共起情報を抽出する場合、その文の形態素解析・構文解析を行うことで、｛太郎、おにぎり、食べる｝という単語集合を作成し、作成した単語集合から抽出する共起情報は、［太郎，おにぎり］、［太郎，食べる］、［おにぎり，食べる］の３つである。なお、共起情報の抽出は文単位とは限らず、文章単位または段落単位での抽出も適用可能である。 The word co-occurrence information extracting means extracts a combination of words that co-occur in the corpus data as statistical information as co-occurrence information. For example, when co-occurrence information is extracted from the sentence “Taro ate onigiri” in the corpus data, the word {Taro, rice ball, eat} is obtained by performing morphological analysis / syntactic analysis of the sentence. There are three types of co-occurrence information to create a set and extract from the created word set: [Taro, Onigiri], [Taro, Eat], [Onigiri, Eat]. The extraction of co-occurrence information is not limited to sentence units, and extraction of sentence units or paragraph units is also applicable.

リンク相関関係情報作成手段は、共起情報の２単語に対してパスを求め、パス上のリンクに共起情報を付与することで、各リンクに２単語間のパスとする２単語の共起情報が蓄積され、これをリンク相関関係情報とする。 The link correlation information creating means obtains a path for two words of the co-occurrence information, and gives the co-occurrence information to the links on the path, thereby co-occurring two words as a path between the two words for each link. Information is accumulated, and this is used as link correlation information.

図８は、リンク相関関係情報作成手段が行うリンク相関関係情報の作成の一例を示す図である。 FIG. 8 is a diagram illustrating an example of creation of link correlation information performed by the link correlation information creation unit.

例えば、コーパスのデータの共起情報が（Ｗａ，Ｗｂ）、（Ｗａ，Ｗｃ）、（Ｗｂ，Ｗｃ）、（Ｗｃ，Ｗｄ）の４つであり、単語Ｗａ、単語Ｗｂ、単語Ｗｃおよび単語Ｗｄが図５に示すシソーラス上の位置に分布する場合、シソーラス上のリンクｉに対して共起情報（Ｗａ，Ｗｃ）、（Ｗｂ，Ｗｃ）が付与される。従って、リンクｉのリンク相関関係情報は（Ｗａ，Ｗｃ）、（Ｗｂ，Ｗｃ）であり、このリンクは概念Ａ以下の概念群と概念Ａ以下の概念群をシソーラス全体から除いた集合の関係を表していると捉える。 For example, the co-occurrence information of corpus data is four (Wa, Wb), (Wa, Wc), (Wb, Wc), (Wc, Wd), and the word Wa, the word Wb, the word Wc, and the word Wd Are distributed at positions on the thesaurus shown in FIG. 5, the co-occurrence information (Wa, Wc) and (Wb, Wc) are given to the link i on the thesaurus. Accordingly, the link correlation information of link i is (Wa, Wc), (Wb, Wc), and this link represents the relationship of the set obtained by removing the concept group below concept A and the concept group below concept A from the entire thesaurus. I think it represents.

概念群間相関関係パラメータ計算手段は、数１０によりリンクｉの概念群間相関関係パラメータｌi3を求める。 The concept group correlation parameter calculation means obtains the concept group correlation parameter l i3 of the link i according to equation (10).

ｎi：リンクｉのリンク相関関係情報中の共起情報数
ａ：正の定数

ni: number of co-occurrence information in link correlation information of link i: a positive constant

概念群間相関関係パラメータは、リンク相関関係情報中の共起情報数の逆数とする。ただし、リンク相関関係情報中に共起情報が存在しない場合、概念群間相関関係パラメータは正の定数とする。例えば、図８に示すシソーラス上のリンクｉでは、ｎi＝２であることから、概念群間相関関係パラメータｌi3は１／２となる。 The correlation parameter between concept groups is the reciprocal of the number of co-occurrence information in the link correlation information. However, when the co-occurrence information does not exist in the link correlation information, the concept group correlation parameter is a positive constant. For example, in the link i on the thesaurus shown in FIG. 8, since ni = 2, the inter-concept group correlation parameter l i3 is ½.

次に、概念群間相関関係パラメータ設定手段による概念群間相関関係パラメータの求め方について具体的な一例を参照して説明する。 Next, how to obtain the correlation parameter between concept groups by the conceptual group correlation parameter setting means will be described with reference to a specific example.

「春になり暖かくなりました。桜も咲きはじめお花見の時期です。お花見では、食べ過ぎ・飲み過ぎには注意しましょう。」
の文章に形態素解析を施し、図９に示すシソーラスに登録されている単語を抽出し、例えば、上記文章中の３つの文を単語集合に変換する
１．｛春，暖かい｝
２．｛桜，咲く，お花見，時期｝
３．｛お花見，食べ過ぎ，飲み過ぎ，注意｝ “It has become warmer in spring. The cherry blossoms are in full bloom and it's time for cherry blossom viewing. In cherry blossom viewing, be careful not to eat too much or drink too much.”
Is subjected to morphological analysis, and words registered in the thesaurus shown in FIG. 9 are extracted, and, for example, three sentences in the sentence are converted into a word set. {Spring, warm}
2. {Sakura, blooming, cherry-blossom viewing, season}
3. {Cherry blossom viewing, eating too much, drinking too much, caution}

そして、上記単語集合から共起情報を抽出すると、次のようになる
１．（春，暖かい）
２．（桜，咲く）、（桜，お花見）、（桜，時期）、（咲く，お花見）、（咲く，時期）、（お花見，時期）
３．（お花見，食べ過ぎ）、（お花見，飲み過ぎ）、（お花見，注意）、（食べ過ぎ，飲み過ぎ）、（食べ過ぎ，注意）、（飲み過ぎ，注意） And when co-occurrence information is extracted from the above word set, it becomes as follows: 1. (Spring, warm)
2. (Cherry blossoms), (cherry blossoms, cherry-blossom viewing), (cherry blossoms, season), (blooming, cherry-blossom viewing), (blooming, cherry-blossom viewing), (cherry blossom viewing, season)
3. (Cherry-blossom viewing, eating too much), (cherry-blossom viewing, drinking too much), (cherry-blossom viewing, caution), (too much eating, drinking too much), (too much eating, caution), (too much drinking, caution)

そして、パス上のリンクに共起情報を付与することで、リンク相関関係情報を作成する。例えば、（生物）と（自然）とをつなぐリンクのリンク相関関係情報は（桜，お花見）、（桜，時期）、（咲く，お花見）、（咲く，時期）の４つの共起情報から成り立ち、概念群間相関関係パラメータは１／４となる。 Then, link correlation information is created by adding the co-occurrence information to the links on the path. For example, the link correlation information of the link connecting (living) and (nature) is four co-occurrence information of (cherry blossom, cherry-blossom viewing), (cherry blossom viewing), (blooming, cherry-blossom viewing), (blooming, season) The concept group correlation parameter is 1/4.

ここで、図９に示すシソーラス上の各リンクに添付されている数値は概念群間相関関係パラメータを表しており、これら概念群間相関関係パラメータに基づき、例えば、「桜」と「お花見」との間の単語間距離を求めると、
『「桜」と「お花見」との単語間距離＝1/3+1/4+1/4+1/3+1/3+1/6+1/6
＝22/12
＝1.833 』
となる。 Here, the numerical value attached to each link on the thesaurus shown in FIG. 9 represents the correlation parameter between concept groups. Based on the correlation parameter between concept groups, for example, “sakura” and “cherry-blossom viewing”. When the distance between words is calculated,
“Distance between words of“ sakura ”and“ cherry blossom viewing ”= 1/3 + 1/4 + 1/4 + 1/3 + 1/3 + 1/6 + 1/6
= 22/12
= 1.833
It becomes.

また、「桜」と「暖かい」との間の単語間距離を求めると、
『「桜」と「暖かい」との単語間距離＝1/3+1/4+1/4+1+1
＝34/12
＝2.833 』
となる。 In addition, when the distance between words between “sakura” and “warm” is calculated,
"Distance between words" Sakura "and" Warm "= 1/3 + 1/4 + 1/4 + 1 + 1
= 34/12
= 2.833
It becomes.

ここで、「桜」と「お花見」との間の単語間距離と、「桜」と「暖かい」との間の単語間距離とを比較すると、シソーラスの構造上では「桜」と「暖かい」との間の方が、「桜」と「お花見」との間よりも小さいものの、概念群間相関関係パラメータに基づいた単語間距離では「桜」と「お花見」との間の単語間距離の方が、「桜」と「暖かい」との間の単語間距離よりも小さくなっている。つまり、概念群間相関関係パラメータに基づいた単語間距離を求めることで、文章の内容を反映した単語間の類似度を求めることができる。 Here, when comparing the distance between words between “Cherry Blossom” and “Cherry Blossom” and the distance between words between “Cherry Blossom” and “Warm”, “Sakura” and “Warm” in the structure of the thesaurus. ”Is smaller than“ Sakura ”and“ Ohanami ”, but the word distance between“ Sakura ”and“ Ohanami ”is the distance between words based on the correlation parameter between concept groups. The distance between words is smaller than the distance between words between “sakura” and “warm”. That is, by calculating the distance between words based on the correlation parameter between concept groups, the similarity between words reflecting the content of the sentence can be determined.

このようにして求められた概念群間相関関係パラメータは、シソーラス上のリンクに対して下位の概念群とその概念群をシソーラスから除いた集合の間の相関性を表す。なお、最小単位の概念と単語との間のリンクにおいては、単語を概念群とみなし、リンクを単語とその単語をシソーラスから除いた集合との間の相関性を表すものと捉え、概念群間相関関係が成り立つものとする。 The correlation parameter between concept groups obtained in this way represents a correlation between a concept group at a lower level with respect to a link on the thesaurus and a set obtained by removing the concept group from the thesaurus. In the link between the concept of the smallest unit and the word, the word is regarded as a concept group, the link is regarded as representing the correlation between the word and the set obtained by removing the word from the thesaurus, and Assume that a correlation is established.

次に、リンク重み設定部が行うリンク重みの設定の処理について図１０に示すフローチャートを参照して説明する。 Next, the link weight setting process performed by the link weight setting unit will be described with reference to the flowchart shown in FIG.

シソーラス階層関係パラメータ設定手段により設定されたシソーラス階層関係パラメータと、上位下位概念関係パラメータ設定手段により設定された上位下位概念関係パラメータと、概念群間相関関係パラメータ設定手段により設定された概念群間相関関係パラメータとを読み出し（ステップ１００１）、シソーラス階層関係パラメータと上位下位概念関係パラメータと概念群間相関関係パラメータとに基づいてシソーラス上の各リンクのリンク重みを計算し（ステップＳ１００２）、リンク重みを記憶保持し（ステップＳ１００３）、処理手順を終了する。 Thesaurus hierarchy relation parameters set by the thesaurus hierarchy relation parameter setting means, upper and lower concept relation parameters set by the higher and lower concept relation parameter setting means, and inter-concept group correlations set by the concept group correlation parameter setting means The relationship parameter is read (step 1001), the link weight of each link on the thesaurus is calculated based on the thesaurus hierarchy relationship parameter, the upper and lower concept relationship parameter, and the concept group correlation parameter (step S1002). Store and hold (step S1003), and the processing procedure ends.

次に、リンク重み設定手段について詳細に説明する。 Next, the link weight setting means will be described in detail.

リンク重み設定手段は、シソーラス階層関係パラメータの特性と、上位下位概念関係パラメータの特性と、概念群間相関関係パラメータの特性とを反映するようなシソーラス上の各リンク重みを設定する。 The link weight setting means sets each link weight on the thesaurus that reflects the characteristics of the thesaurus hierarchical relationship parameters, the characteristics of the upper and lower concept relationship parameters, and the characteristics of the correlation parameters between concept groups.

リンク重み設定手段は、数１１によりリンク重みｌiを求める。 The link weight setting means obtains the link weight l i according to Equation 11.

ｌi：リンク重み
ｌi1：シソーラス階層関係パラメータ
ｌi2：上位下位概念関係パラメータ
ｌi3：概念群間相関関係パラメータ
α，β：定数

li: link weight li1: thesaurus hierarchy relationship parameter li2: upper and lower concept relationship parameter li3: inter-concept group correlation parameter α, β: constant

数１１では、α、βの値を変更することで、リンク重みへの上位下位概念関係パラメータおよび概念群間相関関係パラメータの依存性を設定することができる。 In Equation 11, by changing the values of α and β, it is possible to set the dependency of the higher and lower concept relationship parameters and the concept group correlation parameters on the link weight.

例えば、α＝２、β＝５にすると、リンク重みｌiは数１２のようになる。 For example, when α = 2 and β = 5, the link weight l i is as shown in Equation 12.

つまり、リンク重みは、上位下位概念関係パラメータの値と概念群間相関関係パラメータの値とを反映したものとなり、概念の特徴が反映したリンク重みが求められ、３つ目の問題を克服できる。 That is, the link weight reflects the value of the higher and lower concept relationship parameter and the value of the correlation parameter between concept groups, and the link weight reflecting the concept feature is obtained, so that the third problem can be overcome.

従って、シソーラス階層関係パラメータ、上位下位概念関係パラメータおよび概念群間相関関係パラメータに基づき、３点の課題を効果的に克服することができるリンク重みを求めることができる。 Therefore, link weights that can effectively overcome the three problems can be obtained based on the thesaurus hierarchical relationship parameters, the upper and lower concept relationship parameters, and the concept group correlation parameters.

なお、上記実施例では、シソーラス階層関係パラメータ、上位下位概念関係パラメータおよび概念群間相関関係パラメータに基づいてリンク重みを求める構成について説明してきたが、概念群間相関関係パラメータだけに基づいてリンク重みを求める構成でも３点の課題を効果的に克服することができる。 In the above-described embodiment, the configuration in which the link weight is obtained based on the thesaurus hierarchical relationship parameter, the upper and lower concept relationship parameter, and the concept group correlation parameter has been described. Even in a configuration that requires the above, three problems can be effectively overcome.

また、上記実施例で説明した単語間類似度計算装置と同様の処理を行うことが可能な単語間類似度計算プログラムを、一般的なＰＣ（Personal Computer）等の汎用電子計算機にインストールする構成でも適用可能である。 In addition, a configuration in which a word similarity calculation program capable of performing the same processing as the word similarity calculation apparatus described in the above embodiment is installed in a general-purpose electronic computer such as a general PC (Personal Computer). Applicable.

本発明に係わる単語間類似度計算装置１を組み込む情報端末装置２の機能的な構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of the information terminal device 2 incorporating the similarity calculation apparatus 1 between words concerning this invention. リンク重み計算部７の機能的な構成の一例を詳細に示すブロック図である。It is a block diagram which shows an example of a functional structure of the link weight calculation part 7 in detail. 上位下位概念関係パラメータ設定部１０の機能的な構成の一例を詳細に示すブロック図である。3 is a block diagram showing in detail an example of a functional configuration of a higher-order concept relationship parameter setting unit 10. FIG. 概念群間相関関係パラメータ設定部１１の機能的な構成の一例を詳細に示すブロック図である。3 is a block diagram illustrating in detail an example of a functional configuration of a concept group correlation parameter setting unit 11. FIG. シソーラス階層関係パラメータ設定部が行うシソーラス階層関係パラメータの設定の処理を示すフローチャートである。It is a flowchart which shows the setting process of the thesaurus hierarchy related parameter which a thesaurus hierarchy related parameter setting part performs. 上位下位概念関係パラメータ設定部が行う上位下位概念関係パラメータの設定の処理を示すフローチャートである。It is a flowchart which shows the process of the setting of the high-order concept relationship parameter which a high-order concept relationship parameter setting part performs. 次に、概念群間相関関係パラメータ設定部が行う概念群間相関関係パラメータの設定の処理を示すフローチャートである。Next, it is a flowchart which shows the process of the setting of the correlation parameter between concept groups which the correlation parameter setting part between concept groups performs. リンク相関関係情報作成手段が行うリンク相関関係情報の作成の一例を示す図である。It is a figure which shows an example of preparation of the link correlation information which a link correlation information preparation means performs. 具体的なシソーラスの一例を示す図である。It is a figure which shows an example of a specific thesaurus. リンク重み設定部が行うリンク重みの設定の処理を示すフローチャートである。It is a flowchart which shows the process of the setting of the link weight which a link weight setting part performs. シソーラスについて説明する図である。It is a figure explaining a thesaurus. 従来の手法による課題点を説明するためのシソーラスの一例を示す図である。It is a figure which shows an example of the thesaurus for demonstrating the subject by the conventional method.

Explanation of symbols

１単語類似度計算装置
２情報端末装置
３第１の外部記憶装置
４第２の外部記憶装置
５入力装置
６表示装置
７リンク重み計算部
８単語間類似度計算部
９シソーラス階層関係パラメータ設定部
１０上位下位概念関係パラメータ設定部
１１概念群間相関関係パラメータ設定部
１２リンク重み設定部
１３単語特徴量計算部
１４概念特徴量計算部
１５上位下位概念関係パラメータ計算部
１６単語共起情報抽出部
１７リンク相関関係情報作成部
１８概念群間相関関係パラメータ計算部
２０１、２０２、２０３、２０４、２０５、２０６、３０１、３０２、４０１、４０２メモリ DESCRIPTION OF SYMBOLS 1 Word similarity calculation apparatus 2 Information terminal device 3 1st external storage device 4 2nd external storage device 5 Input device 6 Display apparatus 7 Link weight calculation part 8 Interword similarity calculation part 9 Thesaurus hierarchy relationship parameter setting part 10 Higher and lower conceptual relationship parameter setting unit 11 Concept group correlation parameter setting unit 12 Link weight setting unit 13 Word feature value calculation unit 14 Concept feature value calculation unit 15 Higher and lower concept relationship parameter calculation unit 16 Word co-occurrence information extraction unit 17 Link Correlation information creation unit 18 Concept group correlation parameter calculation unit 201, 202, 203, 204, 205, 206, 301, 302, 401, 402 Memory

Claims

Semantic similarity between two words to be calculated, a word material database in which a plurality of words are collected, a vocabulary database in which concepts and concepts linked to the concepts or words are classified and arranged in a hierarchical structure In the inter-word similarity calculation device that calculates based on
Vocabulary database hierarchy relation setting means for setting lexical database hierarchy relation information indicating a relation between hierarchies in the vocabulary database;
Upper and lower concept relation setting means for setting upper and lower concept relation information indicating a relationship between a superordinate concept existing above the concept with respect to the concept in the vocabulary database and a subordinate concept existing under the concept;
Inter-concept group correlation setting means for setting inter-concept group correlation information indicating a relationship between a first concept group in the vocabulary database and a second concept group other than the first concept group;
Link weight setting means for setting a link weight for each link in the vocabulary database based on the vocabulary database hierarchy relation information, the upper and lower concept relation information, and the concept group correlation information;
An interword similarity calculation means for calculating a similarity between two words to be calculated based on the link weight set for each link in the vocabulary database by the link weight setting means. Interword similarity calculator.

The vocabulary database hierarchy relationship setting means includes:
Setting lexical database inter-layer relationship information such that the similarity between second hierarchies located deeper than the first hierarchies in the vocabulary database is greater than the similarity between the first hierarchies. The inter-word similarity calculation apparatus according to claim 1, wherein:

The lexical database hierarchy relation information between the second hierarchies is smaller than the lexical database hierarchy relation information between the first hierarchies.

The upper and lower conceptual relationship information is
Represented by the difference in features in the superordinate concept and the subordinate concept,
The inter-word similarity calculation apparatus according to claim 1, wherein the similarity is represented by a link existing between the superordinate concept and the subordinate concept.

The upper and lower concept relationship setting means includes:
Word feature amount calculating means for calculating a word feature amount of a word from the word material database;
A concept feature quantity calculating means for calculating a concept feature quantity of the concept based on word feature quantities of all words belonging to the concept;
Upper and lower concept relation information calculation means for calculating upper and lower concept relation information based on the word feature quantity calculated by the word feature quantity calculation means and the concept feature quantity calculated by the concept feature quantity calculation means. The inter-word similarity calculation apparatus according to claim 4.

The conceptual feature quantity calculating means includes:
When the word belongs to the concept, a concept feature total amount indicating the sum of the word feature amounts of all words belonging to the concept and a concept word total number indicating the number of all the words are obtained, and the concept feature total By dividing the quantity by the total number of concept words, the concept feature quantity of the concept is calculated,
When a subordinate concept belongs to the concept, a sum total of concept feature sums of all subordinate concepts belonging to the concept and a sum total of concept word sums of all the subordinate concepts are obtained, and the concept feature sum total The inter-word similarity calculation apparatus according to claim 5, wherein the concept feature amount of the concept is calculated by dividing the total amount by the sum of the concept word sum.

The upper and lower concept relationship information calculating means includes:
When setting the upper and lower concept relation information of the link between the concept and the word belonging to the concept, the upper and lower concept relation information is calculated based on the concept feature quantity of the concept and the word feature quantity of the word,
When setting the upper / lower concept relationship information of the links between the concepts, the higher / lower concept relationship information is calculated based on the concept feature value of the higher concept and the concept feature value of the lower concept in the link. The inter-word similarity calculation apparatus according to claim 6.

The conceptual feature amount is represented by a centroid vector in a multidimensional space,
The inter-word similarity calculation device according to claim 7, wherein the upper-lower concept relationship information is represented by a distance function having the concept feature value of the higher concept and the concept feature value of the lower concept as variables. .

The concept group correlation information is
The inter-word similarity calculation apparatus according to claim 1, wherein the similarity is represented by a relationship between a concept or word included in the first concept group and a concept or word included in the second concept group.

The concept group correlation setting means includes:
Word co-occurrence information extracting means for extracting word co-occurrence information from the word material database;
Link correlation information creating means for assigning the co-occurrence information extracted by the word co-occurrence information extracting means to the link of the vocabulary database and creating link correlation information in the link based on the given co-occurrence information; ,
The word according to claim 9, further comprising: concept group correlation information calculation means for calculating the correlation information between concept groups based on the link correlation information created by the link correlation information creation means. Inter-similarity calculation device.

The concept group correlation information calculating means includes:
When the co-occurrence information is present in the link correlation information, the reciprocal of the total number of co-occurrence information indicating the number of all the co-occurrence information is used as the inter-concept group correlation information,
The inter-word similarity calculation apparatus according to claim 10, wherein when the co-occurrence information does not exist in the link correlation information, a positive constant is used as the correlation information between concept groups.

The inter-word similarity calculation means includes:
The inter-word similarity calculation apparatus according to claim 1, wherein a value obtained by summing up link weights of at least one link existing between the two words to be calculated is used as the similarity between the words.

Semantic similarity between two words to be calculated, a word material database in which a plurality of words are collected, a vocabulary database in which concepts and concepts linked to the concepts or words are classified and arranged in a hierarchical structure In the inter-word similarity calculation device that calculates based on
Inter-concept group correlation setting means for setting inter-concept group correlation information indicating a relationship between a first concept group in the vocabulary database and a second concept group other than the first concept group;
Link weight setting means for setting a weight for each link in the vocabulary database based on the correlation information between concept groups set by the correlation setting means between the concept groups;
An interword similarity calculation means for calculating a similarity between two words to be calculated based on the link weight set for each link in the vocabulary database by the link weight setting means. Interword similarity calculator.

The concept group correlation information is
The inter-word similarity calculation apparatus according to claim 13, wherein the similarity is expressed by a relationship between a concept or a word included in the first concept group and a concept or a word included in the second concept group.

The concept group correlation setting means includes:
Word co-occurrence information extracting means for extracting word co-occurrence information from the word material database;
Link correlation information creating means for assigning the co-occurrence information extracted by the word co-occurrence information extracting means to the link of the vocabulary database and creating link correlation information in the link based on the given co-occurrence information; ,
15. The word according to claim 14, further comprising: concept group correlation information calculation means for calculating the concept group correlation information based on the link correlation information created by the link correlation information creation means. Inter-similarity calculation device.

The concept group correlation information calculating means includes:
When the co-occurrence information is present in the link correlation information, the reciprocal of the total number of co-occurrence information indicating the number of all the co-occurrence information is used as the inter-concept group correlation information,
The inter-word similarity calculation apparatus according to claim 15, wherein, when the co-occurrence information does not exist in the link correlation information, a positive constant is used as the correlation information between concept groups.

The inter-word similarity calculation means includes:
The inter-word similarity calculation apparatus according to claim 13, wherein a value obtained by summing up link weights of at least one link existing between two words to be calculated is used as the similarity between the words.

On a computer, semantic similarity between two words to be calculated is classified and arranged in a hierarchical structure in terms of a word material database in which a plurality of words are collected and concepts and concepts or words linked to the concepts. In a word similarity calculation program for calculating based on a vocabulary database,
The computer,
Lexical database hierarchy between relationship setting means to set a lexical database hierarchy among related information representing the relationship between the hierarchy in the lexical database,
Preamble be in the higher該概precaution against concepts in the lexical database, and該概upper subordinate concept relationship setting means to set the upper subordinate concept relationship information indicating a relationship between subordinate concepts exist beneath a precaution,
The first concept group and the first concept second conceptual correlation set hand interstage concepts group for setting correlation information between concepts group showing the relationship between the groups other than the groups in the vocabulary database,
Said vocabulary database hierarchy among related information, and the upper lower conceptually related information, based on the correlation information between the concepts group, link weight setting means to set a link weight to each link in the lexical database,
It said link based on link weights set for each link in the lexical database by weight setting means, two inter-word similarity calculation means to calculate the similarity between the words that are the object of calculation
Word similarity calculation program to function as

Vocabulary database hierarchy relation information such that the similarity between second hierarchies located deeper than the first hierarchy in the vocabulary database is greater than the similarity between the first hierarchies. Set by the relationship setting means
Claim 18, wherein inter-word similarity calculation program, characterized in that those.

The upper and lower conceptual relationship information is
Represented by the difference in features in the superordinate concept and the subordinate concept,
The word similarity calculation program according to claim 18, wherein the word similarity calculation program is represented by a link existing between the superordinate concept and the subordinate concept.

The upper and lower concept relationship setting means includes:
Furthermore,
Word feature calculation means to calculate a word feature words from the word document database,
Based on the word feature of all words belonging to the concept, the concept feature quantity calculation means to calculate the concept feature amount of該概precaution,
Based on the above concept feature amount and the word feature, the upper subordinate concept relationship information calculation means to calculate the upper lower conceptually related information
The program for calculating the similarity between words according to claim 21, which functions as:

When the word belongs to the concept, a concept feature total amount indicating the sum of the word feature amounts of all words belonging to the concept and a concept word total number indicating the number of all the words are obtained, and the concept feature total The concept feature quantity calculating means calculates the concept feature quantity of the concept by dividing the quantity by the total number of concept words,
When a subordinate concept belongs to the concept, a sum total of concept feature sums of all subordinate concepts belonging to the concept and a sum total of concept word sums of all the subordinate concepts are obtained, and the concept feature sum total The concept feature quantity calculating means calculates the concept feature quantity of the concept by dividing the total quantity by the sum of the concept word sum numbers.
Claim 22 inter-word similarity calculation program, characterized in that those.

When setting the upper Lower conceptually related information of the link between the words belonging to the concepts and該概precaution, based on the concept features and word feature of said word of該概precaution, the upper subordinate concept the upper lower conceptually related information Calculated by the relationship information calculation means ,
When setting the upper / lower concept relationship information of the links between the concepts, the higher / lower concept relationship information is calculated based on the concept feature value of the higher concept and the concept feature value of the lower concept in the link. Means to calculate
Claim 23, wherein inter-word similarity calculation program, characterized in that those.

The conceptual feature amount is represented by a centroid vector in a multidimensional space,
25. The inter-word similarity calculation program according to claim 24, wherein the upper-lower concept relationship information is represented by a distance function having the concept feature value of the higher concept and the concept feature value of the lower concept as variables. .

The concept group correlation information is
The interword similarity calculation program according to claim 18, wherein the similarity is represented by a relationship between a concept or word included in the first concept group and a concept or word included in the second concept group.

The concept group correlation setting means includes:
Furthermore,
Word cooccurrence information extracting means to extract the co-occurrence information of words from said word article database,
Grant co-occurrence information the extracted link the lexical database, on the basis of the granted co-occurrence information, the link correlation information creation means to create a link correlation information in the link,
On the basis of the created link correlation information, the concept set between correlation information concepts group between correlation information calculation means to calculate
Claim 26 inter-word similarity calculation program, characterized in that the functions as.

When the co-occurrence information is present in the link correlation information, the inter-concept group correlation information calculation means sets the reciprocal of the total number of co-occurrence information indicating the number of all the co-occurrence information as inter-concept group correlation information,
When the co-occurrence information does not exist in the link correlation information, a positive constant is calculated by the correlation information between the concept groups and the correlation information between the concept groups.
Claim 27, wherein inter-word similarity calculation program, characterized in that those.

The similarity between the words and the similarity between the words are calculated by adding the link weights of at least one link existing between the two words to be calculated .
Claim 18, wherein inter-word similarity calculation program, characterized in that those.

On a computer, semantic similarity between two words to be calculated is classified and arranged in a hierarchical structure in terms of a word material database in which a plurality of words are collected and concepts and concepts or words linked to the concepts. In a word similarity calculation program for calculating based on a vocabulary database,
The computer,
The first concept group and the first concept second conceptual correlation set hand interstage concepts group for setting correlation information between concepts group showing the relationship between the groups other than the groups in the vocabulary database,
On the basis of the set of concepts group between correlation information, link weight setting means to set a weight to each link in the lexical database,
Based on said link weight, the calculated target two inter-word similarity calculation means to calculate the similarity between the words that are
Word similarity calculation program to function as

The concept group correlation information is
The interword similarity calculation program according to claim 30, wherein the similarity is represented by a relationship between a concept or word included in the first concept group and a concept or word included in the second concept group.

The concept group correlation setting means includes:
Furthermore,
Word cooccurrence information extracting means to extract the co-occurrence information of words from said word article database,
Grant co-occurrence information the extracted link the lexical database, on the basis of the granted co-occurrence information, the link correlation information creation means to create a link correlation information in the link,
On the basis of the created link correlation information, the concept set between correlation information concepts group between correlation information calculation means to calculate
Claim 31, wherein inter-word similarity calculation program, characterized in that the functions as.

When the co-occurrence information is present in the link correlation information, the inter-concept group correlation information calculation means sets the reciprocal of the total number of co-occurrence information indicating the number of all the co-occurrence information as inter-concept group correlation information,
When the co-occurrence information does not exist in the link correlation information, a positive constant is calculated by the correlation information between the concept groups and the correlation information between the concept groups.
Claim 32, wherein inter-word similarity calculation program, characterized in that those.

The similarity between the words and the similarity between the words are calculated by adding the link weights of at least one link existing between the two words to be calculated .
Claim 33 inter-word similarity calculation program, characterized in that those.