JP2014191777A

JP2014191777A - Word meaning analysis device and program

Info

Publication number: JP2014191777A
Application number: JP2013069219A
Authority: JP
Inventors: Ichiro Yamada; 一郎山田; Taro Miyazaki; 太郎宮▲崎▼
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-03-28
Filing date: 2013-03-28
Publication date: 2014-10-06
Anticipated expiration: 2033-03-28
Also published as: JP6106489B2

Abstract

PROBLEM TO BE SOLVED: To perform ranking of a word having a plurality of meanings as to what meaning it is likely to be used according to a retrieval object.SOLUTION: A word meaning characterizing word extraction unit 11 of a word meaning analysis device 1 extracts a word meaning characterizing word which characterizes each word meaning out of word meaning description text data in which a plurality of word meanings are described about a word having a plurality of meanings. A related word extraction unit 12 extracts a related word of the word out of a collection of text data on the basis of a co-occurrence relation with the word having a plurality of meanings. A similarity calculation unit 13 calculates similarity between the word meaning characterizing word which is extracted by the word meaning characterizing word extraction unit 11 and the related word extracted by the related word extraction unit 12. A ranking processing unit 14 determines the order of likelihood to be used of the word meaning corresponding to each word meaning characterizing word on the basis of the similarity calculated by the similarity calculation unit 13.

Description

本発明は、語義解析装置、及びプログラムに関する。 The present invention relates to a semantic analysis device and a program.

従来、複数の意味を持つような曖昧性のある単語が、どのような意味で使われやすいかという指標は、国語辞典などで人手により生成された情報を用いていた。また、単語の意味を分類した英語の辞書であるWordNetなどでは、各単語に対して意味付けが人手で行われたSemCor Corpus (http://www.gabormelli.com/RKB/SemCor_Corpus)などを元に、単語に対しての語義をランキングしている（非特許文献１参照）。 Conventionally, as an index of the meaning of an ambiguous word having a plurality of meanings that is easily used, information manually generated in a Japanese dictionary is used. In addition, WordNet, which is an English dictionary that classifies the meaning of words, is based on SemCor Corpus (http://www.gabormelli.com/RKB/SemCor_Corpus), where meanings are manually assigned to each word. In addition, the meaning of words is ranked (see Non-Patent Document 1).

"WordNet"、［online］、平成２４年１２月２７日、PRINCETON UNIVERSITY、［平成２５年３月１４日検索］、インターネット〈URL：http://wordnet.princeton.edu/〉"WordNet", [online], December 27, 2012, PRINCETON UNIVERSITY, [Search on March 14, 2013], Internet <URL: http://wordnet.princeton.edu/>

人が語義の使われやすさをランキングする作業は、膨大な時間を要するため、辞書の生成や更新は困難である。また、検索に用いる目的で語義のランキング結果を利用する場合は、その検索対象ごとに語義の使われやすさを設定するべきであり、人手でこの設定作業を行うことは非常に困難である。 Since it takes a lot of time to rank the ease of use of meaning by a person, it is difficult to generate and update a dictionary. Further, when using the meaning ranking result for the purpose of search, the ease of use of the meaning should be set for each search target, and it is very difficult to perform this setting work manually.

本発明は、このような事情を考慮してなされたもので、複数の意味を持つ単語が、どのような意味で使われやすいかを検索対象に応じてランキングすることができる語義解析装置、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and a semantic analysis device capable of ranking according to a search target what kind of meaning a word having a plurality of meanings is likely to be used, and Provide a program.

［１］本発明の一態様は、複数の意味を有する単語についての複数の語義が記述された語義記述テキストデータから、前記語義それぞれを特徴付ける語義特徴付け単語を抽出する語義特徴付け単語抽出部と、前記単語との共起関係に基づいてテキストデータの集合から前記単語の関連単語を抽出する関連単語抽出部と、前記語義特徴付け単語抽出部が抽出した前記語義特徴付け単語と、前記関連単語抽出部が抽出した前記関連単語との類似度を計算する類似度計算部と、前記類似度計算部により計算された類似度に基づいて前記語義特徴付け単語に対応した語義が使われやすい順位を決定するランキング処理部と、を備えることを特徴とする語義解析装置である。
この発明によれば、語義解析装置は、語義記述テキストデータから、複数の意味を有する単語の語義それぞれを特徴付ける語義特徴付け単語を抽出するとともに、複数の意味を有する当該単語との共起関係に基づいてテキストデータの集合から関連単語を抽出する。語義解析装置は、抽出した語義特徴付け単語と関連単語との間の類似度を計算し、得られた類似度に基づいて、語義記述テキストデータに記述されている各語義が使われやすい順位を決定する。
これにより、語義解析装置は、複数の意味を持つ単語が、どのような意味で使われやすいかをランキングすることができる。また、語義解析装置は、関連単語を抽出する対象となるテキストデータを変えることで、検索対象に応じて語義のランキングを決定することができる。 [1] One aspect of the present invention is a meaning-characterizing word extraction unit that extracts meaning-characterizing words that characterize each of the meanings from text description text data that describes a plurality of meanings of words having a plurality of meanings. , A related word extraction unit that extracts a related word of the word from a set of text data based on a co-occurrence relationship with the word, the semantic characterization word extracted by the semantic characterization word extraction unit, and the related word A similarity calculation unit that calculates the similarity with the related word extracted by the extraction unit, and a ranking in which the meaning corresponding to the meaning-characterizing word is likely to be used based on the similarity calculated by the similarity calculation unit. And a ranking processing unit for determination.
According to this invention, the meaning analysis device extracts from the meaning description text data the meaning-characterizing words that characterize each meaning of a word having a plurality of meanings, and creates a co-occurrence relationship with the word having a plurality of meanings. Based on the set of text data, related words are extracted. The semantic analysis device calculates the similarity between the extracted semantic characterization word and the related word, and based on the obtained similarity, ranks in which each semantic meaning described in the semantic description text data is easy to use are calculated. decide.
As a result, the semantic analysis apparatus can rank the meanings of words having a plurality of meanings. Moreover, the meaning analysis apparatus can determine the meaning ranking according to the search target by changing the text data from which the related word is extracted.

［２］本発明の一態様は、上述する語義解析装置であって、前記語義特徴付け単語抽出部は、前記語義記述テキストデータに記述されている各語義の定義文の最終文節に含まれる名詞を語義特徴付け単語として抽出する、ことを特徴とする。
この発明によれば、語義解析装置は、語義記述テキストデータに記述されている各語義の定義文の最終文節から語義特徴付け単語となる名詞を抽出する。
これにより、語義解析装置は、語義をよく表す名詞を語義特徴付け単語として抽出することができる。 [2] One aspect of the present invention is the semantic analysis device described above, wherein the semantic characterization word extraction unit is a noun included in a final phrase of a definition sentence for each semantic meaning described in the semantic description text data. Is extracted as a word meaning characterization word.
According to the present invention, the semantic analysis device extracts a noun that becomes a semantically characterized word from the final clause of each semantic definition sentence described in the semantic description text data.
Thereby, the meaning analysis apparatus can extract a noun that well expresses the meaning as a meaning-characterizing word.

［３］本発明の一態様は、上述する語義解析装置であって、前記語義特徴付け単語抽出部は、前記定義文の最終文節に含まれる名詞が複数の中の一つであることを表す特定単語である場合、前記最終文節を修飾している文節に含まれる名詞を語義特徴付け単語として抽出する、ことを特徴とする。
この発明によれば、語義解析装置は、語義の定義文の最終文節が、例えば、「ひとつ」、「一種」などの複数の中の一つであることを表す特定単語である場合、前記最終文節を修飾している文節から語義特徴付け単語となる名詞を抽出する。
これにより、語義解析装置は、語義特徴付け単語となる名詞を精度よく抽出することができる。 [3] One aspect of the present invention is the semantic analysis device described above, wherein the semantic characterization word extraction unit represents that a noun included in a final phrase of the definition sentence is one of a plurality. In the case of a specific word, a noun included in the phrase that modifies the final phrase is extracted as a meaning-characterizing word.
According to the present invention, the semantic analysis apparatus, when the final clause of the semantic definition sentence is a specific word representing one of a plurality of, for example, “one”, “one”, etc., Extract nouns that are semantically characterized words from the phrase that modifies the phrase.
Thereby, the meaning analysis apparatus can extract the noun used as a meaning-characterizing word accurately.

［４］本発明の一態様は、上述する語義解析装置であって、前記テキストデータの集合は、前記単語に基づいた検索を行う対象のテキストデータの集合である、ことを特徴とする。
この発明によれば、語義解析装置は、キーワード検索の対象となるテキストデータの集合から、キーワードとして用いられる単語の関連単語を抽出する。
これにより、語義解析装置は、複数の意味を持つ単語がどのような意味で使われやすいかを、検索対象に応じて精度よくランキングすることができる。 [4] One aspect of the present invention is the semantic analysis apparatus described above, wherein the set of text data is a set of text data to be searched based on the words.
According to this invention, the semantic analysis device extracts a related word of a word used as a keyword from a set of text data to be searched for keywords.
As a result, the semantic analysis apparatus can accurately rank the meaning of words having a plurality of meanings according to the search target.

［５］本発明の一態様は、語義解析装置として用いられるコンピュータを、複数の意味を有する単語についての複数の語義が記述された語義記述テキストデータから、前記語義それぞれを特徴付ける語義特徴付け単語を抽出する語義特徴付け単語抽出部、前記単語との共起関係に基づいてテキストデータの集合から前記単語の関連単語を抽出する関連単語抽出部、前記語義特徴付け単語抽出部が抽出した前記語義特徴付け単語と、前記関連単語抽出部が抽出した前記関連単語との類似度を計算する類似度計算部、前記類似度計算部により計算された類似度に基づいて前記語義特徴付け単語に対応した語義が使われやすい順位を決定するランキング処理部、として機能させるためのプログラムである。 [5] In one aspect of the present invention, a computer used as a semantic analysis apparatus is configured to use a semantically-characterized word that characterizes each of the semantic meanings from textual description text data in which a plurality of semantic meanings of words having a plurality of meanings are described. A semantically characterizing word extracting unit to extract, a related word extracting unit to extract a related word of the word from a set of text data based on a co-occurrence relationship with the word, and the semantic feature extracted by the semantically characterizing word extracting unit A similarity calculation unit for calculating the similarity between the attached word and the related word extracted by the related word extraction unit, and the meaning corresponding to the word meaning characterization word based on the similarity calculated by the similarity calculation unit Is a program for functioning as a ranking processing unit that determines the ranking in which the user is likely to use.

本発明によれば、複数の意味を持つ単語が、どのような意味で使われやすいかを検索対象に応じてランキングすることができる。 According to the present invention, it is possible to rank the meaning of words having a plurality of meanings according to the search target.

本発明の一実施形態における語義解析装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the meaning analysis apparatus in one Embodiment of this invention. 同実施形態における語義記述テキストの例を示す図である。It is a figure which shows the example of the meaning description text in the same embodiment. 同実施形態におけるベーステキスト集合の例を示す図である。It is a figure which shows the example of the base text set in the same embodiment. 同実施形態における語義解析装置の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the meaning analysis apparatus in the embodiment. 同実施形態における語義特徴付け単語の例を示す図である。It is a figure which shows the example of the meaning characterization word in the same embodiment. 同実施形態におけるランキング作成対象単語に対する関連単語の相互情報量の例を示す図である。It is a figure which shows the example of the mutual information amount of the related word with respect to the ranking preparation object word in the same embodiment. 同実施形態における関連単語及び語義特徴付け単語間の分布類似度の例を示す図である。It is a figure which shows the example of the distribution similarity between the related word and the meaning characterization word in the embodiment. 同実施形態におけるランキング結果の例を示す図である。It is a figure which shows the example of the ranking result in the same embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
図１は、本発明の一実施形態による語義解析装置１の構成を示す機能ブロック図である。語義解析装置１は、１台または複数台のコンピュータ装置により実現され、同図に示すように、記憶部１０、語義特徴付け単語抽出部１１、関連単語抽出部１２、類似度計算部１３、及びランキング処理部１４を備えて構成される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a functional block diagram showing a configuration of a meaning analysis apparatus 1 according to an embodiment of the present invention. The meaning analysis apparatus 1 is realized by one or a plurality of computer devices, and as shown in the figure, a storage section 10, a meaning characterization word extraction section 11, a related word extraction section 12, a similarity calculation section 13, and A ranking processing unit 14 is provided.

記憶部１０は、各部の処理に用いられる各種データを記憶する。具体的には、記憶部１０は、語義記述テキスト、及びベーステキスト集合を記憶する。語義記述テキストは、複数の意味を持つような曖昧性のある単語についての複数の語義が記述されたテキストデータ（語義記述テキストデータ）である。ベーステキスト集合とは、語義ランキングの作成対象である単語に関連する単語を抽出するベースとなるテキストデータの集合である。 The storage unit 10 stores various data used for processing of each unit. Specifically, the storage unit 10 stores a meaning-descriptive text and a base text set. The meaning description text is text data (meaning description text data) in which a plurality of meanings for ambiguous words having a plurality of meanings are described. The base text set is a set of text data serving as a base for extracting words related to the word for which the semantic ranking is to be created.

語義特徴付け単語抽出部１１は、語義ランキングの作成対象である単語の語義を特徴付ける単語を記憶部１０に記憶されている語義記述テキストから抽出する。以下では、語義ランキングの作成対象である単語を「ランキング作成対象単語」と記載し、語義を特徴付ける単語を「語義特徴付け単語」記載する。関連単語抽出部１２は、記憶部１０に記憶されているベーステキスト集合からランキング作成対象単語に関連する単語を抽出する。ベーステキスト集合として、例えば、ランキング作成対象単語をキーワードとして検索を行う対象のテキストデータの集合が用いられる。以下では、ランキング作成対象単語に関連する単語を、「関連単語」と記載する。類似度計算部１３は、語義特徴付け単語抽出部１１が抽出した語義特徴付け単語と関連単語抽出部１２が抽出した関連単語との間の類似度を計算する。ランキング処理部１４は、類似度計算部１３が算出した類似度に基づいて、各語義特徴付け単語に対応した語義が使われやすい順位を決定する。これにより、ランキング処理部１４は、語義記述テキストに記述されているどの語義が使われやすいかのランキングを生成する。 The word meaning characterization word extraction unit 11 extracts a word characterizing the word meaning of the word for which the word meaning ranking is created from the word description text stored in the storage unit 10. In the following, a word that is a creation target of a meaning ranking is described as a “ranking creation target word”, and a word that characterizes the meaning is described as a “meaning characterization word”. The related word extraction unit 12 extracts words related to the ranking creation target word from the base text set stored in the storage unit 10. As the base text set, for example, a set of text data to be searched for using a ranking creation target word as a keyword is used. Hereinafter, a word related to the ranking creation target word is referred to as a “related word”. The similarity calculation unit 13 calculates the similarity between the word meaning characterization word extracted by the word meaning characterization word extraction unit 11 and the related word extracted by the related word extraction unit 12. Based on the similarity calculated by the similarity calculation unit 13, the ranking processing unit 14 determines an order in which the meaning corresponding to each meaning-characterizing word is likely to be used. Thereby, the ranking processing unit 14 generates a ranking of which meanings described in the meaning description text are likely to be used.

次に、語義解析装置１に用いられるデータを説明する。
図２は、語義記述テキストの例を示す図である。この語義記述テキストとして、例えば、インターネット上で提供される既存の百科事典サービスであるWikipedia（http://ja.wikipedia.org/）の曖昧さ回避のページなどを利用できる。なお、語義記述テキストとして、例えば、国語辞典などの辞書データを用いることもできる。同図に示す語義記述テキストの例では、ランキング作成対象単語「雷」に対して、複数の語義を定義した文が記述されている。 Next, data used in the meaning analysis apparatus 1 will be described.
FIG. 2 is a diagram illustrating an example of the meaning description text. As this meaning text, for example, an ambiguity avoidance page of Wikipedia (http://en.wikipedia.org/), which is an existing encyclopedia service provided on the Internet, can be used. For example, dictionary data such as a Japanese dictionary can be used as the meaning description text. In the example of the meaning description text shown in the figure, a sentence defining a plurality of meanings is described for the ranking creation target word “lightning”.

図３は、ベーステキスト集合の例を示す図である。同図においては、ベーステキスト集合として、番組ＥＰＧ（Electronic Program Guide）が用いられる場合の例を示している。同図に示す番組ＥＰＧには、複数の番組について、番組を特定する識別子（Id）、番組名（Title)、番組名の短縮表示（Short Title)、番組説明（Description）、及び番組内容（Detail)の情報が含まれている。 FIG. 3 is a diagram illustrating an example of a base text set. In the figure, an example in which a program EPG (Electronic Program Guide) is used as a base text set is shown. The program EPG shown in the figure includes, for a plurality of programs, an identifier (Id) for identifying the program, a program name (Title), a short display of the program name (Short Title), a program description (Description), and a program content (Detail). ) Information is included.

続いて、語義解析装置１の動作について説明する。
図４は、図１に示す語義解析装置１の動作手順を示すフローチャートである。 Next, the operation of the meaning analysis apparatus 1 will be described.
FIG. 4 is a flowchart showing an operation procedure of the semantic analysis apparatus 1 shown in FIG.

［ステップＳ１：語義特徴付け単語抽出処理］
語義特徴付け単語抽出部１１は、例えばインターネット上で公開されている語義記述テキストから、ランキング作成対象単語の語義記述テキストを読み出し、記憶部１０に書き込む。あるいは、語義特徴付け単語抽出部１１は、予め記憶部１０に記憶されている国語辞典などの語義記述テキストからランキング作成対象単語の語義記述テキストを読み出してもよい。語義特徴付け単語抽出部１１は、ランキング作成対象単語についての複数の語義が記述された語義記述テキストから、それらの各語義を特徴付ける語義特徴付け単語を抽出する。 [Step S1: Word Characterizing Characterizing Word Extraction Process]
The word meaning characterization word extraction unit 11 reads out the word meaning description text of the ranking creation target word from the word meaning description text published on the Internet, for example, and writes it in the storage unit 10. Alternatively, the word meaning characterization word extraction unit 11 may read the word meaning description text of the ranking creation target word from the word meaning description text such as the Japanese dictionary stored in the storage unit 10 in advance. The meaning-characterizing word extraction unit 11 extracts a meaning-characterizing word that characterizes each meaning from the meaning-description text in which a plurality of meanings for the ranking creation target word are described.

具体的には、語義特徴付け単語抽出部１１は、ランキング作成対象単語についての語義が記述された語義記述テキストから、各語義を定義する最初の定義文を読み出して構文解析し、その定義文の最終文節を、語義特徴付け単語を抽出する対象の文節とする。以下、語義特徴付け単語を抽出する対象の文節を「単語抽出対象文節」と記載する。語義特徴付け単語抽出部１１は、単語抽出対象文節にある名詞を抽出する。 Specifically, the word meaning characterization word extraction unit 11 reads and parses the first definition sentence defining each meaning from the meaning description text in which the meaning of the ranking creation target word is described, and analyzes the definition sentence. Let the last phrase be the target phrase from which the meaning-characterizing words are extracted. Hereinafter, the phrase from which the meaning-characterizing word is extracted is referred to as a “word extraction target phrase”. The word meaning characterization word extraction unit 11 extracts nouns in the word extraction target clause.

ただし、最終文節が「ひとつ」、「一種」などの複数の中の一つであることを表す特定単語の場合、語義特徴付け単語抽出部１１は、その最終文節を修飾している「の格」であり、かつ、最終文節に最も近い文節を単語抽出対象文節とし、名詞を抽出する。なお、特定単語は、予め記憶部１０に記憶させておく。例えば、図２に示すランキング作成対象単語「雷」の語義記述テキストの場合、定義文「ゲーム用語のひとつ」の最終文節は「ひとつ」である。そこで、語義特徴付け単語抽出部１１は、最終文節「ひとつ」を修飾する文節「ゲーム用語の」を単語抽出対象文節とし、名詞「ゲーム用語」を抽出する。 However, in the case of a specific word indicating that the final clause is one of a plurality of “one”, “one type”, etc., the semantic characterization word extraction unit 11 modifies the final clause with the “case” ”And the phrase closest to the final phrase as a word extraction target phrase, nouns are extracted. The specific word is stored in the storage unit 10 in advance. For example, in the meaning description text of the ranking creation target word “lightning” shown in FIG. 2, the final sentence of the definition sentence “one of game terms” is “one”. Therefore, the word meaning characterization word extraction unit 11 extracts the noun “game term” using the phrase “game term” that modifies the last phrase “one” as the word extraction target clause.

さらに、語義特徴付け単語抽出部１１は、単語抽出対象文節から抽出した名詞に不要な接尾辞がある場合、その接尾辞を削除する。例えば、語義特徴付け単語抽出部１１は、「漫画版」から接尾辞「版」を削除して「漫画」とする。なお、不要な接尾辞は、予め記憶部１０に記憶しておく。また、語義特徴付け単語抽出部１１は、定義文に単語抽出対象文節と並列関係にある文節が存在する場合、並列する文節も単語抽出対象文節として名詞を抽出することにより、複数の名詞の抽出を許す。図２に示す語義記述テキストの例に示す定義文「自然現象・気象のひとつ」の場合、最終文節「ひとつ」を修飾し、かつ最終文節に最も近い文節「気象の」が単語抽出対象文節となり、また、この文節と並列関係にある文節「自然現象・」も単語抽出対象文節となる。これにより、語義特徴付け単語抽出部１１は、各単語抽出対象文節からそれぞれ、名詞「自然現象」、名詞「気象」を抽出する。 Further, if the noun extracted from the word extraction target clause includes an unnecessary suffix, the meaning-characterizing word extraction unit 11 deletes the suffix. For example, the word meaning characterization word extraction unit 11 deletes the suffix “version” from “cartoon version” and sets it as “cartoon”. Note that unnecessary suffixes are stored in the storage unit 10 in advance. In addition, the word meaning characterization word extraction unit 11 extracts a plurality of nouns by extracting a noun as a word extraction target clause from the parallel clause when a phrase having a parallel relationship with the word extraction target clause exists in the definition sentence. Forgive. In the case of the definition sentence “One of Natural Phenomena / Meteorology” shown in the example of the meaning-description text shown in FIG. 2, the phrase “Meteorological” that qualifies the last phrase “one” and is closest to the last phrase becomes the word extraction target phrase. In addition, the phrase “natural phenomenon” in parallel with this phrase is also a word extraction target phrase. Thereby, the word meaning characterization word extraction unit 11 extracts the noun “natural phenomenon” and the noun “weather” from each word extraction target clause.

語義特徴付け単語抽出部１１は、単語抽出対象文節から名詞を抽出する際、できるだけ一般的な名詞を抽出するために、抽出した名詞を構成する形態素（最小の意味単位）を先頭から順に削除し、一般的な名詞か否かを判断する処理を行う。そこで、例えば、ウェブに頻出する上位１００万語の名詞などを頻出名詞として予め記憶部１０に記憶しておく。これは、例えば、インターネット上で公開されている頻出名詞のデータを取得して記憶することでもよく、インターネット上で各名詞を検索したときのヒット数などに基づいて選択した頻出名詞を記憶することでもよい。語義特徴付け単語抽出部１１は、記憶部１０に記憶されている頻出名詞と合致するまで、単語抽出対象文節から抽出した名詞を構成する形態素を、先頭から順に削除する。例えば、図２に示す語義記述テキストの４つめの定義文「日本の男性アイドルグループ」の場合、語義特徴付け単語抽出部１１は、最終文節を単語抽出対象文節として名詞「男性アイドルグループ」を抽出する。語義特徴付け単語抽出部１１は、この抽出した名詞を、形態素解析処理により「男性／アイドル／グループ」に分割する。そしてまず、語義特徴付け単語抽出部１１は、「男性アイドルグループ」が一般的な名詞か否かを判断する。語義特徴付け単語抽出部１１は、「男性アイドルグループ」は頻出名詞に含まれていないため、一般的な名詞ではないと判断して先頭の形態素「男性」を削除し、「アイドルグループ」が一般的な名詞か否かを判断する。語義特徴付け単語抽出部１１は、「アイドルグループ」が頻出名詞に含まれるため一般的な名詞と判断し、４つめの定義文からは「アイドルグループ」を抽出する。
語義特徴付け単語抽出部１１は、上記の処理により抽出した名詞を、語義特徴付け単語として類似度計算部１３に出力する。 When extracting a noun from a word extraction target clause, the word meaning characterization word extraction unit 11 deletes morphemes (minimum semantic units) constituting the extracted noun in order from the top in order to extract as many general nouns as possible. The process of determining whether or not it is a general noun is performed. Therefore, for example, the top 1 million nouns frequently appearing on the web are stored in the storage unit 10 in advance as frequent nouns. This may be, for example, acquiring and storing frequently used noun data published on the Internet, or storing frequently nouns selected based on the number of hits when each noun is searched on the Internet. But you can. The word meaning characterization word extraction unit 11 deletes the morphemes constituting the noun extracted from the word extraction target phrase in order from the top until the frequent noun stored in the storage unit 10 matches the frequent noun. For example, in the case of the fourth definition sentence “Japanese male idol group” in the meaning description text shown in FIG. 2, the semantic characterization word extraction unit 11 extracts the noun “male idol group” with the final phrase as a word extraction target phrase. To do. The meaning-characterizing word extraction unit 11 divides the extracted noun into “male / idol / group” by morphological analysis processing. First, the word meaning characterization word extraction unit 11 determines whether the “male idol group” is a general noun. The meaning-characterizing word extraction unit 11 determines that the “male idol group” is not a common noun, and therefore deletes the first morpheme “male” so that the “idol group” Judge whether it is a proper noun. The meaning-characterizing word extraction unit 11 determines that the “idle group” is a common noun because the “idle group” is included in the frequent nouns, and extracts the “idle group” from the fourth definition sentence.
The word meaning characterization word extraction unit 11 outputs the noun extracted by the above processing to the similarity calculation unit 13 as a word meaning characterization word.

図５は、上記処理により語義特徴付け単語抽出部１１が抽出した語義特徴付け単語の例を示す図である。同図においては、語義特徴付け単語抽出部１１が、図２に示す語義記述テキストの各定義文から抽出した語義特徴付け単語を示している。 FIG. 5 is a diagram illustrating an example of the meaning-characterizing words extracted by the meaning-characterizing word extraction unit 11 by the above processing. In the same figure, the meaning characterization word extraction part 11 has shown the meaning characterization word extracted from each definition sentence of the meaning description text shown in FIG.

［ステップＳ２：関連単語抽出処理］
次に、関連単語抽出部１２は、ベーステキスト集合からランキング作成対象単語の関連名詞を抽出する（ステップＳ２）。この処理では、ランキング作成対象単語をキーワードとして用いて検索を行う対象となるテキストデータの集合を、ベーステキスト集合として用いることができる。例えば、テレビ番組を検索する場合、番組ＥＰＧなどのテキストデータをベーステキスト集合として利用する。本実施形態では、図３に示す番組ＥＰＧをベーステキスト集合として用いる。 [Step S2: Related Word Extraction Process]
Next, the related word extraction unit 12 extracts a related noun of the ranking creation target word from the base text set (step S2). In this process, a set of text data to be searched using a ranking creation target word as a keyword can be used as a base text set. For example, when searching for a television program, text data such as a program EPG is used as a base text set. In the present embodiment, the program EPG shown in FIG. 3 is used as a base text set.

関連単語抽出部１２は、記憶部１０に記憶されている番組ＥＰＧから番組内容を記述した文（例えばDetailに記述されている文）を抽出して形態素解析を行い、名詞を抽出する。この際、関連単語抽出部１２は、文節ごとに一般的な名詞のみを抽出する。一般的な名詞であるかの判断は、ステップＳ１と同様に、記憶部１０に記憶されている頻出名詞との合致に基づいて行う。 The related word extraction unit 12 extracts a sentence describing the program contents (for example, a sentence described in Detail) from the program EPG stored in the storage unit 10, performs morphological analysis, and extracts a noun. At this time, the related word extraction unit 12 extracts only common nouns for each phrase. The determination of whether or not it is a general noun is made based on the match with the frequent noun stored in the storage unit 10 as in step S1.

次に、関連単語抽出部１２は、抽出した名詞とランキング作成対象単語との関連性を評価する。この関連性の評価には、例えば、従来からある相互情報量という指標を使うことができる。単語Ａと単語Ｂに対する相互情報量ＭＩ（Ａ,Ｂ）は、以下の式（１）により定義される。ただし、単語Ａをランキング作成対象単語、単語Ｂを関連単語とする。関連単語は、番組内容の記述文においてランキング作成対象単語と共起する名詞である。 Next, the related word extraction unit 12 evaluates the relationship between the extracted noun and the ranking creation target word. For the evaluation of this relevance, for example, a conventional index of mutual information can be used. The mutual information MI (A, B) for the word A and the word B is defined by the following equation (1). However, the word A is a ranking creation target word and the word B is a related word. The related word is a noun that co-occurs with the ranking target word in the program content description sentence.

式（１）において、Ｐ（Ａ，Ｂ）は単語Ａと単語Ｂが同じ番組の番組内容（Detail）の記述文に出現している確率値、Ｐ（Ａ）は全番組の番組内容の記述文において単語Ａが出現する確率値、Ｐ（Ｂ）は全番組の番組内容の記述文において単語Ｂが出現する確率値を示す。相互情報量ＭＩ（Ａ,Ｂ）の値が大きいほど、単語Ａと単語Ｂは関係が深いと言える。関連単語抽出部１２は、式（１）を用いて、ランキング作成対象単語（単語Ａ）に対する各関連単語（単語Ｂ）の相互情報量を算出する。 In equation (1), P (A, B) is the probability value that word A and word B appear in the description text of the program content (Detail) of the same program, and P (A) is the description of the program content of all the programs. The probability value that the word A appears in the sentence, and P (B) indicates the probability value that the word B appears in the description sentences of the program contents of all programs. It can be said that the larger the value of the mutual information MI (A, B), the deeper the relationship between the word A and the word B. The related word extraction unit 12 calculates the mutual information amount of each related word (word B) with respect to the ranking creation target word (word A) using Expression (1).

図６は、ランキング作成対象単語に対する各関連単語の相互情報量の例を示す図である。同図では、図３に示す番組ＥＰＧの番組内容の記述文においてランキング作成対象単語「雷」と共起する関連名詞についての相互情報量を示している。関連単語抽出部１２は、各関連単語と、それら関連単語について算出した相互情報量とを類似度計算部１３に出力する。 FIG. 6 is a diagram illustrating an example of the mutual information amount of each related word with respect to the ranking creation target word. This figure shows the mutual information about the related nouns that co-occur with the ranking creation target word “lightning” in the program content description sentence of the program EPG shown in FIG. The related word extraction unit 12 outputs each related word and the mutual information amount calculated for the related words to the similarity calculation unit 13.

［ステップＳ３：類似度計算処理］
次に、類似度計算部１３は、ステップＳ１において抽出された語義特徴付け単語と、ステップＳ２において抽出された関連単語との間の類似度を求める。本実施形態では、類似度として分布類似度などの指標を利用する。分布類似度では、実際のテキスト等における単語の係り受けの関係に基づいて各単語をクラスタリングし、そのクラスタリングの結果から各単語のクラスへの所属確率の分布を求め、この確率分布間の距離から単語間の類似度を計算する。分布類似度については、例えば、参考文献「風間，De Saeger，鳥澤，村田，”係り受けの確率的クラスタリングを用いた大規模類似語リストの作成，”言語処理学会第第１５回年次大会発表論文集，C1-6，pp.84-87. (2009)）」に記載されている。類似度計算部１３は、関連単語抽出部１２から入力された関連単語のうち、相互情報量が上位の１００までの関連単語を対象として、語義特徴付け単語抽出部１１から入力された各語義特徴付け単語との分布類似度を、ベーステキスト集合の記述を利用して計算する。 [Step S3: Similarity Calculation Processing]
Next, the similarity calculation unit 13 obtains a similarity between the meaning-characterizing word extracted in step S1 and the related word extracted in step S2. In this embodiment, an index such as distribution similarity is used as the similarity. In the distribution similarity, each word is clustered based on the dependency relationship of words in the actual text, etc., and the distribution of the probability of belonging to the class of each word is obtained from the result of the clustering, and the distance between the probability distributions is calculated. Calculate the similarity between words. Regarding distribution similarity, for example, reference "Kasama, De Saeger, Torizawa, Murata," Creation of large-scale similar word list using dependency stochastic clustering, "Language Processing Society 15th Annual Conference Announcement" Papers, C1-6, pp.84-87. (2009)) ”. The similarity calculation unit 13 targets each related word up to the top 100 of the mutual information among the related words input from the related word extraction unit 12. The distribution similarity with the attached word is calculated using the description of the base text set.

図７は、各関連単語と各語義特徴付け単語との分布類似度の計算結果例を示す。同図においては、図５に示す各語義特徴付け単語と、図６に示す関連単語のうち相互情報量が上位１００に含まれる関連単語との分布類似度の計算結果を示している。類似度計算部１３は、各関連単語と各語義特徴付け単語との分布類似度の計算結果をランキング処理部１４に出力する。 FIG. 7 shows an example of the calculation result of the distribution similarity between each related word and each meaning-characterizing word. In the same figure, the calculation result of the distribution similarity of each meaning-characterizing word shown in FIG. 5 and the related word whose mutual information amount is included in the top 100 among the related words shown in FIG. 6 is shown. The similarity calculation unit 13 outputs the calculation result of the distribution similarity between each related word and each meaning-characterizing word to the ranking processing unit 14.

なお、上記においては類似度として、分布類似度を用いたが、単語間の類似度を定量的な値で示す他の指標値を用いてもよい。例えば、シソーラスにおける単語間の距離などを類似度として用いることができる。 In the above description, the distribution similarity is used as the similarity, but another index value indicating the similarity between words as a quantitative value may be used. For example, the distance between words in the thesaurus can be used as the similarity.

［ステップＳ４：ランキング処理］
ランキング処理部１４は、ステップＳ３において計算された分布類似度を利用して、各語義のランキングを行う。そこで、ランキング作成対象単語の語義をＳｅｍとすると、ランキング処理部１４は、以下の式（２）により各語義Ｓｅｍの重みであるＷｅｉｇｈｔ（Ｓｅｍ）を算出する。 [Step S4: Ranking process]
The ranking processing unit 14 ranks each meaning using the distribution similarity calculated in step S3. Accordingly, when the meaning of the word for which ranking is to be created is Sem, the ranking processing unit 14 calculates Weight (Sem), which is the weight of each meaning Sem, using the following equation (2).

式（２）において、ｎｏｕｎ（Ｓｅｍ）は、語義Ｓｅｍの語義特徴付け単語としてステップＳ１で抽出された名詞である。また、Ｄ（ｎｏｕｎ（Ｓｅｍ））は、語義Ｓｅｍから語義特徴付け単語として抽出された名詞の数を示す。Ｄｓｉｍ（ｔ，ｅｓ_ｉ）は、単語ｔと単語ｅｓ_ｉとの分布類似度を示し、単語ｔは、ステップＳ２において抽出された関連単語であり、単語ｅｓ_ｉは、語義Ｓｅｍから語義特徴付け単語として抽出されたｉ番目の名詞ｎｏｕｎ（Ｓｅｍ）である（ｉは１以上Ｄ（ｎｏｕｎ（Ｓｅｍ））以下の整数）。例えば、図５に示すように、語義Ｓｅｍが「自然現象・気象のひとつ。稲妻。」の場合、ｎｏｕｎ（Ｓｅｍ）は「自然現象」及び「気象」であり、Ｄ（ｎｏｕｎ（Ｓｅｍ））は「２」であり、単語ｅｓ_１は「自然現象」であり、単語ｅｓ_２は「気象」である。
ランキング処理部１４は、算出したＷｅｉｇｈｔ（Ｓｅｍ）の降順に語義をランキングした結果を示すデータを生成する。 In equation (2), “noun (Sem)” is the noun extracted in step S1 as the meaning characterization word of the meaning Sem. Further, D (noun (Sem)) indicates the number of nouns extracted as meaning-characterizing words from the meaning Sem. Dsim (t, es _i ) indicates the distribution similarity between the word t and the word es _i , the word t is a related word extracted in step S2, and the word es _i is a word-characterizing characterization word from the word meaning Sem The i-th noun noun (Sem) extracted as (i is an integer not less than 1 and not more than D (noun (Sem))). For example, as shown in FIG. 5, when the meaning Sem is “one of natural phenomena / weather. Lightning bolt”, noun (Sem) is “natural phenomenon” and “weather”, and D (noun (Sem)) is “2”, the word es ₁ is “natural phenomenon”, and the word es ₂ is “weather”.
The ranking processing unit 14 generates data indicating the result of ranking the meanings in descending order of the calculated Weight (Sem).

図８は、ランキング作成対象単語「雷」の語義に対するランキング結果を示す図である。ランキング処理部１４は、ランキング結果として、語義Ｓｅｍのランキング（順位）と、その語義Ｓｅｍから語義特徴付け単語として抽出された名詞ｎｏｕｎ（Ｓｅｍ）と、算出された語義Ｓｅｍの重みＷｅｉｇｈｔ（Ｓｅｍ）とを対応付けたデータをランキング結果として生成する。ランキング処理部１４は、生成したランキング結果のデータを、記憶部１０に書き込む、あるいは、表示装置や他のコンピュータ装置などに出力する。 FIG. 8 is a diagram illustrating a ranking result for the meaning of the ranking creation target word “lightning”. The ranking processing unit 14 obtains the ranking (ranking) of the meaning Sem as a ranking result, the noun noun (Sem) extracted as the meaning characterization word from the meaning Sem, the calculated weight Semit weight (Sem), and Is generated as a ranking result. The ranking processing unit 14 writes the generated ranking result data into the storage unit 10 or outputs the data to a display device or another computer device.

番組をオンデマンドで配信するインターネット上のウェブサイトにおいてユーザが興味のある番組を検索する場合、例えば、番組ＥＰＧが検索対象として利用される。そこで、上述した実施形態のように、ベーステキスト集合として番組ＥＰＧを用いて語義のランキングを作成する。そして、ユーザが番組検索のために入力したキーワードがどのような意味で用いられたかをランキング結果から把握し、把握した意味を番組ＥＰＧの検索に利用することにより、番組検索の精度を高めることが可能となる。
また、例えば、ベーステキスト集合としてニューステキストの集合を用いた場合、「自然現象、気象」を語義特徴付け単語とした語義がランキングの上位となることが予想される。
このように、関連単語を抽出するためのベーステキスト集合を変えることによって、検索対象に依存した語義のランキング結果を得ることができる。 When searching for a program that the user is interested in on a website on the Internet that distributes the program on demand, for example, the program EPG is used as a search target. Therefore, as in the embodiment described above, the meaning ranking is created using the program EPG as the base text set. Then, it is possible to improve the accuracy of the program search by grasping the meaning of the keyword used by the user for searching the program from the ranking result and using the grasped meaning for the search of the program EPG. It becomes possible.
For example, when a set of news texts is used as the base text set, it is expected that the meaning of “natural phenomenon, weather” as the word characterizing meaning will be higher in the ranking.
In this way, by changing the base text set for extracting related words, it is possible to obtain the meaning ranking result depending on the search target.

以上説明したように、本実施形態の語義解析装置１によれば、大規模なテキスト集合を利用し、人手を介すことなく、複数の意味を持つような曖昧性のある単語が、どのような意味で使われやすいかを推定することができる。さらに、本実施形態の語義解析装置１によれば、ベーステキスト集合として利用する大規模テキスト集合を、検索対象の文書集合や、検索対象の文書集合と同じまたは類似のカテゴリの文書集合とすることにより、検索対象ごとに語義の使われやすさのランキングを得ることができる。 As described above, according to the semantic analysis device 1 of the present embodiment, how is an ambiguous word having a plurality of meanings using a large text set and without manual intervention? It can be estimated whether it is easy to use in a certain sense. Furthermore, according to the semantic analysis device 1 of the present embodiment, the large-scale text set used as the base text set is set as a search target document set or a document set of the same or similar category as the search target document set. Thus, it is possible to obtain a ranking of ease of use of meaning for each search target.

なお、上述の語義解析装置１は、内部にコンピュータシステムを有している。そして、語義解析装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The above semantic analysis device 1 has a computer system therein. The operation process of the meaning analysis apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１語義解析装置
１０記憶部
１１語義特徴付け単語抽出部
１２関連単語抽出部
１３類似度計算部
１４ランキング処理部 DESCRIPTION OF SYMBOLS 1 Word meaning analysis apparatus 10 Memory | storage part 11 Word meaning characterization word extraction part 12 Related word extraction part 13 Similarity calculation part 14 Ranking processing part

Claims

A meaning-characterizing word extraction unit that extracts a meaning-characterizing word that characterizes each of the meanings from the meaning-description text data in which a plurality of meanings of words having a plurality of meanings are described;
A related word extraction unit that extracts a related word of the word from a set of text data based on a co-occurrence relationship with the word;
A similarity calculation unit that calculates a similarity between the meaning characterization word extracted by the meaning characterization word extraction unit and the related word extracted by the related word extraction unit;
A ranking processing unit for determining a ranking in which the meaning corresponding to the meaning-characterizing word is likely to be used based on the similarity calculated by the similarity calculating unit;
A semantic analysis device comprising:

The meaning characterization word extraction unit extracts a noun included in a final sentence of a definition sentence of each meaning described in the meaning description text data as a meaning characterization word;
The semantic analysis apparatus according to claim 1, wherein:

The meaning-characterizing word extraction unit, when the noun included in the final clause of the definition sentence is a specific word indicating that the noun is one of a plurality, the noun included in the clause that modifies the final phrase Are extracted as meaning-characterizing words,
The semantic analysis apparatus according to claim 2, wherein:

The set of text data is a set of text data to be searched based on the word.
The semantic analysis device according to any one of claims 1 to 3, wherein the semantic analysis device is provided.

A computer used as a semantic analysis device
A meaning-characterizing word extraction unit that extracts a meaning-characterizing word that characterizes each of the meanings from the meaning-description text data in which a plurality of meanings of words having a plurality of meanings are described;
A related word extraction unit that extracts a related word of the word from a set of text data based on a co-occurrence relationship with the word;
A similarity calculation unit for calculating a similarity between the meaning characterization word extracted by the meaning characterization word extraction unit and the related word extracted by the related word extraction unit;
A ranking processing unit that determines a ranking in which the meaning corresponding to the meaning-characterizing word is likely to be used based on the similarity calculated by the similarity calculation unit;
Program to function as.