JP6230190B2

JP6230190B2 - Important word extraction device and program

Info

Publication number: JP6230190B2
Application number: JP2014002745A
Authority: JP
Inventors: 太郎宮▲崎▼; 山田　一郎; 一郎山田; 菊佳望月; 加藤　直人; 直人加藤; 田中　英輝; 英輝田中
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-01-09
Filing date: 2014-01-09
Publication date: 2017-11-15
Anticipated expiration: 2034-01-09
Also published as: JP2015132899A

Description

本発明は、重要語抽出装置、及びプログラムに関する。 The present invention relates to a keyword extraction device and a program.

文章内の重要な単語を抽出するタスクは重要である。例えば、長い文章中から取り出された重要な単語は、その文章のトピックが何であるかを推定したり、文章を自動要約する際の手がかりを推定したりする場合に使うことができる。 The task of extracting important words in sentences is important. For example, an important word extracted from a long sentence can be used for estimating what the topic of the sentence is or estimating a clue when automatically summarizing the sentence.

文章中から重要単語を抽出するために従来から広く使われている手法では、ＴＦ−ＩＤＦやｏｋａｐｉＢＭ２５により得られた単語ごとの重み設定が用いられる。これらの手法は、「重要単語を抽出したい文章に多く出現し、かつ、他の文章にはあまり出現しない単語」に高い重みを与えるものであり、計算が単純であることや、ある程度の高い性能が得られることから広く利用されている。 Conventionally widely used methods for extracting important words from sentences use weight settings for each word obtained by TF-IDF or okapi BM25. These methods give a high weight to "words that frequently appear in sentences where important words are to be extracted and do not appear so much in other sentences." Is widely used because

また、重要単語を抽出する技術には、ある期間内に現れた単語を集計して得られた出現頻度から「どれだけ汎用的につかわれたか」を表す汎用度を算出し、汎用度が低く、かつ、対象文章中に多く出現した単語を重要単語とするものもある（例えば、特許文献１参照）。重要単語を抽出する他の技術には、多くの発話に共通して現れる単語を手がかりとし、その単語から近い時間に高い頻度で使われる単語を重要単語とするものがある（例えば、特許文献２参照）。 In addition, the technology for extracting important words calculates the degree of versatility that expresses "how widely used" from the appearance frequency obtained by aggregating words that appeared within a certain period, and the degree of versatility is low. In addition, some words that appear frequently in the target sentence are important words (see, for example, Patent Document 1). As another technique for extracting an important word, there is a technique in which a word that appears in common in many utterances is used as a clue, and a word that is frequently used at a time close to the word is used as an important word (for example, Patent Document 2 reference).

特開２０１１−７０２９１号公報JP 2011-70291 A 特開２０１１−２４８４０９号公報JP 2011-248409 A

ＴＦ−ＩＤＦやｏｋａｐｉＢＭ２５では、文章中の出現回数のみに基づいて単語の重み設定を行っており、文脈や文章全体の意味を用いていない。従って、文脈に合致した重要語の抽出を行うことはできなかった。また、特許文献１の技術も、出現頻度のみを用いて重要語を抽出しており、文章や単語の意味を考慮していない。特許文献２の技術は、単純に出現回数のみを使うものではないが、やはり、文章や単語の意味を利用していない。 In TF-IDF and okapi BM25, the word weight is set based only on the number of appearances in the sentence, and the context and the meaning of the whole sentence are not used. Therefore, it is impossible to extract important words that match the context. The technique of Patent Document 1 also extracts important words using only the appearance frequency, and does not consider the meaning of sentences or words. The technique of Patent Document 2 does not simply use the number of appearances, but still does not use the meaning of sentences or words.

本発明は、このような事情を考慮してなされたもので、文章から文脈に合致した重要語を抽出することができる重要語抽出装置、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and provides an important word extraction device and a program that can extract an important word that matches a context from a sentence.

本発明の一態様は、文章データから所定の品詞の単語を抽出する単語抽出部と、前記単語抽出部が抽出した前記単語からなる単語ペアを作成する単語ペア作成部と、前記単語ペア作成部が作成した前記単語ペアのそれぞれについて、前記単語ペアを構成する前記単語間の類似度を計算する類似度計算部と、前記単語抽出部が抽出した前記単語毎に、前記単語が含まれる前記単語ペアについて前記類似度計算部が計算した前記類似度に基づくスコアを算出する類似度集計部と、前記類似度集計部が算出した前記単語それぞれの前記スコアに基づいて、前記単語抽出部が抽出した前記単語の中から重要語を選択する重要語選択部と、を備えることを特徴とする重要語抽出装置である。
この発明によれば、重要語抽出装置は、文章データから所定の品詞の単語を抽出し、抽出した単語から作成した単語ペアそれぞれについて単語間の類似度を計算する。重要語抽出装置は、各単語のスコアを、その単語が含まれる単語ペアの類似度に基づいて算出し、算出したスコアに従って重要語を選択する。
これにより、重要語抽出装置は、文章全体で使用されている他の単語との関連が高い単語を重要語として抽出するため、文脈や文章の意味を用いて重要語やトピックワードを抽出することができる。 One aspect of the present invention includes a word extraction unit that extracts a word having a predetermined part of speech from sentence data, a word pair creation unit that creates a word pair composed of the words extracted by the word extraction unit, and the word pair creation unit For each of the word pairs created by, the similarity calculation unit that calculates the similarity between the words constituting the word pair, and the word that includes the word for each word extracted by the word extraction unit The word extraction unit extracts the similarity based on the similarity calculated by the similarity calculation unit for the pair and the score of each of the words calculated by the similarity calculation unit. An important word extraction apparatus comprising: an important word selection unit that selects an important word from the words.
According to this invention, the keyword extraction device extracts a word with a predetermined part of speech from sentence data, and calculates the similarity between words for each word pair created from the extracted word. The keyword extraction device calculates the score of each word based on the similarity of the word pair including the word, and selects the keyword according to the calculated score.
As a result, the important word extraction device extracts important words and topic words using the meaning of the context and sentence in order to extract words that are highly related to other words used in the whole sentence as important words. Can do.

本発明の一態様は、上述する重要語抽出装置であって、前記単語抽出部は、前記文章データから品詞が名詞の単語を抽出する、ことを特徴とする。
この発明によれば、重要語抽出装置は、文章データから名詞を抽出し、抽出した名詞のうち、文章中の他の名詞と関連が高い名詞を選択する。
これにより、重要語抽出装置は、トピックワードとしてわかりやすい単語を重要語として抽出することができる。 One aspect of the present invention is the key word extraction device described above, wherein the word extraction unit extracts a word whose part of speech is a noun from the sentence data.
According to this invention, the keyword extraction device extracts nouns from the sentence data, and selects nouns that are highly related to other nouns in the sentence from the extracted nouns.
Thereby, the important word extraction device can extract words that are easy to understand as topic words as important words.

本発明の一態様は、上述する重要語抽出装置であって、前記単語抽出部は、前記文章データから品詞が名詞の単語を抽出し、抽出した前記単語のうち前記文章データ中で隣接する単語を１つの複合名詞とし、前記類似度集計部は、前記単語抽出部が抽出した前記単語及び前記複合名詞をスコア算出対象とし、スコア算出対象の前記単語または前記複合名詞のスコアを、当該スコア算出対象の前記単語、または、当該スコア算出対象の前記複合名詞を構成するいずれかの単語と、他のスコア算出対象の前記単語または他のスコア算出対象の前記複合名詞を構成するいずれかの単語とからなる前記単語ペアそれぞれについて前記類似度計算部が計算した前記類似度に基づいて算出し、前記重要語選択部は、前記類似度集計部が算出したスコアに基づいて前記単語及び前記複合名詞の中から重要語を選択する、ことを特徴とする。
この発明によれば、重要語抽出装置は、文章中から名詞の単語を抽出するとともに、連続する名詞からなる複合名詞を抽出する。重要語抽出装置は、抽出した単語及び複合名詞をスコア算出対象とし、スコア算出対象の単語または複合名詞それぞれのスコアを、当該スコア算出対象の単語、または、当該スコア算出対象の複合名詞を構成するいずれかの単語と、他のスコア算出対象の単語、または、他のスコア算出対象の複合名詞を構成するいずれかの単語とからなる単語ペアの類似度に基づいて算出し、算出したスコアに従って重要語を選択する。
これにより、重要語抽出装置は、複合名詞についても重要語として抽出することができる。 One aspect of the present invention is the above-described important word extraction device, wherein the word extraction unit extracts a word whose part of speech is a noun from the sentence data, and among the extracted words, adjacent words in the sentence data And the similarity totaling unit sets the word extracted by the word extracting unit and the compound noun as a score calculation target, and calculates the score of the word of the score calculation target or the compound noun. The target word, or any word that constitutes the compound noun that is the score calculation target, and any word that constitutes the other score calculation target or the other noun calculation target noun Each of the word pairs is calculated based on the similarity calculated by the similarity calculation unit, and the important word selection unit is based on the score calculated by the similarity totaling unit. There selecting key words among the words and the composite nouns, characterized in that.
According to the present invention, the important word extracting device extracts a noun word from a sentence and extracts a compound noun composed of continuous nouns. The keyword extraction device sets the extracted word and compound noun as score calculation targets, and the score calculation target word or compound noun constitutes the score calculation target word or the score calculation target compound noun. Calculated based on the similarity of a word pair consisting of any word and another score calculation target word or any other word constituting a compound noun for another score calculation target. Select a word.
Thereby, the important word extraction apparatus can extract a compound noun as an important word.

本発明の一態様は、上述する重要語抽出装置であって、前記単語抽出部は、文章データから所定の品詞の単語を抽出し、抽出した前記単語を前記文章データにおける出現数に応じて含んだ単語群を作成し、前記単語ペア作成部は、前記単語抽出部が作成した前記単語群に含まれる前記単語を用いて単語ペアを作成し、前記類似度集計部は、前記単語抽出部が抽出した前記単語それぞれをスコア算出対象とし、スコア算出対象の前記単語のスコアを、当該単語が含まれる前記単語ペアについて前記類似度計算部が計算した前記類似度の平均により算出する、ことを特徴とする。
この発明によれば、重要語抽出装置は、文章データから所定の品詞の単語を抽出し、抽出した単語を文章データにおける出現数に応じて含んだ単語群を作成し、この単語群に含まれる単語を用いて作成した単語ペアの類似度を計算する。重要語抽出装置は、各単語のスコアを、当該単語を含んだ単語ペアについて計算した類似度の平均により算出し、算出したスコアに従って重要語を選択する。
これにより、重要語抽出装置は、文章に出現する回数が多い単語を重要語であると判断しやすくなる。 One aspect of the present invention is the above-described key word extraction device, wherein the word extraction unit extracts a word with a predetermined part of speech from sentence data, and includes the extracted word according to the number of appearances in the sentence data. The word pair creation unit creates a word pair using the words included in the word group created by the word extraction unit, and the similarity tabulation unit Each of the extracted words is set as a score calculation target, and the score of the word of the score calculation target is calculated by an average of the similarities calculated by the similarity calculating unit for the word pair including the word. And
According to this invention, the keyword extraction device extracts a word having a predetermined part of speech from sentence data, creates a word group including the extracted word according to the number of appearances in the sentence data, and is included in the word group. Calculate the similarity of word pairs created using words. The important word extraction device calculates the score of each word based on the average similarity calculated for the word pair including the word, and selects the important word according to the calculated score.
Thereby, the keyword extraction device can easily determine that a word that appears frequently in a sentence is a keyword.

本発明の一態様は、コンピュータを、文章データから所定の品詞の単語を抽出する単語抽出手段と、前記単語抽出手段が抽出した前記単語からなる単語ペアを作成する単語ペア作成手段と、前記単語ペア作成手段が作成した前記単語ペアのそれぞれについて、前記単語ペアを構成する前記単語間の類似度を計算する類似度計算手段と、前記単語抽出手段が抽出した前記単語それぞれをスコア算出対象とし、スコア算出対象の前記単語のスコアを、当該単語が含まれる前記単語ペアについて前記類似度計算手段が計算した前記類似度に基づいて算出する類似度集計手段と、前記類似度集計手段が算出した前記単語それぞれの前記スコアに基づいて、前記単語抽出手段が抽出した前記単語の中から重要語を選択する重要語選択手段と、を具備する重要語抽出装置として機能させるためのプログラムである。 In one aspect of the present invention, the computer includes a word extracting unit that extracts a word having a predetermined part of speech from sentence data, a word pair creating unit that creates a word pair including the word extracted by the word extracting unit, and the word For each of the word pairs created by the pair creation means, similarity calculation means for calculating the similarity between the words constituting the word pair, and each of the words extracted by the word extraction means are score calculation targets, Similarity counting means for calculating the score of the word to be scored based on the similarity calculated by the similarity calculating means for the word pair including the word, and the similarity calculating means Important word selection means for selecting an important word from the words extracted by the word extraction means based on the score of each word. Is a program for functioning as a main word extraction device.

本発明によれば、文章から文脈に合致した重要語を抽出することができる。 According to the present invention, it is possible to extract important words that match a context from a sentence.

本発明の第１の実施形態による重要語抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the keyword extraction device by the 1st Embodiment of this invention. 同実施形態による重要語抽出装置の重要語抽出処理を示すフローチャートである。It is a flowchart which shows the important word extraction process of the important word extraction apparatus by the embodiment. 同実施形態による重要語抽出装置が算出した単語ペアの類似度の例を示す図である。It is a figure which shows the example of the similarity of the word pair which the keyword extraction device by the same embodiment computed. 第２の実施形態による重要語抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the important word extraction apparatus by 2nd Embodiment. 同実施形態による重要語抽出装置の重要語抽出処理を示すフローチャートである。It is a flowchart which shows the important word extraction process of the important word extraction apparatus by the embodiment. 同実施形態による重要語抽出装置が算出した単語ペアの類似度の例を示す図である。It is a figure which shows the example of the similarity of the word pair which the keyword extraction device by the same embodiment computed. 第１の実施形態による重要語抽出装置のシミュレーション結果を示す図である。It is a figure which shows the simulation result of the important word extraction apparatus by 1st Embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
文章の中で重要な単語は、「文全体の内容」を代表する単語である。そこで、本実施形態の重要語抽出装置は、文章に出現する単語の中から、意味的中心に近い単語を選択する。意味的中心に近い単語を求めるために、本実施形態の重要語抽出装置は、重要語となりうる所定の品詞の単語全てを文章から抽出し、抽出した単語を組み合わせた２単語間の類似度を計算する。本実施形態の重要語抽出装置は、単語毎に類似度の平均を算出し、算出した類似度の平均の大小に基づいて、最も文全体の内容に近く、文章のトピックを表す単語を抽出する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
An important word in a sentence is a word representing “the content of the entire sentence”. Therefore, the important word extraction device of the present embodiment selects a word close to the semantic center from words appearing in the sentence. In order to obtain a word close to the semantic center, the important word extraction apparatus of the present embodiment extracts all words having a predetermined part-of-speech that can be important words from sentences, and calculates the similarity between two words obtained by combining the extracted words. calculate. The important word extraction apparatus according to the present embodiment calculates an average of similarity for each word, and extracts a word representing the topic of the sentence that is closest to the content of the entire sentence based on the calculated average of the similarity. .

［第１の実施形態］
図１は、本発明の第１の実施形態による重要語抽出装置１の構成を示すブロック図であり、本実施形態と関係する機能ブロックのみを抽出して示してある。重要語抽出装置１は、コンピュータ装置により実現され、同図に示すように、類似度データベース１０、文入力部１１、単語抽出部１２、単語ペア作成部１３、類似度計算部１４、類似度集計部１５、順位付け部１６、重要語選択部１７、及び出力部１８を備えて構成される。 [First Embodiment]
FIG. 1 is a block diagram showing the configuration of the keyword extraction device 1 according to the first embodiment of the present invention, and shows only functional blocks related to the present embodiment. The keyword extraction device 1 is realized by a computer device, and as shown in the figure, a similarity database 10, a sentence input unit 11, a word extraction unit 12, a word pair creation unit 13, a similarity calculation unit 14, and a similarity calculation A unit 15, a ranking unit 16, an important word selection unit 17, and an output unit 18 are configured.

類似度データベース１０は、単語間の類似度を示す情報を記憶する。文入力部１１は、重要語を抽出する対象となる文章データの入力を受ける。単語抽出部１２は、文章データから所定の品詞の単語を抽出する。単語ペア作成部１３は、単語抽出部１２が抽出した単語の全ての組み合わせの単語ペアを作成する。類似度計算部１４は、単語ペア作成部１３が作成した単語ペアのそれぞれについて、単語ペアを構成する２つの単語間の類似度を計算する。類似度は、２単語間の類似の度合いを定量的に表す値である。類似度集計部１５は、単語抽出部１２が抽出した単語それぞれを、文章全体の他の単語との関連性を定量的に表すスコアの算出対象とする。類似度集計部１５は、スコア算出対象の単語のスコアを、当該単語を含んだ単語ペアについて類似度計算部１４が計算した類似度に基づいて算出する。順位付け部１６は、単語抽出部１２が抽出した単語を、類似度集計部１５が算出したスコアの順に並べる。重要語選択部１７は、順位付け部１６が並べた単語から所定のルールに従って重要語を選択する。出力部１８は、重要語選択部１７が選択した重要語を出力する。 The similarity database 10 stores information indicating the similarity between words. The sentence input unit 11 receives input of sentence data from which important words are extracted. The word extraction unit 12 extracts a word with a predetermined part of speech from the sentence data. The word pair creation unit 13 creates word pairs for all combinations of words extracted by the word extraction unit 12. The similarity calculation unit 14 calculates the similarity between two words constituting a word pair for each word pair created by the word pair creation unit 13. The degree of similarity is a value that quantitatively represents the degree of similarity between two words. The similarity totaling unit 15 sets each word extracted by the word extracting unit 12 as a score calculation target that quantitatively represents the relevance with other words in the entire sentence. The similarity totaling unit 15 calculates a score of a word for which a score is to be calculated based on the similarity calculated by the similarity calculating unit 14 for a word pair including the word. The ranking unit 16 arranges the words extracted by the word extracting unit 12 in the order of the scores calculated by the similarity totaling unit 15. The important word selection unit 17 selects an important word from the words arranged by the ranking unit 16 according to a predetermined rule. The output unit 18 outputs the important word selected by the important word selection unit 17.

図２は、重要語抽出装置１の重要語抽出処理を示すフローチャートである。
文入力部１１は、重要語を抽出する対象となる文章データの入力を受ける（ステップＳ１１）。文章データが示す文章は、１文の文章でもよく、複数の文からなる文章であってもよい。また、文章データが示す文章の長さも任意である。例えば、文章データが示す文章に、番組概要文など数百字程度の文章や、それよりも長いニュース原稿を用いることができる。 FIG. 2 is a flowchart showing the keyword extraction process of the keyword extraction device 1.
The sentence input unit 11 receives input of sentence data from which important words are extracted (step S11). The sentence indicated by the sentence data may be a single sentence or a sentence composed of a plurality of sentences. Further, the length of the text indicated by the text data is also arbitrary. For example, a sentence of about several hundred characters such as a program summary sentence or a news manuscript longer than that can be used as the sentence indicated by the sentence data.

単語抽出部１２は、ステップＳ１１において入力された文章データを、従来技術の形態素解析器により形態素解析する。従来技術の形態素解析器として、例えば、ＭｅＣａｂやＣｈａＳｅｎなどを用いることができる。単語抽出部１２は、形態素解析結果に基づき、文章データが示す文章から品詞が名詞の単語を抽出する（ステップＳ１２）。名詞には、一般名詞（普通名詞や固有名詞）、接尾語、数詞などの分類があるが、単語抽出部１２は、一般名詞のみ、一般名詞と所定の分類の名詞、あるいは、固有名詞を除く一般名詞を抽出してもよい。文章データに同一の名詞が複数回出現する場合、単語抽出部１２は、出現数に応じてその単語を重複して抽出し、抽出した単語からなる単語群を作成する。文章データが「あすの天気です。あすは全国的に良い天気でしょう。」を示す場合、単語抽出部１２は、抽出した名詞の単語「あす」、「天気」、「あす」、「全国」、「天気」からなる単語群を作成する。 The word extraction unit 12 performs a morphological analysis on the text data input in step S11 using a conventional morphological analyzer. As a conventional morphological analyzer, for example, MeCab, ChaSen, or the like can be used. Based on the morphological analysis result, the word extracting unit 12 extracts a word whose part of speech is a noun from the sentence indicated by the sentence data (step S12). The nouns are classified into general nouns (common nouns and proper nouns), suffixes, numbers, etc., but the word extraction unit 12 excludes only general nouns, general nouns and nouns of a predetermined classification, or proper nouns. General nouns may be extracted. When the same noun appears multiple times in the sentence data, the word extraction unit 12 extracts the words in duplicate according to the number of appearances, and creates a word group including the extracted words. When the text data indicates “Tomorrow's weather. Tomorrow is a good weather nationwide.”, The word extraction unit 12 extracts the extracted noun words “Tomorrow”, “Weather”, “Tomorrow”, “Nationwide”. Create a word group consisting of “weather”.

単語ペア作成部１３は、単語抽出部１２が作成した単語群中の単語を用いて、全ての組み合わせの単語ペアを作成する（ステップＳ１３）。つまり、単語群が単語ｗ_１、ｗ_２、…、ｗ_ｎ（ｎは２以上の整数）からなる場合、単語ペア作成部１３は、全ての組み合わせの単語ペア（ｗ_ｉ，ｗ_ｊ）を作成する（ｉ≠ｊ，ｉは１以上ｎ以下の整数、ｊは１以上ｎ以下の整数）。例えば上記の例のように、単語抽出部１２が、単語「あす」、「天気」、「あす」、「全国」、「天気」からなる単語群を作成したとする。単語ペア作成部１３は、単語ペア（あす，天気）、（あす，あす）、（あす，全国）、（あす，天気）、（天気，あす）、（天気，あす）、（天気，全国）、（天気，天気）、（あす，あす）、…を作成する。 The word pair creation unit 13 creates word pairs of all combinations using the words in the word group created by the word extraction unit 12 (step S13). That is, when a word group consists of words w ₁ , w ₂ ,..., W _n (n is an integer of 2 or more), the word pair creation unit 13 creates word pairs (w _i , w _j ) of all combinations. (I ≠ j, i is an integer from 1 to n, and j is an integer from 1 to n). For example, suppose that the word extraction part 12 created the word group which consists of words "Tomorrow", "Weather", "Tomorrow", "Nationwide", and "Weather" like the above example. The word pair creation unit 13 includes word pairs (tomorrow, weather), (tomorrow, tomorrow), (tomorrow, nationwide), (tomorrow, weather), (weather, tomorrow), (weather, tomorrow), (weather, nationwide). , (Weather, weather), (tomorrow, tomorrow), ... are created.

類似度計算部１４は、単語ペア作成部１３が作成した各単語ペアを構成する２つの単語間の類似度を計算する（ステップＳ１４）。ここでは、類似度計算部１４は、各単語ペアに含まれる２つの単語間の類似度を類似度データベース１０から読み出す。 The similarity calculation unit 14 calculates the similarity between two words constituting each word pair created by the word pair creation unit 13 (step S14). Here, the similarity calculation unit 14 reads the similarity between two words included in each word pair from the similarity database 10.

本実施形態では、２つの単語間の類似度にJensen-Shannon Divergenceによる文脈類似度を用いる。文脈類似度は、「似たような文章に出現しやすい単語は類似している」という発想により、単語間の類似度を計算する手法である。単語Ａと単語Ｂとの間の文脈類似度を求める場合、学習用のデータを用いて各単語が出現する文脈について確率分布を求めておき、単語Ａと単語Ｂとの確率分布の異なり具合をJensen-Shannon Divergenceにより計算した結果をこれら単語間の類似度とする。この文脈類似度は、数値が０〜１の範囲を取り、数値が小さいほど類似した単語であることを表す。文脈類似度の詳細については、例えば、文献「風間淳一、Stijn De Saeger、鳥澤健太郎、村田真樹、”係り受けの確率的クラスタリングを用いた大規模類似度リストの作成”、言語処理学会第１５回年次大会、２００９年、ｐ．８４−８７」に記載されている。 In this embodiment, the context similarity based on Jensen-Shannon Divergence is used as the similarity between two words. The context similarity is a technique for calculating the similarity between words based on the idea that “words that are likely to appear in similar sentences are similar”. When obtaining the context similarity between the word A and the word B, a probability distribution is obtained for the context in which each word appears using learning data, and the difference in probability distribution between the word A and the word B is determined. The result calculated by Jensen-Shannon Divergence is the similarity between these words. This context similarity is in the range of 0 to 1, and the smaller the value, the more similar the words. For details on the context similarity, see, for example, the literature “Keiichi Kazama, Stijn De Saeger, Kentaro Torizawa, Masaki Murata,“ Creating a Large-Scale Similarity List Using Dependent Stochastic Clustering ”, Language Processing Society 15th Annual Convention 2009, p.84-87 ”.

なお、類似度計算部１４は、類似度データベース１０として、インターネットによりアクセスされるウェブサイトを利用し得る。利用可能なウェブサイトの一例には、「情報通信研究機構（ＮＩＣＴ）、”ＡＬＡＧＩＮ言語資源・音声資源サイト”、高度言語情報融合フォーラム、［online］、インターネット〈URL：https://alaginrc.nict.go.jp/resources/nictmastar/li-resource-info/li-resource-outline.html>」がある。 The similarity calculation unit 14 can use a website accessed via the Internet as the similarity database 10. Examples of available websites are “National Institute of Information and Communications Technology (NICT),“ ALAGIN Language Resource / Speech Resource Site ”, Advanced Language Information Fusion Forum, [online], Internet <URL: https: //alaginrc.nict .go.jp / resources / nictmastar / li-resource-info / li-resource-outline.html> ".

上記の文章データの場合、類似度計算部１４は、例えば、（あす，天気）の類似度＝０．８０４、（あす，あす）の類似度＝０、（あす，全国）の類似度＝０．９６５、（あす，天気）の類似度＝０．８０４、…を得る。 In the case of the above text data, the similarity calculation unit 14, for example, (tomorrow, weather) similarity = 0.804, (tomorrow, tomorrow) similarity = 0, (tomorrow, nationwide) similarity = 0. .965, (tomorrow, weather) similarity = 0.804,...

類似度集計部１５は、類似度計算部１４が計算した各単語ペアの類似度に基づいて各単語のスコアを集計する（ステップＳ１５）。具体的には、類似度集計部１５は、単語ｗ_ｉ（ｉは１以上ｎ以下の整数）のスコアを、その単語と他の単語とからなる単語ペア（ｗ_ｉ，ｗ_ｊ）（ｊ≠ｉ，ｊは１以上ｎ以下の整数）それぞれについて類似度計算部１４が計算した類似度の平均により算出する。なお、文章中に同一の単語が複数回出現する場合、単語ｗ_１〜単語ｗ_ｎには同じ単語が含まれる。この場合、類似度集計部１５は、単語ｗ_１〜単語ｗ_ｎの中から重複する単語については１つのみ残して削除し、削除の結果残った単語ｗ_１〜単語ｗ_ｎをそれぞれ単語ｗ_ｉとしてスコアを算出すればよい。 The similarity totaling unit 15 totals the score of each word based on the similarity of each word pair calculated by the similarity calculation unit 14 (step S15). Specifically, the similarity totaling unit 15 calculates the score of a word w _i (i is an integer of 1 to n) as a word pair (w _i , w _j ) (j ≠ i and j are integers of 1 or more and n or less), and the similarity is calculated by the average of the similarities calculated by the similarity calculation unit 14. It should be noted that, if the same word appears more than once in the text, the word w _{1 ~} word w _n contain the same word. In this case, the similarity totaling unit 15, the word w _{1 ~} word w For duplicate words from the _n to remove, leaving only one, word w _{1 ~} word w word _n each w _i remaining result of the deletion The score may be calculated as

上記の文章データの場合、類似度集計部１５は、単語「あす」のスコアを、（あす，天気）、（あす，あす）、（あす，全国）、（あす，天気）それぞれの類似度を平均して０．６４３と算出する。同様にして、類似度集計部１５は、単語「天気」のスコアを０．６４６、単語「全国」のスコアを０．９７９と算出する。 In the case of the above sentence data, the similarity totaling unit 15 calculates the score of the word “Tomorrow” for each of the similarities of (Tomorrow, Weather), (Tomorrow, Tomorrow), (Tomorrow, National), (Tomorrow, Weather). An average of 0.643 is calculated. Similarly, the similarity totaling unit 15 calculates the score of the word “weather” as 0.646 and the score of the word “nationwide” as 0.979.

順位付け部１６は、類似度集計部１５が算出したスコアの順に、単語抽出部１２が抽出した単語を並べる（ステップＳ１６）。本実施形態では、類似度としてJensen-Shannon Divergenceによる文脈類似度を用いているため０に近いほど類似度が高い。類似度として、１からJensen-Shannon Divergenceによる文脈類似度を減算した値を用いてもよく、この場合は１に近いほど類似度が高い。 The ranking unit 16 arranges the words extracted by the word extracting unit 12 in the order of the scores calculated by the similarity totaling unit 15 (step S16). In this embodiment, since the context similarity by Jensen-Shannon Divergence is used as the similarity, the similarity is higher as it is closer to 0. As the similarity, a value obtained by subtracting the context similarity by Jensen-Shannon Divergence from 1 may be used. In this case, the similarity is higher as it is closer to 1.

重要語選択部１７は、予め決定しておいたルールに従って順位付け部１６が並べた単語から重要語を選択する（ステップＳ１７）。重要語選択部１７は、所定順位以上の単語を選択してもよく、スコアが所定よりも良い単語を選択してもよく、単語抽出部１２が抽出した単語の中から所定割合の単語を順位が高い順に選択してもよい。例えば、重要語選択部１７は、「スコアが上位５位までの単語」、「最も良いスコアから、その最も良いスコアの１．２倍の値のスコアまでの単語」を選択する。 The important word selection unit 17 selects an important word from the words arranged by the ranking unit 16 in accordance with a predetermined rule (step S17). The keyword selection unit 17 may select words having a predetermined rank or higher, may select words having a score higher than a predetermined level, and ranks a predetermined percentage of the words extracted from the words extracted by the word extraction unit 12. May be selected in descending order. For example, the important word selection unit 17 selects “words with the highest five scores” and “words from the best score to a score that is 1.2 times the best score”.

出力部１８は、重要語選択部１７が選択した重要語を出力する（ステップＳ１８）。例えば、出力部１８は、重要語抽出装置１に備えられたディスプレイ、または、重要語抽出装置１とネットワークを介して接続されるコンピュータ装置のディスプレイにスコアが良い順に重要語を表示させる。あるいは、出力部１８は、重要語抽出装置１の内部または外部に備える記憶装置に文章データあるいは文章データの識別情報と、当該文章データから抽出した重要語及びそのスコアとを出力し、記憶させてもよい。 The output unit 18 outputs the important word selected by the important word selection unit 17 (step S18). For example, the output unit 18 causes the important words to be displayed in descending order of score on a display provided in the important word extraction apparatus 1 or a display of a computer device connected to the important word extraction apparatus 1 via a network. Alternatively, the output unit 18 outputs and stores the sentence data or the identification information of the sentence data, the important word extracted from the sentence data, and the score thereof in a storage device provided inside or outside the keyword extraction apparatus 1. Also good.

重要語抽出装置１の具体的な処理例を示す。
文入力部１１が、文章データ「山形市内の保育園で園児たちが臼と杵を使った昔ながらの餅つきを体験しました」の入力を受ける。本実施形態の重要語抽出装置１は、文章中の重要語の抽出に、文章中に出現する名詞間の類似度を用いる。そこで、単語抽出部１２は、文章データから名詞の単語「山形」、「市内」、「保育園」、「園児」、「臼」、「杵」、「餅つき」、「体験」を抽出する。類似度計算部１４は、これらの単語を用いて単語ペア作成部１３が作成した各単語ペアの類似度を計算する。 A specific processing example of the keyword extraction device 1 will be shown.
The sentence input unit 11 receives an input of sentence data “the children experienced traditional mochi using a mortar and a pestle at a nursery school in Yamagata city”. The important word extraction device 1 of the present embodiment uses the similarity between nouns appearing in a sentence to extract the important words in the sentence. Therefore, the word extraction unit 12 extracts the noun words “Yamagata”, “city”, “nursery school”, “schoolchild”, “mortar”, “rice cake”, “mochitsuki”, “experience” from the sentence data. . The similarity calculation unit 14 calculates the similarity of each word pair created by the word pair creation unit 13 using these words.

図３は、類似度計算部１４が算出した各単語ペアの類似度を示す。同図において、単語ペア（ｗ_ｉ，ｗ_ｊ）のｗ_ｉが縦軸の単語、ｗ_ｊが横軸の単語を表している。また、類似度に、Jensen-Shannon Divergenceで表した文脈類似度を用いている。 FIG. 3 shows the similarity of each word pair calculated by the similarity calculator 14. In the figure, word pairs _{_{_{(w i, w j) w}}} i of the word of the vertical axis, _{w j} represents the words of the horizontal axis. In addition, the context similarity represented by Jensen-Shannon Divergence is used as the similarity.

類似度集計部１５は、類似度計算部１４が算出した各単語ペアの類似度を用い、単語ごとに他の単語との類似度の平均を求める。類似度の平均は、「山形」が０．９６１、「市内」が０．９５７、「保育園」が０．９１０、「園児」が０．９２８、「臼」が０．８０４、「杵」が０．８２７、「餅つき」が０．８７５、「体験」が０．９３２である。順位付け部１６は、類似度の平均により表されるスコアが小さい順に単語を並べる。単語の順は、「臼」、「杵」、「餅つき」、「保育園」、「園児」、「体験」、「市内」、「山形」となる。重要語選択部１７は、文章の中でトピックとなる重要語として、最上位から３つの単語「臼」、「杵」、「餅つき」を選択する。出力部１８は、重要語選択部１７が選択した重要語「臼」、「杵」、「餅つき」を出力する。 The similarity totaling unit 15 uses the similarity of each word pair calculated by the similarity calculating unit 14 and obtains the average of similarities with other words for each word. The average similarity is 0.961 for "Yamagata", 0.957 for "city", 0.910 for "nursery school", 0.928 for "children", 0.804 for "mortar", "杵" Is 0.827, “mochi” is 0.875, and “experience” is 0.932. The ranking unit 16 arranges the words in ascending order of the score represented by the average similarity. The order of the words is “Usu”, “Kashiwa”, “Matsutsuki”, “Nursery school”, “Kindergarten”, “Experience”, “City”, “Yamagata”. The important word selection unit 17 selects the three words “mortar”, “杵”, and “mochi” from the top as important words that become topics in the sentence. The output unit 18 outputs the important words “mortar”, “杵”, and “mochi” selected by the important word selection unit 17.

なお、文章中に同一の単語が２回以上出現する場合は、単語抽出部１２は、その出現数だけ同じ単語を抽出する。よって、図３の縦軸、及び横軸に、出現数に応じた数の単語が含まれることになる。同一単語間のJensen-Shannon Divergenceは０になるため、結果として類似度の平均値が小さくなる。よって、同一の単語が複数回出現すると、その単語の順位は高くなりやすくなる。 In addition, when the same word appears twice or more in a sentence, the word extraction part 12 extracts the same word by the appearance number. Therefore, the number of words corresponding to the number of appearances is included on the vertical and horizontal axes in FIG. Since the Jensen-Shannon Divergence between the same words becomes 0, the average value of the similarity is reduced as a result. Therefore, when the same word appears multiple times, the rank of the word tends to increase.

上記実施形態において、類似度計算部１４は、類似度に文脈類似度を用いているが、任意の他の類似度計算方法により２単語間の類似度を計算してもよい。例えば、統計的な単語の共起を利用した類似度計算方法などを用いることができる。しかし、一般的に、言い換えに使われる単語など、同じ意味の単語については文章中に共起することが少なく、共起を利用した類似度計算方法においては高い類似度が得られない場合もある。その点からは、文脈類似度を用いることが好ましい。 In the embodiment described above, the similarity calculation unit 14 uses the context similarity as the similarity, but may calculate the similarity between two words by any other similarity calculation method. For example, a similarity calculation method using statistical word co-occurrence can be used. However, in general, words that have the same meaning, such as words used for paraphrasing, rarely co-occur in a sentence, and the similarity calculation method using co-occurrence may not provide a high similarity. . From this point of view, it is preferable to use context similarity.

［第２の実施形態］
上述した第１の実施形態では、各単語について２単語間の類似度に基づくスコアを算出しているため、１単語の単位でしか重要語を得ることはできない。そのため、「気象情報」のような複合名詞についてはスコアを算出することは困難である。そこで、本実施形態では、文章中の複合名詞についても重要語として抽出できるようにする。以下では、第２の実施形態を、第１の実施形態との差分を中心に説明する。 [Second Embodiment]
In the first embodiment described above, since a score based on the similarity between two words is calculated for each word, an important word can be obtained only in units of one word. Therefore, it is difficult to calculate a score for compound nouns such as “weather information”. Therefore, in this embodiment, compound nouns in sentences can be extracted as important words. Below, 2nd Embodiment is described centering on the difference with 1st Embodiment.

図４は、本発明の第２の実施形態による重要語抽出装置２の構成を示すブロック図であり、本実施形態と関係する機能ブロックのみを抽出して示してある。同図において、図１に示す第１の実施形態による重要語抽出装置１と同一の部分には同一の符号を付し、その説明を省略する。重要語抽出装置２は、コンピュータ装置により実現され、同図に示すように、類似度データベース１０、文入力部１１、単語抽出部２２、単語ペア作成部１３、類似度計算部１４、類似度集計部２５、順位付け部２６、重要語選択部２７、及び出力部１８を備えて構成される。 FIG. 4 is a block diagram showing the configuration of the keyword extraction device 2 according to the second embodiment of the present invention, and shows only functional blocks related to the present embodiment. In this figure, the same parts as those of the keyword extraction device 1 according to the first embodiment shown in FIG. The keyword extraction device 2 is realized by a computer device, and as shown in the figure, a similarity database 10, a sentence input unit 11, a word extraction unit 22, a word pair creation unit 13, a similarity calculation unit 14, and a similarity calculation A unit 25, a ranking unit 26, an important word selection unit 27, and an output unit 18 are configured.

単語抽出部２２は、文章データから名詞の単語を抽出する。さらに、単語抽出部２２は、文章中で連続する名詞から複合名詞（名詞句）を作成する。その際、単語抽出部２２は、複合名詞を構成する単語の情報も保持しておく。類似度集計部２５は、類似度計算部１４が計算した単語間の類似度に基づいて文章データから抽出された名詞及び複合名詞それぞれのスコアを算出する。順位付け部２６は、文章データから抽出された名詞及び複合名詞を、類似度集計部２５が算出したスコアの順に並べる。重要語選択部２７は、順位付け部２６が並べた名詞及び複合名詞から所定のルールに従って重要語を選択する。 The word extraction unit 22 extracts noun words from the sentence data. Further, the word extraction unit 22 creates a compound noun (noun phrase) from nouns that are continuous in the sentence. At that time, the word extraction unit 22 also holds information on the words constituting the compound noun. The similarity totaling unit 25 calculates the score of each noun and compound noun extracted from the sentence data based on the similarity between words calculated by the similarity calculation unit 14. The ranking unit 26 arranges the nouns and compound nouns extracted from the sentence data in the order of the scores calculated by the similarity totaling unit 25. The important word selection unit 27 selects an important word from the nouns and compound nouns arranged by the ranking unit 26 according to a predetermined rule.

図５は、重要語抽出装置２の重要語抽出処理を示すフローチャートである。
文入力部１１は、文章データの入力を受ける（ステップＳ２１）。単語抽出部２２は、入力された文章データを形態素解析して品詞が名詞の単語を抽出する（ステップＳ２２）。例えば、単語抽出部２２は、一般名詞のみ、一般名詞と所定の分類の名詞、あるいは、固有名詞を除く一般名詞を抽出する。文章データに同一の単語が複数回出現する場合、単語抽出部２２は、出現数に応じてその単語を重複して抽出する。 FIG. 5 is a flowchart showing the keyword extraction process of the keyword extraction device 2.
The sentence input unit 11 receives input of sentence data (step S21). The word extraction unit 22 performs morphological analysis on the input sentence data and extracts a word whose part of speech is a noun (step S22). For example, the word extraction unit 22 extracts only general nouns, general nouns and nouns of a predetermined classification, or general nouns excluding proper nouns. When the same word appears in the sentence data a plurality of times, the word extraction unit 22 extracts the word in duplicate according to the number of appearances.

次に、単語抽出部２２は、ステップＳ２２における形態素解析結果に基づいて文章データが示す文章中で連続する名詞から複合名詞を作成する（ステップＳ２３）。具体的には、単語抽出部２２は、複数の一般名詞が連続する複合名詞や、一般名詞と接尾語や数詞が連続する複合名詞を作成する。文章データに同一の複合名詞が複数回出現する場合、単語抽出部２２は、出現数に応じてその複合名詞を重複して作成する。単語抽出部２２は、ステップＳ２２で抽出した単語と、ステップＳ２３で作成した複合名詞を構成する単語のうちステップＳ２２で抽出されなかった単語とからなる単語群を作成する。さらに、単語抽出部２２は、ステップＳ２２において抽出した単語と、作成した複合名詞を併せて重要語候補とする。なお、単語抽出部２２は、複合名詞を構成する単語のうち、文章中で単独では使用されない単語については重要語候補から除外する。 Next, the word extraction unit 22 creates a compound noun from the nouns continuous in the sentence indicated by the sentence data based on the morphological analysis result in step S22 (step S23). Specifically, the word extraction unit 22 creates a compound noun in which a plurality of general nouns continue, or a compound noun in which a general noun, suffix, and number continue. When the same compound noun appears in the sentence data a plurality of times, the word extraction unit 22 creates the compound noun in duplicate according to the number of appearances. The word extraction unit 22 creates a word group composed of the words extracted in step S22 and the words not extracted in step S22 among the words constituting the compound noun created in step S23. Further, the word extraction unit 22 combines the word extracted in step S22 and the created compound noun as an important word candidate. In addition, the word extraction part 22 excludes the word which is not used independently in a sentence among the words which comprise a compound noun from an important word candidate.

例えば、文章データが「出汁の取り方を、料亭の料理人が伝授します。」を示す場合、ステップＳ２２において、単語抽出部２２は、一般名詞の単語「出汁」、「料亭」、「料理」、「伝授」を抽出する。また、ステップＳ２２において、単語抽出部２２は、連続する一般名詞「料理」と接尾語「人」とからなる複合名詞「料理人」を作成する。単語抽出部２２は、ステップＳ２２において抽出した単語「出汁」、「料亭」、「料理」、「伝授」と、ステップＳ２３において作成した複合名詞「料理人」を構成する単語のうちステップＳ２２で抽出されなかった単語「人」とからなる単語群を作成する。さらに、単語抽出部２２は、ステップＳ２２において抽出した単語と、作成した複合名詞を併せて重要語候補とする。ただし、ステップＳ２２において抽出した単語「出汁」、「料亭」、「料理」、「伝授」のうち、複合名詞「料理人」を構成する単語「料理」については文章中で単独で使用されていない。そこで、単語抽出部２２は、ステップＳ２２において抽出した単語から「料理」を除いた単語「出汁」、「料亭」、「伝授」と、複合名詞「料理人」を重要語候補とする。 For example, if the sentence data indicates “The chef at the restaurant will teach you how to take the soup”, in step S22, the word extraction unit 22 will use the common noun words “dashi”, “restaurant”, “cooking”. ”,“ Transfer ”. In step S <b> 22, the word extraction unit 22 creates a compound noun “cooker” composed of the continuous general noun “cook” and the suffix “human”. The word extraction unit 22 extracts the word “dashi”, “restaurant”, “cooking”, “transmission” extracted in step S22 and the compound noun “cooker” created in step S23 in step S22. A word group consisting of the words “people” that were not created is created. Further, the word extraction unit 22 combines the word extracted in step S22 and the created compound noun as an important word candidate. However, among the words “dashi”, “restaurant”, “cooking”, and “transmission” extracted in step S22, the word “cooking” that constitutes the compound noun “cooker” is not used alone in the sentence. . Therefore, the word extraction unit 22 sets the word “dashi”, “restaurant”, “transmission”, and the compound noun “cooker” obtained by removing “cooking” from the word extracted in step S22 as important word candidates.

単語ペア作成部１３は、ステップＳ２３において単語抽出部２２が作成した単語群中の単語を用いて、全ての組み合わせの単語ペアを作成する（ステップＳ２４）。類似度計算部１４は、単語ペア作成部１３が作成した各単語ペアを構成する２つの単語間の類似度を計算する（ステップＳ２５）。 The word pair creation unit 13 creates word pairs of all combinations using the words in the word group created by the word extraction unit 22 in step S23 (step S24). The similarity calculation unit 14 calculates the similarity between two words constituting each word pair created by the word pair creation unit 13 (step S25).

類似度集計部２５は、類似度計算部１４が計算した各単語ペアの類似度に基づいて、各重要語候補のスコアを集計する（ステップＳ２６）。重要語候補を、ｘ_１，ｘ_２，…，ｘ_ｎ（ｎは１以上の整数）としたときに、類似度集計部２５は、各重要語候補ｘ_ｉ（ｉは１以上ｎ以下の整数）のスコアを以下のように算出する。なお、文章中に同一の重要語候補が複数回出現する場合、重要語候補ｘ_１〜重要語候補ｘ_ｎには同じ単語または複合名詞が含まれる。この場合、類似度集計部２５は、重要語候補ｘ_１〜重要語候補ｘ_ｎの中から重複する重要語候補については１つのみ残して削除し、削除の結果残った重要語候補ｘ_１〜重要語候補ｘ_ｎをそれぞれ重要語候補ｘ_ｉとしてスコアを算出すればよい。 Based on the similarity of each word pair calculated by the similarity calculator 14, the similarity calculator 25 totals the scores of each important word candidate (step S26). When the key word candidates are x ₁ , x ₂ ,..., X _n (n is an integer equal to or greater than 1), the similarity tabulation unit 25 determines each key word candidate x _i (i is an integer equal to or greater than 1 and equal to or less than n) ) Is calculated as follows. When the same important word candidate appears multiple times in the sentence, the important word candidate x ₁ to the important word candidate x _n include the same word or compound noun. In this case, the similarity tabulation unit 25 deletes only one of the important word candidates x ₁ to important word candidates x _n from the important word candidates x ₁ and deletes them. The score may be calculated by using each of the important word candidates x _n as the important word candidate x _i .

類似度集計部２５は、重要語候補ｘ_ｉと他の重要語候補ｘ_ｊ（ｊ≠ｉ，ｊは１以上ｎ以下の整数）それぞれとの類似度の平均によりスコアを算出する。重要語候補ｘ_ｉまたは重要語候補ｘ_ｊのいずれかまたは両方が複合名詞である場合、類似度集計部２５は、重要語候補ｘ_ｉを構成する単語と重要語候補ｘ_ｊを構成する単語とからなる全ての組み合わせの単語ペアの類似度のうち、最もよい類似度を重要語候補ｘ_ｉと重要語候補ｘ_ｊの類似度とする。 The similarity totaling unit 25 calculates a score by averaging the similarities between the keyword candidate x _i and other keyword candidates x _j (j ≠ i, j is an integer of 1 to n). When either one or both of the _keyword candidate x _{i and the} keyword candidate x _j are compound nouns, the similarity tabulation unit 25 selects the words constituting the keyword candidate x _i and the words constituting the keyword candidate x _j of similarity of all combinations word pairs consisting of the best similarity and important word candidate x _i and similarity of important word candidate x _j.

例えば、類似度集計部２５は、重要語候補ｘ_ｉ「天気」と重要語候補ｘ_ｊ「大雨警報」の類似度を、単語ペア（天気，大雨）、（天気，警報）の類似度のうちよい方とする。また例えば、類似度集計部２５は、重要語候補ｘ_ｉ「気象情報」と重要語候補ｘ_ｊ「雪」の類似度を、単語ペア（気象，雪）、（情報，雪）の類似度のうちよい方とする。また例えば、類似度集計部２５は、重要語候補ｘ_ｉ「気象情報」と重要語候補ｘ_ｊ「大雨警報」の類似度を、単語ペア（気象，大雨）、（気象，警報）、（情報，大雨）、（情報，警報）の類似度のうち最もよい類似度とする。 For example, the similarity totaling unit 25 calculates the similarity between the important word candidate x _i “weather” and the important word candidate x _j “heavy rain warning” among the similarities of the word pair (weather, heavy rain) and (weather, warning). Be the better one. Further, for example, the similarity totaling unit 25 calculates the similarity between the keyword candidate x _i “weather information” and the keyword candidate x _j “snow” as the similarity between the word pair (weather, snow) and (information, snow). Let me be the better one. Further, for example, the similarity totaling unit 25 uses the word pair (weather, heavy rain), (weather, warning), (information) for the similarity between the key word candidate x _i “weather information” and the key word candidate x _j “heavy rain warning”. , Heavy rain), (information, warning) similarity is the best similarity.

あるいは、類似度集計部２５は、重要語候補ｘ_ｉが複合名詞である場合、重要語候補ｘ_ｉを構成する単語毎に、他の重要語候補ｘ_ｊそれぞれとの類似度の平均を求める。類似度集計部２５は、重要語候補ｘ_ｉを構成する各単語について求めた類似度の平均のうち、最もよい値を重要語候補ｘ_ｉのスコアとする。 Alternatively, when the keyword candidate x _i is a compound noun, the similarity tabulation unit 25 obtains an average of similarities with each of the other keyword candidates x _{j for} each word constituting the keyword candidate x _i . Similarity degree collection section 25, of the average degree of similarity calculated for each word constituting the important word candidate x _i, the best value score of important word candidate x _i.

例えば、重要語候補ｘ_ｉが「気象情報」であるとする。類似度集計部２５は、重要語候補ｘ_ｉ「気象情報」を構成する単語「気象」と重要語候補ｘ_ｊそれぞれとの類似度の平均を算出する。重要語候補ｘ_ｊが１つの単語であれば、類似度集計部２５は、単語「気象」と重要語候補ｘ_ｊとから構成される単語ペアの類似度を、単語「気象」と重要語候補ｘ_ｊの類似度とする。重要語候補ｘ_ｊが複合名詞であれば、類似度集計部２５は、単語「気象」と、重要語候補ｘ_ｊを構成する各単語とから構成される単語ペアの類似度のうち、最も良い値を単語「気象」と重要語候補ｘ_ｊの類似度とする。同様に、類似度集計部２５は、重要語候補ｘ_ｉを構成する他の単語「情報」と重要語候補ｘ_ｊそれぞれとの類似度の平均を算出する。類似度集計部２５は、単語「気象」と単語「情報」のそれぞれについて算出した類似度の平均のうち良い方を、重要語候補ｘ_ｉ「気象情報」のスコアとする。 For example, it is assumed that the important word candidate x _i is “weather information”. The similarity totaling unit 25 calculates an average of similarities between the word “weather” constituting the important word candidate x _i “weather information” and each of the important word candidates x _j . If the keyword candidate x _j is a single word, the similarity totaling unit 25 determines the similarity between the word pair composed of the word “weather” and the keyword candidate x _j as the word “weather” and the keyword candidate. Let x _{j be} the similarity. If the important word candidate x _j is a compound noun, the similarity totaling unit 25 is the best among the similarities of word pairs composed of the word “weather” and each word constituting the important word candidate x _j. the value and the word "weather" as the similarity of the important word candidate x _j. Similarly, the similarity totaling unit 25 calculates the average of the similarities between the other words “information” constituting the important word candidate x _i and each of the important word candidates x _j . The similarity totaling unit 25 sets the better one of the averages of similarities calculated for each of the word “weather” and the word “information” as the score of the keyword candidate x _i “weather information”.

順位付け部２６は、類似度集計部２５が算出したスコアの順に、単語抽出部１２が抽出した重要語候補を並べる（ステップＳ２７）。重要語選択部２７は、予め決定しておいたルールに従って順位付け部２６が並べた重要語候補から重要語を選択する（ステップＳ２８）。出力部１８は、重要語選択部２７が選択した重要語を出力する（ステップＳ２９）。 The ranking unit 26 arranges the important word candidates extracted by the word extraction unit 12 in the order of the scores calculated by the similarity totaling unit 25 (step S27). The important word selection unit 27 selects an important word from the important word candidates arranged by the ranking unit 26 in accordance with a predetermined rule (step S28). The output unit 18 outputs the important word selected by the important word selection unit 27 (step S29).

重要語抽出装置２の具体的な処理例を示す。
文入力部１１が、文章データ「次はあすの気象情報です。」の入力を受ける。単語抽出部２２は、単語「次」、「あす」、「気象」、「情報」を抽出する。単語抽出部２２は、これらの単語の中から文章中で連続する「気象」と「情報」から１つの複合名詞「気象情報」を作成する。単語抽出部２２は、文章データから抽出した単語「次」、「あす」、「気象」、「情報」からなる単語群を作成する。さらに、単語抽出部２２は、単語群の中から、複合名詞「気象情報」と、文章データから抽出した単語のうち、複合名詞「気象情報」を構成し、かつ、文章中で単独では使われていない単語「気象」及び「情報」を除いた単語「次」、「あす」とを重要語候補とする。類似度計算部１４は、単語群に含まれる単語を用いて単語ペア作成部１３が作成した各単語ペアの類似度を計算する。 A specific processing example of the keyword extraction device 2 will be shown.
The sentence input unit 11 receives input of sentence data “Next is tomorrow's weather information.” The word extraction unit 22 extracts the words “next”, “tomorrow”, “weather”, and “information”. The word extraction unit 22 creates one compound noun “weather information” from “meteorology” and “information” consecutive in the sentence from these words. The word extraction unit 22 creates a word group composed of the words “next”, “tomorrow”, “weather”, and “information” extracted from the sentence data. Further, the word extraction unit 22 constitutes the compound noun “weather information” from the words extracted from the compound noun “weather information” and sentence data from the word group, and is used alone in the sentence. The words “next” and “tomorrow” excluding the words “meteorology” and “information” that are not used are important word candidates. The similarity calculation unit 14 calculates the similarity of each word pair created by the word pair creation unit 13 using words included in the word group.

図６は、類似度計算部１４が計算した各単語ペアの類似度を示す。同図において、単語ペア作成部１３が単語群に含まれる単語「次」、「あす」、「気象」、「情報」を用いて作成した単語ペアそれぞれについて、類似度計算部１４が算出した文脈類似度を示している。なお、同図においては、同一の複合名詞を構成する単語ペアについては類似度を算出していない。 FIG. 6 shows the similarity of each word pair calculated by the similarity calculation unit 14. In the figure, the context calculated by the similarity calculation unit 14 for each word pair created by the word pair creation unit 13 using the words “next”, “tomorrow”, “weather”, and “information” included in the word group. The similarity is shown. In the figure, the degree of similarity is not calculated for word pairs constituting the same compound noun.

類似度集計部２５は、重要語候補「次」のスコアを、重要語候補「次」と重要語候補「あす」の類似度、及び、重要語候補「次」と重要語候補「気象情報」の類似度の平均により算出する。類似度集計部２５は、重要語候補「次」と重要語候補「あす」の類似度を、類似度計算部１４が算出した単語ペア（次，あす）の類似度「０．６７６」とする。類似度集計部２５は、重要語候補「次」と重要語候補「気象情報」の類似度を、類似度計算部１４が算出した単語ペア（次，気象）の類似度「０．９６５」と、単語ペア（次，情報）の類似度「０．８７５」のうち良い方とする。類似度集計部２５は、重要語候補「次」のスコアを、重要語候補「次」と重要語候補「あす」の類似度「０．６７６」と、重要語候補「次」と重要語候補「気象情報」の類似度「０．８７５」の平均から「０．７７６」と算出する。 The similarity totaling unit 25 obtains the score of the important word candidate “next”, the similarity between the important word candidate “next” and the important word candidate “Asu”, and the important word candidate “next” and the important word candidate “weather information”. The average of the similarity is calculated. The similarity totaling unit 25 sets the similarity between the important word candidate “next” and the important word candidate “ASU” as the similarity “0.676” of the word pair (next, tomorrow) calculated by the similarity calculating unit 14. . The similarity totaling unit 25 sets the similarity between the important word candidate “next” and the important word candidate “weather information” to the similarity “0.965” of the word pair (next, weather) calculated by the similarity calculating unit 14. The word pair (next, information) similarity “0.875” is the better one. The similarity totaling unit 25 sets the score of the important word candidate “next”, the similarity “0.676” between the important word candidate “next” and the important word candidate “Asu”, the important word candidate “next”, and the important word candidate. “0.776” is calculated from the average of the similarity “0.875” of “weather information”.

また、類似度集計部２５は、重要語候補「あす」のスコアを、重要語候補「あす」と重要語候補「次」の類似度、及び、重要語候補「あす」と重要語候補「気象情報」の類似度の平均により算出する。類似度集計部２５は、重要語候補「あす」と重要語候補「次」の類似度を、類似度計算部１４が算出した単語ペア（あす，次）の類似度「０．６７６」とする。類似度集計部２５は、重要語候補「あす」と重要語候補「気象情報」の類似度を、類似度計算部１４が算出した単語ペア（あす，気象）の類似度「０．９１８」と、単語ペア（あす，情報）の類似度「０．９９０」のうち良い方とする。類似度集計部２５は、重要語候補「あす」のスコアを、重要語候補「あす」と重要語候補「次」の類似度「０．６７６」と、重要語候補「あす」と重要語候補「気象情報」の類似度「０．９１８」の平均から「０．７９７」と算出する。 The similarity totaling unit 25 calculates the score of the important word candidate “As”, the similarity between the important word candidate “As” and the important word candidate “Next”, and the important word candidate “As” and the important word candidate “Weather”. It is calculated by the average of the similarity of “information”. The similarity totaling unit 25 sets the similarity between the important word candidate “ASU” and the important word candidate “NEXT” as the similarity “0.676” of the word pair (ASU, NEXT) calculated by the similarity calculating unit 14. . The similarity totaling unit 25 calculates the similarity between the keyword candidate “ASU” and the keyword candidate “weather information” as the similarity “0.918” of the word pair (tomorrow, weather) calculated by the similarity calculation unit 14. , The better of the word pair (tomorrow, information) similarity “0.990”. The similarity totaling unit 25 sets the score of the important word candidate “As”, the similarity “0.676” between the important word candidate “As” and the important word candidate “Next”, and the important word candidate “As” and the important word candidate. “0.797” is calculated from the average of the similarity “0.918” of “weather information”.

また、類似度集計部２５は、重要語候補「気象情報」のスコアを、重要語候補「気象情報」と重要語候補「次」の類似度、及び、重要語候補「気象情報」と重要語候補「あす」の類似度の平均により算出する。類似度集計部２５は、重要語候補「気象情報」と重要語候補「次」の類似度を、類似度計算部１４が算出した単語ペア（気象，次）の類似度「０．９６５」と、単語ペア（情報，次）の類似度「０．８７５」のうち良い方とする。類似度集計部２５は、重要語候補「気象情報」と重要語候補「あす」の類似度を、類似度計算部１４が算出した単語ペア（気象，あす）の類似度「０．９１８」と、単語ペア（情報，あす）の類似度「０．９９０」のうち良い方とする。類似度集計部２５は、重要語候補「気象情報」のスコアを、重要語候補「気象情報」と重要語候補「次」の類似度「０．８７５」と、重要語候補「気象情報」と重要語候補「あす」の類似度「０．９１８」の平均から「０．８９７」と算出する。 The similarity totaling unit 25 calculates the score of the important word candidate “weather information”, the similarity between the important word candidate “weather information” and the important word candidate “next”, and the important word candidate “weather information” and the important word. Calculated based on the average of the similarities of the candidate “Asu”. The similarity totaling unit 25 sets the similarity between the important word candidate “weather information” and the important word candidate “next” as the similarity “0.965” of the word pair (weather, next) calculated by the similarity calculating unit 14. , The better of the word pair (information, next) similarity “0.875”. The similarity totaling unit 25 calculates the similarity between the important word candidate “weather information” and the important word candidate “ASSU” as the similarity “0.918” of the word pair (weather, tomorrow) calculated by the similarity calculating unit 14. The word pair (information, tomorrow) similarity “0.990” is the better one. The similarity totaling unit 25 sets the score of the important word candidate “weather information”, the similarity “0.875” between the important word candidate “weather information” and the important word candidate “next”, and the important word candidate “weather information”. It is calculated as “0.897” from the average of the similarity “0.918” of the keyword candidate “ASU”.

あるいは、類似度集計部２５は、重要語候補「気象情報」のスコアを、単語ペア（気象，次）の類似度及び（気象，あす）の類似度の平均と、単語ペア（情報，次）の類似度及び（情報，あす）の類似度の平均とのうち良い方としてもよい。類似度集計部２５は、重要語候補「気象情報」を構成する単語「気象」の類似度の平均を、単語ペア（気象，次）の類似度「０．９６５」、及び、単語ペア（気象，あす）の類似度「０．９１８」の平均から「０．９４２」と算出する。また、類似度集計部２５は、重要語候補「気象情報」を構成する単語「情報」の類似度の平均を、単語ペア（情報，次）の類似度「０．８７５」、及び、単語ペア（情報，あす）の類似度「０．９９０」の平均から「０．９３３」と算出する。類似度集計部２５は、重要語候補「気象情報」のスコアを、単語「気象」の類似度の平均と、単語「情報」の類似度の平均とのうち良い方の「０．９４２」とする。 Alternatively, the similarity totaling unit 25 calculates the score of the important word candidate “weather information”, the average of the similarity of the word pair (weather, next) and the similarity of (weather, tomorrow), and the word pair (information, next). It is good also as the better one of the similarity of (information, tomorrow) and the average of similarity. The similarity totaling unit 25 calculates the average similarity of the word “weather” constituting the keyword candidate “weather information”, the similarity “0.965” of the word pair (weather, next), and the word pair (weather) , Tomorrow) is calculated as “0.942” from the average of the similarity “0.918”. The similarity totaling unit 25 calculates the average similarity between the words “information” constituting the keyword candidate “weather information”, the word pair (information, next) similarity “0.875”, and the word pair. From the average of the similarity (0.990) of (information, tomorrow), “0.933” is calculated. The similarity totaling unit 25 sets the score of the keyword candidate “weather information” to “0.942”, which is the better of the average similarity of the word “weather” and the average similarity of the word “information”. To do.

順位付け部２６は、類似度集計部２５が算出したスコアに基づいて、重要語候補を「次」、「あす」、「気象情報」の順に並べる。重要語選択部２７は、順位付け部２６が並べた重要語候補から重要語を選択し、出力部１８は、重要語選択部２７が選択した重要語を出力する。
なお、上記においては、処理を説明するために短い文章のデータを入力したが、もう少し長い文章のデータを入力することで、抽出の精度は向上すると考えられる。 The ranking unit 26 arranges important word candidates in the order of “next”, “tomorrow”, and “weather information” based on the score calculated by the similarity totaling unit 25. The important word selection unit 27 selects an important word from the important word candidates arranged by the ranking unit 26, and the output unit 18 outputs the important word selected by the important word selection unit 27.
In the above description, short sentence data is input to explain the processing. However, it is considered that inputting a little longer sentence data improves the accuracy of extraction.

図７は、第１の実施形態の重要語抽出装置１による評価実験結果を示す図である。評価実験においては、１００番組それぞれの検索ワードと番組概要文とを示す文章データを評価データとして用いた。そして、各番組について３名の作業者が文章データからキーワードを５単語以内で抽出し、この３名の作業者それぞれが選んだキーワードの和集合を重要語の正解データとした。なお、検索ワードと同一の単語は評価の際に除外した。
同図では、重要語抽出装置１が抽出した上位ｎ位の重要語と、従来技術のｏｋａｐｉＢＭ２５を用いて抽出した上位ｎ位の重要語とが、正解データのキーワードに含まれる確率を示している。同図に示すように、特に上位で抽出される単語について、本実施形態の重要語抽出装置１により抽出された重要語が正解データに含まれる確率は従来技術よりも高く、良好な結果が得られた。 FIG. 7 is a diagram illustrating an evaluation experiment result by the keyword extraction device 1 according to the first embodiment. In the evaluation experiment, sentence data indicating a search word and a program summary sentence for each of 100 programs was used as evaluation data. For each program, three workers extracted keywords from the text data within five words, and the union of keywords selected by each of the three workers was used as correct data for important words. Note that the same words as the search words were excluded during the evaluation.
The figure shows the probability that the top n important words extracted by the keyword extraction device 1 and the top n important words extracted using the conventional okapi BM25 are included in the keywords of the correct data. Yes. As shown in the figure, the probability that an important word extracted by the important word extraction device 1 of the present embodiment is included in correct data is higher than that of the prior art, particularly for a word extracted at a higher level, and good results are obtained. It was.

以上説明した実施形態によれば、重要語抽出装置は、文章全体で使用されている他の名詞との関連が高い名詞を重要語として抽出する。文章中の他の名詞との関連が高い名詞とは、文章全体の意味をよく表している意味的中心の単語である。換言すれば、文章中の他の名詞との関連が高い名詞は、文章の流れの中にある意味内容とつながりが高く、文脈にあっている単語である。よって、重要語抽出装置は、単純な単語の出現頻度の確率的な統計ではなく、文章中の文脈や意味を用いて重要語やトピックワードを抽出することができる。例えば、番組検索を行う従来の装置において、ユーザが入力したキーワード等により検索した結果得られた番組の情報を提示する際に、本実施形態の重要語抽出装置が番組概要から抽出した重要語を併せて提示することが考えられる。この重要語の提示により、検索の結果得られた番組がどのような内容であるかをユーザにわかりやすく伝えることができる。また、以上説明した実施形態によれば、重要語抽出装置は、ＴＦ−ＩＤＦによる重み付けを行う場合とは異なり、類似したドメインの文書を大量に集める必要もない。 According to the embodiment described above, the important word extraction device extracts nouns that are highly related to other nouns used in the entire sentence as important words. A noun highly related to other nouns in a sentence is a semantically-centric word that well represents the meaning of the entire sentence. In other words, a noun that is highly related to other nouns in a sentence is a word that is highly connected to the semantic content in the flow of the sentence and is in context. Therefore, the important word extraction device can extract the important words and the topic words using the context and meaning in the sentence, not the probabilistic statistics of the appearance frequency of simple words. For example, in a conventional apparatus that performs a program search, when presenting program information obtained as a result of a search by a keyword input by a user, the important word extraction apparatus of the present embodiment extracts the important words extracted from the program outline. It is possible to present it together. By presenting this important word, the contents of the program obtained as a result of the search can be easily communicated to the user. Also, according to the embodiment described above, the keyword extraction device does not need to collect a large amount of documents of similar domains, unlike the case of weighting by TF-IDF.

上述のように、本実施形態の重要語抽出装置は、文章中で使用される単語の意味を用いて重要語を抽出するため、単語の出現頻度を用いた従来技術よりも高性能に、文脈に合致した重要語を抽出することができる。また、従来使用されているＴＦ−ＩＤＦの場合は、似たようなスタイルの文章を集めて統計をとる必要があるが、本実施形態では、単語間の類似度を計算するための学習データがあればよく、検索のために「似たようなスタイルの文章を多く集める」という必要がない。 As described above, since the important word extraction apparatus according to the present embodiment extracts important words using the meaning of words used in a sentence, the context is improved in performance compared to the conventional technique using the appearance frequency of words. Can be extracted. In the case of TF-IDF that has been used in the past, it is necessary to collect statistics of similar styles, and in this embodiment, learning data for calculating the similarity between words is used. There is no need to “gather a lot of sentences with similar styles” for searching.

上述した重要語抽出装置１、２は、内部にコンピュータシステムを有している。そして、重要語抽出装置１、２の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The above-described important word extraction devices 1 and 2 have a computer system therein. The operation process of the keyword extraction devices 1 and 2 is stored in a computer-readable recording medium in the form of a program, and the above-described processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の概念辞書記憶部のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a portable dictionary such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a conceptual dictionary storage unit such as a hard disk built in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１、２重要語抽出装置
１０類似度データベース
１１入力部
１２、２２単語抽出部
１３単語ペア作成部
１４類似度計算部
１５、２５類似度集計部
１６、２６順位付け部
１７、２７重要語選択部
１８出力部 DESCRIPTION OF SYMBOLS 1, 2 Important word extraction apparatus 10 Similarity database 11 Input part 12, 22 Word extraction part 13 Word pair creation part 14 Similarity calculation part 15, 25 Similarity totaling part 16, 26 Ranking part 17, 27 Important word selection part 18 Output section

Claims

Extracting a word of a noun from sentence data, and a word extracting unit in which the adjacent words in the sentence data are extracted as one compound noun among the extracted words;
A word pair creation unit that creates a word pair composed of the words extracted by the word extraction unit;
For each of the word pairs created by the word pair creation unit, a similarity calculation unit that calculates the similarity between the words constituting the word pair;
When each of the words extracted by the word extraction unit is a score calculation target and the same word appears multiple times in the document data, only one word is duplicated among the words of the score calculation target. A similarity totaling unit that calculates the score of the word to be score-calculated, which is left as a result of deletion, by calculating the average of the similarities calculated by the similarity calculating unit for the word pair including the word; ,
An important word selection unit that selects an important word from the words extracted by the word extraction unit based on the scores of the words calculated by the similarity totalization unit;
Equipped with a,
The similarity totaling unit, the word extracted by the word extraction unit and the compound noun as a score calculation target, the word of the score calculation target or the score of the compound noun, the word of the score calculation target, or For each of the word pairs consisting of any word that constitutes the compound noun that is the score calculation target, and any word that constitutes the other noun calculation target word or the other noun calculation target noun Calculated based on the similarity calculated by the similarity calculation unit,
The important word selection unit selects an important word from the word and the compound noun based on the score calculated by the similarity tabulation unit,
An important word extraction device characterized by that.

A program for causing a computer to function as the important word extracting device according to claim 1 .