JP5605730B2

JP5605730B2 - Extraction apparatus, extraction method and extraction program

Info

Publication number: JP5605730B2
Application number: JP2011032415A
Authority: JP
Inventors: 友博高木
Original assignee: MEIJI UNIVERSITY LEGAL PERSON
Current assignee: MEIJI UNIVERSITY LEGAL PERSON
Priority date: 2011-02-17
Filing date: 2011-02-17
Publication date: 2014-10-15
Anticipated expiration: 2031-02-17
Also published as: JP2012173800A

Description

本発明は、抽出装置、抽出方法および抽出プログラムに関する。 The present invention relates to an extraction apparatus, an extraction method, and an extraction program.

現在、既存の単語を組み合わせることによって作られた造語を新商品の名前に用いることが行われている。その造語が流行するかどうかは、その造語を構成する単語の組み合わせによって変わってくるが、世の中には用いる単語の組み合わせの候補がたくさんあるので、どの単語を組み合わせればよいのかは一見しただけでは分からない。また、あらゆる単語の組み合わせに対して造語が流行するか否かを検証することは難しい。 Currently, coined words created by combining existing words are used as names for new products. Whether the coined word is prevalent or not depends on the combination of words that make up the coined word, but there are many word combinations that can be used in the world. I do not understand. Also, it is difficult to verify whether coined words are popular for all word combinations.

その問題に対して、非特許文献１では、組み合わせ評価システムがＷＥＢページ上におけるキーワードの登場回数から、単語の組み合わせの斬新さと大衆に受け入られる可能性とを推定し、それによって組み合わせの有効度を定めることが示されている。 To deal with this problem, in Non-Patent Document 1, the combination evaluation system estimates the novelty of a combination of words and the possibility of being accepted by the public from the number of appearances of the keyword on the WEB page, and thereby the effectiveness of the combination It is shown to determine.

西原陽子、砂山渡、谷内田正彦「有効な組み合わせの発見による創造活動支援」、電子情報通信学会論文誌Ｄ−Ｉ, Ｖｏｌ．Ｊ８７−Ｄ−Ｉ, Ｎｏ．１０, ｐｐ．９３９−９４９, ２００４年１０月Yoko Nishihara, Wataru Sunayama, Masahiko Taniuchi “Creative Activity Support by Discovering Effective Combinations”, IEICE Transactions DI, Vol. J87-D-I, no. 10, pp. 939-949, October 2004

しかしながら、非特許文献１における組み合わせ評価システムは、ＷＥＢページなどの文章に活字として掲載されているキーワードを抽出することはできるが、その文章には活字として掲載されていないが、その文章の一部あるいは全体から捉えられる概念を抽出することができず、意外性のある概念の組み合わせを提供できないという問題があった。 However, although the combination evaluation system in Non-Patent Document 1 can extract a keyword that is listed as a type in a text such as a WEB page, it is not listed as a type in the text, but part of the text Or, there is a problem that it is impossible to extract a concept that can be captured from the whole, and it is impossible to provide an unexpected combination of concepts.

そこで本発明は、上記問題に鑑みてなされたものであり、意外性のある概念の組み合わせを提供することを可能とする抽出装置、抽出方法および抽出プログラムを提供することを課題とする。 Therefore, the present invention has been made in view of the above problems, and an object thereof is to provide an extraction apparatus, an extraction method, and an extraction program that can provide a combination of unexpected concepts.

上記の課題を解決するために、本発明の一態様である抽出装置は、単語を示す情報と該単語がクラスタに所属している程度である所属度を示す情報とが関連付けられ、前記単語を示す情報と該単語の位置を示す情報とが関連付けられて記憶されているクラスタ記憶部と、前記クラスタ記憶部から所属度が所定値以上の単語を示す情報に関連付けられている単語の位置の情報を３つ以上のクラスタ分読み出し、該単語の位置の情報に基づいて、対象となる２つのクラスタ以外の第３のクラスタを経由した該２つのクラスタ間の間接関連度と、該２つのクラスタを組み合わせることの意外度とを乗じることにより、発見性指数を算出する発見性指数算出部と、前記クラスタ記憶部から前記クラスタ毎に所属度を示す情報を読み出し、該読み出された所属度を示す情報と、自装置の外部から入力されたターゲットの特性を示す情報とに基づいて、前記対象となる２つのクラスタおよび前記第３のクラスタとターゲットとの関連性を示すターゲット関連性指数を算出するターゲット関連性指数算出部と、前記算出された発見性指数と前記ターゲット関連性指数とに基づいて、前記クラスタの組み合わせを抽出するクラスタ組抽出部と、を備えることを特徴とする。 In order to solve the above problem, an extraction device according to an aspect of the present invention relates to information indicating a word and information indicating an affiliation degree to which the word belongs to a cluster. Cluster storage unit in which information indicating and information indicating the position of the word are stored in association with each other, and information on the position of the word associated with information indicating a word having an affiliation degree equal to or greater than a predetermined value from the cluster storage unit 3 or more clusters, and based on the information on the position of the word, the degree of indirect association between the two clusters via the third cluster other than the two target clusters, and the two clusters by multiplying the possible surprising degree combining, reading and finding index calculation unit that calculates a discovery index information indicating the appertaining to each of the clusters from the cluster storage section, read the Target relevance indicating the relevance between the two target clusters and the third cluster based on the information indicating the degree of affiliation and the information indicating the characteristics of the target input from the outside of the device. A target relevance index calculating unit for calculating an index; and a cluster set extracting unit for extracting a combination of the clusters based on the calculated heuristic index and the target relevance index. .

上記抽出装置は、所定の期間毎に、前記単語を示す情報と該単語の重要度を示す情報とが関連付けられて記憶されている重要度記憶部と、前記重要度記憶部から所定の期間毎に前記単語の重要度を示す情報を読み出し、前記クラスタ記憶部からクラスタ毎に前記所属度を示す情報を読み出し、該単語の重要度を示す情報と該所属度を示す情報とに基づいて、所定の期間毎に各クラスタの活性化を予測する活性化予測部を更に備え、前記クラスタ組抽出部は、前記活性化予測部による予測により前記クラスタの組み合わせのうち少なくとも１つのクラスタの活性化が予測された場合、前記発見性指数とターゲット関連性指数とに基づいて、前記クラスタの組み合わせを抽出することを特徴とするものであってもよい。 The extraction device includes, for each predetermined period, information indicating the word and information indicating the importance of the word in association with each other, and an importance storage unit that stores the associated information from the importance storage unit for each predetermined period. Information indicating the degree of importance of the word, reading information indicating the degree of belonging for each cluster from the cluster storage unit, and based on information indicating the degree of importance of the word and information indicating the degree of belonging An activation prediction unit that predicts activation of each cluster for each period of time, and the cluster set extraction unit predicts activation of at least one cluster of the cluster combinations by prediction by the activation prediction unit If so, the combination of the clusters may be extracted based on the heuristic index and the target relevance index.

上記抽出装置の前記活性化予測部は、所定の期間毎に、単語が所定のクラスタへ所属している所属度を示す情報と、前記重要度記憶部から読み出された該期間における前記単語の重要度を示す情報とに基づいて、該クラスタの活性度を算出する活性度算出部と、前記算出された活性度に基づき、各クラスタの活性度の上昇が期待される度合いである活性度上昇期待値を算出する活性度上昇期待値算出部と、を備え、前記算出された活性度と、前記算出された活性度上昇期待値とに基づいて、前記クラスタの活性化を予測することを特徴とするものであってもよい。 The activation prediction unit of the extraction apparatus includes, for each predetermined period, information indicating the degree of affiliation that a word belongs to a predetermined cluster, and the word in the period read from the importance storage unit. An activity level calculation unit that calculates the activity level of the cluster based on the information indicating the importance level, and an activity level increase that is an expected increase in the activity level of each cluster based on the calculated activity level An activity increase expected value calculation unit for calculating an expected value, and predicting activation of the cluster based on the calculated activity and the calculated activity increase expected value It may be.

上記抽出装置の前記発見性指数は、前記間接関連度と前記意外度が高くなるほど高くなり、前記クラスタ組抽出部は、前記発見性指数と前記ターゲット関連性指数との重み付き和に基づいて、前記クラスタの組み合わせを抽出することを特徴とするものであってもよい。 The heuristic index of the extraction device becomes higher as the indirect relevance and the unexpectedness are higher, and the cluster set extraction unit is based on a weighted sum of the heuristic index and the target relevance index, A combination of the clusters may be extracted.

本発明の一態様である抽出方法は、単語を示す情報と該単語がクラスタに所属している程度である所属度を示す情報とが関連付けられ、前記単語を示す情報と該単語の位置を示す情報とが関連付けられて記憶されているクラスタ記憶部を備える抽出装置が実行する抽出方法であって、前記クラスタ記憶部から所属度が所定値以上の単語を示す情報に関連付けられている単語の位置の情報を３つ以上のクラスタ分読み出し、該単語の位置の情報に基づいて、対象となる２つのクラスタ以外の第３のクラスタを経由した該２つのクラスタ間の間接関連度と、該２つのクラスタを組み合わせることの意外度とを乗じることにより、発見性指数を算出する発見性指数算出手順と、前記クラスタ記憶部から前記クラスタ毎に所属度を示す情報を読み出し、該読み出された所属度を示す情報と、自装置の外部から入力されたターゲットの特性を示す情報とに基づいて、前記対象となる２つのクラスタおよび前記第３のクラスタとターゲットとの関連性を示すターゲット関連性指数を算出するターゲット関連性指数算出手順と、前記算出された発見性指数と前記ターゲット関連性指数とに基づいて、前記クラスタの組み合わせを抽出するクラスタ組抽出手順と、を有することを特徴とする。 In the extraction method according to one aspect of the present invention, information indicating a word is associated with information indicating a degree of belonging to which the word belongs to a cluster, and the information indicating the word and the position of the word are indicated A position of a word associated with information indicating a word having an affiliation degree equal to or greater than a predetermined value from the cluster storage unit, which is an extraction method executed by an extraction apparatus including a cluster storage unit stored in association with information Information for three or more clusters, and based on the information on the position of the word, the degree of indirect association between the two clusters via a third cluster other than the two target clusters, and the two by multiplying the surprising degree of combining clusters, reading and finding index calculation step of calculating a discovery index information indicating the appertaining to each of the clusters from the cluster storage section Based on the information indicating the degree of belonging that has been read and information indicating the characteristics of the target input from the outside of the device, the relevance between the two target clusters and the third cluster and the target And a target relevance index calculation procedure for calculating a target relevance index indicating a cluster combination extraction procedure for extracting a combination of the clusters based on the calculated heuristic index and the target relevance index. It is characterized by that.

本発明の一態様である抽出プログラムは、単語を示す情報と該単語がクラスタに所属している程度である所属度を示す情報とが関連付けられ、前記単語を示す情報と該単語の位置を示す情報とが関連付けられて記憶されているクラスタ記憶部を備える抽出装置のコンピュータに、前記クラスタ記憶部から所属度が所定値以上の単語を示す情報に関連付けられている単語の位置の情報を３つ以上のクラスタ分読み出し、該単語の位置の情報に基づいて、対象となる２つのクラスタ以外の第３のクラスタを経由した該２つのクラスタ間の間接関連度と、該２つのクラスタを組み合わせることの意外度とを乗じることにより、発見性指数を算出する発見性指数算出ステップと、前記クラスタ記憶部から前記クラスタ毎に所属度を示す情報を読み出し、該読み出された所属度を示す情報と、自装置の外部から入力されたターゲットの特性を示す情報とに基づいて、前記対象となる２つのクラスタおよび前記第３のクラスタとターゲットとの関連性を示すターゲット関連性指数を算出するターゲット関連性指数算出ステップと、前記算出された発見性指数と前記ターゲット関連性指数とに基づいて、前記クラスタの組み合わせを抽出するクラスタ組抽出ステップと、を実行させるための抽出プログラムである。 An extraction program according to an aspect of the present invention relates to information indicating a word and information indicating an affiliation degree to which the word belongs to a cluster, and indicates the information indicating the word and the position of the word Information on the position of a word associated with information indicating a word having an affiliation degree equal to or greater than a predetermined value from the cluster storage unit is stored in a computer of an extraction apparatus including a cluster storage unit stored in association with information. Based on the information of the above-mentioned cluster reading and the position of the word, the indirect association degree between the two clusters via the third cluster other than the two target clusters can be combined with the two clusters. by multiplying the surprising degree, reading and finding index calculation step of calculating a discovery index information indicating the appertaining to each of the clusters from the cluster storage section Based on the information indicating the degree of belonging that has been read and information indicating the characteristics of the target input from the outside of the device, the relevance between the two target clusters and the third cluster and the target A target relevance index calculating step for calculating a target relevance index indicating a cluster, and a cluster set extraction step for extracting a combination of the clusters based on the calculated heuristic index and the target relevance index This is an extraction program.

本発明によれば、意外性のある概念の組み合わせを提供することができる。 According to the present invention, an unexpected combination of concepts can be provided.

本発明の実施形態における抽出装置のブロック構成図である。It is a block block diagram of the extraction apparatus in embodiment of this invention. 重要度記憶部に記憶されているワードベクトルテーブルＴ１の一例である。It is an example of the word vector table T1 memorize | stored in the importance memory | storage part. クラスタ生成部による処理を説明するための図である。It is a figure for demonstrating the process by a cluster production | generation part. 活性度の算出方法を説明するための図である。It is a figure for demonstrating the calculation method of activity. 本実施形態の抽出装置がクラスタを生成する処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which the extraction apparatus of this embodiment produces | generates a cluster. 本実施形態の抽出装置がクラスタの組み合わせを抽出する処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which the extraction device of this embodiment extracts the combination of a cluster.

以下、本発明の実施形態について、図面を参照して詳細に説明する。まず、本発明の実施形態における抽出装置１００の概要について説明する。抽出装置１００は、流行語の重要要素である流行に乗っていることと、新しい驚きがあることとを両立する概念を、その概念を提供する対象であるターゲット（人）の特性と関連性がある、複数の概念を組み合わせる事によって生成する。これにより、抽出装置１００は、ターゲットの特性に応じて、世間で流行している概念であって、ターゲットにとって意外性がある概念（ヒットコンセプト）を提示することができる。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. First, the outline | summary of the extraction apparatus 100 in embodiment of this invention is demonstrated. The extraction device 100 is related to the characteristics of the target (person) that is the target of providing the concept, which is compatible with both the fact that it is on the trend which is an important element of the buzzword and that there is a new surprise. Generated by combining some concepts. Thereby, the extraction apparatus 100 can present a concept (hit concept) that is popular in the world and is surprising to the target according to the characteristics of the target.

ここで、概念は、データに出現する語の集合として表される。その特殊な場合として１語による概念も存在する。
組合せ要素となる概念として、２つの概念Ｃ１、Ｃ２をつなぐ役目をする概念Ｃｎが存在する。抽出装置１００は、Ｃ１、Ｃ２、Ｃｎそれぞれの概念を、新聞やウェブ上の時系列データなどから、流行要因（ヒット要因）として定められた特徴を測る測度を測定することで抽出する。 Here, the concept is expressed as a set of words appearing in the data. As a special case, there is a one word concept.
As a concept that becomes a combination element, there is a concept Cn that serves to connect the two concepts C1 and C2. The extraction device 100 extracts the concepts of C1, C2, and Cn by measuring a measure that measures a feature defined as a trend factor (hit factor) from time series data on a newspaper or the web.

抽出装置１００は、概念Ｃ１と概念Ｃ２の直接の関連度は低いが、概念Ｃｎを経由したＣ１−Ｃｎ−Ｃ２の間接関連度は高くなる組合せを抽出する。例えば、抽出装置１００は、ターゲットがゴルフクラブ（Ｃ１）と関連がある所定の雑誌の読者だとすると、概念Ｃ１、概念Ｃｎ、概念Ｃ２の組み合わせとして、ゴルフクラブ（Ｃ１）、口紅（Ｃ２）、Ｃｎ（プレゼント）を抽出する。一見、ゴルフクラブと口紅の関連度は低いが、プレゼントという概念Ｃｎを経由すると両者の間接関連度は高くなるので、ゴルフクラブ（Ｃ１）と口紅（Ｃ２）の組み合わせを抽出する価値は高い。 The extraction apparatus 100 extracts a combination in which the degree of direct association between the concepts C1 and C2 is low but the degree of indirect association of C1-Cn-C2 via the concept Cn is high. For example, if the target is a reader of a predetermined magazine that is related to the golf club (C1), the extraction device 100 has a golf club (C1), lipstick (C2), and Cn (Cn () as combinations of the concepts C1, Cn, and C2. Present). At first glance, the degree of association between the golf club and the lipstick is low, but the degree of indirect association between the two increases via the present concept Cn, so it is highly worth extracting the combination of the golf club (C1) and the lipstick (C2).

さらに、抽出装置１００は、それら概念が対象とする期間において活性化傾向にあることと、それらの概念の少なくとも１つがターゲットの特性と関連があることも概念の抽出の条件とする。例えば、クリスマス時期において、プレゼントという概念の活性化傾向は強くなり、ターゲットとしての所定の雑誌の読者にとってゴルフクラブの関連度は高い。 Furthermore, the extraction apparatus 100 also sets the conditions for the concept extraction that the concepts tend to be activated during the target period and that at least one of the concepts is related to the characteristics of the target. For example, during the Christmas season, the concept of presents tends to be activated, and the relevance of a golf club is high for readers of a predetermined magazine as a target.

抽出装置１００は、上記概念Ｃ１、Ｃ２、Ｃｎの組合せを、ターゲットにとって目新しい概念を示す情報とし出力する。これにより、抽出装置１００は、ターゲットに対して、ターゲットと関連している概念（例えば、概念Ｃ１）と、接続概念Ｃｎを介して関連している概念Ｃ２を示す情報を提供することができる。これにより、例えば、ターゲットであるゴルフクラブ（概念Ｃ１）と関連している所定の雑誌の読者に対して、クリスマス時期の流行概念（ヒットコンセプト）として、プレゼント（概念Ｃｎ）のための口紅（概念Ｃ２）特集を提供することができる。 The extraction apparatus 100 outputs the combination of the concepts C1, C2, and Cn as information indicating a novel concept for the target. Thereby, the extraction apparatus 100 can provide the target with information indicating the concept (for example, the concept C1) related to the target and the concept C2 related through the connection concept Cn. Thus, for example, a lipstick (concept) for a present (concept Cn) as a trendy concept (hit concept) at Christmas time for a reader of a predetermined magazine related to the target golf club (concept C1) C2) Special features can be provided.

図１は、本発明の実施形態における抽出装置１００のブロック構成図である。抽出装置１００は、重要度算出部１０１と、重要度記憶部１０２と、クラスタ生成部１０３と、クラスタ記憶部１０４と、発見性指数算出部１１０と、ターゲット関連性指数算出部１１４と、活性化予測部１２０と、クラスタ組抽出部１３０とを備える。
また、発見性指数算出部１１０は、間接関連度算出部１１１と、意外度算出部１１２と、積算部１１３とを備える。活性化予測部１２０は、活性度算出部１２１と、相対力指数算出部（活性度上昇期待値算出部）１２２とを備える。 FIG. 1 is a block configuration diagram of an extraction device 100 according to an embodiment of the present invention. The extraction apparatus 100 includes an importance calculation unit 101, an importance storage unit 102, a cluster generation unit 103, a cluster storage unit 104, a heuristic index calculation unit 110, a target relevance index calculation unit 114, and an activation. A prediction unit 120 and a cluster set extraction unit 130 are provided.
The heuristic index calculation unit 110 includes an indirect relevance calculation unit 111, an unexpectedness calculation unit 112, and an integration unit 113. The activation prediction unit 120 includes an activity calculation unit 121 and a relative force index calculation unit (activity increase expected value calculation unit) 122.

重要度算出部１０１は、自装置の外部から入力された記事集合Ｄを受け付ける。ここで、入力される記事集合Ｄは新聞のような世相を表すドキュメントや雑誌のような市場の特性を表すドキュメントの時系列データである。そして、重要度算出部１０１は、記事集合Ｄから所定期間のドキュメントを一区切りとし、それを時系列順にならべたものを生成する。ここで、一区切りのドキュメントを１つのドキュメント、全期間のドキュメントを全ドキュメントと称する。 The importance calculation unit 101 receives an article set D input from the outside of its own device. Here, the input article set D is time-series data of a document representing a social aspect such as a newspaper or a document representing a market characteristic such as a magazine. Then, the importance calculation unit 101 generates a document in which a predetermined period of documents is separated from the article set D and is arranged in chronological order. Here, a single segment document is referred to as one document, and a document for the entire period is referred to as an all document.

重要度算出部１０１は、各期間における単語の重要度を示す情報を算出する。具体的には、例えば、重要度算出部１０１は、期間毎にドキュメント中に注目語が出現した頻度ｔｆを、当該ドキュメント中の総単語数で割ることにより、各期間における各語のｔｆ−ｉｄｆ値を算出する。ここで、ｔｆ−ｉｄｆ値とは、情報検索で一般的に語の重要度として使用されている指標である。 The importance calculation unit 101 calculates information indicating the importance of words in each period. Specifically, for example, the importance calculation unit 101 divides the frequency tf at which the attention word appears in the document for each period by the total number of words in the document, so that the tf-idf of each word in each period. Calculate the value. Here, the tf-idf value is an index generally used as the importance of a word in information retrieval.

重要度算出部１０１は、このｔｆ−ｉｄｆ値を事前に定められた語順に並べたものであるワードベクトルを当該所定期間毎に算出する。このワードベクトルは、各語のｔｆ−ｉｄｆ値のリストであり、その期間の特徴を表している。重要度算出部１０１は、算出したワードベクトルを示す情報を該単語と関連付けて、期間毎に重要度記憶部１０２のワードベクトルテーブルＴ１に記憶させる。 The importance calculation unit 101 calculates a word vector in which the tf-idf values are arranged in a predetermined word order for each predetermined period. This word vector is a list of tf-idf values for each word, and represents the characteristics of that period. The importance calculation unit 101 associates information indicating the calculated word vector with the word and stores it in the word vector table T1 of the importance storage unit 102 for each period.

図２は、重要度記憶部１０２に記憶されているワードベクトルテーブルＴ１の一例である。同図において、上記所定期間を１日と定め、１日毎の単語のｔｆ−ｉｄｆ値が予め決められた単語の順番で示されている。また、各列はワードベクトル（Ｗ＿１、Ｗ＿２、Ｗ＿３、…、Ｗ＿３０）を表している。
このように、このワードベクトルを時系列順に並べることによって、時間順に所定期間毎の記事の特徴が示される。 FIG. 2 is an example of the word vector table T1 stored in the importance storage unit 102. In the figure, the predetermined period is defined as one day, and the tf-idf values of the words for each day are shown in a predetermined word order. Each column represents a word vector (W_1, W_2, W_3,..., W_30).
In this way, by arranging the word vectors in time series order, the feature of the article for each predetermined period is shown in time order.

図１に戻って、重要度算出部１０１は、ワードベクトルの情報の集合（以下、ワードベクトル集合と称する）をクラスタ生成部１０３に出力する。
クラスタ生成部１０３は、重要度算出部１０１から入力されたワードベクトル集合を用いて、単語を所定のまとまりであるクラスタに分類し、クラスタ毎にラベルを付与する。 Returning to FIG. 1, the importance calculation unit 101 outputs a set of word vector information (hereinafter referred to as a word vector set) to the cluster generation unit 103.
The cluster generation unit 103 uses the word vector set input from the importance calculation unit 101 to classify the words into clusters that are a predetermined group, and gives a label to each cluster.

本実施形態では、概念は何らかの共通性や関連性によって類似の語の集合で表されると仮定する。ここで言う集合とは、その集合の要素であるかどうかの所属度が０または１で決まる通常の集合の場合も、要素の所属度を０から１までの間の任意の値で表すファジィ集合の場合の、両方の可能性がある。 In the present embodiment, it is assumed that the concept is represented by a set of similar words due to some commonality or relationship. The set mentioned here is a fuzzy set that represents the affiliation of an element by an arbitrary value between 0 and 1 even in the case of a normal set in which the affiliation of whether or not it is an element of the set is 0 or 1. In the case of both.

そこで、クラスタ生成部１０３は、所定のクラスタリング方法に従って、記事集合Ｄに出現する単語をクラスタリングする。通常１つのクラスタには数万の単語が含まれ、それぞれの単語はクラスタに所属する値である所属度ＭｅｍＣ（ｗ）を有する。ここで、所属度ＭｅｍＣ（ｗ）は、単語ｗがクラスタＣに所属する値を表している。この値は、クラスタが対応している概念に所属する程度を意味する。 Therefore, the cluster generation unit 103 clusters words that appear in the article set D according to a predetermined clustering method. Usually, one cluster includes tens of thousands of words, and each word has an affiliation level Mem C (w) that is a value belonging to the cluster. Here, the affiliation degree Mem C (w) represents a value to which the word w belongs to the cluster C. This value means the degree to which the cluster belongs to the corresponding concept.

クラスタリングにはすでに様々な手法が提案されているが、クラスタ生成部１０３は、
一例として、ｋ−ｍｅａｎｓ法によって、記事集合Ｄに出現する単語をクラスタリングする。具体的には、クラスタ生成部１０３は、下記式（１）で表される評価値を最小化するクラスタを算出する。ここで、ｋは事前に与えられるものとする。 Various methods have already been proposed for clustering, but the cluster generation unit 103
As an example, words appearing in the article set D are clustered by the k-means method. Specifically, the cluster generation unit 103 calculates a cluster that minimizes the evaluation value represented by the following formula (1). Here, k is assumed to be given in advance.

但し、以下の条件式（２）を満たすものとする。 However, the following conditional expression (2) is satisfied.

ここで、ｘ_ｉはｉ番目の単語データ（ｉは１からＩまでの整数）で、ｘ_ｉ＝（ｘ_ｉ１，ｘ_ｉ２）、Ｋはクラスタ数、ｖ_ｋはｋ番目のクラスタの重心（ｋは１からＫまでの整数）で、ｖ_ｋ＝（ｖ_ｋ１，ｖ_ｋ２）、ｇ_ｉｋはｉ番目のデータのｋ番目のクラスタへの所属度である。 Here, x _i is the i-th word data (i is an integer from 1 to I), x _i = (x _i1 , x _i2 ), K is the number of clusters, and v _k is the centroid of the k-th cluster (k Is an integer from 1 to K), and v _k = (v _k1 , v _k2 ), g _ik is the degree of affiliation of the i-th data to the k-th cluster.

なお、クラスタ生成部１０３は、ｋ−ｍｅａｎｓ法を用いたがこれに限らず、ｆｕｚｚｙｃ−ｍｅａｎｓ法を用いてもよい。その場合、具体的には、クラスタ生成部１０３は、下記式（３）で表される評価値を最小化するクラスタを算出する。ここで、ｋは事前に与えられるものとする。 The cluster generation unit 103 uses the k-means method, but is not limited thereto, and may use the fuzzy c-means method. In that case, specifically, the cluster generation unit 103 calculates a cluster that minimizes the evaluation value represented by the following formula (3). Here, k is assumed to be given in advance.

但し、以下の条件式（４）を満たすものとする。 However, the following conditional expression (4) is satisfied.

ここで、ｘ_ｉはｉ番目の単語データ（ｉは１からＩまでの整数）で、ｘ_ｉ＝（ｘ_ｉ１，ｘ_ｉ２）、Ｋはクラスタ数、ｖ_ｋはｋ番目のクラスタの重心（ｋは１からＫまでの整数）で、ｖ_ｉ＝（ｖ_ｉ１，ｖ_ｉ２）、ｇ_ｉｋはｉ番目のデータのｋ番目のクラスタへの所属度である。
このように、クラスタ生成部１０３は、ｋ−ｍｅａｎｓ法、ｆｕｚｚｙｃ−ｍｅａｎｓ法のいずれを用いても、要素毎にクラスタに所属する所属度を算出する。 Here, x _i is the i-th word data (i is an integer from 1 to I), x _i = (x _i1 , x _i2 ), K is the number of clusters, and v _k is the centroid of the k-th cluster (k Is an integer from 1 to K), and v _i = (v _i1 , v _i2 ), and g _ik is the degree of affiliation of the i-th data to the k-th cluster.
As described above, the cluster generation unit 103 calculates the degree of affiliation belonging to the cluster for each element, regardless of which of the k-means method and the fuzzy c-means method.

クラスタ生成部１０３は、得られたクラスタ１つずつに１つの概念を割り当てるためにラベルを付与する。具体的には、クラスタ生成部１０３は、クラスタ重心に最も近い語をそのクラスタの代表として、そのクラスタのラベルとする。なお、クラスタ生成部１０３は、クラスタ中の最大の所属度を持つ語をそのクラスタの代表としてそのクラスタのラベルとしてもよい。 The cluster generation unit 103 assigns a label to assign one concept to each obtained cluster. Specifically, the cluster generation unit 103 uses the word closest to the cluster centroid as a representative of the cluster and uses it as the label of the cluster. Note that the cluster generation unit 103 may use a word having the highest degree of affiliation in the cluster as a representative of the cluster and use it as the label of the cluster.

図３は、クラスタ生成部１０３による処理を説明するための図である。図３（ａ）は、クラスタ生成部１０３によって生成されるクラスタを説明するための図である。同図において、向かって左側に記事集合Ｄが示されている。向かって右側には、ｘｙの２次元平面上にクラスタの１例が示されている。 FIG. 3 is a diagram for explaining processing by the cluster generation unit 103. FIG. 3A is a diagram for explaining a cluster generated by the cluster generation unit 103. In the figure, an article set D is shown on the left side. On the right side, an example of a cluster is shown on the two-dimensional plane of xy.

その２次元平面上で、クラスタの各要素である各単語は、×印で示されている。３つのクラスタＣ＿１、Ｃ＿２、Ｃ＿３が示されており、各クラスタは円内の×印で示された単語を含むものとする。クラスタＣ＿１は農産物のラベルが付与されたクラスタであり、その要素にはｐｒｏｃｅｓｓｏｒとｏｒａｎｇｅを含む。一方、クラスタＣ＿２はコンピュータのラベルが付与されたクラスタであり、要素にはｐｒｏｃｅｓｓｏｒ、ｍｅｍｏｒｙを含む。すなわち、ｐｒｏｃｅｓｓｏｒは、食品加工機（フードプロセッサ）という意味でクラスタＣ＿１に所属し、コンピュータのプロセッサの意味でクラスタＣ＿２に所属している。 On the two-dimensional plane, each word that is each element of the cluster is indicated by a cross. Three clusters C_1, C_2, and C_3 are shown, and each cluster includes a word indicated by a cross in the circle. The cluster C_1 is a cluster to which the label of the agricultural product is given, and its elements include processor and orange. On the other hand, the cluster C_2 is a cluster to which a computer label is assigned, and the elements include processor and memory. That is, the processor belongs to the cluster C_1 in the sense of a food processing machine (food processor), and belongs to the cluster C_2 in the sense of a computer processor.

クラスタＣ＿３は脳のラベルが付与されたクラスタであり、要素にはｍｅｍｏｒｙを含む。すなわち、ｍｅｍｏｒｙは、コンピュータのメモリという意味でクラスタＣ＿２に所属し、脳の記憶という意味でクラスタＣ＿３に所属している。 The cluster C_3 is a cluster to which a brain label is assigned, and the element includes memory. In other words, memory belongs to cluster C_2 in the sense of computer memory, and belongs to cluster C_3 in the sense of brain memory.

図３（ｂ）は、クラスタ記憶部１０４に記憶されている概念テーブルＴ２の１例である。概念テーブルＴ２には、図３（ａ）に示されたクラスタを識別する識別情報Ｃ＿ｉ（ｉは正の整数）と、図３（ａ）に示されたクラスタ毎に付与されたラベルを示す情報とが関連付けられている。 FIG. 3B is an example of the concept table T <b> 2 stored in the cluster storage unit 104. In the concept table T2, identification information C_i (i is a positive integer) for identifying the cluster shown in FIG. 3A and information indicating a label assigned to each cluster shown in FIG. And are associated.

図３（ｃ）は、クラスタ記憶部１０４に記憶されている所属度テーブルＴ３の１例である。所属度テーブルＴ３には、図３（ａ）に示された単語を示す情報と、該単語がクラスタに所属している程度である所属度を示す情報とが該クラスタを識別する識別情報Ｃ＿ｉ毎に関連付けられている。 FIG. 3C is an example of the affiliation degree table T 3 stored in the cluster storage unit 104. In the affiliation degree table T3, information indicating the word shown in FIG. 3A and information indicating the affiliation degree to which the word belongs to the cluster are identified for each identification information C_i. Associated with.

図３（ｄ）は、クラスタ記憶部１０４に記憶されている座標テーブルＴ４の１例である。座標テーブルＴ４には、図３（ａ）に示された単語を示す情報と、該単語の位置を示す情報である座標を示す情報とが関連付けられている。 FIG. 3D is an example of a coordinate table T4 stored in the cluster storage unit 104. In the coordinate table T4, information indicating the word shown in FIG. 3A is associated with information indicating coordinates, which is information indicating the position of the word.

図１に戻って、クラスタ生成部１０３は、クラスタ識別情報Ｃ＿ｉ（これ以降、ｉはクラスタのインデックスを表す１からｎまでの正の整数）と、クラスタ毎に付与されたラベルを示す情報とを関連付けてクラスタ記憶部１０４に記憶させる。また、クラスタ生成部１０３は、単語を示す情報と、該単語がクラスタに所属している程度である所属度を示す情報とを該クラスタを識別する識別情報Ｃ＿ｉ毎に関連付けてクラスタ記憶部１０４に記憶させる。また、クラスタ生成部１０３は、クラスタ記憶部１０４に、単語を示す情報と当該単語の位置を示す情報とを関連付けて記憶させる。 Returning to FIG. 1, the cluster generation unit 103 obtains cluster identification information C_i (hereinafter, i is a positive integer from 1 to n indicating the index of the cluster) and information indicating a label assigned to each cluster. The data are stored in the cluster storage unit 104 in association with each other. Further, the cluster generation unit 103 associates the information indicating the word and the information indicating the degree of affiliation to which the word belongs to the cluster in the cluster storage unit 104 in association with the identification information C_i for identifying the cluster. Remember. Further, the cluster generation unit 103 causes the cluster storage unit 104 to store information indicating a word and information indicating the position of the word in association with each other.

またクラスタ記憶部１０４には、図３（ｂ）に示されたように、クラスタ生成部１０３による処理の結果、クラスタを識別する識別情報Ｃ＿ｉと、クラスタ毎に付与されたラベルを示す情報とが関連付けられて記憶されている。
またクラスタ記憶部１０４には、図３（ｃ）に示されたように、クラスタ生成部１０３による処理の結果、単語を示す情報と該単語がクラスタに所属している程度である所属度を示す情報とが該クラスタ毎に関連付けられて記憶されている。 Further, as shown in FIG. 3B, the cluster storage unit 104 includes identification information C_i for identifying a cluster and information indicating a label assigned to each cluster as a result of the processing by the cluster generation unit 103. Associated and stored.
Further, as shown in FIG. 3C, the cluster storage unit 104 shows information indicating the word and the degree of affiliation to which the word belongs to the cluster as a result of the processing by the cluster generation unit 103. Information is stored in association with each cluster.

クラスタ記憶部１０４には、クラスタ生成部１０３による処理の結果、図３（ｄ）に示されるように、単語を示す情報と、当該単語の位置を示す情報とが関連付けられて記憶されている。ここで、例えば、クラスタ生成部１０３によるクラスタリングにより２次元平面上に、各単語の位置が割り当てられている場合、当該各単語の位置を示す情報は、２次元平面上における座標を示す情報である。 As a result of processing by the cluster generation unit 103, the cluster storage unit 104 stores information indicating a word and information indicating the position of the word in association with each other as shown in FIG. Here, for example, when the position of each word is allocated on the two-dimensional plane by clustering by the cluster generation unit 103, the information indicating the position of each word is information indicating the coordinates on the two-dimensional plane. .

発見性指数算出部１１０は、クラスタ記憶部１０４から異なるクラスタに関連付けられている所属度を示す情報を所定の数（例えば、３つ）のクラスタ分読み出し、当該読み出された所属度を示す情報に基づいて、対象となる２つのクラスタ以外の第３のクラスタを経由した当該２つのクラスタ間の関連度と、該２つのクラスタを組み合わせことの意外度とを反映する発見性指数を算出する。
ここで、発見性指数は２つのクラスタ同士の直接の関連性が低くなるほど高くなり、該２つのクラスタが残りの第３のクラスタと関連性が高くなるほど高くなる。 The heuristic index calculation unit 110 reads information indicating the degree of affiliation associated with different clusters from the cluster storage unit 104 for a predetermined number (for example, three) of clusters, and information indicating the read degree of affiliation Based on the above, a heuristic index that reflects the degree of association between the two clusters via the third cluster other than the two target clusters and the unexpectedness of combining the two clusters is calculated.
Here, the heuristic index increases as the direct relationship between the two clusters decreases, and increases as the relationship between the two clusters and the remaining third cluster increases.

間接関連度算出部１１１は、クラスタ記憶部１０４から所属度が所定値以上の単語を示す情報に関連付けられている単語の位置の情報を３つ以上のクラスタ分読み出し、該単語の位置の情報に基づいて、対象となる２つのクラスタ以外の第３のクラスタを経由した該２つのクラスタ間の間接関連度を算出する。一例として、間接関連度算出部１１１は、対象となる２つのクラスタ以外の第３のクラスタを経由したクラスタ間の関連度のうち最大となる最大間接関連度ＭＩＲを算出する。 The indirect degree-of-association calculation unit 111 reads out information on the position of the word associated with information indicating a word having an affiliation degree of a predetermined value or more from the cluster storage unit 104 for three or more clusters, and uses the position information of the word Based on this, the degree of indirect association between the two clusters via the third cluster other than the two target clusters is calculated. As an example, the indirect relevance calculation unit 111 calculates the maximum indirect relevance MIR among the relevance levels between the clusters that have passed through the third cluster other than the two target clusters.

具体的には、例えば、間接関連度算出部１１１は、クラスタＣ＿ｉとクラスタＣ＿ｊ（これ以降、ｊはクラスタのインデックスを表す１からｎまでの整数）が、接続クラスタＣＮを経由して関連している程度を示す間接関連度のうち、接続クラスタＣＮをＣ＿１からＣ＿ｎまで変化させながら間接関連度を算出し、算出されたｎ個の間接関連度のうち最大となる最大間接関連度ＭＩＲを、下記式（５）を用いて算出する。ここで、接続クラスタＣＮは、Ｃ＿１からＣ＿Ｎまでのクラスタを取りうる。 Specifically, for example, the indirect relevance calculation unit 111 associates a cluster C_i and a cluster C_j (hereinafter, j is an integer from 1 to n indicating a cluster index) via the connection cluster CN. The indirect relevance is calculated while changing the connection cluster CN from C_1 to C_n, and the maximum indirect relevance MIR that is the maximum among the calculated n indirect relevance It calculates using Formula (5). Here, the connection cluster CN can take clusters from C_1 to C_N.

ＭＩＲ（Ｃ＿ｉ，Ｃ＿ｊ）＝ＭＡＸ_ＣＮ｛Ａ（Ｃ＿ｉ，ＣＮ）×Ａ（ＣＮ，Ｃ＿ｊ）｝（５） MIR (C_i, C_j) = MAX _CN {A (C_i, CN) × A (CN, C_j)} (5)

ここで、ＭＡＸ_ＣＮは、引数である右辺の間接関連度が最大となる接続クラスタＣＮを抽出し、そのときの引数の値を出力する関数で、Ａは第１の引数と第２の引数の関連度を算出する関数である。
なお、間接関連度算出部１１１は、クラスタＣ＿ｉとクラスタＣ＿ｊが、接続クラスタＣＮを経由して関連している程度を示す最大間接関連度ＭＩＲを、下記式（６）を用いて算出してもよい。 Here, MAX _CN is a function that extracts the connection cluster CN that maximizes the indirect relevance of the right side that is an argument, and outputs the value of the argument at that time. A is the first argument and the second argument. It is a function for calculating relevance.
The indirect relevance calculating unit 111 may calculate the maximum indirect relevance MIR indicating the degree to which the clusters C_i and C_j are related via the connection cluster CN using the following equation (6). Good.

ＭＩＲ（Ｃ＿ｉ，Ｃ＿ｊ）＝ＭＡＸ_ＣＮ｛Ａ（Ｃ＿ｉ，ＣＮ）＋Ａ（ＣＮ，Ｃ＿ｊ）｝（６） MIR (C_i, C_j) = MAX _CN {A (C_i, CN) + A (CN, C_j)} (6)

間接関連度算出部１１１は、式（５）または式（６）の中の関連度Ａを、コサイン類似度を用いて算出する。 The indirect relevance calculation unit 111 calculates the relevance A in Expression (5) or Expression (6) using the cosine similarity.

一例として、間接関連度算出部１１１がコサイン類似度を用いて関連度Ａを算出する方法について説明する。
ベクトルｘは原点からクラスタＣ＿ｉの重心へのベクトル、ベクトルｙを原点からクラスタＣ＿ｊの重心へのベクトルである。例えば、間接関連度算出部１１１は、以下の式（７）に従って、関連度Ａを算出する。 As an example, a method in which the indirect relevance calculation unit 111 calculates the relevance A using the cosine similarity will be described.
The vector x is a vector from the origin to the center of gravity of the cluster C_i, and the vector y is a vector from the origin to the center of gravity of the cluster C_j. For example, the indirect relevance calculation unit 111 calculates the relevance A according to the following formula (7).

Ａ（Ｃ＿ｉ，Ｃ＿ｊ）＝ｘ・ｙ／（｜ｘ｜×｜ｙ｜）（７） A (C_i, C_j) = x · y / (| x | × | y |) (7)

ここで、ｘ・ｙはベクトルｘ、ｙの内積であり、（ｘ１×ｙ１＋ｘ２×ｙ２＋…＋ｘｍ×ｙｍ）で表される（ｍは正の整数）。また、｜ｘ｜はベクトルｘのノルム＝√（ｘ・ｘ）である。式（７）の右辺は、ベクトルｘ、ｙのなす角θの余弦ｃｏｓθを表し、コサイン類似度と呼ばれ、ベクトルの向きの近さ類似性を表す。 Here, x · y is an inner product of vectors x and y, and is represented by (x1 × y1 + x2 × y2 +... + Xm × ym) (m is a positive integer). | X | is the norm of the vector x = √ (x · x). The right side of Equation (7) represents the cosine cos θ of the angle θ formed by the vectors x and y, which is called cosine similarity and represents the closeness similarity in the direction of the vector.

なお、間接関連度算出部１１１は、式（５）または式（６）の中の関連度Ａを、ジャカード係数または相互情報量などの方法を用いて算出してもよい。
ジャカード係数を用いる場合には、間接関連度算出部１１１は、Ｃ＿ｉ、Ｃ＿ｊが通常のクラスタの場合、２つのクラスタＣ＿ｉ、Ｃ＿ｊのどちらかに出現した単語の出現回数によって関連度Ａを算出する。具体的には、間接関連度算出部１１１は、以下の式（８）に従って関連度Ａを算出する。 The indirect relevance calculation unit 111 may calculate the relevance A in Expression (5) or Expression (6) using a method such as a Jacquard coefficient or mutual information.
When Jacquard coefficients are used, the indirect relevance calculation unit 111 calculates the relevance A based on the number of appearances of words that appear in one of the two clusters C_i and C_j when C_i and C_j are normal clusters. . Specifically, the indirect association degree calculation unit 111 calculates the association degree A according to the following equation (8).

ここで、｜Ｃ｜はクラスタＣに含まれる要素（単語）数である。この関連度Ａが大きいほど、二つのクラスタの類似性は高い。
クラスタＣ＿ｉ、クラスタＣ＿ｊがｆｕｚｚｙｃ−ｍｅａｎｓ法で算出されたファジィ集合である場合、間接関連度算出部１１１は、ｘ_ｐをクラスタＣ＿ｉのワードベクトルｘのｐ番目要素（ｐは１からＰまでの整数）、ｙ_ｑをクラスタＣ＿ｊのワードベクトルｙのｑ番目の要素とすると（ｑは１からＱまでの整数）、クラスタＣ＿ｉ、クラスタＣ＿ｊの関連度を次式（９）で算出する。 Here, | C | is the number of elements (words) included in cluster C. The greater the degree of association A, the higher the similarity between the two clusters.
When the cluster C_i and the cluster C_j are fuzzy sets calculated by the fuzzy c-means method, the indirect relevance calculating unit 111 sets the _p p elements of the word vector x of the cluster C_i (p is 1 to P). (Integer), y _q is the q-th element of word vector y of cluster C_j (q is an integer from 1 to Q), and the relevance of cluster C_i and cluster C_j is calculated by the following equation (9).

一方、相互情報量を用いる場合には、間接関連度算出部１１１は、下記の式（１０）に従って、クラスタＣ＿ｉ、クラスタＣ＿ｊの相互情報量ＭＩ（Ｃ＿ｉ，Ｃ＿ｊ）を関連度Ａとして算出する。ここで、相互情報量は、ある２つの単語が共起する割合によって求められる関連性の指標である。 On the other hand, when the mutual information amount is used, the indirect association degree calculation unit 111 calculates the mutual information amount MI (C_i, C_j) of the cluster C_i and the cluster C_j as the association degree A according to the following equation (10). Here, the mutual information amount is an index of relevance obtained by a ratio in which two certain words co-occur.

ここで、ｘ_ｐはＣ＿ｉのワードベクトルｘのｐ番目の要素、ｙ_ｑはＣ＿ｊのワードベクトルｙのｑ番目の要素、Ｐ（ｘ_ｐ，ｙ_ｑ）はｘ_ｐとｙ_ｑの同時出現確率、Ｐ（ｘ_ｐ）、Ｐ（ｙ_ｑ）は、それぞれｘ_ｐ、ｙ_ｑの周辺出現確率である。 Where x _p is the p th element of the word vector x of C_i, y _q is the q th element of the word vector y of C_j, P (x _p , y _q ) is the probability of simultaneous occurrence of x _p and y _q , _{P (x p), P (} y q) is a peripheral probability of occurrence of each _x p, _{y q.}

間接関連度算出部１１１は、クラスタＣ＿ｉとクラスタＣ＿ｊの全ての組み合わせで、最大間接関連度ＭＩＲ（Ｃ＿ｉ，ＣＮ＿（ｉ，ｊ），Ｃ＿ｊ）を算出する。ここで、ＣＮ＿（ｉ，ｊ）は、クラスタＣ＿ｉとクラスタＣ＿ｊとの間接関連度が最大となるときに選択されたクラスタであり、クラスタＣ＿ｉとクラスタＣ＿ｊの組み合わせ毎にクラスタＣ＿１〜Ｃ＿Ｎまでの中から選択されたクラスタである。
間接関連度算出部１１１は、算出した全ての最大間接関連度ＭＩＲ（Ｃ＿ｉ，ＣＮ＿（ｉ，ｊ），Ｃ＿ｊ）を示す情報と、その各最大間接関連度ＭＩＲを算出する際に用いたクラスタＣ＿ｉ、ＣＮ＿（ｉ，ｊ）、Ｃ＿ｊの組み合わせを示す情報とを積算部１１３に出力する。 The indirect relevance calculating unit 111 calculates the maximum indirect relevance MIR (C_i, CN_ (i, j), C_j) for all combinations of the cluster C_i and the cluster C_j. Here, CN_ (i, j) is a cluster that is selected when the degree of indirect association between the cluster C_i and the cluster C_j is maximized, and the middle of the clusters C_1 to C_N for each combination of the cluster C_i and the cluster C_j. Is a cluster selected from
The indirect relevance calculating unit 111 includes information indicating all the calculated maximum indirect relevance MIR (C_i, CN_ (i, j), C_j), and the cluster C_i used when calculating each maximum indirect relevance MIR. , CN_ (i, j), and information indicating the combination of C_j are output to integrating section 113.

意外度算出部１１２は、クラスタ記憶部１０４から所属度が所定値以上の単語を示す情報を３つ以上のクラスタ分読み出し、該読み出された単語の位置を示す情報に基づき、クラスタの組み合わせの意外度Ｕを算出する。具体的には、例えば、意外度算出部１１２は、式（７）の関連度の式の逆数を意外度として使用し、以下の式に従って、クラスタＣ＿ｉとクラスタＣ＿ｊ間の意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）を算出する。 The unexpectedness degree calculation unit 112 reads three or more clusters of information indicating words having a degree of membership of a predetermined value or more from the cluster storage unit 104, and based on the information indicating the positions of the read words, the combination of clusters is read out. An unexpected degree U is calculated. Specifically, for example, the unexpectedness degree calculation unit 112 uses the reciprocal of the relevance degree expression of Expression (7) as the unexpectedness degree, and according to the following expression, the unexpectedness degree U (C_i,) between the cluster C_i and the cluster C_j C_j) is calculated.

Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）＝（｜ｘ｜×｜ｙ｜）／ｘ・ｙ（１１） U (C_i, C_j) = (| x | × | y |) / x · y (11)

ここで、ベクトルｘは原点からクラスタＣ＿ｉの重心へのベクトル、ベクトルｙを原点からクラスタＣ＿ｊの重心へのベクトルである。 Here, the vector x is a vector from the origin to the center of gravity of the cluster C_i, and the vector y is a vector from the origin to the center of gravity of the cluster C_j.

なお、意外度算出部１１２は、ジャッカード係数の逆数（式（７）の右辺の逆数）を用いて、意外度を算出してもよい。その場合、具体的には、意外度算出部１１２は、下記の式（１２）に従って、クラスタＣ＿ｉとクラスタＣ＿ｊ間の意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）を算出する。 Note that the unexpectedness degree calculation unit 112 may calculate the unexpectedness degree using the reciprocal number of the Jackard coefficient (the reciprocal number on the right side of Expression (7)). In that case, specifically, the unexpectedness degree calculation unit 112 calculates the unexpectedness degree U (C_i, C_j) between the cluster C_i and the cluster C_j according to the following equation (12).

ここで、クラスタＣ＿ｉとクラスタＣ＿ｊの関連性が低いほど、意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）は高くなり、両クラスタの組み合わせが意外であることを反映している。
また、意外度算出部１１２は、相互情報量ＭＩの逆数（（式（１０）の右辺の逆数））を用いて、意外度を算出してもよい。その場合、具体的には、意外度算出部１１２は、下記の式（１３）に従って、クラスタＣ＿ｉとクラスタＣ＿ｊ間の意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）を算出する。 Here, the lower the relationship between the cluster C_i and the cluster C_j, the higher the unexpectedness U (C_i, C_j), reflecting that the combination of both clusters is unexpected.
Moreover, the unexpected degree calculation unit 112 may calculate the unexpected degree using the reciprocal number of the mutual information amount MI ((the reciprocal number on the right side of Expression (10))). In that case, specifically, the unexpectedness degree calculation unit 112 calculates the unexpectedness degree U (C_i, C_j) between the cluster C_i and the cluster C_j according to the following equation (13).

Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）＝１／ＭＩ（Ｃ＿ｉ，Ｃ＿ｊ）（１３） U (C_i, C_j) = 1 / MI (C_i, C_j) (13)

意外度算出部１１２は、クラスタＣ＿ｉとクラスタＣ＿ｊの全ての組み合わせで、意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）を算出する。
意外度算出部１１２は、算出した全ての意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）を示す情報と、その各意外度Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）が算出された際に用いられたクラスタＣ＿ｉの識別情報とクラスタＣ＿ｊの識別情報とを積算部１１３に出力する。 The unexpectedness degree calculation unit 112 calculates the unexpectedness degree U (C_i, C_j) for all combinations of the cluster C_i and the cluster C_j.
The unexpectedness degree calculation unit 112 includes information indicating all the calculated unexpected degrees U (C_i, C_j), identification information of the cluster C_i used when each unexpected degree U (C_i, C_j) is calculated, and the cluster The identification information of C_j is output to the integrating unit 113.

続いて、積算部１１３は、最大間接関連度ＭＩＲと意外度Ｕに基づいて、発見性指数を算出する。具体的には、積算部１１３は、対象となる２つのクラスタ（Ｃ＿ｉ、Ｃ＿ｊ）以外の第３のクラスタＣＮを経由した該２つのクラスタ（Ｃ＿ｉ、Ｃ＿ｊ）間の関連度と、該２つのクラスタ（Ｃ＿ｉ、Ｃ＿ｊ）を組み合わせることの意外度とを反映するクラスタ発見性指標Ｓを下記式（１４）に従って、算出する。 Subsequently, the integrating unit 113 calculates a discoverability index based on the maximum indirect association degree MIR and the unexpectedness degree U. Specifically, the integrating unit 113 calculates the degree of association between the two clusters (C_i, C_j) via the third cluster CN other than the two target clusters (C_i, C_j), and the two clusters. A cluster heuristic index S that reflects the unexpectedness of combining (C_i, C_j) is calculated according to the following equation (14).

Ｓ（Ｃ＿ｉ，Ｃ＿ｊ）＝ＭＩＲ（Ｃ＿ｉ，Ｃ＿ｊ）×Ｕ（Ｃ＿ｉ，Ｃ＿ｊ）（１４） S (C_i, C_j) = MIR (C_i, C_j) × U (C_i, C_j) (14)

発見性指標Ｓは、クラスタＣ＿ｉとクラスタＣ＿ｊとの間でクラスタＣＮを経由した関連性が必要なこと、また同時にクラスタＣ＿ｉとクラスタＣ＿ｊとの組み合わせに新たな意外性が必要なことを両立させるための指標である。すなわち、発見性指標Ｓは、２つのクラスタ（Ｃ＿ｉ、Ｃ＿ｊ）同士の直接の関連性が低くなるほど高くなり、該２つのクラスタが残りの第３のクラスタ（ＣＮ＿（ｉ，ｊ））と関連性が高くなるほど高くなる。 The heuristic index S is used to make it necessary to have a relationship between the cluster C_i and the cluster C_j via the cluster CN and at the same time a combination of the cluster C_i and the cluster C_j needs a new surprise. It is an indicator. That is, the heuristic index S increases as the direct relationship between the two clusters (C_i, C_j) decreases, and the two clusters are related to the remaining third cluster (CN_ (i, j)). The higher the is, the higher it is.

積算部１１３は、クラスタＣ＿ｉとクラスタＣ＿ｊの全ての組み合わせで、発見性指標Ｓを算出し、算出した発見性指標Ｓを示す情報をクラスタ組抽出部１３０に出力する。また、積算部１１３は、クラスタＣ＿ｉを示す情報とクラスタＣ＿ｊを示す情報と接続クラスタＣＮ＿（ｉ，ｊ）を示す情報とをターゲット関連性指数算出部１１４に出力する。 The accumulating unit 113 calculates the heuristic index S for all combinations of the clusters C_i and C_j, and outputs information indicating the calculated heuristic index S to the cluster set extraction unit 130. Further, the integrating unit 113 outputs information indicating the cluster C_i, information indicating the cluster C_j, and information indicating the connected cluster CN_ (i, j) to the target relevance index calculating unit 114.

ターゲット関連性指数算出部１１４は、自装置の外部から入力されたターゲットの特性（例えば、ターゲットとなる世相、市場、個人の特性）Ｔを示す情報を受け付ける。また、ターゲット関連性指数算出部１１４は、積算部１１３から入力されたクラスタＣ＿ｉを示す情報とクラスタＣ＿ｊを示す情報と接続クラスタＣＮ＿（ｉ，ｊ）を示す情報とを受け付ける。 The target relevance index calculation unit 114 accepts information indicating target characteristics (for example, target social characteristics, market, personal characteristics) T input from the outside of the device itself. Further, the target relevance index calculating unit 114 receives the information indicating the cluster C_i, the information indicating the cluster C_j, and the information indicating the connected cluster CN_ (i, j) input from the accumulating unit 113.

ターゲット関連性指数算出部１１４は、クラスタ記憶部１０４からクラスタ（Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ＿（ｉ，ｊ））毎に所属度を示す情報を読み出し、該読み出された所属度を示す情報と、自装置の外部から入力されたターゲットの特性を示す情報Ｔとに基づいて、前記異なる３つのクラスタ（Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ＿（ｉ，ｊ））とターゲットとの関連性を示すターゲット関連性指数Ｎを算出する。 The target relevance index calculation unit 114 reads information indicating the degree of affiliation for each cluster (C_i, C_j, CN_ (i, j)) from the cluster storage unit 104, Based on the information T indicating the characteristics of the target input from the outside of the apparatus, a target relevance index N indicating the relevance between the three different clusters (C_i, C_j, CN_ (i, j)) and the target is obtained. calculate.

具体的には、例えば、ターゲット関連性指数算出部１１４は、下記の式（１５）に従って、ターゲット関連性指数Ｎを算出する。 Specifically, for example, the target relevance index calculation unit 114 calculates the target relevance index N according to the following equation (15).

Ｎ（Ｃ＿ｉ，Ｃ＿ｊ，ＣＮ＿（ｉ，ｊ），Ｔ）＝ｍｉｎ（Ａ（Ｃ＿ｉ，Ｔ），Ａ（Ｃ＿ｊ，Ｔ），Ａ（ＣＮ＿（ｉ，ｊ），Ｔ））（１５） N (C_i, C_j, CN_ (i, j), T) = min (A (C_i, T), A (C_j, T), A (CN_ (i, j), T)) (15)

ターゲット関連性指数算出部１１４は、算出したターゲット関連性指数Ｎを示す情報をクラスタ組抽出部１３０に出力する。 The target relevance index calculation unit 114 outputs information indicating the calculated target relevance index N to the cluster set extraction unit 130.

活性度算出部１２１は、各期間のワードベクトルを示す情報を重要度記憶部１０２から読み出し、該読み出された各期間のワードベクトルを示す情報に基づいて、各期間における各クラスタの活性度を算出する。
具体的には、例えば、活性度算出部１２１は、ｋ番目の期間においてｉ番目のクラスタＣ＿ｉの活性度をＲ（Ｃ＿ｉ，ｋ）とすると、下記の式（１６）に従って、活性度を算出する。 The activity calculation unit 121 reads information indicating the word vector of each period from the importance storage unit 102, and determines the activity of each cluster in each period based on the read information indicating the word vector of each period. calculate.
Specifically, for example, when the activity of the i-th cluster C_i is R (C_i, k) in the k-th period, the activity calculation unit 121 calculates the activity according to the following equation (16). .

Ｒ（Ｃ＿ｉ，ｋ）＝ｓｉｍ（Ｙ＿ｉ，Ｗ＿ｋ）（１６） R (C_i, k) = sim (Y_i, W_k) (16)

ここで、Ｙ＿ｉはクラスタＣ＿ｉに所属する単語の所属度から構成される所属度ベクトルであり、Ｗ＿ｋは、ｋ番目（ｋは正の整数）の期間の文書のワードベクトルである。
上記の式（１５）は、活性度算出部１２１は、ｋ番目の期間の文書のワードベクトルＷ＿ｋと、クラスタＣ＿ｉを表す所属度ベクトルＹ＿ｉとの類似度を、そのままそのクラスタＣ＿ｉの活性度として求めるものである。
また、関数ｓｉｍは類似度を表す関数で、コサイン類似度を用いた下記の式（１７）で表される。 Here, Y_i is an affiliation vector composed of affiliations of words belonging to the cluster C_i, and W_k is a word vector of a document in the k-th period (k is a positive integer).
In the above equation (15), the activity calculation unit 121 obtains the similarity between the word vector W_k of the document in the kth period and the membership vector Y_i representing the cluster C_i as the activity of the cluster C_i as it is. Is.
The function sim is a function representing the degree of similarity, and is represented by the following formula (17) using the cosine similarity.

ｓｉｍ（Ｙ＿ｉ，Ｗ＿ｋ）＝Ｙ＿ｉ・Ｗ＿ｋ／（｜Ｙ＿ｉ｜×｜Ｗ＿ｋ｜）（１７） sim (Y_i, W_k) = Y_i · W_k / (| Y_i | × | W_k |) (17)

図４は、活性度の算出方法を説明するための図である。同図において、所属度ベクトル４０１の各要素は、そのクラスタに属する単語（Ｗｏｒｄ１〜ＷｏｒｄＭ）の所属度が示されている（Ｍは正の整数）。また、ｋ番目の期間の文書のワードベクトル４０２の各要素は、ｋ番目の期間の文書におけるそのクラスタに属する単語（Ｗｏｒｄ１〜ＷｏｒｄＭ）のｔｆ−ｉｄｆ値が示されている。 FIG. 4 is a diagram for explaining a method of calculating the activity. In the figure, each element of the degree-of-affiliation vector 401 indicates the degree of belonging of the words (Word 1 to Word M) belonging to the cluster (M is a positive integer). Further, each element of the word vector 402 of the document in the kth period indicates the tf-idf value of the word (Word 1 to Word M) belonging to the cluster in the document in the kth period.

なお、活性度算出部１２１は、関数ｓｉｍとしてジャカード係数を用いてもよい。また、活性度算出部１２１は、下記の式（１８）に従って、クラスタＣ＿ｉの活性度Ｒ（Ｃ＿ｉ）を算出してもよい。 The activity calculation unit 121 may use a jacquard coefficient as the function sim. Further, the activity calculation unit 121 may calculate the activity R (C_i) of the cluster C_i according to the following equation (18).

ここで、ｍｅｍＣ＿ｉ（ｙ_ｑ）は単語ｙ_ｑのクラスタＣ＿ｉへの所属度である。ＭＩ（ｘ_ｐ，ｙ_ｑ）は，単語ｘ_ｐと単語ｙ_ｑとの相互情報量である。ｔｆｉｄｆ（ｘ）はワードベクトル中の単語ｘ_ｐのｔｆ‐ｉｄｆ値である。 Here, mem C_i (y _q ) is the degree of affiliation of the word y _{q to} the cluster C_i. MI (x _p , y _q ) is a mutual information amount between the word x _p and the word y _q . tfidf (x) is the tfidf value word _{x p} in word vectors.

なお、活性度算出部１２１は、各概念に含まれる語すべてを用いて計算する代わりに、ｔｆ‐ｉｄｆ値の高い一定数の上位単語またはｔｆ‐ｉｄｆ値が所定の値を超えた単語のｔｆ‐ｉｄｆ値から構成されるワードベクトルに基づいて活性度を算出してもよい。これにより、活性度算出部１２１は、計算回数を少なくすることができるので、計算に係る時間を短縮することができる。 Instead of calculating using all the words included in each concept, the activity calculation unit 121 calculates a certain number of upper words having a high tf-idf value or tf of a word having a tf-idf value exceeding a predetermined value. The degree of activity may be calculated based on a word vector composed of -idf values. As a result, the activity level calculation unit 121 can reduce the number of calculations, thereby reducing the time required for the calculation.

活性度算出部１２１は、算出した各期間のクラスタＣ＿ｉの活性度Ｒ（Ｃ＿ｉ，ｋ）を示す情報を相対力指数算出部１２２に出力する。
相対力指数算出部１２２は、活性度算出部１２１から入力された各期間のクラスタＣ＿ｉの活性度Ｒ（Ｃ＿ｉ，ｋ）に基づいて、それぞれのクラスタの活性度の時間的変化に注目し、世の中一般やターゲット市場さらには個人で、各クラスタの活性度の上昇が期待される度合い（活性度上昇期待値）を算出する。 The activity calculation unit 121 outputs information indicating the calculated activity R (C_i, k) of the cluster C_i for each period to the relative force index calculation unit 122.
Based on the activity R (C_i, k) of the cluster C_i of each period input from the activity calculation unit 121, the relative force index calculation unit 122 pays attention to the temporal change in the activity of each cluster, and The degree to which the activity of each cluster is expected to increase (activity increase expectation value) is calculated for the general, target market, and individual.

具体的には、例えば、相対力指数算出部１２２は、活性度上昇期待値の一例として、相対力指数ＲＳＩ（Ｃ＿ｉ）を算出する。ここで、相対力指数（ＲＳＩ）とは、過去の値の動きに対する上昇幅の割合を求めたもので、一般にＲＳＩ値が３０を切ると、上昇傾向になると言われている。相対力指数算出部１２２は相対力指数（ＲＳＩ）を算出する際に、例えば１カ月あるいは１日のような所定の長さのサンプリング期間を設けて、そのサンプリング期間内の活性度の上昇値と下降値から、相対力指数（ＲＳＩ）を算出する。
例えば、相対力指数算出部１２２は、下記の式（１９）に従って、相対力指数（ＲＳＩ）を算出する。 Specifically, for example, the relative force index calculation unit 122 calculates a relative force index RSI (C_i) as an example of an activity increase expected value. Here, the relative force index (RSI) is a ratio of an increase width with respect to the movement of the past value, and it is generally said that when the RSI value falls below 30, it tends to increase. When calculating the relative force index (RSI), the relative force index calculating unit 122 provides a sampling period of a predetermined length such as one month or one day, and the activity increase value within the sampling period A relative force index (RSI) is calculated from the descending value.
For example, the relative force index calculation unit 122 calculates a relative force index (RSI) according to the following equation (19).

ＲＳＩ＝ｕ／（ｕ＋ｄ）×１００（１９） RSI = u / (u + d) × 100 (19)

ここで、ｕは所定のサンプリング期間の活性度の上昇値の合計、ｄは所定のサンプリング期間の活性度の下降値の合計である。
なお、相対力指数算出部１２２は、活性度上昇期待値として相対力指数ＲＳＩを用いたが、これに限らず、他の経済指標を用いてもよい。 Here, u is the sum of the increase values of the activity during the predetermined sampling period, and d is the sum of the decrease values of the activity during the predetermined sampling period.
Although the relative force index calculation unit 122 uses the relative force index RSI as the activity increase expected value, the present invention is not limited to this, and other economic indicators may be used.

そして、活性化予測部１２０は、算出された活性度と、算出された活性度上昇期待値とに基づいて、クラスタの活性化を予測する。
具体的には、活性化予測部１２０は、上記の３０という値を一般化して閾値Ｌとし、上昇を予測する条件を下記の２つとする。１つ目は、（ｉ）過去の一定期間の間に相対力指数（ＲＳＩ）が閾値Ｌを下回ったことがあること、２つ目は、（ｉｉ）現在の活性値Ｒが上限Ｒｕ、下限ＲＬの間にあることである。活性化予測部１２０は、これら２つの条件を満たしたときに、これからのクラスタの活性化を予測し、それ以外の場合、これからクラスタが活性化しないと予測する。 Then, the activation prediction unit 120 predicts the activation of the cluster based on the calculated activity and the calculated activity increase expected value.
Specifically, the activation predicting unit 120 generalizes the value of 30 as the threshold L, and sets the following two conditions for predicting the increase. The first is that (i) the relative force index (RSI) may have fallen below a threshold L during a certain period in the past, and the second is that (ii) the current activity value R is an upper limit Ru and a lower limit. It is between RL. The activation predicting unit 120 predicts the future activation of the cluster when these two conditions are satisfied, and otherwise predicts that the cluster will not be activated from now on.

活性化予測部１２０は、予測結果を示す情報をクラスタ組抽出部１３０に出力する。
クラスタ組抽出部１３０は、積算部１１３から発見性指標Ｓを示す情報を、ターゲット関連性指数算出部１１４からターゲット関連性指数Ｎを示す情報を、活性化予測部１２０から予測結果を示す情報を受け取る。 The activation prediction unit 120 outputs information indicating the prediction result to the cluster set extraction unit 130.
The cluster set extraction unit 130 receives information indicating the heuristic index S from the integration unit 113, information indicating the target relevance index N from the target relevance index calculation unit 114, and information indicating the prediction result from the activation prediction unit 120. receive.

クラスタ組抽出部１３０は、活性化予測部１２０による予測により前記クラスタの組み合わせのうち少なくとも１つのクラスタの活性化が予測された場合、発見性指数Ｓとターゲット関連性指数Ｎとに基づいて、クラスタの組み合わせを抽出する。
具体的には、クラスタ組抽出部１３０は、下記の３つの条件に基づいて、クラスタの組み合わせ（Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ（ｉ，ｊ））を抽出する。 When the activation prediction unit 120 predicts the activation of at least one cluster among the cluster combinations, the cluster set extraction unit 130 determines the cluster based on the heuristic index S and the target relevance index N. Extract combinations.
Specifically, the cluster set extraction unit 130 extracts a combination of clusters (C_i, C_j, CN (i, j)) based on the following three conditions.

（１）新規発見性指数Ｓの条件として、クラスタの組Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ（ｉ，ｊ）の発見性指標Ｓが所定の値以上であること、
（２）活性化予測の条件として、クラスタＣ＿ｉ、クラスタＣ＿ｊ、クラスタＣＮ（ｉ，ｊ）のいずれかの相対力指数（ＲＳＩ）と活性度Ｒが、それぞれ上述のクラスタの活性化予測条件（ｉ）および（ｉｉ）を満足していること、
（３）ターゲット関連性指数Ｎの条件として、クラスタの組Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ（ｉ，ｊ）のいずれかが、ターゲットの特性Ｔと所定の値以上の関連度を持つことである。 (1) As a condition of the new heuristic index S, the heuristic index S of the cluster set C_i, C_j, CN (i, j) is greater than or equal to a predetermined value;
(2) As a condition for the activation prediction, the relative force index (RSI) and the activity R of any one of the cluster C_i, the cluster C_j, and the cluster CN (i, j) are respectively set as the activation prediction condition (i ) And (ii)
(3) As a condition of the target relevance index N, any one of the cluster sets C_i, C_j, and CN (i, j) has a relevance greater than a predetermined value with the target characteristic T.

例えば、クラスタ組抽出部１３０は、あるターゲットの特性Ｔが存在した時、特性Ｔにとっての最適なクラスタの組み合わせ（Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ（ｉ，ｊ））を、下記の式（２０）から算出する。 For example, the cluster set extraction unit 130 calculates an optimum cluster combination (C_i, C_j, CN (i, j)) for the characteristic T from the following equation (20) when the characteristic T of a certain target exists. To do.

ａｒｇｍａｘ｛ａＳ（Ｃ＿ｉ，Ｃ＿ｊ，ＣＮ（ｉ，ｊ））＋ｂＮ（Ｃ＿ｉ，Ｃ＿ｊ，ＣＮ（ｉ，ｊ），Ｔ）｝（２０） arg max {aS (C_i, C_j, CN (i, j)) + bN (C_i, C_j, CN (i, j), T)} (20)

ここで、ａ、ｂはＳ、Ｎに対する重みを表す係数であり、ａｒｇｍａｘは、引数が最大となる値を求める関数である。この式（１８）により、クラスタ組抽出部１３０は、引数の値が最大となるクラスタの組み合わせを抽出することができる。ただし，Ｃ＿ｉ、Ｃ＿ｊ、ＣＮ（ｉ，ｊ）のうちいずれかの相対力指数（ＲＳＩ）と活性度Ｒが、それぞれクラスタの活性化予測条件（ｉ）および（ｉｉ）を満足していることとする。 Here, a and b are coefficients representing weights for S and N, and arg max is a function for obtaining a value that maximizes the argument. From this equation (18), the cluster set extraction unit 130 can extract the combination of clusters having the maximum argument value. However, the relative force index (RSI) and the activity R of any of C_i, C_j, and CN (i, j) satisfy the cluster activation prediction conditions (i) and (ii), respectively. To do.

なお、本実施形態では、クラスタ組抽出部１３０は、一例として、式（２０）の引数が最大となるクラスタの組み合わせを１つ抽出したが、これに限ったものではない。クラスタ組抽出部１３０は、式（２０）の引数の値が所定の値以上となる１つ以上のクラスタの組み合わせすべてを抽出してもよい。また、クラスタ組抽出部１３０は、式（２０）の引数の値が高いほうからトップＭ（Ｍは正の整数）のクラスタの組み合わせすべてを抽出してもよい。 In the present embodiment, the cluster set extraction unit 130 extracts, as an example, one cluster combination that maximizes the argument of Expression (20). However, the present invention is not limited to this. The cluster set extraction unit 130 may extract all the combinations of one or more clusters in which the value of the argument in Expression (20) is a predetermined value or more. In addition, the cluster set extraction unit 130 may extract all combinations of clusters of the top M (M is a positive integer) in descending order of the argument value of Expression (20).

そして、クラスタ組抽出部１３０は、抽出したクラスタの組み合わせを構成するクラスタＣ＿ｉを示す情報とクラスタＣ＿ｊを示す情報とクラスタＣＮ＿（ｉ，ｊ）を示す情報とを自装置の外部に出力する。
なお、クラスタ組抽出部１３０は、抽出したクラスタの組み合わせを構成する各クラスタに関連付けられたラベルをそれぞれクラスタ記憶部１０４のテーブルＴ２から読み出し、読み出した各ラベルを示す情報をヒットコンセプトの組み合わせを示す情報として自装置の外部に出力してもよい。 Then, the cluster set extraction unit 130 outputs the information indicating the cluster C_i, the information indicating the cluster C_j, and the information indicating the cluster CN_ (i, j) constituting the extracted cluster combination to the outside of the own apparatus.
The cluster set extraction unit 130 reads the labels associated with the clusters constituting the extracted cluster combination from the table T2 of the cluster storage unit 104, and indicates the read concept information indicating the hit concept combination. Information may be output outside the device itself.

図５は、本実施形態の抽出装置１００がクラスタを生成する処理の流れを示したフローチャートである。まず、重要度算出部１０１は、所定期間毎の一区切りのドキュメント中に掲載された各単語のｔｆ−ｉｄｆ値の算出する（ステップＳ１０１）。次に、重要度算出部１０１は、所定期間毎に、各単語のｔｆ−ｉｄｆ値が予め決められた単語順に並べられたワードベクトルを算出する（ステップＳ１０２）。 FIG. 5 is a flowchart showing a flow of processing in which the extraction apparatus 100 of the present embodiment generates a cluster. First, the importance calculation unit 101 calculates the tf-idf value of each word posted in a document separated by a predetermined period (step S101). Next, the importance calculation unit 101 calculates a word vector in which tf-idf values of the respective words are arranged in a predetermined word order every predetermined period (step S102).

重要度算出部１０１は、全期間のドキュメントでワードベクトルを算出したか判定する（ステップＳ１０３）。重要度算出部１０１は、全期間のドキュメントでワードベクトルを算出していない場合（ステップＳ１０３ＮＯ）、ステップＳ１０１の処理に戻る。一方、重要度算出部１０１が、全期間のドキュメントでワードベクトルを算出した場合（ステップＳ１０３ＹＥＳ）、クラスタ生成部１０３は、クラスタを生成する（ステップＳ１０４）。 The importance calculation unit 101 determines whether the word vector has been calculated for the document for the entire period (step S103). If the word vector is not calculated for the document for the entire period (NO in step S103), the importance calculation unit 101 returns to the process of step S101. On the other hand, when the importance calculation unit 101 calculates a word vector for the document for the entire period (YES in step S103), the cluster generation unit 103 generates a cluster (step S104).

次に、クラスタ生成部１０３は、単語毎にクラスタへの所属度を算出する（ステップＳ１０５）。次に、クラスタ生成部１０３は、クラスタ毎にクラスタのラベルを抽出する（ステップＳ１０６）。次に、クラスタ生成部１０３は、クラスタの識別情報とクラスタのラベルを示す情報とを関連付けて、クラスタ記憶部１０４に記憶させる（ステップＳ１０７）。次に、クラスタ生成部１０３は、単語を示す情報と各クラスタへの所属度を示す情報とをクラスタ毎に関連付けてクラスタ記憶部１０４に記憶させる（ステップＳ１０８）。以上で、本フローチャートの処理を終了する。 Next, the cluster generation unit 103 calculates the degree of belonging to the cluster for each word (step S105). Next, the cluster generation unit 103 extracts a cluster label for each cluster (step S106). Next, the cluster generation unit 103 associates the cluster identification information with the information indicating the cluster label, and stores them in the cluster storage unit 104 (step S107). Next, the cluster generation unit 103 stores the information indicating the word and the information indicating the degree of affiliation with each cluster in the cluster storage unit 104 in association with each cluster (step S108). Above, the process of this flowchart is complete | finished.

以上により、抽出装置１００は、記事集合Ｄから所定期間毎の一区切りのドキュメント中に掲載された各単語の重要度を算出することができる。また、抽出装置１００は、記事集合Ｄからクラスタを生成することができる。 As described above, the extraction apparatus 100 can calculate the importance of each word posted in a document separated by a predetermined period from the article set D. Further, the extraction apparatus 100 can generate a cluster from the article set D.

図６は、本実施形態の抽出装置１００がクラスタの組み合わせを抽出する処理の流れを示したフローチャートである。まず、間接関連度算出部１１１は、最大間接関連度ＭＩＲを算出する（ステップＳ２０１）。次に、間接関連度算出部１１１は、全てのクラスタの組み合わせで最大間接関連度ＭＩＲを算出したか否か判定する（ステップＳ２０２）。間接関連度算出部１１１は、全てのクラスタの組み合わせで最大間接関連度ＭＩＲを算出していない場合（ステップＳ２０２ＮＯ）、ステップＳ２０１の処理に戻る。 FIG. 6 is a flowchart showing a flow of processing in which the extraction apparatus 100 according to the present embodiment extracts a combination of clusters. First, the indirect association degree calculation unit 111 calculates the maximum indirect association degree MIR (step S201). Next, the indirect association degree calculation unit 111 determines whether or not the maximum indirect association degree MIR has been calculated for all combinations of clusters (step S202). If the indirect relevance calculation unit 111 has not calculated the maximum indirect relevance MIR for all combinations of clusters (NO in step S202), the process returns to step S201.

一方、間接関連度算出部１１１が全てのクラスタの組み合わせで最大間接関連度ＭＩＲを算出した場合（ステップＳ２０２ＹＥＳ）、意外度算出部１１２は、意外度Ｕを算出する（ステップＳ２０３）。次に、意外度算出部１１２は、全てのクラスタの組み合わせで意外度Ｕを算出したか否か判定する（ステップＳ２０４）。意外度算出部１１２は、全てのクラスタの組み合わせで意外度Ｕを算出していない場合（ステップＳ２０４ＮＯ）、ステップＳ２０３の処理に戻る。 On the other hand, when the indirect association degree calculation unit 111 calculates the maximum indirect association degree MIR for all combinations of clusters (YES in step S202), the unexpected degree calculation unit 112 calculates the unexpected degree U (step S203). Next, the unexpectedness degree calculation unit 112 determines whether or not the unexpectedness degree U has been calculated for all combinations of clusters (step S204). If the unexpectedness degree calculation unit 112 has not calculated the unexpectedness degree U for all combinations of clusters (NO in step S204), the process returns to the process in step S203.

一方、意外度算出部１１２が全てのクラスタの組み合わせで意外度Ｕを算出した場合（ステップＳ２０４ＹＥＳ）、積算部１１３は、発見性指標を算出する（ステップＳ２０５）。次に、積算部１１３は、全期間のドキュメントで発見性指標を算出したか否か判定する（ステップＳ２０６）。積算部１１３は、全期間のドキュメントで発見性指標を算出していない場合（ステップＳ２０６ＮＯ）、ステップＳ２０１の処理に戻る。 On the other hand, when the unexpectedness degree calculation unit 112 calculates the unexpectedness degree U for all the combinations of clusters (YES in step S204), the integrating unit 113 calculates a discoverability index (step S205). Next, the integrating unit 113 determines whether or not the heuristic index is calculated for the document for the entire period (step S206). The accumulation unit 113 returns to the process of step S201 when the discoverability index is not calculated for the document of the entire period (NO in step S206).

一方、積算部１１３が全期間のドキュメントで発見性指標を算出した場合（ステップＳ２０６ＹＥＳ）、ターゲット関連性指数算出部１１４は、ターゲット関連性指数を算出する（ステップＳ２０７）。 On the other hand, when the integrating unit 113 calculates the discoverability index for the document for the entire period (step S206 YES), the target relevance index calculating unit 114 calculates the target relevance index (step S207).

ステップＳ２０１〜ステップＳ２０７までの処理に並行して、抽出装置１００は、ステップＳ２０８〜ステップＳ２１５までの処理を行う。その際、始めに抽出装置１００は、ｉ、ｊ、ｋを初期化する。次に、処理活性度算出部１２１は、ｋ番目の期間においてｉ番目のクラスタＣ＿ｉの活性度を算出する（ステップＳ２０８）。次に、活性度算出部１２１は、全てのクラスタの活性度を算出したか否か判定する（ステップＳ２０９）。活性度算出部１２１は、全てのクラスタの活性度を算出していない場合（ステップＳ２０９ＮＯ）、ｉを１増やし（ステップＳ２１０）、ステップＳ２０８の処理に戻る。 In parallel with the processing from step S201 to step S207, the extraction apparatus 100 performs the processing from step S208 to step S215. At that time, the extraction apparatus 100 first initializes i, j, and k. Next, the processing activity calculation unit 121 calculates the activity of the i-th cluster C_i in the k-th period (step S208). Next, the activity level calculation unit 121 determines whether the activity levels of all the clusters have been calculated (step S209). If the activity calculation unit 121 has not calculated the activity of all the clusters (NO in step S209), i is increased by 1 (step S210), and the process returns to step S208.

一方、活性度算出部１２１が全てのクラスタの活性度を算出した場合（ステップＳ２０９ＹＥＳ）、活性度算出部１２１は、全期間のドキュメントで活性度を算出したか否か判定する（ステップＳ２１１）。活性度算出部１２１は、全期間のドキュメントで活性度を算出していない場合（ステップＳ２１１ＮＯ）、ｋを１増やし（ステップＳ２１２）、ステップＳ２０８の処理に戻る。
一方、活性度算出部１２１が全期間のドキュメントで活性度を算出した場合（ステップＳ２１１ＹＥＳ）、相対力指数算出部１２２は、ｊ番目のクラスタＣ＿ｊの相対力指数（ＲＳＩ）を算出する（ステップＳ２１３）。 On the other hand, when the activity calculation unit 121 calculates the activity of all the clusters (YES in step S209), the activity calculation unit 121 determines whether the activity is calculated for the document for the entire period (step S211). . If the activity level is not calculated for the document for the entire period (NO in step S211), the activity level calculation unit 121 increases k by 1 (step S212), and the process returns to step S208.
On the other hand, when the activity calculation unit 121 calculates the activity of the document for the entire period (YES in step S211), the relative force index calculation unit 122 calculates the relative force index (RSI) of the jth cluster C_j (step S211). S213).

次に、相対力指数算出部１２２は、全てのクラスタの相対力指数（ＲＳＩ）を算出したか否か判定する（ステップＳ２１４）。相対力指数算出部１２２は、全てのクラスタの相対力指数（ＲＳＩ）を算出していない場合（ステップＳ２１４ＮＯ）、ｊを１増やし（ステップＳ２１５）、ステップＳ２１３の処理に戻る。
一方、相対力指数算出部１２２が、全てのクラスタの相対力指数（ＲＳＩ）を算出した場合（ステップＳ２１４ＹＥＳ）、抽出装置１００は、ステップＳ２１６の処理に進む。 Next, the relative force index calculation unit 122 determines whether or not the relative force index (RSI) of all the clusters has been calculated (step S214). If the relative force index calculation unit 122 has not calculated the relative force index (RSI) of all clusters (NO in step S214), j is increased by 1 (step S215), and the process returns to step S213.
On the other hand, when the relative force index calculation unit 122 calculates the relative force index (RSI) of all the clusters (step S214 YES), the extraction device 100 proceeds to the process of step S216.

次に、ステップＳ２１６において、クラスタ組抽出部１３０は、活性化予測条件を満たす下で、新規発見性指数とターゲット関連性指数とに基づいた評価値が最大になるクラスタの組み合わせを抽出する（ステップＳ２１６）。以上で、本フローチャートの処理を終了する。 Next, in step S216, the cluster set extraction unit 130 extracts a combination of clusters that maximizes the evaluation value based on the new discovery index and the target relevance index under the activation prediction condition (step S216). S216). Above, the process of this flowchart is complete | finished.

以上により、本実施形態の抽出装置１００は、抽出された３つのクラスタのうち少なくとも１つが活性化されていること、抽出された２つのクラスタの組み合わせに意外性があること、その２つのクラスタの組み合わせは直接の関連性は薄いが、抽出されたもう１つのクラスタ（第３のクラスタ）を経由すると結び付けられるものであること、そのクラスタの組み合わせを提供する対象であるターゲットの特性と抽出されたクラスタのうち少なくとも１つとが関連性があることという条件下で、クラスタの組み合わせを提供することができる。各クラスタは１つの概念と対応しているので、抽出装置１００は、所定の期間において、そのターゲットにとって意外性があり、第３のクラスタに対応する第３の概念を介して結び付けられる概念の組み合わせを提供することができる。 As described above, the extraction apparatus 100 according to the present embodiment is that at least one of the three extracted clusters is activated, the combination of the two extracted clusters is surprising, and the two clusters The combination is not directly related, but it is connected via the other extracted cluster (third cluster), extracted from the characteristics of the target for which the combination of clusters is provided A combination of clusters can be provided under the condition that at least one of the clusters is relevant. Since each cluster corresponds to one concept, the extraction device 100 is a combination of concepts that are surprising to the target for a given period and are linked via a third concept corresponding to the third cluster. Can be provided.

また、本実施形態の抽出装置１００の各処理を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、当該記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、抽出装置１００に係る上述した種々の処理を行ってもよい。 In addition, by recording a program for executing each process of the extraction device 100 of the present embodiment on a computer-readable recording medium, causing the computer system to read and execute the program recorded on the recording medium, The above-described various processes related to the extraction apparatus 100 may be performed.

なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Here, the “computer system” may include an OS and hardware such as peripheral devices. Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ））のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” refers to a volatile memory (for example, DRAM (Dynamic) in a computer system serving as a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. Random Access Memory)) that holds a program for a certain period of time is also included. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the concrete structure is not restricted to this embodiment, The design etc. of the range which does not deviate from the summary of this invention are included.

１００抽出装置
１０１重要度算出部
１０２重要度記憶部
１０３クラスタ生成部
１０４クラスタ記憶部
１１０発見性指数算出部
１１１間接関連度算出部
１１２意外度算出部
１１３積算部
１１４ターゲット関連性指数算出部
１２０活性化予測部
１２１活性度算出部
１２２相対力指数算出部（活性度上昇期待値算出部）
１３０クラスタ組抽出部 100 Extraction Device 101 Importance Calculation Unit 102 Importance Storage Unit 103 Cluster Generation Unit 104 Cluster Storage Unit 110 Heuristic Index Calculation Unit 111 Indirect Relevance Calculation Unit 112 Unexpectedness Calculation Unit 113 Integration Unit 114 Target Relevance Index Calculation Unit 120 Activity Prediction unit 121 activity calculation unit 122 relative force index calculation unit (activity increase expected value calculation unit)
130 Cluster group extraction unit

Claims

A cluster in which information indicating a word is associated with information indicating the degree of affiliation to which the word belongs to the cluster, and the information indicating the word and the information indicating the position of the word are stored in association with each other A storage unit;
The information on the position of the word associated with the information indicating the word whose degree of belonging is a predetermined value or more is read from the cluster storage unit for three or more clusters, and based on the information on the position of the word, A heuristic index calculator that calculates a heuristic index by multiplying an indirect association degree between the two clusters via a third cluster other than the cluster and an unexpected degree of combining the two clusters;
Read the information indicating the degree of affiliation for each cluster from the cluster storage unit, and based on the information indicating the degree of affiliation read out and the information indicating the characteristics of the target input from the outside of the own device A target relevance index calculating unit for calculating a target relevance index indicating relevance between the two clusters and the third cluster and the target;
A cluster set extraction unit that extracts a combination of the clusters based on the calculated discoverability index and the target relevance index ;
An extraction device comprising:

An importance storage unit that stores information indicating the word and information indicating the importance of the word in association with each other for each predetermined period;
Information indicating the importance of the word is read from the importance storage unit for each predetermined period, information indicating the affiliation for each cluster is read from the cluster storage unit, and information indicating the importance of the word and the affiliation And an activation prediction unit that predicts activation of each cluster every predetermined period based on the information indicating the degree,
When the activation of at least one cluster of the cluster combinations is predicted by the prediction by the activation prediction unit, the cluster set extraction unit is configured to use the cluster based on the heuristic index and the target relevance index. The combination according to claim 1 is extracted.

The activation prediction unit
For each predetermined period, based on information indicating the degree of affiliation of a word belonging to a predetermined cluster and information indicating the importance of the word in the period read from the importance storage unit, An activity calculation unit for calculating the activity of the cluster;
Based on the calculated activity, an activity increase expectation value calculation unit that calculates an activity increase expectation value that is a degree in which an increase in activity of each cluster is expected;
With
The extraction apparatus according to claim 2 , wherein the activation of the cluster is predicted based on the calculated activity and the calculated activity increase expected value.

The heuristic index increases as the indirect relevance and the unexpectedness increase.
The cluster set extraction unit, the said discovery index based on the weighted sum of the target-related index, any one of claims 1 to 3, characterized in that extracts a combination of the cluster The extraction device described in 1.

A cluster in which information indicating a word is associated with information indicating the degree of affiliation to which the word belongs to the cluster, and the information indicating the word and the information indicating the position of the word are stored in association with each other An extraction method executed by an extraction device including a storage unit,
The information on the position of the word associated with the information indicating the word whose degree of belonging is a predetermined value or more is read from the cluster storage unit for three or more clusters, and based on the information on the position of the word, A heuristic index calculation procedure for calculating a heuristic index by multiplying an indirect association degree between the two clusters via a third cluster other than the cluster and an unexpected degree of combining the two clusters;
Read the information indicating the degree of affiliation for each cluster from the cluster storage unit, and based on the information indicating the degree of affiliation read out and the information indicating the characteristics of the target input from the outside of the own device A target relevance index calculating procedure for calculating a target relevance index indicating relevance between the two clusters and the third cluster and the target;
A cluster set extraction procedure for extracting a combination of the clusters based on the calculated discoverability index and the target relevance index ;
The extraction method characterized by having.

A cluster in which information indicating a word is associated with information indicating the degree of affiliation to which the word belongs to the cluster, and the information indicating the word and the information indicating the position of the word are stored in association with each other In the computer of the extraction device comprising a storage unit,
The information on the position of the word associated with the information indicating the word whose degree of belonging is a predetermined value or more is read from the cluster storage unit for three or more clusters, and based on the information on the position of the word, A discoverability index calculating step of calculating a discoverability index by multiplying the indirect association degree between the two clusters via a third cluster other than the cluster and the unexpectedness of combining the two clusters;
Read the information indicating the degree of affiliation for each cluster from the cluster storage unit, and based on the information indicating the degree of affiliation read out and the information indicating the characteristics of the target input from the outside of the own device A target relevance index calculating step for calculating a target relevance index indicating relevance between the two clusters and the third cluster and the target;
A cluster set extraction step of extracting a combination of the clusters based on the calculated discoverability index and the target relevance index ;
Extraction program to execute.