JP5503632B2

JP5503632B2 - Feature word extraction method, apparatus, and program

Info

Publication number: JP5503632B2
Application number: JP2011286869A
Authority: JP
Inventors: 九月貞光; 邦子齋藤; 賢治今村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-12-27
Filing date: 2011-12-27
Publication date: 2014-05-28
Anticipated expiration: 2031-12-27
Also published as: JP2013134750A

Description

本発明は、特徴語抽出方法、装置、及びプログラムに係り、特に、トピックが階層構造を有する階層的トピックモデルの各トピックにおける特徴語を抽出する特徴語抽出方法、装置、及びプログラムに関する。 The present invention relates to a feature word extraction method, apparatus, and program, and more particularly, to a feature word extraction method, apparatus, and program for extracting feature words in each topic of a hierarchical topic model in which topics have a hierarchical structure.

従来、文書集合に含まれる単語を特徴量とし、１文書をデータ点として確率的なクラスタリングを行うトピックモデルが存在する。また、トピックが階層構造を有する階層的トピックモデルも存在する。このようなトピックモデル全体のパラメータ平均と各トピックにおけるパラメータとの比に基づいて、トピックモデルの各トピックにおける特徴語を抽出する方法が提案されている（例えば、非特許文献１参照）。 Conventionally, there is a topic model in which words included in a document set are feature amounts and probabilistic clustering is performed using one document as a data point. There is also a hierarchical topic model in which topics have a hierarchical structure. A method of extracting feature words in each topic of the topic model based on the ratio between the parameter average of the entire topic model and the parameter in each topic has been proposed (see, for example, Non-Patent Document 1).

D. Blei, A. Y. Ng, M. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003) 993-1022D. Blei, A. Y. Ng, M. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research 3 (2003) 993-1022

しかしながら、非特許文献１に記載の方法を、階層的トピックモデルの中間階層のトピックに対して適用した場合、中間階層の下位に位置するいずれかのトピックに特徴的な単語が、その中間階層のトピックにおける特徴語として選ばれ易く、中間階層のトピックにふさわしい特徴語、すなわち、その中間階層以下の概念に共通な特徴語が抽出できない、という問題がある。 However, when the method described in Non-Patent Document 1 is applied to a topic in an intermediate hierarchy of the hierarchical topic model, a word characteristic of any topic located in the lower level of the intermediate hierarchy is There is a problem that feature words that are easily selected as feature words in a topic and suitable for a topic in an intermediate hierarchy, that is, feature words common to concepts in the intermediate hierarchy and lower cannot be extracted.

本発明は、上記の事情を鑑みてなされたもので、階層的トピックモデルの中間階層のトピックにおける特徴語として、その中間階層以下の概念に共通な特徴語を抽出することができる特徴語抽出方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and a feature word extraction method capable of extracting a feature word common to a concept below the intermediate hierarchy as a feature word in a topic in an intermediate hierarchy of the hierarchical topic model An object is to provide a device and a program.

上記目的を達成するために、本発明の特徴語抽出方法は、文書データを確率的にクラスタリングするための複数のトピック各々を表すパラメータを含み、かつトピックが階層構造を有する階層的トピックモデルから、処理対象のトピックを表す対象パラメータを抽出する対象パラメータ抽出ステップと、前記階層的トピックモデルから、前記処理対象のトピックと同位階層のトピックを表す同位階層パラメータを抽出する同位階層パラメータ抽出ステップと、前記階層的トピックモデルから、前記処理対象のトピックに対応する下位階層のトピックを表す下位階層パラメータを抽出する下位階層パラメータ抽出ステップと、前記対象パラメータ、前記同位階層パラメータ、及び前記下位階層パラメータを用いて、前記対象パラメータに含まれる各単語について、前記下位階層のトピック内分散と前記同位階層のトピック間分散とに基づくスコアを計算するスコア計算ステップと、計算された前記対象パラメータに含まれる各単語の前記スコアに基づいて、前記処理対象のトピックにおける特徴語を抽出する特徴語抽出ステップと、を含む。 To achieve the above object, the feature word extraction method of the present invention includes a parameter representing each of a plurality of topics for probabilistic clustering of document data, and the topic has a hierarchical topic model having a hierarchical structure. A target parameter extraction step for extracting a target parameter representing a topic to be processed, a peer hierarchy parameter extraction step for extracting a peer hierarchy parameter representing a topic of the target subject and a peer hierarchy from the hierarchical topic model, A lower layer parameter extracting step for extracting a lower layer parameter representing a lower layer topic corresponding to the processing target topic from the hierarchical topic model, and using the target parameter, the peer layer parameter, and the lower layer parameter , Included in the target parameter For each word, a score calculating step for calculating a score based on the intra-topic variance in the lower hierarchy and the inter-topic variance in the peer hierarchy, and based on the score of each word included in the calculated target parameter, A feature word extraction step for extracting feature words in the topic to be processed.

本発明の特徴語抽出方法によれば、対象パラメータ抽出ステップが、文書データを確率的にクラスタリングするための複数のトピック各々を表すパラメータを含み、かつトピックが階層構造を有する階層的トピックモデルから、処理対象のトピックを表す対象パラメータを抽出し、同位階層パラメータ抽出ステップで、階層的トピックモデルから、処理対象のトピックと同位階層のトピックを表す同位階層パラメータを抽出し、下位階層パラメータ抽出ステップで、階層的トピックモデルから、処理対象のトピックに対応する下位階層のトピックを表す下位階層パラメータを抽出する。そして、スコア計算ステップで、対象パラメータ、同位階層パラメータ、及び下位階層パラメータを用いて、対象パラメータに含まれる各単語について、下位階層のトピック内分散と同位階層のトピック間分散とに基づくスコアを計算し、特徴語抽出ステップで、計算された対象パラメータに含まれる各単語のスコアに基づいて、処理対象のトピックにおける特徴語を抽出する。 According to the feature word extraction method of the present invention, the target parameter extraction step includes a parameter representing each of a plurality of topics for probabilistic clustering of document data, and the topic has a hierarchical topic model having a hierarchical structure. Extract the target parameter that represents the topic to be processed, extract the peer hierarchy parameter that represents the topic to be processed and the topic of the peer hierarchy from the hierarchical topic model in the peer hierarchy parameter extraction step, From the hierarchical topic model, a lower hierarchy parameter representing a lower hierarchy topic corresponding to the topic to be processed is extracted. Then, in the score calculation step, using the target parameter, peer hierarchy parameter, and lower hierarchy parameter, for each word included in the target parameter, a score based on the intra-topic variance in the lower hierarchy and the inter-topic variance in the peer hierarchy is calculated. In the feature word extraction step, feature words in the processing target topic are extracted based on the score of each word included in the calculated target parameter.

このように、下位階層のトピック内分散と同位階層のトピック間分散とに基づくスコアを計算して特徴語を抽出するため、階層的トピックモデルの中間階層のトピックにおける特徴語として、その中間階層以下の概念に共通な特徴語を抽出することができる。 In this way, feature words are extracted by calculating the score based on the intra-topic variance in the lower hierarchy and the inter-topic variance in the peer hierarchy. It is possible to extract feature words common to the concepts.

また、前記スコアを、各単語の下位階層のトピックにおける分散が少ないほど、かつ同位階層のトピック間での分散が大きいほど高くすることができる。このように、中間階層以下の概念に共通な特徴語の有する直感的な特徴を直接的にスコアに反映することで、中間階層のトピックにおける特徴語として、適切な特徴語を抽出することができる。 Further, the score can be increased as the variance of each word in the lower-level topics is smaller and the variance among the topics in the peer hierarchy is larger. In this way, by directly reflecting the intuitive features of the feature words common to the concepts below the intermediate hierarchy in the score, it is possible to extract an appropriate feature word as the feature word in the topic of the intermediate hierarchy. .

また、本発明の特徴語抽出装置は、文書データを確率的にクラスタリングするための複数のトピック各々を表すパラメータを含み、かつトピックが階層構造を有する階層的トピックモデルから、処理対象のトピックを表す対象パラメータを抽出する対象パラメータ抽出手段と、前記階層的トピックモデルから、前記処理対象のトピックと同位階層のトピックを表す同位階層パラメータを抽出する同位階層パラメータ抽出手段と、前記階層的トピックモデルから、前記処理対象のトピックに対応する下位階層のトピックを表す下位階層パラメータを抽出する下位階層パラメータ抽出手段と、前記対象パラメータ、前記同位階層パラメータ、及び前記下位階層パラメータを用いて、前記対象パラメータに含まれる各単語について、前記下位階層のトピック内分散と前記同位階層のトピック間分散とに基づくスコアを計算するスコア計算手段と、計算された前記対象パラメータに含まれる各単語の前記スコアに基づいて、前記処理対象のトピックにおける特徴語を抽出する特徴語抽出手段と、を含んで構成されている。 The feature word extraction device of the present invention represents a topic to be processed from a hierarchical topic model that includes parameters representing each of a plurality of topics for probabilistic clustering of document data, and the topics have a hierarchical structure. From the target parameter extracting means for extracting the target parameter, from the hierarchical topic model, from the hierarchical topic model, the isotopic hierarchy parameter extracting means for extracting the isotopic hierarchy parameter representing the topic to be processed and the topic of the peer hierarchy, and the hierarchical topic model, Included in the target parameter using lower layer parameter extraction means for extracting a lower layer parameter representing a lower layer topic corresponding to the topic to be processed, the target parameter, the peer layer parameter, and the lower layer parameter For each word Score calculation means for calculating a score based on variance within the pick and variance between topics in the peer hierarchy, and feature words in the processing target topic based on the score of each word included in the calculated target parameter And feature word extraction means for extracting.

また、本発明の特徴語抽出プログラムは、コンピュータに、上記の特徴語抽出方法の各ステップを実行させるためのプログラムである。 The feature word extraction program of the present invention is a program for causing a computer to execute each step of the above feature word extraction method.

以上説明したように、本発明の特徴語抽出方法、装置、及びプログラムによれば、下位階層のトピック内分散と同位階層のトピック間分散とに基づくスコアを計算して特徴語を抽出するため、階層的トピックモデルの中間階層のトピックにおける特徴語として、その中間階層以下の概念に共通な特徴語を抽出することができる、という効果が得られる。 As described above, according to the feature word extraction method, apparatus, and program of the present invention, in order to extract a feature word by calculating a score based on the intra-topic distribution in the lower hierarchy and the inter-topic variance in the peer hierarchy, As a feature word in a topic in an intermediate hierarchy of the hierarchical topic model, it is possible to extract a feature word common to concepts in the intermediate hierarchy and lower.

本実施の形態に係る特徴語抽出装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the feature word extraction apparatus which concerns on this Embodiment. 階層的トピックモデルを示す概略図である。It is the schematic which shows a hierarchical topic model. 本実施の形態に係る特徴語抽出装置における特徴語抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the feature word extraction process routine in the feature word extraction apparatus which concerns on this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本実施の形態に係る特徴語抽出装置１０は、階層的トピックモデルの各トピックにおける特徴語を抽出する。この特徴語抽出装置１０は、ＣＰＵと、ＲＡＭと、後述する特徴語抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成することができる。このコンピュータは、機能的には、図１に示すように、対象コンポーネントモデル抽出部１２と、同位階層コンポーネントモデル抽出部１４と、下位階層コンポーネントモデル抽出部１６と、スコア計算部１８と、特徴語抽出部２０とを含んだ構成で表すことができる。 The feature word extraction device 10 according to the present embodiment extracts feature words in each topic of the hierarchical topic model. The feature word extraction device 10 can be configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a feature word extraction processing routine described later. As shown in FIG. 1, this computer functionally includes a target component model extraction unit 12, a peer layer component model extraction unit 14, a lower layer component model extraction unit 16, a score calculation unit 18, a feature word It can be expressed by a configuration including the extraction unit 20.

ここで、トピックモデルとは、文書集合に含まれる単語（ｖ）を特徴量とし、１文書をデータ点（ｄ）として確率的なクラスタリングを行うためのモデルであり、トピックモデルパラメータを用いて、下記（１）式で表されるモデルである。 Here, the topic model is a model for performing probabilistic clustering using a word (v) included in a document set as a feature amount and one document as a data point (d), and using topic model parameters, It is a model represented by the following formula (1).

（１）式において、トピックモデルパラメータは、ｐ(ｚ)（１×Ｋmatrix）及びｐ(ｖ｜ｚ)（Ｋ×Ｖmatrix）である。ｚはトピックを表す確率変数、ｐ(ｚ)は確率変数ｚに対する事前確率、ｐ(ｖ｜ｚ)は確率変数ｚの下での多項分布による確率、ｎ_ｄｖは文書データｄ中に単語ｖが出現した回数である。 In the equation (1), the topic model parameters are p (z) (1 × Kmatrix) and p (v | z) (K × Vmatrix). z is a random variable representing a topic, p (z) is a prior probability for the random variable z, p (v | z) is a probability based on a multinomial distribution under the random variable z, and n _dv is a word v in the document data d. The number of appearances.

また、階層的トピックモデルとは、トピックが階層構造を有するトピックモデルである。例えば、図２に示すように、第１階層にトピック１、トピック２、トピック３、及びトピック４を含み、第２階層として、トピック２を３つに分割したトピック２−１、トピック２−２、及びトピック２−３を含むような構造となっている。各階層の各トピックは、それぞれ上記のトピックモデルパラメータで表される。第１階層のトピック２については、分割後の第２階層のトピック２−１、トピック２−２、及びトピック２−３各々を表すトピックモデルパラメータとは別に、分割前のトピック２のトピックモデルパラメータをそのまま有している。 A hierarchical topic model is a topic model in which topics have a hierarchical structure. For example, as shown in FIG. 2, topic 2, topic 2-2 including topic 1, topic 2, topic 3, and topic 4 in the first hierarchy and dividing topic 2 into three as the second hierarchy. , And a topic 2-3. Each topic in each hierarchy is represented by the above topic model parameters. For the topic 2 of the first hierarchy, the topic model parameters of the topic 2 before the division are separated from the topic model parameters representing the topics 2-1, 2-2, and 2-3 of the second hierarchy after the division. As it is.

対象コンポーネントモデル抽出部１２は、階層的トピックモデル２２を入力とし、階層的トピックモデル２２から対象となるトピック番号ｋのトピックモデルコンポーネント（パラメータ）（以下、「対象コンポーネントモデル」という）を抽出する。ここでは、対象コンポーネントモデル２４として、ｐ(ｖ｜ｋ)（１×Ｖmatrix）を抽出する。以下では、ｋ＝２とし、図２のトピック２を対象のトピックとする場合について説明する。 The target component model extraction unit 12 receives the hierarchical topic model 22 as an input, and extracts a topic model component (parameter) (hereinafter referred to as “target component model”) of the target topic number k from the hierarchical topic model 22. Here, p (v | k) (1 × Vmatrix) is extracted as the target component model 24. In the following, a case will be described in which k = 2 and topic 2 in FIG. 2 is the target topic.

同位階層コンポーネントモデル抽出部１４は、階層的トピックモデル２２を入力とし、階層的トピックモデル２２から対象となるトピック番号ｋのトピックと同位階層に存在するトピックにおけるトピックモデルコンポーネント（パラメータ）（以下、「同位階層コンポーネントモデル」という）を抽出する。ここでは、同位階層コンポーネントモデルとして、ｐ(ｖ｜ｔ_Ｂ)を抽出する。ｔ_Ｂは同位階層のトピックのトピック番号であり、同位階層のトピックが複数存在する場合には、複数の同位階層コンポーネントモデルを抽出し、同位階層コンポーネントモデル群２６とする。例えば、トピック２を対象トピックｋとすると、同位階層コンポーネントモデルとしては、第１階層のトピック１、トピック３、及びトピック４各々におけるトピックモデルコンポーネントモデルが抽出される。 The peer hierarchy component model extraction unit 14 receives the hierarchical topic model 22 as an input, and the topic model component (parameter) in the topic existing in the peer hierarchy with the topic of the topic number k from the hierarchical topic model 22 (hereinafter, “ "Isotopic hierarchy component model"). Here, p (v | t _B ) is extracted as the peer hierarchy component model. t _B is the topic number of the topic in the peer hierarchy, and when there are a plurality of topics in the peer hierarchy, a plurality of peer hierarchy component models are extracted and set as the peer hierarchy component model group 26. For example, if Topic 2 is the target topic k, the topic model component model in each of Topic 1, Topic 3, and Topic 4 in the first hierarchy is extracted as the peer hierarchy component model.

下位階層コンポーネントモデル抽出部１６は、階層的トピックモデル２２を入力とし、階層的トピックモデル２２から対象となるトピック番号ｋのトピックの下位階層に存在するトピックにおけるトピックモデルコンポーネント（パラメータ）（以下、「下位階層コンポーネントモデル」という）を抽出する。ここでは、下位階層コンポーネントモデルとして、ｐ(ｖ｜ｔ_Ｗ)を抽出する。ｔ_Ｗは下位階層のトピックのトピック番号であり、下位階層のトピックが複数存在する場合には、複数の下位階層コンポーネントモデルを抽出し、下位階層コンポーネントモデル群２８とする。例えば、トピック２を対象トピックｋとすると、下位階層コンポーネントモデルとしては、第２階層のトピック２−１、トピック２−２、及びトピック２−３各々におけるトピックモデルコンポーネントモデルが抽出される。 The lower hierarchical component model extraction unit 16 receives the hierarchical topic model 22 as an input, and uses topic model components (parameters) (hereinafter referred to as “a topic topic component”) in a topic existing in the lower hierarchy of the topic number k that is the target from the hierarchical topic model 22. "Lower layer component model"). Here, p (v | t _W ) is extracted as the lower layer component model. t _W is the topic numbers in the lower hierarchy of topics, if the lower layer of the topic there are multiple, extracts a plurality of lower layer component models, a lower layer component model group 28. For example, if Topic 2 is the target topic k, the topic model component model in each of Topic 2-1, Topic 2-2, and Topic 2-3 in the second hierarchy is extracted as the lower-layer component model.

スコア計算部１８は、対象コンポーネントモデル２４に含まれる全単語についてスコアを計算し、スコアが予め定めた閾値以上の単語を特徴語として抽出する。 The score calculation unit 18 calculates a score for all words included in the target component model 24, and extracts words having a score equal to or higher than a predetermined threshold as feature words.

ここで、スコア計算部１８で計算するスコアの原理について説明する。本発明では、中間階層のトピックにおいて、下位階層の概念に共通する特徴語を抽出することを目的としている。このような特徴語としては、
１．下位階層のトピックにおいて分散が少なく（共通性）、
２．同位階層のトピック間での分散が大きい（特徴性）
という特徴が直感的に存在する。この特徴を直接的にスコア関数に入れることが、所望の共通する特徴語を抽出することにつながる。そこで、下位階層のトピック内分散と同位階層のトピック間分散とを考慮したスコア関数を用いる。 Here, the principle of the score calculated by the score calculation unit 18 will be described. The object of the present invention is to extract feature words common to the concepts of the lower hierarchy in the topic of the intermediate hierarchy. Such feature words include:
1. Less distributed among lower-level topics (commonality)
2. Large variance among topics in peer hierarchy (characteristic)
This feature is intuitive. Putting this feature directly in the score function leads to extracting desired common feature words. Therefore, a score function is used in consideration of the intra-topic variance in the lower hierarchy and the inter-topic variance in the peer hierarchy.

上記原理に従って、スコア計算部１８では、例えば、各単語について、下位階層のトピック内分散σ_Ｗ(ｖ)を下記（２）式により計算し、同位階層のトピック間分散σ_Ｂ(ｖ)を下記（３）式により計算し、その比（（４）式）をスコアとして計算することができる。 In accordance with the above principle, the score calculation unit 18 calculates, for example, the intra-topic variance σ _{W (v)} in the lower hierarchy for each word by the following equation (2), and the inter-topic variance σ _{B (v)} in the peer hierarchy: It is possible to calculate using the equation (3) and the ratio (equation (4)) as a score.

ここで、ｖは単語、ｋは対象トピックのトピック番号、ｔ_Ｗは下位階層のトピックのトピック番号、ｔ_Ｂは同位階層のトピックのトピック番号、ｃ_Ｗは下位階層のトピック数（ここでは３）、ｃ_Ｂは同位階層のトピック数（ここでは３）である。なお、スコア計算部１８で用いるスコアは上記の場合に限定されず、上述の共通する特徴語の特徴が表せるスコアであれば。 Here, v is a word, k is a topic number of a target topic, t _W is a topic number of a topic in a lower hierarchy, t _B is a topic number of a topic in a peer hierarchy, and c _W is the number of topics in a lower hierarchy (here, 3) , C _B is the number of topics in the peer hierarchy (here, 3). Note that the score used in the score calculation unit 18 is not limited to the above case, and may be any score as long as the features of the common feature words described above can be expressed.

特徴語抽出部２０は、スコア計算部１８で計算された各単語のスコアに基づいて各単語をソートし、上位Ｎ位の単語をトピックｋにおける特徴語として抽出して出力する。また、予め定めた閾値を用いて、スコアと閾値との比較結果に応じて特徴語を抽出するようにしてもよい。 The feature word extraction unit 20 sorts the words based on the score of each word calculated by the score calculation unit 18, and extracts and outputs the top N words as feature words in the topic k. Moreover, you may make it extract a feature word according to the comparison result of a score and a threshold value using a predetermined threshold value.

次に、図３を参照して、本実施の形態の特徴語抽出装置１０により実行される特徴語抽出処理ルーチンについて説明する。 Next, a feature word extraction processing routine executed by the feature word extraction device 10 of the present embodiment will be described with reference to FIG.

ステップ１００で、階層的トピックモデル２２を取得し、階層的トピックモデル２２から対象となるトピック番号ｋの対象コンポーネントモデル２４（ｐ(ｖ｜ｋ)）を抽出する。次に、ステップ１０２で、階層的トピックモデル２２から対象となるトピック番号ｋのトピックと同位階層に存在するトピックにおける同位階層コンポーネントモデル群２６（ｐ(ｖ｜ｔ_Ｂ)）を抽出する。次に、ステップ１０４で、階層的トピックモデル２２から対象となるトピック番号ｋのトピックの下位階層に存在するトピックにおける下位階層コンポーネントモデル群２８（ｐ(ｖ｜ｔ_Ｗ)）を抽出する。 In step 100, the hierarchical topic model 22 is acquired, and the target component model 24 (p (v | k)) of the target topic number k is extracted from the hierarchical topic model 22. Next, in step 102, a peer hierarchical component model group 26 (p (v | t _B )) is extracted from the hierarchical topic model 22 in the topic having the topic number k and the topic existing in the peer hierarchy. Next, in step 104, a lower layer component model group 28 (p (v | t _W )) in the topic existing in the lower layer of the topic with the topic number k is extracted from the hierarchical topic model 22.

次に、ステップ１０６で、対象コンポーネントモデル２４に含まれる全単語について、下位階層のトピック内分散と同位階層のトピック間分散とを考慮したスコア関数を用いて、スコアを計算する。次に、ステップ１０８で、上記ステップ１０６で計算された各単語のスコアに基づいて各単語をソートし、上位Ｎ位の単語をトピックｋにおける特徴語として抽出して出力し、特徴語抽出処理を終了する。 Next, in step 106, scores are calculated for all words included in the target component model 24 using a score function that takes into account the intra-topic variance in the lower hierarchy and the inter-topic variance in the peer hierarchy. Next, in step 108, the words are sorted based on the score of each word calculated in step 106, the top N words are extracted and output as feature words in topic k, and feature word extraction processing is performed. finish.

以上説明したように、本実施の形態の特徴語抽出装置によれば、特徴語抽出の対象となるトピックと同位階層のトピック、及び下位階層のトピック各々のパラメータ（コンポーネントモデル）を抽出し、下位階層のトピック内分散と同位階層のトピック間分散とを考慮したスコアを計算し、下位階層のトピックにおいて分散が少なく、同位階層のトピック間での分散が大きい単語を対象のトピックにおける特徴語として抽出するため、階層的トピックモデルの中間階層のトピックにおける特徴語として、その中間階層以下の概念に共通な特徴語を抽出することができる。 As described above, according to the feature word extraction device of the present embodiment, the topic (target model) of the feature word, the topic of the peer hierarchy, and the parameters (component models) of the topics of the lower hierarchy are extracted, Calculates the score considering the intra-topic variance of the hierarchy and the inter-topic variance of the peer hierarchy, and extracts words with low variance in the lower-level topics and large variance between topics in the peer hierarchy as feature words in the target topic Therefore, it is possible to extract a feature word common to the concepts below the intermediate hierarchy as a feature word in the topic of the intermediate hierarchy of the hierarchical topic model.

なお、本発明は、上記実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。また、本発明の特徴語抽出装置を、上記処理を実現するための半導体集積回路等のハードウエアにより構成してもよい。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium. Further, the feature word extraction device of the present invention may be configured by hardware such as a semiconductor integrated circuit for realizing the above processing.

１０特徴語抽出装置
１２対象コンポーネントモデル抽出部
１４同位階層コンポーネントモデル抽出部
１６下位階層コンポーネントモデル抽出部
１８スコア計算部
２０特徴語抽出部 DESCRIPTION OF SYMBOLS 10 Feature word extraction apparatus 12 Target component model extraction part 14 Isotopic hierarchy component model extraction part 16 Lower hierarchy component model extraction part 18 Score calculation part 20 Feature word extraction part

Claims

A target parameter extracting step for extracting a target parameter representing a processing target topic from a hierarchical topic model including a parameter representing each of a plurality of topics for probabilistic clustering of document data and the topic having a hierarchical structure;
A peer hierarchy parameter extracting step for extracting a peer hierarchy parameter representing a topic of the processing target and a peer hierarchy from the hierarchical topic model;
A lower layer parameter extracting step for extracting, from the hierarchical topic model, a lower layer parameter representing a lower layer topic corresponding to the topic to be processed;
Using the target parameter, the peer hierarchy parameter, and the lower hierarchy parameter, for each word included in the target parameter, a score based on the intra-topic variance of the lower hierarchy and the inter-topic variance of the peer hierarchy is calculated. A score calculation step;
A feature word extraction step of extracting a feature word in the processing target topic based on the score of each word included in the calculated target parameter;
A feature word extraction method.

The feature word extraction method according to claim 1, wherein the score is increased as the variance of topics in a lower hierarchy of each word is smaller and as the variance between topics in the peer hierarchy is greater.

A target parameter extracting means for extracting a target parameter representing a processing target topic from a hierarchical topic model including a parameter representing each of a plurality of topics for probabilistic clustering of document data and the topic having a hierarchical structure;
A peer hierarchy parameter extracting means for extracting a peer hierarchy parameter representing a topic of the processing target and a peer of the peer hierarchy from the hierarchical topic model;
A lower layer parameter extracting means for extracting a lower layer parameter representing a lower layer topic corresponding to the processing target topic from the hierarchical topic model;
Using the target parameter, the peer hierarchy parameter, and the lower hierarchy parameter, for each word included in the target parameter, a score based on the intra-topic variance of the lower hierarchy and the inter-topic variance of the peer hierarchy is calculated. A score calculation means;
Feature word extraction means for extracting a feature word in the processing target topic based on the score of each word included in the calculated target parameter;
A feature word extraction device.

A feature word extraction program for causing a computer to execute each step of the feature word extraction method according to claim 1.