JP6257076B2

JP6257076B2 - Information processing apparatus, information processing method, and information processing program

Info

Publication number: JP6257076B2
Application number: JP2013273479A
Authority: JP
Inventors: 徳章川前
Original assignee: エヌ・ティ・ティ・コムウェア株式会社
Priority date: 2013-12-27
Filing date: 2013-12-27
Publication date: 2018-01-10
Anticipated expiration: 2033-12-27
Also published as: JP2015127916A

Description

本発明は、情報処理装置、情報処理方法、及び情報処理プログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and an information processing program.

ｎグラム（ｎ−ｇｒａｍ（ｎは整数））の抽出は、対象となる文書データをｎ単語単位でモデル化を行い、このモデル化の結果として可能となるものである。例えば、非特許文献１には、ベイズ理論に基づく階層構造の確率モデルを基本として、ｎグラム抽出を行うことが記載されている。
非特許文献１に示されるモデルでは、階層クラスにより、文書からトピックの潜在変数を取得し、トピックから単語の確率変数を取得している。また、非特許文献１に示されるモデルでは、以前の単語と以前のトピックとから次の単語間の状態を取得し、次の単語間の状態から、次の単語の確率変数を取得している。このようにして、非特許文献１では、連続するｎ単語からなるｎグラムを抽出する。 Extraction of n-grams (n-gram (n is an integer)) is made possible by modeling target document data in units of n words and as a result of this modeling. For example, Non-Patent Document 1 describes that n-gram extraction is performed based on a hierarchical probability model based on Bayesian theory.
In the model shown in Non-Patent Document 1, a latent variable of a topic is acquired from a document and a random variable of a word is acquired from a topic using a hierarchical class. In the model shown in Non-Patent Document 1, the state between the next words is acquired from the previous word and the previous topic, and the random variable of the next word is acquired from the state between the next words. . In this way, in Non-Patent Document 1, n-grams consisting of consecutive n words are extracted.

ＸｕｅｒｕｌＷａｎｇ、ＭｃＣａｌｌｕｍＡ．ＸｉｎｇＷｅｉ、ＴｏｐｉｃａｌＮ−ｇｒａｍｓ：ＰｈｒａｓｅａｎｄＴｏｐｉｃＤｉｓｃｏｖｅｒｙ、ｗｉｔｈａｎＡｐｐｌｉｃａｔｉｏｎｔｏＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ、ＩＣＤＭ２００７、２００７Ｏｃｔ．ｐｐ．６７９−７０２Xuerul Wang, Mc Callum A. et al. Xing Wei, Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, ICDM 2007, 2007 Oct. pp. 679-702

しかしながら、非特許文献１に示されるモデルでは、トピックの潜在変数の数を予め決定しておく必要がある。また、非特許文献１に示されるモデルでは、最適な単語数のｎグラムを抽出することが難しい。このように、ｎグラムを抽出するときの利便性が十分でないという問題があった。 However, in the model shown in Non-Patent Document 1, the number of latent variables of a topic needs to be determined in advance. Further, in the model shown in Non-Patent Document 1, it is difficult to extract an n-gram having an optimal number of words. Thus, there was a problem that convenience when extracting n-grams was not sufficient.

上述の課題を鑑みてなされたものであり、本発明は、ｎグラムを抽出するときの利便性を向上させることができる情報処理装置、情報処理方法、情報処理プログラムを提供することを課題とする。 This invention is made in view of the above-mentioned subject, and this invention makes it a subject to provide the information processing apparatus, the information processing method, and information processing program which can improve the convenience at the time of extracting n-gram. .

（１）本発明は、上述した課題を解決するためになされたもので、本発明の一態様は、分類装置であって、確率分布から潜在変数の数Ｋ（Ｋは整数）を決定し、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定部と、前記潜在変数推定部で推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率を推定する単語推定部と、を備えることを特徴とする分類装置である。 (1) The present invention has been made in order to solve the above-described problems, and one aspect of the present invention is a classification apparatus, which determines the number K of latent variables (K is an integer) from a probability distribution, a latent variable estimator for estimating a latent variable for extracting words of n (n is an integer) gram, and for each latent variable estimated by the latent variable estimator, n gram words or n gram words And a word estimation unit for estimating an appearance probability.

（２）本発明の一態様は、分類装置であって、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定部と、確率分布からｎグラムの単語の数を決定し、前記潜在変数推定部で推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率を推定する単語推定部と、を備えることを特徴とする分類装置である。 (2) One aspect of the present invention is a classification device, a latent variable estimation unit for estimating a latent variable for extracting n (n is an integer) gram words, and the number of n gram words from a probability distribution And a word estimation unit for estimating the appearance probability of an n-gram word or an n-gram word for each latent variable estimated by the latent variable estimation unit. .

（３）本発明の一態様は、上述のいずれかの分類装置であって、前記潜在変数推定部は、チャイニーズ・レストラン・プロセスにより、潜在変数の数Ｋを決定することを特徴とする分類装置である。 (3) One aspect of the present invention is the classification device according to any one of the above, wherein the latent variable estimation unit determines the number K of latent variables by a Chinese restaurant process. It is.

（４）本発明の一態様は、上述のいずれかの分類装置であって、前記単語推定部は、Ｐｉｔｍａｎ−Ｙｏｒプロセスにより、ｎグラムの単語を推定することを特徴とする分類装置である。 (4) One aspect of the present invention is the classification device according to any one of the above-described classification devices, wherein the word estimation unit estimates n-gram words by a Pitman-Yor process.

（５）本発明の一態様は、上述のいずれかの分類装置であって、前記潜在変数として、トピックを用いることを特徴とする分類装置である。 (5) One aspect of the present invention is any one of the above-described classification apparatuses, wherein a topic is used as the latent variable.

（６）本発明の一態様は、分類方法であって、確率分布から潜在変数の数Ｋ（Ｋは整数）を決定し、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定過程と、前記潜在変数推定過程により推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率の推定を行う単語推定過程と、と有することを特徴とする分類方法である。 (6) One aspect of the present invention is a classification method, wherein the number of latent variables K (K is an integer) is determined from a probability distribution, and latent variables for extracting n (n is an integer) gram words are determined. A latent variable estimation process to be estimated; and a word estimation process for estimating an n-gram word or an appearance probability of the n-gram word for each latent variable estimated by the latent variable estimation process. This is a classification method.

（７）本発明の一態様は、分類方法であって、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定過程と、確率分布からｎグラムの単語の数を決定し、前記潜在変数推定過程により推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率の推定を行う単語推定過程と、を有すること特徴とする分類方法である。 (7) One aspect of the present invention is a classification method, a latent variable estimation process for estimating latent variables for extracting n (n is an integer) gram words, and the number of n gram words from a probability distribution For each latent variable estimated by the latent variable estimation process, and a word estimation process for estimating an appearance probability of the n-gram word or the word of the n-gram. .

（８）本発明の一態様は、分類プログラムであって、分類装置のコンピュータに、確率分布から潜在変数の数Ｋ（Ｋは整数）を決定し、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定ステップと、前記潜在変数推定ステップにより推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率を推定する単語推定テップと、を実行させるための分類プログラムである。 (8) One aspect of the present invention is a classification program, in which a computer of a classification apparatus determines the number of latent variables K (K is an integer) from a probability distribution, and extracts n (n is an integer) gram words A latent variable estimating step for estimating a latent variable for performing, and for each latent variable estimated by the latent variable estimating step, a word estimation step for estimating an n-gram word or an appearance probability of the n-gram word. This is a classification program for execution.

（９）本発明の一態様は、分類プログラムであって、分類装置のコンピュータに、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定ステップと、確率分布からｎグラムの単語の数を決定し、前記潜在変数推定ステップにより推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率を推定する単語推定ステップと、を実行させるための分類プログラムである。 (9) One aspect of the present invention is a classification program, a latent variable estimation step for estimating a latent variable for extracting words of n (n is an integer) gram in a computer of a classification device, and a probability distribution a word estimation step for determining the number of n-gram words and estimating an occurrence probability of the n-gram word or the n-gram word for each latent variable estimated by the latent variable estimation step. Classification program.

本発明によれば、ｎグラムを抽出するときの利便性を向上させることができる。 According to the present invention, convenience when extracting n-grams can be improved.

本発明の実施形態に係るグラフィカルモデルの一例を示す概略図である。It is the schematic which shows an example of the graphical model which concerns on embodiment of this invention. 本実施形態に係るシンボルの一例を示す説明図である。It is explanatory drawing which shows an example of the symbol which concerns on this embodiment. 本実施形態に係る分類システムの構成の一例を示すシステム構成図である。It is a system configuration figure showing an example of the composition of the classification system concerning this embodiment. 本実施形態に係る分類装置の構成の一例を示す概略ブロック図である。It is a schematic block diagram which shows an example of a structure of the classification device which concerns on this embodiment. 本実施形態に係る分類装置における計算処理の説明の一例を示すフローチャートである。It is a flowchart which shows an example of the description of the calculation process in the classification device concerning this embodiment. 本実施形態に係る分類装置における計算処理の処理内容の一例を説明する説明図である。It is explanatory drawing explaining an example of the processing content of the calculation process in the classification device concerning this embodiment. 本実施形態に係る分類装置における計算処理の処理内容の一例を説明する説明図である。It is explanatory drawing explaining an example of the processing content of the calculation process in the classification device concerning this embodiment. 本発明の比較例に係るグラフィカルモデルの一例を示す概略図である。It is the schematic which shows an example of the graphical model which concerns on the comparative example of this invention. 本実施形態に係る分類装置の効果の一例を説明する説明図である。It is explanatory drawing explaining an example of the effect of the classification device concerning this embodiment.

＜実施形態＞
以下、本発明の実施の形態について図面を参照しながら説明する。
図１は、本発明の実施形態に係るグラフィックモデルである。なお、本実施形態におけるシンボルとその定義については、図２に示す通りである。 <Embodiment>
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a graphic model according to an embodiment of the present invention. The symbols and their definitions in this embodiment are as shown in FIG.

ノード１１は、パラメータαのノードである。パラメータαは、トピックの数およびトピックを求めるための確率分布を生成するハイパーパラメータ（ディリクレパラメータ）である。
ノード１２（潜在変数推定部）は、トピックの潜在変数Ｚ_ｉｊを求めるノードである。ここで、ｉは、トークン（単語）の番号であり、ｊは、文書の番号である。トピックの潜在変数Ｚ_ｉｊは、ｊ番目のレビュー文書中のｉ番目のトークンを表す。本実施形態では、トピックの確率分布の生成プロセスとして、チャイニーズ・レストラン・プロセス（Ｃｈｉｎｅｓｅｒｅｓｔａｕｒａｎｔｐｒｏｃｅｓｓ）を導入している。チャイニーズ・レストラン・プロセスを導入することで、トピックの数（Ｋ（Ｋは整数））は、予め設定することなく、適切に決定される。 Node 11 is a node of parameter α. The parameter α is a hyper parameter (Dirichlet parameter) that generates a probability distribution for obtaining the number of topics and topics.
The node 12 (latent variable estimation unit) is a node for _obtaining the latent variable Z _{ij of the} topic. Here, i is a token (word) number, and j is a document number. The topic latent variable Z _ij represents the i th token in the j th review document. In the present embodiment, a Chinese restaurant process is introduced as a topic probability distribution generation process. By introducing the Chinese restaurant process, the number of topics (K (K is an integer)) is appropriately determined without presetting.

ノード１３は、パラメータλのノードである。パラメータλは、レイティングを求めるためのトピックごとの確率分布を生成するコンセントレーション・パラメータである。
ノード１４は、レイティングの観測変数ｖ_ｊを求めるノードである。レビュー文書には、当該レビュー文書に付随する付随情報であるメタデータが含まれている。メタデータとは、レビュー文書に対する評価を表す情報、レビュー文書の作成日時を表す情報やレビューデータの閲覧日時を表す情報、レビュー文書の閲覧数の情報などのことである。レイティングは、レビュー文書ごとのメタデータの情報により行われる。各レビュー文書に存在する同じ単語でも、レビュー文書のトピックにより異なる意味となることがある。例えば、「小さい」や「軽い」は、トピックがモバイル装置の場合には肯定的な意味となるが、トピックが果物の場合には否定的な意味となる。ノード１４では、トピックごとのメタデータのベータ分布を使って、レビュー文書ｊごとのレイティングの観測変数ｖ_ｊを求めている。 Node 13 is a node of parameter λ. The parameter λ is a concentration parameter that generates a probability distribution for each topic for obtaining the rating.
The node 14 is a node for obtaining a rating observation variable v _j . The review document includes metadata that is accompanying information accompanying the review document. The metadata includes information indicating evaluation on the review document, information indicating the creation date and time of the review document, information indicating the browsing date and time of the review data, information on the number of reviews of the review document, and the like. Rating is performed based on metadata information for each review document. The same word that exists in each review document may have different meanings depending on the topic of the review document. For example, “small” or “light” has a positive meaning when the topic is a mobile device, but has a negative meaning when the topic is fruit. The node 14 obtains a rating observation variable v _j for each review document j using a beta distribution of metadata for each topic.

本実施形態では、レイティングを連続値として扱い、レイティングの確率分布として、ベータ分布を用いているが、レイティングを離散値として扱い、レイティングの確率分布として、多項分布を用いてもよい。 In this embodiment, the rating is treated as a continuous value and the beta distribution is used as the rating probability distribution. However, the rating may be treated as a discrete value and the multinomial distribution may be used as the rating probability distribution.

ノード１５_１、１５_２、…、１５_ｎ−１は、パラメータγ（λ_０、λ_０、…、λ_ｎ−１）のノードである。パラメータγ（λ_０、λ_０、…、λ_ｎ−１）は、コンセントレーション・パラメータである。
ノード１６_１、１６_２、…、１６_ｎ−１は、パラメータｄ（ｄ_０、ｄ_０、…、ｄ_ｎ−１）のノードである。パラメータｄ（ｄ_０、ｄ_０、…、ｄ_ｎ−１）は、ディスカウント・パラメータである。 Nodes 15 ₁ , 15 ₂ ,..., 15 _n−1 are nodes of the parameter γ (λ ₀ , λ ₀ ,..., Λ _n−1 ). The parameter γ (λ ₀ , λ ₀ ,..., Λ _n−1 ) is a concentration parameter.
Nodes 16 ₁ , 16 ₂ ,..., 16 _n−1 are nodes of the parameter d (d ₀ , d ₀ ,..., D _n−1 ). The parameter d (d ₀ , d ₀ ,..., D _n−1 ) is a discount parameter.

ノード１７_１、１７_２、…、１７_ｎ−１は、ｎグラムの各単語の確率分布Ｇ（Ｇ_ｂ、Ｇ_ｋ、…、Ｇ_ｋ ^ｕ）を求めるノードである。本実施形態において、ノード１７_１、１７_２、…、１７_ｎ−１は、パラメータγと、パラメータｄと、前のパラメータＧと、から、ピットマン・ユア（Ｐｉｔｍａｎ−Ｙｏｒ）プロセスを階層的に用いて、ｎグラムの各単語の確率分布を求めている。すなわち、ノード１７_１は、一般の文書集合全体で使われる基本となる確率分布Ｇ_ｂを生成する。ノード１７_２、…、１７_ｎ−１は、Ｐｉｔｍａｎ−Ｙｏｒプロセスを階層的に用いることで、各確率分布の確率分布により、トピックｋのｎグラムの各単語の確率分布Ｇｋ、…、Ｇ_ｋ ^ｕを求める。 Nodes 17 ₁ , 17 ₂ ,..., 17 _n−1 are nodes for obtaining a probability distribution G (G _b , G _k ,..., G _k ^u ) of each word in the n-gram. In this embodiment, the nodes 17 ₁ , 17 ₂ ,..., 17 _n-1 use a Pitman-Yor process hierarchically from the parameter γ, the parameter d, and the previous parameter G. Thus, the probability distribution of each word of the n-gram is obtained. That is, the node 17 ₁ generates a probability distribution G _b underlying used throughout general document set. Node _{_{17 2, ..., 17 n-}} 1 is, Pitman-Yor process is hierarchically used things, the probability distribution of each probability distribution, for each word n-gram topic k probability distribution Gk, _{..., G} ^{k u} Ask for.

ノード１８（単語推定部）は、ノード１２で求められたトピックの潜在変数Ｚ_ｊｉと、トピックｋごとの各単語の確率分布Ｇ_ｂ、Ｇ_ｋ、…、Ｇ_ｋ ^ｕと、からレビュー文書ｊのトークンｉのｎグラムの単語の観測変数ｗ_ｊｉを求める。ここで、ｎグラムの単語の数は、Ｐｉｔｍａｎ−Ｙｏｒプロセスを階層的に用い、チャイニーズ・レストラン・プロセスと同様の手法を導入することで、適切に設定される。 Node 18 (word estimator) includes a latent variable _{Z ji} topics determined by the node 12, the probability distribution of each word by topic _{_{k G b, G k, ...}} , and _G ^{k u,} from the review document j Find the observation variable w _ji of the n-gram word of token i. Here, the number of n-gram words is appropriately set by hierarchically using the Pitman-Yor process and introducing a technique similar to the Chinese restaurant process.

チャイニーズ・レストラン・プロセスは、ノンパラメトリックのディレクレ・プロセスで使用され、数の区切りを生成する。本実施形態では、このチャイニーズ・レストラン・プロセスを、トピックの事前確率分布として用いている。 The Chinese restaurant process is used in a non-parametric directory process to generate a number separator. In the present embodiment, this Chinese restaurant process is used as a topic prior probability distribution.

チャイニーズ・レストラン・プロセスは、各テーブルが無限数個の座席キャパシティを有する無限数個の円形テーブルを備えたレストランのメタファを使って記述される。テーブルに順に番号が付けられていると仮定し、Ｚ_ｉはｉ番目の客が座るテーブル（トピック）の番号を示すものとする。店に客が入ると、その客が、誰かが座っているテーブルを選択する可能性は、既に座っている人の数に比例し、誰も座っていないテーブルを選択する可能性は、ある定数パラメータに比例する。すなわち、最初のテーブルに最初の客が座り（Ｚ_ｉ＝１）、それから、テーブルに座るｉ番目の客の確率分布は、（１）式に示す通りになる。 The Chinese restaurant process is described using a restaurant metaphor with an infinite number of circular tables, each table having an infinite number of seating capacities. Assume that the tables are numbered sequentially, and Z _i indicates the number of the table (topic) where the i-th customer sits. When a customer enters a store, the probability that the customer will choose a table where someone is sitting is proportional to the number of people who are already sitting, and the possibility that a customer will choose a table where no one is sitting is a certain number. Proportional to parameter. That is, the first customer sits at the first table (Z _i = 1), and then the probability distribution of the i-th customer sitting at the table is as shown in equation (1).

チャイニーズ・レストラン・プロセスは、レストランでテーブルに座る順番からランダムな区切りを取得する意味で、ディリクレ・プロセスの範疇の確率として記述できる。客は、テーブルを選択して座り、数の区切りを取得する。それは、ディリクレ・プロセスから得られるクラスタ構造と同様のものである。 The Chinese restaurant process can be described as the probability of the Dirichlet process category in the sense of obtaining a random break from the order of sitting at the table in the restaurant. The customer selects the table and sits down to get a number break. It is similar to the cluster structure obtained from the Dirichlet process.

このように、トピックの確率分布の生成プロセスとして、チャイニーズ・レストラン・プロセスを導入することで、トピックの数Ｋは、予め設定することなく、適切に決定されることになる。 As described above, by introducing the Chinese restaurant process as a process for generating the probability distribution of topics, the number K of topics is appropriately determined without being set in advance.

次に、本実施形態におけるレイティングの推定について説明する。前述したように、本実施形態では、各レビュー文書でのレイティングの観測変数ｖ_ｊを求めるのにベータ分布を使っている。この処理手順は、チャイニーズ・レストラン・プロセスと類似して記述できる。それは、レストランのテーブルに座る客の順番からランダムな区切りを取得する意味において、ディリクレ・プロセスと同様の確率として記述される。 Next, rating estimation in the present embodiment will be described. As described above, in this embodiment, the beta distribution is used to obtain the observation variable v _j of the rating in each review document. This processing procedure can be described in a manner similar to the Chinese restaurant process. It is described as a probability similar to the Dirichlet process in the sense of obtaining a random break from the order of customers sitting at a restaurant table.

つまり、もし、トピックＺ_ｉｊが、レビュー文書（レストラン）ｊでのｉ番目の客により選択されたテーブルのインデックスなら、以下のような分布となる。 That is, if the topic Z _ij is an index of a table selected by the i-th customer in the review document (restaurant) j, the distribution is as follows.

この式でのハイパーパラメータαは、補助可変サンプリングにより推定される。単語の提示での次元の減少と、それに対応する評価の決定と、を同時に達成するために、単語と、与えられたレイティングと、がベータ分布を使うトピックを介して結合される。したがって、レイティングの確率ｖ_jは、以下のように与えられる。 The hyperparameter α in this equation is estimated by auxiliary variable sampling. Words and given ratings are combined through a topic using a beta distribution to simultaneously achieve dimensionality reduction in word presentation and corresponding rating decisions. Therefore, the rating probability v _j is given as follows.

次に、本実施形態におけるｎグラムの単語の推定について説明する。前述したように、本実施形態では、ｎグラムの各単語の確率分布を、Ｐｉｔｍａｎ−Ｙｏｒプロセスを階層的に用いて求めている。Ｐｉｔｍａｎ−Ｙｏｒプロセスを階層的に用いることで、べき乗則をコントロールすることができる。 Next, n-gram word estimation in this embodiment will be described. As described above, in this embodiment, the probability distribution of each word of the n-gram is obtained using the Pitman-Yor process hierarchically. The power law can be controlled by using the Pitman-Yor process hierarchically.

つまり、べき乗則は、二つの量の間の数学上の関係であり、言語学上、ジップの法則として知られている。例えば、十分に大きなコーパスにおいて頻度ｎｗで起こる単語ｗの確率Ｐ（ｎｗ）は、以下のように与えられる。 In other words, the power law is a mathematical relationship between two quantities, and is known in linguistics as Zip's law. For example, the probability P (nw) of a word w that occurs at a frequency nw in a sufficiently large corpus is given as follows:

Ｐｉｔｍａｎ−Ｙｏｒプロセスは、ディリクレ・プロセスを定義し、無次元パラメータ空間上の確率分布の確率分布を定義することにより、ベイジアン・フレームワークでのノンパラメトリック推定を行う。Ｐｉｔｍａｎ−Ｙｏｒプロセスは、以下のように示される。 The Pitman-Yor process performs non-parametric estimation in a Bayesian framework by defining a Dirichlet process and defining a probability distribution of a probability distribution on a dimensionless parameter space. The Pitman-Yor process is shown as follows.

ここで、ディスカウント・パラメータｄは、べき乗則の性質をコントロールするパラメータとなる。 Here, the discount parameter d is a parameter for controlling the nature of the power law.

上式のＰｉｔｍａｎ−Ｙｏｒプロセスにおいて、パラメータＧ_０として、確率分布Ｇ_ｂと置き、これにより求められる確率分布をＧ_０として、以下、再帰的に計算を行うことで、これに続くｎグラムの各単語の確率分布を求めることができる。 In the Pitman-Yor process of the above equation, the probability distribution G _b is set as the parameter G ₀ , and the probability distribution obtained thereby is set as G _0. A probability distribution of words can be obtained.

Ｇ〜ＰＹＰ（γ，ｄ，Ｇ_０)を生成する手続きは、式（１）により、チャイニーズ・レストラン・プロセスのメタファを使って記述できる。パラメータｄ及びγは、べき乗則をスムージングする効果を生み出す。 The procedure for generating G to PYP (γ, d, G ₀ ) can be described using the Chinese restaurant process metaphor according to equation (1). The parameters d and γ produce the effect of smoothing the power law.

つまり、最初のテーブルに最初の客が座り（ｚ_１＝１）、そして、ｉ番目の客が選択するテーブルは、以下の分布に従う。 That is, the first customer sits at the first table (z ₁ = 1), and the table selected by the i-th customer follows the following distribution.

ここで、テーブルの数が、レストランに入っていく客の数と同様に増加するとき、このディスカウント・パラメータは、べき乗則を生み出す。つまり、Ｐｉｔｍａｎ−Ｙｏｒプロセスは、べき乗則分布に従った結果をもたらす。 Here, when the number of tables increases as well as the number of customers entering the restaurant, this discount parameter produces a power law. That is, the Pitman-Yor process yields results that follow a power law distribution.

また、チャイニーズ・レストラン・プロセス手法を導入することで、ｎグラム連結モデルの単語の数を適切に設定することができる。つまり、本実施形態では、Ｐｉｔｍａｎ−ｙｏｒプロセスを再帰的に用いることにより、ｎグラムの単語を生成している。ｎグラムの単語を形成するために、本実施形態では、単語を階層的にサンプルする。 Moreover, the number of words of the n-gram connected model can be set appropriately by introducing the Chinese restaurant process method. That is, in this embodiment, n-gram words are generated by recursively using the Pitman-yor process. In order to form n-gram words, the present embodiment samples words hierarchically.

単語の基本の確率分布Ｇ_ｂは、トピックｋ＝０に割り当てられた基本となる単語のユニグラム（１グラム）の確率ベクトルである。確率分布Ｇ_ｂは、以前の情報を使わずに、一般の文書集合全体で求められた確率分布である。単語の基本の確率分布Ｇ_ｂは、以下のように取得できる。 Probability distribution G _b of the basic word is the probability vector of the unigram (1 grams) word underlying assigned to a topic k = 0. The probability distribution _Gb is a probability distribution obtained for the entire general document set without using previous information. Probability distribution G _b of the basic words, can be obtained as follows.

トピック特有のユニグラムの確率分布は、与えられた単語の基本の確率分布Ｇ_ｂ及び現在のトピックｋにより取得される。確率分布Ｇ_ｋは、求められた確率分布Ｇ_ｂを、次のパラメータＧ_０として置くことにより取得される。 Probability distribution of topic-specific unigram is acquired by the probability distribution G _b and the current topic k of the base of a given word. The probability distribution G _{k is} obtained by placing the obtained probability distribution G _b as the next parameter G ₀ .

ここで、各トピックのｎグラム単語の確率分布を生成するために、各トピックをレストランオーナーとみなす。このことにより、チャイニーズ・レストラン・プロセスは、チャイニーズ・レストラン・フランチャイズ・プロセスに拡張される。 Here, each topic is regarded as a restaurant owner in order to generate a probability distribution of n-gram words for each topic. This extends the Chinese restaurant process to the Chinese restaurant franchise process.

例えば、各単語の確率分布Ｇ_ｋ ^ｕは、レストラン(単語の番号)ｕよりインデックスされ、オーナー(トピック)ｋにより管理された関連するレストランを有している。ここで、確率分布Ｇ_ｋ ^ｕは、ｕに続く条件付きｗの確率であり、ｕは、トピックｋでの以前のｎ−１単語の番号である。このレストランの客は、確率分布Ｇ_ｋ ^ｕから取り出され、テーブルは、その確率分布の前の基本確率分布から取得され、料理は、単語から取得された値である。 For example, each word probability distribution G _k ^u is indexed by a restaurant (word number) u and has an associated restaurant managed by the owner (topic) k. Here, probability distribution G _k ^u is the probability of conditional w following u, and u is the number of the previous n−1 words in topic k. The restaurant customers are taken from the probability distribution G _k ^u , the table is obtained from the basic probability distribution before the probability distribution, and the dish is the value obtained from the word.

単語の確率の事前確率分布として、Ｐｉｔｍａｎ−ｙｏｒプロセスを再帰的に置いていくことにより、各文書において同じ連続したトピックの割り当てから引き出された各ｎグラムが定義される。 By placing the Pitman-yor process recursively as a prior probability distribution of word probabilities, each n-gram derived from the same consecutive topic assignment in each document is defined.

式（９）を使って、以前の確率分布Ｇ_ｋ ^ｕを再帰的に置いていくことで、次の単語の確率分布を取得できる。なお、コンセントレーション・パラメータγ及びディスカウント・パラメータｄは、そのフレーズの長さに関係する機能をもつ。この演算は、バックグランドの単語の確率分布Ｇ_ｂが得られるまで繰り返される。 The probability distribution of the next word can be acquired by recursively placing the previous probability distribution G _k ^u using equation (9). The concentration parameter γ and the discount parameter d have a function related to the length of the phrase. This calculation is repeated until a background word probability distribution _Gb is obtained.

各レストラン(単語の数)ｕ及び料理（単語）ｗについて、ｃ^ｋ _ｕｗｌ及びｔ^ｋ _ｕｗｌを、それぞれ、ｋにより管理されたレストランｕでテーブルｌに座り料理（単語）ｗを食している客の数、及び、同じレストランｕでｋによる提供する料理ｗのテーブルの数と定義する。限界数を示すことにより、ｃ^ｋ _ｕｌは、ｋにより管理されるテーブルｌに座る客の数となり、ｃ^ｋ _ｕｗは、ｋにより経営されるｕでの食された料理ｗとなり、ｃ^ｋ _ｕは、ｋにより経営されたｕの客の数となり、ｔ^ｋ _ｕは、ｋの経営するｕのテーブルの数となる。その結果、Ｇ_ｋ ^ｕ、Ｇ_ｋ及びＧ_ｂから取り出された次の単語ｗは、再帰的に以下のように計算される。 For each restaurant (number of words) u and dishes (words) w, c ^k _uwl and t ^k _uwl are respectively _stored in the table u at the restaurant u managed by k and the customers who are eating dishes (words) w And the number of tables of dishes w to be provided by k at the same restaurant u. By indicating the limit number, c ^k _ul is the number of customers sitting in the table l managed by ^k , c ^k _uw is the eaten dish w in u managed by ^k , and c ^k _u is , K is the number of customers managed by ^k , and t ^k _u is the number of u tables managed by k. As a result, the next word w extracted from G _k ^u , G _k and G _b is recursively calculated as follows.

図２は、本実施形態に係るシンボルの一例を示す説明図である。
図示するように、Ｄ（Ｋ，Ｗ）は、レビュー文書の数（トピックの数、単語の数）を表す。Ｎ_ｊは、レビュー文書ｊでの単語トークンの数を表す。ｖ_ｊは、レビュー文書ｊに関連数レイティングを表す。ｚ_ｊｉは、レビュー文書ｊにおける位番目のトークンに関連するトピックを表す。ｗ_ｊｉは、レビュー文書ｊにおけるｉ番目のトークンを表す。ｕは、単語の順（フレーズ）を表す。ｕ（／）は、単語の順を表す。Ｇ_ｂは、与えられたコーパスでの単語の確率分布を表す。Ｇ_ｋは、トピックｋに特有な単語の確率分布を表す。Ｇ_ｋ ^ｕは、トピックｋにおいて、特有な単語ｕの確率分布を表す。αは、ハイパーパラメータを表す。ｄ_ｎは、ディスカウント・パラメータを表す。γ_ｎは、コンセントレーション・パラメータを表す。 FIG. 2 is an explanatory diagram illustrating an example of symbols according to the present embodiment.
As shown in the figure, D (K, W) represents the number of review documents (number of topics, number of words). N _j represents the number of word tokens in the review document j. v _j represents the association number rating for the review document j. z _ji represents the topic associated with the rank token in the review document j. w _ji represents the i-th token in the review document j. u represents the order of words (phrase). u (/) represents the order of words. G _b represents the probability distribution of words in a given corpus. G _k represents a probability distribution of words specific to topic k. G _k ^u represents the probability distribution of a unique word u in topic k. α represents a hyper parameter. d _n represents the discount parameters. γ _n represents a concentration parameter.

次に、図１に示したようなグラフィックモデルで示される処理について、具体的に説明する。上述のように、本実施形態では、チャイニーズ・レストラン・プロセスを導入して、潜在変数Ｚ_ｊｉの数を決定し、Ｐｉｔｍａｎ−ｙｏｒプロセスを階層的に用いることで、ｎグラム連結モデルの推定を行っている。これらの処理は、ギブスのサンプリングにより実現できる。 Next, the process shown by the graphic model as shown in FIG. 1 will be specifically described. As described above, in the present embodiment, a Chinese restaurant process is introduced, the number of latent variables Z _ji is determined, and the Pitman-yor process is used hierarchically to estimate an n-gram connected model. ing. These processes can be realized by Gibbs sampling.

図３は、本実施形態に係る分類システムの構成の一例を示す概略図であり、図４は、分類装置の構成の一例を示す概略ブロック図である。図３に示すように、本発明の実施形態に係るシステムは、ファイルサーバ５１と、計算サーバ５２と、データベース５３と、サービスサーバ５４とから構成される。 FIG. 3 is a schematic diagram illustrating an example of the configuration of the classification system according to the present embodiment, and FIG. 4 is a schematic block diagram illustrating an example of the configuration of the classification device. As shown in FIG. 3, the system according to the embodiment of the present invention includes a file server 51, a calculation server 52, a database 53, and a service server 54.

ファイルサーバ５１は、図４に示すように、計算対象となるレビューデータを保存するデータファイル保存部６１を備えている。データファイル保存部６１に保存するレビューデータは、インターネット上のブログの文書や、ウェブページの文書などである。また、レビューデータは、メタデータとレビューとが関連付けられた文書データである。なお、計算対象となるレビューデータは、インターネット上の文書に限られるものではない。 As illustrated in FIG. 4, the file server 51 includes a data file storage unit 61 that stores review data to be calculated. The review data stored in the data file storage unit 61 is a blog document on the Internet, a web page document, or the like. The review data is document data in which metadata and a review are associated with each other. Note that the review data to be calculated is not limited to documents on the Internet.

計算サーバ５２は、ファイルサーバ５１から計算対象のレビューデータを取り出し、図１に示したグラフィックモデルで示されるような計算処理を行い、計算結果を出力する。計算サーバ５２は、図４に示すように、事前処理部７１と計算処理部７２を備えている。事前処理部７１は、計算処理対象のレビューデータを含むファイルから、単語を抽出する。そして、事前処理部７１は、テキストＩＤ及び単語ＩＤを付加し、その対応表をファイルシステムに保存する。計算処理部７２は、図１に示したグラフィックモデルに対応するような計算処理を行う。 The calculation server 52 extracts the review data to be calculated from the file server 51, performs a calculation process as shown by the graphic model shown in FIG. 1, and outputs the calculation result. The calculation server 52 includes a preprocessing unit 71 and a calculation processing unit 72, as shown in FIG. The pre-processing unit 71 extracts words from a file including review data to be calculated. Then, the preprocessing unit 71 adds the text ID and the word ID, and stores the correspondence table in the file system. The calculation processing unit 72 performs calculation processing corresponding to the graphic model shown in FIG.

データベース５３は、図４に示すように、計算結果記憶部８１を有している。計算サーバ５２の計算処理部７２の計算結果は、データベース５３に送られ、計算結果記憶部８１に保存される。 As shown in FIG. 4, the database 53 has a calculation result storage unit 81. The calculation result of the calculation processing unit 72 of the calculation server 52 is sent to the database 53 and stored in the calculation result storage unit 81.

サービスサーバ５４は、計算結果をサービスの利用のために提供するためのサーバである。図４に示すように、サービスサーバ５４は、呼び出し部９１を備えている。ユーザ端末５５からの呼び出しに応じて、呼び出し部９１は、ユーザ端末５５に計算結果をユーザ端末５５に送る。この計算結果は、商品検索、レビュー検索、マーケティングなど、ユーザからのフィードバックや主観評価などの観測値（例えば、図２の連続値の観測変数ｖ_ｊ）を含む文書集合を扱う各種のサービスに利用できる。 The service server 54 is a server for providing calculation results for use of the service. As shown in FIG. 4, the service server 54 includes a calling unit 91. In response to the call from the user terminal 55, the calling unit 91 sends the calculation result to the user terminal 55. This calculation result is used for various services such as product search, review search, marketing, etc. that handle a document set including observation values (for example, continuous value observation variable v _j in FIG. 2) such as feedback from the user and subjective evaluation. it can.

図５は、計算サーバ５２における計算処理の説明の一例を示すフローチャートである。
図５において、まず、計算サーバ５２の事前処理部７１は、計算対象のテキスト毎に、単語及びメタデータ（作成日時、評価等）を抽出する。そして、事前処理部７１は、各テキストにテキストＩＤを割り振り、各単語に単語ＩＤを割り振る処理を行う（ステップＳ１）。
つまり、図６のテーブルＴ１では、最初のレコードのテキストには、「テキストＡＡＡＢ」が記述され、最後のレコードのテキストには、「テキストＸＤＣＦＲ」が記述されている。最初のレコードのトークン（単語）には、「リンゴ」、「操作性」、「遺産」が記述されている。そして、最後のレコードのトークンには、「音楽」、「芸術」、「リンゴ」が記述されている。 FIG. 5 is a flowchart illustrating an example of description of calculation processing in the calculation server 52.
In FIG. 5, first, the pre-processing unit 71 of the calculation server 52 extracts words and metadata (creation date, evaluation, etc.) for each text to be calculated. And the pre-processing part 71 performs processing which allocates text ID to each text, and allocates word ID to each word (step S1).
That is, in the table T1 in FIG. 6, “text AAAB” is described in the text of the first record, and “text XDCFR” is described in the text of the last record. In the token (word) of the first record, “apple”, “operability”, and “heritage” are described. In the token of the last record, “music”, “art”, and “apple” are described.

図６のテーブルＴ２は、このようなデータに対して、ステップＳ１で、テキストＩＤと単語ＩＤを割り振る処理を行った場合の例である。図６のテーブルＴ２では、最初のレコードのテキストには、テキストＩＤとして「０００」が割り振られ、最後のレコードのテキストには、テキストＩＤとして「０８６」が割り振られている。また、最初のレコードのトークン（単語）には、「００００」、「０００３」、「０１２０」が単語ＩＤとして割り振られている。そして、最後のレコードのトークンには、「１２１１２３４」、「０３０４２」、「００００」が単語ＩＤとして割り振られている。 A table T2 in FIG. 6 is an example in the case where a process of assigning a text ID and a word ID is performed on such data in step S1. In the table T2 of FIG. 6, “000” is assigned as the text ID to the text of the first record, and “086” is assigned as the text ID to the text of the last record. In addition, “0000”, “0003”, and “0120” are assigned as word IDs to tokens (words) of the first record. Then, “12111234”, “03042”, and “0000” are assigned as word IDs to the token of the last record.

次に、計算サーバ５２の計算処理部７２は、乱数を発生し、その値を確率変数（Ｚ）とする（ステップＳ２）。図７のテーブルＴ３は、ステップＳ２で、テキストＩＤと単語ＩＤを割り振ったテーブルＴ２に対して、乱数を発生し、トピックの確率変数（Ｚ）としたものである。つまり、テーブルＴ３の最初のレコードでは、トピックの確率変数として、乱数「１１」、「８」、「３」が入れられている。また、最後のレコードでは、トピックの確率変数として、乱数「２」、「１」、「１１」が入れられている。 Next, the calculation processing unit 72 of the calculation server 52 generates a random number and sets the value as a random variable (Z) (step S2). The table T3 of FIG. 7 is a table in which random numbers are generated for the table T2 to which the text ID and the word ID are assigned in step S2, and are used as a topic random variable (Z). That is, in the first record of the table T3, random numbers “11”, “8”, and “3” are entered as topic random variables. In the last record, random numbers “2”, “1”, and “11” are entered as topic random variables.

次に、計算サーバ５２の計算処理部７２は、ギブスサンプリングにより、潜在変数の推定を行う（ステップＳ３）。そして、サンプリングの数が予め設定されたら、処理を終了する（ステップＳ４）。これにより、トピックの潜在変数の数を最適に決定してトピックの潜在変数を推定し、推定されたトピックを用いてｎグラムを推定することができる。 Next, the calculation processing unit 72 of the calculation server 52 estimates a latent variable by Gibbs sampling (step S3). And if the number of sampling is preset, a process will be complete | finished (step S4). Accordingly, it is possible to estimate the topic latent variables by optimally determining the number of topic latent variables, and to estimate the n-gram using the estimated topics.

＜比較例＞
図８は、比較例のグラフィックモデルである。図８において、ノード１０１は、ハイパーパラメータαのノードである。ハイパーパラメータαは、トピックの確率分布θ_ｄを求めるのに用いられる。
ノード１０２は、文書データ毎に特異なトピックの確率分布θ_ｄを求めるノードである。ここで、Ｄは、文書の数を表し、トピックの確率分布θ_ｄは、文書毎に存在するＤ種類となる。 <Comparative example>
FIG. 8 is a graphic model of a comparative example. In FIG. 8, a node 101 is a hyper parameter α node. The hyper parameter α is used to obtain a topic probability distribution θ _d .
The node 102 is a node for obtaining a probability distribution θ _d of a unique topic for each document data. Here, D represents the number of documents, and the topic probability distribution θ _d is D types existing for each document.

ノード１０３_１、…、１０３_ｉ、１０３_ｉ＋１、…は、トピックの潜在変数ｚ（ｚ_１、…、ｚ_ｉ、ｚ_ｉ＋１、…）を取得するノードである。すなわち、ノード１０３_１、…、１０３_ｉ、１０３_ｉ＋１、…は、文書データから、ノード１０２で求められた文書毎に特異なトピックの確率分布θ_ｄに基づき、トピックの潜在変数ｚ_１、…、ｚ_ｉ、ｚ_ｉ＋１、…を取得する。ここで、ｚ_ｉは、ｉ番目（ｉは任意の整数）の単語（トークン：単語の最小単位）に関連するトピックである。 Nodes 103 ₁ ,..., 103 _i , 103 _{i + 1} ,... Are nodes that acquire latent variables z (z ₁ ,..., Z _i , z _{i + 1} ,...) Of topics. That is, the node _{_{103 1, ..., 103 i,}} 103 i + 1, ... , from the document data, based on the probability distribution theta _d specific topics for each document obtained by the node 102, the topic of the latent variable _z 1, ..., Get z _i , z _{i + 1} ,... Here, z _i is a topic related to the i-th word (i is an arbitrary integer) (token: the smallest unit of words).

ノード１０４は、ハイパーパラメータβのノードである。ハイパーパラメータβは、単語の確率分布φを求めるために用いられる。
ノード１０５は、トピック毎に特異な単語の確率分布φを求めるノードである。トピックの数はＺであり、ノード１０５の単語の確率分布φは、Ｚ種類となる。 The node 104 is a hyper parameter β node. The hyperparameter β is used to obtain a word probability distribution φ.
The node 105 is a node for obtaining a unique word probability distribution φ for each topic. The number of topics is Z, and the word probability distribution φ of the node 105 is of Z types.

ノード１０６は、ハイパーパラメータεのノードである。ハイパーパラメータεは、次の単語間状態の確率分布σを求めるために用いられる。
ノード１０７は、以前の単語と以前のトピック毎に特異な次の単語間状態の確率分布σを求めるためのノードである。なお、次の単語間状態の確率分布σは、（Ｚ×Ｗ）種類となる。 The node 106 is a node of the hyper parameter ε. The hyperparameter ε is used to obtain the probability distribution σ of the next inter-word state.
The node 107 is a node for obtaining the probability distribution σ of the next inter-word state peculiar to the previous word and the previous topic. Note that the probability distribution σ of the next inter-word state is of (Z × W) types.

ノード１０８は、ハイパーパラメータγのノードである。ハイパーパラメータγは、次の単語の確率分布ψを求めるために用いられる。ハイパーパラメータγは、初期値としてランダム値が用いられる。
ノード１０９は、以前の単語と現在のトピック毎に特異な次の単語の確率分布ψを求めるノードである。ここで、Ｚは、トピックの数、Ｗは、単語の数を表し、次の単語の確率分布ψは、（Ｚ×Ｗ）種類となる。 The node 108 is a hyper parameter γ node. The hyperparameter γ is used to obtain the probability distribution ψ of the next word. For the hyperparameter γ, a random value is used as an initial value.
The node 109 is a node for obtaining a probability distribution ψ of the next word that is unique for each of the previous word and the current topic. Here, Z represents the number of topics, W represents the number of words, and the probability distribution ψ of the next word is of (Z × W) types.

ノード１１０_１、…、１１０_ｉ、１１０_ｉ＋１、…は、単語の観測変数を取得するノードである。すなわち、ノード１１０_１、…、１１０_ｉ、１１０_ｉ＋１、…は、ノード１０３_１、…、１０３_ｉ、１０３_ｉ＋１、…で取得されたトピックｚ_１、…、ｚ_ｉ、ｚ_ｉ＋１、…から、ノード１０５で求められた単語の確率分布φに基づき、単語の観測変数ｗ_１、…、ｗ_ｉ、ｗ_ｉ＋１、…を取得する。また、ノード１１０_ｉ、１１０_ｉ＋１、…は、以前の単語と現在のトピックから、ノード１０９で求められた次単語の確率分布ψに基づき、次の単語の観測変数ｗ_ｉ、ｗ_ｉ＋１、…を取得する。ここで、ｗ_ｉは、ｉ番目（ｉは任意の整数）の単語を表す。 Nodes 110 ₁ ,..., 110 _i , 110 _{i + 1} ,... Are nodes that acquire word observation variables. That is, the node _{_{110 1, ..., 110 i,}} 110 i + 1, ... , the node _{_{103 1, ..., 103 i,}} 103 i + 1, topic _z 1 obtained in _{_{..., ..., z i, z}} i + 1, ... from the node Based on the word probability distribution φ obtained in 105, word observation variables w ₁ ,..., W _i , w _{i + 1} ,. Further, the nodes 110 _i , 110 _{i + 1} ,... Represent the next word observation variables w _i , w _{i + 1} ,... Based on the probability distribution ψ of the next word obtained at the node 109 from the previous word and the current topic. get. Here, w _i represents the i-th word (i is an arbitrary integer).

ノード１１１_ｉ、１１１_ｉ＋１、…は、以前の単語と以前のトピックとから、次の単語間の状態を取得するノードである。すなわち、ノード１１１_ｉ、１１１_ｉ＋１、…は、ノード１０３_１、…、１０３_ｉ、１０３_ｉ＋１…で取得された以前のトピックｚ_１、…、ｚ_ｉ、ｚ_ｉ＋１、…と、ノード１１０_１、…、１１０_ｉ、１１０_ｉ＋１、…で取得された以前の単語ｗ_１、…、ｗ_ｉ、ｗ_ｉ＋１、…とから、ノード１０７で求められた次の単語間状態の確率分布σに基づき、次の単語間の状態の潜在変数ｘ_ｉ、ｘ_ｉ＋１、…を取得する。 Nodes 111 _i , 111 _{i + 1} ,... Are nodes that acquire the state between the next word from the previous word and the previous topic. That is, the node ₁₁₁ i, _{111 i + 1,} ..., the node _{_{103 1, ..., 103 i,}} 103 i + 1 ... previous topic _z 1 obtained by, _..., z _{i, z i + 1,} ... and the nodes 110 _1, ... , 110 _i , 110 _{i + 1} ,... Based on the probability distribution σ of the next inter-word state obtained at the node 107 from the previous words w ₁ ,..., W _i , w _{i + 1} ,. The latent variables x _i , x _{i + 1} ,... Of the state between words are acquired.

図８に示すように、比較例では、トピックを潜在変数として用いる場合、トピックの数を予め決めておく必要がある。図８の例では、トピックの数はＺであり、この場合、トピック毎に特異な単語の確率分布φとして、Ｚ種類の確率分布が必要になる。また、次の単語の確率分布ψとして、（Ｚ×Ｗ）種類の確率分布が必要になる。トピックの数を多くすれば、処理数が増大する。トピックの数を少なくすれば、トピック毎の単語を精度良く推定できない。これに対して、本実施形態では、チャイニーズ・レストラン・プロセスを導入することで、トピックの数を適切に設定できる。 As shown in FIG. 8, in the comparative example, when topics are used as latent variables, the number of topics needs to be determined in advance. In the example of FIG. 8, the number of topics is Z. In this case, Z types of probability distributions are necessary as the probability distribution φ of unique words for each topic. Further, (Z × W) types of probability distributions are required as the probability distribution ψ of the next word. Increasing the number of topics increases the number of processes. If the number of topics is reduced, words for each topic cannot be accurately estimated. On the other hand, in this embodiment, the number of topics can be appropriately set by introducing a Chinese restaurant process.

また、図８に示す比較例では、以前の単語と以前のトピックとから、次の単語間の状態を取得して、次の単語を推定している。この構成は、基本的に、２グラム連結モデルである。図８に示す比較例でｎ（ｎ＞３）グラム抽出を行っても、２グラムモデルを基本とするものとなるため、意味のあるｎグラム抽出は行えない。これに対して、本実施形態では、Ｐｉｙｍａｎ−Ｙｏｒプロセスを階層的に導入することでｎグラム連結モデルを実現できる。また、チャイニーズ・レストラン・プロセスを拡張して導入することで、ｎグラム抽出の単語の数を適切に設定できる。 In the comparative example shown in FIG. 8, the state between the next words is acquired from the previous word and the previous topic, and the next word is estimated. This configuration is basically a 2-gram connected model. Even if n (n> 3) gram extraction is performed in the comparative example shown in FIG. 8, it is based on the 2-gram model, so that meaningful n-gram extraction cannot be performed. On the other hand, in this embodiment, an n-gram connection model can be realized by hierarchically introducing the Pyman-Yor process. In addition, by expanding and introducing the Chinese restaurant process, the number of words for n-gram extraction can be set appropriately.

図９は、本実施形態により抽出したフレーズ（２グラム、３グラム）と、比較例により抽出したフレーズ（２グラム、３グラム）とを比較したものである。
ここでは、ＤＶＤタイトルと、本のタイトルと、音楽のタイトルとについて、本実施形態により抽出したフレーズと、図８に示した比較例により抽出したフレーズとについて、Ｐｒｅｃｉｓｉｏｎ（適合率）とＲｅｃａｌｌ（再現率）とを比較している。 FIG. 9 compares the phrases (2 grams, 3 grams) extracted according to the present embodiment with the phrases (2 grams, 3 grams) extracted according to the comparative example.
Here, for the DVD title, the book title, and the music title, the phrase extracted by the present embodiment and the phrase extracted by the comparative example shown in FIG. Rate).

ここで、Ｐｒｅｃｉｓｉｏｎは、（計算結果の中の正解数／計算結果の数）を示し、Ｒｅｃａｌｌは、（計算結果の中の正解数／全ての正解数）を示す。Ｐｒｅｃｉｓｉｏｎは、抽出結果中にどれくらい間違った結果が含まれているかの指標となり、Ｒｅｃａｌｌは、抽出のとりこぼしがどれくらいあるかの指標となる。基本的に、ＰｒｅｃｉｓｉｏｎとＲｅｃａｌｌとは、トレードオフの関係にある。 Here, Precision indicates (number of correct answers in calculation results / number of calculation results), and Recall indicates (number of correct answers in calculation results / number of all correct answers). Precision is an index of how many wrong results are included in the extraction result, and Recall is an index of how much extraction is missed. Basically, Precision and Recall are in a trade-off relationship.

図９に示す結果から、本実施形態では、ＰｒｅｃｉｓｉｏｎとＲｅｃａｌｌとの双方について、比較例より、良好な抽出結果が得られることが確認されている。例えば、２グラムの場合、本のタイトルでは、比較例では、Ｐｒｅｃｉｓｉｏｎが「０．６７」、Ｒｅｃａｌｌが「０．６０」であったが、本実施形態では、Ｐｒｅｃｉｓｉｏｎが「０．８７」、Ｒｅｃａｌｌが「０．９０」となり、ＰｒｅｃｉｓｉｏｎとＲｅｃａｌｌとの双方について、良好な抽出結果が得られている。また、３グラムの場合、本のタイトルでは、比較例では、Ｐｒｅｃｉｓｉｏｎが「０．４６」、Ｒｅｃａｌｌが「０．４２」であったが、本実施形態では、Ｐｒｅｃｉｓｉｏｎが「０．８２」、Ｒｅｃａｌｌが「０．８６」となり、ＰｒｅｃｉｓｉｏｎとＲｅｃａｌｌとの双方について、良好な抽出結果が得られている。 From the results shown in FIG. 9, it is confirmed that in this embodiment, a better extraction result can be obtained for both Precision and Recall than the comparative example. For example, in the case of 2 grams, in the book title, the Precision is “0.67” and the Recall is “0.60” in the comparative example, but in the present embodiment, the Precision is “0.87” and the Recall. Becomes “0.90”, and good extraction results are obtained for both Precision and Recall. In the case of 3 grams, in the title of the book, the Precision is “0.46” and the Recall is “0.42” in the comparative example, but in this embodiment, the Precision is “0.82” and the Recall. Becomes “0.86”, and good extraction results are obtained for both Precision and Recall.

このように、本実施形態によれば、分類装置は、確率分布から潜在変数の数Ｋ（Ｋは整数）を自動的に決定し、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定部と、潜在変数推定部で推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率を推定する単語推定部と、を備える。
これにより、ｎグラムを抽出するときに予め潜在変数の数（トピックの数）を決めておく必要がなく、潜在変数の数を自動的に決定することができるため、潜在変数の数を適切に設定することができ、ｎグラムを抽出するときの利便性を向上させることができる。 As described above, according to the present embodiment, the classification device automatically determines the number of latent variables K (K is an integer) from the probability distribution and extracts a word of n (n is an integer) gram. A latent variable estimation unit that estimates a variable, and a word estimation unit that estimates an n-gram word or an appearance probability of the n-gram word for each latent variable estimated by the latent variable estimation unit.
As a result, it is not necessary to determine the number of latent variables (number of topics) in advance when extracting n-grams, and the number of latent variables can be determined automatically. It can be set, and convenience when extracting n-grams can be improved.

また、本実施形態によれば、分類装置は、ｎ（ｎは整数）グラムの単語を抽出するための潜在変数を推定する潜在変数推定部と、確率分布からｎグラムの単語の数を決定し、潜在変数推定部で推定された潜在変数ごとに、ｎグラムの単語または当該ｎグラムの単語の出現確率を推定する単語推定部と、を備える。
これにより、ｎグラムの単語の抽出におけるｎを表す単語の数を適切に設定することができ、ｎグラムを抽出するときの利便性を向上させることができる。 Further, according to the present embodiment, the classification device determines a latent variable estimation unit for estimating a latent variable for extracting n (n is an integer) gram word, and determines the number of n gram words from the probability distribution. A word estimation unit that estimates n-gram words or the appearance probability of the n-gram words for each latent variable estimated by the latent variable estimation unit.
Thereby, the number of words representing n in the extraction of n-gram words can be set appropriately, and convenience when extracting n-grams can be improved.

また、チャイニーズ・レストラン・プロセスを導入することで、トピックを潜在変数として用いてｎグラムを推定する場合、トピックの数を予め決めておく必要がなく、トピックの数を適切に設定することができる。また、Ｐｉｙｍａｎ−Ｙｏｒプロセスを階層的に導入することでｎグラム連結モデルを実現することができ、また、チャイニーズ・レストラン・プロセスを拡張して導入することで、ｎグラム抽出の単語の数を適切に設定することができる。 In addition, by introducing a Chinese restaurant process, when n-grams are estimated using topics as latent variables, the number of topics need not be determined in advance, and the number of topics can be set appropriately. . In addition, the n-gram connection model can be realized by hierarchically introducing the Pyman-Yor process, and the number of n-gram extraction words can be appropriately increased by introducing the Chinese restaurant process. Can be set to

なお、上述した実施形態では、ｎグラムの抽出を、英単語を中心して説明しているが、本発明は、英単語によるｎグラムの抽出ばかりでなく、日本語やその他の多言語においても抽出することができる。 In the above-described embodiment, extraction of n-grams is explained mainly with respect to English words. However, the present invention is not limited to extraction of n-grams with English words, but also in Japanese and other multilingual languages. can do.

また、上述した実施形態では、トピックを潜在変数としてｎグラムの抽出を行う場合について説明したが、潜在変数はトピックに限定されるものではない。また、トピック以外の潜在変数を用いる場合にも、チャイニーズ・レストラン・プロセスにより、同様にして潜在変数の数を最適に決定することができる。 In the above-described embodiment, the case where n-gram extraction is performed using a topic as a latent variable has been described, but the latent variable is not limited to a topic. Also, when using latent variables other than topics, the number of latent variables can be determined optimally in the same manner by the Chinese restaurant process.

また、本実施形態の計算サーバ５２の各処理を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、当該記録媒体に記録されたプログラムを、コンピュータシステムに読み込ませ、実行することにより、計算サーバ５２に係る上述した種々の処理を行ってもよい。 Further, a program for executing each process of the calculation server 52 of the present embodiment is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by a computer system and executed. The above-described various processes related to the calculation server 52 may be performed.

なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器などのハードウェアを含むものであってもよい。また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリなどの書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭなどの可搬媒体、コンピュータシステムに内蔵されるハードディスクなどの記憶装置のことをいう。 Here, the “computer system” may include an OS and hardware such as peripheral devices. Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネットなどのネットワークや電話回線などの通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ））のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、このプログラムを記憶装置などに格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。 Further, the “computer-readable recording medium” refers to a volatile memory (for example, DRAM (Dynamic) in a computer system serving as a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. Random Access Memory)) that holds a program for a certain period of time is also included. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium.

ここで、プログラムを伝送する「伝送媒体」は、インターネットなどのネットワーク（通信網）や電話回線などの通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計なども含まれる。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the concrete structure is not restricted to this embodiment, The design of the range which does not deviate from the summary of this invention, etc. are included.

５１ファイルサーバ
５２計算サーバ
５３データベース
５４サービスサーバ
５５ユーザ端末
６１データファイル保存部
７１事前処理部
７２計算処理部
８１計算結果記憶部
９１呼び出し部 51 File Server 52 Calculation Server 53 Database 54 Service Server 55 User Terminal 61 Data File Storage Unit 71 Preprocessing Unit 72 Calculation Processing Unit 81 Calculation Result Storage Unit 91 Calling Unit

Claims

The number of latent variables K (K is an integer) is determined from the probability distribution of unique words for each topic of the sentence in the sentence, and the number of latent variables used to extract n (n is an integer) gram words. A latent variable estimator for estimating K latent variables;
For each latent variable estimated by the latent variable estimation unit, a word estimation unit that estimates an n-gram word or an appearance probability of the n-gram word;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, wherein the latent variable estimation unit determines a number K of the latent variables by a Chinese restaurant process.

a latent variable estimator for estimating a latent variable for extracting words of n (n is an integer) gram;
The number of n-gram words is determined from the probability distribution of unique words for each topic of the sentence in the sentence, and for each latent variable estimated by the latent variable estimation unit, n-gram words or n-gram words are determined. A word estimator for estimating the appearance probability;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, wherein the word estimation unit estimates n-gram words by a Pitman-Yor process.

The information processing apparatus according to any one of claims 1 to 4, wherein a topic is used as the latent variable.

Information processing device
The number of latent variables K (K is an integer) is determined from the probability distribution of unique words for each topic of the sentence in the sentence, and the number of latent variables used to extract n (n is an integer) gram words. A latent variable estimation process for estimating K latent variables;
For each latent variable estimated by the latent variable estimation process, a word estimation process for estimating an n-gram word or an appearance probability of the n-gram word;
And an information processing method.

Information processing device
a latent variable estimation process for estimating a latent variable for extracting words of n (n is an integer) gram;
The number of n-gram words is determined from the probability distribution of unique words for each topic of the sentence in the sentence, and for each latent variable estimated by the latent variable estimation process, n-gram words or n-gram words are determined. A word estimation process for estimating the appearance probability;
An information processing method characterized by comprising:

In the computer of the information processing device,
The number of latent variables K (K is an integer) is determined from the probability distribution of unique words for each topic of the sentence in the sentence, and the number of latent variables used to extract n (n is an integer) gram words. A latent variable estimating step for estimating K latent variables;
For each latent variable estimated by the latent variable estimation step, a word estimation step for estimating the occurrence probability of an n-gram word or the n-gram word;
Information processing program to execute.

In the computer of the information processing device,
a latent variable estimation step for estimating a latent variable for extracting n (n is an integer) gram words;
The number of n-gram words is determined from the probability distribution of unique words for each topic of the sentence in the sentence, and for each latent variable estimated by the latent variable estimation step, n-gram words or n-gram words are determined. A word estimation step for estimating an appearance probability;
Information processing program to execute.