JP5802597B2

JP5802597B2 - Classification device, classification system, classification method, and classification program

Info

Publication number: JP5802597B2
Application number: JP2012082952A
Authority: JP
Inventors: 徳章川前
Original assignee: エヌ・ティ・ティ・コムウェア株式会社
Priority date: 2012-03-30
Filing date: 2012-03-30
Publication date: 2015-10-28
Anticipated expiration: 2032-03-30
Also published as: JP2013214150A

Description

本発明は、分類装置、分類システム、分類方法及び分類プログラムに関する。 The present invention relates to a classification device, a classification system, a classification method, and a classification program.

本願発明者は、非特許文献１に示すように、先に、潜在クラスを階層的に用いることで、ユーザの興味を推定するモデルを提案している。このモデルは、インターネット上のブログや、ウェブページ、ツィッター、論文のようなテキストデータから、ユーザ（ブログやウェブページの著者）がどのコミュニティクラスに属し、その文書がどのトレンドクラスに属し、その内容（単語）がどのトピックに属するかを潜在変数として階層的に推定することで、ユーザの興味の推定を行っている。このユーザの興味の推定結果は、マーケティングの需要予測をしたり、そのユーザに最適な広告を提示したり、そのユーザの興味に適合した新たなアイテムをレコメンドしたりするような場合に役立たせることができる。 As shown in Non-Patent Document 1, the inventor of the present application has previously proposed a model for estimating the user's interest by hierarchically using latent classes. This model is based on text data such as blogs on the Internet, web pages, twitters, and papers. Which community class the user (blog or web page author) belongs to, which trend class the document belongs to, and its contents The user's interest is estimated by hierarchically estimating to which topic a (word) belongs as a latent variable. This user's interest estimation result should be useful when forecasting marketing demand, presenting the optimal advertisement to the user, or recommending a new item that matches the user's interest. Can do.

Noriaki Kawamae: Latent interest-topic model: finding the causal relationships behind dyadic data. CIKM 2010: 649-658Noriaki Kawamae: Latent interest-topic model: finding the causal relationships behind dyadic data.CIKM 2010: 649-658

しかしながら、非特許文献１に示されているモデルでは、データの生成過程に階層構造が反映されておらず、異なる階層（例えば、コミュニティという階層、トピックという階層）毎にクラス分けすることができないという問題があった。 However, in the model shown in Non-Patent Document 1, the hierarchical structure is not reflected in the data generation process, and it cannot be classified into different hierarchies (for example, a community hierarchy and a topic hierarchy). There was a problem.

本発明は、上記問題に鑑みてなされたものであり、データを異なる階層毎にクラス分けすることを可能とする分類装置、分類システム、分類方法及び分類プログラムを提供することを課題とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a classification device, a classification system, a classification method, and a classification program capable of classifying data into different layers.

（１）本発明は前記事情に鑑みなされたもので、本発明の一態様は、階層毎に、観測変数の確率分布を取得する観測変数確率分布取得部と、前記観測変数確率分布取得部が取得した確率分布を階層に応じて切り替え、切り替えた確率分布に基づいて観測変数を生成する観測変数生成部と、を備えることを特徴とする分類装置である。 (1) The present invention has been made in view of the above circumstances, and one aspect of the present invention is an observation variable probability distribution acquisition unit that acquires a probability distribution of an observation variable for each hierarchy, and the observation variable probability distribution acquisition unit includes The classification apparatus includes an observation variable generation unit that switches the acquired probability distribution according to a hierarchy and generates an observation variable based on the switched probability distribution.

（２）本発明の一態様は、上記の分類装置において、時刻情報を含む対象データ毎に、スイッチ変数の確率分布を取得するスイッチ変数確率分布取得部と、前記スイッチ変数確率分布取得部が取得した確率分布に基づいて、観測変数毎にスイッチ変数を生成するスイッチ変数生成部と、を備え、前記観測変数生成部は、前記観測変数確率分布取得部が取得した確率分布のうち、スイッチ変数確率分布取得部が取得したスイッチ変数に応じた階層の確率分布へ切り替えることを特徴とする。 (2) According to one aspect of the present invention, in the above classification apparatus, the switch variable probability distribution acquisition unit that acquires a probability distribution of a switch variable for each target data including time information, and the switch variable probability distribution acquisition unit acquire A switch variable generation unit that generates a switch variable for each observation variable based on the probability distribution, and the observation variable generation unit includes a switch variable probability among the probability distributions acquired by the observation variable probability distribution acquisition unit. The distribution acquisition unit switches to a probability distribution of a hierarchy according to the switch variable acquired.

（３）本発明の一態様は、上記の分類装置において、トレンドを分類するトレンド部類毎に、データの時刻毎の出現頻度を定めた時刻頻度分布を取得する時刻頻度分布取得部と、前記時刻頻度分布取得部が取得した時刻頻度分布に基づいて、時刻情報を含む対象データを複数のトレンド部類のいずれかに分類するトレンド分類部と、を備えることを特徴とする。 (3) According to one aspect of the present invention, in the above classification apparatus, a time frequency distribution acquisition unit that acquires a time frequency distribution that defines an appearance frequency for each time of data for each trend category that classifies a trend, and the time A trend classifying unit that classifies target data including time information into one of a plurality of trend categories based on the time frequency distribution acquired by the frequency distribution acquiring unit.

（４）本発明の一態様は、上記の分類装置において、前記時刻頻度分布は、確率分布であることを特徴とする。 (4) One aspect of the present invention is characterized in that, in the classification device, the time frequency distribution is a probability distribution.

（５）本発明の一態様は、階層毎に、観測変数の確率分布を取得する観測変数確率分布取得部と、前記観測変数確率分布取得部が取得した確率分布を階層に応じて切り替え、切り替えた確率分布に基づいて観測変数を生成する観測変数生成部と、を備えることを特徴とする分類システムである。 (5) According to one aspect of the present invention, the observation variable probability distribution acquisition unit that acquires the probability distribution of the observation variable and the probability distribution acquired by the observation variable probability distribution acquisition unit are switched and switched according to the hierarchy. And an observation variable generation unit that generates an observation variable based on the probability distribution.

（６）本発明の一態様は、観測変数確率分布取得部が、階層毎に、観測変数の確率分布を取得する手順と、観測変数生成部が、前記観測変数確率分布取得部が取得した確率分布を階層に応じて切り替え、切り替えた確率分布に基づいて観測変数を生成する手順と、を有することを特徴とする分類方法である。 (6) According to one aspect of the present invention, the observation variable probability distribution acquisition unit obtains the probability distribution of the observation variable for each hierarchy, and the observation variable generation unit acquires the probability acquired by the observation variable probability distribution acquisition unit. And a procedure for generating an observation variable based on the switched probability distribution and switching the distribution according to the hierarchy.

（７）本発明の一態様は、コンピュータに、階層毎に、観測変数の確率分布を取得する観測変数確率分布取得ステップと、前記観測変数確率分布取得ステップにより取得された確率分布を階層に応じて切り替え、切り替えた確率分布に基づいて観測変数を生成する観測変数生成ステップと、を実行させるための分類プログラムである。 (7) According to one aspect of the present invention, an observation variable probability distribution acquisition step for acquiring a probability distribution of an observation variable for each layer and a probability distribution acquired by the observation variable probability distribution acquisition step according to the layer And an observation variable generation step for generating an observation variable based on the switched probability distribution.

本発明によれば、時系列データからトレンドを抽出することができる。 According to the present invention, a trend can be extracted from time series data.

潜在クラスの階層構造を用いたプレファレンスモデルの説明に用いるグラフィカルモデルである。It is a graphical model used to explain a preference model using a hierarchical structure of latent classes. 本発明の実施形態に係るプレファレンスモデルの説明に用いるグラフィカルモデルである。It is a graphical model used for description of the preference model concerning the embodiment of the present invention. 各ブロックの名称と機能部の名称との対応関係、及び本実施形態の確率分布の具体例と確率分布の名称との対応関係を示す図である。It is a figure which shows the correspondence of the name of each block, and the name of a function part, and the correspondence of the specific example of the probability distribution of this embodiment, and the name of probability distribution. 本発明の実施形態に係る分類システムの概略ブロック図である。1 is a schematic block diagram of a classification system according to an embodiment of the present invention. 本発明の実施形態に係る分類システムの説明に用いる機能ブロック図である。It is a functional block diagram used for description of the classification system which concerns on embodiment of this invention. 本発明の実施形態に係る分類システムにおける確率変数と確率変数に対応するパラメータの一覧表である。It is a list of parameters corresponding to random variables and random variables in the classification system according to the embodiment of the present invention. 本発明の実施形態に係る分類システムにおける計算処理の説明に用いるフローチャートである。It is a flowchart used for description of the calculation process in the classification | category system which concerns on embodiment of this invention. 本発明の実施形態に係る事前処理部における前処理の説明図である。It is explanatory drawing of the pre-processing in the pre-processing part which concerns on embodiment of this invention. 本発明の実施形態に係る分類システムにおける計算処理の処理内容の説明図である。It is explanatory drawing of the processing content of the calculation process in the classification | category system which concerns on embodiment of this invention. 本発明の実施形態に係る計算処理部における処理とグラフィカルモデルの関連図である。It is a related figure of the process and graphical model in the calculation process part which concerns on embodiment of this invention. 本発明の実施形態に係る提案モデルで利用する確率分布を示した図である。It is the figure which showed the probability distribution utilized with the proposal model which concerns on embodiment of this invention. 本発明の実施形態に係る分類システムにおける潜在変数を推定するアルゴリズムの説明図である。It is explanatory drawing of the algorithm which estimates the latent variable in the classification system which concerns on embodiment of this invention. 上のテーブルは期間毎に集計した映画のタイトル、下のテーブルは提案モデルで推定したトレンドに対応する映画のタイトルである。The upper table shows the movie titles compiled for each period, and the lower table shows the movie titles corresponding to the trends estimated by the proposed model. 本実施形態の手法と従来手法のレコメンドの精度を比較したテーブルである。It is the table which compared the precision of the recommendation of the method of this embodiment, and the conventional method.

以下、本発明の実施の形態について図面を参照しながら説明する。先ず、本発明を理解するために、本発明の実施の形態の説明に先立ち、本願発明者が先に提案しているプレファレンスモデルについて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, in order to understand the present invention, the preference model previously proposed by the present inventor will be described prior to the description of the embodiment of the present invention.

図１に示すように、本願発明者は、階層的な潜在クラスを用いることで、ユーザの興味を推定するモデルを提案している。このモデルでは、コミュニティクラスと、トレンドクラスと、トピックの潜在クラスを階層構造でモデル化している。ここで、トレンドとは、
例えば、傾向、趨勢、潮流または流行である。トレンドとは、例えば、ある期間内に、ある単語が出現する頻度の傾向である。 As shown in FIG. 1, the present inventor has proposed a model for estimating user interest by using hierarchical latent classes. In this model, community classes, trend classes, and latent classes of topics are modeled in a hierarchical structure. Here, the trend is
For example, trend, trend, current or trend. The trend is, for example, a tendency of a frequency that a certain word appears within a certain period.

図１において、ブロック１１は、コミュニティクラスの多項分布ψを求めるためのハイパーパラメータαのブロックである。ハイパーパラメータαとしては、初期値としてランダム値が用いられる。ブロック１２は、コミュニティクラスの多項分布ψを求めるブロックである。 In FIG. 1, a block 11 is a hyper parameter α block for obtaining a community class multinomial distribution ψ. As the hyper parameter α, a random value is used as an initial value. The block 12 is a block for obtaining a community class multinomial distribution ψ.

ブロック１３は、トレンドクラスの多項分布Ψを求めるためのハイパーパラメータβのブロックである。ハイパーパラメータβとしては、初期値としてランダム値が用いられる。ブロック１４は、コミュニティクラス毎のＳ種類のトレンドクラスの多項分布Ψを求めるブロックである。なお、ここでＳはコミュニティクラスの数を示している。 A block 13 is a hyper parameter β block for obtaining a trend class multinomial distribution Ψ. As the hyper parameter β, a random value is used as an initial value. The block 14 is a block for obtaining a multinomial distribution Ψ of S types of trend classes for each community class. Here, S indicates the number of community classes.

ブロック１５は、トピックの多項分布θを求めるためのハイパーパラメータγのブロックである。ハイパーパラメータγとしては、初期値としてランダム値が用いられる。ブロック１６は、トレンドクラス毎のＣ種類のトピックの多項分布θを求めるブロックである。なお、ここでＣはトレンドクラスの数を示している。 The block 15 is a hyper parameter γ block for obtaining the multinomial distribution θ of topics. As the hyper parameter γ, a random value is used as an initial value. The block 16 is a block for obtaining a multinomial distribution θ of C types of topics for each trend class. Here, C indicates the number of trend classes.

ブロック１７は、トークン内の単語の多項分布φを求めるためのハイパーパラメータδのブロックである。ハイパーパラメータδとしては、初期値としてランダム値が用いられる。トークンは、例えばインターネット上に書かれたブログやウェブページの文書中の区切り（例えば、単語）である。ブロック１８は、トピック毎のＺ種類のトークン内の単語の多項分布φを求めるブロックである。なお、ここでＺはトピックの数を示している。 The block 17 is a hyper parameter δ block for obtaining the multinomial distribution φ of the words in the token. As the hyper parameter δ, a random value is used as an initial value. The token is a delimiter (for example, a word) in a blog or web page document written on the Internet, for example. A block 18 is a block for obtaining a multinomial distribution φ of words in Z types of tokens for each topic. Here, Z indicates the number of topics.

ブロック１９は、ユーザＩＤからコミュニティクラスの潜在変数ｓを求めるブロックである。ユーザａがコミュニティクラスｓである確率分布が潜在変数ｓ_ａとして示される。ここで、ユーザａは、例えばインターネット上でブログやウェブページを書いた著者であり、ユーザＩＤにより識別される。また、Ａは、ユーザ（著者）の数である。ブロック１９は、ユーザＩＤから、ユーザａがコミュニティクラスｓである潜在変数ｓ_ａを、ブロック１２からのコミュニティクラスの多項分布ψを用いて求めている。 The block 19 is a block for obtaining the latent variable s of the community class from the user ID. The probability distribution that user a is community class s is shown as latent variable s _a . Here, the user a is an author who wrote a blog or a web page on the Internet, for example, and is identified by a user ID. A is the number of users (authors). The block 19 obtains _a latent variable s _{a in} which the user a is the community class s from the user ID using the community class multinomial distribution ψ from the block 12.

ブロック２０は、文書ＩＤからトレンドクラスの潜在変数ｃを求めるブロックである。ここで、文書ｄは、例えばインターネット上に書かれたブログやウェブページの文書であり、文書ＩＤにより識別される。文書ｄがトレンドクラスｃに属している確率分布が潜在変数ｃ_ｄとして示される。また、Ｄ_ａは、ユーザ（著者）ａが書いた文書の数である。ブロック２０は、ブロック１９からのユーザａがコミュニティクラスｓである潜在変数ｓ_ａにより、ブロック１４のコミュニティクラス毎のトレンドクラスの多項分布Ψを対応させ、そして、ブロック２０は、対応するコミュニティのトレンドクラスの多項分布Ψにより、文書ｄがトレンドクラスｃに属している確率分布を、潜在変数ｃ_ｄとして、推定する。 A block 20 is a block for obtaining the latent variable c of the trend class from the document ID. Here, the document d is a blog or web page document written on the Internet, for example, and is identified by the document ID. Probability distribution document d belongs to the trend class c is shown as latent variables c _d. D _a is the number of documents written by the user (author) a. Block 20, the latent variable s _a user a from block 19 is community class s, is associated multinomial distribution Ψ trend classes for each community class block 14, then block 20, the corresponding community trends the multinomial distribution Ψ classes, a probability distribution document d belongs to the trend class c, as latent variables c _d, estimates.

ブロック２１は、各トークンにおけるトピックの潜在変数ｚを求めるブロックである。トークンから、トピックの潜在変数ｚを求めるブロックである。ｉ番目のトークンがトピックｚに属している確率分布がトピックの潜在変数ｚ_ｉとして示されている。ここで、Ｎ_ｄは、文書ｄでのトークンの数である。ブロック２１は、ブロック２０からの文書ｄがトレンドクラスｃに属している潜在変数ｃ_ｄにより、ブロック１６のトレンドクラス毎のトピックの多項分布θを対応させ、そして、ブロック２１は、トークンから、対応するトピックの多項分布θにより、ｉ番目のトークンがトピックｚに属する確率分布を、潜在変数ｚ_ｉとして、推定する。 A block 21 is a block for obtaining a latent variable z of a topic in each token. This is a block for obtaining a latent variable z of a topic from a token. The probability distribution that the i-th token belongs to topic z is shown as a latent variable z _i for the topic. Here, N _d is the number of tokens in the document d. Block 21, a latent variable c _d a document d from block 20 belongs to a trend class c, in correspondence of the multinomial distribution θ topics for each trend class block 16, then block 21, the token corresponding The probability distribution that the i-th token belongs to the topic z is estimated as the latent variable z _i by the multinomial distribution θ of the topic to be performed.

ブロック２２は、観測変数ｗを求めるブロックである。観測変数ｗは、データから予め決められた規則に基づいて抽出される情報である。例えば、データが本の購入履歴の場合、観測変数ｗは本のタイトルである。例えば、データがある雑誌の場合、観測変数ｗは、単語、文、段落又は章のタイトルである。例えば、データが映画の場合、観測変数ｗは映画のタイトルである。
本実施形態では、一例として観測変数ｗはトークンの確率分布で、ｉ番目のトークンｗの出現確率分布が観測変数ｗ_ｉとして示されている。ブロック２２は、ｉ番目のトークンがトピックｚに属する潜在変数ｚ_ｉにより、ブロック１８のトピック毎のトークン内の単語の多項分布φを対応させ、そして、ブロック２２は、ｉ番目のトークンｗの出現確率分布を、観測変数ｗ_ｉとして、推定する。 The block 22 is a block for obtaining the observation variable w. The observation variable w is information extracted from data based on a predetermined rule. For example, if the data is a book purchase history, the observation variable w is the title of the book. For example, for a magazine with data, the observation variable w is the title of a word, sentence, paragraph or chapter. For example, if the data is a movie, the observation variable w is the title of the movie.
In the present embodiment, as an example, the observation variable w is a probability distribution of tokens, and the appearance probability distribution of the i-th token w is shown as an observation variable w _i . Block 22 associates the multinomial distribution φ of the words in the token for each topic of block 18 with the latent variable z _i where the i th token belongs to topic z, and block 22 represents the occurrence of the i th token w. The probability distribution is estimated as the observation variable w _i .

このように、図１に示したモデルは、コミュニティクラスと、トレンドクラスと、トピックの潜在変数を階層的に用いることで、ユーザの興味を推定している。 As described above, the model shown in FIG. 1 estimates the user's interest by hierarchically using the community class, the trend class, and the latent variables of the topic.

すなわち、ブロック１９は、ユーザａがコミュニティクラスｓに属する潜在変数ｓ_ａを推定し、ブロック２０は、この潜在変数ｓ_ａを用いて、コミュニティクラス毎のトレンドクラスの多項分布Ψにより、文書ｄがトレンドクラスｃに属する潜在変数ｃ_ｄを推定し、ブロック２１は、潜在変数ｃ_ｄを用いて、トレンドクラス毎のトピックの多項分布θにより、ｉ番目のトークンがトピックｚに属する潜在変数ｚ_ｉを推定し、ブロック２２は、この潜在変数ｚ_ｉを用いて、トピック毎のトークン内の単語の多項分布φにより、観測変数ｗ_ｉを推定している。 That is, the block 19 estimates the latent variable s _{a in} which the user a belongs to the community class s, and the block 20 uses the latent variable s _a to calculate the document d by the multinomial distribution Ψ of the trend class for each community class. estimating the latent variables c _d belonging to the trend class c, block 21, using the latent variables c _d, a multinomial distribution θ topics for each trend class, i-th token latent variables z _i belonging to a topic z Then, the block 22 estimates the observation variable w _i by the multinomial distribution φ of the words in the token for each topic using the latent variable z _i .

しかしながら、図１に示したモデルでは、文書を作成された日時（タイムスタンプ）がモデルに入っていない。このため、トレンドクラス（各要素の生成と時刻の同時出現確率）を抽出できない。 However, in the model shown in FIG. 1, the date and time (time stamp) when the document was created is not included in the model. For this reason, it is impossible to extract the trend class (the generation probability of each element and the simultaneous appearance probability of time).

また、図１に示したモデルでは、トレンドクラスの抽出に必要なデータとそうでないデータとが区別されていない。このため、局所的なトレンドを自動的かつ一意に抽出できない。例えば、文書の中には、新聞のように、誰もがどの時期にでも、購入するものがある。このような全般的な文書は、トレンドを反映するものとして適しているとは言えない。また、業界紙のように、その業界の人にとっては興味があるが、一般的な人にとっては、殆ど興味の対象とならない文章がある。このような局所的な文書も、トレンドを反映するのに適しているとは言えない。 In the model shown in FIG. 1, data necessary for trend class extraction is not distinguished from data that is not. For this reason, a local trend cannot be extracted automatically and uniquely. For example, some documents, such as newspapers, are purchased by everyone at any time. Such general documentation is not suitable for reflecting trends. In addition, there are sentences such as industry papers that are of interest to people in that industry but are of little interest to general people. Such local documents are also not suitable for reflecting trends.

図２は、本発明の第１の実施形態に係るモデルである。図２において、ブロック１１１、１１３、１１５、１１７は、図１におけるブロック１１、１３、１５、１７と同様に、ハイパーパラメータα、β、γ、δのブロックである。 FIG. 2 is a model according to the first embodiment of the present invention. In FIG. 2, blocks 111, 113, 115, and 117 are hyper parameters α, β, γ, and δ, similarly to blocks 11, 13, 15, and 17 in FIG. 1.

ブロック１１２、１１４、１１６、１１８は、図１におけるブロック１２、１４、１６、１８と同様に、コミュニティクラスの多項分布ψ、トレンドクラスの多項分布Ψ、トピックの多項分布θ、トークン内の単語の多項分布φを求めるブロックである。すなわち、ブロック１１８は、階層毎に、観測変数ｗの確率分布φを取得する観測変数確率分布取得部として機能する。本実施形態では、一例として、階層を、コミュニティクラスという階層と、トレンドクラスという階層と、トピックという階層の三つの階層を設ける。ブロック１１２、１１４、１１６については、図１におけるブロック１２、１４、１６と同様である。
なお、本実施形態では、三つの階層に分けたが、これに限らず、タイムスタンプの代わりに位置情報、所得などの観測変数を使うことで、地域による分類という階層、所得による分類という階層などに分けても良い。また、階層の数は２以下でも、４以上でもよい。 The blocks 112, 114, 116, 118 are similar to the blocks 12, 14, 16, 18 in FIG. 1 in that the community class multinomial distribution ψ, the trend class multinomial distribution ψ, the topic multinomial distribution θ, and the words in the token This is a block for obtaining a multinomial distribution φ. That is, the block 118 functions as an observation variable probability distribution acquisition unit that acquires the probability distribution φ of the observation variable w for each layer. In this embodiment, as an example, there are three hierarchies: a community class, a trend class, and a topic. The blocks 112, 114, and 116 are the same as the blocks 12, 14, and 16 in FIG.
In this embodiment, it is divided into three layers. However, the present invention is not limited to this, and by using observation variables such as location information and income instead of time stamps, a layer classified by region, a layer classified by income, etc. It may be divided into The number of layers may be 2 or less or 4 or more.

トークン内の単語の多項分布φを求めるブロック１１８については、図１におけるブロック１８では、トピック毎のＺ種類のトークン内の単語の多項分布φであったのに対して、本実施形態では、トークン内の単語の多項分布φは、トピック毎と、トレンドクラス毎と、コミュニティクラス毎と、全体との（Ｚ＋Ｃ＋Ｓ＋１）種類の多項分布となっている。トピックがＺ種類、トレンドクラスがＣ種類、コミュニティクラスがＳ種類あるからである。 The block 118 for obtaining the multinomial distribution φ of the words in the token is the multinomial distribution φ of the words in the Z types of tokens for each topic in the block 18 in FIG. The multinomial distribution φ of the words is (Z + C + S + 1) types of multinomial distributions for each topic, for each trend class, for each community class, and for the whole. This is because there are Z types of topics, C types of trend classes, and S types of community classes.

ブロック１１９は、図１におけるブロック１９と同様に、コミュニティクラスの潜在変数ｓを求める。ブロック１２０は、図１におけるブロック２０と同様に、トレンドクラスの潜在変数ｃを求める。すなわち、ブロック１２０は、ブロック１２５が取得した時刻頻度分布に基づいて、時刻情報を含む対象データ（例えば、文章）を複数のトレンド部類のいずれかに分類するトレンド分類部として機能する。
ブロック１２１は、図１におけるブロック２１と同様に、トピックの潜在変数ｚを求める。ブロック１２２は、観測変数ｗを求める。図１におけるブロック２２では、観測変数ｗは、トピック毎のＺ種類のトークン内の単語の多項分布φを用いて推定していた。それに対して、この実施形態では、ブロック１２２は、スイッチの潜在変数ｒにより、トピック、トレンドクラス、コミュニティクラス、全体のいずれかの階層に切り替えて、推定を行う。
すなわち、ブロック１２２は、ブロック１１８が取得した確率分布を階層に応じて切り替え、切り替えた確率分布に基づいて観測変数を生成する観測変数生成部として機能する。より詳細には、ブロック１２２は、ブロック１１８が取得した確率分布のうち、ブロック１２９が取得したスイッチ変数に応じた階層の確率分布へ切り替える観測変数生成部として機能する。 The block 119 determines the latent variable s of the community class in the same manner as the block 19 in FIG. The block 120 obtains the latent variable c of the trend class in the same manner as the block 20 in FIG. That is, the block 120 functions as a trend classifying unit that classifies target data (eg, text) including time information into one of a plurality of trend categories based on the time frequency distribution acquired by the block 125.
The block 121 determines the latent variable z of the topic in the same manner as the block 21 in FIG. Block 122 determines an observation variable w. In block 22 in FIG. 1, the observation variable w is estimated using the multinomial distribution φ of words in Z types of tokens for each topic. On the other hand, in this embodiment, the block 122 performs the estimation by switching to any one of the topic, trend class, community class, and the whole hierarchy according to the latent variable r of the switch.
That is, the block 122 functions as an observation variable generation unit that switches the probability distribution acquired by the block 118 according to the hierarchy and generates an observation variable based on the switched probability distribution. More specifically, the block 122 functions as an observation variable generation unit that switches the probability distribution acquired by the block 118 to the probability distribution of the hierarchy corresponding to the switch variable acquired by the block 129.

ブロック１２５は、トレンドクラス毎に、対象データ（例えば、文章）の時刻毎の出現頻度を示すベータ分布λを求める。なお、ここでＣはトレンドクラスの数を示している。
すなわち、ブロック１２５は、トレンドを分類するトレンド部類毎に、対象データの時刻毎の出現頻度を定めた時刻頻度分布を取得する時刻頻度分布取得部として機能する。なお、対象データの時刻毎の出現頻度を示すベータ分布λは、０から１（最古時刻を０、現在時刻を１）に正規化されている。本実施形態では、時刻については、ベータ分布を用いている。これは、時刻を連続的に扱うためである。時毎、日毎、周毎、月毎のように、時刻を離散的に扱った場合、ゆっくりと変化するトレンドクラスに対しては、月毎にように周期の長いデータとして扱えるが、頻繁に変化するトレンドクラスに対しては、時毎や日毎のように、周期の短いデータとして扱わなければならなくなり、データ量が増大する。時刻を連続的に扱うことで、ゆっくりと変化するトレンドクラスに対しても、頻繁に変化するトレンドクラスに対しても、分布曲線の形状の違いだけで処理できる。 The block 125 obtains a beta distribution λ indicating the appearance frequency of the target data (for example, text) at each time for each trend class. Here, C indicates the number of trend classes.
That is, the block 125 functions as a time frequency distribution acquisition unit that acquires a time frequency distribution in which the appearance frequency for each time of the target data is determined for each trend category that classifies the trend. The beta distribution λ indicating the appearance frequency of the target data for each time is normalized from 0 to 1 (the oldest time is 0 and the current time is 1). In this embodiment, a beta distribution is used for the time. This is because the time is handled continuously. When the time is handled discretely, such as hourly, daily, weekly, monthly, for trend classes that change slowly, it can be treated as data with a long period like monthly, but changes frequently. The trend class must be treated as data with a short cycle, such as hourly or daily, and the data amount increases. By handling time continuously, it is possible to process only a trend class that changes slowly or a trend class that changes frequently only by the difference in the shape of the distribution curve.

ブロック１２６は、文書のタイムスタンプから、ブロック１２５のベータ分布λを用いて、タイムスタンプの観測変数ｔを求めるブロックである。タイムスタンプは、文書ｄが生成された日時を示している。ブロック１２６は、ブロック１２０で求められたトレンドクラスの潜在変数ｃ_ｄにより、ブロック１２５からのトレンドクラス毎のベータ分布λを対応させ、そして、タイムスタンプから、対応するベータ分布λを用いて、文書ｄがタイムスタンプｔである確率分布を観測変数ｔ_ｄとして推定している。 A block 126 is a block for obtaining the observed variable t of the time stamp from the time stamp of the document by using the beta distribution λ of the block 125. The time stamp indicates the date and time when the document d was generated. Block 126, the latent variable c _d trend class determined in block 120, is associated with the beta distribution lambda of each trend classes from block 125, and, from the time stamp using the corresponding beta distribution lambda, document d is to estimate the probability distribution is a time stamp t as the observation variable t _d.

ブロック１２７は、多項分布μを求めるためのハイパーパラメータεのブロックである。ハイパーパラメータεとしては、初期値としてランダム値が用いられる。ブロック１２８は、Ｄ_ａ種類の多項分布μを求めている。ここで、Ｄ_ａは、ユーザ（著者）ａが書いた文書の数である。すなわち、ブロック１２８は、対象データ（例えば、文章）毎に、スイッチ変数の確率分布を取得するスイッチ変数確率分布取得部として機能する。 A block 127 is a hyper parameter ε block for obtaining the multinomial distribution μ. As the hyper parameter ε, a random value is used as an initial value. Block 128 is seeking _{D a} type of multinomial mu. Here, D _a is the number of documents written by the user (author) a. That is, the block 128 functions as a switch variable probability distribution acquisition unit that acquires a probability distribution of switch variables for each target data (for example, text).

ブロック１２９は、スイッチの潜在変数ｒを求めるブロックである。ブロック１２９は、文書ｄに対応する多項分布μ_ｄから、ｉ番目のトークンのスイッチ変数ｒ_ｉを求める。ブロック１２２で観測変数ｗの推定に用いる多項分布は、このスイッチ変数で切り替えられる。ブロック１２９は、ブロック１２８が取得した確率分布に基づいて、観測変数毎にスイッチ変数を生成するスイッチ変数生成部として機能する。 A block 129 is a block for obtaining a latent variable r of the switch. Block 129 determines the switch variable r _i of the i th token from the multinomial distribution μ _d corresponding to document d. The multinomial distribution used for estimation of the observation variable w in block 122 is switched by this switch variable. The block 129 functions as a switch variable generation unit that generates a switch variable for each observation variable based on the probability distribution acquired by the block 128.

前述のモデルでは、観測変数の時刻がモデルに入っていない。このため、トレンドクラス毎に、時刻毎の文章ｄの出現確率を抽出することができなかった。これに対して、図２に示す本実施形態に係るモデルでは、観測変数の一つとして時刻ｔが導入されている。ブロック１２６は、ブロック１２０でトレンドクラスの潜在変数ｃが変化すると、ブロック１２５から提供される文章ｄの出現確率の経時分布を切り替えるので、時刻ｔ毎の文章ｄの出現確率が変化する。これにより、トレンドクラス毎に、時刻毎の文章ｄの出現確率を抽出することができる。 In the above model, the time of the observation variable is not included in the model. For this reason, the appearance probability of the sentence d for each time cannot be extracted for each trend class. On the other hand, in the model according to this embodiment shown in FIG. 2, time t is introduced as one of the observation variables. When the latent variable c of the trend class changes in block 120, the block 126 switches the temporal distribution of the appearance probability of the sentence d provided from the block 125, so the appearance probability of the sentence d at each time t changes. Thereby, the appearance probability of the sentence d for every time can be extracted for every trend class.

また、前述のモデルでは、観測変数ｗをトピック毎の多項分布φを用いて求めている。これに対して、本実施形態では、観測変数ｗを求めるための多項分布を、スイッチ変数ｒにより、トピック毎の多項分布と、トレンドクラス毎の多項分布と、コミュニティクラス毎の多項分布と、全体での多項分布とで切り替えている。トピックの数はＺであり、トレンドクラスの数はＣであり、コミュニティクラスの数はＳであり、全体として扱う数は「１」であるから、ブロック１１８の多項分布φの数は、（Ｚ＋Ｃ＋Ｓ＋１）となる。 In the above-described model, the observation variable w is obtained using the multinomial distribution φ for each topic. On the other hand, in the present embodiment, the multinomial distribution for obtaining the observation variable w is converted into a multinomial distribution for each topic, a multinomial distribution for each trend class, a multinomial distribution for each community class, And switching to the multinomial distribution. Since the number of topics is Z, the number of trend classes is C, the number of community classes is S, and the number handled as a whole is “1”, the number of multinomial distributions φ in block 118 is (Z + C + S + 1). )

スイッチ変数ｒが（ｒ＝０）なら、全体の多項分布を選択し、同時生起の多項分布から観測変数ｗを生成する。全体の多項分布は、その内容や時間に無関係で、一般的な分布である。 If the switch variable r is (r = 0), the entire multinomial distribution is selected, and the observation variable w is generated from the simultaneous multinomial distribution. The entire multinomial distribution is a general distribution regardless of its contents and time.

スイッチ変数ｒが（ｒ＝１）なら、トピック毎の多項分布を選択する。トピックの多項分布は、持続的に長い期間のものとなる。 If the switch variable r is (r = 1), a multinomial distribution for each topic is selected. The multinomial distribution of topics is sustained over a long period.

スイッチ変数ｒが（ｒ＝２）なら、トレンドクラス毎の多項分布を選択する。トレンドクラスの多項分布は、時間と共にその傾向が変化する持続期間が短い期間のものとなる。 If the switch variable r is (r = 2), a multinomial distribution for each trend class is selected. The trend class multinomial distribution has a short duration in which the trend changes with time.

スイッチ変数ｒが（ｒ＝３）なら、コミュニティクラス毎の多項分布を選択する。コミュニティクラスの多項分布は、そのコミュニティクラスに特化した局所的なものの分布である。 If the switch variable r is (r = 3), a multinomial distribution for each community class is selected. The multinomial distribution of community classes is a local distribution that is specialized for the community class.

このように、本実施形態に係るモデルでは、スイッチ変数ｒを導入することで、時間と共に変化するものと、そうでないものとを切り分けることができる。これにより、時間と要素の組み合わせの同時確率だけでなく、要素だけの確率で表現することができる。 Thus, in the model according to the present embodiment, by introducing the switch variable r, it is possible to distinguish between those that change with time and those that do not. Thereby, not only the simultaneous probability of the combination of time and the element but also the probability of only the element can be expressed.

図３は、各ブロックの名称と機能部の名称との対応関係、及び本実施形態の確率分布の具体例と確率分布の名称との対応関係を示す図である。テーブルＴ３１は、各ブロックの名称と機能部の名称との対応関係を示すテーブルである。テーブルＴ３２は、本実施形態の確率分布の具体例と確率分布の名称との対応関係を示すテーブルである。
次に、図２に示したようなモデルで示される処理を実行して、ユーザの興味を推定するための処理について、具体的に説明する。 FIG. 3 is a diagram illustrating a correspondence relationship between the names of the blocks and the names of the functional units, and a correspondence relationship between a specific example of the probability distribution of the present embodiment and the names of the probability distributions. The table T31 is a table showing the correspondence between the name of each block and the name of the functional unit. The table T32 is a table showing a correspondence relationship between a specific example of the probability distribution of the present embodiment and the name of the probability distribution.
Next, the process for estimating the user's interest by executing the process shown by the model as shown in FIG. 2 will be specifically described.

図４は、本発明の実施形態に係る分類システムの構成を示すブロック図であり、図５は、各部の機能ブロック図を示すものである。図４に示すように、本発明の実施形態に係るシステムは、ファイルサーバ５０１と、計算サーバ（分類装置）５０２と、データベース５０３と、サービスサーバ５０４とを備える。以下、単語をトークンの一例として説明する。 FIG. 4 is a block diagram showing a configuration of the classification system according to the embodiment of the present invention, and FIG. 5 shows a functional block diagram of each part. As shown in FIG. 4, the system according to the embodiment of the present invention includes a file server 501, a calculation server (classification device) 502, a database 503, and a service server 504. Hereinafter, a word will be described as an example of a token.

ファイルサーバ５０１は、図５に示すように、処理単位となる文書データを保存するデータファイル保存部５１１を有している。データファイル保存部５１１に保存する文書データとしては、インターネット上のブログの文書や、ウェブページの文書、ツィッター、論文等の文書データが用いられる。なお、処理単位となる文書データは、インターネット上の文書に限られるものではない。また、データファイル保存部５１１には、各文書データと、処理単位となる文書を識別するための文書ＩＤと、その文書の著者を示す著者ＩＤと、その文書を生成した日時を示すタイムスタンプとが対応付けられて保存されている。 As shown in FIG. 5, the file server 501 includes a data file storage unit 511 that stores document data serving as a processing unit. As the document data to be stored in the data file storage unit 511, blog documents on the Internet, document data of web pages, tweeters, papers, and the like are used. Note that document data as a processing unit is not limited to documents on the Internet. Further, the data file storage unit 511 includes each document data, a document ID for identifying a document as a processing unit, an author ID indicating the author of the document, and a time stamp indicating the date and time when the document was generated. Are stored in association with each other.

計算サーバ５０２は、ファイルサーバ５０１から文書データを取り出し、図２に示したモデルで示されるような計算処理を行い、計算結果を出力する。計算サーバ５０２は、図５に示すように、事前処理部５２１と、計算処理部５２２とを有している。 The calculation server 502 takes out the document data from the file server 501, performs a calculation process as shown by the model shown in FIG. 2, and outputs a calculation result. The calculation server 502 includes a pre-processing unit 521 and a calculation processing unit 522 as illustrated in FIG.

事前処理部５２１は、例えば、ファイルサーバ５０１から計算処理対象となる文書データファイルを受け取り、この文書データファイルから、文書データ毎に、文書ＩＤと、著者ＩＤと、タイムスタンプを抽出すると共に、その文書の要素となる単語を抽出する。そして、事前処理部５２１は、文書に処理用文書ＩＤを付与し、著者に処理用著者ＩＤを付与し、抽出した各単語に、処理用単語ＩＤを付与する。 For example, the pre-processing unit 521 receives a document data file to be subjected to calculation processing from the file server 501, extracts a document ID, an author ID, and a time stamp for each document data from the document data file, Extract words that are elements of the document. Then, the preprocessing unit 521 assigns a processing document ID to the document, assigns a processing author ID to the author, and assigns a processing word ID to each extracted word.

計算処理部５２２は、事前処理部５２１で処理されたデータを入力し、図２に示したモデルに対応するような計算処理を行う。後に説明するように、この実施形態では、潜在変数の推定に、ギブスサンプリングを用いている。 The calculation processing unit 522 receives the data processed by the preprocessing unit 521 and performs calculation processing corresponding to the model shown in FIG. As will be described later, in this embodiment, Gibbs sampling is used for estimating a latent variable.

データベース５０３は、図５に示すように、計算結果記憶部５３１を有している。計算サーバ５０２の計算処理部５２２の計算結果は、データベース５０３に送られ、計算結果記憶部５３１に保存される。図６に示すように、計算結果としては、コミュニティ、トレンドクラス、タイムスタンプ、スイッチ、トピック、単語の各確率変数と、各確率分布のパラメータ及びその種類からなる。図６の計算結果において、タイムスタンプと単語が観測変数であり、他は潜在変数である。 As shown in FIG. 5, the database 503 has a calculation result storage unit 531. The calculation result of the calculation processing unit 522 of the calculation server 502 is sent to the database 503 and stored in the calculation result storage unit 531. As shown in FIG. 6, the calculation results include community, trend class, time stamp, switch, topic, word random variables, parameters of each probability distribution, and their types. In the calculation result of FIG. 6, the time stamp and the word are observation variables, and the others are latent variables.

サービスサーバ５０４は、計算結果をサービスの利用のために提供するためのサーバである。図５に示すように、サービスサーバ５０４は、呼出し部５４１を有している。 The service server 504 is a server for providing a calculation result for using the service. As illustrated in FIG. 5, the service server 504 includes a calling unit 541.

ユーザ端末５０５からの呼び出しに応じて、呼出し部５４１は、ユーザ端末５０５に計算結果を送る。この計算結果は、マーケティング、需要予測、広告、レコメンド等、各種のサービスに利用できる。 In response to a call from the user terminal 505, the calling unit 541 sends a calculation result to the user terminal 505. This calculation result can be used for various services such as marketing, demand prediction, advertisement, and recommendation.

図７は、計算サーバ５０２での処理を示すフローチャートである。図７において、先ず、計算サーバ５０２の事前処理部５２１は、計算対象文書データについて、処理用文書ＩＤと、処理用著者ＩＤと、処理用単語ＩＤとを割り振る処理を行う。 FIG. 7 is a flowchart showing processing in the calculation server 502. In FIG. 7, first, the pre-processing unit 521 of the calculation server 502 performs processing for assigning a processing document ID, a processing author ID, and a processing word ID for the calculation target document data.

つまり、図８（Ａ）に示すように、各文書には、独自の著者ＩＤや文書ＩＤが付けられている。図８（Ａ）では、最初のレコードの文書データには、著者ＩＤとして「Ａ」が付けられ、文書ＩＤとして「００１」が付けられている。文書の要素となるトークン（ここでは、単語）には、「ローマ」、「歴史」、…、「遺産」がある。そして、最後のレコードの文書データには、著者ＩＤとして「Ｚ」が付けられ、文書ＩＤとして「０８７」が付けられている。文書の要素となるトークンには、「古代」、「芸術」、…、「文化」がある。 That is, as shown in FIG. 8A, each document is given a unique author ID or document ID. In FIG. 8A, the document data of the first record has “A” as the author ID and “001” as the document ID. Tokens (here, words) that are elements of the document include “Rome”, “History”,... The document data of the last record has “Z” as the author ID and “087” as the document ID. There are "ancient", "art", ..., "culture" as tokens that are elements of the document.

図８（Ｂ）は、このようなデータに対して、ステップＳ１で、処理用著者ＩＤ、処理用文書ＩＤ、処理用単語ＩＤを割り振る処理を行った場合の例である。図８（Ｂ）に示すように、最初のレコードの文書データは、処理用著者ＩＤとして「０」が割り当てられ、処理用文書ＩＤとして「０」が割り当てられる。トークン１〜トークンＮに対して、処理用単語ＩＤ「２２」、処理用単語ＩＤ「０」、…、処理用単語ＩＤ「１２１２」が割り振られる。そして、最後のレコードの文書データには、処理用著者ＩＤとして「１００」が割り当てられ、処理用文書ＩＤとして「２２３」が割り当てられる。そして、トークン１〜トークンＮに対して、処理用単語ＩＤ「４」、処理用単語ＩＤ「１」、…、処理用単語ＩＤ「５５７」が割り振られる。 FIG. 8B shows an example of processing for assigning a processing author ID, a processing document ID, and a processing word ID to such data in step S1. As shown in FIG. 8B, the document data of the first record is assigned “0” as the processing author ID and “0” as the processing document ID. A processing word ID “22”, a processing word ID “0”,..., And a processing word ID “1212” are assigned to tokens 1 to N. The document data of the last record is assigned “100” as the processing author ID and “223” as the processing document ID. Then, the processing word ID “4”, the processing word ID “1”,..., And the processing word ID “557” are allocated to the tokens 1 to N.

次に、計算サーバ５０２の計算処理部５２２は、確率変数（Ｃ，Ｓ，Ｚ）の数及びハイパーパラメータ（α、β，γ，δ，ε）の初期値を設定し、また、計算処理の繰り返し回数を設定する（ステップＳ２）。そして、計算サーバ５０２の計算処理部５２２は、乱数を発生し、その値を確率変数（Ｃ，Ｓ，Ｚ）に与える（ステップＳ３）。 Next, the calculation processing unit 522 of the calculation server 502 sets the number of random variables (C, S, Z) and initial values of the hyper parameters (α, β, γ, δ, ε), and performs calculation processing. The number of repetitions is set (step S2). Then, the calculation processing unit 522 of the calculation server 502 generates a random number and gives the value to the random variable (C, S, Z) (step S3).

つまり、図９（Ａ）は、ステップＳ１の処理で、計算対象文書データについて、処理用文書ＩＤと、処理用著者ＩＤと、処理用単語ＩＤとを割り振った状態を示している。このようなデータに対して、図９（Ｂ）に示すように、コミュニティクラスＳ、トレンドクラスＣ、トピックＺに、乱数が挿入される。ここでは、コミュニティクラスの乱数として、例えば「０」〜「２０」を任意に挿入し、トレンドクラスの乱数として、例えば「０」〜「４０」を任意に挿入し、トピックの乱数として、例えば「０」〜「９９」を任意に挿入するものとする。図９（Ｂ）の例では、最初のレコードのデータには、コミュニティクラスＳとして乱数「７」が挿入され、トレンドクラスＣとして乱数「２０」が挿入され、トークン「１」〜「Ｎ」のトピックとして、乱数「７」、「５」、…「８」が挿入されている。そして、最後のレコードのデータには、コミュニティクラスＳとして乱数「１２」が挿入され、トレンドクラスＣとして乱数「１１」が挿入され、トピック「１」〜「Ｎ」として、乱数「８」、「８」、…、「３」が挿入されている。 That is, FIG. 9A shows a state in which the processing document ID, the processing author ID, and the processing word ID are assigned to the calculation target document data in the processing of step S1. For such data, random numbers are inserted into the community class S, trend class C, and topic Z as shown in FIG. Here, for example, “0” to “20” are arbitrarily inserted as community class random numbers, for example, “0” to “40” are arbitrarily inserted as trend class random numbers, and topic random numbers, for example, “ 0 "to" 99 "are arbitrarily inserted. In the example of FIG. 9B, the random number “7” is inserted as the community class S, the random number “20” is inserted as the trend class C, and the tokens “1” to “N” are inserted into the data of the first record. Random numbers “7”, “5”,... “8” are inserted as topics. In the data of the last record, the random number “12” is inserted as the community class S, the random number “11” is inserted as the trend class C, and the random numbers “8”, “ 8 ”,...,“ 3 ”are inserted.

次に、計算サーバ５０２の計算処理部５２２は、ギブスサンプリングで潜在変数の推定を行う（ステップＳ４）。ギブスサンプリングの繰り返し数が、予め決められた計算の繰り返し数に達したら、計算処理を終了する（ステップＳ５）。 Next, the calculation processing unit 522 of the calculation server 502 estimates a latent variable by Gibbs sampling (step S4). When the number of Gibbs sampling iterations reaches a predetermined number of computation iterations, the computation process ends (step S5).

このように、本実施形態では、ギブスサンプリングにより、潜在変数の推定が行われる。図１０は、ギブスサンプリングにより推定される値と、これにより求められる潜在変数との対応を示している。 Thus, in this embodiment, the latent variable is estimated by Gibbs sampling. FIG. 10 shows a correspondence between values estimated by Gibbs sampling and latent variables obtained thereby.

次に、本実施形態において、ギブスサンプリングにより各潜在変数が求められることについて説明する。本実施形態では、図１１に示すように、各潜在変数の推定を統計的処理により推定するために、ディリクレ分布を導入している。ディリクレ分布は、連続型の確率分布であるが、積分することにより、離散型に変換できる。 Next, in this embodiment, it will be described that each latent variable is obtained by Gibbs sampling. In the present embodiment, as shown in FIG. 11, a Dirichlet distribution is introduced in order to estimate each latent variable by statistical processing. The Dirichlet distribution is a continuous probability distribution, but can be converted to a discrete type by integration.

このような確率分布を導入することにより、データ全体の同時確率は、次の式のように表される。 By introducing such a probability distribution, the joint probability of the entire data is expressed as the following equation.

ここで、データ全体の同時確率ｐ（｜）は、条件付確率を示している。
上式を積分すると、次の式のように表せる。 Here, the joint probability p (|) of the entire data indicates a conditional probability.
When the above equation is integrated, it can be expressed as the following equation.

ここで、Γは、ガンマ関数である。ｎ_ｓは、コミュニティｓに所属する著者の数である。である。ｎ_ｓｃは、コミュニティクラスｓからトレンドクラスｃを選択した文書の回数である。ｎ_ｃｚは、トレンドクラスｃからコミュニティクラスｚを選択した回数である。ｎ_ｄｒは、コミュニティクラスｚからスイッチ変数ｒを選択したトークンの回数である。ｎ_ｚｗは、コミュニティクラスｚから観測変数ｗを選択した回数である。
このように、ディリクレ分布を導入して、積分することで、多項分布のパラメータは消え、ハイパーパラメータと、頻度情報が残る。このように、ディリクレ分布を導入して積分することで、連続型の確率分布は、離散型の確率分布になる。 Here, Γ is a gamma function. n _s is the number of authors belonging to the community s. It is. n _sc is the number of documents in which the trend class c is selected from the community class s. n _cz is the number of times the community class z is selected from the trend class c. n _dr is the number of tokens for which the switch variable r is selected from the community class z. n _zw is the number of times the observation variable w is selected from the community class z.
In this way, by introducing the Dirichlet distribution and integrating it, the parameters of the multinomial distribution disappear, and the hyper parameters and frequency information remain. In this way, by introducing and integrating the Dirichlet distribution, the continuous probability distribution becomes a discrete probability distribution.

上述のディリクレ分布を積分した式（１）を変形すると、ギブスサンプリングにより、各潜在変数を推定するための式が求められる。つまり、コミュニティクラスｚがｇである確率は、次の式のように導出することができる。 When the equation (1) obtained by integrating the Dirichlet distribution described above is modified, an equation for estimating each latent variable is obtained by Gibbs sampling. That is, the probability that the community class z is g can be derived as in the following equation.

トレンドクラスｃがｊである確率は、次の式のように導出することができる。 The probability that the trend class c is j can be derived as follows.

変数ｒの確率は、次の式のように導出することができる。 The probability of the variable r can be derived as follows:

式（４）は、変数ｒ＝０の確率である。式（５）は変数ｒ＝１の確率である。ｋはトピックの識別子である。式（６）は、変数ｒ＝２の確率である。式（７）は、変数ｒ＝３の確率である。
図１２は、ギブスサンプリングにより潜在変数を推定するアルゴリズムを示すものである。
計算処理部５２２は、このアルゴリズムに従い、初期化処理を行い、ギブスサンプリングの繰り返し回数Ｎ_{ｉｔｅｒａｔｉｏｎ}を設定する。
そして、計算処理部５２２は、著者数Ａだけ以下の処理を繰り返す。計算処理部５２２は、著者毎（著者数Ａ）に式（２）によりコミュニティクラスの潜在変数ｓ_ａを推定し、変数ｎ_ｓを更新する。計算処理部５２２は、この潜在変数ｓ_ａの推定と変数ｎ_ｓの更新処理をする毎に、以下の処理を行う。 Equation (4) is the probability of variable r = 0. Equation (5) is the probability of variable r = 1. k is a topic identifier. Equation (6) is the probability of variable r = 2. Equation (7) is the probability of variable r = 3.
FIG. 12 shows an algorithm for estimating a latent variable by Gibbs sampling.
The calculation processing unit 522 performs initialization processing according to this algorithm, and sets the number of _iterations of Gibbs sampling N _iteration .
Then, the calculation processing unit 522 repeats the following processing for the number of authors A. Calculation processing unit 522 estimates the latent variable _{s a} community class by equation (2) for each author (number Author A), a updates the variable _{n s.} Calculation processing unit 522, every time the updating of the estimated and variables n _s of the latent variables s _a, the following processing is performed.

計算処理部５２２は、文書数Ｄだけ以下の処理を繰り返す。計算処理部５２２は、文書毎（文書数Ｄ）に式（３）によりトレンドクラスの潜在変数ｃ_ｄを推定し、変数ｎ_ｃと変数λ_ｃを更新する。計算処理部５２２は、この潜在変数ｃ_ｄ推定と変数ｎ_ｃの更新処理をする毎に、以下の処理を行う。 The calculation processing unit 522 repeats the following processing for the number of documents D. Calculation processing unit 522 estimates the latent variable _{c d} trend classes for each document (document count D) by the equation (3), and updates the variable _{n c} and variable lambda _c. Calculation processing unit 522, every time the update processing of the latent variables c _d estimated and variables n _c, the following process is performed.

計算処理部５２２は、文書ｄごとに定められる単語数だけ以下の処理を繰り返す。計算処理部５２２は、式（４）、（５）、（６）、（７）により、スイッチの潜在変数ｒ_ｄｉ及びトピックの潜在変数ｚ_ｄｉを推定し、変数ｎ_ｄｒ、ｎ_ｃｚ、ｎ_ｚｗを更新する。 The calculation processing unit 522 repeats the following processing for the number of words determined for each document d. The calculation processing unit 522 estimates the latent variable r _di of the switch and the latent variable z _{di of the} topic by the equations (4), (5), (6), and (7), and the variables n _dr , n _cz , and n _zw. Update.

計算処理部５２２は、上記の繰り返し処理がすべて終了した後に、各確率分布を知りたい場合、多項分布のパラメータψ、Ψ、θ、φ、μを推定する。なお、図１２中で多項分布の各パラメータψ、Ψ、θ、φ、μの上に付された符号（ハット）は推定値を意味している。 The calculation processing unit 522 estimates the parameters ψ, ψ, θ, φ, and μ of the multinomial distribution when it is desired to know each probability distribution after the above iterative processing is completed. In FIG. 12, the symbols (hats) attached to the parameters ψ, ψ, θ, φ, and μ of the multinomial distribution mean estimated values.

次に、本発明により得られる効果について説明する。図１３は、映画の人気のランキングを、全体（２０００年から２００５年までの６年間）、２０００年から２００１年、２００２年から２００３年、２００４年から２００５年に分けて、記述したものである。図１３（Ａ）は、単純な集計結果である。図１３（Ａ）に示す集計結果のうち、下線で示したようなタイトルの映画は、全体の期間にわたって、一定の人気を保持している。すなわち、これらは、時間的にあまり変化しないものである。 Next, the effects obtained by the present invention will be described. FIG. 13 shows the ranking of popularity of movies by dividing them into the whole (6 years from 2000 to 2005), 2000 to 2001, 2002 to 2003, and 2004 to 2005. . FIG. 13A shows a simple tabulation result. Of the counting results shown in FIG. 13A, a movie with a title as shown by the underline retains a certain popularity throughout the entire period. That is, they do not change much in time.

図１３（Ｂ）は、本実施形態により得られた結果である。ここでは、コミュニティクラスの数Ｃが７５、映画のトレンドクラスの数Ｓが７５、映画のトピックの数Ｚは１００である。φｂは、全体的なトレンドクラスであり、全体の確率分布φのうち、確率が高いものから順に表示したものである。φｃ（２０００−２００１）は、ベータ分布のピークがｔ＝０に最も近いトレンドクラス分布を有するトレンドクラスを抽出し、そして、抽出したトレンドクラスの確率分布φのうち、確率が高いものから順に表示したものである。φｃ（２００２−２００３）は、ベータ分布のピークがｔ＝０．５に最も近いトレンドクラス分布を有するトレンドクラスを抽出し、そして、抽出したトレンドクラスの確率分布φのうち、確率が高いものから順に表示したものである。φｃ（２００４−２００５）は、ベータ分布のピークがｔ＝１に最も近いトレンドクラス分布を有するトレンドクラスを抽出し、そして、抽出したトレンドクラスの確率分布φのうち、確率が高いものから順に表示したものである。 FIG. 13B shows the result obtained by this embodiment. Here, the community class number C is 75, the movie trend class number S is 75, and the movie topic number Z is 100. φb is an overall trend class, and is displayed in descending order of probability from the overall probability distribution φ. φc (2000-2001) extracts the trend class having the trend class distribution whose beta distribution peak is closest to t = 0, and displays the probability distribution φ of the extracted trend class in descending order of probability. It is a thing. φc (2002-2003) extracts a trend class having a trend class distribution whose beta distribution peak is closest to t = 0.5, and from the probability distribution φ of the extracted trend class, the probability is high They are displayed in order. φc (2004-2005) extracts the trend class having the trend class distribution whose beta distribution peak is closest to t = 1, and displays the probability distribution φ of the extracted trend class in descending order of probability. It is a thing.

単純な集計では、図１３（Ａ）における下線で示すタイトルのように、全体の期間にわたって、一定の人気を保持するような映画タイトルがランキングに含まれる。これに対して、図１３（Ｂ）に示すように、本実施形態では、２０００年から２００１年、２００２年から２００３年、２００４年から２００５年の各期間で、下線で示したような全期間にわたって一定の人気となるタイトルの映画は除かれ（あるいは、上記のランキングから外れ）、各期間毎のトレンドクラスを反映したタイトルの映画がランキングされる。このように、本実施形態では、時間と共に変化するものと、そうでないものとを切り分けて、トレンドクラスを求めることができる。 In simple aggregation, movie titles that maintain a certain level of popularity over the entire period are included in the ranking, as shown by the underlined titles in FIG. On the other hand, as shown in FIG. 13B, in this embodiment, the entire period as indicated by the underline in each period from 2000 to 2001, 2002 to 2003, and 2004 to 2005. Films with titles that have a certain popularity over time are excluded (or fall out of the above ranking), and movies with titles reflecting the trend class for each period are ranked. As described above, in the present embodiment, it is possible to obtain a trend class by distinguishing between those that change with time and those that do not.

また、図１４は、本実施形態による手法を従来の手法と比較したものである。図１４において、ＴＯＴ、ＤＴＭｓ、ｇＰＬＳＡ、ＬＩＴは、従来の手法である。ＰＯＴは、本実施形態による手法であり、ＰＯＴｒ＝｛０，１，２，３｝は、ｒが０，１，２，３で選択可能である場合である。ＰＯＴｒ＝｛１，２，３｝は、ｒが１，２，３で選択可能である場合であって、ｒが０は選択できない。ＰＯＴｒ＝｛２，３｝は、ｒが２，３で選択可能である場合であって、ｒが０及び１は選択できない。これにより、ＰＯＴｒ＝｛２，３｝場合、トレンドクラスの確率分布またはトピックの確率分布から観測変数ｗを算出する。 FIG. 14 compares the method according to the present embodiment with the conventional method. In FIG. 14, TOT, DTMs, gPLSA, and LIT are conventional techniques. POT is a method according to the present embodiment, and POT r = {0, 1, 2, 3} is a case where r is selectable by 0, 1, 2, 3. POT r = {1, 2, 3} is a case where r is selectable with 1, 2, 3, and r cannot be selected with 0. POT r = {2, 3} is a case where r is selectable with 2, 3, and r is not selectable between 0 and 1. Thus, when POT r = {2, 3}, the observation variable w is calculated from the trend class probability distribution or the topic probability distribution.

なお、前述したように、（ｒ＝０）なら、一般的な分布である全体の多項分布を選択する。（ｒ＝１）なら、長い持続的な分布であるトピック毎の多項分布を選択する。（ｒ＝２）なら、時間と共にその傾向が変化するトレンドクラス毎の多項分布を選択する。（ｒ＝３）なら、そのコミュニティクラスに特化した局所的なコミュニティクラス毎の多項分布を選択する。 As described above, if (r = 0), the entire multinomial distribution which is a general distribution is selected. If (r = 1), a multinomial distribution for each topic that is a long and persistent distribution is selected. If (r = 2), a multinomial distribution is selected for each trend class whose tendency changes with time. If (r = 3), a multinomial distribution for each local community class specialized for the community class is selected.

続いて、図１４のそれぞれの指標について詳細に説明する。Ｔｏｐ−１０は、直前の予め決められた期間（例えば、直近１ヶ月）以外のテスト期間におけるデータから、上位１０位までにランキングされた推薦映画タイトルを、ユーザがその直前の予め決められた期間の間に視聴した確率である。
また、ＵＣ（ＵｓｅｒＣｏｖｅｒａｇｅ：推薦ユーザの被覆率）は、テスト期間に映画タイトルを視聴したユーザ数に対する各推薦方法が推薦可能なユーザ数の割合である。ＵＣが高いほど、多くのユーザに映画タイトルを推薦できるので、ユーザ全体にとって価値が高いシステムである。 Next, each index in FIG. 14 will be described in detail. Top-10 is a pre-determined period immediately before the user's recommended movie titles ranked in the top 10 from the data in the test period other than the pre-determined period (for example, the latest one month). It is the probability of watching during the period.
Further, UC (User Coverage: coverage rate of recommended users) is the ratio of the number of users who can recommend each recommendation method to the number of users who watched the movie title during the test period. Since the movie title can be recommended to many users as the UC is higher, the system is more valuable for the entire user.

ＩＣ（ＩｔｅｍＣｏｖｅｒａｇｅ：推薦アイテムの被覆率）は、テスト期間に視聴された映画タイトル数に対する各推薦方法が推薦可能なタイトル数の割合である。ＩＣは、システムが推薦できるシステム中の映画タイトルドメインの大きさを示す１つの指標である。従って、ＩＣが低いシステムは、ごく限られた映画タイトルしか提示できないから、ユーザにとって価値が低いシステムである。 IC (Item Coverage: coverage of recommended items) is the ratio of the number of titles that can be recommended by each recommendation method to the number of movie titles viewed during the test period. The IC is one index indicating the size of the movie title domain in the system that can be recommended by the system. Therefore, a system with a low IC can present only a limited number of movie titles, and is therefore a low-value system for the user.

Ｇｉｎｉ係数は、ユーザに対して映画のタイトルをお勧めしたときに、お勧めの統計的な分散を示す指標である。Ｇｉｎｉ係数は、０から１の値をとり、値が０に近いほど映画タイトル毎の推薦ユーザ数の格差が少なく、１に近いほど格差が大きいことを意味する。 The Gini coefficient is an index indicating a recommended statistical variance when a movie title is recommended to the user. The Gini coefficient takes a value from 0 to 1, and the closer the value is to 0, the smaller the difference in the number of recommended users for each movie title, and the closer the value is to 1, the larger the difference.

ＡＥ（ＡｖｅｒａｇｅＥｌａｐｓｅｄｔｉｍｅ）は、映画タイトルがリリースから視聴されるまでの経過時間の平均である。この値が小さければ、ユーザにとってそれだけ映画タイトルの新規性が高くなる。 AE (Average Elapsed time) is an average of elapsed time from the release of a movie title to viewing. If this value is small, the novelty of the movie title is high for the user.

ＡＤ（ＡｖｅｒａｇｅＤｉｆｆｅｒｎｅｃｅｔｉｍｅ）は、テスト期間の開始時刻と映画タイトル視聴の時刻の差の平均である。この値が大きければ、それだけ気が付き難い映画タイトルになる。 AD (Average Difference time) is the average of the difference between the test period start time and the movie title viewing time. If this value is large, the movie title becomes harder to notice.

図１４に示すように、ＰＯＴｒ＝｛１，２，３｝の場合、Ｔｏｐ−１０の値が、従来のどの手法よりも統計的に大きいので、従来よりも、より良い上位１０個の映画タイトルを推薦できる。
また、ＰＯＴｒ＝｛２，３｝の場合、ＩＣの値が、従来のどの手法よりも統計的に大きいので、従来よりもユーザ毎に推薦している映画タイトルが異なっている。これにより、本実施形態の分類システムは、幅広い映画タイトルを提示できることから、ユーザにとって価値が高い。 As shown in FIG. 14, in the case of POT r = {1, 2, 3}, the value of Top-10 is statistically larger than any conventional method, so the top 10 movies that are better than the conventional ones Can recommend a title.
Further, when POT r = {2, 3}, the IC value is statistically larger than any conventional method, so that the recommended movie title is different for each user than before. Thereby, since the classification system of this embodiment can present a wide range of movie titles, it is highly valuable for the user.

また、ＰＯＴｒ＝｛２，３｝の場合、Ｇｉｎｉの値が、従来のどの手法よりも統計的に小さいから、従来よりも映画タイトル毎の推薦ユーザ数の格差が少ない。これにより、本実施形態の分類システムは、広く映画タイトルを推薦できるので、ユーザにとって価値が高い。 In addition, when POT r = {2, 3}, the value of Gini is statistically smaller than any conventional method, so that there is less disparity in the number of recommended users for each movie title than before. Thereby, since the classification system of this embodiment can recommend a movie title widely, its value is high for a user.

また、ＰＯＴｒ＝｛２，３｝の場合、ＡＥの値が、従来のどの手法よりも統計的に小さいから、従来よりも映画タイトルがリリースから視聴されるまでの経過時間が短い。これにより、本実施形態の分類システムは、従来よりも新しい映画タイトルを推薦することができるので、ユーザにとって価値が高い。 In addition, when POT r = {2, 3}, since the value of AE is statistically smaller than any conventional method, the elapsed time from the release to viewing of the movie title is shorter than before. Thereby, since the classification system of this embodiment can recommend a new movie title than before, it is highly valuable for the user.

以上説明したように、本実施形態における計算サーバ５０２は、観測変数として、時刻ｔを導入し、トレンドクラス毎かつ時刻毎に文章ｄの出現確率を抽出する。これにより、計算サーバ５０２は、時系列データからトレンドの周期と、各トレンドを構成するトークン（例えば、単語）を同時に抽出することができる。その結果、例えば、計算サーバ５０２は、コミュニティ（嗜好が類似したユーザの集合）を同時に抽出することができる。 As described above, the calculation server 502 in the present embodiment introduces the time t as an observation variable, and extracts the appearance probability of the sentence d for each trend class and for each time. As a result, the calculation server 502 can simultaneously extract the trend period and tokens (for example, words) constituting each trend from the time-series data. As a result, for example, the calculation server 502 can simultaneously extract a community (a set of users with similar preferences).

また、本実施形態における計算サーバ５０２は、観測変数ｗを求めるための多項分布を、スイッチ変数ｒにより、トピック毎の多項分布と、トレンドクラス毎の多項分布と、コミュニティクラス毎の多項分布と、全体での多項分布との間で切り替える。これにより、時間と共に変化するものと、そうでないものとを切り分けることができ、時間と要素の組み合わせの同時確率だけでなく、要素だけの確率で表現することができる。 Further, the calculation server 502 according to the present embodiment uses a switch variable r to calculate a multinomial distribution for obtaining an observation variable w, a multinomial distribution for each topic, a multinomial distribution for each trend class, a multinomial distribution for each community class, Switch between overall multinomial distributions. Thereby, what changes with time can be separated from what does not, and can be expressed not only by the simultaneous probability of the combination of time and element but also by the probability of only the element.

従来の分析では、時系列に対する変動を連続的に考慮していなかったため、時間の経過に対する変化の予測ができなかった。それに対し、本実施形態では、観測変数の時刻ｔをモデルに導入した。これにより、時系列に対し連続的に変動する要素を取り入れた予測を行うことを可能としたので、トレンドを抽出することができる。 In the conventional analysis, since the change with respect to the time series was not continuously taken into consideration, it was impossible to predict the change with the passage of time. In contrast, in this embodiment, the time t of the observation variable is introduced into the model. As a result, it is possible to perform a prediction that incorporates an element that continuously fluctuates with respect to the time series, so that a trend can be extracted.

従来の分析では、データの生成過程に階層構造が反映されておらず、階層（例えば、コミュニティまたはトピック）毎に分類することができなかった。それにより、分析対象のデータの分類が階層構造でないため、データの属性（該当するコミュニティの規模などに関連）を踏まえた分析ができなかった。
それに対し、本実施形態では、階層（例えば、コミュニティまたはトピック）毎に観測変数ｗを生成する確率分布を生成し、スイッチ変数ｒの値によって用いる確率分布を切り替える構成にした。これにより、構成要因を階層（例えば、コミュニティ、トレンド又はトピック）毎に分類することができる。その結果、例えば、分析データについて、著者ＩＤ、文書ＩＤ、単語ＩＤを付与した場合、分析の際に、例えば、特定著者ＩＤに固有か普遍かを分析することができる。 In the conventional analysis, the hierarchical structure is not reflected in the data generation process, and the data cannot be classified by hierarchy (for example, community or topic). As a result, since the classification of the data to be analyzed is not hierarchical, analysis based on the data attributes (related to the size of the corresponding community, etc.) could not be performed.
On the other hand, in the present embodiment, a probability distribution for generating the observation variable w is generated for each hierarchy (for example, community or topic), and the probability distribution to be used is switched according to the value of the switch variable r. Thereby, a constituent factor can be classified for every hierarchy (for example, community, trend, or topic). As a result, for example, when an author ID, a document ID, and a word ID are assigned to the analysis data, it is possible to analyze whether the analysis data is specific or universal for the specific author ID, for example.

また、本実施形態では、観測変数の時刻ｔがモデルに導入し、かつ階層毎に観測変数ｗを生成する確率分布を生成し、スイッチ変数ｒの値によって用いる確率分布を切り替える構成により、階層（例えば、コミュニティ又はトピック）毎のトレンドを、人手を介すことなく、一意に抽出することができる。ここで、一意とは誰がやっても常に同じ結果になることを意味する。 Further, in the present embodiment, the structure in which the time t of the observation variable is introduced into the model, the probability distribution for generating the observation variable w for each hierarchy is generated, and the probability distribution used according to the value of the switch variable r is switched. For example, a trend for each community or topic) can be uniquely extracted without human intervention. Here, unique means that the same result is always obtained no matter who does it.

更に、本実施形態では、時刻ｔを連続値で扱うために、トレンドの確率分布（一例として、ベータ分布）を導入した。この構成により、時刻を連続値で扱うことにより、周期の異なるトレンドと、各トレンドを構成する要素を同時に抽出可能である。ここで、周期とは、時系列変化の時間スケールである。また、それと同時にトレンドの分布を確率分布で表現することにより、トレンド間の比較が容易である。また、ある時間幅毎に頻度を離散化する必要がないので、データ毎に時間幅を調整しなくても良く、時間幅を調整する手間を削減することができる。 Furthermore, in this embodiment, in order to handle the time t as a continuous value, a trend probability distribution (for example, a beta distribution) is introduced. With this configuration, by treating time with continuous values, it is possible to simultaneously extract trends having different periods and elements constituting each trend. Here, the period is a time scale of time series change. At the same time, it is easy to compare trends by expressing the trend distribution as a probability distribution. Moreover, since it is not necessary to discretize the frequency for every certain time width, it is not necessary to adjust the time width for each data, and the time and effort for adjusting the time width can be reduced.

また、本実施形態では、観測変数の時刻ｔがモデルに導入し、かつ階層毎に観測変数ｗを生成する確率分布を生成し、スイッチ変数ｒの値によって用いる確率分布を切り替え、かつ時刻ｔを連続値で扱うために、トレンドの確率分布を導入した。この構成により、異なる周期を有する複数の時系列データを、その時系列変化の周期毎にかつ階層的に分類することができる。 Further, in the present embodiment, the time t of the observation variable is introduced into the model, the probability distribution for generating the observation variable w for each hierarchy is generated, the probability distribution used according to the value of the switch variable r is switched, and the time t is set. In order to deal with continuous values, a probability distribution of trends was introduced. With this configuration, a plurality of time-series data having different periods can be classified hierarchically for each period of the time-series change.

なお、時刻ｔの確率分布は、ベータ分布λに限らず、確率分布であればよく、ガンマ分布でもよい。
また、複数の装置を備えるシステムが、本実施形態の計算サーバ５０２の各処理を、それらの複数の装置で分散して処理してもよい。
また、本実施形態の計算サーバ５０２の各処理を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、当該記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、計算サーバ５０２に係る上述した種々の処理を行ってもよい。 The probability distribution at time t is not limited to the beta distribution λ, but may be a probability distribution or a gamma distribution.
In addition, a system including a plurality of devices may process each processing of the calculation server 502 of this embodiment in a distributed manner by the plurality of devices.
Further, by recording a program for executing each process of the calculation server 502 of the present embodiment on a computer-readable recording medium, causing the computer system to read and execute the program recorded on the recording medium, The various processes described above related to the calculation server 502 may be performed.

なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Here, the “computer system” may include an OS and hardware such as peripheral devices. Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ））のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the “computer-readable recording medium” refers to a volatile memory (for example, DRAM (Dynamic) in a computer system serving as a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. Random Access Memory)) that holds a program for a certain period of time is also included. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the concrete structure is not restricted to this embodiment, The design etc. of the range which does not deviate from the summary of this invention are included.

５０１ファイルサーバ
５０２計算サーバ（分類装置）
５０３データベース
５０４サービスサーバ
５０５ユーザ端末
５１１データファイル保存部
５２１事前処理部
５２２計算処理部
５３１計算結果記憶部
５４１呼出し部 501 File server 502 Calculation server (classification device)
503 Database 504 Service server 505 User terminal 511 Data file storage unit 521 Pre-processing unit 522 Calculation processing unit 531 Calculation result storage unit 541 Calling unit

Claims

Observation variable probability distribution acquisition unit for acquiring probability distribution of observation variables for each hierarchy,
Switching the probability distribution acquired by the observation variable probability distribution acquisition unit according to the hierarchy, and generating an observation variable based on the switched probability distribution; and
A classification apparatus comprising:

A switch variable probability distribution acquisition unit that acquires a probability distribution of switch variables for each target data including time information,
Based on the probability distribution acquired by the switch variable probability distribution acquisition unit, a switch variable generation unit that generates a switch variable for each observation variable;
With
The observation variable generation unit switches the probability distribution acquired by the observation variable probability distribution acquisition unit to a probability distribution of a hierarchy corresponding to the switch variable acquired by the switch variable probability distribution acquisition unit. Classification device according to.

A time frequency distribution acquisition unit that acquires a time frequency distribution that defines an appearance frequency for each time of data for each trend category that classifies a trend;
Based on the time frequency distribution acquired by the time frequency distribution acquisition unit, a trend classification unit that classifies target data including time information into one of a plurality of trend categories,
Classification apparatus according to 請 Motomeko 2 comprising: a.

The classification device according to claim 3, wherein the time frequency distribution is a probability distribution.

Observation variable probability distribution acquisition unit for acquiring probability distribution of observation variables for each hierarchy,
Switching the probability distribution acquired by the observation variable probability distribution acquisition unit according to the hierarchy, and generating an observation variable based on the switched probability distribution; and
A classification system comprising:

The observation variable probability distribution acquisition unit acquires the probability distribution of observation variables for each hierarchy, and
The observation variable generation unit switches the probability distribution acquired by the observation variable probability distribution acquisition unit according to the hierarchy, and generates an observation variable based on the switched probability distribution;
A classification method characterized by comprising:

On the computer,
Observation variable probability distribution acquisition step for acquiring probability distribution of observation variables for each hierarchy,
Switching the probability distribution acquired by the observation variable probability distribution acquisition step according to the hierarchy, and generating an observation variable based on the switched probability distribution; and
Classification program for running