JP5446788B2

JP5446788B2 - Information processing apparatus and program

Info

Publication number: JP5446788B2
Application number: JP2009272311A
Authority: JP
Inventors: 大介梶
Original assignee: Konica Minolta Inc
Current assignee: Konica Minolta Inc
Priority date: 2009-11-30
Filing date: 2009-11-30
Publication date: 2014-03-19
Anticipated expiration: 2029-11-30
Also published as: JP2011113519A

Description

本発明は、情報処理装置及びプログラムに関する。 The present invention relates to an information processing apparatus and a program.

近年、商品の購入を薦める等の提案を行う場合に、提案を受ける対象と同様の傾向を有する他のものの情報に基づいて提案を行う方法が用いられている。この方法では、例えば、提案を受ける対象者が過去にある商品（商品Ａ）を購入していた場合、同じ商品Ａを過去に購入した他者が購入している他の商品の購入を対象者へ勧める提案を行う。ここで、商品Ａを購入した他者は、商品Ａを購入していない他者に比して対象者と同様の購買傾向を持つ可能性が相対的に高いとみなされる。以下、このような提案のための処理を「レコメンデーション処理」と呼ぶ。 In recent years, when making a proposal such as recommending purchase of a product, a method of making a proposal based on information on other things having the same tendency as the target of the proposal has been used. In this method, for example, when a target person who receives a proposal has purchased a product (product A) in the past, the target person can purchase another product purchased by another person who has purchased the same product A in the past. Make recommendations to recommend. Here, it is considered that the other person who purchased the product A has a relatively high possibility of having the same purchase tendency as the target person compared to the other person who has not purchased the product A. Hereinafter, the process for such a proposal is referred to as a “recommendation process”.

特に、ロングテール理論に基づくサービスにおいて、上述の方法による提案は効果が高いとされる。ロングテールとは、全体に対して割合の大きな大規模集合を形成する程ではない少数により構成される集合に応じたサービス等を提供するものである。このような少数による集合に含まれるデータ（例えば購入履歴等）の間では共通点が多いため、各データ間での差異に基づいた提案を行った場合、提案が受け入れられる可能性が高いとされる。例えば、少数の集合に含まれる二人の顧客データＢ１、Ｂ２があり、顧客データＢ１には商品Ｄ１と商品Ｄ２の購入履歴があり、顧客データＢ２には商品Ｄ１と商品Ｄ３の購入がある場合、顧客データＢ１に対応する顧客には商品Ｄ３を薦め、顧客データＢ２に対応する顧客には商品Ｄ２を薦めるのである。 In particular, in the service based on the long tail theory, the proposal by the above method is considered to be highly effective. The long tail provides a service or the like corresponding to a set composed of a small number that does not form a large-scale set having a large ratio with respect to the whole. Since there are many common points among data (such as purchase histories) included in such a small group, it is highly likely that proposals will be accepted when proposals are made based on differences between data. The For example, there are two customer data B1 and B2 included in a small number of sets, the customer data B1 has purchase history of the product D1 and the product D2, and the customer data B2 has the purchase of the product D1 and the product D3. The product D3 is recommended to the customer corresponding to the customer data B1, and the product D2 is recommended to the customer corresponding to the customer data B2.

このような少数により構成される集団から得られるデータに基づいて提案を行う場合、従来では協調フィルタリングにより提案するためのデータ抽出を行っていた（例えば特許文献１）。協調フィルタリングは、複数のサンプルデータを含むデータベースに対して、あるサンプルデータと同様又は近似した別のサンプルデータを相互相関に基づいて抽出し、抽出された別のサンプルデータに基づいてデータを得る方法である。 When making a proposal based on data obtained from a group composed of such a small number, conventionally, data extraction for making a proposal by collaborative filtering has been performed (for example, Patent Document 1). Collaborative filtering is a method for extracting another sample data similar to or approximate to a certain sample data based on the cross-correlation from a database including a plurality of sample data, and obtaining the data based on the extracted other sample data It is.

しかしながら、協調フィルタリングでは、あるサンプルデータを基準とする走査であるため、データベース内のサンプルデータによってどのような集団が形成されるのかを知ることはできなかった。このため、データベースに含まれる各サンプルデータによりされる集団の傾向に基づく解析等、データベースに対する定量的なデータ処理を行うことができなかった。 However, in collaborative filtering, since scanning is based on certain sample data, it has not been possible to know what group is formed by the sample data in the database. For this reason, it has not been possible to perform quantitative data processing on the database, such as analysis based on the tendency of the group made by each sample data included in the database.

そこで、データベースに対してＥＭアルゴリズムや変分ベイズ法による解析処理を行い、クラスタリングを行う方法が知られている（例えば特許文献２、３）。ここで、クラスタリングとは、データベース等のデータの集合を部分集合に切り分ける処理である。クラスタリングにより得られた部分集合に含まれる各サンプルデータは、何らかの共通の特徴を有し、その特徴は各部分集合ごとに異なる。 Therefore, a method is known in which clustering is performed by performing analysis processing on a database using an EM algorithm or a variational Bayes method (for example, Patent Documents 2 and 3). Here, clustering is a process of dividing a set of data such as a database into subsets. Each sample data included in the subset obtained by clustering has some common feature, and the feature is different for each subset.

特開２００２−３３４２５７号公報JP 2002-334257 A 特開２００７−２７２２９１号公報JP 2007-272291 A 特開２００４−１１７５０３号公報JP 2004-117503 A

しかしながら、特許文献２、３に開示されているクラスタリングでは、粒度を調節することができなかった。粒度とは、部分集合の規模を示すものである。
図１０に基づいて、クラスタリングの粒度について説明を行う。
図１０のクラスタリング結果１０１に示す一の部分集合に含まれるサンプルデータＳの数は、クラスタリング結果１０２、１０３に示す各部分集合に比して多い。このようなクラスタリング結果１０１におけるクラスタリングの粒度は、クラスタリング結果１０２、１０３に比べて大きい。
一方、図１０のクラスタリング結果１０３に示す各部分集合に含まれるサンプルデータＳの数は、クラスタリング結果１０１の部分集合やクラスタリング結果１０２の部分集合１０２ａに比して少ない。このようなクラスタリング結果１０３におけるクラスタリングの粒度は、クラスタリング結果１０１、１０２に比べて小さい。
このように、クラスタリングの粒度は、部分集合に含まれるサンプルデータの数の多少、即ち部分集合の規模を示す。 However, in the clustering disclosed in Patent Documents 2 and 3, the granularity cannot be adjusted. The granularity indicates the scale of the subset.
The clustering granularity will be described with reference to FIG.
The number of sample data S included in one subset shown in the clustering result 101 in FIG. 10 is larger than that in each subset shown in the clustering results 102 and 103. The clustering granularity in such a clustering result 101 is larger than the clustering results 102 and 103.
On the other hand, the number of sample data S included in each subset shown in the clustering result 103 of FIG. 10 is smaller than the subset of the clustering result 101 and the subset 102a of the clustering result 102. The clustering granularity in such a clustering result 103 is smaller than the clustering results 101 and 102.
Thus, the granularity of clustering indicates the number of sample data included in the subset, that is, the scale of the subset.

クラスタリングの粒度は、部分集合に含まれる各サンプルデータ間の共通点に関するルールに基づいて決定する。例えば、共通点のルールが緩い場合には粒度は大きくなる傾向がある。緩いルールとは、例えば、サンプルデータ全体の中で発生する確率が比較的小さい事柄で構成されるルール等である。「発生する確率が比較的小さい事柄」とは、ある事柄について全く発生しない場合を0、100%発生する場合を1としたときにその発生頻度が0.5付近になる事柄である。
逆に、共通点のルールが厳しい場合には粒度は小さくなる傾向がある。厳しいルールとは、厳しいルールとは、例えば、サンプルデータ全体の中で発生しない確率が比較的高い（すなわち0付近になる）又は発生する
確率が比較的高い（すなわち1付近になる）事柄で構成されるルール等である。
上記数値はその事柄の発生確率を表している。この確率が1または0に近い場合、発生しない確率が0または1に近く、起こるか起こらないかがはっきりしている場合であり、一方、発生確率が0.5の場合は、起こらない場合も0.5となり、もっとも曖昧な状態になる。 The granularity of clustering is determined based on a rule relating to common points between sample data included in the subset. For example, when the rules for common points are loose, the granularity tends to increase. A loose rule is, for example, a rule composed of matters with a relatively low probability of occurring in the entire sample data. “A matter with a relatively low probability of occurrence” is a matter where the occurrence frequency is about 0.5 when 0 is assumed when no matter occurs and 1 is assumed when 100%.
Conversely, when the rules for common points are strict, the granularity tends to be small. Strict rules are defined by, for example, the fact that the probability of not occurring in the entire sample data is relatively high (ie, near 0) or the probability of occurrence is relatively high (ie, near 1). Rules.
The above numbers represent the probability of the occurrence. When this probability is close to 1 or 0, the probability that it does not occur is close to 0 or 1, and it is clear whether it will occur or not. On the other hand, if the probability is 0.5, it will be 0.5 if it does not occur. The most ambiguous state.

特許文献２、３に開示されているクラスタリングでは、前述のようなクラスタリングの粒度を調節することはできなかった。 In the clustering disclosed in Patent Documents 2 and 3, the granularity of clustering as described above cannot be adjusted.

本発明の課題は、クラスタリングの粒度を調節可能とすることである。 An object of the present invention is to make it possible to adjust the granularity of clustering.

請求項１に記載の発明は、複数のサンプルデータを含むデータベースを記憶する記憶部と、カテゴリデータの規模を決定するためのパラメータを入力する入力部と、前記入力部によって入力されたパラメータ及び前記データベースに基づいて、変分ベイズ法による混合分布モデルを用いたアルゴリズムにより、前記データベースに含まれるサンプルデータを含む一又は複数の部分集合データである一又は複数のカテゴリデータを生成する制御部と、を備える情報処理装置であって、前記パラメータは、前記混合分布モデルにおける混合される分布の事前分布及び／又は混合比側の事前分布を決定するためのパラメータであり、前記混合分布モデルは、混合ベルヌーイ分布であり、混合比側の分布の事前分布にディリクレ分布を用い、そのパラメータの値を混合分布の次元に１を加算した値を２で除算した値以下とし、前記混合ベルヌーイ分布側の事前分布にベータ分布を用い、そのパラメータの値を０．１以下に設定することを特徴とする。 The invention according to claim 1 is a storage unit that stores a database including a plurality of sample data, an input unit that inputs a parameter for determining the size of category data, the parameter input by the input unit, and the A control unit that generates one or a plurality of category data, which is one or a plurality of subset data including sample data included in the database, by an algorithm using a mixture distribution model based on a variational Bayes method based on a database; The parameter is a parameter for determining the prior distribution of the mixed distribution and / or the prior distribution on the mixing ratio side in the mixed distribution model, and the mixed distribution model Bernoulli distribution, using Dirichlet distribution as the prior distribution of the mixture ratio side distribution A value obtained by adding 1 to the value of over data on dimensions of the mixture distribution is less than a value obtained by dividing by two, with a beta distribution prior distribution of the mixed Bernoulli distribution side, it sets the value of the parameter to 0.1 or less It is characterized by that.

請求項２に記載の発明は、請求項１に記載の情報処理装置であって、前記カテゴリデータの規模は、クラスタリングの粒度であることを特徴とする。 A second aspect of the present invention is the information processing apparatus according to the first aspect, wherein the category data has a clustering granularity.

請求項３に記載の発明は、請求項２に記載の情報処理装置であって、前記パラメータは、前記クラスタリングの粒度の大小を決定するためのパラメータであることを特徴とする。 A third aspect of the present invention is the information processing apparatus according to the second aspect, wherein the parameter is a parameter for determining the size of the clustering granularity.

請求項４に記載の発明は、請求項１から３のいずれか一項に記載の情報処理装置であって、前記混合される分布の分散値の大小は、生成されるカテゴリデータの数及び一のカテゴリデータに含まれるサンプルデータの多少と相関を有することを特徴とする。 A fourth aspect of the present invention is the information processing apparatus according to any one of the first to third aspects , wherein the size of the variance of the mixed distribution is the number of category data to be generated and one It has a correlation with the amount of sample data included in the category data.

請求項５に記載の発明は、コンピュータを、複数のサンプルデータを含むデータベースを記憶する手段、カテゴリデータの規模を決定するためのパラメータを入力する手段、前記入力されたパラメータ及び前記データベースに基づいて、変分ベイズ法による混合分布モデルを用いたアルゴリズムにより、前記データベースに含まれるサンプルデータを含む一又は複数の部分集合データである一又は複数のカテゴリデータを生成する手段、として機能させるためのプログラムであって、前記パラメータは、前記混合分布モデルにおける混合される分布の事前分布及び／又は混合比側の事前分布を決定するためのパラメータであり、前記混合分布モデルは、混合ベルヌーイ分布であり、混合比側の分布の事前分布にディリクレ分布を用い、そのパラメータの値を混合分布の次元に１を加算した値を２で除算した値以下とし、前記混合ベルヌーイ分布側の事前分布にベータ分布を用い、そのパラメータの値を０．１以下に設定することを特徴とする。 The invention according to claim 5, based on a computer, means for storing a database including a plurality of sample data, means for inputting a parameter for determining the size of the category data, the input parameter and the database Te, the algorithm using a mixture model variational Bayesian method, means for generating one or more category data which is one or more subsets data including sample data contained in the database, to function as The parameter is a parameter for determining the prior distribution of the mixed distribution and / or the prior distribution on the mixing ratio side in the mixed distribution model, and the mixed distribution model is a mixed Bernoulli distribution The Dirichlet distribution is used for the prior distribution of the mixture ratio side distribution, and A value obtained by adding 1 to the value of the meter to the dimension of the mixture distribution is less than a value obtained by dividing by two, with a beta distribution prior distribution of the mixed Bernoulli distribution side, setting the value of the parameter to 0.1 or less It is characterized by.

本発明によれば、クラスタリングの粒度を調節可能とする。 According to the present invention, the granularity of clustering can be adjusted.

本発明による情報処理装置１のブロック図である。It is a block diagram of information processor 1 by the present invention. クラスタリング処理を行うときの情報処理装置１の機能ブロック図である。It is a functional block diagram of information processor 1 when performing clustering processing. データベースＤの内容の一例である。It is an example of the contents of the database D. クラスタリング処理結果の一例を示す図である。It is a figure which shows an example of a clustering process result. クラスタリング処理結果の別の一例を示す図である。It is a figure which shows another example of a clustering process result. 情報処理装置１のクラスタリング処理の流れを示すフローチャートである。4 is a flowchart showing a flow of clustering processing of the information processing apparatus 1. 総当り検索によるアイテムの抽出を示す説明図である。It is explanatory drawing which shows extraction of the item by brute force search. 本実施形態による条件付確率Ｐ（Ｒ｜Ｑ）に基づくアイテムＲの抽出の一例を示す説明図である。It is explanatory drawing which shows an example of extraction of the item R based on the conditional probability P (R | Q) by this embodiment. レコメンデーション処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a recommendation process. クラスタリングの粒度に関する説明を行うための説明図である。It is explanatory drawing for demonstrating regarding the granularity of clustering.

以下、図を参照して本発明の実施の形態の例を詳細に説明する。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings.

図１に、本発明による情報処理装置１のブロック図を示す。
情報処理装置１は、コンピュータであり、ＣＰＵ１１、ＲＡＭ１２、ＲＯＭ１３、ストレージデバイス１４、通信装置１５、入力装置１６、表示装置１７を有する。 FIG. 1 shows a block diagram of an information processing apparatus 1 according to the present invention.
The information processing apparatus 1 is a computer and includes a CPU 11, a RAM 12, a ROM 13, a storage device 14, a communication device 15, an input device 16, and a display device 17.

ＣＰＵ１１は、ＲＯＭ１３やストレージデバイス１４等から処理内容に応じたプログラム、データ等を読み出して実行処理し、情報処理装置１の動作制御を含む各種処理を行う。
ＲＡＭ１２は、ＣＰＵ１１が読み出したプログラム、データ等や、ＣＰＵ１１の処理によって生じたパラメータを記憶する記憶装置である。 The CPU 11 reads out programs, data, and the like corresponding to the processing contents from the ROM 13 and the storage device 14 and executes them, and performs various processes including operation control of the information processing apparatus 1.
The RAM 12 is a storage device that stores programs, data, and the like read by the CPU 11 and parameters generated by the processing of the CPU 11.

ＲＯＭ１３は、ＣＰＵ１１が読み出すプログラム、データ等を書き換え不可能に記憶する記憶装置である。
本実施形態のＲＯＭ１３は、クラスタリング処理プログラムＥを記憶する。 The ROM 13 is a storage device that stores a program, data, and the like read by the CPU 11 in a non-rewritable manner.
The ROM 13 of this embodiment stores a clustering processing program E.

ストレージデバイス１４は、ＣＰＵ１１が読み出すプログラム、データ等を書き換え可能に記憶すると共に、ＣＰＵ１１の処理によって生じたデータや外部から入力されたプログラム、データ等を記憶することが可能な記憶装置である。ストレージデバイス１４は、例えばフラッシュメモリやハードディスクドライブ、その他の書き換え可能な記憶装置又はその組み合わせにより構成される。
本実施形態のストレージデバイス１４は、複数のサンプルデータを含むデータベースＤを記憶する。 The storage device 14 is a storage device capable of storing a program, data, and the like read by the CPU 11 in a rewritable manner and storing data generated by the processing of the CPU 11 and a program, data, and the like input from the outside. The storage device 14 is configured by, for example, a flash memory, a hard disk drive, another rewritable storage device, or a combination thereof.
The storage device 14 of this embodiment stores a database D including a plurality of sample data.

通信装置１５は、ネットワークを介した接続により、外部の機器と通信を行う。 The communication device 15 communicates with an external device by connection via a network.

入力装置１６は、情報処理装置１に対して各種の入力を行う。本実施形態の入力装置１６は、キーボードやマウス等の入力機器を有し、ユーザの入力操作内容に応じて情報処理装置１に対して各種の入力を行う。入力装置１６は、他の機器、方法による入力操作を受け付ける構成を備えてもよい。例えば、ペンによる文字入力やタッチパネル方式の入力装置による入力、あるいはこれらの一つ又は複数及びキーボード、マウス等による入力の組み合わせ等が挙げられる The input device 16 performs various inputs to the information processing device 1. The input device 16 according to the present embodiment includes input devices such as a keyboard and a mouse, and performs various inputs to the information processing device 1 according to the input operation content of the user. The input device 16 may include a configuration for accepting an input operation by another device or method. For example, character input using a pen, input using a touch panel type input device, or a combination of one or more of these and input using a keyboard, mouse, or the like

表示装置１７は、情報処理装置１の処理に伴う各種の表示出力を行う。 The display device 17 performs various display outputs accompanying the processing of the information processing device 1.

本実施形態の情報処理装置１は、ＣＰＵ１１がＲＯＭ１３からクラスタリング処理プログラムＥを読み出して実行処理することによって、後述する図２に示す各機能ブロックに応じた機能を実現し、ストレージデバイス１４に記憶されたデータベースＤの内容に基づくクラスタリング処理を行う。 In the information processing apparatus 1 according to the present embodiment, the CPU 11 reads out the clustering processing program E from the ROM 13 and executes the clustering processing E, thereby realizing functions corresponding to each functional block shown in FIG. 2 to be described later and stored in the storage device 14. Clustering processing based on the contents of the database D is performed.

図２に、クラスタリング処理を行うときの情報処理装置１の機能ブロック図を示す。
情報処理装置１は、入力部２１、記憶部２２、制御部２３及び出力部２４として機能する。 FIG. 2 shows a functional block diagram of the information processing apparatus 1 when performing clustering processing.
The information processing apparatus 1 functions as an input unit 21, a storage unit 22, a control unit 23, and an output unit 24.

入力部２１は、情報処理装置１に対して各種の入力を行う。入力部２１の機能は、入力装置１６による。
記憶部２２は、データベースＤを記憶する。また、記憶部２２はクラスタリング処理の結果を記憶することができる。記憶部２２の機能は、ＲＡＭ１２、ＲＯＭ１３、ストレージデバイス１４等の記憶装置による。
制御部２３は、記憶部２２からデータベースＤを読み出してクラスタリング処理を行う。制御部２３の機能は、クラスタリング処理プログラムＥを実行処理するＣＰＵ１１による。
出力部２４は、クラスタリング処理に関する各種の出力を行う。出力部２４の機能は、表示装置１７等の出力装置による。 The input unit 21 performs various inputs to the information processing apparatus 1. The function of the input unit 21 depends on the input device 16.
The storage unit 22 stores the database D. The storage unit 22 can store the result of the clustering process. The function of the storage unit 22 depends on a storage device such as the RAM 12, the ROM 13, and the storage device 14.
The control unit 23 reads the database D from the storage unit 22 and performs clustering processing. The function of the control unit 23 is performed by the CPU 11 that executes the clustering processing program E.
The output unit 24 performs various outputs related to the clustering process. The function of the output unit 24 depends on the output device such as the display device 17.

本実施形態の情報処理装置１によるクラスタリング処理は、変分ベイズ法を用いた混合分布モデルによる。混合分布モデルは確率モデルのひとつであり、本実施形態で用いる混合分布モデルは混合ベルヌーイ分布である。
以下、本実施形態の変分ベイズ法を用いた混合ベルヌーイ分布について説明する。 The clustering process by the information processing apparatus 1 according to the present embodiment is based on a mixed distribution model using the variational Bayes method. The mixed distribution model is one of probability models, and the mixed distribution model used in the present embodiment is a mixed Bernoulli distribution.
Hereinafter, the mixed Bernoulli distribution using the variational Bayes method of the present embodiment will be described.

まず、混合ベルヌーイ分布について説明する。
混合ベルヌーイ分布は潜在クラス解析に用いられる分布として知られており、2 値データのクラスタリングやレコメンデーションシステムなどの応用に広く用いられている。さらにベルヌーイ分布の事前分布が後述するようにひとつのハイパーパラメータであるため、実験・解析が容易である。ベルヌーイ分布の確率密度関数は以下の式（１）によって与えられる。

First, the mixed Bernoulli distribution will be described.
The mixed Bernoulli distribution is known as a distribution used for latent class analysis, and is widely used for applications such as binary data clustering and recommendation systems. Furthermore, since the prior distribution of Bernoulli distribution is one hyperparameter as will be described later, it is easy to experiment and analyze. The probability density function of Bernoulli distribution is given by the following equation (1).

ここでｘ = （ｘ_１，…，ｘ_Ｍ）^Tはデータ、μ= （μ_１，… ，μ_Ｍ）^T、Ｍはそれぞれパラメータとデータの次元である。このとき、混合ベルヌーイ分布は以下の式（２）によって定義される。

Here, x = (x ₁ ,..., X _M ) ^T is data, and μ = (μ ₁ ,..., Μ _M ) ^T and M are the dimensions of parameters and data, respectively. At this time, the mixed Bernoulli distribution is defined by the following equation (2).

ただし、π はＢ（ｘ｜ｘ_ｋ）の混合比を表し、K はコンポーネント数を表す。次にデータｘに対する隠れ変数を導入する。隠れ変数ｚはデータｘがどの分布から発生したかを示す競合的ベクトルz = （0，… ，1，… ，0）として表される。このとき、隠れ変数ｚおよびデータｘの分布の事前分布として共役事前分布であるディリクレ分布とベータ分布を用いるとZ = （ｚ_１，…，ｚ_Ｎ），Ｘ =（ｚ_１，…，ｚ_Ｎ），π およびθの分布はそれぞれ以下の式（３）〜（６）によって与えられる。

（a，b）は事前分布ｐ（π）、ｐ（θ）のパラメータでハイパーパラメータと呼ばれる。 Here, π represents the mixing ratio of B (x | x _k ), and K represents the number of components. Next, a hidden variable for data x is introduced. The hidden variable z is represented as a competitive vector z = (0,..., 1,..., 0) indicating from which distribution the data x originated. In this case, Dirichlet and beta distribution using the Z = a conjugated prior distribution as prior distribution of the distribution of the latent variable z and the data _{_{x (z 1, ..., z}} N), X = (z 1, ..., z N ), Π and θ are given by the following equations (3) to (6), respectively.

(A, b) are parameters of prior distributions p (π) and p (θ) and are called hyperparameters.

ハイパーパラメータａは、ディリクレ分布のパラメータであり、事前分布の混合比を決定する。ハイパーパラメータｂは、ベータ分布のパラメータであり、事前分布において混合される分布の分散値の大小を決定する。 The hyper parameter a is a parameter of the Dirichlet distribution and determines the mixing ratio of the prior distribution. The hyper parameter b is a parameter of the beta distribution, and determines the magnitude of the variance value of the distribution mixed in the prior distribution.

次に、変分ベイズ法について説明する。
本節では、Ｙによりパラメータを含むすべての隠れ変数、Ｘですべての観察可能な変数を表す。このとき、任意の確率分布ｑ（Ｙ）と事後分布ｐ（Ｙ｜Ｘ）に対して式（７）が成り立つ。

Next, the variational Bayes method will be described.
In this section, Y represents all hidden variables including parameters, and X represents all observable variables. At this time, Equation (7) holds for an arbitrary probability distribution q (Y) and posterior distribution p (Y | X).

ここでＣ１，Ｃ２は正規化定数である。上式は互いの分布による平均値の計算を含んでいる。したがって、変分ベイズ法による学習は（７）および（８）の逐次的繰り返し演算により実行される。

Here, C1 and C2 are normalization constants. The above equation includes the calculation of the mean value from each other's distribution. Therefore, learning by the variational Bayes method is executed by the sequential iterative operations of (7) and (8).

次に、混合ベルヌーイ分布の変分ベイズ法について説明する。
前節で導いた変分ベイズ法の一般更新式（１４）、（１５）を前述の混合ベルヌーイ分布の説明における設定のもとで計算することで以下の混合ベルヌーイ分布の変分ベイズ学習アルゴリズムVB e-step及びVB m-stepを得る。
以下の式（１６）、（１７）は、VB e-stepを示す。以下の式（１８）〜（２０）はVB m-stepを示す。

Next, the variational Bayes method of the mixed Bernoulli distribution will be described.
By calculating the general update equations (14) and (15) of the variational Bayes method derived in the previous section under the setting in the description of the mixed Bernoulli distribution described above, the following variational Bayesian learning algorithm VB e of the mixed Bernoulli distribution is given. -step and VB m-step are obtained.
The following formulas (16) and (17) represent VB e-step. The following formulas (18) to (20) represent VB m-step.

ここで、以下の式（２１）とする。ただし、ψはディガンマ関数と呼ばれ、式（２２）が成り立つ。

以上、式（１６）〜（２２）によるアルゴリズムを実行することで事後分布が得られる。事後分布は、以下の式（２３）、（２４）によって得られる。

Here, it is set as the following formula | equation (21). However, ψ is called a digamma function, and Equation (22) holds.

As described above, the posterior distribution is obtained by executing the algorithm according to the equations (16) to (22). The posterior distribution is obtained by the following equations (23) and (24).

なお「学習」とは、変分ベイズ法等のアルゴリズムによって提供されるVB e-step及びVB m-stepによる演算処理をさす。本例では、事前分布ｐ（π）、ｐ（θ）のパラメータであるハイパーパラメータ（ａ，ｂ）により事前分布を決定し、学習を行う。 Note that “learning” refers to arithmetic processing using VB e-step and VB m-step provided by an algorithm such as variational Bayes method. In this example, learning is performed by determining the prior distribution based on the hyperparameters (a, b) that are parameters of the prior distributions p (π) and p (θ).

ハイパーパラメータの設定については、その事前分布の性質から一般に、ハイパーパラメータａが大きい値をとる場合、より多くのコンポーネント（例えばサンプルデータ等）を含む事後分布が構成され、ハイパーパラメータａが小さい値をとる場合、事後分布を構成するためのコンポーネントの数の絞込みが行われる。ハイパーパラメータａが小さい値をとる場合の例として、ハイパーパラメータａを次元（Ｍ）に１を加算した値を２で除算した値以下にすることで、カテゴリ数の絞込みが行われることが実験により得られている。 Regarding the setting of the hyper parameter, generally, when the hyper parameter a takes a large value, a posterior distribution including more components (for example, sample data) is formed, and the hyper parameter a has a small value. In this case, the number of components for configuring the posterior distribution is narrowed down. As an example of a case where the hyper parameter a takes a small value, the number of categories can be narrowed down by making the hyper parameter a less than or equal to the value obtained by adding 1 to the dimension (M) divided by 2. Has been obtained.

一方、ハイパーパラメータｂを小さくとるとより確率的に確実性が高いものを集めたグループを構成することで、小さいカテゴリを見つけることができ、ハイパーパラメータｂを大きくとると、その逆に小さなカテゴリを検出せず、大きな傾向をみることができる傾向をもつ。 On the other hand, if the hyper parameter b is reduced, a small category can be found by constructing a group of more probabilistic certainty, and if the hyper parameter b is increased, the smaller category is reversed. There is a tendency that a large tendency can be seen without detection.

以上の例に示すアルゴリズムに基づき、情報処理装置１は、ストレージデバイス１４に記憶されたデータベースＤの内容に基づくクラスタリング処理を行う。 Based on the algorithm shown in the above example, the information processing apparatus 1 performs a clustering process based on the contents of the database D stored in the storage device 14.

図３に、データベースＤの内容の一例を示す。
図３に示すデータベースＤは、「アイテム名」に示す各アイテムについて、各ユーザＵ１、Ｕ２…がその各アイテムを欲するか否かについて回答した結果を集計したデータベースである。ユーザが欲すると回答したアイテムには「１」が、欲しないと回答したアイテムには「０」が格納されている。図３では図示を省略しているが、本例において各ユーザは８３アイテムについて回答をしており、集計されたユーザ数は１７５０ユーザである。 FIG. 3 shows an example of the contents of the database D.
The database D shown in FIG. 3 is a database in which the results of answering whether or not each user U1, U2,... “1” is stored in the item that the user answered that he / she wants, and “0” is stored in the item that he / she does not want. Although not shown in FIG. 3, in this example, each user has answered about 83 items, and the total number of users is 1750 users.

図４に、クラスタリング処理結果の一例を示す。
図４に示すクラスタリング処理結果は、図３に示すデータベースＤを読み込み、変分ベイズ法を用いた混合ベルヌーイ分布による変分ベイズ学習アルゴリズムを用いて得られたクラスタリング処理結果の一例である。図４及び後述する図５に示す各カテゴリは、データベースＤに含まれるユーザの回答結果の一部により構成される部分集合である。各カテゴリの左側の項目は図３に示すデータベースＤのアイテム名である。各カテゴリの右側の数値は、そのカテゴリを構成する全てのユーザのうち、その左側に記載されたアイテムに対して「1」となる回答を行った、即ち「そのアイテムを欲する」と回答したユーザの割合であり、その割合は0〜1（0〜100[%]）で示される。 FIG. 4 shows an example of the clustering processing result.
The clustering processing result shown in FIG. 4 is an example of the clustering processing result obtained by reading the database D shown in FIG. 3 and using the variational Bayesian learning algorithm based on the mixed Bernoulli distribution using the variational Bayes method. Each category shown in FIG. 4 and FIG. 5 to be described later is a subset composed of a part of the answer results of the users included in the database D. The item on the left side of each category is the item name of the database D shown in FIG. The numerical value on the right side of each category is the user who answered “1” for the item listed on the left side of all the users making up that category, that is, the user who answered “I want that item” The ratio is indicated by 0 to 1 (0 to 100 [%]).

図４に示すクラスタリング処理結果では、１３の部分集合が形成されている。
図４に示すに示すクラスタリング処理結果では、カテゴリ１〜１３の中で、カテゴリ１が最も多くのサンプルデータ、即ちユーザによる回答結果の数（ユーザ数）により構成される最大の部分集合である。以下、カテゴリ２〜１３にかけて、番号の若い順に、各部分集合を構成するユーザ数が多い部分集合が続く。つまり、カテゴリ１３が最もユーザ数が少ない。 In the clustering processing result shown in FIG. 4, 13 subsets are formed.
In the clustering processing result shown in FIG. 4, among categories 1 to 13, category 1 is the largest subset composed of the most sample data, that is, the number of answer results by the user (number of users). In the following, in categories 2 to 13, subsets with a large number of users constituting each subset follow in ascending order of numbers. That is, category 13 has the smallest number of users.

カテゴリ１１〜１３は共に、特定のアイテムについて「1」となる回答、即ちそのアイテムを欲すると回答したユーザが極めて多く存在する部分集合である。例えば、カテゴリ１１はxbox360（登録商標）、ps3（登録商標）、psp（登録商標）を欲すると回答したユーザが99.99[%]を超えている。同様に、カテゴリ１２ではplane（飛行機）、boat（船）を欲するユーザが99.99[%]を超え、カテゴリ１３では「cell phone」等の７アイテムについて欲すると回答したユーザが99.99[%]を超えている。
カテゴリ１１〜１３に示すような、特定のアイテムを欲すると回答したユーザの割合が極めて多い部分集合は、同様の嗜好や欲求を有する少数のユーザによる部分集合である可能性が極めて高い。つまり、図４に示すカテゴリ１１〜１３は、趣味や欲求が共通する少数のユーザの回答結果を反映した部分集合、所謂「少数意見」を反映した部分集合であるといえる。 Each of the categories 11 to 13 is a subset in which there are an extremely large number of users who answered that they wanted “1” for a specific item, that is, that item. For example, the number of users who answered that category 11 wants xbox360 (registered trademark), ps3 (registered trademark), and psp (registered trademark) exceeds 99.99 [%]. Similarly, in category 12, users who want planes and boats (ships) exceed 99.99 [%], and in category 13, users who want 7 items such as “cell phone” exceed 99.99 [%]. ing.
A subset having a very large proportion of users who answered that they want a specific item as shown in categories 11 to 13 is very likely to be a subset by a small number of users having similar preferences and desires. That is, it can be said that the categories 11 to 13 illustrated in FIG. 4 are a subset that reflects the answer results of a small number of users who share a common hobby and desire, that is, a subset that reflects a so-called “minority opinion”.

図４に示すクラスタリング処理結果を得るに際し、混合分布数ＫについてＫ = 50 とし、ハイパーパラメータａ、ｂについてａ = b = 0.001 として図３のデータベースＤに対して学習を行い、有効な混合比を持つカテゴリとして各カテゴリの発生確率が0.0001以上のものを抽出すると、図４に示すようなカテゴリ１、２、１１〜１３を含む計１３の部分集合を得られる。 When obtaining the clustering processing result shown in FIG. 4, learning is performed on the database D of FIG. 3 with K = 50 for the number of mixture distributions K and a = b = 0.001 for the hyperparameters a and b, and the effective mixture ratio is determined. When categories having an occurrence probability of 0.0001 or more are extracted as categories, a total of 13 subsets including categories 1, 2, 11 to 13 as shown in FIG. 4 can be obtained.

本実施例ではb=0.001としたが、この値はデータや所望する粒度によっても異なる。事前分布に対してハイパーパラメータを１以下とすることで、少数意見を抽出しやすい傾向になることが事前分布（ベータ分布）の特性より分かるが、好ましくは０．１以下にすることがよいことが実験より得られている。 In this embodiment, b = 0.001, but this value varies depending on the data and the desired granularity. It can be seen from the characteristics of the prior distribution (beta distribution) that it is easy to extract minority opinions by setting the hyper parameter to 1 or less with respect to the prior distribution, but preferably 0.1 or less Is obtained from experiments.

図５に、クラスタリング処理結果の別の一例を示す。
図５に示すクラスタリング処理結果は、ハイパーパラメータｂの値以外について図４に示すクラスタリング処理結果と同様の条件により得られたクラスタリング処理結果の一例である。図５に示すクラスタリング処理結果を得る際の学習においてハイパーパラメータｂに設定された値は、図４に示すクラスタリング処理結果を得る際の学習においてハイパーパラメータｂに設定された値よりも大きく、 b = 1.0 である。 FIG. 5 shows another example of the clustering processing result.
The clustering process result shown in FIG. 5 is an example of the clustering process result obtained under the same conditions as the clustering process result shown in FIG. 4 except for the value of the hyperparameter b. The value set for the hyper parameter b in the learning for obtaining the clustering process result shown in FIG. 5 is larger than the value set for the hyper parameter b in the learning for obtaining the clustering process result shown in FIG. 1.0.

図５に示すクラスタリング処理結果では、２の部分集合が形成されている。
図５に示すカテゴリ１は、house（家）、money（金）、love（愛）、friend（友）、job（職）等の普遍的な幸せに対応するアイテムを欲すると回答したユーザの割合が比較的多いカテゴリである。図５に示すカテゴリ２は、laptop（ラップトップコンピュータ）、ipod（登録商標）、house（家）、shoes（靴）、clothes（服）等の具体的な物欲に基づく回答を行ったユーザの割合が多いカテゴリである。カテゴリ１、２はいずれも、そのアイテムを欲すると回答したユーザの割合がそのカテゴリを構成する全てのユーザに対して40[%]に達していない。このことから、カテゴリ１、２はいずれも図４のカテゴリ１１〜１３に比して多数のユーザの回答結果により構成される大きな部分集合であるといえる。 In the clustering processing result shown in FIG. 5, two subsets are formed.
Category 1 shown in FIG. 5 is the ratio of users who answered that they want items corresponding to universal happiness such as house, money, love, friend, job, etc. There are relatively many categories. Category 2 shown in FIG. 5 is the ratio of users who answered based on specific desires such as laptop (laptop computer), ipod (registered trademark), house (shoes), shoes (shoes), and clothes (clothes). There are many categories. In both categories 1 and 2, the percentage of users who answered that they want the item does not reach 40 [%] for all users who make up that category. From this, it can be said that the categories 1 and 2 are large subsets constituted by the answer results of many users as compared to the categories 11 to 13 in FIG.

このように、ハイパーパラメータａ，ｂの値を調節することにより、クラスタリングの粒度を調節することができる。 In this way, the granularity of clustering can be adjusted by adjusting the values of the hyper parameters a and b.

本実施形態では、ハイパーパラメータを直接的に指定しているが、ユーザの分類の目的や指定された部分集合の数等の各種条件に応じて自動的にハイパーパラメータを設定する仕組みを設けてもよい。ハイパーパラメータの値と、ハイパーパラメータを設定するための条件との対応付けは、実験的に求められるものであり特に制限されるものではない。 In this embodiment, the hyper parameter is directly specified. However, a mechanism for automatically setting the hyper parameter according to various conditions such as the purpose of classification of the user and the number of specified subsets may be provided. Good. The association between the value of the hyper parameter and the condition for setting the hyper parameter is obtained experimentally and is not particularly limited.

また、混合分布は混合ベルヌーイ分布に限るものではなく、他の混合分布を用いてもよい。他の混合分布として、例えば混合正規分布のような実数値のデータに対して適用できる確率モデルや、ｐＬＳＡ（Probabilistic Latent Semantic Analysis）等の潜在的意味解析で用いられる分布等が挙げられる。
また、変分ベイズ法以外の手法により学習を行ってもよい。学習を行うためのアルゴリズムとして、例えばＥＭアルゴリズム等の事前に分散に相当するパラメータを小さく設定することが可能なアルゴリズムや、k-means等の手法が挙げられる。ＥＭアルゴリズムの場合、ハイパーパラメータは、混合分布モデルにおいて混合される分布の分散値を決定するパラメータとなり、混合される分布の分散値の大小は、生成される部分集合データの数及び一の部分集合データに含まれるサンプルデータの多少と相関を有する。 Further, the mixed distribution is not limited to the mixed Bernoulli distribution, and other mixed distributions may be used. Other mixed distributions include, for example, a probability model that can be applied to real-valued data such as a mixed normal distribution, a distribution used in latent semantic analysis such as pLSA (Probabilistic Latent Semantic Analysis), and the like.
Further, learning may be performed by a method other than the variational Bayes method. As an algorithm for performing learning, for example, an algorithm that can set a parameter corresponding to variance in advance, such as an EM algorithm, or a technique such as k-means can be given. In the case of the EM algorithm, the hyperparameter is a parameter that determines the variance value of the distribution to be mixed in the mixed distribution model, and the size of the distribution value of the mixed distribution is the number of subset data to be generated and one subset. It has a correlation with the amount of sample data included in the data.

このようにして得られたクラスタリング処理結果は、ユーザの嗜好の分類の解析に用いることができることは勿論、レコメンデーションシステム等、ユーザに対する提案を行うための情報源として活用する等の様々な応用が可能である。例えば、ユーザの回答結果のクラスタリング処理結果に基づいて同様のカテゴリに属するアイテムの購買を促す情報をそのカテゴリを構成する各ユーザへ届けるようにしてもよい。また、上述のような回答結果に含まれるユーザであるか否かに関わらず、あるアイテムに対して購入、閲覧等のアクションを行ったユーザに対して、前述のクラスタリング処理結果に基づいて得られた部分集合のうちそのアイテムに対する高い関心を示すカテゴリに含まれる別のアイテムの購買を促す情報をユーザへ届けるようにしてもよい。他にも、様々な応用が考えられる。 The clustering processing results obtained in this way can be used for analysis of user preference classifications, as well as various applications such as recommendation systems, etc. that are used as information sources for making proposals to users. Is possible. For example, information that prompts the purchase of an item belonging to the same category based on the clustering processing result of the user's answer result may be delivered to each user constituting the category. Moreover, regardless of whether or not the user is included in the answer result as described above, it is obtained based on the result of the clustering process described above for a user who has performed an action such as purchase or browsing for an item. Information that prompts the user to purchase another item included in the category indicating a high interest in the item in the subset may be delivered to the user. Various other applications are possible.

仮に、事前にクラスタリング処理が行われていない場合、あるアイテムについて購入、閲覧等の興味、関心を示すアクションを行ったユーザが生じた場合、そのアクションの後に、そのアイテムに対して同様の興味、関心を示すアクションを行ったユーザを抽出する処理を行い、その上でそのアイテムと関連性の高い部分集合に含まれる別のアイテムを抽出する等の処理をユーザのアクションが起こる毎に行う必要が生じる。このような処理は、高い計算コストを要し、コンピュータの処理能力を徒に浪費することに加え、レスポンスも劣悪となる。 If clustering processing is not performed in advance, if there is a user who has performed an action indicating interest, such as purchase, browsing, etc. for a certain item, the same interest for that item after that action, It is necessary to perform the process of extracting the user who performed the action indicating the interest, and then to perform the process such as extracting another item included in the subset highly relevant to the item every time the user action occurs Arise. Such processing requires a high calculation cost, and wastes the processing power of the computer. In addition, the response is poor.

一方、本実施形態ではクラスタリング処理を行うことによってデータベースＤに基づいた部分集合を示すデータ（部分集合データ）を得ることができ、必要に応じて部分集合データを部分集合として記憶部２２に記憶させることができる。本実施形態で扱う部分集合データは、各部分集合について個別に生成されるものとして扱うが、複数の部分集合及び各部分集合を構成するサンプルデータに関する情報を一括して管理するデータを生成して記憶、利用するようにしてもよい。 On the other hand, in the present embodiment, data indicating a subset (subset data) based on the database D can be obtained by performing clustering processing, and the subset data is stored in the storage unit 22 as a subset as necessary. be able to. The subset data handled in the present embodiment is handled as being generated individually for each subset, but generates data for collectively managing information on a plurality of subsets and sample data constituting each subset. You may make it memorize and use.

ユーザは、部分集合データを読み出すことでデータベースＤに基づく部分集合に関する情報を得ることができ、部分集合に基づくデータの解析や抽出を行うことができる。データの解析として、例えば各アイテムを欲する欲求の関連性やユーザ数の相関等に関する情報を得るための解析等が挙げられる。データの抽出として、例えばあるアイテムと関連性の深いアイテムに関するデータの抽出や、あるアイテムに関連性の深い部分集合のデータの抽出等が挙げられる。 The user can obtain information related to the subset based on the database D by reading the subset data, and can analyze and extract data based on the subset. As an analysis of data, for example, an analysis for obtaining information on the relevance of the desire for each item, the correlation of the number of users, and the like. Examples of data extraction include extraction of data related to an item that is closely related to a certain item, extraction of data of a subset that is deeply related to a certain item, and the like.

以下、図６に示すフローチャートを用いて、情報処理装置１のクラスタリング処理の流れについて説明する。
まず、制御部２３が記憶部２２からデータベースＤを読み出してクラスタリング処理の対象となるデータを取得する（ステップＳ１）。 Hereinafter, the flow of clustering processing of the information processing apparatus 1 will be described with reference to the flowchart shown in FIG.
First, the control unit 23 reads the database D from the storage unit 22 and acquires data to be subjected to clustering processing (step S1).

次に、パラメータの決定処理が行われる（ステップＳ２）。パラメータの決定処理とは、ハイパーパラメータａ，ｂの値を決定する処理であり、例えばユーザにより入力部２１を介して直接入力されたハイパーパラメータａ，ｂの値や、ユーザにより入力部２１を介して入力されたクラスタリング処理の目的や所望の部分集合数等に基づいて決定されたハイパーパラメータａ，ｂの値による。 Next, parameter determination processing is performed (step S2). The parameter determination process is a process of determining the values of the hyper parameters a and b. For example, the values of the hyper parameters a and b directly input by the user via the input unit 21 or the user via the input unit 21. The values of the hyperparameters a and b determined on the basis of the purpose of the clustering process input in this way, the desired number of subsets, etc.

次に、制御部２３はステップＳ２で得たハイパーパラメータａ，ｂに基づいて学習を行う（ステップＳ３）。学習のためのアルゴリズムは、前述の記載に基づく。
次に、制御部２３はステップＳ３の学習結果に基づいて有効な混合比を持つ部分集合を抽出し、クラスタリングを行う（ステップＳ４）。
本実施形態では、制御部２３は、ステップＳ４のクラスタリング結果を示す部分集合データを生成し、出力部２４に出力表示する。 Next, the control unit 23 performs learning based on the hyper parameters a and b obtained in step S2 (step S3). The algorithm for learning is based on the above description.
Next, the control unit 23 extracts a subset having an effective mixture ratio based on the learning result of step S3, and performs clustering (step S4).
In the present embodiment, the control unit 23 generates subset data indicating the clustering result in step S <b> 4 and outputs and displays it on the output unit 24.

次に、クラスタリングを終了するか否かの判定が行われる（ステップＳ５）。ステップＳ５の判定は、ステップＳ４で得られたクラスタリング結果により意見の抽出を行うか否かに基づいて決定され、その決定はユーザによる入力操作や所定の条件に基づく処理による、所定の条件に基づく処理とは、例えば所定の部分集合数に基づいて行われたクラスタリングであった場合に、その所定の部分集合数に応じた部分集合が得られたか否かに基づいてクラスタリングを完了するか、再度クラスタリングを行うかを決定する等である。 Next, it is determined whether or not to end clustering (step S5). The determination in step S5 is determined based on whether or not the opinion is extracted based on the clustering result obtained in step S4. For example, when the processing is clustering performed based on a predetermined number of subsets, clustering is completed based on whether or not a subset corresponding to the predetermined number of subsets is obtained, For example, determining whether to perform clustering.

ステップＳ５においてクラスタリングを終了しない場合（ステップＳ５：ＮＯ）、ステップＳ２の処理に戻る。
ステップＳ５においてクラスタリングを終了する場合（ステップＳ５：ＹＥＳ）、データの抽出処理を行う（ステップＳ６）。データの抽出処理とは、ステップＳ４で得られたクラスタリング結果に基づいて所望のデータの抽出を行うことである。ステップＳ６によって得られるデータとして、例えばあるアイテムと関連性の深いアイテムに関するデータや、あるアイテムに関連性の深い部分集合のデータ等である。 If the clustering is not terminated in step S5 (step S5: NO), the process returns to step S2.
When the clustering is finished in step S5 (step S5: YES), data extraction processing is performed (step S6). The data extraction process is to perform extraction of desired data based on the clustering result obtained in step S4. The data obtained in step S6 includes, for example, data related to items that are closely related to a certain item, data of a subset that is deeply related to a certain item, and the like.

ところで、何らかのアイテム（アイテムＱとする）について、アイテムＱと関連して発生する可能性が高いアイテム（アイテムＲとする）について、条件付確率Ｐ（Ｒ｜Ｑ）が高いアイテムＲを計算することで具体的なアイテムＲを得られる。 By the way, for some item (referred to as item Q), an item R with a high conditional probability P (R | Q) is calculated for an item (referred to as item R) that is likely to occur in association with item Q. A specific item R can be obtained.

従来、条件付確率Ｐ（Ｒ｜Ｑ）が高いアイテムＲを抽出する場合、図７に示すように、アイテムＱが得られた後にデータベースＤに対して総当り検索を行い、アイテムＱを含むサンプルデータであって、かつ、最も多く含まれるアイテムＱ以外のアイテムをアイテムＲとして抽出していた。このようなアイテムＲの抽出処理は、総当り処理を行うために高い計算コストを要し、コンピュータの処理能力を徒に浪費することに加え、レスポンスも劣悪となる。 Conventionally, when an item R having a high conditional probability P (R | Q) is extracted, a brute force search is performed on the database D after the item Q is obtained as shown in FIG. Items other than the item Q which is data and contained most are extracted as the item R. Such an item R extraction process requires a high calculation cost in order to perform a brute force process, and in addition to wasting the processing power of the computer, the response is also poor.

そこで、本実施形態では、前述のクラスタリング処理により得られた各部分集合に対して条件付確率Ｐ（Ｒ｜Ｑ）を適用し、条件付確率Ｐ（Ｒ｜Ｑ）が高いアイテムＲを抽出する。
図８に、本実施形態による条件付確率Ｐ（Ｒ｜Ｑ）に基づくアイテムＲの抽出の一例を示す。
まず、制御部２３は、クラスタリング処理によって得られた各部分集合の中で、アイテムＱの発生確率が最も高い部分集合を抽出する（ステップＳ１１）。そして、制御部２３は、抽出された部分集合に含まれるアイテムの中で、アイテムＱを除いて最も発生確率が高いアイテムをアイテムＲとして抽出する（ステップＳ１２）。 Therefore, in the present embodiment, the conditional probability P (R | Q) is applied to each subset obtained by the clustering process described above, and the item R having a high conditional probability P (R | Q) is extracted. .
FIG. 8 shows an example of item R extraction based on the conditional probability P (R | Q) according to the present embodiment.
First, the control unit 23 extracts a subset having the highest occurrence probability of the item Q from each subset obtained by the clustering process (step S11). And the control part 23 extracts the item with the highest generation | occurrence | production probability as the item R except the item Q among the items contained in the extracted subset (step S12).

クラスタリング処理により得られた各部分集合に対して条件付確率Ｐ（Ｒ｜Ｑ）を適用することで、データベースＤに含まれる全てのサンプルデータに対して総当り処理を行う場合に比して大幅に計算コストを低減させることができ、アイテムＱを得てからアイテムＲを得るまでのレスポンスも大幅に向上する。したがって、アイテムＱに基づいてアイテムＲを提案する等のレコメンデーションを高速に行うことができる。 By applying the conditional probability P (R | Q) to each subset obtained by the clustering process, it is much larger than when brute force processing is performed on all sample data included in the database D. In addition, the calculation cost can be reduced, and the response from obtaining the item Q to obtaining the item R is greatly improved. Therefore, the recommendation such as proposing the item R based on the item Q can be performed at high speed.

なお、二つ以上のアイテム（例えばアイテムＱ１、Ｑ２）に基づいて別のアイテムＲを抽出する条件付確率Ｐ（Ｒ｜Ｑ１，Ｑ２）などのように、複数のアイテムに基づく抽出処理を行う場合、アイテムＱ１，Ｑ２のいずれか一方に基づいて上述の処理と同様の処理を行ってもよいし、Ｐ（Ｒ｜Ｑ１）・Ｐ（Ｒ｜Ｑ２）の値が大きいアイテムを抽出するようにしてもよい。 In the case of performing extraction processing based on a plurality of items, such as conditional probability P (R | Q1, Q2) for extracting another item R based on two or more items (for example, items Q1, Q2). The same processing as described above may be performed based on one of the items Q1 and Q2, or an item having a large value of P (R | Q1) · P (R | Q2) may be extracted. Also good.

図９のフローチャートを用いて、アイテムＲを抽出して提案するレコメンデーション処理の流れを説明する。
まず、制御部２３は、クラスタリング処理を行う（ステップＳ２１）。ステップＳ２１のクラスタリング処理は、前述のクラスタリング処理と同様である。
その後、制御部２３は、クラスタリング処理によって得られた各部分集合の中で、アイテムＱの発生確率が最も高い部分集合を抽出する（ステップＳ２２）。そして、制御部２３は、抽出された部分集合に含まれるアイテムの中で、アイテムＱを除いて最も発生確率が高いアイテムをアイテムＲとして抽出する（ステップＳ２３）。ステップＳ２３の抽出結果は、表示部２４に表示される。 The flow of the recommendation process for extracting and proposing the item R will be described using the flowchart of FIG.
First, the control unit 23 performs clustering processing (step S21). The clustering process in step S21 is the same as the clustering process described above.
Thereafter, the control unit 23 extracts a subset having the highest occurrence probability of the item Q from the respective subsets obtained by the clustering process (step S22). And the control part 23 extracts the item with the highest generation | occurrence | production probability as the item R except the item Q among the items contained in the extracted subset (step S23). The extraction result of step S23 is displayed on the display unit 24.

そして、制御部２３は、アイテムＲに基づく提案処理を行う（ステップＳ２４）。
ステップＳ２４のアイテムＲに基づく提案処理とは、例えばアイテムＱを購入したユーザに対してアイテムＲの購入を薦めるための情報を提供する等の処理である。当該情報提供は、例えば電子メールの送信やダイレクトメール、カタログ等の送付によって行われる。制御部は、このような情報提供に関連する処理をプリセットに従って処理する。プリセットとは、予め定められた処理のことであり、例えばメール送信の場合は所定の雛形メールにアイテムＲ及びアイテムＲに関する情報を加味してユーザ宛へ送信する処理等であり、ダイレクトメールやカタログの送付においてはユーザの住所ラベルの出力やユーザに送付するダイレクトメールやカタログの種類を示す情報を送付担当者へ出力する処理等である。
ステップＳ２４の提案処理内容、例えばメール送信処理の結果やその他の情報出力は、表示部２４に表示される。 And the control part 23 performs the proposal process based on the item R (step S24).
The proposal process based on the item R in step S24 is, for example, a process of providing information for recommending the purchase of the item R to the user who has purchased the item Q. The information provision is performed, for example, by sending an e-mail, sending a direct mail, a catalog, or the like. A control part processes the process relevant to such information provision according to a preset. Preset is a predetermined process. For example, in the case of mail transmission, it is a process of sending a message to a user by adding the item R and information related to the item R to a predetermined template mail. Is a process for outputting the address label of the user, direct mail to be sent to the user, and information indicating the type of catalog to the person in charge of sending.
The contents of the proposal process in step S24, for example, the result of the mail transmission process and other information output are displayed on the display unit 24.

前述のアイテムＲの抽出及びレコメンデーション処理は、クラスタリング処理によって得られた各部分集合の中で、アイテムＱの発生確率が最も高い部分集合に含まれるアイテムの中で、アイテムＱを除いて最も発生確率が高いアイテムに基づいて行われているが、抽出の条件はこれに限らない。
例えば、各部分集合に含まれるアイテム等の項目について、その項目に関するパラメータ（例えば発生確率等）が最も低い部分集合の中出、その項目を除いてパラメータが最も低い項目を抽出することもできる。また、発生確率等のパラメータについては、最高／最低に限らず、任意の数値やその近似値を用いてアイテム等の項目の抽出を行うこともできる。 The item R extraction and recommendation process described above is the most generated except for the item Q among the items included in the subset with the highest occurrence probability of the item Q among the subsets obtained by the clustering process. Although it is based on an item with a high probability, the extraction condition is not limited to this.
For example, for items such as items included in each subset, it is possible to extract items in the subset having the lowest parameter (for example, occurrence probability) and the lowest parameters except for the item. In addition, the parameters such as the occurrence probability are not limited to the maximum / minimum, and an item such as an item can be extracted using an arbitrary numerical value or an approximate value thereof.

以上、本実施形態によれば、情報処理装置１は、設定されたハイパーパラメータ（ａ，ｂ）に応じてデータベースＤに基づく部分集合を示す部分集合データを生成する。ここで、ハイパーパラメータ（ａ，ｂ）は、クラスタリングの粒度を決定する。そして、ユーザは、入力部２１を介してハイパーパラメータ（ａ，ｂ）の値を任意に入力決定することができる。つまり、ユーザはハイパーパラメータ（ａ，ｂ）の値を入力することによりクラスタリングの粒度を調節することができる。よって、情報処理装置１は、クラスタリングの粒度を調節可能とする。 As described above, according to the present embodiment, the information processing apparatus 1 generates subset data indicating a subset based on the database D in accordance with the set hyperparameter (a, b). Here, the hyperparameters (a, b) determine the clustering granularity. The user can arbitrarily input and determine the value of the hyperparameter (a, b) via the input unit 21. That is, the user can adjust the clustering granularity by inputting the values of the hyperparameters (a, b). Therefore, the information processing apparatus 1 can adjust the granularity of clustering.

さらに、ハイパーパラメータａは、ディリクレ分布のパラメータであり、事前分布の混合比を決定する。ハイパーパラメータｂは、ベータ分布のパラメータであり、事前分布において混合される分布の分散値の大小を決定する。そして、制御部２３は、変分ベイズ法を用いたアルゴリズムにより部分集合データを生成する。これによって、ハイパーパラメータ（ａ，ｂ）の値の変化に応じて、クラスタリングの粒度を調節可能とすることができる。 Furthermore, the hyper parameter a is a parameter of the Dirichlet distribution and determines the mixing ratio of the prior distribution. The hyper parameter b is a parameter of the beta distribution, and determines the magnitude of the variance value of the distribution mixed in the prior distribution. And the control part 23 produces | generates subset data with the algorithm using the variational Bayes method. This makes it possible to adjust the clustering granularity in accordance with changes in the values of the hyperparameters (a, b).

さらに、ハイパーパラメータｂによって決定される、事前分布において混合される分布の分散値の大小は、生成される部分集合データの数及び一の部分集合データに含まれるサンプルデータの多少と相関を有する。前述に示すように、ハイパーパラメータｂの値が大きいほど、部分集合の数は少なくなり、一の部分集合に含まれるサンプルデータの数は多くなる傾向がある、一方、ハイパーパラメータｂの値が小さいほど、部分集合の数は多くなり、一の部分集合に含まれるサンプルデータの数は少なくなる傾向がある。このように、ハイパーパラメータｂはクラスタリングの粒度を調節するパラメータとして機能する。 Further, the magnitude of the variance value of the distribution mixed in the prior distribution determined by the hyperparameter b has a correlation with the number of subset data to be generated and the amount of sample data included in one subset data. As described above, the larger the value of the hyper parameter b, the smaller the number of subsets, and the larger the number of sample data included in one subset, whereas the smaller the value of the hyper parameter b. As the number of subsets increases, the number of sample data included in one subset tends to decrease. Thus, the hyper parameter b functions as a parameter for adjusting the clustering granularity.

さらに、混合ベルヌーイ分布を用いた学習において、混合比側の分布の事前分布にディリクレ分布を用い、そのパラメータの値を混合分布の次元に１を加算した値を２で除算した値以下とし、混合ベルヌーイ分布側の事前分布にベータ分布を用い、そのパラメータの値を０．１以下に設定することで、「少数意見の抽出」に適した粒度の小さい部分集合を得ることができる。これによって、特定のアイテムに関連性の深い部分集合に基づくデータの解析や提案を行うことができる。 Furthermore, in learning using the mixed Bernoulli distribution, the Dirichlet distribution is used for the prior distribution of the distribution on the mixing ratio side, and the value of the parameter is made equal to or less than the value obtained by adding 1 to the dimension of the mixed distribution divided by 2. By using a beta distribution for the prior distribution on the Bernoulli distribution side and setting the parameter value to 0.1 or less, a small-grained subset suitable for “extraction of minority opinions” can be obtained. As a result, it is possible to analyze and propose data based on a subset that is closely related to a specific item.

さらに、制御部２３は、前述のレコメンデーション処理等のように、一又は複数の部分集合に含まれる一又は複数のパラメータに基づいて、一又は複数のパラメータとは異なるパラメータを抽出する。ここでいうパラメータとは、データベースに含まれるアイテム名等の項目に対応する発生確率等の、各部分集合に含まれるパラメータである。
これによって、データベースの全サンプルデータに対する総当りを要することなく、ある項目と関連性の深いデータを抽出することができ、レコメンデーション処理等の提案において必要となるデータの抽出を低い計算コストで迅速に、効率的に行うことができる。 Furthermore, the control unit 23 extracts a parameter different from one or more parameters based on one or more parameters included in one or more subsets, such as the above-described recommendation processing. The parameter here is a parameter included in each subset such as an occurrence probability corresponding to an item such as an item name included in the database.
This makes it possible to extract data that is closely related to a certain item without requiring brute force for all sample data in the database, and to quickly extract data required for proposals such as recommendation processing at a low calculation cost. It can be done efficiently.

さらに、部分集合に含まれるパラメータに基づくデータの抽出結果や提案処理の結果を表示部２４に出力することによって、各処理の結果をユーザに明示することができる。ユーザは、表示内容からデータの把握や解析等を行うことができる。 Furthermore, by outputting the data extraction result based on the parameters included in the subset and the result of the proposal process to the display unit 24, the result of each process can be clearly indicated to the user. The user can grasp or analyze data from the display content.

なお、本発明の実施の形態は、今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment of the present invention should be considered that the embodiment disclosed this time is illustrative and not restrictive in all respects. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

例えば、図３に示すデータベースＤのアイテム名、アイテム数、ユーザの回答、ユーザ数等の各種項目はあくまで一例であり、任意に変更可能である。図４、１２、１５等に示す部分集合の数、各部分集合に含まれるアイテムやその確率や、図６に示すフローチャートについても一例に過ぎず、これによって本発明が制限されるものではない。
また、制御部が読み出すデータベース及びプログラムの所在は前述の実施形態に限らず、外部の記憶装置や情報処理装置が読み出し可能な媒体等に記憶されていてもよい。 For example, various items such as the item name, the number of items, the user's answer, and the number of users in the database D shown in FIG. 3 are merely examples, and can be arbitrarily changed. The number of subsets shown in FIGS. 4, 12, 15 and the like, the items included in each subset, their probabilities, and the flowchart shown in FIG. 6 are merely examples, and the present invention is not limited thereby.
The location of the database and the program read by the control unit is not limited to the above-described embodiment, and may be stored in a medium that can be read by an external storage device or information processing device.

１１ＣＰＵ
１２ＲＡＭ
１３ＲＯＭ
１４ストレージデバイス
１５通信装置
１６入力装置
１７表示装置
２１入力部
２２記憶部
２３制御部
２４出力部
Ｄデータベース
Ｅクラスタリング処理プログラム 11 CPU
12 RAM
13 ROM
14 storage device 15 communication device 16 input device 17 display device 21 input unit 22 storage unit 23 control unit 24 output unit D database E clustering processing program

Claims

A storage unit for storing a database including a plurality of sample data;
An input unit for inputting parameters for determining the scale of category data;
One or a plurality of subset data including sample data included in the database by an algorithm using a mixture distribution model by a variational Bayes method based on the parameters input by the input unit and the database A control unit for generating category data of
An information processing apparatus comprising:
The parameter is a parameter for determining the prior distribution of the mixed distribution and / or the prior distribution on the mixing ratio side in the mixed distribution model,
The mixed distribution model is a mixed Bernoulli distribution;
The Dirichlet distribution is used as the prior distribution of the distribution on the mixing ratio side, the value of the parameter is made equal to or less than the value obtained by adding 1 to the dimension of the mixed distribution and divided by 2, and the beta distribution is set on the prior distribution on the mixed Bernoulli distribution side. An information processing apparatus characterized in that the parameter value is set to 0.1 or less .

The information processing apparatus according to claim 1, wherein the category data has a clustering granularity.

The information processing apparatus according to claim 2, wherein the parameter is a parameter for determining a size of the clustering granularity.

The magnitude of the mixed is the variance of the distribution, any one of claims 1 to 3, characterized in that it has some To correlate the sample data contained in the number of categories the data to be generated and one category data The information processing apparatus described in 1.

Computer
Means for storing a database including a plurality of sample data;
It means for inputting parameters for determining a scale of category data,
Based on the input parameter and the database, the algorithm using a mixture model Variational Bayesian method, one or more category data which is one or more subsets data including sample data contained in the database Means for generating,
Is a program for functioning as
The parameter is a parameter for determining the prior distribution of the mixed distribution and / or the prior distribution on the mixing ratio side in the mixed distribution model,
The mixed distribution model is a mixed Bernoulli distribution;
The Dirichlet distribution is used as the prior distribution of the distribution on the mixing ratio side, the value of the parameter is made equal to or less than the value obtained by adding 1 to the dimension of the mixed distribution and divided by 2, and the beta distribution is set on the prior distribution on the mixed Bernoulli distribution side. A program characterized in that the parameter value is set to 0.1 or less .