JP6175037B2

JP6175037B2 - Cluster extraction apparatus, method, and program

Info

Publication number: JP6175037B2
Application number: JP2014154303A
Authority: JP
Inventors: 匡宏幸島; 達史松林; 澤田　宏; 宏澤田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-07-29
Filing date: 2014-07-29
Publication date: 2017-08-02
Anticipated expiration: 2034-07-29
Also published as: JP2016031678A

Description

本発明は、クラスタ抽出装置及び方法及びプログラムに係り、特に、E-Commerce(電子商取引)サービスにおいて、行列形式で与えられる商品文書データからクラスタを発見するクラスタ抽出装置及び方法及びプログラムに関する。 The present invention relates to a cluster extraction apparatus, method, and program, and more particularly to a cluster extraction apparatus, method, and program for finding a cluster from product document data provided in a matrix format in an E-Commerce (electronic commerce) service.

POS(Point of Sales)データに代表される購買履歴などの構造化されたデータや、テキストデータ、画像データなど構造化されていないデータの多くは前処理によって行列形式により表現できることが知られている。これら行列表現されたデータ中に存在するクラスタを発見するための手法として、非負値のみからなる行列を分解する非負値行列分解(NMF：Non-negative Matrix Factorization、以下「NMF」と記す)と呼ばれる手法の有用性がこれまで示されている(例えば非特許文献１参照)。NMFを適用する際に入力される行列データはそれより低次のランクの行列の積に分解される。この各低次行列がそれぞれ各行、各列に対応する事物のクラスタへの寄与度を表しており、クラスタ発見が可能となる。したがって例えば購買履歴データに対し適用することで抽出されたクラスタをもとにユーザへのおすすめ商品リストを作成したり、ニュース記事文書集合に対する適用結果から記事の自動分類が可能となる。 It is known that structured data such as purchase history represented by POS (Point of Sales) data, and many unstructured data such as text data and image data can be expressed in a matrix format by preprocessing. . A technique for finding clusters that exist in data represented by these matrices is called non-negative matrix factorization (NMF), which decomposes only non-negative matrices. The usefulness of the technique has been shown so far (see, for example, Non-Patent Document 1). Matrix data that is input when applying NMF is decomposed into lower rank matrix products. Each low-order matrix represents the contribution to the cluster of things corresponding to each row and each column, and cluster discovery is possible. Therefore, for example, a recommended product list for a user can be created based on a cluster extracted by applying to purchase history data, and articles can be automatically classified from application results to a news article document set.

NMFの購買データへの適用例を図１に示す。 An example of NMF application to purchasing data is shown in Fig. 1.

購買データを表す商品購買行列X={x_ij}は行列中の第i行目に対応するユーザによる第ｊ列目に対応する商品の購買数がx_ijの値となるI行J列の行列である。したがって、商品購買行列Xは行と列がそれぞれ特定のユーザと商品に対応していることになる。このことを「商品購買行列Xがユーザ属性と商品属性を持つ」、と呼ぶこととする。この商品購買行列XにNMFを適用することで、 A product purchase matrix X = {x _ij } representing purchase data is a matrix of I rows and J columns in which the number of products purchased by the user corresponding to the j th column by the user corresponding to the i th row in the matrix is the value of x _ij. It is. Therefore, the product purchase matrix X has rows and columns corresponding to specific users and products, respectively. This is referred to as “product purchase matrix X has user attributes and product attributes”. By applying NMF to this product purchasing matrix X,

となるI行R列のユーザ特徴行列A={a_ir}とJ行R列の商品特徴行列B={b_jr}が求まる。但し、記号

The user feature matrix A = {a _ir } of I row R column and the product feature matrix B = {b _jr } of J row R column are obtained. However, the symbol

は両者が類似していることを示し、記号の上付きの記号Tは行列の転置を表す。図１のユーザ特徴行列Aの『クラスタ１』に対応する列に着目すると、「ユーザ１」、「ユーザ２」、「ユーザ３」に対応する1行目、2行目、3行目の値が０より大きい値となっていることが分かる。これは「ユーザ１」、「ユーザ２」、「ユーザ３」が『クラスタ１』に所属することを示している。また商品特徴行列Bの『クラスタ１』に対応する行に着目すると、1列目の「ビール１」、2列目の「ビール２」、3列目の「ビール３」という商品に該当する列の値が０より大きい値となっていることが分かる。これは「ビール１」、「ビール２」、「ビール３」という３つの単語が同じユーザに購入されやすいという『クラスタ１』のもつ特徴を表しているといえる。したがって、この「ビール１」、「ビール２」、「ビール３」という商品をまとめて"クラスタ１の商品特徴"と呼ぶ。同様に、『クラスタ１』に所属するユーザのことを"クラスタ１のユーザ特徴"と呼ぶ。『クラスタ１』の商品特徴とユーザ特徴をまとめて"クラスタ１の特徴"と呼ぶこととする。このようにNMFの適用によって得られたユーザ特徴行列Aと商品特徴行列Bをもとに図２のようなクラスタ抽出が可能となる。

Indicates that they are similar, and the superscript symbol T indicates the transpose of the matrix. Focusing on the column corresponding to “Cluster 1” in the user feature matrix A in FIG. 1, the values in the first, second, and third rows corresponding to “User 1”, “User 2”, and “User 3”. It can be seen that is a value greater than zero. This indicates that “user 1”, “user 2”, and “user 3” belong to “cluster 1”. Focusing on the row corresponding to “Cluster 1” in the product feature matrix B, the column corresponding to the product “beer 1” in the first column, “beer 2” in the second column, and “beer 3” in the third column. It can be seen that the value of is greater than zero. It can be said that this represents a feature of “Cluster 1” that three words “beer 1”, “beer 2”, and “beer 3” are easily purchased by the same user. Therefore, the products “beer 1”, “beer 2”, and “beer 3” are collectively referred to as “product features of cluster 1”. Similarly, users belonging to “Cluster 1” are referred to as “user characteristics of cluster 1”. The product features and user features of “Cluster 1” are collectively referred to as “Cluster 1 features”. Thus, cluster extraction as shown in FIG. 2 is possible based on the user feature matrix A and the product feature matrix B obtained by applying NMF.

なお、クラスタの総数に相当する商品特徴行列のランク数は、解析する前に予め決定しておくものとする。 Note that the rank number of the product feature matrix corresponding to the total number of clusters is determined in advance before analysis.

また、NMFはクラスタ抽出だけでなく欠損値の補完にも利用できることが知られている。その例を図３に示す。図３の商品購買行列X={x_ij}の定義は図１と同じである。ただし、図１の商品購買行列との違いは、「ユーザ１」の「ビール３」の購買数を表す要素が欠損していることにある。このような場合であってもNMFは他の観測されている値をもとにユーザ特徴行列Aと商品特徴行列Bを求めることができる。ここで求めたユーザ特徴行列Aと商品特徴行列Bを利用することで元の商品購買行列X={x_ij}の欠損成分を補完した商品購買行列の推定値 It is also known that NMF can be used not only for cluster extraction but also for missing value complementation. An example is shown in FIG. The definition of the merchandise purchase matrix X = {x _ij } in FIG. 3 is the same as that in FIG. However, the difference from the merchandise purchase matrix in FIG. 1 is that an element indicating the number of purchases of “beer 3” of “user 1” is missing. Even in such a case, the NMF can obtain the user feature matrix A and the product feature matrix B based on other observed values. Estimated value of merchandise purchase matrix that complements missing components of original merchandise purchase matrix X = {x _ij } by using user feature matrix A and product feature matrix B obtained here

が求まり、欠損していた要素の値も求めることができる。

And the value of the missing element can also be obtained.

澤田宏, "非負値行列因子分解NMFの基礎とデータ／信号解析への応用", 電子情報通信学会誌, Vol. 95, No. 9, pp. 829-833, 2012.Hiroshi Sawada, "Basics of Non-Negative Matrix Factorization NMF and its Application to Data / Signal Analysis", IEICE Journal, Vol. 95, No. 9, pp. 829-833, 2012. K. Takeuchi, K. Ishiguro, A. Kimura, and H. Sawada, Non-negative Multiple Matrix Factorization, Proceedings of 23rd International Joint Conference on Artificial Intelligence (IJCAI2013), pp. 1713-1720, 2013.K. Takeuchi, K. Ishiguro, A. Kimura, and H. Sawada, Non-negative Multiple Matrix Factorization, Proceedings of 23rd International Joint Conference on Artificial Intelligence (IJCAI2013), pp. 1713-1720, 2013.

これ以降、ユーザ属性と商品属性を合わせて「ミクロ属性」と呼ぶこととする。合わせて、このミクロ属性と対応関係を持つ属性のことを「マクロ属性」と呼ぶこととする。 Hereinafter, the user attribute and the product attribute are collectively referred to as “micro attribute”. In addition, an attribute having a corresponding relationship with the micro attribute is referred to as a “macro attribute”.

マクロ属性の例には、ミクロ属性である商品属性と対応関係をもつ、商品カテゴリを表すカテゴリ属性が挙げられる。総カテゴリ数をKとする時、商品属性とカテゴリ属性の対応関係はJ×K行列W={w_jk}によって表現できる。当該行列Wは、要素w_jkが商品jがカテゴリkに属する時1, そうでなければ0をとる行列である。なお、w_jkの値は0または1に限定されず、0または正の整数値であればよい。ただし、負の数は用いない。 Examples of the macro attribute include a category attribute representing a product category having a corresponding relationship with a product attribute that is a micro attribute. When the total number of categories is K, the correspondence between product attributes and category attributes can be expressed by a J × K matrix W = {w _jk }. The matrix W is a matrix in which the element w _jk is 1 when the product j belongs to the category k and 0 otherwise. Note that the value of w _jk is not limited to 0 or 1, and may be 0 or a positive integer value. However, negative numbers are not used.

本発明で考える問題は、ミクロ属性とマクロ属性の対応関係を与える行列Wが既知のもと、ミクロ属性をもつ商品購買行列X={x_ij}、マクロ属性をもつカテゴリ購買行列Y={y_ik}という２つの行列からクラスタ抽出を行うという問題である。なお、カテゴリ購買行列Yは要素y_ikがユーザiのカテゴリkに属する商品の購買数を表す行列である。また、商品購買行列X={x_ij}とカテゴリ購買行列Y={y_ik}の間には、任意のユーザiに対して、 The problem to be considered in the present invention is that a product purchasing matrix X = {x _ij } having a micro attribute and a category purchasing matrix Y = {y having a macro attribute are known with a matrix W that gives a correspondence relationship between the micro attribute and the macro attribute. The problem is that cluster extraction is performed from two matrices _ik }. The category purchase matrix Y is a matrix in which the element y _ik represents the number of purchases of products belonging to the category k of the user i. Also, between the product purchase matrix X = {x _ij } and the category purchase matrix Y = {y _ik }, for any user i,

（ただし、jはカテゴリkに属する商品）との関係が成立する。つまり、任意のユーザについて、商品購買数の和とカテゴリの購買数は類似した値をとる。

(Where j is a product belonging to category k). That is, for an arbitrary user, the sum of the number of merchandise purchases and the number of purchases in the category take similar values.

ここでの類似の程度は、NMFを適用する際に用いた行列の類似尺度（類似していることを記号 The degree of similarity here is the similarity measure of the matrix used when applying NMF (symbol indicating similarity)

で表現する）として何を利用するかにより定まる。例えば任意の２つのI行J列の行列U={u_ij}, V={v_ij}の類似の尺度として、ユークリッド距離を用いて定義できる

It is determined by what is used. For example, it can be defined using Euclidean distance as a similar measure of any two I rows and J columns U = {u _ij }, V = {v _ij }

を利用する場合を考える。但し、

Consider the case of using. However,

の値が小さければ小さいほど両者が類似していることを表し、両者が完全に一致した時には0になる。

The smaller the value of, the more similar the two are, and 0 when the two match completely.

この尺度のもと、 Under this scale,

と行列分解を行った際の両者の類似の程度は

And the degree of similarity between the matrix decomposition

で与えられる。商品購買行列X={x_ij}とカテゴリ購買行列Y={y_ik}の要素間には、任意のユーザiに対して、(Σ_jx_ij−y_ik)²（ただし、jはカテゴリkに属する商品）の値が

Given in. Between the elements of the merchandise purchase matrix X = {x _ij } and the category purchase matrix Y = {y _ik }, for any user i, (Σ _j x _ij −y _ik ) ² (where j is the category k The value of the product belonging to

の値と同等となる程度の類似性が必要となる。ユークリッド距離以外の尺度でも同様である。

A degree of similarity that is equivalent to the value of is required. The same applies to scales other than the Euclidean distance.

しかしながら、上記の非特許文献１の技術は単一の行列を扱う手法であり、ミクロ属性を持つ商品購買行列X={x_ij}と、マクロ属性を持つカテゴリ購買行列Y={y_ik}の２つの行列を入力することはできなかった。一方、上記の非特許文献２では複数の行列を同時に扱うことは可能である。したがって、欠損のない商品購買行列Xとカテゴリ購買行列Yであれば、図４に示すように However, the technique of Non-Patent Document 1 described above is a method of handling a single matrix, and a product purchase matrix X = {x _ij } having a micro attribute and a category purchase matrix Y = {y _ik } having a macro attribute. Two matrices could not be entered. On the other hand, in Non-Patent Document 2 described above, it is possible to handle a plurality of matrices simultaneously. Therefore, if there is no missing product purchase matrix X and category purchase matrix Y, as shown in FIG.

という行列分解と、

Matrix decomposition and

という行列分解を行うことで図５に示すようにクラスタ抽出結果を得ることが可能である（C={c_kr}はK行R列のカテゴリ特徴行列を表す）。しかしながら、この非特許文献２は行列Wによって表現されるミクロ属性とマクロ属性の関係性は考慮していないために、欠損値が補完できず、適切なクラスタ抽出が行えない状況が存在する。図６にその例を示す。

By performing matrix decomposition, it is possible to obtain a cluster extraction result as shown in FIG. 5 (C = {c _kr } represents a category feature matrix of K rows and R columns). However, since this non-patent document 2 does not consider the relationship between the micro attribute and the macro attribute expressed by the matrix W, there is a situation where the missing value cannot be complemented and appropriate cluster extraction cannot be performed. An example is shown in FIG.

図６では商品購買行列X={x_ij}の「ビール3」に該当する列の要素が全て欠損値となっていることが分かる。非特許文献２の方法では、商品購買行列Xの「ビール1」，「ビール2」，「ビール3」に該当する列とカテゴリ購買行列Yのビールカテゴリに該当する列の関係性は考慮されない。そのため、商品特徴行列Bの「ビール3」に該当する列の要素の推定には商品購買行列Xの「ビール3」の該当する列の要素が必要となり、この列全てが欠損値の場合には正しく欠損値を補完することができない。 In FIG. 6, it can be seen that all the elements in the column corresponding to “beer 3” in the merchandise purchase matrix X = {x _ij } are all missing values. In the method of Non-Patent Document 2, the relationship between the columns corresponding to “beer 1”, “beer 2”, and “beer 3” in the product purchase matrix X and the columns corresponding to the beer category in the category purchase matrix Y is not considered. Therefore, to estimate the elements of the column corresponding to “Beer 3” of the product feature matrix B, the elements of the corresponding column of “Beer 3” of the product purchase matrix X are required. The missing value cannot be correctly compensated.

本発明は、上記の点に鑑みなされたもので、非負値行列分解において、ミクロな属性とマクロな属性の関係を考慮して行列分解を行うことで、図６のように従来の手法では扱うことのできなかった欠損値が存在する場合であっても、行列の欠損値を補完し、クラスタ抽出が可能なクラスタ抽出装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. In non-negative matrix decomposition, matrix decomposition is performed in consideration of the relationship between micro attributes and macro attributes, and the conventional technique as shown in FIG. An object of the present invention is to provide a cluster extraction apparatus, method, and program capable of complementing a missing value of a matrix and extracting a cluster even when there are missing values that could not be obtained.

一態様によれば、行列形式で与えられるデータからクラスタを抽出するクラスタ抽出装置であって、
第１のオブジェクト群と第２のオブジェクト群の関係について、第１のオブジェクトiを行、第２のオブジェクトjを列とし、該第１のオブジェクトiと該第２のオブジェクトjの関連度を示す非負行列でありミクロ属性をもつ第１のオブジェクト関連度情報行列X=｛x_ij}と、
前記第１のオブジェクト群と第３のオブジェクト群の関係について、前記第１のオブジェクトiを行、第３のオブジェクトkを列として、該第１のオブジェクトiと該第３のオブジェクトkの関連度を示す非負行列でありマクロ属性をもつ第２のオブジェクト関連度情報行列Y=｛y_ik｝と、
前記第２のオブジェクト群と前記第３のオブジェクト群の関係について、前記第２のオブジェクトjを行、前記第３のオブジェクトkを列として、該第２のオブジェクトjが該第３のオブジェクトkと関係がある場合は非負実数、関係がない場合をゼロを要素とする非負行列でありミクロな属性とマクロな属性の関係を表す第３のオブジェクト関連度情報行列W={w_jk}と、
を取得する情報処理手段と、
前記第１のオブジェクト群と、該第１のオブジェクト群を分類するクラスタ群の関係について、前記第１のオブジェクトiを行、クラスタrを列として、第１のオブジェクトiのクラスタrへの所属度合いを表すゼロ以上の実数を要素とする非負行列である第１のオブジェクト特徴行列A、
及び、
前記第２のオブジェクト群と前記クラスタ群の関係について、前記第２のオブジェクトjを行とし、クラスタrを列として、該第２のオブジェクトjのクラスタrへの所属度合いを表すゼロ以上の実数を要素とする非負行列である第２のオブジェクト特徴行列B、
及び、
前記第３のオブジェクト群と前記クラスタ群の関係について、前記第３のオブジェクトkを行とし、クラスタrを列として、該第３のオブジェクトkのクラスタrへの所属度合いを表すゼロ以上の実数を要素とする非負行列である第３のオブジェクト特徴行列Cを、前記第３のオブジェクト関連度情報行列Wを用いて求める特徴行列推定手段と、
前記特徴行列推定手段で求められた前記第１のオブジェクト特徴行列A、前記第２のオブジェクト特徴行列B、前記第３のオブジェクト特徴行列Cから少なくとも１つの特徴づけられたクラスタを抽出する特徴行列処理手段と、
を有するクラスタ抽出装置が提供される。 According to one aspect, a cluster extraction apparatus for extracting clusters from data given in a matrix format,
Regarding the relationship between the first object group and the second object group, the first object i is a row, the second object j is a column, and the degree of association between the first object i and the second object j is shown. A first object relevance information matrix X = {x _ij } which is a non-negative matrix and has a micro attribute;
Regarding the relationship between the first object group and the third object group, the degree of relevance between the first object i and the third object k, with the first object i as a row and the third object k as a column. A second object relevance information matrix Y = {y _ik }, which is a non-negative matrix having a macro attribute,
Regarding the relationship between the second object group and the third object group, the second object j is the row, the third object k is the column, and the second object j is the third object k. A non-negative real number when there is a relationship, a non-negative matrix with zero as an element when there is no relationship, and a third object relevance information matrix W = {w _jk } representing the relationship between the micro attribute and the macro attribute,
Information processing means for acquiring
Regarding the relationship between the first object group and the cluster group that classifies the first object group, the degree of affiliation of the first object i to the cluster r with the first object i as a row and the cluster r as a column A first object feature matrix A that is a non-negative matrix whose elements are zero or more real numbers representing
as well as,
Regarding the relationship between the second object group and the cluster group, the second object j is a row, the cluster r is a column, and a real number of zero or more that represents the degree of affiliation of the second object j to the cluster r. A second object feature matrix B, which is a non-negative matrix of elements,
as well as,
Regarding the relationship between the third object group and the cluster group, the third object k is a row, the cluster r is a column, and a real number of zero or more representing the degree of affiliation of the third object k to the cluster r is a A feature matrix estimation means for obtaining a third object feature matrix C as a non-negative matrix using the third object relevance information matrix W;
Feature matrix processing for extracting at least one characterized cluster from the first object feature matrix A, the second object feature matrix B, and the third object feature matrix C obtained by the feature matrix estimation means Means,
Is provided.

一態様によれば、ミクロ属性を持つ非負行列X、マクロ属性を持つ非負行列Y、ミクロ属性とマクロ属性の双方を持つ非負行列Wを用いて行列分解を行うことにより、従来手法では扱うことができなかった欠損値が存在する場合であっても、クラスタ抽出が可能となる。 According to one aspect, the conventional method can handle matrix decomposition using a non-negative matrix X having micro attributes, a non-negative matrix Y having macro attributes, and a non-negative matrix W having both micro attributes and macro attributes. Even if there are missing values that could not be obtained, cluster extraction is possible.

非負値行列分解（NMF）の適用例。Application example of non-negative matrix decomposition (NMF). 非負値行列分解（NMF）を適用して得られるクラスタリング結果の例。An example of clustering results obtained by applying non-negative matrix decomposition (NMF). 非負値行列分解（NMF）を適用することによる欠損値補完の例。An example of missing value completion by applying non-negative matrix factorization (NMF). 非特許文献２の方法の適用例。An application example of the method of Non-Patent Document 2. 非特許文献２の方法を適用して得られるクラスタリング結果の例。The example of the clustering result obtained by applying the method of nonpatent literature 2. 非特許文献２の方法を適用することによる欠損値補完が不可能となる例。An example in which missing value compensation is impossible by applying the method of Non-Patent Document 2. 本発明を適用することによる欠損値補完の例（その１）。An example of missing value complementation by applying the present invention (part 1). 本発明を適用することによる欠損値補完の例（その２）。An example of missing value compensation by applying the present invention (part 2). 本発明の一実施の形態における概要動作のフローチャート。The flowchart of the outline | summary operation | movement in one embodiment of this invention. 本発明の一実施の形態におけるクラスタ抽出装置の構成例。The structural example of the cluster extraction apparatus in one embodiment of this invention. 本発明の一実施の形態における商品購買情報テーブルの例。The example of the goods purchase information table in one embodiment of this invention. 本発明の一実施の形態におけるカテゴリ購買情報テーブルの例。The example of the category purchase information table in one embodiment of this invention. 本発明の一実施の形態における商品カテゴリ対応情報テーブルの例。The example of the goods category corresponding | compatible information table in one embodiment of this invention. 本発明の一実施の形態におけるユーザ特徴テーブルの例。The example of the user characteristic table in one embodiment of this invention. 本発明の一実施の形態における商品特徴テーブルの例。The example of the goods feature table in one embodiment of this invention. 本発明の一実施の形態におけるカテゴリ特徴テーブルの例。The example of the category characteristic table in one embodiment of this invention. 本発明の一実施の形態におけるクラスタ抽出装置の処理のフローチャート。The flowchart of the process of the cluster extraction apparatus in one embodiment of this invention. 本発明の一実施の形態における特徴行列推定処理のフローチャート。The flowchart of the characteristic matrix estimation process in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

まず、本発明で用いる行列を一般化した形態を示す。 First, a generalized form of the matrix used in the present invention is shown.

オブジェクト関連度情報行列Xを、第１のオブジェクト群と第２のオブジェクト群の関係について、第１のオブジェクトiを行、第２のオブジェクトjを列とし、第１のオブジェクトiと第２のオブジェクトjの関連度を表すゼロ以上の実数を要素とし、ミクロ属性を有する非負行列とする。 In the object relevance information matrix X, regarding the relationship between the first object group and the second object group, the first object i is a row, the second object j is a column, the first object i and the second object A non-negative matrix having a micro attribute with a real number of zero or more representing the degree of association of j as an element.

オブジェクト関連度情報行列Yを、第１のオブジェクト群と第３のオブジェクト群の関係について、第１のオブジェクトiを行、第３のオブジェクトkを列とし、第１のオブジェクトiと第３のオブジェクトkの関連度を表すゼロ以上の実数を要素とし、マクロ属性を有する非負行列とする。 In the object relevance information matrix Y, regarding the relationship between the first object group and the third object group, the first object i is a row, the third object k is a column, the first object i and the third object. A non-negative matrix having a macro attribute with real numbers of zero or more representing the degree of association of k as elements.

オブジェクト関連度情報行列Wを、第２のオブジェクト群と第３のオブジェクト群の関係について、第２のオブジェクトjを行、第３のオブジェクトkを列として、第２のオブジェクトjが第３のオブジェクトkと関係がある場合は非負実数、関係がない場合はゼロを要素とし、ミクロ属性とマクロ属性の両方を有する非負行列とする。 As for the object relevance information matrix W, regarding the relationship between the second object group and the third object group, the second object j is the third object with the second object j as the row and the third object k as the column. If there is a relationship with k, it is a non-negative real number, and if there is no relationship, zero is an element, and it is a non-negative matrix having both micro and macro attributes.

オブジェクト特徴行列Aを、第１のオブジェクト群とクラスタ群の関係について、第１のオブジェクトiを行、クラスタrを列として、第１のオブジェクトiのクラスタrへの所属度合いを表すゼロ以上の実数を要素とする非負行列とする。 The object feature matrix A is a real number of zero or more that represents the degree of affiliation of the first object i to the cluster r, with the first object i as a row and the cluster r as a column, regarding the relationship between the first object group and the cluster group. Is a non-negative matrix with elements

オブジェクト特徴行列Bを、第２のオブジェクト群とクラスタ群の関係について、第２のオブジェクトjを行、クラスタrを列として、第２のオブジェクトjのクラスタrへの所属度合いを表すゼロ以上の実数を要素とする非負行列とする。 The object feature matrix B is a real number greater than or equal to zero representing the degree of affiliation of the second object j to the cluster r, with the second object j as the row and the cluster r as the column, regarding the relationship between the second object group and the cluster group. Is a non-negative matrix with elements

オブジェクト特徴行列Cを、第３のオブジェクト群とクラスタ群の関係について、第３のオブジェクトkを行、クラスタrを列として、第３のオブジェクトkのクラスタrへの所属度合いを表すゼロ以上の実数を要素とする非負行列とする。 The object feature matrix C is a real number of zero or more representing the degree of affiliation of the third object k to the cluster r, with the third object k as the row and the cluster r as the column, regarding the relationship between the third object group and the cluster group. Is a non-negative matrix with elements

以下の説明では、オブジェクト関連度情報行列Xを「商品購買行列X」、オブジェクト関連度情報行列Yを「カテゴリ購買行列Y」、オブジェクト関連度情報行列Wを「商品カテゴリ対応行列W」とし、第１のオブジェクトを「ユーザ」、第２のオブジェクトを「商品」、第３のオブジェクトを「商品のカテゴリ」とし、オブジェクト特徴行列Aを「ユーザ特徴行列A」、オブジェクト特徴行列Bを「商品特徴行列B」、オブジェクト特徴行列Cを「カテゴリ特徴行列C」として説明する。 In the following description, the object relevance information matrix X is “product purchase matrix X”, the object relevance information matrix Y is “category purchase matrix Y”, the object relevance information matrix W is “product category correspondence matrix W”, The first object is “user”, the second object is “product”, the third object is “product category”, the object feature matrix A is “user feature matrix A”, and the object feature matrix B is “product feature matrix”. B ”and the object feature matrix C will be described as“ category feature matrix C ”.

また、各オブジェクトの内容は、商品購買情報テーブル、カテゴリ購買情報テーブル、商品カテゴリ対応情報テーブルとして図１１〜図１３を参照して後述する。 The contents of each object will be described later with reference to FIGS. 11 to 13 as a product purchase information table, a category purchase information table, and a product category correspondence information table.

本発明では、ミクロな属性とマクロな属性の関係を表す商品カテゴリ対応行列Wを利用して、従来の手法では扱うことのできなかった欠損値が存在する場合であっても、欠損値を補完し、クラスタ抽出が可能な非負値行列分解手法を提案する。本発明では、図６で入力として用いた商品購買行列Xとカテゴリ購買行列Yに加え、商品の所属カテゴリを表すJ行K列の商品カテゴリ対応行列W={w_jk}を利用する。このデータを用いて、本発明では非特許文献２の方法と同じく、 In the present invention, the product category correspondence matrix W representing the relationship between the micro attribute and the macro attribute is used to supplement the missing value even when there is a missing value that cannot be handled by the conventional method. Then, we propose a non-negative matrix decomposition method that can extract clusters. In the present invention, in addition to the product purchase matrix X and the category purchase matrix Y used as inputs in FIG. 6, a product category correspondence matrix W = {w _jk } of J rows and K columns representing the category to which the product belongs is used. Using this data, in the present invention, as in the method of Non-Patent Document 2,

という行列分解形を考える。ただし、本発明では商品特徴行列Bとカテゴリ特徴行列Cの間にW^TB=Cが成立するという制約条件を導入する。上記の設定のもと、図７に図６と同じ欠損値を持つ商品購買行列Xに対する本発明の適用例を示す。

Consider the matrix decomposition form. However, in the present invention, a constraint condition that W ^T B = C is established between the product feature matrix B and the category feature matrix C is introduced. FIG. 7 shows an application example of the present invention to the commodity purchase matrix X having the same missing value as FIG. 6 under the above setting.

制約条件W^TB=Cにより、図７中の商品特徴行列Bが与えられたもとでは、カテゴリ特徴行列Cは商品特徴行列Bの要素を用いて図中で示す要素の値を持たなくてはならない。すると、カテゴリ購買行列Yとそれの予測値 When the product feature matrix B in FIG. 7 is given by the constraint condition W ^T B = C, the category feature matrix C must have the values of the elements shown in the diagram using the elements of the product feature matrix B. . Then the category purchasing matrix Y and its predicted value

の間で等号が成り立つためにはb₃₁=1,b₃₂=0が成り立たなければならないことが分かる。

It can be seen that b ₃₁ = 1 and b ₃₂ = 0 must hold in order for the equal sign to hold.

したがって、この制約条件により、商品特徴行列Bの「ビール3」に該当する列の要素が推定でき、商品購買行列Xの「ビール3」の該当する列の欠損値も補完可能となり所望の結果が得られていることが分かる。本発明の方法は、図８に示すようにカテゴリ購買行列Yに欠損値があっても適用可能である。 Therefore, with this constraint, the element of the column corresponding to “Beer 3” of the product feature matrix B can be estimated, and the missing value of the corresponding column of “Beer 3” of the product purchase matrix X can be complemented. You can see that it is obtained. The method of the present invention can be applied even if there is a missing value in the category purchase matrix Y as shown in FIG.

図７のようにある列全てが欠損する、"列欠損"のパターンについては、従来の非特許文献２記載の方法では扱うことができなかった。これに対して、本発明はこのような列欠損パターンについて取り扱うことが可能である。 A “column defect” pattern in which all columns are missing as shown in FIG. 7 cannot be handled by the conventional method described in Non-Patent Document 2. On the other hand, the present invention can handle such a column defect pattern.

更に、図８のように、商品購買行列Xの列欠損になっている商品（図７の「ビール3」）に対応するカテゴリ（図７の「ビールカテゴリ」）のカテゴリ購買行列Yの列が列欠損にはなっていない場合も本発明は取り扱うことができる。 Further, as shown in FIG. 8, the column of the category purchase matrix Y of the category (“beer category” in FIG. 7) corresponding to the product (“beer 3” in FIG. 7) having the missing column in the product purchase matrix X is displayed. The present invention can be handled even when there is no column defect.

図９は、本発明の一実施の形態における概要動作のフローチャートである。 FIG. 9 is a flowchart of an outline operation in one embodiment of the present invention.

ステップ１）クラスタ抽出装置は、外部から入力された商品購買行列、カテゴリ購買行列、商品カテゴリ対応行列を取得し、各行列の要素をテーブルのフィールドに設定する。 Step 1) The cluster extraction apparatus obtains a merchandise purchase matrix, a category purchase matrix, and a merchandise category correspondence matrix input from the outside, and sets the elements of each matrix in the fields of the table.

ステップ２）各テーブルの情報を用いて各特徴行列を推定する。 Step 2) Each feature matrix is estimated using information in each table.

ステップ３）各特徴行列を出力する。 Step 3) Output each feature matrix.

図１０は、本発明の一実施の形態における装置の構成例である。 FIG. 10 is a configuration example of an apparatus according to an embodiment of the present invention.

同図に示すクラスタ抽出装置１は、商品購買情報処理部１０、カテゴリ購買情報処理部２０、商品カテゴリ対応情報処理部３０、特徴行列推定部４０、特徴行列処理部５０、記憶部６０を有し、入出力部７０は入力装置や表示装置等の外部装置２に接続されている。 The cluster extraction apparatus 1 shown in FIG. 1 includes a product purchase information processing unit 10, a category purchase information processing unit 20, a product category correspondence information processing unit 30, a feature matrix estimation unit 40, a feature matrix processing unit 50, and a storage unit 60. The input / output unit 70 is connected to an external device 2 such as an input device or a display device.

記憶部６０は、商品購買情報テーブル６１、カテゴリ購買情報テーブル６２、商品カテゴリ対応情報テーブル６３、ユーザ特徴テーブル６４、商品特徴テーブル６５、カテゴリ特徴テーブル６６を有する。 The storage unit 60 includes a product purchase information table 61, a category purchase information table 62, a product category correspondence information table 63, a user feature table 64, a product feature table 65, and a category feature table 66.

以下に各テーブルについて説明する。なお、テーブル形式のデータは行列形式にて表現できることから、以下では、各テーブルと各特徴行列を同一視し、区別せずに用いる。 Each table will be described below. Since table format data can be expressed in a matrix format, each table and each feature matrix are identified and used without distinction.

＜商品購買情報テーブル６１＞
商品購買情報テーブル６１は、図１１に示すように、ユーザIDフィールド(i)、商品IDフィールド(j)、購買数フィールド(x_ij)を有する。ユーザIDフィールド(i)は、商品購買情報処理部１０により追加されたユーザを特定する識別子が設定される。商品IDフィールド(j)は、商品購買情報処理部１０により追加されたユーザの購入した商品を特定する識別子が設定される。購買数フィールド(x_ij)は、商品購買情報処理部１０により"１"、または当該商品の当該ユーザの購買数が設定される。なお、購買数の値には"０"または正の整数値を設定できるが、負の数を設定することはできない。 <Product purchase information table 61>
As shown in FIG. 11, the product purchase information table 61 has a user ID field (i), a product ID field (j), and a purchase number field (x _ij ). In the user ID field (i), an identifier for identifying a user added by the product purchase information processing unit 10 is set. In the product ID field (j), an identifier for specifying the product purchased by the user added by the product purchase information processing unit 10 is set. In the purchase number field (x _ij ), “1” is set by the product purchase information processing unit 10, or the purchase number of the user of the product is set. Note that “0” or a positive integer value can be set as the value of the number of purchases, but a negative number cannot be set.

＜カテゴリ購買情報テーブル６２＞
カテゴリ購買情報テーブル６２は、図１２に示すように、ユーザIDフィールド(i)、カテゴリIDフィールド(k)、購買数フィールド(y_ik)を有する。ユーザIDフィールド(i)は、カテゴリ購買情報処理部２０により追加されたユーザを特定する識別子が設定される。カテゴリIDフィールド(k)は、カテゴリ購買情報処理部２０により追加されたユーザの購入した商品のカテゴリを特定する識別子が設定される。購買数フィールド(y_ik)は、カテゴリ購買情報処理部２０により"１"、または当該カテゴリの当該ユーザの購買数が設定される。なお、購買数の値には"０"または正の整数値を設定できるが、負の数を設定することはできない。 <Category purchase information table 62>
As shown in FIG. 12, the category purchase information table 62 has a user ID field (i), a category ID field (k), and a purchase quantity field (y _ik ). In the user ID field (i), an identifier for identifying a user added by the category purchase information processing unit 20 is set. In the category ID field (k), an identifier for specifying the category of the product purchased by the user added by the category purchase information processing unit 20 is set. In the purchase quantity field (y _ik ), “1” is set by the category purchase information processing unit 20, or the purchase quantity of the user in the category is set. Note that “0” or a positive integer value can be set as the value of the number of purchases, but a negative number cannot be set.

＜商品カテゴリ対応情報テーブル６３＞
商品カテゴリ対応情報テーブル６３は、図１３に示すように、商品IDフィールド(j)、カテゴリIDフィールド(k)、所属値フィールド(w_jk)を有する。商品IDフィールド(j)は、商品カテゴリ対応情報処理部３０により商品を特定する識別子が設定される。カテゴリIDフィールド(k)は、商品カテゴリ対応情報処理部３０によりカテゴリを特定する識別子が設定される。所属値フィールド(w_jk)には、商品カテゴリ対応情報処理部３０によって当該商品が当該カテゴリに所属する場合には"１"、そうでなければ"０"が設定される。 <Product category correspondence information table 63>
As shown in FIG. 13, the product category correspondence information table 63 has a product ID field (j), a category ID field (k), and an affiliation value field (w _jk ). In the product ID field (j), an identifier for specifying a product by the product category corresponding information processing unit 30 is set. In the category ID field (k), an identifier for identifying a category is set by the product category corresponding information processing unit 30. In the affiliation value field (w _jk ), “1” is set when the product belongs to the category by the product category corresponding information processing unit 30, and “0” is set otherwise.

＜ユーザ特徴テーブル６４＞
ユーザ特徴テーブル６４は、図１４に示すように、ユーザIDフィールド(i)と、クラスタIDフィールド(r)と、ユーザ特徴値フィールド(a_ir)を有する。ユーザIDフィールド(i)には特徴行列推定部４０によりユーザを特定する識別子が設定される。クラスタIDフィールド(r)には、特徴行列推定部４０によりクラスタを特定する識別子が設定される。ユーザ特徴値フィールド(a_ir)には、特徴行列推定部４０により算出された当該ユーザの当該クラスタに対する特徴値が設定される。 <User feature table 64>
As shown in FIG. 14, the user feature table 64 has a user ID field (i), a cluster ID field (r), and a user feature value field (a _ir ). In the user ID field (i), an identifier for identifying the user is set by the feature matrix estimation unit 40. In the cluster ID field (r), an identifier for specifying a cluster by the feature matrix estimation unit 40 is set. In the user feature value field (a _ir ), the feature value for the cluster of the user calculated by the feature matrix estimation unit 40 is set.

＜商品特徴テーブル６５＞
商品特徴テーブル６５は、図１５に示すように、商品IDフィールド(j)と、クラスタIDフィールド(r)と、商品特徴値フィールド(b_jr)を有する。商品IDフィールド(j)には特徴行列推定部４０により商品を特定する識別子が設定される。クラスタIDフィールド(r)には、特徴行列推定部４０によりクラスタを特定する識別子が設定される。商品特徴値フィールド(b_jr)には、特徴行列推定部４０により算出された当該商品の当該クラスタに対する特徴値が設定される。 <Product feature table 65>
As shown in FIG. 15, the product feature table 65 includes a product ID field (j), a cluster ID field (r), and a product feature value field (b _jr ). In the product ID field (j), an identifier for specifying a product is set by the feature matrix estimation unit 40. In the cluster ID field (r), an identifier for specifying a cluster by the feature matrix estimation unit 40 is set. In the product feature value field (b _jr ), the feature value for the cluster of the product calculated by the feature matrix estimation unit 40 is set.

＜カテゴリ特徴テーブル６６＞
カテゴリ特徴テーブル６６は、図１６に示すように、カテゴリIDフィールド(k)と、クラスタIDフィールド(r)と、カテゴリ特徴値フィールド(c_kr)を有する。カテゴリIDフィールド(k)には特徴行列推定部４０によりカテゴリを特定する識別子が設定される。クラスタIDフィールド(r)には、特徴行列推定部４０によりクラスタを特定する識別子が設定される。カテゴリ特徴値フィールド(c_kr)には、特徴行列推定部４０により算出された当該カテゴリの当該クラスタに対する特徴値が設定される。 <Category feature table 66>
As shown in FIG. 16, the category feature table 66 includes a category ID field (k), a cluster ID field (r), and a category feature value field (c _kr ). In the category ID field (k), an identifier for specifying the category is set by the feature matrix estimation unit 40. In the cluster ID field (r), an identifier for specifying a cluster by the feature matrix estimation unit 40 is set. In the category feature value field (c _kr ), the feature value for the cluster in the category calculated by the feature matrix estimation unit 40 is set.

上記の構成における動作を説明する。 The operation in the above configuration will be described.

図１７は、本発明の一実施の形態におけるクラスタ抽出装置の処理のフローチャートである。 FIG. 17 is a flowchart of the process of the cluster extraction device in one embodiment of the present invention.

本実施の形態では、商品購買行列、カテゴリ購買行列及び商品カテゴリ対応行列を入力として特徴行列を推定し、特徴行列を出力することを考える。以下に具体的な動作を説明する。 In the present embodiment, it is assumed that a feature matrix is estimated by inputting a product purchase matrix, a category purchase matrix, and a product category correspondence matrix, and a feature matrix is output. A specific operation will be described below.

ステップ１００）商品購買情報処理部１０は、入力された商品購買行列に基づき、ユーザID毎および商品ID毎の購買数を商品購買情報テーブル６１に格納する処理を行う。具体的には、商品購買情報処理部１０は、図１１に示す商品購買情報テーブル６１に、追加されたユーザ、商品、購買数に応じて、ユーザIDフィールド、商品IDフィールド、購買数フィールドの値を設定した行を挿入する。 Step 100) The product purchase information processing unit 10 performs a process of storing the number of purchases for each user ID and each product ID in the product purchase information table 61 based on the input product purchase matrix. Specifically, the merchandise purchase information processing unit 10 adds the values of the user ID field, the merchandise ID field, and the purchase quantity field to the merchandise purchase information table 61 shown in FIG. Insert a line with.

商品購買情報処理部１０による商品購買情報更新のタイミングは、例えば、システム管理者が外部装置２から供給されるデータをもとに手動で管理できるようにしてもよいし、新たな購買が発生した場合に外部装置２が自動的に処理を起動するようにしてもよい。 The timing of product purchase information update by the product purchase information processing unit 10 may be managed manually by the system administrator based on data supplied from the external device 2, for example, or a new purchase has occurred In this case, the external device 2 may automatically start processing.

ステップ２００）カテゴリ購買情報処理部２０は、入力されたカテゴリ購買行列に基づき、ユーザID毎およびカテゴリID毎の購買数をカテゴリ購買情報テーブル６２に格納する処理を行う。具体的には、カテゴリ購買情報処理部２０は、図１２に示すカテゴリ購買情報テーブル６２に、追加されたユーザ、カテゴリ、購買数に応じて、ユーザIDフィールド、カテゴリIDフィールド、購買数フィールドの値を設定した行を挿入する。 Step 200) The category purchase information processing unit 20 performs processing for storing the number of purchases for each user ID and each category ID in the category purchase information table 62 based on the input category purchase matrix. Specifically, the category purchase information processing unit 20 adds the values of the user ID field, the category ID field, and the purchase quantity field to the category purchase information table 62 shown in FIG. Insert a line with.

カテゴリ購買情報処理部２０によるカテゴリ購買情報更新のタイミングは、例えば外部装置２から供給されるPOSデータをもとにシステム管理者が手動で管理できるようにしてもよいし、新たな購買が発生した場合に外部装置２から自動的に処理を起動するようにしてもよい。 The timing of updating category purchase information by the category purchase information processing unit 20 may be manually managed by a system administrator based on POS data supplied from the external device 2, for example, or a new purchase has occurred. In such a case, the processing may be automatically started from the external device 2.

ステップ３００）商品カテゴリ対応情報処理部３０は、入力された商品カテゴリ対応行列に基づき、ユーザID毎およびカテゴリID毎の所属値を商品カテゴリ対応情報テーブル６３に格納する処理を行う。具体的には、商品カテゴリ対応情報処理部３０は、図１３に示す商品カテゴリ対応情報テーブル６３に、追加された商品、カテゴリに応じて、ユーザIDフィールド、カテゴリIDフィールド、所属値フィールドの値を設定した行を挿入する。 Step 300) The merchandise category correspondence information processing unit 30 performs processing for storing the belonging values for each user ID and each category ID in the merchandise category correspondence information table 63 based on the inputted merchandise category correspondence matrix. Specifically, the product category correspondence information processing unit 30 sets values of the user ID field, the category ID field, and the belonging value field in the product category correspondence information table 63 shown in FIG. 13 according to the added product and category. Insert the set line.

商品カテゴリ対応情報処理部３０による商品カテゴリ対応情報更新のタイミングは、例えば外部装置２から供給されるPOSデータをもとにシステム管理者が手動で管理できるようにしてもよいし、新たな商品が出現した場合に外部装置２から自動的に処理を起動するようにしてもよい。 The timing of updating the product category correspondence information by the product category correspondence information processing unit 30 may be manually managed by the system administrator based on the POS data supplied from the external device 2, for example. The process may be automatically started from the external device 2 when it appears.

ステップ４００）特徴行列推定部４０は、以下の方法で特徴を推定し、記憶部６０のユーザ特徴テーブル６４、商品特徴テーブル６５、カテゴリ特徴テーブル６６に格納する。図１８に特徴行列推定時のフローチャートを示す。以下において、商品購買情報テーブル６１中に存在する全データを Step 400) The feature matrix estimation unit 40 estimates the feature by the following method and stores it in the user feature table 64, the product feature table 65, and the category feature table 66 of the storage unit 60. FIG. 18 shows a flowchart when the feature matrix is estimated. In the following, all data existing in the product purchase information table 61 will be described.

とし、カテゴリ購買情報テーブル６２中に存在する全データを

All data existing in the category purchase information table 62

とし、商品カテゴリ対応情報テーブル６３に存在する全データを

All data existing in the product category correspondence information table 63

とする。ユーザ特徴行列A、商品特徴行列B、カテゴリ特徴行列Cをそれぞれ、

And User feature matrix A, product feature matrix B, category feature matrix C,

とする。Iが全ユーザ数、Jが全商品数、Kが全カテゴリ数、Rが全クラスタ数を表す。iがユーザを特定する識別子、jが商品を特定する識別子、kがカテゴリを特定する識別子、rがクラスタを特定する識別子に対応する。

And I is the total number of users, J is the total number of products, K is the total number of categories, and R is the total number of clusters. i corresponds to an identifier that identifies a user, j an identifier that identifies a product, k an identifier that identifies a category, and r an identifier that identifies a cluster.

ステップ４１０）ユーザ特徴行列Aおよび商品特徴行列B、カテゴリ特徴行列Cをそれぞれ初期化する。また、終了条件の閾値ε、最大繰り返し回数を設定する。 Step 410) The user feature matrix A, the product feature matrix B, and the category feature matrix C are initialized. Also, a threshold value ε for the end condition and the maximum number of repetitions are set.

ステップ４２０）終了条件に用いる変数として特徴更新の最大変化幅を示す変数δを初期化する。 Step 420) A variable δ indicating the maximum change width of the feature update is initialized as a variable used for the end condition.

ステップ４３０）後述する式(1)に従いユーザ特徴行列Aを更新する。その後、更新前のユーザ特徴行列Aの要素の値と更新後のユーザ特徴行列Aの要素の値の差の絶対値の最大値 Step 430) The user feature matrix A is updated according to equation (1) described later. After that, the maximum absolute value of the difference between the element value of the user feature matrix A before the update and the value of the element of the user feature matrix A after the update

がδより大きければ、

If is greater than δ,

と更新する。なお記号「←」は右辺の計算結果を左辺の変数に代入する処理を意味する。なお、代入処理前のユーザ特徴行列Aの要素の値を

And update. The symbol “←” means a process of assigning the calculation result on the right side to the variable on the left side. Note that the values of the elements of the user feature matrix A before the substitution process are

とし、代入処理後のユーザ特徴行列Aの要素の値を

And the element value of the user feature matrix A after the substitution process

として記述した。

As described.

ステップ４４０）後述する式（2）に従い商品特徴行列Bを更新する。その後、更新前の商品特徴行列Bの要素の値と更新後の商品特徴行列Bの要素の値の差の絶対値の最大値 Step 440) The product feature matrix B is updated according to equation (2) described later. After that, the maximum absolute value of the difference between the element value of the product feature matrix B before update and the element value of the product feature matrix B after update

がδより大きければ、

If is greater than δ,

と更新する。代入処理前の商品特徴行列Bの要素の値を

And update. The value of the element of the product feature matrix B before the substitution process

とし、代入処理後の商品特徴行列Bの要素の値を

And the value of the element of the product feature matrix B after the substitution process

として記述した。

As described.

ステップ４５０）後述する式(3)に従いカテゴリ特徴行列Cを更新する。その後、更新前のカテゴリ特徴行列Cの要素の値と更新後のカテゴリ特徴行列Cの要素の値の差の絶対値の最大値 Step 450) Update the category feature matrix C in accordance with equation (3) described below. After that, the maximum absolute value of the difference between the element value of the category feature matrix C before the update and the value of the element of the category feature matrix C after the update

がδより大きければ、

If is greater than δ,

と更新する。代入処理前のカテゴリ特徴行列Cの要素の値を

And update. The element value of the category feature matrix C before the substitution process

とし、代入処理後のカテゴリ特徴行列Cの要素の値を

And the element value of the category feature matrix C after the substitution process

として記述した。

As described.

ステップ４６０）計算繰り返し回数に１を加え、更新する。 Step 460) Add 1 to the number of calculation iterations and update.

ステップ４７０）計算繰り返し回数が予め定めた最大繰り返し数を超えるか、特徴更新による最大変化幅を表すδが予め定めた閾値εより小さければ終了し、そうでなければ、更新した後ステップ４２０に戻る。 Step 470) If the number of calculation iterations exceeds a predetermined maximum number of iterations or if δ representing the maximum change width due to feature update is smaller than a predetermined threshold ε, the processing ends. Otherwise, the processing returns to Step 420 after updating. .

ステップ４３０，４４０，４５０における式(1),式(2),式(3)は以下の通りである。 Expressions (1), (2), and (3) in steps 430, 440, and 450 are as follows.

但し、

However,

とする。また、

And Also,

は行列Xの第i行目で欠損していない列j全体、

Is the entire missing column j in the i-th row of the matrix X,

は行列Yの第i行目で欠損していない列ｋ全体を表す。同様に

Represents the entire column k not missing in the i-th row of the matrix Y. As well

は行列Xの第ｊ列目で欠損していない行i全体、

Is the entire missing row i in column j of matrix X,

は行列Yの第k列目で欠損していない行i全体を表す。

Represents the entire row i not missing in the k-th column of the matrix Y.

はユーザ特徴行列A、商品特徴行列Bによるx_ijの推定値、

Is the estimated value of x _{ij from} the user feature matrix A and product feature matrix B,

はユーザ特徴行列A、カテゴリ特徴行列Cによるy_ikの推定値であると見なせる。

_Can be regarded as an estimated value of y _{ik from} the user feature matrix A and the category feature matrix C.

上記の式(1)、式(2)、式(3)は、前述の制約条件が反映されたものである。 The above formula (1), formula (2), and formula (3) reflect the above-described constraint conditions.

上記の式(1)、式(2)、式(3)の各更新式は全てのユーザi、商品j、カテゴリkについて、 Each update formula of the above formula (1), formula (2), formula (3) is for all users i, product j, category k,

が成立する時、左辺と右辺が一致し、更新の最大変化幅を示す変数δの値が閾値ε以下となるため、更新が停止することが分かる。またあるユーザiについて、全てのj, kについて、

When is established, the left side and the right side coincide with each other, and the value of the variable δ indicating the maximum change width of the update is equal to or less than the threshold value ε, so that the update is stopped. For a certain user i, for all j and k,

である時に式(1)の更新を行うと、右辺の分子が右辺の分母より大きくなるために、a_irを現在の値よりも大きくなるように更新することとなり、

When updating formula (1) when, the numerator on the right side is larger than the denominator on the right side, so a _ir is updated to be larger than the current value,

と

When

の値が大きくなるようにユーザ特徴テーブル６４のユーザ特徴値フィールドの特徴a_irを更新する。

The feature a _ir of the user feature value field of the user feature table 64 is updated so that the value of is increased.

ステップ５００）図１７を再度参照する。特徴行列処理部５０は、ユーザ特徴テーブル６４、商品特徴テーブル６５、カテゴリ特徴テーブル６６を参照し、リクエストの引数に対応する特徴を出力する。 Step 500) Referring again to FIG. The feature matrix processing unit 50 refers to the user feature table 64, the product feature table 65, and the category feature table 66, and outputs features corresponding to request arguments.

出力処理は、例えば、外部装置２から特徴出力のリクエストが入力された場合に実行すればよい。出力は全特徴を出力する場合には、ユーザ特徴テーブル６４、商品特徴テーブル６５、カテゴリ特徴テーブル６６の全ての行を出力すればよいし、クラスタの商品特徴のみを利用する場合には、例えば、リクエストの引数をクラスタＩＤとして、商品特徴テーブル６５から、該クラスタＩＤを持つ行の商品ＩＤフィールド、商品特徴値フィールドを出力した後、商品特徴値フィールドの値の大きい順に商品ＩＤ１０件を特定することでクラスタの商品特徴を求めることができる。 The output process may be executed, for example, when a feature output request is input from the external device 2. When outputting all the features, all the rows of the user feature table 64, the product feature table 65, and the category feature table 66 may be output. When only the product features of the cluster are used, for example, Using the request argument as the cluster ID, after outputting the product ID field and the product feature value field of the row having the cluster ID from the product feature table 65, specify 10 product IDs in descending order of the value of the product feature value field. The product characteristics of the cluster can be obtained with

なお、上記の実施の形態では、商品購買行列とカテゴリ購買行列を表現した行列からクラスタを抽出する例を示しているが、この例に限定されない。例えば、文書と文書中の単語を出現数を表現する行列、文書の所属カテゴリとカテゴリ中の単語の出現数を表現する行列の組など、商品、ユーザ、カテゴリのように１つ１つにID番号を付与して識別可能であり行列形式としてデータを表現することが可能な事物であり、商品とその所属カテゴリのようにミクロとマクロの関係性が存在するものならば、あらゆるものが本装置によるクラスタ抽出が可能である。また、出現数や購入回数のように整数である必要もなく、一般に0以上の実数であればよい。入力となる行列が３つ以上存在する場合にも本発明による方法は適用可能である。例えば、入力される行列として、商品購買行列、カテゴリ購買行列、商品カテゴリ対応行列の３つの行列に加えて、例えば２０代女性、３０代男性等のユーザグループ毎の購入商品を表すグループ購買行列、ユーザとその所属グループを表すユーザグループ対応行列を加えた計５つの行列を入力とする場合、上記の制約条件W^TB=Cの他にもう一つ制約条件が増えることになるが、当該制約条件W^TB=Cはそのまま利用することが可能である。 In the above embodiment, an example is shown in which clusters are extracted from a matrix representing a product purchase matrix and a category purchase matrix. However, the present invention is not limited to this example. For example, a document and a word that represents the number of words in the document, a set of matrix that represents the number of occurrences of the category to which the document belongs and the category, IDs for each product, user, category, etc. Any device that can be identified by assigning a number and that can represent data in a matrix format and that has a micro and macro relationship, such as a product and its category Can be extracted. Further, it is not necessary to be an integer like the number of appearances and the number of purchases, and generally a real number of 0 or more is sufficient. The method according to the present invention can also be applied when there are three or more input matrices. For example, as an input matrix, in addition to the three matrixes of a product purchase matrix, a category purchase matrix, and a product category correspondence matrix, for example, a group purchase matrix representing a purchased product for each user group such as a 20s female, 30s male, etc. When a total of five matrices, including a user group correspondence matrix representing users and their groups, are input, another constraint will increase in addition to the above constraint W ^T B = C. The condition W ^T B = C can be used as it is.

本実施の形態に係るクラスタ抽出装置１は、例えば、１つ又は複数のコンピュータに、本実施の形態で説明した処理内容を記述したプログラムを実行させることにより実現可能である。すなわち、クラスタ抽出装置１が有する機能は、当該コンピュータに内蔵されるCPUやメモリ、ハードディスクなどのハードウェア資源を用いて、クラスタ抽出装置１で実施される処理に対応するプログラムを実行することによって実現することが可能である。また、上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 The cluster extraction apparatus 1 according to the present embodiment can be realized, for example, by causing one or a plurality of computers to execute a program describing the processing contents described in the present embodiment. That is, the functions of the cluster extraction device 1 are realized by executing a program corresponding to the processing executed in the cluster extraction device 1 using hardware resources such as a CPU, memory, and hard disk built in the computer. Is possible. Further, the program can be recorded on a computer-readable recording medium (portable memory or the like), stored, or distributed. It is also possible to provide the program through a network such as the Internet or electronic mail.

本発明は、上記の実施の形態に限定されることなく、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible.

１クラスタ抽出装置
２外部装置
１０商品購買情報処理部
２０カテゴリ購買情報処理部
３０商品カテゴリ対応情報処理部
４０特徴行列推定部
５０特徴行列処理部
６０記憶部
６１商品購買情報テーブル
６２カテゴリ購買情報テーブル
６３商品カテゴリ対応情報テーブル
６４ユーザ特徴テーブル
６５商品特徴テーブル
６６カテゴリ特徴テーブル
７０入出力部 1 Cluster Extractor 2 External Device 10 Merchandise Purchase Information Processing Unit 20 Category Purchase Information Processing Unit 30 Product Category Corresponding Information Processing Unit 40 Feature Matrix Estimation Unit 50 Feature Matrix Processing Unit 60 Storage Unit 61 Product Purchase Information Table 62 Category Purchase Information Table 63 Product category correspondence information table 64 User feature table 65 Product feature table 66 Category feature table 70 Input / output unit

Claims

A cluster extraction device for extracting clusters from data given in a matrix format,
Regarding the relationship between the first object group and the second object group, the first object i is a row, the second object j is a column, and the degree of association between the first object i and the second object j is shown. A first object relevance information matrix X = {x _ij } which is a non-negative matrix and has a micro attribute;
Regarding the relationship between the first object group and the third object group, the degree of relevance between the first object i and the third object k, with the first object i as a row and the third object k as a column. A second object relevance information matrix Y = {y _ik }, which is a non-negative matrix having a macro attribute,
Regarding the relationship between the second object group and the third object group, the second object j is the row, the third object k is the column, and the second object j is the third object k. A non-negative real number when there is a relationship, a non-negative matrix with zero as an element when there is no relationship, and a third object relevance information matrix W = {w _jk } representing the relationship between the micro attribute and the macro attribute,
Information processing means for acquiring
Regarding the relationship between the first object group and the cluster group that classifies the first object group, the degree of affiliation of the first object i to the cluster r with the first object i as a row and the cluster r as a column A first object feature matrix A that is a non-negative matrix whose elements are zero or more real numbers representing
as well as,
Regarding the relationship between the second object group and the cluster group, the second object j is a row, the cluster r is a column, and a real number of zero or more that represents the degree of affiliation of the second object j to the cluster r. A second object feature matrix B, which is a non-negative matrix of elements,
as well as,
Regarding the relationship between the third object group and the cluster group, the third object k is a row, the cluster r is a column, and a real number of zero or more representing the degree of affiliation of the third object k to the cluster r is a Third object feature matrix C that is a non-negative matrix
A feature matrix estimator that calculates the third object relevance information matrix W using
Feature matrix processing for extracting at least one characterized cluster from the first object feature matrix A, the second object feature matrix B, and the third object feature matrix C obtained by the feature matrix estimation means Means,
A cluster extraction apparatus comprising:

The feature matrix estimation means includes:
When all elements of a certain column of the first object relevance information matrix X are missing, or at least one of the first object relevance information matrix X or the second object relevance information matrix Y If one element is missing,
Feature matrix initialization means for initializing the first object feature matrix A, the second object feature matrix B, and the third object feature matrix C;
Variable initialization means for initializing a variable δ indicating the maximum change width of the parameter update;
A first object feature matrix A updating means for updating the first object feature matrix A by a first update equation and updating the variable δ based on a first update condition;
A second object feature matrix B updating means for updating the second object feature matrix B with a second update formula and updating the variable δ based on a second update condition;
A third object feature matrix C updating means for updating the third object feature matrix C with a third update equation and updating the variable δ based on a third update condition;
A count update means for updating the repeat count;
When the repetition count is less than the maximum number or when the variable δ is larger than a predetermined threshold ε, the variable initialization unit, the first object feature matrix A update unit, and the second object feature matrix B Update means, third object feature matrix C update means, and repeat means for repeating the count update means;
Including means for performing
The first update formula is

age,
The second update formula is

age,
The third update formula is

age,
(However, in the first update formula, the second update formula, and the third update formula,

age,

Is the entire missing column j in the i-th row of the first object relevance information matrix X,

Is the entire column k not missing in the i-th row of the second object relevance information matrix Y,

Is the entire row i not missing in the j-th column of the first object relevance information matrix X,

Is the entire row i not missing in the k-th column of the second object relevance information matrix Y,

Is an estimate of x _{ij from} the first object feature matrix A, the second object feature matrix B,

Is the estimated value of y _{ik from} the first object feature matrix A and the third object feature matrix C)
The first update condition is
The maximum absolute value of the difference between the element value of the first object feature matrix A before update and the element value of the first object feature matrix A after update

(However,

Is the value of the element of the first object feature matrix A before the substitution process,

Is the element value of the first object feature matrix A after substitution)
Is greater than the variable δ,

And
The second update condition is
Maximum absolute value of the difference between the element value of the second object feature matrix B before update and the element value of the second object feature matrix B after update

(However,

Is the element value of the second object feature matrix B before the substitution process,

Is the value of the element of the second object feature matrix B after the substitution process)
Is greater than the variable δ,

And
The third update condition is
The maximum absolute value of the difference between the element value of the third object feature matrix C before the update and the element value of the third object feature matrix C after the update

(However,

Is the value of the element of the third object feature matrix C before the substitution process,

Is the value of the element of the third object feature matrix C after the substitution process)
Is greater than the variable δ,

The cluster extraction device according to claim 1, wherein the cluster extraction device is updated.

The feature matrix estimation means includes:
In the first object relevance information matrix X = {x _ij } and the second object relevance information matrix Y = {y _ik }, for any first object i, the second object j The cluster extraction device according to claim 1 or 2, wherein a relation that the sum and the third object k take similar values is established.

The first object is a user;
The second object is a product,
The cluster extraction apparatus according to claim 1, wherein the third object is a product category.

A cluster extraction method for extracting clusters from data given in a matrix format,
Regarding the relationship between the first object group and the second object group, the first object i is a row, the second object j is a column, and the degree of association between the first object i and the second object j is shown. A first object relevance information matrix X = {x _ij } which is a non-negative matrix and has a micro attribute;
Regarding the relationship between the first object group and the third object group, the degree of relevance between the first object i and the third object k, with the first object i as a row and the third object k as a column. A second object relevance information matrix Y = {y _ik }, which is a non-negative matrix having a macro attribute,
Regarding the relationship between the second object group and the third object group, the second object j is the row, the third object k is the column, and the second object j is the third object k. A non-negative real number when there is a relationship, a non-negative matrix with zero as an element when there is no relationship, and a third object relevance information matrix W = {w _jk } representing the relationship between the micro attribute and the macro attribute,
An information processing step of acquiring
Regarding the relationship between the first object group and the cluster group that classifies the first object group, the degree of affiliation of the first object i to the cluster r with the first object i as a row and the cluster r as a column A first object feature matrix A that is a non-negative matrix whose elements are zero or more real numbers representing
Regarding the relationship between the second object group and the cluster group, the second object j is a row, the cluster r is a column, and a real number of zero or more that represents the degree of affiliation of the second object j to the cluster r. A second object feature matrix B, which is a non-negative matrix of elements,
Regarding the relationship between the third object group and the cluster group, the third object k is a row, the cluster r is a column, and a real number of zero or more representing the degree of affiliation of the third object k to the cluster r is a A feature matrix estimation step for obtaining a third object feature matrix C, which is a non-negative matrix as an element, using the third object relevance information matrix W;
A feature matrix processing step of extracting a cluster characterized from the first object feature matrix A, the second object feature matrix B, and the third object feature matrix C obtained in the feature matrix estimation step;
A cluster extraction method characterized by performing:

In the feature matrix estimation step,
When all elements of a certain column of the first object relevance information matrix X are missing, or at least one of the first object relevance information matrix X or the second object relevance information matrix Y If one element is missing,
A feature matrix initialization step of initializing the first object feature matrix A, the second object feature matrix B, and the third object feature matrix C;
A variable initialization step for initializing a variable δ indicating the maximum change width of the parameter update;
A first object feature matrix A update step of updating the first object feature matrix A by a first update formula and updating the variable δ based on a first update condition;
A second object feature matrix B update step of updating the second object feature matrix B by a second update formula and updating the variable δ based on a second update condition;
A third object feature matrix C updating step for updating the third object feature matrix C by a third update formula and updating the variable δ based on a third update condition;
A count update step for updating the repeat count;
When the repeat count is less than the maximum number or when the variable δ is larger than a predetermined threshold ε, the variable initialization step, the first object feature matrix A update step, the second object feature matrix B Repeating the updating step, the third object feature matrix C updating step, and the count updating step;
The first update formula is