JP6608721B2

JP6608721B2 - Data analysis apparatus and program

Info

Publication number: JP6608721B2
Application number: JP2016023158A
Authority: JP
Inventors: 英生梅谷; 郁大濱
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2016-02-09
Filing date: 2016-02-09
Publication date: 2019-11-20
Anticipated expiration: 2036-02-09
Also published as: JP2017142629A

Description

本発明は、データ分析装置及びプログラムに関する。 The present invention relates to a data analyzer and program.

近年、ネットワーク化が進み、様々な機器を介して様々なデータが収集され蓄積されるようになった。様々なデータとはＷＥＢサイトのアクセス情報であったり、顧客の購買履歴であったり、番組の録画視聴履歴であったり、顧客の年齢・性別などの情報である。そのなかで、ＳＮＳ（Social Networking Service）の友達関係や購買履歴などを用いて、ユーザを、好みなどの属性ごとにクラスタリングし、商品をお勧めするなどのレコメンドサービスが行われている。現在知られているクラスタリングの方法として、テンソルのスペクトル分解を用いて、クラスタリング結果を求める方法が提案されている（例えば非特許文献１参照）。非特許文献１では、入力行列をテンソルに拡張して、特異値分解とその三次元拡張であるＣＰ分解を用いて、入力データとなる行列を３つの行列の積で近似できるような行列として求める。それらのクラスタリング結果を用いることで、推薦サービスなどが可能となる。 In recent years, networking has progressed, and various data has been collected and accumulated via various devices. The various data are access information of the WEB site, customer purchase history, program recording / viewing history, customer age / sex information, and the like. Among them, recommendation services such as recommending products by clustering users according to attributes such as preferences using friendships of SNS (Social Networking Service), purchase history, and the like are performed. As a currently known clustering method, a method for obtaining a clustering result using spectral decomposition of a tensor has been proposed (see, for example, Non-Patent Document 1). In Non-Patent Document 1, an input matrix is expanded to a tensor, and a matrix that can be approximated by a product of three matrices is obtained by using singular value decomposition and CP decomposition that is a three-dimensional extension thereof. . By using those clustering results, a recommendation service or the like can be performed.

A Tensor Approach to Learning Mixed Membership Community Models, October 28, 2013A Tensor Approach to Learning Mixed Membership Community Models, October 28, 2013

先行技術では、入力データをテンソルに拡張するため、計算メモリと計算時間が入力データサイズの３乗オーダーになり、計算コストが大きい。また、入力行列に対称性が仮定されており、クラスタ構造を表す行列にフルランクが仮定されている。これは実際にビジネスで取得されるデータに適用する上では厳しい制約である。 In the prior art, since the input data is expanded to a tensor, the calculation memory and calculation time are in the order of the cube of the input data size, and the calculation cost is high. Also, symmetry is assumed for the input matrix, and full rank is assumed for the matrix representing the cluster structure. This is a severe limitation in applying to data that is actually acquired in business.

そこで本発明は、計算メモリを削減し、かつ、入力行列に制約を設けないデータ分析装置及びプログラムを提供する。 The present invention is to reduce the computational memory and provide infusion over data analyzer and programs provided constraints on input matrix.

本発明の一態様に係るデータ分析装置は、Ｎ個の第一対象物とＭ個の第二対象物のそれぞれの関連度を示すＮ行Ｍ列の基礎行列を、３つの第一行列、第二行列及び第三行列に分解して第一対象物及び第二対象物をクラスタリングするデータ分析装置であって、基礎行列の各要素に対して、関連度を示す値が入力された基礎行列を取得する取得部と、第一対象物及び第二対象物のクラスタ数を示すＫを設定する設定部と、基礎行列を少なくとも３つの基礎行列の部分行列に分割する分割部と、部分行列のそれぞれに対して、特異値の大きい方からＫ個までを用いた特異値分解を行った結果を用いて、変換行列を生成する変換行列生成部と、部分行列のそれぞれと、それらに対応する変換行列とで内積計算を行い、前記部分行列を圧縮する内積計算部と、圧縮された部分行列を用いて、３階のテンソルを作成するテンソル生成部と、テンソルを、ＣＰ分解(ＣＡＮＤＥＣＯＭＰ／ＰＡＲＡＦＡＣ分解)を用いて分解する分解部と、分解部の分解結果を用いて、基礎行列を３つに分解した第一行列、第二行列及び第三行列を算出する算出部と、第一行列、第二行列及び第三行列の少なくとも一つを出力することで、第一対象物及び第二対象物のクラスタリング結果を出力する出力部と、を含む。 The data analysis apparatus according to an aspect of the present invention includes an N-row and M-column basic matrix indicating the degree of association between each of the N first objects and the M second objects. A data analysis device for clustering the first object and the second object by decomposing the matrix into a second matrix and a third matrix, and for each element of the basic matrix, a basic matrix in which a value indicating the degree of association is input An acquisition unit for acquiring, a setting unit for setting K indicating the number of clusters of the first object and the second object, a dividing unit for dividing the basic matrix into at least three partial matrices, and each of the partial matrices On the other hand, using a result obtained by performing singular value decomposition using the largest singular value to K pieces, a transformation matrix generation unit for generating a transformation matrix, each of the partial matrices, and a transformation matrix corresponding to them An inner product meter that compresses the submatrix by calculating the inner product at And parts, using the compressed sub-matrix, a tensor generation unit generating a three rank tensor, tensor, decomposing decomposition unit using a CP degradation (CANDECOMP / PARAFAC decomposition), the decomposition result of the decomposition unit Using a calculation unit that calculates a first matrix, a second matrix, and a third matrix obtained by decomposing the basic matrix into three, and outputting at least one of the first matrix, the second matrix, and the third matrix, An output unit that outputs a clustering result of the first object and the second object.

なお、これらの包括的または具体的な態様は、システム、方法、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, a method, an integrated circuit, a computer program, or a recording medium such as a computer-readable CD-ROM, and the system, method, integrated circuit, and computer program. Also, any combination of recording media may be realized.

本発明のデータ分析装置及びプログラムは、計算に使用するメモリ量、計算時間を削減し、入力データに制約を設けないクラスタリングが可能となる。 Data analyzer and program of the present invention, the amount of memory used for the calculation, and reduce the calculation time, it is possible to clustering without the constraint on the input data.

実施の形態１に係るデータ分析方法を実行するためのデータ分析システムの概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a data analysis system for executing a data analysis method according to Embodiment 1. FIG. 実施の形態１に係る入力行列の一例を示す説明図である。6 is an explanatory diagram illustrating an example of an input matrix according to Embodiment 1. FIG. 実施の形態１に係るデータ分析装置の概略構成を示すブロック図である。1 is a block diagram illustrating a schematic configuration of a data analysis apparatus according to a first embodiment. 実施の形態１に係るデータ分析装置の分割部の説明図である。FIG. 3 is an explanatory diagram of a dividing unit of the data analysis apparatus according to the first embodiment. 実施の形態１に係る入力行列の一例を示す説明図である。6 is an explanatory diagram illustrating an example of an input matrix according to Embodiment 1. FIG. 図５の入力行列を基にしてデータ分析方法を行い、得られたパラメータの一例を示す説明図である。It is explanatory drawing which shows an example of the parameter obtained by performing the data analysis method based on the input matrix of FIG. 実施の形態１に係るデータ分析方法の流れを示すフローチャートである。3 is a flowchart showing a flow of a data analysis method according to the first embodiment. 実施の形態１に係るデータ分解装置の構成を示すブロック図の変形例である。5 is a modification of the block diagram showing the configuration of the data decomposition apparatus according to Embodiment 1. FIG. 実施の形態２に係るデータ分析装置の概略構成を示すブロック図である。FIG. 6 is a block diagram showing a schematic configuration of a data analysis apparatus according to a second embodiment. 実施の形態２に係るデータ分析装置の入力変形部を示す説明図である。FIG. 10 is an explanatory diagram illustrating an input deformation unit of a data analysis device according to a second embodiment. 実施の形態２に係るデータ分析装置のパラメータ変形部におけるパラメータの変形例を示す説明図である。FIG. 10 is an explanatory diagram illustrating a modification example of parameters in a parameter modification unit of the data analysis apparatus according to the second embodiment. 実施の形態２に係るデータ分析方法の流れを示すフローチャートである。10 is a flowchart showing a flow of a data analysis method according to the second embodiment.

（本発明の基礎となった知見）
本発明者は、「背景技術」の欄において記載した方法に関し、以下の問題が生じることを見出した。 (Knowledge that became the basis of the present invention)
The inventor has found that the following problems occur with respect to the method described in the “Background Art” column.

従来の方法は、入力として与えられる行列を３次元のテンソル形式に変形する必要がある。そのため、計算時のメモリの使用量と計算時間が入力データサイズの３乗オーダーが必要になる。しかしながら、大規模なデータを扱う際に、計算コストが大きく、データをメモリに載せて計算することができないという課題があった。また、入力行列が対称行列であり、さらにその入力情報に潜在的に含まれているクラスタ構造がフルランクであることを仮定している。しかしながら、購買履歴などの、あるユーザがある商品を買ったという情報は、行列形式では、ユーザ×商品になり、対称形にはならない。さらに、クラスタ構造がフルランクであるとは、ユーザ側のクラスタ数と商品側のクラスタ数が同じであることを仮定している。これは、現実的には考えにくく、実問題を扱う上では非常に厳しい制約となっている。 The conventional method needs to transform a matrix given as an input into a three-dimensional tensor format. For this reason, the amount of memory used and the calculation time at the time of calculation require the cube of the input data size. However, when dealing with large-scale data, there is a problem that the calculation cost is high and the calculation cannot be performed by placing the data in a memory. Also, it is assumed that the input matrix is a symmetric matrix and that the cluster structure potentially included in the input information is full rank. However, information that a certain user has purchased a certain product such as a purchase history becomes user × product in a matrix format, and is not symmetrical. Further, the fact that the cluster structure is full rank assumes that the number of clusters on the user side and the number of clusters on the product side are the same. This is difficult to think realistically, and is a very severe restriction in dealing with actual problems.

このような問題を解決するため、本発明の一態様にかかるデータ分析方法は、Ｎ個の第一対象物とＭ個の第二対象物のそれぞれの関連度を示すＮ行Ｍ列の基礎行列を、３つの第一行列、第二行列及び第三行列に分解して第一対象物及び第二対象物をクラスタリングするデータ分析方法であって、基礎行列の各要素に対して、関連度を示す値が入力された基礎行列を取得する取得ステップと、第一対象物及び第二対象物のクラスタ数を示すＫを設定する設定ステップと、基礎行列を少なくとも３つの基礎行列の部分行列に分割する分割ステップと、部分行列のそれぞれに対して、特異値の大きい方からＫ個までを用いた特異値分解を行った結果を用いて、変換行列を生成する変換行列生成ステップと、部分行列のそれぞれと、それらに対応する変換行列とで内積計算を行い、部分行列を圧縮する内積計算ステップと、圧縮された部分行列を用いて、３階のテンソルを作成するテンソル生成ステップと、テンソルを、ＣＰ分解を用いて分解する分解ステップと、分解ステップの分解結果を用いて、基礎行列を３つに分解した第一行列、第二行列及び第三行列を算出する算出ステップと、第一行列、第二行列及び第三行列の少なくとも一つを出力することで、第一対象物及び第二対象物のクラスタリング結果を出力する出力ステップと、を含むデータ分析方法。 In order to solve such a problem, a data analysis method according to an aspect of the present invention provides an N-by-M basic matrix indicating the degree of association between each of N first objects and M second objects. Is a data analysis method for clustering the first object and the second object by decomposing the first object, the second matrix, and the third matrix, and for each element of the basic matrix, An acquisition step of acquiring a basic matrix in which a value to be input is input; a setting step of setting K indicating the number of clusters of the first object and the second object; and dividing the basic matrix into at least three sub-matrices of the basic matrix A division matrix, a conversion matrix generation step for generating a conversion matrix using a result obtained by performing singular value decomposition using the largest singular value to K pieces for each of the partial matrices, Each and their corresponding Computation of inner product with a transformation matrix, inner product calculation step for compressing a submatrix, tensor generation step for creating a third-order tensor using the compressed submatrix, and tensor decomposition using CP decomposition A decomposition step, a calculation step for calculating a first matrix, a second matrix, and a third matrix obtained by decomposing the basic matrix into three using a decomposition result of the decomposition step; a first matrix, a second matrix, and a third matrix; An output step of outputting a clustering result of the first object and the second object by outputting at least one of the above.

これにより、大規模なテンソルを生成することなくデータサイズの２乗オーダーの計算メモリと計算時間が可能となる。したがって、大規模なデータでもメモリに載せての計算が可能となる。 This enables calculation memory and calculation time in the square order of the data size without generating a large-scale tensor. Therefore, even large-scale data can be calculated on the memory.

例えば、さらに、取得ステップで取得された基礎行列が対称行列でない場合に、基礎行列の転置行列を用いて対称行列に変形する入力変形ステップを含んでもよい。 For example, when the basic matrix acquired in the acquiring step is not a symmetric matrix, an input deformation step of deforming into a symmetric matrix using a transposed matrix of the basic matrix may be included.

これにより、非対称の入力データにおいても、対称形に変形して計算が可能となる。 As a result, even asymmetric input data can be calculated by being transformed into a symmetric shape.

例えば、変換行列生成ステップで計算した特異値分解の結果を用いて、特異値全体の和に対して、特異値の上位からの和が所定値以上になるＫを設定してもよい。 For example, using the result of the singular value decomposition calculated in the transformation matrix generation step, K may be set such that the sum of the singular values from the top is greater than or equal to a predetermined value with respect to the sum of all singular values.

これにより、クラスタ数であるＫを自動で設定することが可能となる。 This makes it possible to automatically set K as the number of clusters.

例えば、変換行列生成ステップで計算した特異値分解の結果を用いて、特異値全体の２乗和に対して、特異値の上位からの２乗和が所定値以上になるＫを設定してもよい。 For example, using the result of the singular value decomposition calculated in the transformation matrix generation step, even if K is set such that the sum of squares from the top of the singular value is greater than or equal to a predetermined value with respect to the square sum of the entire singular value Good.

例えば、分解ステップは、特異値分解を用いて分解してもよい。 For example, the decomposition step may be decomposed using singular value decomposition.

これにより、一般的に良く知られている特異値分解のみを利用しての計算が可能となる。 Thereby, calculation using only singular value decomposition which is generally well known is possible.

例えば、さらに、算出ステップで算出した第一行列及び第三行列のクラスタリング結果が、定義域外になった場合、定義域に入るようにクラスタ数を変動させて第一行列及び第三行列を変形するパラメータ変形ステップを含んでもよい。 For example, when the clustering result of the first matrix and the third matrix calculated in the calculation step is out of the defined area, the first matrix and the third matrix are modified by changing the number of clusters so that the cluster is included in the defined area. A parameter transformation step may be included.

これにより、クラスタ構造がフルランクでないものに対しても、正確なクラスタリングを可能とする。 This enables accurate clustering even when the cluster structure is not full rank.

また、本発明の一態様にかかるデータ分析装置は、Ｎ個の第一対象物とＭ個の第二対象物のそれぞれの関連度を示すＮ行Ｍ列の基礎行列を、３つの第一行列、第二行列及び第三行列に分解して第一対象物及び第二対象物をクラスタリングするデータ分析装置であって、基礎行列の各要素に対して、関連度を示す値が入力された基礎行列を取得する取得部と、第一対象物及び第二対象物のクラスタ数を示すＫを設定する設定部と、基礎行列を少なくとも３つの基礎行列の部分行列に分割する分割部と、部分行列のそれぞれに対して、特異値の大きい方からＫ個までを用いた特異値分解を行った結果を用いて、変換行列を生成する変換行列生成部と、部分行列のそれぞれと、それらに対応する変換行列とで内積計算を行い、部分行列を圧縮する内積計算部と、圧縮された部分行列を用いて、３階のテンソルを作成するテンソル生成部と、テンソルを、ＣＰ分解を用いて分解する分解部と、分解部の分解結果を用いて、基礎行列を３つに分解した第一行列、第二行列及び第三行列を算出する算出部と、第一行列、第二行列及び第三行列の少なくとも一つを出力することで、第一対象物及び第二対象物のクラスタリング結果を出力する出力部と、を備える。 In addition, the data analysis apparatus according to the aspect of the present invention includes an N-row M-column basic matrix indicating the degree of association between each of the N first objects and the M second objects. , A data analysis device for clustering the first object and the second object by decomposing into a second matrix and a third matrix, wherein a value indicating a degree of association is input to each element of the basic matrix An acquisition unit that acquires a matrix, a setting unit that sets K indicating the number of clusters of the first object and the second object, a division unit that divides the basic matrix into at least three partial matrices of the basic matrix, and a partial matrix Using the result of singular value decomposition using from the largest singular value to the K singular values, a transformation matrix generation unit for generating a transformation matrix, each of the sub-matrices, and corresponding to them Performs inner product calculation with transformation matrix and compresses submatrix A calculation unit, a tensor generation unit that creates a third-order tensor using the compressed submatrix, a decomposition unit that decomposes the tensor using CP decomposition, and a basic matrix using the decomposition result of the decomposition unit And calculating at least one of the first matrix, the second matrix, and the third matrix, and calculating the first matrix, the second matrix, and the third matrix, And an output unit that outputs a clustering result of the second object.

また、本発明の一態様にかかるプログラムは、コンピュータに、上記のデータ分析方法を実行させるためのプログラムである。 A program according to one embodiment of the present invention is a program for causing a computer to execute the above data analysis method.

（実施の形態１）
以下、実施の形態について、図面を参照しながら具体的に説明する。なお、以下で説明する実施の形態は、いずれも包括的または具体的な例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置及び接続形態、ステップ、ステップの順序等は、一例であり、本発明を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。 (Embodiment 1)
Hereinafter, embodiments will be specifically described with reference to the drawings. It should be noted that each of the embodiments described below shows a comprehensive or specific example. Numerical values, shapes, materials, components, arrangement positions and connection forms of components, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present invention. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements.

[システムの全体構成]
図１は、実施の形態１に係るデータ分析方法を実行するためのデータ分析システムの概略構成を示すブロック図である。 [Overall system configuration]
FIG. 1 is a block diagram showing a schematic configuration of a data analysis system for executing the data analysis method according to the first embodiment.

データ分析システム１は、Ｎ個の対象物（第一対象物）のそれぞれに対するＭ個の対象物（第二対象物）の関連の有無を示すＮ行Ｍ列の基礎行列を入力として、３つの行列に分解して、ユーザをクラスタリングするデータ分析方法を実行する。対象物は、ユーザや商品などを表し、関連度を表す基礎行列とは、例えば、ＳＮＳの友達関係であると、ユーザ同士が友達であると「関連あり」、友達でないと「関連なし」となり、Ｎ個の対象物とＭ個の対象物は同じものを表し、実際の基礎行列はＮ行Ｎ列の対称行列となる。また、基礎行列が購買履歴であると、Ｎ個の対象物をユーザ、Ｍ個の対象物が商品であり、ユーザが商品を購入したことを「関連あり」とし、購入していないことを「関連なし」とする。 The data analysis system 1 receives an N-row M-column basic matrix indicating whether M objects (second objects) are related to each of N objects (first objects) as three inputs. A data analysis method for clustering users by decomposing into a matrix is executed. The target object represents a user or a product, and the basic matrix representing the degree of association is, for example, SNS friendship, “relevant” if the users are friends, and “unrelated” if they are not friends. , N objects and M objects represent the same thing, and an actual basic matrix is a symmetric matrix of N rows and N columns. Further, if the basic matrix is a purchase history, N objects are users, M objects are products, and a user purchases a product is “relevant”. Not relevant ".

図２は、基礎行列の一例を示す説明図である。 FIG. 2 is an explanatory diagram illustrating an example of a basic matrix.

図２に示す基礎行列では、Ｎ個の対象物がＮ個の対象物のそれぞれに対する関連の有無を示している。基礎行列の各要素に対しては、関連の有無を示す値が入力されている。具体的に、関連がある要素に対しては「１」が代入されており、関連のない要素には「０」が代入されている。例えば、図２にあげる基礎行列をＳＮＳの友達関係を表すデータとした場合、友達関係であるならば「１」が、友達関係でないならば「０」が入力される。なお、データの値と形式は、あくまでも一例であり、これに限定されるものではない。 In the basic matrix shown in FIG. 2, N objects indicate whether or not each of the N objects is related. A value indicating whether or not there is a relationship is input to each element of the basic matrix. Specifically, “1” is assigned to elements that are related, and “0” is assigned to elements that are not related. For example, when the basic matrix shown in FIG. 2 is data representing the friendship relationship of SNS, “1” is input if the friendship relationship and “0” is input if it is not the friendship relationship. Note that the data values and formats are merely examples, and the present invention is not limited to these.

そして、データ分析システム１は、この基礎行列を３つの行列に分解したものを算出することで、対象物をクラスタリングする。 Then, the data analysis system 1 clusters the objects by calculating the basic matrix decomposed into three matrices.

具体的に、データ分析システム１は、図１に示すように、入力装置２００と、表示装置３００と、データ分析装置４００とを備えている。入力装置２００と、表示装置３００と、データ分析装置４００とはネットワーク５００を介して通信可能に接続されている。 Specifically, the data analysis system 1 includes an input device 200, a display device 300, and a data analysis device 400, as shown in FIG. The input device 200, the display device 300, and the data analysis device 400 are communicably connected via a network 500.

ネットワーク５００とは、イーサネット（登録商標）等の有線ネットワーク、無線ＬＡＮ等の無線ネットワーク、公衆網、または、これらのネットワークが組み合わされたネットワーク等である。公衆網とは、電気通信事業者が、不特定多数の利用者の通信のために提供している通信回線のことであり、例えば、一般電話回線またはＩＳＤＮなどが挙げられる。 The network 500 is a wired network such as Ethernet (registered trademark), a wireless network such as a wireless LAN, a public network, or a network in which these networks are combined. A public network is a communication line provided by a telecommunications carrier for communication of an unspecified number of users, and includes, for example, a general telephone line or ISDN.

入力装置２００は、Ｎ行Ｍ列の基礎行列が入力される装置である。入力装置２００は、例えばキーボード、タッチパネル、ポインティングデバイスなどの入力部２１０を備えたパーソナルコンピューター、スマートフォン、フィーチャーフォン、タブレット端末などである。入力装置２００は、Ｎ行Ｍ列の基礎行列が入力されると、当該基礎行列を、ネットワーク５００を介してデータ分析装置４００に送信する。 The input device 200 is a device to which a basic matrix of N rows and M columns is input. The input device 200 is, for example, a personal computer, a smartphone, a feature phone, a tablet terminal, or the like provided with an input unit 210 such as a keyboard, a touch panel, or a pointing device. When the basic matrix of N rows and M columns is input, the input device 200 transmits the basic matrix to the data analysis device 400 via the network 500.

表示装置３００は、クラスタリング結果を表す３つの行列のうち少なくとも一つの行列がデータ分析装置４００から入力されると、当該少なくとも一つの行列を表示する装置である。表示装置３００は、例えばディスプレイなどの表示部３１０を備えたパーソナルコンピューター、スマートフォン、フィーチャーフォン、タブレット端末などである。表示装置３００の表示部３１０に表示された少なくとも一つの行列を解析者が閲覧することで、クラスタリングされた結果を解析することができる。 The display device 300 is a device that displays at least one matrix when at least one of the three matrices representing the clustering result is input from the data analysis device 400. The display device 300 is, for example, a personal computer, a smartphone, a feature phone, a tablet terminal, or the like provided with a display unit 310 such as a display. The analyst views at least one matrix displayed on the display unit 310 of the display device 300, so that the clustered result can be analyzed.

なお、本実施の形態では、入力装置２００と表示装置３００とが独立した異なる端末である場合を例示しているが、入力装置２００と表示装置３００とが一台の端末であってもよい。また、入力装置２００と表示装置３００とデータ分析装置４００が一台の端末であってもよいし、ネットワークを介さずに接続されていてもよい。 In this embodiment, the case where the input device 200 and the display device 300 are independent and different terminals is illustrated, but the input device 200 and the display device 300 may be a single terminal. The input device 200, the display device 300, and the data analysis device 400 may be a single terminal, or may be connected without a network.

［データ分析装置］
データ分析装置４００は、Ｎ行Ｍ列の基礎行列を入力として、クラスタリング結果である３つの行列を算出する処理装置である。実施例１では、入力される基礎行列は、Ｎ行Ｎ列の対称行列として説明を行う。データ分析装置４００は、例えば、サーバ、パーソナルコンピューター、スマートフォン、フィーチャーフォン、タブレット端末などである。 [Data analysis equipment]
The data analysis apparatus 400 is a processing apparatus that calculates three matrices as clustering results by using a basic matrix of N rows and M columns as an input. In the first embodiment, the input basic matrix is described as a symmetric matrix with N rows and N columns. The data analysis device 400 is, for example, a server, a personal computer, a smartphone, a feature phone, a tablet terminal, or the like.

図３は、データ分析装置４００の概略構成を示すブロック図である。 FIG. 3 is a block diagram illustrating a schematic configuration of the data analysis apparatus 400.

図３に示すように、データ分析装置４００は、取得部４１０と、処理部４２０と、出力部４３０とを備えている。 As illustrated in FIG. 3, the data analysis device 400 includes an acquisition unit 410, a processing unit 420, and an output unit 430.

取得部４１０は、入力装置２００からネットワーク５００を介して入力された基礎行列を取得し、処理部４２０に出力する。処理部４２０は、取得部４１０から入力された基礎行列を３つの行列に分解する処理部であり、ＣＰＵ、ＲＡＭ、ＲＯＭ等を備える。処理部４２０は、設定部４２１、分割部４２２、変換行列生成部４２３、内積計算部４２４、テンソル生成部４２５、分解部４２６、パラメータ計算部４２７と、を備える。 The acquisition unit 410 acquires a basic matrix input from the input device 200 via the network 500 and outputs the basic matrix to the processing unit 420. The processing unit 420 is a processing unit that decomposes the basic matrix input from the acquisition unit 410 into three matrices, and includes a CPU, a RAM, a ROM, and the like. The processing unit 420 includes a setting unit 421, a dividing unit 422, a transformation matrix generation unit 423, an inner product calculation unit 424, a tensor generation unit 425, a decomposition unit 426, and a parameter calculation unit 427.

設定部４２１は、クラスタリングを行う際に用いられるクラスタ数を記憶している。例えば、対象物をＫ個のクラスタに分けるなどの設定値である。なお、設定項目は、設定部４２１に予め記憶されていなくとも、入力装置２００から入力された設定値を設定項目としてもよい。この場合、入力装置２００から取得部４１０を介して受信した設定項目を設定部４２１が記憶する。また、後に説明するように入力された基礎行列の情報を用いて自動的に決められてもよい。 The setting unit 421 stores the number of clusters used when performing clustering. For example, it is a set value such as dividing an object into K clusters. Note that the setting item may be a setting value input from the input device 200 even if it is not stored in the setting unit 421 in advance. In this case, the setting unit 421 stores the setting item received from the input device 200 via the acquisition unit 410. Further, as will be described later, it may be automatically determined by using the input basic matrix information.

分割部４２２は、入力された基礎行列を少なくとも３つの部分行列に分割する。具体例として、３つの部分行列を生成する場合を、図４を用いて説明する。基礎行列のデータサイズＮに対して、Ｘ，Ａ，Ｂ，Ｃ⊂Ｎとなる４つの集合を任意の方法で生成する。そして、入力された基礎行列Ｇ∈｛０，１｝^N×Nに対して、Ｇ_X,A∈｛０，１｝^X×A、Ｇ_X,B∈｛０，１｝^X×B、Ｇ_X,C∈｛０，１｝^X×Cとなるような部分行列Ｇ_X,A、Ｇ_X,B、Ｇ_X,Cを生成する。 The dividing unit 422 divides the input basic matrix into at least three partial matrices. As a specific example, a case where three partial matrices are generated will be described with reference to FIG. For the data size N of the basic matrix, four sets of X, A, B, and C で N are generated by an arbitrary method. Then, for the input basic matrix Gε {0,1} ^{N × N} , G _{X, A} ε {0,1} ^{X × A} , G _{X, B} ε {0,1} ^{X × B} , G Generate partial matrices G _{X, A} , G _{X, B} , G _{X, C} such that _{X, C} ∈ {0, 1} ^{X × C.}

変換行列生成部４２３は、分割部で生成した部分行列のそれぞれに対して特異値分解を行い、特異値の上位Ｋ番目までの分解結果を用いて、それぞれの部分行列に対応する変換行列を生成する。具体的には、変換行列生成部４２３は、式（１）〜式（３）を用いて、それぞれの部分行列に対応する変換行列を生成する。 The transformation matrix generation unit 423 performs singular value decomposition on each of the partial matrices generated by the dividing unit, and generates a conversion matrix corresponding to each partial matrix using the decomposition results up to the upper Kth singular value. To do. Specifically, the conversion matrix generation unit 423 generates a conversion matrix corresponding to each partial matrix using Expressions (1) to (3).

ここで、Ｋは設定部４２１にて設定された値を用いてもよいし、特異値分解を行った結果を用いて、例えば、特異値の総和または２乗和に対する特異値の上位Ｋ番目までの和または２乗和がある一定の比率（所定値）を超えるようなＫを設定してもよい。ここで、ある一定の比率とは、７０％以上として設定し、この範囲を満たしていれば自由に決めてよい。 Here, K may be a value set by the setting unit 421, or, for example, up to the highest Kth singular value with respect to the sum of singular values or the sum of squares using the result of singular value decomposition. Alternatively, K may be set such that the sum or square sum exceeds a certain ratio (predetermined value). Here, the certain ratio is set as 70% or more, and may be freely determined as long as this range is satisfied.

内積計算部４２４は、分割部で生成した部分行列と、変換行列生成部で生成した変換行列との内積計算により、それぞれの部分行列を圧縮する。具体的には、ｘ∈Ｘ、ｋ∈Ｋに対して部分行列Ｇ_X,Aと変換行列Ｗ_Aとについては、式（４）で内積計算して圧縮する。 The inner product calculation unit 424 compresses each partial matrix by inner product calculation of the partial matrix generated by the dividing unit and the transformation matrix generated by the transformation matrix generation unit. Specifically, X∈X, partial matrix G _X relative k ∈ _K, for the _A and transformation matrix W _A compresses and the inner product calculated by the equation (4).

ただし、Ｖ_A∈Ｒ^X×Kで、Ｖ_A(x,k)はｘ行ｋ列目の要素である。 However, V _A εR ^{X × K} and V _{A (x, k)} is an element in the xth row and the kth column.

同様に、部分行列Ｇ_X,B、Ｇ_X,Cと変換行列

とについては、式（５）、（６）で内積計算して圧縮する。 Similarly, submatrix G _{X, B} , G _{X, C} and transformation matrix

As for and, the inner product is calculated by the equations (5) and (6) and compressed.

これにより、それぞれの部分行列を圧縮した行列が生成される。 Thereby, the matrix which compressed each submatrix is produced | generated.

テンソル生成部４２５は、内積計算部４２４で生成した３つの圧縮された部分行列を３階のテンソル形式に変形する。具体的には、式（７）を用いて、テンソル形式に変形する。 The tensor generation unit 425 transforms the three compressed submatrices generated by the inner product calculation unit 424 into a third-order tensor format. Specifically, it is transformed into a tensor format using equation (7).

ただし、Ｖ_A（x,k）はＶ_Aのｘ行目の行ベクトルである。 However, V _{A (x, k)} is a row vector of the x-th row of V _A.

分解部４２６は、テンソル生成部４２５で生成したテンソルを、ＣＰ分解を用いて分解する。なお、ＣＰ分解は、行列に定義された特異値分解を３次元のテンソルに適用できるように拡張したものであるため、特異値分解を用いて同様の分解を行ってもよい。分解結果は、式（８）に基づいて、求めるべきパラメータである３つの行列の一部を含む形で算出される。 The decomposition unit 426 decomposes the tensor generated by the tensor generation unit 425 using CP decomposition. Since the CP decomposition is an extension of the singular value decomposition defined in the matrix so that it can be applied to a three-dimensional tensor, the same decomposition may be performed using the singular value decomposition. The decomposition result is calculated based on Expression (8) in a form including a part of three matrices that are parameters to be obtained.

パラメータ計算部４２７は、分解部で分解した結果を用いて、求めるべきパラメータである３つの行列Π^T、Ｐ、Πを計算する。具体的には、式（９）〜式（１３）を用いて計算する。 The parameter calculation unit 427 calculates three matrices Π ^T , P, and ある that are parameters to be obtained, using the result of decomposition by the decomposition unit. Specifically, the calculation is performed using Expressions (9) to (13).

なお、Πを求める際に、確率的ブロックモデル（ＳＢＭ）を仮定するならば、式（１４）を用いてもよい。 Note that if a stochastic block model (SBM) is assumed when obtaining Π, equation (14) may be used.

また、混合メンバシップ・ブロックモデルを仮定するならば、式（１５）を用いてもよい。 If a mixed membership block model is assumed, equation (15) may be used.

出力部４３０は、第一行列、第二行列、第三行列の少なくとも一つを表示装置３００に出力する。出力部４３０は、基礎行列、第一行列、第二行列及び第三行列を一括して出力してもよいし、これらを組み合わせて出力してもよい。また、出力部４３０は、最終的な、第一行列、第二行列、第三行列の積を出力情報の一つとして出力してもよい。 The output unit 430 outputs at least one of the first matrix, the second matrix, and the third matrix to the display device 300. The output unit 430 may output the basic matrix, the first matrix, the second matrix, and the third matrix in a lump, or may output them in combination. The output unit 430 may output the final product of the first matrix, the second matrix, and the third matrix as one piece of output information.

図５は、本実施の形態に係る基礎行列の一例である。図６は、本実施の形態に係るパラメータである３つの行列の一例である。 FIG. 5 is an example of a basic matrix according to the present embodiment. FIG. 6 is an example of three matrices that are parameters according to the present embodiment.

例えば、図１０のように９行９列の基礎行列を与え、Ｋ＝３とすると、図１１に示すように、Π^Tに相当する９行３列の第一行列と、Ｐに相当する３行３列の第二行列と、Πに相当する３行９列の第三行列として算出される。例えば、図５の入力行列をＳＮＳの友達関係のデータとすると、図６のパラメータの第一行列は、「ユーザＵ１は、クラスタＣ１に属する」という意味を示す。これにより、クラスタＣ１に属するユーザは、Ｕ１、Ｕ４、Ｕ６であることがわかる。同様に、クラスタＣ２、Ｃ３に属するユーザもパラメータから知ることができる。第三行列に関しては、基礎行列が対称であるため、第一行列と同じ意味を示す行列になっている。第二行列は、第一行列のクラスタと第三行列のクラスタとの関連強度を表す行列を意味する。この第二行列からは、同じクラスタ間の関連は強いが、違うクラスタとは関連が全くないことを意味している。 For example, given a 9 rows and 9 columns of the fundamental matrix as shown in FIG. 10, when K = 3, as shown in FIG. 11, a first matrix of 9 rows and three columns corresponding to [pi ^T, 3 corresponding to P It is calculated as a second matrix with 3 rows and 3 columns and a third matrix with 3 rows and 9 columns corresponding to Π. For example, if the input matrix in FIG. 5 is SNS friendship data, the first matrix of parameters in FIG. 6 indicates that “user U1 belongs to cluster C1”. Thereby, it can be seen that the users belonging to the cluster C1 are U1, U4, and U6. Similarly, users belonging to the clusters C2 and C3 can also be known from the parameters. Regarding the third matrix, since the basic matrix is symmetric, the matrix has the same meaning as the first matrix. The second matrix means a matrix representing the relation strength between the cluster of the first matrix and the cluster of the third matrix. This second matrix means that the same cluster is strongly related, but has no relation to the different cluster.

［データ分析方法］
次に、本実施の形態に係るデータ分析方法について説明する。 [Data analysis method]
Next, a data analysis method according to the present embodiment will be described.

図７は、本実施の形態に係るデータ分析方法の流れを示すフローチャートである。 FIG. 7 is a flowchart showing the flow of the data analysis method according to the present embodiment.

入力装置２００では、基礎行列の各要素に、関連の有無を示す値が入力される。また、入力装置２００では、クラスタ数を示すＫの値も入力される。これらの入力後においては、入力装置２００は、基礎行列及びＫをデータ分析装置４００に出力する。なお、設定項目がすでにデータ分析装置４００の設定部４２１に設定されていて、それが以降の処理に用いられる場合には、入力装置２００での設定項目の入力は不要である。 In the input device 200, a value indicating whether or not there is a relationship is input to each element of the basic matrix. In the input device 200, a value K indicating the number of clusters is also input. After these inputs, the input device 200 outputs the basic matrix and K to the data analysis device 400. Note that if the setting item has already been set in the setting unit 421 of the data analysis device 400 and is used for the subsequent processing, it is not necessary to input the setting item with the input device 200.

データ分析装置４００の取得部４１０は、入力装置２００からネットワークを介して入力された基礎行列及びＫを取得する（ステップＳ１０１）。 The acquisition unit 410 of the data analysis device 400 acquires the basic matrix and K input from the input device 200 via the network (step S101).

また、設定部４２１は、取得部４１０で取得したＫを設定項目として記憶する（ステップＳ１０２）。 The setting unit 421 stores K acquired by the acquisition unit 410 as a setting item (step S102).

次に、分割部４２２は、基礎行列から３つの部分行列を生成する（ステップＳ１０３）。 Next, the dividing unit 422 generates three partial matrices from the basic matrix (step S103).

次に、変換行列生成部４２３は、ステップＳ１０３にて生成した３つの部分行列に対して、特異値分解を行い、その結果の特異値の上位Ｋ個を用いた分解結果を用いて、３つの部分行列に対応した３つの変換行列を生成する（ステップＳ１０４）。 Next, the transformation matrix generation unit 423 performs singular value decomposition on the three sub-matrices generated in step S103, and uses the decomposition result using the top K singular values as a result, Three transformation matrices corresponding to the partial matrix are generated (step S104).

次に、内積計算部４２４は、ステップＳ１０３で生成した部分行列と、ステップＳ１０４で生成した変換行列との内積計算を行うことで、部分行列を圧縮した行列を生成する（ステップＳ１０５）。 Next, the inner product calculation unit 424 generates a matrix obtained by compressing the partial matrix by performing inner product calculation of the partial matrix generated in step S103 and the transformation matrix generated in step S104 (step S105).

次に、テンソル生成部４２５は、ステップＳ１０５にて圧縮した部分行列を用いて、３階のテンソル形式のデータを生成する（ステップＳ１０６）。 Next, the tensor generation unit 425 generates third-order tensor format data using the partial matrix compressed in step S105 (step S106).

次に、分解部４２６は、ステップＳ１０６にて生成したテンソルを、ＣＰ分解を用いて分解する（Ｓ１０７）。 Next, the decomposition unit 426 decomposes the tensor generated in step S106 using CP decomposition (S107).

次にパラメータ計算部４２７は、ステップＳ１０７の分解結果を用いて、パラメータを計算する（Ｓ１０８）。 Next, the parameter calculation unit 427 calculates parameters using the decomposition result of step S107 (S108).

最後に、出力部４３０は、ステップＳ１０８で算出したパラメータのうち少なくとも一つを表示装置３００に出力し、終了する（Ｓ１０９）。 Finally, the output unit 430 outputs at least one of the parameters calculated in step S108 to the display device 300, and ends (S109).

図８は、実施の形態１の変形例を示すブロック図である。実施の形態１における分割部がない場合も本データ分析方法は可能である。一例を示すと、分割部４２２で用いた集合Ｘ，Ａ，Ｂ，Ｃ⊂ＮをＮ，Ｎ，Ｎ，Ｎ⊂Ｎと置き換える。つまりは、分割を行わずに、基礎行列をそのまま用いて計算する方法である。以降の計算処理もＸ，Ａ，Ｂ，Ｃ⊂ＮをＮ，Ｎ，Ｎ，Ｎ⊂Ｎと読み替えて同様の計算が可能である。これは、基礎行列の分割を行うと、基礎行列をすべて使わずに計算するため、情報の欠落が存在する。その情報の欠落を、基礎行列をすべて用いることで防ぐ効果がある。特に基礎行列のデータサイズが小さい時に有効になる方法である。 FIG. 8 is a block diagram showing a modification of the first embodiment. This data analysis method is possible even when there is no division unit in the first embodiment. For example, the set X, A, B, C⊂N used in the dividing unit 422 is replaced with N, N, N, N⊂N. In other words, the calculation is performed using the basic matrix as it is without dividing. Subsequent calculation processing can be similarly performed by replacing X, A, B, and C⊂N with N, N, N, and N⊂N. This is because when the basic matrix is divided, the calculation is performed without using the basic matrix, and there is a lack of information. This lack of information is effectively prevented by using all the basic matrixes. This is particularly effective when the data size of the basic matrix is small.

［効果等］
本実施の形態１において、先行技術のような大規模なテンソルを作成する必要がなくなり、データサイズＮの２乗オーダーのメモリ使用量と計算時間で処理を行うことができるようになった。そのため、大規模な入力データ（基礎行列）に対してもメモリに載せて計算することが可能になる。 [Effects]
In the first embodiment, it is no longer necessary to create a large-scale tensor as in the prior art, and processing can be performed with a memory usage amount and a calculation time in the square order of the data size N. For this reason, even large-scale input data (basic matrix) can be calculated in the memory.

（実施の形態２）
実施の形態１で例示したデータ分析方法は、入力される基礎行列が対称であるものに限定されている。また、パラメータである第二行列Ｐにフルランク性が仮定されている。そのため、購買履歴のような非対称かつ第二行列がフルランクでないような基礎行列を扱うことができない。このような基礎行列に対して、基礎行列を対称形に変形する方法と、フルランクでない第二行列を算出する方法について説明する。なお、実施の形態１と同じ機能については、同じ符合を振り当てて、説明を省略する。 (Embodiment 2)
The data analysis method exemplified in the first embodiment is limited to a symmetric input basic matrix. Further, full rank property is assumed for the second matrix P that is a parameter. Therefore, it is impossible to handle a basic matrix such as a purchase history that is asymmetric and the second matrix is not full rank. With respect to such a basic matrix, a method of transforming the basic matrix into a symmetrical form and a method of calculating a second matrix that is not full rank will be described. In addition, about the same function as Embodiment 1, the same code | symbol is assigned and description is abbreviate | omitted.

図９は、データ分析装置４０１の概略構成を示すブロック図である。 FIG. 9 is a block diagram illustrating a schematic configuration of the data analysis apparatus 401.

処理部４４０は、実施の形態１の処理部４２０に加えて、入力変形部４２８及びパラメータ変形部４２９を備える。 The processing unit 440 includes an input deformation unit 428 and a parameter deformation unit 429 in addition to the processing unit 420 of the first embodiment.

入力変形部４２８は、入力された基礎行列が非対称の時、基礎行列の転置行列を用いて、対称形に変形する。図１０は、変形の一例を示す図である。式（１６）のように、基礎行列の転置行列と０で埋められた行列を用いて、入力された基礎行列を対称形に変形する。 When the input basic matrix is asymmetric, the input deforming unit 428 uses the transposed matrix of the basic matrix to deform it into a symmetric shape. FIG. 10 is a diagram illustrating an example of a modification. As shown in Expression (16), the input basic matrix is transformed into a symmetrical form by using a transposed matrix of the basic matrix and a matrix filled with zeros.

以降の計算を、この対称に変形した基礎行列を用いて計算を行う。なお、入力された基礎行列が対称行列の場合であっても、この変形を行って計算することは可能であるため、対称の時にも、変形を行ってもよい。 Subsequent calculations are performed using this symmetrically transformed basic matrix. Note that even if the input basic matrix is a symmetric matrix, it is possible to perform the calculation by performing this deformation, and therefore, the deformation may be performed even when it is symmetric.

パラメータ変形部４２９は、パラメータ計算部４２７によって算出されたパラメータである第一行列と第三行列のいずれかが、定義域外の情報になっている際に、定義域に収まるように第一行列または第三行列に新たな行または列を追加または削減して、定義域外のノードを新規クラスタに割り当てる。ここでいう定義域とは、各対象物がいずれか１つのクラスタに所属するという条件である。２つ以上のクラスタに所属する、または、どのクラスタにも所属しないことは定義域外となる。さらに、変形した第一行列または第三行列を用いて、第二行列を再計算する。具体的には、式（１７）〜式（１９）により、第一行列及び第三行列を変形する。 The parameter transformation unit 429 is configured such that when either the first matrix or the third matrix, which is the parameter calculated by the parameter calculation unit 427, is information outside the domain, the first matrix or Add or reduce new rows or columns to the third matrix and assign out-of-domain nodes to the new cluster. The definition area here is a condition that each object belongs to any one cluster. Belonging to two or more clusters or not belonging to any cluster is outside the domain of definition. Further, the second matrix is recalculated using the modified first matrix or third matrix. Specifically, the first matrix and the third matrix are transformed according to the equations (17) to (19).

ここで、式（１８）におけるＵｎｉｑｕｅ関数は、重複する要素を削除する関数である。 Here, the Unique function in Equation (18) is a function that deletes duplicate elements.

図１１は、パラメータである第一行列の変形の方法を示す一例である。変形前の第一行列の対象物Ｍ７、Ｍ８、Ｍ９、Ｍ１０は定義域外になっていることがわかる。ただし、Ｍ７とＭ８は同じクラスタで、Ｍ９とＭ１０は同じクラスタであることがわかる。そのため、新しく２つのクラスタＣ４、Ｃ５を追加し、それぞれに割り当てることで、定義域に収まる第一行列を作成できる。第三行列に関しても同様に可能である。 FIG. 11 is an example showing a method of modifying the first matrix that is a parameter. It can be seen that the objects M7, M8, M9, and M10 of the first matrix before the deformation are out of the defined range. However, it can be seen that M7 and M8 are the same cluster, and M9 and M10 are the same cluster. Therefore, by adding two new clusters C4 and C5 and allocating them to each, a first matrix that can fit in the domain can be created. The same applies to the third matrix.

図１２は、本実施の形態２に係るデータ分析方法の流れを示すフローチャートである。入力装置２００では、基礎行列の各要素に、関連の有無を示す値が入力される。また、入力装置２００では、クラスタ数を示すＫの値も入力される。これらの入力後においては、入力装置２００は、基礎行列及びＫをデータ分析装置４０１に出力する。なお、設定項目がすでにデータ分析装置４０１の設定部４２１に設定されていて、それが以降の処理に用いられる場合には、入力装置２００での設定項目の入力は不要である。 FIG. 12 is a flowchart showing the flow of the data analysis method according to the second embodiment. In the input device 200, a value indicating whether or not there is a relationship is input to each element of the basic matrix. In the input device 200, a value K indicating the number of clusters is also input. After these inputs, the input device 200 outputs the basic matrix and K to the data analysis device 401. If the setting item has already been set in the setting unit 421 of the data analysis device 401 and is used for the subsequent processing, it is not necessary to input the setting item with the input device 200.

データ分析装置４０１の取得部４１０は、入力装置２００からネットワークを介して入力された基礎行列及びＫを取得する（ステップＳ１０１）。 The acquisition unit 410 of the data analysis device 401 acquires the basic matrix and K input from the input device 200 via the network (step S101).

次に、入力変形部４２８は、基礎行列の転置行列を用いて、基礎行列を対称形に変形し、この変形した基礎行列を以降、基礎行列として利用する（ステップＳ１１０）。 Next, the input transformation unit 428 transforms the basic matrix into a symmetrical form using the transposed matrix of the basic matrix, and uses the deformed basic matrix as the basic matrix thereafter (step S110).

次に、内積計算部４２４は、ステップＳ１０３で生成した部分行列と、ステップＳ１０４で生成した変換行列との内積計算を行うことで、部分行列を圧縮した行列を生成する（ステップＳ１０５）。 Next, the inner product calculation unit 424 generates a matrix obtained by compressing the partial matrix by performing an inner product calculation of the partial matrix generated in step S103 and the transformation matrix generated in step S104 (step S105).

次に、パラメータ変形部４２９は、第一行列または、第三行列が定義域外になっている時、定義域に収まる形に変形し、その変形した第一行列と第三行列を用いて、第二行列を再計算する（Ｓ１１１）。 Next, when the first matrix or the third matrix is out of the domain, the parameter transformation unit 429 transforms it into a form that falls within the domain, and uses the transformed first matrix and third matrix, The two matrices are recalculated (S111).

［効果等］
本実施の形態２により、入力される基礎行列が非対称であり、第二行列がフルランクでない基礎行列に対しても、正確に計算が可能になる。つまりは、購買履歴などの非対称構造の情報を用いてクラスタリングが可能になる。 [Effects]
According to the second embodiment, it is possible to accurately calculate a basic matrix in which the input basic matrix is asymmetric and the second matrix is not full rank. That is, clustering is possible using information of an asymmetric structure such as purchase history.

なお、上記各実施の形態において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵまたはプロセッサなどのプログラム実行部が、ハードディスクまたは半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。 In each of the above embodiments, each component may be configured by dedicated hardware or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

また、上記各実施の形態において、特定の処理部が実行する処理を別の処理部が実行してもよい。また、複数の処理の順序が変更されてもよいし、複数の処理が並行して実行されてもよい。 Moreover, in each said embodiment, another process part may perform the process which a specific process part performs. Further, the order of the plurality of processes may be changed, and the plurality of processes may be executed in parallel.

以上、一つまたは複数の態様に係るデータ分析方法について、実施の形態に基づいて説明したが、本発明は、この実施の形態に限定されるものではない。本発明の趣旨を逸脱しない限り、当業者が思いつく各種変形を本実施の形態に施したものや、異なる実施の形態における構成要素を組み合わせて構築される形態も、一つまたは複数の態様の範囲内に含まれてもよい。 As described above, the data analysis method according to one or more aspects has been described based on the embodiment. However, the present invention is not limited to this embodiment. Unless it deviates from the gist of the present invention, various modifications conceived by those skilled in the art have been made in this embodiment, and forms constructed by combining components in different embodiments are also within the scope of one or more aspects. May be included.

本発明は、クラスタリングに用いられるデータ分析方法、データ分析装置及びプログラムとして有用である。すなわち、本発明は、推薦システムや文章分類などクラスタリングを必要とする様々な分野で応用可能である。 The present invention is useful as a data analysis method, data analysis apparatus, and program used for clustering. That is, the present invention can be applied in various fields that require clustering, such as a recommendation system and sentence classification.

１データ分析システム
２００入力装置
３００表示装置
４００、４０１データ分析装置
４１０取得部
４２０、４４０処理部
４２１設定部
４２２分割部
４２３変換行列生成部
４２４内積計算部
４２５テンソル生成部
４２６分解部
４２７パラメータ計算部
４２８入力変形部
４２９パラメータ変形部
５００ネットワーク DESCRIPTION OF SYMBOLS 1 Data analysis system 200 Input apparatus 300 Display apparatus 400, 401 Data analysis apparatus 410 Acquisition part 420,440 Processing part 421 Setting part 422 Division part 423 Transformation matrix production | generation part 424 Inner product calculation part 425 Tensor production part 426 Decomposition part 427 Parameter calculation part 428 Input transformation unit 429 Parameter transformation unit 500 Network

Claims

The basic matrix of N rows and M columns indicating the relevance of each of the N first objects and the M second objects is decomposed into three first matrices, second matrices, and third matrices. A data analysis device for clustering one object and the second object,
An acquisition unit that acquires the basic matrix in which a value indicating the degree of association is input for each element of the basic matrix;
A setting unit for setting K indicating the number of clusters of the first object and the second object;
A dividing unit that divides the basic matrix into at least three partial matrices of the basic matrix;
For each of the sub-matrices, a transformation matrix generation unit that generates a transformation matrix using a result of performing singular value decomposition using the largest singular value to K pieces,
An inner product calculation unit that performs inner product calculation with each of the partial matrices and a conversion matrix corresponding to each of the partial matrices, and compresses the partial matrix;
A tensor generating unit that creates a third-order tensor using the compressed submatrix;
A decomposition unit that decomposes the tensor using CP decomposition (CANDECOMP / PARAFAC decomposition);
With decomposition result of the decomposing unit, a calculating unit in which the disassembled said first matrix into three fundamental matrix, calculates the second matrix and the third matrix,
An output unit that outputs a clustering result of the first object and the second object by outputting at least one of the first matrix, the second matrix, and the third matrix;
Data analysis device including.

The data analysis apparatus according to claim 1, further comprising: an input transformation unit that transforms a symmetric matrix using a transposed matrix of the basic matrix when the basic matrix acquired by the acquisition unit is not a symmetric matrix.

The conversion matrix generation unit sets K that causes a sum of singular values from a higher level to be equal to or greater than a predetermined value , based on the calculated result of singular value decomposition. The data analysis device described.

The conversion matrix generation unit sets, for the sum of squares of the entire singular value, K that makes the sum of squares from the top of the singular value equal to or greater than a predetermined value, using the calculated singular value decomposition result. The data analysis apparatus according to 1 or 2.

The data analysis device according to claim 1, wherein the decomposition unit decomposes using singular value decomposition.

further,
When the clustering result of the first matrix and the third matrix calculated by the calculation unit is out of the definition area, the number of clusters is changed so as to enter the definition area, and the first matrix and the third matrix are transformed. The data analysis device according to any one of claims 1 to 5, further comprising a parameter transformation unit .

The basic matrix of N rows and M columns indicating the relevance of each of the N first objects and the M second objects is decomposed into three first matrices, second matrices, and third matrices. A program for causing a computer to execute a data analysis method for clustering one object and the second object,
The data analysis method includes:
An acquisition step of acquiring the basic matrix in which a value indicating the degree of association is input for each element of the basic matrix;
A setting step for setting K indicating the number of clusters of the first object and the second object;
Splitting the base matrix into at least three sub-matrices of the base matrix;
A transformation matrix generation step for generating a transformation matrix using a result of performing singular value decomposition using the largest singular value to K pieces for each of the partial matrices;
An inner product calculation step of performing an inner product calculation with each of the partial matrices and a transformation matrix corresponding to each of the partial matrices, and compressing the partial matrix;
A tensor generation step of creating a third-order tensor using the compressed submatrix;
A decomposition step of decomposing the tensor using CP decomposition (CANDECOMP / PARAFAC decomposition);
A calculation step of calculating the first matrix, the second matrix, and the third matrix obtained by decomposing the basic matrix into three using the decomposition result of the decomposition step;
An output step of outputting a clustering result of the first object and the second object by outputting at least one of the first matrix, the second matrix and the third matrix;
Including programs.