JP2018097497A

JP2018097497A - Analysis apparatus, analysis method, and program

Info

Publication number: JP2018097497A
Application number: JP2016239885A
Authority: JP
Inventors: 竹内　孝; Takashi Takeuchi; 孝竹内; 上田　修功; Shuko Ueda; 修功上田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2018-06-21
Anticipated expiration: 2036-12-09
Also published as: JP6689737B2

Abstract

PROBLEM TO BE SOLVED: To improve accuracy of analysis by performing non-negative value tensor complementation using adjacency relations between feature quantities.SOLUTION: An analysis apparatus includes: a data input unit for accepting an N-th level tensor X, a mask tensor M representing whether each element in the tensor X is a missing value or not, and an adjacency graph matrix Wrepresenting adjacency relations between feature amounts for each mode; a parameter updating unit for repeatedly updating a factor matrix Ato minimize the sum of a value of a cost function using generalized KL divergence between an estimated value X^ of the tensor X and the tensor X on the basis of the tensor X, the mask tensor M, and the adjacency graph matrix Wand a value of a penalty term concerning an error between the feature quantities in the factor matrix Ausing the adjacency graph matrix W; and a parameter output unit for outputting the factor matrix Afor each of modes n updated by the parameter updating unit.SELECTED DRAWING: Figure 1

Description

本発明は、機械学習及びデータマイニングの解析分野に属し、特にテンソル分解の技術に関連するものである。 The present invention belongs to the field of machine learning and data mining analysis, and particularly relates to a tensor decomposition technique.

非負値テンソル補完法(NTC: Non-negative Tensor Completion)は、データに含まれる欠損値を観測値のテンソル分解によって推定復元する技術である（非特許文献1、非特許文献2）。NTCではテンソル分解に、特に非負値テンソル因子分解法（NTF: Non-negative Tensor Completion）を用いる（非特許文献3、非特許文献4）。NTFはデータの値が非負であることを利用し、データから本質的な少数の非負値からなる共起パターンを学習する技術である。非特許文献7により、一般化KLダイバージェンス（非特許文献5、非特許文献6）を用いたNTCが提案されている。 Non-negative tensor complementation (NTC) is a technique for estimating and restoring missing values included in data by tensor decomposition of observed values (Non-patent Documents 1 and 2). NTC uses non-negative tensor factorization (NTF) for tensor decomposition (Non-Patent Document 3, Non-Patent Document 4). NTF is a technique for learning a co-occurrence pattern consisting of a small number of non-negative values from data using the fact that the value of data is non-negative. Non-patent document 7 proposes NTC using generalized KL divergence (Non-patent document 5 and Non-patent document 6).

非特許文献7には、交通流のデータを観測位置、時刻、日付の属性からなる3階のテンソルとして扱い、一般化KLダイバージェンスを用いたNTCによる解析を行うことで、特定の観測地点が混雑する時刻と日時の共起パターンが抽出されることと欠損値推定の精度が改善されることが開示されている。 Non-Patent Document 7 treats traffic flow data as a third-floor tensor consisting of attributes of observation position, time, and date, and performs analysis by NTC using generalized KL divergence, thereby congesting specific observation points. It is disclosed that the co-occurrence pattern of time and date to be extracted is extracted and the accuracy of missing value estimation is improved.

Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758‐1789, 2013.Y. Xu and W. Yin.A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion.SIAM Journal on Imaging Sciences, 6 (3): 1758-1789, 2013. T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455‐500, 2009.T. G. Kolda and B. W. Bader.Tensor decompositions and applications.SIAM Review, 51 (3): 455-500, 2009. N. Gillis. The why and how of nonnegative matrix factorization. arXiv preprint arXiv:1401.5226, 2014.N. Gillis.The why and how of nonnegative matrix factorization.arXiv preprint arXiv: 1401.5226, 2014. A. Shashua and T. Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proc. ICML, 2005.A. Shashua and T. Hazan.Non-negative tensor factorization with applications to statistics and computer vision.In Proc.ICML, 2005. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000. A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative matrix and tensor factoriza- tions: applications to exploratory multi-way data analysis and blind source separation. Wiley, 2009.A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari.Nonnegative matrix and tensor factoriza- tions: applications to exploratory multi-way data analysis and blind source separation.Wiley, 2009. 竹内孝、納谷太，上田修功"一般化KLダイバージェンスを用いた非負値テンソル補完法と都市交通流データへの応用", 第8回データ工学と情報マネジメントに関するフォーラムTakashi Takeuchi, Tadashi Naya, Nobuyoshi Ueda "Non-negative tensor interpolation using generalized KL divergence and its application to urban traffic data", 8th Forum on Data Engineering and Information Management Cai, D., He, X., Han, J., and Huang, T. S. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 8 (2011), 1548-1560.Cai, D., He, X., Han, J., and Huang, T. S. Graph regularized nonnegative matrix factorization for data representation.IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 8 (2011), 1548-1560.

しかし、従来のNTCはデータの属性に含まれる特徴量の関連性を考慮しないため、データ内の観測値が少数の場合に解析の精度が落ちるという課題がある。特徴量の関連性とは、例えば、観測位置間の距離や時刻間の経過時間、日付間の経過日数などの情報である。 However, since the conventional NTC does not consider the relevance of the feature amounts included in the data attributes, there is a problem that the accuracy of the analysis is lowered when the observation values in the data are small. The relationship between feature amounts is information such as the distance between observation positions, the elapsed time between times, and the number of days elapsed between dates.

本発明は上記の点に鑑みてなされたものであり、特徴量間の隣接関係を用いて非負値テンソル補完を行うことにより、解析の精度を向上させることを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to improve the accuracy of analysis by performing non-negative tensor complementation using an adjacent relationship between feature quantities.

開示の技術によれば、非負の要素からなるN階のテンソルX、前記テンソルXにおける各要素が欠損値であるか又は観測値であるかを表すマスクテンソルM、及び、モード毎の特徴量の隣接関係を表現する隣接グラフ行列W⁽ⁿ⁾とを受け付けるデータ入力部と、
前記データ入力部により受け付けた前記テンソルX、前記マスクテンソルM、及び前記隣接グラフ行列W⁽ⁿ⁾とに基づいて、前記テンソルXの推定値X^と前記テンソルXとの間の一般化KLダイバージェンスを用いたコスト関数の値と、隣接グラフ行列W⁽ⁿ⁾を用いた因子行列A⁽ⁿ⁾内の特徴量間の誤差に関する罰則項の値との和を最小化するように、モードnの各々についての前記因子行列A⁽ⁿ⁾を更新することを繰り返すパラメータ更新部と、
前記パラメータ更新部により更新された前記モードnの各々についての前記因子行列A⁽ⁿ⁾を出力するパラメータ出力部と、
を有する解析装置が提供される。 According to the disclosed technique, the N-th order tensor X composed of non-negative elements, the mask tensor M indicating whether each element in the tensor X is a missing value or an observed value, and the feature amount of each mode A data input unit that accepts an adjacency graph matrix W ⁽ⁿ⁾ representing the adjacency relationship;
The tensor X accepted by the data input unit, the mask tensor M, and the based on the adjacency graph matrix W ^(n), generalized KL divergence between the estimated value X ^ and the tensor X of the tensor X To minimize the sum of the value of the cost function using and the penalty term value for the error between the features in the factor matrix A ⁽ⁿ⁾ using the adjacency graph matrix W ⁽ⁿ⁾ A parameter updating unit that repeats updating the factor matrix A ⁽ⁿ⁾ for each;
A parameter output unit that outputs the factor matrix A ⁽ⁿ⁾ for each of the modes n updated by the parameter update unit;
Is provided.

開示の技術によれば、特徴量間の隣接関係を用いて非負値テンソル補完を行うことにより、解析の精度を向上させることが可能となる。 According to the disclosed technique, it is possible to improve the accuracy of analysis by performing non-negative tensor complementation using the adjacent relationship between feature quantities.

本発明の実施の形態に係る解析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る解析装置における処理のアルゴリズムを示す図である。It is a figure which shows the algorithm of the process in the analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る解析装置が実行する処理のフローチャートである。It is a flowchart of the process which the analyzer which concerns on embodiment of this invention performs. 効果を説明するための図である。It is a figure for demonstrating an effect.

以下、図面を参照して本発明の実施の形態（本実施の形態）を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。 Hereinafter, an embodiment (this embodiment) of the present invention will be described with reference to the drawings. The embodiment described below is merely an example, and the embodiment to which the present invention is applied is not limited to the following embodiment.

＜本発明の実施の形態の概要＞
本実施の形態では、特徴量間の隣接関係を用い、非特許文献8に記載されている正則化技術をNTCに適応し、特徴量の関連性（例：観測位置間の距離や時刻間の経過時間、日付間の経過日数などの情報）を考慮したNTCを実現している。特徴量間の隣接関係を用いることで、隣接関係にある特徴量に対応するパラメータが類似した推定値を持つようになり、欠損値の推定精度が向上する。 <Outline of Embodiment of the Present Invention>
In this embodiment, the adjacency relationship between feature amounts is used, and the regularization technique described in Non-Patent Document 8 is applied to NTC, and the relationship between feature amounts (eg, distance between observation positions and time intervals). NTC that takes into account information such as elapsed time and days elapsed between dates). By using the adjacency relationship between the feature quantities, the parameters corresponding to the feature quantities in the adjacency relation have similar estimated values, and the accuracy of estimating missing values is improved.

＜本発明の実施の形態の原理＞
まず、本実施の形態における原理を説明する。 <Principle of Embodiment of the Present Invention>
First, the principle in the present embodiment will be described.

一般化KLダイバージェンスを用いたNTCは、入力されたN階のテンソルX、Mと、因子行列の初期値A⁽ⁿ⁾ (n=1、…、N)を用いて、以下の損失関数最小化問題を解くためにA⁽ⁿ⁾ (n=1、…、N)の更新を繰り返し行うものである。なお、テンソルを示す記号には下線を付すこととする。 NTC using generalized KL divergence minimizes the following loss function using the input N-th order tensors X and M and the initial value of the factor matrix A ⁽ⁿ⁾ (n = 1, ..., N) In order to solve the problem, A ⁽ⁿ⁾ (n = 1,..., N) is repeatedly updated. The symbol indicating the tensor is underlined.

モード数がNで、各モードの特徴量の長さ（特徴数）がI₁、…、I_NのN階非負テンソルを、 The number of modes is N, and the length (number of features) of the feature quantity of each mode is I ₁ , ..., _N N-order non-negative tensor of I _N ,

とし、テンソルのインデクスとその集合をi = (i₁、…、i_N)、テンソルXのnモードアンフォールドをX ⁽ⁿ⁾とする。テンソルXの欠損を示すマスクテンソルをMとし、マスクテンソルの要素m_iを、

And the index of tensor and its set are i = (i ₁ ,..., I _N ), and the n-mode unfold of tensor X is X ⁽ⁿ⁾ . The mask tensor that indicates a defect in the tensor X is M , and the mask tensor element _mi is

とする。

And

次にモード毎の特徴量の隣接関係を表現する隣接グラフを定める。n番目のモードのための隣接グラフ行列をW⁽ⁿ⁾∈R₊ ^In×Inとし、i番目とj番目の特徴量の類似度を要素w_i,i' ⁽ⁿ⁾∈R₊で表す。つまりw_i,i' ⁽ⁿ⁾の値が大きいほど類似度が高い。 Next, an adjacency graph expressing the adjacency relationship of feature amounts for each mode is determined. The adjacency graph matrix for the n-th mode is W ⁽ⁿ⁾ ∈ R ₊ ^{In × In,} and the similarity between the i-th and j-th feature quantities is represented by an element w _{i, i ′} ⁽ⁿ⁾ ∈ R ₊ . That is, the greater the value of w _{i, i ′} ⁽ⁿ⁾ , the higher the similarity.

テンソル補完のモデルには、観測値をK個の因子の線形和によって近似するCP分解を用いる。非負テンソル補完における因子数は全モードでK個とする。n番目のモードに対応するk番目の因子ベクトルと、因子ベクトルを列ベクトルにもつ因子行列を、 The tensor interpolation model uses CP decomposition that approximates the observed value by a linear sum of K factors. The number of factors in non-negative tensor interpolation is K in all modes. A factor matrix with the kth factor vector corresponding to the nth mode and the factor vector in the column vector,

と定め、因子行列の集合をA = {A⁽¹⁾、A⁽²⁾、…、A⁽ⁿ⁾ 、…、A^(N)}とする。テンソルXの推定値を、因子ベクトルの線形和、

And the set of factor matrices is A = {A ⁽¹⁾ , A ⁽²⁾ , ..., A ⁽ⁿ⁾ , ..., A ^(N) }. The estimate of tensor X , the linear sum of factor vectors,

と定める。コスト関数fをXとX^の要素毎の近似誤差の線形和、すなわち、

It is determined. The cost function f is a linear sum of approximation errors for each element of X and X ^ , that is,

とする。以上から、CP分解を用いた非負テンソル補完は次の最小化問題、

And From the above, non-negative tensor interpolation using CP decomposition is the following minimization problem:

と定められる。

It is determined.

XおよびA⁽ⁿ⁾ (n=1、…、N)が非負の値のみを取ることを許容することが手法名の"非負値"の意味するところである。また属性の共起パターンは因子数であるK個分、因子行列A⁽ⁿ⁾ (n=1、…、N)として抽出される。 Allowing X and A ⁽ⁿ⁾ (n = 1,..., N) to take only non-negative values is the meaning of the “non-negative value” in the method name. In addition, attribute co-occurrence patterns are extracted as a factor matrix A ⁽ⁿ⁾ (n = 1,..., N) for K factors.

要素毎の一般化KL（カルバックライブラー）ダイバージェンスは Generalized KL (Cullback Liver) divergence for each element

と定められる。

It is determined.

本実施の形態では、隣接グラフと一般化KLダイバージェンスを用いて、下記のように、因子行列A⁽ⁿ⁾内の特徴量間の誤差に関する罰則項Ω(A⁽ⁿ⁾)を定める。 In the present embodiment, a penal term Ω (A ⁽ⁿ⁾ ) relating to an error between feature quantities in the factor matrix A ⁽ⁿ⁾ is determined using an adjacency graph and generalized KL divergence as follows.

定義からw_i,i' ⁽ⁿ⁾が非ゼロの値を保つ場合には、k番目の因子ベクトルのi番目の要素とi'番目の要素が同一の値を取るときに、この誤差はゼロとなる。NTCのコスト関数fと罰則項Ω(A⁽ⁿ⁾)のモード毎の和を用いて、本実施の形態におけるコスト関数gを定める。

If w _{i, i '} ⁽ⁿ⁾ keeps a non-zero value from the definition, this error is zero when the i-th element and i'-th element of the k-th factor vector take the same value. It becomes. The cost function g in the present embodiment is determined using the sum of the NTC cost function f and the penalty term Ω (A ⁽ⁿ⁾ ) for each mode.

以上から、本実施の形態における技術はgを最小化するA^*を求める技術と定められる。

From the above, the technique in the present embodiment is defined as a technique for obtaining A ^* that minimizes g.

＜本発明の実施の形態に係る解析装置の構成＞
次に、本実施の形態に係る解析装置１００の構成について説明する。本実施の形態に係る解析装置１００は、ＣＰＵと、ＲＡＭと、後述する解析処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。図１に示すように、この解析装置１００は、機能的にはデータ入力部１０と、演算部２０とを備えている。

<Configuration of analyzer according to embodiment of the present invention>
Next, the configuration of analysis apparatus 100 according to the present embodiment will be described. The analysis apparatus 100 according to the present embodiment can be configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing an analysis processing routine described later and various data. As shown in FIG. 1, the analysis apparatus 100 functionally includes a data input unit 10 and a calculation unit 20.

データ入力部１０は、テンソル形式のデータXとテンソル形式の欠損値指示データ（マスクテンソル）Mを入力する。Xの各モードの長さ（特徴数、要素数）をI₁、…、I_Nとし、Xの要素は非負（0以上）の値のみを許容する。Mは各モードの長さがI₁、…、I_Nの場合のみを許容し、Mの要素は0あるいは1のみに制限する。Mの値が0のXの要素は欠損値、1の要素は観測値として扱う。さらに行列形式のデータW⁽ⁿ⁾ (n=1,…,N)を入力する。W⁽ⁿ⁾は長さI_n×I_nの行列であり、W⁽ⁿ⁾の要素は非負（0以上）の値のみを許容する。 The data input unit 10 inputs data X in tensor format and missing value instruction data (mask tensor) M in tensor format. The length (number of features, number of elements) of each mode of X is I ₁ ,..., I _N, and only non-negative (0 or more) values are allowed for the elements of X. M is allowed only when the length of each mode is I ₁ , ..., I _N , and the elements of M are limited to 0 or 1 only. The X element with an M value of 0 is treated as a missing value, and the 1 element is treated as an observed value. Further, matrix format data W ⁽ⁿ⁾ (n = 1,..., N) is input. W ⁽ⁿ⁾ is a matrix of length I _n × I _n , and the element of W ⁽ⁿ⁾ allows only non-negative values (0 or more).

また、データ入力部１０は、パラメータの初期値として、因子行列の集合A = {A⁽¹⁾、A⁽²⁾、…、 A^(N)}を入力する。A⁽ⁿ⁾の要素は非負（0以上）の値のみを許容する。また、データ入力部１０は、パラメータ更新繰り返しの演算回数Tを入力する。 Further, the data input unit 10 inputs a set of factor matrices A = {A ⁽¹⁾ , A ⁽²⁾ ,..., A ^(N) } as initial values of parameters. The element of A ⁽ⁿ⁾ only allows non-negative (0 or more) values. In addition, the data input unit 10 inputs the number T of parameter update repetitions.

演算部２０は、アンフォールド計算部２６と、パラメータ更新部３０と、パラメータ出力部３８とを含んで構成されている。 The calculation unit 20 includes an unfold calculation unit 26, a parameter update unit 30, and a parameter output unit 38.

アンフォールド計算部２６は、モードnの各々について、テンソルXをモードnに対してアンフォールドしたnモードアンフォールドX _(n)と、マスクテンソルMをモードnに対してアンフォールドしたnモードアンフォールドM _(n)とを計算する。 For each mode n, the unfold calculation unit 26 n-mode unfold X _{(n) obtained by} unfolding the tensor X with respect to the mode n, and n-mode unfold obtained by unfolding the mask tensor M with respect to the mode n. Calculate M _(n) .

パラメータ更新部３０は、因子行列更新部３２と、繰り返し判定部３６とを含んで構成されている。 The parameter update unit 30 includes a factor matrix update unit 32 and an iterative determination unit 36.

パラメータ更新部３０は、上記各部の処理によって、アンフォールド計算部２６によって計算されたテンソルXのnモードアンフォールドX _(n)、マスクテンソルMのnモードアンフォールドM _(n)と、入力された隣接グラフ行列W⁽ⁿ⁾に基づいて、コスト関数fの値と、罰則項のモード毎の和との和を最小化するように、N個のモードの各々についての因子行列A⁽ⁿ⁾を更新することを繰り返す。なお、因子数Kは予め定められている。コスト関数fは、モードnの特徴数I_nと同じ次元数を持ち、かつ、非負の要素からなるk番目の因子に対応する因子ベクトルa_k ⁽ⁿ⁾からなるI_n×Kの因子行列A⁽ⁿ⁾の因子ベクトルa_k ⁽ⁿ⁾の要素を、モードnの各々について掛け合わせた積を、K個の因子の各々について足し合わせた値と、テンソルXの要素との距離を表す一般化KLダイバージェンスの和を用いて表される。 The parameter update unit 30 receives the n-mode unfold X _{(n) of} the tensor X calculated by the unfold calculation unit 26 and the n-mode unfold M _{(n) of} the mask tensor M by the processing of each unit described above. Based on the adjacency graph matrix W ⁽ⁿ⁾ , the factor matrix A ⁽ⁿ⁾ for each of the N modes is calculated so as to minimize the sum of the value of the cost function f and the sum of the penalty terms for each mode. Repeat updating. The factor number K is determined in advance. The cost function f, the mode n have the same number of dimensions, wherein the number I _n of, and consisting of factor vector a _k corresponding to the k-th factor consisting of non-negative elements ⁽ⁿ⁾ I _n × K factor matrices A the elements of the factor vector a _k ⁽ⁿ⁾ of ^(n), the product obtained by multiplying for each mode n, the value obtained by adding for each of the K factor, generalized representing the distance between the elements of the tensor X Expressed using the sum of KL divergence.

ここで、因子行列A⁽ⁿ⁾を更新する方法についてより詳しく説明する。 Here, a method for updating the factor matrix A ⁽ⁿ⁾ will be described in more detail.

因子行列の集合A = {A⁽¹⁾、A⁽²⁾、…、 A^(N)}の更新において、n=1、…、Nまで同様の処理が順番にT回行われる。 In updating the factor matrix set A = {A ⁽¹⁾ , A ⁽²⁾ ,..., A ^(N) }, the same processing up to n = 1,.

A⁽ⁿ⁾を以下の計算法で更新する。推定値X^のnモードアンフォールドは、 A ⁽ⁿ⁾ is updated with the following calculation method. The n-mode unfold of the estimated value X ^ is

となる。この時、

It becomes. At this time,

はクラーチ・ラオ積（Kahtri-Rao product）とし、

Is the Kahtri-Rao product,

とした。

It was.

とすると、因子行列A⁽ⁿ⁾に関する最小化問題は、

Then the minimization problem for the factor matrix A ⁽ⁿ⁾ is

となる。
g(A⁽ⁿ⁾)は因子ベクトルa_k ⁽ⁿ⁾毎に分離可能であるから、因子ベクトルa_k ⁽ⁿ⁾毎に更新を行うことで、上記の最小化問題の最適解を求める。ここで、グラフラプラシアンを

It becomes.
Since g ^{(A (n))} can be separated for each factor vector a _k ^(n), by performing the update for each factor vector a _k ^(n), finding the optimal solution of the minimization problem. Where graph Laplacian

と定める。このとき対角行列D⁽ⁿ⁾は行と列の長さがI_nで対角成分がW⁽ⁿ⁾の行毎の和からなり、非対角成分は0の行列とする。すると因子ベクトルの最小解は以下の線形方程式を解くことで求められる。

It is determined. Diagonal in this case the diagonal matrix D ⁽ⁿ⁾ the length of the row and column I _n is the sum of each row of W ^(n), the non-diagonal elements to zero matrix. Then, the minimum solution of the factor vector can be obtained by solving the following linear equation.

この時、C⁽ⁿ⁾は次のように定めた。

At this time, C ⁽ⁿ⁾ was determined as follows.

上記の式において

In the above formula

は行列の要素毎の積と商である。つまり、上記の式のかっこ()内の式は、M _(n)とX _(n)との要素毎の積の演算結果と、推定値X^のnモードアンフォールドとの要素毎の商である。この更新則を用いて各因子行列A⁽ⁿ⁾を順番に更新する。

Is the product and quotient of each element of the matrix. In other words, the expression in parentheses () in the above expression is the quotient for each element of the result of the product of each element of M _(n) and X _(n) and the n-mode unfold of the estimated value X ^. is there. Each factor matrix A ⁽ⁿ⁾ is updated in order using this update rule.

以上説明した方法に従って、因子行列更新部３２は、まず、モードnの各々についての因子行列A⁽ⁿ⁾を初期化する。また、因子行列更新部３２は、テンソルXのモードnの各々について、アンフォールド計算部２６によって計算されたnモードアンフォールドX _(n)とnモードアンフォールドM _(n)と、隣接グラフ行列W⁽ⁿ⁾を用いて、コスト関数の値と、因子行列A⁽ⁿ⁾内の特徴量間の誤差に関する罰則項の値との和を最小化するように、上記（１）式に従って、因子行列A⁽ⁿ⁾を更新する。コスト関数の値は、因子行列A⁽ⁿ⁾と、モードn以外の各モードn´の因子行列A^(n´)のクラーチ・ラオ積との積の要素と、テンソルXのnモードアンフォールドX _(n)の要素との距離を表す一般化KLダイバージェンスの和を用いて表される。 In accordance with the method described above, the factor matrix update unit 32 first initializes the factor matrix A ⁽ⁿ⁾ for each of the modes n. Further, the factor matrix update unit 32, for each of the modes n of the tensor X , the n-mode unfold X _(n) and the n-mode unfold M _(n) calculated by the unfold calculation unit 26, and the adjacent graph matrix W ⁽ⁿ⁾ is used to minimize the sum of the cost function value and the penalty term value related to the error between the feature quantities in the factor matrix A ⁽ⁿ⁾ , according to the above equation (1). A ⁽ⁿ⁾ is updated. The value of the cost function, the factor matrix A ^(n), the product of elements of the Kurachi Rao product of mode n other than the mode n'factor matrix A ^(n'), n mode unfold X tensor X It is expressed using the sum of generalized KL divergence that represents the distance to the element of _(n) .

繰り返し判定部３６は、予め定められた条件を満たすまで、因子行列更新部３２による因子行列A⁽ⁿ⁾の各々の更新を繰り返させる。本実施の形態では、予め定められた条件としては、最大パラメータ更新繰り返し演算回数Tの回数分繰り返すこととする。 The iterative determination unit 36 causes each update of the factor matrix A ⁽ⁿ⁾ by the factor matrix update unit 32 to be repeated until a predetermined condition is satisfied. In the present embodiment, as a predetermined condition, the maximum parameter update repetition calculation count T is repeated.

パラメータ出力部３８は、パラメータ更新部３０の演算によって最終的に得られた因子行列の集合A = {A⁽¹⁾、A⁽²⁾、…、 A^(N)}を出力する。 The parameter output unit 38 outputs a set of factor matrices A = {A ⁽¹⁾ , A ⁽²⁾ ,..., A ^(N) } finally obtained by the operation of the parameter update unit 30.

本実施の形態に係る解析装置１００は、例えば、コンピュータに、本実施の形態で説明する処理内容を記述したプログラムを実行させることにより実現可能である。すなわち、解析装置１００が有する機能は、当該コンピュータに内蔵されるＣＰＵやメモリ、ハードディスク等のハードウェア資源を用いて、解析装置１００で実施される処理に対応するプログラムを実行することによって実現することが可能である。また、上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 The analysis apparatus 100 according to the present embodiment can be realized, for example, by causing a computer to execute a program describing the processing content described in the present embodiment. That is, the functions of the analysis device 100 are realized by executing a program corresponding to the processing executed by the analysis device 100 using hardware resources such as a CPU, a memory, and a hard disk built in the computer. Is possible. Further, the program can be recorded on a computer-readable recording medium (portable memory or the like), stored, or distributed. It is also possible to provide the program through a network such as the Internet or electronic mail.

＜本発明の実施の形態に係る解析装置の動作＞
図２に、本実施の形態に係る解析装置１００における処理のアルゴリズムを示す。当該アルゴリズムに従って処理を行う解析装置１００の動作について、図３のフロー（解析処理ルーチン）に沿って説明する。なお、図２には、図３のフローに対応するステップ番号も示されている。 <Operation of Analysis Device According to Embodiment of the Present Invention>
FIG. 2 shows a processing algorithm in the analysis apparatus 100 according to the present embodiment. The operation of the analysis apparatus 100 that performs processing according to the algorithm will be described along the flow (analysis processing routine) of FIG. FIG. 2 also shows step numbers corresponding to the flow of FIG.

データ入力部１０においてテンソルX、マスクテンソルM、因子数K、隣接グラフ行列W⁽ⁿ⁾、各因子行列A⁽ⁿ⁾の初期値、及び最大パラメータ更新繰り返し演算回数Tを受け付けると、解析装置１００は、図３に示す解析処理ルーチンを実行する。なお、データ入力部１０が受け付けたデータは、コンピュータとしての解析装置１００のメモリ等に格納され、処理のために適宜読み出される。
まず、ステップＳ１００では、データ入力部１０において受け付けた、各因子行列A⁽ⁿ⁾の初期値に基づいて、各因子行列A⁽ⁿ⁾を初期化する。また、tをt=1と初期化する。 When the data input unit 10 receives the tensor X , the mask tensor M , the number of factors K, the adjacency graph matrix W ⁽ⁿ⁾ , the initial value of each factor matrix A ⁽ⁿ⁾ , and the maximum parameter update iteration count T, the analysis apparatus 100 Executes the analysis processing routine shown in FIG. Note that the data received by the data input unit 10 is stored in a memory or the like of the analysis apparatus 100 as a computer and is appropriately read for processing.
First, in step S100, accepts the data input unit 10, based on the initial value of each factor matrix A ^(n), initializes each factor matrix A ^(n). Also, t is initialized to t = 1.

次に、ステップＳ１０２では、データ入力部１０により受け付けたテンソルX及びマスクテンソルMとに基づいて、モードnの各々について、テンソルXのnモードアンフォールドX _(n)と、マスクテンソルMのnモードアンフォールドM _(n)とを計算する。 Next, in step S102, based on the tensor X and the mask tensor M received by the data input unit 10, the n mode unfold X _(n) of the tensor X and the n mode of the mask tensor M for each mode n. Calculate unfold M _(n) .

ステップＳ１０４では、上記ステップＳ１０２で計算されたテンソルXのｎモードアンフォールドX _(n)と、マスクテンソルMのnモードアンフォールドM _(n)と、因子行列A⁽ⁿ⁾、及び隣接グラフ行列W⁽ⁿ⁾に基づいて、コスト関数の値と罰則項の値との和を最小化するように、上記（１）式に基づき因子行列A⁽ⁿ⁾を更新する。 In step S104, the n-mode unfold X _{(n) of} the tensor X calculated in step S102, the n-mode unfold M _{(n) of} the mask tensor M , the factor matrix A ⁽ⁿ⁾ , and the adjacency graph matrix W Based on ⁽ⁿ⁾ , the factor matrix A ⁽ⁿ⁾ is updated based on the above equation (1) so as to minimize the sum of the value of the cost function and the value of the penalty term.

ステップＳ１０６では、全ての因子行列A⁽ⁿ⁾について上記ステップＳ１０４の処理を行ったか否かを判定する。t回目の上記ステップＳ１０４の処理を行っていない因子行列A⁽ⁿ⁾が存在する場合には、上記ステップＳ１０４へ戻り、当該因子行列A⁽ⁿ⁾について上記ステップＳ１０４の処理を行う。一方、全ての因子行列A⁽ⁿ⁾についてt回目の上記ステップＳ１０４の処理を行った場合には、ステップＳ１０８へ移行する。 In step S106, it is determined whether or not the processing in step S104 has been performed for all factor matrices A ⁽ⁿ⁾ . If the t-th process has not been performed in step S104 factor matrix ^{A (n)} is present, the process returns to step S104, it performs the process of step S104 for the factor matrix A ^(n). On the other hand, when the t-th process of step S104 is performed for all factor matrices A ⁽ⁿ⁾ , the process proceeds to step S108.

ステップＳ１０８では、繰り返し回数t=Tであるかを判定し、t=TであればステップＳ１１２へ移行し、t≠Tであれば、ステップＳ１１０へ移行してt=t+1とカウントアップし、ステップＳ１０４に戻って処理を繰り返す。 In step S108, it is determined whether the number of repetitions t = T. If t = T, the process proceeds to step S112. If t ≠ T, the process proceeds to step S110 and counts up to t = t + 1. Returning to step S104, the process is repeated.

ステップＳ１１２では、ステップＳ１０４で最終的に更新された因子行列の集合A = {A⁽¹⁾、A⁽²⁾、…、 A^(N)}を出力し、処理を終了する。 In step S112, the factor matrix set A = {A ⁽¹⁾ , A ⁽²⁾ ,..., A ^(N) } finally updated in step S104 is output, and the process ends.

＜実験例における効果の説明＞
実験例では、デンマークのアーヘン市の419地点で計測された交通流データのケソン値推定実験（City pulse（http://www.ict-citypulse.eu/））を用いて提案手法の有効性を確認した。 <Explanation of effects in experimental examples>
In the experimental example, the effectiveness of the proposed method is demonstrated using the Queson value estimation experiment (City pulse (http://www.ict-citypulse.eu/)) of traffic flow data measured at 419 points in Aachen, Denmark. confirmed.

データの観測期間は2014年8月1日から9月31日までの61日間で、この期間中はデンマークの暦に祝日は存在しない。データは30 分毎に観測されるため、データが完全であれば、センサー毎に最大で2,928 点の観測値が存在する。センサー毎にデータの観測値を調べ、24時間データが完全に観測されていない場合には欠損とした。交通流データから、観測地点(441点)、時刻(30分毎の24時間)、日付(61日) をモードに持つ3階のテンソルXを作成した。 The observation period of data is 61 days from August 1 to September 31, 2014. During this period, there are no holidays in the Danish calendar. Since the data is observed every 30 minutes, there are a maximum of 2,928 observations for each sensor if the data is complete. The observation value of the data was examined for each sensor, and it was determined as missing if 24-hour data was not completely observed. From the traffic flow data, we created a third-floor tensor X with the observation point (441 points), time (24 hours every 30 minutes), and date (61 days) as modes.

観測地点のモードの隣接行列は、2つの観測地点の距離が100メートル以内かつ観測開始地点から終了地点までの角度の差が45度以下であればw_i,i' ⁽¹⁾を1、そうでなければ0とした。時刻のモードの隣接行列は、2つの時刻が隣り合っていればw_i,i' ⁽²⁾を1、そうでなければ0とした。日付のモードの隣接行列は、2つの日付が隣り合っていればw_i,i' ⁽³⁾を1、そうでなければ0とした。 If the distance between the two observation points is less than 100 meters and the angle difference from the observation start point to the end point is 45 degrees or less, the adjacency matrix of the observation point mode is 1, so that w _{i, i '} ⁽¹⁾ is 1. Otherwise it was set to 0. In the adjacency matrix of the time mode, w _{i, i ′} ⁽²⁾ is 1 if two times are adjacent, and 0 otherwise. The adjacency matrix of the date mode is set to 1 if w _{i, i ′} ⁽³⁾ is 1 if two dates are adjacent, and 0 otherwise.

データの欠損割合に対する、欠損値推定の精度の変化を確認するために、いずれの設定でも、交通流データの実際の欠損値に加えて観測値をランダムに欠損させた。欠損の割合は10%、50%、70%、90%を用いた。各設定で5回の試行を行い、各試行で毎回ランダムに観測値を欠損させ、残った観測値の90%をトレーニングデータ、10%をテストデータとした。因子の数は5、10、20 とし、5-fold cross validation によって決定した。 In order to confirm the change in the accuracy of the missing value estimation with respect to the missing data rate, the observed values were randomly missing in addition to the actual missing values in the traffic flow data. Defect ratios of 10%, 50%, 70% and 90% were used. Five trials were performed for each setting, and observation values were randomly lost each time in each trial. 90% of the remaining observation values were used as training data and 10% were used as test data. The number of factors was 5, 10, and 20 and was determined by 5-fold cross validation.

結果を図４に示す。図４の表におけるProp.が提案手法の結果である。比較手法として、一般化KLダイバージェンスを用いた非負テンソル補完（NTC (gKL)）、ユークリッド距離を用いた非負テンソル補完（NTC (Eud)）、テンソルをセンサーのモードに関してアンフォールドした行列に対する、一般化KL ダイバージェンスを用いた非負行列補完（NMC(gKL)）、欠損値を0 として扱う非負値テンソル因子分解法(Non-negative Tensor Factorization with gKL:NTF )、交通流データの平均値を推定値としたもの(Mean) を採用した。 The results are shown in FIG. Prop. In the table of FIG. 4 is the result of the proposed method. Comparison methods include non-negative tensor completion using generalized KL divergence (NTC (gKL)), non-negative tensor completion using Euclidean distance (NTC (Eud)), and generalization for matrices in which tensors are unfolded with respect to sensor mode. Non-negative matrix interpolation (NMC (gKL)) using KL divergence, non-negative tensor factorization with gKL (NTF) treating missing values as 0, and average value of traffic flow data as estimated values The thing (Mean) was adopted.

欠損率が10%の場合には、提案手法とNMCが他の手法と比べて最も良い汎化性能を全指標で示した。一方、欠損率が30、 50、 70、 90%の場合にはNMCの汎化性能が劣化し、提案手法の汎化性能が最良となった。以上から提案手法による効果が確認された。 When the defect rate was 10%, the proposed method and NMC showed the best generalization performance in all indicators compared with other methods. On the other hand, when the defect rate was 30, 50, 70, and 90%, the generalization performance of NMC deteriorated, and the generalization performance of the proposed method became the best. From the above, the effect of the proposed method was confirmed.

（実施の形態のまとめ）
以上、説明したように、本実施の形態によれば、非負の要素からなるN階のテンソルX、前記テンソルXにおける各要素が欠損値であるか又は観測値であるかを表すマスクテンソルM、及び、モード毎の特徴量の隣接関係を表現する隣接グラフ行列W⁽ⁿ⁾とを受け付けるデータ入力部と、前記データ入力部により受け付けた前記テンソルX、前記マスクテンソルM、及び前記隣接グラフ行列W⁽ⁿ⁾とに基づいて、前記テンソルXの推定値X^と前記テンソルXとの間の一般化KLダイバージェンスを用いたコスト関数の値と、隣接グラフ行列W⁽ⁿ⁾を用いた因子行列A⁽ⁿ⁾内の特徴量間の誤差に関する罰則項の値との和を最小化するように、モードnの各々についての前記因子行列A⁽ⁿ⁾を更新することを繰り返すパラメータ更新部と、前記パラメータ更新部により更新された前記モードnの各々についての前記因子行列A⁽ⁿ⁾を出力するパラメータ出力部と、を有する解析装置が提供される。 (Summary of embodiment)
As described above, according to the present embodiment, the N-order tensor X composed of non-negative elements, the mask tensor M representing whether each element in the tensor X is a missing value or an observed value, And a data input unit that receives an adjacency graph matrix W ⁽ⁿ⁾ that expresses the adjacency relationship of feature quantities for each mode, the tensor X received by the data input unit, the mask tensor M , and the adjacency graph matrix W based on the ^(n), and the value of the cost function using generalized KL divergence between the estimated value X ^ and the tensor X of the tensor X, factor matrix using the adjacency graph matrix W ⁽ⁿ⁾ a ^a parameter updating unit that repeats updating the factor matrix A ⁽ⁿ⁾ for each of the modes n so as to minimize the sum of the penalties term value regarding the error between the feature quantities in ⁽ⁿ⁾ ; Updated by parameter update unit A parameter output portion for outputting the factor matrix A ⁽ⁿ⁾ for each of the modes n, analyzer with is provided with.

前記罰則項は、例えば、前記隣接グラフ行列W⁽ⁿ⁾ の要素と、前記因子行列A⁽ⁿ⁾の因子ベクトルの要素間の一般化KLダイバージェンスとの積の和からなる項である。 The penalty term is, for example, a term composed of a sum of products of elements of the adjacency graph matrix W ⁽ⁿ⁾ and generalized KL divergence between elements of the factor vector of the factor matrix A ⁽ⁿ⁾ .

前記パラメータ更新部は、例えば、前記隣接グラフ行列W⁽ⁿ⁾のグラフラプラシアンを用いた式により、前記因子行列A⁽ⁿ⁾の因子ベクトル毎に更新を行う。 The parameter updating unit performs updating for each factor vector of the factor matrix A ⁽ⁿ⁾ , for example, using an equation using a graph Laplacian of the adjacent graph matrix W ⁽ⁿ⁾ .

以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to the specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. Is possible.

１０データ入力部
２０演算部
２６アンフォールド計算部
３０パラメータ更新部
３２因子行列更新部
３６繰り返し判定部
３８パラメータ出力部
１００解析装置 DESCRIPTION OF SYMBOLS 10 Data input part 20 Operation part 26 Unfold calculation part 30 Parameter update part 32 Factor matrix update part 36 Repetition determination part 38 Parameter output part 100 Analysis apparatus

Claims

N-th order tensor X consisting of non-negative elements, mask tensor M indicating whether each element in the tensor X is a missing value or an observed value, and an adjacency graph representing the adjacency relationship of features for each mode A data input unit that accepts a matrix W ⁽ⁿ⁾ ;
The tensor X accepted by the data input unit, the mask tensor M, and the based on the adjacency graph matrix W ^(n), generalized KL divergence between the estimated value X ^ and the tensor X of the tensor X To minimize the sum of the value of the cost function using and the penalty term value for the error between the features in the factor matrix A ⁽ⁿ⁾ using the adjacency graph matrix W ⁽ⁿ⁾ A parameter updating unit that repeats updating the factor matrix A ⁽ⁿ⁾ for each;
A parameter output unit that outputs the factor matrix A ⁽ⁿ⁾ for each of the modes n updated by the parameter update unit;
Analyzing device.

The said penalty term is a term which consists of the sum of the product of the element of the said adjacency graph matrix W ⁽ⁿ⁾ and the generalized KL divergence between the elements of the factor vector of the said factor matrix A ^(n). Analysis device.

The analysis apparatus according to claim 1, wherein the parameter update unit performs update for each factor vector of the factor matrix A ⁽ⁿ⁾ by an expression using a graph Laplacian of the adjacent graph matrix W ⁽ⁿ⁾ .

An analysis method in an analysis apparatus including a data input unit, a parameter update unit, and a parameter output unit,
The data input unit is an N-th order tensor X composed of non-negative elements, a mask tensor M indicating whether each element in the tensor X is a missing value or an observed value, and adjacent feature quantities for each mode. Accepts an adjacency graph matrix W ⁽ⁿ⁾ representing the relationship,
The parameter updating unit, wherein said tensor X accepted by the data input unit, the mask tensor M, and based on the neighborhood graph matrix W ^(n), and the estimated values X ^ and the tensor X of the tensor X Minimize the sum of the cost function value using the generalized KL divergence between and the penalty term value for the error between feature quantities in the factor matrix A ⁽ⁿ⁾ using the adjacency graph matrix W ⁽ⁿ⁾ Repeating updating the factor matrix A ⁽ⁿ⁾ for each of the modes n,
An analysis method in which the parameter output unit outputs the factor matrix A ⁽ⁿ⁾ for each of the modes n updated by the parameter update unit.

The said penalty term is a term which consists of the sum of the product of the element of the said adjacent graph matrix W ⁽ⁿ⁾ and the generalized KL divergence between the elements of the factor vector of the said factor matrix A ^(n). Analysis method.

The analysis method according to claim 4 or 5, wherein the parameter updating unit updates each factor vector of the factor matrix A ⁽ⁿ⁾ by an expression using a graph Laplacian of the adjacent graph matrix W ⁽ⁿ⁾ .

The program for functioning a computer as each part of the analyzer of any one of Claims 1 thru | or 3.