JPWO2015125452A1

JPWO2015125452A1 - Data management device, data analysis device, data analysis system, and analysis method

Info

Publication number: JPWO2015125452A1
Application number: JP2016503968A
Authority: JP
Inventors: 和世成田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-02-18
Filing date: 2015-02-16
Publication date: 2017-03-30
Anticipated expiration: 2035-02-16
Also published as: US20170053212A1; JP6504155B2; WO2015125452A1

Abstract

訓練データのサイズが計算機のメモリサイズを上回る状況下であっても、ＣＤ法を利用可能にする。本発明のデータ管理装置（１０１）は、行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成するブロック化部（２０）と、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する再ブロック化部（４０）と、を含む。The CD method can be used even when the size of the training data exceeds the memory size of the computer. A data management device (101) of the present invention divides training data representing matrix data into a plurality of blocks, and generates metadata indicating which column values of the original training data each block holds When the partial component of the parameter learned from the training unit (20) and training data converges to 0, an old block including an unnecessary column in each block is replaced with a block from which the unnecessary column is removed. And a reblocking unit (40) for regenerating the metadata.

Description

本発明は、最適化アルゴリズムを用いて最適化問題を解くためのデータ管理装置、データ分析装置、データ分析システム、及び、分析方法に関する。 The present invention relates to a data management device, a data analysis device, a data analysis system, and an analysis method for solving an optimization problem using an optimization algorithm.

機械学習は、データ分析やデータマイニングの分野等において利用されている。機械学習におけるロジスティック回帰やＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）等の多くの手法は、例えば、訓練データ（例えば、デザイン行列、または特徴量と呼ばれる）からパラメータの学習を行う際に、目的関数を定義する。そして、この目的関数を最適化することで最適なパラメータを学習する。このようなパラメータは、人の目で分析しきれないほど次元数が大きいことがある。そのため、スパース学習法（スパース正則化学習、ｌａｓｓｏ）と呼ばれる技術が用いられている。ここで、ｌａｓｓｏとは、ｌｅａｓｔａｂｓｏｌｕｔｅｓｈｒｉｎｋａｇｅａｎｄｓｅｌｅｃｔｉｏｎｏｐｅｒａｔｏｒの略である。スパース学習法とは、学習結果を分析しやすくするために、パラメータのほとんどの次元の値が０になるように学習する。スパース学習法の枠組みでは、パラメータの多くの成分が学習の過程で０に収束する。ゼロに収束した成分は、分析上意味のないものとして無視される。 Machine learning is used in the fields of data analysis and data mining. Many methods such as logistic regression in machine learning and SVM (Support Vector Machine) define an objective function when learning parameters from training data (for example, called a design matrix or feature amount). Then, the optimum parameters are learned by optimizing the objective function. Such parameters may have so many dimensions that they cannot be analyzed by the human eye. Therefore, a technique called a sparse learning method (sparse regularization learning, lasso) is used. Here, “lasso” is an abbreviation for “last absolute shrinkage and selection operator”. In the sparse learning method, learning is performed so that the values of most dimensions of the parameters become 0 in order to facilitate analysis of the learning result. In the framework of the sparse learning method, many components of the parameter converge to 0 in the learning process. Components that converge to zero are ignored as meaningless in the analysis.

機械学習を効率よく行うためには、最適化問題の効率化が切り離せない課題になっている。ここで、特許文献１に記載の行動認識装置では、動作特徴量のマッチングのためにＣｏｏｒｄｉｎａｔｅＤｅｓｃｅｎｔ法（以下、ＣＤ法と記載する）を用いて、回転行列Ｒおよび対応行列Ｃに対する極小ＤＲ，Ｃ（Ｘ、Ｙ）を計算している。ＣＤ法とは、最適化問題を解く手法の一つであり、降下法と呼ばれるクラスのアルゴリズムである。 In order to perform machine learning efficiently, the efficiency of optimization problems has become an inseparable issue. Here, in the action recognition apparatus described in Patent Document 1, the coordinated descendent method (hereinafter referred to as the CD method) is used for matching motion feature amounts, and the minimum DR and C for the rotation matrix R and the correspondence matrix C are used. (X, Y) is calculated. The CD method is one of the methods for solving the optimization problem, and is a class of algorithm called a descent method.

ここで、勾配法と呼ばれる最適化手法の一種である上記ＣＤ法の作用について、図１５を用いて説明する。図１５は、２次元の空間におけるＣＤ法の動きを示す図である。図１５では、２次元空間におけるＣＤ法の作用が概念的に示されている。図１５の例では、パラメータｗは、成分ｗ１及び成分ｗ２を要素として持つ２次元ベクトルである。複数の楕円は、目的関数ｆ（ｗ）が同値を取る成分ｗ１と成分ｗ２との組み合わせを示す等高線である。星マークは、目的関数ｆ（ｗ）の値が最小又は最大となる点、即ち、目的解ｗ＊を示す。目的関数ｆ（ｗ）が与えられたとき、ＣＤ法は、ｆ（ｗ）の空間の各座標軸（各次元）に沿って、ｆ（ｗ）が最小又は最大となる地点（目的解）ｗ＊を探索していく。具体的には、ランダムに探索のための開始点（図１５中のｓｔａｒｔ）が決められた後、次のような処理が繰り返される。即ち、座標軸（次元）ｊが選ばれ、訓練データに基づいて探索点の移動方向ｄと移動幅（ステップ幅）αが決定され、次元ｊの成分ｗｊが成分ｗｊ＋α・ｄ（以下、Δと記載する）で更新される。次の処理では、他の座標軸（次元）が選ばれる。このような処理の繰り返しが、全ての座標軸（次元）について順番に、目的関数ｆ（ｗ）の値が目的解ｗ＊に十分近づくまで、行われる。 Here, the operation of the CD method, which is a kind of optimization method called the gradient method, will be described with reference to FIG. FIG. 15 is a diagram illustrating the movement of the CD method in a two-dimensional space. FIG. 15 conceptually shows the operation of the CD method in a two-dimensional space. In the example of FIG. 15, the parameter w is a two-dimensional vector having the component w1 and the component w2 as elements. The plurality of ellipses are contour lines indicating combinations of the component w1 and the component w2 in which the objective function f (w) has the same value. The star mark indicates the point where the value of the objective function f (w) is minimum or maximum, that is, the objective solution w *. When the objective function f (w) is given, the CD method is a point (objective solution) w * at which f (w) is minimum or maximum along each coordinate axis (each dimension) of the space of f (w). Will continue to explore. Specifically, after the starting point for search (start in FIG. 15) is determined at random, the following processing is repeated. That is, the coordinate axis (dimension) j is selected, the movement direction d and the movement width (step width) α of the search point are determined based on the training data, and the component wj of the dimension j is the component wj + α · d (hereinafter referred to as Δ). Updated). In the next process, another coordinate axis (dimension) is selected. Such processing is repeated in order for all coordinate axes (dimensions) until the value of the objective function f (w) sufficiently approaches the objective solution w *.

以上のように、目的関数ｆ（ｗ）が与えられたとき、ＣＤ法はｆ（ｗ）の空間の各座標軸に沿って、目的関数ｆ（ｗ）が最小または最大となる目的解ｗ＊を探索していく。そして、目的解ｗ＊に十分近づいたら処理を停止する。 As described above, when the objective function f (w) is given, the CD method calculates the objective solution w * that minimizes or maximizes the objective function f (w) along each coordinate axis of the space of f (w). I will explore. Then, when the target solution w * is sufficiently approached, the processing is stopped.

また、ＣＤ法は、パラメータｗの更新計算の際、Ｎｅｗｔｏｎ法などと違ってコストの高い行列演算を必要とせず、計算が低コストである。また、ＣＤ法は、簡素なアルゴリズムであるために実装が比較的容易に行える。そのため、回帰やＳＶＭ等の機械学習の多くの主要な手法が、ＣＤ法に基づき実装されている。 In addition, the CD method does not require a high-cost matrix operation, unlike the Newton method, in the update calculation of the parameter w, and the calculation is low cost. Also, the CD method is a simple algorithm and can be implemented relatively easily. Therefore, many main methods of machine learning such as regression and SVM are implemented based on the CD method.

特開２００６−３４０９０３号公報JP 2006-340903 A

しかしながら、特許文献１に記載のＣＤ法を用いた行動認識装置では、訓練データのサイズが計算機のメモリサイズより大きい場合に、訓練データを全てメモリに読み込んでＣＤ法を適用することができないという課題がある。 However, in the action recognition apparatus using the CD method described in Patent Document 1, when the size of the training data is larger than the memory size of the computer, the CD method cannot be applied by reading all the training data into the memory. There is.

本発明の目的は、上記課題に鑑み、訓練データのサイズが計算機のメモリサイズを上回る状況下であっても、ＣＤ法を利用できるデータ管理装置、データ分析装置、データ分析システム、及び、データ分析方法を提供することにある。 In view of the above problems, an object of the present invention is to provide a data management device, a data analysis device, a data analysis system, and a data analysis that can use the CD method even when the size of training data exceeds the memory size of a computer. It is to provide a method.

本発明の一態様におけるデータ管理装置は、行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成するブロック化手段と、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する再ブロック化手段と、を含む。 The data management device according to an aspect of the present invention divides training data representing matrix data into a plurality of blocks, and generates metadata indicating which column values of the original training data each block holds. When a component of a parameter learned from training data converges to 0, an old block including an unnecessary column in each block is replaced with a block from which the unnecessary column is removed, and the meta Re-blocking means for regenerating data.

本発明の一態様におけるデータ分析装置は、行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納するキュー管理手段と、前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する繰り返し計算手段と、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信するフラグ管理手段と、を含む。 A data analysis apparatus according to an aspect of the present invention includes a queue management unit that reads a predetermined block out of a plurality of blocks that are data obtained by dividing training data representing matrix data, and stores the predetermined block in a queue. In addition, iterative calculation means for reading the predetermined block and performing the CD method iterative calculation, and training data corresponding to the component converged to 0 when one component of the parameter converges to 0 during one iterative calculation Flag management means for transmitting a flag indicating that the column can be removed.

本発明の一態様におけるデータ分析システムは、行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成するブロック化手段と、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する再ブロック化手段と、を含むデータ管理装置と、行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納するキュー管理手段と、
前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する繰り返し計算手段と、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信するフラグ管理手段と、を含むデータ分析装置と、を含む。A data analysis system according to an aspect of the present invention is a block that divides training data representing matrix data into a plurality of blocks, and generates metadata indicating which column values of the original training data each block holds. When a component of a parameter learned from training data converges to 0, an old block including an unnecessary column in each block is replaced with a block from which the unnecessary column is removed, and the meta Re-blocking means for regenerating data, and queue management means for reading a predetermined block out of a plurality of blocks that are data obtained by dividing training data representing matrix data and storing it in a queue When,
An iterative calculation means for reading out the predetermined block stored in the queue and performing an iterative calculation of a CD method, and a component that has converged to 0 when one component of the parameter has converged to 0 during one iterative calculation A flag management means for transmitting a flag indicating that it is possible to remove the train data column corresponding to the data analysis apparatus.

本発明の一態様におけるコンピュータが読み取り可能な第１の記録媒体は、行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成する処理と、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する処理と、をコンピュータに実行させるプログラムを記憶する。 In a first recording medium readable by a computer according to one embodiment of the present invention, training data representing matrix data is divided into a plurality of blocks, and each column holds a value of which column of the original training data. A block obtained by removing an unnecessary block from an old block including an unnecessary column among the blocks when a component of a parameter learned from training data and a process of generating metadata to indicate converge to 0 And a program for causing the computer to execute the process of regenerating the metadata.

本発明の一態様におけるコンピュータが読み取り可能な第２の記録媒体は、行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納する処理と、前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する処理と、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する処理と、をコンピュータに実行させるプログラムを記憶する。 The second recording medium readable by the computer in one aspect of the present invention is a process of reading a predetermined block out of a plurality of blocks, which is data obtained by dividing training data representing matrix data, and storing it in a queue; The process of reading the predetermined block stored in the queue and performing the CD method iterative calculation, and when one component of the parameter converges to 0 during one iteration, it corresponds to the component converged to 0 A program for causing the computer to execute a process of transmitting a flag indicating that the train of training data to be removed can be removed.

本発明の一態様におけるデータ管理方法は、行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成し、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する。 The data management method in one aspect of the present invention divides training data representing matrix data into a plurality of blocks, generates metadata indicating which column values of the original training data each block holds, When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. To do.

本発明の一態様におけるデータ分析方法は、行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納し、前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施し、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する。 In the data analysis method according to an aspect of the present invention, a predetermined block is read out from a plurality of blocks, which are data obtained by dividing training data representing matrix data, is stored in a queue, and the predetermined data stored in the queue is stored. When the block is read and the CD method is repeatedly calculated, and when some component of the parameter converges to 0 during one iteration, the train of training data corresponding to the component converged to 0 can be removed. Send a flag indicating.

本発明の一態様における分析方法は、行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成し、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成し、行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納し、前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施し、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する。 The analysis method according to an aspect of the present invention divides training data representing matrix data into a plurality of blocks, generates metadata indicating which column values of the original training data each block holds, and trains the training data. When some components of the parameters learned from the data converge to 0, the old block including the unnecessary column in each block is replaced with the block from which the unnecessary column is removed, and the metadata is regenerated. , Out of a plurality of blocks that are data obtained by dividing training data representing matrix data, a predetermined block is read and stored in a queue, the predetermined block stored in the queue is read, and the CD method is repeatedly calculated. When a component of a parameter converges to 0 during one iteration calculation, a train of training data corresponding to the component converged to 0 can be removed Transmitting a flag indicating and.

本発明の効果は、訓練データのサイズが計算機のメモリサイズを上回る状況下であっても、ＣＤ法を利用できることである。 The effect of the present invention is that the CD method can be used even in a situation where the size of training data exceeds the memory size of the computer.

本発明の第１の実施形態におけるデータ管理装置１０１の構成を示すブロック図である。It is a block diagram which shows the structure of the data management apparatus 101 in the 1st Embodiment of this invention. 本発明の第１の実施形態におけるデータ管理装置１０１の動作を示すフロー図である。It is a flowchart which shows operation | movement of the data management apparatus 101 in the 1st Embodiment of this invention. 本発明の第２の実施形態におけるデータ分析装置１０２の構成を示すブロック図である。It is a block diagram which shows the structure of the data analyzer 102 in the 2nd Embodiment of this invention. 本発明の第２の実施形態におけるデータ分析装置１０２の動作を示すフロー図である。It is a flowchart which shows operation | movement of the data analyzer 102 in the 2nd Embodiment of this invention. 本発明の第３の実施形態におけるデータ分析システム１０３の構成を示すブロック図である。It is a block diagram which shows the structure of the data analysis system 103 in the 3rd Embodiment of this invention. 本発明の第３の実施形態におけるデータ分析システム１０３の構成を実現するコンピュータの一例を示すブロック図である。It is a block diagram which shows an example of the computer which implement | achieves the structure of the data analysis system 103 in the 3rd Embodiment of this invention. 本発明の第３の実施形態における訓練データおよびそのブロック分割の一例を示す図である。It is a figure which shows an example of the training data in the 3rd Embodiment of this invention, and its block division. 本発明の第３の実施形態におけるメタデータの一例を示す図である。It is a figure which shows an example of the metadata in the 3rd Embodiment of this invention. 本発明の第３の実施形態におけるブロック化の動作を示すフロー図である。It is a flowchart which shows the operation | movement of blocking in the 3rd Embodiment of this invention. 本発明の第３の実施形態におけるキュー管理の動作を示すフロー図である。It is a flowchart which shows the operation | movement of queue management in the 3rd Embodiment of this invention. 本発明の第３の実施形態における繰り返し計算の動作を示すフロー図である。It is a flowchart which shows the operation | movement of iterative calculation in the 3rd Embodiment of this invention. 本発明の第３の実施形態におけるフラグ管理の動作を示すフロー図である。It is a flowchart which shows the operation | movement of flag management in the 3rd Embodiment of this invention. 本発明の第３の実施形態における再ブロック化の動作を示すフロー図である。It is a flowchart which shows the operation | movement of the reblocking in the 3rd Embodiment of this invention. 本発明の第３の実施形態における再ブロック化で生成された新しいブロックとメタデータの一例を示す図である。It is a figure which shows an example of the new block and metadata produced | generated by reblocking in the 3rd Embodiment of this invention. ＣｏｏｒｄｉｎａｔｅＤｅｓｃｅｎｔ法の動作例を示す図である。It is a figure which shows the operation example of Coordinate Descent method.

＜実施形態１＞
本発明の実施形態について、図面を参照して詳細に説明する。図１は、本発明の第１の実施形態におけるデータ管理装置１０１の構成を示すブロック図である。<Embodiment 1>
Embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the data management apparatus 101 in the first embodiment of the present invention.

図１を用いて、本発明の第１の実施形態におけるデータ管理装置１０１について説明する。なお、図１に付記した図面参照符号は、理解を助けるための一例として各要素に便宜上付記したものであり、本発明に対するなんらの限定を意図するものではない。 A data management apparatus 101 according to the first embodiment of the present invention will be described with reference to FIG. Note that the drawing reference numerals attached to FIG. 1 are added to the respective elements for convenience as an example for facilitating understanding, and are not intended to limit the present invention.

図１に示すように、本発明の第１の実施形態におけるデータ管理装置１０１は、ブロック化部２０と再ブロック化部４０を含む。ブロック化部２０は、与えられた行列データで表される訓練データ（例えば、整数Ｎ、Ｍで表されるＮ行Ｍ列の行列）を複数のブロックに分割し、各ブロックが元の訓練データの何行何列の値を保持しているかを表す情報であるメタデータを生成する。再ブロック化部４０は、訓練データから学習するパラメータを監視する。パラメータは、訓練データから学習される成分であり、例えばＣＤ法によって定義される目的関数のベクトル成分に対応する。再ブロック化部４０は、訓練データの学習処理によりそのパラメータの一部の成分（例えば、ｊ次元（訓練データのｊ列）の成分ｗｊ）が０に収束した時に、各ブロックのうち不要な列を含む古いブロックを、不要な列を除去したブロックに置き換える。ここで、不要な列とは、例えば、０に収束した軸に対応する列である。また、不要な列を除去したブロックは、更新ブロックとも言う。そして、再ブロック化部４０は、更新ブロックに対し、前述のメタデータ（各ブロックが元の訓練データの何行何列の値を保持しているかを表す情報）を再生成する。 As shown in FIG. 1, the data management apparatus 101 according to the first embodiment of the present invention includes a blocking unit 20 and a reblocking unit 40. The blocking unit 20 divides training data represented by given matrix data (for example, a matrix of N rows and M columns represented by integers N and M) into a plurality of blocks, and each block is the original training data. Metadata that is information indicating how many rows and columns of values are held is generated. The reblocking unit 40 monitors parameters learned from the training data. The parameter is a component learned from the training data, and corresponds to a vector component of an objective function defined by the CD method, for example. The reblocking unit 40 uses an unnecessary column in each block when a component of the parameter (for example, a component wj of j dimension (j column of training data)) converges to 0 by the training processing of the training data. Replace the old block containing with the block from which unnecessary columns are removed. Here, the unnecessary column is, for example, a column corresponding to an axis converged to 0. A block from which unnecessary columns are removed is also referred to as an update block. Then, the reblocking unit 40 regenerates the above-described metadata (information indicating how many rows and columns of the original training data each block holds) for the updated block.

次に、図２を用いて、本発明の第１の実施形態におけるデータ管理装置１０１の動作について説明する。 Next, the operation of the data management apparatus 101 according to the first embodiment of the present invention will be described with reference to FIG.

図２は、本発明の第１の実施形態におけるデータ管理装置１０１の動作を示すフロー図である。なお、図２に示すフロー図及び以下の説明は処理例であり、適宜求める処理に応じて処理順等を入れ替えたり、処理を戻したり、繰り返しても良い。 FIG. 2 is a flowchart showing the operation of the data management apparatus 101 in the first embodiment of the present invention. Note that the flowchart shown in FIG. 2 and the following description are examples of processing, and the processing order and the like may be changed or the processing may be returned or repeated depending on the processing that is appropriately obtained.

図２に示すように、ブロック化部２０は、与えられた行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データの何行何列の値を保持しているかを表す情報であるメタデータを生成する（ステップＳ１０１）。再ブロック化部４０は、訓練データから学習するパラメータの一部の成分が０に収束した時に、各ブロックのうち不要な列を含む古いブロックを、不要な列を除去したブロックに置き換え、そのメタデータを再生成する（ステップＳ１０２）。 As shown in FIG. 2, the blocking unit 20 divides training data representing given matrix data into a plurality of blocks, and how many rows and columns of each block hold values of the original training data. Metadata that is information to represent is generated (step S101). When the partial component of the parameter learned from the training data converges to 0, the reblocking unit 40 replaces an old block including an unnecessary column in each block with a block from which an unnecessary column is removed, Data is regenerated (step S102).

本発明の第１の実施形態におけるデータ管理装置１０１は、訓練データのサイズがデータ管理装置または計算機のメモリサイズを上回る状況下であっても、ＣＤ法を利用できる。その理由は、訓練データをブロックに分割することで、ブロック単位のサイズが小さくなり、訓練データがメモリサイズよりも大きい場合でも、データ管理装置または計算機が処理できるブロック単位でＣＤ法に係る処理を行えるためである。 The data management apparatus 101 according to the first embodiment of the present invention can use the CD method even in a situation where the size of training data exceeds the memory size of the data management apparatus or the computer. The reason is that by dividing the training data into blocks, the size of each block is reduced, and even when the training data is larger than the memory size, the processing related to the CD method is performed in units of blocks that can be processed by the data management device or the computer. This is because it can be done.

＜実施形態２＞
本発明を実施するための第２の形態におけるデータ分析装置１０２の構成について、図面を参照して説明する。図３は、本発明の第２の実施形態におけるデータ分析装置１０２の構成を示すブロック図である。<Embodiment 2>
The configuration of the data analysis apparatus 102 according to the second embodiment for carrying out the present invention will be described with reference to the drawings. FIG. 3 is a block diagram showing a configuration of the data analysis apparatus 102 according to the second embodiment of the present invention.

図３に示すように、本発明の第２の実施形態におけるデータ分析装置１０２は、キュー管理部９０、繰り返し計算部１１０、及びフラグ管理部１００を含む。 As illustrated in FIG. 3, the data analysis apparatus 102 according to the second exemplary embodiment of the present invention includes a queue management unit 90, an iterative calculation unit 110, and a flag management unit 100.

キュー管理部９０は、行列データで表される訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納する。繰り返し計算部１１０は、キューに格納された所定のブロックを読み出してＣＤ法の繰り返し計算（第１の実施形態における学習に対応）を実施する。フラグ管理部１００は、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、その一部の成分に対応する（訓練データの）列を除去可能であることを示すフラグを送信する。 The queue management unit 90 reads a predetermined block among a plurality of blocks that are data obtained by dividing the training data represented by the matrix data, and stores it in the queue. The iterative calculation unit 110 reads a predetermined block stored in the queue and performs a CD method iterative calculation (corresponding to learning in the first embodiment). When one component of a parameter converges to 0 during one iteration, the flag management unit 100 transmits a flag indicating that a column (for training data) corresponding to the component can be removed. .

次に、図４を用いて、本発明の第２の実施形態におけるデータ分析装置１０２の動作について説明する。 Next, the operation of the data analysis apparatus 102 according to the second embodiment of the present invention will be described with reference to FIG.

図４は、本発明の第２の実施形態におけるデータ分析装置１０２の動作を示すフロー図である。図４に示すように、キュー管理部９０は、与えられた行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納する（ステップＳ２０１）。繰り返し計算部１１０は、キューに格納された所定のブロックを読み出してＣＤ法の繰り返し計算を実施する（ステップＳ２０２）。フラグ管理部１００は、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、一部の成分に対応する訓練データの列を除去可能であることを示すフラグを送信する（ステップＳ２０３）。 FIG. 4 is a flowchart showing the operation of the data analysis apparatus 102 according to the second embodiment of the present invention. As shown in FIG. 4, the queue management unit 90 reads a predetermined block among a plurality of blocks that are data obtained by dividing the training data representing the given matrix data, and stores the predetermined block in the queue (step S201). The iterative calculation unit 110 reads a predetermined block stored in the queue and performs iterative calculation by the CD method (step S202). The flag management unit 100 transmits a flag indicating that the train of training data corresponding to some components can be removed when some components of the parameters converge to 0 during one iteration (step S203). ).

本発明の第２の実施形態におけるデータ分析装置１０２は、訓練データのサイズが計算機のメモリサイズを上回る状況下であっても、ＣＤ法を利用できる。なぜなら、訓練データをブロックに分割することで、ブロック単位のサイズが小さくなり、訓練データがメモリサイズよりも大きい場合でも、ブロック単位でＣＤ法に係る処理を行えるためである。 The data analysis apparatus 102 according to the second embodiment of the present invention can use the CD method even in a situation where the size of the training data exceeds the memory size of the computer. This is because dividing the training data into blocks reduces the size of each block, and even when the training data is larger than the memory size, the process related to the CD method can be performed in units of blocks.

＜実施形態３＞
まず、本発明の実施形態において、解決すべき課題を明らかにする。<Embodiment 3>
First, problems to be solved in the embodiment of the present invention will be clarified.

特許文献１に記載のＣＤ法を用いた行動認識装置では、訓練データのサイズが計算機のメモリサイズより大きい場合に、訓練データを全てメモリに読み込んでＣＤ法を適用することができないという課題（第一の課題）がある。近年の情報技術の発達により、マシンのメモリサイズを超える巨大な訓練データが得られやすくなっていることから、訓練データをメモリに乗せきれないために、ＣＤ法に係る処理を実行できないケースが増えてきている。 In the behavior recognition apparatus using the CD method described in Patent Document 1, when the size of training data is larger than the memory size of the computer, the problem is that the training method cannot be read into the memory and the CD method cannot be applied (No. There is one problem). Due to recent developments in information technology, it has become easier to obtain huge training data that exceeds the memory size of the machine, so the number of cases where the processing related to the CD method cannot be executed because the training data cannot be stored in the memory has increased. It is coming.

さらに、特許文献１に記載のＣＤ法を用いた行動認識装置では、ＣＤ法は繰り返し計算が何度も発生するために、処理時間がかかるという課題（第二の課題）がある。ＣＤ法では、１度の更新で、訓練データの各行を参照する必要がある。特に、第一の課題に直面しているとき、対処法として、訓練データをメモリに読める分だけ読み込み、処理し、また次の分を読み込むＯｕｔ−ｏｆ−Ｃｏｒｅ対応が必須になる。そのとき、データの読み込みが頻発して処理時間が余計に増加する。 Furthermore, in the action recognition apparatus using the CD method described in Patent Document 1, the CD method has a problem (second problem) that it takes a long processing time because repeated calculation occurs many times. In the CD method, it is necessary to refer to each row of training data by one update. In particular, when facing the first problem, an out-of-core correspondence that reads and processes the training data as much as it can be read into the memory and reads the next one becomes essential as a countermeasure. At that time, data is frequently read, and the processing time is increased.

そこで、本発明を実施するための第３の形態におけるデータ分析システム１０３が、上記第一の課題及び第二の課題を解決する。以下に、本発明を実施するための第３の形態におけるデータ分析システム１０３の構成及び動作について説明する。 Therefore, the data analysis system 103 according to the third embodiment for carrying out the present invention solves the first and second problems. The configuration and operation of the data analysis system 103 according to the third embodiment for carrying out the present invention will be described below.

まず、本発明を実施するための第３の形態におけるデータ分析システム１０３の構成について、図面を参照して説明する。図５は、本発明の第３の実施形態におけるデータ分析システム１０３の構成を示すブロック図である。 First, the configuration of the data analysis system 103 according to the third embodiment for carrying out the present invention will be described with reference to the drawings. FIG. 5 is a block diagram showing the configuration of the data analysis system 103 according to the third embodiment of the present invention.

本発明の第３の実施形態におけるデータ分析システム１０３は、データ管理装置１、データ分析装置６、及び訓練データ格納部１２を含む。データ管理装置１、データ分析装置６、及び訓練データ格納部１２は、ネットワーク１３やバス等で、通信可能に接続される。ここで、訓練データ格納部１２は、訓練データを格納する。また、訓練データ格納部１２は、例えば、データ分析システム１０３の外部にある記憶装置として、訓練データを格納してもよい。その場合、データ分析システム１０３とその記憶装置は、ネットワーク１３等で通信可能に接続される。 A data analysis system 103 according to the third embodiment of the present invention includes a data management device 1, a data analysis device 6, and a training data storage unit 12. The data management device 1, the data analysis device 6, and the training data storage unit 12 are communicably connected via a network 13 or a bus. Here, the training data storage unit 12 stores training data. Moreover, the training data storage unit 12 may store training data as a storage device outside the data analysis system 103, for example. In that case, the data analysis system 103 and its storage device are communicably connected via the network 13 or the like.

データ管理装置１は、ブロック化部２、メタデータ格納部３、再ブロック化部４、及びブロック格納部５を含む。ここで、ブロック化部２、再ブロック化部４は、上述した本発明の第１の実施形態におけるデータ管理装置１０１が含むブロック化部２０、再ブロック化部４０と同様の構成と機能を有する。 The data management device 1 includes a blocking unit 2, a metadata storage unit 3, a reblocking unit 4, and a block storage unit 5. Here, the blocking unit 2 and the reblocking unit 4 have the same configurations and functions as the blocking unit 20 and the reblocking unit 40 included in the data management apparatus 101 in the first embodiment of the present invention described above. .

ブロック化部２は、訓練データ格納部１２に格納された（与えられた）訓練データを読み出し、訓練データを複数のブロックに分割する。さらに、ブロック化部２は、分割された各ブロックのデータをブロック格納部５に格納する。また、ブロック化部２は、各ブロックが元の訓練データ上の何行何列の値を保有しているかを示すメタデータを生成し、メタデータ格納部３に格納する。 The blocking unit 2 reads the training data stored (given) in the training data storage unit 12 and divides the training data into a plurality of blocks. Further, the blocking unit 2 stores the data of each divided block in the block storage unit 5. Further, the blocking unit 2 generates metadata indicating how many rows and columns in the original training data each block has, and stores the metadata in the metadata storage unit 3.

ブロック格納部５は、分割された訓練データの各ブロックのデータを格納する。メタデータ格納部３は、ブロック化部２によって生成されたメタデータを格納する。 The block storage unit 5 stores the data of each block of the divided training data. The metadata storage unit 3 stores the metadata generated by the blocking unit 2.

再ブロック化部４は、訓練データから学習するパラメータの一部の成分が０に収束した時に、各ブロックのうち不要な列を含む古いブロックを、当該不要な列を除去したブロックに置き換え、置き換えたブロックに対して前述のメタデータを再生成する。 The reblocking unit 4 replaces an old block including an unnecessary column in each block with a block from which the unnecessary column is removed when a part of parameters of a parameter learned from training data converges to 0. The above-mentioned metadata is regenerated for each block.

データ分析装置６は、パラメータ格納部７、キュー８、キュー管理部９、フラグ管理部１０、及び繰り返し計算部１１を含む。ここで、キュー管理部９、繰り返し計算部１１、及びフラグ管理部１０は、上述した本発明の第２の実施形態におけるデータ分析装置１０２が含むキュー管理部９０、繰り返し計算部１１０、及びフラグ管理部１００と同様の構成と機能を有する。 The data analysis device 6 includes a parameter storage unit 7, a queue 8, a queue management unit 9, a flag management unit 10, and an iterative calculation unit 11. Here, the queue management unit 9, the iteration calculation unit 11, and the flag management unit 10 are the queue management unit 90, the iteration calculation unit 110, and the flag management included in the data analysis apparatus 102 in the second embodiment of the present invention described above. It has the same configuration and function as the unit 100.

パラメータ格納部７は、パラメータ等の更新すべき変数を格納する。キュー８は、ブロックを格納する。 The parameter storage unit 7 stores variables to be updated such as parameters. The queue 8 stores blocks.

繰り返し計算部１１は、繰り返し計算部１１が計算する列に必要なブロックまたは代表値をキュー８から読み出し、更新計算を行う。繰り返し計算部１１は、キュー８に格納された所定のブロックを読み出してＣＤ法の繰り返し計算を実施する。繰り返し計算部１１は、１つの繰り返し計算ごとに、パラメータの各成分が０に収束したか否かの判定を行う。０に収束した成分ｗｊがあった場合、繰り返し計算部１１は、フラグ管理部１０を呼び出して成分ｗｊが０に収束したことを示す情報を伝える。 The iterative calculation unit 11 reads out blocks or representative values necessary for the columns calculated by the iterative calculation unit 11 from the queue 8 and performs update calculation. The iterative calculation unit 11 reads a predetermined block stored in the queue 8 and performs iterative calculation by the CD method. The iterative calculation unit 11 determines whether or not each parameter component has converged to 0 for each iterative calculation. When there is a component wj that has converged to 0, the iterative calculation unit 11 calls the flag management unit 10 and transmits information indicating that the component wj has converged to 0.

キュー管理部９は、不要になったブロックをキュー８から破棄し、新たに必要になるブロックをブロック格納部５から取得（例えば、フェッチ）する。フラグ管理部１０は、繰り返し計算部１１から成分ｗｊが０に収束した情報を受け取り、データ管理装置１に不要な列を出力する。 The queue management unit 9 discards unnecessary blocks from the queue 8 and acquires (for example, fetches) newly necessary blocks from the block storage unit 5. The flag management unit 10 receives information that the component wj has converged to 0 from the iterative calculation unit 11 and outputs an unnecessary column to the data management device 1.

図６を用いて、本発明の第３の実施形態におけるデータ分析システム１０３が含むデータ管理装置１及びデータ分析装置６を実現するコンピュータについて説明する。 The computer which implement | achieves the data management apparatus 1 and the data analysis apparatus 6 which the data analysis system 103 in the 3rd Embodiment of this invention contains is demonstrated using FIG.

図６は、本発明の第３の実施形態におけるデータ分析システム１０３が含むデータ管理装置１及びデータ分析装置６の代表的なハードウェア構成図である。図６に示すように、データ管理装置１及びデータ分析装置６は、それぞれ、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２１、ＲＡＭ（ＲａｍｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２２、記憶装置２３を含む。また、データ管理装置１及びデータ分析装置６は、それぞれ、例えば、通信インターフェース２４、入力装置２５、出力装置２６を含む。 FIG. 6 is a typical hardware configuration diagram of the data management device 1 and the data analysis device 6 included in the data analysis system 103 according to the third embodiment of the present invention. As shown in FIG. 6, the data management device 1 and the data analysis device 6 each include, for example, a CPU (Central Processing Unit) 21, a RAM (Random Access Memory) 22, and a storage device 23. In addition, the data management device 1 and the data analysis device 6 include, for example, a communication interface 24, an input device 25, and an output device 26, respectively.

データ管理装置１が含むブロック化部２、及び再ブロック化部４と、データ分析装置６が含むキュー管理部９、フラグ管理部１０、及び繰り返し計算部１１とは、プログラムをＲＡＭ２２に読み出し、実行するＣＰＵ２１によって実現される。データ管理装置１が含むメタデータ格納部３、及びブロック格納部５と、データ分析装置６が含むパラメータ格納部７、及びキュー８とは、例えば、ハードディスクや、フラッシュメモリである。 The blocking unit 2 and the reblocking unit 4 included in the data management device 1 and the queue management unit 9, the flag management unit 10 and the iterative calculation unit 11 included in the data analysis device 6 read the program into the RAM 22 and execute it. This is realized by the CPU 21. The metadata storage unit 3 and the block storage unit 5 included in the data management device 1 and the parameter storage unit 7 and the queue 8 included in the data analysis device 6 are, for example, a hard disk or a flash memory.

通信インターフェース２４は、ＣＰＵ２１に接続され、ネットワーク或いは外部記憶媒体に接続される。外部データが通信インターフェース２４を介してＣＰＵ２１に取り込まれても良い。入力装置２５は、例えばキーボードやマウス、タッチパネルである。出力装置２６は、例えばディスプレイである。なお、図６に示すハードウェア構成は、一例にすぎず、データ管理装置１及びデータ分析装置６のそれぞれの構成要素が独立した論理回路で構成されていても良い。 The communication interface 24 is connected to the CPU 21 and is connected to a network or an external storage medium. External data may be taken into the CPU 21 via the communication interface 24. The input device 25 is, for example, a keyboard, a mouse, or a touch panel. The output device 26 is a display, for example. Note that the hardware configuration illustrated in FIG. 6 is merely an example, and each component of the data management device 1 and the data analysis device 6 may be configured by independent logic circuits.

次に、図７乃至１４を用いて、本発明の第３の実施形態におけるデータ分析システム１０３の動作について説明する。 Next, the operation of the data analysis system 103 according to the third embodiment of the present invention will be described with reference to FIGS.

図９は、本発明の第３の実施形態におけるブロック化部２の動作を示すフロー図（フローチャート）である。ブロック化部２はまず、データ分析装置６のキュー８のサイズを取得する（ステップＳ３０１）。次に、ブロック化部２は、訓練データをキュー８に十分収まるサイズのブロックに分割する（ステップＳ３０２）。分割する方法は、例えば、行方向に分割してもいいし、列方向に分割してもいいし、行列両方の方向に分割してもいい。 FIG. 9 is a flowchart (flow chart) showing the operation of the blocking unit 2 in the third embodiment of the present invention. First, the blocking unit 2 acquires the size of the queue 8 of the data analysis device 6 (step S301). Next, the blocking unit 2 divides the training data into blocks of a size that can fit in the queue 8 (step S302). For example, the division may be performed in the row direction, in the column direction, or in both the matrix directions.

次に、ブロック化部２は、各ブロックが訓練データのどの値を保持しているかの情報をメタデータとして生成する（ステップＳ３０３）。そして、ブロック化部２は、各ブロックのデータをブロック格納部５に格納し、生成したメタデータをメタデータ格納部３に格納する（ステップＳ３０４）。 Next, the blocking unit 2 generates information as to which value of the training data each block holds as metadata (step S303). Then, the blocking unit 2 stores the data of each block in the block storage unit 5, and stores the generated metadata in the metadata storage unit 3 (step S304).

図１０は、本発明の第３の実施形態におけるキュー管理部９の動作を示すフロー図である。最初に、キュー管理部９は、繰り返し計算部１１から、処理する列の数列（ｊ１，ｊ２，．．．，ｊｋ）を取得する（ステップＳ４０１）。ここで、ｋは、１以上の整数である。処理する列の数列の順序関係は、列番号の降順や昇順であってもいいし、ランダムであってもいいし、あるいはそれ以外の順序関係であってもいい。次に、キュー管理部９は、カウンタｒを１で初期化する（ステップＳ４０２）。ここでカウンタｒの値は、１からｋまでを取り得る。キュー管理部９は、メタデータ格納部３に格納されているメタデータを参照して、ブロック格納部５に格納されているｊｒ列目を含む未処理のブロックを特定する（ステップＳ４０３）。 FIG. 10 is a flowchart showing the operation of the queue management unit 9 in the third embodiment of the present invention. First, the queue management unit 9 obtains a sequence (j1, j2,..., Jk) of columns to be processed from the iterative calculation unit 11 (step S401). Here, k is an integer of 1 or more. The order relationship of the numerical sequences of the columns to be processed may be descending or ascending order of the column numbers, may be random, or may be other order relationships. Next, the queue management unit 9 initializes the counter r with 1 (step S402). Here, the value of the counter r can be 1 to k. The queue management unit 9 refers to the metadata stored in the metadata storage unit 3 and identifies an unprocessed block including the jr-th column stored in the block storage unit 5 (step S403).

次に、キュー管理部９は、キュー８が満杯である場合（ステップＳ４０４でＹＥＳ）、空きができるまで定期的にキュー８をチェックしながら待機する（ステップＳ４０５）。キュー８に空きが出来た場合（ステップＳ４０４のＮｏ）、キュー管理部９は、当該ブロックをブロック格納部５から読み込んでキュー８に入れる（ステップＳ４０６）。そして、ｊｒ列目を含む未処理の他のブロックが存在する場合（ステップＳ４０７でＹＥＳ）、上記の処理が繰り返される（ステップＳ４０３へ戻る）。ｊｒ列目を含む未処理の他のブロックが存在しない場合（ステップＳ４０７のＮｏ）、キュー管理部９は、カウンタｒの値を更新する（ステップＳ４０８）。キュー管理部９は、例えば、カウンタｒの値に１を足す。そして、繰り返し計算部１１の処理が終了する場合（ステップＳ４０９でＹＥＳ）、キュー管理部９の処理が終了する。繰り返し計算部１１の処理が終了しない場合（ステップＳ４０９でＮｏ）、処理が終了するまで、上記の処理が繰り返される（ステップＳ４０４へ戻る）。 Next, when the queue 8 is full (YES in step S404), the queue management unit 9 waits while periodically checking the queue 8 until it becomes available (step S405). When the queue 8 is empty (No in step S404), the queue management unit 9 reads the block from the block storage unit 5 and puts the block in the queue 8 (step S406). If there is another unprocessed block including the jr-th column (YES in step S407), the above process is repeated (return to step S403). When there is no other unprocessed block including the jr-th column (No in Step S407), the queue management unit 9 updates the value of the counter r (Step S408). For example, the queue management unit 9 adds 1 to the value of the counter r. When the process of the iterative calculation unit 11 ends (YES in step S409), the process of the queue management unit 9 ends. When the process of the iterative calculation unit 11 does not end (No in step S409), the above process is repeated until the process ends (return to step S404).

図１１は、本発明の第３の実施形態における繰り返し計算部１１の動作を示すフロー図である。まず、繰り返し計算部１１は、処理する列の数列（ｊ１，ｊ２，．．．）を決定し、キュー管理部９に送信する（ステップＳ５０１）。繰り返し計算部１１は、カウンタｒを１で初期化し（ステップＳ５０２）、更新差分Δを０で初期化する（ステップＳ５０３）。次に、繰り返し計算部１１は、キュー８からｊｒ列目を含むブロックを取得し（ステップＳ５０４）、ブロックを一行ずつ読みながら更新差分Δを更新する（ステップＳ５０５）。ここで、更新差分Δは、例えば、Ｎ行Ｍ列（Ｎ、Ｍは自然数）の訓練データの、ｉ行目及びｊ列目（ｉは１以上Ｎ以下の整数、ｊは１以上Ｍ以下の整数）の値ｘｉｊとｗの関数ｇ（ｗ）との積ｘｉｊ×ｇ（ｗ）を、１行目からＮ行目まで足すことで計算される。 FIG. 11 is a flowchart showing the operation of the iterative calculation unit 11 in the third embodiment of the present invention. First, the iterative calculation unit 11 determines the number of columns to be processed (j1, j2,...) And transmits it to the queue management unit 9 (step S501). The iterative calculation unit 11 initializes the counter r with 1 (step S502) and initializes the update difference Δ with 0 (step S503). Next, the iterative calculation unit 11 acquires a block including the jr-th column from the queue 8 (step S504), and updates the update difference Δ while reading the block line by line (step S505). Here, the update difference Δ is, for example, training data of N rows and M columns (N and M are natural numbers), i row and j columns (i is an integer of 1 to N, j is 1 to M) It is calculated by adding a product xij × g (w) of a value xij of an integer) and a function g (w) of w from the first line to the Nth line.

ブロックのｊｒ列目の全ての行に対する更新の処理が完了していない場合（ステップＳ５０６でＮｏ）、繰り返し計算部１１は、ステップＳ５０４からステップＳ５０５の処理を、ブロックのｊｒ列目の全ての行について繰り返す（ステップＳ５０４に戻る）。 If the update processing has not been completed for all the rows in the jr column of the block (No in step S506), the iterative calculation unit 11 performs the processing from step S504 to step S505 on all the rows in the jr column of the block. Is repeated (return to step S504).

ブロックのｊｒ列目の全ての行に対する更新の処理が完了した場合（ステップＳ５０６でＹＥＳ）、繰り返し計算部１１は、目的関数ｆ（ｗ）のパラメータｗのｊｒ番目（ｊｒ列目）の成分ｗｊｒをｗｊｒ＋Δに更新する（ステップＳ５０７）。パラメータｗの更新差分Δが所定の値より小さい（以下、十分小さいと記載する）場合（ステップＳ５０８でＹＥＳ）、繰り返し計算部１１は動作（ステップ処理）を終了する。所定の値は、例えば、０．０００１等、更新差分Δが十分小さいことを示す値であればどのような値でも良い。 When the update processing has been completed for all the rows in the jr column of the block (YES in step S506), the iterative calculation unit 11 determines the jr-th (jr-th column) component wjr of the parameter w of the objective function f (w). Is updated to wjr + Δ (step S507). When the update difference Δ of the parameter w is smaller than the predetermined value (hereinafter, described as sufficiently small) (YES in step S508), the iterative calculation unit 11 ends the operation (step process). The predetermined value may be any value as long as the value indicates that the update difference Δ is sufficiently small, for example, 0.0001.

パラメータｗの更新差分Δが所定の値より大きい場合（ステップＳ５０８でＮｏ）、繰り返し計算部１１は、まだ更新する余地があると判断し、成分ｗｊｒがゼロに収束したかどうかを判定する（ステップＳ５０９）。ｗｊｒがゼロに収束している場合（ステップＳ５０９でＹＥＳ）、繰り返し計算部１１は、ｗｊｒがゼロに収束したことをフラグ管理部１０に送信する（ステップＳ５１０）。次に、繰り返し計算部１１は、カウンタｒの値をｒ＋１に更新し（ステップＳ５１１）、更新差分Δが十分小さくなるまで上記を繰り返す（ステップＳ５０３へ戻る）。 When the update difference Δ of the parameter w is larger than the predetermined value (No in step S508), the iterative calculation unit 11 determines that there is still room for updating and determines whether the component wjr has converged to zero (step) S509). If wjr has converged to zero (YES in step S509), the iterative calculation unit 11 transmits to the flag management unit 10 that wjr has converged to zero (step S510). Next, the iterative calculation unit 11 updates the value of the counter r to r + 1 (step S511), and repeats the above until the update difference Δ is sufficiently small (return to step S503).

ここで、成分ｗｊｒがゼロに収束していない場合（ステップＳ５０９でＮｏ）、繰り返し計算部１１は、カウンタｒの値をｒ＋１に更新し（ステップＳ５１１）、更新差分Δが十分小さくなるまで上記を繰り返す（ステップＳ５０３へ戻る）。 Here, if the component wjr has not converged to zero (No in step S509), the iterative calculation unit 11 updates the value of the counter r to r + 1 (step S511) and repeats the above until the update difference Δ is sufficiently small. Repeat (return to step S503).

図１２は、本発明の第３の実施形態におけるフラグ管理部１０の動作を示すフロー図である。図１２に示すように、フラグ管理部１０は、パラメータｗの非ゼロ成分の数のスナップショットを、変数ｚとして管理する（ステップＳ６０１）。そして、フラグ管理部１０は、ゼロに収束した成分の位置を繰り返し受信し（ステップＳ６０２）、それまでに受信したゼロ成分の位置情報の数がｚ／２以上かどうかを判定する（ステップＳ６０３）。ゼロ成分の位置情報の数がｚ／２以上である場合（ステップＳ６０３でＹＥＳ）、フラグ管理部１０は、再ブロック化部４へ、ゼロに収束した成分ｗｊｒの位置情報と、再ブロック化の命令を送信する（ステップＳ６０４）。そして、繰り返し計算部１１の処理が終了する場合（ステップＳ６０５でＹＥＳ）、フラグ管理部１０の処理が終了する。 FIG. 12 is a flowchart showing the operation of the flag management unit 10 in the third exemplary embodiment of the present invention. As shown in FIG. 12, the flag management unit 10 manages a snapshot of the number of non-zero components of the parameter w as a variable z (step S601). Then, the flag management unit 10 repeatedly receives the position of the component that has converged to zero (step S602), and determines whether the number of pieces of zero component position information received so far is equal to or greater than z / 2 (step S603). . If the number of pieces of zero component position information is greater than or equal to z / 2 (YES in step S603), the flag management unit 10 sends the position information of the component wjr converged to zero to the reblocking unit 4 and the reblocking information. An instruction is transmitted (step S604). When the process of the iterative calculation unit 11 ends (YES in step S605), the process of the flag management unit 10 ends.

ここで、繰り返し計算部１１の処理が終了しない場合（ステップＳ６０５でＮｏ）、フラグ管理部１０は、処理が終了するまで上記の処理を繰り返す（ステップＳ６０１へ戻る）。また、ゼロ成分の位置情報の数がｚ／２未満である場合（ステップＳ６０３でＮｏ）、フラグ管理部１０は、ステップＳ６０５の処理に進む。また、ｚ／２の分母は、必ずしも２である必要はなく、任意の整数をユーザが指定できるようにパラメータ化されていてもいい。 If the process of the iterative calculation unit 11 does not end (No in step S605), the flag management unit 10 repeats the above process until the process ends (returns to step S601). If the number of pieces of zero component position information is less than z / 2 (No in step S603), the flag management unit 10 proceeds to the process of step S605. Further, the denominator of z / 2 is not necessarily 2 and may be parameterized so that the user can specify an arbitrary integer.

図１３は、本発明の第３の実施形態における再ブロック化部４の動作を示すフロー図である。図１３に示すように、再ブロック化部４は、フラグ管理部１０から再ブロック化の命令と、パラメータｗの中でゼロに収束した成分の位置情報を取得する（ステップＳ７０１）。次に、再ブロック化部４は、キュー８に十分収まるサイズの範囲で、隣り合うブロック同士を、ゼロに収束した成分に対応する列を除外しながら連結することで、ブロックを再構成し、ブロック格納部５の古いブロックと置き換える（ステップＳ７０２）。再ブロック化部４は、例えば、隣り合うブロック同士を、ゼロに収束した成分に対応する列を除外しながら連結することで、ブロックを再構成し、古いブロックと置き換える。そして、再ブロック化部４は、再構成されたブロックに対応したメタデータを生成し、メタデータ格納部３の古いメタデータと置き換える（ステップＳ７０３）。以上で、再ブロック化部４の動作が終了する。 FIG. 13 is a flowchart showing the operation of the reblocking unit 4 in the third embodiment of the present invention. As illustrated in FIG. 13, the reblocking unit 4 obtains a reblocking command from the flag management unit 10 and position information of components that have converged to zero in the parameter w (step S701). Next, the reblocking unit 4 reconstructs the blocks by connecting adjacent blocks while excluding the column corresponding to the component that has converged to zero, in a size range that fits in the queue 8 sufficiently. Replace with the old block in the block storage unit 5 (step S702). For example, the reblocking unit 4 reconstructs a block by replacing adjacent blocks with each other while excluding a column corresponding to a component that has converged to zero, and replaces the old block. Then, the reblocking unit 4 generates metadata corresponding to the reconfigured block, and replaces the old metadata in the metadata storage unit 3 (step S703). Thus, the operation of the reblocking unit 4 is completed.

次に、本願の発明を実施するためのデータ分析装置６における詳細な動作について説明する。 Next, detailed operations in the data analysis apparatus 6 for carrying out the invention of the present application will be described.

最初に、図７を用いて、データ管理装置１のブロック化部２が実施する動作例を示す。図７は、本発明の第３の実施形態における訓練データおよびそのブロック分割の一例を示す図である。 First, an example of an operation performed by the blocking unit 2 of the data management device 1 will be described with reference to FIG. FIG. 7 is a diagram illustrating an example of training data and its block division according to the third embodiment of the present invention.

図７に示される８行８列の行列は、訓練データの例である。例えば、データ分析装置６のキュー８には、訓練データの２分の１のデータサイズしか入らないと仮定する。ブロック化部２はブロックの最大サイズがキュー８のサイズ以下になるよう、訓練データを適当な大きさのブロックに分割する。ここでは例として、訓練データを行および列方向に２等分し、全体で４等分割したブロックを生成している。 The 8-by-8 matrix shown in FIG. 7 is an example of training data. For example, it is assumed that the queue 8 of the data analysis device 6 has only a half data size of training data. The blocking unit 2 divides the training data into blocks of an appropriate size so that the maximum size of the block is equal to or smaller than the size of the queue 8. Here, as an example, the training data is divided into two equal parts in the row and column directions, and a block that is divided into four equal parts as a whole is generated.

図７に示すように、８行８列の行列に記載の点線がブロックの境界線を表している。４等分割したブロックのそれぞれを、ｂｌｏｃｋ１，２，３，４とする。ｂｌｏｃｋ１は、例えば、行ｘ１のデータが「０．３６０．２６０．０００．００」であり、行ｘ２のデータが「０．０００．０００．９１０．００」である。ｂｌｏｃｋ１の行ｘ３のデータは「０．０１０．０００．０００．００」であり、行ｘ４のデータは「０．０００．０００．０９０．００」である。 As shown in FIG. 7, dotted lines described in a matrix of 8 rows and 8 columns represent block boundary lines. Each of the four equally divided blocks is referred to as blocks 1, 2, 3, and 4. In block 1, for example, the data in the row x 1 is “0.36 0.26 0.00 0.00”, and the data in the row x 2 is “0.00 0.00 0.91 0.00”. The data in the row x3 of block 1 is “0.01 0.00 0.00 0.00”, and the data in the row x4 is “0.00 0.00 0.09 0.00”.

ここで、ブロックを分割する方法としては本例に限らない。例えば、行または列方向だけを分割してもいいし、ブロックごとにサイズが異なるように分割してもいいし、事前に行や列を任意の手法で並び替えてから分割してもいい。 Here, the method of dividing the block is not limited to this example. For example, only the row or column direction may be divided, the blocks may be divided so as to have different sizes, or the rows and columns may be rearranged by an arbitrary method in advance.

ブロック化部２は、ブロック分割すると同時に、ブロックのメタデータを算出する。図８は、本発明の第３の実施形態におけるメタデータの一例を示す図である。図８は、例えば、図７の４ブロックのメタデータを示す。つまり、メタデータの各行は、訓練データの各列がどのブロックに分配されたかを示している。図８に示すように、メタデータの一行目は、例えば、訓練データ上で１列目にあたる値が、ブロック１と２に分配されたことを示す。 The blocking unit 2 calculates block metadata simultaneously with block division. FIG. 8 is a diagram illustrating an example of metadata in the third exemplary embodiment of the present invention. FIG. 8 shows, for example, the metadata of the four blocks in FIG. That is, each row of metadata indicates to which block each column of training data is distributed. As shown in FIG. 8, the first row of metadata indicates that, for example, the value corresponding to the first column in the training data is distributed to blocks 1 and 2.

ここで、メタデータの形式はこの例に限らず、訓練データの値が、どのブロックに属しているかの情報が含まれていれば、任意の形式があり得る。 Here, the format of the metadata is not limited to this example, and any format can be used as long as the information of which block the training data value belongs to is included.

次に、図７及び図１４を用いて、再ブロック化に関する具体的な動作例を説明する。 Next, a specific operation example regarding reblocking will be described with reference to FIGS. 7 and 14.

データ分析装置６は、キュー８にブロックを順番に読み出しながら、繰り返し計算部１１でパラメータｗの最適化を行う。ここではパラメータｗの初期値をランダムに（１，１０，２，３，４，８，３）と決定し、最適化を始めた場合、例えば、フラグ管理部１０が管理する非ゼロ成分の数ｚは８である。何度かの繰り返し計算の後、繰り返し計算部１１がパラメータｗの２列目の成分がゼロに収束したと判定すると、フラグ管理部１０は２列目という位置情報を記憶する。さらに繰り返し計算が進み、３，４，６列目もゼロに収束したと判定されたとする。フラグ管理部１０は同様に３，４，６列目という位置情報も記憶する。さらに、ｚ／２以上の数の成分がゼロに収束したことから、フラグ管理部１０は位置情報（２，３，４，６）と共に、データ管理装置１の再ブロック化部４に再ブロック化命令を送信する。 The data analysis device 6 optimizes the parameter w by the iterative calculation unit 11 while sequentially reading out the blocks to the queue 8. Here, when the initial value of the parameter w is randomly determined as (1, 10, 2, 3, 4, 8, 3) and optimization is started, for example, the number of non-zero components managed by the flag management unit 10 z is 8. If the iterative calculation unit 11 determines that the component of the second column of the parameter w has converged to zero after several iterations, the flag management unit 10 stores the position information of the second column. Assume that iterative calculation further proceeds and it is determined that the third, fourth, and sixth columns have converged to zero. Similarly, the flag management unit 10 also stores position information of the third, fourth, and sixth columns. Furthermore, since the number of components equal to or greater than z / 2 has converged to zero, the flag management unit 10 reblocks the reblocking unit 4 of the data management device 1 together with the position information (2, 3, 4, 6). Send instructions.

命令を受けた再ブロック化部４は、ブロック格納部５にあるブロックに対して、送られてきた位置情報（２，３，４，６）の列を除外しながら、キュー８に十分収まるサイズになるように再ブロック化を行う。 Upon receiving the instruction, the reblocking unit 4 excludes the column of the position information (2, 3, 4, 6) sent to the block in the block storage unit 5 and is sufficiently large to fit in the queue 8. Re-block so that

図１４は、本発明の第３の実施形態における再ブロック化で生成された新しいブロックとメタデータの一例を示す図である。図１４は、図７に示される４つのブロックを、位置情報（２，３，４，６）に基づき再ブロック化した例である。この場合、２，３，４，６列目が除外された２つのブロックが生成され、ブロック格納部５の古いブロック（図７）と置き換えられる。そして、図１４に示すように、新しいブロック（図１４の左図）から新しいメタデータ（図１４の右図）が生成される。 FIG. 14 is a diagram illustrating an example of new blocks and metadata generated by reblocking according to the third embodiment of the present invention. FIG. 14 shows an example in which the four blocks shown in FIG. 7 are reblocked based on position information (2, 3, 4, 6). In this case, two blocks from which the second, third, fourth and sixth columns are excluded are generated and replaced with the old block (FIG. 7) in the block storage unit 5. Then, as shown in FIG. 14, new metadata (right diagram in FIG. 14) is generated from the new block (left diagram in FIG. 14).

不要となった列をブロックから除外することで、全ブロックのうちキュー８に読み出されるブロックの割合が大きくなり、必要な情報がバッファやキャッシュに乗りやすくなるメリットがある。 By excluding unnecessary columns from the block, the ratio of the blocks read to the queue 8 out of all the blocks is increased, and there is an advantage that necessary information can be easily put on the buffer or the cache.

上記のとおり、本発明の第３の実施形態におけるデータ分析システム１０３において、データ管理装置１のブロック化部２は、訓練データ格納部１２に格納された訓練データを読み出し、訓練データをブロック分割し、ブロック格納部５に格納する。また、ブロック化部２は、各ブロックが元の訓練データ上の何行何列の値を保有しているかを示すメタデータを生成し、メタデータ格納部３に格納する。再ブロック化部４は、繰り返し計算中にゼロに収束したパラメータの成分の位置情報に基づき、その位置に対応する訓練データ上の列を除外するように、ブロックを再構成して古いブロックと置き換えて保持する。 As described above, in the data analysis system 103 according to the third embodiment of the present invention, the blocking unit 2 of the data management device 1 reads the training data stored in the training data storage unit 12 and divides the training data into blocks. And stored in the block storage unit 5. Further, the blocking unit 2 generates metadata indicating how many rows and columns in the original training data each block has, and stores the metadata in the metadata storage unit 3. Based on the position information of the parameter component that has converged to zero during the iterative calculation, the reblocking unit 4 reconstructs the block and replaces the old block so as to exclude the train data column corresponding to that position. Hold.

データ分析装置６は、パラメータ格納部７、キュー８、キュー管理部９、フラグ管理部１０、及び繰り返し計算部１１を含む。パラメータ格納部７は、パラメータ等の更新すべき変数を格納する。キュー８は、ブロックを格納する。繰り返し計算部１１は、繰り返し計算部１１が計算する列に必要なブロックまたは代表値をキュー８から読み出し、更新計算を行う。繰り返し計算部１１は、キュー８に格納された所定のブロックを読み出してＣＤ法の繰り返し計算を実施する。キュー管理部９は、不要になったブロックをキュー８から破棄し、新たに必要になるブロックをブロック格納部５から取得する。フラグ管理部１０は、繰り返し計算部１１から成分ｗｊが０に収束したことを示す情報を受け取り、データ管理装置１に不要な列を出力する。したがって、当該データ分析システム１０３は、訓練データのサイズが計算機のメモリサイズを上回る状況下であっても、ＣＤ法を利用でき、且つ、当該状況下でのＣＤ法の処理時間を短縮できる。 The data analysis device 6 includes a parameter storage unit 7, a queue 8, a queue management unit 9, a flag management unit 10, and an iterative calculation unit 11. The parameter storage unit 7 stores variables to be updated such as parameters. The queue 8 stores blocks. The iterative calculation unit 11 reads out blocks or representative values necessary for the columns calculated by the iterative calculation unit 11 from the queue 8 and performs update calculation. The iterative calculation unit 11 reads a predetermined block stored in the queue 8 and performs iterative calculation by the CD method. The queue management unit 9 discards unnecessary blocks from the queue 8 and acquires newly required blocks from the block storage unit 5. The flag management unit 10 receives information indicating that the component wj has converged to 0 from the iterative calculation unit 11 and outputs an unnecessary column to the data management device 1. Therefore, the data analysis system 103 can use the CD method even in a situation where the size of the training data exceeds the memory size of the computer, and can shorten the processing time of the CD method under the situation.

その理由は、以下の通りである。すなわち、訓練データをブロックに分割し、ブロック単位で処理を行うことで、訓練データがメモリに乗りきらない場合であっても、ＣＤ法の処理が実行できる。また、パラメータの一部の成分は最適化による繰り返し計算の途中でしばしば０に収束する。０に収束したパラメータ成分は以後の繰り返し計算で変化することがない。すなわち、当該成分に対応するデータ列は以降読み込む必要がない。したがって、読み込む必要のないデータ列を再ブロック化で除去することで、必要なデータ列を一度により多く読み込むことができ、計算が短縮される。 The reason is as follows. That is, by dividing the training data into blocks and performing the processing in units of blocks, the CD method can be executed even when the training data does not fit in the memory. Also, some components of the parameter often converge to 0 during the iterative calculation by optimization. The parameter component that has converged to 0 does not change in subsequent iterations. That is, it is not necessary to read the data string corresponding to the component. Therefore, by removing the data strings that do not need to be read by reblocking, more necessary data strings can be read at a time and the calculation is shortened.

ここで、計算が短縮される仕組みを具体的に説明するために、図７に示す訓練データを用いたＣＤ法を考える。訓練データは二次記憶装置から主記憶装置に読み込まれて処理される。しかし、当該計算機は、例えば、容量の問題で当該訓練データの半分しか一度に主記憶上に読み込めないと仮定する。このとき対処法の一つとして、当該訓練データを４行ずつ主記憶に読み込んで処理する方法が考えられる。すなわち、列ｊについて成分ｗｊの更新を行うために、１行目から４行目を読み込み、処理し、次に５行目から８行目を読み、処理する。この場合、２回のＩＯが発生する。１回の繰り返し計算において、１列目から８列目まで順に更新計算をすると仮定すると、１６回のＩＯが発生する。５０回計算を繰り返した段階でパラメータｗの１，２，３，４番目の成分が０に収束し、１００回繰り返した時点でパラメータｗが最適化されたとすると、ＩＯは全部で２×８×５０＋２×４×５０＝１２００回発生する。 Here, in order to explain specifically the mechanism by which the calculation is shortened, consider the CD method using training data shown in FIG. The training data is read from the secondary storage device into the main storage device and processed. However, the computer assumes that, for example, only half of the training data can be read into main memory at a time due to capacity issues. At this time, as one of the countermeasures, a method of reading the training data into the main memory every four lines and processing can be considered. That is, in order to update the component wj for the column j, the first to fourth lines are read and processed, and then the fifth to eighth lines are read and processed. In this case, two IOs are generated. Assuming that update calculation is performed in order from the first column to the eighth column in one repetitive calculation, 16 IOs are generated. If the first, second, third, and fourth components of the parameter w converge to 0 when the calculation is repeated 50 times, and the parameter w is optimized when the calculation is repeated 100 times, the IO is 2 × 8 × in total. 50 + 2 × 4 × 50 = 1200 occurrences.

このとき、５０回繰り返した時点で、訓練データにおける１〜４列目は二度と参照されない。その理由は、以下の通りである。すなわち、先述のとおり、ＣＤ法の列ｊに対する計算では、パラメータｗの成分ｗｊをｗｊ＋α・ｄに更新する。ここで、ｄは、図１５における開始点の移動方向、αは、移動幅（ステップ幅）である。α・ｄは訓練データのｉ行ｊ列目の値ｘｉｊとｗの関数ｇ（ｗ）とに対して、積ｘｉｊ×ｇ（ｗ）の行ｉに関する総和によって得られる値で、訓練データのｊ列目の値はｗｊの更新でのみ使用される。 At this time, when it is repeated 50 times, the first to fourth columns in the training data are never referred to again. The reason is as follows. That is, as described above, in the calculation for the column j in the CD method, the component wj of the parameter w is updated to wj + α · d. Here, d is the moving direction of the starting point in FIG. 15, and α is the moving width (step width). α · d is a value obtained by summing the value xij of the i-th row and j-th column of the training data and the function g (w) of w with respect to the row i of the product xij × g (w). The value in the column is used only for updating wj.

そこで二次記憶装置上の訓練データを、１〜４列目を除去した訓練データに置き換えると、データサイズが半分になる。このため、５１回目から１００回目までの繰り返し処理では、置き換えたデータを一回ずつ読み込めばよい。この場合、ＩＯは全部で２×８×５０＋１×４×５０＝１０００回発生し、置き換えを行わない場合よりＩＯ回数が減る。 Therefore, if the training data on the secondary storage device is replaced with training data from which the first to fourth columns are removed, the data size is halved. For this reason, in the repetitive processing from the 51st time to the 100th time, it is only necessary to read the replaced data once. In this case, IO occurs 2 × 8 × 50 + 1 × 4 × 50 = 1000 times in total, and the number of IOs is reduced as compared with the case where no replacement is performed.

これによって全体の処理時間短縮の効果が得られる。 Thereby, the effect of shortening the entire processing time can be obtained.

以上、実施形態を用いて本願発明を説明したが、本願発明は、上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解しうる様々な変更をすることができる。 Although the present invention has been described above using the embodiment, the present invention is not limited to the above embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

［付記１］
行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成するブロック化部と、
訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する再ブロック化部と、を含むデータ管理装置。[Appendix 1]
A blocking unit that divides training data representing matrix data into a plurality of blocks, and generates metadata indicating which column values of the original training data each block holds;
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. And a re-blocking unit.

［付記２］
前記再ブロック化部は、前記各ブロックのうちの隣り合うブロック同士を、前記ブロックに含まれる列のうちの前記０に収束した成分に対応する列を除外しながら連結して、前記ブロックを再構成する付記１に記載のデータ管理装置。[Appendix 2]
The reblocking unit connects adjacent blocks of the blocks while excluding a column corresponding to the component converged to 0 among columns included in the block, and re-blocks the blocks. The data management device according to Supplementary Note 1, which is configured.

［付記３］
前記メタデータを格納するメタデータ格納部をさらに備え、
前記再ブロック化部は、再構成された前記ブロックに対応するメタデータを生成し、前記メタデータ格納部に格納されたメタデータを更新する付記２に記載のデータ管理装置。[Appendix 3]
A metadata storage unit for storing the metadata;
The data management apparatus according to appendix 2, wherein the reblocking unit generates metadata corresponding to the reconstructed block, and updates the metadata stored in the metadata storage unit.

［付記４］
行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成し、
訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成するデータ管理方法。[Appendix 4]
Divide the training data representing the matrix data into multiple blocks, generate metadata that indicates which column values in the original training data each block holds,
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. Data management method.

［付記５］
行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成する処理と、
訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する処理と、をコンピュータに実行させるプログラム。[Appendix 5]
A process of dividing the training data representing the matrix data into a plurality of blocks, and generating metadata indicating which column values of the original training data each block holds;
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. A program that causes a computer to execute

［付記６］
行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納するキュー管理部と、
前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する繰り返し計算部と、
１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信するフラグ管理部と、を含むデータ分析装置。[Appendix 6]
A queue management unit that reads a predetermined block out of a plurality of blocks that are data obtained by dividing training data representing matrix data, and stores it in a queue;
An iterative calculation unit that reads the predetermined block stored in the queue and performs an iterative calculation of a CD method;
A flag management unit that transmits a flag indicating that a train of training data corresponding to the component that has converged to 0 can be removed when some component of the parameter has converged to 0 during one iteration calculation. Data analysis device.

［付記７］
前記繰り返し計算部は、前記１つの繰り返し計算ごとに、前記パラメータの各成分が０に収束したか否かを判定し、前記０に収束した成分があると判断した場合、前記フラグ管理部に前記０に収束した成分を通知する付記６に記載のデータ分析装置。[Appendix 7]
The iterative calculation unit determines whether or not each component of the parameter has converged to 0 for each one iterative calculation, and if it is determined that there is a component that has converged to 0, the flag management unit The data analysis device according to appendix 6, which notifies the component that has converged to 0.

［付記８］
前記繰り返し計算部は、前記所定のブロックに含まれる少なくとも１つの成分を更新した場合に、更新した前記成分の更新差分が所定の閾値よりも大きいことに応じて、前記成分をさらに更新する付記６又は７に記載のデータ分析装置。[Appendix 8]
The repetitive calculation unit further updates the component when the update difference of the updated component is larger than a predetermined threshold when at least one component included in the predetermined block is updated. Or the data analysis device according to 7;

［付記９］
前記キュー管理部は、前記ＣＤ法の繰り返し計算の結果、不要となったブロックを前記キューから破棄し、新たに必要となるブロックを前記キューに格納する付記６乃至８のいずれか１項に記載のデータ分析装置。[Appendix 9]
The queue management unit according to any one of appendices 6 to 8, wherein the queue management unit discards a block that becomes unnecessary as a result of the repeated calculation of the CD method from the queue and stores a newly required block in the queue. Data analysis equipment.

［付記１０］
前記キュー管理部は、前記複数のブロックのうち、前記繰り返し計算部が前記ＣＤ法の繰り返し計算を実施していないブロックを特定し、特定した前記ブロックを前記所定のブロックとして読み出す付記６乃至９のいずれか１項に記載のデータ分析装置。[Appendix 10]
The queue management unit identifies, among the plurality of blocks, a block in which the iterative calculation unit has not performed the CD method repeated calculation, and reads the identified block as the predetermined block. The data analysis device according to any one of claims.

［付記１１］
前記フラグ管理部は、前記繰り返し計算部から前記パラメータの各成分のうち、前記０に収束した成分に関する情報を受け取り、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する付記６乃至１０のいずれか１項に記載のデータ分析装置。[Appendix 11]
The flag management unit receives information on the component that has converged to 0 among the components of the parameter from the iterative calculation unit, and can remove a train of training data corresponding to the component that has converged to 0 The data analysis device according to any one of appendices 6 to 10, which transmits a flag to indicate.

［付記１２］
前記フラグ管理部は、前記パラメータの各成分のうち前記０に収束した成分が所定の数以上であるか否かを判定し、前記所定の数以上であることに応じて、前記複数のブロックを再ブロック化することを要求する付記６乃至１１のいずれか１項に記載のデータ分析装置。[Appendix 12]
The flag management unit determines whether or not the number of components that have converged to 0 among the components of the parameter is equal to or greater than a predetermined number. The data analysis device according to any one of appendices 6 to 11, which requests reblocking.

［付記１３］
行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納し、
前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施し、
１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信するデータ分析方法。[Appendix 13]
Among a plurality of blocks that are data obtained by dividing training data representing matrix data, a predetermined block is read and stored in a queue,
Read the predetermined block stored in the queue and perform repeated calculation of the CD method,
A data analysis method for transmitting a flag indicating that a train of training data corresponding to a component converged to 0 can be removed when some component of the parameter converges to 0 during one iterative calculation.

［付記１４］
行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納する処理と、
前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する処理と、
１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する処理と、をコンピュータに実行させるプログラム。[Appendix 14]
Among a plurality of blocks that are data obtained by dividing training data representing matrix data, a process of reading a predetermined block and storing it in a queue;
A process of reading the predetermined block stored in the queue and performing an iterative calculation of a CD method;
When a part of the parameters of the parameter converges to 0 during one iteration, a process is executed to send to the computer a flag indicating that the train of training data corresponding to the component converged to 0 can be removed. Program to make.

［付記１５］
行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成するブロック化部と、
訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する再ブロック化部と、を含むデータ管理装置と、
行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納するキュー管理部と、
前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する繰り返し計算部と、
１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信するフラグ管理部と、を含むデータ分析装置と、を含むデータ分析システム。[Appendix 15]
A blocking unit that divides training data representing matrix data into a plurality of blocks, and generates metadata indicating which column values of the original training data each block holds;
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. A data management device comprising:
A queue management unit that reads a predetermined block out of a plurality of blocks that are data obtained by dividing training data representing matrix data, and stores it in a queue;
An iterative calculation unit that reads the predetermined block stored in the queue and performs an iterative calculation of a CD method;
A flag management unit that transmits a flag indicating that a train of training data corresponding to the component that has converged to 0 can be removed when some component of the parameter has converged to 0 during one iteration calculation. And a data analysis system including the data analysis device.

［付記１６］
前記再ブロック化部は、前記各ブロックのうちの隣り合うブロック同士を、前記ブロックに含まれる列のうちの前記０に収束した成分に対応する列を除外しながら連結して、前記ブロックを再構成する付記１５に記載のデータ分析システム。[Appendix 16]
The reblocking unit connects adjacent blocks of the blocks while excluding a column corresponding to the component converged to 0 among columns included in the block, and re-blocks the blocks. The data analysis system according to Supplementary Note 15, which is configured.

［付記１７］
前記メタデータを格納するメタデータ格納部をさらに備え、
前記再ブロック化部は、再構成された前記ブロックに対応するメタデータを生成し、前記メタデータ格納部に格納されたメタデータを更新する付記１６に記載のデータ分析システム。[Appendix 17]
A metadata storage unit for storing the metadata;
The data analysis system according to appendix 16, wherein the reblocking unit generates metadata corresponding to the reconstructed block, and updates the metadata stored in the metadata storage unit.

［付記１８］
前記繰り返し計算部は、前記１つの繰り返し計算ごとに、前記パラメータの各成分が０に収束したか否かを判定し、前記０に収束した成分があると判断した場合、前記フラグ管理部に前記０に収束した成分を通知する付記１５に記載のデータ分析システム。[Appendix 18]
The iterative calculation unit determines whether or not each component of the parameter has converged to 0 for each one iterative calculation, and if it is determined that there is a component that has converged to 0, the flag management unit The data analysis system according to appendix 15, which notifies the component that has converged to zero.

［付記１９］
前記繰り返し計算部は、前記所定のブロックに含まれる少なくとも１つの成分を更新した場合に、更新した前記成分の更新差分が所定の閾値よりも大きいことに応じて、前記成分をさらに更新する付記１５又は１６に記載のデータ分析システム。[Appendix 19]
The repetitive calculation unit further updates the component when the update difference of the updated component is larger than a predetermined threshold when at least one component included in the predetermined block is updated. Or the data analysis system of 16.

［付記２０］
前記キュー管理部は、前記ＣＤ法の繰り返し計算の結果、不要となったブロックを前記キューから破棄し、新たに必要となるブロックを前記キューに格納する付記１５乃至１７のいずれか１項に記載のデータ分析システム。[Appendix 20]
18. The queue management unit according to any one of appendices 15 to 17, wherein the queue management unit discards a block that is no longer necessary as a result of the repetition calculation of the CD method from the queue and stores a newly required block in the queue. Data analysis system.

［付記２１］
前記キュー管理部は、前記複数のブロックのうち、前記繰り返し計算部が前記ＣＤ法の繰り返し計算を実施していないブロックを特定し、特定した前記ブロックを前記所定のブロックとして読み出す付記１５乃至１８のいずれか１項に記載のデータ分析システム。[Appendix 21]
The queue management unit identifies a block in which the iterative calculation unit has not performed the repetition calculation of the CD method among the plurality of blocks, and reads the identified block as the predetermined block. The data analysis system according to any one of claims.

［付記２２］
前記フラグ管理部は、前記繰り返し計算部から前記パラメータの各成分のうち、０に収束した成分に関する情報を受け取り、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する付記１５乃至１９のいずれか１項に記載のデータ分析システム。[Appendix 22]
The flag management unit receives information on a component that has converged to 0 out of each component of the parameter from the iterative calculation unit, and indicates that a train of training data corresponding to the component that has converged to 0 can be removed. 20. The data analysis system according to any one of appendices 15 to 19, which transmits a flag.

［付記２３］
前記フラグ管理部は、前記パラメータの各成分のうち前記０に収束した成分が所定の数以上であるか否かを判定し、前記所定の数以上であることに応じて、前記複数のブロックを再ブロック化することを要求する付記１５乃至２０のいずれか１項に記載のデータ分析システム。[Appendix 23]
The flag management unit determines whether or not the number of components that have converged to 0 among the components of the parameter is equal to or greater than a predetermined number. 21. The data analysis system according to any one of appendices 15 to 20, which requests reblocking.

［付記２４］
行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成し、
訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成し、
行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納し、
前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施し、
１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する分析方法。[Appendix 24]
Divide the training data representing the matrix data into multiple blocks, generate metadata that indicates which column values in the original training data each block holds,
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. And
Among a plurality of blocks that are data obtained by dividing training data representing matrix data, a predetermined block is read and stored in a queue,
Read the predetermined block stored in the queue and perform repeated calculation of the CD method,
An analysis method of transmitting a flag indicating that a train of training data corresponding to a component converged to 0 can be removed when a part of the components of the parameter converges to 0 during one iteration calculation.

［付記２５］
行列データを表す訓練データを複数のブロックに分割し、各ブロックが元の訓練データのどの列の値を保持しているかを示すメタデータを生成する処理と、訓練データから学習するパラメータの一部の成分が０に収束した時に、前記各ブロックのうちの不要な列を含む古いブロックを、前記不要な列を除去したブロックに置き換え、前記メタデータを再生成する処理と、行列データを表す訓練データを分割したデータである複数のブロックのうち、所定のブロックを読みだしてキューに格納する処理と、前記キューに格納された前記所定のブロックを読み出してＣＤ法の繰り返し計算を実施する処理と、１つの繰り返し計算時にパラメータの一部の成分が０に収束した時、前記０に収束した成分に対応する訓練データの列を除去可能であることを示すフラグを送信する処理と、をコンピュータに実行させるプログラム。[Appendix 25]
Divide the training data representing matrix data into multiple blocks, generate metadata that shows which column values of each training block each block holds, and some of the parameters learned from the training data When the component of 収束 converges to 0, an old block including an unnecessary column in each block is replaced with a block from which the unnecessary column is removed, and a process for regenerating the metadata and training for representing matrix data A process of reading a predetermined block out of a plurality of blocks, which are data obtained by dividing the data, and storing it in a queue; a process of reading the predetermined block stored in the queue and performing an iterative calculation of a CD method; When some component of the parameter converges to 0 during one iteration, it is possible to remove the train of training data corresponding to the component converged to 0. Program for executing a process of transmitting, to the computer a flag indicating.

この出願は、２０１４年２月１８日に出願された日本出願特願２０１４−０２８４５４を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2014-028454 for which it applied on February 18, 2014, and takes in those the indications of all here.

１データ管理装置
２ブロック化部
３メタデータ格納部
４再ブロック化部
５ブロック格納部
６データ分析装置
７パラメータ格納部
８キュー
９キュー管理部
１０フラグ管理部
１１繰り返し計算部
１２訓練データ格納部
１３ネットワーク
２０ブロック化部
２１ＣＰＵ
２２ＲＡＭ
２３記憶装置
２４通信インターフェース
２５入力装置
２６出力装置
４０再ブロック化部
９０キュー管理部
１００フラグ管理部
１０１データ管理装置
１０２データ分析装置
１０３データ分析システム
１１０繰り返し計算部DESCRIPTION OF SYMBOLS 1 Data management apparatus 2 Blocking part 3 Metadata storage part 4 Reblocking part 5 Block storage part 6 Data analyzer 7 Parameter storage part 8 Queue 9 Queue management part 10 Flag management part 11 Iterative calculation part 12 Training data storage part 13 Network 20 Blocking unit 21 CPU
22 RAM
DESCRIPTION OF SYMBOLS 23 Memory | storage device 24 Communication interface 25 Input device 26 Output device 40 Reblocking part 90 Queue management part 100 Flag management part 101 Data management apparatus 102 Data analysis apparatus 103 Data analysis system 110 Iterative calculation part

Claims

Blocking means for dividing training data representing matrix data into a plurality of blocks, and generating metadata indicating which column values of the original training data each block holds,
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. And a re-blocking means.

The reblocking means connects adjacent blocks of the blocks while excluding the column corresponding to the component converged to 0 among the columns included in the block, and reblocks the blocks. The data management apparatus according to claim 1, which is configured.

Queue management means for reading a predetermined block out of a plurality of blocks, which is data obtained by dividing training data representing matrix data, and storing it in a queue;
Repetitive calculation means for reading the predetermined block stored in the queue and performing repetitive calculation of a CD method;
Flag management means for transmitting a flag indicating that a train of training data corresponding to the component that has converged to 0 can be removed when some component of the parameter converges to 0 during one iteration calculation. Data analysis device.

The iterative calculation means determines whether or not each component of the parameter has converged to 0 for each one of the iteration calculations. If it is determined that there is a component converged to 0, the flag management means The data analysis apparatus according to claim 3, wherein a component that has converged to 0 is notified.

Blocking means for dividing training data representing matrix data into a plurality of blocks, and generating metadata indicating which column values of the original training data each block holds,
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. A data management device comprising:
Queue management means for reading a predetermined block out of a plurality of blocks, which is data obtained by dividing training data representing matrix data, and storing it in a queue;
Repetitive calculation means for reading the predetermined block stored in the queue and performing repetitive calculation of a CD method;
Flag management means for transmitting a flag indicating that a train of training data corresponding to the component that has converged to 0 can be removed when some component of the parameter converges to 0 during one iteration calculation. And a data analysis system including the data analysis device.

A process of dividing the training data representing the matrix data into a plurality of blocks, and generating metadata indicating which column values of the original training data each block holds;
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. And a computer-readable recording medium storing a program for causing the computer to execute the process.

Among a plurality of blocks that are data obtained by dividing training data representing matrix data, a process of reading a predetermined block and storing it in a queue;
A process of reading the predetermined block stored in the queue and performing an iterative calculation of a CD method;
When a part of the parameters of the parameter converges to 0 during one iteration, a process is executed to send to the computer a flag indicating that the train of training data corresponding to the component converged to 0 can be removed. A computer-readable recording medium storing a program to be executed.

Divide the training data representing the matrix data into multiple blocks, generate metadata that indicates which column values in the original training data each block holds,
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. Data management method.

Among a plurality of blocks that are data obtained by dividing training data representing matrix data, a predetermined block is read and stored in a queue,
Read the predetermined block stored in the queue and perform repeated calculation of the CD method,
A data analysis method for transmitting a flag indicating that a train of training data corresponding to a component converged to 0 can be removed when some component of the parameter converges to 0 during one iterative calculation.

Divide the training data representing the matrix data into multiple blocks, generate metadata that indicates which column values in the original training data each block holds,
When some components of the parameters learned from the training data converge to 0, the old blocks including unnecessary columns in each block are replaced with blocks from which unnecessary columns are removed, and the metadata is regenerated. And
Among a plurality of blocks that are data obtained by dividing training data representing matrix data, a predetermined block is read and stored in a queue,
Read the predetermined block stored in the queue and perform repeated calculation of the CD method,
An analysis method of transmitting a flag indicating that a train of training data corresponding to a component converged to 0 can be removed when a part of the components of the parameter converges to 0 during one iteration calculation.