JP7029385B2

JP7029385B2 - Learning equipment, learning methods and learning programs

Info

Publication number: JP7029385B2
Application number: JP2018237211A
Authority: JP
Inventors: 清良披田野; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2022-03-03
Anticipated expiration: 2038-12-19
Also published as: JP2020098531A

Description

本発明は、データポイゾニングに耐性のある学習装置、学習方法及び学習プログラムに関する。 The present invention relates to a learning device, a learning method and a learning program that are resistant to data poisoning.

教師あり学習に対する脅威の一つとして、訓練データに悪性データを混入するデータポイゾニング攻撃がある。本攻撃への対策としては、ＴＲＩＭと呼ばれる学習アルゴリズムがある（例えば、非特許文献１参照）。
一般的な教師あり学習では、訓練データから入力と出力との対応関係を表すモデルを学習する際に、最適化問題を解くことで損失関数を最小化するモデルのパラメータを決定する。一方、ＴＲＩＭでは、データポイゾニングの影響を最小限に抑えるために、悪性データが混入したＮ個の訓練データのうち正常な訓練データの数をｎ（＜Ｎ）として最適化問題を解くことで、モデルのパラメータを決定する。具体的には、ＴＲＩＭでは、交互最小化法により、訓練データとモデルのパラメータとを交互に最適化することで、Ｎ個の訓練データから損失関数を最小化するｎ個の訓練データを抽出しつつ、同時に損失関数を最小化するモデルのパラメータを決定する。 One of the threats to supervised learning is a data poisoning attack that mixes malicious data into training data. As a countermeasure against this attack, there is a learning algorithm called TRIM (see, for example, Non-Patent Document 1).
In general supervised learning, when learning a model that represents the correspondence between inputs and outputs from training data, the parameters of the model that minimizes the loss function are determined by solving the optimization problem. On the other hand, in TRIM, in order to minimize the influence of data poisoning, the number of normal training data out of N training data mixed with malignant data is set to n (<N) and the optimization problem is solved. Determine the parameters of the model. Specifically, in TRIM, by alternately optimizing the training data and the parameters of the model by the alternate minimization method, n training data that minimize the loss function are extracted from the N training data. At the same time, determine the parameters of the model that minimizes the loss function.

Ｍ．Ｊａｇｉｅｌｓｋｉｅｔａｌ．， “ＭａｎｉｐｕｌａｔｉｎｇＭａｃｈｉｎｅＬｅａｒｎｉｎｇ：ＰｏｉｓｏｎｉｎｇＡｔｔａｃｋｓａｎｄＣｏｕｎｔｅｒｍｅａｓｕｒｅｓｆｏｒＲｅｇｒｅｓｓｉｏｎＬｅａｒｎｉｎｇ，” ＩＥＥＥＳ＆Ｐ２０１８．M. Jagielski et al. , "Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning," IEEE S & P 2018.

損失関数は、一般的に、訓練データそれぞれに対する損失度の平均で表される。このため、ＴＲＩＭにおいて、現在のモデルのパラメータを用いて損失関数を最小化するｎ個の訓練データを選ぶ際は、Ｎ個の訓練データに対して損失度を計算し、損失度が小さい方からｎ個の訓練データを選択する。したがって、少なくともオーダＮの計算が必要となる。
また、現在のｎ個の訓練データを用いて損失関数を最小化するモデルのパラメータを選択する際も、ｎ個の訓練データに対して損失度を計算する必要がある。このため、オーダｎの計算が必要となる。
したがって、Ｎ及びｎが大きい場合、ＴＲＩＭの計算コストは膨大となっていた。 The loss function is generally expressed as the average of the degree of loss for each training data. Therefore, in TRIM, when selecting n training data that minimizes the loss function using the parameters of the current model, the loss degree is calculated for the N training data, and the loss degree is calculated from the one with the smallest loss degree. Select n training data. Therefore, it is necessary to calculate at least order N.
Also, when selecting the parameters of the model that minimizes the loss function using the current n training data, it is necessary to calculate the degree of loss for the n training data. Therefore, it is necessary to calculate the order n.
Therefore, when N and n are large, the calculation cost of TRIM is enormous.

本発明は、教師あり学習において、データポイゾニングの影響を最小限に抑えつつ、大規模な訓練データであっても高速に学習できる学習装置、学習方法及び学習プログラムを提供することを目的とする。 An object of the present invention is to provide a learning device, a learning method and a learning program capable of learning even a large-scale training data at high speed while minimizing the influence of data poisoning in supervised learning.

本発明に係る学習装置は、教師あり学習により関数のパラメータ値を決定する学習装置であって、訓練データの集合からランダムに一つを抽出する抽出部と、前記抽出部により抽出された訓練データに対して、現在のパラメータ値による損失関数の勾配を計算し、前記訓練データそれぞれに対応する勾配を格納した勾配記録データを更新する計算部と、前記勾配記録データのうち、小さい方から所定数の勾配を抽出し、当該所定数の勾配の平均値に基づいて前記パラメータ値を更新する更新部と、を備え、前記抽出部、前記計算部及び前記更新部による処理を、前記パラメータ値が収束するまで繰り返し実行する。 The learning device according to the present invention is a learning device that determines a function parameter value by supervised learning, and has an extraction unit that randomly extracts one from a set of training data and a training data extracted by the extraction unit. On the other hand, a calculation unit that calculates the gradient of the loss function based on the current parameter value and updates the gradient recording data storing the gradient corresponding to each of the training data, and a predetermined number of the gradient recording data from the smaller one. The parameter value converges on the processing by the extraction unit, the calculation unit, and the update unit. Repeat until you do.

前記更新部は、前記繰り返しが所定回数に満たない場合、非ゼロの勾配のうち小さい方から、前記繰り返しの回数に応じた所定割合の数の勾配を抽出してもよい。 When the number of repetitions is less than the predetermined number of times, the updating unit may extract a predetermined ratio of gradients according to the number of repetitions from the smaller of the non-zero gradients.

前記更新部は、抽出した前記非ゼロの勾配が所定数に満たない場合、不足する数の勾配をゼロとして前記平均値を算出してもよい。 When the extracted non-zero gradient is less than a predetermined number, the update unit may calculate the average value by setting the insufficient number of gradients to zero.

本発明に係る学習方法は、教師あり学習により関数のパラメータ値を決定する学習方法であって、訓練データの集合からランダムに一つを抽出する抽出ステップと、前記抽出ステップにおいて抽出された訓練データに対して、現在のパラメータ値による損失関数の勾配を計算し、前記訓練データそれぞれに対応する勾配を格納した勾配記録データを更新する計算ステップと、前記勾配記録データのうち、小さい方から所定数の勾配を抽出し、当該所定数の勾配の平均値に基づいて前記パラメータ値を更新する更新ステップと、を前記パラメータ値が収束するまでコンピュータが繰り返し実行する。 The learning method according to the present invention is a learning method in which a parameter value of a function is determined by supervised learning, and is an extraction step of randomly extracting one from a set of training data and a training data extracted in the extraction step. On the other hand, a calculation step of calculating the gradient of the loss function based on the current parameter value and updating the gradient recording data storing the gradient corresponding to each of the training data, and a predetermined number of the gradient recording data from the smaller one. The computer repeatedly executes an update step of extracting the gradients of the above and updating the parameter values based on the average value of the predetermined number of gradients until the parameter values converge.

本発明に係る学習プログラムは、教師あり学習により関数のパラメータ値を決定するための学習プログラムであって、訓練データの集合からランダムに一つを抽出する抽出ステップと、前記抽出ステップにおいて抽出された訓練データに対して、現在のパラメータ値による損失関数の勾配を計算し、前記訓練データそれぞれに対応する勾配を格納した勾配記録データを更新する計算ステップと、前記勾配記録データのうち、小さい方から所定数の勾配を抽出し、当該所定数の勾配の平均値に基づいて前記パラメータ値を更新する更新ステップと、を前記パラメータ値が収束するまでコンピュータに繰り返し実行させるためのものである。 The learning program according to the present invention is a learning program for determining the parameter value of a function by supervised learning, and is extracted in an extraction step of randomly extracting one from a set of training data and the extraction step. For the training data, the calculation step of calculating the gradient of the loss function according to the current parameter value and updating the gradient recording data storing the gradient corresponding to each of the training data, and the gradient recording data from the smaller one. This is for causing a computer to repeatedly execute an update step of extracting a predetermined number of gradients and updating the parameter values based on the average value of the predetermined number of gradients until the parameter values converge.

本発明によれば、教師あり学習において、データポイゾニングの影響を最小限に抑えつつ、大規模な訓練データであっても高速に学習できる。 According to the present invention, in supervised learning, even a large-scale training data can be learned at high speed while minimizing the influence of data poisoning.

実施形態に係る学習装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the learning apparatus which concerns on embodiment. 実施形態に係る学習装置の処理を示すフローチャートである。It is a flowchart which shows the process of the learning apparatus which concerns on embodiment.

以下、本発明の実施形態の一例について説明する。
本実施形態に係る学習方法では、悪性データが混入された訓練データを用いた教師あり学習において、悪性データの影響を抑制する。
なお、悪性データは、例えば全体の２０％等、所定の割合で混入されているものと仮定して、この所定の割合の訓練データが全体から除去される。 Hereinafter, an example of the embodiment of the present invention will be described.
In the learning method according to the present embodiment, the influence of the malignant data is suppressed in the supervised learning using the training data mixed with the malignant data.
It should be noted that the malignant data is assumed to be mixed in a predetermined ratio, for example, 20% of the total, and the training data in the predetermined ratio is removed from the whole.

教師あり学習では、訓練データの集合から、入力ｘと出力ｙとの対応関係を示す関数ｆ_ｗのパラメータｗが導出される。
ここで、Ｄ１を、ｎ個のデータからなる訓練データの集合とする。また、ｌ_ｉ（ｗ）を、ｗを入力としてｉ番目の訓練データについての損失度を出力する関数とする。教師あり学習では、次の最適化問題を解くことで、損失関数Ｌを最小化する関数ｆ_ｗのパラメータｗが導出される。
ｍｉｎ_ｗＬ（Ｄ１，ｗ）＝ｍｉｎ_ｗ（１／ｎ）・Σ_{ｉ∈［ｎ］}ｌ_ｉ（ｗ） In supervised learning, the parameter _w of the function fw indicating the correspondence between the input x and the output y is derived from the set of training data.
Here, D1 is a set of training data consisting of n data. Further, let l _i (w) be a function that outputs the degree of loss for the i-th training data with w as an input. In supervised learning, the parameter w of the function f _w that minimizes the loss function L is derived by solving the following optimization problem.
min _w L (D1, w) = min _w (1 / n) ・ Σ i _{∈ [n]} l _i (w)

データポイゾニング攻撃では、攻撃者は、悪性データの集合を訓練データの集合に混入することで、関数ｆ_ｗを意図的に操作する。
ここで、Ｄ２を訓練データの集合Ｄ１と悪性データの集合Ｄ１’との和集合とし、Ｄ２はＮ個の要素からなるものとする。また、Ｒを効用関数とする。効用関数Ｒは、パラメータｗ及び攻撃者が用意したテストデータの集合Ｄ３を入力として攻撃の効用度を出力する。攻撃者は、例えば次の最適化問題を解くことで、効用関数Ｒを最大化する悪性データの集合Ｄ１’を導出する。
ｍａｘ_Ｄ１’ Ｒ（Ｄ３，ｗ’）
ｓ．ｔ．ｗ’∈ａｒｇｍｉｎ_ｗＬ（Ｄ２，ｗ） In a data poisoning attack, an attacker deliberately manipulates the function _fw by mixing a set of malicious data into a set of training data.
Here, D2 is a union of a set of training data D1 and a set of malignant data D1', and D2 is composed of N elements. Also, let R be a utility function. The utility function R outputs the utility of the attack by inputting the parameter w and the set D3 of the test data prepared by the attacker. The attacker derives a set of malicious data D1'that maximizes the utility function R, for example, by solving the following optimization problem.
max _D1'R (D3, w')
s. t. w'∈ argmin _w L (D2, w)

本実施形態の学習方法では、Ｄ１”をｎ（＜Ｎ）個の要素からなるＤ２の部分集合とし、次の最適化問題を解くことで、悪性データの集合の影響を最小限にするパラメータｗを導出する。
ｍｉｎ_{ｗ，Ｄ１”} Ｌ（Ｄ１”，ｗ） In the learning method of this embodiment, D1 ”is a subset of D2 consisting of n (<N) elements, and the parameter w that minimizes the influence of the set of malignant data by solving the following optimization problem. Is derived.
min _{w, D1 "} L (D1", w)

図１は、本実施形態に係る学習装置１の機能構成を示すブロック図である。
学習装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０及び記憶部２０の他、各種データの入出力デバイス及び通信デバイス等を備える。 FIG. 1 is a block diagram showing a functional configuration of the learning device 1 according to the present embodiment.
The learning device 1 is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10 and a storage unit 20, as well as various data input / output devices and communication devices.

制御部１０は、学習装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における各機能を実現する。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire learning device 1, and realizes each function in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群を学習装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の各機能を制御部１０に実行させるためのプログラム（学習プログラム）、学習モデルである関数のパラメータ群、及び後述の勾配記録データ等のデータを記憶する。 The storage unit 20 is a storage area for various programs and various data for making the hardware group function as the learning device 1, and may be a ROM, RAM, flash memory, hard disk (HDD), or the like. Specifically, the storage unit 20 stores data such as a program (learning program) for causing the control unit 10 to execute each function of the present embodiment, a parameter group of a function as a learning model, and gradient recording data described later. Remember.

制御部１０は、抽出部１１と、計算部１２と、更新部１３とを備える。制御部１０は、これらの機能部を繰り返し動作させることにより、教師あり学習により関数のパラメータ値を決定する。 The control unit 10 includes an extraction unit 11, a calculation unit 12, and an update unit 13. The control unit 10 determines the parameter value of the function by supervised learning by repeatedly operating these functional units.

抽出部１１は、教師あり学習に用いる訓練データの集合から、ランダムに一つの訓練データを抽出する。 The extraction unit 11 randomly extracts one training data from a set of training data used for supervised learning.

計算部１２は、抽出部１１により抽出された訓練データに対して、現在のパラメータ値による損失関数の勾配を計算し、勾配記録データを更新する。
勾配記録データは、訓練データそれぞれに対応する損失関数の勾配を格納したベクトルデータである。抽出部１１により同一の訓練データが抽出された場合には、勾配記録データにおける同一のインデックスの値が上書き更新される。 The calculation unit 12 calculates the gradient of the loss function based on the current parameter value for the training data extracted by the extraction unit 11, and updates the gradient recording data.
The gradient recording data is vector data that stores the gradient of the loss function corresponding to each training data. When the same training data is extracted by the extraction unit 11, the value of the same index in the gradient recording data is overwritten and updated.

更新部１３は、勾配記録データのうち、小さい方から所定数の勾配を抽出し、これら所定数の勾配の平均値に基づいてパラメータ値を更新する。
更新部１３は、学習の繰り返し回数が訓練データの数に満たないような初期の段階では、非ゼロの勾配のうち小さい方から、学習の繰り返し回数に応じた所定割合の数の勾配を抽出する。そして、更新部１３は、非ゼロの勾配が所定数に満たない場合、不足する数の勾配をゼロとして平均値を算出する。 The update unit 13 extracts a predetermined number of gradients from the smaller of the gradient recording data, and updates the parameter value based on the average value of the predetermined number of gradients.
In the initial stage where the number of learning repetitions is less than the number of training data, the update unit 13 extracts a predetermined ratio of gradients according to the number of learning repetitions from the smaller of the non-zero gradients. .. Then, when the non-zero gradient is less than a predetermined number, the update unit 13 calculates the average value with the gradient of the insufficient number as zero.

制御部１０は、抽出部１１、計算部１２及び更新部１３による処理を、パラメータ値が収束するまで繰り返す。 The control unit 10 repeats the processing by the extraction unit 11, the calculation unit 12, and the update unit 13 until the parameter values converge.

図２は、本実施形態に係る学習装置１の処理を示すフローチャートである。
学習装置１は、確率的勾配降下法を応用して前述の最適化問題を解くことで、悪性データを除去しながら関数ｆ_ｗのパラメータｗを学習する。 FIG. 2 is a flowchart showing the processing of the learning device 1 according to the present embodiment.
The learning device 1 learns the parameter w of the function f _w while removing the malignant data by solving the above-mentioned optimization problem by applying the stochastic gradient descent method.

ステップＳ１において、制御部１０は、後続の繰り返し処理における勾配を記録するためのＮ次元のベクトル（勾配記録データ）Ｚ^（０）＝（ｚ_１ ^（０），…，ｚ_Ｎ ^（０））を用意する。ただし、全てのｉについてｚ_ｉ ^（０）＝０とする。
また、制御部１０は、パラメータｗの学習前の初期値ｗ^（０）、及び学習の繰り返し回数ｔの初期値０を設定する。 In step S1, the control unit 10 sets an N-dimensional vector (gradient recording data) Z ⁽⁰⁾ = (z ₁ ⁽⁰⁾ , ..., Z _N ⁽⁰⁾ ) for recording the gradient in the subsequent iterative processing. prepare. However, z _i ⁽⁰⁾ = 0 for all i.
Further, the control unit 10 sets the initial value w ⁽⁰⁾ of the parameter w before learning and the initial value 0 of the number of times of learning t.

ステップＳ２において、制御部１０（抽出部１１）は、学習の繰り返し回数ｔをカウントアップし、ｔ回目の学習のために、正常データと悪性データとが混在したＮ個の訓練データの集合Ｄ２から１個の訓練データ（ｘ_ｃ，ｙ_ｃ）をランダムに抽出する。 In step S2, the control unit 10 (extraction unit 11) counts up the number of times of learning t, and from the set D2 of N training data in which normal data and malignant data are mixed for the t-th learning. One training data (x _c , y _c ) is randomly extracted.

ステップＳ３において、制御部１０（計算部１２）は、ステップＳ２で抽出された訓練データ（ｘ_ｃ，ｙ_ｃ）に対応する損失関数ｌ_ｃ（ｗ^{（ｔ－１）}）の勾配∇ｌ_ｃ（ｗ^{（ｔ－１）}）を計算し、ベクトルＺのｃ番目の要素のみを次のように上書き更新する。
ｚ_ｃ（ｔ）←∇ｌ_ｃ（ｗ^{（ｔ－１）}） In step S3, the control unit 10 (calculation unit 12) has a gradient ∇ l _c (w (t-1)) of the loss function l _c (w ^(t-1) ) corresponding to the training data (x _c , y _c ) extracted in step S2. w ^(t-1) ) is calculated, and only the c-th element of the vector Z is overwritten and updated as follows.
z _c (t) ← ∇l _c (w ^(t-1) )

ステップＳ４において、制御部１０（更新部１３）は、学習の繰り返し回数ｔがＮ未満か否かを判定する。この判定がＹＥＳ（ｔ＜Ｎ）の場合、処理はステップＳ５に移り、判定がＮＯ（ｔ≧Ｎ）の場合、処理はステップＳ６に移る。 In step S4, the control unit 10 (update unit 13) determines whether or not the number of times of learning repetition t is less than N. If this determination is YES (t <N), the process proceeds to step S5, and if the determination is NO (t ≧ N), the process proceeds to step S6.

ステップＳ５において、制御部１０（更新部１３）は、ベクトルＺの非ゼロの要素のうち値が小さい要素からｎｔ／Ｎ個を抽出し、これらの要素のインデックス集合Ｉ（ｔ）を作成する。 In step S5, the control unit 10 (update unit 13) extracts nt / N elements from the non-zero elements of the vector Z having a small value, and creates an index set I (t) of these elements.

ステップＳ６において、制御部１０（更新部１３）は、ベクトルＺの中で値が小さい要素からｎ個を抽出し、これらの要素のインデックス集合Ｉ（ｔ）を作成する。 In step S6, the control unit 10 (update unit 13) extracts n elements from the elements having small values in the vector Z, and creates an index set I (t) of these elements.

ステップＳ７において、制御部１０（更新部１３）は、次の更新式を用いて、パラメータｗを更新する。ここで、λは、ステップサイズである。
ｗ^（ｔ）←ｗ^{（ｔ－１）}－（λ／ｎ）・Σ_ｉ∈Ｉｚ_ｉ ^（ｔ） In step S7, the control unit 10 (update unit 13) updates the parameter w using the following update formula. Where λ is the step size.
w ^(t) ← w ^(t-1) -(λ / n) ・ Σ i _{∈ I} z _i ^(t)

ステップＳ８において、制御部１０は、パラメータｗの値が収束したか否かを判定する。この判定がＹＥＳの場合、処理は終了し、判定がＮＯの場合、処理はステップＳ２に戻り、制御部１０は、ステップＳ２からステップＳ７までの処理をパラメータｗの値が収束するまで繰り返す。
パラメータｗの値が収束すると、インデックス集合Ｉも特定の集合に収束する。この特定の集合は、悪性データと推定される所定の割合の集合が除外された良性の訓練データとなる。 In step S8, the control unit 10 determines whether or not the value of the parameter w has converged. If this determination is YES, the process ends, if the determination is NO, the process returns to step S2, and the control unit 10 repeats the processes from step S2 to step S7 until the value of the parameter w converges.
When the value of the parameter w converges, the index set I also converges to a specific set. This particular set is benign training data, excluding a set of presumed malignant data.

以上のように、本実施形態によれば、学習装置１は、毎回の繰り返し処理おいて、一つの訓練データをランダムに抽出してモデル（関数ｆ_ｗ）のパラメータｗに対する損失関数の勾配を計算し、過去の勾配の平均を用いてパラメータｗを更新する。これにより、繰り返し処理におけるパラメータｗの更新に関わる計算がＴＲＩＭの場合のオーダｎからオーダ１まで減少する。
また、学習装置１は、過去の勾配の平均を算出する際に、全ての勾配の平均ではなく、値の小さい一部（ｎ個）の勾配を選択して平均値を算出する。これにより、勾配が大きくなる外れ値、すなわち悪性データの可能性が高い訓練データが除外されるので、訓練データの最適化が同時に実現される。この場合、訓練データの最適化に関する計算量がＴＲＩＭの場合のオーダＮからオーダ１に減少する。
このように、学習装置１は、教師あり学習において、データポイゾニングの影響を最小限に抑えつつ、大規模な訓練データであっても高速に学習できる。 As described above, according to the present embodiment, the learning device 1 randomly extracts one training data in each iterative process and calculates the gradient of the loss function with respect to the parameter w of the model (function f _w ). Then, the parameter w is updated using the average of the past gradients. As a result, the calculation related to the update of the parameter w in the iterative process is reduced from order n in the case of TRIM to order 1.
Further, when calculating the average of the past gradients, the learning device 1 selects not the average of all the gradients but the gradients of a part (n) having a small value and calculates the average value. This excludes outliers with large gradients, that is, training data that are likely to be malignant data, so that training data optimization is achieved at the same time. In this case, the amount of calculation related to the optimization of training data is reduced from order N in the case of TRIM to order 1.
As described above, the learning device 1 can learn even a large-scale training data at high speed while minimizing the influence of data poisoning in supervised learning.

学習装置１は、繰り返しの回数が所定回数、例えば訓練データの数に満たない場合、記録された非ゼロの勾配のうち小さい方から、繰り返しの回数に応じた所定割合（ｎｔ／Ｎ）の勾配を抽出する。これにより、学習装置１は、学習の初期の段階においても、適切に悪性データを除外しつつ、パラメータｗを学習できる。 When the number of repetitions is less than the predetermined number of times, for example, the number of training data, the learning device 1 has a gradient of a predetermined ratio (nt / N) according to the number of repetitions from the smaller of the recorded non-zero gradients. To extract. As a result, the learning device 1 can learn the parameter w while appropriately excluding malignant data even in the initial stage of learning.

学習装置１は、抽出した非ゼロの勾配が所定数ｎに満たない場合、不足する数の勾配をゼロとして平均値を算出する。これにより、学習装置１は、学習の初期の段階においても、各訓練データによる学習への影響度を均一にできる。 When the extracted non-zero gradient is less than the predetermined number n, the learning device 1 calculates the average value with the gradient of the insufficient number as zero. As a result, the learning device 1 can make the degree of influence of each training data on learning uniform even in the initial stage of learning.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments. Moreover, the effects described in the above-described embodiments are merely a list of the most suitable effects resulting from the present invention, and the effects according to the present invention are not limited to those described in the embodiments.

前述の実施形態では、学習装置１は、パラメータｗを更新する際に用いる非ゼロの勾配の数がｎに満たない場合、不足する数の勾配をゼロとするため固定値のｎで除して平均値を算出したが、これには限られない。例えば、ｎに代えて、抽出された勾配の数ｎｔ／Ｎが用いられてもよい。 In the above-described embodiment, when the number of non-zero gradients used when updating the parameter w is less than n, the learning device 1 divides the insufficient number of gradients by n in order to make it zero. The average value was calculated, but it is not limited to this. For example, instead of n, the number nt / N of the extracted gradients may be used.

前述の実施形態では、学習装置１は、繰り返しの回数ｔが訓練データの数Ｎに満たない学習の初期において、繰り返しの回数ｔに応じて、勾配記録データからｎｔ／Ｎ個の要素を抽出したが、これには限られない。例えば、勾配記録データに存在する非ゼロの要素の個数に対して所定割合（ｎ／Ｎ）を乗じた数が抽出されてもよい。 In the above-described embodiment, the learning device 1 extracts nt / N elements from the gradient recording data according to the number of repetitions t in the initial stage of learning in which the number of repetitions t is less than the number N of training data. However, it is not limited to this. For example, the number obtained by multiplying the number of non-zero elements existing in the gradient recording data by a predetermined ratio (n / N) may be extracted.

前述の実施形態において、訓練データはランダムに選択されることとしたが、これには限られない。例えば、インデックスの順番等、全ての訓練データが均一に選択されてもよい。 In the above-described embodiment, the training data is randomly selected, but the training data is not limited to this. For example, all training data may be uniformly selected, such as the order of indexes.

学習装置１による学習方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ－ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The learning method by the learning device 1 is realized by software. When realized by software, the programs that make up this software are installed in the information processing device (computer). Further, these programs may be recorded on a removable medium such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Further, these programs may be provided to the user's computer as a Web service via a network without being downloaded.

１学習装置
１０制御部
１１抽出部
１２計算部
１３更新部
２０記憶部 1 Learning device 10 Control unit 11 Extraction unit 12 Calculation unit 13 Update unit 20 Storage unit

Claims

It is a learning device that determines the parameter value of a function by supervised learning.
An extractor that randomly extracts one from the set of training data,
A calculation unit that calculates the gradient of the loss function based on the current parameter value for the training data extracted by the extraction unit and updates the gradient recording data that stores the gradient corresponding to each of the training data.
It is provided with an update unit that extracts a predetermined number of gradients from the smaller of the gradient recording data and updates the parameter value based on the average value of the predetermined number of gradients.
A learning device that repeatedly executes processing by the extraction unit, the calculation unit, and the update unit until the parameter values converge.

The learning device according to claim 1, wherein the updating unit extracts a predetermined ratio of gradients according to the number of repetitions from the smaller of the non-zero gradients when the repetition is less than a predetermined number of times.

The learning device according to claim 2, wherein when the extracted non-zero gradient is less than a predetermined number, the update unit calculates the average value with the insufficient number of gradients as zero.

It is a learning method that determines the parameter value of a function by supervised learning.
An extraction step that randomly extracts one from a set of training data,
For the training data extracted in the extraction step, the gradient of the loss function according to the current parameter value is calculated, and the gradient recording data storing the gradient corresponding to each of the training data is updated.
An update step of extracting a predetermined number of gradients from the smaller of the gradient recording data and updating the parameter value based on the average value of the predetermined number of gradients.
A learning method in which a computer repeatedly executes the above until the parameter values converge.

A learning program for determining function parameter values by supervised learning.
An extraction step that randomly extracts one from a set of training data,
For the training data extracted in the extraction step, the gradient of the loss function according to the current parameter value is calculated, and the gradient recording data storing the gradient corresponding to each of the training data is updated.
An update step of extracting a predetermined number of gradients from the smaller of the gradient recording data and updating the parameter value based on the average value of the predetermined number of gradients.
Is a learning program for causing a computer to repeatedly execute the above parameters until the parameter values converge.