WO2017183587A1 - 学習装置、学習方法および学習プログラム - Google Patents
学習装置、学習方法および学習プログラム Download PDFInfo
- Publication number
- WO2017183587A1 WO2017183587A1 PCT/JP2017/015337 JP2017015337W WO2017183587A1 WO 2017183587 A1 WO2017183587 A1 WO 2017183587A1 JP 2017015337 W JP2017015337 W JP 2017015337W WO 2017183587 A1 WO2017183587 A1 WO 2017183587A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- learning
- gradient
- primary
- primary gradient
- moving average
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present invention relates to a learning device, a learning method, and a learning program.
- Machine learning is applied to the field where learning of model parameters is performed so as to lower the error function based on observation data, and prediction is performed on unknown data in problems to be solved such as identification, regression, and clustering.
- machine learning a model is created from past observation data, and future data is predicted.
- it is necessary to create a model so that the deviation (error) between the predicted data and the actually measured data is reduced.
- machine learning it is expected to make a model in a short time with a small error.
- the stochastic gradient descent method is a method in which learning data is selected at random, an error function is calculated, and an operation of correcting parameters in a gradient direction that reduces the error function is repeated.
- various learning algorithms based on the stochastic gradient descent method have been proposed in order to realize efficient learning.
- “efficient” means that the error function can be lowered with a smaller number of parameter updates compared to the conventional stochastic gradient descent method.
- AdaGrad that realizes efficient learning by automatically adjusting the learning rate based on the probabilistic gradient descent method has been proposed (for example, see Non-Patent Document 1).
- the learning rate is a hyperparameter for controlling the amount of parameter update during model learning. How quickly the error function can be minimized depends on the learning rate setting.
- an algorithm called RMSProp is an algorithm in which automatic adjustment of the learning rate is applied to learning of a complex model such as deep learning.
- An efficient learning algorithm called AdaDelta for example, refer to Non-Patent Document 2
- Adam for example, refer to Non-Patent Document 3
- AdaDelta for example, refer to Non-Patent Document 2
- Adam for example, refer to Non-Patent Document 3
- Adam has the highest efficiency among the algorithms that automatically adjust these learning rates.
- AdaGrad a weighting parameter that influences the learning rate by dividing the learning rate by the moving average of the absolute values of the past primary gradients.
- the primary gradient refers to a differential with respect to a parameter in the error function.
- This primary gradient is information that defines the direction of parameter update. Therefore, it can be inferred that the information of the direction of the primary gradient is important for adjusting the learning rate.
- AdaGrad, RMSProp, AdaDelta, and Adam the absolute value of the primary gradient is used, so information on the direction of the primary gradient is lost with respect to the learning rate, and it is expected that there is a limit to efficient learning. .
- the present invention has been made in view of the above, and an object thereof is to provide a learning device, a learning method, and a learning program capable of realizing efficient learning.
- a learning device is a learning device that performs learning using a stochastic gradient descent method in machine learning, and is a primary device in the stochastic gradient descent method.
- the statistic calculator calculates the primary slope statistic from the slope calculator that calculates the slope, the statistic calculator that calculates the primary slope statistic, and the primary slope statistic calculated by the statistic calculator.
- An initialization bias removing unit that removes the initialization bias at the time
- a learning rate adjusting unit that adjusts the learning rate by dividing the learning rate by the standard deviation of the primary gradient based on the statistics of the primary gradient
- a parameter updating unit that updates the parameters of the learning model using the learning rate adjusted by the learning rate adjusting unit.
- FIG. 1 is a block diagram illustrating an example of a configuration of a learning device according to the present embodiment.
- FIG. 2 is a flowchart showing a processing procedure of learning processing executed by the learning device shown in FIG.
- FIG. 3 is a diagram showing a learning algorithm used by the learning apparatus shown in FIG.
- FIG. 4 is a flowchart illustrating a processing procedure of learning processing according to the modification of the embodiment.
- FIG. 5 is a diagram illustrating a learning algorithm according to a modification of the embodiment.
- FIG. 6 is a diagram illustrating an example of a computer in which a learning apparatus is realized by executing a program.
- Machine learning is basically a technique in which a model is learned from observation data so as to minimize an error function of a problem to be solved, and unknown data is predicted using the learned model.
- problems to be solved include data classification, regression, and clustering.
- error function include a square error and cross entropy.
- model include logistic regression and a neural network.
- the stochastic gradient descent method is a widely used algorithm. In the stochastic gradient descent method, learning is performed by repeatedly applying the following equation (1).
- ⁇ is one of the hyper parameters that are manually set to define the parameter update range, and is called the learning rate. Since the learning rate defines the update width, it greatly affects the learning efficiency. If the learning rate can be set appropriately, learning can proceed with high efficiency. In recent years, research has been progressing to achieve high efficiency by automatically adjusting the learning rate based on various information. Here, high efficiency means that the error function can be lowered with a smaller number of parameter updates compared to the conventional stochastic gradient descent method.
- the learning rate is automatically adjusted by dividing the learning rate by the moving average of the absolute values of the primary gradient in the past.
- the first-order gradient refers to a differential with respect to a parameter in the error function, and includes information that defines the direction of parameter update.
- Adam uses the absolute value of the primary gradient for the learning rate, information on the direction of the primary gradient is lost, and it is expected that there is a limit to efficient learning.
- the learning rate is automatically adjusted based on information on the gradient direction.
- the learning rate is adjusted based on the gradient direction information by repeatedly applying the following series of formulas (2) to (7) instead of formula (1). Yes.
- the number of repeated calculations is t.
- Equation (2) indicates that the primary gradient of the i-th parameter in the (t ⁇ 1) -th iteration is the symbol g i, t .
- the approximate value of the moving average of the i-th primary gradient gi , t in the t- th iteration is obtained using the following equation (3).
- the primary gradient g i the approximate value m i of the moving average of t, to t, using the following equation (4), to remove the initialization bias.
- (4) using a formula, primary gradient g i, the approximate value m i of the moving average of t, from t, to remove the initialization bias.
- the moving average of the variances of the i-th primary gradient gi , t in the t- th iteration is obtained using the following equation (5).
- the moving average c i, t of the variance of the i-th primary gradient g i, t in equation (5) is a moving average of the variance of the primary gradient over the past time.
- the primary gradient g i, the moving average c i, t of the variance of t is the statistics for the primary gradient g i, t.
- the primary gradient g i, the moving average c i, t of the dispersion of t, the primary gradient g i is a value determined by the previous direction of variation of t, including direction information of the primary gradient g i, t.
- the initialization bias is removed using the following equation (6) with respect to the moving average ci , t of the variance of the primary gradient gi , t .
- the initialization bias is removed from the moving average c i, t of the variance of the primary gradient g i, t using equation (6).
- the learning rate is adjusted using the following equation (7).
- the calculations of Expressions (2) to (7) are repeated until the learning model parameter ⁇ t converges.
- the learning rate is divided by the square root of the moving average c i, t of the variance after removing the bias of the primary gradient g i, t , that is, the standard deviation of the primary gradient. By doing so, it is formulated to automatically adjust the learning rate.
- the dispersion is determined by the variation in the past direction of the primary gradient.
- the learning rate can be adjusted based on the information on the direction of the primary gradient, and the error function can be lowered. That is, according to the present embodiment, efficient learning can be realized.
- FIG. 1 is a block diagram illustrating an example of a configuration of a learning device 10 according to the present embodiment.
- the learning device 10 performs learning using a stochastic gradient descent method in machine learning.
- the learning apparatus 10 receives standard values of ⁇ , ⁇ 1 , and ⁇ 2 that are hyper parameters.
- the input of ⁇ , ⁇ 1 and ⁇ 2 is only the first time.
- the learning device 10 outputs the converged parameter ⁇ t , for example.
- the learning device 10 according to the present embodiment includes a gradient calculation unit 11, a statistic calculation unit 12, an initialization bias removal unit 13, a learning rate adjustment unit 14, and a parameter update unit 15.
- the gradient calculation unit 11 calculates a primary gradient in the stochastic gradient descent method. Specifically, the gradient calculation unit 11 receives ⁇ t updated by the parameter update unit 15 as an input. Further, the gradient calculation unit 11 has an input of the input data x t by an external device. The gradient calculation unit 11 calculates a primary gradient gt for t representing the number of repeated calculations and outputs it to the statistic calculation unit 12.
- the gradient calculation unit 11 initializes each variable.
- to set the initial value also initialization after bias removal m t and c t after the initialization bias removal. This initialization is performed only for the first time.
- the gradient calculation unit 11 the input data x t and the parameter theta t is input. Subsequently, the gradient calculation unit 11 increments t by +1. Due to this +1 increment, the approximate value m t of the primary gradient and the moving average c t of the variance of the primary gradient from which the initialization bias described later is removed are the approximate value m t-1 of the primary gradient moving average and the primary gradient The moving average c t ⁇ 1 of the variance of the gradient.
- the gradient calculation unit 11 by using the expression (2), to calculate the primary gradient g t, outputs the statistic calculation unit 12.
- the statistic calculator 12 calculates a statistic of the primary gradient. Specifically, the statistic calculation unit 12 receives the primary gradient g t output from the gradient calculation unit 11 and the standard values of ⁇ , ⁇ 1 , and ⁇ 2 that are hyperparameters, and is a primary gradient that is a statistic. calculating a moving average c t of the variance of g t approximation m t and the primary gradient g t of the moving average. Statistic calculation unit 12, (3) using the equation to calculate the approximate value m t of the moving average of the primary gradient g t. The statistic calculation unit 12, (5) using the equation, calculating a moving average c t of the dispersion of the primary gradient g t. Statistic calculation unit 12 outputs the moving average c t of the variance of the approximate value of the moving average of the primary gradient g t m t and primary gradient g t initialization bias removal unit 13.
- the initialization bias removal unit 13 removes the initialization bias from the statistics of the primary gradient calculated by the statistic calculation unit 12. Specifically, the initialization bias removal unit 13, with respect to the approximate value m t of the moving average of the primary gradient g t, removing initialization bias using equation (4). Then, the initialization bias removal unit 13, with respect to the moving average c t of the dispersion of the primary gradient g t, to remove the initialization bias using equation (6).
- the calculation described in Non-Patent Document 3 may be used.
- the learning rate adjustment unit 14 adjusts the learning rate by dividing the learning rate by the standard deviation of the primary gradient based on the statistics of the primary gradient. Specifically, the learning rate adjusting unit 14, the initialization bias removal unit 13 the approximate value of the moving average of the primary gradient g t removed initialization biased by m t and primary gradient g t moving average c t of the variance of Based on the above, the learning rate is adjusted using Equation (7). Specifically, the learning rate adjustment unit 14 adjusts the learning rate by dividing the learning rate by the standard deviation of the primary gradient based on the statistics from which the initialization bias has been removed.
- the parameter updating unit 15 updates the learning model parameters using the learning rate adjusted by the learning rate adjusting unit 14. Specifically, the parameter update unit 15 updates the model parameter ⁇ t based on the calculation result by the learning rate adjustment unit 14. The parameter updating unit 15 ends the calculation process when the parameter ⁇ t converges. On the other hand, if the parameter theta t has not converged outputs a parameter theta t into the gradient calculation unit 11. Thereby, the gradient calculation unit 11 increments t by +1. Then, the gradient calculation unit 11, the statistic calculation unit 12, the initialization bias removal unit 13, and the learning rate adjustment unit 14 repeat the calculations of the equations (2) to (7).
- FIG. 2 is a flowchart showing a processing procedure of learning processing executed by the learning apparatus 10 shown in FIG.
- the gradient calculation unit 11 receives input of standard values of ⁇ , ⁇ 1 , and ⁇ 2 that are hyper parameters (step S1). Subsequently, the gradient calculation unit 11 initializes each variable (step S2).
- the gradient calculation unit 11, the input data x t and the parameter theta t are input to the incremented by t (step S3). Subsequently, the gradient calculation unit 11, (2) to calculate the primary gradient g t using (step S4), and outputs the statistic calculation unit 12.
- the statistic calculation unit 12 receives the primary gradient g t output from the gradient calculation unit 11 and the standard values of ⁇ , ⁇ 1 , and ⁇ 2 that are hyperparameters, and uses the equation (3) as a primary gradient. calculating an approximate value m t of the moving average of g t (step S5).
- Step S7 the initialization bias removal unit 13, with respect to the moving average c t of variance approximation m t and primary gradient g t of the moving average of the primary gradient g t for statistic calculation unit 12 has calculated, removing the initialization bias (Step S7).
- step S8 based on the moving average c t of the variance of the learning rate adjusting unit 14 the approximate value of the moving average of the primary gradient g t removed initialization bias by the initialization bias removal unit 13 m t and primary gradient g t
- the learning rate is adjusted using the second term of equation (7) (step S8).
- learning is performed by taking the product of the learning rate and the approximate value of the moving average of the primary gradient divided by the standard deviation of the primary gradient that is the square root of the moving average of the variance of the primary gradient. The rate is adjusted.
- the parameter updating unit 15 updates the model parameter ⁇ t based on the calculation result of step S8 (step S9). Thereafter, the parameter update unit 15 determines whether or not the model parameter ⁇ t has converged (step S10).
- the learning device 10 ends the process.
- the learning device 10 returns to step S3. That is, the gradient calculation unit 11 increments t by +1, and executes the processes after step S4 again.
- the learning rate is adjusted by dividing the learning rate by the standard deviation of the primary gradient.
- the learning rate is adjusted using the standard deviation of the primary gradient including the information that defines the parameter update direction. For this reason, according to the above learning process, an efficient learning is realizable.
- FIG. 3 is a diagram showing a learning algorithm used by the learning apparatus 10 shown in FIG.
- the learning algorithm shown in FIG. 3 corresponds to the process shown in the flowchart of the learning process in FIG.
- the learning algorithm increments t by +1 (third line in FIG. 3).
- the third line in FIG. 3 corresponds to step S3 shown in FIG. Learning algorithm, (2) using the equation to calculate the primary gradient g t (fourth line of FIG. 3).
- the fourth line in FIG. 3 corresponds to step S4 shown in FIG.
- the Learning algorithm (3) using the equation to calculate the approximate value m t of the moving average of the primary gradient g t (line 5 of FIG. 3).
- the fifth line in FIG. 3 corresponds to step S5 shown in FIG.
- the learning algorithm (5) using the equation, calculating a moving average c t of the dispersion of the primary gradient g t (line 6 of FIG. 3).
- the sixth line in FIG. 3 corresponds to step S6 shown in FIG.
- the learning rate make adjustments to update the parameter theta t (line 9 of FIG. 3).
- the ninth line in FIG. 3 corresponds to step S8 and step S9 shown in FIG.
- the learning algorithm repeats the processing from the second line to the seventh line in FIG. 3 until the parameter ⁇ t converges (the tenth line in FIG. 3).
- the 10th line in FIG. 3 corresponds to step S10 shown in FIG.
- the learning rate is divided not by the absolute value of the primary gradient but by the standard deviation of the primary gradient, thereby adjusting the learning rate, which is more efficient than before. Learning can be performed.
- the error decreases when t, which is the number of repeated calculations, advances by one. It has been experimentally required that this is larger than the conventional Adam (see, for example, Non-Patent Document 3). That is, according to the present embodiment, the parameter ⁇ t can be converged by learning the number of iterations t that is smaller than that of the conventional Adam. Therefore, in this embodiment, it is possible to realize efficient learning as compared with conventional Adam.
- the error function of the learned model is smaller than Adam by adjusting the learning rate using the standard deviation of the primary gradient including information that defines the direction of parameter update. Highly accurate results were obtained experimentally.
- Equation (8) indicates that the primary gradient of the i-th parameter in the (t ⁇ 1) -th iteration is the symbol g i, t .
- moving average m i, t of the primary gradient g i, t in formula is a running average of the primary gradient over the past time.
- the moving average mi, t of the primary gradient is a statistic relating to the primary gradient gi , t .
- the moving average of the variances of the i-th primary gradient gi , t in the t- th iteration is obtained using the following equation (10).
- the moving average c i, t of the variance of the i-th primary gradient g i, t in the equation (10) is a moving average of the variance of the primary gradient over the past time.
- the primary gradient g i, the moving average c i, t of the variance of t is the statistics for the primary gradient g i, t.
- the primary gradient g i, the moving average c i, t of the dispersion of t, the primary gradient g i is a value determined by the previous direction of variation of t, including direction information of the primary gradient g i, t.
- the initialization bias is removed using the following equation (11) for the moving average c i, t of the variance of the primary gradient g i, t .
- the initialization bias is removed from the moving average c i, t of the variance of the primary gradient g i, t using equation (11).
- the learning rate is adjusted using the following equation (12).
- the calculations of Expressions (8) to (12) are repeated until the learning model parameter ⁇ t converges.
- the learning rate is divided by the square root of the moving average c i, t of the variance after removing the bias of the primary gradient g i, t , that is, the standard deviation of the primary gradient. Therefore, it is formulated to automatically adjust the learning rate.
- the dispersion is determined by the variation in the past direction of the primary gradient.
- the learning rate can be adjusted based on the information on the direction of the primary gradient, and the error function can be lowered.
- the learning device according to this modification has the same configuration as the learning device 10 shown in FIG. Therefore, a learning process according to this modification will be described.
- FIG. 4 is a flowchart illustrating a processing procedure of learning processing according to the modification of the embodiment.
- the learning apparatus 10 receives an input of the standard values of beta 1 (step S11).
- Steps S12 and S13 shown in FIG. 4 are steps S2 and S3 shown in FIG.
- the statistic calculation unit 12 receives the primary gradient g t output from the gradient calculation unit 11 and the standard values of ⁇ and ⁇ 1 as hyperparameters, and uses the equation (9) to calculate the primary gradient g t . calculating a moving average m t (step S15).
- Step S17 the initialization bias removal unit 13, with respect to the moving average c t of the dispersion of the primary gradient g t for statistic calculation unit 12 has calculated, to remove the initialization bias (Step S17).
- the learning rate adjusting unit 14 based on the moving average c t of the dispersion of the primary gradient g t removed the primary gradient g t and initialization bias, learning rate using the second term of (12) Is adjusted (step S18).
- the learning rate is adjusted by taking the product of the learning rate and the value obtained by dividing the primary gradient by the standard deviation of the primary gradient that is the square root of the moving average of the variance of the primary gradient. .
- Step S19 and step S20 shown in FIG. 4 are step S9 and step S10 shown in FIG.
- FIG. 5 is a diagram showing a learning algorithm 2 according to this modification.
- the learning algorithm 2 shown in FIG. 5 corresponds to the process shown in the flowchart of the learning process in FIG.
- the first line in FIG. 5 corresponds to step S12 shown in FIG.
- the learning algorithm increments t by +1 (third line in FIG. 5).
- the third line in FIG. 5 corresponds to step S13 shown in FIG.
- Learning algorithm using equation (8), to calculate the primary gradient g t (fourth line of FIG. 5).
- the fourth line in FIG. 5 corresponds to step S14 shown in FIG.
- the Learning algorithm (9) by using the equation, calculating a moving average m t of the primary gradient g t (line 5 of FIG. 5).
- the fifth line in FIG. 5 corresponds to step S15 shown in FIG.
- the learning algorithm (10) using the equation, calculating a moving average c t of the dispersion of the primary gradient g t (line 6 of FIG. 5).
- the sixth line in FIG. 5 corresponds to step S16 shown in FIG.
- the learning algorithm removes the initialization bias using the equation (11) with respect to the moving average ct of the variance of the primary gradient gt (the seventh line in FIG. 5).
- the seventh line in FIG. 5 corresponds to step S17 shown in FIG.
- the learning algorithm repeats the processing from the second line to the eighth line in FIG. 5 until the parameter ⁇ t converges (the ninth line in FIG. 5).
- the ninth line in FIG. 5 corresponds to step S20 shown in FIG.
- Each component of the learning device 10 shown in FIG. 1 is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific form of the distribution and integration of the functions of the learning device 10 is not limited to the illustrated one, and all or a part thereof can be functionally or physically in arbitrary units according to various loads or usage conditions. Can be distributed or integrated.
- each process performed in the learning device 10 may be realized in whole or in part by a CPU (Central Processing Unit) and a program that is analyzed and executed by the CPU.
- Each process performed in the learning device 10 may be realized as hardware by wired logic.
- all or a part of the processes described as being automatically performed can be manually performed.
- all or part of the processing described as being performed manually can be automatically performed by a known method.
- the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.
- FIG. 6 is a diagram illustrating an example of a computer in which the learning apparatus 10 is realized by executing a program.
- the computer 1000 includes a memory 1010 and a CPU 1020, for example.
- the computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.
- the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
- the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to the hard disk drive 1090.
- the disk drive interface 1040 is connected to the disk drive 1100.
- a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
- the serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example.
- the video adapter 1060 is connected to the display 1130, for example.
- the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the learning device 10 is implemented as a program module 1093 in which a code executable by the computer 1000 is described.
- the program module 1093 is stored in the hard disk drive 1090, for example.
- a program module 1093 for executing processing similar to the functional configuration in the learning device 10 is stored in the hard disk drive 1090.
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 and executes them as necessary.
- the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, WAN, etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.
- LAN local area network
- WAN wide area network
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
Abstract
Description
実施の形態で用いる主な記号を下表に示す。以下、従来の数理的背景、実施の形態の数理的背景、実施の形態の各説明において、同一の記号を用いる。
まず、以降の説明において必要となる背景知識を説明する。機械学習は、基本的に、解きたい問題の誤差関数を最小化するようにモデルを観測データから学習し、学習したモデルを用いて未知のデータに対する予測を行う技術である。解きたい問題は、例えば、データの分類、回帰、クラスタリングなどが挙げられる。誤差関数は、例えば、二乗誤差や交差エントロピーなどが挙げられる。モデルは、例えば、ロジスティック回帰やニューラルネットなどが挙げられる。
本実施の形態は、確率的勾配降下法において、勾配の方向の情報を基に、学習率を自動的に調整する。本実施の形態では、(1)式に代えて、以下の一連の(2)式~(7)式を繰り返し適用することで、勾配の方向の情報に基づいた学習率の調整を実現している。本実施の形態では、繰り返し計算回数をtとする。
上記の実施の形態の数理的背景を踏まえ、本実施の形態に係る学習装置などについて説明する。なお、以下の実施の形態は、一例を示すに過ぎない。
図1は、本実施の形態に係る学習装置10の構成の一例を示すブロック図である。学習装置10は、機械学習での確率的勾配降下法を用いて学習を行う。学習装置10は、ハイパーパラメータであるα,β1,β2の標準値を入力とする。このα,β1,β2の入力は、初回のみである。そして、学習装置10は、例えば、収束したパラメータθtを出力する。図1に示すように、本実施の形態に係る学習装置10は、勾配計算部11、統計量計算部12、初期化バイアス除去部13、学習率調整部14及びパラメータ更新部15を有する。
次に、学習装置10が実行する学習処理について説明する。図2は、図1に示す学習装置10が実行する学習処理の処理手順を示すフローチャートである。まず、学習装置10では、勾配計算部11が、ハイパーパラメータであるα,β1,β2の標準値の入力を受け付ける(ステップS1)。続いて、勾配計算部11は、各変数を初期化する(ステップS2)。
次に、学習装置10が使用する学習アルゴリズムについて説明する。図3は、図1に示す学習装置10が使用する学習アルゴリズムを示す図である。図3に示す学習アルゴリズムは、図2の学習処理のフローチャートが示す処理に対応する。図3に示すように、学習アルゴリズムは、まず、ハイパーパラメータの経験的な標準設定を示す。例えば、学習率α=0.001、β1=0.7、β2=0.99が経験的な標準設定として示されている。
本実施の形態では、確率的勾配降下法において、学習率を、一次勾配の絶対値ではなく、一次勾配の標準偏差で除算することで、学習率を調整することによって、従来よりも効率的な学習を実行することができる。
本実施の形態に係る変形例について説明する。変形例においても、確率的勾配降下法において、勾配の方向の情報を基に、学習率を自動的に調整する。本変形例では、(2)~(7)式に代えて、以下の一連の(8)式~(12)式を繰り返し適用することで、勾配の方向の情報に基づいた学習率の調整を実現している。本変形例においても、繰り返し計算回数をtとする。
図4は、実施の形態の変形例に係る学習処理の処理手順を示すフローチャートである。まず、学習装置10では、勾配計算部11が、ハイパーパラメータであるα,β1の標準値の入力を受け付ける(ステップS11)。図4に示すステップS12及びステップS13は、図1に示すステップS2及びステップS3である。
次に、本変形例に係る学習アルゴリズムについて説明する。図5は、本変形例に係る学習アルゴリズム2を示す図である。図5に示す学習アルゴリズム2は、図4の学習処理のフローチャートが示す処理に対応する。
図1に示した学習装置10の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、学習装置10の機能の分散および統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散または統合して構成することができる。
図6は、プログラムが実行されることにより、学習装置10が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
11 勾配計算部
12 統計量計算部
13 初期化バイアス除去部
14 学習率調整部
15 パラメータ更新部
Claims (7)
- 機械学習での確率的勾配降下法を用いて学習を行う学習装置であって、
前記確率的勾配降下法における一次勾配を計算する勾配計算部と、
前記一次勾配の統計量を計算する統計量計算部と、
前記統計量計算部が計算した一次勾配の統計量から、統計量計算部が一次勾配の統計量を計算する際に初期化バイアスを除去する初期化バイアス除去部と、
前記一次勾配の統計量を基に、学習率を、前記一次勾配の標準偏差で除算することで、前記学習率を調整する学習率調整部と、
前記学習率調整部が調整した前記学習率を用いて学習モデルのパラメータを更新するパラメータ更新部と、
を有することを特徴とする学習装置。 - 前記統計量計算部は、前記一次勾配の統計量として、前記一次勾配の移動平均の近似値と、前記一次勾配の分散の移動平均とを計算し、
前記学習率調整部は、前記学習率と、前記一次勾配の移動平均の近似値を、前記一次勾配の分散の移動平均の平方根である一次勾配の標準偏差で除算した値と、の積をとることで、前記学習率を調整することを特徴とする請求項1に記載の学習装置。 - 前記統計量計算部は、前記一次勾配の統計量として、前記一次勾配の移動平均と、前記一次勾配の分散の移動平均とを計算し、
前記学習率調整部は、前記学習率と、前記一次勾配を、前記一次勾配の分散の移動平均の平方根である一次勾配の標準偏差で除算した値と、の積をとることで、前記学習率を調整する
ことを特徴とする請求項1に記載の学習装置。 - 前記初期化バイアス除去部は、
前記一次勾配の移動平均の近似値を、前記一次勾配の移動平均を算出する際の重みを1から減算した値で、除算することで、前記一次勾配の移動平均の近似値の初期化バイアスを除去し、
前記一次勾配の分散の移動平均を、前記一次勾配の分散の移動平均を算出する際の重みを1から減算した値で、除算することで、前記一次勾配の分散の移動平均の近似値の初期化バイアスを除去する
ことを特徴とする請求項2に記載の学習装置。 - 前記初期化バイアス除去部は、
前記一次勾配の分散の移動平均を、前記一次勾配の分散の移動平均を算出する際の重みを1から減算した値で、除算することで、前記一次勾配の分散の移動平均の近似値の初期化バイアスを除去する
ことを特徴とする請求項3に記載の学習装置。 - 機械学習での確率的勾配降下法を用いて学習を行う学習装置が実行する学習方法であって、
前記確率的勾配降下法における一次勾配を計算する工程と、
前記一次勾配の統計量を計算する工程と、
前記一次勾配の統計量から、前記統計量を計算する工程において前記一次勾配の統計量を計算する際に初期化バイアスを除去する工程と、
前記一次勾配の統計量を基に、学習率を、前記一次勾配の標準偏差で除算することで、前記学習率を調整する工程と、
前記調整する工程において調整された前記学習率を用いて学習モデルのパラメータを更新する工程と、
を含んだことを特徴とする学習方法。 - 機械学習での確率的勾配降下法を用いて学習を行う場合に、前記確率的勾配降下法における一次勾配を計算するステップと、
前記一次勾配の統計量を計算するステップと、
前記一次勾配の統計量から、前記統計量を計算するステップにおいて前記一次勾配の統計量を計算する際に使用した初期化バイアスを除去するステップと、
前記一次勾配の統計量を基に、学習率を、前記一次勾配の標準偏差で除算することで、前記学習率を調整するステップと、
前記調整するステップにおいて調整された前記学習率を用いて学習モデルのパラメータを更新するステップと、
をコンピュータに実行させるための学習プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2018513164A JP6417075B2 (ja) | 2016-04-18 | 2017-04-14 | 学習装置、学習方法および学習プログラム |
US16/092,135 US20190156240A1 (en) | 2016-04-18 | 2017-04-14 | Learning apparatus, learning method, and recording medium |
EP17785925.3A EP3432230A4 (en) | 2016-04-18 | 2017-04-14 | LEARNING DEVICE, LEARNING METHOD, AND LEARNING PROGRAM |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2016-083141 | 2016-04-18 | ||
JP2016083141 | 2016-04-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017183587A1 true WO2017183587A1 (ja) | 2017-10-26 |
Family
ID=60115964
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2017/015337 WO2017183587A1 (ja) | 2016-04-18 | 2017-04-14 | 学習装置、学習方法および学習プログラム |
Country Status (4)
Country | Link |
---|---|
US (1) | US20190156240A1 (ja) |
EP (1) | EP3432230A4 (ja) |
JP (1) | JP6417075B2 (ja) |
WO (1) | WO2017183587A1 (ja) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019220525A1 (ja) * | 2018-05-15 | 2019-11-21 | 日本電気株式会社 | 確率的最適化装置、確率的最適化方法、および確率的最適化プログラム |
JP2020063699A (ja) * | 2018-10-17 | 2020-04-23 | トヨタ自動車株式会社 | 内燃機関の制御装置及びその制御方法、並びに内燃機関を制御するための学習モデル及びその学習方法 |
JP2020071562A (ja) * | 2018-10-30 | 2020-05-07 | 株式会社キャンサースキャン | 健康診断受診確率計算方法及び健診勧奨通知支援システム |
JPWO2021029034A1 (ja) * | 2019-08-14 | 2021-02-18 | ||
CN112884160A (zh) * | 2020-12-31 | 2021-06-01 | 北京爱笔科技有限公司 | 一种元学习方法及相关装置 |
JPWO2021186500A1 (ja) * | 2020-03-16 | 2021-09-23 | ||
WO2023100339A1 (ja) | 2021-12-03 | 2023-06-08 | 三菱電機株式会社 | 学習済モデル生成システム、学習済モデル生成方法、情報処理装置、プログラム、学習済モデル、および推定装置 |
JP7436830B2 (ja) | 2020-04-06 | 2024-02-22 | 富士通株式会社 | 学習プログラム、学習方法、および学習装置 |
JP7552996B2 (ja) | 2018-10-09 | 2024-09-18 | 株式会社Preferred Networks | ハイパーパラメータチューニング方法、プログラム、ユーザプログラム、装置、方法 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11126893B1 (en) * | 2018-05-04 | 2021-09-21 | Intuit, Inc. | System and method for increasing efficiency of gradient descent while training machine-learning models |
CN110992432B (zh) * | 2019-10-28 | 2021-07-09 | 北京大学 | 基于深度神经网络最小方差梯度量化压缩及图像处理方法 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017152990A1 (en) * | 2016-03-11 | 2017-09-14 | Telecom Italia S.P.A. | Convolutional neural networks, particularly for image analysis |
-
2017
- 2017-04-14 WO PCT/JP2017/015337 patent/WO2017183587A1/ja active Application Filing
- 2017-04-14 JP JP2018513164A patent/JP6417075B2/ja active Active
- 2017-04-14 EP EP17785925.3A patent/EP3432230A4/en not_active Ceased
- 2017-04-14 US US16/092,135 patent/US20190156240A1/en not_active Abandoned
Non-Patent Citations (1)
Title |
---|
KINGMA, P. ET AL., A METHOD FOR STOCHASTIC OPTIMIZATION, 22 December 2014 (2014-12-22), XP055425253, Retrieved from the Internet <URL:https://arxiv.org/pdf/1412. 6980v1.pdf> [retrieved on 20170630] * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPWO2019220525A1 (ja) * | 2018-05-15 | 2021-03-11 | 日本電気株式会社 | 確率的最適化装置、確率的最適化方法、および確率的最適化プログラム |
WO2019220525A1 (ja) * | 2018-05-15 | 2019-11-21 | 日本電気株式会社 | 確率的最適化装置、確率的最適化方法、および確率的最適化プログラム |
JP7552996B2 (ja) | 2018-10-09 | 2024-09-18 | 株式会社Preferred Networks | ハイパーパラメータチューニング方法、プログラム、ユーザプログラム、装置、方法 |
JP2020063699A (ja) * | 2018-10-17 | 2020-04-23 | トヨタ自動車株式会社 | 内燃機関の制御装置及びその制御方法、並びに内燃機関を制御するための学習モデル及びその学習方法 |
JP2020071562A (ja) * | 2018-10-30 | 2020-05-07 | 株式会社キャンサースキャン | 健康診断受診確率計算方法及び健診勧奨通知支援システム |
JP7279796B2 (ja) | 2019-08-14 | 2023-05-23 | 日本電信電話株式会社 | 秘密勾配降下法計算方法、秘密深層学習方法、秘密勾配降下法計算システム、秘密深層学習システム、秘密計算装置、およびプログラム |
JPWO2021029034A1 (ja) * | 2019-08-14 | 2021-02-18 | ||
WO2021029034A1 (ja) * | 2019-08-14 | 2021-02-18 | 日本電信電話株式会社 | 秘密勾配降下法計算方法、秘密深層学習方法、秘密勾配降下法計算システム、秘密深層学習システム、秘密計算装置、およびプログラム |
CN114207694B (zh) * | 2019-08-14 | 2024-03-08 | 日本电信电话株式会社 | 秘密梯度下降法计算方法及系统、秘密深度学习方法及系统、秘密计算装置、记录介质 |
CN114207694A (zh) * | 2019-08-14 | 2022-03-18 | 日本电信电话株式会社 | 秘密梯度下降法计算方法、秘密深度学习方法、秘密梯度下降法计算系统、秘密深度学习系统、秘密计算装置及程序 |
AU2019461061B2 (en) * | 2019-08-14 | 2023-03-30 | Nippon Telegraph And Telephone Corporation | Secure gradient descent computation method, secure deep learning method, secure gradient descent computation system, secure deep learning system, secure computation apparatus, and program |
WO2021186500A1 (ja) * | 2020-03-16 | 2021-09-23 | 日本電気株式会社 | 学習装置、学習方法、及び、記録媒体 |
JPWO2021186500A1 (ja) * | 2020-03-16 | 2021-09-23 | ||
JP7468619B2 (ja) | 2020-03-16 | 2024-04-16 | 日本電気株式会社 | 学習装置、学習方法、及び、記録媒体 |
JP7436830B2 (ja) | 2020-04-06 | 2024-02-22 | 富士通株式会社 | 学習プログラム、学習方法、および学習装置 |
CN112884160B (zh) * | 2020-12-31 | 2024-03-12 | 北京爱笔科技有限公司 | 一种元学习方法及相关装置 |
CN112884160A (zh) * | 2020-12-31 | 2021-06-01 | 北京爱笔科技有限公司 | 一种元学习方法及相关装置 |
WO2023100339A1 (ja) | 2021-12-03 | 2023-06-08 | 三菱電機株式会社 | 学習済モデル生成システム、学習済モデル生成方法、情報処理装置、プログラム、学習済モデル、および推定装置 |
KR20230084423A (ko) | 2021-12-03 | 2023-06-13 | 미쓰비시덴키 가부시키가이샤 | 학습된 모델 생성 시스템, 학습된 모델 생성 방법, 정보 처리 장치, 기록 매체, 학습된 모델, 및 추정 장치 |
Also Published As
Publication number | Publication date |
---|---|
US20190156240A1 (en) | 2019-05-23 |
EP3432230A4 (en) | 2019-11-20 |
JP6417075B2 (ja) | 2018-10-31 |
JPWO2017183587A1 (ja) | 2018-08-30 |
EP3432230A1 (en) | 2019-01-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6417075B2 (ja) | 学習装置、学習方法および学習プログラム | |
JP7016407B2 (ja) | スマートな供給空気温度設定点制御を介する冷却ユニットのエネルギー最適化 | |
Brownlee | What is the Difference Between a Batch and an Epoch in a Neural Network | |
EP2795413B1 (en) | Hybrid control system | |
Kwon et al. | A method for handling batch-to-batch parametric drift using moving horizon estimation: application to run-to-run MPC of batch crystallization | |
Hong et al. | Extremum estimation and numerical derivatives | |
WO2019194299A1 (ja) | 学習装置、学習方法および学習プログラム | |
CN113544599A (zh) | 执行过程并优化在该过程中使用的控制信号的方法 | |
CN111209083B (zh) | 一种容器调度的方法、设备及存储介质 | |
CN111382906A (zh) | 一种电力负荷预测方法、系统、设备和计算机可读存储介质 | |
US11550274B2 (en) | Information processing apparatus and information processing method | |
Anastasiou et al. | Bounds for the normal approximation of the maximum likelihood estimator | |
Singh et al. | Kernel width adaptation in information theoretic cost functions | |
JP2019040414A (ja) | 学習装置及び学習方法 | |
Pinkse et al. | Estimates of derivatives of (log) densities and related objects | |
JP2018073285A (ja) | L1グラフ計算装置、l1グラフ計算方法及びl1グラフ計算プログラム | |
Fisher et al. | Three-way cross-fitting and pseudo-outcome regression for estimation of conditional effects and other linear functionals | |
WO2021157669A1 (ja) | 回帰分析装置、回帰分析方法及びプログラム | |
Toulisα et al. | Implicit stochastic approximation | |
JP2016212510A (ja) | 非線形最適解探索システム | |
US20170344883A1 (en) | Systems and methods for control, analysis, and/or evaluation of dynamical systems | |
CN114442557A (zh) | 一种机床温度场快速辨识方法及系统 | |
CN104298213A (zh) | 一种基于参考批次的指数时变增益型迭代学习控制算法 | |
CN104134091B (zh) | 一种神经网络训练方法 | |
CN114138025A (zh) | 控制脱硫浆液阀门开度的方法、装置及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2018513164 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2017785925 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2017785925 Country of ref document: EP Effective date: 20181015 |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17785925 Country of ref document: EP Kind code of ref document: A1 |