WO2019182059A1 - Model generation device, model generation method, and program - Google Patents

Model generation device, model generation method, and program Download PDF

Info

Publication number
WO2019182059A1
WO2019182059A1 PCT/JP2019/011865 JP2019011865W WO2019182059A1 WO 2019182059 A1 WO2019182059 A1 WO 2019182059A1 JP 2019011865 W JP2019011865 W JP 2019011865W WO 2019182059 A1 WO2019182059 A1 WO 2019182059A1
Authority
WO
WIPO (PCT)
Prior art keywords
error
model
training data
data
model generation
Prior art date
Application number
PCT/JP2019/011865
Other languages
French (fr)
Japanese (ja)
Inventor
裕也 海野
Original Assignee
株式会社 Preferred Networks
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社 Preferred Networks filed Critical 株式会社 Preferred Networks
Publication of WO2019182059A1 publication Critical patent/WO2019182059A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to a model generation device, a model generation method, and a program.
  • an embodiment of the present invention proposes a model generation device, a model generation method, and a program that perform machine learning with a memory usage amount that does not depend on the mini-batch size.
  • a model generation apparatus includes an input unit, an error output unit, an error calculation unit, and a model generation unit.
  • the input means divides the training data and inputs it to the model.
  • the error output means outputs a first error representing a difference between the data acquired by inputting the divided training data into the model and the correct answer label of the divided training data.
  • the error calculation means calculates a second error representing a difference between the data acquired by inputting the training data into the model and the correct answer label of the training data.
  • the model generation means generates the learned model in which the weight of at least one layer of the neural network is updated by back propagation based on the second error.
  • the figure which shows the concept of a prediction problem The figure which shows the concept of a prediction problem.
  • the figure which shows the concept of a prediction problem The block diagram which shows the function of the machine learning apparatus which concerns on one Embodiment.
  • the flowchart which shows the flow of the process which concerns on one Embodiment.
  • FIG. 1A is a diagram illustrating an example of a concept of an input / output state in forward propagation for a word prediction problem.
  • Input x is a vector of D dimension
  • product xW T of the weight matrix W and the input x is a hidden layer is calculated.
  • W T represents a transposed matrix of W.
  • the weight matrix W is indicated by a V ⁇ D matrix.
  • V indicates the vocabulary size of the model, that is, the number of words to be predicted in the model.
  • an output element predicted in the model at that time point is obtained by inputting the calculation result to, for example, a softmax function in the output layer.
  • the predicted result is compared with teacher data to calculate a loss.
  • the hidden layer elements are optimized.
  • the model is optimized by learning the hidden layer W by repeating this forward propagation and back propagation.
  • FIG. 1A described above uses one of the training data, and at the time of back propagation error, the loss is calculated for one or more partial data in the training data.
  • batch learning all training data is used for error back propagation at one time.
  • mini-batch learning that uses some data at once is generally used. Done.
  • the number of data calculated at one time is called the number of batches.
  • a highly parallel processor such as a GPU
  • parallel calculation for reducing the number of batches cannot be performed efficiently, so that the calculation efficiency is significantly reduced. Therefore, calculation is performed with a certain number of batches in order to increase the utilization efficiency of the processor.
  • X indicates an input x in which each row is input to the input layer of FIG. 1B.
  • B inputs that is, inputs x 1 , x 2 ,..., X i ,. ⁇ x input B which is a matrix, each including a row.
  • the output for this each input x i is likewise summarized as Y, Y can be, for example, B-number of the output, i.e., the output y 1 with respect to the input x 1, ⁇ ⁇ ⁇ , the output y i to the input x i, ⁇ ⁇ -, it indicated the output y B for the input x B in a matrix, each including a row.
  • the operation of the mini batch size B can be efficiently processed by applying the softmax function to each row of the output of the hidden layer. It is possible to improve the convergence and accuracy of the model by calculating the average of each row of Y, calculating the loss from the error with the correct label, and optimizing the loss by propagating back.
  • the mini-batch size B can be increased to further increase the calculation efficiency.
  • the calculation method of the input and the hidden layer the calculation is performed on data having a large mini-batch size, for example, a size that is normally difficult to store in the accelerator memory. Make it possible to do.
  • the memory usage method, the loss calculation method, and the like will be described in detail.
  • FIG. 2 is a block diagram illustrating functions of the machine learning device according to the present embodiment.
  • the machine learning device 1 includes an input unit 10, a control unit 12, a storage unit 14, a learning unit 16, and an output unit 18, and performs machine learning.
  • the machine learning device 1 functions as a model generation device that generates a model by machine learning.
  • the input unit 10 is an interface through which data from the outside of the machine learning device 1 is input, and receives input of training data, hyper parameters, and the like.
  • the input data is transmitted to the control unit 12 and processed. Further, the input data may be transmitted to the storage unit 14 and temporarily stored.
  • the control unit 12 performs control for the learning unit 16 to learn the model and control for storing data in the storage unit 14.
  • the learning unit 16 learns the model based on the input training data.
  • the learning unit 16 performs processing by transmitting and receiving data stored in the storage unit 14 at an appropriate timing.
  • the storage unit 14 stores training data input from the input unit 10.
  • the storage unit 14 may include a main storage device and an auxiliary storage device as a hardware configuration.
  • a program for operating the control unit 12, the learning unit 16, and the like may be stored.
  • Training data has a large capacity and can be input / output at a low speed. Once stored in the main storage, the capacity is smaller than that of the main storage, and the data can be input / output at a high speed. It may be transferred to the auxiliary storage device as necessary.
  • the output unit 18 outputs the model learned by the learning unit 16 to the outside.
  • a database (not shown) may be provided inside the machine learning device 1 and a model may be output to the database.
  • This database may be provided in the storage unit 14.
  • the machine learning device 1 may be a processing device that performs natural language processing or the like. In this case, the learned model may be stored in a necessary place as appropriate.
  • FIG. 3 is a block diagram showing the internal functions of the learning unit 16.
  • the function of the learning unit 16 is not limited to RNN in natural language processing, and can be similarly applied to other models in which loss calculation is performed by dividing into relatively large mini-batches. It is.
  • the learning unit 16 includes a data selection unit 160, a forward propagation unit 162, an error output unit 164, an error calculation unit 166, and a back propagation unit 168.
  • the data selection unit 160 receives a control signal from the control unit 12 and selects training data used for learning in the mini-batch. For example, the training data is randomly allocated to the mini-batch having the element B. The randomly distributed data is output to the forward propagation unit 162. Alternatively, the label of randomly distributed data may be notified to the forward propagation unit 162, and the forward propagation unit 162 may acquire the mini-batch training data from the storage unit 14 when performing the calculation.
  • the forward propagation unit 162 performs forward propagation in model generation and calculates a numerical value for calculating a loss in each layer.
  • the error output unit 164 refers to the output of each layer obtained by the forward propagation unit 162. This output is calculated as a vector based on a dimension in each layer or a matrix format in which a predetermined number of rows are aggregated.
  • the error output unit 164 calculates the difference (first difference) from the correct label for the output vector or matrix in each layer output by the forward propagation unit 162. By multiplying the calculated first difference by, for example, a softmax function, a predetermined number (b) of input data in a mini-batch, that is, at least a part (B) of training data, is input. Calculate the loss (first error) for the data.
  • the error calculation unit 166 calculates the mini-batch loss (second error), which is the second difference, by calculating the sum or average of the first errors. That is, the first error indicates a difference between the model output value and the label value of b pieces of data, and the second error indicates a loss in the mini-batch (B pieces).
  • FIG. 4 is a diagram showing an outline of the mini-batch processing in the present embodiment.
  • the input corresponding to the batch size is divided into N groups by the predetermined size b smaller than B.
  • B N ⁇ b
  • all sub-matrices Xn are b ⁇ D matrices.
  • b may be defined as a predetermined number smaller than B that does not depend on B.
  • V is a vocabulary size and has a value of about 50,000 to 100,000. For this reason, when the batch size B is increased, the matrix shown in FIG. 1B, in particular, all the elements of the matrix Y including B ⁇ V elements to be acquired in the output layer are stored in the memory at the same timing. It is difficult to make the batch size B sufficiently large.
  • the n computation for n 1,2, ⁇ , by performing sequentially to N, the first error by the error output unit 164 L 1, L 2, ⁇ , L N is calculated.
  • the error calculation unit 166 calculates a second error that is a loss in the mini-batch by calculating the sum or average of the first errors.
  • the first error L n is calculated, by discarding the partial matrix Y n , the second error corresponding to the matrix Y, which is a B ⁇ V matrix, can be obtained even if the mini batch size B is large. It is possible to calculate without directly obtaining.
  • the submatrix Y n does not need to be discarded. For example, an area corresponding to b ⁇ V ⁇ (the number of element bits) may be secured and used. Furthermore, the need to store also the first error L 1, etc. all without prepares a variable called loss L, we take the sequentially sum the first error L n calculated from the partial matrix Y n You may do it.
  • the loss L may be obtained by computing the partial matrix Y n in each computation core such as a GPU, obtaining the first error L n by parallel computation, and obtaining this sum.
  • the back propagation unit 168 optimizes the model by performing error back propagation based on the second error calculated by the forward propagation unit 162.
  • the error calculation unit 166 recalculates and outputs the first error (a partial error with respect to the submatrix Xn ) for each predetermined number b with respect to the mini-batch at a timing required during execution of error back propagation.
  • back propagation portion 168 may perform backpropagation for each partial matrix X n, again, a partial matrix Y n are required to calculate the first error is the error of each the partial matrix X n.
  • Error calculation unit 166 by recalculating the Y n, obtains the Y n of the error output unit 164 is calculated, to calculate a first error.
  • the amount of memory required at the time of recalculation is sufficient if an area that can be calculated for each mini-batch can be secured, and if the area for b ⁇ V ⁇ (number of element bits) is secured, as in the case of the calculation described above, it is executed. It is possible. In view of b ⁇ B, it can be seen that the amount of memory used can be reduced.
  • the overall calculation cost is that the batch size B cannot be made sufficiently large even when the recalculation cost of Y n is taken into consideration. Becomes higher. Therefore, recalculate the Y n, i.e., the re-calculated difference from b pieces of data by recalculating a partial error (first error), as take a minute calculation cost for obtaining a first error in the mini-batch
  • a partial error first error
  • the error calculation unit 166 does not need to be provided independently, and the error output unit 164 may have a function of the error calculation unit 166 and may not include the error calculation unit 166.
  • the forward propagation unit 162 and the back propagation unit 168 may function as a single unit, and may function as a model generation unit that generates a model by performing forward propagation and back propagation. Further, this back propagation of error can be similarly applied not only to the output layer but also to an intermediate layer.
  • the second error may be calculated using the result of forward propagation to a middle layer and the result of back propagation to the layer with respect to the partial matrix Xn , and the back propagation of the layer may be executed. In this case, it is not always necessary to use a softmax function or the like in the calculation of the first error.
  • the error calculation of the present embodiment can be executed for at least one layer constituting the network.
  • the forward propagation unit 162 performs processing, and the calculation is continued using the next mini-batch, or when the next epoch is shifted to, the data selection unit 160 performs processing.
  • FIG. 5 is a flowchart illustrating an example of a processing flow according to the present embodiment.
  • training data is input to the machine learning device 1 via the input unit 10 (S100).
  • the input training data is stored in the storage unit 14 as necessary.
  • Necessary information such as the number of data is output to the control unit 12. Thereafter, the control unit 12 controls processing necessary for learning, such as learning of the learning unit 16 and data transmission of the storage unit 14.
  • the data selection unit 160 of the learning unit 16 randomly distributes data for each predetermined mini-batch size B and selects data for which a mini-batch is generated (S102).
  • the mini-batch size B may be a preset parameter or may be input to the control unit 12 via the input unit 10 as a hyper parameter.
  • the control unit 12 When input as a hyperparameter, the control unit 12 generates a mini-batch so that the data selection unit 160 selects data for each mini-batch size B.
  • This step may be a step of distributing data in advance for each mini-batch instead of selecting data for each mini-batch.
  • the data selection may be, for example, a process of reading the selected data into an easily accessible memory, or a process of outputting the index of the selected data to the forward propagation unit 162 and the back propagation unit 168.
  • the data necessary for the calculation may be read into an easily accessible memory at the timing when the processing is performed by the forward propagation unit 162 or the like.
  • other optimization processes such as loop unrolling and software pipelining may be performed.
  • the training data in the mini-batch is forward propagated for each part by the forward propagation unit 162, and a first error is calculated for each partial data forwardly propagated by the error output unit 164, and the first error for each partial data is calculated. Based on the error, the second error of the entire mini-batch is calculated (S104).
  • the back propagation unit 168 performs error back propagation using the first error or the second error calculated in S104, and the model is optimized (S106).
  • recalculation may be performed for each back propagation process of the partial data.
  • the first error may be calculated by performing this recalculation, and error back propagation may be performed using the first error. That is, in each layer, the second difference between the output of the layer and the error propagated back from the next layer is obtained as the first error, and the error propagates back to the previous layer as the first error. Also good. Thus, the first error may be obtained in each layer.
  • the second error as the whole mini-batch may be calculated. In this way, the back propagation unit 168 performs back propagation using the second error and a process (function) for obtaining the first error corresponding to the second error by forward propagation.
  • the output unit 18 outputs the learned model and finishes the process (S112).
  • the end of the learning is that the value of the loss, for example, the second error in the output layer, is smaller than the predetermined value, the calculation of the predetermined number of epochs is completed, and the evaluation value becomes larger than the predetermined value in the validation. It is judged according to the conditions such as.
  • the machine learning device 1 may be stored in the storage unit 14 to function as a natural language processing device using the model, for example. Further, the output unit 18 may output the model stored in the storage unit 14.
  • FIG. 6 is a diagram illustrating a hardware implementation example of the present embodiment.
  • the machine learning device 1 includes a CPU 200, an accelerator 202, a main storage device 204, an auxiliary storage device 206, a network interface 208, and a device interface 210. Each of these devices is connected by a bus 212.
  • a CPU (Central Processing Unit) 200 is a processor that operates the machine learning device 1 and operates the machine learning device 1 based on a program stored in the main storage device 204, for example.
  • the accelerator 202 is a device for assisting arithmetic processing, and includes, for example, a GPU.
  • the GPU speeds up the numerical calculation by GPGPU (General-Purpose computing on GPU).
  • the accelerator 202 may be provided with a memory that is an auxiliary storage device in itself, and may be capable of accessing data stored in the memory at high speed. In the exchange with the main storage device 204 or the auxiliary storage device 206, prefetching of necessary data from these storage devices to the memory on the accelerator 202 is performed so that this high-speed access can be utilized to the maximum extent. Also good.
  • the main storage device 204 is directly connected to the CPU 200 via a main bus or the like, and mainly stores programs and the like necessary for the operation of the machine learning device 1.
  • the main storage device 204 includes, for example, DRAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory).
  • the auxiliary storage device 206 is slower in throughput than the main storage device 204 but has a large capacity memory.
  • the auxiliary storage device 206 does not need to be in the same computer as the computer in which the machine learning device 1 is configured, and may be installed outside.
  • the training data may be stored in the auxiliary storage device 206, transferred to the CPU 200, the accelerator 202, and the main storage device 204 via the bus 212 and used.
  • the network interface 208 is an interface that connects the external network 300 and the machine learning device 1.
  • the device interface 210 is an interface that connects the external device 400 and the machine learning device 1.
  • the external device 400 may be connected to the machine learning device 1 via the network 300 and the network interface 208.
  • the calculation in this embodiment is mainly executed on the accelerator 202.
  • the arithmetic processing in the accelerator 202 is higher than that of the CPU 200, the capacity of the memory mounted in the accelerator 202 is often smaller than that of the main storage device 204 and the auxiliary storage device 206.
  • the data on the memory can be accessed at high speed from the processor in the accelerator 202, while the access from the processor in the accelerator 202 to the main storage device 204 and the auxiliary storage device 206 may be slow. Many.
  • the number of vectors to be processed at the same timing in the partial data may be determined based on the memory capacity on the accelerator 202.
  • b may be set so that the capacity excluding the capacity of a program necessary for operating the processor of the accelerator 202 and the buffer capacity of input data or the like can be used as much as the capacity of b ⁇ V elements. .
  • an accurate model is learned by calculating a partial loss (first error or second error) for each predetermined number of data that does not depend on the mini-batch size. Therefore, it is possible to secure a high calculation speed without reducing the mini-batch size. As described above, it is possible to perform loss back propagation and error propagation in a memory capacity that does not depend on the mini-batch size, so that it is possible to efficiently use the computing resources of the computer.
  • the data selection unit 160 inputs mini-batch size data to the model generation apparatus including the forward propagation unit 162 and the backward propagation unit 168, and the forward propagation unit 162 and the backward propagation unit 168 have a predetermined number of sizes of data.
  • the present invention is not limited to this.
  • the data selection unit 160 includes a data division unit that divides the data into b pieces of data, and the data division unit adds b pieces of input data to the model generation unit. May be operated as an input means.
  • the forward propagation unit 162 obtains the output from the output layer by calculation for the b pieces of data input from the input unit to the network.
  • the error output unit 164 outputs a first error based on the first difference between the output corresponding to the b pieces of data and the correct answer label. This error is stored, and the first error is similarly output for the next b pieces inputted from the input means.
  • the error calculation unit 166 calculates a second error, which is an error of the mini-batch.
  • the back propagation unit 168 outputs a first error via the error output unit 164 based on the b pieces of data input from the input means for the output of each layer and the back propagated output. Then, when the output of the first error is completed for the data for the mini-batch, the error calculation unit 166 calculates the second error in the layer focused on.
  • a data dividing unit that divides the data into b pieces of data may be provided as input means.
  • the division may have a certain amount of fluctuation with respect to the predetermined number b, for example. As another example, it may be changed dynamically according to the usage rate of the memory. Thus, the division is not limited to each predetermined number, and may be appropriately changed according to the situation, or may be various other division methods.
  • RNN learning in natural language processing has been described as an example.
  • the present invention is not limited to this, and is also applicable to learning in other neural networks that require a large amount of data areas when performing loss calculations. Is possible.
  • it can be used not only for MLP and CNN but also for LSTM (Long Shot-Term Memory).
  • the generated model is a model that performs natural language processing.
  • the present invention is not limited to this, and the machine learning device 1 generates a model that performs processing on various other data for other purposes. It may be a thing.
  • the softmax function or the like which is a function in the above description, is shown as an example, and other implementations may be used.
  • the softmax function may be a function suitable for obtaining other gradients such as a sigmoid function and ReLU (Rectified Linear Unit). It is possible to select an appropriate function as a function to be used in other places as appropriate.
  • control unit 12 is a control circuit implemented by analog, digital, or FPGA (Field Programmable Gate Array), ASIC (Application Specific Specific Integrated Circuit), or the like. There may be.
  • learning unit 16 may be implemented by a circuit.
  • the machine learning device 1 may be configured by hardware, or may be configured by software, and the CPU or the like may be implemented by software information processing.
  • the machine learning device 1 and a program that realizes at least a part of the functions are stored in a storage medium such as a flexible disk or a CD-ROM, and read and executed by a computer. Also good.
  • the storage medium is not limited to a removable medium such as a magnetic disk or an optical disk, but may be a fixed storage medium such as a hard disk device or a memory. That is, information processing by software may be specifically implemented using hardware resources.
  • the processing by software may be implemented in a circuit such as an FPGA and executed by hardware.
  • the generation of the model and the processing after inputting the model may be performed using an accelerator such as a GPU, for example.
  • a processing circuit such as a CPU, a storage device such as a memory, and other necessary hardware may be provided one by one, or a plurality of at least one may be provided.
  • the model generated by the machine learning device 1 can be used as a program module that is a part of the artificial intelligence software.
  • the CPU of the computer operates based on the model stored in the storage unit so as to perform an operation and output the result.
  • the present invention can be applied to natural language processing as an example.
  • natural language processing it is possible to deal with a neural network that inputs an image.
  • a plurality of input images can be divided into mini-batches and applied in the same manner.
  • the pixels in the image may be divided and the calculation may be performed for each of the divided pixels as described above.
  • the present invention can be applied to other types of data as long as mini-batch processing and division within the data can be appropriately performed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention realizes machine learning in memory usage amounts independent of mini-batch size. This model generation device generates a pre-trained model comprising a neural network model, and comprises an input means, an error output means, an error calculation means, and a model generation means. The input means divides training data and inputs the same into the model. The error output means outputs first errors representing the difference between data acquired by inputting the divided training data into the model and correct answer labels of the divided training data. On the basis of the first errors, the error calculation means calculates second errors representing the difference between data acquired by inputting the training data into the model and correct answer labels of the training data. On the basis of the second errors, the model generation means generates the pre-trained model in which the weight of at least one layer of the neural network has been updated by backward propagation of errors.

Description

モデル生成装置、モデル生成方法及びプログラムModel generation apparatus, model generation method and program
 本発明は、モデル生成装置、モデル生成方法及びプログラムに関する。 The present invention relates to a model generation device, a model generation method, and a program.
 深層学習は、訓練データに応じて計算量及びデータ量に伴ったメモリ使用量が増減する。自然言語処理分野における深層学習の場合、機械翻訳、要約又は言語モデル等のタスクによる単語の予測問題に帰着することが多い。このような単語の予測問題は、数万から数十万といった数の語彙を用いた大規模な予測問題となるため、膨大なメモリ容量を必要とする。一方、計算効率を向上させるために、ミニバッチにおいて演算を行う場合、一度に処理するデータ数(ミニバッチサイズ)を大きくする必要がある。一般的に消費メモリ容量はこの双方に比例するため、演算を実行するために、ミニバッチサイズを減らし、計算効率を犠牲にせざるを得なくなる。すなわち、バッチサイズを大きくすることにより計算効率を向上させることが可能であるがメモリ容量の関係でバッチサイズを小さくせざるを得ず、一方、バッチサイズを小さくするとことにより学習全体としての計算効率が下がってしまう。 In deep learning, the amount of calculation and the amount of memory used with the amount of data increase or decrease depending on the training data. In the case of deep learning in the field of natural language processing, it often results in the problem of word prediction by tasks such as machine translation, summarization or language models. Such a word prediction problem is a large-scale prediction problem using a number of vocabularies such as tens of thousands to hundreds of thousands, and requires a huge memory capacity. On the other hand, in order to improve calculation efficiency, when performing operations in mini-batch, it is necessary to increase the number of data (mini-batch size) to be processed at one time. In general, since the memory consumption capacity is proportional to both, the mini-batch size must be reduced and the calculation efficiency must be sacrificed in order to execute the calculation. In other words, it is possible to improve the calculation efficiency by increasing the batch size, but it is necessary to reduce the batch size due to the memory capacity. On the other hand, by reducing the batch size, the calculation efficiency of the entire learning is reduced. Will go down.
 再計算を行う方法、ロス計算を1つずつ行う手法等が研究されているが、消費メモリ容量が語彙サイズに比例する問題を解決していないため、メモリ容量の低下効果は限定的である。また、小さいバッチサイズでもメモリの読み込みを減らすことにより計算効率を上げる方法も研究されているが、汎用的に使用できるわけではなく、その適用範囲が限定されてしまう。 ・ Recalculation methods, loss calculation methods one by one, etc. have been studied. However, since the problem that the consumed memory capacity is proportional to the vocabulary size has not been solved, the effect of reducing the memory capacity is limited. In addition, a method of increasing calculation efficiency by reducing memory reading even with a small batch size has been studied, but it cannot be used for general purposes and its application range is limited.
 そこで、本発明の実施形態は、ミニバッチサイズに依存しないメモリ使用量における機械学習を行うモデル生成装置、モデル生成方法及びプログラムを提案する。 Therefore, an embodiment of the present invention proposes a model generation device, a model generation method, and a program that perform machine learning with a memory usage amount that does not depend on the mini-batch size.
 一実施形態に係るモデル生成装置は、入力手段と、誤差出力手段と、誤差算出手段と、モデル生成手段と、を備える。入力手段は、訓練データを分割して前記モデルに入力する。誤差出力手段は、分割された訓練データをモデルに入力して取得されるデータと、分割された訓練データの正解ラベルと、の差を表す第1誤差を出力する。誤差算出手段は、第1誤差に基づいて、訓練データをモデルに入力して取得されるデータと、訓練データの正解ラベルと、の差を表す第2誤差を算出する。モデル生成手段は、第2誤差に基づいて、ニューラルネットワークの少なくとも1層の重みを誤差逆伝播で更新した前記学習済みモデルを生成する。 A model generation apparatus according to an embodiment includes an input unit, an error output unit, an error calculation unit, and a model generation unit. The input means divides the training data and inputs it to the model. The error output means outputs a first error representing a difference between the data acquired by inputting the divided training data into the model and the correct answer label of the divided training data. Based on the first error, the error calculation means calculates a second error representing a difference between the data acquired by inputting the training data into the model and the correct answer label of the training data. The model generation means generates the learned model in which the weight of at least one layer of the neural network is updated by back propagation based on the second error.
 一実施形態によれば、ミニバッチサイズに依存しないメモリ使用量における機械学習を行うことができる。 According to one embodiment, it is possible to perform machine learning in memory usage that does not depend on the mini-batch size.
予測問題の概念を示す図。The figure which shows the concept of a prediction problem. 予測問題の概念を示す図。The figure which shows the concept of a prediction problem. 一実施形態に係る機械学習装置の機能を示すブロック図。The block diagram which shows the function of the machine learning apparatus which concerns on one Embodiment. 一実施形態に係る学習部の機能を示すブロック図。The block diagram which shows the function of the learning part which concerns on one Embodiment. 一実施形態に係る演算処理の概念を示す図。The figure which shows the concept of the arithmetic processing which concerns on one Embodiment. 一実施形態に係る処理の流れを示すフローチャート。The flowchart which shows the flow of the process which concerns on one Embodiment. 一実施形態に係るハードウェア構成の例を示す図。The figure which shows the example of the hardware constitutions which concern on one Embodiment.
 まず、本実施形態を適用する学習の一例について説明する。本実施形態において学習は、例えば、自然言語処理分野における深層機械学習をRNN(Recurrent Neural Network)の手法を用いて行う。自然言語処理分野における処理は、大規模な単語の予測問題に帰着されることが多い。以下の説明は、一例として、予測問題に帰着されたネットワークの演算について説明するが、これには限られず、ビッグデータを扱うような機械学習を他のMLP(Multilayer Perceptron)、CNN(Convolutional Neural Network)等の手法により行う場合においても本実施形態は適用することが可能である。 First, an example of learning to which this embodiment is applied will be described. In this embodiment, for example, deep machine learning in the field of natural language processing is performed using an RNN (Recurrent Neural Network) technique. Processing in the natural language processing field often results in large word prediction problems. In the following description, as an example, calculation of a network resulting in a prediction problem will be described. However, the present invention is not limited to this, and machine learning that handles big data is performed by other MLP (Multilayer Perceptron), CNN (Convolutional Neural Network). The present embodiment can be applied even when the method is performed by the above method.
 本実施形態で行う演算の概略を説明する。図1Aは、単語の予測問題について、順伝播における入出力の状態の一例の概念を示す図である。入力xは、D次元のベクトルであり、隠れ層である重み行列Wと入力xの積xWが計算される。ここで、Wは、Wの転置行列を表す。重み行列Wは、V×D行列で示される。ここで、Vは、モデルが有する語彙サイズ、すなわち、モデルにおいて予測する対象となる単語数を示す。 An outline of the calculation performed in this embodiment will be described. FIG. 1A is a diagram illustrating an example of a concept of an input / output state in forward propagation for a word prediction problem. Input x is a vector of D dimension, product xW T of the weight matrix W and the input x is a hidden layer is calculated. Here, W T represents a transposed matrix of W. The weight matrix W is indicated by a V × D matrix. Here, V indicates the vocabulary size of the model, that is, the number of words to be predicted in the model.
 入力層と隠れ層との積が算出されると、出力層において、当該算出結果を、例えば、ソフトマックス関数に入力することにより、その時点におけるモデルにおいて予測される出力要素が取得される。予測された結果は、教師データと比較され、ロスが算出される。ロスを算出した上で、誤差を逆伝播することにより、隠れ層の要素が最適化されていく。この順伝播と逆伝播を繰り返すことにより、隠れ層Wを学習することにより、モデルが最適化される。隠れ層は、1層しか図示されていないが、複数層あってもよい。 When the product of the input layer and the hidden layer is calculated, an output element predicted in the model at that time point is obtained by inputting the calculation result to, for example, a softmax function in the output layer. The predicted result is compared with teacher data to calculate a loss. By calculating the loss and backpropagating the error, the hidden layer elements are optimized. The model is optimized by learning the hidden layer W by repeating this forward propagation and back propagation. Although only one hidden layer is illustrated, there may be multiple hidden layers.
 上述した図1Aは、訓練データのうち1つのデータを用いたものであり、誤差逆伝播時は訓練データ中の1つ以上の一部のデータに対してロスの計算を行う。バッチ学習では一度の誤差逆伝播に全ての訓練データを利用するが、モデルの収束が遅かったり、良い結果が得られなかったりするので、一度に一部のデータを利用するミニバッチ学習が一般的に行われる。一度に計算するデータの数のことをバッチ数と呼ぶ。特にGPUなどのように並列性の高いプロセッサを利用する場合は、バッチ数を小さくする並列計算を効率的に行えなくなるため、計算効率が著しく低下する。その為、プロセッサの利用効率を上げるためにある程度大きなバッチ数で計算を行う。 FIG. 1A described above uses one of the training data, and at the time of back propagation error, the loss is calculated for one or more partial data in the training data. In batch learning, all training data is used for error back propagation at one time. However, since convergence of the model is slow or good results cannot be obtained, mini-batch learning that uses some data at once is generally used. Done. The number of data calculated at one time is called the number of batches. In particular, when a highly parallel processor such as a GPU is used, parallel calculation for reducing the number of batches cannot be performed efficiently, so that the calculation efficiency is significantly reduced. Therefore, calculation is performed with a certain number of batches in order to increase the utilization efficiency of the processor.
 図1Bは、ミニバッチ学習による順伝播における入出力の状態の一例の概念を示す図である。ミニバッチ学習による場合、ミニバッチ内のバッチ数であるミニバッチサイズBに対して、入力層のデータをB×D行列である入力行列Xとし、B×V行列であるXWを算出し、バッチごと(行ごと)にソフトマックス関数を掛け、対数をとり、Y=log(softmax(XW))が算出される。 FIG. 1B is a diagram illustrating a concept of an example of an input / output state in forward propagation by mini-batch learning. If by mini-batch training, with respect to mini-batch size B is a batch number in a mini-batch, and the data of the input layer and the input matrix X is a B × D matrix, it calculates the XW T is a B × V matrix, each batch Multiply (by row) by the softmax function and take the logarithm to calculate Y = log (softmax (XW T )).
 この場合、Xは、各行が図1Bの入力層へ入力される入力xを示すものであり、例えば、B個の入力、すなわち、入力x、x、・・・、x、・・・xの入力をそれぞれ行として備える行列である。この各入力xに対する出力が、同様にYとしてまとめられ、Yは、例えば、B個の出力、すなわち、入力xに対する出力y、・・・、入力xに対する出力y、・・・、入力xに対する出力yをそれぞれ行に備える行列で示される。 In this case, X indicates an input x in which each row is input to the input layer of FIG. 1B. For example, B inputs, that is, inputs x 1 , x 2 ,..., X i ,. · x input B which is a matrix, each including a row. The output for this each input x i is likewise summarized as Y, Y can be, for example, B-number of the output, i.e., the output y 1 with respect to the input x 1, · · ·, the output y i to the input x i, · · -, it indicated the output y B for the input x B in a matrix, each including a row.
 出力層において、隠れ層の出力の各行に対してソフトマックス関数を適用することにより、ミニバッチサイズBの演算を効率よく処理することができる。Yの各行の平均を算出し、正解ラベルとの誤差からロスを算出し、当該ロスを逆伝播して最適化することにより、モデルの収束性及び精度を向上することが可能である。 In the output layer, the operation of the mini batch size B can be efficiently processed by applying the softmax function to each row of the output of the hidden layer. It is possible to improve the convergence and accuracy of the model by calculating the average of each row of Y, calculating the loss from the error with the correct label, and optimizing the loss by propagating back.
 自然言語処理といった単語数が多い学習やビッグデータの学習では、ミニバッチサイズBを大きくして計算効率をさらに高めることが可能である。本実施形態は、この入力と隠れ層との演算方法を工夫することにより、大きなミニバッチサイズ、例えば、通常であればアクセラレータのメモリに格納が困難であるようなサイズのデータに対して演算を行うことを可能とする。以下、メモリの使用方法及びロスの算出方法等について詳しく説明する。 In learning with a large number of words such as natural language processing or big data learning, the mini-batch size B can be increased to further increase the calculation efficiency. In this embodiment, by devising the calculation method of the input and the hidden layer, the calculation is performed on data having a large mini-batch size, for example, a size that is normally difficult to store in the accelerator memory. Make it possible to do. Hereinafter, the memory usage method, the loss calculation method, and the like will be described in detail.
 図2は、本実施形態に係る機械学習装置の機能を示すブロック図である。機械学習装置1は、入力部10と、制御部12と、記憶部14と、学習部16と、出力部18と、を備え、機械学習を行う装置である。この機械学習装置1は、機械学習によりモデルを生成するモデル生成装置として機能する。 FIG. 2 is a block diagram illustrating functions of the machine learning device according to the present embodiment. The machine learning device 1 includes an input unit 10, a control unit 12, a storage unit 14, a learning unit 16, and an output unit 18, and performs machine learning. The machine learning device 1 functions as a model generation device that generates a model by machine learning.
 入力部10は、機械学習装置1の外部からのデータが入力されるインタフェースであり、訓練データ、ハイパーパラメータ等の入力を受け付ける。入力されたデータは、制御部12に送信され、処理される。また、入力されたデータは、記憶部14に送信され、一時的に記憶されるようにしてもよい。 The input unit 10 is an interface through which data from the outside of the machine learning device 1 is input, and receives input of training data, hyper parameters, and the like. The input data is transmitted to the control unit 12 and processed. Further, the input data may be transmitted to the storage unit 14 and temporarily stored.
 制御部12は、学習部16がモデルを学習する制御及び記憶部14にデータを記憶させる制御を行う。 The control unit 12 performs control for the learning unit 16 to learn the model and control for storing data in the storage unit 14.
 学習部16は、入力された訓練データに基づいて、モデルの学習を行う。この学習部16は、記憶部14に格納されているデータを適切なタイミングで送受信することにより、処理を行う。 The learning unit 16 learns the model based on the input training data. The learning unit 16 performs processing by transmitting and receiving data stored in the storage unit 14 at an appropriate timing.
 記憶部14は、入力部10から入力された訓練データ等を格納する。この記憶部14は、ハードウェア的な構成として、主記憶装置及び補助記憶装置を備えていてもよい。主記憶装置を備える場合には、制御部12、学習部16等を動作させるためのプログラムを記憶しておいてもよい。訓練データは、容量が大きく、データの入出力が低速で行うことができる主記憶装置に一度格納された後、容量が主記憶装置に比べれば小さく、データの入出力が高速で行うことができる補助記憶装置に必要に応じて転送されるようにしてもよい。 The storage unit 14 stores training data input from the input unit 10. The storage unit 14 may include a main storage device and an auxiliary storage device as a hardware configuration. When the main storage device is provided, a program for operating the control unit 12, the learning unit 16, and the like may be stored. Training data has a large capacity and can be input / output at a low speed. Once stored in the main storage, the capacity is smaller than that of the main storage, and the data can be input / output at a high speed. It may be transferred to the auxiliary storage device as necessary.
 出力部18は、学習部16が学習したモデルを外部へと出力する。別の例としては、機械学習装置1の内部に図示しないデータベースを備え、当該データベースにモデルを出力するようにしてもよい。このデータベースは、記憶部14に備えられていてもよい。さらに別の例として、機械学習装置1が自然言語処理等を行う処理装置であってもよく、この場合、学習されたモデルは、適宜必要な箇所に記憶されるようにしてもよい。 The output unit 18 outputs the model learned by the learning unit 16 to the outside. As another example, a database (not shown) may be provided inside the machine learning device 1 and a model may be output to the database. This database may be provided in the storage unit 14. As yet another example, the machine learning device 1 may be a processing device that performs natural language processing or the like. In this case, the learned model may be stored in a necessary place as appropriate.
 図3は、学習部16の内部の機能を示すブロック図である。上述したように、自然言語処理におけるRNNには限られず、学習部16の機能は、比較的サイズの大きなミニバッチに分割してロス計算を行うような他のモデルに関しても同様に適用することが可能である。 FIG. 3 is a block diagram showing the internal functions of the learning unit 16. As described above, the function of the learning unit 16 is not limited to RNN in natural language processing, and can be similarly applied to other models in which loss calculation is performed by dividing into relatively large mini-batches. It is.
 学習部16は、データ選択部160と、順伝播部162と、誤差出力部164と、誤差算出部166と、逆伝播部168と、を備える。 The learning unit 16 includes a data selection unit 160, a forward propagation unit 162, an error output unit 164, an error calculation unit 166, and a back propagation unit 168.
 データ選択部160は、制御部12からの制御信号等を受信し、ミニバッチにおいて学習に用いられる訓練データを選択する。例えば、訓練データをBの要素を有するミニバッチにランダムに振り分ける。ランダムに振り分けられたデータは、順伝播部162へと出力される。または、ランダムに振り分けられたデータのラベルを順伝播部162へと通知し、順伝播部162が、演算を行う際に記憶部14からミニバッチの訓練データを取得するようにしてもよい。 The data selection unit 160 receives a control signal from the control unit 12 and selects training data used for learning in the mini-batch. For example, the training data is randomly allocated to the mini-batch having the element B. The randomly distributed data is output to the forward propagation unit 162. Alternatively, the label of randomly distributed data may be notified to the forward propagation unit 162, and the forward propagation unit 162 may acquire the mini-batch training data from the storage unit 14 when performing the calculation.
 順伝播部162は、モデル生成において順伝播を行い、各層におけるロスの計算をするための数値を算出する。順伝播部162により得られた各層の出力は、誤差出力部164が参照する。この出力は、各層における次元に基づいたベクトル又は当該ベクトルを所定数行集約した行列形式として算出される。 The forward propagation unit 162 performs forward propagation in model generation and calculates a numerical value for calculating a loss in each layer. The error output unit 164 refers to the output of each layer obtained by the forward propagation unit 162. This output is calculated as a vector based on a dimension in each layer or a matrix format in which a predetermined number of rows are aggregated.
 誤差出力部164は、順伝播部162が出力した各層における出力のベクトル又は行列に対して正解ラベルとの差(第1の差)を算出する。算出された第1の差に対して、例えば、ソフトマックス関数を掛けることにより、ミニバッチ内、すなわち、訓練データのうち少なくとも一部(B個)のデータのうち、所定数(b個)の入力データに対するロス(第1誤差)を計算する。 The error output unit 164 calculates the difference (first difference) from the correct label for the output vector or matrix in each layer output by the forward propagation unit 162. By multiplying the calculated first difference by, for example, a softmax function, a predetermined number (b) of input data in a mini-batch, that is, at least a part (B) of training data, is input. Calculate the loss (first error) for the data.
 誤差算出部166は、第1誤差の和又は平均を算出することにより第2の差であるミニバッチのロス(第2誤差)を計算する。すなわち、第1誤差は、b個のデータのモデル出力値とラベル値との差を示し、第2誤差は、ミニバッチ(B個)におけるロスを示す。 The error calculation unit 166 calculates the mini-batch loss (second error), which is the second difference, by calculating the sum or average of the first errors. That is, the first error indicates a difference between the model output value and the label value of b pieces of data, and the second error indicates a loss in the mini-batch (B pieces).
 ここで、本実施形態におけるミニバッチの処理について図4を用いて説明する。図4は、本実施形態におけるミニバッチ処理の概略を示す図である。 Here, mini-batch processing in the present embodiment will be described with reference to FIG. FIG. 4 is a diagram showing an outline of the mini-batch processing in the present embodiment.
 入力Xに対して、上述したように、バッチサイズ分の入力は、Bよりも小さい所定サイズbによりN個のグループに分割される。例えば、所定サイズbは、floor(B/N)である。すなわち、入力行列Xは、X=[X  X  ・・・ X と表されるサブ行列Xに分割される。Bがbで割り切れる場合、B=N×bとなり、全てのサブ行列Xは、b×D行列となる。Bがbで割り切れない場合、余りをb’として、B=(N-1)×b+b’となり、サブ行列XからXN-1まではb×D行列、サブ行列Xは、b’×D行列となる。bは、Bに依存しないBよりも小さい所定数として定義されてもよい。 For the input X, as described above, the input corresponding to the batch size is divided into N groups by the predetermined size b smaller than B. For example, the predetermined size b is floor (B / N). That is, the input matrix X is divided into sub-matrices X n expressed as X = [X 1 T X 2 T ... X N T ] T. When B is divisible by b, B = N × b, and all sub-matrices Xn are b × D matrices. If B is not divisible by b, the remainder is b ′, and B = (N−1) × b + b ′, and the submatrix X 1 to X N−1 is a b × D matrix, and the submatrix X N is b ′ × D matrix. b may be defined as a predetermined number smaller than B that does not depend on B.
 上述したように、バッチサイズBを大きくとると学習の精度及び効率を向上させることが可能である。一方で、学習部16において使用できるメモリの容量は、学習部16及び他の部のメモリ使用量により制限される。Vは、語彙サイズであり、約5万~10万程度の値を有する。このため、バッチサイズBを大きくした場合、図1Bに示すような行列、特に、出力層において取得されるべきB×Vの要素を備える行列Yの全ての要素を同じタイミングでメモリ上に格納することは困難であり、バッチサイズBを十分に大きくとることが困難となる。 As described above, when the batch size B is increased, the accuracy and efficiency of learning can be improved. On the other hand, the memory capacity that can be used in the learning unit 16 is limited by the memory usage of the learning unit 16 and other units. V is a vocabulary size and has a value of about 50,000 to 100,000. For this reason, when the batch size B is increased, the matrix shown in FIG. 1B, in particular, all the elements of the matrix Y including B × V elements to be acquired in the output layer are stored in the memory at the same timing. It is difficult to make the batch size B sufficiently large.
 図4に示すように、入力行列Xのサイズを所定サイズbによりN個に分割することにより、バッチサイズBを大きくしても、上記のメモリの制限内でロスの計算を行うことが可能となる。 As shown in FIG. 4, by dividing the size of the input matrix X into N by a predetermined size b, it is possible to calculate the loss within the above memory limit even if the batch size B is increased. Become.
 順伝播部162において、b行ごとに分割された入力行列Xは、図1Bに示す隠れ層の出力として、Xが算出される。算出されたXをソフトマックス関数に掛けることにより、図1Bに示す行列Yの部分行列であるb×V要素を有するYが算出される。各入力ベクトルxに対する教師ベクトルのラベルをyとし、部分行列におけるY[i,y]の和又は平均を算出することにより、部分入力Xに対する出力の第1誤差Lを算出する。すなわち、第1誤差は、ミニバッチから見ると、部分入力Xに対する部分的なロスを示す。 In order propagation portion 162, input matrix X n that is divided into b line, the output of the hidden layer shown in FIG. 1B, X n W T is calculated. By multiplying the calculated X n W T to softmax function, Y n with b × V element is a submatrix of the matrix Y shown in FIG. 1B is calculated. The first vector error L n of the output with respect to the partial input X n is calculated by calculating the sum or average of Y n [i, y i ] in the partial matrix, where y i is the label of the teacher vector for each input vector x i . To do. That is, the first error indicates a partial loss with respect to the partial input Xn when viewed from the mini-batch.
 このnに対する演算をn=1,2,・・・,Nまで逐次的に行うことにより、誤差出力部164により第1誤差L,L,・・・,Lが算出される。誤差算出部166は、この第1誤差の総和又は平均を算出することにより、ミニバッチ内におけるロスである第2誤差を算出する。第1誤差Lが算出されるごとに、部分行列Yを破棄することにより、ミニバッチサイズBが大きくても、B×V行列である行列Yに対応する第2誤差を、行列Yを直接求めることなく算出することが可能となる。 The n computation for n = 1,2, ···, by performing sequentially to N, the first error by the error output unit 164 L 1, L 2, ··· , L N is calculated. The error calculation unit 166 calculates a second error that is a loss in the mini-batch by calculating the sum or average of the first errors. Each time the first error L n is calculated, by discarding the partial matrix Y n , the second error corresponding to the matrix Y, which is a B × V matrix, can be obtained even if the mini batch size B is large. It is possible to calculate without directly obtaining.
 部分行列Yは、破棄する必要は無く、例えば、b×V×(要素のビット数)分の領域を確保しておき、使い回すようにしてもよい。さらに、第1誤差L等も全部記憶しておく必要は無く、ロスLという変数を用意しておき、部分行列Yから算出された第1誤差Lを逐次的に和をとっていくようにしてもよい。もちろん、部分行列YについてGPU等の各演算コアにおいて演算し、第1誤差Lを並列演算によりそれぞれ求めて、この和を求めることによりロスLを取得するようにしてもよい。 The submatrix Y n does not need to be discarded. For example, an area corresponding to b × V × (the number of element bits) may be secured and used. Furthermore, the need to store also the first error L 1, etc. all without prepares a variable called loss L, we take the sequentially sum the first error L n calculated from the partial matrix Y n You may do it. Of course, the loss L may be obtained by computing the partial matrix Y n in each computation core such as a GPU, obtaining the first error L n by parallel computation, and obtaining this sum.
 図3に戻り、逆伝播部168は、順伝播部162が算出した第2誤差に基づいて、誤差逆伝播をおこなうことにより、モデルを最適化する。 3, the back propagation unit 168 optimizes the model by performing error back propagation based on the second error calculated by the forward propagation unit 162.
 誤差算出部166は、誤差逆伝播を実行中に必要となるタイミングにおいて、ミニバッチに対する所定数bごとの第1誤差(部分行列Xに対する部分的な誤差)の再計算を行い出力する。例えば、逆伝播部168が部分行列Xごとに誤差逆伝播を行う場合、再度、部分行列Yが当該部分行列Xごとの誤差である第1誤差を算出するために必要となる。誤差算出部166は、Yを再計算することにより、誤差出力部164が算出したYを取得し、第1誤差を算出する。 The error calculation unit 166 recalculates and outputs the first error (a partial error with respect to the submatrix Xn ) for each predetermined number b with respect to the mini-batch at a timing required during execution of error back propagation. For example, back propagation portion 168 may perform backpropagation for each partial matrix X n, again, a partial matrix Y n are required to calculate the first error is the error of each the partial matrix X n. Error calculation unit 166, by recalculating the Y n, obtains the Y n of the error output unit 164 is calculated, to calculate a first error.
 再計算時に必要となるメモリ量は、ミニバッチごとに計算できる領域が確保できればよいので、上述した計算時と同様にb×V×(要素のビット数)分の領域が確保されていれば実行することが可能である。b<Bであることに鑑みると、使用するメモリ量を削減可能であることが分かる。 The amount of memory required at the time of recalculation is sufficient if an area that can be calculated for each mini-batch can be secured, and if the area for b × V × (number of element bits) is secured, as in the case of the calculation described above, it is executed. It is possible. In view of b <B, it can be seen that the amount of memory used can be reduced.
 例えば、GPU(Graphic Processing Unit)に代表されるアクセラレータを学習に用いる場合、このYの再計算のコストを考慮しても、バッチサイズBを十分大きくとれないことの方が全体的な計算コストが高くなる。そこで、Yを再計算、すなわち、b個のデータから再計算された差を部分誤差(第1誤差)として再計算することにより、ミニバッチ内の第1誤差を求める分の計算コストが掛かるとしても、ミニバッチ全体としての計算コスト、ひいては、学習全体としての計算コストを削減することが可能となる。 For example, when an accelerator represented by a GPU (Graphic Processing Unit) is used for learning, the overall calculation cost is that the batch size B cannot be made sufficiently large even when the recalculation cost of Y n is taken into consideration. Becomes higher. Therefore, recalculate the Y n, i.e., the re-calculated difference from b pieces of data by recalculating a partial error (first error), as take a minute calculation cost for obtaining a first error in the mini-batch However, it is possible to reduce the calculation cost of the entire mini-batch, and hence the calculation cost of the entire learning.
 なお、誤差算出部166は、独立して備えられる必要は無く、誤差出力部164が誤差算出部166の機能を備え、誤差算出部166を備えていない構成であってもよい。また、順伝播部162及び逆伝播部168は、2つで1つの手段として機能し、順伝播及び逆伝播を行うことによりモデルを生成する、モデル生成部として機能してもよい。また、この誤差逆伝搬は、出力層のみならず、途中の層においても同様に適用することが可能である。例えば、部分行列Xに対して途中の層までの順伝搬の結果と、当該層までの逆伝搬の結果を用いて第2誤差を算出し、当該層の逆伝搬を実行してもよい。この場合、第1誤差の算出において、必ずしもソフトマックス関数等を用いる必要は無い。このように、本実施形態の誤差算出は、ネットワークを構成する少なくとも1つの層に対して実行することが可能である。 The error calculation unit 166 does not need to be provided independently, and the error output unit 164 may have a function of the error calculation unit 166 and may not include the error calculation unit 166. Further, the forward propagation unit 162 and the back propagation unit 168 may function as a single unit, and may function as a model generation unit that generates a model by performing forward propagation and back propagation. Further, this back propagation of error can be similarly applied not only to the output layer but also to an intermediate layer. For example, the second error may be calculated using the result of forward propagation to a middle layer and the result of back propagation to the layer with respect to the partial matrix Xn , and the back propagation of the layer may be executed. In this case, it is not always necessary to use a softmax function or the like in the calculation of the first error. As described above, the error calculation of the present embodiment can be executed for at least one layer constituting the network.
 この後、同じミニバッチを用いて計算を続ける場合には、順伝播部162が処理を行い、次のミニバッチを用いて計算を続ける、若しくは、次のエポックへと移行する場合には、データ選択部160が処理を行う。 Thereafter, when the calculation is continued using the same mini-batch, the forward propagation unit 162 performs processing, and the calculation is continued using the next mini-batch, or when the next epoch is shifted to, the data selection unit 160 performs processing.
 図5は、本実施形態に係る処理の流れの一例を示すフローチャートである。 FIG. 5 is a flowchart illustrating an example of a processing flow according to the present embodiment.
 まず、入力部10を介して機械学習装置1に訓練データが入力される(S100)。入力された訓練データは、必要に応じて記憶部14に記憶される。データ数等、必要となる情報は、制御部12へと出力される。その後、制御部12は、学習部16の学習及び記憶部14のデータ送信等、学習に必要な処理を制御する。 First, training data is input to the machine learning device 1 via the input unit 10 (S100). The input training data is stored in the storage unit 14 as necessary. Necessary information such as the number of data is output to the control unit 12. Thereafter, the control unit 12 controls processing necessary for learning, such as learning of the learning unit 16 and data transmission of the storage unit 14.
 次に、学習部16のデータ選択部160は、所定のミニバッチサイズBごとにデータをランダムに振り分け、ミニバッチが生成されるデータを選択する(S102)。ミニバッチサイズBは、あらかじめ設定されているパラメータであってもよくハイパーパラメータとして入力部10を介して制御部12へと入力されるものであってもよい。ハイパーパラメータとして入力された場合、制御部12は、データ選択部160が当該ミニバッチサイズBごとにデータを選択するようにミニバッチを生成する。 Next, the data selection unit 160 of the learning unit 16 randomly distributes data for each predetermined mini-batch size B and selects data for which a mini-batch is generated (S102). The mini-batch size B may be a preset parameter or may be input to the control unit 12 via the input unit 10 as a hyper parameter. When input as a hyperparameter, the control unit 12 generates a mini-batch so that the data selection unit 160 selects data for each mini-batch size B.
 このステップは、ミニバッチごとにデータを選択するのではなく、あらかじめデータをミニバッチごとに振り分けるステップであってもよい。データの選択は、例えば、選択されたデータをアクセスしやすいメモリに読み込む処理を行ってもよいし、選択されたデータのインデクスを順伝播部162及び逆伝播部168へと出力する処理であってもよく、この場合、順伝播部162等により処理を行うタイミングで演算に必要となる分のデータをアクセスしやすいメモリに読み込むようにしてもよい。実行の際には、ループアンローリングやソフトウェアパイプライニング等の他の最適化処理をするようにしてもよい。 This step may be a step of distributing data in advance for each mini-batch instead of selecting data for each mini-batch. The data selection may be, for example, a process of reading the selected data into an easily accessible memory, or a process of outputting the index of the selected data to the forward propagation unit 162 and the back propagation unit 168. In this case, the data necessary for the calculation may be read into an easily accessible memory at the timing when the processing is performed by the forward propagation unit 162 or the like. At the time of execution, other optimization processes such as loop unrolling and software pipelining may be performed.
 次に、順伝播部162により、ミニバッチ内の訓練データが部分ごとに順伝播され、誤差出力部164により、順伝播された部分データごとに第1誤差を算出し、当該部分データごとの第1誤差に基づいて、ミニバッチ全体としての第2誤差を算出する(S104)。 Next, the training data in the mini-batch is forward propagated for each part by the forward propagation unit 162, and a first error is calculated for each partial data forwardly propagated by the error output unit 164, and the first error for each partial data is calculated. Based on the error, the second error of the entire mini-batch is calculated (S104).
 次に、逆伝播部168により、S104で算出された第1誤差又は第2誤差を用いた誤差逆伝播が実行され、モデルが最適化される(S106)。逆伝播の実行中において、部分データの各層における出力が必要となる場合には、部分データの逆伝播処理ごとに再計算をしてもよい。この再計算を行うことにより第1誤差を算出し、当該第1誤差を用いて誤差逆伝播を行ってもよい。すなわち、各層において、第1誤差として、当該層の出力と、次の層から逆伝播された誤差との第2の差を求め、第1誤差として、前の層へと逆伝播するようにしてもよい。このように、各層において、第1誤差を求めてもよい。逆伝播においても同様に、ミニバッチ全体としての第2誤差を算出してもよい。逆伝播部168は、このように、第2誤差と、当該第2誤差に対応する第1誤差を順伝播により求める経過(関数)とを用いて、逆伝播を実行する。 Next, the back propagation unit 168 performs error back propagation using the first error or the second error calculated in S104, and the model is optimized (S106). During the execution of back propagation, if it is necessary to output partial data at each layer, recalculation may be performed for each back propagation process of the partial data. The first error may be calculated by performing this recalculation, and error back propagation may be performed using the first error. That is, in each layer, the second difference between the output of the layer and the error propagated back from the next layer is obtained as the first error, and the error propagates back to the previous layer as the first error. Also good. Thus, the first error may be obtained in each layer. Similarly, in the back propagation, the second error as the whole mini-batch may be calculated. In this way, the back propagation unit 168 performs back propagation using the second error and a process (function) for obtaining the first error corresponding to the second error by forward propagation.
 次に、ミニバッチでの処理が終了したか否かを判断する(S108)。ミニバッチでの処理が終了していない場合(S108:NO)、すなわち、同じミニバッチを用いてさらに学習を行う場合、S104へと戻り、処理を継続する(S104~S106)。 Next, it is determined whether or not the processing in the mini-batch is completed (S108). If the processing in the mini-batch is not completed (S108: NO), that is, if further learning is performed using the same mini-batch, the process returns to S104 and the processing is continued (S104 to S106).
 一方、ミニバッチでの処理が終了した場合(S108:YES)、学習が終了したか否かを判断する(S110)。学習が終了していない場合(S110:NO)、次のミニバッチのデータ選択を行い、次のミニバッチの処理を実行する(S102~S108)。 On the other hand, when the processing in the mini-batch is completed (S108: YES), it is determined whether learning is completed (S110). If learning has not ended (S110: NO), the next mini-batch data is selected and the next mini-batch process is executed (S102 to S108).
 一方、学習が終了している場合(S110:YES)、出力部18は、学習済みモデルを出力し、処理を終了する(S112)。学習の終了は、ロス、例えば、出力層における第2誤差、の値が所定の値よりも小さくなった、所定のエポック数の演算が終了した、バリデーションで評価値が所定の値よりも大きくなった、等の条件に応じて判断される。また、出力部18を介して外部に出力するのではなく、記憶部14に記憶し、機械学習装置1を、例えば、当該モデルを用いた自然言語処理装置として機能させてもよい。さらに、出力部18は、記憶部14に記憶されたモデルを出力するようにしてもよい。 On the other hand, when the learning is finished (S110: YES), the output unit 18 outputs the learned model and finishes the process (S112). The end of the learning is that the value of the loss, for example, the second error in the output layer, is smaller than the predetermined value, the calculation of the predetermined number of epochs is completed, and the evaluation value becomes larger than the predetermined value in the validation. It is judged according to the conditions such as. Further, instead of outputting to the outside via the output unit 18, the machine learning device 1 may be stored in the storage unit 14 to function as a natural language processing device using the model, for example. Further, the output unit 18 may output the model stored in the storage unit 14.
 図6は、本実施形態のハードウェア実装例を示す図である。機械学習装置1は、CPU200と、アクセラレータ202と、主記憶装置204と、補助記憶装置206と、ネットワークインタフェース208と、デバイスインタフェース210と、を備える。これらの各デバイスは、バス212により接続されている。 FIG. 6 is a diagram illustrating a hardware implementation example of the present embodiment. The machine learning device 1 includes a CPU 200, an accelerator 202, a main storage device 204, an auxiliary storage device 206, a network interface 208, and a device interface 210. Each of these devices is connected by a bus 212.
 CPU(Central Processing Unit)200は、機械学習装置1を動作させるプロセッサであり、例えば、主記憶装置204に記憶されているプログラムに基づいて機械学習装置1を動作させる。 A CPU (Central Processing Unit) 200 is a processor that operates the machine learning device 1 and operates the machine learning device 1 based on a program stored in the main storage device 204, for example.
 アクセラレータ202は、演算処理を補助するためのデバイスであり、例えば、GPUを備える。GPUは、GPGPU(General-Purpose computing on GPU)により、数値計算を高速化する。アクセラレータ202は、それ自身に補助的に記憶装置であるメモリを備えていてもよく、当該メモリ上に格納されているデータのアクセスを高速に行えるものであってもよい。主記憶装置204又は補助記憶装置206とのバスを介してのやりとりにおいては、この高速アクセスを最大限利用できるように、これら記憶装置からアクセラレータ202上のメモリへ必要なデータのプリフェッチ等を行ってもよい。 The accelerator 202 is a device for assisting arithmetic processing, and includes, for example, a GPU. The GPU speeds up the numerical calculation by GPGPU (General-Purpose computing on GPU). The accelerator 202 may be provided with a memory that is an auxiliary storage device in itself, and may be capable of accessing data stored in the memory at high speed. In the exchange with the main storage device 204 or the auxiliary storage device 206, prefetching of necessary data from these storage devices to the memory on the accelerator 202 is performed so that this high-speed access can be utilized to the maximum extent. Also good.
 主記憶装置204は、CPU200とメインバス等により直接接続され、機械学習装置1の動作に必要となるプログラム等を主に格納する。主記憶装置204は、例えば、DRAM(Dynamic Random Access Memory)又はSRAM(Static Random Access Memory)を備えている。 The main storage device 204 is directly connected to the CPU 200 via a main bus or the like, and mainly stores programs and the like necessary for the operation of the machine learning device 1. The main storage device 204 includes, for example, DRAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory).
 補助記憶装置206は、主記憶装置204よりもスループット等は遅いが、大容量のメモリを備えている。補助記憶装置206は、機械学習装置1が構成されるコンピュータと同じコンピュータ内に存在している必要は無く、外部に設置されているものであってもよい。例えば、訓練データは、補助記憶装置206に記憶され、バス212を介してCPU200、アクセラレータ202、主記憶装置204へと転送されて使用されるものであってもよい。 The auxiliary storage device 206 is slower in throughput than the main storage device 204 but has a large capacity memory. The auxiliary storage device 206 does not need to be in the same computer as the computer in which the machine learning device 1 is configured, and may be installed outside. For example, the training data may be stored in the auxiliary storage device 206, transferred to the CPU 200, the accelerator 202, and the main storage device 204 via the bus 212 and used.
 ネットワークインタフェース208は、外部にあるネットワーク300と機械学習装置1とを接続するインタフェースである。デバイスインタフェース210は、外部装置400と機械学習装置1とを接続するインタフェースである。外部装置400は、ネットワーク300及びネットワークインタフェース208を介して機械学習装置1と接続されていてもよい。 The network interface 208 is an interface that connects the external network 300 and the machine learning device 1. The device interface 210 is an interface that connects the external device 400 and the machine learning device 1. The external device 400 may be connected to the machine learning device 1 via the network 300 and the network interface 208.
 本実施形態における演算は、主にアクセラレータ202上において実行される。アクセラレータ202における演算処理は、CPU200よりも高いが、アクセラレータ202に搭載されているメモリの容量は、主記憶装置204及び補助記憶装置206よりも小さい場合が多い。このメモリ上のデータはアクセラレータ202内のプロセッサからは高速にアクセスすることが可能であるが、一方、アクセラレータ202内のプロセッサから主記憶装置204及び補助記憶装置206へのアクセスは低速である場合が多い。 The calculation in this embodiment is mainly executed on the accelerator 202. Although the arithmetic processing in the accelerator 202 is higher than that of the CPU 200, the capacity of the memory mounted in the accelerator 202 is often smaller than that of the main storage device 204 and the auxiliary storage device 206. The data on the memory can be accessed at high speed from the processor in the accelerator 202, while the access from the processor in the accelerator 202 to the main storage device 204 and the auxiliary storage device 206 may be slow. Many.
 このような場合、B×Vの行列を一度にアクセラレータ202上のメモリに格納することは困難であるため、補助記憶装置206等に格納されている必要があるが、アクセラレータ202と各種記憶装置とのアクセスは、アクセラレータ202内に備えられているプロセッサとメモリとのアクセスに比べると一般的に低速である。そこで、B×Vといったデータを一度にアクセラレータ202で扱おうとすると、より低速な前者のバスを介しての転送が必要となるため、演算処理に対するデータの転送処理の割合が高くなり、転送処理がボトルネックとなる可能性が高い。 In such a case, since it is difficult to store the B × V matrix in the memory on the accelerator 202 at a time, it is necessary to store the B × V matrix in the auxiliary storage device 206 or the like. This access is generally slower than access between the processor and memory provided in the accelerator 202. Therefore, if data such as B × V is to be handled by the accelerator 202 at a time, transfer via the lower speed of the former bus is required, so that the ratio of data transfer processing to arithmetic processing increases, and transfer processing is reduced. It is likely to become a bottleneck.
 そこで、上述したように、部分データごとにアクセラレータ202において演算処理を行うことにより、高速な演算処理を行うとともに、ミニバッチサイズBを減少させずに学習を行うことが可能となる。 Therefore, as described above, by performing arithmetic processing in the accelerator 202 for each partial data, high-speed arithmetic processing can be performed and learning can be performed without reducing the mini-batch size B.
 なお、部分データにおいて同じタイミングで処理するベクトル数、すなわち、上述した所定サイズbは、このアクセラレータ202上のメモリ容量に基づいて決定されるものであってもよい。例えば、アクセラレータ202のプロセッサを動作させるために必要なプログラムの容量及び入力データ等のバッファ容量等を除いた容量をできるだけb×V要素の容量として用いることができるようにbを設定してもよい。 Note that the number of vectors to be processed at the same timing in the partial data, that is, the predetermined size b described above may be determined based on the memory capacity on the accelerator 202. For example, b may be set so that the capacity excluding the capacity of a program necessary for operating the processor of the accelerator 202 and the buffer capacity of input data or the like can be used as much as the capacity of b × V elements. .
 以上のように、本実施形態によれば、ミニバッチサイズに依存しない所定数データごとに部分的なロス(第1誤差又は第2誤差)の計算を行うことにより、精度のよいモデルを学習するためのミニバッチサイズを減少させることなく、高速な演算速度を確保することが可能となる。このように、ミニバッチサイズに依存しないメモリ容量においてロスを算出し誤差逆伝播を行うことが可能となるので、計算機の計算資源を効率的に利用できることが可能となる。 As described above, according to the present embodiment, an accurate model is learned by calculating a partial loss (first error or second error) for each predetermined number of data that does not depend on the mini-batch size. Therefore, it is possible to secure a high calculation speed without reducing the mini-batch size. As described above, it is possible to perform loss back propagation and error propagation in a memory capacity that does not depend on the mini-batch size, so that it is possible to efficiently use the computing resources of the computer.
 なお、前述では、データ選択部160が順伝播部162及び逆伝播部168を備えるモデル生成装置にミニバッチサイズのデータを入力し、順伝播部162及び逆伝播部168において所定数のサイズのデータを用いて演算を行うものとしたがこれには限られない。 In the above description, the data selection unit 160 inputs mini-batch size data to the model generation apparatus including the forward propagation unit 162 and the backward propagation unit 168, and the forward propagation unit 162 and the backward propagation unit 168 have a predetermined number of sizes of data. However, the present invention is not limited to this.
 例えば、データ選択部160がミニバッチのデータとしてB個のデータを選択した後に、データをb個のデータごとに分割するデータ分割部を備え、当該データ分割部がモデル生成部にb個の入力データを入力する、入力手段として動作するようにしてもよい。 For example, after the data selection unit 160 selects B pieces of data as mini-batch data, the data selection unit 160 includes a data division unit that divides the data into b pieces of data, and the data division unit adds b pieces of input data to the model generation unit. May be operated as an input means.
 モデル生成において、順伝播部162は、ネットワークに入力手段から入力されたb個のデータについて出力層からの出力を演算により取得する。誤差出力部164は、当該b個のデータに対応する出力と、正解ラベルとの第1の差に基づいて第1誤差を出力する。この誤差を記憶しておき、入力手段から入力された次のb個について同様に第1誤差を出力する。ミニバッチ分、すなわち、B個のデータが入力し終わった段階において、誤差算出部166は、ミニバッチの誤差である、第2誤差を算出する。 In the model generation, the forward propagation unit 162 obtains the output from the output layer by calculation for the b pieces of data input from the input unit to the network. The error output unit 164 outputs a first error based on the first difference between the output corresponding to the b pieces of data and the correct answer label. This error is stored, and the first error is similarly output for the next b pieces inputted from the input means. At the stage where the mini-batch, that is, the B data has been input, the error calculation unit 166 calculates a second error, which is an error of the mini-batch.
 逆伝播部168も同様に、各層の出力と、逆伝播されてきた出力とについて、入力手段から入力されたb個のデータに基づき第1誤差を、誤差出力部164を介して出力する。そして、ミニバッチ分のデータについて第1誤差の出力が終わると、誤差算出部166が着目している層における第2誤差を算出する。このように、b個ごとのデータに分割するデータ分割部を入力手段として備えていてもよい。 Similarly, the back propagation unit 168 outputs a first error via the error output unit 164 based on the b pieces of data input from the input means for the output of each layer and the back propagated output. Then, when the output of the first error is completed for the data for the mini-batch, the error calculation unit 166 calculates the second error in the layer focused on. As described above, a data dividing unit that divides the data into b pieces of data may be provided as input means.
 また、分割は、例えば、所定数bに対してある程度の揺らぎを持たせてもよい。別の例としてメモリの使用率に応じて動的に変更させてもよい。このように、分割は、所定数ごとに限られるものではなく、状況に応じて適切に変化するものであってもよいし、その他の種々の分割方法であってもよい。 Further, the division may have a certain amount of fluctuation with respect to the predetermined number b, for example. As another example, it may be changed dynamically according to the usage rate of the memory. Thus, the division is not limited to each predetermined number, and may be appropriately changed according to the situation, or may be various other division methods.
 上述では、一例として、自然言語処理におけるRNNの学習について説明したが、これには限られず、ロスの演算を行う際に大量のデータ領域が必要となる他のニューラルネットワークにおける学習にも適用することが可能である。例えば、MLP、CNNのみならず、LSTM(Long Shot-Term Memory)等に用いることも可能である。 In the above description, RNN learning in natural language processing has been described as an example. However, the present invention is not limited to this, and is also applicable to learning in other neural networks that require a large amount of data areas when performing loss calculations. Is possible. For example, it can be used not only for MLP and CNN but also for LSTM (Long Shot-Term Memory).
 また、生成されたモデルは、自然言語処理を行うモデルであるとしたが、これには限られず、機械学習装置1は、他の目的において、他の種々のデータに対する処理を行うモデルを生成するものであってもよい。 The generated model is a model that performs natural language processing. However, the present invention is not limited to this, and the machine learning device 1 generates a model that performs processing on various other data for other purposes. It may be a thing.
 上述した説明における関数であるソフトマックス関数等は、一例として示したものであり、他の実装であってもよい。ソフトマックス関数は、例えば、シグモイド関数、ReLU(Rectified Linear Unit)等の他の勾配を求めるのに適している関数であってもよい。他の箇所に用いる関数も適宜適切な関数を選択することが可能である。 The softmax function or the like, which is a function in the above description, is shown as an example, and other implementations may be used. The softmax function may be a function suitable for obtaining other gradients such as a sigmoid function and ReLU (Rectified Linear Unit). It is possible to select an appropriate function as a function to be used in other places as appropriate.
 図2に示すような各実施形態における機械学習装置1において、制御部12は、アナログ、デジタル、或いは、FPGA(Field Programmable Gate Array)、ASIC(Application Specific Integrated Circuit)等により実装された制御回路であってもよい。学習部16も同様に、回路により実装されたものであってもよい。 In the machine learning device 1 in each embodiment as shown in FIG. 2, the control unit 12 is a control circuit implemented by analog, digital, or FPGA (Field Programmable Gate Array), ASIC (Application Specific Specific Integrated Circuit), or the like. There may be. Similarly, the learning unit 16 may be implemented by a circuit.
 上記の全ての記載において、機械学習装置1の少なくとも一部はハードウェアで構成されていてもよいし、ソフトウェアで構成され、ソフトウェアの情報処理によりCPU等が実施をしてもよい。ソフトウェアで構成される場合には、機械学習装置1及びその少なくとも一部の機能を実現するプログラムをフレキシブルディスクやCD-ROM等の記憶媒体に収納し、コンピュータに読み込ませて実行させるものであってもよい。記憶媒体は、磁気ディスクや光ディスク等の着脱可能なものに限定されず、ハードディスク装置やメモリなどの固定型の記憶媒体であってもよい。すなわち、ソフトウェアによる情報処理がハードウェア資源を用いて具体的に実装されるものであってもよい。さらに、ソフトウェアによる処理は、FPGA等の回路に実装され、ハードウェアが実行するものであってもよい。モデルの生成や、モデルに入力をした後の処理は、例えば、GPU等のアクセラレータを使用して行ってもよい。また、CPU等の処理回路、メモリ等の記憶装置、その他必要なハードウェア等は、1つずつ備えられるものであってもよいし、少なくとも一方が複数備えられるものであってもよい。 In all the descriptions above, at least a part of the machine learning device 1 may be configured by hardware, or may be configured by software, and the CPU or the like may be implemented by software information processing. In the case of software, the machine learning device 1 and a program that realizes at least a part of the functions are stored in a storage medium such as a flexible disk or a CD-ROM, and read and executed by a computer. Also good. The storage medium is not limited to a removable medium such as a magnetic disk or an optical disk, but may be a fixed storage medium such as a hard disk device or a memory. That is, information processing by software may be specifically implemented using hardware resources. Furthermore, the processing by software may be implemented in a circuit such as an FPGA and executed by hardware. The generation of the model and the processing after inputting the model may be performed using an accelerator such as a GPU, for example. Further, a processing circuit such as a CPU, a storage device such as a memory, and other necessary hardware may be provided one by one, or a plurality of at least one may be provided.
 また、本実施形態に係る機械学習装置1が生成したモデルは、人工知能ソフトウェアの一部であるプログラムモジュールとして利用することが可能である。すなわち、コンピュータのCPUが格納部に格納されているモデルに基づいて、演算を行い、結果を出力するように動作する。 Further, the model generated by the machine learning device 1 according to the present embodiment can be used as a program module that is a part of the artificial intelligence software. In other words, the CPU of the computer operates based on the model stored in the storage unit so as to perform an operation and output the result.
 上記の全ての記載に基づいて、本発明の追加、効果又は種々の変形を当業者であれば想到できるかもしれないが、本発明の態様は、上記した個々の実施形態に限定されるものではない。特許請求の範囲に規定された内容及びその均等物から導き出される本発明の概念的な思想と趣旨を逸脱しない範囲において種々の追加、変更及び部分的削除が可能である。例えば、前述した全ての実施形態において、説明に用いた数値は、一例として示したものであり、これらに限られるものではない。 Based on all the descriptions above, those skilled in the art may think of additions, effects, or various modifications of the present invention, but the aspects of the present invention are not limited to the individual embodiments described above. Absent. Various additions, modifications, and partial deletions can be made without departing from the concept and spirit of the present invention derived from the contents defined in the claims and equivalents thereof. For example, in all the embodiments described above, the numerical values used in the description are shown as an example and are not limited thereto.
 本実施形態においては、一例として自然言語処理に適用できることを示したが、これには限られない。例えば、画像を入力するニューラルネットワークにも対応することが可能である。この場合、複数の入力画像をミニバッチに分割し同様に適用できる。別の例として、画像内の画素ごとに本実施形態における分割をして分類することも可能である。この場合、画像内の画素を分割し、上記のように分割された画素ごとに演算を行ってもよい。さらには、ミニバッチ内の分割と、画像内の画素の分割とを併用することも可能である。この他の形態のデータに対しても、ミニバッチ処理、及び、データ内での分割が適切にできるものであれば、適用することが可能である。 In the present embodiment, it has been shown that the present invention can be applied to natural language processing as an example. For example, it is possible to deal with a neural network that inputs an image. In this case, a plurality of input images can be divided into mini-batches and applied in the same manner. As another example, it is also possible to divide and classify each pixel in the image according to this embodiment. In this case, the pixels in the image may be divided and the calculation may be performed for each of the divided pixels as described above. Furthermore, it is possible to use both the division in the mini-batch and the division of the pixels in the image. The present invention can be applied to other types of data as long as mini-batch processing and division within the data can be appropriately performed.
1:機械学習装置、10:入力部、12:制御部、14:記憶部、16:学習部、160:データ選択部、162:順伝播部、164:誤差出力部、166:誤差算出部、168:逆伝播部、18:出力部、200:CPU、202:アクセラレータ、204:主記憶装置、206:補助記憶装置、208:ネットワークインタフェース、210:デバイスインタフェース、212:バス、300:ネットワーク、400:外部装置 1: machine learning device, 10: input unit, 12: control unit, 14: storage unit, 16: learning unit, 160: data selection unit, 162: forward propagation unit, 164: error output unit, 166: error calculation unit, 168: Back propagation unit, 18: Output unit, 200: CPU, 202: Accelerator, 204: Main storage device, 206: Auxiliary storage device, 208: Network interface, 210: Device interface, 212: Bus, 300: Network, 400 : External device

Claims (10)

  1.  ニューラルネットワークモデルを備える学習済みモデルを生成するモデル生成装置であって、
     訓練データを分割して前記モデルに入力する、入力手段と、
     前記分割された訓練データを前記モデルに入力して取得されるデータと、前記分割された訓練データの正解ラベルと、の差を表す第1誤差を出力する、誤差出力手段と、
     前記第1誤差に基づいて、前記訓練データを前記モデルに入力して取得されるデータと、前記訓練データの正解ラベルと、の差を表す第2誤差を算出する、誤差算出手段と、
     前記第2誤差に基づいて、前記ニューラルネットワークの少なくとも1層の重みを誤差逆伝播で更新する、モデル生成手段と、
     を備える、モデル生成装置。
    A model generation device for generating a learned model including a neural network model,
    Input means for dividing the training data and inputting it into the model;
    An error output means for outputting a first error representing a difference between data obtained by inputting the divided training data into the model and a correct answer label of the divided training data;
    An error calculating means for calculating a second error representing a difference between data acquired by inputting the training data into the model and a correct label of the training data based on the first error;
    Model generation means for updating the weight of at least one layer of the neural network by error back propagation based on the second error;
    A model generation device comprising:
  2.  前記誤差出力手段は、前記分割された訓練データそれぞれに対する前記第1誤差を出力し、
     前記誤差算出手段は、前記分割された訓練データそれぞれに対する前記第1誤差の和又は平均を求めることにより、前記第2誤差を算出する、請求項1に記載のモデル生成装置。
    The error output means outputs the first error for each of the divided training data,
    The model generation apparatus according to claim 1, wherein the error calculation unit calculates the second error by calculating a sum or an average of the first errors for each of the divided training data.
  3.  前記誤差出力手段は、前記第1誤差を算出した後、前記第1誤差の算出に用いた行列を破棄する、請求項1又は請求項2に記載のモデル生成装置。 3. The model generation apparatus according to claim 1, wherein the error output means discards the matrix used for calculating the first error after calculating the first error.
  4.  前記入力手段は、前記訓練データを所定数に分割する、請求項1乃至請求項3のいずれかに記載のモデル生成装置。 4. The model generation apparatus according to claim 1, wherein the input unit divides the training data into a predetermined number.
  5.  前記所定数は、前記訓練データのデータ数に依存しない、請求項4に記載のモデル生成装置。 The model generation device according to claim 4, wherein the predetermined number does not depend on the number of data of the training data.
  6.  前記誤差出力手段は、誤差逆伝播を実行するタイミングにおいて、前記第1誤差を再計算し、
     前記モデル生成手段は、再計算された前記第1誤差及び前記第1誤差を求める経過に基づいて前記誤差算出手段が計算した前記第2誤差を用いて誤差逆伝播を実行する、請求項1乃至請求項5のいずれかに記載のモデル生成装置。
    The error output means recalculates the first error at a timing of executing error back propagation,
    The model generation means executes error back propagation using the second error calculated by the error calculation means based on the recalculated first error and the process of obtaining the first error. The model generation device according to claim 5.
  7.  前記モデルは、RNN(Recurrent Neural Network)モデルに基づいたモデルである、請求項1乃至請求項6のいずれかに記載のモデル生成装置。 The model generation apparatus according to any one of claims 1 to 6, wherein the model is a model based on an RNN (Recurrent Neural Network) model.
  8.  前記モデルに入力されるデータは、自然言語処理に用いるデータである、請求項1乃至請求項7のいずれかに記載のモデル生成装置。 The model generation apparatus according to any one of claims 1 to 7, wherein the data input to the model is data used for natural language processing.
  9.  ニューラルネットワークモデルを備える学習済みモデルを生成するモデル生成方法であって、
     入力手段が、訓練データを分割して前記モデルに入力するステップと、
     誤差出力手段が、前記分割された訓練データを前記モデルに入力して取得されるデータと、前記分割された訓練データの正解ラベルと、の差を表す第1誤差として出力するステップと、
     誤差算出手段が、前記第1誤差に基づいて、前記訓練データを前記モデルに入力して取得されるデータと、前記訓練データの前記正解ラベルと、の差を表す第2誤差を算出するステップと、
     モデル生成手段が、前記第2誤差に基づいて、前記ニューラルネットワークの少なくとも1層の重みを誤差逆伝播で更新するステップと、
     を備えるモデル生成方法。
    A model generation method for generating a learned model including a neural network model,
    An input means for dividing the training data and inputting it into the model;
    An error output means for outputting as a first error representing a difference between data obtained by inputting the divided training data into the model and a correct label of the divided training data;
    An error calculating means calculating, based on the first error, a second error representing a difference between data acquired by inputting the training data into the model and the correct answer label of the training data; ,
    Updating a weight of at least one layer of the neural network by back propagation based on the second error; and
    A model generation method comprising:
  10.  コンピュータに、
     ニューラルネットワークモデルを備える学習済みモデルを生成するモデル生成装置において、
     訓練データを分割して前記モデルに入力する、入力手段、
     前記分割された訓練データを前記モデルに入力して取得されるデータと、前記分割された訓練データの正解ラベルと、の差を表す第1誤差を出力する、誤差出力手段、
     前記第1誤差にもとづいて、前記訓練データを前記モデルに入力して取得されるデータと、前記訓練データの正解ラベルと、の差を表す第2誤差を算出する、誤差算出手段、
     前記第2誤差に基づいて、前記ニューラルネットワークの少なくとも1層の重みを誤差逆伝播で更新する、モデル生成手段、
     として機能させるプログラム。
    On the computer,
    In a model generation device that generates a learned model including a neural network model,
    Input means for dividing the training data and inputting it into the model;
    An error output means for outputting a first error representing a difference between data obtained by inputting the divided training data into the model and a correct answer label of the divided training data;
    An error calculating means for calculating a second error representing a difference between data acquired by inputting the training data into the model and a correct label of the training data based on the first error;
    Model generation means for updating the weight of at least one layer of the neural network by back propagation based on the second error;
    Program to function as.
PCT/JP2019/011865 2018-03-22 2019-03-20 Model generation device, model generation method, and program WO2019182059A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018-055087 2018-03-22
JP2018055087A JP2021119425A (en) 2018-03-22 2018-03-22 Model generation device, model generation method and program

Publications (1)

Publication Number Publication Date
WO2019182059A1 true WO2019182059A1 (en) 2019-09-26

Family

ID=67986223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/011865 WO2019182059A1 (en) 2018-03-22 2019-03-20 Model generation device, model generation method, and program

Country Status (2)

Country Link
JP (1) JP2021119425A (en)
WO (1) WO2019182059A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151510A1 (en) * 2018-11-12 2020-05-14 Advanced Micro Devices, Inc. Adaptive batch reuse on deep memories
CN114118449B (en) * 2022-01-28 2022-10-04 深圳佑驾创新科技有限公司 Image label identification method, medium and equipment based on bias label learning model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018160180A (en) * 2017-03-23 2018-10-11 富士通株式会社 Information processing system, information processor, and method for controlling information processing system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018160180A (en) * 2017-03-23 2018-10-11 富士通株式会社 Information processing system, information processor, and method for controlling information processing system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ASHITANI, TATSUJI: "To try Data- Parallel with Distributed TensorFlow", DISTRIBUTED TENSORFLOW QIITA, 2016, XP055640683, Retrieved from the Internet <URL:https://qiita.com/ashitani/items/dbe76cb9194d60ead9de> [retrieved on 20190613] *
MORINAGA, YUYA ET AL.: "Development of hybrid type operation method of mathematical programming and machine learning for a thermal grid", DOCUMENTS OF RESEARCH GROUP OF THE INSTITUTE OF ELECTRICAL ENGINEERING OF JAPAN, 11 June 2017 (2017-06-11), pages 7 - 12 *

Also Published As

Publication number Publication date
JP2021119425A (en) 2021-08-12

Similar Documents

Publication Publication Date Title
US11308398B2 (en) Computation method
KR101959376B1 (en) Systems and methods for a multi-core optimized recurrent neural network
US20180260709A1 (en) Calculating device and method for a sparsely connected artificial neural network
US20170061279A1 (en) Updating an artificial neural network using flexible fixed point representation
US11983616B2 (en) Methods and apparatus for constructing digital circuits for performing matrix operations
US20190130268A1 (en) Tensor radix point calculation in a neural network
JP7410395B2 (en) Optimization device and optimization method
WO2019182059A1 (en) Model generation device, model generation method, and program
KR102290531B1 (en) Apparatus for Reorganizable neural network computing
CN114662646A (en) Method and device for realizing neural network
US20190130276A1 (en) Tensor manipulation within a neural network
US20210294784A1 (en) Method and apparatus with softmax approximation
US11709783B1 (en) Tensor data distribution using grid direct-memory access (DMA) controller
WO2023125857A1 (en) Model training method based on machine learning framework system and related device
CN116090518A (en) Feature map processing method and device based on systolic operation array and storage medium
KR20230132369A (en) Reducing resources in quantum circuits
JP2020080048A (en) Parallel processing apparatus and program
US11704562B1 (en) Architecture for virtual instructions
CN114330682A (en) Hardware architecture applied to Fastformer neural network and computing method thereof
Pochelu et al. An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks
JP2020191017A (en) Information processing device, information processing method, and information processing program
KR20200023155A (en) A method for accelerating training process of neural network and a neural network system
JP7470019B2 (en) Information Processing System
US20240095493A1 (en) Desparsified convolution for sparse tensors
TWI844228B (en) Training a neural network to perform a machine learning task

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19771724

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19771724

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP