WO2019182059A1

WO2019182059A1 - Model generation device, model generation method, and program

Info

Publication number: WO2019182059A1
Application number: PCT/JP2019/011865
Authority: WO
Inventors: 裕也海野
Original assignee: 株式会社ＰｒｅｆｅｒｒｅｄＮｅｔｗｏｒｋｓ
Priority date: 2018-03-22
Filing date: 2019-03-20
Publication date: 2019-09-26
Also published as: JP2021119425A

Abstract

The present invention realizes machine learning in memory usage amounts independent of mini-batch size. This model generation device generates a pre-trained model comprising a neural network model, and comprises an input means, an error output means, an error calculation means, and a model generation means. The input means divides training data and inputs the same into the model. The error output means outputs first errors representing the difference between data acquired by inputting the divided training data into the model and correct answer labels of the divided training data. On the basis of the first errors, the error calculation means calculates second errors representing the difference between data acquired by inputting the training data into the model and correct answer labels of the training data. On the basis of the second errors, the model generation means generates the pre-trained model in which the weight of at least one layer of the neural network has been updated by backward propagation of errors.

Description

Model generation apparatus, model generation method and program

The present invention relates to a model generation device, a model generation method, and a program.

In deep learning, the amount of calculation and the amount of memory used with the amount of data increase or decrease depending on the training data. In the case of deep learning in the field of natural language processing, it often results in the problem of word prediction by tasks such as machine translation, summarization or language models. Such a word prediction problem is a large-scale prediction problem using a number of vocabularies such as tens of thousands to hundreds of thousands, and requires a huge memory capacity. On the other hand, in order to improve calculation efficiency, when performing operations in mini-batch, it is necessary to increase the number of data (mini-batch size) to be processed at one time. In general, since the memory consumption capacity is proportional to both, the mini-batch size must be reduced and the calculation efficiency must be sacrificed in order to execute the calculation. In other words, it is possible to improve the calculation efficiency by increasing the batch size, but it is necessary to reduce the batch size due to the memory capacity. On the other hand, by reducing the batch size, the calculation efficiency of the entire learning is reduced. Will go down.

・ Recalculation methods, loss calculation methods one by one, etc. have been studied. However, since the problem that the consumed memory capacity is proportional to the vocabulary size has not been solved, the effect of reducing the memory capacity is limited. In addition, a method of increasing calculation efficiency by reducing memory reading even with a small batch size has been studied, but it cannot be used for general purposes and its application range is limited.

Therefore, an embodiment of the present invention proposes a model generation device, a model generation method, and a program that perform machine learning with a memory usage amount that does not depend on the mini-batch size.

A model generation apparatus according to an embodiment includes an input unit, an error output unit, an error calculation unit, and a model generation unit. The input means divides the training data and inputs it to the model. The error output means outputs a first error representing a difference between the data acquired by inputting the divided training data into the model and the correct answer label of the divided training data. Based on the first error, the error calculation means calculates a second error representing a difference between the data acquired by inputting the training data into the model and the correct answer label of the training data. The model generation means generates the learned model in which the weight of at least one layer of the neural network is updated by back propagation based on the second error.

According to one embodiment, it is possible to perform machine learning in memory usage that does not depend on the mini-batch size.

The figure which shows the concept of a prediction problem. The figure which shows the concept of a prediction problem. The block diagram which shows the function of the machine learning apparatus which concerns on one Embodiment. The block diagram which shows the function of the learning part which concerns on one Embodiment. The figure which shows the concept of the arithmetic processing which concerns on one Embodiment. The flowchart which shows the flow of the process which concerns on one Embodiment. The figure which shows the example of the hardware constitutions which concern on one Embodiment.

First, an example of learning to which this embodiment is applied will be described. In this embodiment, for example, deep machine learning in the field of natural language processing is performed using an RNN (Recurrent Neural Network) technique. Processing in the natural language processing field often results in large word prediction problems. In the following description, as an example, calculation of a network resulting in a prediction problem will be described. However, the present invention is not limited to this, and machine learning that handles big data is performed by other MLP (Multilayer Perceptron), CNN (Convolutional Neural Network). The present embodiment can be applied even when the method is performed by the above method.

An outline of the calculation performed in this embodiment will be described. FIG. 1A is a diagram illustrating an example of a concept of an input / output state in forward propagation for a word prediction problem. Input x is a vector of D dimension, product xW ^T of the weight matrix W and the input x is a hidden layer is calculated. Here, W ^T represents a transposed matrix of W. The weight matrix W is indicated by a V × D matrix. Here, V indicates the vocabulary size of the model, that is, the number of words to be predicted in the model.

When the product of the input layer and the hidden layer is calculated, an output element predicted in the model at that time point is obtained by inputting the calculation result to, for example, a softmax function in the output layer. The predicted result is compared with teacher data to calculate a loss. By calculating the loss and backpropagating the error, the hidden layer elements are optimized. The model is optimized by learning the hidden layer W by repeating this forward propagation and back propagation. Although only one hidden layer is illustrated, there may be multiple hidden layers.

FIG. 1A described above uses one of the training data, and at the time of back propagation error, the loss is calculated for one or more partial data in the training data. In batch learning, all training data is used for error back propagation at one time. However, since convergence of the model is slow or good results cannot be obtained, mini-batch learning that uses some data at once is generally used. Done. The number of data calculated at one time is called the number of batches. In particular, when a highly parallel processor such as a GPU is used, parallel calculation for reducing the number of batches cannot be performed efficiently, so that the calculation efficiency is significantly reduced. Therefore, calculation is performed with a certain number of batches in order to increase the utilization efficiency of the processor.

FIG. 1B is a diagram illustrating a concept of an example of an input / output state in forward propagation by mini-batch learning. If by mini-batch training, with respect to mini-batch size B is a batch number in a mini-batch, and the data of the input layer and the input matrix X is a B × D matrix, it calculates the XW ^T is a B × V matrix, each batch Multiply (by row) by the softmax function and take the logarithm to calculate Y = log (softmax (XW ^T )).

In this case, X indicates an input x in which each row is input to the input layer of FIG. 1B. For example, B inputs, that is, inputs x ₁ , x ₂ ,..., X _i ,. · x input _B which is a matrix, each including a row. The output for this each input _{x i} is likewise summarized as Y, Y can be, for example, B-number of the output, i.e., the output _y 1 with respect to the input _{x 1,} · · ·, the output _y i to the input _{x i,} · · -, it indicated the output y _B for the input x _B in a matrix, each including a row.

In the output layer, the operation of the mini batch size B can be efficiently processed by applying the softmax function to each row of the output of the hidden layer. It is possible to improve the convergence and accuracy of the model by calculating the average of each row of Y, calculating the loss from the error with the correct label, and optimizing the loss by propagating back.

In learning with a large number of words such as natural language processing or big data learning, the mini-batch size B can be increased to further increase the calculation efficiency. In this embodiment, by devising the calculation method of the input and the hidden layer, the calculation is performed on data having a large mini-batch size, for example, a size that is normally difficult to store in the accelerator memory. Make it possible to do. Hereinafter, the memory usage method, the loss calculation method, and the like will be described in detail.

FIG. 2 is a block diagram illustrating functions of the machine learning device according to the present embodiment. The machine learning device 1 includes an input unit 10, a control unit 12, a storage unit 14, a learning unit 16, and an output unit 18, and performs machine learning. The machine learning device 1 functions as a model generation device that generates a model by machine learning.

The input unit 10 is an interface through which data from the outside of the machine learning device 1 is input, and receives input of training data, hyper parameters, and the like. The input data is transmitted to the control unit 12 and processed. Further, the input data may be transmitted to the storage unit 14 and temporarily stored.

The control unit 12 performs control for the learning unit 16 to learn the model and control for storing data in the storage unit 14.

The learning unit 16 learns the model based on the input training data. The learning unit 16 performs processing by transmitting and receiving data stored in the storage unit 14 at an appropriate timing.

The storage unit 14 stores training data input from the input unit 10. The storage unit 14 may include a main storage device and an auxiliary storage device as a hardware configuration. When the main storage device is provided, a program for operating the control unit 12, the learning unit 16, and the like may be stored. Training data has a large capacity and can be input / output at a low speed. Once stored in the main storage, the capacity is smaller than that of the main storage, and the data can be input / output at a high speed. It may be transferred to the auxiliary storage device as necessary.

The output unit 18 outputs the model learned by the learning unit 16 to the outside. As another example, a database (not shown) may be provided inside the machine learning device 1 and a model may be output to the database. This database may be provided in the storage unit 14. As yet another example, the machine learning device 1 may be a processing device that performs natural language processing or the like. In this case, the learned model may be stored in a necessary place as appropriate.

FIG. 3 is a block diagram showing the internal functions of the learning unit 16. As described above, the function of the learning unit 16 is not limited to RNN in natural language processing, and can be similarly applied to other models in which loss calculation is performed by dividing into relatively large mini-batches. It is.

The learning unit 16 includes a data selection unit 160, a forward propagation unit 162, an error output unit 164, an error calculation unit 166, and a back propagation unit 168.

The data selection unit 160 receives a control signal from the control unit 12 and selects training data used for learning in the mini-batch. For example, the training data is randomly allocated to the mini-batch having the element B. The randomly distributed data is output to the forward propagation unit 162. Alternatively, the label of randomly distributed data may be notified to the forward propagation unit 162, and the forward propagation unit 162 may acquire the mini-batch training data from the storage unit 14 when performing the calculation.

The forward propagation unit 162 performs forward propagation in model generation and calculates a numerical value for calculating a loss in each layer. The error output unit 164 refers to the output of each layer obtained by the forward propagation unit 162. This output is calculated as a vector based on a dimension in each layer or a matrix format in which a predetermined number of rows are aggregated.

The error output unit 164 calculates the difference (first difference) from the correct label for the output vector or matrix in each layer output by the forward propagation unit 162. By multiplying the calculated first difference by, for example, a softmax function, a predetermined number (b) of input data in a mini-batch, that is, at least a part (B) of training data, is input. Calculate the loss (first error) for the data.

The error calculation unit 166 calculates the mini-batch loss (second error), which is the second difference, by calculating the sum or average of the first errors. That is, the first error indicates a difference between the model output value and the label value of b pieces of data, and the second error indicates a loss in the mini-batch (B pieces).

Here, mini-batch processing in the present embodiment will be described with reference to FIG. FIG. 4 is a diagram showing an outline of the mini-batch processing in the present embodiment.

For the input X, as described above, the input corresponding to the batch size is divided into N groups by the predetermined size b smaller than B. For example, the predetermined size b is floor (B / N). That is, the input matrix X is divided into sub-matrices X _n expressed as X = [X ₁ ^T X ₂ ^T ... X _N ^T ] ^T. When B is divisible by b, B = N × b, and all sub-matrices _Xn are b × D matrices. If B is not divisible by b, the remainder is b ′, and B = (N−1) × b + b ′, and the submatrix X ₁ to X _N−1 is a b × D matrix, and the submatrix X _N is b ′ × D matrix. b may be defined as a predetermined number smaller than B that does not depend on B.

As described above, when the batch size B is increased, the accuracy and efficiency of learning can be improved. On the other hand, the memory capacity that can be used in the learning unit 16 is limited by the memory usage of the learning unit 16 and other units. V is a vocabulary size and has a value of about 50,000 to 100,000. For this reason, when the batch size B is increased, the matrix shown in FIG. 1B, in particular, all the elements of the matrix Y including B × V elements to be acquired in the output layer are stored in the memory at the same timing. It is difficult to make the batch size B sufficiently large.

As shown in FIG. 4, by dividing the size of the input matrix X into N by a predetermined size b, it is possible to calculate the loss within the above memory limit even if the batch size B is increased. Become.

In order propagation portion 162, input matrix _{X n} that is divided into b line, the output of the hidden layer shown in FIG. _1B, X n ^{W T} is calculated. By multiplying the calculated X _n W ^T to softmax function, Y _n with b × V element is a submatrix of the matrix Y shown in FIG. 1B is calculated. The first vector error L _n of the output with respect to the partial input X _n is calculated by calculating the sum or average of Y _n [i, y ⁱ ] in the partial matrix, where y ⁱ is the label of the teacher vector for each input vector x _i . To do. That is, the first error indicates a partial loss with respect to the partial input _Xn when viewed from the mini-batch.

The n computation for n = 1,2, ···, by performing sequentially to N, the first error by the error output unit _{_{164 L 1, L 2, ···}} , L N is calculated. The error calculation unit 166 calculates a second error that is a loss in the mini-batch by calculating the sum or average of the first errors. Each time the first error L _n is calculated, by discarding the partial matrix Y _n , the second error corresponding to the matrix Y, which is a B × V matrix, can be obtained even if the mini batch size B is large. It is possible to calculate without directly obtaining.

The submatrix Y _n does not need to be discarded. For example, an area corresponding to b × V × (the number of element bits) may be secured and used. Furthermore, the need to store also the first error L _1, etc. all without prepares a variable called loss L, we take the sequentially sum the first error L _n calculated from the partial matrix Y _n You may do it. Of course, the loss L may be obtained by computing the partial matrix Y _n in each computation core such as a GPU, obtaining the first error L _n by parallel computation, and obtaining this sum.

3, the back propagation unit 168 optimizes the model by performing error back propagation based on the second error calculated by the forward propagation unit 162.

The error calculation unit 166 recalculates and outputs the first error (a partial error with respect to the submatrix _Xn ) for each predetermined number b with respect to the mini-batch at a timing required during execution of error back propagation. For example, back propagation portion 168 may perform backpropagation for each partial matrix X _n, again, a partial matrix Y _n are required to calculate the first error is the error of each the partial matrix X _n. Error calculation unit 166, by recalculating the Y _n, obtains the Y _n of the error output unit 164 is calculated, to calculate a first error.

The amount of memory required at the time of recalculation is sufficient if an area that can be calculated for each mini-batch can be secured, and if the area for b × V × (number of element bits) is secured, as in the case of the calculation described above, it is executed. It is possible. In view of b <B, it can be seen that the amount of memory used can be reduced.

For example, when an accelerator represented by a GPU (Graphic Processing Unit) is used for learning, the overall calculation cost is that the batch size B cannot be made sufficiently large even when the recalculation cost of Y _n is taken into consideration. Becomes higher. Therefore, recalculate the Y _n, i.e., the re-calculated difference from b pieces of data by recalculating a partial error (first error), as take a minute calculation cost for obtaining a first error in the mini-batch However, it is possible to reduce the calculation cost of the entire mini-batch, and hence the calculation cost of the entire learning.

The error calculation unit 166 does not need to be provided independently, and the error output unit 164 may have a function of the error calculation unit 166 and may not include the error calculation unit 166. Further, the forward propagation unit 162 and the back propagation unit 168 may function as a single unit, and may function as a model generation unit that generates a model by performing forward propagation and back propagation. Further, this back propagation of error can be similarly applied not only to the output layer but also to an intermediate layer. For example, the second error may be calculated using the result of forward propagation to a middle layer and the result of back propagation to the layer with respect to the partial matrix _Xn , and the back propagation of the layer may be executed. In this case, it is not always necessary to use a softmax function or the like in the calculation of the first error. As described above, the error calculation of the present embodiment can be executed for at least one layer constituting the network.

Thereafter, when the calculation is continued using the same mini-batch, the forward propagation unit 162 performs processing, and the calculation is continued using the next mini-batch, or when the next epoch is shifted to, the data selection unit 160 performs processing.

FIG. 5 is a flowchart illustrating an example of a processing flow according to the present embodiment.

First, training data is input to the machine learning device 1 via the input unit 10 (S100). The input training data is stored in the storage unit 14 as necessary. Necessary information such as the number of data is output to the control unit 12. Thereafter, the control unit 12 controls processing necessary for learning, such as learning of the learning unit 16 and data transmission of the storage unit 14.

Next, the data selection unit 160 of the learning unit 16 randomly distributes data for each predetermined mini-batch size B and selects data for which a mini-batch is generated (S102). The mini-batch size B may be a preset parameter or may be input to the control unit 12 via the input unit 10 as a hyper parameter. When input as a hyperparameter, the control unit 12 generates a mini-batch so that the data selection unit 160 selects data for each mini-batch size B.

This step may be a step of distributing data in advance for each mini-batch instead of selecting data for each mini-batch. The data selection may be, for example, a process of reading the selected data into an easily accessible memory, or a process of outputting the index of the selected data to the forward propagation unit 162 and the back propagation unit 168. In this case, the data necessary for the calculation may be read into an easily accessible memory at the timing when the processing is performed by the forward propagation unit 162 or the like. At the time of execution, other optimization processes such as loop unrolling and software pipelining may be performed.

Next, the training data in the mini-batch is forward propagated for each part by the forward propagation unit 162, and a first error is calculated for each partial data forwardly propagated by the error output unit 164, and the first error for each partial data is calculated. Based on the error, the second error of the entire mini-batch is calculated (S104).

Next, the back propagation unit 168 performs error back propagation using the first error or the second error calculated in S104, and the model is optimized (S106). During the execution of back propagation, if it is necessary to output partial data at each layer, recalculation may be performed for each back propagation process of the partial data. The first error may be calculated by performing this recalculation, and error back propagation may be performed using the first error. That is, in each layer, the second difference between the output of the layer and the error propagated back from the next layer is obtained as the first error, and the error propagates back to the previous layer as the first error. Also good. Thus, the first error may be obtained in each layer. Similarly, in the back propagation, the second error as the whole mini-batch may be calculated. In this way, the back propagation unit 168 performs back propagation using the second error and a process (function) for obtaining the first error corresponding to the second error by forward propagation.

Next, it is determined whether or not the processing in the mini-batch is completed (S108). If the processing in the mini-batch is not completed (S108: NO), that is, if further learning is performed using the same mini-batch, the process returns to S104 and the processing is continued (S104 to S106).

On the other hand, when the processing in the mini-batch is completed (S108: YES), it is determined whether learning is completed (S110). If learning has not ended (S110: NO), the next mini-batch data is selected and the next mini-batch process is executed (S102 to S108).

On the other hand, when the learning is finished (S110: YES), the output unit 18 outputs the learned model and finishes the process (S112). The end of the learning is that the value of the loss, for example, the second error in the output layer, is smaller than the predetermined value, the calculation of the predetermined number of epochs is completed, and the evaluation value becomes larger than the predetermined value in the validation. It is judged according to the conditions such as. Further, instead of outputting to the outside via the output unit 18, the machine learning device 1 may be stored in the storage unit 14 to function as a natural language processing device using the model, for example. Further, the output unit 18 may output the model stored in the storage unit 14.

FIG. 6 is a diagram illustrating a hardware implementation example of the present embodiment. The machine learning device 1 includes a CPU 200, an accelerator 202, a main storage device 204, an auxiliary storage device 206, a network interface 208, and a device interface 210. Each of these devices is connected by a bus 212.

A CPU (Central Processing Unit) 200 is a processor that operates the machine learning device 1 and operates the machine learning device 1 based on a program stored in the main storage device 204, for example.

The accelerator 202 is a device for assisting arithmetic processing, and includes, for example, a GPU. The GPU speeds up the numerical calculation by GPGPU (General-Purpose computing on GPU). The accelerator 202 may be provided with a memory that is an auxiliary storage device in itself, and may be capable of accessing data stored in the memory at high speed. In the exchange with the main storage device 204 or the auxiliary storage device 206, prefetching of necessary data from these storage devices to the memory on the accelerator 202 is performed so that this high-speed access can be utilized to the maximum extent. Also good.

The main storage device 204 is directly connected to the CPU 200 via a main bus or the like, and mainly stores programs and the like necessary for the operation of the machine learning device 1. The main storage device 204 includes, for example, DRAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory).

The auxiliary storage device 206 is slower in throughput than the main storage device 204 but has a large capacity memory. The auxiliary storage device 206 does not need to be in the same computer as the computer in which the machine learning device 1 is configured, and may be installed outside. For example, the training data may be stored in the auxiliary storage device 206, transferred to the CPU 200, the accelerator 202, and the main storage device 204 via the bus 212 and used.

The network interface 208 is an interface that connects the external network 300 and the machine learning device 1. The device interface 210 is an interface that connects the external device 400 and the machine learning device 1. The external device 400 may be connected to the machine learning device 1 via the network 300 and the network interface 208.

The calculation in this embodiment is mainly executed on the accelerator 202. Although the arithmetic processing in the accelerator 202 is higher than that of the CPU 200, the capacity of the memory mounted in the accelerator 202 is often smaller than that of the main storage device 204 and the auxiliary storage device 206. The data on the memory can be accessed at high speed from the processor in the accelerator 202, while the access from the processor in the accelerator 202 to the main storage device 204 and the auxiliary storage device 206 may be slow. Many.

In such a case, since it is difficult to store the B × V matrix in the memory on the accelerator 202 at a time, it is necessary to store the B × V matrix in the auxiliary storage device 206 or the like. This access is generally slower than access between the processor and memory provided in the accelerator 202. Therefore, if data such as B × V is to be handled by the accelerator 202 at a time, transfer via the lower speed of the former bus is required, so that the ratio of data transfer processing to arithmetic processing increases, and transfer processing is reduced. It is likely to become a bottleneck.

Therefore, as described above, by performing arithmetic processing in the accelerator 202 for each partial data, high-speed arithmetic processing can be performed and learning can be performed without reducing the mini-batch size B.

Note that the number of vectors to be processed at the same timing in the partial data, that is, the predetermined size b described above may be determined based on the memory capacity on the accelerator 202. For example, b may be set so that the capacity excluding the capacity of a program necessary for operating the processor of the accelerator 202 and the buffer capacity of input data or the like can be used as much as the capacity of b × V elements. .

As described above, according to the present embodiment, an accurate model is learned by calculating a partial loss (first error or second error) for each predetermined number of data that does not depend on the mini-batch size. Therefore, it is possible to secure a high calculation speed without reducing the mini-batch size. As described above, it is possible to perform loss back propagation and error propagation in a memory capacity that does not depend on the mini-batch size, so that it is possible to efficiently use the computing resources of the computer.

In the above description, the data selection unit 160 inputs mini-batch size data to the model generation apparatus including the forward propagation unit 162 and the backward propagation unit 168, and the forward propagation unit 162 and the backward propagation unit 168 have a predetermined number of sizes of data. However, the present invention is not limited to this.

For example, after the data selection unit 160 selects B pieces of data as mini-batch data, the data selection unit 160 includes a data division unit that divides the data into b pieces of data, and the data division unit adds b pieces of input data to the model generation unit. May be operated as an input means.

In the model generation, the forward propagation unit 162 obtains the output from the output layer by calculation for the b pieces of data input from the input unit to the network. The error output unit 164 outputs a first error based on the first difference between the output corresponding to the b pieces of data and the correct answer label. This error is stored, and the first error is similarly output for the next b pieces inputted from the input means. At the stage where the mini-batch, that is, the B data has been input, the error calculation unit 166 calculates a second error, which is an error of the mini-batch.

Similarly, the back propagation unit 168 outputs a first error via the error output unit 164 based on the b pieces of data input from the input means for the output of each layer and the back propagated output. Then, when the output of the first error is completed for the data for the mini-batch, the error calculation unit 166 calculates the second error in the layer focused on. As described above, a data dividing unit that divides the data into b pieces of data may be provided as input means.

Further, the division may have a certain amount of fluctuation with respect to the predetermined number b, for example. As another example, it may be changed dynamically according to the usage rate of the memory. Thus, the division is not limited to each predetermined number, and may be appropriately changed according to the situation, or may be various other division methods.

In the above description, RNN learning in natural language processing has been described as an example. However, the present invention is not limited to this, and is also applicable to learning in other neural networks that require a large amount of data areas when performing loss calculations. Is possible. For example, it can be used not only for MLP and CNN but also for LSTM (Long Shot-Term Memory).

The generated model is a model that performs natural language processing. However, the present invention is not limited to this, and the machine learning device 1 generates a model that performs processing on various other data for other purposes. It may be a thing.

The softmax function or the like, which is a function in the above description, is shown as an example, and other implementations may be used. The softmax function may be a function suitable for obtaining other gradients such as a sigmoid function and ReLU (Rectified Linear Unit). It is possible to select an appropriate function as a function to be used in other places as appropriate.

In the machine learning device 1 in each embodiment as shown in FIG. 2, the control unit 12 is a control circuit implemented by analog, digital, or FPGA (Field Programmable Gate Array), ASIC (Application Specific Specific Integrated Circuit), or the like. There may be. Similarly, the learning unit 16 may be implemented by a circuit.

In all the descriptions above, at least a part of the machine learning device 1 may be configured by hardware, or may be configured by software, and the CPU or the like may be implemented by software information processing. In the case of software, the machine learning device 1 and a program that realizes at least a part of the functions are stored in a storage medium such as a flexible disk or a CD-ROM, and read and executed by a computer. Also good. The storage medium is not limited to a removable medium such as a magnetic disk or an optical disk, but may be a fixed storage medium such as a hard disk device or a memory. That is, information processing by software may be specifically implemented using hardware resources. Furthermore, the processing by software may be implemented in a circuit such as an FPGA and executed by hardware. The generation of the model and the processing after inputting the model may be performed using an accelerator such as a GPU, for example. Further, a processing circuit such as a CPU, a storage device such as a memory, and other necessary hardware may be provided one by one, or a plurality of at least one may be provided.

Further, the model generated by the machine learning device 1 according to the present embodiment can be used as a program module that is a part of the artificial intelligence software. In other words, the CPU of the computer operates based on the model stored in the storage unit so as to perform an operation and output the result.

Based on all the descriptions above, those skilled in the art may think of additions, effects, or various modifications of the present invention, but the aspects of the present invention are not limited to the individual embodiments described above. Absent. Various additions, modifications, and partial deletions can be made without departing from the concept and spirit of the present invention derived from the contents defined in the claims and equivalents thereof. For example, in all the embodiments described above, the numerical values used in the description are shown as an example and are not limited thereto.

In the present embodiment, it has been shown that the present invention can be applied to natural language processing as an example. For example, it is possible to deal with a neural network that inputs an image. In this case, a plurality of input images can be divided into mini-batches and applied in the same manner. As another example, it is also possible to divide and classify each pixel in the image according to this embodiment. In this case, the pixels in the image may be divided and the calculation may be performed for each of the divided pixels as described above. Furthermore, it is possible to use both the division in the mini-batch and the division of the pixels in the image. The present invention can be applied to other types of data as long as mini-batch processing and division within the data can be appropriately performed.

1: machine learning device, 10: input unit, 12: control unit, 14: storage unit, 16: learning unit, 160: data selection unit, 162: forward propagation unit, 164: error output unit, 166: error calculation unit, 168: Back propagation unit, 18: Output unit, 200: CPU, 202: Accelerator, 204: Main storage device, 206: Auxiliary storage device, 208: Network interface, 210: Device interface, 212: Bus, 300: Network, 400 : External device

Claims

A model generation device for generating a learned model including a neural network model,
Input means for dividing the training data and inputting it into the model;
An error output means for outputting a first error representing a difference between data obtained by inputting the divided training data into the model and a correct answer label of the divided training data;
An error calculating means for calculating a second error representing a difference between data acquired by inputting the training data into the model and a correct label of the training data based on the first error;
Model generation means for updating the weight of at least one layer of the neural network by error back propagation based on the second error;
A model generation device comprising:
The error output means outputs the first error for each of the divided training data,
The model generation apparatus according to claim 1, wherein the error calculation unit calculates the second error by calculating a sum or an average of the first errors for each of the divided training data.
3. The model generation apparatus according to claim 1, wherein the error output means discards the matrix used for calculating the first error after calculating the first error.
4. The model generation apparatus according to claim 1, wherein the input unit divides the training data into a predetermined number.
The model generation device according to claim 4, wherein the predetermined number does not depend on the number of data of the training data.
The error output means recalculates the first error at a timing of executing error back propagation,
The model generation means executes error back propagation using the second error calculated by the error calculation means based on the recalculated first error and the process of obtaining the first error. The model generation device according to claim 5.
The model generation apparatus according to any one of claims 1 to 6, wherein the model is a model based on an RNN (Recurrent Neural Network) model.
The model generation apparatus according to any one of claims 1 to 7, wherein the data input to the model is data used for natural language processing.
A model generation method for generating a learned model including a neural network model,
An input means for dividing the training data and inputting it into the model;
An error output means for outputting as a first error representing a difference between data obtained by inputting the divided training data into the model and a correct label of the divided training data;
An error calculating means calculating, based on the first error, a second error representing a difference between data acquired by inputting the training data into the model and the correct answer label of the training data; ,
Updating a weight of at least one layer of the neural network by back propagation based on the second error; and
A model generation method comprising:
On the computer,
In a model generation device that generates a learned model including a neural network model,
Input means for dividing the training data and inputting it into the model;
An error output means for outputting a first error representing a difference between data obtained by inputting the divided training data into the model and a correct answer label of the divided training data;
An error calculating means for calculating a second error representing a difference between data acquired by inputting the training data into the model and a correct label of the training data based on the first error;
Model generation means for updating the weight of at least one layer of the neural network by back propagation based on the second error;
Program to function as.