WO2021100818A1

WO2021100818A1 - Learning method and learning device employing augmentation

Info

Publication number: WO2021100818A1
Application number: PCT/JP2020/043248
Authority: WO
Inventors: 剛岡留; 敦也井手
Original assignee: 学校法人関西学院
Priority date: 2019-11-19
Filing date: 2020-11-19
Publication date: 2021-05-27
Also published as: JPWO2021100818A1; JP7160416B2

Abstract

In relation to machine learning, the objective of the present invention is to provide a learning method and device with which it is possible to improve generalization capability and to improve recognition accuracy. A plurality of items of learning data after augmentation with respect to multidimensional quantity learning data before augmentation are input into one classifier, and from among probability distributions of prediction labels output for each of the plurality of items of learning data, at least one probability distribution, selected using the error compared with the correct answer of the learning data before augmentation as a scale, is used to perform learning on the basis of the error compared with the correct answer of the learning data before augmentation. Data sampled with the probability of selection increased in accordance with the magnitude of the error compared with the correct answer, data with which the error compared with the correct answer is greatest, or data with which the difference from the average value of the error compared with the correct answer is smallest are used as the data selected using the error compared with the correct answer as a scale.

Description

Learning method and learning device using padding

The present invention relates to a technique that enables highly accurate recognition in machine learning, especially in a deep neural network (DNN) even when there is little learning data.

In recent years, AI (Artificial Intelligence) has undergone remarkable evolution due to technological developments in computers. The central technology is a neural network (NN; Neural Network), and in particular, DNN, which is a deepened neural network, is sweeping the technology. In general, the deeper the layer of DNN, the more expressive it becomes. Therefore, in recent years, many studies have been conducted on the design of a DNN with a deep layer and a method for successfully learning a DNN with a deep layer.

However, in order to deepen the layer, a large amount of data is required for learning, but there is a problem that the cost of data preparation is high in the real world. In addition, as a problem created by deepening the layer, there is a problem that overfitting is likely to occur. This means that the model is too expressive and fits the data too much. Further, if the layer is deepened, the number of parameters becomes enormous, so that there is a problem that the model size becomes large, it becomes difficult to use it on a mobile terminal, and the amount of calculation at the time of prediction increases.

On the other hand, the DNN classifier is required to have generalization performance. That is, it is necessary to be able to accurately predict even unknown data. Therefore, the error calculation (cross entropy) between the probability distribution of the prediction label and the true probability distribution is performed, and the parameters are optimized so that the error becomes small.
As a device to improve the prediction accuracy, data augmentation is generally performed to increase the data by processing the data at hand at the time of training such as scaling, inversion, rotation, or adjusting the contrast. ing. In addition, a method called ensemble learning, in which a plurality of predictors are trained and their prediction results are integrated at the time of operation, is also known.

In addition, as a technique for performing an ensemble at the time of learning, a technique related to a convolutional neural network for image recognition focusing on ensemble learning (see Non-Patent Document 1) and a technique related to a neural network having a residual structure (non-patent). (Refer to Document 2), a learning processing method (see Patent Document 1) for identifying whether or not a disease is present based on skin image data is disclosed.
However, all of the above documents utilize the output results predicted by a plurality of learners, and do not utilize the output results in which a plurality of inflated learning data are input to one learner.
Further, in Non-Patent Document 1 and Patent Document 1, a technique for inflating learning data by rotating and inverting an image is disclosed, but some of the plurality of learning data after inflating are used for learning. As a result, there may be data that deteriorates the recognition accuracy, and if it is used as it is as learning data after padding, there is a problem that the recognition accuracy cannot be sufficiently improved as a result.

Japanese Unexamined Patent Publication No. 2017-45341

In view of such a situation, the present invention excludes data that deteriorates recognition accuracy from the inflated learning data in order to improve the generalization ability that can accurately predict even unknown data and improve the recognition accuracy in machine learning. It is an object of the present invention to provide a learning method and a learning device capable of learning efficiently.

In order to solve the above problem, according to the first aspect of the present invention, the learning method causes one classifier to input a plurality of learning data after padding with respect to the multidimensional amount of learning data before padding. Of the probability distributions of the prediction labels output for each of the plurality of training data, at least one probability distribution selected based on the error from the correct answer of the training data before padding is used to obtain the training data before padding. Learn based on the error from the correct answer.

Here, the padding includes padding that does not involve information deterioration such as inversion, rotation, and translation with respect to the training data before padding, and padding that causes information degradation with respect to the learning data before padding, or both. Inflating can be applied. The correct answer of the training data before padding is the same as the correct answer of the plurality of training data after padding with respect to the learning data before padding. The correct label does not change even if the information is degraded. That is, the same correct answer label as the training data before padding is assigned to the plurality of training data after the information deterioration, and the probability distribution of the correct answer is also the same.
As a measure of the error from the correct answer of the training data before padding, cross entropy, which is an index showing how far the two probability distributions are apart, is preferably used. That is, by using cross entropy as a loss function, efficient learning becomes possible.

In addition, by selecting the error from the correct answer of the training data before padding as a scale from the probability distributions of the prediction labels output for each of the plurality of training data, data that is not suitable for learning is used or used. Even if it is not used much, it can be learned and the generalization ability can be improved. In the present invention, the data that is not suitable for learning has a small error from the correct answer.

According to the second aspect of the present invention, in the learning method, a plurality of training data after padding are input to one classifier for a multidimensional amount of training data before padding, and for a plurality of training data. Based on the probability distribution of each output prediction label, learning is performed using at least one training data after padding selected by measuring the error from the correct answer of the training data before padding.

According to the third aspect of the present invention, in the learning method, a plurality of training data after padding are input to one classifier for a multidimensional amount of training data before padding, and for a plurality of training data. Based on the probability distribution of each output prediction label, one training data after padding selected by using the error from the correct answer of the training data before padding as a scale is replaced with the training data before padding for learning.
When mini-batch learning is used, the multidimensional amount of training data before padding is one source data in the mini-batch, the source data is padded and extended to multiple extended data, and each extended data. Is input to the classifier, and based on the output probability distribution of the prediction label, one extended data selected by using the error from the correct answer of the original data as a scale is replaced with the original data for learning.

In the above-mentioned learning method of the present invention, by using the learning data in which information is deteriorated and inflated, the generalization ability can be further improved and the recognition accuracy can be further improved. Here, information deterioration means rewriting a part of data to lose information, adding noise to the original data of a multidimensional amount, deleting a part of the multidimensional amount, or determining. It means to set a value and reduce the amount of information that the original data has. For example, information is degraded by filling a specific partial area in black in the image before padding (the pixel value of the specific partial area is set to 0 (zero)) or by setting the pixel value of the specific partial area to a specific value. Can be made to.

In the case of images, the training data used in the present invention is represented by a multidimensional quantity of pixel dimensions constituting the one image or a multidimensional quantity representing the characteristics of the image. When the training data is one-dimensional time-series data, it is represented by a multi-dimensional quantity in which the data values at each time are arranged for the duration, or a multi-dimensional quantity representing the characteristics of the one-dimensional time-series data, or training. When the data is multidimensional time series data, it is represented by a multidimensional quantity in which the multidimensional data values at each time are arranged for the duration, or a multidimensional quantity representing the characteristics of the multidimensional time series data.

In the learning method of the present invention, the data selected on the basis of the error from the correct answer is preferably the data sampled with a high probability of being selected according to the magnitude of the error from the correct answer. The data selected based on the error from the correct answer refers to probability distribution data or learning data. The error between the probability distribution data (categorical distribution) of the predicted label and the probability distribution data (one-hot expression) of the correct answer label output for each of the multiple training data after padding is calculated, and the calculated error is large. Correspondingly, the probability of selection of the probability distribution data of each output prediction label is increased and sampled. As for the probability distribution data selected by sampling, the larger the error from the correct answer, the higher the probability of selection and the easier it is to select.

As shown in the examples described later, it is known that the learning efficiency is higher when the data has a larger error from the correct answer. Data with a large error from the correct answer is difficult data for the classifier, and it is presumed that learning with difficult data will improve learning efficiency rather than simple data (data with a small error from the correct answer).
However, on the other hand, learning efficiency tends to be higher when learning a mixture of different data than when learning only difficult data. Therefore, the larger the error from the correct answer, the higher the probability of selection, and the smaller the error from the correct answer, the lower the probability of selection. We decided to sample and select multiple training data.
The number of data to be sampled may be one or two or more.

In the learning method of the present invention, the data selected on the basis of the error from the correct answer may be the data having the maximum error from the correct answer.
As described above, it is known that the learning efficiency is higher for the data having a large error from the correct answer. Data with a large error from the correct answer is difficult data for the classifier, and learning efficiency is improved by learning with difficult data rather than simple data (data with a small error from the correct answer).
The data that maximizes the error from the correct answer may be one or two or more. There may be two or more maximum data. When there are two or more, any one data may be randomly selected, or a plurality of data may be used.

In the learning method of the present invention, the data selected on the basis of the error from the correct answer may be the data having the minimum difference from the average value of the error from the correct answer.
The probability distribution data and training data selected based on the error from the correct answer are based on the average value and median value of the error from the correct answer, the quartile value, the 3/4 division value, and the load average value. It can be selected, but it is preferable to select in order from the probability distribution closest to the average value of the error.
The data that minimizes the difference from the average value of the error from the correct answer may be one or two or more. There may be two or more data that minimize the difference from the average value. When there are two or more, any one data may be randomly selected, or a plurality of data may be used.

In the learning method of the present invention, the probability distribution selected with the above error as a scale is preferably two or more. By selecting two or more probability distributions, the recognition accuracy can be further improved.
Similarly, when the number of learning data selected using the above error as a scale is two or more, the recognition accuracy can be further improved.

In the learning method of the present invention, the time-series data is data that changes with time, and examples thereof include time-series acceleration data obtained by an acceleration sensor mounted on a human. Based on the time-series acceleration data obtained by the acceleration sensor attached to the human, it is possible to predict and classify the daily behavior of the human. Dividing time-series data into smaller time-series data means dividing the obtained time-series acceleration data into sections of a certain period of time and adding random noise to the acceleration data in some division areas, or some. It is to learn the classifier model of human daily behavior by modifying the acceleration data of the divided region to zero and modifying it to deteriorated acceleration data and using them as inflated data.

In the learning method of the present invention, when the training data is image data, at least one of rotation, inversion, zoom, movement, cropping, division, blurring, and noise is applied to the image data, and the training data is time. In the case of series data, at least one of inversion, movement, cutting, division, and noise may be applied to the time series data to obtain a plurality of training data after padding.

Next, the learning device of the present invention will be described.
According to the fourth aspect of the present invention, the learning device inputs an inflated processing unit that inflates a plurality of training data with respect to the multidimensional amount of learning data before inflating, and a plurality of learning data after inflating. At least 1 is measured by the error between the label prediction unit that predicts the label using one classifier and the correct answer of the training data before padding in the probability distribution of the prediction labels output for each of the plurality of training data. It includes a selection unit that selects one probability distribution, and a learning unit that learns based on the error between the selected probability distribution and the correct answer of the training data before padding.

According to the fifth aspect of the present invention, the learning device inputs an inflated processing unit that inflates a plurality of learning data with respect to the multidimensional amount of learning data before inflating, and a plurality of learning data after inflating. Based on the probability distribution of the prediction label output for each of the multiple training data and the label prediction unit that predicts the label using one classifier, at least 1 is measured by the error between the correct answer of the training data before padding. It is provided with a selection unit for selecting one inflated learning data and a learning unit for learning using the selected inflated learning data.

According to the sixth aspect of the present invention, the learning device inputs an inflated processing unit that inflates a plurality of training data with respect to the multidimensional amount of training data before inflating, and a plurality of learning data after inflating. Based on the probability distribution of the prediction label output for each of the multiple training data and the label prediction unit that predicts the label using one classifier, one scale is the error between the correct answer of the training data before padding. It is provided with a selection unit for selecting the learning data after padding, and a learning unit for learning by replacing the selected learning data after padding with the learning data before padding.

In the selection unit of the learning device of the present invention, the data selected on the basis of the error from the correct answer is preferably data sampled with a high probability of being selected according to the magnitude of the error from the correct answer. .. The data selected based on the error from the correct answer refers to the probability distribution data or the training data, and the probability distribution data of the prediction label output for each of the plurality of training data after padding and the correct answer label. The error from the probability distribution data is calculated, and the probability of selection of the probability distribution data of each output prediction label is increased and sampled according to the magnitude of the calculated error.
Further, in the selection unit of the learning device of the present invention, the data selected on the basis of the error from the correct answer may be the data having the maximum error from the correct answer. Further, in the selection unit of the learning device of the present invention, the data selected on the basis of the error from the correct answer may be the data having the minimum difference from the average value of the error from the correct answer.
Further, in the padding processing unit of the learning device of the present invention, it is preferable to rewrite a part of the data so that the learning data of the multidimensional amount before padding is lost. By using the learning data in which the information is deteriorated and inflated, the generalization ability can be further improved and the recognition accuracy can be further improved.

In the learning device of the present invention, the padding processing unit rewrites a part of the data to deteriorate the learning data of the multidimensional amount before padding, and when the learning data is an image, one image is the 1st. When the training data is one-dimensional time-series data, which is represented by a pixel-dimensional multidimensional quantity that constitutes an image or a multidimensional quantity that represents the characteristics of the image, the data value at each time is used for the duration. If the multidimensional quantity is arranged or is represented by a multidimensional quantity that represents the characteristics of the one-dimensional time series data, or if the training data is multidimensional time series data, the multidimensional data value at each time is used for the duration. It is represented by a multidimensional quantity arranged in a fraction, or a multidimensional quantity that represents the characteristics of the multidimensional time series data. The information deterioration may be performed by adding noise to the original data of the multidimensional amount, or deleting a part of the multidimensional amount or setting it to a predetermined value to reduce the amount of information contained in the original data.

According to the learning method and learning device of the present invention, there is an effect that data that deteriorates recognition accuracy can be excluded from the learning data after padding and learning can be performed efficiently, and generalization ability and recognition accuracy can be improved.

Functional block diagram of the learning device of the first embodiment Schematic flow chart of the learning method of Example 1 Image image before padding Explanatory drawing of cutout of image before padding Image after cutting out Output image of probability distribution Explanatory diagram of harmonization processing when selecting M pieces around the average Explanatory diagram of how to inflate learning data Graph showing the relationship between prediction accuracy and model size (1) Graph showing the relationship between prediction accuracy and model size (2) Schematic flow chart of the learning method of Example 2 Schematic flow chart of the learning method of Example 3 Functional block diagram of the learning device of the fourth embodiment Illustrated diagram of expansion and selection of training data for mini-batch Explanatory diagram of typical data expansion Explanatory diagram of how to select an extended data image with a probability distribution close to the average value of the error Explanatory diagram of how to select an extended data image of the probability distribution with the maximum error Explanatory drawing of the method of sampling and selecting with a high probability of being selected according to the magnitude of the error Part 1 Explanatory drawing of the method of sampling and selecting with a high probability of being selected according to the magnitude of the error Part 2

Hereinafter, an example of the embodiment of the present invention will be described in detail with reference to the drawings. The scope of the present invention is not limited to the following examples and illustrated examples, and many modifications and modifications can be made.

FIG. 1 shows a functional block diagram of an embodiment of the learning device of the present invention. As shown in FIG. 1, the learning device 10 includes an inflating processing unit 1, a label prediction unit 2, a selection unit 3, and a learning unit 4. The padding processing unit 1 is input with the learning data 11 before padding, and performs padding without information deterioration such as inversion, rotation, and translation, or deteriorates the information possessed by the learning data 11 to paddle, and after padding. The learning data 21 of the above is output. The label prediction unit 2 inputs the inflated learning data 21 to the classifier 22, and outputs prediction labels for each of the plurality of training data. Then, the selection unit 3 inputs the probability distribution 15 of the output prediction label, calculates the error 31 from the correct answer of the training data before padding, and uses the error as a scale to calculate any probability distribution in the probability distribution 15. Select. Specifically, it is sampled and selected so that the probability distribution close to the average value of the error 31 or the probability distribution with the largest error 31 or the probability of being selected according to the magnitude of the error 31 is high. Select one of the probability distributions. The learning unit 4 inputs the probability distribution 32 selected with the error from the correct answer as a scale, further calculates the error 41 from the correct answer, and adjusts the weight parameter 42 of the classifier 22.
In this embodiment, an example in which the learning data before padding is inflated by information deterioration will be described, but the present invention is also useful for padding without information deterioration such as inversion, rotation, and translation.

FIG. 2 shows a processing flow diagram of one embodiment of the learning method of the present invention. As shown in FIG. 2, first, the training data is inflated (step S01). The image image before padding is shown in FIG. In this embodiment, the image 5 shown in FIG. 3 is used as one learning data before padding. Image 5 has a black background and is white with the number "7" displayed.

FIG. 4 shows an example of extracting a plurality of images after padding whose information has been deteriorated from one original image before padding. Specifically, an example of cutting out four images 51 to 54 after inflating from the image 5 before inflating will be shown. Here, the padding is not limited to cutting out four images, and it is possible to cut out even more images. As shown in FIG. 4, each of the four inflated images 51 to 54 is an image in which information is deteriorated by cutting out a part of "7". For example, the image 51 is a cutout of the upper right position of "7", and the image 52 is a cutout of the upper left position of "7". Further, the image 53 is a slightly lower right position from the center of the “7”, and the image 54 is a cutout of a lower position of the “7”. By cutting out a part of the image in this way, it is possible to deteriorate the training data before the padding and create a plurality of training data after the padding.

FIGS. 5 (1) to 5 (4) show images after cutting out. The image 51 shown in FIG. 5 (1) can be easily identified as a part of the number “7” by the human eye, whereas the images 52 and 5 (3) shown in FIG. 5 (2) show. In the image 53 shown and the image 54 shown in FIG. 5 (4), it is difficult for the human eye to determine whether or not the content of the image is a part of the number “7”.

The four inflated images 51 to 54 are input to one classifier as learning data (step S02). FIG. 6 is an output image of a probability distribution classified by image recognition by a classifier. FIG. 6 (1) is a probability distribution of the prediction labels output for each of the plurality of training data, and FIG. 6 (2) is a probability distribution showing the correct answer of the training data before padding. As shown in FIG. 6 (1), there are 10 types of prediction labels from the numbers "0" to "9", and the probability distribution from n = 1 to n = N is displayed. Here, since the four images 51 to 54 are input to one classifier as learning data, N = 4. Since the image of the training data before padding is the number "7", the probability distribution indicating the correct answer of the training data is "7" among the prediction labels "0" to "9", and "0" to "". 6 ”,“ 8 ”and“ 9 ”are 0 (one-hot expression).

Next, the cross-entropy error between the four probability distributions of the prediction labels output for each of the four training data after padding and the correct answer of the training data before padding was calculated, and the average value of the obtained errors was calculated. Is calculated (step S03). Here, the learning data after inflating is only information-deteriorated with respect to the learning data before inflating, and the correct label of the learning data before inflating is inherited.
FIG. 7 shows an explanatory diagram of the harmonization process when M pieces (M is a natural number) around the average value are selected. An image 6 that input _{x i,} the true label 7 is set to _{y i.} Here, "Frog" is set as the correct answer label 7. As an example of the image data group 60 after padding, there are M image data, and in FIG. 7, four image data (6a to 6d) are displayed. In the graph 9 in FIG. 7, the image data is represented as one point on a one-dimensional line for the sake of clarity, but the actual image data is super-multidimensional, for example, in the case of 28 × 28 pixel image data. Then, it becomes 784 (= 28 × 28) dimension. Of the four image data (6a to 6d), the leftmost image data 6a has a frog's face hidden and is not like a frog, so it is a distance from other image data points on a one-dimensional line. Is far away.

The horizontal axis of the graph 9 shows the error calculated from the image data group 60 after padding and its label 7. For example, the leftmost image data 6a in which the frog's face is hidden has a large error because it is difficult to classify it correctly, and a point is drawn at the position from the rightmost position. The broken line 14 in the graph 9 shows the average value of the output of the cross entropy error function (loss function). If most of the inflated image data group 60 has enough information to classify correctly, it is close to the average value of the output of the loss function, and the data that cannot be correctly classified is from the average value. It can be judged that the image data is distant and not useful for learning. In other words, image data close to the average value is useful for learning and can improve the accuracy of the classifier after learning, so it is actively used for learning.
Of the four image data (6a to 6d), two image data (6b, 6c) close to the average value of the output of the loss function shown by the broken line 14 in the graph 9 are extracted as useful data for learning. , Will be used for learning. In addition, the image data group 60 also includes image data other than the image data (6a to 6d), and here, two images (6e, 6f) are used for learning.

From the probability distributions of the prediction labels output for each of the four training data, a probability distribution having a small error from the average value of the output of the cross entropy error function (loss function) is selected (step S04). Then, the cross entropy error function (loss function) is optimized so that the error between the probability distribution of the prediction label selected in step S04 and the correct answer becomes small, and the weight parameter of the classifier is used by the error back propagation method. Is adjusted (step S05).

(Experiment 1 using image data set)
Using image data sets (CIFAR-10, CIFAR-100) as training data, the effectiveness of the learning device and learning method of this example was verified. CIFAR-10 and CIFAR-100 are composed of 32 × 32 pixel RGB images. As a classifier model, a model in which Dropout (p = 0.3) is introduced into WideResNet28-4 is used. Here, Dropout is a method of invalidating neurons in the hidden layer with a certain probability for a network during learning, and is useful for improving generalization performance.

As a method of padding the training data, zero-padding, random horizontal-flip, random crop, and cutout were performed. FIG. 8 is an explanatory image diagram of the method of padding the training data, (1) is the image data before padding, (2) is zero padding, (3) is random horizontal inversion, (4) is randomly cut out, and (5). Shows an example of clipping.
The image 61 shown in FIG. 8 (2) is obtained by applying zero padding to the image 6 shown in FIG. 8 (1), and the periphery of the image is filled with zeros. The range filled with zeros is 4 pixels in all directions. The image 62 shown in FIG. 8 (3) is an image 6 horizontally inverted with a predetermined probability. The image 63 shown in FIG. 8 (4) is a randomly cut out image 6. The cutout size is 32 × 32 pixels, which is the same as the image 61. The image 64 shown in FIG. 8 (5) has a cutout portion 64a provided in the image 6. Even if the image 61 is a large image, an important portion of the image may disappear depending on the location where the cutout portion 64a is provided.

The batch size is 64 and the number of epochs is 100. As an optimization method, a stochastic gradient descent (SGD) method was used. The learning rate is 0.1, the weight decay is 0.0005, and the momentum is 0.9, and the learning rate is increased by 0.2 times every 40, 60, and 80 epochs. did.
As a harmonization method, 32 inflated image data were prepared, their errors were calculated, and M samples close to the average of the errors were selected and used for learning. The specific harmonization process is as described above.

Table 1 below shows Comparative Example 1 in which one image was randomly selected from the above 32 inflated image data when CIFAR-10 or CIFAR-100 was used, and the above 32 images. It shows the result of comparing Example A in which the image data closest to the average error was selected (M = 1) from the image data after padding.
In addition, Table 2 below shows Comparative Example 2 in which CIFAR-10 or CIFAR-100 was used, and after padding into 8 images, all 8 images were used for learning without selection, and the above. From the 32 inflated image data, 4 were selected from those close to the average error (M = 4), and 8 were selected from those close to the average error (M = 8). It shows the result of having compared Example C. The experimental results in Tables 1 and 2 below represent the error rate (%) of the top candidate.

As shown in Table 1 above, when CIFAR-10 was used, it was 5.43% ± 0.174 in Comparative Example 1 and 4.97% in Example A in which the one closest to the average error was selected. It was ± 0.095, and it was found that the error rate was lower in Example A. Also, when CIFAR-100 is used, it is 24.63% ± 0.280 in Comparative Example 1 and 23.93% ± 0.225 in Example A, and the error rate is lower in Example A. It turned out.

As shown in Table 2 above, when CIFAR-10 was used, it was 4.34% ± 0.108 in Comparative Example 2, and 3.85 in Example B in which four were selected from those close to the average error. % ± 0.039, which is 4.02% ± 0.158 in Example C in which eight are selected from those close to the average of the errors, and the error rate in Example B or Example C is higher than that in Comparative Example 2. Turned out to be low. Further, also in the case of using CIFAR-100, 20.64% ± 0.095 in Comparative Example 2, 20.05% ± 0.140 in Example B, and 19.32% ± 0.295 in Example C. Therefore, it was found that the error rate was lower in Example B or Example C than in Comparative Example 2.

(Experiment 2 using image data set)
Similar to Experiment 1, the effectiveness of the learning device and learning method of this example was verified using the image data set (CIFAR-10, CIFAR-100) as the learning data. As the convolutional neural network (CNN), four types of VGG6, VGG9, VGG13 and VGG16 were used. Table 3 below shows the outline of the classifier model used in this experiment.

50,000 images were used as teacher images for training and 10,000 images were used as evaluation images for testing. For the purpose of comparison with this example, experiments were conducted in the case of having inflated without information deterioration treatment (Comparative Example 3) and in the case of not inflating (Comparative Example 4). Inflating without information deterioration processing is normal inflating, for example, inversion of the original image.

9 and 10 show graphs showing the relationship between prediction accuracy and model size. The graph shown in FIG. 9 shows Example D in which the image data of 32 × 32 pixels before padding was subjected to information deterioration processing to 27 × 27 pixels after padding, and Comparative Example 3a with padding without information deterioration treatment and padding. It is a plot of Comparative Example 4a without. FIG. 9 (1) shows the prediction accuracy of the top candidate when CIFAR-10 is used, FIG. 9 (2) shows the prediction accuracy of the top candidate when CIFAR-100 is used, and FIG. 9 (3) shows the prediction accuracy of the top candidate. It shows the prediction accuracy of the top five candidates when CIFAR-100 is used. In each of FIGS. 9 (1) to 9 (3), it was confirmed that the prediction accuracy of Example D was equal to or higher than that of Comparative Example 3a or Comparative Example 4a. In Example D, as compared with Comparative Examples 3a and 4a, it was found that the prediction accuracy of not only the top candidate but also the top five candidates is less likely to decrease even if the size of the learning model is reduced.

Further, the graph shown in FIG. 10 shows Example E in which information deterioration processing was performed on the image data of 32 × 32 pixels before padding to 23 × 23 pixels after padding, Comparative Example 3b with padding, and Comparative Example 4b without padding. Is a plot. FIG. 10 (1) shows the prediction accuracy of the top candidate when CIFAR-10 is used, FIG. 10 (2) shows the prediction accuracy of the top candidate when CIFAR-100 is used, and FIG. 10 (3) shows the prediction accuracy of the top candidate. It shows the prediction accuracy of the top five candidates when CIFAR-100 is used. It was confirmed that the prediction accuracy of Example E was higher than that of Comparative Example 3b and Comparative Example 4b in any of FIGS. 10 (1) to 10 (3). Further, in the case of Example E as well, it was found that, as in Example D, the accuracy of prediction of not only the top candidate but also the top five candidates does not easily decrease even if the model size becomes small.

FIG. 11 shows a processing flow diagram of another embodiment of the learning method of the present invention. The learning data (image) is inflated in the same manner as in the first embodiment (step S11). The images after padding may be images whose information is deteriorated with respect to the original image before padding, or images which are not accompanied by information degradation with respect to the original image such as rotation and translation. A plurality of inflated images are input to one classifier (step S12). The output image of the probability distribution of the predicted label classified and output by the image recognition by the classifier and the probability distribution showing the correct answer of the training data before padding are the same as those shown in FIG.
Next, the cross-entropy error between the plurality of probability distributions of the predicted labels output for each of the plurality of images after padding and the correct answer of the original image before padding is calculated (step S13). Among the probability distributions of the prediction labels output for each of the plurality of padded images, the probability distribution having the maximum output of the cross entropy error function (loss function) is selected (step S14). Then, the cross entropy error function (loss function) is optimized so that the error between the probability distribution of the prediction label selected in step S14 and the correct answer becomes small, and the weight parameter of the classifier is used by the error back propagation method. Is adjusted (step S15).

FIG. 12 shows a processing flow diagram of another embodiment of the learning method of the present invention. The learning data (image) is inflated in the same manner as in the first embodiment (step S21). The images after padding may be images whose information is deteriorated with respect to the original image before padding, or images which are not accompanied by information degradation with respect to the original image such as rotation and translation. A plurality of inflated images are input to one classifier (step S22). The output image of the probability distribution of the predicted label classified and output by the image recognition by the classifier and the probability distribution showing the correct answer of the training data before padding are the same as those shown in FIG.
Next, the cross entropy error between the plurality of probability distributions of the predicted labels output for each of the plurality of images after padding and the correct answer of the original image before padding is calculated (step S23). From the probability distributions of the prediction labels output for each of the plurality of inflated images, the probability of being selected is increased according to the magnitude of the error and sampled, and the probability distribution is selected (step S24). .. Then, the cross entropy error function (loss function) is optimized so that the error between the probability distribution of the prediction label selected by sampling in step S24 and the correct answer becomes small, and the classifier uses the error backpropagation method. The weight parameter is adjusted (step S25).

FIG. 13 shows a functional block diagram of another embodiment of the learning device of the present invention. As shown in FIG. 13, the learning device 10a includes an inflating processing unit 1, a label prediction unit 2, a selection unit 3, and a learning unit 4a. Similar to the learning device 10 shown in the first embodiment, the padding processing unit 1 inputs the learning data 11 before padding and outputs the learning data 21 after padding. The label prediction unit 2 inputs the inflated learning data 21 to the classifier 22, and outputs prediction labels for each of the plurality of training data. Then, the selection unit 3 inputs the probability distribution 15 of the output prediction label, calculates the error 31 from the correct answer of the learning data before padding, and selects the learning data using the error as a scale. Specifically, depending on the inflated training data having a probability distribution close to the average value of the error 31, the inflated learning data having the largest probability distribution of the error 31, or the magnitude of the error 31. One of the inflated training data selected by sampling so as to increase the probability of being selected is selected.
Then, in the learning unit 4a, the classifier 22 inputs the training data after padding selected with the error from the correct answer as a scale, and the error calculation 41 between the probability distribution of the prediction label output by the classifier 22 and the correct answer is performed. , The weight parameter adjustment 42 of the classifier 22 is performed.

It is known that there are learning methods such as batch learning, mini-batch learning, and online learning due to the difference in how data is handled for learning the classifier, but deep learning (DNN) usually suppresses overfitting. In order to enable learning for a huge amount of training data, the training data is divided into units called mini-batch, and the data contained in one mini-batch is input to the classifier and output. Update the parameters of the classifier so that the error between the distribution and the correct answer is small. The parameters of the classifier are updated once with the batch size training data. Then, learning is performed for all mini-batch, and further repeated learning is performed.

When mini-batch learning is used in the learning method of the present invention, the following processes a) to d) are performed. FIG. 14 shows a conceptual diagram of a learning method using a mini-batch.
a) First, for one original data (learning data before padding) in the mini-batch, a plurality of extended data (a plurality of training data after padding) in which the original data is information-degraded are created (“Learning data after padding” in FIG. 14). See the extended arrow).
b) Next, each extended data is input to the classifier and the probability distribution of the prediction label is output.
c) Based on the probability distribution of the output prediction label, one extended data is selected using the error from the correct answer of the original data as a measure (see the “selection” arrow in FIG. 14).
As will be described later, the method of selecting the extended data using the error as a measure is to select the extended data of the probability distribution close to the average value of the error, select the extended data of the probability distribution with the largest error, or adjust to the magnitude of the error. Use one of the methods of selecting the extended data by sampling with a high probability of being selected accordingly.
d) Replace the selected extended data with the original data and train the classifier as training data for the mini-batch.

Here, an extension method for creating a plurality of extended data by degrading the original data will be described. FIG. 15 shows a typical data extension. FIG. 15 shows four data extensions: inversion, cropping, cutout, and mixup. Inversion and other rotations and translations relatively preserve and extend the information in the original data. In addition, the cropping cuts and expands unnecessary parts such as margins while leaving the important parts in the original image as they are. Inversion, rotation, translation, and cropping do not degrade the original data.
On the other hand, unlike the cutout, the cutout cuts (hides) a part of the important part, and the mixup fuses the two images, both of which deteriorate the information of the original data. For example, in the cutout, random extended data can be created by randomly selecting the size and position of the portion to be cut from the original data.
The extended method is not limited to the above, and other methods can also be applied.

Three methods of selecting extended data based on the error from the correct answer will be described with reference to FIGS. 16 to 18. FIG. 16 is an explanatory diagram of a method of selecting extended data of a probability distribution close to the average value of errors, FIG. 17 is an explanatory diagram of a method of selecting extended data of a probability distribution having the maximum error, and FIG. 18 is an explanatory diagram of a method of selecting extended data of a probability distribution having the maximum error. It is explanatory drawing of the method of selecting extended data by sampling with a high probability of being selected according to the size of.
First, as shown in FIG. 16, the method of selecting the extended data of the probability distribution close to the average value of the error from the correct answer is to expand one original image to M for each original image in the mini-batch, and during learning. M extended data images are input to the classifier (DNN), and the probability distribution of the prediction label of each of the M extended data images is output. The correct label of the extended data image obtained by extending the original image is the same as the correct label of the original image. Calculate the cross-entropy error between the probability distribution of the prediction label of the extended data image output from the classifier and the correct answer. Then, the extended data image closest to the average value of the error is replaced with the original image, and the training data of the mini-batch is used.

Next, as shown in FIG. 17, the method of selecting the extended data of the probability distribution having the largest error from the correct answer is to expand one original image to M for each original image in the mini-batch and classify during learning. M extended data images are input to the device (DNN), and the probability distribution of the prediction label of each of the M extended data images is output. The correct label of the extended data image obtained by extending the original image is the same as the correct label of the original image. Calculate the cross-entropy error between the probability distribution of the prediction label of the extended data image output from the classifier and the correct answer. Then, the extended data image having the largest error is replaced with the original image, and the learning data of the mini-batch is used.

Next, as shown in FIGS. 18 and 19, the method of selecting the extended data by sampling with a high probability of being selected according to the magnitude of the error is as shown in FIGS. 18 and 19, for each original image in the mini-batch, one original image. Is expanded to M sheets, M sheets of extended data images are input to the classifier (DNN) in the middle of learning, and the probability distribution of the prediction label of each of the M sheets of extended data images is output. The correct label of the extended data image obtained by extending the original image is the same as the correct label of the original image. Calculate the cross-entropy error between the probability distribution of the prediction label of the extended data image output from the classifier and the correct answer. Then, the extended data image selected by sampling with a high probability of being selected according to the magnitude of the error is replaced with the original image to obtain the learning data of the mini-batch.

When increasing the probability of selection according to the magnitude of the error, as shown in FIG. 19, the error may be normalized and the probability of selection may be set using a relative error based on the minimum error. By using the relative error based on the minimum error, the probability that the extended data image with the minimum error is selected becomes 0, and it can be excluded from the sampling target. Data with a small error from the correct answer is data that can easily reach the correct answer and is not suitable for learning. Therefore, the probability of being selected according to the magnitude of the relative error is increased, data that is not suitable for learning is excluded from the sampling target, and extended data is selected.

(Experiment 3 using image data set)
Using image data sets (CIFAR-10, CIFAR-100) as training data, the effectiveness of the learning method of this example was verified. CIFAR-10 and CIFAR-100 are composed of 32 × 32 pixel RGB images. As a classifier model, PreAct ResNet18 was used.
The training data padding method uses zero-padding, random horizontal-flip, and random crop as normal padding, and cutout (Cutout) as data expansion that degrades information. ) And Mixup were used.

The batch size is 64 and the number of epochs is 100. As an optimization method, a stochastic gradient descent method (SGD: Stochastic Gradient Descent) was used. The learning rate (Learning Rate) is 0.1, the weight decay (weight decay) is 0.0005, and the inertial term (Momentum) is 0.9, and the learning rate is increased by 0.2 times every 40, 60, and 80 epochs. did.

In Tables 4 and 5 below, when CIFAR-10 and CIFAR-100 are used as learning data, the original images in the mini-batch are inflated to 8 images, and then learning is performed by the following 4 learning methods. , Shows the result of the correct answer rate. The experimental results all represent the correct answer rate (%) of the top candidate.
-A case where the original image was used for learning the classifier without selecting the inflated image (Comparative Example 5).
-A case in which one that is closest to the average error is selected, the selected extended image is replaced with the original image, and the classifier is used for learning (Example E).
-A case in which one with the largest error is selected, the selected extended image is replaced with the original image, and the classifier is used for learning (Example F).
-A case in which sampling is performed so that the probability of selection is high according to the magnitude of the error, one is selected, and the selected extended image is replaced with the original image and used for learning the classifier (Example G).

Here, the padding 1 is a padding of the original image by performing a normal padding without information deterioration such as inversion, and padding to 8 images. Further, the padding 2 is a padding of the original image using a normal padding and a cutout accompanied by information deterioration, and padding to 8 images. Further, the padding 3 is a padding of the original image using a normal padding and a mixup accompanied by information deterioration, and padding to 8 images.

As shown in the experimental results in Table 4 above, when CIFAR-10 was used, inflating 1 without information deterioration, inflating 2 and inflating 3 with information deterioration, Example F and Example in all cases. It was shown that G has a higher accuracy rate than Comparative Example 5, which is a conventional learning method. However, in Example E, in the case of padding 1, the correct answer rate was higher than that of Comparative Example 5, but in the case of

padding

2 and 3, the correct answer rate was substantially the same or lower than that of Comparative Example 5. ..
Further, as shown in the experimental results in Table 5 above, when CIFAR-100 is used, Example E corresponding to three types of image selection methods after padding, regardless of the presence or absence of information deterioration at the time of padding. , Example F and Example G all showed that the correct answer rate was higher than that of Comparative Example 5, which is a conventional learning method. Comparing the three types of padding, it was shown that the padding with information deterioration (padding 2 and 3) had a higher accuracy rate in almost all cases than the padding 1 without information deterioration.

The present invention is useful as a technique that enables highly accurate recognition in machine learning, especially in DNN.

1 Inflating processing unit 2 Label prediction unit 3

Selection unit

4,

4a Learning unit

5, 6, 6a to 6f, 51 to 54, 61 to 64 Image 7

Label

8, 9 Graph 10, 10a Learning device 11 Learning data before padding 15 Probability distribution of output prediction label 21 Training data after padding 22 Classifier 31 Error with correct answer of training data before padding 32 Probability distribution selected based on error with correct answer 41 Error calculation with correct answer 42 Classifier Adjust the weight parameter of 60 Image data group 64a Cutout part

Claims

For the multidimensional amount of training data before padding, a plurality of training data after padding are input to one classifier, and among the probability distributions of the prediction labels output for each of the plurality of training data, before padding. A learning method characterized in that learning is performed based on the error from the correct answer of the training data before padding by using at least one probability distribution selected by measuring the error from the correct answer of the learning data of.
For the multidimensional amount of training data before padding, a plurality of training data after padding are input to one classifier, and based on the probability distribution of the prediction label output for each of the plurality of training data, before padding. A learning method characterized in that learning is performed using at least one inflated learning data selected based on an error from the correct answer of the training data of.
For the multidimensional amount of training data before padding, a plurality of training data after padding are input to one classifier, and based on the probability distribution of the prediction label output for each of the plurality of training data, before padding. A learning method characterized in that one inflated learning data selected with an error from the correct answer of the inflated learning data is replaced with the uninflated learning data for learning.
When mini-batch learning is used
The above-mentioned multidimensional amount of training data before padding is one source data in the mini-batch.
One that is selected by inflating the original data and expanding it to multiple extended data, inputting each extended data into the classifier, and based on the probability distribution of the output prediction label, using the error from the correct answer of the original data as a measure. The learning method according to claim 3, wherein the extended data is replaced with the original data for learning.
The data selected based on the error from the correct answer is the data sampled with a high probability of being selected according to the magnitude of the error from the correct answer, according to any one of claims 1 to 4. Learning method.
The learning method according to any one of claims 1 to 4, wherein the data selected based on the error from the correct answer is the data having the maximum error from the correct answer.
The learning method according to any one of claims 1 to 4, wherein the data selected based on the error from the correct answer is the data in which the difference from the average value of the error from the correct answer is minimized.
The learning method according to claim 1, wherein the probability distribution selected using the above error as a scale is two or more.
When the training data is an image, one image is represented by a pixel-dimensional multidimensional quantity constituting the one image or a multidimensional quantity representing the characteristics of the image.
When the training data is one-dimensional time-series data, it is represented by a multi-dimensional quantity in which the data values at each time are arranged for the duration, or a multi-dimensional quantity representing the characteristics of the one-dimensional time-series data.
Alternatively, when the training data is multidimensional time series data, it is represented by a multidimensional quantity in which the multidimensional data values of each time are arranged for the duration, or a multidimensional quantity representing the characteristics of the multidimensional time series data. The learning method according to any one of claims 1 to 8, wherein the learning method has been performed.
The learning method according to any one of claims 1 to 9, wherein the plurality of learning data after the padding are data whose information is degraded with respect to the learning data of the multidimensional amount before padding.
The learning method according to claim 10, wherein the information deterioration rewrites a part of the data to cause information to be lost.
The plurality of training data after the padding are data in which information is degraded with respect to the learning data of the multidimensional amount before the padding.
When the training data is an image, one image is represented by a pixel-dimensional multidimensional quantity constituting the one image or a multidimensional quantity representing the characteristics of the image.
When the training data is one-dimensional time-series data, it is represented by a multi-dimensional quantity in which the data values at each time are arranged for the duration, or a multi-dimensional quantity representing the characteristics of the one-dimensional time-series data.
Alternatively, when the training data is multidimensional time series data, it is represented by a multidimensional quantity in which the multidimensional data values of each time are arranged for the duration, or a multidimensional quantity representing the characteristics of the multidimensional time series data. Being done
The information deterioration is to add noise to the original data of the multidimensional amount, or delete a part of the multidimensional amount or set it to a predetermined value to reduce the amount of information possessed by the original data. The learning method according to any one of claims 1 to 9, which is characterized.
When the training data is image data,
The image data is subjected to at least one of rotation, inversion, zoom, movement, cropping, division, blurring, and noise.
When the training data is time series data,
At least one of inversion, movement, clipping, division, and noise is applied to the time series data.
The learning method according to any one of claims 1 to 12, wherein a plurality of learning data after the padding are used.
An inflating processing unit that inflates multiple learning data with respect to the multidimensional amount of learning data before inflating,
A label prediction unit that inputs multiple training data after padding and predicts labels using one classifier,
A selection unit that selects at least one probability distribution based on the error from the correct answer of the training data before padding from the probability distributions of the prediction labels output for each of the plurality of training data.
Learning unit that learns based on the error between the selected probability distribution and the correct answer of the learning data before padding,
A learning device characterized by being equipped with.
An inflating processing unit that inflates multiple learning data with respect to the multidimensional amount of learning data before inflating,
A label prediction unit that inputs multiple training data after padding and predicts labels using one classifier,
Based on the probability distribution of the prediction label output for each of the plurality of training data, a selection unit that selects at least one training data after padding based on an error from the correct answer of the training data before padding, and a selection unit.
Learning unit that learns using the selected inflated learning data,
A learning device characterized by being equipped with.
An inflating processing unit that inflates multiple learning data with respect to the multidimensional amount of learning data before inflating,
A label prediction unit that inputs multiple training data after padding and predicts labels using one classifier,
Based on the probability distribution of the prediction label output for each of the plurality of training data, a selection unit that selects one training data after padding based on the error from the correct answer of the training data before padding, and a selection unit.
A learning unit that replaces the selected training data after padding with the learning data before padding and learns.
A learning device characterized by being equipped with.
The data selected by the selection unit on the basis of the error from the correct answer is the data sampled with a high probability of being selected according to the magnitude of the error from the correct answer, according to claims 14 to 16. The learning device according to any one.
The learning device according to any one of claims 14 to 16, wherein the data selected by the selection unit on the basis of the error from the correct answer is the data having the maximum error from the correct answer.
The learning device according to any one of claims 14 to 16, wherein the data selected by the selection unit on the basis of the error from the correct answer is the data having the minimum difference from the average value of the error from the correct answer. ..
The learning device according to any one of claims 14 to 19, wherein the padding processing unit rewrites a part of data to lack information in a multidimensional amount of learning data before padding.
The padding processing unit rewrites a part of the data to deteriorate the learning data of the multidimensional amount before padding.
When the training data is an image, one image is represented by a pixel-dimensional multidimensional quantity constituting the one image or a multidimensional quantity representing the characteristics of the image.
When the training data is one-dimensional time-series data, it is represented by a multi-dimensional quantity in which the data values at each time are arranged for the duration, or a multi-dimensional quantity representing the characteristics of the one-dimensional time-series data.
Alternatively, when the training data is multidimensional time series data, it is represented by a multidimensional quantity in which the multidimensional data values of each time are arranged for the duration, or a multidimensional quantity representing the characteristics of the multidimensional time series data. Being done
The information deterioration is to add noise to the original data of the multidimensional amount, or delete a part of the multidimensional amount or set it to a predetermined value to reduce the amount of information possessed by the original data. The learning device according to claim 20, wherein the learning device is characterized.