WO2021059388A1

WO2021059388A1 - Learning device, image processing device, learning method, and learning program

Info

Publication number: WO2021059388A1
Application number: PCT/JP2019/037552
Authority: WO
Inventors: 真弥山口; 美樹境; 哲哉塩田; 足立　一樹
Original assignee: 日本電信電話株式会社
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2021-04-01

Abstract

A learning device (10) has: a self-supervised learning unit (11) that updates parameters including shared parameters for a features extraction layer in a first NN, by using image data that has been converted a plurality of times as inputs therefor, by using the first NN, and by performing multitask learning so as to estimate the conversion details for each type of conversion; and a second task learning unit (13) that uses any image data as inputs therefor, uses a second neural network applying the shared parameters learned by the self-supervised learning unit (11) to the features extraction layer, and learns second 2NN parameters so as to perform a prescribed processing. The first NN has the features extraction layer and a plurality of NN for prior learning, each NN corresponding to the respective conversion types for the plurality of conversions. The plurality of NN for prior learning share the features extraction layer and estimate the conversion details for each conversion type.

Description

Learning device, image processing device, learning method and learning program

The present invention relates to a learning device, an image processing device, a learning method, and a learning program.

Deep learning (deep neural network: DNN) has been very successful in image recognition and so on. For example, in image recognition using DNN, when an image is input to the DNN model, a classification result of what the image reflects is output.

This deep learning model requires a large amount of labeled learning data to achieve high accuracy. It is necessary to attach a teacher label to the collected image data of this labeled learning data, and the cost is particularly high when the label cannot be attached only to a specialist such as a medical image. Creating learning data is a significant cost and is a major impediment to the use of deep learning.

Here, some methods have been proposed as a technique for training the DNN model even if the amount of labeled data is small. For example, as a technique for training a DNN model even if the amount of data is small, transfer learning that diverts a model that has been trained in another data set, and data expansion that processes the original data to increase the number of data (Data). Augmentation), self-supervised learning (see, for example, Non-Patent Document 1) in which features are acquired by creating labels from data and solving subtasks has been proposed.

The method described in Non-Patent Document 1 acquires expressions useful in subsequent tasks (classification, etc.) by solving a pre-task for predicting a rotation angle. Since this method is simple yet has a powerful effect, it is adopted for learning of hostile generative networks (GAN: Generative Adversarial Networks) and learning in combination with other methods.

The method described in Non-Patent Document 1 diverts expressions acquired by unsupervised learning. Specifically, in the method described in Non-Patent Document 1, the rotation angle of the image is learned in the pre-task, and the classification is learned in the post-task. In this method, the geometrical features of the image data can be captured by predicting the rotation angle, but for data for which rotation cannot be defined or for data with features that can easily predict the rotation angle, a preliminary task is required. The effect of is low. For example, in a landscape image, the angle of the image can be predicted only by the position of the sky, so that the effect of the pre-task is low in this method.

The present invention has been made in view of the above, and an object of the present invention is to provide a learning device, an image processing device, a learning method, and a learning program capable of capturing the features of an image and improving the accuracy of image processing. ..

In order to solve the above-mentioned problems and achieve the object, the learning device according to the present invention takes image data in which a plurality of conversions have been performed as an input, and uses a first neural network to perform conversion for each conversion type. By performing multi-task learning so as to estimate the contents, the first learning unit that updates the parameters including the shared parameters of the feature extraction layer of the first neural network and the first learning unit that inputs arbitrary image data The first neural network has a second learning unit that learns the parameters of the second neural network so as to perform predetermined processing by using the second neural network in which the learned shared parameters are applied to the feature extraction layer. It has a feature extraction layer and a plurality of pre-learning neural networks corresponding to each of a plurality of conversion conversion types. The plurality of pre-learning neural networks share the feature extraction layer and display the conversion contents for each conversion type. It is characterized by estimating.

Further, the image processing apparatus according to the present invention has a processing unit that processes input image data using a model having a deep neural network in which trained parameters are set, and has been trained. The parameters were updated by inputting image data in which a plurality of conversions were performed and performing multitask learning so as to estimate the conversion contents for each conversion type of the conversions using the first neural network. Based on the shared parameters of the feature extraction layer of the first neural network, the first neural network has the feature extraction layer and a plurality of pre-learning neural networks corresponding to the conversion types of the plurality of transformations. A plurality of pre-learning neural networks share the feature extraction layer and estimate the conversion content for each conversion type.

Further, the learning method according to the present invention is a learning method executed by a learning device, in which image data subjected to a plurality of conversions is input and conversion contents for each conversion type of conversion are used by using a first neural network. By performing multi-task learning so as to estimate each of the above, the first learning step of updating the parameters including the shared parameters of the feature extraction layer of the first neural network and the learning in the first learning step by inputting arbitrary image data. The first neural network includes a second learning step of learning the parameters of the second neural network so as to perform a predetermined process using the second neural network in which the already shared parameters are applied to the feature extraction layer. It has a layer and a plurality of pre-learning neural networks corresponding to each of a plurality of conversion conversion types, and the plurality of pre-learning neural networks share a feature extraction layer and estimate the conversion content for each conversion type. It is characterized by that.

Further, the learning program according to the present invention takes image data in which a plurality of conversions have been performed as an input, and uses a first neural network to perform multitasking learning so as to estimate the conversion contents for each conversion type of conversion. Then, the first learning step of updating the parameters including the shared parameters of the feature extraction layer of the first neural network and the shared parameters learned in the first learning step were applied to the feature extraction layer by inputting arbitrary image data. Using the second neural network, a computer is made to execute a second learning step of learning the parameters of the second neural network so as to perform a predetermined process, and the first neural network is a feature extraction layer and conversion of a plurality of transformations. It has a plurality of pre-learning neural networks corresponding to each type, and the plurality of pre-learning neural networks share a feature extraction layer and estimate the conversion content for each conversion type.

According to the present invention, it is possible to improve the accuracy of image processing by capturing the features of an image.

FIG. 1 is a diagram showing an example of a configuration of an image processing system according to an embodiment. FIG. 2 is a diagram showing an example of the configuration of the learning device shown in FIG. FIG. 3 is a diagram showing the estimation accuracy of the converted content for the image data subjected to the conversion process. FIG. 4 is a diagram illustrating a flow of learning processing according to the embodiment. FIG. 5 is a diagram showing an example of the configuration of the first task learning unit shown in FIG. FIG. 6 is a flowchart showing a processing procedure of image processing performed by the image processing system according to the embodiment. FIG. 7 is a flowchart showing a processing procedure of the learning process shown in FIG. FIG. 8 is a diagram showing the accuracy of each classification between the DNN model trained using the learning method according to the embodiment and the DNN trained using the learning method described in Non-Patent Document 1. FIG. 9 is a diagram showing an example of a computer in which a learning device or an image processing device is realized by executing a program.

Hereinafter, the learning device, the image processing device, the learning method, and the embodiment of the learning program according to the present application will be described in detail based on the drawings. In the present invention, as a pre-task (first task), a plurality of conversions are performed on the image data, self-supervised learning is performed, and multi-task learning for estimating each conversion content is performed, and as a latter-stage task (second task). , An example of learning the image data classification process by diverting the learning result of the first task will be described. Further, the present invention is not limited to the embodiments described below.

[Embodiment]
First, the image processing system according to the embodiment will be described with reference to FIG. FIG. 1 is a diagram showing an example of a configuration of an image processing system according to an embodiment.

As shown in FIG. 1, the image processing system 1 in the embodiment is characterized by an image processing device 20 that classifies image data using a deep learning (deep neural network: DNN) model and image data for learning. It has a learning device 10 for setting parameters of the DNN model used by the image processing device 20 by learning the above.

The learning device 10 has a self-supervised learning unit 11 (first learning unit) and a second task learning unit 13 (second learning unit) provided after the self-supervised learning unit 11.

The self-supervised learning unit 11 uses the first neural network (NN) to perform self-supervised learning on image data for learning. The self-supervised learning unit 11 performs a plurality of types of conversion into image data. Here, the first NN is a feature extraction layer that extracts features related to conversion from image data that has been subjected to a plurality of conversions, and a plurality of NNs for pre-learning that estimate conversion contents for each conversion type of conversion to image data. And have. The self-supervised learning unit 11 takes the image data in which a plurality of conversions have been performed as input, and uses the first NN to perform multitask learning so as to estimate the conversion contents for each conversion type of the conversion, thereby performing the first NN. Perform the first task of updating the parameters including the shared parameters of the feature extraction layer. The self-supervised learning unit 11 has a learning data generation unit 14 and a first task learning unit 15.

The learning data generation unit 14 generates self-teacher data based on the input learning image data 30, and also performs a plurality of types of conversions on the learning image data.

The first task learning unit 15 uses the self-teacher data generated by the learning data generation unit 14 and the image data to which a plurality of types of conversions have been performed to perform multitask learning on the characteristics of a plurality of conversion contents. Learn by doing.

The first task learning unit 15 shares the shared parameters of the feature extraction layer that extracts the features related to the conversion from the image data in which the plurality of conversions have been performed, and sets a plurality of NNs for pre-learning corresponding to the plurality of conversions. By performing multi-task learning so as to estimate the conversion content for each conversion type of conversion, each parameter of a plurality of NNs and the shared parameter of the feature extraction layer are updated. The feature extraction layer is composed of a DNN model containing a large number of non-linear functions. The first task learning unit 15 outputs the DNN model (first learned DNN 16) with updated parameters to the second task learning unit 13.

The second task learning unit 13 uses the learning result of the first task to perform the second task of learning the predetermined process. The second task learning unit 13 inputs arbitrary image data, and uses the second NN to which the shared parameters learned by the self-supervised learning unit 11 are applied to the feature extraction layer, and sets the parameters of the second NN so as to perform predetermined processing. learn. The supervised learning learning data 40 is composed of a pair of learning data for learning and teacher data indicating a class of each learning data. The second task learning unit 13 outputs the DNN (second learned DNN 17) whose parameters have been updated by learning to the image processing device 20. In the following, an example in which the second task learning unit 13 performs image data classification processing (class classification) as predetermined processing will be described, but the predetermined processing is not limited to class classification, such as segmentation and object range detection. It may be.

The image processing device 20 applies the parameters of the second trained DNN 17 to the DNN model, and performs an analysis unit 21 that performs class classification (predetermined processing (class classification, segmentation, object range detection, etc.)) on the image data to be processed. Have. The image processing device 20 outputs the classification result by the analysis unit 21 as an estimation result for the image data.

[Learning device]
Next, the learning device 10 shown in FIG. 1 will be described. FIG. 2 is a diagram showing an example of the configuration of the learning device shown in FIG.

As shown in FIG. 1, in the learning device 10, the learning data generation unit 14 has a self-teacher data generation unit 141 and a data conversion unit 142.

The self-teacher data generation unit 141 generates self-teacher data 32 based on the input image data for learning, and outputs it to the first parameter update unit 153 (described later). The self-teacher data generation unit 141 generates a class corresponding to each image data as self-teacher data.

The data conversion unit 142 performs a plurality of conversions on the image data 30 for learning. Here, the learning device 10 executes any two or more conversions of Rotation, ShearX, ShearY, Solarize, Brightness, Color, Contrast, and Sharpness as conversions for the image data. The data conversion unit 142 outputs the image data 31 that has undergone two or more conversions to the feature extraction unit 151 (described later).

Rotation is a rotation process. Shear X is a horizontal bending process. Shear Y is a vertical curve. Solarize is an inversion process, and inverts pixels whose pixel value exceeds a threshold value. Brightness is a brightness adjustment process. For example, when the size of the pixel value is 0, it is converted into a black image, and when it is 1, it is left as it is. Color is a color tone adjustment process, and like the color TV set, when the size of the pixel value is 0, it is converted to black or white, and when it is 1, it is left as it is. Contrast is a contrast adjustment process, and when the size of the pixel value is 0, it is converted into a gray image, and when it is 1, it is left as it is. Sharpness is an edge sharpening process.

These processes were set up based on the data transformations used in the context of data expansion. Specifically, from the transformations used in Auto Augmentation (see Reference 1), transformations in which the degree of transformation can be set with four discrete values were extracted.
Reference 1: Ekin D. Cubuk et al, “AutoAugment: Learning Augmentation Policies from Data”, arXiv preprint arXiv: 1805.09501 (ICML 2019).

Here, in setting the conversion type to be executed by the learning device 10, a preliminary experiment was conducted for a total of 12 types of conversion. The 12 types of conversions are Shear X, Shear Y, Translate X, Translate Y, Solarize, Posterize, Contrast, Color, Brightness, Sharpness, CutOut (Size prediction), and CutOut (number). In the following, the conversion type such as Shear X and Shear Y described above will be the conversion type, and the conversion content for each conversion type (in the case of Rotation, the degree of rotation is set in four stages) will be the conversion content.

FIG. 3 is a diagram showing the estimation accuracy of the task (classification) in the latter stage when the image data subjected to each conversion process is used for the pre-learning. For reference, the estimation accuracy of the model under the Random condition where the feature extraction layer is initialized with random numbers and the estimation accuracy of the model in which the feature extraction layer is pre-trained by the Rotation estimation task are shown on the far right of the graph. It is shown second from the right. As shown in FIG. 3, it was found that the estimation accuracy of ShearX, ShearY, Solarize, Brightness, Color, Contrast, and Sharpness was higher than that of the Random condition. Therefore, in the present embodiment, any two or more conversions of Rotation, ShearX, ShearY, Solarize, Brightness, Color, Contrast, and Sharpness are set as the conversion type of the conversion process performed by the data conversion unit 142.

Further, the data conversion unit 142 performs a plurality of conversions having different conversion properties on the input image data 30 for learning. For example, when the data conversion unit 142 adds other data conversions in addition to Rotation, the conversions have different properties, in other words, the conversions are less likely to affect each other, or the features are orthogonal. Add from the conversion that is.

Further, the data conversion unit 142 sets a plurality of conversions to be executed for the input image data 30 according to the contents of the second task (predetermined processing) executed by the second task learning unit 13. When the second task is a class classification, the data conversion unit 142 converts in the order of Rotation, Sharpness, and Solarize in the first task. By converting in the order of Rotation, Sharpness, and Solarize in this way, it is possible to convert after correctly emphasizing the edges in the image.

The first task learning unit 15 has a feature extraction unit 151, a first task classification unit 152 (estimation unit), and a first parameter update unit 153.

The feature extraction unit 151 has a feature extraction layer composed of a DNN, and is based on the shared parameters of the DNNs constituting the feature extraction layer from the image data 31 that has been subjected to a plurality of conversions by the data conversion unit 142. Specifically, for example, a feature vector) is extracted. A shared parameter, which is a parameter updated by the first parameter update unit 153, is set in the feature extraction layer of the feature extraction unit 151. Shared parameters are learned to extract features useful for capturing each transformation (eg, features related to the shape, texture, and color of objects present in the image) through pre-learning to estimate the transformations for the input image. To.

The first task classification unit 152 has a plurality of NNs for pre-learning corresponding to each of the plurality of conversions, and based on the features extracted by the feature extraction unit 151, the conversion contents for each conversion type for the image data 31. To estimate. The first task classification unit 152 classifies the image data 31 that has undergone a plurality of conversions based on the feature vector extracted by the feature extraction unit 151. Initial values are set for a plurality of NNs for pre-learning. As the initial value, for example, a value generated from a uniformly distributed random number is used. Then, the parameters updated by the first parameter update unit 153 are set in the plurality of NNs for pre-learning.

The first parameter update unit 153 multitasks the first neural network based on each conversion content estimated by the first task classification unit 152 and the self-teacher data 32 generated by the self-teacher data generation unit 141. Learning is executed, the parameters of the NN for pre-learning are updated, and the shared parameters of the feature extraction layer are updated. As a result, the DNN parameters of the feature extraction layer reflect the features of the image data that has been subjected to a plurality of conversions by the data conversion unit 142. The first task learning unit 15 outputs the DNN model (first learned DNN 16) whose parameters have been updated to the second task learning unit 13. The NN parameters for pre-learning correspond to each classification layer corresponding to a plurality of conversion types of the first task classification unit 152.

Next, the second task learning unit 13 will be described. The second task learning unit 13 has a feature extraction unit 131, a second task classification unit 132, and a second parameter update unit 133. The second task learning unit 13 accepts the input of the image data 41 for learning and the teacher data 42 as the supervised learning learning data 40.

The feature extraction unit 131 has a feature extraction layer composed of DNN. The feature extraction unit 131 applies the parameters of the first trained DNN 16 to the DNN and extracts features (for example, feature vectors) from the image data 41 for learning. The feature amount to be extracted at this time is the same as that of the feature extraction unit 151.

The classification unit 132 for the second task has an NN, classifies the image data 41 for learning based on the features extracted by the feature extraction unit 131, and updates the classified class as the second parameter as an estimation result. Output to unit 133.

The second parameter update unit 133 sets the parameters of the feature extraction layer of the feature extraction unit 131 and the NN of the second task classification unit 132 based on the estimation result by the second task classification unit 132 and the teacher data 42. Update the parameters. The second task learning unit 13 outputs the learned NN and DNN (second learned DNN 17) that have been learned (parameter update by learning is completed) to the image processing device 20.

[Flow of learning process]
Subsequently, the flow of the learning process executed by the learning device 10 will be described. FIG. 4 is a diagram illustrating a flow of learning processing according to the embodiment. FIG. 5 is a diagram showing an example of the configuration of the first task learning unit 15 shown in FIG.

In the first task, multitask learning is performed by sharing the parameters of the feature extraction layer. Specifically, as shown in (1) and FIG. 5 of FIG. 4, Rotation, Solarize, and Sharpness are continuously executed by the data converter (self-teacher data generation unit 141) for the image data 311 for learning. At the same time, as the self-teacher data 32, a class of image data 311 is generated and used as the self-teacher data 32. A specific example of the class will be described. For example, in the case of Rotation, the four-step angle of {0,90,180,270} is the class. In the case of Sharpness, the class has four levels of emphasis of {0.0,1.0,1.5,2.0}. In the case of Solarize, the class is a four-step inversion threshold value of {0,96,192,256}. Therefore, the learning data generation unit 14 performs four-step conversion on the image and generates a class according to the step.

Then, in the first task, the feature extraction layer 1511 (feature extraction unit 151) in which ^{the shared parameter θ sh is set extracts the feature vector from the image data 311.} The feature vector extracted in the feature extraction layer 1511 is output to the Rotation classification layer (NN) 1521, the Solarize classification layer (NN) 1522, and the Sharpness classification layer (NN) 1523 in the first task classification unit 152. ..

The Rotation classification layer 1521, the Solarize classification layer 1522, and the Sharpness classification layer 1523 are ^{classified into four classes by applying their own parameters (θ 1} to θ ³ ), and each of the corresponding conversions has four stages. Estimate the degree. That is, the Rotation classification layer 1521 classifies the Rotation class. The classification layer 1522 for Solarize classifies Solarize. The Sharpness classification layer 1523 classifies Sharpness.

The first parameter update unit 153 is based on the classification results of the Rotation classification layer 1521, the Solarize classification layer 1522, and the Sharpness classification layer 1523 and the self-teacher data 32, and the Rotation classification layer 1521 and the Solarize classification layer 1522. And each parameter of the classification layer 1523 for Sharpness is updated.

The first parameter update unit 153 calculates the loss of the four-class classification output from the Rotation classification layer 1521, the Solarize classification layer 1522, and the Sharpness classification layer 1523 using the softmax cross entropy. Specifically, the first parameter update unit 153 uses the equation (1) to divide the classes classified by the Rotation classification layer 1521, the Solarize classification layer 1522, and the Sharpness classification layer 1523, and the self-teacher data 32. Calculate the loss to and from the indicated class for each Rotation, Solarize, and Sharpness transformation (task).

(1) In the formula, _{L T} is the loss of each task. τ is the number of classes of self-teacher data, which is 4 in this example. t (= 1, ..., T) is a task number. In this example, task 1 is Rotation, task 2 is Solarize, and task 3 is Sharpness. θ ^sh is a common parameter of the feature extraction layer 1511. θ ^t is a parameter peculiar to the task t. c _T is the output when the feature vector is input to the classification layer to which ^{θ t is applied.} F corresponds to ^{θ sh.}

The first parameter update unit 153 updates each parameter θ ¹ to θ ³ by the inverse error propagation method based on the loss and the gradient of each task. The Rotation parameter update unit 1531 updates the parameter θ ¹ of the Rotation classification layer 1521. The Solarize parameter update unit 1532 updates the parameter θ ² of the Solarize classification layer 1522. The Sharpness parameter update unit 1533 updates the parameter θ ³ of the Sharpness classification layer 1523.

^{Subsequently, the first parameter update unit 153 calculates the total loss with respect to the shared parameter θth} using the Frank-Wolfe method (see Reference 2) based on the loss and gradient of each task, and determines that the total loss is the minimum. The shared parameter θ ^th is optimized so that The first parameter updating unit 153, by optimizing the sharing parameter theta ^th with Frank-Wolfe method for multi-tasking, omit the complexity of parameter search, which improves the accuracy in a small amount of additional computation cost.
Reference 2: Ozan Sener and Vladlen Koltun, “Multi-Task learning as Multi-Objective Optimization”, Advances in Neural Information Processing Systems. 201 (NIPS 2018)

First, the first parameter update unit 153 minimizes the total loss with ^{respect to the shared parameter θ th by using the equation (2).}

In equation (2), α _t is a weight for coordination between tasks. Z (= z ₁ , ..., Z _t ) is a parameter for updating ^{θ sh.} z _i is g (x _i ; θ ^sh ).

Specifically, the first parameter update unit 153 performs weight calculation using the gradients of ^{the parameters θ 1} to θ ³ of each task in the feature extraction parameter update unit 1534 using the equation (2), and shares the weight. The parameter θ ^sh and the total loss and gradient of the parameters θ ¹ to θ ³ ^{of each task are calculated, and the shared parameter θ sh} is updated by the inverse error propagation method of the total loss.

The self-supervised learning unit 11 repeats the processes in the learning data generation unit 14, the feature extraction layer 1511, the first task classification unit 152, and the first parameter update unit 153 until a predetermined end condition is reached. The end condition is, for example, that the number of learning steps becomes a preset maximum number of learning steps.

Subsequently, as shown in (2) of FIG. 4, the second task learning unit 13 applies the parameters of the first learned DNN 16 to the feature extraction layer 1311 of the feature extraction unit 131, and supervised learning learning data. Learning is performed for 40. The supervised learning learning data 40 is a set of pairs composed of image data and correct answer output when a predetermined process is performed.

Specifically, the feature extraction layer 1311 extracts a feature vector from the image data 41 of the supervised learning data. Then, the classification layer 1321 of the classification unit for the second task classifies the image data 41 based on the feature vector. The second parameter update unit 133 sets the parameters of the DNN model of the feature extraction layer 1311 and the NN of the classification layer 13211 based on the loss between the class classified by the classification layer 1321 and the class shown by the teacher data 42. Update the parameters of. The second task learning unit 13 outputs the NN and the DNN (second learned DNN 17) whose parameters have been updated by learning to the image processing device 20. The second task learning unit 13 may perform arbitrary processing as predetermined processing (second-stage task, second task). In the example of FIG. 4, the second task learning unit 13 executed supervised learning, but unsupervised learning such as reinforcement learning may be used.

In this way, the learning device 10 acquires features other than the features obtained by Rotation in the first task of performing conversion other than Rotation. Then, the learning device 10 simultaneously uses data conversion other than Rotation together with Rotation, and performs multitask learning in the first task. As a result, in the first task, the learning device 10 can set the first trained DNN 16 that appropriately captures the characteristics of a plurality of conversions for image data, not limited to Rotation. Therefore, in the second task, the accuracy of the classification task is correct. Can be raised.

In addition, in FIGS. 4 and 5, in the first task, any two or more conversions of Rotation, ShearX, ShearY, Solarize, Brightness, Color, Contrast, and Sharpness are executed according to the processing contents in the image processing apparatus 20. Then, multitask learning should be performed.

[Image processing procedure]
Next, the flow of image processing performed by the image processing system 1 in the embodiment will be described. FIG. 6 is a flowchart showing a processing procedure of image processing performed by the image processing system 1 in the embodiment.

As shown in FIG. 6, in the image processing system 1, the learning device 10 performs a learning process of learning the characteristics of the image data for learning in order to set the parameters of the DNN model used by the image processing device 20 (step). S1). The image processing device 20 performs an image analysis process for classifying the image data using the second trained DNN 17 whose parameters are set by the learning process of the learning device 10 (step S2).

[Processing procedure of learning process]
Next, the processing procedure of the learning process (step S1) will be described. FIG. 7 is a flowchart showing a processing procedure of the learning process shown in FIG.

As shown in FIG. 7, the learning device 10 sets the learning step as an initial value and executes the first learning step. Specifically, the learning data generation unit 14 of the self-supervised learning unit 11 generates self-teacher data based on the input image data 30 for learning, and performs a plurality of conversions on the image data for learning. (Step S11).

Subsequently, in the feature extraction unit 151, ^{the feature extraction layer 1511 in which the shared parameter θ sh} is set performs a feature extraction process for extracting a feature vector from the image data 311 that has undergone a plurality of conversions (step S12). The first task classification unit 152 classifies each conversion for image data based on the feature vector extracted by the feature extraction unit 151 using a plurality of pre-learning NNs corresponding to the plurality of conversions. Perform the classification process for the first task to be performed (step S13).

The first parameter update unit 153 calculates the loss and the gradient from the classification result of each NN of the first task classification unit 152 (step S14). The first parameter update unit 153 calculates the loss between each class output from each NN and the class indicated by the teacher data for each NN using the equation (1). The first parameter update unit 153 updates the parameters of each NN by the inverse error propagation method based on the calculated loss and gradient for each NN (step S15).

Subsequently, the first parameter update unit 153 performs a weight calculation based on the equation (2) using the loss and the gradient of each NN (step S16). At this time, the first parameter update unit 153 uses the Frank-Wolfe method. Then, the first parameter update unit 153 calculates the shared parameter of the feature extraction layer 1511 and the total loss and gradient of the parameters of each NN (step S17), and updates the ^{shared parameter θ sh by the inverse error propagation method of the total loss.} (Step S18).

The first parameter update unit 153 determines whether or not the number of learning steps is smaller than the maximum number of learning steps (step S19). When the number of learning steps is smaller than the maximum number of learning steps (step S19: Yes), 1 is added to the learning steps, and the first parameter update unit 153 returns to step S11 and returns to step S11 to obtain the first learning data for the next learning. Execute one task.

On the other hand, when the number of learning steps is not smaller than the maximum number of learning steps (step S19: No), that is, when the number of learning steps reaches the maximum number of learning steps, the DNN model with updated parameters (first trained DNN16) is used. Output to the second task learning unit 13. Then, the second task learning unit 13 uses the first learned DNN 16 to learn the second task (classification of image data) (step S20). The second task learning unit 13 outputs the DNN (second learned DNN 17) whose parameters have been updated by learning to the image processing device 20. The learning of the second task learning unit 13 may be performed by using a general method for learning the second task. In the present embodiment, since the second task is described as the classification, a general method may be used when learning the classification in the neural network.

[Evaluation experiment]
An evaluation experiment was conducted with the second task as an image classification, and a DNN model trained using the pre-learning method according to the present embodiment and a DNN trained using the pre-learning method described in Non-Patent Document 1. , The accuracy of each classification in the second task was obtained.

The experimental settings will be described. In the experiment, as an image data set, it is a data set for supervised learning, and a data set consisting of a pair of images and teacher information: CIFAR-100 (with teacher information consisting of 50,000 for learning and 10,000 for testing) Data set) is used. The second task is to classify 100 classes using the image data of CIFAR-100. For both the first task and the second task, 45,000 image data from the learning data set are used during learning, and as a validation data set, 5,000 image data in which the classes are evenly divided are used for cross-validation. I do. At the time of learning the first task, the teacher information of the data set is not used. At the time of the test of the second task (image analysis) for classifying, 10000 image data of the test data set is used. Further, Wide-Resnet-40-10 (see Reference 3) is used for feature extraction in the feature extraction unit 151 and the feature extraction unit 131.
Reference 3: Sergey Zagoruyko and Nikos Komodakis. “Wide Residual Networks”, arXiv preprint arXiv: 1605.07146 (BMVC 2016).

Next, the experimental procedure will be explained. In the evaluation experiment, as the first task, the feature extraction layer of the feature extraction unit 151 is trained (100 epoch). In the first task, in the learning method according to the present embodiment, the DNN model is pre-learned using the image data obtained by converting Rotation, Sharpness, and Solarize. Further, in the learning method described in Non-Patent Document 1, the DNN model is pre-learned using the image data obtained only by rotation. Then, the parameters of the trained feature extraction layer are fixed, and the classification is trained (5000iteration) by the Linear-Regression model as the learning of the second task. Then, in the second task, the test classification accuracy (Top-1 Accuracy) is measured by the trained Linear-Regression model. The second task was performed 3 times each, and the mean and standard deviation were calculated.

FIG. 8 is a diagram showing the accuracy of each classification between the DNN model trained using the learning method according to the embodiment and the DNN trained using the learning method described in Non-Patent Document 1. As shown in FIG. 8, the classification accuracy of the DNN model trained using the learning method described in Non-Patent Document 1 (see Rotation in FIG. 8) was 42.97%. On the other hand, the classification accuracy of the DNN model trained using the learning method according to the embodiment (see Rotation + Solarize + Sharpness in FIG. 8) was 48.88%. Therefore, according to the embodiment, it was confirmed that the classification accuracy higher than that of the conventional method can be achieved. In the present embodiment, a more robust classifier can be realized by adding other conversion predictions in addition to Rotation.

[Effect of Embodiment]
In the conventional method, self-supervised learning is performed only for Rotation as a pre-task (first task), so the latter stage for a dataset with a data structure that can easily predict the rotation angle or a dataset for which rotation cannot be defined. There is a problem that the effect of improving the prediction performance of the task (second task) is small.

On the other hand, in the present embodiment, as the first task, a plurality of types of conversions are performed on one image data, and self-supervised learning is performed by multitask learning, so that not only geometric features but also color tones and edges are formed. It is possible to acquire various characteristics of the conversion contents related to. Then, in the present embodiment, as the first task, self-supervised learning that combines Rotation and data transformation that can acquire useful features is performed, so that the second task is compared with the case of only a single transformation. The accuracy of the task can be improved. Therefore, according to the present embodiment, it is possible to improve the accuracy of image processing by capturing the features of the image. Further, in the present embodiment, since learning can be performed by a scheme almost the same as that of the conventional self-supervised learning, additional implementation is easy.

In the present embodiment, since one image data is subjected to multiple conversions in the first task, an effect similar to data expansion can be obtained. Specifically, according to the present embodiment, it is possible to obtain an effect that the accuracy is higher than that of using the input obtained by converting each one by one. Further, in the present embodiment, by sharing the input from the feature extraction unit 151 to the classification unit 152 for the first task, it becomes possible to use the approximate expression when updating the gradient, and the calculation efficiency is improved. Dozens of times better.

In the present embodiment, in the first task, as conversion to the image data, not only Rotation but also ShearX, ShearY, Solarize, Brightness, Color, Contrast, Sharpness, two or more according to the processing contents of the second task. It is also possible to choose a conversion. As the present embodiment, the case where the second task is class classification has been described, but the second task is not limited to this, and may be any task such as object detection or segmentation that outputs some kind of output by inputting an image. For example, in the present embodiment, when the second task is segmentation, in the first task, the input image data is converted in the order of Solarize, Color, and Sharpness. As described above, in the embodiment, a high effect can be expected by learning the conversion contents from various aspects by combining conversions having different properties.

[System configuration, etc.]
Further, each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed. It is also possible to automatically perform all or part of the above by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified. That is, the processes described in the learning method and the speech recognition method are not only executed in chronological order according to the order of description, but also executed in parallel or individually as required by the processing capacity of the device that executes the processes. You may.

[program]
FIG. 9 is a diagram showing an example of a computer in which the learning device 10 or the image processing device 20 is realized by executing the program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

Memory 1010 includes ROM 1011 and RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

The hard disk drive 1031 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, the program that defines each process of the learning device 10 or the image processing device 20 is implemented as a program module 1093 in which a code that can be executed by the computer 1000 is described. The program module 1093 is stored in, for example, the hard disk drive 1031. For example, a program module 1093 for executing processing similar to the functional configuration in the learning device 10 or the image processing device 20 is stored in the hard disk drive 1031. The hard disk drive 1031 may be replaced by an SSD (Solid State Drive).

Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 into the RAM 1012 as needed, and executes the program.

The program module 1093 and the program data 1094 are not limited to the case where they are stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.

1 Image processing system 10 Learning device 11 Self-teacher learning unit 13 Second task learning unit 14 Learning data generation unit 15 First task learning unit 16 First learned DNN
17 Second learned DNN
20 Image processing device 21 Analysis unit 31, 41 Image data 32 Self-teacher data 42 Teacher data 131, 151 Feature extraction unit 132 Second task classification unit 133 Second parameter update unit 141 Self-teacher data generation unit 142 Data conversion unit 152 Classification unit for 1 task 153 1st parameter update unit

Claims

Features of the first neural network by inputting image data that has undergone a plurality of conversions and performing multitask learning so as to estimate the conversion contents for each conversion type of the conversions using the first neural network. The first learning part that updates the parameters including the shared parameters of the extraction layer,
The parameters of the second neural network are learned so as to perform predetermined processing by using the second neural network in which arbitrary image data is input and the shared parameters learned by the first learning unit are applied to the feature extraction layer. Second learning department and
Have,
The first neural network has the feature extraction layer and a plurality of pre-learning neural networks corresponding to the conversion types of the plurality of transformations.
The plurality of pre-learning neural networks are learning devices that share the feature extraction layer and estimate the conversion content for each conversion type.
The first learning unit
A conversion unit that performs a plurality of conversions on the input image data and generates self-teacher data according to the conversion contents for each conversion type of the conversion.
A feature extraction unit having the feature extraction layer and extracting features from the image data subjected to the plurality of conversions based on the shared parameters.
An estimation unit that has the plurality of pre-learning neural networks and estimates the conversion content for each conversion type for the image data based on the features extracted by the feature extraction unit.
Based on the conversion content estimated by the estimation unit and the self-teacher data, the first parameter update unit that executes the multitask learning on the first neural network and updates the shared parameter,
The learning device according to claim 1, wherein the learning device has.
The learning device according to claim 2, wherein the conversion unit performs the plurality of conversions having different conversion properties on the input image data.
The learning device according to claim 2 or 3, wherein the conversion unit sets the plurality of conversions to be executed according to the content of the predetermined processing with respect to the input image data.
Any of claims 1 to 4, wherein the conversion type is any two or more of rotation, horizontal or vertical curvature, inversion, brightness adjustment, color tone adjustment, contrast adjustment, and contour sharpening. The learning device described in one.
It has a processing unit that performs predetermined processing on the input image data using a model having a neural network in which trained parameters are set.
The trained parameters are updated by inputting image data in which a plurality of conversions have been performed and performing multitask learning so as to estimate the conversion contents for each conversion type of the conversions using the first neural network. Based on the shared parameters of the feature extraction layer of the first neural network
The first neural network has the feature extraction layer and a plurality of pre-learning neural networks corresponding to the conversion types of the plurality of transformations.
The plurality of pre-learning neural networks are image processing devices that share the feature extraction layer and estimate the conversion content for each conversion type.
It is a learning method executed by the learning device.
Features of the first neural network by inputting image data that has undergone a plurality of conversions and performing multitask learning so as to estimate the conversion contents for each conversion type of the conversions using the first neural network. The first learning step to update the parameters including the shared parameters of the extraction layer,
The parameters of the second neural network are learned so as to perform predetermined processing by using the second neural network in which arbitrary image data is input and the shared parameters learned in the first learning step are applied to the feature extraction layer. The second learning process and
Including
The first neural network has the feature extraction layer and a plurality of pre-learning neural networks corresponding to the conversion types of the plurality of transformations.
A learning method characterized in that the plurality of pre-learning neural networks share the feature extraction layer and estimate the conversion content for each conversion type.
Features of the first neural network by inputting image data that has undergone a plurality of conversions and performing multitask learning so as to estimate the conversion contents for each conversion type of the conversions using the first neural network. The first learning step to update the parameters, including the shared parameters of the extraction layer,
The parameters of the second neural network are learned so as to perform predetermined processing by using the second neural network in which arbitrary image data is input and the shared parameters learned in the first learning step are applied to the feature extraction layer. The second learning step and
Let the computer run
The first neural network has the feature extraction layer and a plurality of pre-learning neural networks corresponding to the conversion types of the plurality of transformations.
The plurality of pre-learning neural networks share the feature extraction layer, and are learning programs for estimating the conversion contents for each conversion type.