WO2022162839A1

WO2022162839A1 - Learning device, learning method, and recording medium

Info

Publication number: WO2022162839A1
Application number: PCT/JP2021/003058
Authority: WO
Inventors: 学中野; 裕一中谷; 遊哉石井; 哲夫井下
Original assignee: 日本電気株式会社
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2022-08-04
Also published as: JPWO2022162839A1

Abstract

This learning device performs distillation learning using unknown data that a teacher model has not learned. A data generation means generates generated data on the basis of an input pseudo-correct label. Specifically, the data generation means generates, as the generated data, data such that each of a plurality of teacher models outputs a teacher prediction label close to the pseudo-correct label when the data is input into the teacher model. A learning means uses the plurality of teacher models to perform distillation learning of a student model using the generated data as input.

Description

LEARNING DEVICE, LEARNING METHOD, AND RECORDING MEDIUM

The present invention relates to a neural network learning method using distillation.

In machine learning, it is possible to construct a highly accurate learning model by constructing a neural network with deep layers. Such a learning model is called deep learning or deep learning, and consists of several million to hundreds of millions of neural networks. In deep learning, it is known that the more complicated the learning model and the deeper the layers, that is, the more the number of neural networks, the higher the accuracy. On the other hand, bloated models require more computer memory, so methods have been proposed to build smaller models while maintaining the performance of huge models.

Non-Patent Document 1 and Patent Document 1 describe Knowledge Distillation ( hereinafter referred to as “distillation”). In this method, the data used during training of the teacher model is used as input to the teacher model and the student model, and the weighted average of the predicted label output by the teacher model and the true label given by the learning data is approximated to the weighted average of the student model. do the learning. Since the learning method described in Non-Patent Document 1 uses a weighted average label, the same data used for learning the teacher model is required when learning the student model. However, deep learning requires a large amount of training data, so from the viewpoint of storage medium capacity limitations, protection of privacy information contained in the data, and data copyright, it is not possible to retain the training data itself. It can be difficult.

Non-Patent Document 2 describes distillation learning using data unknown to the teacher model, that is, data for which the true label associated with the input data is unknown, without using the data used in learning the teacher model. there is This learning method trains the student model so as to approximate the predicted label of the teacher model for unknown data.

JP 2019-046380 A

In the learning method described in Non-Patent Document 2, images generated using a GAN (Generative Adversarial Network) are used to perform distillation learning from teacher models to student models. However, if the image generated using the GAN is far from the image of the target domain, improvement in the performance of the student model cannot be expected.

One purpose of the present invention is to realize distillation learning that generates high-performance student models using unknown data.

In one aspect of the invention, the learning device comprises:
multiple trained teacher models,
Data generation means for generating generated data based on an input pseudo-correct label, wherein each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label. data generating means for generating data as the generated data;
learning means for performing distillation learning of a student model using the generated data as an input and using the plurality of teacher models.

In another aspect of the invention, a learning method comprises:
Get multiple trained teacher models,
Generate generated data based on the input pseudo-correct label,
Using the generated data as an input, performing distillation learning of a student model using the plurality of teacher models,
The generated data is data such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label.

In still another aspect of the present invention, the recording medium comprises
Get multiple trained teacher models,
Generate generated data based on the input pseudo-correct label,
A process of performing distillation learning of a student model using the generated data as an input and using the plurality of teacher models,
The generated data records a program that causes a computer to execute processing such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label.

According to the present invention, it is possible to realize distillation learning that generates high-performance student models using unknown data.

2 shows a hardware configuration of a learning device according to the first embodiment; 4 is a flowchart showing the overall flow of learning processing; FIG. 10 is a diagram illustrating an example of a discrimination boundary of a teacher model; The learning method of the supervised model is shown schematically. The functional configuration of the learning device when performing the learning of the data generator is shown. 4 shows a configuration example of a label distribution determination unit; 1 shows the functional configuration of a learning device when learning a student model; 10 is a flow chart of learning processing of a student model; 2 shows the functional configuration of a learning device according to a second embodiment; 9 is a flowchart of learning processing according to the second embodiment;

Preferred embodiments of the present invention will be described below with reference to the drawings.
<First embodiment>
[Basic concept]
In general, when learning a student model using a distillation method (hereinafter also referred to as “distilled learning”), the student model is learned using the learning data used for learning the teacher model. Also, if the learning data used for learning the teacher model cannot be obtained, the student model is learned using images generated using GAN or the like. However, if the image generated using the GAN is far from the image of the target domain, the performance improvement of the student model by distillation learning cannot be expected. Therefore, in this embodiment, the performance of the student model by distillation learning is improved by approximating the image generated by the GAN to the domain in which the teacher model was trained, that is, the target domain.

[Hardware configuration]
FIG. 1 is a block diagram showing the hardware configuration of the learning device according to the first embodiment. As illustrated, the learning device 10 includes an interface (I/F) 12, a processor 13, a memory 14, a recording medium 15, and a database (DB) 16.

The interface 12 inputs and outputs data with an external device. Specifically, the interface 12 acquires learning data and unknown data used by the learning device 10 from an external device.

The processor 13 is a computer such as a CPU (Central Processing Unit), and controls the learning device 100 as a whole by executing a program prepared in advance. The processor 13 may be a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array). The processor 13 executes learning processing, which will be described later.

The memory 14 is composed of ROM (Read Only Memory), RAM (Random Access Memory), and the like. The memory 14 stores neural network models used by the learning apparatus 10, specifically, teacher models, student models, and the like. The memory 14 is also used as a working memory while the processor 13 is executing various processes.

The recording medium 15 is a non-volatile, non-temporary recording medium such as a disk-shaped recording medium or a semiconductor memory, and is configured to be detachable from the learning device 10 . The recording medium 15 records various programs executed by the processor 13 . When the learning device 10 executes various processes, programs recorded in the recording medium 15 are loaded into the memory 14 and executed by the processor 13 . Database 16 stores data entered via interface 12 .

[Overview of learning process]
Next, an overview of the learning process by the learning device 10 will be described. FIG. 2 is a flowchart showing the overall flow of learning processing. The learning process is roughly divided into teacher model learning (step S10), data generator learning (step S20), and student model learning (step S30).

Training of a teacher model is to learn multiple teacher models using data obtained from multiple sites (domains). As a result, a plurality of trained teacher models are obtained. The learning of the data generation unit is to learn the data generation unit that generates data used for learning the student model using a plurality of trained teacher models. Note that the data generation unit generates an image using a GAN. In the learning of the student model, the student model is learned by distillation using a plurality of trained teacher models and a trained data generator. A detailed description will be given below in order.

[Teacher model learning]
First, the learning of the teacher model will be explained.
(Basic concept)
In learning a teacher model, a teacher model is learned using images obtained at each site (target domain). That is, a teacher model is learned for each individual target domain, and a plurality of teacher models corresponding to a plurality of target domains are learned. Here, each teacher model is learned so as to simultaneously satisfy the following two objectives.

Aim A: To have high performance for images in the target domain. This is similar to normal learning.
Purpose B: Make the output distribution of each teacher model as different as possible for images other than the target domain. That is, each training model is trained to intentionally increase the degree of disagreement of the output with respect to images other than the target domain.

An example of a teacher model that satisfies the above objectives A and B at the same time will be explained. FIG. 3 shows an example of a distribution map of feature quantities. In this example, it is assumed that class X and class Y are to be classified in a certain target domain. It is assumed that feature amounts belonging to area 1 on the distribution map are classified into class X, and feature amounts belonging to area 2 are classified into class Y. FIG.

Here, the discrimination boundaries of teacher models 1 and 2 that simultaneously satisfy the above objectives A and B are indicated by F1 and F2, respectively. First, since the identification boundaries F1 and F2 both divide the area 1 and the area 2 into different regions, class X and class Y can be classified correctly. Therefore, the teacher models 1 and 2 both satisfy the objective A described above. Furthermore, the identification boundaries F1 and F2 classify most of the areas (white areas) other than Area 1 and Area 2 into different classes. Therefore, teacher models 1 and 2 satisfy objective B above. That is, the discrimination boundaries F1 and F2 correctly classify classes X and Y of the target domain, and classify most other regions into different classes. Therefore, the teacher models 1 and 2 simultaneously satisfy the objectives A and B described above.

If another teacher model 3 is to be generated in addition to the teacher models 1 and 2, the discrimination boundary F3 is, for example, similar to the discrimination boundaries F1 and F2, as shown in FIG. , and the areas other than areas 1 and 2 are divided into two areas different from the identification boundaries F1 and F2. A plurality of teacher models learned in this way are used in the learning of the data generation unit and in the learning of the student models, which will be described later.

(Teacher model learning method)
FIG. 4 schematically shows the learning method of the teacher model. In this example, it is assumed that N teacher models 20-1 to 20-N are learned. Each teacher model 20-1 to 20-N is a model using a neural network. In the following description, when individual teacher models 20-1 to 20-N are not distinguished, they may simply be referred to as "teacher model 20". In the drawings below, elements to be learned are shown in gray.

First, as shown in FIG. 4(A), learning data is input to each teacher model 20-1 to 20-N. This learning data is learning data of the target domain, and a correct label is prepared. That is, this training data includes images obtained in the target domain and correct labels for the images. Each teacher model 20-1 to 20-N outputs predicted labels 1 to N for the input image, respectively.

The learning device 10 learns the teacher model 20-1 so that the error between the predicted label 1 output by the teacher model 20-1 and the correct label prepared as learning data is minimized. The learning device 10 also performs the same processing for the other teacher models 20-2 to 20-N to learn each of the teacher models 20-2 to 20-N. As a result, each of the teacher models 20-1 to 20-N is learned to correctly predict the image data of the target domain. Thus, objective A above is satisfied.

Next, as shown in FIG. 4(B), unknown data is input to each teacher model 20-1 to 20-N. Unknown data is data that is unknown to the teacher model, that is, data that is not used for learning of the teacher model. Specifically, the unknown data is an image other than the image of the target domain, and no correct label is prepared. Each teacher model 20-1 to 20-N outputs prediction labels 1 to N, respectively, for the input unknown data. The learning device 10 learns the teacher model 20-1 so that the degree of mismatch between the predicted label 1 and the other predicted labels 2 to N is maximized. The learning device 10 also performs the same processing for the other teacher models 20-2 to 20-N to learn each of the teacher models 20-2 to 20-N. As a result, each teacher model 20-1 to 20-N has a high degree of discrepancy in predicted labels for unknown data that are images of domains other than the target domain (hereinafter also referred to as “non-target domains”). that is, to output different prediction labels as much as possible. This satisfies Objective B above.

As a method of performing learning using the unknown data, for example, the method described in the following document can be used.
"Maximum Classifier Discrepancy for Unsupervised Domain Adaptation"，Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, Tatsuya Harada; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 3723-3732

In addition to the above method, we define a loss function that indicates the degree of discrepancy in the predicted labels output by each teacher model, and add this loss function to the loss function used when learning using normal training data. Do it.

In the above description, two types of learning are performed so that objective A is satisfied by learning using the learning data of the target domain, and then objective B is satisfied by learning using unknown data of the non-target domain. I'm doing it separately. Instead, learning data and unknown data may be mixed and input to each teacher model 20, and each teacher model 20 may learn so as to satisfy purpose A and purpose B at the same time.

[Learning of the data generator]
Next, learning of the data generator will be described.
(Basic concept)
A data generator generates an image using a GAN. Here, in the present embodiment, the GAN is trained so that the image generated by the data generation unit is close to the image of the target domain. Specifically, in learning GAN, we add consistency loss as a loss function. That is, when an image generated by a GAN is input to a plurality of teacher models, a loss is added that decreases as the output distributions of the plurality of teacher models match. By training the above-mentioned supervised model, each supervised model will output a highly matching prediction label for images in the target domain and a low matching prediction label for images in the non-target domain. is learned by Therefore, when an image is input to each teacher model and the prediction label output by each teacher model has a high degree of matching (low consistency loss), the image is considered to be close to the image of the target domain. Conversely, when an image is input to each teacher model and the prediction label output by each teacher model has a low matching score (large consistency loss), the image is considered not close to the image of the target domain. .

Therefore, the learning device 10 inputs images generated by the GAN to each teacher model, and calculates consistency loss based on the predicted label output by each teacher model. Then, the learning device 10 learns the GAN so as to generate an image with a small consistency loss. As a result, the GAN is trained to output an image close to the image of the target domain.

(Functional configuration)
FIG. 5 shows the functional configuration of the learning device 10 when learning the data generator. The learning device 10 includes a random number generator 31, a data generation unit 32, teacher models 20-1 to 20-N, a label error minimization unit 33, and a label distribution determination unit . Here, the data generation unit 32 shown in gray is the object of learning. Also, the teacher models 20-1 to 20-N have already been learned in the direction described above.

The random number generator 31 generates random number vectors and outputs them to the data generator 32 . By using the random number vector, the data generator 32 can generate various variations of images. The data generator 32 is configured by GAN. Unknown data is input to the data generator 32 . Unknown data is image data of non-target domains, as described above. The unknown data is for GAN to learn the likeness of natural images, and images obtained from general image datasets such as Image Net can be used. By using the image of the image data set as the unknown data, the data generator 32 can generate an image that looks like a natural image. Note that unknown data can also be regarded as auxiliary data or proxy data in the sense that the GAN learns the likeness of a natural image.

In addition, a pseudo-correct label D3 is input to the data generation unit 32. The pseudo-correct label D3 is data specifying the class of the image generated by the data generation unit 32, and can be, for example, a class number. The data generator 32 generates an image D1 of the class indicated by the pseudo-correct label D3 based on the input random number vector and the pseudo-correct label D3, and outputs it to the teacher models 20-1 to 20-N.

The data generator (GAN) 32 includes a generator and a discriminator. As a basic operation, the generator receives a random number vector and a pseudo-correct label D3 and generates an image D1. The image D1 or unknown data is input to the discriminator. The classifier is trained with the goal of distinguishing between the image D1 generated by the generator and unknown data, and the generator is trained with the goal of generating the image D1 that the classifier cannot distinguish. In this embodiment, in addition to the above learning, generator learning is performed using the label error minimizing unit 33 as will be described later.

The teacher models 20-1 to 20-N each perform prediction on the image D1 and output the predicted label D2 to the label error minimizing unit 33 and the label distribution determining unit 34. The predicted label output by the teacher model 20 is hereinafter referred to as a "teacher predicted label". The label distribution determination unit 34 calculates the label distribution based on the teacher prediction labels D2 input from the teacher models 20-1 to 20-N, and determines the pseudo-correct label D3 so that the calculated distribution is uniform. and output to the data generator 32 . For example, when the teacher model 20 performs 10-class classification, each of the teacher models 20-1 to 20-N outputs the 10-class classification result as the teacher prediction label D2. The label distribution determination unit 34 aggregates the teacher prediction labels D2 output by the teacher models 20-1 to 20-N, and determines the class of the image to be generated by the data generation unit 32 next so that the distribution is uniform. A pseudo-correct label D3 shown is generated and output to the data generator 32 . As a result, the data generation unit 32 generates images so that the teacher prediction labels D2 output from the teacher models 20-1 to 20-N are evenly distributed.

In addition, the label distribution determining unit 34 outputs the pseudo-correct label D3 to the label error minimizing unit 33. The label error minimizing unit 33 makes the data generating unit 32 learn using the teacher prediction label D2 and the pseudo-correct label D3 input from each of the teacher models 20-1 to 20-N. Specifically, the label error minimizing unit 33 calculates the error between the teacher prediction label D2 output by each of the teacher models 20-1 to 20-N and the pseudo-correct label D3, and minimizes the sum of the errors. The parameters of the neural network that constitutes the data generator 32 are optimized.

In addition to this, the label error minimization unit 33 performs learning of the data generation unit 32 based on the consistency loss described above. Specifically, the label error minimizing unit 33 calculates the consistency loss based on the teacher prediction label D2 output by each of the teacher models 20-1 to 20-N. The consistency loss is a loss that becomes smaller as the distributions of teacher prediction labels D2 output by a plurality of teacher models 20 match each other. Therefore, the label error minimizing unit 33 causes the data generating unit 32 to generate Learn instruments. As a result, the data generating unit 32 generates an image such that the distribution of the teacher prediction labels D2 output by the teacher models 20-1 to 20-N matches when the generated image is input, that is, the image of the target domain. It is trained to produce close images.

FIG. 6 shows a configuration example of the label distribution determination unit 34. The label distribution determining section 34 includes a cumulative probability density calculating section 35 , a weight calculating section 36 and a multiplier 37 . A teacher prediction label D2 output from each teacher model 20-1 to 20-N is input to a cumulative probability density calculator 35 and a multiplier 37. FIG. The cumulative probability density calculation unit 35 calculates the cumulative probability distribution of each class from each teacher prediction label D2, obtains the cumulative probability density, and inputs the cumulative probability density to the weight calculation unit . The weight calculator 36 calculates a weight for each class so that the cumulative probability density of each class is uniform. For example, the weight calculator 36 may use the reciprocal of the cumulative probability density as the weight, or the user may arbitrarily determine the weight for some classes. The multiplier 37 then multiplies the teacher prediction label D2 by a weight to determine a pseudo-correct label D3 for each piece of unknown data.

[Student model learning]
Next, the learning of the student model will be explained.
(Functional configuration)
FIG. 7 shows the functional configuration of the learning device 10 when learning a student model. The learning device 10 includes a random number generator 31 , a data generator 32 , teacher models 20 - 1 to 20 -N, a label distribution determiner 34 , a student model 40 and a distillation learning section 41 . Here, the student model 40 is the object of learning. The teacher models 20-1 to 20-N and the data generator 32 have been trained by the learning method described above. Random number generator 31 and label distribution determining unit 34 are the same as those at the time of learning of the data generating unit shown in FIG.

When the pseudo-correct label D3 is input from the label distribution determination unit 34, the data generation unit 32 uses the pseudo-correct label D3 and the random number vector from the random number generator 31 to generate the image D1, and the teacher model 20- 1 to 20-N and to the student model 40. The student model 40 is constructed using a neural network like the teacher model.

Each of the teacher models 20-1 to 20-N outputs a teacher prediction label D2 for the image D1 to the distillation learning unit 41. The student model 40 also outputs a predicted label (hereinafter also referred to as a “student predicted label”) D5 for the image D1 to the distillation learning unit 41 . The distillation learning unit 41 learns the student model 40 so that the student model 40 approaches the teacher model 20 . Specifically, the distillation learning unit 41 adjusts the parameters of the neural network that constitutes the student model 40 so that the sum of errors between the student prediction label D5 and each teacher prediction label D2 and pseudo-correct label D3 is minimized. Optimize. In this way, the learning of the student model is performed by distillation.

As described above, the data generation unit 32 is trained so that it can generate an image D1 close to the image of the target domain based on unknown data. Therefore, even if the training data for the teacher model cannot be obtained, the student model 40 undergoes distillation learning using the image D1 that is close to the image of the target domain generated from the unknown data. can be inherited.

In the above configuration, the data generation unit 32 is an example of data generation means, and the image D1 is an example of generated data. The distillation learning unit 41 is an example of learning means, and the label distribution determination unit 34 is an example of label distribution determination means.

(Student model learning process)
FIG. 8 is a flowchart of a student model learning process by the learning device 10 shown in FIG. This processing is realized by the processor 13 shown in FIG. 1 executing a program prepared in advance.

First, the label distribution determination unit 34 generates a pseudo-correct label D3 and outputs it to the data generation unit 32 (step S31). The data generator 32 uses the random number vector to generate the image D1 of the class indicated by the input pseudo-correct label D3, and outputs it to the teacher model 20 and the student model 40 (step S32). Next, each teacher model 20 and student model 40 predict the image D1 and output a teacher prediction label D2 and a student prediction label D5 to the distillation learning unit 41 (step S33).

Next, the distillation learning unit 41 learns the student model so that the error between the student prediction label D5 and each teacher prediction label D2 and pseudo-correct label D3 is minimized (step S34). The processing of steps S31 to S34 is repeatedly executed until a predetermined end condition is satisfied, and when the predetermined end condition is satisfied (step S35: Yes), the processing ends.

As described above, in the learning process of the student model, distillation learning is performed using an image similar to the image of the target domain generated by the trained data generation unit 32. Therefore, even when unknown data is used, the performance of the teacher model It is possible to obtain a student model that appropriately inherits

[Second embodiment]
Next, a second embodiment of the invention will be described. FIG. 9 shows the functional configuration of a learning device 50 according to the second embodiment. The hardware configuration of the learning device 50 is the same as that shown in FIG.

The learning device 50 performs distillation learning using unknown data that has not been learned by the teacher model. 54. A plurality of teacher models have already been trained, and the student model 54 is the subject of learning. The data generating means 52 generates generated data based on the input pseudo-correct label. Specifically, the data generating means 52 generates data such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label as the generated data. The learning means 53 receives the generated data and performs distillation learning of the student model 54 using a plurality of teacher models 51 . Distillation learning can thus be performed using unknown data.

FIG. 10 is a flowchart of learning processing according to the second embodiment. First, a plurality of trained teacher models are acquired (step S51). Next, generation data is generated based on the input pseudo-correct label (step S52). Here, the generated data is data such that each of a plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label. Then, with the generated data as input, distillation learning of the student model is performed using a plurality of teacher models (step S53).

Some or all of the above embodiments can also be described as the following additional remarks, but are not limited to the following.

(Appendix 1)
multiple trained teacher models,
Data generation means for generating generated data based on an input pseudo-correct label, wherein each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label. data generating means for generating data as the generated data;
learning means for performing distillation learning of a student model using the generated data as an input and using the plurality of teacher models;
A learning device with

(Appendix 2)
The learning means inputs the generated data to the plurality of teacher models and student models, and uses teacher prediction labels output by the plurality of teacher models as correct labels to learn the student models. learning device.

(Appendix 3)
In the plurality of teacher models, the teacher prediction label output by each for known input data is close to the correct label, and the degree of discrepancy between the teacher prediction label output by each for unknown input data is maximized. 3. The learning device according to appendix 1 or 2, which has been trained as follows.

(Appendix 4)
The learning device according to appendix 3, wherein the known input data is data used for learning the teacher model, and the unknown input data is data not used for learning the teacher model.

(Appendix 5)
5. The learning device according to appendix 3 or 4, wherein the known input data is target domain data, and the unknown input data is data other than the target domain data.

(Appendix 6)
The data generating means, when the generated data is input to the plurality of teacher models, minimizes a loss function that becomes smaller as distributions of teacher prediction labels output from each of the plurality of teacher models match each other. 6. The learning device according to any one of appendices 1 to 5, which has already been trained.

(Appendix 7)
The learning means minimizes the sum of the error between the student predicted label output by the student model and the teacher predicted label output by the plurality of teacher models and the error between the student predicted label and the pseudo-correct label. 7. The learning device according to any one of appendices 1 to 6, which learns the student model to

(Appendix 8)
8. The learning device according to any one of Appendices 1 to 7, further comprising: label distribution determining means for adjusting the values of the pseudo-correct labels so that the teacher-predicted labels output by the plurality of teacher models are evenly distributed among the classes. .

(Appendix 9)
Get multiple trained teacher models,
Generate generated data based on the input pseudo-correct label,
Using the generated data as an input, performing distillation learning of a student model using the plurality of teacher models,
The learning method, wherein the generated data is data such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label.

(Appendix 10)
Get multiple trained teacher models,
Generate generated data based on the input pseudo-correct label,
A process of performing distillation learning of a student model using the generated data as an input and using the plurality of teacher models,
The generated data is data such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label. A recording medium recording a program for causing a computer to execute processing. .

Although the present invention has been described with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

REFERENCE SIGNS LIST 10 learning device 20 teacher model 31 random number generator 32 data generation unit 33 label error minimization unit 34 label distribution determination unit 40 student model 41 distillation learning unit

Claims

multiple trained teacher models,
Data generation means for generating generated data based on an input pseudo-correct label, wherein each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label. data generating means for generating data as the generated data;
learning means for performing distillation learning of a student model using the generated data as an input and using the plurality of teacher models;
A learning device with
2. The learning unit according to claim 1, wherein the generated data is input to the plurality of teacher models and student models, and the teacher prediction labels output by the plurality of teacher models are used as correct labels to learn the student models. A learning device as described.
In the plurality of teacher models, the teacher prediction label output by each for known input data is close to the correct label, and the degree of discrepancy between the teacher prediction label output by each for unknown input data is maximized. 3. The learning device according to claim 1 or 2, wherein the learning device is already trained as follows.
The learning device according to claim 3, wherein the known input data is data used for learning the teacher model, and the unknown input data is data not used for learning the teacher model.
The learning device according to claim 3 or 4, wherein the known input data is target domain data, and the unknown input data is data other than the target domain data.
The data generating means, when the generated data is input to the plurality of teacher models, minimizes a loss function that becomes smaller as distributions of teacher prediction labels output from each of the plurality of teacher models match each other. 6. The learning device according to any one of claims 1 to 5, which has been learned.
The learning means minimizes the sum of the error between the student predicted label output by the student model and the teacher predicted label output by the plurality of teacher models and the error between the student predicted label and the pseudo-correct label. 7. The learning device according to any one of claims 1 to 6, wherein the student model is trained to
8. The learning according to any one of claims 1 to 7, further comprising label distribution determining means for adjusting the values of the pseudo-correct labels so that the teacher prediction labels output by the plurality of teacher models are evenly distributed in each class. Device.
Get multiple trained teacher models,
Generate generated data based on the input pseudo-correct label,
Using the generated data as an input, performing distillation learning of a student model using the plurality of teacher models,
The learning method, wherein the generated data is data such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label.
Get multiple trained teacher models,
Generate generated data based on the input pseudo-correct label,
A process of performing distillation learning of a student model using the generated data as an input and using the plurality of teacher models,
The generated data is data such that each of the plurality of teacher models to which the generated data is input outputs a teacher prediction label close to the pseudo-correct label. A recording medium recording a program for causing a computer to execute processing. .