WO2023080509A1

WO2023080509A1 - Method and device for learning noisy labels through efficient transition matrix estimation

Info

Publication number: WO2023080509A1
Application number: PCT/KR2022/016182
Authority: WO
Inventors: 계성민; 안상일; 이준영; 장부루; 최광희
Original assignee: 주식회사 하이퍼커넥트
Priority date: 2021-11-04
Filing date: 2022-10-21
Publication date: 2023-05-11

Abstract

A method for learning noisy labels according to a first mode comprises the steps of: estimating a transition matrix on the basis of output data of a noise classifier for first input data and a first label corresponding to the first input data; calculating the loss of a clean classifier on the basis of output data of the clean classifier for the first input data, output data of the clean classifier for second input data, the first label, a second label corresponding to the second input data, and the transition matrix; calculating the loss of the noise classifier on the basis of output data of the noise classifier for the second input data, and the second label; and training the clean classifier and the noise classifier on the basis of the loss of the clean classifier and the loss of the noise classifier.

Description

Method and apparatus for learning noise labels through efficient conversion matrix estimation

The present disclosure relates to a method and apparatus for learning noise labels through efficient conversion matrix estimation.

Recently, a technique of classifying input data into a predetermined class by combining with a deep learning technique has been developed. The classifier (or classification model) determines which class the input data belongs to, and classifies the input data into the most similar class among the preset classes even if the input data does not belong to any of the preset classes.

On the other hand, the classifier performs learning based on the result of classifying the input data and the label of the input data. When the labeling of the input data is incorrect, that is, when a noise label is learned, the performance of the classification model decreases. can do. Since training a classification model with only clean labels has limitations such as quantitative limitations, various studies on how to learn noise labels have been conducted.

An object of the present disclosure is to provide a method and apparatus for learning noise labels through efficient conversion matrix estimation. The problems to be solved by the present disclosure are not limited to the above-mentioned problems, and other problems and advantages of the present disclosure that are not mentioned can be understood by the following description and more clearly understood by the embodiments of the present disclosure. It will be. In addition, it will be appreciated that the problems and advantages to be solved by the present disclosure can be realized by the means and combinations indicated in the claims.

A method for learning a noise label according to a first aspect of the present disclosure includes a transition matrix based on output data of a noise classifier for first input data and a first label corresponding to the first input data. ) estimating; Clean based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, a second label corresponding to the second input data, and the conversion matrix. calculating classifier loss; calculating a noise classifier loss based on the output data of the noise classifier for the second input data and the second label; and allowing the clean classifier and the noise classifier to learn a noise label based on the clean classifier loss and the noise classifier loss.

A computer-readable recording medium according to a second aspect of the present disclosure includes a recording medium on which a program for executing the method described above is recorded on a computer.

An apparatus for learning a noise label according to a third aspect of the present disclosure includes a memory in which at least one program is stored; and a processor operating by executing the at least one program, wherein the processor estimates a conversion matrix based on output data of a noise classifier for first input data and a first label corresponding to the first input data. and based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, a second label corresponding to the second input data, and the conversion matrix. to calculate a clean classifier loss, calculate a noise classifier loss based on the output data of the noise classifier for the second input data and the second label, and calculate the clean classifier loss based on the clean classifier loss and the noise classifier loss and cause the noise classifier to learn noise labels.

A label transition matrix can be efficiently estimated, and a classifier can be effectively trained using not only clean labels but also noise labels through the estimated transition matrix. The conversion matrix is estimated adaptively, so that the classifier does not blindly trust samples that may contain already corrected labels, avoiding problems related to miscorrection. In addition, by introducing the two-head architecture, it is possible to efficiently estimate the label switching matrix through a single backpropagation and to train the classifier at high speed. In addition, the noise label can be continuously corrected in real time while learning the noise label at every iteration.

1 is a diagram for explaining an example of classifying input data into at least one class.

2 is a flowchart illustrating an example of a method of learning a noise label according to an embodiment.

3 is a schematic diagram illustrating a processor forming a clean batch and a noise batch according to an embodiment.

4 is a schematic diagram illustrating a process of estimating a conversion matrix by a processor according to an embodiment.

5 is a schematic diagram illustrating a process of calculating a clean classifier loss by a processor according to an embodiment.

6 is a schematic diagram illustrating a process of calculating a noise classifier loss by a processor according to an embodiment.

7 is a schematic diagram illustrating a process of optimizing parameters of a clean classifier and a noise classifier by a processor according to an embodiment.

8 is a schematic diagram illustrating a two-head architecture according to one embodiment.

9 shows an algorithm summarizing the noise label learning method through efficient conversion matrix estimation of the present disclosure.

10 is a block diagram illustrating an apparatus for learning a noise label according to an exemplary embodiment.

11 is a diagram for explaining an example in which a final model is utilized according to an embodiment.

A method for learning a noise label according to the first aspect includes estimating a transition matrix based on output data of a noise classifier for first input data and a first label corresponding to the first input data. doing; Clean based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, a second label corresponding to the second input data, and the conversion matrix. calculating classifier loss; calculating a noise classifier loss based on the output data of the noise classifier for the second input data and the second label; and training the clean classifier and the noise classifier based on the clean classifier loss and the noise classifier loss.

Advantages and features of the present disclosure, and methods of achieving them, will become clear with reference to the detailed description of embodiments taken in conjunction with the accompanying drawings. However, it should be understood that the present disclosure is not limited to the embodiments presented below, but may be implemented in various different forms, and includes all conversions, equivalents, and substitutes included in the spirit and technical scope of the present disclosure. do. The embodiments presented below are provided to make this disclosure complete, and to fully inform those skilled in the art of the scope of the invention to which this disclosure belongs. In describing the present disclosure, if it is determined that a detailed description of related known technologies may obscure the gist of the present disclosure, the detailed description will be omitted.

The terms used in the embodiments have been selected from general terms that are currently widely used as much as possible, but they may vary depending on the intention of a person skilled in the art or a precedent, the emergence of new technologies, and the like. In addition, in a specific case, there are also terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the corresponding description. Therefore, terms used in the specification should be defined based on the meaning of the term and the overall content of the specification, not simply the name of the term.

Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present disclosure. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

Also, terms including ordinal numbers such as “first” or “second” used in the specification may be used to describe various components, but the components should not be limited by the terms. The terms may be used for the purpose of distinguishing one component from another.

Some embodiments of the present disclosure may be represented as functional block structures and various processing steps. Some or all of these functional blocks may be implemented as a varying number of hardware and/or software components that perform specific functions. For example, functional blocks of the present disclosure may be implemented by one or more microprocessors or circuit configurations for a predetermined function. Also, for example, the functional blocks of this disclosure may be implemented in various programming or scripting languages. Functional blocks may be implemented as an algorithm running on one or more processors. In addition, the present disclosure may employ prior art for electronic environment setting, signal processing, and/or data processing. Terms such as "mechanism", "element", "means" and "component" may be used broadly and are not limited to mechanical and physical components.

This application is a continuation of Korean Patent Application No. 10-2021-0150892. Therefore, the contents to be described in this specification are based on the contents of Korean Patent Application No. 10-2021-0150892. Therefore, the contents described in Korean Patent Application No. 10-2021-0150892 can be a reference for understanding the invention to be described by this specification, and even if omitted below, the contents described in Korean Patent Application No. 10-2021-0150892 The content can be employed in the invention to be described by this specification.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings.

1 shows an example of input data 110, a classifier (classification model) 120, and a classification result 130. Although FIG. 1 shows that the input data 110 is classified into a total of three classes, the number of classes is not limited to the example of FIG. 1 .

The type of input data 110 is not limited. For example, the input data 110 may correspond to various types of data such as images, texts, and voices.

The classifier 120 may calculate a probability that the input data 110 is classified into a specific class. For example, the classifier 120 may calculate a probability that input data is classified for each class using a softmax function and cross entropy.

For example, assuming that the input data 110 is an image, the first class is male, and the second class is female, the classifier 120 classifies the input image into a first class or a second class. Even if the input data 110 is an animal image, the classifier 120 classifies the input image into a class determined to be more similar among the first class and the second class.

Meanwhile, the classifier 120 may be learned by learning data. The classifier 120 may be trained in a direction in which an error between a result output from the classifier 120 and an actual correct answer is reduced.

Conventionally, supervised learning achieves many achievements by using a vast amount of annotated data, such as image classification, object detection and face recognition. It can solve various classification tasks. It has been theoretically and empirically proven that the performance of supervised learning-based classification models improves as the size of the annotated data increases. However, it is impossible to avoid noise labels since not all data can be finely annotated.

Many methods have been proposed to design classifiers that are robust to noise labels. Unlike traditional methods that assume that every given label is potentially contaminated, the recently proposed method further improves the performance of the classifier by utilizing a small, inexpensively acquired clean data set. Based on a clean data set, the loss correction method reduces the influence of noise labels by modifying the loss function, and the re-weighting method penalizes samples that are likely to be noisy labels. ) do. In particular, recent label correction methods can achieve remarkable performance based on model-agnostic meta-learning (MAML). This method directly reduces the noise level by re-labeling the noise labels, raising the theoretical upper limit of prediction performance.

However, the MAML-based label correction method has two problems. One is that MAML-based label calibration methods blindly trust previously miscalibrated labels. Miscalibrated labels are often retained during training, which causes the model to learn miscalibrated labels as ground-truth labels. The other is that MAML-based label correction methods are inherently slow to learn, resulting in excessive computational overhead. This inefficiency is due to the fact that the MAML-based label correction method involves performing a virtual update with a noisy data set, finding optimal parameters using a clean data set, and updating real parameters with the found parameters, resulting in a single iteration ( It comes from having multiple training steps per iteration.

To alleviate this problem, the present disclosure presents a robust and efficient method called Fast Transition Matrix Estimation for Learning with Noisy Labels (FasTEN). FasTEN learns noise labels by efficiently estimating the transition matrix, while continuously correcting them in real time.

Hereinafter, noise label learning through efficient conversion matrix estimation based on a method called FasTEN proposed in the present disclosure will be described in detail with reference to FIGS. 2 to 8 .

The operations shown in FIG. 2 may be executed by a noise label learning apparatus to be described later. Specifically, the method of learning the noise label shown in FIG. 2 may be executed by the processor shown in FIG. 10 .

Before step 210, the processor forms a clean batch composed of one or more first input data included in the clean data set and one or more first labels corresponding to the one or more first input data included in the clean data set, and one or more second input data included in the noise data set A noise array composed of data and one or more second labels corresponding thereto may be formed.

The clean data set includes a plurality of first input data and a plurality of first labels respectively corresponding to the plurality of first input data. In other words, the clean data set includes a plurality of first input data and first labels matched thereto. Here, all of the plurality of first labels included in the clean data set are accurately labeled for each of the plurality of first input data.

The noise data set includes a plurality of second input data and a plurality of second labels respectively corresponding to the plurality of second input data. In other words, the noise data set includes a plurality of second input data and second labels matched thereto. Here, at least some of the plurality of second labels included in the noise data set are incorrectly labeled for each of the plurality of second input data. In other words, some of the plurality of second labels are incorrectly labeled for each of the plurality of second input data.

A clean batch and a noise batch may be configured by randomly selecting from the clean data set and the noise data set, respectively. For example, the size of the clean batch and the size of the noise batch may be set to be the same, but are not limited thereto.

A specific example of how the processor forms a clean batch and a noise batch will be described later with reference to FIG. 3 .

In operation 210, the processor may estimate a transition matrix based on the output data of the noise classifier for the first input data and the first label.

A noise classifier refers to a classifier that is learned using a noisy data set. A noise classifier may include a linear classifier and a feature extractor. A noise classifier is used to estimate a transition matrix, and the estimated transition matrix is used to learn noise labels.

A specific example of the processor estimating the conversion matrix will be described later with reference to FIG. 4 .

In step 220, the processor may calculate a clean classifier loss based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, the second label, and the conversion matrix. .

As will be described later, the clean classifier is trained using not only the clean data set but also the noise data set. The clean classifier loss is a value used to train the clean classifier and, in an embodiment, may be calculated based on a loss based on clean data and a loss based on noise data. A clean classifier, like a noise classifier, may include a linear classifier and a feature extractor.

A specific example of how the processor calculates the clean classifier loss will be described later with reference to FIG. 5 .

In operation 230, the processor may calculate a noise classifier loss based on the second label and output data of the noise classifier for the second input data.

As described above, the noise classifier is a classifier that is learned using a noisy data set, and the noise classifier loss is used to train the noise classifier. Specifically, the noise classifier loss may be calculated based on the output data of the noise classifier for the second input data and the second label.

A specific example of how the processor calculates the noise classifier loss will be described later with reference to FIG. 6 .

In step 240, the processor may cause the clean classifier and the noise classifier to learn a noise label based on the clean classifier loss and the noise classifier loss.

A clean classifier and a noise classifier may be trained based on the clean classifier loss and the noise classifier loss calculated in the previous step. In the present disclosure, a clean classifier and a noise classifier may constitute a two-head architecture sharing a feature extractor, and through the two-head architecture, efficient learning through a single backpropagation may be realized.

A specific example of allowing the processor to learn the noise label for the clean classifier and the noise classifier will be described later with reference to FIGS. 7 and 8 .

Referring to FIG. 3 , a batch 320 used for one iteration may be formed from a data set 310 . In other words, the data set 310 can be divided into multiple batches, and one batch 320 can be used for one iteration. The data set 310 is a set of data used for learning a classifier (classification model). The data set 310 may include a plurality of input data and a plurality of labels corresponding thereto. Accordingly, batch 320 may also consist of one or more input data and one or more labels corresponding thereto.

The data set 310 may be a clean data set or a noise data set. That is, a clean batch may be formed from a clean data set, and a noise batch may be formed from a noise data set.

In one embodiment, for efficient estimation of the label conversion matrix, the processor may form batches to have the same number of samples per class. In one embodiment, the processor may construct a clean batch by randomly selecting K samples from each of the N classes in the clean data set (where N and K are natural numbers greater than or equal to 1). Each sample consists of input data and clean labels corresponding to the input data. For example, a clean batch may be constructed according to Equation 1 below.

In Equation 1 above,

denotes a clean batch,

denotes an input,

Is

refers to the clean label of

In one embodiment, the processor may configure the noise batch by randomly selecting M samples from the noise data set. Each sample consists of input data and a noise label corresponding to the input data. For example, the noise arrangement may be configured according to Equation 2 below.

In Equation 2 above,

denotes the noise placement,

denotes an input,

Is

refers to the noise label of In one embodiment, M may be equal to KN.

Referring to FIG. 4 , the input included in the clean batch 410

(corresponding to the first input data) is input to the noise classifier 420 . As output data through the noise classifier 420

is output. Via calculator 430

and the inputs included in the clean batch 410

label corresponding to

conversion matrix estimated as output based on (corresponding to the first label)

is output. In this specification,

denotes the conversion matrix,

means that the conversion matrix is estimated.

label transition matrix

Each component of

is a clean label

Ga Noise Label

It can be defined as the probability of being contaminated by For example, if there are 4 classes in the data set: 'cat', 'dog', 'bird', and 'airplane', the class of any input data that would be inappropriate to be labeled as 'cat' is 'cat'. The probability of being classified as a 'cat' and labeled 'cat' will be greater when the class of the input data is actually 'dog' than when it is 'bird' or 'airplane'. Similarly, the probability that the class of the input data that is inappropriate to be labeled as 'airplane' is classified as 'airplane' and labeled as 'airplane' is the probability that the class of the input data is actually 'bird' if it is 'cat' or 'airplane'. It will be bigger than the case of 'dog'. As such, the conversion matrix is a matrix representing the probability of being misclassified between classes.

In this disclosure, a small amount of clean data set can be utilized to estimate the conversion matrix. In general, building a clean data set costs more than building a noisy data set. Meanwhile, according to the present disclosure, it is possible to accurately estimate a conversion matrix even when a small amount of clean data set is used. Accordingly, according to the present disclosure, accurate conversion matrix estimation is possible at a lower cost than in the prior art.

In this disclosure, a robust and efficient method for learning a noise label while correcting the noise label on-the-fly by learning a transition matrix is proposed. This may be a novel approach to improving label calibration with conversion matrix estimation. In addition, as will be described later, according to the method proposed in the present disclosure, the learning speed can be increased by using a two-head architecture so that the conversion matrix can be learned with a single backpropagation.

That is, in this disclosure, an efficient method in terms of both learning rate and prediction performance, verified through extensive experiments, is proposed.

In this disclosure, noise classifier 420 may be used to estimate the transition matrix. As will be described later, the noise classifier 420 of the present disclosure refers to a classifier that is trained with a noise batch constructed from a noise data set. Because noise classifier 420 is trained on noise data, having noise classifier 420 classify clean data will classify it incorrectly. For example, a classifier trained to classify an image of a cat as having a class of 'dog' will classify the class as 'dog' even when another cat image is input. Based on these characteristics of the noise classifier 420 and the clean data, a conversion matrix can be estimated using "how the noise classifier 420 misclassifies".

It is theoretically proven that a correctly estimated label transition matrix is useful for obtaining classifiers that are statistically consistent with noise labels, i.e. robust to noise labels.

In this disclosure, a simple but accurate method of directly estimating the a posteriori value with a clean data set can be chosen. included in the batch

About

and

Assume that is conditionally independent.

In this disclosure, the conversion matrix is class-independent and instance-independent, i.e.

can be designed to be In the present disclosure, the feature extractor

and a linear classifier

By parameterizing

can be expressed as, where

denotes a noise classifier trained only with noise labels. In this disclosure, a noise classifier may include a linear classifier and a feature extractor. noise classifier

If gives a perfect prediction for noisy data, as shown in Equation 4 below, included in the clean batch

and

Conversion probability using

can be estimated.

In Equation 4 above,

denotes a clean batch,

denotes an input,

Is

refers to the clean label of In Equation 4 above,

Is a feature extractor included in the noise classifier,

Means a linear classifier included in the noise classifier,

denotes a noise classifier.

The conversion matrix estimated through the process described above may then be used to calculate the clean classifier loss.

In FIG. 4 , the noise classifier 420 and the calculator 430 are shown as separate functional blocks for convenience of description and functional division of the operation of the processor and the flow of data according to an embodiment of the present disclosure, but the noise classifier 420 ) and the operator 430 may be configured as one piece of hardware.

Referring to FIG. 5 , the input included in the clean batch 510

is input to the clean classifier 530. As output data through the clean classifier 530

is output. Also included in the noise batch 520 is the input

(corresponding to the second input data) is input to the clean classifier 530. As output data through the clean classifier 530

is output Via calculator 540

,

, the input included in the clean batch 510

label corresponding to

, the input included in the noise batch 520.

label corresponding to

(corresponding to the second label) and the transition matrix

Calculated clean classifier loss as output based on

is output

Calculating the clean classifier loss is for training the clean classifier 530, for example the clean classifier loss

Can be calculated according to Equation 5 below.

In Equation 5 above,

denotes a clean batch,

denotes the input included in the clean batch,

Is

refers to the clean label of In Equation 5 above,

denotes the noise placement,

denotes the input included in the noise batch,

Is

refers to the noise label of In Equation 5 above,

Means the cross-entropy loss function,

is a feature extractor included in the clean classifier,

Means a linear classifier included in the clean classifier,

denotes a clean classifier.

That is, through the calculator 540

Output data of the clean classifier 530 for

and

label corresponding to

A clean data based loss is calculated based on

Output data of the clean classifier 530 for

, the transition matrix

and

label corresponding to

A noise data based loss is calculated based on , and a clean classifier loss based on the calculated clean data based loss and the noise data based loss.

this is calculated

transition matrix

is estimated correctly, then the clean classifier

can be statistically consistent. This approach avoids the miscorrection problem because the clean classifier does not blindly trust the corrected label.

As such, the clean classifier 530 of the present disclosure may be learned by utilizing not only clean data (clean batch) but also noise data (noise batch). Although noise data includes labels with noise, since the amount of data is larger than that of clean data, it is efficient to use this to train the clean classifier 530. It should be noted that such noisy data can be exploited if the conversion matrix is estimated correctly.

The clean classifier loss calculated in this way can then be used to optimize the parameters of the clean classifier.

In FIG. 5 , the clean classifier 530 and the calculator 540 are shown as separate functional blocks for convenience of description and functional division of the operation of the processor and the flow of data according to an embodiment of the present disclosure, but the clean classifier 530 ) and the operator 540 may be configured as one piece of hardware.

Referring to FIG. 6 , the input included in the noise batch 610

is input to the noise classifier 620. As output data through the noise classifier 620

is output Via calculator 630

and the inputs included in the noise batch 610.

label corresponding to

Noise classifier loss computed as output based on

is output.

As described above, the noise classifier 620 is trained through noise data. Because of this, if the noise classifier 620 classifies clean data, it may be classified incorrectly. Computing the noise classifier loss is for training the noise classifier 620, e.g., the noise classifier loss.

Can be calculated through Equation 7 below.

In Equation 6 above,

denotes the noise placement,

denotes an input,

Is

refers to the noise label of In Equation 6 above,

Means the cross entropy loss function,

Is a feature extractor included in the noise classifier,

Means a linear classifier included in the noise classifier,

denotes a noise classifier.

In the present disclosure, in that the constantly changing noise label distribution can be adaptively modeled in real time, a noise classifier for each iteration

It may be important to update In one embodiment, the noise label distribution may be continuously shifted while the noise label is dynamically corrected by the label correction step to reduce the noise level.

In FIG. 6 , the noise classifier 620 and the calculator 630 are shown as separate functional blocks for convenience of description and functional division of the operation of the processor and the flow of data according to an embodiment of the present disclosure, but the noise classifier 620 ) and the operator 630 may be configured as one piece of hardware.

Referring to FIG. 7 , through the calculator 710, the clean classifier loss calculated in the above process

Over-noise classifier loss

Parameters of the clean classifier 720 and the noise classifier 730 are optimized based on .

Clean Classifier Loss

Through the process of optimizing the parameters of the clean classifier 720 based on , the clean classifier 720 can learn not only clean data but also noise data, and the noise classifier loss

The noise classifier 730 for estimating the conversion matrix may be trained through a process of optimizing parameters of the noise classifier 730 based on .

In FIG. 7 , a calculator 710, a clean classifier 720, and a noise classifier 730 are shown as individual functional blocks for convenience of description and functional division of the operation of a processor and the flow of data according to an embodiment of the present disclosure. However, the calculator 710, the clean classifier 720, and the noise classifier 730 may be composed of a single piece of hardware.

Referring to FIG. 8 , the clean classifier 810 includes a first linear classifier and a feature extractor, and the noise classifier 820 includes a second linear classifier and a feature extractor. At this time, the clean classifier 810 and the noise classifier 820 share the same feature extractor.

As mentioned above, the clean classifier

is a linear classifier

and feature extractor

can include noise classifier

is a linear classifier

and feature extractor

can include

In the present disclosure, the clean classifier 810 and the noise classifier 820 may constitute a two-head architecture. To estimate the conversion matrix efficiently, the present disclosure adopts a two-head architecture composed of two classifiers sharing a feature extractor, i.e., a clean classifier 810 and a noise classifier 820. In the two-head architecture, the feature extractor of the clean classifier 810

and the feature extractor of the noise classifier 820

is the same (i.e.

). In other words, the clean classifier 810 and the noise classifier 820 may share the same feature extractor.

In this disclosure, efficient learning can be realized through a two-head architecture. In one embodiment, parameters of the feature extractor, the linear classifier of the clean classifier 810 and the linear classifier of the noise classifier 820 may be optimized through single back-propagation through the two-head architecture. there is. The clean classifier 810 and the noise classifier 820 may share weights by sharing the same feature extractor.

Unlike a shared feature extractor, the architecture of this embodiment does not share a linear classifier, since it is impractical to model both clean data distribution and noise data distribution with a single linear classifier.

In the present disclosure, a clean classifier and a noise classifier are respectively

and

(

(equivalent to) is defined as the final objective function

Can be generated as shown in Equation 7 below.

In Equation 7 above

Is a hyper parameter and is a loss balancing factor. hyperparameter

is the classifier

is introduced into the objective function to prevent overfitting, and experiments are conducted to find the optimal value.

The prior art requires a plurality of learning steps, which lowers the learning efficiency. In particular, the aforementioned MAML has been widely used for sample re-weighting, label correction, and label conversion matrix estimation. In this approach, first, a virtual update is performed with a noisy data set, optimal parameters are found using a clean data set, and real parameters are updated with the found parameters. This hypothetical update process requires backpropagation three times per iteration, which at least triples the computational cost.

On the other hand, compared to conventional learning methods that assume all labels are clean, in this disclosure, a single linear classifier with only N additional parameters

is only added. Also, the present disclosure requires only a single backpropagation, where the additional linear classifier is negligible in terms of computational burden. Therefore, according to the present disclosure, it is possible to greatly improve the learning speed while showing performance higher than that of the prior art.

In one embodiment, the processor may correct the noise labels included in the noise batch. In the present disclosure, the processor may calibrate the noise label in each iteration, so that the noise classifier that learns the noise data may also be updated in each iteration. Since the noise classifier is updated at every iteration, the conversion matrix used to learn the clean classifier can also be estimated differently at every iteration. That is, in the present disclosure, one of the problems of the prior art, that the model learns an erroneously calibrated label as a ground truth label, can be solved.

In one embodiment, the processor determines the input included in the batch of noise, represented by a probability vector.

Output data of clean classifier for

The label may be corrected based on a comparison result between values of elements included in and a threshold value.

Specifically, in the present disclosure, a clean classifier for an input included in a noise batch formed from a noise data set as output data

, which is represented by a probability vector. The largest of the elements in this probability vector

is the threshold

, the label corresponding to that element is corrected to a more probable label. Since this approach only relies on the most recent predictions of the model being trained, the decision to calibrate labels can change each time. As a formula, calibrated labels

can be expressed as in Equation 8 below.

In Equation 8 above,

denotes the output data of the clean classifier,

denotes a floor function,

is the noise placement

refers to the original label of

It should be noted that, in this disclosure, the probability vector used for label correction decision is the output data of the clean classifier for the input data included in the noise batch.

, which is the clean classifier loss

It is data that has already been calculated in the process of calculating . That is, according to the present disclosure, more efficient label correction is possible by using an already calculated value again.

In FIG. 8 , the first linear classifier, the second linear classifier, and the feature extractor are shown as individual functional blocks for convenience of description and functional division of the operation of the processor and the flow of data according to an embodiment of the present disclosure, but the clean classifier The first linear classifier and feature extractor included in 810, the second linear classifier and feature extractor included in noise classifier 820, the clean classifier 810, and the noise classifier 820 included in the noise classifier 820 may be configured as one piece of hardware. .

The noise label learning method through efficient conversion matrix estimation according to an embodiment of the present disclosure described with reference to FIGS. 2 to 8 can be summarized as the algorithm shown in FIG. 9 .

The noise label learning apparatus 1000 performs the noise label learning method described above with reference to FIGS. 1 to 8 . Therefore, even if omitted below, those skilled in the art can easily understand that the above description of the method for learning a noise label with reference to FIGS. 1 to 8 can be equally applied to the noise label learning apparatus 1000. there is.

Referring to FIG. 10 , a noise label learning apparatus 1000 may include a processor 1010 and a memory 1020.

The memory 1020 is operably connected to the processor 1010 and stores at least one program for the processor 1010 to operate. In addition, the memory 1020 stores all data related to the contents described above with reference to FIGS. 1 to 8 , such as learning data, input data, and class information.

For example, the memory 1020 may temporarily or permanently store data processed by the processor 1010 . The memory 1020 may include magnetic storage media or flash storage media, but is not limited thereto. The memory 1020 may include built-in memory and/or external memory, and may include volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, and NAND. Non-volatile memory such as flash memory or NOR flash memory, flash drives such as SSD, compact flash (CF) card, SD card, Micro-SD card, Mini-SD card, Xd card, or memory stick; Alternatively, it may include a storage device such as a HDD.

The processor 1010 performs the method of classifying the input data described above with reference to FIGS. 1 to 8 according to a program stored in the memory 1020 .

The processor 1010 forms a clean batch composed of one or more first input data included in the clean data set and one or more first labels corresponding to the one or more first input data included in the clean data set, and one or more second input data included in the noise data set and the corresponding one or more first labels. It is possible to form a noise batch consisting of one or more second labels. Here, the first input data is an input included in the clean batch.

, the first label is

clean label

, the second input data is an input included in the noise batch

, the second label is

noise label of

respond to

Specifically, the clean batch includes (N * K) samples selected from the clean data set, N is the number of classes included in the clean data set, K is the number of samples selected from each of the classes, N and K may be a natural number greater than or equal to 1. Also, the noise batch is composed of (N * K) samples selected from the noise data set, and N and K may be natural numbers greater than or equal to 1.

Also, the processor 1010 may estimate a conversion matrix based on the output data of the noise classifier for the first input data and the first label. Here, the output data of the noise classifier for the first input data is

respond to

Also, the processor 1010 may calculate a clean classifier loss based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, the second label, and the conversion matrix. there is. Here, the output data of the clean classifier for the first input data is

, the output data of the clean classifier for the second input data is

respond to

Specifically, the processor 1010 calculates the clean classifier loss based on the output data and the first label of the clean classifier for the first input data, and calculates the clean classifier loss for the second input data. Calculating a noise data-based loss based on the output data of , a conversion matrix, and a second label, and calculating a clean classifier loss based on the clean data-based loss and the noise-data-based loss.

Also, the processor 1010 may calculate a noise classifier loss based on the output data of the noise classifier for the second input data and the second label. Here, the output data of the noise classifier for the second input data is

respond to

Specifically, the processor 1010 may calculate a noise classifier loss based on the output of the noise classifier for the second input data and the second label.

Further, the processor 1010 may cause the clean classifier and the noise classifier to learn noise labels based on the clean classifier loss and the noise classifier loss.

Specifically, the clean classifier may include a first linear classifier, the noise classifier may include a second linear classifier, and the clean classifier and the noise classifier may share the same feature extractor. Also, the processor 1010 may optimize the parameters of the feature extractor, the parameters of the first linear classifier, and the parameters of the second linear classifier through single backpropagation.

Also, the processor 1010 may correct a second label corresponding to the second input data. Specifically, correcting the second label by the processor 1010 generates a probability vector based on the output data of the clean classifier for the second input data, and determines the value of one or more elements included in the probability vector and the threshold value. and correcting the second label based on the comparison result.

For example, the processor 1010 may refer to a data processing device embedded in hardware having a physically structured circuit to perform functions expressed by codes or instructions included in a program. Here, as an example of the data processing unit built into the hardware, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific ASIC Integrated circuit), a processing device such as a field programmable gate array (FPGA) may be included, but is not limited thereto.

11 shows a network configuration diagram including a server 1110 and a plurality of terminals 1121 to 1124 according to an embodiment.

The server 1110 may be a mediation device that connects the plurality of terminals 1121 to 1124. The server 1110 may provide a mediation service to transmit/receive data between a plurality of terminals 1121 to 1124 . The server 1110 and the plurality of terminals 1121 to 1124 may be connected to each other through a communication network. The server 1110 may transmit data to or receive data from a plurality of terminals 1121 to 1124 through a communication network.

Here, the communication network may be implemented as one of a wired communication network, a wireless communication network, and a complex communication network. For example, the communication network may include mobile communication networks such as 3G, Long Term Evolution (LTE), LTE-A, and 5G. In addition, the communication network may include a wired or wireless communication network such as Wi-Fi, Universal Mobile Telecommunications System (UMTS)/General Packet Radio Service (GPRS), or Ethernet.

Communication networks include Magnetic Secure Transmission (MST), Radio Frequency IDentification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, and Bluetooth Low Energy (BLE). , or may include a local area network such as infrared communication (IR, InfraRed communication). The communication network may include a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN).

Each of the plurality of terminals 1121 to 1124 may be implemented as one of a desktop computer, a laptop computer, a smart phone, a smart tablet, a smart watch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device. Also, the plurality of terminals 1121 to 1124 may execute programs or applications.

For example, the plurality of terminals 1121 to 1124 may execute an application capable of receiving mediation services. Here, the mediation service means that users of the plurality of terminals 1121 to 1124 perform video calls and/or voice calls with each other.

In order to provide mediation services, the server 1110 may perform various classification tasks. For example, the server 1110 may classify users into predetermined classes based on information provided from users of the plurality of terminals 1121 to 1124 . In particular, when the server 1110 receives individual face images from people subscribed to the mediation service (ie, users of the plurality of terminals 1121 to 1124), the server 1110 converts the face images for various purposes. It can be classified into a certain class. For example, the predetermined class may be a class set based on a person's gender or a class set based on a person's age.

As another example, the classification model of the present disclosure may be used by server 1110 to filter abusing elements. Specifically, when the server 1110 receives an image from a user subscribed to a mediation service (ie, a plurality of terminals 1121 to 1124), the server 1110 may classify the received image into a predetermined class. At this time, a predetermined class may be set based on the abusing element. For example, in response to determining that the received image includes an abusing element, the server 1110 may classify and label the received image as an 'abusing class'. In the present disclosure, an abusive element may refer to an element that needs to be prevented from being displayed, such as a harmful element, an element against public order and morals, a sadistic element, a sexual element, an element inappropriate for minors, and the like. In one embodiment, the server 1110 may stop data transmission/reception regarding an image classified as containing the abusing element according to the inclusion of the abusing element, or may block service use of a terminal that has transmitted the image. In another embodiment, the server 1110 may request an additional authentication procedure to transmit/receive data about an image classified as containing an abusing element. At this time, the final model generated according to the method described above with reference to FIGS. 1 to 8 is stored in the server 1110, and the server 1110 can accurately classify the image into a predetermined class.

According to the foregoing, a classification model capable of accurately classifying the input data into a predetermined class regardless of the distribution of the input data can be created.

On the other hand, the above-described method can be written as a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. In addition, the structure of data used in the above-described method can be recorded on a computer-readable recording medium through various means. The computer-readable recording medium includes storage media such as magnetic storage media (eg, ROM, RAM, USB, floppy disk, hard disk, etc.) and optical reading media (eg, CD-ROM, DVD, etc.) do.

Those skilled in the art related to the present embodiment will be able to understand that it may be implemented in a modified form within a range that does not deviate from the essential characteristics of the above description. Therefore, the disclosed methods should be considered from an explanatory point of view rather than a limiting point of view, and the scope of rights is shown in the claims rather than the foregoing description, and should be construed to include all differences within the equivalent range.

Claims

estimating a transition matrix based on output data of a noise classifier for first input data and a first label corresponding to the first input data;

Clean based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, a second label corresponding to the second input data, and the conversion matrix. calculating classifier loss;

calculating a noise classifier loss based on the output data of the noise classifier for the second input data and the second label; and

causing the clean classifier and the noise classifier to learn a noise label based on the clean classifier loss and the noise classifier loss;

Including, method.
According to claim 1,

The first input data and the first label are included in a clean data set;

The second input data and the second label are included in a noise data set;

The clean data set includes a plurality of first input data and a plurality of first labels corresponding to each of the plurality of first input data, wherein the plurality of first labels are all for each of the plurality of first input data are correctly labeled,

The noise data set includes a plurality of second input data and a plurality of second labels corresponding to each of the plurality of second input data, wherein the plurality of second labels are part of each of the plurality of second input data incorrectly labeled,

method.
According to claim 1,

The clean classifier includes a first linear classifier,

The noise classifier includes a second linear classifier,

The clean classifier and the noise classifier share the same feature extractor,

method.
According to claim 3,

The learning step is

Optimizing parameters of the feature extractor, parameters of the first linear classifier, and parameters of the second linear classifier through single back-propagation,

method.
According to claim 1,

Calculating the clean classifier loss,

calculating a clean data-based loss based on output data of the clean classifier for the first input data and the first label;

calculating a loss based on noise data based on output data of the clean classifier for the second input data, the conversion matrix, and the second label; and

calculating the clean classifier loss based on the clean data based loss and the noisy data based loss;

How to include.
According to claim 1,

correcting the second label corresponding to the second input data;

How to include more.
According to claim 6,

The step of correcting the second label,

correcting the second label based on a comparison result between values of at least one element included in output data of the clean classifier and a threshold value with respect to the second input data;

How to include.
According to claim 1,

The method,

Forming a clean batch composed of one or more first input data included in the clean data set and one or more first labels corresponding to the one or more first input data;

The clean batch includes (N * K) samples selected from the clean data set,

N is the number of classes included in the clean data set,

K is the number of samples selected from each of the classes,

Wherein N and K are natural numbers of 1 or more,

method.
According to claim 8,

The method,

Forming a noise array composed of one or more second input data included in the noise data set and one or more second labels corresponding to the one or more second input data;

The noise batch includes M samples selected from the noise data set,

The M is the same number as (N * K),

method.
A computer-readable recording medium recording a program for executing the method of claim 1 on a computer.
a memory in which at least one program is stored; and

A processor operating by executing the at least one program; including,

the processor,

Estimating a conversion matrix based on output data of a noise classifier for first input data and a first label corresponding to the first input data;

Clean based on output data of the clean classifier for the first input data, output data of the clean classifier for the second input data, the first label, a second label corresponding to the second input data, and the conversion matrix. compute the classifier loss,

Calculate a noise classifier loss based on output data of the noise classifier for the second input data and the second label;

cause the clean classifier and the noise classifier to learn a noise label based on the clean classifier loss and the noise classifier loss;

Device.
According to claim 11,

The first input data and the first label are included in a clean data set;

The second input data and the second label are included in a noise data set;

The clean data set includes a plurality of first input data and a plurality of first labels corresponding to each of the plurality of first input data, wherein the plurality of first labels are all for each of the plurality of first input data are correctly labeled,

The noise data set includes a plurality of second input data and a plurality of second labels corresponding to each of the plurality of second input data, wherein the plurality of second labels are part of each of the plurality of second input data incorrectly labeled,

Device.
According to claim 11,

The clean classifier includes a first linear classifier,

The noise classifier includes a second linear classifier,

The clean classifier and the noise classifier share the same feature extractor,

Device.
According to claim 13,

The processor causes the clean classifier and the noise classifier to learn noise labels,

Optimizing parameters of the feature extractor, parameters of the first linear classifier, and parameters of the second linear classifier through single backpropagation,

Device.
According to claim 11,

The processor calculating the clean classifier loss comprises:

Calculate a clean data-based loss based on output data of the clean classifier for the first input data and the first label;

Calculate a loss based on noise data based on output data of the clean classifier for the second input data, the conversion matrix, and the second label;

Calculating the clean classifier loss based on the clean data-based loss and the noisy data-based loss.

Device.
According to claim 11,

the processor,

correcting the second label corresponding to the second input data;

Device.
17. The method of claim 16,

Correcting the second label by the processor,

Correcting the second label based on a comparison result of a threshold value and a value of one or more elements included in output data of the clean classifier for the second input data,

Device.
According to claim 11,

the processor,

Forming a clean batch composed of one or more first input data included in the clean data set and one or more first labels corresponding thereto;

The clean batch includes (N * K) samples selected from the clean data set,

N is the number of classes included in the clean data set,

K is the number of samples selected from each of the classes,

Wherein N and K are natural numbers of 1 or more,

Device.
According to claim 18,

the processor,

Forming a noise array composed of one or more second input data included in the noise data set and one or more second labels corresponding thereto;

The noise batch includes M samples selected from the noise data set,

The M is the same number as (N * K),

Device.