CN115730656A

CN115730656A - Out-of-distribution sample detection method using mixed unmarked data

Info

Publication number: CN115730656A
Application number: CN202211434819.XA
Authority: CN
Inventors: 王魏; 孙一轩
Original assignee: Nanjing Zhigu Artificial Intelligence Research Institute Co ltd; Nanjing University
Current assignee: Nanjing Zhigu Artificial Intelligence Research Institute Co ltd; Nanjing University
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-03

Abstract

The invention discloses an out-of-distribution sample detection method by using mixed unmarked data, which comprises the following specific steps: firstly, a user needs to prepare an object library, a category label is provided for a small number of objects in the library through a manual labeling method, the objects with category labels are called labeled training data, a total of K categories exist, and the rest objects without category labels are called unlabeled training data. Since these unlabeled training data may be contaminated with samples both within and outside the distribution, it is also referred to as mixed unlabeled data; by utilizing the technologies of adaptive temperature, dynamic confidence threshold value and the like, the invention effectively distinguishes the samples in distribution and the samples out of distribution in the unlabeled training data, so that the model obtained by training can more accurately detect the samples out of distribution while ensuring the classification accuracy of the samples in distribution.

Description

Out-of-distribution sample detection method using mixed unmarked data

The technical field is as follows:

the invention relates to an out-of-distribution sample detection method using mixed unlabeled data.

Background art:

in conventional machine learning, there is a very important i.i.d. hypothesis that samples of the training set and the test set are independently sampled from the same distribution. However, in real-world application scenarios, some off-distribution samples may cause the model to give completely incorrect predictions, which is unacceptable for applications with higher safety requirements, such as autonomous driving or medical diagnostics. Therefore, out-of-Distribution Detection (Out-of-Distribution Detection) is intended to require that the model can correctly detect these Out-of-Distribution samples in the inference phase before subsequent processing of these samples is possible.

In a conventional method for detecting an out-of-distribution sample, a classifier is generally trained by using a large number of labeled samples, and then a model statistical index such as a confidence level is used to determine whether a sample is an out-of-distribution sample. However, such a method requires a large amount of information to be marked, and thus, a large marking cost is incurred. Thus, some improved approaches have attempted to utilize unlabeled samples to enhance the performance of off-distribution sample detection. One class of methods performs self-supervised learning on unlabeled sample sets within a pure distribution, while another class of methods attempts to utilize unlabeled sample sets outside of a pure distribution. However, these methods all require the assumption that the unlabeled samples are either within or outside of pure distribution, which is not in line with the real-world scenario, since a set of unlabeled samples will typically be a mixture of in-distribution and out-of-distribution samples. How to train an out-of-distribution sample detector based on a small amount of labeled data and a large amount of mixed unlabeled data is a problem to be solved urgently.

The invention content is as follows:

the present invention is directed to solving the above-mentioned problems of the prior art and provides an out-of-distribution sample detection method using mixed unlabeled data.

The technical scheme adopted by the invention is as follows:

an out-of-distribution sample detection method using mixed unlabeled data, comprising the steps of:

1) A library of objects is created as a training data set,assigning class labels to a small number of objects in an object library and constructing labeled data L = { (x) ₁ ，y ₁ )，…，(x _n ，y _n ) X denotes a sample, y denotes a label of the sample;

the rest are nothing flag data U = { x = ₁ ，…，x _m N denotes the number of marked objects, m denotes the number of unmarked objects, and the number of classes is k;

wherein is made of

Represents in-distribution samples in unlabeled data U, m ₁ Is the number of samples within the distribution in unlabeled data U;

by using

Represents the out-of-distribution samples in unlabeled data U, m ₂ The number of out-of-distribution samples in unlabeled data U;

2) Cross entropy loss training is used on the marked data L, and consistency regular training is used on the unmarked data U, so that a basic model is finally obtained;

3) Using an adaptive temperature technology in the consistency regularization training of the step 2) to distinguish distributed inner samples and distributed outer samples on the unlabeled data U;

4) For each training round t, two confidence thresholds are obtained

And

corresponding to the in-distribution sample and the out-distribution sample used for screening the unlabeled data U;

5) And the training data set is expanded by using two data enhancement methods of RandAugment and mixup, so that the model has better generalization performance.

6) Training on the samples in the distribution screened in the step 4) by using a minimum entropy principle, and training on the samples outside the distribution screened in the step 4) by using a maximum entropy principle;

7) In the testing stage, whether the sample is an out-of-distribution sample is determined according to the output confidence of the sample.

Further, in the step 2), a basic model is obtained by training with labeled data L and unlabeled data U, and the training method includes:

calculating the cross-entropy loss on the marked data L, noted

Training with consistency regularization loss on unlabeled data U, denoted as

Then, the total loss is calculated

And reversely propagating and updating a parameter theta of the basic model, wherein the structure of the basic model adopts a Wide ResNet-28-2 network, and the parameter quantity is about 1.4M.

Further, the basic model is optimized by using an SGD optimizer during training.

Further, in the step 3), an adaptive temperature technique is used to distinguish between an in-distribution sample and an out-distribution sample on the unlabeled data U, specifically:

at the t-th training round, the negative log-likelihood loss of the optimization model on the verification set V in one distribution can be expressed as

In the formula: tt represents a target temperature required to be calculated at the t-th training round;

argminT indicates that finding a target T can minimize the right formula, V is a validation set;

represents: after the sample x passes through the model with the parameter theta and the temperature of T, the posterior probability value corresponding to the real mark y is a value between 0 and 1.

Further, two confidence thresholds are used in the step 4)

And

the method is used for screening samples in and out of the distribution on unlabeled data U during the t-th training round number, and specifically comprises the following steps:

(a) Given a binary Gaussian mixture model, note as g ₁ And g ₂ (ii) a Fitting output confidence distribution { C over unlabeled data U by EM algorithm _θ (x，T_t)|x∈U}，

Wherein the content of the first and second substances,

for the output confidence after the input model of sample x,

representing the prediction class of the model for sample x;

(b) Passing the sample to g ₁ And g ₂ The method for separating the unlabeled data U by the posterior probability includes:

in the formula: p is the sign of the probability, for example: p (a | b) is the expression of the posterior probability, which means that the probability of a occurrence is a value between 0 and 1 when b occurs;

(c) The two confidence thresholds are the average of the output confidence of the samples on the two sets in step (b), specifically,

(d) Finally, the set of samples within the distribution on unlabeled data U is screened according to two confidence thresholds

And out-of-distribution sample set

Expressed as:

further, in the step 5), two data enhancement methods of randAugment and mixup are used to improve the generalization performance of the model;

wherein RandAugment adopts random high-intensity disturbance to a given sample to improve the generalization performance of the model,

the mixup is to enhance the generalization performance of the model by adopting the data enhancement of linear combination of the sample pairs and the label pairs, and when the mixup data enhancement method is adopted, the sample after x enhancement is recorded as

Further, the minimum entropy principle is used for training on the samples in the screened distribution in the step 6), and the loss function is expressed as

Wherein:

represents a K-dimensional one-hot pseudo label, wherein only 1 is represented at the subscript represented by the real label, and the rest are all 0;

training on the screened out-of-distribution samples by using the maximum entropy principle, and expressing the loss function as

Wherein, in the process,

representing a given distribution of computations

Entropy of (2).

After training based on the above two principles, the model outputs higher confidence on the samples inside the distribution and lower confidence on the samples outside the distribution, thus making it easier for the model to distinguish between the two.

Further, in the step 7), in the testing stage, the test sample x needs to be input into the model to obtain the output confidence C thereof _θ (x, T) comparing a predetermined confidence threshold τ; if C _θ (x, T) is more than or equal to tau, judging that the sample in the distribution is in the distribution, and giving a specific prediction type; if C _θ If (x, T) < tau, then the sample is judged to be out-of-distribution. The invention has the following beneficial effects:

the method can effectively utilize a small amount of marked samples and a large amount of mixed unmarked samples for training, and the training mode has practical significance. During testing, the method can detect not only the seen out-of-distribution categories, but also the unseen out-of-distribution categories to a certain extent, and has higher ductility.

Description of the drawings:

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of model training in the present invention.

FIG. 3 is a model test flow diagram of the present invention.

The specific implementation mode is as follows:

the invention will be further described with reference to the accompanying drawings.

As shown in FIG. 1, the method for detecting the sample outside the distribution by using the mixed unlabeled data comprises the following steps:

step 1), establishing an object library as a training data set, giving class labels to a small number of objects in the object library to form labeled data L, and giving unlabeled data U as the rest, wherein the number of classes is K; wherein the content of the first and second substances,

by U ⁱⁿ Representing the intra-distribution samples in unlabeled data U,

by U ^out Representing the out-of-distribution samples in unlabeled data U.

For cat and dog classification problems, say, cat is the first category and dog is the second category. If the content contained in the ith object is a cat, y _i =1, i.e. the object belongs to the first class, if the content contained in the object by the user is a dog, y _i =0, the web page belongs to the second class. Assume that initially a total of n web pages are tagged and the remaining m objects are not tagged. For labeled data, the samples are all in-distribution samples, while unlabeled data may be intermixed with in-distribution and out-of-distribution samples. Also exemplified by the cat or dog classification problem, the marked samples are all cats or dogs, while the unmarked samples may contain other content, such as tigers, trees, cups, and the like.

The first M training rounds of training are the first stage, and M is a hyperparameter. The goal of the first stage is to train to obtain a base model:

step 2), calculating cross entropy loss for samples on the marked data

Where CE represents the cross-entropy loss, y represents the K-dimensional one-hot label vector constructed from y,

q _θ (x) Representing the output probability distribution of the sample x after passing through a model softmax layer, wherein the parameter of the model is theta; computing consistency regularization penalties for samples on unlabeled data U

Wherein

And

representing two different data enhancement methods, when embodied

The RandAugment method is employed, and

and adopting a standard data enhancement method such as cutting and turning.

Represents the output probability in class i after temperature scaling (temperature scaling), where z _i Is the output logit value of class i, and T is a parameter, representing the temperature value.

Step 3), calculating to obtain a current appropriate temperature value through a verification set V in distribution at the current training round number t by using a self-adaptive temperature technology

And substituting into a consistency regularization loss

To calculate a specific loss value.

Step 4), summing the cross entropy loss and the consistency regular loss, and calculating to obtain the total loss

And back-propagates the update of the parameter theta.

The number of training rounds is the second stage, so that the model has stronger detection performance of the sample outside the distribution, and the specific steps are as follows:

step 5), the in-distribution and out-distribution samples in the unlabeled data were screened as shown in FIG. 2. Specifically, an unlabeled sample is input into a model to obtain a sample output confidence set

{C _θ (x，T_t)|x∈U}.

Secondly, fitting a binary Gaussian mixture model on the sample output confidence set by adopting an EM algorithm,

the result of the fitting is recorded as g ₁ And g ₂ . Next, the passing sample belongs to g ₁ And g ₂ To separate the unlabeled data U,

and calculating two confidence degree threshold values of the current t-th training round number

And

finally, the screening of the samples in and out of distribution is completed according to the comparison of the sample output confidence and the two thresholds.

For example, a sample output confidence is greater than

Then is screened as an in-distribution sample; if less than

It is screened as an out-of-distribution sample.

Step 6), for the sample x in the training set, obtaining an enhanced sample by using a RandAugment data enhancement method, and recording the enhanced sample as

Step 7), selecting another sample x' in the training set, and obtaining an enhanced sample by using a mixup data enhancement method to represent

Wherein λ' = max (λ,1- λ) λ is sampled from Beta (α, α), α is a hyper-parameter of Beta distribution, and α =0.2 is taken in specific implementation.

Step 8), for the screened in-distribution samples, calculating the entropy minimization loss, which can be expressed as

Wherein, in the step (A),

the representation represents a K-dimensional one-hot pseudo label.

Step 9), for the screened out-of-distribution samples, the maximum loss of the entropy is calculated and can be expressed as

Wherein, the first and the second end of the pipe are connected with each other,

representing a given distribution of calculations

Entropy of (2).

Step 10), summing the three losses to obtain the total loss

And propagates the updated model parameters theta back.

As shown in FIG. 3, in the testing stage, a test sample x needs to be input into the model to obtain its output confidence C _θ (x, T) against a predefined confidence threshold τ. If C _θ And (x, T) is more than or equal to tau, judging that the sample in the distribution is in the distribution, and giving a specific prediction type.

If C _θ If (x, T) < tau, then the sample is judged to be out-of-distribution.

The foregoing is only a preferred embodiment of this invention and it should be noted that modifications can be made by those skilled in the art without departing from the principle of the invention and these modifications should also be considered as the protection scope of the invention.

Claims

1. An out-of-distribution sample detection method using mixed unlabeled data, characterized by: the method comprises the following steps:

1) Establishing an object library as a training data set, giving class labels to a small number of objects in the object library and forming labeled data L = { (x) ₁ ，y ₁ )，…，(x _n ，y _n ) The rest are unmarked data U = { x = } ₁ ，…，x _m N represents the number of marked objects, m represents the number of unmarked objects, and the number of categories is k;

wherein is made of

Representing in-distribution samples, m, in unlabeled data U ₁ Is the number of samples within the distribution in unlabeled data U;

by using

Represents the out-of-distribution samples in the unlabeled data U, m ₂ The number of out-of-distribution samples in unlabeled data U;

3) Using an adaptive temperature technique in the consistency regularization training of the step 2) to distinguish an in-distribution sample and an out-distribution sample on the unlabeled data U;

4) For each training round t for labeled data L and unlabeled data U, two confidence level thresholds are obtained

And

5) Expanding the training data set by using two data enhancement methods of RandAugment and mixup;

6) Training the samples in the distribution screened out in the step 4) by using a minimum entropy principle, and training the samples out of the distribution screened out in the step 4) by using a maximum entropy principle;

2. The method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 2), a basic model is obtained by training labeled data L and unlabeled data U, and the training method comprises the following steps:

calculating the cross-entropy loss on the marked data L, noted

Training with consistency regularization loss on unlabeled data U, denoted as

Then, the total loss is calculated

And reversely propagating and updating a parameter theta of the basic model, wherein the basic model adopts a Wide ResNet-28-2 network, and the parameter quantity of the basic model is about 1.4M.

3. The method of out-of-distribution sample detection using mixed unlabeled data of claim 2, wherein: and optimizing by using an SGD optimizer during training of the basic model.

4. The method for out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 3), an adaptive temperature technique is used to distinguish an in-distribution sample and an out-distribution sample on the unlabeled data U, specifically:

at the t-th training round, the negative log-likelihood loss of the optimization model on the verification set V in a distribution can be expressed as

In the formula: tt represents a target temperature to be calculated at the t-th training round;

5. The method for out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: two confidence thresholds are used in the step 4)

And

(a) Given a binary Gaussian mixture model, note as g ₁ And g ₂ (ii) a Fitting output confidence distribution { C over unlabeled data U by EM algorithm _θ (x，T_t)|xU}，

Wherein the content of the first and second substances,

for the output confidence after the input model of sample x,

representing the prediction category of the model for sample x;

And out-of-distribution sample set

Expressed as:

6. the method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein:

in the step 5), a RandAugment method and a mixup method are used for improving the generalization performance of the model;

the mixup is to improve the generalization performance of the model by adopting the data enhancement of linear combination of the sample pairs and the label pairs, and when the mixup data enhancement method is adopted, the sample after x enhancement is recorded as

7. The method of out-of-distribution sample detection using mixed unlabeled data of claim 6, wherein:

the minimum entropy principle is used for training on the samples in the selected distribution in the step 6), and the loss function is expressed as

Wherein:

Wherein the content of the first and second substances,

representing a given distribution of computations

The entropy of (c).

8. The method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 7), in the testing stage, the test sample x needs to be input into the model to obtain the output confidence coefficient C of the test sample x _θ (x, T) comparing a predetermined confidence threshold τ; if C _θ If (x, T) is more than or equal to tau, judging that the samples in the distribution are in-distribution, and giving a specific prediction type; if C _θ If (x, T) < tau, then the sample is judged to be out-of-distribution.