CN115730656A - Out-of-distribution sample detection method using mixed unmarked data - Google Patents

Out-of-distribution sample detection method using mixed unmarked data Download PDF

Info

Publication number
CN115730656A
CN115730656A CN202211434819.XA CN202211434819A CN115730656A CN 115730656 A CN115730656 A CN 115730656A CN 202211434819 A CN202211434819 A CN 202211434819A CN 115730656 A CN115730656 A CN 115730656A
Authority
CN
China
Prior art keywords
distribution
sample
data
samples
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211434819.XA
Other languages
Chinese (zh)
Inventor
王魏
孙一轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhigu Artificial Intelligence Research Institute Co ltd
Nanjing University
Original Assignee
Nanjing Zhigu Artificial Intelligence Research Institute Co ltd
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhigu Artificial Intelligence Research Institute Co ltd, Nanjing University filed Critical Nanjing Zhigu Artificial Intelligence Research Institute Co ltd
Priority to CN202211434819.XA priority Critical patent/CN115730656A/en
Publication of CN115730656A publication Critical patent/CN115730656A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses an out-of-distribution sample detection method by using mixed unmarked data, which comprises the following specific steps: firstly, a user needs to prepare an object library, a category label is provided for a small number of objects in the library through a manual labeling method, the objects with category labels are called labeled training data, a total of K categories exist, and the rest objects without category labels are called unlabeled training data. Since these unlabeled training data may be contaminated with samples both within and outside the distribution, it is also referred to as mixed unlabeled data; by utilizing the technologies of adaptive temperature, dynamic confidence threshold value and the like, the invention effectively distinguishes the samples in distribution and the samples out of distribution in the unlabeled training data, so that the model obtained by training can more accurately detect the samples out of distribution while ensuring the classification accuracy of the samples in distribution.

Description

Out-of-distribution sample detection method using mixed unmarked data
The technical field is as follows:
the invention relates to an out-of-distribution sample detection method using mixed unlabeled data.
Background art:
in conventional machine learning, there is a very important i.i.d. hypothesis that samples of the training set and the test set are independently sampled from the same distribution. However, in real-world application scenarios, some off-distribution samples may cause the model to give completely incorrect predictions, which is unacceptable for applications with higher safety requirements, such as autonomous driving or medical diagnostics. Therefore, out-of-Distribution Detection (Out-of-Distribution Detection) is intended to require that the model can correctly detect these Out-of-Distribution samples in the inference phase before subsequent processing of these samples is possible.
In a conventional method for detecting an out-of-distribution sample, a classifier is generally trained by using a large number of labeled samples, and then a model statistical index such as a confidence level is used to determine whether a sample is an out-of-distribution sample. However, such a method requires a large amount of information to be marked, and thus, a large marking cost is incurred. Thus, some improved approaches have attempted to utilize unlabeled samples to enhance the performance of off-distribution sample detection. One class of methods performs self-supervised learning on unlabeled sample sets within a pure distribution, while another class of methods attempts to utilize unlabeled sample sets outside of a pure distribution. However, these methods all require the assumption that the unlabeled samples are either within or outside of pure distribution, which is not in line with the real-world scenario, since a set of unlabeled samples will typically be a mixture of in-distribution and out-of-distribution samples. How to train an out-of-distribution sample detector based on a small amount of labeled data and a large amount of mixed unlabeled data is a problem to be solved urgently.
The invention content is as follows:
the present invention is directed to solving the above-mentioned problems of the prior art and provides an out-of-distribution sample detection method using mixed unlabeled data.
The technical scheme adopted by the invention is as follows:
an out-of-distribution sample detection method using mixed unlabeled data, comprising the steps of:
1) A library of objects is created as a training data set,assigning class labels to a small number of objects in an object library and constructing labeled data L = { (x) 1 ,y 1 ),…,(x n ,y n ) X denotes a sample, y denotes a label of the sample;
the rest are nothing flag data U = { x = 1 ,…,x m N denotes the number of marked objects, m denotes the number of unmarked objects, and the number of classes is k;
wherein is made of
Figure BDA0003946734630000021
Represents in-distribution samples in unlabeled data U, m 1 Is the number of samples within the distribution in unlabeled data U;
by using
Figure BDA0003946734630000022
Represents the out-of-distribution samples in unlabeled data U, m 2 The number of out-of-distribution samples in unlabeled data U;
2) Cross entropy loss training is used on the marked data L, and consistency regular training is used on the unmarked data U, so that a basic model is finally obtained;
3) Using an adaptive temperature technology in the consistency regularization training of the step 2) to distinguish distributed inner samples and distributed outer samples on the unlabeled data U;
4) For each training round t, two confidence thresholds are obtained
Figure BDA0003946734630000023
And
Figure BDA0003946734630000024
corresponding to the in-distribution sample and the out-distribution sample used for screening the unlabeled data U;
5) And the training data set is expanded by using two data enhancement methods of RandAugment and mixup, so that the model has better generalization performance.
6) Training on the samples in the distribution screened in the step 4) by using a minimum entropy principle, and training on the samples outside the distribution screened in the step 4) by using a maximum entropy principle;
7) In the testing stage, whether the sample is an out-of-distribution sample is determined according to the output confidence of the sample.
Further, in the step 2), a basic model is obtained by training with labeled data L and unlabeled data U, and the training method includes:
calculating the cross-entropy loss on the marked data L, noted
Figure BDA0003946734630000025
Training with consistency regularization loss on unlabeled data U, denoted as
Figure BDA0003946734630000026
Then, the total loss is calculated
Figure BDA0003946734630000027
And reversely propagating and updating a parameter theta of the basic model, wherein the structure of the basic model adopts a Wide ResNet-28-2 network, and the parameter quantity is about 1.4M.
Further, the basic model is optimized by using an SGD optimizer during training.
Further, in the step 3), an adaptive temperature technique is used to distinguish between an in-distribution sample and an out-distribution sample on the unlabeled data U, specifically:
at the t-th training round, the negative log-likelihood loss of the optimization model on the verification set V in one distribution can be expressed as
Figure BDA0003946734630000031
In the formula: tt represents a target temperature required to be calculated at the t-th training round;
argminT indicates that finding a target T can minimize the right formula, V is a validation set;
Figure BDA0003946734630000032
represents: after the sample x passes through the model with the parameter theta and the temperature of T, the posterior probability value corresponding to the real mark y is a value between 0 and 1.
Further, two confidence thresholds are used in the step 4)
Figure BDA0003946734630000033
And
Figure BDA0003946734630000034
the method is used for screening samples in and out of the distribution on unlabeled data U during the t-th training round number, and specifically comprises the following steps:
(a) Given a binary Gaussian mixture model, note as g 1 And g 2 (ii) a Fitting output confidence distribution { C over unlabeled data U by EM algorithm θ (x,T_t)|x∈U},
Wherein the content of the first and second substances,
Figure BDA0003946734630000035
for the output confidence after the input model of sample x,
Figure BDA0003946734630000036
representing the prediction class of the model for sample x;
(b) Passing the sample to g 1 And g 2 The method for separating the unlabeled data U by the posterior probability includes:
Figure BDA0003946734630000037
Figure BDA0003946734630000038
in the formula: p is the sign of the probability, for example: p (a | b) is the expression of the posterior probability, which means that the probability of a occurrence is a value between 0 and 1 when b occurs;
(c) The two confidence thresholds are the average of the output confidence of the samples on the two sets in step (b), specifically,
Figure BDA0003946734630000039
Figure BDA0003946734630000041
(d) Finally, the set of samples within the distribution on unlabeled data U is screened according to two confidence thresholds
Figure BDA0003946734630000042
And out-of-distribution sample set
Figure BDA0003946734630000043
Expressed as:
Figure BDA0003946734630000044
Figure BDA0003946734630000045
further, in the step 5), two data enhancement methods of randAugment and mixup are used to improve the generalization performance of the model;
wherein RandAugment adopts random high-intensity disturbance to a given sample to improve the generalization performance of the model,
the mixup is to enhance the generalization performance of the model by adopting the data enhancement of linear combination of the sample pairs and the label pairs, and when the mixup data enhancement method is adopted, the sample after x enhancement is recorded as
Figure BDA0003946734630000046
Further, the minimum entropy principle is used for training on the samples in the screened distribution in the step 6), and the loss function is expressed as
Figure BDA0003946734630000047
Wherein:
Figure BDA0003946734630000048
represents a K-dimensional one-hot pseudo label, wherein only 1 is represented at the subscript represented by the real label, and the rest are all 0;
training on the screened out-of-distribution samples by using the maximum entropy principle, and expressing the loss function as
Figure BDA0003946734630000049
Wherein, in the process,
Figure BDA00039467346300000410
representing a given distribution of computations
Figure BDA00039467346300000411
Entropy of (2).
After training based on the above two principles, the model outputs higher confidence on the samples inside the distribution and lower confidence on the samples outside the distribution, thus making it easier for the model to distinguish between the two.
Further, in the step 7), in the testing stage, the test sample x needs to be input into the model to obtain the output confidence C thereof θ (x, T) comparing a predetermined confidence threshold τ; if C θ (x, T) is more than or equal to tau, judging that the sample in the distribution is in the distribution, and giving a specific prediction type; if C θ If (x, T) < tau, then the sample is judged to be out-of-distribution. The invention has the following beneficial effects:
the method can effectively utilize a small amount of marked samples and a large amount of mixed unmarked samples for training, and the training mode has practical significance. During testing, the method can detect not only the seen out-of-distribution categories, but also the unseen out-of-distribution categories to a certain extent, and has higher ductility.
Description of the drawings:
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of model training in the present invention.
FIG. 3 is a model test flow diagram of the present invention.
The specific implementation mode is as follows:
the invention will be further described with reference to the accompanying drawings.
As shown in FIG. 1, the method for detecting the sample outside the distribution by using the mixed unlabeled data comprises the following steps:
step 1), establishing an object library as a training data set, giving class labels to a small number of objects in the object library to form labeled data L, and giving unlabeled data U as the rest, wherein the number of classes is K; wherein the content of the first and second substances,
by U in Representing the intra-distribution samples in unlabeled data U,
by U out Representing the out-of-distribution samples in unlabeled data U.
For cat and dog classification problems, say, cat is the first category and dog is the second category. If the content contained in the ith object is a cat, y i =1, i.e. the object belongs to the first class, if the content contained in the object by the user is a dog, y i =0, the web page belongs to the second class. Assume that initially a total of n web pages are tagged and the remaining m objects are not tagged. For labeled data, the samples are all in-distribution samples, while unlabeled data may be intermixed with in-distribution and out-of-distribution samples. Also exemplified by the cat or dog classification problem, the marked samples are all cats or dogs, while the unmarked samples may contain other content, such as tigers, trees, cups, and the like.
The first M training rounds of training are the first stage, and M is a hyperparameter. The goal of the first stage is to train to obtain a base model:
step 2), calculating cross entropy loss for samples on the marked data
Figure BDA0003946734630000061
Where CE represents the cross-entropy loss, y represents the K-dimensional one-hot label vector constructed from y,
q θ (x) Representing the output probability distribution of the sample x after passing through a model softmax layer, wherein the parameter of the model is theta; computing consistency regularization penalties for samples on unlabeled data U
Figure BDA0003946734630000062
Wherein
Figure BDA0003946734630000063
And
Figure BDA0003946734630000064
representing two different data enhancement methods, when embodied
Figure BDA0003946734630000065
The RandAugment method is employed, and
Figure BDA0003946734630000066
and adopting a standard data enhancement method such as cutting and turning.
Figure BDA0003946734630000067
Represents the output probability in class i after temperature scaling (temperature scaling), where z i Is the output logit value of class i, and T is a parameter, representing the temperature value.
Figure BDA0003946734630000071
Step 3), calculating to obtain a current appropriate temperature value through a verification set V in distribution at the current training round number t by using a self-adaptive temperature technology
And substituting into a consistency regularization loss
Figure BDA0003946734630000072
To calculate a specific loss value.
Step 4), summing the cross entropy loss and the consistency regular loss, and calculating to obtain the total loss
Figure BDA0003946734630000073
And back-propagates the update of the parameter theta.
The number of training rounds is the second stage, so that the model has stronger detection performance of the sample outside the distribution, and the specific steps are as follows:
step 5), the in-distribution and out-distribution samples in the unlabeled data were screened as shown in FIG. 2. Specifically, an unlabeled sample is input into a model to obtain a sample output confidence set
{C θ (x,T_t)|x∈U}.
Secondly, fitting a binary Gaussian mixture model on the sample output confidence set by adopting an EM algorithm,
the result of the fitting is recorded as g 1 And g 2 . Next, the passing sample belongs to g 1 And g 2 To separate the unlabeled data U,
and calculating two confidence degree threshold values of the current t-th training round number
Figure BDA0003946734630000074
And
Figure BDA0003946734630000075
finally, the screening of the samples in and out of distribution is completed according to the comparison of the sample output confidence and the two thresholds.
For example, a sample output confidence is greater than
Figure BDA0003946734630000081
Then is screened as an in-distribution sample; if less than
Figure BDA0003946734630000082
It is screened as an out-of-distribution sample.
Step 6), for the sample x in the training set, obtaining an enhanced sample by using a RandAugment data enhancement method, and recording the enhanced sample as
Figure BDA0003946734630000083
Step 7), selecting another sample x' in the training set, and obtaining an enhanced sample by using a mixup data enhancement method to represent
Figure BDA0003946734630000084
Wherein λ' = max (λ,1- λ) λ is sampled from Beta (α, α), α is a hyper-parameter of Beta distribution, and α =0.2 is taken in specific implementation.
Step 8), for the screened in-distribution samples, calculating the entropy minimization loss, which can be expressed as
Figure BDA0003946734630000085
Wherein, in the step (A),
Figure BDA0003946734630000086
the representation represents a K-dimensional one-hot pseudo label.
Step 9), for the screened out-of-distribution samples, the maximum loss of the entropy is calculated and can be expressed as
Figure BDA0003946734630000087
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003946734630000088
representing a given distribution of calculations
Figure BDA0003946734630000089
Entropy of (2).
Step 10), summing the three losses to obtain the total loss
Figure BDA00039467346300000810
And propagates the updated model parameters theta back.
As shown in FIG. 3, in the testing stage, a test sample x needs to be input into the model to obtain its output confidence C θ (x, T) against a predefined confidence threshold τ. If C θ And (x, T) is more than or equal to tau, judging that the sample in the distribution is in the distribution, and giving a specific prediction type.
If C θ If (x, T) < tau, then the sample is judged to be out-of-distribution.
The foregoing is only a preferred embodiment of this invention and it should be noted that modifications can be made by those skilled in the art without departing from the principle of the invention and these modifications should also be considered as the protection scope of the invention.

Claims (8)

1. An out-of-distribution sample detection method using mixed unlabeled data, characterized by: the method comprises the following steps:
1) Establishing an object library as a training data set, giving class labels to a small number of objects in the object library and forming labeled data L = { (x) 1 ,y 1 ),…,(x n ,y n ) The rest are unmarked data U = { x = } 1 ,…,x m N represents the number of marked objects, m represents the number of unmarked objects, and the number of categories is k;
wherein is made of
Figure FDA0003946734620000011
Representing in-distribution samples, m, in unlabeled data U 1 Is the number of samples within the distribution in unlabeled data U;
by using
Figure FDA0003946734620000012
Represents the out-of-distribution samples in the unlabeled data U, m 2 The number of out-of-distribution samples in unlabeled data U;
2) Cross entropy loss training is used on the marked data L, and consistency regular training is used on the unmarked data U, so that a basic model is finally obtained;
3) Using an adaptive temperature technique in the consistency regularization training of the step 2) to distinguish an in-distribution sample and an out-distribution sample on the unlabeled data U;
4) For each training round t for labeled data L and unlabeled data U, two confidence level thresholds are obtained
Figure FDA0003946734620000013
And
Figure FDA0003946734620000014
corresponding to the in-distribution sample and the out-distribution sample used for screening the unlabeled data U;
5) Expanding the training data set by using two data enhancement methods of RandAugment and mixup;
6) Training the samples in the distribution screened out in the step 4) by using a minimum entropy principle, and training the samples out of the distribution screened out in the step 4) by using a maximum entropy principle;
7) In the testing stage, whether the sample is an out-of-distribution sample is determined according to the output confidence of the sample.
2. The method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 2), a basic model is obtained by training labeled data L and unlabeled data U, and the training method comprises the following steps:
calculating the cross-entropy loss on the marked data L, noted
Figure FDA0003946734620000015
Training with consistency regularization loss on unlabeled data U, denoted as
Figure FDA0003946734620000016
Then, the total loss is calculated
Figure FDA0003946734620000021
And reversely propagating and updating a parameter theta of the basic model, wherein the basic model adopts a Wide ResNet-28-2 network, and the parameter quantity of the basic model is about 1.4M.
3. The method of out-of-distribution sample detection using mixed unlabeled data of claim 2, wherein: and optimizing by using an SGD optimizer during training of the basic model.
4. The method for out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 3), an adaptive temperature technique is used to distinguish an in-distribution sample and an out-distribution sample on the unlabeled data U, specifically:
at the t-th training round, the negative log-likelihood loss of the optimization model on the verification set V in a distribution can be expressed as
Figure FDA0003946734620000022
In the formula: tt represents a target temperature to be calculated at the t-th training round;
argminT indicates that finding a target T can minimize the right formula, V is a validation set;
Figure FDA0003946734620000023
represents: after the sample x passes through the model with the parameter theta and the temperature of T, the posterior probability value corresponding to the real mark y is a value between 0 and 1.
5. The method for out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: two confidence thresholds are used in the step 4)
Figure FDA0003946734620000024
And
Figure FDA0003946734620000025
the method is used for screening samples in and out of the distribution on unlabeled data U during the t-th training round number, and specifically comprises the following steps:
(a) Given a binary Gaussian mixture model, note as g 1 And g 2 (ii) a Fitting output confidence distribution { C over unlabeled data U by EM algorithm θ (x,T_t)|xU},
Wherein the content of the first and second substances,
Figure FDA0003946734620000026
for the output confidence after the input model of sample x,
Figure FDA0003946734620000027
representing the prediction category of the model for sample x;
(b) Passing the sample to g 1 And g 2 The method for separating the unlabeled data U by the posterior probability includes:
Figure FDA0003946734620000031
Figure FDA0003946734620000032
(c) The two confidence thresholds are the average of the output confidence of the samples on the two sets in step (b), specifically,
Figure FDA0003946734620000033
Figure FDA0003946734620000034
(d) Finally, the set of samples within the distribution on unlabeled data U is screened according to two confidence thresholds
Figure FDA0003946734620000035
And out-of-distribution sample set
Figure FDA0003946734620000036
Expressed as:
Figure FDA0003946734620000037
Figure FDA0003946734620000038
6. the method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein:
in the step 5), a RandAugment method and a mixup method are used for improving the generalization performance of the model;
wherein RandAugment adopts random high-intensity disturbance to a given sample to improve the generalization performance of the model,
the mixup is to improve the generalization performance of the model by adopting the data enhancement of linear combination of the sample pairs and the label pairs, and when the mixup data enhancement method is adopted, the sample after x enhancement is recorded as
Figure FDA0003946734620000039
7. The method of out-of-distribution sample detection using mixed unlabeled data of claim 6, wherein:
the minimum entropy principle is used for training on the samples in the selected distribution in the step 6), and the loss function is expressed as
Figure FDA00039467346200000310
Wherein:
Figure FDA00039467346200000311
represents a K-dimensional one-hot pseudo label, wherein only 1 is represented at the subscript represented by the real label, and the rest are all 0;
training on the screened out-of-distribution samples by using the maximum entropy principle, and expressing the loss function as
Figure FDA0003946734620000041
Wherein the content of the first and second substances,
Figure FDA0003946734620000042
representing a given distribution of computations
Figure FDA0003946734620000043
The entropy of (c).
8. The method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 7), in the testing stage, the test sample x needs to be input into the model to obtain the output confidence coefficient C of the test sample x θ (x, T) comparing a predetermined confidence threshold τ; if C θ If (x, T) is more than or equal to tau, judging that the samples in the distribution are in-distribution, and giving a specific prediction type; if C θ If (x, T) < tau, then the sample is judged to be out-of-distribution.
CN202211434819.XA 2022-11-16 2022-11-16 Out-of-distribution sample detection method using mixed unmarked data Pending CN115730656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211434819.XA CN115730656A (en) 2022-11-16 2022-11-16 Out-of-distribution sample detection method using mixed unmarked data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211434819.XA CN115730656A (en) 2022-11-16 2022-11-16 Out-of-distribution sample detection method using mixed unmarked data

Publications (1)

Publication Number Publication Date
CN115730656A true CN115730656A (en) 2023-03-03

Family

ID=85296025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211434819.XA Pending CN115730656A (en) 2022-11-16 2022-11-16 Out-of-distribution sample detection method using mixed unmarked data

Country Status (1)

Country Link
CN (1) CN115730656A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776248A (en) * 2023-06-21 2023-09-19 哈尔滨工业大学 Virtual logarithm-based out-of-distribution detection method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776248A (en) * 2023-06-21 2023-09-19 哈尔滨工业大学 Virtual logarithm-based out-of-distribution detection method

Similar Documents

Publication Publication Date Title
Fan et al. Watching a small portion could be as good as watching all: Towards efficient video classification
CN112966074B (en) Emotion analysis method and device, electronic equipment and storage medium
US6397200B1 (en) Data reduction system for improving classifier performance
CN110598029A (en) Fine-grained image classification method based on attention transfer mechanism
CN111046664A (en) False news detection method and system based on multi-granularity graph convolution neural network
CN111950540A (en) Knowledge point extraction method, system, device and medium based on deep learning
CN111597340A (en) Text classification method and device and readable storage medium
CN113469186A (en) Cross-domain migration image segmentation method based on small amount of point labels
CN114863091A (en) Target detection training method based on pseudo label
CN115730656A (en) Out-of-distribution sample detection method using mixed unmarked data
CN114881125A (en) Label noisy image classification method based on graph consistency and semi-supervised model
CN113111184B (en) Event detection method based on explicit event structure knowledge enhancement and terminal equipment
CN114549909A (en) Pseudo label remote sensing image scene classification method based on self-adaptive threshold
CN112035629B (en) Method for implementing question-answer model based on symbolized knowledge and neural network
CN117152587A (en) Anti-learning-based semi-supervised ship detection method and system
CN113591892A (en) Training data processing method and device
CN110705631B (en) SVM-based bulk cargo ship equipment state detection method
CN116624903A (en) Intelligent monitoring method and system for oil smoke pipeline
CN114495114B (en) Text sequence recognition model calibration method based on CTC decoder
CN115797701A (en) Target classification method and device, electronic equipment and storage medium
CN111723301B (en) Attention relation identification and labeling method based on hierarchical theme preference semantic matrix
CN114579761A (en) Information security knowledge entity relation connection prediction method, system and medium
CN114202671A (en) Image prediction optimization processing method and device
CN114618167A (en) Anti-cheating detection model construction method and anti-cheating detection method
CN112712163B (en) Coverage rate-based neural network effective data enhancement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination