CN115730656A - Out-of-distribution sample detection method using mixed unmarked data - Google Patents
Out-of-distribution sample detection method using mixed unmarked data Download PDFInfo
- Publication number
- CN115730656A CN115730656A CN202211434819.XA CN202211434819A CN115730656A CN 115730656 A CN115730656 A CN 115730656A CN 202211434819 A CN202211434819 A CN 202211434819A CN 115730656 A CN115730656 A CN 115730656A
- Authority
- CN
- China
- Prior art keywords
- distribution
- sample
- data
- samples
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 52
- 230000003044 adaptive effect Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 37
- 238000012360 testing method Methods 0.000 claims description 12
- 238000012216 screening Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 4
- 238000012795 verification Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 2
- 230000001902 propagating effect Effects 0.000 claims description 2
- 238000010200 validation analysis Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000002372 labelling Methods 0.000 abstract 1
- 239000000523 sample Substances 0.000 description 51
- 241000282472 Canis lupus familiaris Species 0.000 description 5
- 241000282326 Felis catus Species 0.000 description 5
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000282376 Panthera tigris Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Abstract
The invention discloses an out-of-distribution sample detection method by using mixed unmarked data, which comprises the following specific steps: firstly, a user needs to prepare an object library, a category label is provided for a small number of objects in the library through a manual labeling method, the objects with category labels are called labeled training data, a total of K categories exist, and the rest objects without category labels are called unlabeled training data. Since these unlabeled training data may be contaminated with samples both within and outside the distribution, it is also referred to as mixed unlabeled data; by utilizing the technologies of adaptive temperature, dynamic confidence threshold value and the like, the invention effectively distinguishes the samples in distribution and the samples out of distribution in the unlabeled training data, so that the model obtained by training can more accurately detect the samples out of distribution while ensuring the classification accuracy of the samples in distribution.
Description
The technical field is as follows:
the invention relates to an out-of-distribution sample detection method using mixed unlabeled data.
Background art:
in conventional machine learning, there is a very important i.i.d. hypothesis that samples of the training set and the test set are independently sampled from the same distribution. However, in real-world application scenarios, some off-distribution samples may cause the model to give completely incorrect predictions, which is unacceptable for applications with higher safety requirements, such as autonomous driving or medical diagnostics. Therefore, out-of-Distribution Detection (Out-of-Distribution Detection) is intended to require that the model can correctly detect these Out-of-Distribution samples in the inference phase before subsequent processing of these samples is possible.
In a conventional method for detecting an out-of-distribution sample, a classifier is generally trained by using a large number of labeled samples, and then a model statistical index such as a confidence level is used to determine whether a sample is an out-of-distribution sample. However, such a method requires a large amount of information to be marked, and thus, a large marking cost is incurred. Thus, some improved approaches have attempted to utilize unlabeled samples to enhance the performance of off-distribution sample detection. One class of methods performs self-supervised learning on unlabeled sample sets within a pure distribution, while another class of methods attempts to utilize unlabeled sample sets outside of a pure distribution. However, these methods all require the assumption that the unlabeled samples are either within or outside of pure distribution, which is not in line with the real-world scenario, since a set of unlabeled samples will typically be a mixture of in-distribution and out-of-distribution samples. How to train an out-of-distribution sample detector based on a small amount of labeled data and a large amount of mixed unlabeled data is a problem to be solved urgently.
The invention content is as follows:
the present invention is directed to solving the above-mentioned problems of the prior art and provides an out-of-distribution sample detection method using mixed unlabeled data.
The technical scheme adopted by the invention is as follows:
an out-of-distribution sample detection method using mixed unlabeled data, comprising the steps of:
1) A library of objects is created as a training data set,assigning class labels to a small number of objects in an object library and constructing labeled data L = { (x) 1 ,y 1 ),…,(x n ,y n ) X denotes a sample, y denotes a label of the sample;
the rest are nothing flag data U = { x = 1 ,…,x m N denotes the number of marked objects, m denotes the number of unmarked objects, and the number of classes is k;
wherein is made ofRepresents in-distribution samples in unlabeled data U, m 1 Is the number of samples within the distribution in unlabeled data U;
by usingRepresents the out-of-distribution samples in unlabeled data U, m 2 The number of out-of-distribution samples in unlabeled data U;
2) Cross entropy loss training is used on the marked data L, and consistency regular training is used on the unmarked data U, so that a basic model is finally obtained;
3) Using an adaptive temperature technology in the consistency regularization training of the step 2) to distinguish distributed inner samples and distributed outer samples on the unlabeled data U;
4) For each training round t, two confidence thresholds are obtainedAndcorresponding to the in-distribution sample and the out-distribution sample used for screening the unlabeled data U;
5) And the training data set is expanded by using two data enhancement methods of RandAugment and mixup, so that the model has better generalization performance.
6) Training on the samples in the distribution screened in the step 4) by using a minimum entropy principle, and training on the samples outside the distribution screened in the step 4) by using a maximum entropy principle;
7) In the testing stage, whether the sample is an out-of-distribution sample is determined according to the output confidence of the sample.
Further, in the step 2), a basic model is obtained by training with labeled data L and unlabeled data U, and the training method includes:
calculating the cross-entropy loss on the marked data L, notedTraining with consistency regularization loss on unlabeled data U, denoted as
Then, the total loss is calculatedAnd reversely propagating and updating a parameter theta of the basic model, wherein the structure of the basic model adopts a Wide ResNet-28-2 network, and the parameter quantity is about 1.4M.
Further, the basic model is optimized by using an SGD optimizer during training.
Further, in the step 3), an adaptive temperature technique is used to distinguish between an in-distribution sample and an out-distribution sample on the unlabeled data U, specifically:
at the t-th training round, the negative log-likelihood loss of the optimization model on the verification set V in one distribution can be expressed as
In the formula: tt represents a target temperature required to be calculated at the t-th training round;
argminT indicates that finding a target T can minimize the right formula, V is a validation set;
represents: after the sample x passes through the model with the parameter theta and the temperature of T, the posterior probability value corresponding to the real mark y is a value between 0 and 1.
Further, two confidence thresholds are used in the step 4)Andthe method is used for screening samples in and out of the distribution on unlabeled data U during the t-th training round number, and specifically comprises the following steps:
(a) Given a binary Gaussian mixture model, note as g 1 And g 2 (ii) a Fitting output confidence distribution { C over unlabeled data U by EM algorithm θ (x,T_t)|x∈U},
for the output confidence after the input model of sample x,representing the prediction class of the model for sample x;
(b) Passing the sample to g 1 And g 2 The method for separating the unlabeled data U by the posterior probability includes:
in the formula: p is the sign of the probability, for example: p (a | b) is the expression of the posterior probability, which means that the probability of a occurrence is a value between 0 and 1 when b occurs;
(c) The two confidence thresholds are the average of the output confidence of the samples on the two sets in step (b), specifically,
(d) Finally, the set of samples within the distribution on unlabeled data U is screened according to two confidence thresholdsAnd out-of-distribution sample setExpressed as:
further, in the step 5), two data enhancement methods of randAugment and mixup are used to improve the generalization performance of the model;
wherein RandAugment adopts random high-intensity disturbance to a given sample to improve the generalization performance of the model,
the mixup is to enhance the generalization performance of the model by adopting the data enhancement of linear combination of the sample pairs and the label pairs, and when the mixup data enhancement method is adopted, the sample after x enhancement is recorded as
Further, the minimum entropy principle is used for training on the samples in the screened distribution in the step 6), and the loss function is expressed as
Wherein:represents a K-dimensional one-hot pseudo label, wherein only 1 is represented at the subscript represented by the real label, and the rest are all 0;
training on the screened out-of-distribution samples by using the maximum entropy principle, and expressing the loss function as
After training based on the above two principles, the model outputs higher confidence on the samples inside the distribution and lower confidence on the samples outside the distribution, thus making it easier for the model to distinguish between the two.
Further, in the step 7), in the testing stage, the test sample x needs to be input into the model to obtain the output confidence C thereof θ (x, T) comparing a predetermined confidence threshold τ; if C θ (x, T) is more than or equal to tau, judging that the sample in the distribution is in the distribution, and giving a specific prediction type; if C θ If (x, T) < tau, then the sample is judged to be out-of-distribution. The invention has the following beneficial effects:
the method can effectively utilize a small amount of marked samples and a large amount of mixed unmarked samples for training, and the training mode has practical significance. During testing, the method can detect not only the seen out-of-distribution categories, but also the unseen out-of-distribution categories to a certain extent, and has higher ductility.
Description of the drawings:
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of model training in the present invention.
FIG. 3 is a model test flow diagram of the present invention.
The specific implementation mode is as follows:
the invention will be further described with reference to the accompanying drawings.
As shown in FIG. 1, the method for detecting the sample outside the distribution by using the mixed unlabeled data comprises the following steps:
step 1), establishing an object library as a training data set, giving class labels to a small number of objects in the object library to form labeled data L, and giving unlabeled data U as the rest, wherein the number of classes is K; wherein the content of the first and second substances,
by U in Representing the intra-distribution samples in unlabeled data U,
by U out Representing the out-of-distribution samples in unlabeled data U.
For cat and dog classification problems, say, cat is the first category and dog is the second category. If the content contained in the ith object is a cat, y i =1, i.e. the object belongs to the first class, if the content contained in the object by the user is a dog, y i =0, the web page belongs to the second class. Assume that initially a total of n web pages are tagged and the remaining m objects are not tagged. For labeled data, the samples are all in-distribution samples, while unlabeled data may be intermixed with in-distribution and out-of-distribution samples. Also exemplified by the cat or dog classification problem, the marked samples are all cats or dogs, while the unmarked samples may contain other content, such as tigers, trees, cups, and the like.
The first M training rounds of training are the first stage, and M is a hyperparameter. The goal of the first stage is to train to obtain a base model:
step 2), calculating cross entropy loss for samples on the marked data
Where CE represents the cross-entropy loss, y represents the K-dimensional one-hot label vector constructed from y,
q θ (x) Representing the output probability distribution of the sample x after passing through a model softmax layer, wherein the parameter of the model is theta; computing consistency regularization penalties for samples on unlabeled data U
WhereinAndrepresenting two different data enhancement methods, when embodiedThe RandAugment method is employed, andand adopting a standard data enhancement method such as cutting and turning.
Represents the output probability in class i after temperature scaling (temperature scaling), where z i Is the output logit value of class i, and T is a parameter, representing the temperature value.
Step 3), calculating to obtain a current appropriate temperature value through a verification set V in distribution at the current training round number t by using a self-adaptive temperature technology
Step 4), summing the cross entropy loss and the consistency regular loss, and calculating to obtain the total lossAnd back-propagates the update of the parameter theta.
The number of training rounds is the second stage, so that the model has stronger detection performance of the sample outside the distribution, and the specific steps are as follows:
step 5), the in-distribution and out-distribution samples in the unlabeled data were screened as shown in FIG. 2. Specifically, an unlabeled sample is input into a model to obtain a sample output confidence set
{C θ (x,T_t)|x∈U}.
Secondly, fitting a binary Gaussian mixture model on the sample output confidence set by adopting an EM algorithm,
the result of the fitting is recorded as g 1 And g 2 . Next, the passing sample belongs to g 1 And g 2 To separate the unlabeled data U,
finally, the screening of the samples in and out of distribution is completed according to the comparison of the sample output confidence and the two thresholds.
For example, a sample output confidence is greater thanThen is screened as an in-distribution sample; if less thanIt is screened as an out-of-distribution sample.
Step 6), for the sample x in the training set, obtaining an enhanced sample by using a RandAugment data enhancement method, and recording the enhanced sample as
Step 7), selecting another sample x' in the training set, and obtaining an enhanced sample by using a mixup data enhancement method to represent
Wherein λ' = max (λ,1- λ) λ is sampled from Beta (α, α), α is a hyper-parameter of Beta distribution, and α =0.2 is taken in specific implementation.
Step 8), for the screened in-distribution samples, calculating the entropy minimization loss, which can be expressed as
Step 9), for the screened out-of-distribution samples, the maximum loss of the entropy is calculated and can be expressed as
Wherein, the first and the second end of the pipe are connected with each other,representing a given distribution of calculationsEntropy of (2).
Step 10), summing the three losses to obtain the total lossAnd propagates the updated model parameters theta back.
As shown in FIG. 3, in the testing stage, a test sample x needs to be input into the model to obtain its output confidence C θ (x, T) against a predefined confidence threshold τ. If C θ And (x, T) is more than or equal to tau, judging that the sample in the distribution is in the distribution, and giving a specific prediction type.
If C θ If (x, T) < tau, then the sample is judged to be out-of-distribution.
The foregoing is only a preferred embodiment of this invention and it should be noted that modifications can be made by those skilled in the art without departing from the principle of the invention and these modifications should also be considered as the protection scope of the invention.
Claims (8)
1. An out-of-distribution sample detection method using mixed unlabeled data, characterized by: the method comprises the following steps:
1) Establishing an object library as a training data set, giving class labels to a small number of objects in the object library and forming labeled data L = { (x) 1 ,y 1 ),…,(x n ,y n ) The rest are unmarked data U = { x = } 1 ,…,x m N represents the number of marked objects, m represents the number of unmarked objects, and the number of categories is k;
wherein is made ofRepresenting in-distribution samples, m, in unlabeled data U 1 Is the number of samples within the distribution in unlabeled data U;
by usingRepresents the out-of-distribution samples in the unlabeled data U, m 2 The number of out-of-distribution samples in unlabeled data U;
2) Cross entropy loss training is used on the marked data L, and consistency regular training is used on the unmarked data U, so that a basic model is finally obtained;
3) Using an adaptive temperature technique in the consistency regularization training of the step 2) to distinguish an in-distribution sample and an out-distribution sample on the unlabeled data U;
4) For each training round t for labeled data L and unlabeled data U, two confidence level thresholds are obtainedAndcorresponding to the in-distribution sample and the out-distribution sample used for screening the unlabeled data U;
5) Expanding the training data set by using two data enhancement methods of RandAugment and mixup;
6) Training the samples in the distribution screened out in the step 4) by using a minimum entropy principle, and training the samples out of the distribution screened out in the step 4) by using a maximum entropy principle;
7) In the testing stage, whether the sample is an out-of-distribution sample is determined according to the output confidence of the sample.
2. The method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 2), a basic model is obtained by training labeled data L and unlabeled data U, and the training method comprises the following steps:
calculating the cross-entropy loss on the marked data L, notedTraining with consistency regularization loss on unlabeled data U, denoted as
3. The method of out-of-distribution sample detection using mixed unlabeled data of claim 2, wherein: and optimizing by using an SGD optimizer during training of the basic model.
4. The method for out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 3), an adaptive temperature technique is used to distinguish an in-distribution sample and an out-distribution sample on the unlabeled data U, specifically:
at the t-th training round, the negative log-likelihood loss of the optimization model on the verification set V in a distribution can be expressed as
In the formula: tt represents a target temperature to be calculated at the t-th training round;
argminT indicates that finding a target T can minimize the right formula, V is a validation set;
5. The method for out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: two confidence thresholds are used in the step 4)Andthe method is used for screening samples in and out of the distribution on unlabeled data U during the t-th training round number, and specifically comprises the following steps:
(a) Given a binary Gaussian mixture model, note as g 1 And g 2 (ii) a Fitting output confidence distribution { C over unlabeled data U by EM algorithm θ (x,T_t)|xU},
for the output confidence after the input model of sample x,representing the prediction category of the model for sample x;
(b) Passing the sample to g 1 And g 2 The method for separating the unlabeled data U by the posterior probability includes:
(c) The two confidence thresholds are the average of the output confidence of the samples on the two sets in step (b), specifically,
(d) Finally, the set of samples within the distribution on unlabeled data U is screened according to two confidence thresholdsAnd out-of-distribution sample setExpressed as:
6. the method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein:
in the step 5), a RandAugment method and a mixup method are used for improving the generalization performance of the model;
wherein RandAugment adopts random high-intensity disturbance to a given sample to improve the generalization performance of the model,
7. The method of out-of-distribution sample detection using mixed unlabeled data of claim 6, wherein:
the minimum entropy principle is used for training on the samples in the selected distribution in the step 6), and the loss function is expressed as
Wherein:represents a K-dimensional one-hot pseudo label, wherein only 1 is represented at the subscript represented by the real label, and the rest are all 0;
training on the screened out-of-distribution samples by using the maximum entropy principle, and expressing the loss function as
8. The method of out-of-distribution sample detection using mixed unlabeled data of claim 1, wherein: in the step 7), in the testing stage, the test sample x needs to be input into the model to obtain the output confidence coefficient C of the test sample x θ (x, T) comparing a predetermined confidence threshold τ; if C θ If (x, T) is more than or equal to tau, judging that the samples in the distribution are in-distribution, and giving a specific prediction type; if C θ If (x, T) < tau, then the sample is judged to be out-of-distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211434819.XA CN115730656A (en) | 2022-11-16 | 2022-11-16 | Out-of-distribution sample detection method using mixed unmarked data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211434819.XA CN115730656A (en) | 2022-11-16 | 2022-11-16 | Out-of-distribution sample detection method using mixed unmarked data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115730656A true CN115730656A (en) | 2023-03-03 |
Family
ID=85296025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211434819.XA Pending CN115730656A (en) | 2022-11-16 | 2022-11-16 | Out-of-distribution sample detection method using mixed unmarked data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115730656A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116776248A (en) * | 2023-06-21 | 2023-09-19 | 哈尔滨工业大学 | Virtual logarithm-based out-of-distribution detection method |
-
2022
- 2022-11-16 CN CN202211434819.XA patent/CN115730656A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116776248A (en) * | 2023-06-21 | 2023-09-19 | 哈尔滨工业大学 | Virtual logarithm-based out-of-distribution detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fan et al. | Watching a small portion could be as good as watching all: Towards efficient video classification | |
CN112966074B (en) | Emotion analysis method and device, electronic equipment and storage medium | |
US6397200B1 (en) | Data reduction system for improving classifier performance | |
CN110598029A (en) | Fine-grained image classification method based on attention transfer mechanism | |
CN111046664A (en) | False news detection method and system based on multi-granularity graph convolution neural network | |
CN111950540A (en) | Knowledge point extraction method, system, device and medium based on deep learning | |
CN111597340A (en) | Text classification method and device and readable storage medium | |
CN113469186A (en) | Cross-domain migration image segmentation method based on small amount of point labels | |
CN114863091A (en) | Target detection training method based on pseudo label | |
CN115730656A (en) | Out-of-distribution sample detection method using mixed unmarked data | |
CN114881125A (en) | Label noisy image classification method based on graph consistency and semi-supervised model | |
CN113111184B (en) | Event detection method based on explicit event structure knowledge enhancement and terminal equipment | |
CN114549909A (en) | Pseudo label remote sensing image scene classification method based on self-adaptive threshold | |
CN112035629B (en) | Method for implementing question-answer model based on symbolized knowledge and neural network | |
CN117152587A (en) | Anti-learning-based semi-supervised ship detection method and system | |
CN113591892A (en) | Training data processing method and device | |
CN110705631B (en) | SVM-based bulk cargo ship equipment state detection method | |
CN116624903A (en) | Intelligent monitoring method and system for oil smoke pipeline | |
CN114495114B (en) | Text sequence recognition model calibration method based on CTC decoder | |
CN115797701A (en) | Target classification method and device, electronic equipment and storage medium | |
CN111723301B (en) | Attention relation identification and labeling method based on hierarchical theme preference semantic matrix | |
CN114579761A (en) | Information security knowledge entity relation connection prediction method, system and medium | |
CN114202671A (en) | Image prediction optimization processing method and device | |
CN114618167A (en) | Anti-cheating detection model construction method and anti-cheating detection method | |
CN112712163B (en) | Coverage rate-based neural network effective data enhancement method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |