CN114549910A

CN114549910A - Noise identification method based on clean data set and key feature detection

Info

Publication number: CN114549910A
Application number: CN202210259878.1A
Authority: CN
Inventors: 袁春; 王子啸
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-05-27

Abstract

A method of identifying noise in video data, comprising the steps of: s1, establishing a clean data set for comparing unknown data of the noise data set, and completing dimension reduction on the features of the clean data set and the noise data set by utilizing interframe information to obtain a feature set after dimension reduction; s2, calculating the cosine similarity of the undetermined sample of the noise data set and the clean sample class center of the clean data set in the feature space after dimensionality reduction; s3, comparing the cosine similarity of the undetermined sample and the clean sample class center, calculating the probability that the undetermined sample is the clean sample according to the cosine similarity, and dividing the sample of which the probability is larger than a preset probability threshold value into the clean sample. The invention introduces a clean data set for comparing unknown data, completes dimension reduction on the features to obtain a feature set after dimension reduction, and only needs to complete the similarity calculation in the feature space after dimension reduction, thereby relieving the problem of dimension disaster caused by overhigh dimension.

Description

Noise identification method based on clean data set and key feature detection

Technical Field

The invention relates to image recognition, in particular to a noise recognition method based on a clean data set and key feature detection in a video classification task.

Background

Abbreviations and terms:

learning with noise: this task involves training a high-precision deep neural network on a data set containing noise labels. Training of deep neural networks generally depends on a large amount of manually labeled samples, and in practical application, acquiring a large amount of clean samples is time-consuming and labor-consuming, is not practical under certain specific scenes (such as medical scenes), and inevitably introduces noise samples in data concentration. For example, in a crowd-sourced scenario, a researcher or business will typically entrust several annotators to annotate a particular data set, but the final annotated data will always contain some noise due to inconsistent annotators or other reasons. In addition, there is a low-cost data acquisition method that searches for keywords by directly using a search engine, but samples returned by the search engine include a large number of noise samples.

The noise detection method comprises the following steps: the noise detection method refers to cleaning a data set containing noise by using some specific index or method, and obtaining a cleaned data set. This data set will be used for subsequent model training tasks.

Video classification: video classification is to classify videos according to semantic contents of the videos to be classified, and is one of basic tasks of computer vision, which is also the basis of a plurality of subsequent tasks (such as a video understanding task). Unlike the classification task on the image data set, the classified object is not a single-frame image, but a continuous video object composed of multiple-frame images containing time sequence causal relationship, so semantic information linking the contents of the previous and subsequent frames is needed to understand the video.

Approaches to solving learning challenges in noisy data can be broadly divided into two categories. One is to train a robust model directly in the presence of a noise label, and this kind of method reduces the negative impact caused by overfitting a noise sample by designing a network structure robust to the label noise or introducing a loss function robust to the noise; another method is to detect potential noise samples in the data set by first detecting and removing potential noise samples from the training set, and then performing model training using the filtered training set. In practical applications, the latter is more practical in the industry because it not only learns a robust deep learning model, but also provides a relatively clean data set.

Most of the previous methods are based on the task of image classification with noise. The timing relationship between different frames in a video relative to an image also contains information beneficial to noise detection, so how to detect noise by using the timing relationship is also a problem to be solved.

The noise is identified based on the feature similarity [1] [2] [3 ]. [1] And detecting the noise sample by calculating the cosine similarity of the sample to be detected and the class prototype. [2] And constructing an adjacent graph for each class by using a K-nearest neighbor method, and taking the sample in the most dominant subgraph as a clean sample. [3] The method of K neighbor is also used, and a voting mechanism is used to judge whether a sample belongs to a clean sample. However, the previous methods are all based on image classification tasks, but the noisy learning problem is different in the video scene.

[1]Lee,K.-H.；He,X.；Zhang,L.；and Yang,L.2018.Cleannet:Transfer learning for scalable image classifier training with label noise.In CVPR.

[2]Wu,P.；Zheng,S.；Goswami,M.；Metaxas,D.；and Chen,C.2020.A Topological Filter for Learning with Label Noise.NeurIPS,33.

[3]Ortego,D.；Arazo,E.；Albert,P.；O’Connor,N.E.；and McGuinness,K.2021.Multi-Objective Interpolation Training for Robustness to Label Noise.In CVPR,6606–6615.

It is to be noted that the information disclosed in the above background section is only for understanding the background of the present application and thus may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The main objective of the present invention is to overcome the above-mentioned drawbacks of the background art, and to provide a noise identification method based on clean data set and key feature detection in a video classification task.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of identifying noise in video data, comprising the steps of:

s1, establishing a clean data set for comparing unknown data of the noise data set, and completing dimensionality reduction on the characteristics of the clean data set and the noise data set by utilizing interframe information to obtain a dimensionality reduced characteristic set;

s2, calculating cosine similarity between the undetermined sample of the noise data set and a clean sample class center of the clean data set in the feature space after dimension reduction;

s3, comparing the cosine similarity of the undetermined sample and the clean sample class center, calculating the probability that the undetermined sample is the clean sample according to the cosine similarity, and dividing the sample with the probability greater than a preset probability threshold value into the clean sample.

Further:

inter-frame timing information is utilized to detect noise samples of a noisy data set.

The noise data set is characterized by

Wherein x_i∈R^d，x_iRepresenting a representation of a segment in a video, M representing the number of samples, a representation set of clean data sets

C is the number of clean samples of each category in the clean data set, and K represents the number of the categories; the objective of the classification task is to find a representation x_iTo which class it belongs.

x_iLabel a of_iRepresented as a one-hot code y_i∈{0,1}^K，y_iThe kth element y of^kIs assigned a value of 1 and the remaining elements are assigned a value of 0; wherein a classification header g (-) with Softmax operation is used after the feature extractor f (-) with consensus function to predict x_iProbability of belonging to class k: p (k-x)_i)＝g(x_i；k)。

For one representation x_iDetermining the tableSign x_iOf a salient feature of (a)_iIs a vector x_iThe first m characteristic channels with the largest inter-frame variance; wherein, the characterization set D is for a clean data set_LEach of the tokens belonging to class k being considered clean

Computing a corresponding set of salient features

Set of salient features L for class k^kFor all that is

The m characteristic channels with the largest occurrence number.

After the significant feature sets of all classes are calculated, all samples labeled as class k in the feature set D are processed

Only the set L of salient features in the corresponding category is reserved^kThe dimensions contained in, the resulting reduced-dimension representation

And calculates the current value of class center

Wherein m is_kIs the number of tokens with label k in token set D.

Updating the accumulated value c of class center by momentum updating method^k：

Wherein m is a momentum constant, and is 0.999;

wherein one to-be-detected representation with k label in the representation set D of the noise data set

Reliability r of_iComprises the following steps:

where, represents the inner product calculation; characterizing x by the confidence indication_iIs the confidence level of a clean sample of class k.

For each class, a two-component beta mixture model pi (·) is fitted to the confidence of all labels in the signature set D for the class characterization, and the characterization x_iBelong to the a_iThe likelihood of a class clean cluster is defined as

All will be

The characterization of (a) is considered clean, the rest are considered noisy,

is a label of_iOf (2) a sample

The reliability of (2).

After the noise detection is finished, training is performed on clean samples until the next noise detection is started.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the noise identification method.

The invention has the following beneficial effects:

the invention provides a method for identifying noise in video data, which can screen out a correctly labeled sample in a video data set with wrong labeling by using a small clean data set in a video scene to finish high-quality noise sample detection. In the invention, a small clean data set is introduced to compare unknown data, and in addition, the dimension reduction of the features is completed by utilizing interframe information, so that a feature set after dimension reduction is obtained. Therefore, the similarity calculation only needs to be completed in the feature space after the dimensionality reduction, and the problem of dimensionality disaster caused by overhigh dimensionality is solved. When the method is used, the cosine similarity between the class centers of the undetermined sample and the clean sample is only needed to be compared, and the sample with the larger similarity is divided into the clean samples. The detection rate for noisy samples is improved compared to previous methods due to the introduction of a known clean data set. Different from the traditional method, the method can specially solve the problem of detecting the noise sample in the video, and can complete the learning task with noise in the video scene. Moreover, the method takes the characteristics of the video data into consideration, and utilizes the inter-frame time sequence information to detect the noise sample.

Drawings

Fig. 1 is a flowchart illustrating a method for identifying noise in video data according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.

Referring to fig. 1, an embodiment of the present invention provides a method for identifying noise in video data, including the following steps:

The invention provides a method for identifying noise in video data, which can screen out a correctly labeled sample in a video data set with wrong labeling by using a small clean data set in a video scene to finish high-quality noise sample detection. In the invention, a small clean data set is introduced to compare unknown data, and in addition, the dimension reduction of the features is completed by utilizing interframe information, so that a feature set after dimension reduction is obtained. Therefore, the similarity calculation only needs to be completed in the feature space after the dimensionality reduction, and the dimensionality disaster problem caused by overhigh dimensionality is relieved. When the method is used, the cosine similarity between the class centers of the undetermined sample and the clean sample is only needed to be compared, and the sample with the larger similarity is divided into the clean samples. The present invention provides an improved detection rate for noisy samples compared to previous methods, due to the introduction of a known clean data set. Different from the previous method, the method can specially solve the problem of detecting the noise sample in the video, and can complete the learning task with noise in the video scene. Moreover, the method takes the characteristics of the video data into consideration, and utilizes the inter-frame time sequence information to detect the noise sample.

Specific examples of the present invention are described in further detail below.

In the embodiment of the invention, after a sample passes through a convolutional neural network, the feature vector output by the last layer of the network is the sample feature, and the feature vector records the feature information of the sample, so that the method can be used for identifying the sample and detecting a noise sample.

The number of clean samples for each category is collected, and K represents the number of categories. The goal of the classification task is to find a representation x_iTo which class it should belong. x is the number of_iIs a representation of a segment in a video. x is the number of_iLabel a of_iCan be expressed as a one-hot code y_i∈{0,1}^K。y_iThe kth element y of^kWill be assigned a 1 and the remaining elements will be assigned a 0. Under the setting of noisy learning, the label a may be wrongly assigned to a sample not belonging to class a. A classification header g (-) with Softmax operation is defined after the feature extractor f (-) with consensus function to predict x_iProbability of belonging to the k-th class, i.e., p (k-x)_i)＝g(x_i；k)。

Selection of the remarkable characteristics:

for one representation x_iWe define this characterization x_iOf a salient feature of (a)_iIs a vector x_iThe first m characteristic channels with the largest inter-frame variance. Characterization set D for a known clean sample set_LEach of the tokens belonging to class k being considered clean

A corresponding set of salient features can be computed

Here, we define the salient feature set L of the class k by using the time sequence relation information between frames^kFor all that is

The m characteristic channels with the largest occurrence number.

The confidence of the characterization is estimated using the salient features:

We only keep the set L of salient features in the corresponding class^kThe dimension contained in the representation is marked as the representation after the dimension reduction by utilizing the interframe information

And calculates the current value of class center

Wherein m is_kIs the number of tokens with label k in token set D.

We update the cumulative value of class centers by momentum update:

where m is the momentum constant, 0.999 is taken.

We define one to-be-detected token with label k in the token set D of the noisy dataset

Degree of reliability of (r)_iComprises the following steps:

where · represents the inner product calculation. The cosine similarity can be obtained by the above calculation.

Noise samples are detected with confidence:

confidence level indicates the characterization x_iIs the confidence level of a clean sample of class k, and to detect clean and noisy representations, we fit a two-component beta mixture model pi (·) for the confidence of the class representation with all the labels in D for each class. And the similarity comparison is realized by fitting the beta hybrid model. Characterization of x_iBelong to the a_iThe likelihood of a class clean cluster is defined as

Is a label of_iOf (2) a sample

The reliability of (2). Samples with greater similarity will yield greater probability

The sample with a high probability is regarded as a clean sample, and thus, the sample with a relatively high similarity can be classified as a clean sample. Preferably, we will all

The characterization of (a) is considered clean and the rest is considered noise. After the noise detection is finished, the model will be trained on clean samples until the next noise detection starts.

Results of the experiment

We chose TSM-ResNet50[1] as all superparameters of the infrastructure for all experiments. We performed experiments on two huge sets of video classification data, kinetic (K400) and sokinetic-V1 (SthV1), respectively. Symmetric noise and equivalent noise.

Wherein the symmetric noise is generated because each sample in the sample data set is independently assigned to a random tag, not its true tag, and the probability of the symmetric noise is uniformly distributed, wherein in the present embodiment, the noise ratio is assigned to 40%, 60%, and 80%.

Equivalent noise is generated because all samples in a class can only be assigned to a particular class other than the true label. The probability of sample mislabeling in the sample category is set to 10%, 20%, 40%.

Comparing the method of the present invention with a number of popular methods of tag noise detection: co-teaching [2] has two networks that choose the sample with less loss as the clean sample for each other, and the noise ratio is assumed to be known in advance in Co-teaching; TopoFiller [3] detects noise samples by a clustering method based on the L2 norm, M-correction [4] uses a Gaussian mixture model to model sample loss, and selects the sample with smaller loss as a clean sample. Meanwhile, a lower-bound baseline method and an upper-bound baseline method are respectively constructed in the experiment for comparison. In the lower bound method, all training data is added to the training process; whereas in the upper bound method, the model is trained only on correctly labeled samples.

The results of the experiments on K400 and SthV1 are shown in tables 1 and 2. The table records the best accuracy achieved by the method on the test set during all rounds of training. The method of the present invention is far superior to previous noise sample detection methods in all groups.

Table 2: accuracy of testing of Kinetics data set (%)

Table 3: test accuracy of Something-Something-V1 data set (%)

Reference documents:

[1]Lin,J.；Gan,C.；and Han,S.2019.Tsm:Temporal shift module for efficient video understanding.In ICCV,7083–7093.

[2]Han,B.；Yao,Q.；Yu,X.；Niu,G.；Xu,M.；Hu,W.；Tsang,I.；and Sugiyama,M.2018.Co-teaching:Robust training of deep neural networks with extremely noisy labels.In NeurIPS.

[3]Wu,P.；Zheng,S.；Goswami,M.；Metaxas,D.；and Chen,C.2020.A Topological Filter for Learning with Label Noise.NeurIPS,33.

[4]Arazo,E.；Ortego,D.；Albert,P.；O’Connor,N.；and Mcguinness,K.2019.Unsupervised Label Noise Modeling and Loss Correction.In ICML.

the background of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe the prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.

The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.

Claims

1. A method for identifying noise in video data, comprising the steps of:

s1, establishing a clean data set for comparing unknown data of the noise data set, and completing dimension reduction on sample features of the clean data set and the noise data set by utilizing interframe information to obtain a feature set after dimension reduction;

2. The noise identification method of claim 1, wherein inter-frame timing information is utilized to detect noise samples of a noise data set.

3. A method of noise identification as claimed in claim 1 or 2, characterized in that the characteristic set of the noise data set is

4. The noise identification method of claim 3, wherein x is_iLabel a of_iRepresented as a one-hot code y_i∈{0,1}^K，y_iThe kth element y of^kIs assigned a value of 1 and the remaining elements are assigned a value of 0; wherein a classification header g (-) with Softmax operation is used after the feature extractor f (-) with consensus function to predict x_iProbability of belonging to class k: p (k-x)_i)＝g(x_i；k)。

5. A method for noise identification as defined in any of claims 1 to 4, characterized in that for one representation x_iDetermining the characterization x_iOf a salient feature of (a)_iIs a vector x_iThe first m characteristic channels with the largest inter-frame variance; wherein, the characterization set D is for a clean data set_LEach of the tokens belonging to class k being considered clean

Computing a corresponding set of salient features

Set of salient features L for class k^kFor all that is

The m characteristic channels with the largest occurrence number.

6. The method of claim 5, wherein after computing the significant feature sets for all classes, all samples labeled as class k in the feature set D are identified

And calculates the current value of class center

Wherein m is_kIs the number of tokens with label k in token set D.

7. The noise identification method according to claim 6, wherein the class center accumulated value c is updated by a momentum update method^k：

Wherein m is a momentum constant, and is 0.999;

Reliability r of_iComprises the following steps:

8. The method of claim 7, wherein for each class, a two-component beta mixture model pi (·) is fitted to the confidence level of all labels in the signature set D for the class characterization, and the characterization x is adapted to_iBelong to the a_iThe likelihood of a class clean cluster is defined as

All will be

Is considered to be dryThe net, the rest is considered to be noisy,

is a label of_iOf (2) a sample

The reliability of (2).

9. The noise identification method of any one of claims 1 to 8, wherein after the noise detection is finished, training is performed on clean samples until the next noise detection is started.

10. A computer-readable storage medium storing a computer program which, when executed by a processor, implements the noise identification method according to any one of claims 1 to 9.