CN113837238A

CN113837238A - Long-tail image identification method based on self-supervision and self-distillation

Info

Publication number: CN113837238A
Application number: CN202111026141.7A
Authority: CN
Inventors: 王利民; 李天昊; 武港山
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-12-24
Anticipated expiration: 2041-09-02
Also published as: CN113837238B

Abstract

A long-tail image recognition method based on self-supervision and self-distillation is characterized in that a multi-stage training frame training feature extraction network is constructed, the self-supervision training feature extraction network is used in long-tail distribution sampling in the first stage, a classifier of the feature extraction network is finely adjusted in category balance sampling under the condition that the weight of the first-stage feature extraction network is reserved in the second stage, a soft label for self-distillation is generated, the weight before self-distillation is discarded in the third stage, the soft label is used as supervision for carrying out self-distillation combined training on the feature extraction network in long-tail distribution, and the obtained feature extraction network is used for image recognition and classification in long-tail distribution. The invention provides a multi-stage training method by utilizing self-supervision and self-distillation aiming at a feature extraction network of long tail data, fully characterizes the tail category by utilizing the self-supervision method, and effectively migrates the knowledge of the head category to the tail category by utilizing the self-distillation method.

Description

Long-tail image identification method based on self-supervision and self-distillation

Technical Field

The invention belongs to the technical field of computer software, relates to an image classification technology, and particularly relates to a long-tail image identification method based on self-supervision and self-distillation.

Background

Recently, deep learning has made significant progress in visual recognition in the image and video domains by training powerful neural networks on large scale class balancing and carefully culled labeled datasets, such as ImageNet and Kinetics. Unlike these artificially balanced datasets, real-world data always follows a long-tailed distribution, which makes the collection of balanced datasets more challenging, and for classes with a small number of natural samples, the cost of collecting a large number of training samples is very high and almost impossible. However, since the data distribution is extremely unbalanced, learning directly from the long tail data can lead to a large performance degradation.

A common approach to mitigate the performance degradation caused by long-tailed training data is a class-rebalancing-based strategy, including a training data sampling strategy that rebalances in training and designing a loss function that resets weights according to class. These methods can effectively reduce the dominance of the head classes in the training process, thereby producing more accurate classification decision boundaries. However, since the data distribution is distorted by artificial warping, over-parameterized depth networks fit this composite distribution very easily, and therefore they often risk over-fitting the tail classes. To address these problems, Bingyi et al separated the tasks characterizing learning and classifier training and designed a two-stage training scheme (Kang B, Xie S, Rohrbach M, et al. This two-stage training scheme first learns the visual representation under the raw data distribution and then trains the linear classifier on the frozen features under class-balanced sampling. This simple two-stage training scheme has proven to address the overfitting problem and achieve the best results at that time on a common long-tailed basis. However, this two-stage training scheme does not deal well with the unbalanced label distribution problem, especially in the characterization learning stage, so that the features do not represent well the tail class samples.

Based on the above analysis, the present invention aims to design a new learning paradigm for long tail visual recognition, and hopefully, to be able to fuse the advantages of both long tail recognition methods, i.e. robustness to the over-fitting problem and effectively handle the unbalanced label problem.

Disclosure of Invention

The invention aims to solve the problems that: objects in the nature are distributed according to the characteristics of long tails, and direct learning of data distributed by the long tails can cause a model to only pay attention to head categories and ignore tail categories.

The technical scheme of the invention is as follows: a long-tail image recognition method based on self-supervision and self-distillation is characterized in that a multi-stage training frame is constructed and used for training a feature extraction network and a classifier in a deep neural network, the feature extraction network is trained by using a self-supervision task in a long-tail distribution sampling mode in the first stage, the classifier is finely adjusted under class balance sampling in the second stage under the condition that the weight of the feature extraction network in the first stage is reserved, a soft label for self-distillation is generated, the deep neural network with the same structure is retrained in the third stage, the soft label in the second stage is used as supervision in long-tail distribution, self-distillation combined training is carried out on the deep neural network, and the obtained deep neural network is used for image recognition and classification in the long-tail distribution.

Further, the invention comprises the following steps:

1) a preparation stage: preparing a long-tail distributed image data set and a deep neural network for training, wherein the deep neural network consists of a feature extraction network and a classifier, and randomly initializing parameters of the deep neural network;

2) and (3) a feature training stage under the self-supervision guidance: under the data distributed by the long tail, simultaneously utilizing a supervision task and an automatic supervision task to train the feature extraction network;

3) a soft label generation stage: sampling data in a category balance mode, fixing weight parameters of the feature extraction network obtained by training in the step 2), finely adjusting a classifier, and outputting a prediction result of a training sample as a soft label for the step 4) as a teacher network after fine adjustment is finished;

4) a self-distillation stage: retraining a deep neural network under the original long tail distributed data, wherein the deep neural network has a feature extraction network with the same network structure as that in the step 1), and simultaneously supervising and training by using the soft label and the real label obtained in the step 3);

5) fine adjustment stage of the classifier: sampling data in a category balance mode, fixing the feature extraction network parameters obtained by training in the step 4) to be unchanged, and finely adjusting the classifier to obtain a final deep neural network;

6) and (3) a testing stage: and testing on the class-balanced data set to detect the picture identification capability of the deep neural network.

As a preferred embodiment, the multi-stage training is specifically:

and (3) an automatic supervision characteristic training stage: preparing a deep neural network D for training, the deep neural network D including a feature extraction network F and a classifier G_supSampling a picture data set distributed in a long tail manner to obtain a training picture, sending the training picture into a feature extraction network to obtain a feature f of the picture, and sending the feature f into a classifier G_supObtaining the prediction of the category, and calculating the classified loss function according to the real label; randomly initializing a network module for the self-supervision task, sending the characteristic f into the network module for the self-supervision task to obtain output, and calculating a self-supervision loss function according to the output; adding the two calculated classified and self-supervised loss functions to serve as a final loss function, optimizing the feature extraction network by using random gradient descent, and continuously iterating the process until the iteration times are reached;

a soft label generation stage: sampling training data by adopting a class balance method, and retraining a feature extraction network obtained in a self-supervision feature training stageF and classifier G_supThe training task is a classification task, the loss function is a cross entropy loss function, the training method is to fix the weight of the feature extraction network F, finely adjust the weight of each category in the classifier through a plurality of learnable parameters, and continuously iterate until the iteration times are reached, so that the deep neural network trained at this stage is called a network R;

a self-distillation stage: initializing a new deep neural network S, extracting the network F from the features_SAnd two linear classifiers H_hardAnd H_softComposition, feature extraction network F_SThe method comprises the steps of consistent with a network structure of a feature extraction network F, sampling a picture data set distributed at the long tail to obtain training pictures, sending each training picture into a network R, outputting a prediction result of the network, wherein the prediction result is a soft label, sending the training pictures into a deep neural network S, outputting two classification results by two classifiers of the deep neural network S respectively, monitoring the two classification results by utilizing the soft label and an original label of the picture respectively, and continuously carrying out iterative training until the number of iterations is reached, wherein loss functions are KL divergence and cross entropy loss functions respectively.

Further, a classifier fine-tuning stage is also configured: and (4) fine-tuning a classifier supervised by a hard tag in the deep neural network S obtained by training in the distillation stage under the data sampling of class balance to obtain a final classification result.

The present invention proposes a conceptually simple but particularly effective multi-stage training scheme consisting of two parts. Firstly, the invention introduces a self-distillation frame for long tail identification, which can automatically mine the label relationship; secondly, a new distillation label generation module guided by self supervision is provided, and distillation labels integrate information from labels and data fields, so that long tail distribution can be effectively modeled.

Compared with the prior art, the invention has the following advantages

The invention provides a multi-stage long tail image recognition training method utilizing self-supervision and self-distillation, which can fully characterize tail categories by utilizing a self-supervision method and effectively transfer the knowledge of head categories to tail categories by utilizing a self-distillation method.

Compared with the method for manually designing the class balance training strategy in the prior art, the method has the advantages that the feature extraction network is trained under long-tail distribution, and the original distribution is not artificially damaged, so that overfitting to the tail class and underfitting to the head class can be avoided. Compared with the existing two-stage long tail identification training method, the method has the advantages that the soft label result of the network R is added in the self-distillation stage, the class balance modeling is introduced in the characteristic training stage, and the more robust representation can be obtained.

The invention achieves results significantly better than the prior art in disclosing long-tailed image recognition datasets.

Drawings

FIG. 1 is a system framework diagram used by the present invention.

Detailed Description

Deep learning has made significant progress in visual recognition of large-scale balanced datasets, but still underperforms on real-world long-tailed data. The prior art typically employs a class rebalancing training strategy to effectively alleviate the imbalance problem, but may risk overfitting the tail classes. Recent decoupling methods proposed by researchers overcome the overfitting problem by using a multi-stage training scheme, but still fail to obtain tail class information in the feature learning stage. In the invention, the soft label can be used as a solution with excellent effect, the label correlation is incorporated into a multi-stage training scheme to carry out long-tail identification, and the internal relation between classes embodied by the soft label is beneficial to the long-tail identification by transferring knowledge from the beginning to the end of the classes.

As shown in FIG. 1, the invention constructs a multi-stage training frame for training a feature extraction network and a classifier in a deep neural network, wherein the feature extraction network is trained by using a self-supervision task in a long tail distribution sampling in a first stage, the classifier is finely tuned under class balance sampling in a second stage under the condition of keeping the weight of the feature extraction network in the first stage to generate a soft label for self-distillation, the deep neural network with the same structure is retrained in a third stage, the soft label in the second stage is used as supervision in the long tail distribution, the deep neural network is subjected to self-distillation combined training, and the obtained deep neural network is used for image recognition and classification in the long tail distribution. The method comprises the following steps:

the method comprises the following steps:

5) fine adjustment stage of the classifier: and (3) carrying out data sampling in a category balance mode, fixing the feature extraction network parameters obtained by training in the step 4) to be unchanged, and finely adjusting the classifier to obtain the final deep neural network.

The following is a detailed description.

1) A preparation stage: preparing a data set for training, wherein the data set is divided into a training set and a testing set, samples of the training set are distributed according to a long tail, the number of the samples of a few categories is large, the number of the samples of a plurality of categories is small, each category of the testing set has the same number of samples, and the number of the samples of each category is generally small. Preparation of deep neural network for training, denoted D, deep neural network characteristicsThe feature extraction network can be selected from common deep basic networks such as ResNet, ResNeXt, VGGNet, etc., the classifier is a full-connection layer, the feature extraction network is a transformation F, and the classifier is a transformation G_supAnd randomly initializing parameters of the deep neural network.

2) And (3) a feature training stage under the self-supervision guidance: randomly initializing a network module for self-supervision tasks, denoted G_selfThe specific form of the module is different according to the self-supervision task. And sampling under the long tail distribution of the data to obtain a training picture x. Sending the picture into a feature extraction network to obtain the feature f (F) (x) of the picture, sending the feature f into a classifier to obtain the prediction z (G) of the class_sup(f)∈R^1×cAnd c is the number of categories, the real category of the picture is set as y, and the classified loss function is calculated according to the real label:

the network module sending the characteristics to the self-supervision task obtains output u ═ G_self(f) Calculating an auto-supervised loss function L from the output_selfThe two loss functions are added in a weighted manner as the final loss function L, the weights are respectively alpha₁And alpha₂：

L＝α₁L_sup+α₂L_self

And optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached.

The self-supervision task can select the rotation angle or the instance discrimination of the predicted picture. The method comprises the following specific steps:

2.1) rotation angle prediction task: for a picture x, randomly rotating an angle of {0 °,90 °,180 °,270 ° } to obtain a rotated picture x', and predicting the rotated angle through the network. Self-supervision network module G in the task_selfFor a linear classifier implemented for a fully connected layer, the output is u ∈ R^1×4Let the rotation angle of the picture be four anglesThe angle r in (1), the auto-supervision loss function is:

2.2) example discrimination task: to-be-self-supervised network module G_selfImplemented as a multi-layered perceptron model. Cloning the structure and weight of the current deep neural network to generate a new deep neural network M, updating momentum of the parameters of the M in the training process according to the weight of the network D, and recording the momentum as M, wherein the updated formula is as follows:

M＝m·M+(1-m)·D

for the ith picture, the picture is transformed by T₁Obtaining the input picture x after the simple data enhancement_iIs subjected to transformation T₂Obtaining an input picture x 'after the enhancement of the complex data'_iX is to be_iInbound networks F and G_selfAnd normalizing the output result by using l-2 norm to obtain the characteristic v of the input picture_iX'_iInbound networks F and G_selfNormalizing the output result by using a l-2 norm to obtain the characteristic v 'of the input picture'_i：

The self-supervised loss function is then:

wherein v'_kThe characteristics of other pictures output through the network M are called negative samples, K is the number of the negative samples, and tau is a hyper-parameter for controlling the temperature.

3) A soft label generation stage: by categoryThe method comprises the specific steps of sampling training data in a category balance mode, wherein the specific step of sampling in two stages is to randomly select a category through uniformly distributed samplers, and then randomly select a sample through uniformly distributed samplers in samples belonging to the selected category. Retraining the feature extraction network F and the classifier G in the deep neural network obtained in the step 2)_supThe weight of the fixed characteristic extraction network is not changed during training, and a parameter s for adjusting the scale of the weight in the classifier is introduced_iWeight w of script for each class in the classifier_iThe weight after the adjustment of the scale is as follows:

weight w of original text is kept in training_iUpdating the parameter s according to the gradient optimization without changing_iThe value of (c). For picture x, the predicted value of the category is obtained by using the adjusted classifier:

f＝F(x)

wherein c is the number of categories, the loss function is also a cross entropy loss function, and if the correct category is y, the loss function is:

and optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached. The deep neural network trained at this stage is called network R.

4) A self-distillation stage: re-randomly initializing a new deep neural network S from a feature extraction network F_SAnd two linear classifiers H_hardAnd H_softFeature extraction network F_SConsistent with the feature extraction network structure prepared in step 1). On dataSampling is carried out on original long tail data to obtain a training picture x, data enhancement is carried out through image transformation to obtain x', and an original label of the picture is y. Firstly, obtaining the pseudo label of the picture x through the network R trained in the step 3)

Wherein

The features extracted for the network R are,

for the class prediction obtained for the network R,

is the weight of the classifier of the network R,

for the prediction of the nth class,

the nth element in the pseudo label is represented, T represents a temperature parameter and is a super parameter, T is manually set to be 2 and is used for adjusting the distribution of the above formula, the larger T is, the gentler T is, and c represents the category number.

Sending the picture x' which is also subjected to data enhancement into the deep neural network S reinitialized at the present stage to obtain two picturesPredicted output z of classifier^hardAnd z^soft：

f＝F_S(x′)

z^hard＝H_hard(f)

z^soft＝H_soft(f)

Using z^softAnd soft label

The loss function from the distillation section was calculated:

where T is a temperature parameter that controls the degree of smoothness of the profile,

for the classifier H_softAnd outputting the n-th and k-th category predictions.

Using z^hardAnd the original label y calculates the loss function of the common classification:

the two loss functions are according to the weight lambda₁And λ₂And performing weighted fusion to obtain a final loss function at the stage:

L＝λ₁L_kd+λ₂L_ce

5) Fine adjustment stage of the classifier: classifier H for hard tag supervision in deep neural network S obtained by training in distillation stage_hardAnd (4) fine adjustment is carried out under the data sampling of class balance, and the hard label is the original label. Sampling the training picture by adopting a class balance method like a soft label, and retraining the features in the deep neural network obtained in the step 4)Sign extraction network F_SAnd a classifier H_hardTraining-time fixed feature extraction network F_SIs constant, a parameter s is introduced that adjusts the scale of the weights in the classifier_iWeight h of script for each class in the classifier_iThe weight after the adjustment of the scale is as follows:

f＝F(x)

6) And (3) a testing stage: and (3) during testing, the test set constructed in the step 1) is used, each class of the test set has the same number of pictures, namely a data set with balanced classes of the test set, the pictures of the test set are respectively sent into the network obtained in the step 5) for prediction, the prediction accuracy is obtained by comparing the pictures with the correct classes of the pictures, and whether the picture identification capability of the deep neural network meets the requirement on accuracy is detected.

The practice of the invention is illustrated by the following specific examples.

And (3) training by using pictures in the ImageNet-LT dataset, and specifically implementing the pictures by using a Python3 programming language and a Pytrich1.6 deep learning framework.

FIG. 1 is a system framework diagram for use with the present invention, the corresponding implementation is as follows:

1) and in the preparation stage, a data set ImageNet-LT used for training and testing is constructed, the data set has 1000 categories, the category distribution conforms to the pareto distribution, and the coefficient is 6. The training set comprises 12 ten thousand pictures, and the number of the pictures in each category is different from 1280 to 5; the test set comprises 5 ten thousand pictures, and each category comprises 50 pictures with the same number. Preparing a neural network required by training, selecting ResNeXt-50 by a feature extraction network, outputting characteristic dimensions of 2048 dimensions, adopting a full connection layer as a classifier by the classifier, inputting characteristic dimensions of 2048, outputting characteristic dimensions of 1000, and randomly initializing parameters of the neural network.

2) In the feature training stage under the self-supervision guidance, example discrimination is adopted as a specific task, a network module of the self-supervision task is a multi-layer perceptron, namely a full connection layer, a ReLU nonlinear activation layer and a full connection layer, the input feature dimension and the hidden layer feature dimension are 2048 dimensions, and the output feature dimension is 128 dimensions. And constructing a network with the structure and parameters completely identical to those of the current network, wherein the network uses momentum updating in the training process, the parameter m of the momentum updating is 0.999, and the temperature parameter tau is 0.2. For a picture respectively through transformation T₁And T₂Obtaining two data enhanced pictures, wherein T is transformed₁Transforming T for random picture size transformation, random clipping, random horizontal turning, normalization₂The method comprises the steps of randomly transforming the size of a picture, randomly clipping, randomly transforming colors, randomly graying, randomly Gaussian blurring, randomly turning horizontally and normalizing. Will transform T₁The transformed picture is sent to an original network to obtain a 1000-dimensional classification prediction vector and a 128-dimensional feature vector, and T is transformed₂And sending the obtained picture into a momentum updating network to obtain a 128-dimensional feature vector. Calculating a classification loss function by using the classification prediction vector, calculating an automatic supervision loss function by using the feature vector, and calculating two loss functions according to the following steps of 1:1 ratio is fused to obtain the most total loss function. Training with stochastic gradient descent algorithm, training with 8 TITIAN Xp blocks, batch size256, the number of training rounds is 135 rounds, the learning rate is 0.1, and the learning rate is attenuated by a cosine function.

3) And in the soft label generation stage, the self-supervision module in the step 2) is abandoned, and the feature extraction module and the classifier are reserved. The training data is sampled by adopting a class balance method, namely, a class is randomly selected by uniformly distributed samplers at first, and then a sample is randomly selected by uniformly distributed samplers from the samples belonging to the selected class. Retraining the feature extraction network and the classifier in the deep neural network obtained in the step 2), transforming the size of the random transformed picture by using transformation, randomly cutting, randomly carrying out horizontal turning, normalizing to obtain a picture with enhanced data for the same picture obtained by sampling, sequentially sending the picture into the feature extraction network and the classifier after coefficient adjustment to obtain a class prediction result, and calculating a loss function according to the real class. Training is carried out by adopting a random gradient descent algorithm, 8 TITIAN Xps are used for training, the batch size is 512, the number of training rounds is 5, the learning rate of a coefficient for adjusting the classifier is 0.2, the learning rate of the rest parts (a feature extraction network and the original parameters of the classifier) is 0, and the learning rate is attenuated by adopting a cosine function.

4) And in the self-distillation stage, a new deep neural network is initialized randomly again, and ResNeXt-50 is still selected by the feature extraction network and two linear classifiers which are all full-connection layers with the input dimension of 2048 and the output dimension of 1000. Sampling under original long tail data of the data to obtain a training picture, carrying out image transformation and random picture size transformation, random clipping, random horizontal turning, normalization and data enhancement, sending the picture into the network obtained by training in the step 3) to obtain a prediction result, modulating by using the temperature T-2, and normalizing by using softmax to obtain a pseudo label; sending the pictures into a feature extraction network to obtain intermediate features, sending the intermediate features into two classifiers respectively to obtain two prediction results, calculating a loss function of self-distillation by using a first classification result and a pseudo label, calculating a classified loss function by using a second classification result and a real label, and fusing the two loss functions in a ratio of 1:1 to obtain a final loss function. Training is carried out by using a random gradient descent algorithm, 8 TITIAN Xp blocks are used for training, the batch size is 256, the number of training rounds is 135 rounds, the learning rate is 0.1, and the learning rate is attenuated by adopting a cosine function.

5) In the fine adjustment stage of the classifier, a class balance method is adopted to sample the training data, namely, a class is randomly selected through the uniformly distributed samplers, and then a sample is randomly selected through the uniformly distributed samplers from the samples belonging to the selected class. Retraining the feature extraction network and the classifier in the deep neural network obtained in the step 4), transforming the size of the random transformed picture by using transformation, randomly cutting, randomly carrying out horizontal turning, normalizing to obtain a picture with enhanced data for the same picture obtained by sampling, sequentially sending the picture into the feature extraction network and the classifier after coefficient adjustment to obtain a class prediction result, and calculating a loss function according to the real class. Training is carried out by adopting a random gradient descent algorithm, 8 TITIAN Xps are used for training, the batch size is 512, the number of training rounds is 5, the learning rate of a coefficient for adjusting the classifier is 0.2, the learning rate of the rest parts (a feature extraction network and the original parameters of the classifier) is 0, and the learning rate is attenuated by adopting a cosine function.

6) And in the testing stage, the testing set constructed in the step 1) is used, each category of the testing set has the same number of pictures, the pictures are respectively sent into the network obtained in the step 5) for prediction, and the prediction accuracy is obtained by comparing the pictures with the correct categories. The accuracy of the entire test set was 56.0%, with the category accuracy for the larger number of samples in the training set being 66.8%, the accuracy for the medium number of occurrences being 53.1%, and the accuracy for the smaller number of occurrences being 35.4%. Compared with the baseline method, the accuracy is respectively improved by 3.9%, 3.4%, 4.5% and 3.1%.

Claims

1. A long-tail image recognition method based on self-supervision and self-distillation is characterized in that a multi-stage training frame is constructed and used for training a feature extraction network and a classifier in a deep neural network, the feature extraction network is trained by using a self-supervision task in a long-tail distribution sampling mode in the first stage, the classifier is finely adjusted under class balance sampling in the second stage under the condition that the weight of the feature extraction network in the first stage is reserved, soft labels for self-distillation are generated, the deep neural network with the same structure is retrained in the third stage, the soft labels in the second stage are used as supervision in the long-tail distribution, self-distillation combined training is carried out on the deep neural network, and the obtained deep neural network is used for image recognition and classification in the long-tail distribution.

2. The long tail image recognition method based on self-supervision and self-distillation as claimed in claim 1, characterized by comprising the following steps:

6) and (3) a testing stage: and testing on the class-balanced data set to detect whether the picture identification capability of the deep neural network meets the requirement.

3. The method for identifying the long tail image based on the self-supervision and the self-distillation as claimed in claim 1 or 2, characterized in that the multi-stage training is specifically as follows:

a soft label generation stage: sampling the training picture by adopting a class balance method, and retraining a feature extraction network F and a classifier G obtained in a self-supervision feature training stage_supThe training task is a classification task, the loss function is a cross entropy loss function, the training method is to fix the weight of the feature extraction network F, and the classifier G is finely adjusted through a plurality of learnable parameters_supContinuously iterating until the number of iterations is reached, and calling the deep neural network trained at the stage as a network R;

a self-distillation stage: initializing a new deep neural network S, extracting the network F from the features_SAnd two linear classifiers H_hardAnd H_softComposition, feature extraction network F_SConsistent with the structure of a feature extraction network F network, sampling a picture data set distributed at the long tail to obtain training pictures, firstly sending each training picture into a network R, outputting a prediction result of the network, wherein the prediction result is a soft label, then sending the training pictures into a deep neural network S, and respectively outputting two classifiers by two classifiers of the deep neural network SAnd (4) respectively monitoring the two classification results by using the soft label and the original label of the picture, wherein the loss functions are KL divergence and cross entropy loss functions respectively, and continuously performing iterative training until the iteration times are reached.

4. The method for identifying the long tail image based on the self-supervision and the self-distillation as claimed in claim 3, which is further provided with a classifier fine-tuning stage: and (4) fine-tuning a classifier supervised by a hard tag in the deep neural network S obtained by training in the distillation stage under the data sampling of class balance to obtain a final classification result.

5. The method for identifying the long-tail image based on the self-supervision and the self-distillation as claimed in claim 3, wherein in the self-supervision characteristic training stage, the self-supervision task comprises the steps of predicting the rotation angle of the picture and judging an example:

and (3) a rotation angle prediction task: for a picture x, randomly rotating an angle in {0 degrees, 90 degrees, 180 degrees and 270 degrees } to obtain a rotated picture x', predicting the rotating angle through a network, and in the task, an automatic supervision network module G_selfFor a linear classifier implemented for a fully connected layer, the output is u ∈ R^1×4If the rotation angle of the picture is the r-th angle of the four angles, the self-supervision loss function is:

example discrimination task: to-be-self-supervised network module G_selfThe method is realized as a multilayer perceptron model, the structure and the weight of the current deep neural network are cloned, a new deep neural network M is generated, the parameters of the network M are updated according to the weight of the network D in the training process, the momentum is recorded as M, and the updated formula is as follows:

M＝m·M+(1-m)·D

for the ith picture, the picture is transformed by T₁Obtaining the input picture x after the simple data enhancement_iThrough which is passedTransformation T₂Obtaining an input picture x 'after the enhancement of the complex data'_iX is to be_iInbound networks F and G_selfAnd normalizing the output result by using a/-2 norm to obtain the characteristic v of the input picture_iX'_iInbound networks F and G_selfAnd normalizing the output result by utilizing a/-2 norm to obtain the characteristic v 'of the input picture'_i：

The self-supervised loss function is:

6. The method for identifying the long tail image based on the self-supervision and the self-distillation as claimed in claim 3, wherein the self-distillation stage uses a double-head self-distillation algorithm:

sampling a picture data set distributed in a long tail manner to obtain a training picture x, performing data enhancement through image transformation to obtain a picture x', setting an original label of the picture as y, and firstly obtaining a pseudo label of the picture x through a network R

Wherein

The features extracted for the network R are,

for the class prediction obtained for the network R,

is the weight of the classifier of the network R,

for the prediction of the nth class,

representing the nth element in the pseudo label, T representing a temperature parameter for controlling the smoothness degree of the distribution, and c representing the number of categories;

sending the picture x' which is also subjected to data enhancement into the deep neural network S reinitialized at the present stage to obtain the prediction output z of the two classifiers^hardAnd z^soft：

f＝F_S(x′)

z^hard＝H_hard(f)

z^soft＝H_soft(f)

Using z^softAnd soft label

Calculating the loss from the distillation partLoss function:

wherein T is a temperature parameter for controlling the smoothness of the distribution, and z is used^hardAnd the original label y calculates the loss function of the common classification:

L＝λ₁L_kd+λ₂L_ce