CN113837238A - Long-tail image identification method based on self-supervision and self-distillation - Google Patents

Long-tail image identification method based on self-supervision and self-distillation Download PDF

Info

Publication number
CN113837238A
CN113837238A CN202111026141.7A CN202111026141A CN113837238A CN 113837238 A CN113837238 A CN 113837238A CN 202111026141 A CN202111026141 A CN 202111026141A CN 113837238 A CN113837238 A CN 113837238A
Authority
CN
China
Prior art keywords
self
network
training
stage
supervision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111026141.7A
Other languages
Chinese (zh)
Other versions
CN113837238B (en
Inventor
王利民
李天昊
武港山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202111026141.7A priority Critical patent/CN113837238B/en
Publication of CN113837238A publication Critical patent/CN113837238A/en
Application granted granted Critical
Publication of CN113837238B publication Critical patent/CN113837238B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

A long-tail image recognition method based on self-supervision and self-distillation is characterized in that a multi-stage training frame training feature extraction network is constructed, the self-supervision training feature extraction network is used in long-tail distribution sampling in the first stage, a classifier of the feature extraction network is finely adjusted in category balance sampling under the condition that the weight of the first-stage feature extraction network is reserved in the second stage, a soft label for self-distillation is generated, the weight before self-distillation is discarded in the third stage, the soft label is used as supervision for carrying out self-distillation combined training on the feature extraction network in long-tail distribution, and the obtained feature extraction network is used for image recognition and classification in long-tail distribution. The invention provides a multi-stage training method by utilizing self-supervision and self-distillation aiming at a feature extraction network of long tail data, fully characterizes the tail category by utilizing the self-supervision method, and effectively migrates the knowledge of the head category to the tail category by utilizing the self-distillation method.

Description

Long-tail image identification method based on self-supervision and self-distillation
Technical Field
The invention belongs to the technical field of computer software, relates to an image classification technology, and particularly relates to a long-tail image identification method based on self-supervision and self-distillation.
Background
Recently, deep learning has made significant progress in visual recognition in the image and video domains by training powerful neural networks on large scale class balancing and carefully culled labeled datasets, such as ImageNet and Kinetics. Unlike these artificially balanced datasets, real-world data always follows a long-tailed distribution, which makes the collection of balanced datasets more challenging, and for classes with a small number of natural samples, the cost of collecting a large number of training samples is very high and almost impossible. However, since the data distribution is extremely unbalanced, learning directly from the long tail data can lead to a large performance degradation.
A common approach to mitigate the performance degradation caused by long-tailed training data is a class-rebalancing-based strategy, including a training data sampling strategy that rebalances in training and designing a loss function that resets weights according to class. These methods can effectively reduce the dominance of the head classes in the training process, thereby producing more accurate classification decision boundaries. However, since the data distribution is distorted by artificial warping, over-parameterized depth networks fit this composite distribution very easily, and therefore they often risk over-fitting the tail classes. To address these problems, Bingyi et al separated the tasks characterizing learning and classifier training and designed a two-stage training scheme (Kang B, Xie S, Rohrbach M, et al. This two-stage training scheme first learns the visual representation under the raw data distribution and then trains the linear classifier on the frozen features under class-balanced sampling. This simple two-stage training scheme has proven to address the overfitting problem and achieve the best results at that time on a common long-tailed basis. However, this two-stage training scheme does not deal well with the unbalanced label distribution problem, especially in the characterization learning stage, so that the features do not represent well the tail class samples.
Based on the above analysis, the present invention aims to design a new learning paradigm for long tail visual recognition, and hopefully, to be able to fuse the advantages of both long tail recognition methods, i.e. robustness to the over-fitting problem and effectively handle the unbalanced label problem.
Disclosure of Invention
The invention aims to solve the problems that: objects in the nature are distributed according to the characteristics of long tails, and direct learning of data distributed by the long tails can cause a model to only pay attention to head categories and ignore tail categories.
The technical scheme of the invention is as follows: a long-tail image recognition method based on self-supervision and self-distillation is characterized in that a multi-stage training frame is constructed and used for training a feature extraction network and a classifier in a deep neural network, the feature extraction network is trained by using a self-supervision task in a long-tail distribution sampling mode in the first stage, the classifier is finely adjusted under class balance sampling in the second stage under the condition that the weight of the feature extraction network in the first stage is reserved, a soft label for self-distillation is generated, the deep neural network with the same structure is retrained in the third stage, the soft label in the second stage is used as supervision in long-tail distribution, self-distillation combined training is carried out on the deep neural network, and the obtained deep neural network is used for image recognition and classification in the long-tail distribution.
Further, the invention comprises the following steps:
1) a preparation stage: preparing a long-tail distributed image data set and a deep neural network for training, wherein the deep neural network consists of a feature extraction network and a classifier, and randomly initializing parameters of the deep neural network;
2) and (3) a feature training stage under the self-supervision guidance: under the data distributed by the long tail, simultaneously utilizing a supervision task and an automatic supervision task to train the feature extraction network;
3) a soft label generation stage: sampling data in a category balance mode, fixing weight parameters of the feature extraction network obtained by training in the step 2), finely adjusting a classifier, and outputting a prediction result of a training sample as a soft label for the step 4) as a teacher network after fine adjustment is finished;
4) a self-distillation stage: retraining a deep neural network under the original long tail distributed data, wherein the deep neural network has a feature extraction network with the same network structure as that in the step 1), and simultaneously supervising and training by using the soft label and the real label obtained in the step 3);
5) fine adjustment stage of the classifier: sampling data in a category balance mode, fixing the feature extraction network parameters obtained by training in the step 4) to be unchanged, and finely adjusting the classifier to obtain a final deep neural network;
6) and (3) a testing stage: and testing on the class-balanced data set to detect the picture identification capability of the deep neural network.
As a preferred embodiment, the multi-stage training is specifically:
and (3) an automatic supervision characteristic training stage: preparing a deep neural network D for training, the deep neural network D including a feature extraction network F and a classifier GsupSampling a picture data set distributed in a long tail manner to obtain a training picture, sending the training picture into a feature extraction network to obtain a feature f of the picture, and sending the feature f into a classifier GsupObtaining the prediction of the category, and calculating the classified loss function according to the real label; randomly initializing a network module for the self-supervision task, sending the characteristic f into the network module for the self-supervision task to obtain output, and calculating a self-supervision loss function according to the output; adding the two calculated classified and self-supervised loss functions to serve as a final loss function, optimizing the feature extraction network by using random gradient descent, and continuously iterating the process until the iteration times are reached;
a soft label generation stage: sampling training data by adopting a class balance method, and retraining a feature extraction network obtained in a self-supervision feature training stageF and classifier GsupThe training task is a classification task, the loss function is a cross entropy loss function, the training method is to fix the weight of the feature extraction network F, finely adjust the weight of each category in the classifier through a plurality of learnable parameters, and continuously iterate until the iteration times are reached, so that the deep neural network trained at this stage is called a network R;
a self-distillation stage: initializing a new deep neural network S, extracting the network F from the featuresSAnd two linear classifiers HhardAnd HsoftComposition, feature extraction network FSThe method comprises the steps of consistent with a network structure of a feature extraction network F, sampling a picture data set distributed at the long tail to obtain training pictures, sending each training picture into a network R, outputting a prediction result of the network, wherein the prediction result is a soft label, sending the training pictures into a deep neural network S, outputting two classification results by two classifiers of the deep neural network S respectively, monitoring the two classification results by utilizing the soft label and an original label of the picture respectively, and continuously carrying out iterative training until the number of iterations is reached, wherein loss functions are KL divergence and cross entropy loss functions respectively.
Further, a classifier fine-tuning stage is also configured: and (4) fine-tuning a classifier supervised by a hard tag in the deep neural network S obtained by training in the distillation stage under the data sampling of class balance to obtain a final classification result.
The present invention proposes a conceptually simple but particularly effective multi-stage training scheme consisting of two parts. Firstly, the invention introduces a self-distillation frame for long tail identification, which can automatically mine the label relationship; secondly, a new distillation label generation module guided by self supervision is provided, and distillation labels integrate information from labels and data fields, so that long tail distribution can be effectively modeled.
Compared with the prior art, the invention has the following advantages
The invention provides a multi-stage long tail image recognition training method utilizing self-supervision and self-distillation, which can fully characterize tail categories by utilizing a self-supervision method and effectively transfer the knowledge of head categories to tail categories by utilizing a self-distillation method.
Compared with the method for manually designing the class balance training strategy in the prior art, the method has the advantages that the feature extraction network is trained under long-tail distribution, and the original distribution is not artificially damaged, so that overfitting to the tail class and underfitting to the head class can be avoided. Compared with the existing two-stage long tail identification training method, the method has the advantages that the soft label result of the network R is added in the self-distillation stage, the class balance modeling is introduced in the characteristic training stage, and the more robust representation can be obtained.
The invention achieves results significantly better than the prior art in disclosing long-tailed image recognition datasets.
Drawings
FIG. 1 is a system framework diagram used by the present invention.
Detailed Description
Deep learning has made significant progress in visual recognition of large-scale balanced datasets, but still underperforms on real-world long-tailed data. The prior art typically employs a class rebalancing training strategy to effectively alleviate the imbalance problem, but may risk overfitting the tail classes. Recent decoupling methods proposed by researchers overcome the overfitting problem by using a multi-stage training scheme, but still fail to obtain tail class information in the feature learning stage. In the invention, the soft label can be used as a solution with excellent effect, the label correlation is incorporated into a multi-stage training scheme to carry out long-tail identification, and the internal relation between classes embodied by the soft label is beneficial to the long-tail identification by transferring knowledge from the beginning to the end of the classes.
As shown in FIG. 1, the invention constructs a multi-stage training frame for training a feature extraction network and a classifier in a deep neural network, wherein the feature extraction network is trained by using a self-supervision task in a long tail distribution sampling in a first stage, the classifier is finely tuned under class balance sampling in a second stage under the condition of keeping the weight of the feature extraction network in the first stage to generate a soft label for self-distillation, the deep neural network with the same structure is retrained in a third stage, the soft label in the second stage is used as supervision in the long tail distribution, the deep neural network is subjected to self-distillation combined training, and the obtained deep neural network is used for image recognition and classification in the long tail distribution. The method comprises the following steps:
the method comprises the following steps:
1) a preparation stage: preparing a long-tail distributed image data set and a deep neural network for training, wherein the deep neural network consists of a feature extraction network and a classifier, and randomly initializing parameters of the deep neural network;
2) and (3) a feature training stage under the self-supervision guidance: under the data distributed by the long tail, simultaneously utilizing a supervision task and an automatic supervision task to train the feature extraction network;
3) a soft label generation stage: sampling data in a category balance mode, fixing weight parameters of the feature extraction network obtained by training in the step 2), finely adjusting a classifier, and outputting a prediction result of a training sample as a soft label for the step 4) as a teacher network after fine adjustment is finished;
4) a self-distillation stage: retraining a deep neural network under the original long tail distributed data, wherein the deep neural network has a feature extraction network with the same network structure as that in the step 1), and simultaneously supervising and training by using the soft label and the real label obtained in the step 3);
5) fine adjustment stage of the classifier: and (3) carrying out data sampling in a category balance mode, fixing the feature extraction network parameters obtained by training in the step 4) to be unchanged, and finely adjusting the classifier to obtain the final deep neural network.
The following is a detailed description.
1) A preparation stage: preparing a data set for training, wherein the data set is divided into a training set and a testing set, samples of the training set are distributed according to a long tail, the number of the samples of a few categories is large, the number of the samples of a plurality of categories is small, each category of the testing set has the same number of samples, and the number of the samples of each category is generally small. Preparation of deep neural network for training, denoted D, deep neural network characteristicsThe feature extraction network can be selected from common deep basic networks such as ResNet, ResNeXt, VGGNet, etc., the classifier is a full-connection layer, the feature extraction network is a transformation F, and the classifier is a transformation GsupAnd randomly initializing parameters of the deep neural network.
2) And (3) a feature training stage under the self-supervision guidance: randomly initializing a network module for self-supervision tasks, denoted GselfThe specific form of the module is different according to the self-supervision task. And sampling under the long tail distribution of the data to obtain a training picture x. Sending the picture into a feature extraction network to obtain the feature f (F) (x) of the picture, sending the feature f into a classifier to obtain the prediction z (G) of the classsup(f)∈R1×cAnd c is the number of categories, the real category of the picture is set as y, and the classified loss function is calculated according to the real label:
Figure BDA0003243410610000051
the network module sending the characteristics to the self-supervision task obtains output u ═ Gself(f) Calculating an auto-supervised loss function L from the outputselfThe two loss functions are added in a weighted manner as the final loss function L, the weights are respectively alpha1And alpha2
L=α1Lsup2Lself
And optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached.
The self-supervision task can select the rotation angle or the instance discrimination of the predicted picture. The method comprises the following specific steps:
2.1) rotation angle prediction task: for a picture x, randomly rotating an angle of {0 °,90 °,180 °,270 ° } to obtain a rotated picture x', and predicting the rotated angle through the network. Self-supervision network module G in the taskselfFor a linear classifier implemented for a fully connected layer, the output is u ∈ R1×4Let the rotation angle of the picture be four anglesThe angle r in (1), the auto-supervision loss function is:
Figure BDA0003243410610000052
2.2) example discrimination task: to-be-self-supervised network module GselfImplemented as a multi-layered perceptron model. Cloning the structure and weight of the current deep neural network to generate a new deep neural network M, updating momentum of the parameters of the M in the training process according to the weight of the network D, and recording the momentum as M, wherein the updated formula is as follows:
M=m·M+(1-m)·D
for the ith picture, the picture is transformed by T1Obtaining the input picture x after the simple data enhancementiIs subjected to transformation T2Obtaining an input picture x 'after the enhancement of the complex data'iX is to beiInbound networks F and GselfAnd normalizing the output result by using l-2 norm to obtain the characteristic v of the input pictureiX'iInbound networks F and GselfNormalizing the output result by using a l-2 norm to obtain the characteristic v 'of the input picture'i
Figure BDA0003243410610000061
Figure BDA0003243410610000062
The self-supervised loss function is then:
Figure BDA0003243410610000063
wherein v'kThe characteristics of other pictures output through the network M are called negative samples, K is the number of the negative samples, and tau is a hyper-parameter for controlling the temperature.
3) A soft label generation stage: by categoryThe method comprises the specific steps of sampling training data in a category balance mode, wherein the specific step of sampling in two stages is to randomly select a category through uniformly distributed samplers, and then randomly select a sample through uniformly distributed samplers in samples belonging to the selected category. Retraining the feature extraction network F and the classifier G in the deep neural network obtained in the step 2)supThe weight of the fixed characteristic extraction network is not changed during training, and a parameter s for adjusting the scale of the weight in the classifier is introducediWeight w of script for each class in the classifieriThe weight after the adjustment of the scale is as follows:
Figure BDA0003243410610000064
weight w of original text is kept in trainingiUpdating the parameter s according to the gradient optimization without changingiThe value of (c). For picture x, the predicted value of the category is obtained by using the adjusted classifier:
f=F(x)
Figure BDA0003243410610000065
wherein c is the number of categories, the loss function is also a cross entropy loss function, and if the correct category is y, the loss function is:
Figure BDA0003243410610000066
and optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached. The deep neural network trained at this stage is called network R.
4) A self-distillation stage: re-randomly initializing a new deep neural network S from a feature extraction network FSAnd two linear classifiers HhardAnd HsoftFeature extraction network FSConsistent with the feature extraction network structure prepared in step 1). On dataSampling is carried out on original long tail data to obtain a training picture x, data enhancement is carried out through image transformation to obtain x', and an original label of the picture is y. Firstly, obtaining the pseudo label of the picture x through the network R trained in the step 3)
Figure BDA0003243410610000067
Figure BDA0003243410610000071
Figure BDA0003243410610000072
Figure BDA0003243410610000073
Wherein
Figure BDA0003243410610000074
The features extracted for the network R are,
Figure BDA0003243410610000075
for the class prediction obtained for the network R,
Figure BDA0003243410610000076
is the weight of the classifier of the network R,
Figure BDA0003243410610000077
for the prediction of the nth class,
Figure BDA0003243410610000078
the nth element in the pseudo label is represented, T represents a temperature parameter and is a super parameter, T is manually set to be 2 and is used for adjusting the distribution of the above formula, the larger T is, the gentler T is, and c represents the category number.
Sending the picture x' which is also subjected to data enhancement into the deep neural network S reinitialized at the present stage to obtain two picturesPredicted output z of classifierhardAnd zsoft
f=FS(x′)
zhard=Hhard(f)
zsoft=Hsoft(f)
Using zsoftAnd soft label
Figure BDA0003243410610000079
The loss function from the distillation section was calculated:
Figure BDA00032434106100000710
where T is a temperature parameter that controls the degree of smoothness of the profile,
Figure BDA00032434106100000711
for the classifier HsoftAnd outputting the n-th and k-th category predictions.
Using zhardAnd the original label y calculates the loss function of the common classification:
Figure BDA00032434106100000712
the two loss functions are according to the weight lambda1And λ2And performing weighted fusion to obtain a final loss function at the stage:
L=λ1Lkd2Lce
and optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached.
5) Fine adjustment stage of the classifier: classifier H for hard tag supervision in deep neural network S obtained by training in distillation stagehardAnd (4) fine adjustment is carried out under the data sampling of class balance, and the hard label is the original label. Sampling the training picture by adopting a class balance method like a soft label, and retraining the features in the deep neural network obtained in the step 4)Sign extraction network FSAnd a classifier HhardTraining-time fixed feature extraction network FSIs constant, a parameter s is introduced that adjusts the scale of the weights in the classifieriWeight h of script for each class in the classifieriThe weight after the adjustment of the scale is as follows:
Figure BDA00032434106100000713
weight w of original text is kept in trainingiUpdating the parameter s according to the gradient optimization without changingiThe value of (c). For picture x, the predicted value of the category is obtained by using the adjusted classifier:
f=F(x)
Figure BDA0003243410610000081
wherein c is the number of categories, the loss function is also a cross entropy loss function, and if the correct category is y, the loss function is:
Figure BDA0003243410610000082
and optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached.
6) And (3) a testing stage: and (3) during testing, the test set constructed in the step 1) is used, each class of the test set has the same number of pictures, namely a data set with balanced classes of the test set, the pictures of the test set are respectively sent into the network obtained in the step 5) for prediction, the prediction accuracy is obtained by comparing the pictures with the correct classes of the pictures, and whether the picture identification capability of the deep neural network meets the requirement on accuracy is detected.
The practice of the invention is illustrated by the following specific examples.
And (3) training by using pictures in the ImageNet-LT dataset, and specifically implementing the pictures by using a Python3 programming language and a Pytrich1.6 deep learning framework.
FIG. 1 is a system framework diagram for use with the present invention, the corresponding implementation is as follows:
1) and in the preparation stage, a data set ImageNet-LT used for training and testing is constructed, the data set has 1000 categories, the category distribution conforms to the pareto distribution, and the coefficient is 6. The training set comprises 12 ten thousand pictures, and the number of the pictures in each category is different from 1280 to 5; the test set comprises 5 ten thousand pictures, and each category comprises 50 pictures with the same number. Preparing a neural network required by training, selecting ResNeXt-50 by a feature extraction network, outputting characteristic dimensions of 2048 dimensions, adopting a full connection layer as a classifier by the classifier, inputting characteristic dimensions of 2048, outputting characteristic dimensions of 1000, and randomly initializing parameters of the neural network.
2) In the feature training stage under the self-supervision guidance, example discrimination is adopted as a specific task, a network module of the self-supervision task is a multi-layer perceptron, namely a full connection layer, a ReLU nonlinear activation layer and a full connection layer, the input feature dimension and the hidden layer feature dimension are 2048 dimensions, and the output feature dimension is 128 dimensions. And constructing a network with the structure and parameters completely identical to those of the current network, wherein the network uses momentum updating in the training process, the parameter m of the momentum updating is 0.999, and the temperature parameter tau is 0.2. For a picture respectively through transformation T1And T2Obtaining two data enhanced pictures, wherein T is transformed1Transforming T for random picture size transformation, random clipping, random horizontal turning, normalization2The method comprises the steps of randomly transforming the size of a picture, randomly clipping, randomly transforming colors, randomly graying, randomly Gaussian blurring, randomly turning horizontally and normalizing. Will transform T1The transformed picture is sent to an original network to obtain a 1000-dimensional classification prediction vector and a 128-dimensional feature vector, and T is transformed2And sending the obtained picture into a momentum updating network to obtain a 128-dimensional feature vector. Calculating a classification loss function by using the classification prediction vector, calculating an automatic supervision loss function by using the feature vector, and calculating two loss functions according to the following steps of 1:1 ratio is fused to obtain the most total loss function. Training with stochastic gradient descent algorithm, training with 8 TITIAN Xp blocks, batch size256, the number of training rounds is 135 rounds, the learning rate is 0.1, and the learning rate is attenuated by a cosine function.
3) And in the soft label generation stage, the self-supervision module in the step 2) is abandoned, and the feature extraction module and the classifier are reserved. The training data is sampled by adopting a class balance method, namely, a class is randomly selected by uniformly distributed samplers at first, and then a sample is randomly selected by uniformly distributed samplers from the samples belonging to the selected class. Retraining the feature extraction network and the classifier in the deep neural network obtained in the step 2), transforming the size of the random transformed picture by using transformation, randomly cutting, randomly carrying out horizontal turning, normalizing to obtain a picture with enhanced data for the same picture obtained by sampling, sequentially sending the picture into the feature extraction network and the classifier after coefficient adjustment to obtain a class prediction result, and calculating a loss function according to the real class. Training is carried out by adopting a random gradient descent algorithm, 8 TITIAN Xps are used for training, the batch size is 512, the number of training rounds is 5, the learning rate of a coefficient for adjusting the classifier is 0.2, the learning rate of the rest parts (a feature extraction network and the original parameters of the classifier) is 0, and the learning rate is attenuated by adopting a cosine function.
4) And in the self-distillation stage, a new deep neural network is initialized randomly again, and ResNeXt-50 is still selected by the feature extraction network and two linear classifiers which are all full-connection layers with the input dimension of 2048 and the output dimension of 1000. Sampling under original long tail data of the data to obtain a training picture, carrying out image transformation and random picture size transformation, random clipping, random horizontal turning, normalization and data enhancement, sending the picture into the network obtained by training in the step 3) to obtain a prediction result, modulating by using the temperature T-2, and normalizing by using softmax to obtain a pseudo label; sending the pictures into a feature extraction network to obtain intermediate features, sending the intermediate features into two classifiers respectively to obtain two prediction results, calculating a loss function of self-distillation by using a first classification result and a pseudo label, calculating a classified loss function by using a second classification result and a real label, and fusing the two loss functions in a ratio of 1:1 to obtain a final loss function. Training is carried out by using a random gradient descent algorithm, 8 TITIAN Xp blocks are used for training, the batch size is 256, the number of training rounds is 135 rounds, the learning rate is 0.1, and the learning rate is attenuated by adopting a cosine function.
5) In the fine adjustment stage of the classifier, a class balance method is adopted to sample the training data, namely, a class is randomly selected through the uniformly distributed samplers, and then a sample is randomly selected through the uniformly distributed samplers from the samples belonging to the selected class. Retraining the feature extraction network and the classifier in the deep neural network obtained in the step 4), transforming the size of the random transformed picture by using transformation, randomly cutting, randomly carrying out horizontal turning, normalizing to obtain a picture with enhanced data for the same picture obtained by sampling, sequentially sending the picture into the feature extraction network and the classifier after coefficient adjustment to obtain a class prediction result, and calculating a loss function according to the real class. Training is carried out by adopting a random gradient descent algorithm, 8 TITIAN Xps are used for training, the batch size is 512, the number of training rounds is 5, the learning rate of a coefficient for adjusting the classifier is 0.2, the learning rate of the rest parts (a feature extraction network and the original parameters of the classifier) is 0, and the learning rate is attenuated by adopting a cosine function.
6) And in the testing stage, the testing set constructed in the step 1) is used, each category of the testing set has the same number of pictures, the pictures are respectively sent into the network obtained in the step 5) for prediction, and the prediction accuracy is obtained by comparing the pictures with the correct categories. The accuracy of the entire test set was 56.0%, with the category accuracy for the larger number of samples in the training set being 66.8%, the accuracy for the medium number of occurrences being 53.1%, and the accuracy for the smaller number of occurrences being 35.4%. Compared with the baseline method, the accuracy is respectively improved by 3.9%, 3.4%, 4.5% and 3.1%.

Claims (6)

1. A long-tail image recognition method based on self-supervision and self-distillation is characterized in that a multi-stage training frame is constructed and used for training a feature extraction network and a classifier in a deep neural network, the feature extraction network is trained by using a self-supervision task in a long-tail distribution sampling mode in the first stage, the classifier is finely adjusted under class balance sampling in the second stage under the condition that the weight of the feature extraction network in the first stage is reserved, soft labels for self-distillation are generated, the deep neural network with the same structure is retrained in the third stage, the soft labels in the second stage are used as supervision in the long-tail distribution, self-distillation combined training is carried out on the deep neural network, and the obtained deep neural network is used for image recognition and classification in the long-tail distribution.
2. The long tail image recognition method based on self-supervision and self-distillation as claimed in claim 1, characterized by comprising the following steps:
1) a preparation stage: preparing a long-tail distributed image data set and a deep neural network for training, wherein the deep neural network consists of a feature extraction network and a classifier, and randomly initializing parameters of the deep neural network;
2) and (3) a feature training stage under the self-supervision guidance: under the data distributed by the long tail, simultaneously utilizing a supervision task and an automatic supervision task to train the feature extraction network;
3) a soft label generation stage: sampling data in a category balance mode, fixing weight parameters of the feature extraction network obtained by training in the step 2), finely adjusting a classifier, and outputting a prediction result of a training sample as a soft label for the step 4) as a teacher network after fine adjustment is finished;
4) a self-distillation stage: retraining a deep neural network under the original long tail distributed data, wherein the deep neural network has a feature extraction network with the same network structure as that in the step 1), and simultaneously supervising and training by using the soft label and the real label obtained in the step 3);
5) fine adjustment stage of the classifier: sampling data in a category balance mode, fixing the feature extraction network parameters obtained by training in the step 4) to be unchanged, and finely adjusting the classifier to obtain a final deep neural network;
6) and (3) a testing stage: and testing on the class-balanced data set to detect whether the picture identification capability of the deep neural network meets the requirement.
3. The method for identifying the long tail image based on the self-supervision and the self-distillation as claimed in claim 1 or 2, characterized in that the multi-stage training is specifically as follows:
and (3) an automatic supervision characteristic training stage: preparing a deep neural network D for training, the deep neural network D including a feature extraction network F and a classifier GsupSampling a picture data set distributed in a long tail manner to obtain a training picture, sending the training picture into a feature extraction network to obtain a feature f of the picture, and sending the feature f into a classifier GsupObtaining the prediction of the category, and calculating the classified loss function according to the real label; randomly initializing a network module for the self-supervision task, sending the characteristic f into the network module for the self-supervision task to obtain output, and calculating a self-supervision loss function according to the output; adding the two calculated classified and self-supervised loss functions to serve as a final loss function, optimizing the feature extraction network by using random gradient descent, and continuously iterating the process until the iteration times are reached;
a soft label generation stage: sampling the training picture by adopting a class balance method, and retraining a feature extraction network F and a classifier G obtained in a self-supervision feature training stagesupThe training task is a classification task, the loss function is a cross entropy loss function, the training method is to fix the weight of the feature extraction network F, and the classifier G is finely adjusted through a plurality of learnable parameterssupContinuously iterating until the number of iterations is reached, and calling the deep neural network trained at the stage as a network R;
a self-distillation stage: initializing a new deep neural network S, extracting the network F from the featuresSAnd two linear classifiers HhardAnd HsoftComposition, feature extraction network FSConsistent with the structure of a feature extraction network F network, sampling a picture data set distributed at the long tail to obtain training pictures, firstly sending each training picture into a network R, outputting a prediction result of the network, wherein the prediction result is a soft label, then sending the training pictures into a deep neural network S, and respectively outputting two classifiers by two classifiers of the deep neural network SAnd (4) respectively monitoring the two classification results by using the soft label and the original label of the picture, wherein the loss functions are KL divergence and cross entropy loss functions respectively, and continuously performing iterative training until the iteration times are reached.
4. The method for identifying the long tail image based on the self-supervision and the self-distillation as claimed in claim 3, which is further provided with a classifier fine-tuning stage: and (4) fine-tuning a classifier supervised by a hard tag in the deep neural network S obtained by training in the distillation stage under the data sampling of class balance to obtain a final classification result.
5. The method for identifying the long-tail image based on the self-supervision and the self-distillation as claimed in claim 3, wherein in the self-supervision characteristic training stage, the self-supervision task comprises the steps of predicting the rotation angle of the picture and judging an example:
and (3) a rotation angle prediction task: for a picture x, randomly rotating an angle in {0 degrees, 90 degrees, 180 degrees and 270 degrees } to obtain a rotated picture x', predicting the rotating angle through a network, and in the task, an automatic supervision network module GselfFor a linear classifier implemented for a fully connected layer, the output is u ∈ R1×4If the rotation angle of the picture is the r-th angle of the four angles, the self-supervision loss function is:
Figure FDA0003243410600000021
example discrimination task: to-be-self-supervised network module GselfThe method is realized as a multilayer perceptron model, the structure and the weight of the current deep neural network are cloned, a new deep neural network M is generated, the parameters of the network M are updated according to the weight of the network D in the training process, the momentum is recorded as M, and the updated formula is as follows:
M=m·M+(1-m)·D
for the ith picture, the picture is transformed by T1Obtaining the input picture x after the simple data enhancementiThrough which is passedTransformation T2Obtaining an input picture x 'after the enhancement of the complex data'iX is to beiInbound networks F and GselfAnd normalizing the output result by using a/-2 norm to obtain the characteristic v of the input pictureiX'iInbound networks F and GselfAnd normalizing the output result by utilizing a/-2 norm to obtain the characteristic v 'of the input picture'i
Figure FDA0003243410600000031
Figure FDA0003243410600000032
The self-supervised loss function is:
Figure FDA0003243410600000033
wherein v'kThe characteristics of other pictures output through the network M are called negative samples, K is the number of the negative samples, and tau is a hyper-parameter for controlling the temperature.
6. The method for identifying the long tail image based on the self-supervision and the self-distillation as claimed in claim 3, wherein the self-distillation stage uses a double-head self-distillation algorithm:
sampling a picture data set distributed in a long tail manner to obtain a training picture x, performing data enhancement through image transformation to obtain a picture x', setting an original label of the picture as y, and firstly obtaining a pseudo label of the picture x through a network R
Figure FDA0003243410600000034
Figure FDA0003243410600000035
Figure FDA0003243410600000036
Figure FDA0003243410600000037
Wherein
Figure FDA0003243410600000038
The features extracted for the network R are,
Figure FDA0003243410600000039
for the class prediction obtained for the network R,
Figure FDA00032434106000000310
is the weight of the classifier of the network R,
Figure FDA00032434106000000311
for the prediction of the nth class,
Figure FDA00032434106000000312
representing the nth element in the pseudo label, T representing a temperature parameter for controlling the smoothness degree of the distribution, and c representing the number of categories;
sending the picture x' which is also subjected to data enhancement into the deep neural network S reinitialized at the present stage to obtain the prediction output z of the two classifiershardAnd zsoft
f=FS(x′)
zhard=Hhard(f)
zsoft=Hsoft(f)
Using zsoftAnd soft label
Figure FDA00032434106000000313
Calculating the loss from the distillation partLoss function:
Figure FDA00032434106000000314
wherein T is a temperature parameter for controlling the smoothness of the distribution, and z is usedhardAnd the original label y calculates the loss function of the common classification:
Figure FDA0003243410600000041
the two loss functions are according to the weight lambda1And λ2And performing weighted fusion to obtain a final loss function at the stage:
L=λ1Lkd2Lce
and optimizing the network by using random gradient descent, and continuously iterating the process until the iteration times are reached.
CN202111026141.7A 2021-09-02 2021-09-02 Long-tail image recognition method based on self-supervision and self-distillation Active CN113837238B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111026141.7A CN113837238B (en) 2021-09-02 2021-09-02 Long-tail image recognition method based on self-supervision and self-distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111026141.7A CN113837238B (en) 2021-09-02 2021-09-02 Long-tail image recognition method based on self-supervision and self-distillation

Publications (2)

Publication Number Publication Date
CN113837238A true CN113837238A (en) 2021-12-24
CN113837238B CN113837238B (en) 2023-09-01

Family

ID=78962069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111026141.7A Active CN113837238B (en) 2021-09-02 2021-09-02 Long-tail image recognition method based on self-supervision and self-distillation

Country Status (1)

Country Link
CN (1) CN113837238B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114549904A (en) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 Visual processing and model training method, apparatus, storage medium, and program product
CN114595780A (en) * 2022-03-15 2022-06-07 百度在线网络技术(北京)有限公司 Image-text processing model training and image-text processing method, device, equipment and medium
CN114627348A (en) * 2022-03-22 2022-06-14 厦门大学 Intention-based picture identification method in multi-subject task
CN114863193A (en) * 2022-07-07 2022-08-05 之江实验室 Long-tail learning image classification and training method and device based on mixed batch normalization
CN114863248A (en) * 2022-03-02 2022-08-05 武汉大学 Image target detection method based on deep supervision self-distillation
CN115272881A (en) * 2022-08-02 2022-11-01 大连理工大学 Long-tail remote sensing image target identification method based on dynamic relation distillation
CN116578913A (en) * 2023-03-31 2023-08-11 中国人民解放军陆军工程大学 Reliable unmanned aerial vehicle detection and recognition method oriented to complex electromagnetic environment
CN116578913B (en) * 2023-03-31 2024-05-24 中国人民解放军陆军工程大学 Reliable unmanned aerial vehicle detection and recognition method oriented to complex electromagnetic environment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203530A (en) * 2016-07-21 2016-12-07 长安大学 Method is determined for the feature weight of uneven distributed data towards k nearest neighbor algorithm
US20200364542A1 (en) * 2019-05-16 2020-11-19 Salesforce.Com, Inc. Private deep learning
CN112348792A (en) * 2020-11-04 2021-02-09 广东工业大学 X-ray chest radiography image classification method based on small sample learning and self-supervision learning
CN112381116A (en) * 2020-10-21 2021-02-19 福州大学 Self-supervision image classification method based on contrast learning
US20210182618A1 (en) * 2018-10-29 2021-06-17 Hrl Laboratories, Llc Process to learn new image classes without labels
CN113177612A (en) * 2021-05-24 2021-07-27 同济大学 Agricultural pest image identification method based on CNN few samples

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203530A (en) * 2016-07-21 2016-12-07 长安大学 Method is determined for the feature weight of uneven distributed data towards k nearest neighbor algorithm
US20210182618A1 (en) * 2018-10-29 2021-06-17 Hrl Laboratories, Llc Process to learn new image classes without labels
US20200364542A1 (en) * 2019-05-16 2020-11-19 Salesforce.Com, Inc. Private deep learning
CN112381116A (en) * 2020-10-21 2021-02-19 福州大学 Self-supervision image classification method based on contrast learning
CN112348792A (en) * 2020-11-04 2021-02-09 广东工业大学 X-ray chest radiography image classification method based on small sample learning and self-supervision learning
CN113177612A (en) * 2021-05-24 2021-07-27 同济大学 Agricultural pest image identification method based on CNN few samples

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
T. LI: ""Self Supervision to Distillation for Long-Tailed Visual Recognition"", 《2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》, pages 610 - 619 *
李徵: ""基于地质知识蒸馏学习的油气储集层识别方法"", 《中国科学:信息科学》, vol. 51, no. 1, pages 40 - 55 *
秦晓明: ""基于深度学习的含噪声标签图像的分类研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 8, pages 138 - 562 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114549904A (en) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 Visual processing and model training method, apparatus, storage medium, and program product
CN114863248A (en) * 2022-03-02 2022-08-05 武汉大学 Image target detection method based on deep supervision self-distillation
CN114863248B (en) * 2022-03-02 2024-04-26 武汉大学 Image target detection method based on deep supervision self-distillation
CN114595780A (en) * 2022-03-15 2022-06-07 百度在线网络技术(北京)有限公司 Image-text processing model training and image-text processing method, device, equipment and medium
CN114627348A (en) * 2022-03-22 2022-06-14 厦门大学 Intention-based picture identification method in multi-subject task
CN114627348B (en) * 2022-03-22 2024-05-31 厦门大学 Picture identification method based on intention in multi-subject task
CN114863193A (en) * 2022-07-07 2022-08-05 之江实验室 Long-tail learning image classification and training method and device based on mixed batch normalization
CN115272881A (en) * 2022-08-02 2022-11-01 大连理工大学 Long-tail remote sensing image target identification method based on dynamic relation distillation
CN116578913A (en) * 2023-03-31 2023-08-11 中国人民解放军陆军工程大学 Reliable unmanned aerial vehicle detection and recognition method oriented to complex electromagnetic environment
CN116578913B (en) * 2023-03-31 2024-05-24 中国人民解放军陆军工程大学 Reliable unmanned aerial vehicle detection and recognition method oriented to complex electromagnetic environment

Also Published As

Publication number Publication date
CN113837238B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN113837238A (en) Long-tail image identification method based on self-supervision and self-distillation
CN109214452B (en) HRRP target identification method based on attention depth bidirectional cyclic neural network
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
CN108846413B (en) Zero sample learning method based on global semantic consensus network
CN109598711B (en) Thermal image defect extraction method based on feature mining and neural network
CN109239670B (en) Radar HRRP (high resolution ratio) identification method based on structure embedding and deep neural network
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN111239137B (en) Grain quality detection method based on transfer learning and adaptive deep convolution neural network
CN109145685B (en) Fruit and vegetable hyperspectral quality detection method based on ensemble learning
CN114861705A (en) Electromagnetic target intelligent perception identification method based on multi-feature heterogeneous fusion
CN113011487B (en) Open set image classification method based on joint learning and knowledge migration
CN113065520A (en) Multi-modal data-oriented remote sensing image classification method
CN109872319B (en) Thermal image defect extraction method based on feature mining and neural network
CN116523711A (en) Education supervision system and method based on artificial intelligence
CN113688867B (en) Cross-domain image classification method
CN115512272A (en) Time sequence event detection method for multi-event instance video
CN115098681A (en) Open service intention detection method based on supervised contrast learning
CN112784927B (en) Semi-automatic image labeling method based on online learning
CN113449751B (en) Object-attribute combined image identification method based on symmetry and group theory
CN114220145A (en) Face detection model generation method and device and fake face detection method and device
CN116310463B (en) Remote sensing target classification method for unsupervised learning
CN111652265A (en) Robust semi-supervised sparse feature selection method based on self-adjusting graph
CN112446432A (en) Handwritten picture classification method based on quantum self-learning self-training network
CN114037866B (en) Generalized zero sample image classification method based on distinguishable pseudo-feature synthesis
CN117274724B (en) Weld defect classification method based on variable type temperature distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant