CN114972839A - Generalized continuous classification method based on online contrast distillation network - Google Patents

Generalized continuous classification method based on online contrast distillation network Download PDF

Info

Publication number
CN114972839A
CN114972839A CN202210326319.8A CN202210326319A CN114972839A CN 114972839 A CN114972839 A CN 114972839A CN 202210326319 A CN202210326319 A CN 202210326319A CN 114972839 A CN114972839 A CN 114972839A
Authority
CN
China
Prior art keywords
model
feature
data
student
student model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210326319.8A
Other languages
Chinese (zh)
Inventor
冀中
黎晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210326319.8A priority Critical patent/CN114972839A/en
Publication of CN114972839A publication Critical patent/CN114972839A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a generalized continuous classification method based on an online comparison distillation network, which is used for establishing a classification model based on knowledge distillation; establishing a buffer area and updating the buffer area by using a reservoir sampling method; randomly sampling S samples from the buffer area and respectively inputting the samples into the teacher model and the student model to obtain classification output and feature embedding corresponding to the teacher model and the student model; calculating and classifying and outputting the quality fraction according to the teacher model, adjusting knowledge distillation loss function weights of different samples and calculating distillation loss
Figure DDA0003573569680000011
Comparing the characteristic embedding between the two models, and calculating the distillation loss of the comparison relation of the two models
Figure DDA0003573569680000012
Calculating student model self-supervision loss
Figure DDA0003573569680000013
And supervised contrast learning loss
Figure DDA0003573569680000014
Calculating cross entropy classification loss of student model
Figure DDA0003573569680000015
And (4) weighting and adding the losses to determine parameters of the maximum optimization objective optimization student model. And updating the parameters of the teacher model by the parameters of the student model. The invention has good classification accuracy for both new tasks and old tasks.

Description

Generalized continuous classification method based on online contrast distillation network
Technical Field
The invention relates to a generalized continuous classification method, in particular to a generalized continuous classification method based on an online contrast distillation network.
Background
At present, in recent years, deep learning has achieved good effects on computer vision tasks such as image classification, target detection and semantic segmentation. However, when a neural network trained on an old task is trained directly on a new task, the new task can severely interfere with the performance of the old task, creating a Catastrophic Forgetting problem. It is obvious that training a neural network from scratch consumes more time and computing resources, and data of a previous task is not necessarily acquired again due to privacy and other problems. The human beings have the ability of continuously learning, and can quickly learn new knowledge on the basis of old knowledge without damaging the stability of the previously learned knowledge. It is expected that neural networks will have this ability to learn continuously, and continuous Learning (also called Incremental Learning) is proposed to overcome the problem of catastrophic forgetting. In recent years, a lot of continuous learning work adopts the idea of empirical Replay (Experience Replay), samples of a part of old tasks are stored, and the stored samples are replayed when training new tasks, so that the catastrophic forgetting problem is alleviated.
In the existing continuous learning technology, it is often necessary to assume that categories among tasks are mutually disjoint, that is, categories in a new task are not present in an old task, and clear task boundaries exist between tasks, and there is a high possibility that such a priori knowledge does not exist in real-world tasks. Many existing techniques exploit the unlikely a priori knowledge of such realistic tasks, simplifying the difficulty of continuous learning problems. For example, when the model output of old samples at the past time is used to output a regular model of old samples at the present time to alleviate catastrophic forgetting, the dimension of the old model output and the dimension of the new model output become inconsistent due to the arrival of new classes, and in the case where it is assumed that classes are complementary between tasks, the output of the new model can be made to coincide with only a part of the old model. In the past, a continuous method of utilizing the mutual disjointness of the task classes cannot be applied to the setting of the generalized continuous learning. For this reason, a General continuous Learning (General continuous Learning) technique for solving the catastrophic forgetting problem in a real scene is receiving attention. The goal of generalized continuous learning is to consolidate learned knowledge from a non-stationary infinite data stream and learn new knowledge quickly. Under the setting of generalized continuous learning, the categories of each task may be intersected, new samples of old categories may appear in new tasks, and the conventional method for solving the continuous learning by means of the prior knowledge which does not necessarily exist in the real world is difficult to apply to the generalized continuous learning.
The generalized continuous Learning is a general continuous Learning scenario, and can also be applied to the classic Class Incremental Learning (Class Incremental Learning), Task Incremental Learning (Task Incremental Learning) and Domain Incremental Learning (Domain Incremental Learning) scenarios. But specific a priori knowledge in these classical scenarios cannot be exploited to alleviate catastrophic forgetfulness when image classification is done in scenarios of broad continuous learning. This means that in empirical playback some inherent non-scene specific information must be mined to consolidate the old task knowledge.
Disclosure of Invention
The invention provides a generalized continuous classification method based on an online contrast distillation network for solving the technical problems in the prior art.
The technical scheme adopted by the invention for solving the technical problems in the prior art is as follows: a generalized continuous classification method based on an online comparative distillation network comprises the following steps:
step 1, establishing a knowledge distillation-based classification model, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of the student model; initializing parameters of a teacher model and parameters of a student model and endowing the parameters to a buffer area with a fixed size;
step 2, when a batch data stream containing R samples arrives, counting the number of the samples encountered currently, and updating a buffer area by using a reservoir sampling method;
step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classification output data of the teacher model and the student model corresponding to the S samples through the processing of respective feature encoders and classifiers of the S samples; respectively obtaining feature embedded data of the teacher model and the student model corresponding to the S samples through the processing of the feature encoder and the feature mapper of the teacher model and the feature mapper;
step 4, calculating the quality scores of the teacher model classified output data, adjusting the weights of the online knowledge distillation loss functions of different samples according to the quality scores of the teacher model classified output data, and further calculating the online distillation loss of the teacher model and the student model
Figure BDA0003573569660000021
Step 5, comparing characteristic embedded data between the teacher model and the student models, and calculating the comparison relation distillation loss of the teacher model and the student models
Figure BDA0003573569660000022
Step 6, using the self-supervision learning and the supervision contrast learning to help the student model to extract the discriminant characteristics to calculate the self-supervision loss of the student model
Figure BDA0003573569660000023
And supervised contrast learning loss
Figure BDA0003573569660000024
Step 7, calculating the cross entropy classification loss of the student model based on experience playback
Figure BDA0003573569660000025
Step 8, calculating the total optimization target of the student model
Figure BDA0003573569660000026
Figure BDA0003573569660000027
α 1 To alpha 3 The hyperparameters of each corresponding loss function; optimizing parameters of the student model by using a random gradient descent algorithm;
and 9, directly utilizing the parameters of the student model to update the parameters of the teacher model.
Further, in step 2, assume that the non-stationary data stream is composed of n sample disjoint tasks { T } 1 ,T 2 ,...,T n Are composed of, each task T n The training set of (A) is all marked data
Figure BDA0003573569660000031
Composition, where m is task T n Number of samples, x, of training set i For task T n I-th image sample in training set, y i For task T n Ith image sample x in training set i The labeled category; buffer zone
Figure BDA0003573569660000032
Has a capacity of
Figure BDA0003573569660000033
x j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category; the method for sampling the water reservoir comprises the following steps:
step A1, judging the current encounteredNumber of samples num and buffer capacity
Figure BDA0003573569660000034
The size between, if
Figure BDA0003573569660000035
Sample (x) i ,y i ) Direct storage to buffer
Figure BDA0003573569660000036
Performing the following steps; x is the number of i For task T n I image sample in training set, y i For task T n Ith image sample x in training set i The labeled category;
step A2, if
Figure BDA0003573569660000037
Generating a random integer rand _ num, wherein the minimum value of the random integer is 0, and the maximum value is num-1; if it is
Figure BDA0003573569660000038
Using samples (x) i ,y i ) Replacing samples (x) in the buffer rand_num ,y rand_num );x rand_num Buffer with index rand _ num
Figure BDA0003573569660000039
Of (2) an image sample, y rand_num Buffer with index rand _ num
Figure BDA00035735696600000310
The image sample label in (1).
Further, in step 4, the method for calculating the quality score of the teacher model classification output data is as follows:
setting:
Figure BDA00035735696600000311
indicates a capacity of
Figure BDA00035735696600000312
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure BDA00035735696600000313
represents a sample x j Classified output data are obtained after the characteristic encoder and the classifier of the teacher model are sequentially processed; omega (x) j ) For corresponding sample x j The teacher model of (1) classifies the quality score of the output data; omega (x) j ) The calculation formula of (a) is as follows:
Figure BDA00035735696600000314
in the formula:
ρ represents a temperature coefficient;
c represents the number of all possible categories;
exp (·) represents an exponential function based on a natural constant e;
Figure BDA00035735696600000315
outputting data for classification
Figure BDA00035735696600000316
Middle corresponding category y j The classification output data of (1);
Figure BDA0003573569660000041
representing sorted output data
Figure BDA0003573569660000042
The classification of each class outputs data.
Further, in step 4, let
Figure BDA0003573569660000043
Represents a sample x j Sequentially pass through the student modelThe feature encoder and the classifier of (2) to obtain classified output data; calculating on-line distillation loss for teacher and student models
Figure BDA0003573569660000044
The method of (1) is as follows:
Figure BDA0003573569660000045
in the formula: l. capillary 2 Is represented by 2 A norm;
Figure BDA0003573569660000046
representing a mathematical expectation function.
Further, in step 5, setting:
Figure BDA0003573569660000047
indicates a capacity of
Figure BDA0003573569660000048
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure BDA0003573569660000049
is a sample x j After the input data is input into the teacher model, the characteristic embedded data of the teacher model is obtained through a characteristic encoder and a characteristic mapper;
Figure BDA00035735696600000410
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper; z is a radical of t All samples x for the current batch j All teacher model feature embedded data obtained by processing the teacher model feature embedded data through a feature encoder and a feature mapper after being input into the teacher model
Figure BDA00035735696600000411
A set of (a); z is a radical of s All samples x for the current batch j All student model feature embedded data obtained by the feature encoder and the feature mapper after being input into the student model
Figure BDA00035735696600000412
A set of (a);
Figure BDA00035735696600000413
denotes from z s Sampling the obtained characteristic embedded data; z is a radical of t+ Is represented by
Figure BDA00035735696600000414
Embedding teacher model characteristics with the same class label into a data set;
Figure BDA00035735696600000415
denotes from z t+ Sampling the obtained characteristic embedded data;
Figure BDA00035735696600000416
denotes from z t Sampling the obtained characteristic embedded data; calculating comparative distillation loss between teacher model and student model
Figure BDA00035735696600000417
The method (2) is as follows:
Figure BDA00035735696600000418
Figure BDA00035735696600000419
Figure BDA00035735696600000420
in the formula:
Figure BDA00035735696600000421
representing a mathematical expectation function;
|·| 2 is represented by 2 A norm;
log (-) represents a natural logarithmic function based on a natural constant e;
Figure BDA00035735696600000422
representing a judgment function for judging the feature-embedded data
Figure BDA00035735696600000423
And
Figure BDA00035735696600000424
whether or not they originate from their joint distribution
Figure BDA0003573569660000051
Figure BDA0003573569660000052
Representing a judgment function for judging the feature-embedded data
Figure BDA0003573569660000053
And
Figure BDA0003573569660000054
whether or not they originate from their joint distribution
Figure BDA0003573569660000055
(·) T Representing a transpose;
exp (·) represents an exponential function based on a natural constant e;
τ represents a temperature coefficient.
Further, step 6 includes the following sub-steps:
step B1, let Θ be t ,Φ t ,Ψ t Feature encoder and classification corresponding to teacher modelDevice and feature mapper s ,Φ s ,Ψ s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample
Figure BDA0003573569660000056
Where x represents an image sample, y is the class labeled by image sample x,
Figure BDA0003573569660000057
for the image samples after the geometric transformation,
Figure BDA0003573569660000058
a label that is geometrically transformed; amplifying the training sample
Figure BDA0003573569660000059
Inputting the data into a student model, and processing the data by a feature coder and a feature mapper of the student model to obtain corresponding student model feature data F s And feature embedded data
Figure BDA00035735696600000510
Wherein:
Figure BDA00035735696600000511
Figure BDA00035735696600000512
step B2, obtaining student model characteristic data F s Input to a multi-layer sensor
Figure BDA00035735696600000513
In the method, the method is used for judging the training samples
Figure BDA00035735696600000514
The kind of geometric transformation to be performed; let the output of the multilayer perceptron be S s ,S s The calculation formula of (a) is as follows:
Figure BDA00035735696600000515
step B3, calculating the self-supervision loss
Figure BDA00035735696600000516
Figure BDA00035735696600000517
The calculation formula of (a) is as follows:
Figure BDA00035735696600000518
wherein
Figure BDA00035735696600000519
Representing a mathematical expectation function;
softmax (·) denotes a softmax function;
l (-) represents a cross entropy loss function;
step B4, setting:
Figure BDA00035735696600000520
indicates a capacity of
Figure BDA00035735696600000521
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure BDA0003573569660000061
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper;
Figure BDA0003573569660000062
feature embedded data representing acquired ensemble student models
Figure BDA0003573569660000063
And
Figure BDA0003573569660000064
a set of (a);
Figure BDA0003573569660000065
represents from
Figure BDA0003573569660000066
Sampling the obtained characteristic embedded data;
Figure BDA0003573569660000067
is shown and
Figure BDA0003573569660000068
embedding the characteristics of the student model with the same class label into a data set;
Figure BDA0003573569660000069
represents from
Figure BDA00035735696600000610
Sampling the obtained characteristic embedded data;
Figure BDA00035735696600000611
represents from
Figure BDA00035735696600000612
Embedding the sampled features into data; based on the original characteristic embedded data and the amplified characteristic embedded data, the characteristic embedded data in the student model is utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning
Figure BDA00035735696600000613
The calculation formula of (c) is as follows:
Figure BDA00035735696600000614
Figure BDA00035735696600000615
Figure BDA00035735696600000616
in the formula:
wherein
Figure BDA00035735696600000617
Represents a mathematical expectation;
|·| 2 represents l 2 A norm;
log (-) represents a natural logarithmic function based on a natural constant e;
Figure BDA00035735696600000618
representing feature embedded data
Figure BDA00035735696600000619
And
Figure BDA00035735696600000620
the distance of (d);
Figure BDA00035735696600000621
representing feature embedded data
Figure BDA00035735696600000622
And
Figure BDA00035735696600000623
the distance of (d);
exp (·) represents an exponential function based on a natural constant e;
Figure BDA00035735696600000629
representing a transpose;
τ represents a temperature coefficient;
step B5, loss will be self-monitored
Figure BDA00035735696600000624
And supervised contrast learning loss
Figure BDA00035735696600000625
Performing the combination yields a collaborative contrast loss
Figure BDA00035735696600000626
Help the student model to better extract discriminant features,
Figure BDA00035735696600000627
the calculation formula of (a) is as follows:
Figure BDA00035735696600000628
further, in step B1, the geometric transformation includes rotating, scaling and adjusting the aspect ratio of the image.
Further, in step 7, assume that the non-stationary data stream is composed of n sample disjoint tasks { T } 1 ,T 2 ,...,T n Is formed, let x denote task T n And a buffer area
Figure BDA0003573569660000071
Y is the category marked by the image sample x; cross entropy classification loss for student models
Figure BDA0003573569660000072
The calculation formula of (a) is as follows:
Figure BDA0003573569660000073
in the formula:
Figure BDA0003573569660000074
representing a mathematical expectation function;
softmax (·) denotes a softmax function;
l (-) represents a cross entropy loss function;
r s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.
Further, in step 9, a specific method for updating the parameters of the teacher model by using the parameters of the student model is as follows:
let Θ be t ,Φ t ,Ψ t Feature encoder, classifier and feature mapper for the teacher model s ,Φ s ,Ψ s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; the method for updating the parameters of the teacher model comprises the following steps:
Θ t ←mΘ t +(1-m)[(1-X)Θ t +XΘ s ];
Φ t ←mΦ t +(1-m)[(1-X)Φ t +XΦ s ];
Ψ t ←mΨ t +(1-m)[(1-X)Ψ t +XΨ s ];
where m represents a momentum factor and X obeys a Bernoulli distribution (also known as a 0-1 distribution), defined as:
P(X=k)=p k (1-p) 1-k ,k={0,1};
the value range of the Bernoulli probability p is (0, 1), and the update frequency of the teacher model is controlled through the Bernoulli probability p.
Further, the calculation formula of the momentum factor m is as follows:
m=min(itera/(itera+1),η);
where itera is the number of iterations of the current student model, min (itera/(itera +1), η) represents the smaller of itera/(itera +1) and η, and η is a constant and is typically set to 0.999.
The invention has the advantages and positive effects that: according to the generalized continuous classification method based on the online comparison distillation network, the teacher-student framework in online knowledge distillation is used for effectively consolidating the knowledge of the old task, so that the model has good classification accuracy on both the new task and the old task. In the training stage, the training strategy of comparison learning is introduced into online knowledge distillation, the teacher model integrates the weights of the student models at all times to realize the accumulation of knowledge, and the student models distill classified output data and comparison relations to the teacher model to relieve catastrophic forgetting. The teacher model and the student models are cooperated with each other, so that the student models can keep the performance of the old tasks, the teacher model can accumulate the weight which is more balanced to the old tasks and the new tasks when accumulating the weight, and the teacher model can better guide the student models to consolidate the knowledge of the old tasks when the student models train the new tasks. In the testing stage, the teacher model is used for testing, because the teacher model integrates the advantages of the student models which are good at distinguishing different classes at different moments, the teacher model can have good classification performance on all classes. Therefore, the invention can effectively integrate the advantages of the student network and improve the classification accuracy of the teacher network during testing.
Drawings
FIG. 1 is a flow chart of the broad continuous classification method based on an online contrast distillation network according to the present invention.
Detailed Description
For further understanding of the contents, features and effects of the present invention, the following embodiments are enumerated in conjunction with the accompanying drawings, and the following detailed description is given:
referring to fig. 1, a generalized continuous classification method based on an online comparative distillation network includes the following steps:
step 1, establishing a knowledge distillation-based classification model, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of the student model; parameters of the teacher model and the student model are initialized and a buffer area with a fixed size is given.
And 2, when a batch data stream containing R samples arrives, counting the number of the samples encountered currently, and updating the buffer area by using a reservoir sampling method.
Step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classification output data of the teacher model and the student model corresponding to the S samples through the processing of respective feature encoders and classifiers of the S samples; respectively obtaining feature embedded data of the teacher model and the student model corresponding to the S samples through the processing of the feature encoder and the feature mapper of the teacher model and the feature mapper; namely: processing the S samples by a feature encoder and a classifier of the teacher model in sequence to obtain a classification output data set of the teacher model; after being processed by a feature encoder and a classifier of the student model in sequence, a classification output data set of the student model is obtained; after being processed by a feature encoder and a feature mapper of the teacher model in sequence, a feature embedded data set of the teacher model is obtained; and after being processed by a feature encoder and a feature mapper of the student model in sequence, the feature embedded data set of the student model is obtained.
Step 4, calculating the quality scores of the teacher model classified output data, adjusting the weights of the online knowledge distillation loss functions of different samples according to the quality scores of the teacher model classified output data, and further calculating the online distillation loss of the teacher model and the student model
Figure BDA0003573569660000081
Step 5, comparing the characteristic embedded data between the teacher model and the student models, and calculating the comparison relationship distillation loss of the teacher model and the student models
Figure BDA0003573569660000082
Step 6, using the self-supervision learning and the supervision contrast learning to help the student model to extract the discriminant characteristics to calculate the self-supervision loss of the student model
Figure BDA0003573569660000091
And supervised contrast learning loss
Figure BDA0003573569660000092
Step 7, calculating the cross entropy classification loss of the student model based on experience playback
Figure BDA0003573569660000093
Step 8, calculating the total optimization target of the student model
Figure BDA0003573569660000094
Figure BDA0003573569660000095
α 1 To alpha 3 The hyperparameters of each corresponding loss function; and optimizing parameters of the student model by using a random gradient descent algorithm.
And 9, directly utilizing the parameters of the student model to update the parameters of the teacher model.
Preferably, in step 2, it can be assumed that the non-stationary data stream is formed by n sample disjoint tasks { T } 1 ,T 2 ,...,T n Are composed of, each task T n The training set of (A) is all marked data
Figure BDA0003573569660000096
Composition, where m is task T n Number of samples, x, of training set i For task T n I image sample in training set, y i For task T n Ith image sample x in training set i The labeled category; buffer zone
Figure BDA0003573569660000097
Has a capacity of
Figure BDA0003573569660000098
x j For the j image sample in the buffer,y j Is the j image sample x in the buffer j The labeled category; a method of sampling a water reservoir may comprise the steps of:
step A1, judging the size between the number num of samples encountered currently and the buffer area capacity B, if num is less than or equal to B, sampling (x) i ,y i ) Direct storage to buffer
Figure BDA00035735696600000917
Performing the following steps; x is the number of i For task T n I image sample in training set, y i For task T n Ith image sample x in training set i The labeled category.
Step A2, if
Figure BDA0003573569660000099
Generating a random integer rand _ num, wherein the minimum value of the random integer is 0, and the maximum value is num-1; if it is
Figure BDA00035735696600000910
Using samples (x) i ,y i ) Replacing samples (x) in the buffer rand_num ,y rand_num );x rand_num Buffer with index rand _ num
Figure BDA00035735696600000911
Image sample of (1), y rand_num Buffer with index rand _ num
Figure BDA00035735696600000912
The image sample label in (1).
Preferably, in step 4, the method for calculating the quality score of the teacher model classification output data may be as follows:
can be provided with:
Figure BDA00035735696600000913
indicates a capacity of
Figure BDA00035735696600000914
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure BDA00035735696600000915
represents a sample x j Classifying output data is obtained after the processing of the characteristic encoder and the classifier of the teacher model is carried out in sequence; omega (x) j ) For corresponding sample x j The teacher model of (1) classifies and outputs the quality scores of the data; omega (x) j ) The calculation formula of (c) can be as follows:
Figure BDA00035735696600000916
in the formula: ρ represents a temperature coefficient; c represents the number of all possible categories; exp (·) represents an exponential function based on a natural constant e;
Figure BDA0003573569660000101
outputting data for classification
Figure BDA0003573569660000102
Middle corresponding category y j The classification output data of (2);
Figure BDA0003573569660000103
representing sorted output data
Figure BDA0003573569660000104
The classification of each class outputs data.
Preferably, in step 4, let
Figure BDA0003573569660000105
Represents a sample x j Classifying output data is obtained after the data are sequentially processed by a feature encoder and a classifier of the student model; calculating on-line distillation loss for teacher and student models
Figure BDA0003573569660000106
The method of (2) is as follows:
Figure BDA0003573569660000107
in the formula: l. capillary 2 Is represented by 2 A norm;
Figure BDA0003573569660000108
representing a mathematical expectation function; exp (. cndot.) represents an exponential function based on a natural constant e.
Preferably, in step 5, there may be:
Figure BDA0003573569660000109
indicates a capacity of
Figure BDA00035735696600001010
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure BDA00035735696600001011
is a sample x j After the input data is input into the teacher model, the characteristic embedded data of the teacher model is obtained through a characteristic encoder and a characteristic mapper;
Figure BDA00035735696600001012
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper; z is a radical of formula t All samples x for the current batch j All teacher model feature embedded data obtained by processing the teacher model feature embedded data through a feature encoder and a feature mapper after being input into the teacher model
Figure BDA00035735696600001013
A set of (a); z is a radical of s All samples x for the current batch j After being input into the student model, all student model characteristic embedded data obtained through the characteristic encoder and the characteristic mapper
Figure BDA00035735696600001014
A set of (a);
Figure BDA00035735696600001015
represents from z s Sampling the obtained characteristic embedded data; z is a radical of t+ Is shown and
Figure BDA00035735696600001016
embedding teacher model features with the same class label into a data set;
Figure BDA00035735696600001017
denotes from z t+ Embedding the sampled features into data;
Figure BDA00035735696600001018
denotes from z t Embedding the sampled features into data; calculating the distillation loss of the comparison relationship between the teacher model and the student model
Figure BDA00035735696600001019
The method of (a) can be as follows:
Figure BDA00035735696600001020
Figure BDA00035735696600001021
Figure BDA00035735696600001022
in the formula:
Figure BDA00035735696600001023
representing a mathematical expectation function; l. capillary 2 Is represented by 2 A norm; log (-) represents a natural logarithmic function based on a natural constant e;
Figure BDA00035735696600001024
representing a judgment function for judging the feature-embedded data
Figure BDA00035735696600001025
And
Figure BDA00035735696600001026
whether or not they originate from their joint distribution
Figure BDA00035735696600001027
Figure BDA00035735696600001028
Representing a judgment function for judging the feature-embedded data
Figure BDA0003573569660000111
And
Figure BDA0003573569660000112
whether or not they originate from their joint distribution
Figure BDA0003573569660000113
Figure BDA00035735696600001127
Representing a transpose; exp (·) represents an exponential function based on a natural constant e; τ represents a temperature coefficient.
Preferably, step 6 may comprise the following sub-steps:
in step B1, Θ can be set t ,Φ t ,Ψ t Feature encoder, classifier and feature mapper for the teacher model s ,Φ s ,Ψ s Corresponding to the feature encoder, the classifier and the feature mapper of the teacher model, each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample
Figure BDA0003573569660000114
Where x represents an image sample, y is the class labeled by image sample x,
Figure BDA0003573569660000115
for the image samples after the geometric transformation,
Figure BDA0003573569660000116
a label that is geometrically transformed; amplifying the training sample
Figure BDA0003573569660000117
Inputting the data into the student model, processing the data by the feature encoder and the feature mapper of the student model to obtain corresponding student model feature data F s And feature embedded data
Figure BDA0003573569660000118
Wherein:
Figure BDA0003573569660000119
Figure BDA00035735696600001110
step B2, the obtained student model characteristic data F can be used s Input to a multi-layer sensor
Figure BDA00035735696600001111
In the method, the method is used for judging the training samples
Figure BDA00035735696600001112
The kind of geometric transformation to be performed; the output of the multilayer perceptron can be set as S s ,S s The calculation formula of (c) can be as follows:
Figure BDA00035735696600001113
step B3, countingCalculated self-supervision loss
Figure BDA00035735696600001114
Figure BDA00035735696600001115
The calculation formula of (c) can be as follows:
Figure BDA00035735696600001116
wherein
Figure BDA00035735696600001117
Representing a mathematical expectation function; softmax (·) denotes a softmax function; l (-) represents a cross entropy loss function;
step B4, may include:
Figure BDA00035735696600001118
indicates a capacity of
Figure BDA00035735696600001119
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure BDA00035735696600001120
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper;
Figure BDA00035735696600001121
feature embedded data representing derived universe student model
Figure BDA00035735696600001122
And
Figure BDA00035735696600001123
a set of (a);
Figure BDA00035735696600001124
represents from
Figure BDA00035735696600001125
Embedding the sampled features into data;
Figure BDA00035735696600001126
is shown and
Figure BDA0003573569660000121
embedding the characteristics of the student model with the same class label into a data set;
Figure BDA0003573569660000122
represents from
Figure BDA0003573569660000123
Sampling the obtained characteristic embedded data;
Figure BDA0003573569660000124
represents from
Figure BDA0003573569660000125
Sampling the obtained characteristic embedded data; based on the original characteristic embedded data and the amplified characteristic embedded data, the original characteristic embedded data and the amplified characteristic embedded data in the student model can be utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning
Figure BDA0003573569660000126
The calculation formula of (c) can be as follows:
Figure BDA0003573569660000127
Figure BDA0003573569660000128
Figure BDA0003573569660000129
in the formula:
wherein
Figure BDA00035735696600001210
Represents a mathematical expectation; l. capillary 2 Is represented by 2 A norm; log (-) represents a natural logarithmic function based on a natural constant e;
Figure BDA00035735696600001211
representing feature embedded data
Figure BDA00035735696600001212
And
Figure BDA00035735696600001213
the distance of (d);
Figure BDA00035735696600001214
representing feature embedded data
Figure BDA00035735696600001215
And
Figure BDA00035735696600001216
the distance of (d); exp (·) represents an exponential function based on a natural constant e;
Figure BDA00035735696600001226
representing a transpose; τ represents a temperature coefficient;
step B5, self-supervision loss can be reduced
Figure BDA00035735696600001217
And supervised contrast learning loss
Figure BDA00035735696600001218
Combining to obtain collaborative contrast loss
Figure BDA00035735696600001219
Help the student model to better extract discriminant features,
Figure BDA00035735696600001220
the calculation formula of (c) can be as follows:
Figure BDA00035735696600001221
preferably, in step B1, the geometric transformation may include rotating, scaling and adjusting the aspect ratio of the image.
Preferably, in step 7, it can be assumed that the non-stationary data stream is formed by n sample disjoint tasks { T } 1 ,T 2 ,…,T n Is formed, let x denote task T n And a buffer area
Figure BDA00035735696600001222
Y is the category marked by the image sample x; cross entropy classification loss for student models
Figure BDA00035735696600001223
The calculation formula of (c) can be as follows:
Figure BDA00035735696600001224
in the formula:
Figure BDA00035735696600001225
representing a mathematical expectation function; softmax (·) denotes a softmax function; l (-) represents a cross entropy loss function; r is s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.
Preferably, in step 9, a specific method for updating the parameters of the teacher model by using the parameters of the student model may be as follows:
let Θ be t ,Φ t ,Ψ t Corresponding characteristic compiling for teacher modelCoder, classifier and feature mapper s ,Φ s ,Ψ s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; the method for updating the parameters of the teacher model can be as follows:
Θ t ←mΘ t +(1-m)[(1-X)Θ t +XΘ s ];
Φ t ←mΦ t +(1-m)[(1-X)Φ t +XΦ s ];
Ψ t ←mΨ t +(1-m)[(1-X)Ψ t +XΨ s ];
where m represents a momentum factor and X obeys a Bernoulli distribution (also known as a 0-1 distribution), which can be defined as:
P(X=k)=p k (1-p) 1-k ,k={0,1};
the value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model can be controlled through the Bernoulli probability p.
Preferably, the calculation formula of the momentum factor m can be as follows:
m=min(itera/(itera+1),η);
where itera is the number of iterations of the current student model, min (itera/(itera +1), η) represents the smaller of itera/(itera +1) and η, η is a constant and can be set to 0.999.
The working process and working principle of the present invention are further explained by a preferred embodiment of the present invention as follows:
the generalized continuous learning is to consolidate old knowledge and accumulate new knowledge from a non-stationary data stream, and finally complete classification prediction of all the seen class images. Assume that a non-stationary data stream is formed by N sample disjoint tasks T 1 ,T 2 ,...,T N Are composed of, each task T n The training set of (A) is all marked data
Figure BDA0003573569660000131
Composition, where m is task T n Number of samples, x, of training set i For task T n Training set ith image sample,y i For task T n Ith image sample x in training set i The labeled category. In the testing stage, the generalized continuous learning method can complete classification tasks for all the classes currently seen. Each task T n The test sets of (1) are all tagged with data
Figure BDA0003573569660000132
Composition, wherein p is task T n Number of samples, x, of test set q For task T n The q image sample in the test set, y q For task T n The q image sample x in the test set q The labeled category. The generalized continuous learning task is all the tasks { T } trained at present 1 ,T 2 ,...,T n And performing category prediction on the test set.
FIG. 1 is a flow chart of the broad continuous classification method based on an online contrast distillation network according to the present invention. Wherein the content of the first and second substances,
Figure BDA0003573569660000133
indicates a capacity of
Figure BDA0003573569660000134
Buffer of x j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category.
Figure BDA0003573569660000135
Which represents the loss of the on-line distillation,
Figure BDA0003573569660000136
indicating comparative distillation losses. Let Θ be t ,Φ t ,Ψ t Setting theta as a feature encoder, a classifier and a feature mapper of the teacher model respectively s ,Φ s ,Ψ s A feature encoder, a classifier and a feature mapper respectively of the student model,
the invention relates to a generalized continuous classification method based on an online contrast distillation network, which comprises the following steps:
step 1, before a task starts, firstly initializing parameters of a teacher model and a student model and endowing a buffer area with a fixed size: theta t =Θ s ,Φ t =Φ s ,Ψ t =Ψ s
Figure BDA0003573569660000141
Step 2, when a batch data stream containing bsz samples arrives, counting the number num of the samples met currently, and updating a buffer area by utilizing a reservoir sampling method
Figure BDA0003573569660000142
This ensures that the probability of all samples being stored in the buffer is equal. For a specific sample, the specific steps of sampling by using the water reservoir comprise:
(1) judging the number num of samples encountered currently and the capacity of a buffer area
Figure BDA0003573569660000143
The size between, if
Figure BDA0003573569660000144
Sample (x) i ,y i ) Directly storing the data into a buffer area;
(2) if it is
Figure BDA0003573569660000145
A random integer rand _ num is generated, the minimum value of the random integer is 0, and the maximum value is num-1. If it is
Figure BDA0003573569660000146
Using samples (x) i ,Y i ) Replacing samples (x) in the buffer rand_num ,y rand_num );x rand-num The representation index is
Figure BDA0003573569660000147
Of the buffer of (2), y rand_num The representation index is
Figure BDA0003573569660000148
Of the buffer area.
Step 3, from the buffer area
Figure BDA0003573569660000149
Randomly sampling S samples x j To consolidate the old knowledge, the S samples x j Respectively inputting the data into a teacher model and a student model, wherein the classified output data of the teacher model and the classified output data of the student model obtained by the feature encoder and the classifier are respectively as follows:
Figure BDA00035735696600001410
Figure BDA00035735696600001411
the characteristic embedded data of the teacher model and the student model obtained by the characteristic encoder and the characteristic mapper are respectively as follows:
Figure BDA00035735696600001412
Figure BDA00035735696600001413
step 4, setting:
Figure BDA00035735696600001414
represents a sample x j Classified output data are obtained after the characteristic encoder and the classifier of the teacher model are sequentially processed; omega (x) j ) For corresponding sample x j The teacher model of (1) classifies the quality score of the output data;
Figure BDA00035735696600001415
representing sorted output data
Figure BDA00035735696600001416
The classification output data of each category; is provided with
Figure BDA00035735696600001417
Represents a sample x j Classifying output data are obtained after the processing of a feature encoder and a classifier of the student model is carried out in sequence;
Figure BDA00035735696600001418
outputting data for classification
Figure BDA00035735696600001419
Middle corresponding category y j The classification of (2) outputs data.
Calculating the quality fraction omega (x) of the classified output data of each sample by the perceptron j ):
Figure BDA0003573569660000151
Where ρ is the temperature coefficient, C represents the number of all possible classes, and exp (·) represents an exponential function based on the natural constant e.
Calculating the on-line distillation loss according to equations (1), (2) and (5)
Figure BDA0003573569660000152
Figure BDA0003573569660000153
Wherein |. non 2 Is represented by 2 The norm of the number of the first-order-of-arrival,
Figure BDA0003573569660000154
representing a mathematical expectation function. By giving the difference between the outputs of the teacher model and the student model
Figure BDA0003573569660000155
Weight ω (x) j ) To let the student model focus more attention on samples with high quality scores.
And 5, comparing the characteristic embedded data between the teacher model and the student model, and calculating the distillation loss of the comparison relation according to the formulas (3) and (4)
Figure BDA0003573569660000156
Figure BDA0003573569660000157
Wherein
Figure BDA0003573569660000158
Represents a mathematical expectation function, log (-) represents a natural logarithmic function based on a natural constant e,
Figure BDA0003573569660000159
is a sample x j After the input data is input into the teacher model, the characteristic embedded data of the teacher model is obtained through a characteristic encoder and a characteristic mapper;
Figure BDA00035735696600001510
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper; z is a radical of t All samples x for the current batch j All teacher model feature embedded data obtained by processing the teacher model feature embedded data through a feature encoder and a feature mapper after being input into the teacher model
Figure BDA00035735696600001511
A set of (a); z is a radical of s All samples x for the current batch j All student model feature embedded data obtained by the feature encoder and the feature mapper after being input into the student model
Figure BDA00035735696600001512
A set of (a);
Figure BDA00035735696600001513
denotes from z s Sampling the obtained characteristic embedded data; z is a radical of t+ Is shown and
Figure BDA00035735696600001514
embedding teacher model characteristics with the same class label into a data set;
Figure BDA00035735696600001515
represents from z t+ Sampling the obtained characteristic embedded data;
Figure BDA00035735696600001516
denotes from z t The sampled features are embedded in the data.
Figure BDA00035735696600001517
Representing a judgment function for judging the feature-embedded data
Figure BDA00035735696600001518
And
Figure BDA00035735696600001519
whether or not they originate from their joint distribution
Figure BDA00035735696600001520
The calculation formula is as follows:
Figure BDA0003573569660000161
wherein exp (·) represents an exponential function with a natural constant e as a base, | · non · 2 Is represented by 2 Norm, (.) T Denotes transposition and τ denotes temperature coefficient.
Figure BDA0003573569660000162
Representing a judgment function for judging the feature-embedded data
Figure BDA0003573569660000163
And
Figure BDA0003573569660000164
whether or not they originate from their joint distribution
Figure BDA0003573569660000165
The calculation formula is as follows:
Figure BDA0003573569660000166
wherein exp (·) represents an exponential function with a natural constant e as a base, | · non · 2 Is represented by 2 Norm, (.) T Denotes transposition and τ denotes temperature coefficient.
And 6, using self-supervised learning and supervised contrast learning to help the student model to extract discriminant features, wherein the specific steps comprise:
(1) each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample
Figure BDA0003573569660000167
Where x represents an image sample, y is the class labeled by image sample x,
Figure BDA0003573569660000168
for the image samples after the geometric transformation,
Figure BDA0003573569660000169
is a geometrically transformed label. The geometric transformation includes rotating, scaling and adjusting the aspect ratio of the image. This doubles the number of training images for the student model. For the set of images subjected to random geometric transformation
Figure BDA00035735696600001610
Inputting them into the student network, and getting the corresponding student model characteristics and characteristic embedding data:
Figure BDA00035735696600001611
Figure BDA00035735696600001612
(2) inputting the obtained characteristics of the student model into a multi-layer perceptron
Figure BDA00035735696600001613
In the method, the training samples are judged and matched
Figure BDA00035735696600001614
The kind of geometric transformation performed:
Figure BDA00035735696600001615
(3) calculating self-supervision loss
Figure BDA00035735696600001616
Figure BDA00035735696600001617
Wherein
Figure BDA00035735696600001618
Represents the mathematical expectation function, softmax (·) represents the softmax function, and l (·) represents the cross-entropy loss function.
(4) Is provided with
Figure BDA0003573569660000171
Is a sample x j After being input into the student model, the characteristics are codedCharacteristic embedding data of the student model obtained by the device and the characteristic mapper;
Figure BDA0003573569660000172
feature embedded data representing acquired ensemble student models
Figure BDA0003573569660000173
And
Figure BDA0003573569660000174
a set of (a);
Figure BDA0003573569660000175
represents from
Figure BDA0003573569660000176
Sampling the obtained characteristic embedded data;
Figure BDA0003573569660000177
is shown and
Figure BDA0003573569660000178
embedding the student model characteristics with the same category label into a data set;
Figure BDA0003573569660000179
represents from
Figure BDA00035735696600001710
Sampling the obtained characteristic embedded data;
Figure BDA00035735696600001711
represents from
Figure BDA00035735696600001712
Embedding the sampled features into data; based on the original characteristic embedded data and the amplified characteristic embedded data, the characteristic embedded data in the student model is utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning
Figure BDA00035735696600001713
The calculation formula of (a) is as follows:
Figure BDA00035735696600001714
Figure BDA00035735696600001715
Figure BDA00035735696600001716
in the formula:
Figure BDA00035735696600001717
represents a mathematical expectation; log (-) represents a natural logarithmic function based on a natural constant e;
Figure BDA00035735696600001718
representing feature embedded data
Figure BDA00035735696600001719
And
Figure BDA00035735696600001720
the distance of (d);
Figure BDA00035735696600001721
representing feature embedded data
Figure BDA00035735696600001722
And
Figure BDA00035735696600001723
the distance of (d); exp (·) denotes an exponential function based on a natural constant e; l. capillary 2 Is represented by 2 A norm;
Figure BDA00035735696600001731
representing a transpose; τ represents a temperature coefficient.
(5) For self-supervision loss
Figure BDA00035735696600001724
And supervised contrast learning loss
Figure BDA00035735696600001725
Performing the combination yields a collaborative contrast loss
Figure BDA00035735696600001726
Help student's model to extract the characteristic of discriminant better:
Figure BDA00035735696600001727
and 7, calculating the cross entropy classification loss of the student model based on experience playback:
Figure BDA00035735696600001728
wherein x represents a task T n And a buffer area
Figure BDA00035735696600001729
Y is the category marked by the image sample x;
Figure BDA00035735696600001730
representing a mathematical expectation function; softmax (·) denotes a softmax function; l (-) represents a cross entropy loss function; r is a radical of hydrogen s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.
r s (x) Feature encoder Θ representing the passage of an image sample x through a student model s And a classifier phi s The resulting output:
Figure BDA0003573569660000181
step 8, calculating the total optimization target of the student model
Figure BDA0003573569660000182
And (3) optimizing parameters of the student model by using a random gradient descent algorithm:
Figure BDA0003573569660000183
wherein alpha is 1 、α 2 And alpha 3 Representing a hyper-parameter.
And 9, directly updating the parameters of the teacher model by using the parameters of the student model without involving gradient return, and setting theta t ,Φ t ,Ψ t Feature encoder, classifier and feature mapper for the teacher model s ,Φ s ,Ψ s The characteristic encoder, the classifier and the characteristic mapper correspond to a teacher model. The updating method comprises the following steps:
Θ t ←mΘ t +(1-m)[(1-X)Θ t +XΘ s ] (21);
Φ t ←mΦ t +(1-m)[(1-X)Φ t +XΦ s ] (22);
Ψ t ←mΨ t +(1-m)[(1-X)Ψ t +XΨ s ] (23);
where m represents a momentum factor and X obeys a Bernoulli distribution (also known as a 0-1 distribution), defined as:
P(X=k)=p k (1-p) 1-k ,k={0,1} (24);
the value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model is controlled through the Bernoulli probability p.
In order for the teacher model to quickly learn new knowledge in the early stage of model training, the momentum factor m is designed as follows:
m=min(itera/(itera+1),η) (25);
where itera is the number of iterations of the current student model, min (itera/(itera +1), η) represents the smaller of itera/(itera +1) and η, and η is a constant and is typically set to 0.999.
The generalized continuous classification method based on the online contrast distillation network can be used for testing at any time. In the testing stage, a teacher model is used for testing. The reason is that student models at different times are good at classifying different categories, and teacher models that learn from student models can accumulate their merits for learning. Thus, the teacher model has a greater ability to distinguish between all the classes seen than the student models.
The above-mentioned embodiments are only for illustrating the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and to carry out the same, and the present invention shall not be limited to the embodiments, i.e. the equivalent changes or modifications made within the spirit of the present invention shall fall within the scope of the present invention.

Claims (10)

1. A generalized continuous classification method based on an online comparative distillation network is characterized by comprising the following steps:
step 1, establishing a knowledge distillation-based classification model, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of the student model; initializing parameters of a teacher model and parameters of a student model and endowing the parameters to a buffer area with a fixed size;
step 2, when a batch data stream containing R samples arrives, counting the number of the samples encountered currently, and updating a buffer area by using a reservoir sampling method;
step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classification output data of the teacher model and the student model corresponding to the S samples through the processing of respective feature encoders and classifiers of the S samples; respectively obtaining feature embedded data of the teacher model and the student model corresponding to the S samples through the processing of the feature encoder and the feature mapper of the teacher model and the feature mapper;
step 4, calculating the quality scores of the teacher model classified output data, adjusting the weights of the online knowledge distillation loss functions of different samples according to the quality scores of the teacher model classified output data, and further calculating the online distillation loss of the teacher model and the student model
Figure FDA0003573569650000011
Step 5, comparing the characteristic embedded data between the teacher model and the student models, and calculating the comparison relationship distillation loss of the teacher model and the student models
Figure FDA0003573569650000012
Step 6, using the self-supervision learning and the supervision contrast learning to help the student model to extract the discriminant characteristics to calculate the self-supervision loss of the student model
Figure FDA0003573569650000013
And supervised contrast learning loss
Figure FDA0003573569650000014
Step 7, calculating the cross entropy classification loss of the student model based on experience playback
Figure FDA0003573569650000015
Step 8, calculating the total optimization target of the student model
Figure FDA0003573569650000016
Figure FDA0003573569650000017
α 1 To alpha 3 The hyperparameters of each corresponding loss function; using random gradient descentOptimizing parameters of the student model by an algorithm;
and 9, directly utilizing the parameters of the student model to update the parameters of the teacher model.
2. The method for the generalized continuous classification based on the online contrast distillation network as claimed in claim 1, wherein in step 2, the non-stationary data stream is assumed to be formed by n sample disjoint tasks { T } 1 ,T 2 ,...,T n Are composed of, each task T n The training set of (A) is all marked data
Figure FDA0003573569650000018
Composition, where m is task T n Number of samples, x, of training set i For task T n I image sample in training set, y i For task T n Ith image sample x in training set i The labeled category; buffer zone
Figure FDA0003573569650000019
Has a capacity of B, x j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category; the method for sampling the water reservoir comprises the following steps:
step A1, determining the number num of samples currently encountered and the buffer capacity
Figure FDA0003573569650000021
The size between, if
Figure FDA0003573569650000022
Sample (x) i ,y i ) Direct storage to buffer
Figure FDA0003573569650000023
Performing the following steps; x is a radical of a fluorine atom i For task T n I-th image sample in training set, y i For task T n Ith image sample x in training set i MarkedA category;
step A2, if
Figure FDA0003573569650000024
Generating a random integer rand _ num, wherein the minimum value of the random integer is 0, and the maximum value is num-1; if rand _ num < B, use the sample (x) i ,y i ) Replacing samples (x) in the buffer rand_num ,y rand_num );x rand_num Buffer with index rand _ num
Figure FDA0003573569650000025
Of (2) an image sample, y rand_num Buffer with index rand _ num
Figure FDA0003573569650000026
The image sample label in (1).
3. The generalized continuous classification method based on the online contrast distillation network as claimed in claim 1, wherein in step 4, the method for calculating the quality score of the teacher model classification output data is as follows:
setting:
Figure FDA0003573569650000027
indicates a capacity of
Figure FDA0003573569650000028
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure FDA0003573569650000029
representing a sample x j Classified output data are obtained after the characteristic encoder and the classifier of the teacher model are sequentially processed; omega (x) j ) For corresponding sample x j The teacher model of (1) classifies the quality score of the output data; omega (x) j ) Is calculated as followsThe following steps:
Figure FDA00035735696500000210
in the formula:
ρ represents a temperature coefficient;
c represents the number of all possible categories;
exp (·) represents an exponential function based on a natural constant e;
Figure FDA00035735696500000211
outputting data for classification
Figure FDA00035735696500000212
Middle corresponding category y j The classification output data of (1);
Figure FDA00035735696500000213
representing sorted output data
Figure FDA00035735696500000214
The classification of each class outputs data.
4. The generalized continuous classification method based on online comparative distillation network as claimed in claim 3, wherein in step 4, setting
Figure FDA00035735696500000215
Represents a sample x j Classifying output data are obtained after the processing of a feature encoder and a classifier of the student model is carried out in sequence; calculating on-line distillation loss for teacher and student models
Figure FDA00035735696500000216
The method of (2) is as follows:
Figure FDA00035735696500000217
in the formula: l. capillary 2 To represent
Figure FDA00035735696500000218
A norm;
Figure FDA00035735696500000219
representing a mathematical expectation function.
5. The generalized continuous classification method based on the online comparative distillation network according to claim 1, wherein in step 5, the following steps are provided:
Figure FDA0003573569650000031
indicates a capacity of
Figure FDA0003573569650000032
The buffer area of (2); x is a radical of a fluorine atom j For the jth image sample in the buffer, y j Is the j image sample x in the buffer j The labeled category;
Figure FDA0003573569650000033
is a sample x j After the input data is input into the teacher model, the characteristic embedded data of the teacher model is obtained through a characteristic encoder and a characteristic mapper;
Figure FDA0003573569650000034
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper; z is a radical of t All samples x for the current batch j All teacher model feature embedded data obtained by processing the teacher model feature embedded data through a feature encoder and a feature mapper after being input into the teacher model
Figure FDA0003573569650000035
A set of (a); z is a radical of s All samples x for the current batch j All student model feature embedded data obtained by the feature encoder and the feature mapper after being input into the student model
Figure FDA0003573569650000036
A set of (a);
Figure FDA0003573569650000037
denotes from z s Sampling the obtained characteristic embedded data; z is a radical of t+ Is shown and
Figure FDA0003573569650000038
embedding teacher model characteristics with the same class label into a data set;
Figure FDA0003573569650000039
denotes from z t+ Sampling the obtained characteristic embedded data;
Figure FDA00035735696500000310
denotes from z t Sampling the obtained characteristic embedded data; calculating comparative distillation loss between teacher model and student model
Figure FDA00035735696500000311
The method of (2) is as follows:
Figure FDA00035735696500000312
Figure FDA00035735696500000313
Figure FDA00035735696500000314
in the formula:
Figure FDA00035735696500000315
representing a mathematical expectation function;
|·| 2 to represent
Figure FDA00035735696500000316
A norm;
log (-) represents a natural logarithmic function based on a natural constant e;
Figure FDA00035735696500000317
representing a judgment function for judging the feature-embedded data
Figure FDA00035735696500000318
And
Figure FDA00035735696500000319
whether or not they originate from their joint distribution
Figure FDA00035735696500000320
Figure FDA00035735696500000321
Representing a judgment function for judging the feature-embedded data
Figure FDA00035735696500000322
And
Figure FDA00035735696500000323
whether or not they originate from their joint distribution
Figure FDA00035735696500000324
(·) T To representTransposition is carried out;
exp (·) represents an exponential function based on a natural constant e;
Figure FDA0003573569650000041
indicating the temperature coefficient.
6. The broad continuous classification method based on the online contrast distillation network according to claim 5, wherein the step 6 comprises the following sub-steps:
step B1, let Θ be t ,Φ t ,Ψ t Feature encoder, classifier and feature mapper for the teacher model s ,Φ s ,Ψ s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample
Figure FDA0003573569650000042
Where x represents an image sample, y is the class labeled by image sample x,
Figure FDA0003573569650000043
for the image samples after the geometric transformation,
Figure FDA0003573569650000044
a label that is geometrically transformed; amplifying the training sample
Figure FDA0003573569650000045
Inputting the data into a student model, and processing the data by a feature coder and a feature mapper of the student model to obtain corresponding student model feature data F s And feature embedded data
Figure FDA0003573569650000046
Wherein:
Figure FDA0003573569650000047
Figure FDA0003573569650000048
step B2, obtaining student model characteristic data F s Input to a multi-layer sensor
Figure FDA0003573569650000049
In the method, the training samples are judged and matched
Figure FDA00035735696500000410
The kind of geometric transformation to be performed; let the output of the multilayer perceptron be S s ,S s The calculation formula of (c) is as follows:
Figure FDA00035735696500000411
step B3, calculating the self-supervision loss
Figure FDA00035735696500000412
Figure FDA00035735696500000413
The calculation formula of (a) is as follows:
Figure FDA00035735696500000414
wherein
Figure FDA00035735696500000415
Representing a mathematical expectation function;
softmax (·) represents a softmax function;
Figure FDA00035735696500000416
representing a cross entropy loss function;
step B4, setting:
Figure FDA00035735696500000417
indicates a capacity of
Figure FDA00035735696500000418
The buffer area of (2); x is the number of j For the jth image sample in the buffer, y j For the jth image sample x in the buffer j The labeled category;
Figure FDA00035735696500000419
is a sample x j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper;
Figure FDA00035735696500000420
feature embedded data representing derived universe student model
Figure FDA00035735696500000421
And
Figure FDA00035735696500000422
a set of (a);
Figure FDA00035735696500000423
represents from
Figure FDA00035735696500000424
Sampling the obtained characteristic embedded data;
Figure FDA00035735696500000425
is shown and
Figure FDA00035735696500000426
embedding the student model characteristics with the same category label into a data set;
Figure FDA00035735696500000427
represents from
Figure FDA00035735696500000428
Embedding the characteristics obtained by the fruit sample into data;
Figure FDA0003573569650000051
represents from
Figure FDA0003573569650000052
Sampling the obtained characteristic embedded data; based on the original characteristic embedded data and the amplified characteristic embedded data, the characteristic embedded data in the student model is utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning
Figure FDA0003573569650000053
The calculation formula of (a) is as follows:
Figure FDA0003573569650000054
Figure FDA0003573569650000055
Figure FDA0003573569650000056
in the formula:
wherein
Figure FDA0003573569650000057
Represents a mathematical expectation;
|·| 2 to represent
Figure FDA0003573569650000058
A norm;
log (-) represents a natural logarithmic function based on a natural constant e;
Figure FDA0003573569650000059
representing feature embedded data
Figure FDA00035735696500000510
And
Figure FDA00035735696500000511
the distance of (d);
Figure FDA00035735696500000512
representing feature embedded data
Figure FDA00035735696500000513
And
Figure FDA00035735696500000514
the distance of (a);
exp (·) represents an exponential function based on a natural constant e;
(·) T representing a transpose;
Figure FDA00035735696500000515
represents a temperature coefficient;
step B5, loss self-supervision
Figure FDA00035735696500000516
And supervised contrast learning loss
Figure FDA00035735696500000517
Performing the combination yields a collaborative contrast loss
Figure FDA00035735696500000518
Help the student model to better extract discriminant features,
Figure FDA00035735696500000519
the calculation formula of (a) is as follows:
Figure FDA00035735696500000520
7. the generalized continuous classification method based on online contrast distillation network according to claim 1, wherein in step B1, the geometric transformation includes rotating, scaling and adjusting aspect ratio of the image.
8. The method for the generalized continuous classification based on the online contrast distillation network as claimed in claim 1, wherein in step 7, the non-stationary data stream is assumed to be formed by n sample disjoint tasks { T } 1 ,T 2 ,...,T n Is formed, let x denote task T n And a buffer area
Figure FDA00035735696500000521
Y is the category marked by the image sample x; cross entropy classification loss for student models
Figure FDA00035735696500000522
The calculation formula of (c) is as follows:
Figure FDA0003573569650000061
in the formula:
Figure FDA0003573569650000062
representing a mathematical expectation function;
softmax (·) denotes a softmax function;
Figure FDA0003573569650000063
representing a cross entropy loss function;
r s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.
9. The generalized continuous classification method based on the online contrast distillation network as claimed in claim 1, wherein the specific method for updating the parameters of the teacher model by using the parameters of the student model in step 9 is as follows:
let Θ be t ,Φ t ,Ψ t Feature encoder, classifier and feature mapper for the teacher model s ,Φ s ,Ψ s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; the method for updating the parameters of the teacher model comprises the following steps:
Θ t ←mΘ t +(1-m)[(1-X)Θ t +XΘ s ];
Φ t ←mΦ t +(1-m)[(1-X)Φ t +XΦ s ];
Ψ t ←mΨ t +(1-m)[(1-X)Ψ t +XΨ s ];
where m represents a momentum factor and X obeys a Bernoulli distribution (also known as a 0-1 distribution), defined as:
P(X=k)=p k (1-p) 1-k ,k={0,1};
the value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model is controlled through the Bernoulli probability p.
10. The generalized continuous classification method based on the online contrast distillation network according to claim 9, wherein the momentum factor m is calculated as follows:
m=min(itera/(itera+1),η);
where itera is the number of iterations of the current student model, min (itera/(itera +1), η) represents the smaller of itera/(itera +1) and η, and η is a constant and is typically set to 0.999.
CN202210326319.8A 2022-03-30 2022-03-30 Generalized continuous classification method based on online contrast distillation network Pending CN114972839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210326319.8A CN114972839A (en) 2022-03-30 2022-03-30 Generalized continuous classification method based on online contrast distillation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210326319.8A CN114972839A (en) 2022-03-30 2022-03-30 Generalized continuous classification method based on online contrast distillation network

Publications (1)

Publication Number Publication Date
CN114972839A true CN114972839A (en) 2022-08-30

Family

ID=82976151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210326319.8A Pending CN114972839A (en) 2022-03-30 2022-03-30 Generalized continuous classification method based on online contrast distillation network

Country Status (1)

Country Link
CN (1) CN114972839A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457042A (en) * 2022-11-14 2022-12-09 四川路桥华东建设有限责任公司 Method and system for detecting surface defects of thread bushing based on distillation learning
CN115511059A (en) * 2022-10-12 2022-12-23 北华航天工业学院 Network lightweight method based on convolutional neural network channel decoupling
CN116502621A (en) * 2023-06-26 2023-07-28 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115511059A (en) * 2022-10-12 2022-12-23 北华航天工业学院 Network lightweight method based on convolutional neural network channel decoupling
CN115511059B (en) * 2022-10-12 2024-02-09 北华航天工业学院 Network light-weight method based on convolutional neural network channel decoupling
CN115457042A (en) * 2022-11-14 2022-12-09 四川路桥华东建设有限责任公司 Method and system for detecting surface defects of thread bushing based on distillation learning
CN116502621A (en) * 2023-06-26 2023-07-28 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation
CN116502621B (en) * 2023-06-26 2023-10-17 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation

Similar Documents

Publication Publication Date Title
CN109086658B (en) Sensor data generation method and system based on generation countermeasure network
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN114972839A (en) Generalized continuous classification method based on online contrast distillation network
CN111274398B (en) Method and system for analyzing comment emotion of aspect-level user product
CN110929034A (en) Commodity comment fine-grained emotion classification method based on improved LSTM
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111582397A (en) CNN-RNN image emotion analysis method based on attention mechanism
CN111460157A (en) Cyclic convolution multitask learning method for multi-field text classification
CN113626589A (en) Multi-label text classification method based on mixed attention mechanism
CN106339718A (en) Classification method based on neural network and classification device thereof
CN111813939A (en) Text classification method based on representation enhancement and fusion
Das et al. Determining attention mechanism for visual sentiment analysis of an image using svm classifier in deep learning based architecture
CN114973226A (en) Training method for text recognition system in natural scene of self-supervision contrast learning
CN114780723A (en) Portrait generation method, system and medium based on guide network text classification
CN111611375B (en) Text emotion classification method based on deep learning and turning relation
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
Liu et al. Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism
Saha et al. The Corporeality of Infotainment on Fans Feedback Towards Sports Comment Employing Convolutional Long-Short Term Neural Network
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN113792541B (en) Aspect-level emotion analysis method introducing mutual information regularizer
CN106447691A (en) Weighted extreme learning machine video target tracking method based on weighted multi-example learning
Ouanan et al. Development of deep learning-based facial recognition system
Soujanya et al. A CNN based approach for handwritten character identification of Telugu guninthalu using various optimizers
CN113626537B (en) Knowledge graph construction-oriented entity relation extraction method and system
Zhu Neural architecture search for deep face recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination