CN114972839A

CN114972839A - Generalized continuous classification method based on online contrast distillation network

Info

Publication number: CN114972839A
Application number: CN202210326319.8A
Authority: CN
Inventors: 冀中; 黎晋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-08-30

Abstract

The invention discloses a generalized continuous classification method based on an online comparison distillation network, which is used for establishing a classification model based on knowledge distillation; establishing a buffer area and updating the buffer area by using a reservoir sampling method; randomly sampling S samples from the buffer area and respectively inputting the samples into the teacher model and the student model to obtain classification output and feature embedding corresponding to the teacher model and the student model; calculating and classifying and outputting the quality fraction according to the teacher model, adjusting knowledge distillation loss function weights of different samples and calculating distillation loss

Comparing the characteristic embedding between the two models, and calculating the distillation loss of the comparison relation of the two models

Calculating student model self-supervision loss

And supervised contrast learning loss

Calculating cross entropy classification loss of student model

And (4) weighting and adding the losses to determine parameters of the maximum optimization objective optimization student model. And updating the parameters of the teacher model by the parameters of the student model. The invention has good classification accuracy for both new tasks and old tasks.

Description

Generalized continuous classification method based on online contrast distillation network

Technical Field

The invention relates to a generalized continuous classification method, in particular to a generalized continuous classification method based on an online contrast distillation network.

Background

At present, in recent years, deep learning has achieved good effects on computer vision tasks such as image classification, target detection and semantic segmentation. However, when a neural network trained on an old task is trained directly on a new task, the new task can severely interfere with the performance of the old task, creating a Catastrophic Forgetting problem. It is obvious that training a neural network from scratch consumes more time and computing resources, and data of a previous task is not necessarily acquired again due to privacy and other problems. The human beings have the ability of continuously learning, and can quickly learn new knowledge on the basis of old knowledge without damaging the stability of the previously learned knowledge. It is expected that neural networks will have this ability to learn continuously, and continuous Learning (also called Incremental Learning) is proposed to overcome the problem of catastrophic forgetting. In recent years, a lot of continuous learning work adopts the idea of empirical Replay (Experience Replay), samples of a part of old tasks are stored, and the stored samples are replayed when training new tasks, so that the catastrophic forgetting problem is alleviated.

In the existing continuous learning technology, it is often necessary to assume that categories among tasks are mutually disjoint, that is, categories in a new task are not present in an old task, and clear task boundaries exist between tasks, and there is a high possibility that such a priori knowledge does not exist in real-world tasks. Many existing techniques exploit the unlikely a priori knowledge of such realistic tasks, simplifying the difficulty of continuous learning problems. For example, when the model output of old samples at the past time is used to output a regular model of old samples at the present time to alleviate catastrophic forgetting, the dimension of the old model output and the dimension of the new model output become inconsistent due to the arrival of new classes, and in the case where it is assumed that classes are complementary between tasks, the output of the new model can be made to coincide with only a part of the old model. In the past, a continuous method of utilizing the mutual disjointness of the task classes cannot be applied to the setting of the generalized continuous learning. For this reason, a General continuous Learning (General continuous Learning) technique for solving the catastrophic forgetting problem in a real scene is receiving attention. The goal of generalized continuous learning is to consolidate learned knowledge from a non-stationary infinite data stream and learn new knowledge quickly. Under the setting of generalized continuous learning, the categories of each task may be intersected, new samples of old categories may appear in new tasks, and the conventional method for solving the continuous learning by means of the prior knowledge which does not necessarily exist in the real world is difficult to apply to the generalized continuous learning.

The generalized continuous Learning is a general continuous Learning scenario, and can also be applied to the classic Class Incremental Learning (Class Incremental Learning), Task Incremental Learning (Task Incremental Learning) and Domain Incremental Learning (Domain Incremental Learning) scenarios. But specific a priori knowledge in these classical scenarios cannot be exploited to alleviate catastrophic forgetfulness when image classification is done in scenarios of broad continuous learning. This means that in empirical playback some inherent non-scene specific information must be mined to consolidate the old task knowledge.

Disclosure of Invention

The invention provides a generalized continuous classification method based on an online contrast distillation network for solving the technical problems in the prior art.

The technical scheme adopted by the invention for solving the technical problems in the prior art is as follows: a generalized continuous classification method based on an online comparative distillation network comprises the following steps:

step 1, establishing a knowledge distillation-based classification model, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of the student model; initializing parameters of a teacher model and parameters of a student model and endowing the parameters to a buffer area with a fixed size;

step 2, when a batch data stream containing R samples arrives, counting the number of the samples encountered currently, and updating a buffer area by using a reservoir sampling method;

step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classification output data of the teacher model and the student model corresponding to the S samples through the processing of respective feature encoders and classifiers of the S samples; respectively obtaining feature embedded data of the teacher model and the student model corresponding to the S samples through the processing of the feature encoder and the feature mapper of the teacher model and the feature mapper;

step 4, calculating the quality scores of the teacher model classified output data, adjusting the weights of the online knowledge distillation loss functions of different samples according to the quality scores of the teacher model classified output data, and further calculating the online distillation loss of the teacher model and the student model

Step 5, comparing characteristic embedded data between the teacher model and the student models, and calculating the comparison relation distillation loss of the teacher model and the student models

Step 6, using the self-supervision learning and the supervision contrast learning to help the student model to extract the discriminant characteristics to calculate the self-supervision loss of the student model

And supervised contrast learning loss

Step 7, calculating the cross entropy classification loss of the student model based on experience playback

Step 8, calculating the total optimization target of the student model

α ₁ To alpha ₃ The hyperparameters of each corresponding loss function; optimizing parameters of the student model by using a random gradient descent algorithm;

and 9, directly utilizing the parameters of the student model to update the parameters of the teacher model.

Further, in step 2, assume that the non-stationary data stream is composed of n sample disjoint tasks { T } ₁ ，T ₂ ，...，T _n Are composed of, each task T _n The training set of (A) is all marked data

Composition, where m is task T _n Number of samples, x, of training set _i For task T _n I-th image sample in training set, y _i For task T _n Ith image sample x in training set _i The labeled category; buffer zone

Has a capacity of

x _j For the jth image sample in the buffer, y _j Is the j image sample x in the buffer _j The labeled category; the method for sampling the water reservoir comprises the following steps:

step A1, judging the current encounteredNumber of samples num and buffer capacity

The size between, if

Sample (x) _i ，y _i ) Direct storage to buffer

Performing the following steps; x is the number of _i For task T _n I image sample in training set, y _i For task T _n Ith image sample x in training set _i The labeled category;

step A2, if

Generating a random integer rand _ num, wherein the minimum value of the random integer is 0, and the maximum value is num-1; if it is

Using samples (x) _i ，y _i ) Replacing samples (x) in the buffer _{rand_num} ，y _{rand_num} )；x _{rand_num} Buffer with index rand _ num

Of (2) an image sample, y _{rand_num} Buffer with index rand _ num

The image sample label in (1).

Further, in step 4, the method for calculating the quality score of the teacher model classification output data is as follows:

setting:

indicates a capacity of

The buffer area of (2); x is the number of _j For the jth image sample in the buffer, y _j Is the j image sample x in the buffer _j The labeled category;

represents a sample x _j Classified output data are obtained after the characteristic encoder and the classifier of the teacher model are sequentially processed; omega (x) _j ) For corresponding sample x _j The teacher model of (1) classifies the quality score of the output data; omega (x) _j ) The calculation formula of (a) is as follows:

in the formula:

ρ represents a temperature coefficient;

c represents the number of all possible categories;

exp (·) represents an exponential function based on a natural constant e;

outputting data for classification

Middle corresponding category y _j The classification output data of (1);

representing sorted output data

The classification of each class outputs data.

Further, in step 4, let

Represents a sample x _j Sequentially pass through the student modelThe feature encoder and the classifier of (2) to obtain classified output data; calculating on-line distillation loss for teacher and student models

The method of (1) is as follows:

in the formula: l. capillary ₂ Is represented by ₂ A norm;

representing a mathematical expectation function.

Further, in step 5, setting:

indicates a capacity of

is a sample x _j After the input data is input into the teacher model, the characteristic embedded data of the teacher model is obtained through a characteristic encoder and a characteristic mapper;

is a sample x _j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper; z is a radical of ^t All samples x for the current batch _j All teacher model feature embedded data obtained by processing the teacher model feature embedded data through a feature encoder and a feature mapper after being input into the teacher model

A set of (a); z is a radical of ^s All samples x for the current batch _j All student model feature embedded data obtained by the feature encoder and the feature mapper after being input into the student model

A set of (a);

denotes from z ^s Sampling the obtained characteristic embedded data; z is a radical of ^t+ Is represented by

Embedding teacher model characteristics with the same class label into a data set;

denotes from z ^t+ Sampling the obtained characteristic embedded data;

denotes from z ^t Sampling the obtained characteristic embedded data; calculating comparative distillation loss between teacher model and student model

The method (2) is as follows:

in the formula:

representing a mathematical expectation function;

|·| ₂ is represented by ₂ A norm;

log (-) represents a natural logarithmic function based on a natural constant e;

representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

Representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

(·) ^T Representing a transpose;

exp (·) represents an exponential function based on a natural constant e;

τ represents a temperature coefficient.

Further, step 6 includes the following sub-steps:

step B1, let Θ be ^t ，Φ ^t ，Ψ ^t Feature encoder and classification corresponding to teacher modelDevice and feature mapper ^s ，Φ ^s ，Ψ ^s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample

Where x represents an image sample, y is the class labeled by image sample x,

for the image samples after the geometric transformation,

a label that is geometrically transformed; amplifying the training sample

Inputting the data into a student model, and processing the data by a feature coder and a feature mapper of the student model to obtain corresponding student model feature data F ^s And feature embedded data

Wherein:

step B2, obtaining student model characteristic data F ^s Input to a multi-layer sensor

In the method, the method is used for judging the training samples

The kind of geometric transformation to be performed; let the output of the multilayer perceptron be S ^s ，S ^s The calculation formula of (a) is as follows:

step B3, calculating the self-supervision loss

The calculation formula of (a) is as follows:

wherein

Representing a mathematical expectation function;

softmax (·) denotes a softmax function;

l (-) represents a cross entropy loss function;

step B4, setting:

indicates a capacity of

is a sample x _j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper;

feature embedded data representing acquired ensemble student models

And

a set of (a);

represents from

Sampling the obtained characteristic embedded data;

is shown and

embedding the characteristics of the student model with the same class label into a data set;

represents from

Sampling the obtained characteristic embedded data;

represents from

Embedding the sampled features into data; based on the original characteristic embedded data and the amplified characteristic embedded data, the characteristic embedded data in the student model is utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning

The calculation formula of (c) is as follows:

in the formula:

wherein

Represents a mathematical expectation;

|·| ₂ represents l ₂ A norm;

representing feature embedded data

And

the distance of (d);

representing feature embedded data

And

the distance of (d);

exp (·) represents an exponential function based on a natural constant e;

representing a transpose;

τ represents a temperature coefficient;

step B5, loss will be self-monitored

And supervised contrast learning loss

Performing the combination yields a collaborative contrast loss

Help the student model to better extract discriminant features,

the calculation formula of (a) is as follows:

further, in step B1, the geometric transformation includes rotating, scaling and adjusting the aspect ratio of the image.

Further, in step 7, assume that the non-stationary data stream is composed of n sample disjoint tasks { T } ₁ ，T ₂ ，...，T _n Is formed, let x denote task T _n And a buffer area

Y is the category marked by the image sample x; cross entropy classification loss for student models

The calculation formula of (a) is as follows:

in the formula:

representing a mathematical expectation function;

softmax (·) denotes a softmax function;

l (-) represents a cross entropy loss function;

r ^s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.

Further, in step 9, a specific method for updating the parameters of the teacher model by using the parameters of the student model is as follows:

let Θ be ^t ，Φ ^t ，Ψ ^t Feature encoder, classifier and feature mapper for the teacher model ^s ，Φ ^s ，Ψ ^s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; the method for updating the parameters of the teacher model comprises the following steps:

Θ ^t ←mΘ ^t +(1-m)[(1-X)Θ ^t +XΘ ^s ]；

Φ ^t ←mΦ ^t +(1-m)[(1-X)Φ ^t +XΦ ^s ]；

Ψ ^t ←mΨ ^t +(1-m)[(1-X)Ψ ^t +XΨ ^s ]；

where m represents a momentum factor and X obeys a Bernoulli distribution (also known as a 0-1 distribution), defined as:

P(X＝k)＝p ^k (1-p) ^1-k ，k＝{0，1}；

the value range of the Bernoulli probability p is (0, 1), and the update frequency of the teacher model is controlled through the Bernoulli probability p.

Further, the calculation formula of the momentum factor m is as follows:

m＝min(itera/(itera+1)，η)；

where itera is the number of iterations of the current student model, min (itera/(itera +1), η) represents the smaller of itera/(itera +1) and η, and η is a constant and is typically set to 0.999.

The invention has the advantages and positive effects that: according to the generalized continuous classification method based on the online comparison distillation network, the teacher-student framework in online knowledge distillation is used for effectively consolidating the knowledge of the old task, so that the model has good classification accuracy on both the new task and the old task. In the training stage, the training strategy of comparison learning is introduced into online knowledge distillation, the teacher model integrates the weights of the student models at all times to realize the accumulation of knowledge, and the student models distill classified output data and comparison relations to the teacher model to relieve catastrophic forgetting. The teacher model and the student models are cooperated with each other, so that the student models can keep the performance of the old tasks, the teacher model can accumulate the weight which is more balanced to the old tasks and the new tasks when accumulating the weight, and the teacher model can better guide the student models to consolidate the knowledge of the old tasks when the student models train the new tasks. In the testing stage, the teacher model is used for testing, because the teacher model integrates the advantages of the student models which are good at distinguishing different classes at different moments, the teacher model can have good classification performance on all classes. Therefore, the invention can effectively integrate the advantages of the student network and improve the classification accuracy of the teacher network during testing.

Drawings

FIG. 1 is a flow chart of the broad continuous classification method based on an online contrast distillation network according to the present invention.

Detailed Description

For further understanding of the contents, features and effects of the present invention, the following embodiments are enumerated in conjunction with the accompanying drawings, and the following detailed description is given:

referring to fig. 1, a generalized continuous classification method based on an online comparative distillation network includes the following steps:

step 1, establishing a knowledge distillation-based classification model, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of the student model; parameters of the teacher model and the student model are initialized and a buffer area with a fixed size is given.

And 2, when a batch data stream containing R samples arrives, counting the number of the samples encountered currently, and updating the buffer area by using a reservoir sampling method.

Step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classification output data of the teacher model and the student model corresponding to the S samples through the processing of respective feature encoders and classifiers of the S samples; respectively obtaining feature embedded data of the teacher model and the student model corresponding to the S samples through the processing of the feature encoder and the feature mapper of the teacher model and the feature mapper; namely: processing the S samples by a feature encoder and a classifier of the teacher model in sequence to obtain a classification output data set of the teacher model; after being processed by a feature encoder and a classifier of the student model in sequence, a classification output data set of the student model is obtained; after being processed by a feature encoder and a feature mapper of the teacher model in sequence, a feature embedded data set of the teacher model is obtained; and after being processed by a feature encoder and a feature mapper of the student model in sequence, the feature embedded data set of the student model is obtained.

Step 5, comparing the characteristic embedded data between the teacher model and the student models, and calculating the comparison relationship distillation loss of the teacher model and the student models

And supervised contrast learning loss

Step 8, calculating the total optimization target of the student model

α ₁ To alpha ₃ The hyperparameters of each corresponding loss function; and optimizing parameters of the student model by using a random gradient descent algorithm.

Preferably, in step 2, it can be assumed that the non-stationary data stream is formed by n sample disjoint tasks { T } ₁ ，T ₂ ，...，T _n Are composed of, each task T _n The training set of (A) is all marked data

Composition, where m is task T _n Number of samples, x, of training set _i For task T _n I image sample in training set, y _i For task T _n Ith image sample x in training set _i The labeled category; buffer zone

Has a capacity of

x _j For the j image sample in the buffer，y _j Is the j image sample x in the buffer _j The labeled category; a method of sampling a water reservoir may comprise the steps of:

step A1, judging the size between the number num of samples encountered currently and the buffer area capacity B, if num is less than or equal to B, sampling (x) _i ，y _i ) Direct storage to buffer

Performing the following steps; x is the number of _i For task T _n I image sample in training set, y _i For task T _n Ith image sample x in training set _i The labeled category.

Step A2, if

Image sample of (1), y _{rand_num} Buffer with index rand _ num

The image sample label in (1).

Preferably, in step 4, the method for calculating the quality score of the teacher model classification output data may be as follows:

can be provided with:

indicates a capacity of

represents a sample x _j Classifying output data is obtained after the processing of the characteristic encoder and the classifier of the teacher model is carried out in sequence; omega (x) _j ) For corresponding sample x _j The teacher model of (1) classifies and outputs the quality scores of the data; omega (x) _j ) The calculation formula of (c) can be as follows:

in the formula: ρ represents a temperature coefficient; c represents the number of all possible categories; exp (·) represents an exponential function based on a natural constant e;

outputting data for classification

Middle corresponding category y _j The classification output data of (2);

representing sorted output data

The classification of each class outputs data.

Preferably, in step 4, let

Represents a sample x _j Classifying output data is obtained after the data are sequentially processed by a feature encoder and a classifier of the student model; calculating on-line distillation loss for teacher and student models

The method of (2) is as follows:

in the formula: l. capillary ₂ Is represented by ₂ A norm;

representing a mathematical expectation function; exp (. cndot.) represents an exponential function based on a natural constant e.

Preferably, in step 5, there may be:

indicates a capacity of

is a sample x _j After the data is input into the student model, feature embedded data of the student model is obtained through a feature encoder and a feature mapper; z is a radical of formula ^t All samples x for the current batch _j All teacher model feature embedded data obtained by processing the teacher model feature embedded data through a feature encoder and a feature mapper after being input into the teacher model

A set of (a); z is a radical of ^s All samples x for the current batch _j After being input into the student model, all student model characteristic embedded data obtained through the characteristic encoder and the characteristic mapper

A set of (a);

represents from z ^s Sampling the obtained characteristic embedded data; z is a radical of ^t+ Is shown and

embedding teacher model features with the same class label into a data set;

denotes from z ^t+ Embedding the sampled features into data;

denotes from z ^t Embedding the sampled features into data; calculating the distillation loss of the comparison relationship between the teacher model and the student model

The method of (a) can be as follows:

in the formula:

representing a mathematical expectation function; l. capillary ₂ Is represented by ₂ A norm; log (-) represents a natural logarithmic function based on a natural constant e;

representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

Representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

Representing a transpose; exp (·) represents an exponential function based on a natural constant e; τ represents a temperature coefficient.

Preferably, step 6 may comprise the following sub-steps:

in step B1, Θ can be set ^t ，Φ ^t ，Ψ ^t Feature encoder, classifier and feature mapper for the teacher model ^s ，Φ ^s ，Ψ ^s Corresponding to the feature encoder, the classifier and the feature mapper of the teacher model, each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample

Where x represents an image sample, y is the class labeled by image sample x,

for the image samples after the geometric transformation,

a label that is geometrically transformed; amplifying the training sample

Inputting the data into the student model, processing the data by the feature encoder and the feature mapper of the student model to obtain corresponding student model feature data F ^s And feature embedded data

Wherein:

step B2, the obtained student model characteristic data F can be used ^s Input to a multi-layer sensor

In the method, the method is used for judging the training samples

The kind of geometric transformation to be performed; the output of the multilayer perceptron can be set as S ^s ，S ^s The calculation formula of (c) can be as follows:

step B3, countingCalculated self-supervision loss

The calculation formula of (c) can be as follows:

wherein

Representing a mathematical expectation function; softmax (·) denotes a softmax function; l (-) represents a cross entropy loss function;

step B4, may include:

indicates a capacity of

feature embedded data representing derived universe student model

And

a set of (a);

represents from

Embedding the sampled features into data;

is shown and

represents from

Sampling the obtained characteristic embedded data;

represents from

Sampling the obtained characteristic embedded data; based on the original characteristic embedded data and the amplified characteristic embedded data, the original characteristic embedded data and the amplified characteristic embedded data in the student model can be utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning

The calculation formula of (c) can be as follows:

in the formula:

wherein

Represents a mathematical expectation; l. capillary ₂ Is represented by ₂ A norm; log (-) represents a natural logarithmic function based on a natural constant e;

representing feature embedded data

And

the distance of (d);

representing feature embedded data

And

the distance of (d); exp (·) represents an exponential function based on a natural constant e;

representing a transpose; τ represents a temperature coefficient;

step B5, self-supervision loss can be reduced

And supervised contrast learning loss

Combining to obtain collaborative contrast loss

Help the student model to better extract discriminant features,

the calculation formula of (c) can be as follows:

preferably, in step B1, the geometric transformation may include rotating, scaling and adjusting the aspect ratio of the image.

Preferably, in step 7, it can be assumed that the non-stationary data stream is formed by n sample disjoint tasks { T } ₁ ，T ₂ ，…，T _n Is formed, let x denote task T _n And a buffer area

The calculation formula of (c) can be as follows:

in the formula:

representing a mathematical expectation function; softmax (·) denotes a softmax function; l (-) represents a cross entropy loss function; r is ^s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.

Preferably, in step 9, a specific method for updating the parameters of the teacher model by using the parameters of the student model may be as follows:

let Θ be ^t ，Φ ^t ，Ψ ^t Corresponding characteristic compiling for teacher modelCoder, classifier and feature mapper ^s ，Φ ^s ，Ψ ^s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; the method for updating the parameters of the teacher model can be as follows:

Θ ^t ←mΘ ^t +(1-m)[(1-X)Θ ^t +XΘ ^s ]；

Φ ^t ←mΦ ^t +(1-m)[(1-X)Φ ^t +XΦ ^s ]；

Ψ ^t ←mΨ ^t +(1-m)[(1-X)Ψ ^t +XΨ ^s ]；

where m represents a momentum factor and X obeys a Bernoulli distribution (also known as a 0-1 distribution), which can be defined as:

P(X＝k)＝p ^k (1-p) ^1-k ，k＝{0，1}；

the value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model can be controlled through the Bernoulli probability p.

Preferably, the calculation formula of the momentum factor m can be as follows:

m＝min(itera/(itera+1)，η)；

where itera is the number of iterations of the current student model, min (itera/(itera +1), η) represents the smaller of itera/(itera +1) and η, η is a constant and can be set to 0.999.

The working process and working principle of the present invention are further explained by a preferred embodiment of the present invention as follows:

the generalized continuous learning is to consolidate old knowledge and accumulate new knowledge from a non-stationary data stream, and finally complete classification prediction of all the seen class images. Assume that a non-stationary data stream is formed by N sample disjoint tasks T ₁ ，T ₂ ，...，T _N Are composed of, each task T _n The training set of (A) is all marked data

Composition, where m is task T _n Number of samples, x, of training set _i For task T _n Training set ith image sample，y _i For task T _n Ith image sample x in training set _i The labeled category. In the testing stage, the generalized continuous learning method can complete classification tasks for all the classes currently seen. Each task T _n The test sets of (1) are all tagged with data

Composition, wherein p is task T _n Number of samples, x, of test set _q For task T _n The q image sample in the test set, y _q For task T _n The q image sample x in the test set _q The labeled category. The generalized continuous learning task is all the tasks { T } trained at present ₁ ，T ₂ ，...，T _n And performing category prediction on the test set.

FIG. 1 is a flow chart of the broad continuous classification method based on an online contrast distillation network according to the present invention. Wherein the content of the first and second substances,

indicates a capacity of

Buffer of x _j For the jth image sample in the buffer, y _j Is the j image sample x in the buffer _j The labeled category.

Which represents the loss of the on-line distillation,

indicating comparative distillation losses. Let Θ be ^t ，Φ ^t ，Ψ ^t Setting theta as a feature encoder, a classifier and a feature mapper of the teacher model respectively ^s ，Φ ^s ，Ψ ^s A feature encoder, a classifier and a feature mapper respectively of the student model,

the invention relates to a generalized continuous classification method based on an online contrast distillation network, which comprises the following steps:

step 1, before a task starts, firstly initializing parameters of a teacher model and a student model and endowing a buffer area with a fixed size: theta ^t ＝Θ ^s ，Φ ^t ＝Φ ^s ，Ψ ^t ＝Ψ ^s ，

Step 2, when a batch data stream containing bsz samples arrives, counting the number num of the samples met currently, and updating a buffer area by utilizing a reservoir sampling method

This ensures that the probability of all samples being stored in the buffer is equal. For a specific sample, the specific steps of sampling by using the water reservoir comprise:

(1) judging the number num of samples encountered currently and the capacity of a buffer area

The size between, if

Sample (x) _i ，y _i ) Directly storing the data into a buffer area;

(2) if it is

A random integer rand _ num is generated, the minimum value of the random integer is 0, and the maximum value is num-1. If it is

Using samples (x) _i ，Y _i ) Replacing samples (x) in the buffer _{rand_num} ，y _{rand_num} )；x _rand-num The representation index is

Of the buffer of (2), y _{rand_num} The representation index is

Of the buffer area.

Step 3, from the buffer area

Randomly sampling S samples x _j To consolidate the old knowledge, the S samples x _j Respectively inputting the data into a teacher model and a student model, wherein the classified output data of the teacher model and the classified output data of the student model obtained by the feature encoder and the classifier are respectively as follows:

the characteristic embedded data of the teacher model and the student model obtained by the characteristic encoder and the characteristic mapper are respectively as follows:

step 4, setting:

represents a sample x _j Classified output data are obtained after the characteristic encoder and the classifier of the teacher model are sequentially processed; omega (x) _j ) For corresponding sample x _j The teacher model of (1) classifies the quality score of the output data;

representing sorted output data

The classification output data of each category; is provided with

Represents a sample x _j Classifying output data are obtained after the processing of a feature encoder and a classifier of the student model is carried out in sequence;

outputting data for classification

Middle corresponding category y _j The classification of (2) outputs data.

Calculating the quality fraction omega (x) of the classified output data of each sample by the perceptron _j )：

Where ρ is the temperature coefficient, C represents the number of all possible classes, and exp (·) represents an exponential function based on the natural constant e.

Calculating the on-line distillation loss according to equations (1), (2) and (5)

Wherein |. non ₂ Is represented by ₂ The norm of the number of the first-order-of-arrival,

representing a mathematical expectation function. By giving the difference between the outputs of the teacher model and the student model

Weight ω (x) _j ) To let the student model focus more attention on samples with high quality scores.

And 5, comparing the characteristic embedded data between the teacher model and the student model, and calculating the distillation loss of the comparison relation according to the formulas (3) and (4)

Wherein

Represents a mathematical expectation function, log (-) represents a natural logarithmic function based on a natural constant e,

A set of (a);

denotes from z ^s Sampling the obtained characteristic embedded data; z is a radical of ^t+ Is shown and

represents from z ^t+ Sampling the obtained characteristic embedded data;

denotes from z ^t The sampled features are embedded in the data.

Representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

The calculation formula is as follows:

wherein exp (·) represents an exponential function with a natural constant e as a base, | · non · ₂ Is represented by ₂ Norm, (.) ^T Denotes transposition and τ denotes temperature coefficient.

Representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

The calculation formula is as follows:

And 6, using self-supervised learning and supervised contrast learning to help the student model to extract discriminant features, wherein the specific steps comprise:

(1) each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample

Where x represents an image sample, y is the class labeled by image sample x,

for the image samples after the geometric transformation,

is a geometrically transformed label. The geometric transformation includes rotating, scaling and adjusting the aspect ratio of the image. This doubles the number of training images for the student model. For the set of images subjected to random geometric transformation

Inputting them into the student network, and getting the corresponding student model characteristics and characteristic embedding data:

(2) inputting the obtained characteristics of the student model into a multi-layer perceptron

In the method, the training samples are judged and matched

The kind of geometric transformation performed:

(3) calculating self-supervision loss

Wherein

Represents the mathematical expectation function, softmax (·) represents the softmax function, and l (·) represents the cross-entropy loss function.

(4) Is provided with

Is a sample x _j After being input into the student model, the characteristics are codedCharacteristic embedding data of the student model obtained by the device and the characteristic mapper;

feature embedded data representing acquired ensemble student models

And

a set of (a);

represents from

Sampling the obtained characteristic embedded data;

is shown and

embedding the student model characteristics with the same category label into a data set;

represents from

Sampling the obtained characteristic embedded data;

represents from

The calculation formula of (a) is as follows:

in the formula:

represents a mathematical expectation; log (-) represents a natural logarithmic function based on a natural constant e;

representing feature embedded data

And

the distance of (d);

representing feature embedded data

And

the distance of (d); exp (·) denotes an exponential function based on a natural constant e; l. capillary ₂ Is represented by ₂ A norm;

representing a transpose; τ represents a temperature coefficient.

(5) For self-supervision loss

And supervised contrast learning loss

Performing the combination yields a collaborative contrast loss

Help student's model to extract the characteristic of discriminant better:

and 7, calculating the cross entropy classification loss of the student model based on experience playback:

wherein x represents a task T _n And a buffer area

Y is the category marked by the image sample x;

representing a mathematical expectation function; softmax (·) denotes a softmax function; l (-) represents a cross entropy loss function; r is a radical of hydrogen ^s (x) And the classification output data of the image sample x after being sequentially processed by a feature encoder and a classifier of the student model is represented.

r ^s (x) Feature encoder Θ representing the passage of an image sample x through a student model ^s And a classifier phi ^s The resulting output:

step 8, calculating the total optimization target of the student model

And (3) optimizing parameters of the student model by using a random gradient descent algorithm:

wherein alpha is ₁ 、α ₂ And alpha ₃ Representing a hyper-parameter.

And 9, directly updating the parameters of the teacher model by using the parameters of the student model without involving gradient return, and setting theta ^t ，Φ ^t ，Ψ ^t Feature encoder, classifier and feature mapper for the teacher model ^s ，Φ ^s ，Ψ ^s The characteristic encoder, the classifier and the characteristic mapper correspond to a teacher model. The updating method comprises the following steps:

Θ ^t ←mΘ ^t +(1-m)[(1-X)Θ ^t +XΘ ^s ] (21)；

Φ ^t ←mΦ ^t +(1-m)[(1-X)Φ ^t +XΦ ^s ] (22)；

Ψ ^t ←mΨ ^t +(1-m)[(1-X)Ψ ^t +XΨ ^s ] (23)；

P(X＝k)＝p ^k (1-p) ^1-k ，k＝{0，1} (24)；

the value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model is controlled through the Bernoulli probability p.

In order for the teacher model to quickly learn new knowledge in the early stage of model training, the momentum factor m is designed as follows:

m＝min(itera/(itera+1)，η) (25)；

The generalized continuous classification method based on the online contrast distillation network can be used for testing at any time. In the testing stage, a teacher model is used for testing. The reason is that student models at different times are good at classifying different categories, and teacher models that learn from student models can accumulate their merits for learning. Thus, the teacher model has a greater ability to distinguish between all the classes seen than the student models.

The above-mentioned embodiments are only for illustrating the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and to carry out the same, and the present invention shall not be limited to the embodiments, i.e. the equivalent changes or modifications made within the spirit of the present invention shall fall within the scope of the present invention.

Claims

1. A generalized continuous classification method based on an online comparative distillation network is characterized by comprising the following steps:

And supervised contrast learning loss

Step 8, calculating the total optimization target of the student model

α ₁ To alpha ₃ The hyperparameters of each corresponding loss function; using random gradient descentOptimizing parameters of the student model by an algorithm;

2. The method for the generalized continuous classification based on the online contrast distillation network as claimed in claim 1, wherein in step 2, the non-stationary data stream is assumed to be formed by n sample disjoint tasks { T } ₁ ，T ₂ ，...，T _n Are composed of, each task T _n The training set of (A) is all marked data

Has a capacity of B, x _j For the jth image sample in the buffer, y _j Is the j image sample x in the buffer _j The labeled category; the method for sampling the water reservoir comprises the following steps:

step A1, determining the number num of samples currently encountered and the buffer capacity

The size between, if

Sample (x) _i ，y _i ) Direct storage to buffer

Performing the following steps; x is a radical of a fluorine atom _i For task T _n I-th image sample in training set, y _i For task T _n Ith image sample x in training set _i MarkedA category;

step A2, if

Generating a random integer rand _ num, wherein the minimum value of the random integer is 0, and the maximum value is num-1; if rand _ num < B, use the sample (x) _i ，y _i ) Replacing samples (x) in the buffer _{rand_num} ，y _{rand_num} )；x _{rand_num} Buffer with index rand _ num

Of (2) an image sample, y _{rand_num} Buffer with index rand _ num

The image sample label in (1).

3. The generalized continuous classification method based on the online contrast distillation network as claimed in claim 1, wherein in step 4, the method for calculating the quality score of the teacher model classification output data is as follows:

setting:

indicates a capacity of

representing a sample x _j Classified output data are obtained after the characteristic encoder and the classifier of the teacher model are sequentially processed; omega (x) _j ) For corresponding sample x _j The teacher model of (1) classifies the quality score of the output data; omega (x) _j ) Is calculated as followsThe following steps:

in the formula:

ρ represents a temperature coefficient;

c represents the number of all possible categories;

exp (·) represents an exponential function based on a natural constant e;

outputting data for classification

Middle corresponding category y _j The classification output data of (1);

representing sorted output data

The classification of each class outputs data.

4. The generalized continuous classification method based on online comparative distillation network as claimed in claim 3, wherein in step 4, setting

Represents a sample x _j Classifying output data are obtained after the processing of a feature encoder and a classifier of the student model is carried out in sequence; calculating on-line distillation loss for teacher and student models

The method of (2) is as follows:

in the formula: l. capillary ₂ To represent

A norm;

representing a mathematical expectation function.

5. The generalized continuous classification method based on the online comparative distillation network according to claim 1, wherein in step 5, the following steps are provided:

indicates a capacity of

The buffer area of (2); x is a radical of a fluorine atom _j For the jth image sample in the buffer, y _j Is the j image sample x in the buffer _j The labeled category;

A set of (a);

denotes from z ^t+ Sampling the obtained characteristic embedded data;

The method of (2) is as follows:

in the formula:

representing a mathematical expectation function;

|·| ₂ to represent

A norm;

representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

Representing a judgment function for judging the feature-embedded data

And

whether or not they originate from their joint distribution

(·) ^T To representTransposition is carried out;

exp (·) represents an exponential function based on a natural constant e;

indicating the temperature coefficient.

6. The broad continuous classification method based on the online contrast distillation network according to claim 5, wherein the step 6 comprises the following sub-steps:

step B1, let Θ be ^t ，Φ ^t ，Ψ ^t Feature encoder, classifier and feature mapper for the teacher model ^s ，Φ ^s ，Ψ ^s The characteristic encoder, the classifier and the characteristic mapper correspond to the teacher model; each training sample (x, y) of the student model is subjected to one-time random geometric transformation to obtain an amplified training sample

Where x represents an image sample, y is the class labeled by image sample x,

for the image samples after the geometric transformation,

a label that is geometrically transformed; amplifying the training sample

Wherein:

In the method, the training samples are judged and matched

The kind of geometric transformation to be performed; let the output of the multilayer perceptron be S ^s ，S ^s The calculation formula of (c) is as follows:

step B3, calculating the self-supervision loss

The calculation formula of (a) is as follows:

wherein

Representing a mathematical expectation function;

softmax (·) represents a softmax function;

representing a cross entropy loss function;

step B4, setting:

indicates a capacity of

The buffer area of (2); x is the number of _j For the jth image sample in the buffer, y _j For the jth image sample x in the buffer _j The labeled category;

feature embedded data representing derived universe student model

And

a set of (a);

represents from

Sampling the obtained characteristic embedded data;

is shown and

represents from

Embedding the characteristics obtained by the fruit sample into data;

represents from

Sampling the obtained characteristic embedded data; based on the original characteristic embedded data and the amplified characteristic embedded data, the characteristic embedded data in the student model is utilized to carry out supervised contrast learning, and the loss function of the supervised contrast learning

The calculation formula of (a) is as follows:

in the formula:

wherein

Represents a mathematical expectation;

|·| ₂ to represent

A norm;

representing feature embedded data

And

the distance of (d);

representing feature embedded data

And

the distance of (a);

exp (·) represents an exponential function based on a natural constant e;

(·) ^T representing a transpose;

represents a temperature coefficient;

step B5, loss self-supervision

And supervised contrast learning loss

Performing the combination yields a collaborative contrast loss

Help the student model to better extract discriminant features,

the calculation formula of (a) is as follows:

7. the generalized continuous classification method based on online contrast distillation network according to claim 1, wherein in step B1, the geometric transformation includes rotating, scaling and adjusting aspect ratio of the image.

8. The method for the generalized continuous classification based on the online contrast distillation network as claimed in claim 1, wherein in step 7, the non-stationary data stream is assumed to be formed by n sample disjoint tasks { T } ₁ ，T ₂ ，...，T _n Is formed, let x denote task T _n And a buffer area

The calculation formula of (c) is as follows:

in the formula:

representing a mathematical expectation function;

softmax (·) denotes a softmax function;

representing a cross entropy loss function;

9. The generalized continuous classification method based on the online contrast distillation network as claimed in claim 1, wherein the specific method for updating the parameters of the teacher model by using the parameters of the student model in step 9 is as follows:

Θ ^t ←mΘ ^t +(1-m)[(1-X)Θ ^t +XΘ ^s ]；

Φ ^t ←mΦ ^t +(1-m)[(1-X)Φ ^t +XΦ ^s ]；

Ψ ^t ←mΨ ^t +(1-m)[(1-X)Ψ ^t +XΨ ^s ]；

P(X＝k)＝p ^k (1-p) ^1-k ，k＝{0，1}；

10. The generalized continuous classification method based on the online contrast distillation network according to claim 9, wherein the momentum factor m is calculated as follows:

m＝min(itera/(itera+1)，η)；