CN114782742A

CN114782742A - Output regularization method based on teacher model classification layer weight

Info

Publication number: CN114782742A
Application number: CN202210357826.8A
Authority: CN
Inventors: 梅建萍; 仇文豪
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-22

Abstract

The invention relates to an output regularization method based on teacher model classification layer weight, which is characterized in that the weight of a classification layer of a teacher model which completes supervision training is converted into a correlation matrix among classes, each row in the matrix is used as a soft label of the corresponding class, extra information is provided for a student model, and the student model participates in the training of the student model; and selecting the student model with the highest accuracy as a final target model. The method fully utilizes the information provided by the teacher model, reduces the problems of overlarge occupation of training resources by the teacher model and overlong integral training time in the training process, can also be used for training the student model by the method even if certain neural network models only can provide the weight of a classifier layer of the teacher model, has higher classification accuracy and wider model applicability, has higher training speed, only needs fewer training resources, and can further regularize the network model under the condition of fewer resources.

Description

Output regularization method based on teacher model classification layer weight

Technical Field

The present invention relates to computing; the technical field of calculation or counting, in particular to an output regularization method based on teacher model classification layer weight in the image classification field in deep learning.

Background

In the deep neural network, a neural network model with a large number of parameters achieves excellent performance in a supervised learning task of image classification, however, the neural network model usually overfitts a marked training sample, thereby resulting in poor generalization capability. The overfitting phenomenon is a main and common problem existing after a modern deep neural network model with millions of parameters is trained by using a marked data set, and in order to solve the overfitting problem, researchers at home and abroad propose different solutions, including regularization methods.

The regularization method comprises the steps of carrying out regularization processing on an input end and an output end; the input end improves the data generalization capability of the network model by increasing the diversity of the training samples, but due to the limitation of the labeled data set, researchers provide methods such as data enhancement, label mixing and the like to increase virtual samples in limited data to improve the generalization capability of the model; the regularization method of the output end comprises a maximum entropy model, label smoothing, knowledge distillation and the like, wherein the maximum entropy model and label smoothing technology is used for optimizing a loss function of model training from the angle of information entropy to achieve the regularization purpose, the knowledge distillation which is initially used in the field of model compression has two neural network models in common training, wherein a model with a larger parameter quantity is called a teacher model, and a model with a smaller parameter quantity is called a student model.

However, for the output-end regularization method, knowledge distillation has wider applicability and stronger regularization capability than the maximum entropy model and the label smoothing technique, but the whole teacher model needs to be provided in training, and the time required for training is much longer than the latter.

Disclosure of Invention

The invention solves the problems in the prior art, provides an optimized output regularization method based on teacher model classification layer weight, and solves the problems of overlarge resources and overlong training time required by training a student model by using a teacher model in the prior art.

The method converts the weight of a classification layer of a teacher model which finishes supervised training into a correlation matrix among classes, and takes each row in the matrix as a soft label of the corresponding class to provide additional information for a student model and participate in the training of the student model; and selecting the student model with the highest accuracy as a final target model.

Preferably, the method comprises the steps of:

step 1: acquiring a data set with a label, and dividing the data set into a training set and a verification set;

and 2, step: preprocessing all data of the data set;

and step 3: acquiring a preset teacher model or constructing a teacher model;

and 4, step 4: obtaining a soft label matrix based on the classification layer weight of the teacher model;

and 5: training a target model based on the soft label matrix;

step 6: and (5) adjusting parameters for a plurality of times, repeating the step (4) and the step (5), and selecting the student model with the highest accuracy as a final target model.

Preferably, in step 1, the number of elements in the verification set is min (0.1 × n,5000), where n is the number of elements in the data set.

Preferably, in step 2, the preprocessing includes normalization processing and enhancement operation on the data, and the size of the preprocessed data is consistent.

Preferably, in step 3, the teacher model is constructed by training the teacher model on the training set prepared in step 2 through a cross-entropy loss function, wherein the cross-entropy loss function is expressed by formula (1),

wherein y represents a data label, p represents the prediction distribution of the teacher model to the data, C is the size of the current data set category, C is a positive integer between 1 and C, y_cIs a data tag of the current c-th category, p_cAnd (4) predicting distribution of the data in the current c-th category for the teacher model.

Preferably, the step 4 comprises the following steps:

step 4.1: for the teacher model, reserving a weight matrix W of a classification layer of the teacher model, wherein the dimension of the weight matrix W is k multiplied by C, k is the characteristic dimension of input data of the classification layer, and C is the category size of a current data set;

step 4.2: transforming the weight matrix W into a class-dependent soft label matrix Q by equation (2),

Q＝σ(W^TW) (2)

wherein σ () is a softmax function, and the dimension of Q is C × C;

step 4.3: using a restart random walk algorithm to perform iterative computation on the current soft label matrix according to the formula (3) so as to obtain S,

s＝(1-u)Qs+uq (3)

wherein S represents the probability distribution of S rows, Q represents the probability distribution of Q rows, the dimension size is 1 xC, u is the probability of recovering to the original vector, u is adjustable, and u belongs to (0, 1);

step 4.4: if the iteration times reach a preset maximum value or meet the condition that the epsilon is less than 1e-6, stopping iteration and outputting a stable matrix S, otherwise, adding 1 to the iteration times and returning to the step 4.3;

wherein e is norm_L1(S_i-S_i-1) The L1 norm size of the difference between the matrix S in each iteration and S in the last iteration, i being used for counting and i not being greater than a preset maximum value of the number of iterations, S_iFor S, S calculated in the current iteration_i-1For S calculated in the last iteration, S in the first iteration_i-1Is Q.

Preferably, the step 5 comprises the steps of:

step 5.1: selecting a network model with parameter quantity smaller than that of the teacher model, and training on the current data set; the loss function is set as in equation (6),

L_tsr＝αL_ce+βL₃ (6)

wherein alpha and beta are weight coefficients, and alpha is belonged to (0, 3)],β∈(0,3]，L_ceAs a cross-entropy loss function, L_tIs a KL divergence function, as shown in equation (5),

wherein s is_lLabel vector, S, related to the I-th row category in soft label matrix S corresponding to current training data_c,lLabel vectors related to the c-th category in the l-th row in the soft label matrix S corresponding to the current training data;

step 5.2: and selecting a random gradient descent algorithm optimizer, performing preset number of iterations on the model, and storing the model of the last iteration.

Preferably, the classification is an image classification.

The invention relates to an optimized output regularization method based on teacher model classification layer weight, which is characterized in that the classification layer weight of a teacher model which completes supervision training is converted into a correlation matrix among classes, each row in the matrix is used as a soft label of the corresponding class, additional information is provided for a student model, and the student model participates in the training of the student model; and selecting the student model with the highest accuracy as a final target model.

The invention has the beneficial effects that:

(1) the information provided by the teacher model is fully utilized, and the problems that the teacher model occupies too large training resources and the whole training time is too long in the training process are solved;

(2) even if some neural network models can only provide the weights of the classifier layer of the teacher model, the method can be used for training the student model;

(3) compared with the label smoothing technology, the method has higher classification accuracy and wider model applicability;

(4) compared with the knowledge distillation technology, the method has higher training speed, only needs fewer training resources, and can be used for further regularizing the network model under the condition of fewer resources.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic view of a model of the present invention;

fig. 3 is a schematic diagram of soft labels of corresponding training data taken by a soft label matrix in training a student model according to the present invention.

Detailed Description

The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.

The invention relates to an output regularization method based on teacher model classification layer weight, which is characterized in that the method converts the weight of a classification layer of a teacher model completing supervised training into a correlation matrix among classes, and takes each row in the matrix as a soft label of the corresponding class to provide additional information for a student model and participate in the training of the student model; and selecting the student model with the highest accuracy as a final target model.

The classification is an image classification.

The output regularization method based on the classification layer weight in the teacher model comprises six parts, namely data set preparation, data preprocessing, teacher model preparation, soft label obtaining based on the classification layer weight of the teacher model, network training model construction and parameter adjustment, wherein the soft label obtaining and the network training model construction are repeated after each parameter adjustment.

In the invention, data set preparation is the primary step before target network training, the data set is usually the data set with marks in the current task domain, and then the current data set is divided into a training set and a verification set; the data preprocessing is to crop and normalize the picture and to perform a certain degree of data enhancement.

In the invention, in the preparation stage of the teacher model, the teacher model usually selects the parameter number which is larger than that of the target network model, namely the parameter number of the student model, if the teacher model can be directly obtained, namely the parameter number exists or can be directly obtained by downloading, the training is not needed, otherwise, the teacher model is trained through cross entropy loss; in the stage of obtaining the soft label based on the teacher model classification layer weight, the weight matrix of the classification layer in the teacher model is reserved, and the soft label matrix is obtained by restarting the random walk algorithm; in the stage of constructing a network training model, training the network model by combining a soft label matrix and a proposed cross entropy loss function; in the parameter adjustment stage, the model is subjected to parameter adjustment through the verification set, the pictures are predicted and classified on the verification set, the classification accuracy is calculated, and the model with the highest classification accuracy is reserved.

The invention is applicable to any image classified data set.

Taking the cifar100 dataset as an example, the method comprises the following steps:

in step 1, the number of elements in the verification set is min (0.1 × n,5000), where n is the number of elements in the data set.

In the invention, a labeled data set of a current task domain is prepared, a picture data set on any task domain can be used, the currently adopted cifar100 data set is an open source data set, so a stage of manually marking a sample is not needed, if the used data set has no marking information, the picture sample in the current data set needs to be marked, generally manually marking, and then the data set is divided into a training set and a verification set, wherein the number of the verification set is min (0.1 xn, 5000), which means that a smaller value is selected from 0.1 xn and 5000 as the number of elements in the verification set, and the rest picture data is used as the training set.

And 2, step: preprocessing all data of the data set;

in the step 2, the preprocessing includes normalization processing and enhancement operation on the data, and the size of the preprocessed data is consistent.

In the invention, the normalization processing is generally to normalize all the sizes of the pictures to the same size, such as a × b, wherein a and b can be specified manually and are generally similar to the sizes of the pictures, so as to avoid over-cropping and damaging the core content of the images; data enhancement of image data includes, but is not limited to, horizontal flipping, random cropping, etc. of an image, where random cropping causes a change in picture size, and therefore, the picture space needs to be supplemented by padding, usually black, into the picture after data enhancement, so that the picture size is maintained at a × b.

In the embodiment of the invention, after the image is cut, the images in the cifar100 data set are uniformly adjusted to be color images of 32 × 32 pixels, or all the images are cut to the same size as required for network training; the data enhancement is to cut the picture at random edges, and then use black to fill in randomly, so that the picture size is kept at 32 × 32 pixels, and at the same time, use random horizontal inversion to enhance certain data.

In the invention, the operations can be executed on the data set by using the torrch deep learning frame, wherein operations such as cutting pictures and certain data enhancement, namely picture normalization, are provided in a torchvision packet in the torrch, and after the operations are finished, the pictures are input and loaded into the DataLoader through the provided frame interface for subsequent training.

And 3, step 3: acquiring a preset teacher model or constructing the teacher model;

in the step 3, the teacher model is constructed by training the teacher model through a cross entropy loss function on the training set prepared in the step 2, wherein the cross entropy loss function is shown as a formula (1),

In the invention, the teacher model has two acquisition modes:

if the corresponding teacher model exists in the current data domain, for example, the current data domain is ready to be realized, or the corresponding teacher model exists on the network, the corresponding teacher model is directly downloaded;

if not, selecting and constructing a network structure of the teacher model, and training the teacher model on the training set prepared in the step 2 through cross entropy loss; as shown in formula (1), if the classification of the current picture classification task is 3 classifications, y takes a value of {0,1,2}, where 0,1, or 2 indicates that the current picture belongs to the fourth class, and p is a 3-dimensional vector and indicates the prediction output distribution of the current model to the picture.

In the embodiment of the invention, a resnet101 network model is selected as a teacher model, the teacher model is trained by using a random gradient descent optimizer provided by pytorch according to formula (1), and the model of the last iteration is used as the teacher model through 200 iterations.

the step 4 comprises the following steps:

and 4.2: transforming the weight matrix W into a class-dependent soft label matrix Q by equation (2),

Q＝σ(W^TW) (2)

wherein σ () is a softmax function, and the dimension of Q is C × C;

s＝(1-u)Qs+uq (3)

wherein ss represents the probability distribution of S rows, Q represents the probability distribution of Q rows, the dimension size is 1 × C, u is the probability of recovering to the original vector, u is adjustable, and u belongs to (0, 1);

step 4.4: if the iteration times reach a preset maximum value or the epsilon <1e-6 is met, stopping iteration and outputting the stabilized matrix S, otherwise, adding 1 to the iteration times and returning to the step 4.3;

In the invention, after a teacher model is obtained, a weight matrix W of a classification layer in the teacher model is reserved, the dimension of the weight matrix W is k multiplied by C, the weight matrix W is converted into a soft label matrix related to the class through an equation (2), wherein W^TThe W matrix is obtained by transpose multiplication of the W matrix, the dimensionality is C multiplied by C, the softmax function sigma (-) is used for mapping the weight correlation to the probability space of (0,1) for each row, and a probability matrix Q with each row being 1 is obtained, namely the needed soft label matrix, and the dimensionality is C multiplied by C.

In the invention, Q is used as a soft label matrix, the probability of each row is unstable, so that the current soft label matrix is calculated according to the formula (3) by restarting a random walk algorithm, and the probability of each row is finally stable to obtain S, which is a probability matrix tending to stability and is closer to the correlation among categories in a real environment; wherein, in general, u is 0.2; the purpose of restarting the random walk algorithm is to obtain the proportion of the original vector after one iteration through iteration, and the iteration termination condition is that the iteration times exceed 40 or the epsilon is larger than 1 e-6.

In the embodiment of the invention, a weight matrix W of a feature mapping layer in a teacher model is selected for the next operation, and in a resnet101 model, the size of the W matrix is 512 x 100 dimensions, wherein 512 is the feature dimension behind a feature extractor, and 100 is the category dimension of a cifar100 data set; a probability matrix Q can be obtained through the formula (2), and a probability matrix S is obtained by executing a restarting random walk algorithm through a recursion formula (3), wherein the probability of each row tends to be stable; set u to 0.2, number of iterations to 40, and e to 1 e-6.

And 5: training a target model based on the soft label matrix;

the step 5 comprises the following steps:

step 5.1: selecting a network model with parameter quantity smaller than that of the teacher model, and training on the current data set; the loss function is set as in equation (6).

L_34r＝αL_ce+βL_t (6)

Wherein 6 and beta are weight coefficients, and alpha is epsilon (0, 3)],β∈(0,3]，L_ceAs a cross-entropy loss function, L₃Is a KL divergence function, as shown in equation (5),

In the invention, after the soft label matrix S is obtained, a target model, namely a student model, is trained.

In the present invention, s_lAnd the label vector related to the class of the ith row in the soft label matrix S corresponding to the current training data is the ith row in the matrix S and is the prediction probability vector of the model to the current data.

In the present invention, α is generally set to 1 and β is set to 2.

In the invention, after the loss function is set, 200 iterations are carried out on the model through the selected random gradient descent algorithm optimizer to obtain the model of the last iteration, and the model is stored in a hard disk.

In the embodiment of the invention, a network training model is constructed: after the soft label matrix S is obtained, starting training a target model, namely a student model; firstly, selecting a network model resnet18 with a smaller parameter quantity,training on a cifar100 dataset; the loss function required for training is composed of two parts, including a cross-entropy loss function L_ceWherein y is a label in the cifar100 training set, p is a prediction probability vector of the model to the current data, and the KL divergence in formula (5) is also included, where s is_lLabel vectors related to categories in the soft label matrix S corresponding to the current training data are represented, and p is a prediction probability vector of the model to the current data; from this, the loss function in equation (6) can be obtained as the loss function L of the current student model training_tsrWhere α and β are weighting coefficients preceding two loss functions, typically set α to 1 and β to 2; after the loss function is set, 200 iterations are performed on the model through a random gradient descent algorithm optimizer in the pytorech to obtain a model of the last iteration, and the model is stored through persistence.

And 6: and (5) adjusting parameters for a plurality of times, repeating the step (4) and the step (5), and selecting the student model with the highest accuracy as a final target model.

In the invention, a parameter u is adjusted from 0.1 to 0.5 at an interval of 0.1, a classification layer in a teacher model is calculated and stored, a plurality of different student models are obtained through training in the steps 4 and 5, a verification set is loaded by using a Dataset and a DataLoader in a catalogue, data in the verification set is subjected to prediction classification by using each target model, a prediction classification result is compared with a label of the verification set, the accuracy of classification prediction is finally obtained, and a model with the highest accuracy is finally selected as a final target model.

In order to achieve the above, the present invention further relates to a computer readable storage medium, on which an output regularization program based on teacher model classification layer weights is stored, and when the program is executed by a processor, the output regularization method based on teacher model classification layer weights is implemented, so as to solve the problems of excessive resources and excessive training time required by training a student model by using a teacher model in the prior art.

In order to achieve the foregoing, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the foregoing output regularization method based on teacher model classification layer weights.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An output regularization method based on teacher model classification layer weight is characterized in that: the method comprises the steps of converting the weight of a classification layer of a teacher model which completes supervised training into a correlation matrix among classes, taking each row in the matrix as a soft label of the corresponding class, providing additional information for a student model and participating in training of the student model; and selecting the student model with the highest accuracy as a final target model.

2. The teacher model classification level weight-based output regularization method according to claim 1, wherein: the method comprises the following steps:

step 2: preprocessing all data of the data set;

and step 3: acquiring a preset teacher model or constructing the teacher model;

and 5: training a target model based on the soft label matrix;

3. The method of claim 2, wherein the step of regularizing the output based on teacher model classification layer weights comprises: in step 1, the number of elements in the verification set is min (0.1 × n,5000), where n is the number of elements in the data set.

4. The teacher model classification level weight-based output regularization method according to claim 2, wherein: in the step 2, the preprocessing includes normalization processing and enhancement operation on the data, and the size of the preprocessed data is consistent.

5. The method of claim 2, wherein the step of regularizing the output based on teacher model classification layer weights comprises: in the step 3, the teacher model is constructed by training the teacher model through a cross entropy loss function on the training set prepared in the step 2, wherein the cross entropy loss function is shown as a formula (1),

wherein y represents a data label, p represents the predicted distribution of the teacher model to the data, C is the category size of the current data set, C is a positive integer from 1 to C, y_cIs a data tag of the current c-th category, p_cThe predicted distribution of data in the current c-th category is for the teacher model.

6. The teacher model classification level weight-based output regularization method according to claim 2, wherein: the step 4 comprises the following steps:

step 4.1: for the teacher model, a weight matrix W of a classification layer of the teacher model is reserved, the dimensionality of the weight matrix W is k multiplied by C, k is the characteristic dimensionality of input data of the classification layer, and C is the category size of a current data set;

Q＝σ(W^TW) (2)

wherein σ (.) is a softmax function, and the dimension of Q is C × C;

s＝uQs+(1-u)s (3)

wherein S represents the probability distribution of each row S, the dimension is 1 × C, u is the probability of recovering to the original vector, u is adjustable, and u belongs to (0, 1);

step 4.4: if the iteration times reach a preset maximum value or meet the condition that the epsilon is less than 1e-6, stopping the iteration, outputting the stabilized matrix S, otherwise, adding 1 to the iteration times, and returning to the step 4.3;

wherein e ═ norm_L1(S_i-S_i-1) I is the L1 norm of the difference between the matrix S in each iteration and S in the last iteration, i is used for counting and i is not greater than the preset maximum value of the iteration number, S_iFor S, S calculated in the current iteration_i-1For S calculated in the last iteration, S in the first iteration_i-1Is Q.

7. The teacher model classification layer weight-based output regularization method according to claim 5, wherein: the step 5 comprises the following steps:

L_tsr＝αL_ce+βL_t (6)

wherein alpha and beta are weight coefficients, and alpha is belonged to (0, 3)]，β∈(0，3]，L_ceAs a cross-entropy loss function, L_tIs a KL divergence function, as shown in equation (5),

wherein s is_lLabel vector, S, related to the I-th row category in soft label matrix S corresponding to current training data_c，lLabel vectors related to the c-th category in the l-th row in the soft label matrix S corresponding to the current training data;

8. The teacher model classification level weight-based output regularization method according to claim 1, wherein: the classification is an image classification.