CN114782742A - Output regularization method based on teacher model classification layer weight - Google Patents

Output regularization method based on teacher model classification layer weight Download PDF

Info

Publication number
CN114782742A
CN114782742A CN202210357826.8A CN202210357826A CN114782742A CN 114782742 A CN114782742 A CN 114782742A CN 202210357826 A CN202210357826 A CN 202210357826A CN 114782742 A CN114782742 A CN 114782742A
Authority
CN
China
Prior art keywords
model
training
teacher model
matrix
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210357826.8A
Other languages
Chinese (zh)
Inventor
梅建萍
仇文豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210357826.8A priority Critical patent/CN114782742A/en
Publication of CN114782742A publication Critical patent/CN114782742A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to an output regularization method based on teacher model classification layer weight, which is characterized in that the weight of a classification layer of a teacher model which completes supervision training is converted into a correlation matrix among classes, each row in the matrix is used as a soft label of the corresponding class, extra information is provided for a student model, and the student model participates in the training of the student model; and selecting the student model with the highest accuracy as a final target model. The method fully utilizes the information provided by the teacher model, reduces the problems of overlarge occupation of training resources by the teacher model and overlong integral training time in the training process, can also be used for training the student model by the method even if certain neural network models only can provide the weight of a classifier layer of the teacher model, has higher classification accuracy and wider model applicability, has higher training speed, only needs fewer training resources, and can further regularize the network model under the condition of fewer resources.

Description

Output regularization method based on teacher model classification layer weight
Technical Field
The present invention relates to computing; the technical field of calculation or counting, in particular to an output regularization method based on teacher model classification layer weight in the image classification field in deep learning.
Background
In the deep neural network, a neural network model with a large number of parameters achieves excellent performance in a supervised learning task of image classification, however, the neural network model usually overfitts a marked training sample, thereby resulting in poor generalization capability. The overfitting phenomenon is a main and common problem existing after a modern deep neural network model with millions of parameters is trained by using a marked data set, and in order to solve the overfitting problem, researchers at home and abroad propose different solutions, including regularization methods.
The regularization method comprises the steps of carrying out regularization processing on an input end and an output end; the input end improves the data generalization capability of the network model by increasing the diversity of the training samples, but due to the limitation of the labeled data set, researchers provide methods such as data enhancement, label mixing and the like to increase virtual samples in limited data to improve the generalization capability of the model; the regularization method of the output end comprises a maximum entropy model, label smoothing, knowledge distillation and the like, wherein the maximum entropy model and label smoothing technology is used for optimizing a loss function of model training from the angle of information entropy to achieve the regularization purpose, the knowledge distillation which is initially used in the field of model compression has two neural network models in common training, wherein a model with a larger parameter quantity is called a teacher model, and a model with a smaller parameter quantity is called a student model.
However, for the output-end regularization method, knowledge distillation has wider applicability and stronger regularization capability than the maximum entropy model and the label smoothing technique, but the whole teacher model needs to be provided in training, and the time required for training is much longer than the latter.
Disclosure of Invention
The invention solves the problems in the prior art, provides an optimized output regularization method based on teacher model classification layer weight, and solves the problems of overlarge resources and overlong training time required by training a student model by using a teacher model in the prior art.
The method converts the weight of a classification layer of a teacher model which finishes supervised training into a correlation matrix among classes, and takes each row in the matrix as a soft label of the corresponding class to provide additional information for a student model and participate in the training of the student model; and selecting the student model with the highest accuracy as a final target model.
Preferably, the method comprises the steps of:
step 1: acquiring a data set with a label, and dividing the data set into a training set and a verification set;
and 2, step: preprocessing all data of the data set;
and step 3: acquiring a preset teacher model or constructing a teacher model;
and 4, step 4: obtaining a soft label matrix based on the classification layer weight of the teacher model;
and 5: training a target model based on the soft label matrix;
step 6: and (5) adjusting parameters for a plurality of times, repeating the step (4) and the step (5), and selecting the student model with the highest accuracy as a final target model.
Preferably, in step 1, the number of elements in the verification set is min (0.1 × n,5000), where n is the number of elements in the data set.
Preferably, in step 2, the preprocessing includes normalization processing and enhancement operation on the data, and the size of the preprocessed data is consistent.
Preferably, in step 3, the teacher model is constructed by training the teacher model on the training set prepared in step 2 through a cross-entropy loss function, wherein the cross-entropy loss function is expressed by formula (1),
Figure BDA0003582575110000021
wherein y represents a data label, p represents the prediction distribution of the teacher model to the data, C is the size of the current data set category, C is a positive integer between 1 and C, ycIs a data tag of the current c-th category, pcAnd (4) predicting distribution of the data in the current c-th category for the teacher model.
Preferably, the step 4 comprises the following steps:
step 4.1: for the teacher model, reserving a weight matrix W of a classification layer of the teacher model, wherein the dimension of the weight matrix W is k multiplied by C, k is the characteristic dimension of input data of the classification layer, and C is the category size of a current data set;
step 4.2: transforming the weight matrix W into a class-dependent soft label matrix Q by equation (2),
Q=σ(WTW) (2)
wherein σ () is a softmax function, and the dimension of Q is C × C;
step 4.3: using a restart random walk algorithm to perform iterative computation on the current soft label matrix according to the formula (3) so as to obtain S,
s=(1-u)Qs+uq (3)
wherein S represents the probability distribution of S rows, Q represents the probability distribution of Q rows, the dimension size is 1 xC, u is the probability of recovering to the original vector, u is adjustable, and u belongs to (0, 1);
step 4.4: if the iteration times reach a preset maximum value or meet the condition that the epsilon is less than 1e-6, stopping iteration and outputting a stable matrix S, otherwise, adding 1 to the iteration times and returning to the step 4.3;
wherein e is normL1(Si-Si-1) The L1 norm size of the difference between the matrix S in each iteration and S in the last iteration, i being used for counting and i not being greater than a preset maximum value of the number of iterations, SiFor S, S calculated in the current iterationi-1For S calculated in the last iteration, S in the first iterationi-1Is Q.
Preferably, the step 5 comprises the steps of:
step 5.1: selecting a network model with parameter quantity smaller than that of the teacher model, and training on the current data set; the loss function is set as in equation (6),
Ltsr=αLce+βL3 (6)
wherein alpha and beta are weight coefficients, and alpha is belonged to (0, 3)],β∈(0,3],LceAs a cross-entropy loss function, LtIs a KL divergence function, as shown in equation (5),
Figure BDA0003582575110000041
wherein s islLabel vector, S, related to the I-th row category in soft label matrix S corresponding to current training datac,lLabel vectors related to the c-th category in the l-th row in the soft label matrix S corresponding to the current training data;
step 5.2: and selecting a random gradient descent algorithm optimizer, performing preset number of iterations on the model, and storing the model of the last iteration.
Preferably, the classification is an image classification.
The invention relates to an optimized output regularization method based on teacher model classification layer weight, which is characterized in that the classification layer weight of a teacher model which completes supervision training is converted into a correlation matrix among classes, each row in the matrix is used as a soft label of the corresponding class, additional information is provided for a student model, and the student model participates in the training of the student model; and selecting the student model with the highest accuracy as a final target model.
The invention has the beneficial effects that:
(1) the information provided by the teacher model is fully utilized, and the problems that the teacher model occupies too large training resources and the whole training time is too long in the training process are solved;
(2) even if some neural network models can only provide the weights of the classifier layer of the teacher model, the method can be used for training the student model;
(3) compared with the label smoothing technology, the method has higher classification accuracy and wider model applicability;
(4) compared with the knowledge distillation technology, the method has higher training speed, only needs fewer training resources, and can be used for further regularizing the network model under the condition of fewer resources.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic view of a model of the present invention;
fig. 3 is a schematic diagram of soft labels of corresponding training data taken by a soft label matrix in training a student model according to the present invention.
Detailed Description
The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.
The invention relates to an output regularization method based on teacher model classification layer weight, which is characterized in that the method converts the weight of a classification layer of a teacher model completing supervised training into a correlation matrix among classes, and takes each row in the matrix as a soft label of the corresponding class to provide additional information for a student model and participate in the training of the student model; and selecting the student model with the highest accuracy as a final target model.
The classification is an image classification.
The output regularization method based on the classification layer weight in the teacher model comprises six parts, namely data set preparation, data preprocessing, teacher model preparation, soft label obtaining based on the classification layer weight of the teacher model, network training model construction and parameter adjustment, wherein the soft label obtaining and the network training model construction are repeated after each parameter adjustment.
In the invention, data set preparation is the primary step before target network training, the data set is usually the data set with marks in the current task domain, and then the current data set is divided into a training set and a verification set; the data preprocessing is to crop and normalize the picture and to perform a certain degree of data enhancement.
In the invention, in the preparation stage of the teacher model, the teacher model usually selects the parameter number which is larger than that of the target network model, namely the parameter number of the student model, if the teacher model can be directly obtained, namely the parameter number exists or can be directly obtained by downloading, the training is not needed, otherwise, the teacher model is trained through cross entropy loss; in the stage of obtaining the soft label based on the teacher model classification layer weight, the weight matrix of the classification layer in the teacher model is reserved, and the soft label matrix is obtained by restarting the random walk algorithm; in the stage of constructing a network training model, training the network model by combining a soft label matrix and a proposed cross entropy loss function; in the parameter adjustment stage, the model is subjected to parameter adjustment through the verification set, the pictures are predicted and classified on the verification set, the classification accuracy is calculated, and the model with the highest classification accuracy is reserved.
The invention is applicable to any image classified data set.
Taking the cifar100 dataset as an example, the method comprises the following steps:
step 1: acquiring a data set with a label, and dividing the data set into a training set and a verification set;
in step 1, the number of elements in the verification set is min (0.1 × n,5000), where n is the number of elements in the data set.
In the invention, a labeled data set of a current task domain is prepared, a picture data set on any task domain can be used, the currently adopted cifar100 data set is an open source data set, so a stage of manually marking a sample is not needed, if the used data set has no marking information, the picture sample in the current data set needs to be marked, generally manually marking, and then the data set is divided into a training set and a verification set, wherein the number of the verification set is min (0.1 xn, 5000), which means that a smaller value is selected from 0.1 xn and 5000 as the number of elements in the verification set, and the rest picture data is used as the training set.
And 2, step: preprocessing all data of the data set;
in the step 2, the preprocessing includes normalization processing and enhancement operation on the data, and the size of the preprocessed data is consistent.
In the invention, the normalization processing is generally to normalize all the sizes of the pictures to the same size, such as a × b, wherein a and b can be specified manually and are generally similar to the sizes of the pictures, so as to avoid over-cropping and damaging the core content of the images; data enhancement of image data includes, but is not limited to, horizontal flipping, random cropping, etc. of an image, where random cropping causes a change in picture size, and therefore, the picture space needs to be supplemented by padding, usually black, into the picture after data enhancement, so that the picture size is maintained at a × b.
In the embodiment of the invention, after the image is cut, the images in the cifar100 data set are uniformly adjusted to be color images of 32 × 32 pixels, or all the images are cut to the same size as required for network training; the data enhancement is to cut the picture at random edges, and then use black to fill in randomly, so that the picture size is kept at 32 × 32 pixels, and at the same time, use random horizontal inversion to enhance certain data.
In the invention, the operations can be executed on the data set by using the torrch deep learning frame, wherein operations such as cutting pictures and certain data enhancement, namely picture normalization, are provided in a torchvision packet in the torrch, and after the operations are finished, the pictures are input and loaded into the DataLoader through the provided frame interface for subsequent training.
And 3, step 3: acquiring a preset teacher model or constructing the teacher model;
in the step 3, the teacher model is constructed by training the teacher model through a cross entropy loss function on the training set prepared in the step 2, wherein the cross entropy loss function is shown as a formula (1),
Figure BDA0003582575110000071
wherein y represents a data label, p represents the prediction distribution of the teacher model to the data, C is the size of the current data set category, C is a positive integer between 1 and C, ycIs a data tag of the current c-th category, pcAnd (4) predicting distribution of the data in the current c-th category for the teacher model.
In the invention, the teacher model has two acquisition modes:
if the corresponding teacher model exists in the current data domain, for example, the current data domain is ready to be realized, or the corresponding teacher model exists on the network, the corresponding teacher model is directly downloaded;
if not, selecting and constructing a network structure of the teacher model, and training the teacher model on the training set prepared in the step 2 through cross entropy loss; as shown in formula (1), if the classification of the current picture classification task is 3 classifications, y takes a value of {0,1,2}, where 0,1, or 2 indicates that the current picture belongs to the fourth class, and p is a 3-dimensional vector and indicates the prediction output distribution of the current model to the picture.
In the embodiment of the invention, a resnet101 network model is selected as a teacher model, the teacher model is trained by using a random gradient descent optimizer provided by pytorch according to formula (1), and the model of the last iteration is used as the teacher model through 200 iterations.
And 4, step 4: obtaining a soft label matrix based on the classification layer weight of the teacher model;
the step 4 comprises the following steps:
step 4.1: for the teacher model, reserving a weight matrix W of a classification layer of the teacher model, wherein the dimension of the weight matrix W is k multiplied by C, k is the characteristic dimension of input data of the classification layer, and C is the category size of a current data set;
and 4.2: transforming the weight matrix W into a class-dependent soft label matrix Q by equation (2),
Q=σ(WTW) (2)
wherein σ () is a softmax function, and the dimension of Q is C × C;
step 4.3: using a restart random walk algorithm to perform iterative computation on the current soft label matrix according to the formula (3) so as to obtain S,
s=(1-u)Qs+uq (3)
wherein ss represents the probability distribution of S rows, Q represents the probability distribution of Q rows, the dimension size is 1 × C, u is the probability of recovering to the original vector, u is adjustable, and u belongs to (0, 1);
step 4.4: if the iteration times reach a preset maximum value or the epsilon <1e-6 is met, stopping iteration and outputting the stabilized matrix S, otherwise, adding 1 to the iteration times and returning to the step 4.3;
wherein e is normL1(Si-Si-1) The L1 norm size of the difference between the matrix S in each iteration and S in the last iteration, i being used for counting and i not being greater than a preset maximum value of the number of iterations, SiFor S, S calculated in the current iterationi-1For S calculated in the last iteration, S in the first iterationi-1Is Q.
In the invention, after a teacher model is obtained, a weight matrix W of a classification layer in the teacher model is reserved, the dimension of the weight matrix W is k multiplied by C, the weight matrix W is converted into a soft label matrix related to the class through an equation (2), wherein WTThe W matrix is obtained by transpose multiplication of the W matrix, the dimensionality is C multiplied by C, the softmax function sigma (-) is used for mapping the weight correlation to the probability space of (0,1) for each row, and a probability matrix Q with each row being 1 is obtained, namely the needed soft label matrix, and the dimensionality is C multiplied by C.
In the invention, Q is used as a soft label matrix, the probability of each row is unstable, so that the current soft label matrix is calculated according to the formula (3) by restarting a random walk algorithm, and the probability of each row is finally stable to obtain S, which is a probability matrix tending to stability and is closer to the correlation among categories in a real environment; wherein, in general, u is 0.2; the purpose of restarting the random walk algorithm is to obtain the proportion of the original vector after one iteration through iteration, and the iteration termination condition is that the iteration times exceed 40 or the epsilon is larger than 1 e-6.
In the embodiment of the invention, a weight matrix W of a feature mapping layer in a teacher model is selected for the next operation, and in a resnet101 model, the size of the W matrix is 512 x 100 dimensions, wherein 512 is the feature dimension behind a feature extractor, and 100 is the category dimension of a cifar100 data set; a probability matrix Q can be obtained through the formula (2), and a probability matrix S is obtained by executing a restarting random walk algorithm through a recursion formula (3), wherein the probability of each row tends to be stable; set u to 0.2, number of iterations to 40, and e to 1 e-6.
And 5: training a target model based on the soft label matrix;
the step 5 comprises the following steps:
step 5.1: selecting a network model with parameter quantity smaller than that of the teacher model, and training on the current data set; the loss function is set as in equation (6).
L34r=αLce+βLt (6)
Wherein 6 and beta are weight coefficients, and alpha is epsilon (0, 3)],β∈(0,3],LceAs a cross-entropy loss function, L3Is a KL divergence function, as shown in equation (5),
Figure BDA0003582575110000091
wherein s islLabel vector, S, related to the I-th row category in soft label matrix S corresponding to current training datac,lLabel vectors related to the c-th category in the l-th row in the soft label matrix S corresponding to the current training data;
step 5.2: and selecting a random gradient descent algorithm optimizer, performing preset number of iterations on the model, and storing the model of the last iteration.
In the invention, after the soft label matrix S is obtained, a target model, namely a student model, is trained.
In the present invention, slAnd the label vector related to the class of the ith row in the soft label matrix S corresponding to the current training data is the ith row in the matrix S and is the prediction probability vector of the model to the current data.
In the present invention, α is generally set to 1 and β is set to 2.
In the invention, after the loss function is set, 200 iterations are carried out on the model through the selected random gradient descent algorithm optimizer to obtain the model of the last iteration, and the model is stored in a hard disk.
In the embodiment of the invention, a network training model is constructed: after the soft label matrix S is obtained, starting training a target model, namely a student model; firstly, selecting a network model resnet18 with a smaller parameter quantity,training on a cifar100 dataset; the loss function required for training is composed of two parts, including a cross-entropy loss function LceWherein y is a label in the cifar100 training set, p is a prediction probability vector of the model to the current data, and the KL divergence in formula (5) is also included, where s islLabel vectors related to categories in the soft label matrix S corresponding to the current training data are represented, and p is a prediction probability vector of the model to the current data; from this, the loss function in equation (6) can be obtained as the loss function L of the current student model trainingtsrWhere α and β are weighting coefficients preceding two loss functions, typically set α to 1 and β to 2; after the loss function is set, 200 iterations are performed on the model through a random gradient descent algorithm optimizer in the pytorech to obtain a model of the last iteration, and the model is stored through persistence.
And 6: and (5) adjusting parameters for a plurality of times, repeating the step (4) and the step (5), and selecting the student model with the highest accuracy as a final target model.
In the invention, a parameter u is adjusted from 0.1 to 0.5 at an interval of 0.1, a classification layer in a teacher model is calculated and stored, a plurality of different student models are obtained through training in the steps 4 and 5, a verification set is loaded by using a Dataset and a DataLoader in a catalogue, data in the verification set is subjected to prediction classification by using each target model, a prediction classification result is compared with a label of the verification set, the accuracy of classification prediction is finally obtained, and a model with the highest accuracy is finally selected as a final target model.
In order to achieve the above, the present invention further relates to a computer readable storage medium, on which an output regularization program based on teacher model classification layer weights is stored, and when the program is executed by a processor, the output regularization method based on teacher model classification layer weights is implemented, so as to solve the problems of excessive resources and excessive training time required by training a student model by using a teacher model in the prior art.
In order to achieve the foregoing, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the foregoing output regularization method based on teacher model classification layer weights.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. An output regularization method based on teacher model classification layer weight is characterized in that: the method comprises the steps of converting the weight of a classification layer of a teacher model which completes supervised training into a correlation matrix among classes, taking each row in the matrix as a soft label of the corresponding class, providing additional information for a student model and participating in training of the student model; and selecting the student model with the highest accuracy as a final target model.
2. The teacher model classification level weight-based output regularization method according to claim 1, wherein: the method comprises the following steps:
step 1: acquiring a data set with a label, and dividing the data set into a training set and a verification set;
step 2: preprocessing all data of the data set;
and step 3: acquiring a preset teacher model or constructing the teacher model;
and 4, step 4: obtaining a soft label matrix based on the classification layer weight of the teacher model;
and 5: training a target model based on the soft label matrix;
step 6: and (5) adjusting parameters for a plurality of times, repeating the step (4) and the step (5), and selecting the student model with the highest accuracy as a final target model.
3. The method of claim 2, wherein the step of regularizing the output based on teacher model classification layer weights comprises: in step 1, the number of elements in the verification set is min (0.1 × n,5000), where n is the number of elements in the data set.
4. The teacher model classification level weight-based output regularization method according to claim 2, wherein: in the step 2, the preprocessing includes normalization processing and enhancement operation on the data, and the size of the preprocessed data is consistent.
5. The method of claim 2, wherein the step of regularizing the output based on teacher model classification layer weights comprises: in the step 3, the teacher model is constructed by training the teacher model through a cross entropy loss function on the training set prepared in the step 2, wherein the cross entropy loss function is shown as a formula (1),
Figure FDA0003582575100000021
wherein y represents a data label, p represents the predicted distribution of the teacher model to the data, C is the category size of the current data set, C is a positive integer from 1 to C, ycIs a data tag of the current c-th category, pcThe predicted distribution of data in the current c-th category is for the teacher model.
6. The teacher model classification level weight-based output regularization method according to claim 2, wherein: the step 4 comprises the following steps:
step 4.1: for the teacher model, a weight matrix W of a classification layer of the teacher model is reserved, the dimensionality of the weight matrix W is k multiplied by C, k is the characteristic dimensionality of input data of the classification layer, and C is the category size of a current data set;
step 4.2: transforming the weight matrix W into a class-dependent soft label matrix Q by equation (2),
Q=σ(WTW) (2)
wherein σ (.) is a softmax function, and the dimension of Q is C × C;
step 4.3: using a restart random walk algorithm to perform iterative computation on the current soft label matrix according to the formula (3) so as to obtain S,
s=uQs+(1-u)s (3)
wherein S represents the probability distribution of each row S, the dimension is 1 × C, u is the probability of recovering to the original vector, u is adjustable, and u belongs to (0, 1);
step 4.4: if the iteration times reach a preset maximum value or meet the condition that the epsilon is less than 1e-6, stopping the iteration, outputting the stabilized matrix S, otherwise, adding 1 to the iteration times, and returning to the step 4.3;
wherein e ═ normL1(Si-Si-1) I is the L1 norm of the difference between the matrix S in each iteration and S in the last iteration, i is used for counting and i is not greater than the preset maximum value of the iteration number, SiFor S, S calculated in the current iterationi-1For S calculated in the last iteration, S in the first iterationi-1Is Q.
7. The teacher model classification layer weight-based output regularization method according to claim 5, wherein: the step 5 comprises the following steps:
step 5.1: selecting a network model with parameter quantity smaller than that of the teacher model, and training on the current data set; the loss function is set as in equation (6),
Ltsr=αLce+βLt (6)
wherein alpha and beta are weight coefficients, and alpha is belonged to (0, 3)],β∈(0,3],LceAs a cross-entropy loss function, LtIs a KL divergence function, as shown in equation (5),
Figure FDA0003582575100000031
wherein s islLabel vector, S, related to the I-th row category in soft label matrix S corresponding to current training datac,lLabel vectors related to the c-th category in the l-th row in the soft label matrix S corresponding to the current training data;
step 5.2: and selecting a random gradient descent algorithm optimizer, performing preset number of iterations on the model, and storing the model of the last iteration.
8. The teacher model classification level weight-based output regularization method according to claim 1, wherein: the classification is an image classification.
CN202210357826.8A 2022-04-06 2022-04-06 Output regularization method based on teacher model classification layer weight Pending CN114782742A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210357826.8A CN114782742A (en) 2022-04-06 2022-04-06 Output regularization method based on teacher model classification layer weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210357826.8A CN114782742A (en) 2022-04-06 2022-04-06 Output regularization method based on teacher model classification layer weight

Publications (1)

Publication Number Publication Date
CN114782742A true CN114782742A (en) 2022-07-22

Family

ID=82426992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210357826.8A Pending CN114782742A (en) 2022-04-06 2022-04-06 Output regularization method based on teacher model classification layer weight

Country Status (1)

Country Link
CN (1) CN114782742A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115511012A (en) * 2022-11-22 2022-12-23 南京码极客科技有限公司 Class soft label recognition training method for maximum entropy constraint
CN116861302A (en) * 2023-09-05 2023-10-10 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115511012A (en) * 2022-11-22 2022-12-23 南京码极客科技有限公司 Class soft label recognition training method for maximum entropy constraint
CN116861302A (en) * 2023-09-05 2023-10-10 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method
CN116861302B (en) * 2023-09-05 2024-01-23 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method

Similar Documents

Publication Publication Date Title
CN108491874B (en) Image list classification method based on generation type countermeasure network
CN107292352B (en) Image classification method and device based on convolutional neural network
CN114782742A (en) Output regularization method based on teacher model classification layer weight
CN113111979B (en) Model training method, image detection method and detection device
CN113554599B (en) Video quality evaluation method based on human visual effect
US20230245351A1 (en) Image style conversion method and apparatus, electronic device, and storage medium
CN111401374A (en) Model training method based on multiple tasks, character recognition method and device
US20220391611A1 (en) Non-linear latent to latent model for multi-attribute face editing
CN113128478A (en) Model training method, pedestrian analysis method, device, equipment and storage medium
CN113743474A (en) Digital picture classification method and system based on cooperative semi-supervised convolutional neural network
CN114065951A (en) Semi-supervised federal learning method based on non-IID data
CN115019173A (en) Garbage identification and classification method based on ResNet50
CN113705724B (en) Batch learning method of deep neural network based on self-adaptive L-BFGS algorithm
CN109886317B (en) General image aesthetic evaluation method, system and equipment based on attention mechanism
CN114529752A (en) Sample increment learning method based on deep neural network
CN114140641A (en) Image classification-oriented multi-parameter self-adaptive heterogeneous parallel computing method
CN110768864B (en) Method and device for generating images in batches through network traffic
CN112528077A (en) Video face retrieval method and system based on video embedding
Ding et al. Take a close look at mode collapse and vanishing gradient in GAN
CN115578593B (en) Domain adaptation method using residual attention module
CN116758379A (en) Image processing method, device, equipment and storage medium
CN113554104B (en) Image classification method based on deep learning model
CN111160161A (en) Self-learning face age estimation method based on noise elimination
Zhao et al. U-net for satellite image segmentation: Improving the weather forecasting
Yang et al. Using Generative Adversarial Networks Based on Dual Attention Mechanism to Generate Face Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination