CN114626504A - Model compression method based on group relation knowledge distillation - Google Patents

Model compression method based on group relation knowledge distillation Download PDF

Info

Publication number
CN114626504A
CN114626504A CN202210030247.2A CN202210030247A CN114626504A CN 114626504 A CN114626504 A CN 114626504A CN 202210030247 A CN202210030247 A CN 202210030247A CN 114626504 A CN114626504 A CN 114626504A
Authority
CN
China
Prior art keywords
image
network
group
sample
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210030247.2A
Other languages
Chinese (zh)
Inventor
杨赛
杨慧
周伯俊
胡彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202210030247.2A priority Critical patent/CN114626504A/en
Publication of CN114626504A publication Critical patent/CN114626504A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a model compression method based on group relation knowledge distillation, which is characterized in that after a data set is preprocessed, a large-capacity convolutional neural network is initialized randomly as a teacher network, and the network is pre-trained by utilizing a cross entropy loss function; and then, in a knowledge distillation stage, randomly initializing a small-capacity convolutional neural network as a student network, carrying out K-means clustering on the image sample characteristics by the teacher network and the student network respectively, calculating the relation among all groups by using the maximum mean difference, and training the student network by using the weighting sum of cross entropy and a group relation loss function. And finally, carrying out classification decision on the test image by using the trained network. The method can guide the student network to imitate the grouping capability of the teacher to the samples, so that the performance of the student network approaches the performance of the teacher network.

Description

Model compression method based on group relation knowledge distillation
Technical Field
The invention relates to a model compression method based on group relation knowledge distillation, and belongs to the technical field of computer vision.
Background
In recent years, deep convolutional neural networks have enjoyed unprecedented success in a number of artificial intelligence fields, such as computer vision, natural language processing, speech recognition, and the like. However, these successes are heavily dependent on powerful computing power and huge memory resources to a large extent, so that deep convolutional networks cannot be widely deployed in embedded and mobile systems at all. In order to reduce the computation cost while maintaining excellent performance, knowledge distillation technology is proposed to transfer the knowledge learned in a teacher network with large scale and many parameters to a student network with small scale and few parameters, so as to enable a network model with smaller complexity to approach or exceed the performance of a complex network as much as possible.
Knowledge distillation is an effective model compression technique, and aims to improve the performance of lightweight student networks by transferring knowledge from a large-capacity teacher network trained in advance. The core idea is to train a complex teacher network model in advance, and then train a smaller student network by using the output of the teacher network model and the real label of the data. However, the output of the pre-trained teacher network contains information similar to the authentic labels, thereby allowing limited knowledge to be utilized in the knowledge distillation process. Subsequently, several distillation methods based on knowledge of the characteristics implicit in the network intermediate layers were proposed in succession. For example, Romero et al propose directly matching feature outputs of a student network and a teacher network by minimizing the L2 distance between peer-level features. However, the conditions for the student network to simulate the overall characteristics of the teacher network are too stringent to adversely affect the performance and convergence of the model. Consequently, follow-up work is undertaken on how to encode features near-efficiently on feature knowledge to close the gap between student and teacher networks. Zagaruyko and the like convert the characteristic diagram into a space attention diagram and guide students to learn a teacher network attention area in a network way; yim et al propose as knowledge the Gram matrix between adjacent layer features within the same residual block. Although the above works achieve good performance, they still suffer from the problem of ignoring the relationship between the different samples. Recent work has therefore focused on how to use the relationships between sample features for knowledge distillation. For example, Tung et al calculates the similarity between different sample features to obtain a similarity matrix, and keeps the similarity between the sample features consistent between the student network and the teacher through training. Zhu et al propose to use both the features and their gradient information to obtain better knowledge of the relationship.
Knowledge distillation methods based on sample characteristic relationships can achieve good performance when model compression is performed on low volumes, and several work of modeling sample characteristic relationships has been performed at present. However, the current work is still only based on establishing the relationship among sample individuals, and neglecting the group relationship among samples. The group relationship established according to the similarity between the samples is an important knowledge and is important for improving the performance of the model.
Disclosure of Invention
In view of the problems in the prior art, the present invention provides a model compression method based on group relation knowledge distillation, so as to solve the above technical problems.
In order to achieve the purpose, the invention adopts the technical scheme that: a model compression method based on group relation knowledge distillation comprises the following steps:
the method comprises the following steps: processing image data;
step two: training a teacher network;
step three: constructing teacher network group relation knowledge;
step four: constructing student network group relation knowledge;
step five: training a student network;
step six: and (5) testing the student network.
Further, the specific steps of processing the image data in the first step are as follows:
s11: for a given image data set, randomly dividing the given image data set into three subsets, namely a training set, a verification set and a test set, and respectively using the three subsets for training a model, verifying a hyper-parameter and testing the performance of the model;
s12: the number of image classification classes is C, and the mth image sample in the training set is represented as xmIts corresponding label ym(ii) a The ith image sample in the test set is denoted xi
Further, the training of the teacher network in the second step includes the specific steps of:
s21: randomly initializing a large-capacity convolutional neural network, which consists of a plurality of convolutional layers, a plurality of fully-connected layers and a Softmax layer; the convolutional layer is represented as
Figure BDA0003464208400000021
For extracting features of image samples, where θtIs a parameter of the convolutional layer; the fully-connected layer is represented as
Figure BDA0003464208400000022
For transformation and classification of features, wherein WtA parameter matrix of a full connection layer; the Softmax layer converts the classification score into a classification probability output value
S22: randomly taking M image samples in the training set and inputting the M image samples into the teacher network, wherein the M image sample is represented by a convolutional layer as:
Figure BDA0003464208400000031
s23: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as follows:
Figure BDA0003464208400000032
wherein
Figure BDA0003464208400000033
Representing the weight vectors in the teacher network full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of the image classification categories;
s24: the calculation formula of the Softmax layer for converting the classification score into the probability output value belonging to the c-th class is as follows:
Figure BDA0003464208400000034
wherein
Figure BDA0003464208400000035
Score a classification
Figure BDA0003464208400000036
The (j) th component of (a),
Figure BDA0003464208400000037
for a classifier parameter matrix WtThe jth weight vector of
S25: the cross entropy loss function between the classification output probability value and the real label is:
Figure BDA0003464208400000038
wherein
Figure BDA0003464208400000039
In order to output a probability value for the classification,
Figure BDA00034642084000000310
is a full connection layer parameter matrix WtM is the number of image samples.
Further, the specific steps of the construction of teacher network group relation knowledge in the third step are as follows:
s31: randomly extracting M image samples in a training set, inputting the M image samples into a teacher network to extract features, and clustering the features of the M image samples into K groups by using a K-means algorithm;
s32: in the k group, there is NkSamples, wherein the characteristics of the kth sample are expressed as
Figure BDA00034642084000000311
In group I there are NlSamples, wherein the characteristic of the first sample is expressed as
Figure BDA00034642084000000312
Then, the formula for calculating the coefficient of relationship between the kth group of sample features and the lth group of sample features based on the maximum mean difference is:
Figure BDA0003464208400000041
further, the concrete steps of the construction of the student network group relation knowledge in the fourth step are as follows:
s41: randomly initializing a small-capacity convolutional neural network;
s42: for the same batch of M image samples randomly extracted in a training set, the same batch of M image samples are taken into a student network to extract features, and the features of the M image samples are clustered into K groups by using a K-means algorithm;
s43: in the k group, there is NkSamples, wherein the characteristics of the kth sample are expressed as
Figure BDA0003464208400000042
In group I there are NlA sample, wherein the characteristic of the first sample is expressed as
Figure BDA0003464208400000043
Then, the calculation formula of the coefficient of relationship between the kth group of sample features and the l group of sample features based on the maximum mean difference is:
Figure BDA0003464208400000044
further, the concrete steps of training the student network in the fifth step are as follows:
s51: for the same batch of M image samples randomly drawn in the training set and input into the student network, the mth image sample is represented by the convolution layer as:
Figure BDA0003464208400000045
s52: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure BDA0003464208400000046
wherein
Figure BDA0003464208400000047
Representing weight vectors in the full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of image classification categories;
s53: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:
Figure BDA0003464208400000048
wherein
Figure BDA0003464208400000051
Score a classification
Figure BDA0003464208400000052
The (j) th component of (a),
Figure BDA0003464208400000053
for a classifier parameter matrix WsThe jth weight vector of (1);
s54: the cross entropy loss function between the classification output probability value and the real label is:
Figure BDA0003464208400000054
wherein
Figure BDA0003464208400000055
A probability value is output for the classification of the student network,
Figure BDA0003464208400000056
is a full connection layer parameter matrix WsThe jth weight vector of (1), M being the number of image samples;
s55: based on a relation matrix RtAnd RsF norm therebetween to obtain a group relationship distillation loss function
Figure BDA0003464208400000057
The calculation formula of (c) is:
Figure BDA0003464208400000058
s56: the total loss function for the student network optimization is:
Figure BDA0003464208400000059
where α and β are adjustable parameters, the total loss function based on the above equation is applied to the parameter θ in the student networksAnd WsAnd optimizing to complete the model compression process.
Further, the step six includes the specific steps of testing the student network:
s61: parameter θ in fixed student networksAnd Ws(ii) a Randomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as xiThe features of the convolutional layer through the student network are represented as:
Figure BDA00034642084000000510
s62: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure BDA00034642084000000511
wherein
Figure BDA00034642084000000512
Representing weight vectors in the full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of image classification categories;
s63: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:
Figure BDA0003464208400000061
wherein
Figure BDA0003464208400000062
Score a classification
Figure BDA0003464208400000063
The (j) th component of (a),
Figure BDA0003464208400000064
is the jth weight vector in the classifier parameter matrix Ws.
The invention has the beneficial effects that: the invention discloses a model compression method based on group relation knowledge distillation, which is characterized in that features extracted from samples are clustered by using a K-means algorithm based on a teacher network, and then a student network is guided to simulate the grouping capability of the teacher on the samples, so that the performance of the student network approaches the performance of the teacher network.
Drawings
FIG. 1 is a schematic structural diagram of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood, however, that the description herein of specific embodiments is only intended to illustrate the invention and not to limit the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, and the terms used herein in the specification of the present invention are for the purpose of describing particular embodiments only and are not intended to limit the present invention.
As shown in fig. 1, the present invention is a method for compressing a model based on group relation knowledge distillation, comprising the following steps:
the method comprises the following steps: processing image data;
s11: for a given image data set, randomly dividing the image data set into three subsets, namely a training set, a verification set and a test set, and respectively using the three subsets for training a model, verifying a hyper-parameter and testing the performance of the model;
s12: the number of image classification classes is C, and the mth image sample in the training set is represented as xmIts corresponding label ym(ii) a The ith image sample in the test set is denoted xi
Step two: training a teacher network;
s21: randomly initializing a large-capacity convolutional neural network, which consists of a plurality of convolutional layers, a plurality of fully-connected layers and a Softmax layer; the convolutional layer is represented as
Figure BDA0003464208400000065
For extracting features of image samples, where θtIs a parameter of the convolutional layer; the fully-connected layer is represented as
Figure BDA0003464208400000071
For transforming and classifying features, wherein WtA parameter matrix of a full connection layer; the Softmax layer converts the classification score into a classification probability output value
S22: randomly taking M image samples in the training set and inputting the M image samples into the teacher network, wherein the M image sample is represented by a convolutional layer as:
Figure BDA0003464208400000072
s23: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure BDA0003464208400000073
wherein
Figure BDA0003464208400000074
Representing the weight vectors in the teacher network full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of the image classification categories;
s24: the calculation formula of the Softmax layer for converting the classification score into the probability output value belonging to the c-th class is as follows:
Figure BDA0003464208400000075
wherein
Figure BDA0003464208400000076
Score a classification
Figure BDA0003464208400000077
The (j) th component of (a),
Figure BDA0003464208400000078
for a classifier parameter matrix WtThe jth weight vector of
S25: the cross entropy loss function between the classification output probability value and the real label is:
Figure BDA0003464208400000079
wherein
Figure BDA00034642084000000710
In order to output a probability value for the classification,
Figure BDA00034642084000000711
is a full connection layer parameter matrix WtM is the number of image samples.
Step three: constructing teacher network group relation knowledge;
s31: randomly extracting M image samples in a training set, inputting the M image samples into a teacher network to extract features, and clustering the features of the M image samples into K groups by using a K-means algorithm;
s32: in the k group, there is NkSamples, wherein the characteristics of the kth sample are expressed as
Figure BDA0003464208400000081
In group I there are NlSamples, wherein the characteristic of the first sample is expressed as
Figure BDA0003464208400000082
Then, the formula for calculating the coefficient of relationship between the kth group of sample features and the lth group of sample features based on the maximum mean difference is:
Figure BDA0003464208400000083
step four: constructing student network group relation knowledge;
s41: randomly initializing a small-capacity convolutional neural network;
s42: for the same batch of M image samples randomly extracted in a training set, the same batch of M image samples are taken into a student network to extract features, and the features of the M image samples are clustered into K groups by using a K-means algorithm;
s43: in the k group, there is NkSamples, wherein the characteristics of the kth sample are expressed as
Figure BDA0003464208400000084
In group I there are NlSamples, wherein the characteristic of the first sample is expressed as
Figure BDA0003464208400000085
Then, the formula for calculating the coefficient of relationship between the kth group of sample features and the lth group of sample features based on the maximum mean difference is:
Figure BDA0003464208400000086
step five: training a student network;
s51: for the same batch of M image samples randomly drawn in the training set and input into the student network, the mth image sample is represented by the convolutional layer as:
Figure BDA0003464208400000087
s52: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure BDA0003464208400000088
wherein
Figure BDA0003464208400000089
Representing weight vectors in the full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of image classification categories;
s53: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:
Figure BDA0003464208400000091
wherein
Figure BDA0003464208400000092
Score a classification
Figure BDA0003464208400000093
The (j) th component of (a),
Figure BDA0003464208400000094
for a classifier parameter matrix WsThe jth weight vector of (a);
s54: the cross entropy loss function between the classification output probability value and the real label is:
Figure BDA0003464208400000095
wherein
Figure BDA0003464208400000096
A probability value is output for the classification of the student network,
Figure BDA0003464208400000097
is a full connection layer parameter matrix WsThe jth weight vector in (1), M being the number of image samples;
s55: based on a relation matrix RtAnd RsF norm of between to obtain a group relation distillation loss function
Figure BDA0003464208400000098
The calculation formula of (2) is as follows:
Figure BDA0003464208400000099
s56: the total loss function for the student network optimization is:
Figure BDA00034642084000000910
where α and β are adjustable parameters, the total loss function based on the above equation is applied to the parameter θ in the student networksAnd WsAnd optimizing to complete the model compression process.
Step six: and (5) testing the student network.
S61: parameter θ in fixed student networksAnd Ws(ii) a Randomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as xi, and the characteristic expression of passing through the convolution layer in the student network is as follows:
Figure BDA00034642084000000911
s62: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure BDA0003464208400000101
wherein
Figure BDA0003464208400000102
Representing weight vectors in the full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of image classification categories;
s63: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:
Figure BDA0003464208400000103
wherein
Figure BDA0003464208400000104
Score a classification
Figure BDA0003464208400000105
The (j) th component of (a),
Figure BDA0003464208400000106
is the jth weight vector in the classifier parameter matrix Ws.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A model compression method based on group relation knowledge distillation is characterized by comprising the following steps:
the method comprises the following steps: processing image data;
step two: training a teacher network;
randomly initializing a large-capacity convolutional neural network, wherein the convolutional neural network consists of a plurality of convolutional layers, a plurality of fully-connected layers and a Softmax layer; the convolutional layer and the fully-connected layer are respectively represented as
Figure FDA0003464208390000011
And
Figure FDA0003464208390000012
wherein theta istAs a parameter of the convolutional layer, WtA parameter matrix of a full connection layer; randomly extracting M image samples in a training set, and inputting the M image samples into a teacher network; the m-th image sample is represented as a convolution layer
Figure FDA0003464208390000013
Through the full connection layer is shown as
Figure FDA0003464208390000014
Finally, the Softmax layer converts the probability output value of the probability output value into the c category, and the probability output value is pt(ym=c|xm) (ii) a The probability output value p is then calculatedt(ym=c|xm) With the authenticity label y of the specimenmCross entropy objective function between
Figure FDA0003464208390000015
Optimizing parameters in the teacher network;
step three: constructing teacher network group relation knowledge;
randomly extracting M image samples in a training set, inputting the M image samples into a teacher network to extract features, and clustering the features of the M image samples into K groups by using a K-means algorithm, wherein N is arranged in the K groupkSamples, wherein the characteristics of the kth sample are expressed as
Figure FDA0003464208390000016
In group I there are NlSamples, wherein the characteristic of the first sample is expressed as
Figure FDA0003464208390000017
Calculating the relation coefficient between any two groups by using the maximum mean difference
Figure FDA0003464208390000018
Finally obtaining a relation matrix Rt
Step four: constructing student network group relation knowledge;
randomly initializing a small-capacity convolutional neural network, which consists of a small number of convolutional layers, a small number of full-connection layers and a Softmax layer; the convolutional layer and the fully-connected layer are respectively represented as
Figure FDA0003464208390000019
And
Figure FDA00034642083900000110
wherein theta issAs a parameter of the convolution layer, WsIs a parameter matrix of the full connection layer. For the same batch of M image samples randomly extracted in a training set, the M image samples are taken into a student network to extract features, a K-means algorithm is used for clustering the features of the M image samples into K groups, and N image samples are assumed to exist in the K groupkSamples, wherein the characteristics of the kth sample are expressed as
Figure FDA00034642083900000111
Group IIn which is NlSamples, wherein the characteristic of the first sample is expressed as
Figure FDA00034642083900000112
Calculating the relation coefficient between any two groups by using the maximum mean difference
Figure FDA00034642083900000113
Finally obtaining a relation matrix Rs
Step five: training a student network;
for the same batch of M image samples randomly extracted in the training set, the M image sample is input into the student network to obtain the probability output value belonging to the c category, and the probability output value is represented as ps(ym=c|xm) Calculating a probability output value ps(ym=c|xm) With the authenticity label y of the specimenmCross entropy objective function between
Figure FDA0003464208390000021
And simultaneously calculating a relationship matrix RtAnd RsF norm of between as a function of group-related distillation loss
Figure FDA0003464208390000022
Will be provided with
Figure FDA0003464208390000023
And
Figure FDA0003464208390000024
as a function of the total loss, optimizing parameters in the student network;
step six: testing a student network;
parameter θ in fixed student networksAnd WsRandomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as xiFinally obtaining the summary belonging to the c-th category through the convolution layer, the full connection layer and the Softmax layer in the student networkThe value of the rate output is expressed as ps(yi=c|xi)。
2. The method for compressing a model based on knowledge distillation of group relationships according to claim 1, wherein the image data in the first step is processed by the following specific steps:
s11: for a given image data set, randomly dividing the given image data set into three subsets, namely a training set, a verification set and a test set, and respectively using the three subsets for training a model, verifying a hyper-parameter and testing the performance of the model;
s12: the number of image classification classes is C, and the mth image sample in the training set is represented as xmIts corresponding label ym(ii) a The ith image sample in the test set is denoted xi
3. The group-relationship-knowledge-distillation-based model compression method as claimed in claim 1, wherein the training of the teacher network in the second step comprises the specific steps of:
s21: randomly initializing a large-capacity convolutional neural network, which consists of a plurality of convolutional layers, a plurality of fully-connected layers and a Softmax layer; the convolutional layer is represented as
Figure FDA0003464208390000025
For extracting features of image samples, where θtIs a parameter of the convolutional layer; the fully-connected layer is represented as
Figure FDA0003464208390000029
For transforming and classifying features, wherein WtA parameter matrix of a full connection layer; the Softmax layer converts the classification score into a classification probability output value
S22: randomly taking M image samples in the training set and inputting the M image samples into the teacher network, wherein the M image sample is represented by a convolutional layer as:
Figure FDA0003464208390000026
s23: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure FDA0003464208390000027
wherein
Figure FDA0003464208390000028
Representing the weight vectors in the teacher network full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of the image classification categories;
s24: the calculation formula of the Softmax layer for converting the classification score into the probability output value belonging to the c-th class is as follows:
Figure FDA0003464208390000031
wherein
Figure FDA0003464208390000032
Score a classification
Figure FDA0003464208390000033
The (j) th component of (a),
Figure FDA0003464208390000034
for a classifier parameter matrix WtThe jth weight vector of
S25: the cross entropy loss function between the classification output probability value and the real label is:
Figure FDA0003464208390000035
wherein
Figure FDA0003464208390000036
In order to output a probability value for the classification,
Figure FDA0003464208390000037
is a full connection layer parameter matrix WtM is the number of image samples.
4. The group relationship knowledge distillation-based model compression method as claimed in claim 1, wherein the construction of teacher network group relationship knowledge in the third step comprises the following specific steps:
s31: randomly extracting M image samples in a training set, inputting the M image samples into a teacher network to extract features, and clustering the features of the M image samples into K groups by using a K-means algorithm;
s32: in the kth group there is NkSamples, wherein the characteristics of the kth sample are expressed as
Figure FDA0003464208390000038
In group I there are NlSamples, wherein the characteristic of the first sample is expressed as
Figure FDA0003464208390000039
Then, the formula for calculating the coefficient of relationship between the kth group of sample features and the lth group of sample features based on the maximum mean difference is:
Figure FDA00034642083900000310
5. the group relationship knowledge distillation-based model compression method as claimed in claim 1, wherein the concrete steps of the construction of the group relationship knowledge of the student network in the fourth step are as follows:
s41: randomly initializing a small-capacity convolutional neural network;
s42: for the same batch of M image samples randomly extracted in a training set, the same batch of M image samples are taken into a student network to extract features, and the features of the M image samples are clustered into K groups by using a K-means algorithm;
s43: in the k group, there is NkSamples, wherein the characteristics of the kth sample are expressed as
Figure FDA00034642083900000311
In group I there are NlA sample, wherein the characteristic of the first sample is expressed as
Figure FDA0003464208390000041
Then, the calculation formula of the coefficient of relationship between the kth group of sample features and the l group of sample features based on the maximum mean difference is:
Figure FDA0003464208390000042
6. the group relationship knowledge distillation-based model compression method according to claim 1, wherein the training of the student network in the fifth step comprises the following specific steps:
s51: for the same batch of M image samples randomly drawn in the training set and input into the student network, the mth image sample is represented by the convolutional layer as:
Figure FDA0003464208390000043
s52: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure FDA0003464208390000044
wherein
Figure FDA0003464208390000045
Representing weight vectors in the full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of image classification categories;
s53: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:
Figure FDA0003464208390000046
wherein
Figure FDA0003464208390000047
Score a classification
Figure FDA0003464208390000048
The (j) th component of (a),
Figure FDA0003464208390000049
for a classifier parameter matrix WsThe jth weight vector of (a);
s54: the cross entropy loss function between the classification output probability value and the real label is:
Figure FDA00034642083900000410
wherein
Figure FDA00034642083900000411
A probability value is output for the classification of the student network,
Figure FDA00034642083900000412
is a full connection layer parameter matrix WsThe jth weight vector of (1), M being the number of image samples;
s55: based on a relation matrix RtAnd RsF norm of between to obtain a group relation distillation loss function
Figure FDA00034642083900000413
The calculation formula of (c) is:
Figure FDA0003464208390000051
s56: the total loss function for the student network optimization is:
Figure FDA0003464208390000052
where α and β are adjustable parameters, the total loss function based on the above equation is applied to the parameter θ in the student networksAnd WsAnd optimizing to complete the model compression process.
7. The group relationship knowledge distillation-based model compression method as claimed in claim 1, wherein the concrete steps of the test of the student network in the sixth step are as follows:
s61: parameter θ in fixed student networksAnd Ws(ii) a Randomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as xi, and the characteristic expression of passing through the convolution layer in the student network is as follows:
Figure FDA0003464208390000053
s62: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:
Figure FDA0003464208390000054
wherein
Figure FDA0003464208390000055
Representing the weight vectors in the full-link layer, C in total, and the number of image classification categoriesThe purposes are consistent;
s63: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:
Figure FDA0003464208390000056
wherein
Figure FDA0003464208390000057
Score a classification
Figure FDA0003464208390000058
The (j) th component of (a),
Figure FDA0003464208390000059
is the jth weight vector in the classifier parameter matrix Ws.
CN202210030247.2A 2022-01-11 2022-01-11 Model compression method based on group relation knowledge distillation Withdrawn CN114626504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210030247.2A CN114626504A (en) 2022-01-11 2022-01-11 Model compression method based on group relation knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210030247.2A CN114626504A (en) 2022-01-11 2022-01-11 Model compression method based on group relation knowledge distillation

Publications (1)

Publication Number Publication Date
CN114626504A true CN114626504A (en) 2022-06-14

Family

ID=81899061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210030247.2A Withdrawn CN114626504A (en) 2022-01-11 2022-01-11 Model compression method based on group relation knowledge distillation

Country Status (1)

Country Link
CN (1) CN114626504A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115511059A (en) * 2022-10-12 2022-12-23 北华航天工业学院 Network lightweight method based on convolutional neural network channel decoupling
CN117058437A (en) * 2023-06-16 2023-11-14 江苏大学 Flower classification method, system, equipment and medium based on knowledge distillation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115511059A (en) * 2022-10-12 2022-12-23 北华航天工业学院 Network lightweight method based on convolutional neural network channel decoupling
CN115511059B (en) * 2022-10-12 2024-02-09 北华航天工业学院 Network light-weight method based on convolutional neural network channel decoupling
CN117058437A (en) * 2023-06-16 2023-11-14 江苏大学 Flower classification method, system, equipment and medium based on knowledge distillation
CN117058437B (en) * 2023-06-16 2024-03-08 江苏大学 Flower classification method, system, equipment and medium based on knowledge distillation

Similar Documents

Publication Publication Date Title
CN114626504A (en) Model compression method based on group relation knowledge distillation
CN109801621A (en) A kind of audio recognition method based on residual error gating cycle unit
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN108345866B (en) Pedestrian re-identification method based on deep feature learning
CN111985581A (en) Sample-level attention network-based few-sample learning method
CN111695456A (en) Low-resolution face recognition method based on active discriminability cross-domain alignment
CN114157539B (en) Data-aware dual-drive modulation intelligent identification method
CN110598552A (en) Expression recognition method based on improved particle swarm optimization convolutional neural network optimization
CN112766378B (en) Cross-domain small sample image classification model method focusing on fine granularity recognition
CN106971180A (en) A kind of micro- expression recognition method based on the sparse transfer learning of voice dictionary
Chen et al. A semisupervised deep learning framework for tropical cyclone intensity estimation
CN112633154A (en) Method and system for converting heterogeneous face feature vectors
Huang et al. Design and Application of Face Recognition Algorithm Based on Improved Backpropagation Neural Network.
CN116452862A (en) Image classification method based on domain generalization learning
CN115731595A (en) Fuzzy rule-based multi-level decision fusion emotion recognition method
CN109034192B (en) Track-vehicle body vibration state prediction method based on deep learning
CN114387474A (en) Small sample image classification method based on Gaussian prototype classifier
CN114329031A (en) Fine-grained bird image retrieval method based on graph neural network and deep hash
CN116543269B (en) Cross-domain small sample fine granularity image recognition method based on self-supervision and model thereof
CN109165576A (en) A kind of moving state identification method and device
CN116246305A (en) Pedestrian retrieval method based on hybrid component transformation network
CN112784800B (en) Face key point detection method based on neural network and shape constraint
CN114997331A (en) Small sample relation classification method and system based on metric learning
CN112001222B (en) Student expression prediction method based on semi-supervised learning
CN114357166A (en) Text classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220614