CN113505719A - Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm - Google Patents
Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm Download PDFInfo
- Publication number
- CN113505719A CN113505719A CN202110824459.3A CN202110824459A CN113505719A CN 113505719 A CN113505719 A CN 113505719A CN 202110824459 A CN202110824459 A CN 202110824459A CN 113505719 A CN113505719 A CN 113505719A
- Authority
- CN
- China
- Prior art keywords
- layer
- convolution
- network
- model
- pooling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000005021 gait Effects 0.000 title claims abstract description 69
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000006835 compression Effects 0.000 title claims abstract description 29
- 238000007906 compression Methods 0.000 title claims abstract description 29
- 238000013140 knowledge distillation Methods 0.000 title claims abstract description 28
- 238000004364 calculation method Methods 0.000 claims abstract description 19
- 238000000926 separation method Methods 0.000 claims abstract description 12
- 238000011176 pooling Methods 0.000 claims description 147
- 238000012549 training Methods 0.000 claims description 101
- 239000011159 matrix material Substances 0.000 claims description 73
- 230000006870 function Effects 0.000 claims description 25
- 238000004821 distillation Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 238000012360 testing method Methods 0.000 claims description 12
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 239000004576 sand Substances 0.000 claims description 3
- 238000011410 subtraction method Methods 0.000 claims description 3
- 238000002360 preparation method Methods 0.000 claims 1
- 239000013598 vector Substances 0.000 abstract description 11
- 238000013461 design Methods 0.000 abstract description 4
- 239000000284 extract Substances 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 15
- 238000005516 engineering process Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000036544 posture Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a gait recognition model compression system and method based on a local-integral joint knowledge distillation algorithm. The system adopts deep separation convolution to design a compact lightweight gait model network, the convolution network in the model network only reserves a backbone convolution network, and each convolution module in the backbone convolution network is simplified; the model network adopts a 16-layer lightweight fully-connected network, so that model parameters are greatly compressed, calculation of the recognition model is simplified, and the recognition efficiency is improved; the method of the invention simultaneously utilizes the local feature vectors output by the convolution network in the teacher model and the student model and the global feature vectors output by the full-connection network to carry out the joint knowledge distillation, not only retains the local features of the gait of the pedestrian by convolution operation, but also extracts the global features of the gait of the pedestrian by the full-connection operation, increases the information content of the knowledge distillation and improves the effect of identifying the gait of the pedestrian.
Description
Technical Field
The invention belongs to the technical field of image recognition, and particularly relates to a gait recognition model compression system and method based on a local-integral joint knowledge distillation algorithm.
Background
Gait recognition is an emerging biological feature recognition technology, and aims to find and extract feature changes among different pedestrians from a series of walking postures so as to realize automatic identification of the identity of the pedestrians. Gait recognition has many advantages over other biometric technologies, such as: the method has the advantages of no need of active matching of the identified objects, low requirement on image pixels, no limitation of visual angles, easiness in distinguishing camouflage, long identification distance and the like. Based on this, the gait recognition technology has very wide application in the fields of video monitoring, intelligent security and the like.
At present, gait recognition technology is mostly designed based on a standard convolutional neural network, and a recognition model is trained by collecting gait video samples containing pedestrian labels, so that the model learns useful gait appearance and motion characteristics from the samples and can recognize according to the characteristics. Existing gait recognition techniques can be divided into model-based methods and appearance-based methods depending on whether or not a human body model is established.
The gait features need to be extracted by establishing a human skeleton or a posture structure based on the model method, the calculation amount is large, and the network structure is complicated. The appearance-based method is the most common method at present, can directly extract gait features from a raw video captured by a camera, and can be subdivided into a feature template-based method, a gait video-based method and a set-based method.
The method based on the characteristic template (such as a gait energy map) is simple to realize but is easy to lose time sequence characteristics by extracting gait characteristics in the characteristic template for identification; the method based on the gait video sequence effectively improves gait space-time characteristics for recognition through a three-dimensional standard convolutional neural network, but the model is large in scale and difficult to train; the method is based on a set, single-frame gait contour features and set pooling structure aggregation gait space-time features in a gait set are extracted through a two-dimensional standard convolutional neural network, high-efficiency identification performance is achieved, and the model scale is still large.
In conclusion, the conventional gait recognition method is mainly realized by adopting a high-capacity neural network model, has the defects of more model parameters, long training time and difficult application and popularization, and is not suitable for practical application occasions with higher real-time requirements.
Although the traditional model compression method can reduce the capacity of the model and reduce the model parameters to a certain extent, the traditional model compression method is simpler and cannot keep the key information in the model, so that the identification performance of the model is seriously reduced, and therefore, the traditional model compression method is not suitable for solving the compression problem of the gait identification model.
Disclosure of Invention
The invention aims to provide a gait recognition model compression method based on a local-integral joint knowledge distillation algorithm, which can effectively ensure the gait recognition accuracy of a model while reducing the scale of model parameters.
In order to achieve the purpose, the invention adopts the following technical scheme:
gait recognition model compression system based on local-global joint knowledge distillation algorithm comprises:
comprises a teacher model Mt and a student model Ms(ii) a Wherein:
the teacher model Mt consists of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network;
the convolution network consists of a backbone network and a plurality of layers of global channels;
the backbone network consists of a first convolution module, a second convolution module and a third convolution module;
the first convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the second convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the third convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
the multilayer global channel consists of a fourth convolution module and a fifth convolution module;
the fourth convolution module consists of three layers, wherein: the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and 2 multiplied by 2 pooling cores are adopted;
the fifth convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
four collection pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged;
defining four collection pooling structures as a first, a second and a third collection pooling structure respectively; defining two horizontal pyramid pooling structures as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure respectively;
the fully connected network comprises a first fully connected sub-network and a second fully connected sub-network;
the output of the first convolution module is connected with the input of the second convolution module;
the output of the first convolution module is also connected with the input of the fourth convolution module through the first set pooling structure;
the output of the second convolution module is connected with the input of the third convolution module;
the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module;
the output of the third convolution module is connected with the input of the third collection pooling structure, and the output of the third collection pooling structure is added with the output of the fifth convolution module at corresponding positions and then connected with the input of the first horizontal pyramid pooling structure;
the output of the third convolution module is also connected with the input of the second horizontal pyramid pooling structure through a fourth set pooling structure;
the output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network;
the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network;
the outputs of the first and second fully-connected sub-networks are used as teacher models MtAn output of (d);
the first horizontal pyramid pooling structure and the second horizontal pyramid pooling structure have five scales;
the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers;
student model MsThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network, wherein:
the simplified convolution network consists of a sixth convolution module, a seventh convolution module and an eighth convolution module;
the sixth convolution module consists of two layers, wherein: the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the seventh convolution module consists of three layers, wherein:
the first layer is a deep separation convolutional layer, using a convolution kernel of 5 × 5; the second layer is a point convolution layer, using a convolution kernel of 1 × 1; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the eighth convolution module is comprised of two layers, wherein:
the first layer uses a 3 × 3 convolution kernel for the depth-separated convolution layer; the second layer is a point convolution layer, using a convolution kernel of 1 × 1;
the sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence;
defining a student model MsThe middle collection pooling structure is a fifth collection pooling structure;
the output of the eighth convolution module is connected with the input of the reduced horizontal pyramid pooling structure through a fifth set pooling structure;
the output of the simplified horizontal pyramid pooling structure is connected with the input of the simplified fully-connected network;
the simplified horizontal pyramid pooling structure has only one scale; the simplified fully-connected network comprises 16 independent fully-connected neural network layers; the output of the streamlined fully-connected network is used as a student model MsTo output of (c).
In addition, the invention also provides a gait recognition model compression method based on the local-integral joint knowledge distillation algorithm, which is based on the gait recognition model compression system mentioned above, and the specific technical scheme is as follows:
the gait recognition model compression method based on the local-integral joint knowledge distillation algorithm comprises the following steps:
Step 2. Using training set XtrainTeacher training model MtSetting learning rate and iteration times, wherein an Adam optimizer is adopted as the optimizer, and a triple loss function L shown in formula (1) is adopted as the loss functiontri;
In the formula, Ntri+Representing a subset of training samplesThe total number of sample pairs consisting of two samples in which all Euclidean distances are not 0; the training sample subset is a training set X from each trainingtrainA set consisting of a plurality of randomly selected sample images;
n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;
p represents the number of pedestrians contained in each subset of training samples;
i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;
k represents the number of video sequences of each pedestrian in each subset of training samples;
a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;
m represents a boundary threshold of the loss function;
represents the ith subset of training samplesAny sample with the same identity of the represented pedestrian;
represents the p-th subset of training samples andany sample with different represented pedestrian identities;
symbol | | | purple2A 2-norm representing a matrix;
[]+representing the ReLU operation, the calculation method is as follows: [ x ] of]+Max {0, x }, max being the max-valued operation;
step 3, training set XtrainInput to trained teacher model MtAnd untrained student model MsRespectively obtaining the same data set in the teacher model MtThe multi-dimensional feature matrix of the convolution network outputModel M for studentssThe multi-dimensional characteristic matrix of the simplified convolution network outputAnd in the teacher model MtMulti-dimensional feature matrix of full-connection network outputModel M for studentssSimplified full-connection network output multidimensional characteristic matrix
Multidimensional feature matrixAnd a multi-dimensional feature matrixThe dimension of (a) is b × s × c × h × w; multidimensional feature matrixAnd a multi-dimensional feature matrixDimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the characteristic moment of the full-connection network outputThe dimension of the array;
step 4. make the difference metric function Lc_disComputing a multidimensional feature matrixAnd a multi-dimensional feature matrixThe difference between them; wherein the difference metric function Lc_disThe calculation formula of (a) is as follows:
in the formula, the difference metric function Lc_disRepresenting a loss of partial distillation, the symbol | | | | non-conducting phosphor2 FRepresents the F-norm of the matrix;
step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)And a multi-dimensional feature matrixThe difference between each sample in (1) and the result thereof are respectivelyAndrepresents;
wherein,representing the distance between all samples of the same category in the feature matrix output by the teacher model;representing the distance between all the samples of different types in the feature matrix output by the teacher model;representing the distance between all samples of the same category in a feature matrix output by the student model;representing the distance between all the samples of different types in the feature matrix output by the student model;
when the training teacher model is represented, the jth sample to be trained with the same category in the ith training sample subset;
when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;
when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;
when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;
when the student model is trained, the kth sample to be trained with different classes in the pth training sample subset is represented;
step 6, using the triple loss function L shown in the formula (1)triRespectively calculateAndtriple loss ofAndthe specific calculation formula is shown in formulas (4) and (5);
wherein,representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;
wherein,representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;
calculating the triplet loss using the Smooth L1 loss function in equation (6)Andtotal distillation loss L off_dis;
Step 7. partial distillation loss Lc_disTotal distillation loss Lf_disTriple lossAndintegrating and calculating to obtain the total loss LtotalThe specific calculation formula is as follows:
step 8, setting a student model MsThe number of iterations is determined by selecting Adam optimizer and reducing the loss value LtotalTransferring the knowledge of the teacher model to the student model;
step 9, test set XtestThe pedestrian video sequence is input into the student network MsAnd (4) carrying out identification to obtain an identification result.
The invention has the following advantages:
as described above, the present invention describes a gait recognition model compression system based on local-global joint knowledge distillation algorithm, which designs a compact and lightweight gait model network (i.e. student model) based on deep separation convolution, wherein the convolution network in the model network only retains the backbone convolution network, and simplifies each convolution module in the backbone convolution network, specifically, 3 × 3 deep separation convolution layers and 1 × 1 point convolution layers are adopted to replace the standard convolution layers in the existing scheme; in addition, the model network adopts a 16-layer lightweight fully-connected network to replace a 31-layer fully-connected network in the existing scheme, so that the model parameters are greatly compressed, the calculation of the recognition model is simplified, and the recognition efficiency is improved. In addition, the invention also provides a local-overall joint knowledge distillation algorithm suitable for gait recognition tasks on the basis of the gait recognition model compression system, compared with the prior art, the local-overall joint knowledge distillation method designed by the invention simultaneously utilizes the local feature vector output by the convolution network and the global feature vector output by the full-connection network to carry out joint knowledge distillation, not only retains the local feature of the gait of the pedestrian by convolution operation, but also extracts the global feature of the gait of the pedestrian by the full-connection operation, increases the information content of knowledge distillation, improves the effect of the gait recognition of the pedestrian, and ensures the gait recognition accuracy of the model.
Drawings
FIG. 1 is a block diagram of a partial-global joint knowledge distillation algorithm based gait recognition model compression system according to the present invention;
FIG. 2 is a schematic diagram of a structure of aggregate pooling in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a horizontal pyramid pooling configuration in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a simplified horizontal pyramid pooling configuration in an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a multi-layer global channel according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a deep separation convolution module according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart of the gait recognition model compression method based on the local-global joint knowledge distillation algorithm.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
example 1
The embodiment describes a gait recognition model compression system of a local-global joint knowledge distillation algorithm.
As shown in fig. 1, the gait recognition model compression system constructs two recognition models based on a deep neural network, which are respectively designated as a teacher model Mt with large capacity and a student model M with light weights。
Where the symbol [ ] in fig. 1 represents addition by element.
The teacher model Mt is composed of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network.
The convolutional network consists of a backbone network and a plurality of layers of global channels.
The backbone network is composed of a first convolution module, a second convolution module and a third convolution module.
The first convolution module is comprised of three layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 1 multiplied by 64 multiplied by 44, and the output is a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;
the second layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the second layer inputs data of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44, and outputs a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;
the third layer is a pooling layer, and a maximum pooling layer with a pooling kernel of 2 × 2 and a step size of 2 is used, and the third layer has an input of s (number of frames) × 32 × 64 × 44 and an output of s (number of frames) × 32 × 22.
The second convolution module is comprised of three layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the first layer is data of s (frame number) multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;
the second layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the second layer is data of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;
the third layer is a pooling layer, and a maximum pooling layer having a pooling kernel of 2 × 2 and a step size of 2 is used, and the third layer has an input of (number of frames) × 32 × 32 × 22 and an output of s (number of frames) × 32 × 16 × 11.
The third convolution module consists of two layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 3 x 3 is used, the input of the first layer is data of s (frame number) x 64 x 32 x 22, and the output is a characteristic diagram of s (frame number) x 128 x 16 x 11;
the second layer is a standard convolutional layer, and uses a 3 × 3 convolutional kernel, and the input of the second layer is data of s (number of frames) × 128 × 16 × 11, and a feature map of s (number of frames) × 128 × 16 × 11 is output.
As shown in fig. 5, the multi-layered global channel is composed of a fourth convolution module and a fifth convolution module.
The fourth convolution module consists of three layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the first layer is data of 32 multiplied by 22, and a feature map of 64 multiplied by 32 is output;
the second layer is a standard convolution layer, a 3 × 3 convolution kernel is used, the input of the second layer is data of 64 × 32 × 22, and a feature map of 64 × 32 × 22 is output;
the third layer is a pooling layer, and a 2 × 2 pooling kernel is used to input 64 × 32 × 22 data and output a 64 × 16 × 11 feature map.
The fifth convolution module consists of two layers, wherein:
the first layer is a standard convolution layer, a 3 x 3 convolution kernel is used, the input of the first layer is data of 64 x 16 x 11, and the output is a feature map of 128 x 16 x 11;
the second layer is a standard convolutional layer, a 3 × 3 convolutional kernel is used, the input of the second layer is 128 × 16 × 11 data, and the output is a 128 × 16 × 11 feature map.
There are four pooling structures in the teacher model Mt, two horizontal pyramid pooling structures, and one full-connection network.
The four set pooling are defined as a first, second, third, and fourth set pooling structure, respectively.
One of the aggregate pooling structures is illustrated in fig. 2.
The input of one set of pooling is a feature matrix corresponding to s video frames, each dimension is 128 × 16 × 11, the output is a processed single feature matrix, the dimension is 128 × 16 × 11, and the value of each element in the feature matrix is the maximum value element extracted from the corresponding position of the feature matrices by taking the maximum value operation.
The characteristic of the set pooling is that the sequence of the feature matrixes corresponding to a plurality of input video frames can be randomly arranged.
The two horizontal pyramid pooling structures are respectively defined as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure, and one of the horizontal pyramid pooling structures is taken as an example, as shown in fig. 3.
One horizontal pyramid pooling input is a feature matrix of 128 × 16 × 11, the matrix is decomposed according to 5 scales to obtain intermediate feature matrices, which are respectively 1 feature matrix of 128 × 16 × 11 in dimension, 2 feature matrices of 128 × 8 × 11 in dimension, 4 feature matrices of 128 × 4 × 11 in dimension, 8 feature matrices of 128 × 2 × 11 in dimension, and 16 feature matrices of 128 × 1 × 11 in dimension, and the total number of 31 feature matrices is obtained.
The invention adopts global maximum pooling operation and global average pooling operation to compress the second dimension and the third dimension of each intermediate feature matrix into 128-dimension vectors. The above process is illustrated by way of example:
when the global maximum pooling operation is carried out on the 128 x 16 x 11 feature matrix, decomposing the matrix into 128 x 11 sub-matrixes, calculating the maximum value of each 16 x 11 sub-matrix, obtaining 128 maximum value calculation results in total, and combining the 128 maximum value calculation results into a 128-dimensional output feature vector; similarly, when performing global average pooling on a 128 × 16 × 11 feature matrix, the matrix is decomposed into 128 × 11 sub-matrices, an average value of each 16 × 11 sub-matrix is calculated, 128 average value calculation results are obtained in total, and the 128 average value calculation results are combined into a 128-dimensional output feature vector.
The final output of one horizontal pyramid pooling structure is 31 128-dimensional vectors.
The fully connected network comprises a first fully connected sub-network and a second fully connected sub-network; the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers.
The fully connected network contains a total of 62 fully connected layers, each layer inputting 128 data and outputting 256 features.
The outputs of the first fully-connected sub-network and the second fully-connected sub-network are taken as the outputs of the teacher model Mt.
The output of the first convolution module is connected to the input of the second convolution module. The output of the first convolution module is also connected to the input of the fourth convolution module through a first aggregate pooling structure.
The output of the second convolution module is connected to the input of the third convolution module. And the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module.
And the output of the third convolution module is connected with the input of the third set pooling structure, and the output of the third set pooling structure is added with the output of the fifth convolution module at the corresponding position and then connected with the input of the first horizontal pyramid pooling structure.
The output of the third convolution module is also connected to the input of the second horizontal pyramid pooling structure via a fourth set pooling structure.
The output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network; the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network.
Student model MsThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network.
The reduced convolutional network only comprises one backbone convolutional network, and specifically, as shown in fig. 1, the reduced convolutional network comprises a sixth convolutional module, a seventh convolutional module, and an eighth convolutional module.
Compared with a first convolution module, a second convolution module and a third convolution module in a backbone network in a teacher model, the sixth convolution module, the seventh convolution module and the eighth convolution module are simplified respectively.
Wherein the sixth convolution module deletes a standard convolution layer having a convolution kernel of 3 × 3 compared to the first convolution module.
The seventh convolution module replaces the two layers of standard convolution layers in the second convolution module with a depth-separated convolution layer with a convolution kernel of 3 x 3 and a point convolution with a convolution kernel of 1 x 1, respectively.
Similarly, the eighth convolution module replaces the two standard convolution layers in the third convolution module with a depth-separated convolution layer with convolution kernel 3 × 3 and a point convolution with convolution kernel 1 × 1, respectively.
Specifically, the sixth convolution module is composed of two layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 1 multiplied by 64 multiplied by 44, and the output is a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;
the second layer is a pooling layer, and a maximum pooling layer having a pooling kernel of 2 × 2 and a step size of 2 is used, and the second layer has a characteristic map of 32 × 32 × 22 as input and output of data of 32 × 64 × 44.
The seventh convolution module consists of three layers, wherein:
the first layer is a deep separation convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 32 multiplied by 22 is output;
the second layer is a point convolution layer, a convolution kernel of 1 multiplied by 1 is used, the input of the second layer is data of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;
the third layer is a pooling layer, a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is adopted, the third layer inputs data of 64 × 32 × 22 and outputs a feature map of 64 × 16 × 11;
the eighth convolution module is comprised of two layers, wherein:
the first layer is a depth separation convolution layer, a convolution kernel of 3 x 3 is used, the input of the first layer is data of s (frame number) × 64 x 16 x 11, and a characteristic diagram of s (frame number) × 64 x 16 x 11 is output;
the second layer is a dot convolution layer, and uses a 1 × 1 convolution kernel, and the input of the second layer is data of s (frame number) × 64 × 16 × 11, and a feature map of s (frame number) × 128 × 16 × 11 is output.
The structure of the deep separation convolutional layer is shown in fig. 6, and the structure is a known structure and will not be described in detail.
The sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence.
In a preferred embodiment, a dot convolution layer is further added in front of the first depth-separation convolution layer of the seventh convolution module, and similarly, a dot convolution layer is further added in front of the first depth-separation convolution layer of the eighth convolution module.
The design realizes the improvement of the network performance of the lightweight student model on the premise of hardly changing the model capacity.
Defining a student model MsThe middle collection pooling structure is a fifth collection pooling structure; the output of the eighth convolution module is connected to the input of the reduced horizontal pyramid pooling structure via a fifth set pooling structure.
The fifth set pooling structure is also composed of a statistical (maximum) function, which has as input s (number of frames) x 128 x 16 x 11 feature matrix and outputs a 128 x 16 x 11 feature matrix.
The reduced horizontal pyramid pooling structure is composed of a global maximum pooling and a global average pooling, and the structure is shown in fig. 4.
The input of the simplified horizontal golden sub-tower pooling is a 128 multiplied by 16 multiplied by 11 feature matrix, the middle feature matrix is 16 three-dimensional matrices of 128 multiplied by 1 multiplied by 11, and 16 feature vectors of 128 dimensions are output through global maximum pooling and global average pooling.
The output of the simplified horizontal pyramid pooling structure is connected with the input of a simplified fully-connected network, the simplified fully-connected network comprises 16 independent fully-connected neural network layers, the input of each layer is a 128-dimensional vector, and the output is a 128-dimensional vector.
Compared with the conventional gait recognition model based on the standard convolution layer structure, the invention designs a compact lightweight gait recognition model (called as a student model) by adopting low-cost deep separation convolution, thereby reducing the parameter number of the model structurally.
Example 2
The embodiment describes a gait recognition model compression method based on a local-global joint knowledge distillation algorithm, which is based on the gait recognition model compression system based on the local-global joint knowledge distillation algorithm in the embodiment 1.
In this embodiment 2, the purpose of gait recognition is achieved by training the two models in the above embodiment 1.
As shown in fig. 7, the gait recognition model compression method based on the local-global joint knowledge distillation algorithm includes the following steps:
The image sets form a data set X, and the data set X is divided into a training set XtrainAnd test set Xtest。
Taking a deep convolutional neural network as a basic structure, constructing two gait recognition models which are respectively recorded as a large-capacity teacher model MtAnd a lightweight student model MsThe model structure is described in embodiment 1 above, and is not described here.
Step 2. Using training set XtrainTeacher training model Mt(ii) a The learning rate was set to 0.0001, the number of iterations was 80000, and the optimizer used an Adam optimizer that entered 16 sequences of 8 objects per iteration (128 sequences total, each sequence randomly selected 30 frames of images, scaled to a size of 64 × 44 pixels).
The loss function is as given in equation (1)Shown triplet loss function Ltri;
In the formula, Ntri+Representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset; the training sample subset is a training set X from each trainingtrainA set consisting of a plurality of randomly selected sample images;
n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;
p represents the number of pedestrians contained in each subset of training samples;
i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;
k represents the number of video sequences of each pedestrian in each subset of training samples;
a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;
m represents a boundary threshold of the loss function;
represents the ith subset of training samplesAny sample with the same identity of the represented pedestrian;
represents the p-th subset of training samples andany sample with different represented pedestrian identities;
symbol | | | purple2A 2-norm representing a matrix;
[]+representing the ReLU operation, the calculation method is as follows: [ x ] of]+Max {0, x }, max is the max-valued operation.
By reducing the loss value, the points of the same object are made to approach each other, while the points of different objects are made to depart from each other, the boundary threshold M of the loss function is taken to be 0.2, and the training target is to make the teacher model MtThe recognition performance of (2) is the best.
Step 3, training set XtrainInput to trained teacher model MtAnd untrained student model MsRespectively obtaining the same data set in the teacher model MtThe multi-dimensional feature matrix of the convolution network outputModel M for studentssThe multi-dimensional characteristic matrix of the simplified convolution network outputIn teacher model MtMulti-dimensional feature matrix of full-connection network outputAnd in the student model MsSimplified full-connection network output multidimensional characteristic matrix
Multidimensional feature matrixAnd a multi-dimensional feature matrixThe dimension of (a) is b × s × c × h × w; multidimensional feature matrixAnd a multi-dimensional feature matrixDimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the dimension of the characteristic matrixes output by the fully-connected network.
Step 4. make the difference metric function Lc_disComputing a multidimensional feature matrixAnd a multi-dimensional feature matrixThe difference between them; wherein the difference metric function Lc_disThe calculation formula of (a) is as follows:
in the formula, the difference metric function Lc_disRepresenting a loss of partial distillation, the symbol | | | | non-conducting phosphor2 FRepresenting the F-norm of the matrix.
When calculating the similarity matrix differences of the output features, the difference of the feature matrix is further normalized by using an L2 regularization method, so that the learning of the student model from the teacher model can be guided to be more effective.
Under this approach, the teacher model MtAnd student model MsThe number of channels of the convolution output features may not remain consistent, that is, a larger or smaller capacity model may be designed for knowledge distillation.
Step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)And a multi-dimensional feature matrixThe difference between each sample in (1) and the result thereof are respectivelyAndrepresents;
wherein,representing the distance between all samples of the same category in the feature matrix output by the teacher model;representing the distance between all the samples of different types in the feature matrix output by the teacher model;representing the distance between all samples of the same category in a feature matrix output by the student model;and the distance between all the samples in different categories in the feature matrix output by the student model is represented. The same-class samples refer to picture samples with the same pedestrian identity labels, and the different-class samples refer to picture samples with different pedestrian identity labels.
When representing the teacher model, the jth training sample with the same category in the ith training sample subsetA sample to be trained;
when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;
when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;
when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;
and when the training student model is shown, the kth sample to be trained with different classes in the pth training sample subset.
Step 6, using the triple loss function L shown in the formula (1)triRespectively calculateAndtriple loss ofAndthe specific calculation formula is shown in formulas (4) and (5);
where m represents the boundary threshold of the loss function, set to 0.2;
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;
wherein,representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;
calculating the triplet loss using the Smooth L1 loss function in equation (6)Andtotal distillation loss L off_dis;
Step 7. partial distillation loss Lc_disTotal distillation loss Lf_disTriple lossAndintegrating and calculating to obtain the total loss LtotalThe specific calculation formula is as follows:
Step 8, setting a student model MsThe iteration number is 30000 times, the optimizer selects Adam optimizer, the previous 10000 times learning rate is set to 0.005, the later 20000 times learning rate is set to 0.001, the Adam optimizer selects Adam optimizer, and loss value L is reducedtotalTransferring the knowledge of the teacher model to the student model MsIn addition, the recognition performance of the student model is improved.
Step 9, test set XtestThe pedestrian video sequence is input into the student network MsAnd (4) carrying out identification to obtain an identification result.
As can be seen from the process of the gait recognition model compression method, the invention utilizes the joint knowledge distillation algorithm to carry out local and overall knowledge distillation on the large-capacity gait recognition model (called as a teacher model), can guide the student model to learn more knowledge from the teacher model, and can reduce the model capacity and maintain the original recognition effect as much as possible.
The method of the invention adopts the lightweight model compression and the joint knowledge distillation technology, so that the gait recognition accuracy of the model can be effectively ensured while the scale of the model parameters is reduced, thereby reducing the operation cost, reducing the training times and the reasoning time, improving the efficiency of the model, and being more suitable for the actual scene with high real-time requirement and large data volume.
Experiments prove that compared with a deep neural network model in the prior art, the gait recognition model compression method has the advantages that the number of model parameters is reduced by 9 times, the calculated amount is reduced by 19 times, the performance of the model in the public data set CASIA-B is reduced by only 2.2%, the actual reasoning time is shortened, and the problem of model efficiency is effectively solved.
It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
Claims (2)
1. A gait recognition model compression system based on a local-integral joint knowledge distillation algorithm is characterized in that,
comprises a teacher model Mt and a student model Ms(ii) a Wherein:
the teacher model Mt consists of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network;
the convolution network consists of a backbone network and a plurality of layers of global channels;
the backbone network consists of a first convolution module, a second convolution module and a third convolution module;
the first convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the second convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the third convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
the multilayer global channel consists of a fourth convolution module and a fifth convolution module;
the fourth convolution module consists of three layers, wherein: the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and 2 multiplied by 2 pooling cores are adopted;
the fifth convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
four collection pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged;
defining four collection pooling structures as a first, a second and a third collection pooling structure respectively; defining two horizontal pyramid pooling structures as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure respectively;
the fully connected network comprises a first fully connected sub-network and a second fully connected sub-network;
the output of the first convolution module is connected with the input of the second convolution module;
the output of the first convolution module is also connected with the input of the fourth convolution module through the first set pooling structure;
the output of the second convolution module is connected with the input of the third convolution module;
the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module;
the output of the third convolution module is connected with the input of the third collection pooling structure, and the output of the third collection pooling structure is added with the output of the fifth convolution module at corresponding positions and then connected with the input of the first horizontal pyramid pooling structure;
the output of the third convolution module is also connected with the input of the second horizontal pyramid pooling structure through a fourth set pooling structure;
the output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network;
the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network;
the outputs of the first and second fully-connected sub-networks are used as teacher models MtAn output of (d);
the first horizontal pyramid pooling structure and the second horizontal pyramid pooling structure have five scales;
the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers;
student model MsThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network, wherein:
the simplified convolution network consists of a sixth convolution module, a seventh convolution module and an eighth convolution module;
the sixth convolution module consists of two layers, wherein: the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the seventh convolution module consists of three layers, wherein:
the first layer is a deep separation convolutional layer, using a convolution kernel of 5 × 5; the second layer is a point convolution layer, using a convolution kernel of 1 × 1; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the eighth convolution module is comprised of two layers, wherein:
the first layer uses a 3 × 3 convolution kernel for the depth-separated convolution layer; the second layer is a point convolution layer, using a convolution kernel of 1 × 1;
the sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence;
defining a student model MsThe middle set pooling structure is a fifth set pooling nodeStructuring;
the output of the eighth convolution module is connected with the input of the reduced horizontal pyramid pooling structure through a fifth set pooling structure;
the output of the simplified horizontal pyramid pooling structure is connected with the input of the simplified fully-connected network;
the simplified horizontal pyramid pooling structure has only one scale; the simplified fully-connected network comprises 16 independent fully-connected neural network layers; the output of the streamlined fully-connected network is used as a student model MsTo output of (c).
2. A gait recognition model compression method based on a local-global joint knowledge distillation algorithm, a gait recognition model compression system based on the local-global joint knowledge distillation algorithm of claim 1; it is characterized in that the preparation method is characterized in that,
the gait recognition model compression method comprises the following steps:
step 1, extracting a gait contour sequence in a gait video by using a background subtraction method, uniformly cutting the gait contour sequence into picture sets with the same size to form a data set X, and dividing the data set X into a training set XtrainAnd test set Xtest;
Step 2. Using training set XtrainTeacher training model MtSetting learning rate and iteration times, wherein an Adam optimizer is adopted as the optimizer, and a triple loss function L shown in formula (1) is adopted as the loss functiontri;
In the formula, Ntri+Representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset; the training sample subset is a training set X from each trainingtrainA set consisting of a plurality of randomly selected sample images;
n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;
p represents the number of pedestrians contained in each subset of training samples;
i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;
k represents the number of video sequences of each pedestrian in each subset of training samples;
a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;
m represents a boundary threshold of the loss function;
represents the ith subset of training samplesAny sample with the same identity of the represented pedestrian;
represents the p-th subset of training samples andany sample with different represented pedestrian identities;
symbol | | | purple2A 2-norm representing a matrix;
[]+representing the ReLU operation, the calculation method is as follows: [ x ] of]+Max {0, x }, max being the max-valued operation;
step 3, training set XtrainInput to trained teacher model MtAnd untrained student model MsRespectively obtaining the same data set in the teacher model MtThe multi-dimensional feature matrix of the convolution network outputModel M for studentssThe multi-dimensional characteristic matrix of the simplified convolution network outputAnd in the teacher model MtMulti-dimensional feature matrix of full-connection network outputModel M for studentssSimplified full-connection network output multidimensional characteristic matrix
Multidimensional feature matrixAnd a multi-dimensional feature matrixThe dimension of (a) is b × s × c × h × w; multidimensional feature matrixAnd a multi-dimensional feature matrixDimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of the convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the dimensionality of the characteristic matrixes output by the fully-connected network;
step 4. make the difference metric function Lc_disComputing a multidimensional feature matrixAnd a multi-dimensional feature matrixThe difference between them; wherein the difference metric function Lc_disThe calculation formula of (a) is as follows:
in the formula, the difference metric function Lc_disRepresenting a loss of partial distillation, symbolRepresents the F-norm of the matrix;
step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)And a multi-dimensional feature matrixThe difference between each sample in (1) and the result thereof are respectivelyRepresents;
wherein,representing the distance between all samples of the same category in the feature matrix output by the teacher model;representing teacher model outputDistances between all the different types of samples in the feature matrix;representing the distance between all samples of the same category in a feature matrix output by the student model;representing the distance between all the samples of different types in the feature matrix output by the student model;
when the training teacher model is represented, the jth sample to be trained with the same category in the ith training sample subset;
when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;
when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;
when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;
when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;
when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;
when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;
when the student model is trained, the kth sample to be trained with different classes in the pth training sample subset is represented;
step 6, using the triple loss function L shown in the formula (1)triRespectively calculateAndtriple loss ofAndthe specific calculation formula is shown in formulas (4) and (5);
wherein,representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;
wherein,representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;
calculating the triplet loss using the Smooth L1 loss function in equation (6)Andtotal distillation loss L off_dis;
Step 7. partial distillation loss Lc_disTotal distillation loss Lf_disTriple lossAndintegrating and calculating to obtain the total loss LtotalThe specific calculation formula is as follows:
step 8, setting a student model MsThe number of iterations is determined by selecting Adam optimizer and reducing the loss value LtotalTransferring the knowledge of the teacher model to the student model;
step 9, test set XtestThe pedestrian video sequence is input into the student network MsAnd (4) carrying out identification to obtain an identification result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110824459.3A CN113505719B (en) | 2021-07-21 | 2021-07-21 | Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110824459.3A CN113505719B (en) | 2021-07-21 | 2021-07-21 | Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113505719A true CN113505719A (en) | 2021-10-15 |
CN113505719B CN113505719B (en) | 2023-11-24 |
Family
ID=78014088
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110824459.3A Active CN113505719B (en) | 2021-07-21 | 2021-07-21 | Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113505719B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116246349A (en) * | 2023-05-06 | 2023-06-09 | 山东科技大学 | Single-source domain generalization gait recognition method based on progressive subdomain mining |
CN116824640A (en) * | 2023-08-28 | 2023-09-29 | 江南大学 | Leg identification method, system, medium and equipment based on MT and three-dimensional residual error network |
CN117237984A (en) * | 2023-08-31 | 2023-12-15 | 江南大学 | MT leg identification method, system, medium and equipment based on label consistency |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596039A (en) * | 2018-03-29 | 2018-09-28 | 南京邮电大学 | A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks |
CN108764462A (en) * | 2018-05-29 | 2018-11-06 | 成都视观天下科技有限公司 | A kind of convolutional neural networks optimization method of knowledge based distillation |
CN109034219A (en) * | 2018-07-12 | 2018-12-18 | 上海商汤智能科技有限公司 | Multi-tag class prediction method and device, electronic equipment and the storage medium of image |
CN110097084A (en) * | 2019-04-03 | 2019-08-06 | 浙江大学 | Pass through the knowledge fusion method of projection feature training multitask student network |
CN112560631A (en) * | 2020-12-09 | 2021-03-26 | 昆明理工大学 | Knowledge distillation-based pedestrian re-identification method |
CN112784964A (en) * | 2021-01-27 | 2021-05-11 | 西安电子科技大学 | Image classification method based on bridging knowledge distillation convolution neural network |
-
2021
- 2021-07-21 CN CN202110824459.3A patent/CN113505719B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108596039A (en) * | 2018-03-29 | 2018-09-28 | 南京邮电大学 | A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks |
CN108764462A (en) * | 2018-05-29 | 2018-11-06 | 成都视观天下科技有限公司 | A kind of convolutional neural networks optimization method of knowledge based distillation |
CN109034219A (en) * | 2018-07-12 | 2018-12-18 | 上海商汤智能科技有限公司 | Multi-tag class prediction method and device, electronic equipment and the storage medium of image |
CN110097084A (en) * | 2019-04-03 | 2019-08-06 | 浙江大学 | Pass through the knowledge fusion method of projection feature training multitask student network |
CN112560631A (en) * | 2020-12-09 | 2021-03-26 | 昆明理工大学 | Knowledge distillation-based pedestrian re-identification method |
CN112784964A (en) * | 2021-01-27 | 2021-05-11 | 西安电子科技大学 | Image classification method based on bridging knowledge distillation convolution neural network |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116246349A (en) * | 2023-05-06 | 2023-06-09 | 山东科技大学 | Single-source domain generalization gait recognition method based on progressive subdomain mining |
CN116246349B (en) * | 2023-05-06 | 2023-08-15 | 山东科技大学 | Single-source domain generalization gait recognition method based on progressive subdomain mining |
CN116824640A (en) * | 2023-08-28 | 2023-09-29 | 江南大学 | Leg identification method, system, medium and equipment based on MT and three-dimensional residual error network |
CN116824640B (en) * | 2023-08-28 | 2023-12-01 | 江南大学 | Leg identification method, system, medium and equipment based on MT and three-dimensional residual error network |
CN117237984A (en) * | 2023-08-31 | 2023-12-15 | 江南大学 | MT leg identification method, system, medium and equipment based on label consistency |
CN117237984B (en) * | 2023-08-31 | 2024-06-21 | 江南大学 | MT leg identification method, system, medium and equipment based on label consistency |
Also Published As
Publication number | Publication date |
---|---|
CN113505719B (en) | 2023-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273800B (en) | Attention mechanism-based motion recognition method for convolutional recurrent neural network | |
CN108154194B (en) | Method for extracting high-dimensional features by using tensor-based convolutional network | |
CN111814661B (en) | Human body behavior recognition method based on residual error-circulating neural network | |
CN106778604B (en) | Pedestrian re-identification method based on matching convolutional neural network | |
CN113505719B (en) | Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm | |
CN105956560B (en) | A kind of model recognizing method based on the multiple dimensioned depth convolution feature of pondization | |
CN113239784B (en) | Pedestrian re-identification system and method based on space sequence feature learning | |
CN111325165B (en) | Urban remote sensing image scene classification method considering spatial relationship information | |
CN110414432A (en) | Training method, object identifying method and the corresponding device of Object identifying model | |
CN108830157A (en) | Human bodys' response method based on attention mechanism and 3D convolutional neural networks | |
CN107451565B (en) | Semi-supervised small sample deep learning image mode classification and identification method | |
CN110728183A (en) | Human body action recognition method based on attention mechanism neural network | |
CN108090472A (en) | Pedestrian based on multichannel uniformity feature recognition methods and its system again | |
WO2022227292A1 (en) | Action recognition method | |
CN113920581A (en) | Method for recognizing motion in video by using space-time convolution attention network | |
CN113505856B (en) | Non-supervision self-adaptive classification method for hyperspectral images | |
CN111881716A (en) | Pedestrian re-identification method based on multi-view-angle generation countermeasure network | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
CN116543269B (en) | Cross-domain small sample fine granularity image recognition method based on self-supervision and model thereof | |
CN116246338B (en) | Behavior recognition method based on graph convolution and transducer composite neural network | |
CN115731579A (en) | Terrestrial animal individual identification method based on cross attention transducer network | |
CN112396036A (en) | Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction | |
CN114429646A (en) | Gait recognition method based on deep self-attention transformation network | |
CN114612718B (en) | Small sample image classification method based on graph structural feature fusion | |
CN115641525A (en) | Multi-user behavior analysis method based on video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |