CN113505719A

CN113505719A - Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm

Info

Publication number: CN113505719A
Application number: CN202110824459.3A
Authority: CN
Inventors: 单彩峰; 宋旭; 陈宇; 黄岩
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2021-10-15
Anticipated expiration: 2041-07-21
Also published as: CN113505719B

Abstract

The invention discloses a gait recognition model compression system and method based on a local-integral joint knowledge distillation algorithm. The system adopts deep separation convolution to design a compact lightweight gait model network, the convolution network in the model network only reserves a backbone convolution network, and each convolution module in the backbone convolution network is simplified; the model network adopts a 16-layer lightweight fully-connected network, so that model parameters are greatly compressed, calculation of the recognition model is simplified, and the recognition efficiency is improved; the method of the invention simultaneously utilizes the local feature vectors output by the convolution network in the teacher model and the student model and the global feature vectors output by the full-connection network to carry out the joint knowledge distillation, not only retains the local features of the gait of the pedestrian by convolution operation, but also extracts the global features of the gait of the pedestrian by the full-connection operation, increases the information content of the knowledge distillation and improves the effect of identifying the gait of the pedestrian.

Description

Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a gait recognition model compression system and method based on a local-integral joint knowledge distillation algorithm.

Background

Gait recognition is an emerging biological feature recognition technology, and aims to find and extract feature changes among different pedestrians from a series of walking postures so as to realize automatic identification of the identity of the pedestrians. Gait recognition has many advantages over other biometric technologies, such as: the method has the advantages of no need of active matching of the identified objects, low requirement on image pixels, no limitation of visual angles, easiness in distinguishing camouflage, long identification distance and the like. Based on this, the gait recognition technology has very wide application in the fields of video monitoring, intelligent security and the like.

At present, gait recognition technology is mostly designed based on a standard convolutional neural network, and a recognition model is trained by collecting gait video samples containing pedestrian labels, so that the model learns useful gait appearance and motion characteristics from the samples and can recognize according to the characteristics. Existing gait recognition techniques can be divided into model-based methods and appearance-based methods depending on whether or not a human body model is established.

The gait features need to be extracted by establishing a human skeleton or a posture structure based on the model method, the calculation amount is large, and the network structure is complicated. The appearance-based method is the most common method at present, can directly extract gait features from a raw video captured by a camera, and can be subdivided into a feature template-based method, a gait video-based method and a set-based method.

The method based on the characteristic template (such as a gait energy map) is simple to realize but is easy to lose time sequence characteristics by extracting gait characteristics in the characteristic template for identification; the method based on the gait video sequence effectively improves gait space-time characteristics for recognition through a three-dimensional standard convolutional neural network, but the model is large in scale and difficult to train; the method is based on a set, single-frame gait contour features and set pooling structure aggregation gait space-time features in a gait set are extracted through a two-dimensional standard convolutional neural network, high-efficiency identification performance is achieved, and the model scale is still large.

In conclusion, the conventional gait recognition method is mainly realized by adopting a high-capacity neural network model, has the defects of more model parameters, long training time and difficult application and popularization, and is not suitable for practical application occasions with higher real-time requirements.

Although the traditional model compression method can reduce the capacity of the model and reduce the model parameters to a certain extent, the traditional model compression method is simpler and cannot keep the key information in the model, so that the identification performance of the model is seriously reduced, and therefore, the traditional model compression method is not suitable for solving the compression problem of the gait identification model.

Disclosure of Invention

The invention aims to provide a gait recognition model compression method based on a local-integral joint knowledge distillation algorithm, which can effectively ensure the gait recognition accuracy of a model while reducing the scale of model parameters.

In order to achieve the purpose, the invention adopts the following technical scheme:

gait recognition model compression system based on local-global joint knowledge distillation algorithm comprises:

comprises a teacher model Mt and a student model M_s(ii) a Wherein:

the teacher model Mt consists of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network;

the convolution network consists of a backbone network and a plurality of layers of global channels;

the backbone network consists of a first convolution module, a second convolution module and a third convolution module;

the first convolution module is comprised of three layers, wherein:

the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;

the second convolution module is comprised of three layers, wherein:

the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;

the third convolution module consists of two layers, wherein:

the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;

the multilayer global channel consists of a fourth convolution module and a fifth convolution module;

the fourth convolution module consists of three layers, wherein: the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and 2 multiplied by 2 pooling cores are adopted;

the fifth convolution module consists of two layers, wherein:

four collection pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged;

defining four collection pooling structures as a first, a second and a third collection pooling structure respectively; defining two horizontal pyramid pooling structures as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure respectively;

the fully connected network comprises a first fully connected sub-network and a second fully connected sub-network;

the output of the first convolution module is connected with the input of the second convolution module;

the output of the first convolution module is also connected with the input of the fourth convolution module through the first set pooling structure;

the output of the second convolution module is connected with the input of the third convolution module;

the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module;

the output of the third convolution module is connected with the input of the third collection pooling structure, and the output of the third collection pooling structure is added with the output of the fifth convolution module at corresponding positions and then connected with the input of the first horizontal pyramid pooling structure;

the output of the third convolution module is also connected with the input of the second horizontal pyramid pooling structure through a fourth set pooling structure;

the output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network;

the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network;

the outputs of the first and second fully-connected sub-networks are used as teacher models M_tAn output of (d);

the first horizontal pyramid pooling structure and the second horizontal pyramid pooling structure have five scales;

the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers;

student model M_sThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network, wherein:

the simplified convolution network consists of a sixth convolution module, a seventh convolution module and an eighth convolution module;

the sixth convolution module consists of two layers, wherein: the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;

the seventh convolution module consists of three layers, wherein:

the first layer is a deep separation convolutional layer, using a convolution kernel of 5 × 5; the second layer is a point convolution layer, using a convolution kernel of 1 × 1; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;

the eighth convolution module is comprised of two layers, wherein:

the first layer uses a 3 × 3 convolution kernel for the depth-separated convolution layer; the second layer is a point convolution layer, using a convolution kernel of 1 × 1;

the sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence;

defining a student model M_sThe middle collection pooling structure is a fifth collection pooling structure;

the output of the eighth convolution module is connected with the input of the reduced horizontal pyramid pooling structure through a fifth set pooling structure;

the output of the simplified horizontal pyramid pooling structure is connected with the input of the simplified fully-connected network;

the simplified horizontal pyramid pooling structure has only one scale; the simplified fully-connected network comprises 16 independent fully-connected neural network layers; the output of the streamlined fully-connected network is used as a student model M_sTo output of (c).

In addition, the invention also provides a gait recognition model compression method based on the local-integral joint knowledge distillation algorithm, which is based on the gait recognition model compression system mentioned above, and the specific technical scheme is as follows:

the gait recognition model compression method based on the local-integral joint knowledge distillation algorithm comprises the following steps:

step 1, extracting a gait contour sequence in a gait video by using a background subtraction method, uniformly cutting the gait contour sequence into picture sets with the same size to form a data set X, and dividing the data set X into a training set X_trainAnd test set X_test；

Step 2. Using training set X_trainTeacher training model M_tSetting learning rate and iteration times, wherein an Adam optimizer is adopted as the optimizer, and a triple loss function L shown in formula (1) is adopted as the loss function_tri；

In the formula, N_tri+Representing a subset of training samplesThe total number of sample pairs consisting of two samples in which all Euclidean distances are not 0; the training sample subset is a training set X from each training_trainA set consisting of a plurality of randomly selected sample images;

n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;

p represents the number of pedestrians contained in each subset of training samples;

i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;

k represents the number of video sequences of each pedestrian in each subset of training samples;

a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;

m represents a boundary threshold of the loss function;

representing the jth sample to be trained in the ith training sample subset;

represents the ith subset of training samples

Any sample with the same identity of the represented pedestrian;

represents the p-th subset of training samples and

any sample with different represented pedestrian identities;

symbol | | | purple₂A 2-norm representing a matrix;

[]₊representing the ReLU operation, the calculation method is as follows: [ x ] of]₊Max {0, x }, max being the max-valued operation;

step 3, training set X_trainInput to trained teacher model M_tAnd untrained student model M_sRespectively obtaining the same data set in the teacher model M_tThe multi-dimensional feature matrix of the convolution network output

Model M for students_sThe multi-dimensional characteristic matrix of the simplified convolution network output

And in the teacher model M_tMulti-dimensional feature matrix of full-connection network output

Model M for students_sSimplified full-connection network output multidimensional characteristic matrix

Multidimensional feature matrix

And a multi-dimensional feature matrix

The dimension of (a) is b × s × c × h × w; multidimensional feature matrix

And a multi-dimensional feature matrix

Dimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the characteristic moment of the full-connection network outputThe dimension of the array;

step 4. make the difference metric function L_{c_dis}Computing a multidimensional feature matrix

And a multi-dimensional feature matrix

The difference between them; wherein the difference metric function L_{c_dis}The calculation formula of (a) is as follows:

in the formula, the difference metric function L_{c_dis}Representing a loss of partial distillation, the symbol | | | | non-conducting phosphor² _FRepresents the F-norm of the matrix;

step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)

And a multi-dimensional feature matrix

The difference between each sample in (1) and the result thereof are respectively

And

represents;

wherein,

representing the distance between all samples of the same category in the feature matrix output by the teacher model;

representing the distance between all the samples of different types in the feature matrix output by the teacher model;

representing the distance between all samples of the same category in a feature matrix output by the student model;

representing the distance between all the samples of different types in the feature matrix output by the student model;

when the training teacher model is represented, the jth sample to be trained with the same category in the ith training sample subset;

when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;

when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;

when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;

when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;

when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;

when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;

when the student model is trained, the kth sample to be trained with different classes in the pth training sample subset is represented;

step 6, using the triple loss function L shown in the formula (1)_triRespectively calculate

And

triple loss of

And

the specific calculation formula is shown in formulas (4) and (5);

wherein,

representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;

wherein,

representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;

calculating the triplet loss using the Smooth L1 loss function in equation (6)

And

total distillation loss L of_{f_dis}；

Step 7. partial distillation loss L_{c_dis}Total distillation loss L_{f_dis}Triple loss

And

integrating and calculating to obtain the total loss L_totalThe specific calculation formula is as follows:

wherein, alpha is a distillation loss weight value;

step 8, setting a student model M_sThe number of iterations is determined by selecting Adam optimizer and reducing the loss value L_totalTransferring the knowledge of the teacher model to the student model;

step 9, test set X_testThe pedestrian video sequence is input into the student network M_sAnd (4) carrying out identification to obtain an identification result.

The invention has the following advantages:

as described above, the present invention describes a gait recognition model compression system based on local-global joint knowledge distillation algorithm, which designs a compact and lightweight gait model network (i.e. student model) based on deep separation convolution, wherein the convolution network in the model network only retains the backbone convolution network, and simplifies each convolution module in the backbone convolution network, specifically, 3 × 3 deep separation convolution layers and 1 × 1 point convolution layers are adopted to replace the standard convolution layers in the existing scheme; in addition, the model network adopts a 16-layer lightweight fully-connected network to replace a 31-layer fully-connected network in the existing scheme, so that the model parameters are greatly compressed, the calculation of the recognition model is simplified, and the recognition efficiency is improved. In addition, the invention also provides a local-overall joint knowledge distillation algorithm suitable for gait recognition tasks on the basis of the gait recognition model compression system, compared with the prior art, the local-overall joint knowledge distillation method designed by the invention simultaneously utilizes the local feature vector output by the convolution network and the global feature vector output by the full-connection network to carry out joint knowledge distillation, not only retains the local feature of the gait of the pedestrian by convolution operation, but also extracts the global feature of the gait of the pedestrian by the full-connection operation, increases the information content of knowledge distillation, improves the effect of the gait recognition of the pedestrian, and ensures the gait recognition accuracy of the model.

Drawings

FIG. 1 is a block diagram of a partial-global joint knowledge distillation algorithm based gait recognition model compression system according to the present invention;

FIG. 2 is a schematic diagram of a structure of aggregate pooling in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a horizontal pyramid pooling configuration in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a simplified horizontal pyramid pooling configuration in an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a multi-layer global channel according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a deep separation convolution module according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of the gait recognition model compression method based on the local-global joint knowledge distillation algorithm.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

example 1

The embodiment describes a gait recognition model compression system of a local-global joint knowledge distillation algorithm.

As shown in fig. 1, the gait recognition model compression system constructs two recognition models based on a deep neural network, which are respectively designated as a teacher model Mt with large capacity and a student model M with light weight_s。

Where the symbol [ ] in fig. 1 represents addition by element.

The teacher model Mt is composed of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network.

The convolutional network consists of a backbone network and a plurality of layers of global channels.

The backbone network is composed of a first convolution module, a second convolution module and a third convolution module.

The first convolution module is comprised of three layers, wherein:

the first layer is a standard convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 1 multiplied by 64 multiplied by 44, and the output is a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;

the second layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the second layer inputs data of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44, and outputs a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;

the third layer is a pooling layer, and a maximum pooling layer with a pooling kernel of 2 × 2 and a step size of 2 is used, and the third layer has an input of s (number of frames) × 32 × 64 × 44 and an output of s (number of frames) × 32 × 22.

The second convolution module is comprised of three layers, wherein:

the first layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the first layer is data of s (frame number) multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;

the second layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the second layer is data of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;

the third layer is a pooling layer, and a maximum pooling layer having a pooling kernel of 2 × 2 and a step size of 2 is used, and the third layer has an input of (number of frames) × 32 × 32 × 22 and an output of s (number of frames) × 32 × 16 × 11.

The third convolution module consists of two layers, wherein:

the first layer is a standard convolution layer, a convolution kernel of 3 x 3 is used, the input of the first layer is data of s (frame number) x 64 x 32 x 22, and the output is a characteristic diagram of s (frame number) x 128 x 16 x 11;

the second layer is a standard convolutional layer, and uses a 3 × 3 convolutional kernel, and the input of the second layer is data of s (number of frames) × 128 × 16 × 11, and a feature map of s (number of frames) × 128 × 16 × 11 is output.

As shown in fig. 5, the multi-layered global channel is composed of a fourth convolution module and a fifth convolution module.

The fourth convolution module consists of three layers, wherein:

the first layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the first layer is data of 32 multiplied by 22, and a feature map of 64 multiplied by 32 is output;

the second layer is a standard convolution layer, a 3 × 3 convolution kernel is used, the input of the second layer is data of 64 × 32 × 22, and a feature map of 64 × 32 × 22 is output;

the third layer is a pooling layer, and a 2 × 2 pooling kernel is used to input 64 × 32 × 22 data and output a 64 × 16 × 11 feature map.

The fifth convolution module consists of two layers, wherein:

the first layer is a standard convolution layer, a 3 x 3 convolution kernel is used, the input of the first layer is data of 64 x 16 x 11, and the output is a feature map of 128 x 16 x 11;

the second layer is a standard convolutional layer, a 3 × 3 convolutional kernel is used, the input of the second layer is 128 × 16 × 11 data, and the output is a 128 × 16 × 11 feature map.

There are four pooling structures in the teacher model Mt, two horizontal pyramid pooling structures, and one full-connection network.

The four set pooling are defined as a first, second, third, and fourth set pooling structure, respectively.

One of the aggregate pooling structures is illustrated in fig. 2.

The input of one set of pooling is a feature matrix corresponding to s video frames, each dimension is 128 × 16 × 11, the output is a processed single feature matrix, the dimension is 128 × 16 × 11, and the value of each element in the feature matrix is the maximum value element extracted from the corresponding position of the feature matrices by taking the maximum value operation.

The characteristic of the set pooling is that the sequence of the feature matrixes corresponding to a plurality of input video frames can be randomly arranged.

The two horizontal pyramid pooling structures are respectively defined as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure, and one of the horizontal pyramid pooling structures is taken as an example, as shown in fig. 3.

One horizontal pyramid pooling input is a feature matrix of 128 × 16 × 11, the matrix is decomposed according to 5 scales to obtain intermediate feature matrices, which are respectively 1 feature matrix of 128 × 16 × 11 in dimension, 2 feature matrices of 128 × 8 × 11 in dimension, 4 feature matrices of 128 × 4 × 11 in dimension, 8 feature matrices of 128 × 2 × 11 in dimension, and 16 feature matrices of 128 × 1 × 11 in dimension, and the total number of 31 feature matrices is obtained.

The invention adopts global maximum pooling operation and global average pooling operation to compress the second dimension and the third dimension of each intermediate feature matrix into 128-dimension vectors. The above process is illustrated by way of example:

when the global maximum pooling operation is carried out on the 128 x 16 x 11 feature matrix, decomposing the matrix into 128 x 11 sub-matrixes, calculating the maximum value of each 16 x 11 sub-matrix, obtaining 128 maximum value calculation results in total, and combining the 128 maximum value calculation results into a 128-dimensional output feature vector; similarly, when performing global average pooling on a 128 × 16 × 11 feature matrix, the matrix is decomposed into 128 × 11 sub-matrices, an average value of each 16 × 11 sub-matrix is calculated, 128 average value calculation results are obtained in total, and the 128 average value calculation results are combined into a 128-dimensional output feature vector.

The final output of one horizontal pyramid pooling structure is 31 128-dimensional vectors.

The fully connected network comprises a first fully connected sub-network and a second fully connected sub-network; the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers.

The fully connected network contains a total of 62 fully connected layers, each layer inputting 128 data and outputting 256 features.

The outputs of the first fully-connected sub-network and the second fully-connected sub-network are taken as the outputs of the teacher model Mt.

The output of the first convolution module is connected to the input of the second convolution module. The output of the first convolution module is also connected to the input of the fourth convolution module through a first aggregate pooling structure.

The output of the second convolution module is connected to the input of the third convolution module. And the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module.

And the output of the third convolution module is connected with the input of the third set pooling structure, and the output of the third set pooling structure is added with the output of the fifth convolution module at the corresponding position and then connected with the input of the first horizontal pyramid pooling structure.

The output of the third convolution module is also connected to the input of the second horizontal pyramid pooling structure via a fourth set pooling structure.

The output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network; the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network.

Student model M_sThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network.

The reduced convolutional network only comprises one backbone convolutional network, and specifically, as shown in fig. 1, the reduced convolutional network comprises a sixth convolutional module, a seventh convolutional module, and an eighth convolutional module.

Compared with a first convolution module, a second convolution module and a third convolution module in a backbone network in a teacher model, the sixth convolution module, the seventh convolution module and the eighth convolution module are simplified respectively.

Wherein the sixth convolution module deletes a standard convolution layer having a convolution kernel of 3 × 3 compared to the first convolution module.

The seventh convolution module replaces the two layers of standard convolution layers in the second convolution module with a depth-separated convolution layer with a convolution kernel of 3 x 3 and a point convolution with a convolution kernel of 1 x 1, respectively.

Similarly, the eighth convolution module replaces the two standard convolution layers in the third convolution module with a depth-separated convolution layer with convolution kernel 3 × 3 and a point convolution with convolution kernel 1 × 1, respectively.

Specifically, the sixth convolution module is composed of two layers, wherein:

the second layer is a pooling layer, and a maximum pooling layer having a pooling kernel of 2 × 2 and a step size of 2 is used, and the second layer has a characteristic map of 32 × 32 × 22 as input and output of data of 32 × 64 × 44.

The seventh convolution module consists of three layers, wherein:

the first layer is a deep separation convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 32 multiplied by 22 is output;

the second layer is a point convolution layer, a convolution kernel of 1 multiplied by 1 is used, the input of the second layer is data of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;

the third layer is a pooling layer, a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is adopted, the third layer inputs data of 64 × 32 × 22 and outputs a feature map of 64 × 16 × 11;

the eighth convolution module is comprised of two layers, wherein:

the first layer is a depth separation convolution layer, a convolution kernel of 3 x 3 is used, the input of the first layer is data of s (frame number) × 64 x 16 x 11, and a characteristic diagram of s (frame number) × 64 x 16 x 11 is output;

the second layer is a dot convolution layer, and uses a 1 × 1 convolution kernel, and the input of the second layer is data of s (frame number) × 64 × 16 × 11, and a feature map of s (frame number) × 128 × 16 × 11 is output.

The structure of the deep separation convolutional layer is shown in fig. 6, and the structure is a known structure and will not be described in detail.

The sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence.

In a preferred embodiment, a dot convolution layer is further added in front of the first depth-separation convolution layer of the seventh convolution module, and similarly, a dot convolution layer is further added in front of the first depth-separation convolution layer of the eighth convolution module.

The design realizes the improvement of the network performance of the lightweight student model on the premise of hardly changing the model capacity.

Defining a student model M_sThe middle collection pooling structure is a fifth collection pooling structure; the output of the eighth convolution module is connected to the input of the reduced horizontal pyramid pooling structure via a fifth set pooling structure.

The fifth set pooling structure is also composed of a statistical (maximum) function, which has as input s (number of frames) x 128 x 16 x 11 feature matrix and outputs a 128 x 16 x 11 feature matrix.

The reduced horizontal pyramid pooling structure is composed of a global maximum pooling and a global average pooling, and the structure is shown in fig. 4.

The input of the simplified horizontal golden sub-tower pooling is a 128 multiplied by 16 multiplied by 11 feature matrix, the middle feature matrix is 16 three-dimensional matrices of 128 multiplied by 1 multiplied by 11, and 16 feature vectors of 128 dimensions are output through global maximum pooling and global average pooling.

The output of the simplified horizontal pyramid pooling structure is connected with the input of a simplified fully-connected network, the simplified fully-connected network comprises 16 independent fully-connected neural network layers, the input of each layer is a 128-dimensional vector, and the output is a 128-dimensional vector.

Compared with the conventional gait recognition model based on the standard convolution layer structure, the invention designs a compact lightweight gait recognition model (called as a student model) by adopting low-cost deep separation convolution, thereby reducing the parameter number of the model structurally.

Example 2

The embodiment describes a gait recognition model compression method based on a local-global joint knowledge distillation algorithm, which is based on the gait recognition model compression system based on the local-global joint knowledge distillation algorithm in the embodiment 1.

In this embodiment 2, the purpose of gait recognition is achieved by training the two models in the above embodiment 1.

As shown in fig. 7, the gait recognition model compression method based on the local-global joint knowledge distillation algorithm includes the following steps:

step 1, extracting a gait contour sequence in the gait video by using a background subtraction method (the method is a conventional method), and uniformly cutting the gait contour sequence into a picture set with the same size, such as an image set of 64 x 64 pixels.

The image sets form a data set X, and the data set X is divided into a training set X_trainAnd test set X_test。

Taking a deep convolutional neural network as a basic structure, constructing two gait recognition models which are respectively recorded as a large-capacity teacher model M_tAnd a lightweight student model M_sThe model structure is described in embodiment 1 above, and is not described here.

Step 2. Using training set X_trainTeacher training model M_t(ii) a The learning rate was set to 0.0001, the number of iterations was 80000, and the optimizer used an Adam optimizer that entered 16 sequences of 8 objects per iteration (128 sequences total, each sequence randomly selected 30 frames of images, scaled to a size of 64 × 44 pixels).

The loss function is as given in equation (1)Shown triplet loss function L_tri；

In the formula, N_tri+Representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset; the training sample subset is a training set X from each training_trainA set consisting of a plurality of randomly selected sample images;

m represents a boundary threshold of the loss function;

representing the jth sample to be trained in the ith training sample subset;

represents the ith subset of training samples

Any sample with the same identity of the represented pedestrian;

represents the p-th subset of training samples and

any sample with different represented pedestrian identities;

symbol | | | purple₂A 2-norm representing a matrix;

[]₊representing the ReLU operation, the calculation method is as follows: [ x ] of]₊Max {0, x }, max is the max-valued operation.

By reducing the loss value, the points of the same object are made to approach each other, while the points of different objects are made to depart from each other, the boundary threshold M of the loss function is taken to be 0.2, and the training target is to make the teacher model M_tThe recognition performance of (2) is the best.

In teacher model M_tMulti-dimensional feature matrix of full-connection network output

And in the student model M_sSimplified full-connection network output multidimensional characteristic matrix

Multidimensional feature matrix

And a multi-dimensional feature matrix

The dimension of (a) is b × s × c × h × w; multidimensional feature matrix

And a multi-dimensional feature matrix

Dimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the dimension of the characteristic matrixes output by the fully-connected network.

And a multi-dimensional feature matrix

in the formula, the difference metric function L_{c_dis}Representing a loss of partial distillation, the symbol | | | | non-conducting phosphor² _FRepresenting the F-norm of the matrix.

When calculating the similarity matrix differences of the output features, the difference of the feature matrix is further normalized by using an L2 regularization method, so that the learning of the student model from the teacher model can be guided to be more effective.

Under this approach, the teacher model M_tAnd student model M_sThe number of channels of the convolution output features may not remain consistent, that is, a larger or smaller capacity model may be designed for knowledge distillation.

And a multi-dimensional feature matrix

And

represents;

wherein,

and the distance between all the samples in different categories in the feature matrix output by the student model is represented. The same-class samples refer to picture samples with the same pedestrian identity labels, and the different-class samples refer to picture samples with different pedestrian identity labels.

When representing the teacher model, the jth training sample with the same category in the ith training sample subsetA sample to be trained;

and when the training student model is shown, the kth sample to be trained with different classes in the pth training sample subset.

And

triple loss of

And

the specific calculation formula is shown in formulas (4) and (5);

where m represents the boundary threshold of the loss function, set to 0.2;

wherein,

calculating the triplet loss using the Smooth L1 loss function in equation (6)

And

total distillation loss L of_{f_dis}；

And

wherein alpha is a distillation loss weight value.

Step 8, setting a student model M_sThe iteration number is 30000 times, the optimizer selects Adam optimizer, the previous 10000 times learning rate is set to 0.005, the later 20000 times learning rate is set to 0.001, the Adam optimizer selects Adam optimizer, and loss value L is reduced_totalTransferring the knowledge of the teacher model to the student model M_sIn addition, the recognition performance of the student model is improved.

As can be seen from the process of the gait recognition model compression method, the invention utilizes the joint knowledge distillation algorithm to carry out local and overall knowledge distillation on the large-capacity gait recognition model (called as a teacher model), can guide the student model to learn more knowledge from the teacher model, and can reduce the model capacity and maintain the original recognition effect as much as possible.

The method of the invention adopts the lightweight model compression and the joint knowledge distillation technology, so that the gait recognition accuracy of the model can be effectively ensured while the scale of the model parameters is reduced, thereby reducing the operation cost, reducing the training times and the reasoning time, improving the efficiency of the model, and being more suitable for the actual scene with high real-time requirement and large data volume.

Experiments prove that compared with a deep neural network model in the prior art, the gait recognition model compression method has the advantages that the number of model parameters is reduced by 9 times, the calculated amount is reduced by 19 times, the performance of the model in the public data set CASIA-B is reduced by only 2.2%, the actual reasoning time is shortened, and the problem of model efficiency is effectively solved.

It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A gait recognition model compression system based on a local-integral joint knowledge distillation algorithm is characterized in that,

comprises a teacher model Mt and a student model M_s(ii) a Wherein:

the first convolution module is comprised of three layers, wherein:

the second convolution module is comprised of three layers, wherein:

the third convolution module consists of two layers, wherein:

the fifth convolution module consists of two layers, wherein:

the seventh convolution module consists of three layers, wherein:

the eighth convolution module is comprised of two layers, wherein:

defining a student model M_sThe middle set pooling structure is a fifth set pooling nodeStructuring;

2. A gait recognition model compression method based on a local-global joint knowledge distillation algorithm, a gait recognition model compression system based on the local-global joint knowledge distillation algorithm of claim 1; it is characterized in that the preparation method is characterized in that,

the gait recognition model compression method comprises the following steps:

m represents a boundary threshold of the loss function;

representing the jth sample to be trained in the ith training sample subset;

represents the ith subset of training samples

Any sample with the same identity of the represented pedestrian;

represents the p-th subset of training samples and

any sample with different represented pedestrian identities;

symbol | | | purple₂A 2-norm representing a matrix;

Multidimensional feature matrix

And a multi-dimensional feature matrix

The dimension of (a) is b × s × c × h × w; multidimensional feature matrix

And a multi-dimensional feature matrix

Dimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of the convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the dimensionality of the characteristic matrixes output by the fully-connected network;