CN113505719B

CN113505719B - Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm

Info

Publication number: CN113505719B
Application number: CN202110824459.3A
Authority: CN
Inventors: 单彩峰; 宋旭; 陈宇; 黄岩
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2023-11-24
Anticipated expiration: 2041-07-21
Also published as: CN113505719A

Abstract

The invention discloses a gait recognition model compression system and method based on a local-integral combined knowledge distillation algorithm. The system designs a compact lightweight gait model network by adopting deep separation convolution, wherein the convolution network in the model network only keeps a backbone convolution network and simplifies all convolution modules in the backbone convolution network; the model network adopts a 16-layer lightweight full-connection network, so that model parameters are greatly compressed, calculation of an identification model is simplified, and identification efficiency is improved; the method of the invention utilizes the local feature vector output by the convolution network and the global feature vector output by the full-connection network in the teacher model and the student model to carry out combined knowledge distillation, thereby not only retaining the local feature of the gait of the pedestrian by convolution operation, but also extracting the global feature of the gait of the pedestrian by full-connection operation, increasing the information quantity of knowledge distillation and improving the recognition effect of the gait of the pedestrian.

Description

Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a gait recognition model compression system and method based on a local-integral combined knowledge distillation algorithm.

Background

Gait recognition is an emerging biological feature recognition technology, and aims to find and extract feature changes among different pedestrians from a series of walking gestures so as to realize automatic recognition of the pedestrians. Gait recognition has many advantages over other biometric technologies, such as: the method has the advantages of no need of active matching of the identified objects, low requirement on image pixels, no limitation of viewing angles, easy identification of camouflage, long identification distance and the like. Based on the above, gait recognition technology is widely applied in the fields of video monitoring, intelligent security and the like.

At present, gait recognition technology is mostly designed based on a standard convolutional neural network, and a recognition model is trained by collecting gait video samples containing pedestrian labels, so that the model learns useful gait appearance and action characteristics from the samples, and can be recognized according to the characteristics. Existing gait recognition techniques can be divided into model-based methods and appearance-based methods depending on the presence or absence of a human model build.

The method based on the model needs to extract gait characteristics by building a human skeleton or a gesture structure, and has large calculation amount and redundant network structure. The appearance-based method is the most commonly used method at present, can directly extract gait features from the original video captured by a camera, and can be subdivided into feature template-based, gait video-based and set-based methods.

The method based on the characteristic template (such as a gait energy diagram) is simple to realize, but the time sequence characteristics are easy to lose by extracting gait characteristics in the characteristic template for identification; the gait video sequence-based method effectively improves gait space-time characteristics through a three-dimensional standard convolutional neural network for recognition, but the model is large in scale and difficult to train; based on the method of aggregation, the single-frame gait outline features and the aggregate-pooling structure aggregate gait space-time features in the gait aggregate are extracted through the two-dimensional standard convolutional neural network, so that the efficient identification performance is realized, and the model scale is still large.

In conclusion, the current gait recognition method is mainly realized by adopting a large-capacity neural network model, has the advantages of multiple model parameters, long training time and difficult application and popularization, and is not suitable for practical application occasions with high real-time requirements.

While the conventional model compression method can reduce the capacity of the model and reduce the parameters of the model to a certain extent, the conventional model compression method is simpler, key information in the model cannot be reserved, and the recognition performance of the model is seriously reduced, so that the conventional model compression method is not suitable for solving the compression problem of the gait recognition model.

Disclosure of Invention

The invention aims to provide a gait recognition model compression method based on a local-whole combined knowledge distillation algorithm, which can effectively ensure the gait recognition accuracy of a model while reducing the scale of model parameters.

In order to achieve the above purpose, the invention adopts the following technical scheme:

gait recognition model compression system based on local-global joint knowledge distillation algorithm includes:

comprises a teacher model Mt and a student model M _s The method comprises the steps of carrying out a first treatment on the surface of the Wherein:

the teacher model Mt consists of a convolution network, a collection pooling structure, a horizontal pyramid pooling structure and a full-connection network;

the convolution network consists of a backbone network and a plurality of layers of global channels;

the backbone network consists of a first convolution module, a second convolution module and a third convolution module;

the first convolution module is composed of three layers, wherein:

The first layer is a standard convolution layer, and a convolution kernel of 5×5 is used; the second layer is a standard convolution layer, and a convolution kernel of 3×3 is used; the third layer is a pooling layer, and adopts a maximum pooling layer with pooling core of 2 multiplied by 2 and step length of 2;

the second convolution module consists of three layers, wherein:

the first layer is a standard convolution layer, and a convolution kernel of 3×3 is used; the second layer is a standard convolution layer, and a convolution kernel of 3×3 is used; the third layer is a pooling layer, and the largest pooling layer with pooling core of 2 multiplied by 2 and step length of 2 is used;

the third convolution module consists of two layers, wherein:

the first layer is a standard convolution layer, and a convolution kernel of 3×3 is used; the second layer is a standard convolution layer, and a convolution kernel of 3×3 is used;

the multi-layer global channel consists of a fourth convolution module and a fifth convolution module;

the fourth convolution module consists of three layers, wherein: the first layer is a standard convolution layer, and a convolution kernel of 3×3 is used; the second layer is a standard convolution layer, and a convolution kernel of 3×3 is used; the third layer is a pooling layer, and a pooling core of 2 multiplied by 2 is adopted;

the fifth convolution module consists of two layers, wherein:

four aggregation pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged;

Defining four collection pooling structures as a first collection pooling structure, a second collection pooling structure and a third collection pooling structure and a fourth collection pooling structure respectively; defining two horizontal pyramid pooling structures as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure respectively;

the fully-connected network comprises a first fully-connected sub-network and a second fully-connected sub-network;

the output of the first convolution module is connected with the input of the second convolution module;

the output of the first convolution module is also connected with the input of the fourth convolution module through a first collection pooling structure;

the output of the second convolution module is connected with the input of the third convolution module;

the output of the second convolution module is connected with the input of the second aggregation pooling structure, and the output of the second aggregation pooling structure is connected with the input of the fifth convolution module after corresponding position addition is carried out on the output of the fourth convolution module;

the output of the third convolution module is connected with the input of the third aggregation pooling structure, and the output of the third aggregation pooling structure is connected with the input of the first horizontal pyramid pooling structure after corresponding position addition is carried out on the output of the fifth convolution module;

the output of the third convolution module is also connected with the input of the second horizontal pyramid pooling structure through the fourth aggregate pooling structure;

The output of the first horizontal pyramid pooling structure is connected with the input of the first fully-connected subnetwork;

the output of the second horizontal pyramid pooling structure is connected with the input of the second fully-connected subnetwork;

the outputs of the first fully-connected subnetwork and the second fully-connected subnetwork are used as a teacher model M _t An output of (2);

the first and second horizontal pyramid pooling structures have five scales;

the first full-connection sub-network and the second full-connection sub-network respectively comprise 31 independent full-connection neural network layers;

student model M _s The method comprises the steps of simplifying a convolution network, assembling a pooling structure, simplifying a horizontal pyramid pooling structure and simplifying a full-connection network, wherein:

the simplified convolution network consists of a sixth convolution module, a seventh convolution module and an eighth convolution module;

the sixth convolution module consists of two layers, wherein: the first layer is a standard convolution layer, and a convolution kernel of 5×5 is used; the second layer is a pooling layer, and the largest pooling layer with pooling core of 2 multiplied by 2 and step length of 2 is used;

the seventh convolution module consists of three layers, wherein:

the first layer is a depth separation convolution layer, and a convolution kernel of 5×5 is used; the second layer is a point convolution layer, and a convolution kernel of 1 multiplied by 1 is used; the third layer is a pooling layer, and adopts a maximum pooling layer with pooling core of 2 multiplied by 2 and step length of 2;

The eighth convolution module consists of two layers, wherein:

the first layer uses a 3 x 3 convolution kernel for the depth-separated convolution layer; the second layer is a point convolution layer, and a convolution kernel of 1 multiplied by 1 is used;

the sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence;

definition of student model M _s The medium aggregate pooling structure is a fifth aggregate pooling structure;

the output of the eighth convolution module is connected with the input of the simplified horizontal pyramid pooling structure through the fifth aggregate pooling structure;

the output of the simplified horizontal pyramid pooling structure is connected with the input of the simplified fully-connected network;

the simplified horizontal pyramid pooling structure has only one scale; the simplified fully-connected network comprises 16 independent fully-connected neural network layers; output of simplified fully-connected network as student model M _s Is provided.

In addition, the invention also provides a gait recognition model compression method based on the local-integral combined knowledge distillation algorithm, which is based on the gait recognition model compression system, and the specific technical scheme is as follows:

a gait recognition model compression method based on a local-whole combined knowledge distillation algorithm comprises the following steps:

step 1, extracting gait contour sequences in a gait video by using a background subtraction method, uniformly cutting the gait contour sequences into picture sets with the same size to form a data set X, and dividing the data set X into a training set X _train And test set X _test ；

Step 2, training set X _train Training teacher model M _t Setting a learning rate and iteration times, wherein an optimizer adopts an Adam optimizer, and a loss function adopts a triplet loss function L shown in a formula (1) _tri ；

Wherein N is _tri+ Representing the total number of sample pairs consisting of two samples of a training sample subset for which all Euclidean distances are not 0; the training sample subset is the secondary training set X when each training _train A set formed by a plurality of sample images selected randomly;

n represents the number of fully connected neural network layers of the teacher network, and t represents the serial number of the fully connected neural network layers of the teacher network;

p represents the number of pedestrians contained in each training sample subset;

i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;

k represents the number of video sequences for each pedestrian in each training sample subset;

a. j and k respectively represent the sequence numbers of the pedestrian video sequences in each training sample subset;

m represents the boundary threshold of the loss function;

representing a j-th sample to be trained in the i-th training sample subset;

representing the sum +.>Any sample with the same identity of the represented pedestrian;

representing the sum +. >Any sample with different identities of the represented pedestrians;

sign | I ₂ Representing the 2-norm of the matrix;

[] ₊ representing ReLU operation, the calculation method is as follows: [ x ]] ₊ =max {0, x }, max being the maximum value operation;

step 3, training set X _train Input to a trained teacher model M _t And untrained student model M _s Respectively obtain the same data set in the teacher model M _t Multi-dimensional feature matrix of convolution network output of (a)In student model M _s Multi-dimensional characteristic matrix of the output of the reduced convolution network ∈>In teacher model M _t Is a multi-dimensional feature matrix of the full-connection network output>In student model M _s Multi-dimensional characteristic matrix of the output of the simplified fully-connected network>

Multi-dimensional feature matrixAnd multidimensional feature matrix->Is b x s x c x h x w; multidimensional feature matrix->And multidimensional feature matrix->Is b x n x d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of the output feature matrixes of the convolution layer, h represents the height of the output feature matrixes of the convolution network, w represents the width of the output feature matrixes of the convolution network, and d represents the dimension of the feature matrixes output by the full-connection network;

step 4, making the difference measurement function L _{c_dis} Calculating a multi-dimensional feature matrixAnd multidimensional feature matrix->Differences between; wherein,difference metric function L _{c_dis} The calculation formula of (2) is as follows:

in the difference measurement function L _{c_dis} Representing the partial distillation loss, the sign is ² _F Representing the F-norm of the matrix;

step 5, respectively calculating the multi-dimensional feature matrix by using the formula (3)And multidimensional feature matrix->The difference between each sample of (a) is used as a result of +.>And->A representation;

wherein,representing the distance between all the samples of the same class in the feature matrix output by the teacher model; />Representing the distance between all different types of samples in the feature matrix output by the teacher model; />Representing the distance between all the samples of the same class in the feature matrix output by the student model; />Representing the distance between all different types of samples in the feature matrix output by the student model;

when the teacher model is trained, the jth sample to be trained with the same category in the ith training sample subset is represented;

when the teacher model is trained, the kth sample to be trained with the same category in the kth training sample subset is represented; />Representing the jth sample to be trained with different categories in the ith training sample subset when training the teacher model;

representing a kth sample to be trained with different categories in a p-th training sample subset when training a teacher model; / >Representing the jth sample to be trained with the same category in the ith training sample subset when training the student model;

representing the kth sample to be trained with the same category in the kth training sample subset when training the student model; />Representing the jth sample to be trained with different categories in the ith training sample subset when training the student model;

representing the kth sample to be trained with different categories in the p-th training sample subset when training the student model;

step 6, using the triplet loss function L shown in the formula (1) _tri Respectively calculateAnd->Triple loss->And->The specific calculation formulas are shown in formulas (4) and (5);

wherein,representing the total number of sample pairs consisting of two samples with all Euclidean distances other than 0 in one training sample subset in the teacher model;

wherein,representing the total number of sample pairs consisting of two samples with all Euclidean distances other than 0 in one training sample subset in the student model;

calculating the triplet loss by using the Smooth L1 loss function in the formula (6)And->Total distillation loss L of (2) _{f_dis} ；

Step 7, partial distillation loss L _{c_dis} Total distillation loss L _{f_dis} Loss of tripletAnd->Integrating and calculating to obtain total loss L _total The specific calculation formula is as follows:

Wherein alpha is a distillation loss weight;

step 8, setting a student model M _s The iteration times are counted, the optimizer is an Adam optimizer, and the loss value L is reduced _total Transferring the knowledge of the teacher model to the student model;

step 9. Test set X _test The pedestrian video sequence is input to the student network M _s And (3) performing recognition to obtain a recognition result.

The invention has the following advantages:

as described above, the present invention describes a gait recognition model compression system based on a local-global joint knowledge distillation algorithm, which designs a compact lightweight gait model network (i.e., student model) based on deep separation convolution, wherein the convolution network in the model network only retains a backbone convolution network, and simplifies each convolution module in the backbone convolution network, specifically, adopts a 3×3 deep separation convolution layer and a 1×1 point convolution layer, instead of the standard convolution layer in the existing scheme; in addition, the model network adopts a 16-layer lightweight full-connection network to replace a 31-layer full-connection network in the existing scheme, so that model parameters are greatly compressed, calculation of an identification model is simplified, and identification efficiency is improved. In addition, the invention also provides a local-whole combined knowledge distillation algorithm suitable for gait recognition tasks on the basis of the gait recognition model compression system, and compared with the existing method, the local-whole combined knowledge distillation method designed by the invention simultaneously utilizes the local feature vector output by a convolution network and the global feature vector output by a full-connection network to carry out combined knowledge distillation, thereby not only retaining the local feature of the gait of the pedestrian by convolution operation, but also extracting the global feature of the gait of the pedestrian by full-connection operation, increasing the information quantity of knowledge distillation, improving the effect of the gait recognition of the pedestrian, namely ensuring the gait recognition accuracy of the model.

Drawings

FIG. 1 is a block diagram of a gait recognition model compression system based on a local-global joint knowledge distillation algorithm of the present invention;

FIG. 2 is a schematic diagram of a structure of pooling in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a horizontal pyramid pooling structure in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a simplified horizontal pyramid pooling structure in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a multi-layer global channel according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a depth separation convolution module according to an embodiment of the present disclosure;

fig. 7 is a flow chart of a gait recognition model compression method based on a local-global joint knowledge distillation algorithm.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and detailed description:

example 1

The embodiment describes a gait recognition model compression system of a local-global joint knowledge distillation algorithm.

As shown in fig. 1, the gait recognition model compression system constructs two recognition models based on a deep neural network, namely a high-capacity teacher model Mt and a light student model M _s 。

Wherein the symbol "# indicates addition by element" in fig. 1.

The teacher model Mt consists of a convolution network, a collection pooling structure, a horizontal pyramid pooling structure and a full-connection network.

The convolutional network consists of a backbone network and multiple layers of global channels.

The backbone network is composed of a first convolution module, a second convolution module and a third convolution module.

The first convolution module is composed of three layers, wherein:

the first layer is a standard convolution layer, and uses a convolution kernel of 5×5, and has s (frame number) ×1×64×44 data as input and s (frame number) ×32×64×44 feature map as output;

the second layer is a standard convolution layer, and uses a convolution kernel of 3×3, and inputs data of s (frame number) ×32×64×44 and outputs a feature map of s (frame number) ×32×64×44;

the third layer is a pooling layer, and adopts a maximum pooling layer with a pooling core of 2×2 and a step length of 2, and the input of the third layer is s (frame number) ×32×64×44, and the output is s (frame number) ×32×32×22.

The second convolution module consists of three layers, wherein:

the first layer is a standard convolution layer, and uses a convolution kernel of 3×3, and has s (frame number) ×32×32×22 data as input and outputs a feature map of s (frame number) ×64×32×22;

the second layer is a standard convolution layer, and uses a convolution kernel of 3×3, and the input of the second layer is s (frame number) ×64×32×22 data, and a feature map of s (frame number) ×64×32×22 is output;

The third layer was a pooling layer, and the largest pooling layer with a pooling kernel of 2×2 and a step size of 2 was used, and the input of the third layer was (frame number) ×32×32×22, and the output was s (frame number) ×32×16×11.

The third convolution module consists of two layers, wherein:

the first layer is a standard convolution layer, and uses a convolution kernel of 3×3, and has s (frame number) ×64×32×22 data as input and s (frame number) ×128×16×11 feature map as output;

the second layer is a standard convolution layer, and uses a convolution kernel of 3×3, and has s (frame number) ×128×16×11 data as input, and outputs a feature map of s (frame number) ×128×16×11.

As shown in fig. 5, the multi-layer global channel is composed of a fourth convolution module and a fifth convolution module.

The fourth convolution module consists of three layers, wherein:

the first layer is a standard convolution layer, and uses a convolution kernel of 3×3, and the input of the first layer is 32×32×22 data, and a 64×32×32 feature map is output;

the second layer is a standard convolution layer, and uses a convolution kernel of 3×3, and the input of the second layer is 64×32×22 data, and a characteristic diagram of 64×32×22 is output;

the third layer is a pooling layer, and uses a 2×2 pooling core, inputs 64×32×22 data, and outputs a 64×16×11 feature map.

The fifth convolution module consists of two layers, wherein:

the first layer is a standard convolution layer, and uses a convolution kernel of 3×3, and the input of the first layer is 64×16×11 data, and the output of the first layer is a feature map of 128×16×11;

the second layer is a standard convolution layer, and uses a convolution kernel of 3×3, and has 128×16×11 data as an input and 128×16×11 feature maps as an output.

Four aggregation pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged.

The four aggregate pooling are defined as a first, second, third, and fourth aggregate pooling structure, respectively.

Taking one of the aggregate pooling structures as an example, as shown in fig. 2.

The input of a set pool is a feature matrix corresponding to s video frames, each dimension is 128 multiplied by 16 multiplied by 11, the output is a processed single feature matrix, the dimension is 128 multiplied by 16 multiplied by 11, and the value of each element in the feature matrix is the maximum value element extracted from the corresponding positions of a plurality of feature matrices through maximum value operation.

The pooling is characterized in that the sequence of the feature matrixes corresponding to the plurality of input video frames can be arranged randomly.

The two horizontal pyramid pooling structures are respectively defined as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure, and one of the two horizontal pyramid pooling structures is taken as an example, as shown in fig. 3.

The input of one horizontal pyramid pool is a 128×16×11 feature matrix, the matrix is decomposed according to 5 scales to obtain an intermediate feature matrix, and the intermediate feature matrix is respectively 1 feature matrix with 128×16×11 dimensions, 2 feature matrices with 128×8×11 dimensions, 4 feature matrices with 128×4×11 dimensions, 8 feature matrices with 128×2×11 dimensions, and 16 feature matrices with 128×1×11 dimensions, and the total is 31 feature matrices.

The invention adopts global maximum pooling operation and global average pooling operation to compress the second and third dimensions of each intermediate feature matrix, and changes the second and third dimensions into 128-dimension vectors. The above procedure is illustrated:

when global maximum pooling operation is carried out on the 128 multiplied by 16 multiplied by 11 feature matrix, the matrix is decomposed into 128 multiplied by 11 submatrices, the maximum value of each 16 multiplied by 11 submatrices is calculated, 128 maximum value calculation results are obtained in total, and the 128 maximum value calculation results are combined into an output feature vector with 128 dimensions; similarly, when global average pooling is performed on a 128×16×11 feature matrix, the matrix is decomposed into 128 16×11 sub-matrices, the average value of each 16×11 sub-matrix is calculated, 128 average value calculation results are obtained in total, and the 128 average value calculation results are combined into an output feature vector of 128 dimensions.

The final output of one horizontal pyramid pooling structure is 31 128-dimensional vectors.

The fully-connected network comprises a first fully-connected sub-network and a second fully-connected sub-network; the first full-connection sub-network and the second full-connection sub-network respectively comprise 31 independent full-connection neural network layers.

The fully connected network contains a total of 62 fully connected layers, each layer inputting 128 data and outputting 256 features.

The outputs of the first fully-connected subnetwork and the second fully-connected subnetwork serve as outputs of the teacher model Mt.

The output of the first convolution module is connected to the input of the second convolution module. The output of the first convolution module is also connected to the input of the fourth convolution module through a first set pooling structure.

The output of the second convolution module is connected to the input of the third convolution module. The output of the second convolution module is connected with the input of the second aggregation pooling structure, and the output of the second aggregation pooling structure is connected with the input of the fifth convolution module after corresponding position addition is carried out on the output of the fourth convolution module.

The output of the third convolution module is connected with the input of the third aggregation pooling structure, and the output of the third aggregation pooling structure is connected with the input of the first horizontal pyramid pooling structure after corresponding position addition is carried out on the output of the fifth convolution module.

The output of the third convolution module is also connected to the input of the second horizontal pyramid pooling structure through a fourth aggregate pooling structure.

The output of the first horizontal pyramid pooling structure is connected with the input of the first fully-connected subnetwork; the output of the second horizontal pyramid pooling structure is connected to the input of the second fully connected subnetwork.

Student model M _s The method consists of a simplified convolution network, a collection pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network.

The simplified convolutional network only retains one backbone convolutional network, and specifically, as shown in fig. 1, the simplified convolutional network is composed of a sixth convolutional module, a seventh convolutional module and an eighth convolutional module.

Compared with a first convolution module, a second convolution module and a third convolution module in a backbone network in a teacher model, the sixth convolution module, the seventh convolution module and the eighth convolution module are respectively simplified.

Wherein the sixth convolution module eliminates a standard convolution layer having a convolution kernel of 3 x 3 as compared to the first convolution module.

The seventh convolution module replaces the two standard convolution layers in the second convolution module with a 3 x 3 depth separation convolution layer and a 1 x 1 point convolution, respectively.

Similarly, the eighth convolution module replaces the two standard convolution layers in the third convolution module with a 3×3 depth separation convolution layer and a 1×1 point convolution, respectively.

Specifically, the sixth convolution module is composed of two layers, wherein:

the second layer is a pooling layer, and the largest pooling layer with a pooling kernel of 2×2 and a step size of 2 is used, and the second layer has 32×64×44 data as input and 32×32×22 feature map as output.

The seventh convolution module consists of three layers, wherein:

the first layer is a depth-separated convolution layer, and uses a convolution kernel of 5×5, and has s (frame number) ×32×32×22 data as input and outputs a feature map of s (frame number) ×32×32×22;

the second layer is a point convolution layer, and uses a convolution kernel of 1×1, and has s (frame number) ×64×32×22 data as input and outputs a feature map of s (frame number) ×64×32×22;

the third layer is a pooling layer, adopts a maximum pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2, and has 64 multiplied by 32 multiplied by 22 data input and 64 multiplied by 16 multiplied by 11 feature images output;

The eighth convolution module consists of two layers, wherein:

the first layer is a depth-separated convolution layer, and uses a convolution kernel of 3×3, and has s (frame number) ×64×16×11 data as input and outputs a feature map of s (frame number) ×64×16×11;

the second layer is a point convolution layer, and uses a convolution kernel of 1×1, and has s (frame number) ×64×16×11 data as input, and outputs a feature map of s (frame number) ×128×16×11.

The structure of the depth-separated convolution layer is shown in fig. 6, and this structure is a known structure and will not be described in detail.

The sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence.

In a preferred embodiment, a dot convolution layer is added in front of the first layer of the depth separation convolution layer of the seventh convolution module, and similarly, a dot convolution layer is added in front of the first layer of the depth separation convolution layer of the eighth convolution module.

The design realizes the improvement of the network performance of the lightweight chemical green model on the premise of almost not changing the capacity of the model.

Definition of student model M _s The medium aggregate pooling structure is a fifth aggregate pooling structure; the output of the eighth convolution module is connected to the input of the reduced horizontal pyramid pooling structure through a fifth aggregate pooling structure.

The fifth aggregate pooling structure is also composed of a statistical (maximum) function, which inputs a feature matrix of s (frame number) ×128×16×11, and outputs a feature matrix of 128×16×11.

The reduced horizontal pyramid pooling structure consists of global maximum pooling and global average pooling, and is structured as shown in fig. 4.

The input of the reduced level golden tower pooling is a 128 multiplied by 16 multiplied by 11 feature matrix, the middle feature matrix is 16 128 multiplied by 1 multiplied by 11 three-dimensional matrices, and 16 128-dimensional feature vectors are output through global maximum pooling and global average pooling.

The output of the reduced horizontal pyramid pooling structure is connected with the input of a reduced full-connection network, the reduced full-connection network comprises 16 independent full-connection neural network layers, each layer of input is 128-dimensional vectors, and the output is 128-dimensional vectors.

Compared with the current gait recognition model based on the standard convolution layer structure, the invention designs a compact lightweight gait recognition model (called a student model) by adopting low-cost depth separation convolution, and structurally reduces the parameter quantity of the model.

Example 2

The present embodiment describes a gait recognition model compression method based on the local-global joint knowledge distillation algorithm, which is based on the gait recognition model compression system based on the local-global joint knowledge distillation algorithm in the above embodiment 1.

In this embodiment 2, the two models in the above embodiment 1 are trained to achieve the purpose of gait recognition.

As shown in fig. 7, the gait recognition model compression method based on the local-global joint knowledge distillation algorithm comprises the following steps:

step 1, extracting gait contour sequences in the gait video by using a background subtraction method (the conventional method), and uniformly cutting into image sets with the same size, such as image sets with 64 multiplied by 64 pixels.

The image sets form a data set X, and the data set X is divided into a training set X _train And test set X _test 。

With the deep convolutional neural network as a basic structure, two gait recognition models are constructed and respectively recorded as a large-capacity teacher model M _t And a light student model M _s The model structure is already described in the above embodiment 1, and will not be described here again.

Step 2, training set X _train Training teacher model M _t The method comprises the steps of carrying out a first treatment on the surface of the The learning rate is set to 0.0001, the iteration number is 80000, the optimizer adopts an Adam optimizer, and 16 sequences of 8 objects are input once per iteration (128 sequences are total, 30 frames of images are randomly selected for each sequence, and the images are scaled to 64×44 pixels).

The loss function adopts a triplet loss function L as shown in a formula (1) _tri ；

Wherein N is _tri+ Representing two sample sets of all Euclidean distances other than 0 in a training sample subsetTotal number of sample pairs formed; the training sample subset is the secondary training set X when each training _train A set formed by a plurality of sample images selected randomly;

m represents the boundary threshold of the loss function;

representing a j-th sample to be trained in the i-th training sample subset;

representing the sum +.>Any sample with different identities of the represented pedestrians;

sign | I ₂ Representing the 2-norm of the matrix;

[] ₊ representing ReLU operation, the calculation method is as follows: [ x ]] ₊ =max {0, x }, max is the maximum valueAnd (3) operating.

By reducing the loss value, the points of the same object are mutually close, the points of different objects are mutually far away, the boundary threshold value M of the loss function is 0.2, and the training aim is to enable the teacher model M _t The recognition performance of (a) is best.

Step 3, training set X _train Input to a trained teacher model M _t And untrained complete student model M _s Respectively obtain the same data set in the teacher model M _t Multi-dimensional feature matrix of convolution network output of (a)In student model M _s Multi-dimensional characteristic matrix of the output of the reduced convolution network ∈>In teacher model M _t Multi-dimensional feature matrix of full-connection network outputIn student model M _s Multi-dimensional characteristic matrix of the output of the simplified fully-connected network>

Multi-dimensional feature matrixAnd multidimensional feature matrix->Is b x s x c x h x w; multidimensional feature matrix->And multidimensional feature matrix->Is b x n x d; wherein b represents each training sampleThe number of samples in this subset; s represents the number of frames; c represents the number of the output feature matrixes of the convolution layer, h represents the height of the output feature matrixes of the convolution network, w represents the width of the output feature matrixes of the convolution network, and d represents the dimension of the feature matrixes output by the full-connection network.

Step 4, making the difference measurement function L _{c_dis} Calculating a multi-dimensional feature matrixAnd multidimensional feature matrix->Differences between; wherein the difference metric function L _{c_dis} The calculation formula of (2) is as follows:

in the difference measurement function L _{c_dis} Representing the partial distillation loss, the sign is ² _F Representing the F-norm of the matrix.

When the similarity matrix difference of the output characteristics is calculated, the L2 regularization method is further used for normalizing the difference of the characteristic matrices, so that the student model can be guided to learn more effective knowledge from the teacher model.

In this way, the teacher model M _t And student model M _s The number of channels of the convolved output features may not be consistent, that is, a model of greater or lesser capacity may be designed for knowledge distillation.

wherein,representing the distance between all the samples of the same class in the feature matrix output by the teacher model; />Representing the distance between all different types of samples in the feature matrix output by the teacher model; />Representing the distance between all the samples of the same class in the feature matrix output by the student model; />And representing the distance between all different types of samples in the feature matrix output by the student model. The same class of samples refers to picture samples with the same pedestrian identity tag, and the different class of samples refers to picture samples with different pedestrian identity tags.

representing a kth sample to be trained with different categories in a p-th training sample subset when training a teacher model; />Representing the jth sample to be trained with the same category in the ith training sample subset when training the student model;

representing the kth sample to be trained with different categories in the p-th training sample subset when training the student model.

/>

wherein m represents the boundary threshold of the loss function, and is set to 0.2;

Representing the total number of sample pairs consisting of two samples with all Euclidean distances other than 0 in one training sample subset in the teacher model;

wherein alpha is a distillation loss weight.

Step 8, setting a student model M _s The iteration times are 30000, the optimizer is an Adam optimizer, the learning rate of the former 10000 times is set to be 0.005, the learning rate of the latter 20000 times is set to be 0.001, the optimizer is an Adam optimizer, and the loss value L is reduced _total Transferring the knowledge of the teacher model to the student model M _s And the recognition performance of the student model is improved.

The compression method of the gait recognition model has the advantages that the combined knowledge distillation algorithm is utilized to carry out local and whole knowledge distillation on the high-capacity gait recognition model (called a teacher model), the student model can be guided to learn more knowledge from the teacher model, and the original recognition effect is maintained as much as possible while the model capacity is reduced.

The method adopts the lightweight model compression and combined knowledge distillation technology, so that the gait recognition accuracy of the model can be effectively ensured while the scale of the model parameters is reduced, the operation cost is reduced, the training times and the reasoning time are reduced, the efficiency of the model is improved, and the method is more suitable for actual scenes with high real-time requirements and large data volume.

Experiments prove that compared with a deep neural network model in the prior art, the gait recognition model compression method disclosed by the invention has the advantages that the model parameter quantity is reduced by 9 times, the calculated quantity is reduced by 19 times, the performance of the model in the public data set CASIA-B is reduced by 2.2%, the actual reasoning time is reduced, and the problem of model efficiency is effectively solved.

The foregoing description is, of course, merely illustrative of preferred embodiments of the present invention, and it should be understood that the present invention is not limited to the above-described embodiments, but is intended to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A gait recognition model compression system based on a local-global combined knowledge distillation algorithm is characterized in that,

the first convolution module is composed of three layers, wherein:

the second convolution module consists of three layers, wherein:

the third convolution module consists of two layers, wherein:

The fifth convolution module consists of two layers, wherein:

the first and second horizontal pyramid pooling structures have five scales;

the seventh convolution module consists of three layers, wherein:

The eighth convolution module consists of two layers, wherein:

2. A gait recognition model compression method based on a local-global joint knowledge distillation algorithm, the gait recognition model compression system based on the local-global joint knowledge distillation algorithm described in claim 1; it is characterized in that the method comprises the steps of,

the gait recognition model compression method comprises the following steps:

m represents the boundary threshold of the loss function;

representing a j-th sample to be trained in the i-th training sample subset;

sign | I ₂ Representing the 2-norm of the matrix;

in the difference measurement function L _{c_dis} Representing the loss of partial distillation, signRepresenting the F-norm of the matrix;

step 5, respectively calculating the multi-dimensional feature matrix by using the formula (3)And multidimensional feature matrix->The difference between each sample of (a) is used as a result of +.>A representation;

wherein,representing the distance between all the samples of the same class in the feature matrix output by the teacher model; />Representing the distance between all different types of samples in the feature matrix output by the teacher model; />Representing studentsThe distances among all the samples of the same category in the feature matrix output by the model; />Representing the distance between all different types of samples in the feature matrix output by the student model;

when the teacher model is trained, the kth sample to be trained with the same category in the kth training sample subset is represented;

representing the jth sample to be trained with different categories in the ith training sample subset when training the teacher model;

representing a kth sample to be trained with different categories in a p-th training sample subset when training a teacher model;

Representing the jth sample to be trained with the same category in the ith training sample subset when training the student model;

representing the kth sample to be trained with the same category in the kth training sample subset when training the student model;

representing the jth sample to be trained with different categories in the ith training sample subset when training the student model;

wherein,representing two samples of a training sample subset of the teacher model with all Euclidean distances other than 0Total number of sample pairs composed;

Wherein alpha is a distillation loss weight;