CN107085716B

CN107085716B - Cross-view gait recognition method based on multi-task generation countermeasure network

Info

Publication number: CN107085716B
Application number: CN201710373017.5A
Authority: CN
Inventors: 何逸炜; 张军平
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2021-06-04
Anticipated expiration: 2037-05-24
Also published as: CN107085716A

Abstract

The invention belongs to the field of computer vision and machine learning, and particularly relates to a cross-perspective gait recognition method for generating a confrontation network based on multiple tasks. The invention mainly solves the problem that the generalization performance of the model is reduced under the large visual angle change of gait recognition. For an original pedestrian video frame sequence, preprocessing each frame image, and extracting gait template features; then coding the gait hidden representation through a neural network, and carrying out angle transformation in a hidden space; then, generating a confrontation network through multiple tasks to reconstruct gait template characteristics of other visual angles; and finally, recognizing by using the gait implicit representation. Compared with classification-based or other reconstruction-based methods, the method has stronger interpretability and can improve the identification performance.

Description

Cross-view gait recognition method based on multi-task generation countermeasure network

Technical Field

The invention belongs to the technical field of computer vision and machine learning, and particularly relates to a cross-visual angle gait recognition method based on videos.

Background

The video-based cross-perspective gait recognition problem is one of the problems of computer vision and machine learning field research. When gait video frame sequences under different visual angles are given, whether main bodies of the gait frame sequences are the same object or not is judged according to computer vision or a machine learning algorithm. Currently, there are many predecessors working in this field, and the main methods can be divided into three categories: reconstruction-based methods, subspace-based methods, and deep learning-based methods. The following are some references to these three types of processes:

[1]W.Kusakunniran,Q.Wu,J.Zhang,and H.Li,“Support vector regression for multi-view gait recognition based on local motion feature selection,”in Conference on Computer Vision and Pattern Recognition,pp.974–981,2010.

[2]M.Hu,Y.Wang,Z.Zhang,J.J.Little,and D.Huang,“View-invariant discriminative projection for multi-view gait-based human identification,”IEEE Transactions on Information Forensics and Security,vol.8,no.12,pp.2034–2045,2013.

[3]W.Kusakunniran,Q.Wu,J.Zhang,H.Li,and L.Wang,“Recognizing gaits across views through correlated motion co-clustering,”IEEE Transactions on Image Processing,vol.23,no.2,pp.696–709,2014.

[4]S.Yu,H.Chen,Q.Wang,L.Shen,and Y.Huang,“Invariant feature extraction for gait recognition using only one uniform model,”Neurocomputing,vol.239,pp.81–93,2017.

[5]Z.Wu,Y.Huang,L.Wang,X.Wang,and T.Tan,“A comprehensive study on cross-view gait based human identification with deep CNNs,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.39,no.2,pp.209–226,2017.

[6]Y.Feng,Y.Li,and J.Luo,“Learning effective gait features using LSTM,”in International Conference on Pattern Recognition,pp.320–325,2016.

[7]H.Iwama,M.Okumura,Y.Makihara,and Y.Yagi,“The OU-ISIR gait database comprising the large population dataset and performance evaluation of gait recognition,”IEEE Transactions on Information Forensics and Security,vol.7,no.5,pp.1511–1521,2012.

[8]Y.Makihara,R.Sagawa,Y.Mukaigawa,T.Echigo,and Y.Yagi,“Gait recognition using a view transformation model in the frequency domain,”in European Conference on Computer Vision,pp.151–163,Springer,2006.

[9]W.Kusakunniran,Q.Wu,H.Li,and J.Zhang,“Multiple views gait recognition using view transformation model based on optimized gait energy image,”in International Conference on Computer Vision,pp.1058–1064,2009.

[10]X.Xing,K.Wang,T.Yan,and Z.Lv,“Complete canonical correlation analysis with application to multi-view gait recognition,”Pattern Recognition,vol.50,pp.107–117,2016.。

the first reconstruction-based method is mainly based on a reconstructed gait template, such as a VTM model in [1,3,8,9], and predicts the value of each pixel point in template features under a target view angle respectively by training a plurality of models. To reduce computational overhead, each pixel of the template features is reconstructed simultaneously based on the method of the auto-encoder [4] and a view-invariant feature representation is extracted.

The subspace-based approach [2,10] projects gait features from different perspectives into a common subspace and computes the similarity in the common subspace. However, the subspace method can only model the linear correlation among the features generally, neglects the nonlinear correlation, and has low recognition rate.

Recently, models based on deep neural networks have gained greater application in the field of computer vision. [5,6] the deep network is used for gait recognition, and the feature expression with unchanged visual angle is automatically extracted. Although the recognition accuracy is greatly improved compared with the past method, the depth model has no interpretability.

Disclosure of Invention

The invention aims to provide a video-based cross-view gait recognition method with high recognition rate and low calculation overhead.

The basic contents of the present invention will be described first.

One, periodic energy diagram

(a) In order to establish more effective gait template characteristics, the original pedestrian video sequence needs to be preprocessed. For each frame in a video sequence, firstly separating a front background, extracting a gait outline image, and then centering and aligning the gait outline image to the center of the image to obtain a standardized gait outline image;

(b) based on the normalized gait profile sequence, a normalized cycle position of each frame is first obtained using a cycle detection technique in gait recognition. Defining a gait cycleIs the time domain between two adjacent local minimum normalized period positions. The sequence of contour maps is then divided into one or more gait cycles and frames that do not constitute a complete gait cycle are discarded. Sequence B of gait profile according to centralization_tAnd corresponding normalized period position r_tStructure of n_cPeriodic energy map (PEI) template features for each channel. The value at the (x, y) coordinate of the kth channel in the periodic energy map PEI is calculated as follows:

wherein:

m represents the size of the time domain window covered by each channel, and t (k) determines the range of the time domain covered by each channel. In this way, the spatial information of the sequence of contour maps can be divided into n according to the periodic position of each frame_cAnd in a time domain, the time and space information of the gait is coded simultaneously.

Two, multitask generation countermeasure network

The multitask generation countermeasure network for gait recognition is mainly composed of an encoder, a view angle conversion layer, a generator and a discriminator as shown in figure 1. Wherein:

(a) an encoder E: to obtain a hidden representation of a particular view, a convolutional neural network is used as an encoder in the model. Input x of the encoder_uThe PEI template with the visual angle u is provided by the invention. The input scale size is 64 × 64 × n_c. Each channel of the PEI template is independently sent into an encoder, and then effective characteristic implicit expression z is extracted by using time pooling and passing through a full connection layer_uAs the output of the encoder. For time pooling, the mode adopted is average pooling;

(b) view angle conversion layer V: assuming the distribution of gait data as a function of perspectiveOn a high-dimensional manifold. The movement of the samples in the manifold to a particular direction may enable a change in viewing angle while preserving identity information. Hidden representation z given a hidden space_uThe view transformation can be described as:

wherein h is_iRepresenting the transformation vector from view i-1 to view i. z is a radical of_vIs a hidden representation at view angle v;

(c) a generator G: by representing z implicitly_vAs an input, a gait template at the perspective v may be generated. The generator consists of five deconvolution layers and uses the ReLU as the excitation function. Because the direct reconstruction of the PEI template has certain difficulty, one channel of the PEI template is randomly selected for reconstruction, so that the information of all channels can be stored by hidden representation. Define the input of the generator as [ z ]_v,c]And c is a one-hot encoded representation of the channel;

(d) a discriminator D: the structure of the discriminator is the same as that of the encoder except that time pooling operation is not adopted and the dimension of the output layer is different from that of the encoder. Defining the output dimension of the discriminator as n_v+n_c+n_dWherein n is_vIs the number of viewing angles, n_cIs the number of PEI channels, n_dIs the number of different identities in the training set. The output of the discriminator corresponds to n_v+n_c+n_dAnd each sub-arbiter shares parameters except the last layer. Each arbiter is responsible for a different arbitration task to determine whether the generated samples belong to a particular distribution. For example, in the model, the first n_v+n_cThe sub-discriminators enable the generator to generate a distribution that satisfies a particular view and channel, then n_dThe sub-discriminators enable the generator to generate samples corresponding to distributions belonging to a particular identity.

Three, loss function

The present invention uses two loss functions to train a multitask generative confrontation network.

(a) Pixel-by-pixel loss: to enhance the ability of the hidden representation to retain identity information, pixel-by-pixel loss between the generated template and the real template is first minimized. According to template characteristics x_vAnd dummy features

The pixel-by-pixel loss is calculated as:

(b) multitask fighting loss: according to template characteristics x_vAnd dummy features

Computing the multitask countermeasure loss:

where E represents the expectation for the corresponding sample set, | | · | | | ceiling₁Representing the L1 norm, and vector s is the one-hot encoding of identity, channel, and angle information. D (-) represents the output of the discriminator, the dimension of the discriminator output is the same as the dimension of the vector s, and the non-zero element in the vector s determines the distribution to which the characteristics of the pseudo-template belong;

the final loss function is defined as:

L＝L_p+αL_a (6)

alpha is used as a hyper-parameter to balance pixel-by-pixel loss and multi-task countermeasure loss, and after a final loss function is defined, parameters of an encoder, a view transformation layer, a generator and a discriminator are alternately updated by using a back propagation algorithm.

The invention provides a cross-view gait recognition method based on a multitask generation countermeasure network, which comprises the following specific steps:

(1) inputting pedestrian video frame sequences under different visual angles, and constructing gait template characteristics:

vector x_iIs the gait template characteristic under the visual angle i, n_vIs the number of all views;

(2) for any different view angles u, v, encoding the corresponding gait template features x by using a convolutional neural network_uTo the hidden space and obtaining a hidden representation in the hidden space as z_u；

(3) Representing z implicitly in implicit space_uCarrying out visual angle transformation to an angle v to obtain a hidden representation z_v；

(4) Will implicitly represent z_vAnd the channel one-hot code is used as input, and the pseudo-template characteristic under the output angle v is obtained after the generation network in the countermeasure network is generated through multiple tasks

(5) According to template characteristics x_uAnd dummy features

Calculating the pixel-by-pixel loss L_p；

(6) Generating discriminative networks in countermeasure networks using multitasking, according to template features x_uAnd dummy features

Computing a multitasking confrontation loss L_a；

(7) Weighting pixel-by-pixel loss and multi-task counterloss according to total loss L ═ L_p+αL_aTraining the multitask generation countermeasure network, alpha is used as a hyper-parameter to balance pixel-by-pixel loss and multitask countermeasure loss.

In the invention, the gait template is constructed by the following steps:

(1) performing foreground and background separation on each frame of an original video frame sequence, and extracting a gait contour map; and translating and zooming the gait outline image to the image center to obtain a centralized gait outlineGraph sequence B₁,B₂,B₃,…,B_n}；

(2) For each frame in the sequence of centered gait profiles, the normalized period position r is calculated_tA normalized cycle position of a t-th frame of the centered gait profile;

(3) sequence B of gait profile according to centralization_tAnd corresponding normalized period position r_tStructure of n_cThe periodic energy map template characteristics of each channel; the value at the (x, y) coordinate of the kth channel in the periodic energy map PEI is calculated as follows:

wherein:

m represents the size of the time domain window covered by each channel, and t (k) determines the range of the time domain covered by each channel.

In the invention, the visual angle transformation step is as follows:

implicit representation z at a given view angle u_uThe calculation process for transforming to view v can be described as:

wherein h is_iRepresenting the transformation vector from view i-1 to view i.

In the present invention, the loss L is_pAnd L_aThe calculation steps are as follows:

(1) according to template characteristics x_vAnd dummy features

Calculate the pixel-by-pixel loss:

(2) according to template characteristics x_vAnd dummy features

Computing the multitask countermeasure loss:

where E represents the expectation for the corresponding sample set, | | · | | | ceiling₁Representing the L1 norm, and vector s is the one-hot encoding of identity, channel, and angle information. D (-) represents the output of the discriminator.

The method of the invention extracts the hidden representation with specific view angle by using the nonlinear modeling capability of the convolutional neural network. By carrying out view angle transformation in a hidden space, the calculation overhead is reduced; and based on the capability of generating the confrontation network to model the distribution, the characteristics with expression are extracted, and the identification efficiency can be greatly improved.

Drawings

FIG. 1: detailed model flow diagram of the invention.

FIG. 2: OU-ISIR, CASIA-B, USF data set sample presentation.

FIG. 3 average accuracy on CASIA-B data sets under different walking conditions.

Detailed Description

Having described the specific steps and models of the present invention, the following demonstrates the test effect of the present invention on several gait data sets.

The experiment used three data sets, including the OU-ISIR data set, the CASIA-B data set, and the USF data set. Fig. 2 shows some examples of these three data sets.

The OU-ISIR dataset has a total of 4007 different people, with 2135 males and 1872 females, with an age distribution ranging from 1 year to 94. The data has 4 different angles. Respectively 55 deg., 65 deg., 75 deg., 85 deg.. The method is used for extracting the PEI template, setting the number of channels to be 3, interpolating the PEI to the picture with the size of 64 multiplied by 64 pixels, and putting the picture into a multitask generation countermeasure network for training.

The CASIA-B dataset has a total of 124 different people, 11 different perspectives. Wherein, each person has 6 groups of gait sequences of normal walking, 2 groups of gait sequences of walking with the bag, and 2 groups of gait sequences of wearing the coat at each visual angle. CASIA-B has a larger viewing angle range than OU-ISIR, but the number of people is relatively small. The number of PEI channels is set to be 10, and gait recognition accuracy rates under different walking states are tested respectively.

The USF is another common gait data set, with a total of 122 different people, each with 5 different forms of gait sequences, close to the real scene. In the experiments, we only used gait sequences at different perspectives for testing. We set the number of channels for PEI to 10.

The experiment uses Rank-1 recognition accuracy as a performance index. The identification is performed by using a nearest neighbor classifier in a hidden space of a view transformation layer.

Experimental example 1: recognition performance of multitask generation countermeasure networks

This part of the experiment shows different models, identifying accuracy across viewing angles. As a comparison we have chosen an autoencoder, canonical correlation analysis, linear discriminant analysis, convolutional neural network, and local tensor discriminant model. Table 1 shows a comparison of the method of the present invention with other methods on three data sets. It can be seen that the present invention is greatly enhanced over other methods.

Experimental example 2: effect of different loss functions on model Performance

Table 2 shows the performance variation of the model over the CASIA-B dataset when different penalty functions are used. It can be seen that the recognition performance of the model can be improved by combining the multitask pair loss and the pixel-by-pixel loss; and the performance of the model is degraded when different loss functions are used alone.

Experimental example 3: effect of different Walking states on model Performance

Fig. 3 shows the cross-view recognition accuracy on the CASIA-B under different walking states, there are three different walking states in total: normal walking, walking with the bag, and walking with the coat. It can be seen that the accuracy is highest during normal walking, and the decrease in model performance is more pronounced with the coat-worn gait sequence than with the bag-worn gait sequence.

Experimental example 4: influence of different templates and PEI (polyetherimide) channel numbers on identification accuracy

Table 3 shows the effect of using different templates and the number of PEI channels on the recognition accuracy. We used a gait energy map (GEI) and a time gait template (CGI) and analyzed the average recognition accuracy for PEI channel numbers of 3, 5, and 10, respectively. It can be seen that the PEI templates of our invention have higher recognition accuracy than 6EI and C6I, and as the number of PEI templates increases, the recognition accuracy can be further improved.

Table 1: recognition accuracy (%) under different methods

Table 2: model identification accuracy (%) under different loss functions

	54°	90°	126°
				Loss per pixel + multi-tasking penalty	82.4	73.1	81.3
Pixel by pixel loss	81.5	71.7	83.5
				Multitasking countermeasure loss	74.6	68.6	75.4

TABLE 3 influence of different templates and the number of PEI channels on the recognition accuracy

Claims

1. The cross-perspective gait recognition method for generating the confrontation network based on multiple tasks is characterized by comprising the following specific steps:

(5) According to template characteristics x_uAnd dummy features

Calculating the pixel-by-pixel loss L_p；

Computing a multitasking confrontation loss L_a；

(7) Weighting pixel-by-pixel loss and multi-task counterloss according to total loss L ═ L_p+αL_aTraining a multitask to generate an antagonistic network;

alpha is used as a hyper-parameter to balance pixel-by-pixel loss and multi-task countermeasure loss, and after a final loss function is defined, parameters of an encoder, a view transformation layer, a generator and a discriminator are alternately updated by using a back propagation algorithm;

the gait template is constructed by the following steps:

(1) performing foreground and background separation on each frame of an original video frame sequence, and extracting a gait contour map; and translating and zooming the gait outline image to the image center to obtain a centralized gait outline image sequence B₁，B₂，B₃，...，B_n}；

(3) sequence B of gait profile according to centralization_tAnd corresponding normalized period position r_tStructure of n_cPer channel energy cycle map PEITemplate characteristics; the value at the (x, y) coordinate of the kth channel in the periodic energy plot is calculated as:

wherein:

m represents the size of the time domain window covered by each channel, and T (k) determines the range of the time domain covered by each channel;

the multi-task generation countermeasure network mainly comprises four parts, namely an encoder, a visual angle conversion layer, a generator and a discriminator; wherein:

(a) an encoder: using a convolutional neural network as an encoder in a model, the input x of the encoder_uIs a PEI template with a viewing angle u; the input scale size is 64 × 64 × n_c(ii) a Each channel of the PEI template is independently sent into an encoder, and then effective characteristic implicit expression z is extracted by using time pooling and passing through a full connection layer_uAs the output of the encoder;

(b) viewing angle conversion layer: assuming that the gait data is distributed on a high-dimensional manifold along with the change of the view angle, the change of the view angle is realized under the condition that samples in the manifold move towards a specific direction to keep identity information; hidden representation z given a hidden space_uThe view transformation can be described as:

wherein h is_iRepresenting the transformation vector from view i-1 to view i, z_vIs a hidden representation at view angle v;

(c) a generator: by representing z implicitly_vGenerating a gait template under a visual angle v as input; the generator consists of five deconvolution layersComposing and using the ReLU as an excitation function; randomly selecting one channel of the PEI template for reconstruction so as to ensure that the hidden representation can store the information of all the channels; define the input of the generator as [ z ]_v，c]And c is a one-hot encoded representation of the channel;

(d) a discriminator: the structure of the discriminator is the same as that of the encoder except that time pooling operation is not adopted and the dimension of an output layer is different from that of the encoder; defining the output dimension of the discriminator as n_v+n_c+n_dWherein n is_vIs the number of viewing angles, n_cIs the number of PEI channels, n_dIs the number of different identities in the training set; the output of the discriminator corresponds to n_v+n_c+n_dSub-discriminators sharing parameters except the last layer among the sub-discriminators; each arbiter is responsible for a different arbitration task to determine whether the generated samples belong to a particular distribution.

2. The gait recognition method according to claim 1, characterized in that the perspective transformation step is:

implicit representation z at a given view angle u_uImplicit representation z transformed to view angle v_vThe calculation process of (2) is as follows:

wherein h is_iRepresenting the transformation vector from view i-1 to view i.

3. The gait recognition method according to claim 1, characterized in that the loss L is_pAnd L_aThe calculation steps are as follows:

(1) according to template characteristics x_vAnd dummy features

Calculate the pixel-by-pixel loss:

(2) according to template characteristics x_vAnd dummy features

Computing the multitask countermeasure loss:

where E represents the expectation for the corresponding sample set, | | · | | | ceiling₁Representing the norm of L1, and the vector s is the one-hot code of identity, channel and angle information; d (-) represents the output of the discriminator.