CN116704612A

CN116704612A - Cross-visual-angle gait recognition method based on contrast domain self-adaptive learning

Info

Publication number: CN116704612A
Application number: CN202310711768.9A
Authority: CN
Inventors: 贲晛烨; 黄天欢; 王亮; 庄兆意; 单彩峰; 黄永祯; 郝敬全; 辛国茂; 郑其荣; 刘大扬; 李玉军
Original assignee: Watrix Technology Beijing Co ltd; Shenzhen Research Institute Of Shandong University; Institute of Automation of Chinese Academy of Science; Shandong University; Shandong University of Science and Technology; Shandong Jianzhu University; Taihua Wisdom Industry Group Co Ltd
Current assignee: Watrix Technology Beijing Co ltd; Shenzhen Research Institute Of Shandong University; Institute of Automation of Chinese Academy of Science; Shandong University; Shandong University of Science and Technology; Shandong Jianzhu University; Taihua Wisdom Industry Group Co Ltd
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-05

Abstract

The invention relates to a cross-visual angle gait recognition method based on adaptive learning of a contrast domain, which comprises the following steps: (1) Constructing the whole network based on the adaptive learning of the opposite domain and training: preprocessing gait contours; gait view level sub-domain division; constructing a feature extractor embedded with a hierarchical feature aggregation strategy; constructing a visual angle change countermeasure elimination module; constructing a measurement learning module; the whole network countermeasure training based on the countermeasure domain self-adaptive learning; (2) cross-view gait recognition: and sending the gait contour sequence of the identity to be identified into a trained feature extractor embedded with a hierarchical feature aggregation strategy to obtain gait features, and comparing the gait features with the registered data set in a feature similarity manner to finish the identity identification of the sample to be identified. The invention can fully excavate the space-time information in the gait sequence and effectively eliminate the interference of visual angle change at the same time; the invention realizes more full and comprehensive space-time feature extraction of gait contour sequences; the capability of extracting the network discriminant gait features is effectively enhanced.

Description

Cross-visual-angle gait recognition method based on contrast domain self-adaptive learning

Technical Field

The invention relates to a cross-visual angle gait recognition method based on adaptive learning of a contrast domain, and belongs to the technical field of deep learning and pattern recognition.

Background

Gait is a physical and behavioral biological feature that describes the pattern of human walking. And other traditional biological features such as: gait can be easily captured in a remote, non-cooperative scene and is difficult to disguise, compared to the face, fingerprint and iris. Therefore, gait has high application potential in the fields of monitoring and security, and gait recognition technology is also receiving more and more attention. However, in real scenes, gait recognition is affected by many external factors, such as changes in clothing, carrying conditions and camera viewing angles. Among them, the visual angle change is one of the most important factors affecting the gait recognition performance, because the different visual angles bring about a great difference in appearance.

Conventional cross-view gait recognition is mainly based on transition learning between views, and the gait recognition method generates encouraging performance improvement. However, this direct view conversion is constrained by a conversion model based on the current known camera view and does not scale well to handle gait recognition in the case of unknown view.

In recent years, a cross-view gait recognition method based on deep learning emerges, and aims to eliminate the influence of the change in the view angle. These methods can be classified into two types, namely, forced learning-based and decoupling learning-based methods. The first class of methods aims to learn, regardless of the visual angle difference, a discriminative gait representation independent of the visual angle change from gait training data mixed with different visual angles by means of robust spatiotemporal modeling. The second category of methods aims at separating the view angle information from other features of gait by decoupling to eliminate its interference so that the model can best learn the features of gait independent of camera view angle. Compared with the traditional conversion-based method, the elimination-based method has better performance advantages, is more flexible and can be well popularized to different visual angles. However, current methods based on view angle elimination still suffer from the following drawbacks: 1) In forced learning, the view itself, i.e. explicit view estimation or modeling of a particular view, is to some extent ignored and underestimated. 2) Decoupling learning relies on the decomposition and synthesis of gait sequences. The current decoupling learning-based method is used for decomposing and synthesizing a single frame image in a gait sequence, or directly integrating the whole gait sequence into a single frame template (such as GEI), and then decomposing and synthesizing the single frame template, so that time-space related information in the gait sequence is inevitably destroyed, and the performance of gait recognition is limited.

Therefore, how to better utilize the visual angle information to eliminate and fully mine the space-time information in the gait sequence is the key of improving the performance of the cross-visual angle gait recognition based on the deep visual angle elimination.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-visual-angle gait recognition method based on adaptive learning of a contrast domain.

Summary of the invention:

a cross-view gait recognition method based on adaptive learning of a contrast domain comprises gait sample preprocessing, gait view level subdomain division, feature extractor construction embedded with hierarchical feature aggregation strategy, view variation contrast elimination module design, measurement learning module design, overall frame contrast training and cross-view gait recognition.

In order to avoid interference caused by different gait sequence scales, the gait samples are subjected to contour correction processing and then used as input data. To fully utilize and eliminate the viewing angle change information, gait samples at different viewing angles are considered as subfields from different viewing angle levels, and the viewing angle change is considered as a field change between subfields. In order to fully excavate the space-time information of the gait samples from different subdomains, a hierarchical feature aggregation strategy is designed and embedded into a skeleton network to construct a gait feature extractor. By organically polymerizing the space-time characteristics of each stage of the network, more comprehensive and complementary gait space-time representation is obtained, and the characteristic expression capability of the network is effectively enhanced. In order to eliminate the disturbance of the visual angle change to obtain a more robust local gait feature representation, a visual angle change countermeasure elimination module is designed for carrying out visual angle countermeasure sub-field self-adaptive learning on the gait features of different sub-fields obtained by the feature extractor so as to reduce the difference between the sub-fields. Meanwhile, in order to ensure the discriminant feature representation capability of the whole framework, a measurement learning module is designed, model countermeasure training of the whole framework is carried out by using triple loss and cross entropy loss on the basis of the sub-domain countermeasure learning loss, and finally, gait recognition across visual angles is carried out by using the trained model.

Term interpretation:

1. CASIA-B database: is a gait data set with wide application, and total 124 subjects. Each subject had 110 gait sequences, and these 110 sequences were collected under three walking conditions, namely Normal (NM) (group 6, index nm#01-06), backpack (BG) (group 2, index bg#01-02), with coat worn (CL) (group 2, index cl#01-02), respectively. Each group was again acquired at 11 different viewing angles simultaneously, i.e., (0 ° -180 ° apart by 18 °). Thus, the CASIA-B database contains 124 x (6+2+2) x 10=13640 gait silhouette sequences in total.

2. OUISIR database: is a gait data set with more subjects. It consisted of 4007 subjects, 2135 men and 1872 women, aged between 1-94 years. Under normal walking conditions, 2 sets of gait sequences for each subject were acquired, one set (index # 01) being used as the registration dataset and the other set (index # 02) being used as the query dataset, and each set containing gait silhouette sequences acquired at 4 different perspectives (55 °,65 °,75 °,85 °).

3. Registration dataset (gamma): refers to a dataset made up of gait samples of known tags entered in advance.

4. Query data set (probe): refers to the data set that is used as the test input, constituted by the gait sample to be identified.

5. Gait recognition: refers to comparing each sample in the query data set to all samples in the enrollment data set and identifying the sample in the query data set as the sample tag closest to the enrollment data set.

6. Cross-view gait recognition: the gait samples in the registration data set and the query data set are collected under different visual angles, and the gait samples under a single visual angle known in the registration data set are used for comparison, so that the identification of the query data set collected under different visual angles is realized.

The technical scheme of the invention is as follows:

a cross-visual-angle gait recognition method based on adaptive learning of a contrast domain comprises the following steps:

(1) Constructing the whole network based on the adaptive learning of the reactance domain and training, comprising the following steps:

A. gait contour preprocessing

Performing contour correction on the gait contour map to enable the human body contour to be located in the center of the image, and performing size adjustment on the corrected gait contour map;

B. gait view level subdomain partitioning

Dividing gait samples under different visual angles in a single gait database into subdomains of different visual angle levels according to the different visual angles; the statistical distribution difference caused by the change in viewing angle is regarded as a domain offset between subfields.

C. Feature extractor for constructing embedded hierarchical feature aggregation strategy

The feature extractor comprises a basic branch, a hierarchical feature aggregation branch and a feature mapping head;

the gait sequence input of different subdomains firstly enters a basic branch to carry out basic space-time feature extraction, the output of different network stages of the basic branch is sent into a hierarchical feature aggregation branch to carry out comprehensive global space-time feature extraction, and finally the output of the hierarchical feature aggregation branch enters a feature mapping head to obtain the final fine-grained gait representation based on part;

D. building view angle variation countermeasure elimination module

The visual angle change countermeasure eliminating module comprises a plurality of specific visual angle discriminators embedded with gradient inversion layers and a countermeasure optimizing target related to the feature extractor and the specific visual angle discriminators;

designing an antagonism optimization target for the output of the feature extractor embedded with the hierarchical feature aggregation strategy, and simultaneously training the feature extractor embedded with the hierarchical feature aggregation strategy and a specific view angle discriminator in the view angle variation antagonism elimination module in a minimum-maximum mode;

E. building a metric learning module

The metric learning module comprises a triplet loss and a cross entropy loss of the packet;

Training the feature extractor by using the grouped triplet loss and cross entropy loss at the same time for the output of the feature extractor embedded with the hierarchical feature aggregation strategy;

F. network countermeasure training based on whole countermeasure domain self-adaptive learning

The whole network based on the adaptive learning of the contrast domain comprises a feature extractor embedded with a hierarchical feature aggregation strategy, a visual angle change contrast eliminating module and a measurement learning module;

combining the antagonism optimization target in the step D with the triple loss and the cross entropy loss grouped in the step E for the output of the feature extractor embedded with the hierarchical feature aggregation strategy, and training the feature extractor;

(2) Cross-view gait recognition

And C, acquiring a gait contour sequence of the identity to be identified, preprocessing the gait contour by the step A, sending the acquired gait contour sequence of the identity to be identified into a trained feature extractor embedded with a hierarchical feature aggregation strategy to acquire gait features, and comparing the feature similarity with a registration data set to finish the identity identification of the sample to be identified.

In step A, the gait profile sequence is acquired via the CASIA-B database and the OUISIR database, according to a preferred embodiment of the invention.

According to a preferred embodiment of the present invention, step a, gait contour preprocessing, means: the following is performed for each gait profile sequence:

a. Reading each gait contour sequence, placing a part comprising a pedestrian at the center of each frame of image, and correcting the image so that the head of the pedestrian is placed at the uppermost edge of the image and the feet are placed at the lowermost edge of the image;

b. and c, adjusting the corrected gait contour sequence obtained in the step a to the same image size H multiplied by W to obtain a final processed pedestrian contour sequence, wherein H, W represents the height and width of the adjusted image respectively.

According to a preferred embodiment of the present invention, the step B, gait view level sub-field division, includes:

dividing the gait silhouette data set obtained by the processing of the step A into a training set and a test set, and dividing the training set X with the labels into V viewing angle level subdomains according to different viewing anglesWherein subdomain X ^v Is +.>All have corresponding identity tag->N _v And P _v Respectively represent subdomains X ^v Total number of gait samples and total number of IDs contained;

according to the preferred embodiment of the present invention, the step C of constructing a feature extractor embedded with a hierarchical feature aggregation policy includes:

the basic branch includes an initial layer and 3 stacked 3D convolution blocks; the initial layer comprises two 3D convolution layers and is used for carrying out preliminary space-time feature extraction on an input gait contour sequence; the 3 stacked 3D convolution blocks take the output characteristics of the initial layer as input and are used for extracting space-time information in different stages;

The hierarchical feature aggregation branch comprises a plurality of cross-stage attention aggregation blocks and a feature fusion from low stage to high stage, wherein the cross-stage attention aggregation blocks are used for relieving distribution differences and semantic misalignments of heterogeneous features of different network stages, and the feature fusion from low stage to high stage is used for fusing local detail information of the low stage and semantic representation of the high stage so as to acquire more complementary and comprehensive space-time feature representation;

the feature map header includes a horizontal slicing operation (HS), a generalized average pooling operation (GeM), and a full-join mapping operation for the packet.

The cross-stage attention aggregation block includes two learnable parameters sigma ₁ Sum sigma ₂ A cross-stage attention derivative operation W _m And a cross-phase attention aggregation operation W _a The method comprises the steps of carrying out a first treatment on the surface of the For input X from any subdomain, X ε X ^v The method comprises the steps of carrying out a first treatment on the surface of the Firstly, general space-time characteristics of two adjacent stages in a basic branch are initially combined through a learnable parameter, and the general space-time characteristics are expressed as shown in a formula (I):

in the formula (I),is the output of the element-by-element weighted addition; />Representing the features extracted by the basic branch in stage I, < >>Features representing first-order attention aggregation in first-1 stage of hierarchical feature aggregation branch, 1 <l≤n，C _l 、T _l 、H _l 、W _l Respectively express characteristic->Channel number, frame number, height and width;

then, a representationSoft attention mask m for each location importance in (b) ^l Generated by a cross-stage attention derivative operation, represented by formula (II):

in the formula (II), W _m Comprises a channel pooling, a 3D convolution layer with a core size of 3 multiplied by 3 and a sigmoid layer; generated soft attention mask m ^l Further guide W _a Performing deep cross-stage attention aggregation to obtain aggregated features Representing +.after cross-phase attention aggregation>The number of channels represented by formula (III):

in the formula (III) of the present invention,and +. _a Comprising a core size of 3 x 3 a 3D convolution layer and a leak Relu layer;

attention aggregation for each cross-phaseThe output of the block uses the lattice neg layer to carry out channel adjustment and cascade the channel by channel, and finally uses a maximum pooling operation to generate the final global space-time characteristic representation F _c ∈R ^C'×H'×W' C ', H ', W ' represent global spatiotemporal features F, respectively _c The number, height and width of the channels are shown as formula (IV):

in the formula (IV), MP represents the maximum pooling operation, and n represents the number of cross-stage attention aggregation blocks;

In the feature mapping head, first, the hierarchical features are aggregated into the global space-time features F of branch output _c The method comprises the steps of segmenting into H' part-based strip spaces along the horizontal direction, and extracting refined features from each horizontal strip space by using generalized average pooling, wherein the representation is shown as a formula (V):

F _GeM ＝W _GeM (F _c ') (Ⅴ)

in the formula (V), the compound represented by the formula (V),is based on the fine-grained feature of the part after generalized mean pooling operation, wherein +.>Represents F _GeM Fine-grained characteristics, W, of each of the portion-based ribbon spaces _GeM (. Cndot.) represents generalized mean pooling, F _c '∈R ^C'×H'×W' Representing the characteristics after horizontal segmentation;

subsequently, for F _GeM Each of the horizontal stripe featuresThe full connection layer of the packet is utilized to map the packet to a more discriminant representation space, and the representation is shown as a formula (VI):

F _FM ＝W _SFC (F _GeM ) (VI)

in the formula (VI) of the present invention,representing the output after the packet feature mapping, W _SFC Representing the full connection mapping operation of the packet.

According to a preferred embodiment of the present invention, step D, constructing a viewing angle variation countermeasure elimination module includes:

constructing a binary discriminator embedded with a gradient inversion layer, i.e. a specific view discriminator embedded with a gradient inversion layer, for each specific view subdomain to discriminate all acquired part-based fine-grained features F _GeM Whether from this or other subfields, therefore, V binary discriminators are used altogether, and all the parameters of the gradient inversion layers are shared;

For all acquired part-based fine-grained features F _GeM Constructing a countermeasure optimization objective on a feature extractor embedded with a hierarchical feature aggregation strategy and a specific view angle discriminator, enabling the specific view angle discriminator to discriminate whether each of the partial-based fine-grained gait feature inputs comes from a specific sub-domain by minimizing the loss of the specific view angle discriminator, and simultaneously maximizing the loss of the specific view angle discriminator to fine-tune the feature extractor to generate gait features capable of confusing the view angle discriminator;

specifically, for a parameter ofSpecific viewing angle discriminator D of (2) _v Sample input x all fine-grained feature based on part +.>Normalized and then input D through gradient inversion layer (GRL) _v The corresponding output is sent into a softmax layer to obtain a two-dimensional probability output z E R ² The expression is shown as a formula (VII):

in the formula (VII) of the present invention,representing a softmax function, the sample input x has a total of H' probability outputs;

the particular view discriminator is then trained by defining all the binary cross entropy losses based on the fine-grained features of the portion, represented by formula (VIII):

in the formula (VIII), G _F Representing feature extractor embedded with hierarchical feature aggregation policy, L _BCE Representing a binary cross-entropy loss,<·>representing an average of the set of subscripts; the joint minimization optimization objective for all the particular view angle discriminators is shown in formula (IX):

optimizing feature extractor G _F The maximum objective function of (2) is shown as formula (X):

according to the preferred embodiment of the present invention, step E, a metric learning module is constructed, which enhances the discriminant of gait representations, including grouped triplet loss and cross entropy loss, by minimizing the distance between gait samples from different view level subfields but belonging to the same ID; comprising the following steps:

c. using triplet loss for all acquired part-based fine-grained features F _GeM Respectively carrying out discriminant constraint, and taking the reduced triplet loss as a training target, wherein the loss function is specifically shown as a formula (XI) and a formula (XII):

in the formulas (XI) and (XII), (P, U) represents the number of subjects in one mini-batch and the number of gait profile sequences of each subject, N _tri Represents the number of non-zero terms in the loss, H' represents the number of fine-grained features based on the portion after horizontal division, m represents the Margin of the triplet loss (Margin),an h horizontal feature representing a p-th subject's u-th gait contour sequence,/->An h horizontal feature representing the a-th gait contour sequence of the p-th subject,/- >An h horizontal feature representing a c gait contour sequence of a b-th subject, d ₊ And d _- Euclidean distance between positive and negative samples, respectively;

d. fine granularity feature F based on part for all acquisitions using cross entropy loss _GeM Respectively carrying out discriminant constraint, and taking the cross entropy loss reduction as a training target, wherein a loss function is specifically shown as a formula (XIII):

in the formula (XIII), N=PU represents the total number of gait contour sequences in one mini-batch,and->Respectively indicate->Is a real identity tag and a predictive identity tag.

According to a preferred embodiment of the present invention, step F, training the entire adaptive learning based network based on the reactance domain, comprises:

combining the antagonism optimization objective with the grouping triplet loss and the cross entropy loss, training a total objective function of the whole network based on the antagonism domain self-adaptive learning, wherein the total objective function is shown as a formula (XIV):

in formula (XIV), β is a trade-off parameter for adjusting the two optimization goals of model discriminant learning and view angle change cancellation.

According to the invention, in the step (2), the gait contour sequence is obtained by dividing the pedestrian video acquired by the camera in the actual scene.

According to a preferred embodiment of the present invention, step (2), the cross-view gait recognition includes:

e. According to the trained whole adaptive learning network based on the contrast domain, the registration data set is sent to the trained whole adaptive learning network based on the contrast domain after the step A, H' partial fine-grained characteristics output by the trained characteristic extractor embedded with the hierarchical characteristic aggregation strategy are cascaded to be used as the integral characteristic representation of each gait contour sequence, and the characteristic database of the registration data set is finally obtained;

f. c, after the sample in the query data set to be identified is subjected to the step A, the sample is sent to a trained feature extractor embedded with a hierarchical feature aggregation strategy, and features of the query data set are obtained; e, carrying out Euclidean distance calculation on each gait sample feature in the query data set and all features in the registration data set obtained in the step E, finally identifying the query sample as a label of the feature with the smallest Euclidean distance in the registration data set, outputting an identity label of the query sample, and completing identification.

A computer device comprising a memory storing a computer program and a processor implementing steps of a cross-view gait recognition method based on contrast domain adaptive learning when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a cross-view gait recognition method based on contrast domain adaptive learning.

The beneficial effects of the invention are as follows:

1. the cross-view angle gait recognition based on the anti-domain adaptive learning provided by the invention is different from the previous method based on view angle conversion or view angle elimination, the interference problem of view angle change in the cross-view angle gait recognition is converted into the sub-domain adaptive problem of view angle level, and a novel sub-domain alignment method is used, so that the network can fully excavate the space-time information in the gait sequence and effectively eliminate the interference of view angle change.

2. The feature extractor embedded with the hierarchical feature aggregation strategy provided by the invention realizes more full and comprehensive space-time feature extraction on gait contour sequences. By introducing a multistage attention aggregation mechanism, local detail information extracted from a shallow layer of a network and high-level semantic representation can be effectively aggregated, and the capability of extracting the distinguishing gait characteristics of the network is effectively enhanced.

3. According to the view angle change antagonism elimination module, the influence of view angle change in the cross-view angle gait recognition is effectively relieved through antagonism learning among a plurality of view angle level subdomains. Meanwhile, the simple countermeasure optimization mode of the module promotes the realization of the whole network and the end-to-end training, and is beneficial to further improving the robustness of gait characteristics.

Drawings

FIG. 1 is a schematic diagram of a feature extractor embedded with a hierarchical feature aggregation policy in accordance with the present invention;

FIG. 2 is a schematic diagram of a cross-phase attention aggregation block in accordance with the present invention;

FIG. 3 is an overall frame diagram of cross-view gait recognition based on adaptive learning of the reactance domain in accordance with the present invention;

fig. 4 is a schematic view showing the structure of the view angle variation countermeasure eliminating module in the present invention.

Detailed Description

The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.

Example 1

A cross-view gait recognition method based on adaptive learning of a contrast domain, as shown in figure 3, comprises the following steps:

A. gait contour preprocessing

B. gait view level subdomain partitioning

The feature extractor embedded with the hierarchical feature aggregation strategy takes a general space-time feature network as a basic branch, and effectively aggregates local detail information extracted from a shallow layer of the network and semantic representation of a high layer by introducing a multi-level attention aggregation mechanism so as to realize more full and comprehensive space-time information extraction on gait contour sequences.

The specific structure of the feature extractor embedded with the hierarchical feature aggregation strategy is shown in fig. 1, and the feature extractor comprises a basic branch, a hierarchical feature aggregation branch and a feature mapping head;

D. building view angle variation countermeasure elimination module

The view angle change antagonism elimination module is a key domain adaptive component in the whole view angle-crossing gait recognition network based on the antagonism domain adaptive learning. The specific structure is as shown in fig. 4, the view angle variation countermeasure eliminating module comprises a plurality of specific view angle discriminators embedded with gradient inversion layers and a countermeasure optimizing target related to the feature extractor and the specific view angle discriminators;

The perspective change challenge elimination module attempts to challenge gait features generated by feature extractors embedded with hierarchical feature aggregation strategies and distinguish whether they are from different subfields by a challenge learning process. That is, the feature extractor aims to generate a gait representation of the confusable perspective discriminator. Meanwhile, the visual angle change antagonism eliminating module feeds back to the feature extractor in an antagonism learning mode to generate a better gait feature confusion visual angle discriminator with unchanged subdomain. The whole cross-view gait recognition network based on the contrast domain self-adaptive learning can extract gait features with high discriminant and subdomain invariance simultaneously through the contrast optimization learning of the feature extractor embedded with the hierarchical feature aggregation strategy and the view angle change contrast elimination module;

designing an antagonism optimization target for the output of the feature extractor embedded with the hierarchical feature aggregation strategy, and simultaneously training the feature extractor embedded with the hierarchical feature aggregation strategy and a specific view angle discriminator in the view angle variation antagonism elimination module in a minimum-maximum mode; to promote perspective robustness based on the fine grain gait feature of the portion;

E. building a metric learning module

training the feature extractor by using the grouped triplet loss and cross entropy loss at the same time for the output of the feature extractor embedded with the hierarchical feature aggregation strategy; to further ensure a high discriminant of the gait representation in the feature space;

combining the antagonism optimization target in the step D with the triple loss and the cross entropy loss grouped in the step E for the output of the feature extractor embedded with the hierarchical feature aggregation strategy, and training the feature extractor; to obtain gait feature representation with high discriminant and visual angle robustness at the same time;

(2) Cross-view gait recognition

Example 2

The cross-view gait recognition method based on the adaptive learning of the contrast domain according to the embodiment 1 is characterized in that:

in step A, gait contour sequences are obtained through the CASIA-B database and the OUISIR database.

Step a, gait contour preprocessing, giving a registration dataset x= { X containing N gait contour sequences _i I=1, 2, …, N }; each gait contour sequence (one gait contour sequence, i.e. one gait contour video, comprising a plurality of/multi-frame gait contours) is processed as follows:

a. reading each gait contour sequence, and placing a part comprising the pedestrian in the center of each frame of image so as to avoid interference caused by different distances between the pedestrian and the camera; correcting the image so that the head of the pedestrian is arranged at the uppermost edge of the image and the feet are arranged at the lowermost edge of the image;

b. and d, adjusting the corrected gait contour sequence obtained in the step a into the same image size H multiplied by W to obtain a final processed pedestrian contour sequence, wherein the final processed pedestrian contour sequence is used as an input of a cross-visual angle gait recognition network based on the contrast domain self-adaptive learning, and H, W respectively represents the height and the width of the adjusted image.

Step B, gait view level subdomain division, comprising:

step C, constructing a feature extractor embedded with a hierarchical feature aggregation strategy, which comprises the following steps:

as shown in fig. 1, the basic branch includes an initial layer and 3 stacked 3D convolution blocks; the initial layer comprises two 3D convolution layers and is used for carrying out preliminary space-time feature extraction on an input gait contour sequence; also referred to as "stage 0"; the 3 stacked 3D convolution blocks take the output characteristics of the initial layer as input and are used for extracting conventional space-time information in different stages; for convenience of the following description, these 3 consecutive 3D convolution blocks are also referred to as "stage 1", "stage 2" and "stage 3"; the specific structure of the entire basic branch is shown in table 1 (where the LeakyRelu layer after each 3D convolution layer is omitted).

TABLE 1

As shown in fig. 2, the cross-phase attention aggregation block includes two learnable parameters σ ₁ Sum sigma ₂ A cross-stage attention derivative operation W _m And a cross-phase attention aggregation operation W _a The method comprises the steps of carrying out a first treatment on the surface of the For input X from any subdomain, X ε X ^v The method comprises the steps of carrying out a first treatment on the surface of the Firstly, general space-time characteristics of two adjacent stages in a basic branch are initially combined through a learnable parameter, and the general space-time characteristics are expressed as shown in a formula (I):

in the formula (I),is the output of the element-by-element weighted addition; />Representing the features extracted by the basic branch in stage I, < >>Features representing first-order attention aggregation in first-1 stage of hierarchical feature aggregation branch, 1<l≤n，C _l 、T _l 、H _l 、W _l Respectively express characteristic->Channel number, frame number, height and width; it should be noted that the first cross-stage attention aggregation block of the hierarchical feature aggregation branch takes as input the features extracted by the basic branch (stage 1), and the subsequent first cross-stage attention aggregation block takes as input the ∈branch ∈ ->And features of the previous order attention aggregation in hierarchical feature aggregation branches->As input.

in the formula (II), W _m Comprises a channel pooling, a 3D convolution layer with a core size of 3 multiplied by 3 and a sigmoid layer; the cross-stage attention aggregation block takes into account the differences in the features of the different stages to regenerate more accurate attention weights, as opposed to the direct global feature aggregation operations along the multiple stages that are commonly used. Generated soft attention mask m ^l Further guide W _a Performing deep cross-stage attention aggregation to obtain aggregated features Representing +.after cross-phase attention aggregation>The number of channels of (a) is expressed asFormula (III):

in the formula (III) of the present invention,and +. _a Comprising a core size of 3 x 3 a 3D convolution layer and a leak Relu layer; through the attention aggregation, dislocation of heterogeneous features in different stages is effectively relieved, and more discriminative cross-stage space-time features can be extracted.

To further encourage semantic complementation of high-stage features, feature fusion from low-stage to high-stage is performed. Channel adjustment is carried out on the output of each cross-stage attention aggregation block by using a lattice neg layer, the cross-stage attention aggregation blocks are cascaded according to channels, and finally a maximum pooling operation is used for generating a final global space-time characteristic representation F _c ∈R ^C'×H'×W' C ', H ', W ' represent global spatiotemporal features F, respectively _c The number, height and width of the channels are shown as formula (IV):

F _GeM ＝W _GeM (F _c ')(Ⅴ)

in the formula (V), the compound represented by the formula (V),is a warp yarnPart-based fine-grained feature after generalized mean pooling operation, wherein +.>Represents F _GeM Fine-grained characteristics, W, of each of the portion-based ribbon spaces _GeM (. Cndot.) represents generalized mean pooling, F _c '∈R ^C'×H'×W' Representing the characteristics after horizontal segmentation;

F _FM ＝W _SFC (F _GeM ) (VI)

Step D, constructing a view angle change countermeasure eliminating module, which comprises the following steps:

constructing a binary (1 v.s. Other) discriminator embedded with a gradient inversion layer for each subfield of a specific view angle, i.e. a specific view angle discriminator embedded with a gradient inversion layer, to discriminate all acquired partial-based fine-grained features F _GeM Whether from this or other subfields, therefore, V binary discriminators are used altogether, and all the parameters of the gradient inversion layers are shared;

for all acquired part-based fine-grained features F _GeM Constructing a countermeasure optimization objective for the feature extractor embedded with the hierarchical feature aggregation policy and the view-specific discriminant, enabling the view-specific discriminant to discriminate whether each of the partially-based fine-grained gait feature inputs is from a particular sub-domain by minimizing the penalty of the view-specific discriminant, while maximizing the penalty of the view-specific discriminant to fine tune the feature extractor to produce a visual confusionGait features of the angle discriminator; thereby reducing subfield differences;

the view variation countermeasure elimination module aims at reducing the distribution difference among a plurality of view level subfields at the same time without specifying any specific source domain or target domain; in the view angle change antagonism elimination module, each subdomain can be regarded as a temporary target domain, other subdomains are regarded as source domains, the antagonism optimization is used for reducing the difference between the source domains and the target domains, and finally gait information under different camera views is mapped to a common feature embedding space through iterative training;

in the formula (VIII), G _F Representation embedded hierarchical feature aggregation policyFeature extractor of L _BCE Representing a binary cross-entropy loss,<·>representing an average of the set of subscripts; the joint minimization optimization objective for all the particular view angle discriminators is shown in formula (IX):

gradient inversion layer (GRL) can automatically translate the maximization problem to minimize negative loss, thereby ensuring consistency of network optimization. Thus, the use of a gradient inversion layer may convert maximizing the subfield view angle discriminator loss (i.e., equation (ix)) to minimizing the negative loss of the view angle discriminator to reduce the distribution difference among the subfields. In this way, the feature extractor G is optimized _F The maximum objective function of (2) is shown as formula (X):

in the forward propagation process, the GRL is just a normal layer and does not introduce any additional operations. During the back propagation, the GRL reverses the gradient in the optimization objective (formula (ix)) with respect to the parameters in the feature extractor and then multiplies it by a negative weight α to pass it back. With GRL, view-level subdomain countermeasure alignment can be achieved in an end-to-end fashion without requiring iterative training like generating a countermeasure network with separately fixed generators and discriminators, which greatly simplifies the overall implementation of the network and facilitates mining of spatiotemporal information in gait sequences. With the optimized training of the network, robust gait feature representations which are difficult to identify by all the specific visual angle discriminators are finally extracted, so that the gap between subdomains is effectively reduced.

And E, constructing a measurement learning module, which comprises the following steps:

in the formulas (XI) and (XII), (P, U) represents the number of subjects in one mini-batch and the number of gait profile sequences of each subject, N _tri Represents the number of non-zero terms in the loss, H' represents the number of fine-grained features based on the portion after horizontal division, m represents the Margin of the triplet loss (Margin),an h horizontal feature representing a p-th subject's u-th gait contour sequence,/->An h horizontal feature representing the a-th gait contour sequence of the p-th subject,/->An h horizontal feature representing a c gait contour sequence of a b-th subject, d ₊ And d _- Euclidean distance between positive and negative samples, respectively;

Step F, training the whole network based on the adaptive learning of the reactance domain, comprising the following steps:

in formula (XIV), β is a trade-off parameter for adjusting the two optimization goals of model discriminant learning and view angle change cancellation. As can be seen from the formula (XIV), the purpose of model optimization is to minimize the triplet loss and cross entropy loss, i.e. the smaller the distance between sample features of the same person from different perspectives is, the better the distance between sample features of different persons is, and all gait samples can be accurately ID classified; and meanwhile, the feature extractor is trained to maximize the loss of the view angle discriminator in the view angle variation reactance eliminating module, namely, the feature extractor is adjusted to generate gait features capable of confusing the view angle discriminator, so that the influence of the view angle variation is weakened. Thus, gait features with high discrimination of ID and view invariance can be extracted, and the recognition of gait across view angles is realized.

In the step (2), the gait contour sequence is obtained by dividing the pedestrian video acquired by the camera in the actual scene.

Step (2), cross-view gait recognition, comprising:

In this embodiment, first, the gait contour sequence is preprocessed and the size h×w of the input gait contour sequence diagram is set to 64×44. All experiments in this example were trained using Adam optimizer with momentum set to 0.9 and learning rate set to 1-e4. The margin for triplet loss is set to 0.2. And in the training stage, 30 continuous frames are randomly selected for each gait contour sequence after pretreatment to be used as model input. In the test stage, all frames of the pretreated gait contour sequence are used for obtaining final characteristic representation, and Rank-1 accuracy rate is selected to measure the performance of model gait recognition.

In order to verify the advancement of the cross-visual angle gait recognition method based on the adaptive learning of the contrast domain, the invention is firstly compared with the prior 9 advanced gait recognition methods on a CASIA-B gait database, including CNN-LB, gaitNet, group-supervision DRL, gaitSet, gaitPart, gaitSlice, MT3D, ESNet and GaitGL.

Because the CASIA-B database covers the comprehensive view angle, the invention fully experiments the cross-view angle recognition task on the CASIA-B data set. CASIA-B is a widely used gait data set, comprising 113640 video segments of 124 subjects. Each subject had 10 types of gait contour sequences, including 6 types (index nm#01-06) collected under normal walking conditions, 2 types (index bg#01-02) collected under backpack conditions, and 2 types (index cl#01-02) collected under coat conditions. Each type comprises a gait profile sequence of 11 different perspectives (18 ° apart from 0 ° -180 °).

In this example, all gait contour sequences of the first 74 subjects in the CASIA-B database were used for model training, leaving the gait contour sequences of the 50 subjects to be tested. In one mini-batch, the number of subjects and the number of sequences per subject were set to (8; 8), the number of model iterations was set to 100K, and β in formula (XIV) was set to 0.01, respectively. During the test phase, the first four types (i.e., NM#01-04) of the 6 types of gait contour sequences sampled under normal conditions are used as registration data sets, and the remaining NM#05-06, BG#01-02, and CL#01-02 are used as query data sets, respectively.

The cross-view gait recognition rates (%) of the present invention and the other 6 advanced gait recognition methods under three different walking conditions (including normal, backpack and wear coat) on the CASIA-B database are listed in Table 2. The results in Table 2 are the average Rank-1 accuracy (%) of cross-view identification at another 10 registration views for each query view, respectively.

TABLE 2

As can be seen from Table 2, the method of the present invention achieves the best recognition under all walking conditions. The average recognition rates of the method of the invention reach 97.8%, 95.2% and 86.0% respectively under normal, knapsack and coat-wearing walking conditions, strongly illustrating the advantages of the method of the invention.

To further verify the generalization of the method of the present invention, the method of the present invention was evaluated on the OUISIR dataset. OUISIR is a gait dataset consisting of 4007 subjects. The database has four perspectives (55 °, 65 °, 75 °, 85 °), and the OUIRIS database contains fewer perspectives but more subjects than the caia-B, and thus can be used to verify the generalization performance of each gait recognition method. Wherein each subject acquired gait sequences under two normal walking conditions (index #01, # 02). In this example, 3836 subjects in the OUIRIS database were used for training and were cross-validated using five folds. Likewise, the gait contour sequence is preprocessed and the size h×w of the input gait contour sequence diagram is set to 64×44. In one mini-batch, the number of subjects and the number of sequences per subject were set to (32; 4), the number of model iterations was set to 60K, and β in formula (XIV) was set to 0.03, respectively. In the test phase, the sequence of index #01 is used as the registration dataset, and the sequence of index #02 is used as the query dataset.

The inventive method and other advanced methods include cross-view gait recognition results of MGAN, CNNS and MT3D at various views are shown in table 3. The results in Table 3 are the accuracy (%) of cross-view identification Rank-1 for four different query views of the OUIRIS database.

TABLE 3 Table 3

As can be seen from the observation of Table 3, the method of the present invention achieves the highest accuracy under all viewing angle crossing conditions, and has obvious performance advantages. Further, as can be seen from table 3, as the viewing angle difference between the query data set and the registration data set increases, the recognition accuracy of the CNNS, MGAN, and MT3D methods is greatly reduced, for example: the recognition rates of the query view angle and the registration view angle being (55 °,85 °) respectively are significantly reduced compared to the recognition rates of the query view angle and the registration view angle being (55 °,65 °) and (55 °,75 °) respectively. However, under the condition that the query view angle and the registration view angle are greatly different, the method can still obtain excellent and stable identification performance, so that the method is more robust to the change of the view angle and has better generalization capability.

Example 3

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the cross-view gait recognition method based on contrast domain adaptive learning of embodiments 1 or 2 when executing the computer program.

Example 4

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the cross-view gait recognition method based on contrast domain adaptive learning of embodiments 1 or 2.

Claims

1. The cross-visual-angle gait recognition method based on the adaptive learning of the reactance domain is characterized by comprising the following steps of:

A. gait contour preprocessing

B. gait view level subdomain partitioning

D. Building view angle variation countermeasure elimination module

E. building a metric learning module

(2) Cross-view gait recognition

2. The method for cross-view gait recognition based on adaptive learning of a contrast domain according to claim 1, wherein in step a, the gait contour sequence is obtained through a ca sia-B database and an OUISIR database.

3. The method for identifying the gait across the visual angle based on the adaptive learning of the contrast domain according to claim 1, wherein the step A, the gait contour preprocessing, means: the following is performed for each gait profile sequence:

4. The method for cross-view gait recognition based on adaptive learning of the contrast domain according to claim 1, wherein the step B, gait view level sub-domain division, comprises:

The feature mapping header includes a horizontal slicing operation, a generalized average pooling operation, and a full-join mapping operation for the packet.

in the formula (I),is the output of the element-by-element weighted addition; />Representing the features extracted by the basic branch in stage I, < >>Features representing first-order attention aggregation in first-1 stage of hierarchical feature aggregation branch, 1<l≤n，C _l 、T _l 、H _l 、W _l Respectively express characteristic->Channel number, frame number, height and width;

channel adjustment is carried out on the output of each cross-stage attention aggregation block by using a lattice neg layer, the cross-stage attention aggregation blocks are cascaded according to channels, and finally a maximum pooling operation is used for generating a final global space-time characteristic representation F _c ∈R ^{C′×H′×W′} C ', H ', W ' represent global spatiotemporal features F, respectively _c The number, height and width of the channels are shown as formula (IV):

F _GeM ＝W _GeM (F′ _c ) (Ⅴ)

in the formula (V), the compound represented by the formula (V),is based on the fine-grained feature of the part after generalized mean pooling operation, wherein +.>Represents F _GeM Fine-grained characteristics, W, of each of the portion-based ribbon spaces _GeM (. Cndot.) represents generalized mean pooling, F' _c ∈R ^{C′×H′×W′} Representing the characteristics after horizontal segmentation;

subsequently, for F _GeM Each of the levels ofStrap featuresThe full connection layer of the packet is utilized to map the packet to a more discriminant representation space, and the representation is shown as a formula (VI):

F _FM ＝W _SFC (F _GeM ) (VI)

5. The method for cross-view gait recognition based on adaptive learning of a contrast domain according to claim 1, wherein step D, constructing a view variation contrast elimination module comprises:

Specifically, for a parameter ofSpecific viewing angle discriminator D of (2) _v Sample input x is based onFine-grained character of the division->Normalized and then input D through gradient inversion layers _v The corresponding output is sent into a softmax layer to obtain a two-dimensional probability output z E R ² The expression is shown as a formula (VII):

6. the method for cross-view gait recognition based on the adaptive learning of the contrast domain according to claim 1, wherein in the step E, a metric learning module is constructed, and the metric learning module enhances the discrimination of gait representation by minimizing the distance between gait samples from different view-level subfields and belonging to the same ID, including the triplet loss and the cross entropy loss of the grouping; comprising the following steps:

in the formulas (XI) and (XII), (P, U) represents the number of subjects in one mini-batch and the number of gait profile sequences of each subject, N _tri Represents the number of non-zero items in the loss, H' represents the number of fine-grained features based on the part after horizontal division, m represents the margin of the triplet loss,an h horizontal feature representing a p-th subject's u-th gait contour sequence,an h horizontal feature representing the a-th gait contour sequence of the p-th subject,/->An h horizontal feature representing a c gait contour sequence of a b-th subject, d ₊ And d-divisionThe Euclidean distance between the positive sample and the negative sample is distinguished;

7. The method for cross-view gait recognition based on the adaptive learning of the contrast domain according to claim 1, wherein the step F, training the whole network based on the adaptive learning of the contrast domain, comprises:

8. The method for identifying the gait across the visual angle based on the adaptive learning of the contrast domain according to any one of claims 1 to 7, wherein in the step (2), the gait contour sequence is obtained by dividing the pedestrian video acquired by the camera in the actual scene;

further preferably, step (2), cross-view gait recognition, comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor when executing the computer program performs the steps of the cross-view gait recognition method based on contrast domain adaptive learning as claimed in any one of claims 1-8.

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the cross-view gait recognition method based on contrast domain adaptive learning of any of claims 1-8.