CN114550291A

CN114550291A - Gait feature extraction method, device and equipment

Info

Publication number: CN114550291A
Application number: CN202210156625.1A
Authority: CN
Inventors: 郑新想
Original assignee: Chongqing Unisinsight Technology Co Ltd
Current assignee: Chongqing Unisinsight Technology Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-27

Abstract

The invention provides a gait feature extraction method, a device and equipment, wherein the method comprises the following steps: preprocessing a multi-frame image of a target object to obtain a preprocessed image of a corresponding frame; inputting the preprocessed image into a gait feature extraction model, wherein the gait feature extraction model comprises a convolutional neural network and a transformer network; sequentially carrying out feature extraction of first-step features on the preprocessed image of the single frame according to the time sequence of the frame by using a convolutional neural network; the method combines the deep convolutional network and the visual transformer network, fully utilizes the characteristics of the network, and realizes the rapid extraction of the gait features while ensuring the accuracy of the extracted gait features.

Description

Gait feature extraction method, device and equipment

Technical Field

The present application relates to the field of target identification technologies, and in particular, to a gait feature extraction method, apparatus, and device.

Background

The gait recognition technology is a technology for carrying out identity recognition according to the walking posture of a target object, is a novel technology in biological feature recognition, has the advantages of unique distance, difficulty in camouflage, non-contact and the like, and has important research value and application value in the fields of national public safety, financial safety and the like.

The gait recognition system is generally divided into three steps: gait preprocessing of pedestrian segmentation and joint point detection, gait feature extraction and gait feature comparison and identification. The gait feature extraction in the intermediate link determines the success rate of the whole process to a great extent. In order to obtain effective and reliable gait characteristics of pedestrians, most of the existing gait characteristic extraction methods adopt a depth model with strong expression capability to extract the gait characteristics; for example, Liao et al uses a depth posture model to obtain posture information of a human body in a video sequence, and then uses a long-short memory model (LSTM) to fuse time and space information of gait to obtain gait characteristics of the human body for recognition; however, these gait feature extraction algorithms do not fully utilize the deep network characteristics and usually the computation of the network is time-consuming when the neural network is used to organize or fuse the temporal and spatial information of pedestrians.

Disclosure of Invention

The invention provides a gait feature extraction method, a gait feature extraction device and gait feature extraction equipment, which are used for solving the problems that the deep network characteristics cannot be fully utilized and the calculation of a common network is time-consuming when a neural network is used for organizing or fusing the time sequence and space information of pedestrians in the conventional gait feature extraction algorithm.

In a first aspect, the present invention provides a gait feature extraction method, including:

preprocessing a plurality of frames of images of a target object to respectively obtain preprocessed images of corresponding frames, wherein the preprocessed images of a single frame comprise a human body contour image and a human body joint image;

inputting the obtained preprocessed image into a gait feature extraction model, wherein the gait feature extraction model comprises a convolutional neural network and a transformer network;

sequentially carrying out feature extraction on first step features on the preprocessed image of the single frame by utilizing the convolutional neural network according to the time sequence of the frames, wherein the first step features are used for representing the spatial features of the gait of the target object;

inputting the first step-state features of the obtained preprocessed images of each frame into the transformer network, and performing feature extraction of second step-state features fusing gait time features and space features by using the attention mechanism, semantic feature extraction capability and long-distance characteristic capture capability of the transformer network through the transformer network based on the first step-state features and the time sequence of frames of the first step-state features.

In a second aspect, the present invention provides a gait feature extraction device, including:

the acquisition module is used for preprocessing a plurality of frames of images of the target object to respectively obtain preprocessed images of corresponding frames, wherein the preprocessed image of a single frame comprises a human body outline image and a human body joint image;

the gait feature extraction module is used for inputting the obtained preprocessed image into a gait feature extraction model, and the gait feature extraction model comprises a convolutional neural network and a transformer network;

the first feature extraction module is used for sequentially carrying out feature extraction on first step features on the preprocessed image of the single frame according to the time sequence of the frame by utilizing the convolutional neural network, wherein the first step features are used for representing the spatial features of the gait of the target object;

and the second feature extraction module is used for inputting the first step features of the obtained preprocessed images of the frames into the transformer network, and the transformer network performs feature extraction of second step features fusing gait time features and space features based on the first step features and the time sequence of the frames thereof by utilizing the attention mechanism, semantic feature extraction capability and long-distance characteristic capture capability of the transformer network.

In a third aspect, the present invention provides a gait feature extraction device, including a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to perform:

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement any of the steps of the gait feature extraction method described above.

The gait feature extraction method comprises the steps of firstly utilizing the translation, scaling invariance and hierarchical characteristics of the convolutional neural network to obtain gait features on a target object space, reducing the resolution of a feature map, reducing the calculated amount for a subsequent network, and then utilizing the advantages of a transducer network in sequence processing to extract second-step state features fusing gait time features and space features, thereby realizing the structural and functional complementation of the network and simultaneously improving the network reasoning speed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a gait feature extraction method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a gait feature extraction model according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a third step-wise feature extraction process provided in the embodiment of the present invention;

FIG. 4 is a diagram illustrating a second step of a dynamic feature extraction process according to an embodiment of the present invention;

fig. 5 is a schematic view of a gait feature extraction device according to an embodiment of the invention;

fig. 6 is a schematic diagram of a gait feature extraction device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein, and that the embodiments described in the following exemplary embodiments are not intended to represent all embodiments consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The application scenario described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not form a limitation on the technical solution provided in the embodiment of the present invention, and it can be known by a person skilled in the art that with the occurrence of a new application scenario, the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems.

The gait recognition characteristic has the unique advantages of long distance, difficult camouflage, non-contact and the like, and is widely applied in the fields of national public safety, financial safety and the like.

The gait recognition system is roughly divided into three steps: gait preprocessing of pedestrian segmentation and joint point detection, gait feature extraction and gait feature comparison identification, wherein the gait feature extraction largely determines the success rate of the whole process, the gait feature extraction inevitably relates to spatial information in pedestrian frames and interframe time sequence information, and how to effectively combine or fuse gait space-time information to form reliable and stable gait features is particularly important, but the conventional gait feature extraction method cannot fully utilize deep network characteristics when a neural network organization is used or pedestrian time sequence and spatial information is fused, and the computation of a common network is time-consuming.

The invention provides a gait feature extraction method, which combines a deep convolutional network and a visual transformer network, fully utilizes the characteristics of the network, and realizes the rapid extraction of gait features while ensuring the accuracy of the extracted gait features.

In addition, in the field of picture classification, the latest research result LeViT of the visual transformer has a function of fast and effective reasoning, and the transformor network in the embodiment of the invention adopts LeViT.

The gait feature extraction method provided by the invention is explained in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a gait feature extraction method according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

101, preprocessing a plurality of frames of images of a target object to respectively obtain preprocessed images of corresponding frames, wherein the preprocessed image of a single frame comprises a human body outline image and a human body joint image;

in the embodiment of the invention, the multi-frame images of the target object are multi-frame images in the same tracking track of the target object.

In the implementation, a pedestrian detection and tracking algorithm is used for acquiring multi-frame images which are arranged in the same tracking track of a target object according to the time sequence of frames from a video of the target object acquired by a video acquisition device, and a pedestrian segmentation model and a pedestrian joint model are used for respectively carrying out preprocessing of human body contour extraction and human body joint extraction on the acquired multi-frame images so as to acquire processed multi-frame preprocessed images.

Step 102, inputting the obtained preprocessed image into a gait feature extraction model, wherein the gait feature extraction model comprises a convolutional neural network and a transformer network;

103, sequentially performing feature extraction of first step features on the preprocessed image of the single frame by using the convolutional neural network according to the time sequence of the frames, wherein the first step features are used for representing the spatial features of the gait of the target object;

and 104, inputting the first step state features of the obtained preprocessed images of the frames into the transformer network, and performing feature extraction of second step state features fusing gait time features and space features by the transformer network on the basis of the first step state features and the time sequence of the frames thereof by utilizing the attention mechanism, semantic feature extraction capability and long-distance characteristic capture capability of the transformer network.

In implementation, after acquiring a multi-frame preprocessed image of a target object, inputting the multi-frame preprocessed image into a gait feature extraction model, as shown in fig. 2, the gait feature extraction model includes two parts, namely a convolutional neural network and a transformer network, and the multi-frame preprocessed image is a preprocessed image arranged according to a time sequence of frames; after multi-frame preprocessed images are input, firstly, sequentially processing single-frame preprocessed images by a convolutional neural network, and extracting gait space characteristics, namely first step characteristics, of a target object based on a human body contour map and a human body relation node map of the single-frame target object, wherein the steps 1 and 2 of a convolutional neural network ResNet50 are adopted in the embodiment of the invention; and inputting the first step state features of the multi-frame preprocessed images sequentially output by the convolutional neural network into a transform network, synthesizing the first step state features corresponding to the multi-frame into a fusion feature by the transform network according to the time sequence of the frames, and extracting a second step state feature from the fusion feature by utilizing the semantic feature extraction capability and the long-distance characteristic capture capability of the transform network.

It should be noted that although the convolutional neural network in the embodiment of the present invention uses orders 1 and 2 of the convolutional neural network ResNet50, other types of convolutional neural networks may be used in specific implementations according to actual requirements.

According to the gait feature extraction method, the spatial information of the gait of the target object is obtained by using the characteristics of translation, scaling invariance and hierarchy of the convolutional neural network, the resolution of a feature map is reduced, the calculated amount is reduced for the subsequent introduction of an attention network, and the second-step state feature integrating the gait time feature and the spatial feature is extracted by using the advantage of a transformer network (such as LeViT) in sequence processing, so that the structural and functional complementation of the neural network is realized, and the network reasoning speed is improved.

As an optional implementation, the transform network includes a first transform network and a second transform network, and the transform network performs feature extraction of a second-step feature fusing gait temporal features and spatial features based on the first-step feature and temporal sequence of frames thereof by using attention mechanism and semantic feature extraction capability and long-distance feature capture capability of the transform network, including:

performing third step state feature extraction for inhibiting gait space features of a non-target object and strengthening a target object by using a first transform network and adopting an attention mechanism to the first step state features of each frame output by the convolutional neural network;

and performing feature extraction of a second-step feature fusing gait time features and space features on third-step features of multiple frames and time sequence of the frames obtained by the first transformer network by utilizing semantic feature extraction capability and long-distance characteristic capture capability of the second transformer network.

In order to make the extracted gait features more distinctive, the transform network in the embodiment of the invention includes two parts, namely a first transform network and a second transform network;

as shown in fig. 3, the first transform network optimizes the first step-state features of the single-frame preprocessed image output by the convolutional neural network by using an attention mechanism of spatial dimension, and outputs third step-state features that distinguish the gait space features of the target object from the gait space features of the non-target object.

As shown in fig. 4, after extracting the third step state features corresponding to the multiple frames of preprocessed images of the target object by using the first transform network, inputting the third step state features of the multiple frames of preprocessed images sequentially output by the first transform network into the transform network, synthesizing the third step state features corresponding to the multiple frames into one fusion feature by the transform network according to the time sequence of the frames, and extracting the second step state feature from the fusion feature by using the semantic feature extraction capability and the long-distance characteristic capture capability of the transform network.

The transform network is divided into a first transform network and a second transform network, the non-target object characteristics are restrained through the first transform network, the extraction of the third step state characteristics of the target object characteristics is enhanced, and the reliability and the accuracy of the gait characteristics are further improved.

In an embodiment of the present invention, the first/second/third step state features include a global gait feature and a local gait feature; the gait global feature is used for representing the global gait feature of the target object, and the gait local feature is used for representing the gait feature corresponding to the local part of the target object.

The convolutional neural network extracts a first-step global feature of the target object according to a global preprocessing image of the target object (namely a human body contour map and a human body key map which are not divided according to joint points); dividing the preprocessed image into local preprocessed images corresponding to different parts according to a human body joint graph of the target object, and extracting first step-state local features of the target object according to the local preprocessed images; the transformer network extracts second-step global features according to first-step global features corresponding to the multi-frame preprocessed images, and extracts second-step local features according to first-step local features corresponding to the multi-frame preprocessed images; the specific process of feature extraction has been described in detail above, and is not described herein again.

As an alternative embodiment, the local part of the target object includes: the part from the head to the shoulder, the part from the shoulder to the hip and the part from the hip to the foot; the target object global features are head to foot features.

Because the walking direction of the target object has a large influence on the gait recognition, in order to further improve the accuracy of the gait recognition system, the embodiment of the invention adds a network for predicting the walking direction of the target object on the branch of the second-step global feature, and the network is used for predicting the front walking, the back walking, the left walking and the right walking of the gait. In the subsequent gait feature comparison and identification, the comparison and identification are carried out in respective directions, and the accuracy of the gait identification system is further improved.

As an optional implementation manner, the method provided in the embodiment of the present invention further includes:

and extracting the direction characteristics of the walking direction of the target object by using a second transform network and adopting an attention mechanism on the third step state characteristics of the multiple frames obtained by the first transform network and the time sequence of the frames.

It should be noted that, in order to make the extracted gait feature more accurate, the walking direction of the target object in the multi-frame image of the target object acquired in step 101 should be substantially consistent.

In the implementation, after the gait feature of the target object is extracted by using a gait feature extraction method, gait recognition is carried out according to the extracted gait and feature of the target object so as to determine the identity of the target object.

As an optional implementation manner, the transform network, after performing feature extraction of a second-step feature that fuses gait temporal features and spatial features based on the first-step feature and a temporal sequence of frames thereof by using an attention mechanism and a semantic feature extraction capability and a long-distance feature capture capability of the transform network, further includes:

comparing the extracted second-step state features with gait features of a plurality of target objects prestored in a database, and determining the target object corresponding to the second-step state features according to a comparison result;

the gait features of a plurality of target objects pre-stored in the database are obtained by performing feature extraction on the historical preprocessed images of the plurality of objects by utilizing a gait feature extraction model.

After the gait feature extraction model training is finished, inputting the acquired preprocessed images of the plurality of target objects into the gait feature extraction model, acquiring the gait features of the plurality of target objects, and storing the gait features into the database.

In implementation, when gait recognition of the target object is carried out, firstly, the walking direction of the target object predicted by the gait feature extraction model is selected, reference gait features consistent with the walking direction of the target object are screened from a database, and similarity comparison is carried out on the second-step global features and the second-step local features extracted by the gait feature extraction model and the gait global features and the gait local features in the reference gait features respectively, so as to carry out gait recognition of the target object.

The training process of the gait feature extraction model is described in detail below.

Firstly, processing multi-frame image samples respectively corresponding to different target objects to obtain multi-frame preprocessed image samples respectively corresponding to different target objects, and labeling the target objects of the preprocessed images of the multi-frames of the same target object to obtain a training sample;

in implementation, sequence Tracker segments (image samples) of a plurality of target objects are obtained from videos of a plurality of different scenes and different times collected by a video collecting device by using a pedestrian detection and tracking algorithm, and are labeled and preprocessed, wherein the same target object is only labeled with one object identifier, the sequence of different scenes and times of the same pedestrian is labeled with different Tracker IDs (tracking track identifiers) so as to distinguish different scenes and times, and the walking directions of the target objects including the walking directions of the front side, the back side, the left side and the right side are labeled at the same time, so that a plurality of training samples are obtained;

the preprocessing of the image samples is realized through a pedestrian segmentation model and a pedestrian joint model, and the preprocessed image samples comprise human body contour image samples and human body joint image samples.

It should be noted that, one training sample is directed at one tracking track of one target object, different tracking tracks of the target object form different training samples, and the same tracking track of the target object may also correspond to multiple training samples.

Inputting the training samples into a network model, wherein the network model comprises a feature extraction layer, and the feature extraction layer comprises a first feature extraction layer adopting a convolutional neural network and a second feature extraction layer adopting a transformer network;

sequentially performing feature extraction of first-step features on the single-frame preprocessed image sample by utilizing the first feature extraction layer according to the time sequence of frames, and performing feature extraction of second-step features fusing gait time features and space features on the basis of the first-step features of each frame and the time sequence of the frames output by the first feature extraction layer according to the attention mechanism, semantic feature extraction capability and long-distance feature capture capability of a transform network by utilizing the second feature extraction layer;

determining the value of a loss function, and adjusting the parameters of the network model according to the value of the loss function, wherein the loss function comprises a triple loss function determined according to the first distance of the second-step characteristics of the same target object extracted by the second characteristic extraction layer and the second distance of the second-step characteristics of different target objects;

and when the condition that the parameter adjustment end condition is met is determined, determining the current feature extraction layer as a gait feature extraction model.

The triple loss function is specifically:

L_tripletloss＝[D^ap-D^an+α]₊

wherein D is^ap＝||f^a-f^p||,D^an＝||f^a-fⁿ||，f^a，f^p，fⁿRespectively representing second step state features extracted by the second feature extraction layer, second step state features of the same target object extracted by the second feature extraction layer and second step state features of different target objects extracted by the second feature extraction layer; the same target object is the same as a target object to which the second-step-state feature extracted by the second feature extraction layer belongs, the different target object is different from a target object to which the second-step-state feature currently extracted by the second feature extraction layer belongs, and alpha represents a hyper-parameter constant.

As an optional implementation manner, the adjusting the parameters of the network model by using the loss function value includes:

and adjusting parameters of a feature extraction layer in the network model by using the triple loss function value and aiming at reducing the loss function value.

As an optional implementation, the network model further includes an output layer, the output layer predicts the probability of identifying different target objects according to the second step state features output by the second feature extraction layer, and the loss function further includes:

taking the probabilities of different target objects identified by the output layer prediction as predicted values, and taking the probabilities of different target objects expected to be output according to the target object labels of the preprocessed samples as target values;

and calculating to obtain the cross entropy loss function value of the target object identification according to the predicted value and the target value.

The specific formula of the cross entropy loss function for calculating the target object identifier is as follows:

wherein N represents the number of target objects in model training, and y representsDetermining the probability of different target objects expected to be output according to the target object labels of the preprocessed samples,

representing the probability that the output layer predicts different identified target objects.

and adjusting parameters of the network model by using the value of the cross entropy loss function for calculating the target object identifier and taking the reduction of the value of the loss function as a target, namely adjusting the parameters of a feature extraction layer and an output layer in the network model.

As an optional implementation manner, the preprocessed image sample is further labeled with a target object walking direction, and the second feature extraction layer includes a third feature extraction layer adopting a first transform network and a fourth feature extraction layer adopting a second transform network;

performing third-step feature extraction for inhibiting a non-target object from strengthening a target object by using the third feature extraction layer based on the first-step features of each frame output by the first feature extraction layer by adopting an attention mechanism, and performing second-step feature extraction for fusing gait time features and space features and feature extraction for the walking direction of the target object based on the third-step features of a plurality of frames output by the third feature extraction layer and the time sequence of the frames by using the fourth feature extraction layer according to the semantic feature extraction capability and the long-distance characteristic capture capability of a transform network;

when the parameters of the fourth feature extraction layer in the network model are adjusted, the loss function further includes:

and calculating the cross entropy loss function value of the walking direction by taking the walking direction of the target object extracted by the fourth feature extraction layer as a predicted value and taking the walking direction of the target object marked by the preprocessed sample as a target value.

The specific formula of the cross entropy loss function for calculating the walking direction is as follows:

wherein, y_pRepresenting the walking direction of the target object marked by the preprocessed sample,

and representing the walking direction of the target object extracted by the fourth feature extraction layer.

and adjusting the parameters of the fourth feature extraction layer in the network model by using the value of the cross entropy loss function for calculating the walking direction and taking the reduction of the value of the loss function as a target.

It should be noted that, when determining the value of the loss function and adjusting the parameter of the network model according to the value of the loss function, the loss function may only include the triplet loss function, or may include any one of the following:

the triple loss function and the cross entropy loss function corresponding to the target object identifier;

the triple loss function and a cross entropy loss function corresponding to the walking direction of the target object;

and the triple loss function, the cross entropy loss function corresponding to the target object identification and the cross entropy loss function corresponding to the target object walking direction.

The gait feature extraction device in the embodiment of the present application is described below with reference to the accompanying drawings.

Fig. 5 is a schematic view of a gait feature extraction device according to an embodiment of the invention, as shown in fig. 5, the device includes:

an obtaining module 501, configured to pre-process multiple frames of images of a target object to obtain pre-processed images of corresponding frames, respectively, where a single frame of the pre-processed image includes a human body contour image and a human body joint image;

a feature extraction module 502, configured to input the obtained preprocessed image into a gait feature extraction model, where the gait feature extraction model includes a convolutional neural network and a transformer network;

a first feature extraction module 503, configured to perform feature extraction of first step features on the single-frame preprocessed image sequentially according to a time sequence of frames by using the convolutional neural network, where the first step features are used to represent spatial features of target object gait;

the second feature extraction module 504 is configured to input the first step features of the obtained preprocessed image of each frame into the transform network, and perform feature extraction of a second step feature that combines gait temporal features and spatial features based on the first step features and the temporal sequence of frames thereof by using the attention mechanism, semantic feature extraction capability, and long-distance feature capturing capability of the transform network.

Optionally, the transform network includes a first transform network and a second transform network, and the second feature extraction module 504 is configured to perform, by the transform network, feature extraction of a second-step feature that merges gait temporal features and spatial features based on the first-step feature and a temporal order of frames thereof by using an attention mechanism and a semantic feature extraction capability and a long-distance feature capture capability of the transform network, and includes:

performing feature extraction for inhibiting third step-state features of a non-target object and strengthening a target object by using a first transform network and adopting an attention mechanism on first step-state features of each frame output by the convolutional neural network;

and performing feature extraction of second-step features fusing gait time features and space features on third-step features of multiple frames and time sequence of the frames obtained by the first transformer network by utilizing semantic feature extraction capability and long-distance feature capture capability of the second transformer network.

Optionally, the apparatus provided in the embodiment of the present invention further includes:

and the direction feature extraction module is used for extracting the direction features of the walking direction of the target object according to the third step state features of the multiple frames obtained by the first transformer network and the time sequence of the frames by utilizing the semantic feature extraction capability and the long-distance characteristic capture capability of the second transformer network.

Optionally, the feature extraction module 502 is further configured to:

processing multi-frame image samples respectively corresponding to different target objects to obtain multi-frame preprocessed image samples respectively corresponding to different target objects, and performing target object labeling on the preprocessed images of the multi-frames of the same target object to obtain a training sample;

sequentially performing first-step feature extraction on the single-frame preprocessed image sample by using the first feature extraction layer according to the time sequence of the frames, and performing second-step feature extraction fusing gait time features and space features on the basis of the first-step features of the frames and the time sequence of the frames, which are output by the first feature extraction layer, by using the second feature extraction layer according to the attention mechanism, the semantic feature extraction capability and the long-distance feature capture capability of the transform network;

Optionally, the network model further includes an output layer, the output layer predicts, according to the second step state feature output by the second feature extraction layer, the probability of identifying different target objects, and the loss function further includes:

taking the probabilities of different target objects identified by output layer prediction as predicted values, and taking the probabilities of different target objects expected to be output, which are obtained and determined according to target object labeling of the preprocessed samples, as target values;

Optionally, the preprocessed image sample is further labeled with a target object walking direction, and the second feature extraction layer includes a third feature extraction layer using a first transform network and a fourth feature extraction layer using a second transform network;

performing feature extraction for inhibiting the third step features of the non-target object to strengthen the target object by using the third feature extraction layer based on the first step features of each frame output by the first feature extraction layer by adopting an attention mechanism, and performing feature extraction for fusing the second step features of gait time features and space features and the walking direction of the target object based on the third step features of a plurality of frames output by the third feature extraction layer and the time sequence of the frames by using the fourth feature extraction layer according to the semantic feature extraction capability and the long-distance characteristic capture capability of the transform network;

and taking the walking direction of the target object extracted by the fourth feature extraction layer as a predicted value, and calculating the cross entropy loss function value of the walking direction by taking the walking direction of the target object marked according to the preprocessed sample as a target value.

Optionally, the first/second/third step-state features include a global gait feature and a local gait feature;

the gait global feature is used for representing the global gait feature of the target object, and the gait local feature is used for representing the gait feature corresponding to the local part of the target object.

Optionally, after the transform network performs feature extraction of a second-step feature fusing gait time features and spatial features based on the first-step feature and a time sequence of a frame thereof by using an attention mechanism, a semantic feature extraction capability, and a long-distance feature capture capability of the transform network, the second feature extraction module is further configured to:

carrying out similarity comparison on the extracted second-step state features and gait features of a plurality of target objects prestored in a database, and determining the target object corresponding to the second-step state features according to a comparison result;

Based on the same application concept, the embodiment of the invention further provides gait feature extraction equipment, and the gait feature extraction equipment in the embodiment of the invention is described below with reference to the accompanying drawings.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an apparatus according to the present application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the gait feature extraction method according to various exemplary embodiments of the present application described above in the present specification.

An apparatus 600 according to this embodiment of the present application is described below with reference to fig. 6. The apparatus 600 shown in fig. 6 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present application.

As shown in fig. 6, the device 600 is embodied in the form of a general purpose device. The components of device 600 may include, but are not limited to: the at least one processor 601, the at least one memory 602, the bus 603 connecting the different system components (including the memory 602 and the processor 601), wherein the memory stores program code that, when executed by the processor, causes the processor to perform the steps of:

Bus 603 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The memory 602 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)6021 and/or cache memory 6022, and may further include read-only memory (ROM) 6023.

The memory 602 may also include a program/utility 6025 having a set (at least one) of program modules 6024, such program modules 6024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

Device 600 may also communicate with one or more external devices 604 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with device 600, and/or with any devices (e.g., router, modem, etc.) that enable device 600 to communicate with one or more other devices. Such communication may occur via input/output (I/O) interfaces 605. Also, the device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 606. As shown, a network adapter 606 communicates with the other modules for the device 600 over the bus 603. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the device 600, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Optionally, the transform network includes a first transform network and a second transform network, and the transform network performs feature extraction of a second-step feature that fuses gait temporal features and spatial features based on the first-step feature and a temporal order of frames thereof by using an attention mechanism and a semantic feature extraction capability and a long-distance feature capture capability of the transform network, including:

Optionally, the processor is further configured to:

and performing direction feature extraction on the walking direction of the target object by using the semantic feature extraction capability and the long-distance characteristic capture capability of a second transform network on the third step state features and the time sequence of frames of the multiple frames obtained by the first transform network.

Optionally, the processor is further configured to:

taking the probabilities of different target objects identified by the output layer prediction as predicted values, and taking the probabilities of different target objects expected to be output, which are obtained and determined according to the target object labels of the preprocessed samples, as target values;

Optionally, after the transform network performs feature extraction of a second-step feature fusing gait time features and spatial features based on the first-step feature and a time sequence of a frame thereof by using an attention mechanism, a semantic feature extraction capability, and a long-distance feature capture capability of the transform network, the processor is further configured to:

In some possible embodiments, the aspects of a gait feature extraction method provided by the present application can also be implemented in the form of a program product comprising program code for causing a computer device to perform the steps of a gait feature extraction method according to various exemplary embodiments of the present application described above in this specification, when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for monitoring of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device, or entirely on the remote device or server. In the case of remote devices, the remote devices may be connected to the user device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external devices (e.g., through the internet using an internet service provider).

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and block diagrams, and combinations of flows and blocks in the flow diagrams and block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A gait feature extraction method is characterized by comprising the following steps:

inputting the obtained first step state features of the preprocessed images of the frames into the transform network, and performing feature extraction of second step state features fusing gait time features and space features by the transform network on the basis of the first step state features and the time sequence of the frames thereof by utilizing the attention mechanism, the semantic feature extraction capability and the long-distance characteristic capture capability of the transform network.

2. The method according to claim 1, wherein the fransformer network comprises a first fransformer network and a second fransformer network, and the fransformer network performs feature extraction of a second-step feature fusing gait temporal features and spatial features based on the first-step feature and a temporal sequence of frames thereof by using an attention mechanism and semantic feature extraction capability and a long-distance feature capture capability of the fransformer network, and comprises:

3. The method of claim 2, further comprising:

4. The method of any one of claims 1 to 3, further comprising:

5. The method of claim 4, wherein the network model further comprises an output layer that predicts probabilities of identifying different target objects based on second-step features output by the second feature extraction layer, and wherein the loss function further comprises:

6. The method of claim 4, wherein the preprocessed image sample is further labeled with a target object walking direction, and the second feature extraction layer comprises a third feature extraction layer adopting a first transform network and a fourth feature extraction layer adopting a second transform network;

7. The method according to any one of claims 1 to 3,

the first/second/third step state features comprise a gait global feature and a gait local feature;

8. The method according to any one of claims 1 to 3, wherein the transform network, after performing feature extraction of a second-step feature fusing gait temporal features and spatial features based on the first-step feature and a temporal sequence of frames thereof by using an attention mechanism and semantic feature extraction capability and a long-distance feature capture capability of the transform network, further comprises:

9. A gait feature extraction device, characterized by comprising:

10. A gait feature extraction device, characterized in that it comprises a memory and a processor, said memory having stored thereon a computer program operable on said processor, which, when executed by said processor, implements the method of any one of claims 1 to 8.

11. A computer-readable storage medium comprising computer program instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 8.