CN114783069A

CN114783069A - Method, device, terminal equipment and storage medium for identifying object based on gait

Info

Publication number: CN114783069A
Application number: CN202210703368.9A
Authority: CN
Inventors: 苏航; 刘海亮; 汤武惊; 张怡
Original assignee: Shenzhen Research Institute of Sun Yat Sen University
Current assignee: Shenzhen Research Institute of Sun Yat Sen University
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-07-22
Anticipated expiration: 2042-06-21
Also published as: CN114783069B

Abstract

The application is applicable to the technical field of equipment management, and provides a method, a device, a terminal device and a storage medium for identifying an object based on gait, wherein the method comprises the following steps: receiving target video data to be identified; importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; importing the inter-frame action feature data into a pooling fusion network, and outputting fusion feature data corresponding to the target video data; the fused feature data comprises multi-channel feature data; generating a plurality of feature segments based on the fused feature data, inputting all the feature segments to a voting time correlation module, and generating a classification result of the target object; each feature segment corresponds to a classification category; and determining the behavior category of the target object according to the classification result. By adopting the method, the accuracy of the behavior recognition process of the video data can be greatly reduced.

Description

Method, device, terminal equipment and storage medium for identifying object based on gait

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for identifying an object based on gait, a terminal device, and a storage medium.

Background

With the continuous development of artificial intelligence technology, a computer can assist a user in executing various types of recognition operations so as to improve the processing efficiency of the user. For example, when a user analyzes video data, the behavior type of a target person in the video data can be determined through an artificial intelligence algorithm, so that the user can conveniently analyze the target person, for example, when the user performs behavior tracking on the target person or monitors dangerous actions in a key area, the workload of the user can be greatly reduced through artificial intelligence behavior recognition, and the analysis efficiency is improved.

The existing behavior recognition technology often uses optical flow information to determine time information and space information of a target object in a video so as to determine a behavior type of the target object, but action change frequencies of different behavior types are different, when behavior type recognition is performed through the optical flow information, behavior recognition based on the video has the phenomena that the same type is obvious, but the difference of pixel information of each frame is large, and the difference of pixel information of each frame is small, so that the situation of recognition errors easily occurs, and the accuracy of behavior recognition is reduced.

Disclosure of Invention

The embodiment of the application provides a method, a device, a terminal device and a storage medium for recognizing an object based on gait, which can solve the problems that in the existing behavior recognition technology, when behavior type recognition is carried out through optical flow information, behavior recognition based on video has obvious same types but large difference of pixel information of each frame, and the difference of pixel information of each frame with different types is small, so that recognition errors are easy to occur, and the accuracy of behavior recognition is reduced.

In a first aspect, an embodiment of the present application provides a method for identifying an object based on gait, including:

receiving target video data to be identified;

importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; the inter-frame action characteristic data is used for determining action characteristic information between adjacent video image frames in the target video data;

importing the inter-frame action feature data into a pooling fusion network, and outputting fusion feature data corresponding to the target video data; the fused feature data comprises multi-channel feature data;

generating a plurality of feature segments based on the fused feature data, inputting all the feature segments to a voting time correlation module, and generating a classification result of the target object; each feature fragment corresponds to one classification category;

and determining the behavior category of the target object according to the classification result.

In a possible implementation manner of the first aspect, the voting time correlation module includes a plurality of confidence units; the generating a plurality of feature segments based on the fused feature data and outputting the classification result corresponding to each feature segment through a voting time correlation module includes:

pooling the fused feature data in a spatial dimension through a one-dimensional space convolution layer to generate a plurality of feature fragments; each feature segment corresponds to one of the classification categories;

generating a plurality of convolution kernels according to the gait feature information of the target object; each convolution kernel corresponds to one motion capture frame rate; each motion capture frame rate is determined according to the classification category;

inputting each feature segment into the confidence coefficient unit corresponding to the classification category, and outputting confidence coefficient parameters related to the classification category; the confidence unit is generated from a convolution kernel of a motion capture frame rate associated with the classification category;

and generating the classification result according to all the confidence coefficient parameters.

In a possible implementation manner of the first aspect, before generating a plurality of feature segments based on the fused feature data and outputting a classification result corresponding to each feature segment through a voting time correlation module, the method further includes:

acquiring a plurality of first sample groups; the first sample group comprises a plurality of first sample behavior videos with the same behavior category;

training and learning an original correlation module according to the first sample group so as to enable the first similarity loss of the original correlation module to be smaller than a first loss threshold value, and generating a primary correction module; the first similarity loss is:

wherein the content of the first and second substances,

is the first similarity loss value; sam is said first group of samples;

running a video for an ith first sample in the first sample group;

a conditional probability distribution for the ith first sample behavior video;

the behavior type of the ith first sample behavior video;

the distance between any two first samples is taken as the video; n is the total number of videos contained in the first sample group; theta is a learning parameter of the original correlation module;

is a behavior similarity function; lambda is a preset coefficient;

obtaining a second sample set; a second sample behavior video containing a plurality of different behavior categories within the second sample;

and training and learning a primary correction model according to the second sample group so as to enable the second similarity loss of the primary correction model to be smaller than a second loss threshold value, and generating the voting time correlation module.

In a possible implementation manner of the first aspect, before the receiving target video data to be identified, the method further includes:

acquiring sample video data for training a behavior recognition module; the behavior recognition module comprises the inter-frame action extraction network, the pooling fusion network and the voting time correlation module;

generating positive sample data and negative sample data according to the sample video data; the positive sample data is obtained after interference processing is carried out on background information in the sample video data; the negative sample data is obtained by carrying out interference processing on a frame sequence of a sample video frame in the sample video data;

generating first spatial information and first optical flow information through the positive sample data, and generating second spatial information and second optical flow information through the negative sample data;

obtaining space enhancement information according to the first space information and the second space information;

obtaining optical flow enhancement information according to the second optical flow information and the first optical flow information;

importing the spatial enhancement information and the optical flow enhancement information into the behavior recognition module to obtain a training recognition result of the sample video data;

and pre-training the position learning parameters in the initial recognition module based on the training recognition results of all the sample video data to obtain the behavior recognition module.

In a possible implementation manner of the first aspect, the generating positive sample data and negative sample data according to the sample video data includes:

marking sample objects in each sample video frame of the sample video data, and identifying other regions except the sample objects as background regions;

performing interpolation processing on the background area through a preset thin plate spline to obtain a spatial interference image frame;

and packaging according to the frame sequence number of each spatial interference image frame in the sample video data to obtain the positive sample data.

dividing the sample video data into a plurality of video segments according to a preset action time duration; the paragraph duration of each video segment is not greater than the action time duration;

respectively updating the frame sequence numbers of the sample video frames in the video segments according to a preset out-of-order processing algorithm;

and packaging each sample video frame based on the updated frame sequence number to obtain the negative sample data.

In a possible implementation manner of the first aspect, the importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data further includes:

determining image tensors of any two continuous video image frames in the target video data;

determining a plurality of feature point coordinates according to the key positions of the target object in the video image frame; the characteristic point coordinates are determined according to the gait behavior of the target object;

determining tensor expressions of coordinates of all characteristic points in the image tensor, and generating characteristic vectors of the target object in the video image frame based on the coordinate expressions of all the characteristic points;

constructing a displacement correlation matrix according to the characteristic vectors of any two continuous video image frames; the displacement correlation matrix is used for determining displacement correlation scores between the coordinates of each characteristic point in one video image frame and each coordinate point in the other video image frame;

determining the maximum displacement distance of each characteristic point coordinate between the two continuous video image frames according to the displacement correlation matrix, and determining the displacement matrix of the target object based on all the maximum displacement distances;

importing the displacement matrix into a preset feature transformation model to generate action feature subdata of any two continuous video image frames;

and obtaining the inter-frame action characteristic data based on the action characteristic subdata of all the video image frames.

In a second aspect, an embodiment of the present application provides an apparatus for identifying an object based on gait, including:

the target video data receiving unit is used for receiving target video data to be identified;

the inter-frame action characteristic data extraction unit is used for importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; the inter-frame action characteristic data is used for determining action characteristic information between adjacent video image frames in the target video data;

the fusion characteristic data unit is used for importing the inter-frame action characteristic data into a pooling fusion network and outputting fusion characteristic data corresponding to the target video data;

the gait behavior data identification unit is used for importing the target video data into a context attention network and determining the gait behavior data of a target object in the target video data; the contextual attention network is used for extracting a mutual position relation between the target object and an environmental object in the target video data;

and the behavior identification unit is used for obtaining the behavior category of the target object according to the gait behavior data and the fusion characteristic data.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the above first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a server, causes the server to perform the method of any one of the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: after target video data needing behavior recognition is received, the target video data are imported into an inter-frame action extraction network, action characteristic information between each video image frame is extracted, action characteristic data are generated based on the action characteristic information between all the video image frames, the action characteristic data are imported into a pooling fusion network for characteristic extraction, corresponding fusion characteristic data are obtained, the fusion characteristic data are divided into a plurality of characteristic segments of different classification categories, score values of the different classification categories are determined through a voting time correlation module, a classification result is obtained, finally the behavior category of a target object is determined through the classification result, and automatic recognition of the behavior category of the target object is achieved. Compared with the existing behavior recognition technology, the method has the advantages that optical flow information is not adopted for recognition, inter-frame action characteristic data between video image frames are determined firstly, then pooling fusion is carried out, the sensitivity of each video inter-frame action is improved, then the fusion characteristic data is divided into a plurality of characteristic segments of different classification categories, and as the voting time correlation module can determine score values between the characteristic segments and the classification categories by adopting algorithms associated with the classification categories, fine recognition of behavior categories can be achieved, and therefore recognition accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation of a method for identifying an object based on gait according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an inter-frame action extraction network according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a pooled fusion network according to an embodiment of the present application;

fig. 4(a) is a schematic diagram of an implementation manner of S104 of a method for identifying an object based on gait according to an embodiment of the present application;

fig. 4(b) is a schematic structural diagram of a voting time correlation module according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating an implementation of a method for identifying an object based on gait according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating an implementation manner of a method for identifying an object based on gait according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating an implementation manner of a method S102 for identifying an object based on gait according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an apparatus for identifying an object based on gait provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

The method for recognizing the object based on the gait, provided by the embodiment of the application, can be applied to terminal equipment capable of recognizing the behavior of the video data, such as a smart phone, a server, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook and the like. The embodiment of the present application does not set any limit to the specific type of the terminal device.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an implementation of a method for identifying an object based on gait according to an embodiment of the present application, where the method includes the following steps:

in S101, target video data to be recognized is received.

In this embodiment, the electronic device may be configured with a video database containing a plurality of video data. When behavior identification needs to be performed on certain video data in the video database, the terminal device identifies the video data as target video data and performs subsequent processing. The video data of the recognized behavior category contains the recognized behavior category, and the behavior flag of the video data of which the behavior category recognition is not performed is empty. In this case, the terminal device may read whether the behavior flag is empty, and recognize the video data whose behavior flag is empty as the target video data.

In one possible implementation, the target video data may be a video server. When a user needs to identify the behavior of a certain video, the user can install a corresponding client program through a local user terminal, import target video data to be identified into the client program, and initiate an identification request, after receiving the identification request, the user terminal can establish communication connection with a video server through the client program, send the target video data to the video server, and identify the behavior through the identification server.

In a possible implementation manner, in order to improve the efficiency of behavior recognition, the terminal device may set a corresponding video duration threshold, if the video duration of the original video data is greater than the video duration threshold, the original video data may be divided into more than two video segments, the video duration of each video segment is not greater than the video duration threshold, the divided video segments are recognized as target video data, and a subsequent behavior recognition operation is performed.

In S102, importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; the inter-frame motion feature data is used to determine motion feature information between adjacent video image frames in the target video data.

In this embodiment, in order to reduce the operation pressure of behavior identification, an inter-frame motion extraction network is configured in a motion behavior identification module of a terminal device, and the inter-frame motion extraction network is specifically configured to determine motion characteristic information between any two adjacent video image frames, that is, an identification key point of the inter-frame motion extraction network is not a motion of a user in the global but a motion change between every two frames, and then the motion changes between all frames are combed, so that a complete behavior motion of the whole video can be obtained, and subsequent behavior identification is facilitated. Compared with the global optical flow information, the inter-frame action extraction network provided by the embodiment of the application has the characteristics of plug and play, the data volume input to the inter-frame action extraction network each time is specifically the data volume of two video image frames, the whole target video data is not required to be led into the identification network to extract the optical flow information, the occupancy rate of a cache space is reduced, and the requirement on the computing capacity of a computer is also reduced.

In a possible implementation manner, the manner of determining the motion characteristic information between the video image frames may specifically be: and identifying an object region of a target object through the inter-frame action extraction network, then identifying an area deviation between the two object regions, determining action characteristic information of the target object according to the direction, the position and the size of the deviation area, then determining the number of each action characteristic information according to the frame number of each video image frame, and packaging all the action characteristic information according to the number to generate the action characteristic data.

Exemplarily, fig. 2 illustrates a schematic structural diagram of an inter-frame action extraction network provided in an embodiment of the present application. Referring to fig. 2, the input data of the inter-frame action extraction network is two video image frames, namely an image t and an image t +1, the two video image frames are two video image frames with adjacent frame numbers, the electronic device can perform vector conversion on the two video image frames through a vector conversion module, then perform dimension reduction processing through a pooling layer, determine displacement information between vector identifiers corresponding to the two video image frames through an activation layer and a displacement calculation module, and then determine action information between the two video image frames through an action identification unit. The motion recognition unit may be specifically configured with a plurality of convolution layers, and as shown in the drawing, the motion recognition unit may include a first convolution layer configured with a convolution kernel of 1 × 7, a second convolution layer configured with a convolution kernel of 1 × 3, a third convolution layer configured with a convolution kernel of 1 × 3, and a fourth convolution layer configured with a convolution kernel of 1 × 3.

In S103, the inter-frame motion feature data is imported into a pooling fusion network, and fusion feature data corresponding to the target video data is output.

In this embodiment, since each piece of motion feature information in the inter-frame motion extraction module is discrete, feature extraction is required to be performed on the basis of the discrete motion feature information to determine a continuous motion for subsequent motion recognition, based on which, the terminal device may import inter-frame motion feature data into the pooling fusion network, perform pooling dimension reduction processing, perform feature fusion, and output corresponding fusion feature data. Wherein, the fusion feature data can be expressed as:

wherein, Maxpool is the fusion characteristic data;

the inter-frame action information corresponding to the ith video image frame; n is the total number of frames in the target video data; and T is the feature transpose.

Further, as another embodiment of the present application, the pooled fusion network is specifically a homologous bilinear pooled network, and the homologous bilinear pooled network is a symmetric matrix generated by calculating an outer product of features at different spatial positions, and then performing an average pooling on the matrix to obtain bilinear features, which can provide a stronger feature representation than a linear model and can be optimized in an end-to-end manner. The traditional Global Average Pooling (GAP) only captures first-order statistical information, and ignores more detailed characteristics useful for behavior identification, and for the problem, a bilinear pooling method used in fine-grained classification is used for reference and is fused with the GAP method, so that more detailed characteristics can be extracted for behaviors with higher similarity, and a better identification result is obtained.

Illustratively, fig. 3 shows a schematic structural diagram of a pooled fusion network provided in an embodiment of the present application. Referring to fig. 3, the pooling fusion network includes bilinear pooling and a first-order pooling. And inserting a bilinear pooling module into the features extracted from the last layer of convolution layer before global average pooling to capture second-order statistics of the spatial feature map so as to obtain second-order classification output, and adding the first-order feature vectors obtained by global average pooling to obtain a classification output vector. By combining the first-order and second-order vectors, large context clues and fine-grained information of behaviors can be captured, and the classification layer of the existing behavior recognition network is enriched. Meanwhile, the original GAP branch is crucial to the back propagation in the end-to-end training process, and the training difficulty of the bilinear pool module can be reduced.

In S104, generating a plurality of feature segments based on the fused feature data, and inputting all the feature segments to a voting time correlation module to generate a classification result of the target object; each of the feature segments corresponds to a classification category.

In this embodiment, the terminal device infers a channel attention map using the relationship between the feature channels. The channel attention module focuses on how to pick the channels of interest. The goal of the terminal device is to improve the learning capabilities of the network by adjusting the weight of each channel signal in the intermediate feature map. To efficiently capture the channel attention map in each slice tensor, the spatial dimensions are first compressed by fusing the feature data to generate a channel descriptor F, which represents the average pooled feature, by using a global average pool operation. The terminal device then inputs the channel descriptors into a multi-layer perceptron with a hidden layer to fully capture the dependencies between channels. And finally, multiplying the channels between the input feature graph U and the output Mc (F) of the multilayer perceptron to obtain a channel attention feature graph. Each channel of the multi-channel attention feature map corresponds to a classification category, and the terminal device can divide the channel attention feature map according to the number of the channels to obtain a plurality of feature segments, wherein each feature segment corresponds to a classification category.

It should be noted that the classification category is determined according to the identifiable behavior category.

In this embodiment, the terminal device is configured with a voting time correlation module, different action categories (i.e., classification categories) often have different time characteristics, for example, the same stride length is obtained, and when the stride length is fast, the action category is a running category, and when the stride length is slow, the action category is a behavior category, so that score values corresponding to different channels, that is, the confidence levels belonging to the classification categories, are determined according to the time correlation voting module, and the score values of all the channels are encapsulated, so as to obtain the classification result.

In S105, determining a behavior class of the target object according to the classification result.

In this embodiment, after the terminal device classifies the result, it may determine the score corresponding to each classification category, and select the classification category with the highest score as the behavior category of the target object.

In a possible implementation manner, the video length of the target video data is longer, so that the target object may include multiple types of behavior actions in the process of the whole video length, in this case, the terminal device may output a behavior sequence according to the occurrence sequence of each behavior, where the behavior sequence includes multiple elements, and each element corresponds to one behavior category.

As can be seen from the above, in the method for identifying an object based on gait provided in the embodiment of the present application, after target video data that needs to be subjected to behavior identification is received, the target video data is imported to an inter-frame motion extraction network, motion feature information between each video image frame is extracted, motion feature data is generated based on the motion feature information between all the video image frames, the motion feature data is then imported to a pooling fusion network for feature extraction, so as to obtain corresponding fusion feature data, the fusion feature data is divided into a plurality of feature segments of different classification categories, score values of the different classification categories are determined by a voting time correlation module, a classification result is obtained, and finally, a behavior category of the target object is determined by the classification result, so that automatic identification of the behavior category of the target object is achieved. Compared with the existing behavior recognition technology, the method has the advantages that optical flow information is not adopted for recognition, inter-frame action characteristic data between video image frames are determined firstly, then pooling fusion is carried out, the sensitivity of each video inter-frame action is improved, then the fusion characteristic data is divided into a plurality of characteristic segments of different classification categories, and as the voting time correlation module can determine score values between the characteristic segments and the classification categories by adopting algorithms associated with the classification categories, fine recognition of behavior categories can be achieved, and therefore recognition accuracy is improved.

Fig. 4(a) shows a flowchart of a specific implementation of the method S104 for identifying an object based on gait according to the second embodiment of the present invention. Referring to fig. 4(a), with respect to the embodiment shown in fig. 1, in the method for identifying an object based on gait provided in this embodiment, S104 includes: S1041-S1044, which are detailed in detail as follows:

further, the vote time correlation module comprises a plurality of confidence units; the generating a plurality of feature segments based on the fused feature data and outputting the classification result corresponding to each feature segment through a voting time correlation module includes:

in S1041, pooling the fused feature data in a spatial dimension by a one-dimensional spatial convolution layer to generate a plurality of feature fragments; each of the feature segments corresponds to one of the classification categories.

In this embodiment, the terminal device may compress the spatial dimension to enable subsequent identification of the feature in the time dimension, and based on this, perform pooling processing on the fused feature data in the spatial dimension by the one-dimensional spatial convolution layer, for example, the fused feature data is vector data of C '× T × H × W, where C', T, H, W respectively represent the number of channels, the time dimension, and the height and width of the feature map. To reduce the computational cost and the impact of spatial information, the features are first pooled spatially, and channel compression is performed using a convolution kernel with 1 x 1, resulting in an output with the shape of C T, where C represents the number of classifications.

In S1042, a plurality of convolution kernels are generated according to the gait feature information of the target object; each convolution kernel corresponds to one motion capture frame rate; each motion capture frame rate is determined according to the classification category.

In this embodiment, the terminal device may obtain gait feature information corresponding to the target object, so as to determine a possible gait type of the target object, for example, for children, the stride and the frequency of the child are greatly different from those of adults, and the stride and the youth and youth of the child are also slightly different, that is, the age, the sex, the height, and the like, all have a certain influence on gait. Different convolution kernels are used for determining score values of different state types, and the different convolution kernels can realize dynamic capture of different time granularities, for example, if the convolution kernels are smaller, capture of time dimensions of fine granularity can be carried out, so that behaviors with higher speed can be identified; and the convolution kernel is larger, so that the capture of coarse-grained time dimension can be performed, and the type of behavior with lower speed can be identified.

In S1043, inputting each feature segment to the confidence unit corresponding to the classification category, and outputting a confidence parameter related to the classification category; the confidence units are generated from convolution kernels for a motion capture frame rate associated with the classification category.

In this embodiment, the terminal device may import the feature segment obtained by dividing each channel into the confidence unit associated with the feature segment, and calculate the confidence parameter corresponding to the feature segment and the confidence unit, that is, the score value, to determine the probability that the target object belongs to the classification category in the target video data. The method is used for integrating sparse time sampling strategies with different sampling intervals into a feature level video representation and is input of one-dimensional time convolution with different expansion rates. The proportion of the sampling stride occupied by the fine-grained time information is reduced more and more. The higher the motion capture frame rate, the higher the motion capture frame rate is for capturing fast motion. The lower the frame rate of motion capture, the lower the rate of capture for slow motion, and each branch can be considered an independent voter.

In S1044, the classification result is generated according to all the confidence parameters.

In this embodiment, the terminal device may obtain an output of each confidence unit, that is, the confidence parameter is encapsulated to obtain the classification result.

Exemplarily, fig. 4(b) shows a schematic diagram of a voting time correlation model provided in an embodiment of the present application. As shown in fig. 4(b), the spatial dimension is compressed by the one-dimensional spatial convolution layer, and the time dimension characteristic is focused, so that confidence units, such as confidence units 1-3, are constructed by different convolution kernels, and fine identification in the time dimension is performed due to the characteristics that the difference between the pixel information of each frame is large in the same category in behavior identification, and the difference between the pixel information of each frame is small in the different categories, thereby improving the identification accuracy.

In the embodiment of the application, confidence coefficient units with different time sensitivities can be configured by creating different convolution kernels, so that confidence coefficient parameters with different action speeds are determined, identification of different action types is realized, and the identification accuracy is improved.

Fig. 5 is a flowchart illustrating a specific implementation of a method for identifying a subject based on gait according to a second embodiment of the invention. Referring to fig. 5, in the method for identifying an object based on gait according to the embodiment, before generating a plurality of feature segments based on the fused feature data and outputting a classification result corresponding to each feature segment through a voting time correlation module, the method further includes: S501-S504, which are detailed as follows:

in S501, a plurality of first sample groups are acquired; the first sample group comprises a plurality of first sample behavior videos with the same behavior category;

in S502, training and learning an original correlation module according to the first sample group, so that a first similarity loss of the original correlation module is smaller than a first loss threshold, and a primary correction module is generated; the first similarity loss is:

wherein the content of the first and second substances,

is the first similarity loss value; sam is the first sample group;

running a video for an ith first sample in the first sample group;

a conditional probability distribution for the ith first sample behavior video;

the behavior type of the ith first sample behavior video is set;

the distance between pairs of the video is acted on any two first samples; n is the total number of videos contained in the first sample group; theta is a learning parameter of the original correlation module;

is a behavior similarity function; λ is a preset coefficient.

In this embodiment, the voting time correlation module can be implemented on the basis of minimizing cross-entropy loss. Only minimizing cross-entropy loss forces neural network learning to be able to distinguish features of a video with high confidence, and thus may tend to learn sample-specific features, especially for confusing motion classes. This effect can be particularly pronounced in behavior recognition because of the large number of parameters of the data set compared to the three-dimensional convolutional networkToo few samples to learn generalizable class-specific features. We consider this assumption and add regularization to a particular input descriptor of the same class. We propose a new "similarity loss" that introduces class-specific constraints in the output, helping to improve the characterization capability. First, we define how the distance between two probability distributions is measured. The conditional probability distribution of two input videos of the same class (i.e., the first sample group described above) is denoted as p (y | Sami). For M classification classes, each distribution is an M-dimensional vector, and since all the first sample behavior videos in the first sample group belong to the same class, the output difference between the first sample behavior videos can be reduced. We define a pair-wise distance of a pair of first sample behavior videos by using a model parameter theta, and determine a corresponding similarity loss, namely a first similarity loss according to the pair-wise distance

For the behavioral similarity function, since all videos in the first group of samples belong to the same type, i.e. the behavioral similarity function is 0, and the similarity function output of videos of different types is 0, it can be encouraged that a pair of samples randomly drawn from the same class share a similar probability distribution, thereby partially preserving the intra-class similarity. The network will force learning of common movement patterns, preventing it from over-fitting certain sample information.

In S503, a second sample group is acquired; a second sample behavior video containing a plurality of different behavior categories within the second sample;

in S504, a first-time calibration model is trained and learned according to the second sample group, so that a second similarity loss of the first-time calibration model is smaller than a second loss threshold, and the voting time correlation module is generated.

In this embodiment, as in the training process, after the time-related network can identify sample videos of the same behavior type, it is necessary to improve the degree of difference between sample videos of different behavior types, and for a specific training process, reference may be made to the above process.

In the embodiment of the application, the sample groups of the same type are trained for the first time, and then the sample groups of different types are trained for the second time, so that the identification accuracy of the behavior types of the same type can be improved, and the over-fitting condition is avoided.

Fig. 6 is a flowchart illustrating a specific implementation of a method for identifying an object based on gait according to a third embodiment of the present invention. Referring to fig. 6, in relation to the embodiment shown in fig. 1, before receiving target video data to be identified, the method for identifying an object based on gait according to the present embodiment further includes: S601-S607 are detailed as follows:

further, before the receiving the target video data to be identified, the method further includes:

in S601, sample video data for training a behavior recognition module is obtained; the behavior recognition module includes the interframe action extraction network, the pooling convergence network, and the contextual attention network.

In this embodiment, before performing behavior recognition on target video data, the terminal device may perform training learning on a local behavior recognition module, so that accuracy of subsequent behavior recognition can be improved. The behavior identification module specifically comprises three networks which are respectively an interframe motion extraction network, specifically used for extracting interframe motion data, a pooling fusion network, specifically used for performing feature extraction and feature fusion on the interframe motion data, and a context attention network, specifically used for determining the relative position between a target object and an environment object, so that the behavior category of the target object can be determined in a global dimension, and based on the behavior category, the terminal equipment can acquire sample video data from a video library. It should be noted that the sample video data is specifically video data that is not labeled with a behavior category or weakly labeled video data. The training method can be used for training and learning in an antagonistic learning mode, so that the time consumption of marking by a user can be reduced, the training efficiency can be improved, and the training accuracy can be improved.

In the embodiment, a deep bidirectional converter is introduced so as to better utilize key information in position embedding and multi-head attention mechanism automatic selection videos, a sequence self-supervision learning method oriented to video understanding is designed, massive internet big data and an existing public data set are fully utilized to continuously optimize and train a behavior pre-training model, and then the robust behavior pre-training model with field universality and task sharing capability is obtained.

In S602, positive sample data and negative sample data are generated according to the sample video data; the positive sample data is obtained after interference processing is carried out on background information in the sample video data; the negative sample data is obtained by performing interference processing on the frame sequence of the sample video frame in the sample video data.

In this embodiment, after obtaining any sample video data, the terminal device may convert the sample video data into two different types of sample data, one of which is positive sample data obtained by interfering with background information, that is, by interfering with spatial dimensions, and the other is negative sample data obtained by interfering with frame sequences, that is, by interfering with temporal dimensions, thereby decoupling an action and a spatial scene, and further enhancing the sensitivity of a network to the action. This way of constructing positive and negative samples makes the network have to focus on global statistics to distinguish the positive and negative samples.

The process of generating the positive sample may specifically include the following steps:

step 1.1 marks sample objects in each sample video frame of the sample video data and identifies other regions than the sample objects as background regions.

And 1.2, carrying out interpolation processing on the background area through a preset thin plate spline to obtain a spatial interference image frame.

And step 1.3, packaging according to the frame serial number of each spatial interference image frame in the sample video data to obtain the positive sample data.

In this embodiment, the terminal device may locate the sample object in the sample video data through an object recognition algorithm (such as a face recognition algorithm or a human key point recognition algorithm), where the sample object may also be an entity person, and after the sample object in the sample video data is marked, identify another region except for the region where the sample object is located as a background region, and because the space needs to be interfered, the terminal device may perform interpolation processing in the background region through a thin-plate spline, so as to block part of the background region, eliminate correlation in the space between sample video frames, and repackage the spatial interference image frame after the thin-plate spline is added according to the frame number, thereby obtaining positive sample data.

In the embodiment of the application, the background area is subjected to interpolation processing through the thin-plate spline, the local scene information is damaged, so that a positive sample is constructed, the sensitivity of subsequent identification to the user action can be improved, and the training accuracy is improved.

The process of generating the negative examples may specifically include the following steps:

step 2.1, dividing the sample video data into a plurality of video segments according to preset action time duration; the paragraph duration of each of the video segments is not greater than the action time duration.

And 2.2, respectively updating the frame sequence numbers of the sample video frames in the video segments according to a preset out-of-order processing algorithm.

And 2.3, packaging each sample video frame based on the updated frame serial number to obtain the negative sample data.

In this embodiment, to implement interference in the time dimension, the terminal device may divide the sample video data into a plurality of video segments, and perform out-of-order processing on the video image frames in each video segment. Because one action has a certain duration, different actions can be separated by dividing the video segments, and the sensitivity of each action to be identified subsequently can be improved. Wherein the action time duration is determined by determining an average duration of an action based on big data analysis. The terminal equipment reconfigures the frame sequence number of each sample video frame in the video frequency band through a random algorithm, and therefore encapsulation is carried out according to the sample video frames with the updated frame sequence numbers, and negative sample data are obtained.

Usually, the negative samples adopted in the comparative learning are all directly used by other videos, but if the negative samples are used by other videos, besides different action information, a plurality of characteristics which can make the network be more easily distinguished can be introduced, so that the mode of selecting the negative samples cannot ensure that the network focuses on the movement, and the optical flow information is damaged by using local time interference on the basis of the project so as to construct the negative samples. This way of constructing positive and negative samples makes the network have to focus on global statistics to be able to distinguish between positive and negative samples.

In S603, first spatial information and first optical flow information are generated from the positive sample data, and second spatial information and second optical flow information are generated from the negative sample data.

In this embodiment, the terminal device may perform data conversion on positive sample data through a coding algorithm to obtain coded data of each image frame in the positive sample data, to obtain a plurality of feature maps, add the learned position codes to the extracted feature maps, model the time information by using the depth bidirectional converter after fusing the position codes, and model the space information from the time information, that is, the first optical flow information, of the positive sample data, to obtain the space information, that is, the first space information of the positive sample data. Correspondingly, corresponding processing is carried out on the negative sample data, and second spatial information and the second optical flow information are obtained.

In S604, spatial enhancement information is obtained according to the first spatial information and the second spatial information.

In this embodiment, the first spatial information interferes with the background region, so that there is no correlation in space, the second spatial information does not interfere with the background region, and the two sample data are from the same sample video data, so that the two spatial information are fused, the sensitivity of spatial information capture can be improved, and thus spatially enhanced information is obtained.

In S605, optical-flow enhancement information is obtained from the second optical-flow information and the first optical-flow information.

In this embodiment, the first optical flow information does not interfere with the time series, so that the first optical flow information has correlation in a time dimension, the second optical flow information interferes with the time series, and the two sample data are both from the same sample video data, so that the two optical flow information are fused, the sensitivity of time information capture can be improved, and the optical flow enhancement information can be obtained.

In S606, the spatial enhancement information and the optical flow enhancement information are imported into the behavior recognition module, so as to obtain a training recognition result of the sample video data.

In S607, the position learning parameter in the initial recognition module is pre-trained based on the training recognition results of all the sample video data, so as to obtain the behavior recognition module.

In this embodiment, the behavior recognition includes two key pieces of information: spatial information and temporal information. The spatial information belongs to static information in a scene, such as an object, context information and the like, which is easy to capture in a single frame of a video, the temporal information mainly captures dynamic characteristics of motion, which is obtained by integrating spatial information between frames, for behavior recognition, how to better capture motion information is crucial to model performance, and the global average pooling layer used at the end of the existing 3D convolutional neural network hinders the richness of the temporal information. To address this problem, a depth bidirectional Transformer (Transformer) is intended to be used instead of global averaging pooling. Coding K frames sampled from an input video through a 3D convolutional encoder, dividing a feature vector into tokens with fixed lengths in the end of a network to obtain a feature map (feature map), adding a learned position code into extracted features in order to store position information, modeling time information by using a transform block in a depth bidirectional converter after fusing the position code, fusing the time information by using a feature vector obtained through a multi-head attention mechanism of the depth bidirectional converter, connecting the vectors together, performing feature dimension transformation through a multilayer perceptron, and completing end-to-end training through calculating contrast loss. Thereby obtaining a pre-training model with good generalization performance.

In the embodiment of the application, the sensitivity of action and space-time information identification can be improved by determining the positive sample data and the negative sample data, so that the training of behavior categories can be completed without labeling, and the pre-training effect is improved.

Fig. 7 shows a flowchart of a specific implementation of the method S102 for identifying a subject based on gait according to the fourth embodiment of the present invention. Referring to fig. 7, with respect to the embodiment shown in any one of fig. 1 to 6, the present embodiment provides a method for identifying an object based on gait, where S102 in the method for identifying an object based on gait includes: s1021~ S1027, detailed description is as follows:

further, the importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data includes:

in S1021, the image tensors of any two consecutive video image frames within the target video data are determined.

In this embodiment, before extracting the motion characteristic information between two video image frames, the terminal device needs to pre-process the video image frames, and needs to convert the video image frames expressed by graphics into tensors expressed by vectors. The image tensor corresponding to each video image frame is determined according to the image size of the video image frame, and exemplarily, the image long phase may be a tensor of H × W × C size, where H is determined according to the image length of the video image frame, W is determined according to the image width of the video image frame, that is, H × W is used to represent the spatial resolution of the video image frame, and C is used to identify the spatial position where the target object is located, and exemplarily, two consecutive video image frames may be identified as F (t) and F (t + 1), that is, the t-th video image frame and the image tensor corresponding to the t + 1-th video image frame.

In S1022, determining a plurality of feature point coordinates according to the key positions of the target object in the video image frame; the feature point coordinates are determined according to the gait behavior of the target object.

In this embodiment, the terminal device may mark the position where the target object is located, i.e., the above-mentioned key position, in each video image frame. In this case, the terminal device may perform sliding framing in the video image frame through the human body template, and calculate a matching degree between the human body template and the framing region, so as to identify a region where a human body is located, that is, a region where the target object is located.

In this embodiment, after determining the key position, the terminal device may identify a plurality of key points in the target object based on the key position, where each key point corresponds to one feature point coordinate. Illustratively, key points related to gait behavior include: after each key point is marked, the coordinates of the key point in the video image frame, namely the coordinates of the characteristic points, can be determined.

In S1023, tensor expressions of the coordinates of the respective feature points are determined in the image tensor, and feature vectors of the target object in the video image frame are generated based on coordinate expressions of all the feature points.

In this embodiment, after determining the coordinates of the plurality of feature points, the terminal device may locate an element in which each feature point is located in the image tensor, so as to obtain an expression of each feature point through the tensor, that is, the tensor expression, and finally encapsulate the tensor expressions of all the feature point, so as to obtain the feature vector of the target object related to the gait.

In S1024, constructing a displacement correlation matrix according to the feature vectors of the two consecutive video image frames; the displacement correlation matrix is used for determining displacement correlation scores between the coordinates of each characteristic point in one video image frame and the coordinates of each coordinate point in another video image frame.

In this embodiment, after determining the tensor expressions corresponding to the feature point coordinates of the key points and obtaining the feature vectors formed based on the tensor expressions of all the key points, the terminal device may calculate the vector deviation between the two video image frames, so as to determine the displacement corresponding to each key point of the target object between the two video image frames according to the vector deviation, thereby determining to obtain the displacement correlation matrix.

In this embodiment, since no large displacement occurs in the probability according to any one of the two adjacent frames of the video, the displacement can be limited to a specific area, which is assumed to have X as the center point and contain P2 feature points, and then the correlation score matrix of the position X and all the features in the candidate area can be obtained by performing point multiplication on the feature of the X position and the feature in the corresponding candidate area in the adjacent video image frame, and the dimension of the matrix is hxx × P2, that is, the displacement correlation matrix reflects the relationship between the positions between the adjacent frames.

In S1025, the maximum displacement distance of each feature point coordinate between the two consecutive video image frames is determined according to the displacement correlation matrix, and the displacement matrix of the target object is determined based on all the maximum displacement distances.

In this embodiment, after determining the correlation scores between the coordinates of each feature point in the key area relative to the coordinates of another video image frame, the terminal device may select a value with the largest correlation score to determine the maximum displacement distance corresponding to the coordinates of the feature point, that is, locate the coordinate point associated with the coordinates of the feature point in another video image frame.

Further, as another embodiment of the present application, the step S1025 specifically includes the following steps:

step 1: determining a displacement correlation array corresponding to each characteristic point coordinate in the displacement correlation matrix;

and 2, step: determining a parameter value with the maximum correlation coefficient from the displacement correlation array as the maximum displacement distance of the characteristic coordinate point;

and step 3: constructing a displacement field of the target object on a two-dimensional space according to the maximum displacement distances of all the characteristic coordinate points;

and 4, step 4: performing pooling dimensionality reduction on the displacement field through an activation function softmax to obtain a one-dimensional confidence tensor;

and 5: and fusing the displacement field and the one-dimensional confidence tensor to construct a displacement matrix for expressing a three-dimensional space.

In this embodiment, based on the correlation score matrix, the displacement field of the motion information can be estimated as long as the maximum score of each feature point in the correlation score matrix in a video image frame is found to correspond to the point in another video image frame, since the above-mentioned correlation score is used for determining the correlation between two coordinate points, it is possible to separate the correlation scores between the respective coordinate points of the respective feature point coordinates on the other video image frame based on the above-mentioned displacement correlation matrix, i.e. the above-mentioned displacement correlation array, and determining the parameter value with the largest correlation coefficient to determine the corresponding coordinate point of the feature point coordinate in another video image frame, and taking the distance between other points as the maximum displacement distance, thereby constructing a displacement field of the target object in a two-dimensional space, since the video image frame is a two-dimensional image, the displacement field constructed is also two-dimensional. Specifically, feature extraction, i.e. maximum pooling, can be performed on the two-dimensional field by adding a softmax layer, so as to obtain a confidence map of the target object, and finally the two-dimensional displacement field and the one-dimensional confidence map are combined to form a displacement matrix with three-dimensional features.

In the embodiment of the application, the motion condition of the target object is determined by constructing the two-dimensional displacement field, the confidence of each point in the displacement field is determined by pooling dimension reduction, and the displacement condition is conveniently and effectively evaluated, so that subsequent action identification can be conveniently performed, and the accuracy of action identification is improved

In S1026, the displacement matrix is imported into a preset feature transformation model, and the motion feature sub-data of any two consecutive video image frames is generated.

In this embodiment, in order to match the features of the downstream layer, the displacement tensor needs to be converted into a motion feature matrix matching the dimension of the downstream layer. The feed may be fed into four depth-scalable layers, one 1 x 7 layer and three 1 x 3 layers, which are converted into motion features of the same number of channels C as the original input f (t). For input to the next layer of the network.

In S1027, the inter-frame motion feature data is obtained based on the motion feature sub-data of all the video image frames.

In this embodiment, after determining the motion characteristic subdata corresponding to each video image frame with respect to the next video image frame, the terminal device may perform encapsulation according to the frame number of each video image frame, so as to obtain inter-frame motion characteristic data about the entire target video data.

In the embodiment of the application, a plurality of key point coordinates related to gait are marked in the target object, a corresponding displacement matrix is constructed by displacement of the key point coordinates, and the action characteristic subdata of the target object is determined by the displacement of the key point, so that the number of points required to be operated can be reduced, the operation amount is further reduced, and the operation efficiency is improved.

Fig. 8 is a block diagram illustrating a configuration of a device for identifying an object based on gait according to an embodiment of the present invention, where the device for identifying an object based on gait includes units for executing steps implemented by an encryption device in the corresponding embodiment of fig. 1. Please refer to fig. 1 and fig. 1 for a related description of an embodiment. For convenience of explanation, only the portions related to the present embodiment are shown.

Referring to fig. 8, the apparatus for recognizing a subject based on gait includes:

a target video data receiving unit 81 for receiving target video data to be recognized;

the inter-frame action feature data extraction unit 82 is configured to import the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data; the inter-frame action characteristic data is used for determining action characteristic information between adjacent video image frames in the target video data;

a fusion feature data unit 83, configured to import the inter-frame action feature data into a pooling fusion network, and output fusion feature data corresponding to the target video data;

a classification result generating unit 84, configured to import the target video data into a contextual attention network, and determine gait behavior data of a target object in the target video data; the contextual attention network is used for extracting the mutual position relation between the target object and the environmental object in the target video data;

and the behavior recognition unit 85 is configured to obtain a behavior category of the target object according to the gait behavior data and the fusion feature data.

Optionally, the inter-frame motion feature data extraction unit 82 includes:

the image tensor conversion unit is used for determining the image tensors of any two continuous video image frames in the target video data;

the characteristic point coordinate determination unit is used for determining a plurality of characteristic point coordinates according to the key positions of the target object in the video image frame; the characteristic point coordinates are determined according to the gait behavior of the target object;

an eigenvector generating unit, configured to determine tensor expressions of coordinates of feature points in the image tensor, and generate eigenvectors of the target object in the video image frame based on coordinate expressions of all the feature points;

the displacement correlation matrix constructing unit is used for constructing a displacement correlation matrix according to the characteristic vectors of any two continuous video image frames; the displacement correlation matrix is used for determining displacement correlation scores between the coordinates of each characteristic point in one video image frame and each coordinate point in another video image frame;

a displacement matrix construction unit, configured to determine, according to the displacement correlation matrix, a maximum displacement distance between the two consecutive video image frames of each of the feature point coordinates, and determine a displacement matrix of the target object based on all the maximum displacement distances;

the action characteristic subdata determining unit is used for leading the displacement matrix into a preset characteristic transformation model and generating action characteristic subdata of any two continuous video image frames;

and the action characteristic subdata packaging unit is used for obtaining the interframe action characteristic data based on the action characteristic subdata of all the video image frames.

Optionally, the classification result generating unit 83 includes:

a feature fragment generating unit, configured to perform pooling processing on the fused feature data in a spatial dimension through a one-dimensional spatial convolution layer, so as to generate a plurality of feature fragments; each feature segment corresponds to one classification category;

the convolution kernel generating unit is used for generating a plurality of convolution kernels according to the gait feature information of the target object; each convolution kernel corresponds to one motion capture frame rate; each motion capture frame rate is determined according to the classification category;

the confidence coefficient parameter calculation unit is used for respectively inputting each feature segment to the confidence coefficient unit corresponding to the classification category and outputting the confidence coefficient parameter related to the classification category; the confidence unit is generated from a convolution kernel of a motion capture frame rate associated with the classification category;

and the confidence coefficient parameter packaging unit is used for generating the classification result according to all the confidence coefficient parameters.

Optionally, the apparatus further comprises:

a first sample group acquisition unit configured to acquire a plurality of first sample groups; the first sample group comprises a plurality of first sample behavior videos with the same behavior category;

the primary training unit is used for training and learning the original correlation module according to the first sample group so as to enable the first similarity loss of the original correlation module to be smaller than a first loss threshold value and generate a primary correction module; the first similarity loss is:

wherein the content of the first and second substances,

is the first similarity loss value; sam is the first sample group;

running a video for an ith first sample in the first sample group;

a conditional probability distribution for the ith first sample behavior video;

the behavior type of the ith first sample behavior video;

is a behavior similarity function; lambda is a preset coefficient;

a second sample group acquisition unit for acquiring a second sample group; a second sample behavior video containing a plurality of different behavior categories within the second sample;

and the secondary training unit is used for training and learning the primary correction model according to the second sample group so as to enable the second similarity loss of the primary correction model to be smaller than a second loss threshold value, and the voting time correlation module is generated.

Optionally, the apparatus further comprises:

the system comprises a sample video data acquisition unit, a behavior recognition module and a control unit, wherein the sample video data acquisition unit is used for acquiring sample video data used for training the behavior recognition module; the behavior recognition module comprises the inter-frame action extraction network, the pooling convergence network, and the contextual attention network;

the sample data conversion unit is used for generating positive sample data and negative sample data according to the sample video data; the positive sample data is obtained after interference processing is carried out on background information in the sample video data; the negative sample data is obtained by carrying out interference processing on a frame sequence of a sample video frame in the sample video data;

an information extraction unit configured to generate first spatial information and first optical flow information from the positive sample data, and generate second spatial information and second optical flow information from the negative sample data;

a spatial enhancement information generating unit, configured to obtain spatial enhancement information according to the first spatial information and the second spatial information;

an optical flow enhancement information extraction unit configured to obtain optical flow enhancement information from the second optical flow information and the first optical flow information;

a training recognition result output unit, configured to import the spatial enhancement information and the optical flow enhancement information into the behavior recognition module, to obtain a training recognition result of the sample video data;

and the module training unit is used for pre-training the position learning parameters in the initial recognition module based on the training recognition results of all the sample video data to obtain the behavior recognition module.

Optionally, the sample data conversion unit includes:

a background region identification unit, configured to mark a sample object in each sample video frame of the sample video data, and identify a region other than the sample object as a background region;

the background area processing unit is used for carrying out interpolation processing on the background area through a preset thin plate spline to obtain a space interference image frame;

and the positive sample generation unit is used for packaging the frame serial numbers of the space interference image frames in the sample video data to obtain the positive sample data.

Optionally, the sample data conversion unit includes:

the video dividing unit is used for dividing the sample video data into a plurality of video segments according to preset action time duration; the paragraph duration of each video segment is not greater than the action time duration;

the disorder processing unit is used for respectively updating the frame sequence numbers of the sample video frames in the video segments according to a preset disorder processing algorithm;

and the negative sample generation unit is used for packaging each sample video frame based on the updated frame sequence number to obtain the negative sample data.

Therefore, the device for identifying an object based on gait provided by the embodiment of the invention can also be used for importing target video data to be subjected to behavior identification into an inter-frame action extraction network after receiving the target video data, extracting action feature information between each video image frame, generating action feature data based on the action feature information between all the video image frames, importing the action feature data into a pooling fusion network for feature extraction to obtain corresponding fusion feature data, dividing the fusion feature data into a plurality of feature fragments of different classification categories, determining scores of the different classification categories through a voting time correlation module to obtain a classification result, and finally determining the behavior category of the target object through the classification result, thereby realizing automatic identification of the behavior category of the target object. Compared with the existing behavior recognition technology, the method has the advantages that optical flow information is not adopted for recognition, inter-frame action characteristic data between video image frames are determined firstly, then pooling fusion is carried out, the sensitivity of each video inter-frame action is improved, then the fusion characteristic data is divided into a plurality of characteristic segments of different classification categories, and as the voting time correlation module can determine score values between the characteristic segments and the classification categories by adopting algorithms associated with the classification categories, fine recognition of behavior categories can be achieved, and therefore recognition accuracy is improved.

It should be understood that, in the structural block diagram of the device for identifying an object based on gait shown in fig. 8, each module is used to execute each step in the embodiment corresponding to fig. 1 to 7, and each step in the embodiment corresponding to fig. 1 to 7 has been explained in detail in the above embodiment, and specific reference is made to the relevant description in the embodiment corresponding to fig. 1 to 7 and fig. 1 to 7, which is not repeated herein.

Fig. 9 is a block diagram of a terminal device according to another embodiment of the present application. As shown in fig. 9, the terminal apparatus 900 of this embodiment includes: a processor 910, a memory 920, and a computer program 930 stored in the memory 920 and executable at the processor 910, such as a program for a method of identifying an object based on gait. The processor 910 implements the steps in the various embodiments of the gait-based object recognition method described above, such as S101 to S105 shown in fig. 1, when executing the computer program 930. Alternatively, the processor 910, when executing the computer program 930, implements the functions of the modules in the embodiment corresponding to fig. 9, for example, the functions of the units 81 to 85 shown in fig. 8, please refer to the related description in the embodiment corresponding to fig. 8.

Illustratively, the computer program 930 may be partitioned into one or more modules, which are stored in the memory 920 and executed by the processor 910 to complete the present application. One or more of the modules may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of computer program 930 in terminal device 900. For example, the computer program 930 may be divided into respective unit modules, and the respective modules may be specifically functioned as described above.

Terminal device 900 can include, but is not limited to, a processor 910, a memory 920. Those skilled in the art will appreciate that fig. 9 is merely an example of a terminal device 900 and does not constitute a limitation of terminal device 900 and may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The processor 910 may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete hardware components, and so forth. A general purpose processor may be a microprocessor or any conventional processor or the like.

The storage 920 may be an internal storage unit of the terminal device 900, such as a hard disk or a memory of the terminal device 900. The memory 920 may also be an external storage device of the terminal device 900, such as a plug-in hard disk, a smart card, a flash memory card, etc. provided on the terminal device 900. Further, the memory 920 may also include both internal and external memory units of the terminal device 900.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of identifying an object based on gait, comprising:

receiving target video data to be identified;

importing the inter-frame action feature data into a pooling fusion network, and outputting fusion feature data corresponding to the target video data; the fused feature data comprises feature data in multiple channels;

generating a plurality of feature segments based on the fusion feature data, inputting all the feature segments to a voting time correlation module, and generating a classification result of the target object; each feature fragment corresponds to one classification category;

2. The method of claim 1, wherein the vote time correlation module comprises a plurality of confidence elements; the generating a plurality of feature segments based on the fused feature data and outputting the classification result corresponding to each feature segment through a voting time correlation module includes:

3. The method according to claim 1, before generating a plurality of feature segments based on the fused feature data and outputting a classification result corresponding to each feature segment through a voting time correlation module, further comprising:

training and learning an original correlation module according to the first sample group so as to enable first similarity loss of the original correlation module to be smaller than a first loss threshold value, and generating a primary correction module; the first similarity loss is:

wherein the content of the first and second substances,

is the first similarity loss value; sam is said first group of samples;

running a video for an ith first sample in the first sample group;

a conditional probability distribution for the ith first sample behavior video;

the behavior type of the ith first sample behavior video;

the distance between any two first samples is taken as the video; n is the total number of videos contained in the first sample group; thetaLearning parameters for the original correlation module;

is a behavior similarity function; lambda is a preset coefficient;

4. The method of claim 1, further comprising, prior to said receiving target video data to be identified:

and pre-training the position learning parameters in the initial identification module based on the training identification results of all the sample video data to obtain the behavior identification module.

5. The method of claim 4, wherein generating positive sample data and negative sample data from the sample video data comprises:

marking sample objects in each sample video frame of the sample video data, and identifying other areas except the sample objects as background areas;

6. The method of claim 4, wherein generating positive sample data and negative sample data from the sample video data comprises:

7. The method according to any one of claims 1 to 5, wherein the importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data further comprises:

8. An apparatus for recognizing a subject based on gait, comprising:

a fusion feature data unit, configured to import the inter-frame action feature data into a pooling fusion network, and output fusion feature data corresponding to the target video data; the fused feature data comprises multi-channel feature data;

a classification result generating unit, configured to generate a plurality of feature segments based on the fused feature data, input all the feature segments to a voting time correlation module, and generate a classification result of the target object; each feature segment corresponds to a classification category;

and the behavior type identification unit is used for determining the behavior type of the target object according to the classification result.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.