CN113850135A

CN113850135A - Dynamic gesture recognition method and system based on time shift frame

Info

Publication number: CN113850135A
Application number: CN202110973739.0A
Authority: CN
Inventors: 吴心怡; 胡超; 李恒
Original assignee: 709th Research Institute of CSIC
Current assignee: 709th Research Institute of CSIC
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-12-28

Abstract

The invention discloses a dynamic gesture recognition method and a system based on a time shift frame, wherein a dynamic gesture recognition network model is constructed, a basic network of the dynamic gesture recognition network model is a residual error network, the residual error network comprises an attention module and a time shift module, the attention module is used for optimizing intermediate characteristics obtained by extracting residual error blocks, and the time shift module is used for carrying out time dimension modeling fusion on the attention characteristics among the residual error blocks on each layer; therefore, the two-dimensional convolutional neural network with lower complexity is used for replacing the three-dimensional convolutional neural network, the recognition effect under the RGB-D image mode can be achieved under the RGB image mode, the dynamic gesture recognition network model is adopted for detecting and recognizing the dynamic gesture video, and the problems of large training data volume and high algorithm complexity in the conventional three-dimensional dynamic gesture recognition method can be solved.

Description

Dynamic gesture recognition method and system based on time shift frame

Technical Field

The invention relates to the technical field of computer vision and human-computer interaction, in particular to a dynamic gesture recognition method and system based on a time shift frame.

Background

The main task of gesture recognition is to extract features from images or videos and classify the features to obtain corresponding labels and explanations, and the application range is very wide, such as man-machine interaction, visual monitoring, video retrieval and the like. The gesture recognition can be divided into static gesture recognition and dynamic gesture recognition according to whether the gesture moves, compared with the static gesture recognition, the recognition task of the dynamic gesture is possibly harder, and the space-time characteristics of the gesture action are often required to be learned and trained through a continuous video sequence, so that different dynamic gestures are classified and recognized. At present, a deep learning model with higher precision in dynamic gesture recognition generally uses a deep image or RGB-D fusion data as input, and a three-dimensional convolution neural network is adopted for training to obtain a dynamic gesture recognition model, so that the problems of large data volume and higher algorithm complexity exist, and certain difficulty is brought to the training and testing of the dynamic gesture recognition model.

Currently, in the field of three-dimensional dynamic gesture recognition methods, there are some methods as follows:

in a patent document, "dynamic gesture recognition method based on dual-channel deep convolutional neural network" (CN201710990519.2), a dual-channel deep convolutional network is proposed, and time domain features and space features of a dynamic gesture in a depth space and a color space are respectively extracted and fused through a multi-stage depth convolutional layer and a depth pooling layer. The method obtains the motion information in the image sequence by a method of subtracting the pixels of the previous frame image from the pixels of the next frame image, removes noise by a method of median filtering and corrosion before expansion, and respectively obtains the foreground and the motion information, but the method is easy to eliminate the influence of illumination and background, and when the color of a hand is similar to that of the background or the background is disordered, the foreground and the motion information are easy to be confused, so that the accuracy of dynamic gesture recognition is reduced;

the patent document "a dynamic gesture recognition method and system based on a deep neural network" (CN201810745350.9), the method collects dynamic gesture video segments of RGB images and depth information to generate a training sample data set, and designs a dynamic gesture recognition network model, the network model is composed of a feature extraction network, a previous and next frame association network and a classification recognition network, wherein the previous and next frame association network is used for performing previous and next time frame association mapping on feature vectors obtained by the sample of each gesture meaning through the feature extraction network, and merging the feature vectors into a fusion feature vector of each gesture meaning. The method does not belong to an end-to-end network, a calculation bottleneck possibly exists, and a certain limitation is placed on real-time performance;

in a patent document, "a multi-modal dynamic gesture recognition method based on a lightweight 3D residual network and a TCN" (CN202011467797.8), a multi-modal dynamic gesture recognition method is proposed, in which an RGB-D image sequence is used as an input, a lightweight 3D residual network and a time convolution network are used as basic models to extract long and short term space-time features, the method uses RGB-D data as a network input, which has a large data volume, and needs to perform extraction and fusion operations on the features of two modes at a later stage, so that the complexity of the network and the training difficulty are increased;

in a patent document, "a dynamic gesture recognition method and system based on a self-attention machine" (CN202010607626.4), a multi-mode input strategy is proposed to describe an occurrence process of a dynamic gesture, and a spatial self-attention machine based on non-local information statistics is used to calculate a dependency relationship between two elements at any distance on a feature map, so as to directly obtain an influence of global information on any element on the whole feature map.

Disclosure of Invention

The invention provides a dynamic gesture recognition method and a dynamic gesture recognition system based on a time shift frame, which are used for overcoming the technical defects.

In order to achieve the above technical object, a first aspect of the present invention provides a dynamic gesture recognition method based on a time shift frame, which includes the following steps:

acquiring a dynamic gesture video sample, labeling the dynamic gesture video sample and manufacturing a dynamic gesture image data set;

constructing a dynamic gesture recognition network model, wherein a basic network of the dynamic gesture recognition network model is a residual error network, the residual error network comprises an attention module and a time shifting module, the attention module is used for optimizing intermediate features obtained by extracting residual error blocks, and the time shifting module is used for carrying out time dimension modeling fusion on the attention features among the residual error blocks of each layer;

training a dynamic gesture recognition network model by using a dynamic gesture image data set;

and detecting and recognizing the dynamic gesture video by adopting the trained dynamic gesture recognition network model.

The invention provides a dynamic gesture recognition system based on a time shift frame, which comprises the following functional modules:

the data acquisition module is used for acquiring a dynamic gesture video sample, labeling the dynamic gesture video sample and manufacturing a dynamic gesture image data set;

the network construction module is used for constructing a dynamic gesture recognition network model, a basic network of the dynamic gesture recognition network model is a residual error network, the residual error network comprises an attention module and a time shift module, the attention module is used for optimizing intermediate features obtained by extracting residual error blocks, and the time shift module is used for carrying out time dimension modeling fusion on the attention features among the residual error blocks on each layer;

the network training module is used for training the dynamic gesture recognition network model by utilizing the dynamic gesture image data set;

and the detection and recognition module is used for detecting and recognizing the dynamic gesture video by adopting the trained dynamic gesture recognition network model.

A third aspect of the present invention provides a server, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned dynamic gesture recognition method based on a time shift frame when executing the computer program.

A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of a dynamic gesture recognition method based on a time-shift framework as described above.

Compared with the prior art, the dynamic gesture recognition method and the system based on the time shift frame adopt the residual error network as the basic network of the two-dimensional convolutional neural network, improve the middle characteristics of the residual error network by additionally arranging the attention module, and perform time dimension modeling fusion on the improved characteristics based on the time shift frame by additionally arranging the time shift module, so that the two-dimensional convolutional neural network with lower complexity is used for replacing the three-dimensional convolutional neural network, the recognition effect under an RGB-D image mode can be achieved under the RGB image mode, and the problems of large training data volume and high algorithm complexity in the conventional three-dimensional dynamic gesture recognition method are solved.

Drawings

FIG. 1 is a flow chart of a dynamic gesture recognition method based on a time shift frame according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a dynamic gesture recognition network model according to an embodiment of the present invention;

fig. 3 is a block diagram of a dynamic gesture recognition system based on a time shift framework according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Based on the above, an embodiment of the present invention provides a dynamic gesture recognition method based on a time shift frame, as shown in fig. 1, which includes the following steps:

and S1, acquiring a dynamic gesture video sample, labeling the dynamic gesture video sample and making a dynamic gesture image data set.

Namely, C dynamic gestures RGB video clips with different meanings are recorded from the camera, at least 50 different video clips are collected by each dynamic gesture, and the video clips are stored and labeled according to gesture types.

Extracting the collected dynamic gesture video clips frame by frame to form an image sequence, specifically, extracting an input video clip { F }₁,F₂,…,F_nIn which F_nRepresenting each frame of image that makes up the video, and saving each frame of image to make up the image sequence. And performing interframe difference processing on the image sequence, removing a static frame without gesture motion information in the image sequence by using an interframe difference method, and reducing redundant information of the image sequence. And finally, randomly cutting the size of the image sequence to 224 multiplied by 224 to form a dynamic gesture image data set.

The method can also sample an image sample set of the dynamic gesture motion, an input image sequence is uniformly divided into n segments, each segment samples one frame, namely the input image sequence is sampled into n pictures, and the n pictures are used as the input sample set of the dynamic gesture recognition network model.

S2, constructing a dynamic gesture recognition network model, wherein a basic network of the dynamic gesture recognition network model is a residual error network, the residual error network comprises an attention module and a time shifting module, the attention module is used for optimizing intermediate features obtained by extracting residual error blocks, and the time shifting module is used for carrying out time dimension modeling fusion on the attention features among the residual error blocks on each layer.

Specifically, the basic network of the dynamic gesture recognition network model is a residual error network resnet5, which is used for extracting gesture motion features in an image sequence. As shown in FIG. 2, the attention module is inserted into resBetween each residual block of the net50 network, the attention module allows the original features to be redistributed, emphasizing the weights of important features and compressing unnecessary features without changing the intermediate feature map size and without destroying the network structure of resnet 50. Wherein the attention module includes a channel attention unit and a spatial attention unit. Specifically, as shown in fig. 2, the size of the middle feature map F obtained by extracting the residual error block through the residual error network resnet50 is W × H × C, and the middle feature map F passes through the channel attention module M_cThen obtaining the attention weight M_c(F) The size is 1 × 1 × C, M_c(F) Multiplying the characteristic F with the characteristic F'; the formula is as follows:

f '' passes through a spatial attention module M_sAttention weight M_s(F') the size W.times.HX 1, M_s(F ') and F' are multiplied to obtain a final attention feature F ', wherein the size of F' is W × H × C, which is the output feature of the attention module, and the formula is as follows:

since the feature map is the same size as the input, the attention feature is directly input into the next residual block in the resnet50 network.

As shown in fig. 2, the time shift module is disposed between the attention feature output blocks of each layer of residual block, and is specifically configured to: in the time dimension T, partial channels of image features of a current frame are respectively replaced by partial channels of image features of two frames before and after a certain frame image feature. For example, the time shift module may replace the channel of the current frame 1/4 with the channel of 1/8 of the previous and next frames respectively in the time dimension T, specifically, X is shown as the formula_iTo represent in a sequence of input images { F }₁,F₂,…,F_nIn the (i) time, i is represented by an image frame F_iExtracting the obtained attention features, and respectively adding X _i+11/8 channels and X _i-11/8 channel shifted and replaced to X_iIn the channel of (2), this time by X _i-11/8 channel shifted to X_iIs shown as

From X _i+11/8 channel shifted to X_iIs shown as

X_iThe 3/4 channel reserved by itself is shown as

Will be provided with

Are multiplied by the corresponding weights w respectively₁、w₂、w₃The obtained sum Y is a new feature fusing the information of the current frame and the previous and next frames; the specific calculation formula is as follows:

after the characteristics optimized by the attention module are subjected to time dimension modeling and fusion, new characteristic information is input into a full connection layer, the full connection layer multiplies a weight matrix by an input vector, offsets are added, and the scores of C types of dynamic gesture categories are respectively output, wherein the calculation formula is as follows:

y(Z)＝Softmax(Z)＝Softmax(Wz+b) (4)

wherein W represents weight, b represents bias term, z is input vector, Softmax function maps the C scores to probability y of (0, 1), and the category with the maximum probability is the dynamic gesture category predicted by the model.

And S3, training the dynamic gesture recognition network model by using the dynamic gesture image data set.

Specifically, the step S3 includes the following sub-steps:

and S31, dividing the training set and the testing set in a 7:3 ratio for the image sequence of each type of gestures in the dynamic gesture image data set.

S32, training the dynamic gesture recognition network model by using the training set, and calculating the loss function value of the dynamic gesture recognition network model by using the cross entropy loss function, wherein the calculation mode is as follows:

wherein m represents the number of samples processed by the neural network in one time, n represents the number of dynamic gesture categories contained in the training set, y_i,jThe true label of the ith sample on the jth class,

indicating the prediction probability value of the ith sample on the jth class.

And S33, updating and optimizing all weight parameters of the dynamic gesture recognition network model by using a back propagation algorithm according to the obtained loss function value so as to obtain the optimized and updated dynamic gesture recognition network model.

And S34, repeating the sub-steps S32 and S33, and performing iterative training on the updated dynamic gesture recognition network model until the loss function reaches the minimum value to obtain an iterative training or dynamic gesture recognition network model. For the above training process, the initialization parameters are as follows: according to the scene studied by the invention, the training iteration number is set to be 100, the optimizer uses Adam, the initial learning rate is set to be 0.01, the learning rate of each iteration is attenuated by 10 times, a resnet50 pre-training model is used, the test accuracy is compared with the test accuracy of the previous iteration after each iteration is finished, and if the current test accuracy is greater than the test accuracy of the previous iteration, the currently generated training model is used as the optimal model and is stored until all iterations are finished.

And S35, verifying the recognition accuracy of the dynamic gesture recognition network model after iterative training by using the test set until the recognition accuracy reaches the optimal value, and obtaining the trained dynamic gesture recognition network model.

And S4, detecting and recognizing the dynamic gesture video by adopting the trained dynamic gesture recognition network model.

The invention relates to a dynamic gesture recognition method based on a time shift frame, which adopts a residual error network as a basic network of a two-dimensional convolutional neural network, improves the middle characteristics of the residual error network by additionally arranging an attention module, and carries out time dimension modeling fusion on the improved characteristics based on the time shift frame by additionally arranging a time shift module, so that the two-dimensional convolutional neural network with lower complexity is used for replacing the three-dimensional convolutional neural network, the recognition effect under an RGB-D image mode can be achieved under the RGB image mode, and the problems of large training data amount and high algorithm complexity in the conventional three-dimensional dynamic gesture recognition method are solved.

As shown in fig. 3, an embodiment of the present invention further provides a dynamic gesture recognition system based on a time shift frame, which includes the following functional modules:

the data acquisition module 10 is used for acquiring a dynamic gesture video sample, labeling the dynamic gesture video sample and making a dynamic gesture image data set;

the network construction module 20 is configured to construct a dynamic gesture recognition network model, a basic network of the dynamic gesture recognition network model is a residual network, the residual network includes an attention module and a time shift module, the attention module is configured to optimize intermediate features obtained by extracting residual blocks, and the time shift module is configured to perform time-dimension modeling fusion on the attention features between each layer of residual blocks;

a network training module 30, configured to train a dynamic gesture recognition network model using the dynamic gesture image dataset;

and the detection and recognition module 40 is configured to perform detection and recognition on the dynamic gesture video by using the trained dynamic gesture recognition network model.

The execution mode of the dynamic gesture recognition system based on the time shift frame of this embodiment is substantially the same as that of the above dynamic gesture recognition method based on the time shift frame, and therefore, detailed description thereof is omitted.

The server in this embodiment is a device for providing computing services, and generally refers to a computer with high computing power, which is provided to a plurality of consumers via a network. The server of this embodiment includes: a memory including an executable program stored thereon, a processor, and a system bus, it will be understood by those skilled in the art that the terminal device structure of the present embodiment does not constitute a limitation of the terminal device, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

The memory may be used to store software programs and modules, and the processor may execute various functional applications of the terminal and data processing by operating the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

An executable program of the dynamic gesture recognition method based on the time shift framework is contained in a memory, the executable program can be divided into one or more modules/units, the one or more modules/units are stored in the memory and executed by a processor to complete the acquisition of information and implement the process, and the one or more modules/units can be a series of computer program instruction segments capable of completing specific functions and are used for describing the execution process of the computer program in the server. For example, the computer program may be divided into a data acquisition module 10, a network construction module 20, a network training module 30, a detection recognition module 40.

The processor is a control center of the server, connects various parts of the whole terminal equipment by various interfaces and lines, and executes various functions of the terminal and processes data by running or executing software programs and/or modules stored in the memory and calling data stored in the memory, thereby performing overall monitoring of the terminal. Alternatively, the processor may include one or more processing units; preferably, the processor may integrate an application processor, which mainly handles operating systems, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor.

The system bus is used to connect functional units in the computer, and can transmit data information, address information and control information, and the types of the functional units can be PCI bus, ISA bus, VESA bus, etc. The system bus is responsible for data and instruction interaction between the processor and the memory. Of course, the system bus may also access other devices such as network interfaces, display devices, etc.

The server at least includes a CPU, a chipset, a memory, a disk system, and the like, and other components are not described herein again.

In the embodiment of the present invention, the executable program executed by the processor included in the terminal specifically includes: a dynamic gesture recognition method based on a time shift frame comprises the following steps:

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A dynamic gesture recognition method based on a time shift frame is characterized by comprising the following steps:

2. The method for recognizing dynamic gestures based on time shift frames as claimed in claim 1, wherein after the video samples of dynamic gestures are labeled, the video segments are preprocessed, the frames are extracted one by one to form an image sequence, and the image sequence is subjected to inter-frame difference processing.

3. The dynamic gesture recognition method based on the time shift frame of claim 2, wherein an image sample set of the dynamic gesture action is sampled to obtain an input sample set of the dynamic gesture recognition network model.

4. The method according to claim 1, wherein the attention module is disposed between each residual block, and the time shift module is disposed between the attention feature output blocks of each layer of residual block.

5. The dynamic gesture recognition method based on time shift frame of claim 1, wherein the attention module comprises a channel attention unit and a spatial attention unit.

6. The dynamic gesture recognition method based on the time shift framework according to claim 1, wherein the time shift module is specifically configured to: in the time dimension T, partial channels of image features of a current frame are respectively replaced by partial channels of image features of two frames before and after a certain frame image feature.

7. The dynamic gesture recognition method based on the time shift frame according to claim 1, wherein the dynamic gesture recognition network model is trained by using a dynamic gesture image dataset, and the method specifically includes the following steps:

dividing the dynamic gesture image data set into a training set and a testing set according to a proportion;

training the dynamic gesture recognition network model by using a training set, and calculating a loss function value of the dynamic gesture recognition network model by using a cross entropy loss function;

updating and optimizing all weight parameters of the dynamic gesture recognition network model by using a back propagation algorithm according to the obtained loss function value so as to obtain an optimized and updated dynamic gesture recognition network model;

performing iterative training on the updated dynamic gesture recognition network model until the loss function reaches the minimum value to obtain an iterative training or dynamic gesture recognition network model;

and verifying the recognition accuracy of the dynamic gesture recognition network model after iterative training by using the test set until the recognition accuracy reaches the optimal value, and obtaining the trained dynamic gesture recognition network model.

8. A dynamic gesture recognition system based on a time shift framework is characterized by comprising the following functional modules:

9. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the steps of the time-shift-framework based dynamic gesture recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for dynamic gesture recognition based on a time-shift framework according to any one of claims 1 to 7.