CN113449573A

CN113449573A - Dynamic gesture recognition method and device

Info

Publication number: CN113449573A
Application number: CN202010235859.6A
Authority: CN
Inventors: 吴觊豪; 马杰延
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2021-09-28
Also published as: WO2021190296A1

Abstract

The invention discloses a dynamic gesture recognition method and device in the field of artificial intelligence, and relates to the field of Artificial Intelligence (AI). The dynamic gesture recognition method comprises the following steps: acquiring a plurality of images containing gesture actions; and identifying the multiple images through the gesture identification model to obtain the types and attributes of gesture actions in the multiple images, wherein the attributes comprise a forward stroke and a backward stroke. The method can improve the identification precision.

Description

Dynamic gesture recognition method and device

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to a dynamic gesture recognition method and apparatus.

Background

Gestures are expressive physical movements that convey a variety of meaningful information. Dynamic gesture recognition is one of research hotspots in the field of deep learning, and as a new human-computer interaction (HCI) mode, the dynamic gesture recognition has a wide application prospect in many fields such as virtual reality, smart home, early education of children, medical robots and the like. In dynamic gesture recognition, when a user continuously waves a gesture in one direction, the user must return to a starting point in another direction, which is a gesture return stroke, and the gesture return stroke easily causes misjudgment of the terminal device on the gesture direction. For example, the user needs to continuously wave twice to the right, and needs to return the hand to the starting point of the first wave to wave again when waving for the second time, and during the return process, the terminal device can easily determine that the hand is a gesture to the left.

Therefore, how to accurately recognize gesture actions is a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the application provides a dynamic gesture recognition method and device, which can improve recognition accuracy.

In a first aspect, an embodiment of the present application provides a dynamic gesture recognition method, including: acquiring a plurality of images containing gesture actions; and identifying the plurality of images through a gesture recognition model to obtain the types and attributes of gesture actions in the plurality of images, wherein the attributes comprise a forward stroke and a backward stroke.

In the scheme provided by the application, the gesture recognition model is used for recognizing the multiple images containing the gesture actions, the types and the attributes of the gesture actions in the multiple images can be obtained, then the corresponding operation is carried out according to the types and the attributes of the gesture actions in the obtained multiple images, and the situation that the gesture is misjudged by the terminal due to the return stroke problem can be avoided.

In a possible implementation manner, a plurality of sample images carrying annotation information are obtained, the sample images are a plurality of images containing gesture actions, and the annotation information includes types and attributes of the gesture actions in the sample images; and training an initial gesture recognition model according to the plurality of sample images carrying the labeling information to obtain the gesture recognition model.

In the scheme provided by the application, the type and the attribute of the gesture motion recorded in the sample image can be obtained in advance, and then the initial gesture recognition model is trained by utilizing a plurality of sample images carrying the type and the attribute, so that the trained gesture recognition model has the capability of recognizing the type and the attribute of the gesture motion recorded in a plurality of images, and thus, a plurality of images input into the gesture recognition model can be recognized, and the type and the attribute of the gesture motion recorded in the plurality of images can be output.

In one possible implementation, the gesture recognition model includes a spatial feature module, a temporal feature module, and a classification module; the identifying the plurality of images through the gesture recognition model to obtain the types and the attributes of the gesture actions in the plurality of images comprises the following steps: inputting the multiple images into the spatial feature module to obtain first feature data, wherein the first feature data comprises spatial features of gesture actions in the multiple images; inputting the first feature data into the time domain feature module to obtain second feature data, wherein the second feature data comprises time domain features of the first feature data in a time dimension; and inputting the second characteristic data into the classification module to obtain the types and attributes of the gesture actions in the multiple images.

In the scheme provided by the application, after the spatial feature extraction is carried out on a plurality of input images through the spatial feature module of the gesture recognition model, first feature data is obtained, the time domain feature module of the gesture recognition model extracts the time domain features of the plurality of images on the time dimension aiming at the first feature data, and finally the classification module of the gesture recognition model is input to obtain the types and attributes of gesture actions in the plurality of images.

In one possible implementation, the time domain feature module includes a dimension transform layer, a convolution layer, a Batch Normalization (BN) layer, a linear correction unit (ReLu) layer, a max pooling layer, and a feature combination layer; inputting the first feature data into the time domain feature module to obtain second feature data, where the second feature data includes a time domain feature of the first feature data in a time dimension, and the method includes: determining first time domain characteristic data corresponding to the first characteristic data in the time dimension through the dimension conversion layer according to the time information of the plurality of images; performing convolution processing on the first time domain characteristic data through the convolution layer to obtain second time domain characteristic data; and sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

In the scheme provided by the application, the dimension conversion layer of the time domain feature module determines first time domain feature data corresponding to pixels with the same position in first feature data corresponding to a plurality of images in the time dimension according to time information of each image in the plurality of images, the convolution layer of the time domain feature module performs convolution processing on the first feature data to obtain corresponding second time domain feature data, and the second time domain feature data sequentially passes through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

In a possible implementation manner, the performing convolution processing on the first time domain feature data by using a convolution layer to obtain second time domain feature data includes: performing convolution processing on the first time domain feature data by using a first preset number of one-dimensional convolution layers with different convolution kernel sizes to obtain second time domain feature data, wherein the second time domain feature data comprises a first preset number of feature data with different scales; the step of sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combining layer to obtain the second feature data includes: and sequentially passing the first preset number of feature data with different scales through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

In the scheme provided by the application, aiming at the first time domain characteristic data, the convolution layer of the time domain characteristic module performs convolution processing on the first time domain characteristic data by using a first preset number of one-dimensional convolution layers with different convolution kernel sizes to obtain a first preset number of characteristic data with different scales, the characteristic data passes through the BN layer, the ReLu layer, the maximum pooling layer and the characteristic combination layer, the characteristic combination layer fuses the first time domain characteristic data corresponding to the first preset number of characteristic data with different scales to obtain the second characteristic data corresponding to the first characteristic data. The use of the one-dimensional convolutional layer can effectively reduce the calculation amount and improve the processing efficiency of the convolutional layer of the time domain feature module.

In a possible implementation manner, the gesture recognition model further includes a first classifier and a second classifier, and the inputting the second feature data into the classification module to obtain the type and the attribute of the gesture motion in the multiple images includes: inputting the second characteristic data into the first classifier to obtain a first probability that the gesture actions in the multiple images belong to each type; classifying a first gesture motion as any one of the gesture motions in the plurality of images into a type having a highest first probability corresponding to the first gesture motion; inputting the second characteristic data into the second classifier to obtain a second probability that the gesture actions in the multiple images belong to each attribute; and classifying the first gesture action into the attribute with the maximum second probability corresponding to the first gesture action.

In the scheme provided by the application, the second feature data is input into a trained first classifier and a trained second classifier, a first probability that the gesture motion belongs to each type and a second probability that the gesture motion belongs to each attribute are obtained, and the first gesture motion is classified into the type corresponding to the first gesture motion and having the highest first probability and the attribute corresponding to the second gesture motion and having the highest second probability.

In one possible implementation, the acquiring a plurality of images including gesture actions includes: acquiring a video to be identified; and extracting one image from the video to be recognized at intervals of a second preset number of images to obtain a plurality of images containing gesture actions.

In the scheme provided by the application, the video to be identified is obtained, one image is extracted from the video to be identified at intervals of a second preset number of images according to the time sequence of the images in the video to be identified, and the extracted third preset number of images is determined as the plurality of images under the condition that the number of the extracted images reaches a third preset number.

In one possible implementation, the method further includes: and executing a function corresponding to the type of the gesture action under the condition that the attribute of the gesture action in the plurality of images is the outward movement.

In the scheme provided by the application, after the type and the attribute of the gesture action in the multiple images are recognized through the gesture recognition model, the terminal device executes the function corresponding to the recognized type of the gesture action under the condition that the attribute is the forward stroke, and the terminal device does not process under the condition that the attribute is the backward stroke.

In a second aspect, an embodiment of the present application provides a dynamic gesture recognition apparatus, including: the first acquisition unit is used for acquiring a plurality of images containing gesture actions; and the recognition unit is used for recognizing the multiple images through a gesture recognition model to obtain the types and attributes of gesture actions in the multiple images, wherein the attributes comprise a forward stroke and a backward stroke.

In one possible implementation, the apparatus further includes: the second acquisition unit is used for acquiring a plurality of sample images carrying annotation information, wherein the sample images are a plurality of images containing gesture actions, and the annotation information comprises types and attributes of the gesture actions in the sample images; and the training unit is used for training the initial gesture recognition model according to the plurality of sample images carrying the labeling information to obtain the gesture recognition model.

In one possible implementation, the gesture recognition model includes a spatial feature module, a temporal feature module, and a classification module; the identification unit is specifically configured to: inputting the multiple images into the spatial feature module to obtain first feature data, wherein the first feature data comprises spatial features of gesture actions in the multiple images; inputting the first feature data into the time domain feature module to obtain second feature data, wherein the second feature data comprises time domain features of the first feature data in a time dimension; and inputting the second characteristic data into the classification module to obtain the types and attributes of the gesture actions in the multiple images.

In one possible implementation, the time domain feature module includes a dimension transformation layer, a convolution layer, a BN layer, a ReLu layer, a max pooling layer, and a feature combination layer; the identification unit is configured to input the first feature data into the time domain feature module to obtain second feature data, where the second feature data includes a time domain feature of the first feature data in a time dimension, and specifically configured to: determining first time domain characteristic data corresponding to the first characteristic data in the time dimension through the dimension conversion layer according to the time information of the plurality of images; performing convolution processing on the first time domain characteristic data through the convolution layer to obtain second time domain characteristic data; and sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

In a possible implementation manner, when the identification unit is configured to perform convolution processing on the first time domain feature data through the convolution layer to obtain second time domain feature data, the identification unit is specifically configured to: performing convolution processing on the first time domain feature data by using a first preset number of one-dimensional convolution layers with different convolution kernel sizes to obtain second time domain feature data, wherein the second time domain feature data comprises a first preset number of feature data with different scales; the step of sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combining layer to obtain the second feature data includes: and sequentially passing the first preset number of feature data with different scales through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

In a possible implementation manner, the gesture recognition model further includes a first classifier and a second classifier, and the recognition unit is configured to input the second feature data into the classification module, and when obtaining the types and attributes of the gesture actions in the multiple images, specifically configured to: inputting the second characteristic data into the first classifier to obtain a first probability that the gesture actions in the multiple images belong to each type; classifying a first gesture motion as any one of the gesture motions in the plurality of images into a type having a highest first probability corresponding to the first gesture motion; inputting the second characteristic data into the second classifier to obtain a second probability that the gesture actions in the multiple images belong to each attribute; and classifying the first gesture action into the attribute with the maximum second probability corresponding to the first gesture action.

In a possible implementation manner, the first obtaining unit is specifically configured to: acquiring a video to be identified; and extracting one image from the video to be recognized at intervals of a second preset number of images to obtain a plurality of images containing gesture actions.

In one possible implementation, the apparatus further includes: and the execution unit is used for executing the function corresponding to the type of the gesture motion under the condition that the attribute of the gesture motion in the images is the outward movement.

In a third aspect, an embodiment of the present application provides a computing device, where the computing device includes a processor and a memory, where the memory is configured to store a program, and the processor executes the program stored in the memory, and when the program stored in the memory is executed, the computing device is enabled to implement the first aspect and the dynamic gesture recognition method provided in connection with any one implementation manner of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium for storing computer-executable instructions, which, when invoked by the computer, are configured to cause the computer to implement the first aspect and the dynamic gesture recognition method provided in connection with any one of the implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product, where the computer program product includes instructions, which when executed by a computer, enable the computer to perform the first aspect and the flow of the dynamic gesture recognition method provided in connection with any implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a scenario of dynamic gesture interaction in an embodiment of the present application;

FIG. 2 is a schematic diagram of an architecture of a dynamic gesture recognition system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a gesture recognition model according to an embodiment of the present application;

fig. 4 is a schematic diagram of a CNN in an embodiment of the present application;

fig. 5 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating a dynamic gesture recognition method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a spatial signature module in an embodiment of the present application;

FIG. 8 is a diagram illustrating a first feature data extraction according to an embodiment of the present application;

FIG. 9 is a diagram of a time domain feature module in an embodiment of the present application;

FIG. 10 is a diagram illustrating a Reshape conversion operation performed by a dimension transformation layer in an embodiment of the present application;

FIG. 11 is a flowchart illustrating another dynamic gesture recognition method according to an embodiment of the invention;

FIG. 12 is a flowchart illustrating another dynamic gesture recognition method according to an embodiment of the invention;

FIG. 13 is a schematic flowchart of a method for training a gesture recognition model according to an embodiment of the present invention;

FIG. 14 is a diagram illustrating feature extraction in gesture recognition according to an embodiment of the present invention;

FIG. 15 is a diagram illustrating a time domain feature extraction according to an embodiment of the present invention;

FIG. 16 is a schematic structural diagram of a dynamic gesture recognition apparatus according to an embodiment of the present invention;

fig. 17 is a schematic structural diagram of a computing device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application are described below clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The technical solution in the present application will be described below with reference to the accompanying drawings.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Artificial intelligence

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

(2) Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(3) Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although it is a mixture ofDNN, however, appears to be complex, but is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector, b is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the number of coefficients W and offset vectors b is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. Training deep neural networks, i.e. learning weight momentsThe final purpose of the array process is to obtain a weight matrix (a weight matrix formed by vectors W of many layers) of all layers of the trained deep neural network.

(4) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(5) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, the process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(6) Gesture return

In dynamic gesture recognition, when a user continuously waves a gesture in one direction, the user must return to a starting point in another direction, which is a gesture return stroke, and the gesture return stroke easily causes misjudgment of the terminal device on the gesture direction.

With the rapid development of computer technology, dynamic gesture recognition has become one of man-machine interaction modes. Fig. 1 is a dynamic gesture interaction scenario in an embodiment of the present application, as shown in fig. 1, in the field of interaction design, a user may use one hand or two hands to perform a contactless operation on a terminal device, and the terminal device responds to a gesture of the user and executes a related command through dynamic gesture recognition. Currently, the mainstream dynamic gesture recognition methods can be divided into two types: the first method is a method for realizing dynamic gesture recognition by combining a neural network and video input, and the method is based on multi-image input, uses a Convolutional Neural Network (CNN) to extract spatial features (features of images), uses one-dimensional convolution (1 DCONV) or a full link network (MLP) to extract time-domain features, and finally obtains a dynamic gesture recognition result in a video. The method can complete the self-recognition of the dynamic gesture recognition (the characteristic of learning action in the training process) end to end, but has a more serious backhaul problem. The second method is to perform static image recognition (detection tracking, classification, or key point recognition) using a neural network, and estimate a dynamic motion by a combination of classifications of consecutive frames, for example, classification of positions and hand shapes. The method can adjust the accuracy of dynamic gesture recognition through some classification results and classification thresholds, but the return stroke problem is difficult to solve.

Based on the above problems, the present application provides a dynamic gesture recognition method, which may acquire a plurality of images including gesture actions, recognize the plurality of images through a gesture recognition model, obtain types and attributes of the gesture actions in the plurality of images, where the attributes include a forward stroke and a backward stroke, and then execute corresponding operation commands according to the obtained types and attributes of the gesture actions. By the method, the identification of the return trip in the dynamic gesture identification can be realized, and the gesture action identification precision is improved.

The system architecture provided by the embodiments of the present application is described below.

Referring to fig. 2, fig. 2 is a schematic diagram of a dynamic gesture recognition system according to an embodiment of the present disclosure. As shown in fig. 2, the dynamic gesture recognition system architecture 200 may include an execution device 210, a training device 220, a database 230, a user device 240, a data storage system 250, and a data collection device 260.

The data collecting device 260 is configured to collect a plurality of image data including a gesture motion, store the plurality of image data in the database 230, and train the training device 220 to obtain the gesture recognition model 201 based on the plurality of image data maintained in the database 230. The training process may include: the training device 220 inputs the data of the plurality of images into the initial gesture recognition model 221 to obtain the gesture recognition model 201, wherein the initial gesture recognition model 221 is a deep neural network. The operation of each layer in the deep neural network can be expressed mathematically

To describe: from each layer in the physical layer deep neural networkWork can be understood as performing an input space to output space transformation (i.e., a row space to a column space of a matrix) by five operations on an input space (a set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

The gesture recognition model 201 obtained by training according to the training device 220 may be applied to different systems or devices, for example, the execution device 210 shown in fig. 2, where the execution device 210 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, or may be a server or a cloud. The execution device 210 may execute the dynamic gesture recognition method in the embodiment of the present application. In fig. 2, the execution device 210 is configured with an I/O interface 212 for data interaction with an external device, and a user may input data to the I/O interface 212 through the user device 240, where the input data may be multiple pieces of image data including gesture actions in the embodiment of the present application, or may be a request for identifying a dynamic gesture from the execution device 210.

In the process of executing the relevant processing such as calculation by the calculation module 211 of the execution device 210, the execution device 210 may call data, codes, and the like in the data storage system 250 for corresponding processing, and may store data, instructions, and the like obtained by corresponding processing in the data storage system 250.

The calculation module 211 may process the input data of the multiple images including the gesture motion by using the gesture recognition model 201, specifically, first obtain the multiple images including the gesture motion, obtain first feature data through a spatial feature module in the gesture recognition model 201, input the first feature data into a time domain feature module in the gesture recognition model 201 to obtain second feature data, and input the second feature data into a classification module in the gesture recognition model 201 to obtain the type and attribute of the gesture motion in the multiple images.

Finally, the I/O interface 212 returns the processing result, such as the type and attribute of the gesture motion in the plurality of images obtained by the recognition method of the gesture recognition model 201, to the user device 240. The user device 240 may be a terminal, such as a mobile phone terminal, a notebook computer, an AR/VR terminal or a vehicle-mounted terminal, for responding to a corresponding requirement of a terminal user.

In the case shown in fig. 2, the user may manually give input data (e.g., multiple images containing gesture actions in the embodiment of the present application), which may be manipulated through an interface provided by the I/O interface 212. Alternatively, the user device 240 may automatically send the input data to the I/O interface 212, and if the user device 240 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the user device 240. The user may view the recognition results output by the execution device 210 at the user device 240, the recognition results including the type and attributes of the gesture actions in the plurality of images. Upon receiving the recognition result, the user device 240 may convert the recognition result into a corresponding instruction to respond to the dynamic gesture of the user.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210. The data collection device 260 may be an external device separate from the user device 240 or an internal device disposed in the user device 240.

As shown in fig. 2, what is obtained by training according to the training device 220 may be the gesture recognition model 201 in this embodiment, specifically, the gesture recognition model 201 provided in this embodiment may be a neural network model for dynamic gesture recognition.

Referring to fig. 3, fig. 3 is a schematic diagram of a gesture recognition model according to an embodiment of the present application. As shown in fig. 3, the gesture recognition model 300 may include a spatial feature module 301, a temporal feature module 302, and a classification module 303, and the temporal feature module 302 may be disposed after the spatial feature module 301. The spatial feature module 301 in fig. 3 extracts, layer by layer, first feature data from the input multiple images including the gesture motion, where the first feature data includes spatial features characterizing the gesture motion in the multiple images. The input data of the time domain feature module 302 is the first feature data output by the spatial feature module 301 at the previous stage. The time domain feature module 302 processes the first feature data to obtain second feature data. The input data of the classification module 303 is the second feature data output by the time domain feature module 302 located at the upper stage, and the classification module 303 classifies the second feature data to determine the types and attributes of the gesture actions in the multiple images. The output values of the classification module 303 may be passed to two outputs, one of which may be classified using softmax logistic regression (softmax regression) for characterizing the type of gesture action, and the other of which may employ a sigmoid function for characterizing the attributes of the gesture action.

In specific implementation, the gesture recognition model 300 may include a plurality of spatial feature modules and a plurality of time domain feature modules, and the structures of the plurality of spatial feature modules may be the same or different. A single spatial feature module may contain only one neural network layer, e.g., a single spatial feature module contains only one convolutional layer; a single spatial feature module may also include multiple layers of the same or different neural networks, e.g., a convolutional layer and a pooling layer in a single spatial feature module, or multiple different convolutional layers in a single spatial feature module. The gesture recognition model 300 illustrated in fig. 3 is only an example, and in practical applications, the number, the structure, and the position of the spatial feature modules and the number, the structure, and the position of the time domain feature modules included in the gesture recognition model 300 may be set according to actual requirements, which is not limited in the embodiment of the present application.

In this embodiment, the spatial feature module 301 may be a CNN architecture.

Referring to fig. 4, fig. 4 is a schematic diagram of a CNN according to an embodiment of the present disclosure. As described in the introduction of the basic concept, CNN is a deep neural network with a convolution structure, and is a deep learning (deep learning) architecture, which refers to learning at multiple levels at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to multiple images containing gesture actions input into it.

As shown in fig. 4, Convolutional Neural Network (CNN)400 may include an input layer 410, a convolutional/pooling layer 420 (where the pooling layer is optional), and a neural network layer 430.

Convolutional/pooling layers 420:

and (3) rolling layers:

the convolutional/pooling layer 420 shown in fig. 4 may include layers such as examples 421 and 426, for example: in one implementation, 421 layers are convolutional layers, 422 layers are pooling layers, 423 layers are convolutional layers, 424 layers are pooling layers, 425 are convolutional layers, 426 are pooling layers; in another implementation, 421, 422 are convolutional layers, 423 are pooling layers, 424, 425 are convolutional layers, and 426 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of one convolution layer will be described below by taking convolution layer 421 as an example.

Convolution layer 421 may include a plurality of convolution operators, also called kernels, whose role in image processing is equivalent to a filter for extracting specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, depending on the value of step size stride) in the horizontal direction, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation. The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 400 can make correct prediction.

When convolutional neural network 400 has multiple convolutional layers, the initial convolutional layer (e.g., 421) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 400 increases, the more convolutional layers (e.g., 426) later extract more complex features, such as features with high levels of semantics, the more highly semantic features are suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the

layers

421 and 426 as illustrated by 420 in fig. 4 may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 430:

after processing by convolutional layer/pooling layer 420, convolutional neural network 400 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 520 only extracts features and reduces the parameters brought about by the input image. However, to generate the final output information (class information required or other relevant information), convolutional neural network 400 needs to utilize neural network layer 430 to generate one or a set of the number of required classes of output. Therefore, a plurality of hidden layers (431, 432 to 43n shown in fig. 4) and an output layer 440 may be included in the neural network layer 430, and parameters included in the hidden layers may be pre-trained according to the relevant training data of a specific task type.

After the hidden layers in the neural network layer 430, i.e., the last layer of the whole convolutional neural network 400 is the output layer 440, the output layer 440 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from the direction 410 to 440 in fig. 4) of the whole convolutional neural network 400 is completed, the backward propagation (i.e., the propagation from the direction 440 to 410 in fig. 4 is performed as the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 400, and the error between the result output by the convolutional neural network 400 through the output layer and the ideal result.

It should be noted that the convolutional neural network 400 shown in fig. 4 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

A hardware structure of a chip provided in an embodiment of the present application is described below.

Referring to fig. 5, fig. 5 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure. As shown in fig. 5, the chip includes a neural network processor 50. The chip may be disposed in the execution device 210 shown in fig. 2 to complete the calculation work of the calculation module 211. The chip may also be disposed in the training apparatus 220 as shown in fig. 2 to complete the training work of the training apparatus 220 and output the target model 201. The algorithm of each module in the gesture recognition model shown in fig. 3 can be implemented in a chip shown in fig. 5.

The neural network processor 50 may be any processor suitable for large-scale exclusive-or processing, such as an NPU, TPU, or GPU. Taking NPU as an example: the NPU may be mounted as a coprocessor to a main CPU (host CPU), which is assigned tasks. The core portion of the NPU is an arithmetic circuit 503, and the arithmetic circuit 503 is controlled by a controller 504 to extract matrix data in memories (501 and 502) and perform a multiply-add operation.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 503 fetches the weight data of the matrix B from the weight memory 502 and buffers it on each PE in the arithmetic circuit 503. The arithmetic circuit 503 takes the input data of the matrix a from the input memory 501, performs matrix arithmetic based on the input data of the matrix a and the weight data of the matrix B, and stores a partial result or a final result of the obtained matrix in an accumulator (accumulator) 508.

The unified memory 506 is used to store input data as well as output data. The weight data is directly transferred to the weight memory 502 by a Direct Memory Access Controller (DMAC) 505. The input data is also carried through the DMAC into the unified memory 506.

A Bus Interface Unit (BIU) 510, configured to interact between the DMAC and an instruction fetch memory (instruction fetch buffer) 509; the bus interface unit 501 is also used for the instruction fetch memory 509 to fetch instructions from the external memory; the bus interface unit 501 is also used for the memory unit access controller 505 to obtain the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 506, or transfer weight data to the weight memory 502, or transfer input data to the input memory 501.

The vector calculation unit 507 has a plurality of operation processing units, and further processes the output of the operation circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The vector calculation unit 507 is mainly used for calculating non-convolutional layers or fully-connected layers (FCs) in the neural network, and may specifically process: pooling (pooling), normalization (normalization), and the like. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both.

In some implementations, the vector calculation unit 507 stores the processed vectors to the unified memory 506. In some implementations, the vectors processed by the vector amount calculation unit 507 can be used as activation inputs for the arithmetic circuit 503, for example, for use in subsequent layers in a neural network, as shown in fig. 4, if the current processing layer is the hidden layer 1(431), the vectors processed by the vector amount calculation unit 507 can also be used for calculation in the hidden layer 2 (432).

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all On-Chip memories. The external memory is independent of the NPU hardware architecture.

The operations of the modules in the gesture recognition model shown in fig. 3 may be performed by the operation circuit 503 or the vector calculation unit 507.

A dynamic gesture recognition method based on the system architecture of fig. 2 is provided below. Referring to fig. 6, fig. 6 is a flowchart illustrating a dynamic gesture recognition method according to an embodiment of the present invention. As shown in fig. 6, the dynamic gesture recognition method may include the following steps.

601. A plurality of images containing gesture actions are acquired.

In an embodiment, when the gesture motion needs to be recognized, a plurality of images containing the gesture motion can be acquired. The plurality of images including the gesture motion may include only the hand object, or may include a background of the hand object.

In specific implementation, a video to be recognized can be obtained first, and one image is extracted from every second preset number of images in the video to be recognized, so that a plurality of images including gesture actions are obtained. In one implementation, in the case that the number of the extracted images reaches a third preset number, the extracted third preset number of images is determined as a plurality of images including the gesture motion. For example, if the second preset number is 14 and the third preset number is 8, an image may be extracted every 14 frames from the 1 st frame in the video to be recognized, and finally the first 8 images composed of the 1 st, 15 th, 29 th, 43 th, 57 th, 71 th, 85 th and 99 th frames are obtained. The extraction of one image per 14 frames of the interval may continue to result in a second 8 images.

602. And identifying the plurality of images through the gesture identification model to obtain the types and attributes of gesture actions in the plurality of images.

After the plurality of images containing the gesture actions are acquired, the plurality of images can be identified through the gesture identification model, and the types and attributes of the gesture actions in the plurality of images are obtained. The gesture recognition model can comprise a spatial feature module, a time domain feature module, a classification module, a first classifier and a second classifier.

In specific implementation, the plurality of images input into the gesture recognition model may be a continuous image, or a plurality of images obtained by arranging a plurality of discontinuous images captured from a section of video to be recognized according to a time sequence. The plurality of images are essentially a four-dimensional matrix (B × T, C, H, W), where B is a batch size (batch size), i.e., the number of the plurality of images that the gesture recognition model can process at one time, T is an image length, i.e., the number of images included in the plurality of images, C is a channel number of the images, H is a height of the images, W is a width of the images, and the image referred to at this time is a video frame. The input information including the batch number B of 2, the image length T of 8, the RGB channel number C of 3, the height H of 224, and the width W of 224 is taken as an example, that is, a plurality of images of the input gesture recognition model are a four-dimensional matrix (2 × 8,3, 224). B may be set to 1 if the gesture recognition model processes only one set of images at a time, i.e. the gesture recognition model may process T images out of a set of images at a time.

In specific implementation, a plurality of images including gesture actions may be input into the spatial feature module to obtain first feature data, where the first feature data includes spatial features of the gesture actions in the plurality of images. In one implementation, please refer to fig. 7, where fig. 7 is a schematic diagram of a spatial feature module according to an embodiment of the present application. As shown in fig. 7, the spatial feature module 700 may include an input layer 710, a convolutional/pooling layer 720 (where pooling is optional), an attention mechanism (attention mechanism)730, and a neural network layer 740. In the network structure design of the spatial feature module 700, the backbone network may adopt a CNN architecture, the spatial feature module 700 is a lightweight network based on the CNN architecture and added with an attention mechanism, and a batch normalization BatchNorm is used to replace an L2Norm, so that a better effect can be obtained. Wherein an attention mechanism 730 may be added behind convolutional/pooling layer 720 (where pooling layer is optional).

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a first feature data extraction according to an embodiment of the present application. As shown in (1) of fig. 8, the present embodiment adopts a scheme of inputting multiple images, and for each method of sharing parameters, the spatial features are extracted by using the same parameters for each of the blocks 0 to 3. The blocks are feature blocks formed after the convolution layer in the spatial feature module performs spatial feature extraction on a local area of the image, and an attribute mechanism can be added after the blocks for enhancing the spatial feature extraction of the local area. In this embodiment, an attribute mechanism is added after each tile. Specifically, as shown in (2) of fig. 8, for example, when spatial feature extraction is performed on the image 0, features of a plurality of local regions can be obtained, and the spatial feature 0 can be obtained by adding the features of the plurality of local regions. Similarly, spatial feature 1, … can be obtained by performing spatial feature extraction on image 1, and spatial feature n can be obtained by performing spatial feature extraction on image n.

In this embodiment of the application, the first feature data corresponding to each image includes a plurality of two-dimensional pictures (i.e., two-dimensional matrices (H, W)), each two-dimensional picture is a feature map (feature map), and the number of feature maps included in the first feature data is equal to the number of corresponding channels. For example, the dimension of the data output by the spatial feature module is (16,64, 112), the number of feature maps included in the first feature data corresponding to one image is 64, and the size of each feature map is 112 × 112. It should be noted that the dimensions and sizes of the first feature data corresponding to each image output by the same module are the same. Similarly, the second feature data corresponding to each image also includes a plurality of feature maps.

After the first feature data is obtained, the first feature data may be input to a time domain feature module to obtain second feature data, where the second feature data includes a time domain feature of the first feature data in a time dimension. In one implementation, the time domain feature module may be a CNN architecture. Referring to fig. 9, fig. 9 is a schematic diagram of a time domain feature module in an embodiment of the present application. As shown in fig. 9, the time domain feature module 900 may include a dimension transform layer 901, a convolution layer 902, a batch normalization layer 903, an activation function layer 904, a max-pooling layer 905, and a feature federation layer 906. Specifically, the first time domain feature data corresponding to the first feature data in the time dimension is determined by the dimension conversion layer 901 according to the time information of the plurality of images. Referring to fig. 10, fig. 10 is a schematic diagram illustrating a dimension transformation layer performing a Reshape transformation operation according to an embodiment of the present application. Reshape is a function that can readjust the number of rows, columns, and dimensions of the matrix. As shown in fig. 10, the dimension conversion layer 901 may implement dimension conversion on first feature data (B × T, C, H, W) output by a previous stage spatial feature module, that is, merge spatial dimensions (H, W) in the first feature data (B × T, C, H, W) into Batch processing dimensions of Batch, and separate time dimensions T separately to obtain a three-dimensional matrix (B × H × W, C, T), where the first time domain feature data is formed by arranging pixel points, in the first feature data (C, H, W), corresponding to multiple images, that are identical in H, identical in W, and identical in C, according to a time sequence, each first time domain feature data includes T data, and the first time domain feature data is a one-dimensional vector formed by the T data. For example, when B is 1, T is 8, C is 64, H is 56, and W is 56, the Reshape operation results in 1 × 56 × 56 × 64 first time domain characteristic data, each of which contains 8 data.

After the first time domain feature data is obtained, the convolution layer 902 performs convolution operation on the first time domain feature data to obtain second time domain feature data. Specifically, the convolution layer 902 performs convolution processing on the first time domain feature data output by the dimension conversion layer 901. The convolutional layers 902 may include a first preset number of one-dimensional convolutional layers with different convolutional cores, and for the first time domain feature data output by the dimension conversion layer 901, the first preset number of one-dimensional convolutional layers are respectively used to perform convolutional processing on the first time domain feature data, so as to obtain a first preset number of second time domain feature data with different scales corresponding to the first time domain feature data. And the second time domain feature data sequentially passes through the BN layer 903, the ReLu layer 904 and the maximum pooling layer 905, and further the feature combination layer 906 fuses a first preset number of second time domain feature data with different scales to obtain corresponding second feature data. Specifically, the feature combination layer 906 may add a first preset number of second time domain feature data with different scales to obtain corresponding second feature data. The convolutional layer 902 is provided with a plurality of one-dimensional convolutional layers with different convolutional kernel sizes, so that time sequence features with different scales can be extracted from the same first time domain feature data, the feature combination layer 906 fuses the time sequence features with different scales to obtain second feature data, and the time sequence features of gesture actions are well reserved.

The BN layer 903 is a network layer, as is the convolutional layer 902, and is used to accelerate the training speed and improve the generalization capability of the network. The BN layer is essentially a normalized network layer, and may replace a local response normalization layer (LRN layer), the ReLu layer 904, which is used to increase the non-linear relationship between the neural network layers and alleviate the problem of gradient disappearance, the max pooling operator in the max pooling layer 905 may take the pixel with the largest value in the range as the result of max pooling in a specific range, and may sample the first time domain characteristic data output by the convolutional layer 902 to obtain a smaller size image.

In one implementation, the convolutional layer 902 may include 4 one-dimensional convolutional layers with different convolutional cores, where the convolutional cores are k equal to 3, k equal to 4, k equal to 5, and k equal to 6, and the 4 one-dimensional convolutional layers are respectively used to perform convolutional processing on the first time domain characteristic data to obtain 4 different-scale characteristic data corresponding to the first time domain characteristic data, and after passing through the BN layer 903, the ReLu layer 904, and the maximum pooling layer 905, the characteristic combining layer 906 adds the 4 different-scale characteristic data to obtain the second characteristic data corresponding to the first time domain characteristic data.

In one implementation, please refer to fig. 11, where fig. 11 is a schematic diagram of a data assembling method according to an embodiment of the present application. The first feature data and the second feature data may be assembled as in fig. 11 and then input into the feature federation layer 906 for data fusion.

After the second feature data is obtained, the second feature data may be input to the first classifier to obtain a first probability that the gesture motion in the plurality of images belongs to each type, the first gesture motion is classified into a type with a maximum first probability corresponding to the first gesture motion, and the first gesture motion is any one of the gesture motions in the plurality of images. And inputting the second characteristic data into a second classifier to obtain a second probability that the gesture action in the multiple images belongs to each attribute, and classifying the first gesture action to the attribute with the maximum second probability corresponding to the first gesture action. The output values of the first classifier and the second classifier may be passed to two outputs, one of which may employ softmax logistic regression (softmax regression) for classification of the characterization type and the other of which may employ a sigmoid function for classification of the characterization attribute.

In the dynamic gesture recognition method described in fig. 6, a plurality of images including a gesture motion are obtained, and the plurality of images are recognized through the spatial feature module 301, the time domain feature module 302, and the classification module 303 in the gesture recognition model 300 shown in fig. 3, so as to obtain the type and the attribute of the gesture motion in the plurality of images. First feature data including spatial features of gesture actions in the multiple images are obtained through the spatial feature module 301, and second feature data including time domain features of the first feature data in a time dimension are obtained through the time domain feature module 302. The type and attribute of the gesture motion in the multiple images are obtained through the classification module 303. Therefore, based on the gesture recognition model 300 and the dynamic gesture recognition method, more comprehensive characteristic information of the gesture actions can be acquired from a plurality of images, and the recognition accuracy rate of the hand actions is improved.

Another dynamic gesture recognition method based on the system architecture of fig. 2 is provided below. Referring to fig. 12, fig. 12 is a flowchart illustrating another dynamic gesture recognition method according to an embodiment of the present invention. As shown in fig. 12, the dynamic gesture recognition method may include the following steps.

1201. A plurality of images containing gesture actions are acquired.

Step 1201 is the same as step 601, and please refer to step 601 for detailed description, which is not repeated herein.

1202. And identifying the plurality of images through the gesture identification model to obtain the types and attributes of gesture actions in the plurality of images.

Step 1202 is the same as step 602, and please refer to step 602 for detailed description, which is not repeated herein.

1203. And executing the function corresponding to the type of the gesture action when the attribute of the gesture action in the plurality of images is the outward movement.

After the types and the attributes of the gesture actions in the multiple images are obtained, the function corresponding to the type of the gesture action can be executed under the condition that the attribute of the gesture action in the multiple images is the outward movement. For example, if the type and attribute of the gesture motion are left and left respectively, then a command to the left is executed to respond to the gesture motion; get the type and attribute of the gesture motion to left and back respectively, then no command is executed.

In this embodiment of the application, the gesture recognition model may be an AI model, and the initial gesture recognition model needs to be trained before the gesture recognition model is used for recognition, so that the trained gesture recognition model has the types, attributes and capabilities of recognizing gesture actions in a plurality of images. The gesture recognition model in the present application may have the ability to determine the type and attributes of the gesture action (trip return). The following provides a gesture recognition model training method based on the system architecture of fig. 2. Referring to fig. 13, fig. 13 is a schematic flowchart illustrating a method for training a gesture recognition model according to an embodiment of the present invention. As shown in fig. 13, the gesture recognition model training method may include the following steps.

1301. And acquiring a plurality of sample images carrying the labeling information.

In the process of training the initial gesture recognition model, special training data is required to be used for training, analysis is performed based on model capability requirements, a sample image carrying annotation information is required to be used for training, gesture actions are recorded in the sample image, and the annotation information can include types and attributes of the gesture actions in the sample image. The type information of the gesture motion is used to indicate the type of the gesture motion, for example: the attribute information is used to indicate attributes of the gesture actions in the multiple images, and the attribute information may be forward or backward, for example, the gesture actions are forward in images 0 to 7, and the gesture actions are backward in images 8 to 15. Note that the markup information may be saved in a file such as extensible markup language (XML) or JavaScript object notation (JSON).

The labeling information of the gesture actions including types and attributes can be obtained by detecting the sample image by utilizing a gesture action detection algorithm to obtain the type information and the attribute information of the gesture actions recorded in the sample image, and can also be obtained in a manual labeling mode.

1302. And training the initial gesture recognition model according to the plurality of sample images carrying the labeling information to obtain the gesture recognition model.

After a plurality of sample images carrying annotation information are obtained, the plurality of sample images carrying annotation information form a training set, model training is carried out by using training samples in the training set, and an initial gesture recognition model is determined at first.

The initial gesture recognition model may include a spatial feature module, a temporal feature module, a loss function calculation module, and a classification module.

Firstly, initializing parameters of an initial gesture recognition model, then inputting a sample image into a space module in the initial gesture recognition model, and performing space feature extraction on the input sample image to obtain abstract features. Hand features and hand keypoint features in the gesture motion can be detected by the spatial feature module. And then obtaining time domain characteristics corresponding to the hand key point characteristics aiming at the time information through a time domain characteristic model. Referring to fig. 14, fig. 14 is a schematic diagram illustrating feature extraction in gesture recognition according to an embodiment of the present invention. As shown in fig. 14, for the sample image, the spatial feature module can detect the position information of the hand in the image, mark the position information of the hand in the image with a rectangular frame, and perform the key point detection on the hand in the rectangular frame. The time domain characteristics of the key points corresponding to the time information of the plurality of images can be obtained through the time domain characteristic module, and the time domain characteristics of the key points in the front and rear images can be obtained by adopting a difference method. Specifically, please refer to fig. 15, fig. 15 is a schematic diagram illustrating a time domain feature extraction according to an embodiment of the present invention. As shown in FIG. 15, the key points of the hand may include the tip points and the phalangeal jointsAnd key points, which can extract displacement information and speed information of the key points of the hand. The displacement of the hand keypoints relative to the image may be S_xAnd S_yTo represent S_xDenotes the displacement in the x direction, S_yIndicating a displacement in the y-direction. Under the condition that the frame rates of the front and back image acquisition are the same, S_xAnd S_yAnd may also represent the speed of movement of the hand keypoints. According to a large amount of data experience, the gesture actions have different moving speeds according to different attributes of the forward stroke and the backward stroke, so that the moving speed of the key points of the hand can be extracted under the condition that the frame rates of the forward and backward image acquisition are different. For different key points, a plurality of pieces of displacement information and velocity information can be extracted, the displacement information and the velocity information of a plurality of different hand key points are combined into a vector, and a vector F can be obtained, wherein F is [ F ═ F { [ F } { [ F } F } of₁,f₂,f₃,…,f_n]。

The method comprises the steps of detecting and identifying the features extracted by a spatial feature module and a time domain feature module, predicting the type and the attribute of a gesture action in an input sample image, outputting the type and the attribute to a loss function calculation module, then inputting the marking information corresponding to the sample image to the loss function calculation module, comparing the predicted result with the marking information corresponding to the sample image by the loss function calculation module, calculating a loss function, and updating and adjusting the weight parameters in an initial gesture identification model and a classifier by using optimization algorithms such as Back Propagation (BP), Gradient Descent (GD) or random gradient descent (SGD) and the like with the loss function as a target function. And sequentially and circularly inputting sample images carrying annotation information, and continuously iterating and executing the training process until the probability that the preset probability corresponding to the sample images obtained based on the initial gesture recognition model and the classifier is consistent with the annotation information corresponding to the sample images reaches an expected value, indicating that the gesture recognition model meeting the requirements is obtained, finishing the training to obtain the gesture recognition model, namely the gesture recognition model has the function of recognizing the types and attributes of gesture actions in a plurality of images and can be used for dynamic gesture recognition.

Referring to fig. 16, fig. 16 is a schematic structural diagram of a dynamic gesture recognition apparatus according to an embodiment of the present invention. As shown in fig. 16, the dynamic gesture recognition apparatus 1600 may include:

a first acquiring unit 1601 configured to acquire a plurality of images including gesture motions;

the recognition unit 1602 is configured to recognize the multiple images through the gesture recognition model to obtain types and attributes of gesture actions in the multiple images, where the attributes include a forward stroke and a backward stroke.

In an alternative implementation, the apparatus 1600 may further include:

a second obtaining unit 1603, configured to obtain a plurality of sample images carrying annotation information, where the sample images are a plurality of images including gesture actions, and the annotation information includes types and attributes of the gesture actions in the sample images;

the training unit 1604 is configured to train the initial gesture recognition model according to the plurality of sample images carrying the labeling information to obtain a gesture recognition model.

In an optional implementation manner, the gesture recognition model comprises a spatial feature module, a time domain feature module and a classification module;

the identifying unit 1602 is specifically configured to:

inputting the multiple images into a spatial feature module to obtain first feature data, wherein the first feature data comprises spatial features of gesture actions in the multiple images;

inputting the first characteristic data into a time domain characteristic module to obtain second characteristic data, wherein the second characteristic data comprises time domain characteristics of the first characteristic data in a time dimension;

and inputting the second characteristic data into the classification module to obtain the types and attributes of the gesture actions in the multiple images.

In an optional implementation manner, the time domain feature module includes a dimension conversion layer, a convolution layer, a BN layer, a ReLu layer, a maximum pooling layer, and a feature combination layer;

the identifying unit 1602 is configured to input the first feature data into the time domain feature module to obtain second feature data, where the second feature data includes a time domain feature of the first feature data in a time dimension, and is specifically configured to:

determining first time domain characteristic data corresponding to the first characteristic data in the time dimension through a dimension conversion layer according to the time information of the plurality of images;

performing convolution processing on the first time domain characteristic data through the convolution layer to obtain second time domain characteristic data;

and sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain second feature data.

In an optional implementation manner, the identifying unit 1602 is configured to, when performing convolution processing on the first time domain feature data by using a convolution layer to obtain second time domain feature data, specifically configured to:

performing convolution processing on the first time domain characteristic data by using a first preset number of one-dimensional convolution layers with different convolution kernel sizes to obtain second time domain characteristic data, wherein the second time domain characteristic data comprises a first preset number of characteristic data with different scales;

sequentially passing the second time domain feature data through a BN layer, a ReLu layer, a maximum pooling layer and a feature combination layer to obtain second feature data comprises the following steps:

and sequentially passing the first preset number of feature data with different scales through a BN layer, a ReLu layer, a maximum pooling layer and a feature combination layer to obtain second feature data.

In an optional implementation manner, the gesture recognition model further includes a first classifier and a second classifier, and the recognition unit 1602 is configured to input the second feature data into the classification module, and when obtaining the types and attributes of the gesture actions in the multiple images, is specifically configured to:

inputting the second characteristic data into the first classifier to obtain a first probability that the gesture actions in the multiple images belong to each type;

classifying the first gesture motion into a type with the highest first probability corresponding to the first gesture motion, wherein the first gesture motion is any one gesture motion in the gesture motions in the multiple images;

inputting the second characteristic data into a second classifier to obtain a second probability that the gesture actions in the multiple images belong to each attribute;

and classifying the first gesture action into the attribute with the maximum second probability corresponding to the first gesture action.

In an optional implementation manner, the first obtaining unit 1601 is specifically configured to:

acquiring a video to be identified;

and extracting one image from the video to be recognized at intervals of a second preset number of images to obtain a plurality of images containing gesture actions.

In an alternative implementation, the apparatus 1600 may further include: the executing unit 1605 is configured to execute a function corresponding to the type of the gesture motion when the attribute of the gesture motion in the plurality of images is the outward movement.

Referring to fig. 17, fig. 17 is a schematic structural diagram of a computing device according to an embodiment of the present invention. As shown in fig. 17, the computing device 1700 may include: memory 1701, processor 1702, communication interface 1703, and bus 1704. The memory 1701, the processor 1702, and the communication interface 1703 are communicatively connected to each other via the bus 1704.

The Memory 1701 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 1701 may store programs that, when executed by the processor 1702, the processor 1702 and the communication interface 1703 are used to perform the various steps of the dynamic gesture recognition method of an embodiment of the present application.

The processor 1702 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to implement functions required to be executed by units in the dynamic gesture recognition apparatus according to the embodiment of the present disclosure, or to execute the dynamic gesture recognition method according to the embodiment of the present disclosure.

The processor 1702 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the dynamic gesture recognition method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1702. The processor 1702 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1701, and the processor 1702 reads the information in the memory 1701, and completes the functions required to be executed by the units included in the dynamic gesture recognition apparatus according to the embodiment of the present application in combination with the hardware thereof, or executes the dynamic gesture recognition method according to the embodiment of the present application.

Communication interface 1703 enables communication between apparatus 1700 and other devices or a communication network using transceiver means, such as, but not limited to, a transceiver. The bus 1704 may include a pathway to transfer information between various components of the device 1700 (e.g., the memory 1701, the processor 1702, and the communication interface 1703). For specific implementation of each functional device, reference may be made to the related description of the dynamic gesture recognition method in the foregoing embodiment, and details are not described in the embodiment of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Those of skill in the art will appreciate that the functions described in connection with the various illustrative logical blocks, modules, and algorithm steps described in the disclosure herein may be implemented as hardware, software, firmware, or any combination thereof. If implemented in software, the functions described in the various illustrative logical blocks, modules, and steps may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium, which corresponds to a tangible medium, such as a data storage medium, or any communication medium including a medium that facilitates transfer of a computer program from one place to another (e.g., according to a communication protocol). In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described herein. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that the computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, Application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements.

The techniques of this application may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this application to emphasize functional aspects of means for performing the disclosed techniques, but do not necessarily require realization by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit, in conjunction with suitable software and/or firmware, or provided by an interoperating hardware unit (including one or more processors as described above).

The above description is only an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A dynamic gesture recognition method, comprising:

acquiring a plurality of images containing gesture actions;

and identifying the plurality of images through a gesture recognition model to obtain the types and attributes of gesture actions in the plurality of images, wherein the attributes comprise a forward stroke and a backward stroke.

2. The method of claim 1, further comprising:

acquiring a plurality of sample images carrying annotation information, wherein the sample images are a plurality of images containing gesture actions, and the annotation information comprises types and attributes of the gesture actions in the sample images;

and training an initial gesture recognition model according to the plurality of sample images carrying the labeling information to obtain the gesture recognition model.

3. The method of claim 1, wherein the gesture recognition model comprises a spatial feature module, a temporal feature module, and a classification module;

the identifying the plurality of images through the gesture recognition model to obtain the types and the attributes of the gesture actions in the plurality of images comprises the following steps:

inputting the multiple images into the spatial feature module to obtain first feature data, wherein the first feature data comprises spatial features of gesture actions in the multiple images;

inputting the first feature data into the time domain feature module to obtain second feature data, wherein the second feature data comprises time domain features of the first feature data in a time dimension;

4. The method of claim 3, wherein the time domain feature modules comprise a dimension transform layer, a convolution layer, a batch normalized BN layer, a modified Linear Unit ReLu layer, a max pooling layer, and a feature union layer;

inputting the first feature data into the time domain feature module to obtain second feature data, where the second feature data includes a time domain feature of the first feature data in a time dimension, and the method includes:

determining first time domain characteristic data corresponding to the first characteristic data in the time dimension through the dimension conversion layer according to the time information of the plurality of images;

and sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

5. The method of claim 4, wherein the convolving the first time-domain feature data with the convolutional layer to obtain second time-domain feature data comprises:

performing convolution processing on the first time domain feature data by using a first preset number of one-dimensional convolution layers with different convolution kernel sizes to obtain second time domain feature data, wherein the second time domain feature data comprises a first preset number of feature data with different scales;

the step of sequentially passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combining layer to obtain the second feature data includes:

and sequentially passing the first preset number of feature data with different scales through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.

6. The method according to any one of claims 3-5, wherein the gesture recognition model further comprises a first classifier and a second classifier, and the inputting the second feature data into the classification module to obtain the type and the attribute of the gesture motion in the plurality of images comprises:

classifying a first gesture motion as any one of the gesture motions in the plurality of images into a type having a highest first probability corresponding to the first gesture motion;

inputting the second characteristic data into the second classifier to obtain a second probability that the gesture actions in the multiple images belong to each attribute;

7. The method according to any one of claims 1-6, wherein said capturing a plurality of images containing gesture actions comprises:

acquiring a video to be identified;

8. The method according to any one of claims 1-7, further comprising:

and executing a function corresponding to the type of the gesture action under the condition that the attribute of the gesture action in the plurality of images is the outward movement.

9. A dynamic gesture recognition apparatus, comprising:

the first acquisition unit is used for acquiring a plurality of images containing gesture actions;

and the recognition unit is used for recognizing the multiple images through a gesture recognition model to obtain the types and attributes of gesture actions in the multiple images, wherein the attributes comprise a forward stroke and a backward stroke.

10. The apparatus of claim 9, further comprising:

the second acquisition unit is used for acquiring a plurality of sample images carrying annotation information, wherein the sample images are a plurality of images containing gesture actions, and the annotation information comprises types and attributes of the gesture actions in the sample images;

and the training unit is used for training the initial gesture recognition model according to the plurality of sample images carrying the labeling information to obtain the gesture recognition model.

11. The apparatus of claim 9, wherein the gesture recognition model comprises a spatial feature module, a temporal feature module, and a classification module;

the identification unit is specifically configured to:

12. The apparatus of claim 11, wherein the time domain feature module comprises a dimension transform layer, a convolution layer, a batch normalized BN layer, a modified linear unit ReLu layer, a max-pooling layer, and a feature union layer;

the identification unit is configured to input the first feature data into the time domain feature module to obtain second feature data, where the second feature data includes a time domain feature of the first feature data in a time dimension, and specifically configured to:

13. The apparatus according to claim 12, wherein the identifying unit, when performing convolution processing on the first time domain feature data by the convolutional layer to obtain second time domain feature data, is specifically configured to:

14. The apparatus according to any one of claims 11 to 13, wherein the gesture recognition model further includes a first classifier and a second classifier, and the recognition unit is configured to input the second feature data into the classification module, and when obtaining the types and attributes of the gesture actions in the plurality of images, specifically configured to:

15. The apparatus according to any one of claims 9 to 14, wherein the first obtaining unit is specifically configured to:

acquiring a video to be identified;

16. The apparatus of claims 9-15, further comprising:

and the execution unit is used for executing the function corresponding to the type of the gesture motion under the condition that the attribute of the gesture motion in the images is the outward movement.

17. A computing device comprising a processor and a memory, the memory for storing a program, the processor executing the memory-stored program, the memory-stored program when executed causing the computing device to implement the method of any of claims 1-8.

18. A computer-readable storage medium for storing computer-executable instructions which, when invoked by the computer, cause the computer to perform the method of any one of claims 1 to 8.