WO2021190296A1 - 一种动态手势识别方法及设备 - Google Patents

一种动态手势识别方法及设备 Download PDF

Info

Publication number
WO2021190296A1
WO2021190296A1 PCT/CN2021/079699 CN2021079699W WO2021190296A1 WO 2021190296 A1 WO2021190296 A1 WO 2021190296A1 CN 2021079699 W CN2021079699 W CN 2021079699W WO 2021190296 A1 WO2021190296 A1 WO 2021190296A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature data
layer
gesture
multiple images
feature
Prior art date
Application number
PCT/CN2021/079699
Other languages
English (en)
French (fr)
Inventor
吴觊豪
马杰延
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021190296A1 publication Critical patent/WO2021190296A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence (AI), and in particular to a dynamic gesture recognition method and device.
  • AI artificial intelligence
  • Dynamic gesture recognition has always been one of the research hotspots in the field of deep learning.
  • HAI human-computer interaction
  • dynamic gesture recognition is used in many fields such as virtual reality, smart home, early childhood education, and medical robots. have a broad vision of application.
  • the gesture backhaul can easily cause the terminal device to misjudge the direction of the gesture. For example, the user needs to swipe to the right twice in a row, and for the second swipe, he needs to return his hand to the starting point of the first swipe and swipe again. During the return process, the terminal device can easily determine it as a left gesture. .
  • the embodiments of the present application provide a dynamic gesture recognition method and device, which can improve the recognition accuracy.
  • an embodiment of the present application provides a dynamic gesture recognition method, including: acquiring multiple images containing gesture actions; recognizing the multiple images through a gesture recognition model to obtain the types and types of gesture actions in the multiple images Attributes, the attributes include outbound and return trips.
  • the gesture recognition model is used to recognize multiple images containing gesture actions, and the types and attributes of gesture actions in the multiple images can be obtained, and then according to the types and attributes of gesture actions in the multiple images obtained Performing the corresponding operation can avoid the terminal misjudgment of the gesture caused by the backhaul problem.
  • multiple sample images carrying annotation information are acquired, where the sample images are multiple images containing gesture actions, and the annotation information includes the types and attributes of gesture actions in the sample images;
  • the initial gesture recognition model is trained on the plurality of sample images carrying the annotation information to obtain the gesture recognition model.
  • the types and attributes of the gesture actions recorded in the sample images in the sample images can be obtained in advance, and then multiple sample images carrying the types and attributes are used to train the initial gesture recognition model to complete the training
  • the gesture recognition model has the ability to recognize the types and attributes of gesture actions recorded in multiple images, so that multiple images input to the gesture recognition model can be recognized, so that the types and attributes of gesture actions recorded in multiple images can be output .
  • the gesture recognition model includes a spatial feature module, a temporal feature module, and a classification module; the multiple images are recognized through the gesture recognition model to obtain the information of the gesture actions in the multiple images.
  • the types and attributes include: inputting the multiple images into the spatial feature module to obtain first feature data, where the first feature data includes the spatial features of gesture actions in the multiple images; adding the first feature Input data into the time domain feature module to obtain second feature data, where the second feature data includes the time domain feature of the first feature data in the time dimension; input the second feature data into the classification module, Obtain the types and attributes of gesture actions in the multiple images.
  • the spatial feature module of the gesture recognition model performs spatial feature extraction on multiple input images to obtain the first feature data
  • the temporal feature module of the gesture recognition model extracts multiple images for the first feature
  • the temporal characteristics of the data in the time dimension are finally input into the classification module of the gesture recognition model to obtain the types and attributes of gesture actions in the multiple images.
  • the temporal feature module includes a dimensional transformation layer, a convolutional layer, a batch normalization (BN) layer, a rectified linear unit (ReLu) layer, and a maximum pooling layer. And a feature joint layer; said inputting the first feature data into the time domain feature module to obtain second feature data, where the second feature data includes the time domain features of the first feature data in the time dimension,
  • the method includes: determining, through the dimension transformation layer, the first time domain feature data corresponding to the first feature data in the time dimension according to the time information of the multiple images;
  • the feature data is subjected to convolution processing to obtain second time domain feature data; the second time domain feature data is sequentially passed through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer to obtain The second characteristic data.
  • the dimensional transformation layer of the time domain feature module determines that the pixels with the same position in the first feature data corresponding to the multiple images are in time according to the time information of each image in the multiple images. Dimensionally corresponding to the first time domain feature data, the convolution layer of the time domain feature module performs convolution processing on the first feature data to obtain the corresponding second time domain feature data, and the second time domain
  • the feature data sequentially passes through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer to obtain the second feature data.
  • the performing convolution processing on the first time domain feature data through a convolution layer to obtain the second time domain feature data includes: using a first preset number of convolution kernels with different sizes
  • the one-dimensional convolutional layer of 1 performs convolution processing on the first time-domain feature data to obtain second time-domain feature data, where the second time-domain feature data includes a first preset number of feature data with different scales;
  • the step of passing the second time-domain feature data through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer in sequence to obtain the second feature data includes: It is assumed that a number of feature data of different scales sequentially pass through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer to obtain the second feature data.
  • the convolutional layer of the time domain feature module uses a first preset number of one-dimensional convolutional layers with different convolution kernel sizes to compare the first time domain feature data.
  • the domain feature data is subjected to convolution processing to obtain a first preset number of feature data of different scales.
  • the feature data passes through the BN layer, the ReLu layer, the maximum pooling layer, and the feature joint layer, so The feature joint layer fuses a first preset number of feature data of different scales corresponding to the first time domain feature data to obtain second feature data corresponding to the first feature data.
  • Using a one-dimensional convolutional layer can effectively reduce the amount of calculation and improve the processing efficiency of the convolutional layer of the temporal feature module.
  • the gesture recognition model further includes a first classifier and a second classifier
  • the second feature data is input to the classification module to obtain the gesture actions in the multiple images
  • the types and attributes of includes: inputting the second feature data into the first classifier to obtain the first probability that the gesture actions in the multiple images belong to each type; classifying the first gesture action to the The type with the highest first probability corresponding to the first gesture action, where the first gesture action is any one of the gesture actions in the plurality of images; inputting the second characteristic data into the second classifier To obtain the second probability that the gesture action in the multiple images belongs to each attribute; classify the first gesture action to the attribute with the second highest probability corresponding to the first gesture action.
  • the second feature data is input into the trained first classifier and the second classifier, and the first probability of the gesture action belonging to each type and the second probability of belonging to each attribute are obtained. Categorize the first gesture action into the type with the highest first probability and the attribute with the second highest probability corresponding to the first gesture action.
  • the acquiring multiple images containing gesture actions includes: acquiring a video to be recognized; extracting one image from the video to be recognized every second preset number of images, and obtaining an image including: Multiple images of gesture actions.
  • the video to be recognized is obtained, and one image is extracted from the second preset number of images in the video to be recognized according to the sequence of the images in the video to be recognized, and the number of extracted images reaches the first
  • the extracted third preset number of images are determined as the multiple images.
  • the method further includes: when the attribute of the gesture action in the multiple images is a forward journey, executing a function corresponding to the type of the gesture action.
  • the terminal device After the gesture recognition model recognizes the types and attributes of gesture actions in the multiple images, if the attributes are outbound, the terminal device performs the recognized gesture actions corresponding to the types If the attribute is backhaul, the terminal device does not perform processing.
  • an embodiment of the present application provides a dynamic gesture recognition device, including: a first acquisition unit, configured to acquire multiple images containing gesture actions; a recognition unit, configured to recognize the multiple images through a gesture recognition model, Obtain the types and attributes of gesture actions in the multiple images, and the attributes include outbound and return trips.
  • the device further includes: a second acquiring unit, configured to acquire a plurality of sample images carrying annotation information, the sample images are multiple images containing gesture actions, and the annotation information includes The type and attribute of the gesture action in the sample image; the training unit is used to train the initial gesture recognition model according to the plurality of sample images carrying annotation information to obtain the gesture recognition model.
  • the gesture recognition model includes a spatial feature module, a temporal feature module, and a classification module; the recognition unit is specifically configured to: input the multiple images into the spatial feature module to obtain First feature data, the first feature data includes the spatial features of gesture actions in the multiple images; the first feature data is input into the time domain feature module to obtain second feature data, the second feature The data includes the temporal characteristics of the first characteristic data in the time dimension; the second characteristic data is input to the classification module to obtain the types and attributes of gesture actions in the multiple images.
  • the temporal feature module includes a dimensional transformation layer, a convolutional layer, a BN layer, a ReLu layer, a maximum pooling layer, and a feature combination layer; the recognition unit is used to combine the first
  • the feature data is input into the time domain feature module to obtain second feature data.
  • the second feature data includes the time domain feature of the first feature data in the time dimension, it is specifically used to:
  • the time information determines the first time-domain feature data corresponding to the first feature data in the time dimension through the dimensional transformation layer; the first time-domain feature data is convolved through the convolution layer to obtain the first feature data 2.
  • the identification unit is configured to perform convolution processing on the first time domain feature data through the convolution layer to obtain the second time domain feature data, which is specifically configured to: use the first time domain feature data.
  • a preset number of one-dimensional convolution layers with different convolution kernel sizes perform convolution processing on the first time domain feature data to obtain second time domain feature data
  • the second time domain feature data includes the first preset Set a number of feature data with different scales;
  • the second time domain feature data is passed through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer in sequence to obtain the second
  • the feature data includes: sequentially passing the first preset number of feature data of different scales through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer to obtain the second feature data .
  • the gesture recognition model further includes a first classifier and a second classifier
  • the recognition unit is configured to input the second feature data into the classification module to obtain the plurality of
  • the type and attribute of the gesture action in the image are specifically used to: input the second feature data into the first classifier to obtain the first probability that the gesture action in the multiple images belongs to each type; A gesture action is classified into the type with the first highest probability corresponding to the first gesture action, and the first gesture action is any one of the gesture actions in the plurality of images; and the second feature Data is input to the second classifier to obtain the second probability that the gesture actions in the multiple images belong to each attribute; the second probability that the first gesture action is classified into the first gesture action is the largest Attributes.
  • the first acquiring unit is specifically configured to: acquire a video to be recognized; extract one image from the video to be recognized every second preset number of images, and obtain an image including a gesture Multiple images of the action.
  • the device further includes: an execution unit configured to execute a function corresponding to the type of the gesture action when the attribute of the gesture action in the multiple images is a forward journey.
  • an embodiment of the present application provides a computing device.
  • the computing device includes a processor and a memory.
  • the memory is used to store a program.
  • the processor executes the program stored in the memory.
  • the computing device is enabled to implement the foregoing first aspect and the dynamic gesture recognition method provided in combination with any one of the foregoing first aspects.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable medium is used to store computer-executable instructions, and the computer-executable instructions are used to cause the The computer implements the foregoing first aspect and the dynamic gesture recognition method provided in combination with any one of the foregoing first aspects.
  • the embodiments of the present application provide a computer program product, the computer program product including instructions, when the computer program product is executed by a computer, the computer can execute the above first aspect and any one of the above first aspect in combination.
  • FIG. 1 is a scene of dynamic gesture interaction in an embodiment of the application
  • FIG. 2 is a schematic diagram of the architecture of a dynamic gesture recognition system in an embodiment of the application
  • FIG. 3 is a schematic diagram of a gesture recognition model in an embodiment of the application.
  • Figure 4 is a schematic diagram of a CNN in an embodiment of the application.
  • FIG. 5 is a schematic diagram of a chip hardware structure in an embodiment of the application.
  • FIG. 6 is a schematic flowchart of a dynamic gesture recognition method in an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a spatial feature module in an embodiment of this application.
  • FIG. 8 is a schematic diagram of first feature data extraction in an embodiment of this application.
  • FIG. 9 is a schematic diagram of a time domain feature module in an embodiment of this application.
  • FIG. 10 is a schematic diagram of a dimensional transformation layer in an embodiment of the application performing a conversion and Reshape operation
  • FIG. 11 is a schematic flowchart of another dynamic gesture recognition method in an embodiment of the present invention.
  • FIG. 12 is a schematic flowchart of another dynamic gesture recognition method in an embodiment of the present invention.
  • FIG. 13 is a schematic flowchart of a method for training a gesture recognition model in an embodiment of the present invention.
  • FIG. 15 is a schematic diagram of temporal feature extraction in an embodiment of the present invention.
  • FIG. 16 is a schematic structural diagram of a dynamic gesture recognition device in an embodiment of the present invention.
  • FIG. 17 is a schematic structural diagram of a computing device disclosed in an embodiment of the present invention.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes x s and intercept 1 as inputs.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. From the division of DNN according to the location of different layers, the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • DNN looks complicated, it is not complicated in terms of the work of each layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as It should be noted that there is no W parameter in the input layer.
  • W the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • more hidden layers make the network more capable of portraying complex situations in the real world.
  • a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can be connected to only part of the neighboring neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size.
  • the convolution kernel can obtain reasonable weights through learning.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
  • FIG 1 is a scene of dynamic gesture interaction in an embodiment of this application.
  • a user can use one or two hands to perform non-contact operations on a terminal device.
  • Dynamic gesture recognition responds to user gestures and executes related commands.
  • the mainstream dynamic gesture recognition methods can be divided into two types: the first is to combine neural networks and video input methods to implement dynamic gesture recognition methods.
  • This method is based on multiple image inputs and uses convolutional neural networks (CNN) Extract spatial features (image features), use one-dimensional convolution (1DCONV) or fully connected network (multilayer perceptron, MLP) to extract temporal features, and finally obtain dynamic gesture recognition results in the video.
  • CNN convolutional neural networks
  • Extract spatial features image features
  • 1DCONV one-dimensional convolution
  • MLP fully connected network
  • This method can complete the self-recognition of dynamic gesture recognition end-to-end (learning the characteristics of the action during the training process), but there is a serious backhaul problem.
  • the second is to perform static image recognition (detection tracking, classification, or keypoint recognition) through neural networks, and infer dynamic movements through the combination of classification of consecutive frames, such as the classification of positions and hand shapes.
  • This method can adjust the accuracy of dynamic gesture recognition through some classification results and classification thresholds, but the backhaul problem is also difficult to solve.
  • this application provides a method for dynamic gesture recognition, which can obtain multiple images containing gesture actions, recognize multiple images through a gesture recognition model, and obtain the types and attributes of gesture actions in the multiple images.
  • the attributes include Outbound and return trips, and then execute corresponding operation commands according to the type and attribute of the obtained gesture action.
  • FIG. 2 is a schematic diagram of the architecture of a dynamic gesture recognition system in an embodiment of the application.
  • the dynamic gesture recognition system architecture 200 may include an execution device 210, a training device 220, a database 230, a user device 240, a data storage system 250, and a data collection device 260.
  • the data collection device 260 is used to collect multiple image data including gesture actions and store the multiple image data in the database 230.
  • the training device 220 trains the gesture recognition model 201 based on the multiple image data maintained in the database 230.
  • the training process may include: the training device 220 inputs multiple pieces of image data into the initial gesture recognition model 221 to obtain the gesture recognition model 201, and the initial gesture recognition model 221 is a deep neural network.
  • the work of each layer in the deep neural network can be expressed in mathematical expressions To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
  • This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, take the loss function as an example. The higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the gesture recognition model 201 trained according to the training device 220 can be applied to different systems or devices, such as the execution device 210 shown in FIG. 2, which can be a terminal, such as a mobile phone terminal, a tablet computer, or a notebook Computers, AR/VR, vehicle-mounted terminals, etc., can also be servers or clouds.
  • the execution device 210 may execute the dynamic gesture recognition method in the embodiment of the present application.
  • the execution device 210 is configured with an I/O interface 212 for data interaction with external devices.
  • the user can input data to the I/O interface 212 through the user device 240.
  • the input data is described in the embodiment of the present application. It may be multiple pieces of image data containing gesture actions, or it may be a request to the execution device 210 to recognize dynamic gestures.
  • the execution device 210 can call data, codes, etc. in the data storage system 250 for corresponding processing, or can use the data, instructions, etc. obtained from the corresponding processing. Stored in the data storage system 250.
  • the calculation module 211 can use the gesture recognition model 201 to process multiple input image data containing gesture actions. Specifically, first obtain multiple images containing gesture actions, and obtain the first feature through the spatial feature module in the gesture recognition model 201 Data, the first feature data is input into the temporal feature module in the gesture recognition model 201 to obtain the second feature data, and the second feature data is input into the classification module in the gesture recognition model 201 to obtain the types and attributes of gesture actions in multiple images .
  • the I/O interface 212 returns the processing results, such as the types and attributes of gesture actions in the multiple images obtained by the recognition method of the gesture recognition model 201 described above, to the user equipment 240.
  • the user equipment 240 may be a terminal, such as a mobile phone terminal, a notebook computer, an AR/VR terminal, or a vehicle-mounted terminal, etc., to respond to the corresponding needs of the terminal user.
  • the user can manually set input data (for example, multiple images including gesture actions in the embodiment of the present application), and the manual setting can be operated through the interface provided by the I/O interface 212.
  • the user equipment 240 can automatically send input data to the I/O interface 212. If the user equipment 240 is required to automatically send input data and the user's authorization is required, the user can set the corresponding authority in the user equipment 240.
  • the user can view the recognition result output by the execution device 210 on the user device 240, and the recognition result includes the types and attributes of gesture actions in multiple images. After receiving the recognition result, the user equipment 240 may convert the recognition result into a corresponding instruction to respond to the user's dynamic gesture.
  • FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present invention.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
  • the data collection device 260 may be a separate external device from the user equipment 240, or may be an internal device placed in the user equipment 240.
  • the gesture recognition model 201 in this embodiment may be obtained by training according to the training device 220.
  • the gesture recognition model 201 provided in the embodiment of the present application may be a neural network model for dynamic gesture recognition.
  • the gesture recognition model 300 may include a spatial feature module 301, a temporal feature module 302, and a classification module 303, and the temporal feature module 302 may be arranged behind the spatial feature module 301.
  • the spatial feature module 301 in FIG. 3 extracts first feature data layer by layer from input multiple images containing gesture actions, and the first feature data contains spatial features that characterize the gesture actions in the multiple images.
  • the input data of the time domain feature module 302 is the first feature data output by the spatial feature module 301 at the upper level.
  • the time domain feature module 302 processes the first feature data to obtain the second feature data.
  • the input data of the classification module 303 is the second feature data output by the time domain feature module 302 located at the upper level.
  • the classification module 303 classifies the second feature data and determines the types and attributes of gesture actions in multiple images.
  • the output value of the classification module 303 can be passed to two outputs, one output can be classified using softmax logistic regression (softmax regression) to characterize the type of gesture action, and the other output can use a sigmoid function to characterize the attribute of the gesture action.
  • the gesture recognition model 300 may include multiple spatial feature modules and multiple temporal feature modules, and the structures of the multiple spatial feature modules may be the same or different.
  • a single spatial feature module can contain only one neural network layer, for example, a single spatial feature module contains only one convolutional layer; a single spatial feature module can also include multiple same or different neural network layers, for example, in a single spatial feature module Including convolutional layer and pooling layer, or a single spatial feature module contains multiple different convolutional layers.
  • the gesture recognition model 300 described in FIG. 3 is only an example. In practical applications, the number, structure, and location of the spatial feature modules included in the gesture recognition model 300 and the number, structure, and location of the temporal feature modules can be set according to actual needs. Certainly, the embodiments of this application are not limited.
  • the spatial feature module 301 may be a CNN architecture.
  • CNN is a deep neural network with a convolutional structure and a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning at different levels of abstraction. There are multiple levels of learning.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to input multiple images containing gesture actions.
  • a convolutional neural network (CNN) 400 may include an input layer 410, a convolutional layer/pooling layer 420 (the pooling layer is optional), and a neural network layer 430.
  • the convolutional layer/pooling layer 420 may include layers 421-426.
  • layer 421 is a convolutional layer
  • layer 422 is a pooling layer
  • layer 423 is a convolutional layer.
  • Build layers, 424 layers are pooling layers
  • 425 are convolutional layers
  • 426 are pooling layers
  • 421 and 422 are convolutional layers
  • 423 are pooling layers
  • 424 and 425 are convolutional layers.
  • Layer, 426 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 421 can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension convolution output, but in most cases, a single weight matrix is not used, but multiple weight matrices of the same size (row ⁇ column) are applied. That is, multiple homogeneous matrices.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size have the same size, and then the multiple extracted feature maps of the same size are combined to form a convolution operation. Output.
  • the weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 400 can make correct predictions. .
  • the initial convolutional layer (such as 421) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example 426) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the layers 421-426 as illustrated by 420 in Figure 4 can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • Neural network layer 430
  • the convolutional neural network 400 After processing by the convolutional layer/pooling layer 420, the convolutional neural network 400 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 520 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 400 needs to use the neural network layer 430 to generate one or a group of required classes of output. Therefore, the neural network layer 430 may include multiple hidden layers (431, 432 to 43n as shown in FIG. 4) and an output layer 440. The parameters contained in the hidden layers may be based on specific task types. The relevant training data is obtained by pre-training.
  • the output layer 440 After the multiple hidden layers in the neural network layer 430, that is, the final layer of the entire convolutional neural network 400 is the output layer 440.
  • the output layer 440 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 400 shown in FIG. 4 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • FIG. 5 is a schematic diagram of a chip hardware structure in an embodiment of the application.
  • the chip includes a neural network processor 50.
  • the chip can be set in the execution device 210 as shown in FIG. 2 to complete the calculation work of the calculation module 211.
  • the chip can also be set in the training device 220 shown in FIG. 2 to complete the training work of the training device 220 and output the target model 201.
  • the algorithms of each module in the gesture recognition model as shown in FIG. 3 can all be implemented in the chip as shown in FIG. 5.
  • the neural network processor 50 may be NPU, TPU, or GPU, etc., any processor suitable for large-scale XOR calculation processing. Take the NPU as an example: the NPU can be mounted on a host CPU (host CPU) as a coprocessor, and the host CPU assigns tasks to it. The core part of the NPU is the arithmetic circuit 503. The arithmetic circuit 503 is controlled by the controller 504 to extract matrix data in the memory (501 and 502) and perform multiplication and addition operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit 503 fetches the weight data of the matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit 503.
  • the arithmetic circuit 503 fetches the input data of matrix A from the input memory 501, and performs matrix operations based on the input data of matrix A and the weight data of matrix B, and the partial or final result of the obtained matrix is stored in an accumulator 508 .
  • the unified memory 506 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 502 through the direct memory access controller (DMAC) 505 of the storage unit.
  • the input data is also transferred to the unified memory 506 through the DMAC.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 510 is used for the interaction between the DMAC and the instruction fetch buffer 509; the bus interface unit 501 is also used for the instruction fetch buffer 509 to obtain instructions from the external memory; the bus interface unit 501 also The storage unit access controller 505 obtains the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506, or to transfer the weight data to the weight memory 502, or to transfer the input data to the input memory 501.
  • the vector calculation unit 507 has multiple arithmetic processing units, if necessary, further processing the output of the arithmetic circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 507 is mainly used for calculation of non-convolutional layers or fully connected layers (FC) in the neural network, and can specifically handle: pooling, normalization, etc. calculations.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the vector calculation unit 507 stores the processed vector to the unified memory 506.
  • the vector processed by the vector calculation unit 507 can be used as the activation input of the arithmetic circuit 503, for example, for use in subsequent layers in a neural network, as shown in FIG. 4, if the current processing layer is a hidden layer 1 (431), the vector processed by the vector calculation unit 507 can also be used for calculation in the hidden layer 2 (432).
  • the instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
  • the unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories.
  • the external memory is independent of the NPU hardware architecture.
  • each module in the gesture recognition model shown in FIG. 3 may be executed by the arithmetic circuit 503 or the vector calculation unit 507.
  • FIG. 6 is a schematic flowchart of a dynamic gesture recognition method in an embodiment of the present invention.
  • the dynamic gesture recognition method may include the following steps.
  • multiple images containing the gesture action can be acquired.
  • the multiple images containing gesture actions can include only the hand object or the background of the hand object.
  • the video to be recognized may be acquired first, and one image is extracted from the second preset number of images in the video to be recognized every interval to obtain multiple images including gesture actions.
  • the extracted third preset number of images are determined to be multiple images containing gesture actions. For example, if the second preset number is 14, and the third preset number is 8, you can start from the first frame in the video to be recognized, and extract one image every 14 frames, and finally get the first 1, 15, 29, and 43 , 57, 71, 85, 99 frames composed of the first 8 images. You can continue to extract one image every 14 frames to get the second 8 images.
  • the multiple images can be recognized through the gesture recognition model, and the types and attributes of the gesture actions in the multiple images can be obtained.
  • the gesture recognition model may include a spatial feature module, a temporal feature module, a classification module, a first classifier, and a second classifier.
  • the multiple images input to the gesture recognition model may be a continuous image, or multiple images obtained by arranging discontinuous multiple images from a video to be recognized and arranged in time sequence.
  • Multiple images are essentially a four-dimensional matrix (B ⁇ T, C, H, W), where B is the batch size, that is, the number of multiple images that can be processed by the gesture recognition model at one time, and T is Image length, that is, the number of images contained in multiple images, C is the number of channels of the image, H is the height of the image, and W is the width of the image.
  • B is the batch size, that is, the number of multiple images that can be processed by the gesture recognition model at one time
  • T Image length, that is, the number of images contained in multiple images
  • C is the number of channels of the image
  • H is the height of the image
  • W is the width of the image.
  • the image referred to is a video frame.
  • the spatial feature module 700 may include an input layer 710, a convolutional layer/pooling layer 720 (the pooling layer is optional), an attention mechanism 730, and a neural network layer 740.
  • the backbone network can use the CNN architecture.
  • the spatial feature module 700 is a lightweight network based on the CNN architecture with an increased attention mechanism. Using batch normalized BatchNorm instead of L2Norm, you can get better Effect. Among them, the attention mechanism 730 can be added after the convolutional layer/pooling layer 720 (the pooling layer is optional).
  • FIG. 8 is a schematic diagram of first feature data extraction in an embodiment of the application.
  • this embodiment adopts the scheme of multiple image input, and the method of sharing parameters is adopted for each image.
  • Blocks 0 to 3 use the same parameters for each image.
  • the block is a feature block formed after the convolutional layer in the spatial feature module performs spatial feature extraction on a local area of the image.
  • an attention mechanism can be added to enhance the spatial feature extraction of the local area.
  • an attention mechanism is added after each block. Specifically, as shown in (2) in FIG.
  • spatial feature extraction when spatial feature extraction is performed on image 0, the features of multiple local regions can be obtained, and the features of multiple local regions can be added to obtain the spatial feature 0.
  • performing spatial feature extraction on image 1 can obtain spatial feature 1, ..., and performing spatial feature extraction on image n to obtain spatial feature n.
  • the first feature data corresponding to each image includes multiple two-dimensional pictures (ie, two-dimensional matrix (H, W)), and each two-dimensional picture is a feature map (feature map).
  • the number of feature maps contained in the feature data is equal to the number of corresponding channels. For example, if the dimension of the data output by the spatial feature module is (16,64,112,112), the first feature data corresponding to an image contains 64 feature maps, and the size of each feature map is 112 ⁇ 112. It should be noted that the dimensions and sizes of the first feature data corresponding to each image output by the same module are the same.
  • the second feature data corresponding to each image also includes multiple feature maps.
  • the first feature data can be input into the time domain feature module to obtain the second feature data.
  • the second feature data includes the time domain features of the first feature data in the time dimension.
  • the temporal feature module may be a CNN architecture.
  • FIG. 9 is a schematic diagram of a time domain feature module in an embodiment of this application.
  • the time-domain feature module 900 may include a dimensional transformation layer 901, a convolutional layer 902, a batch normalization layer 903, an activation function layer 904, a maximum pooling layer 905, and a feature joint layer 906.
  • the dimensional transformation layer 901 determines the first time domain feature data corresponding to the first feature data in the time dimension.
  • FIG. 10 is a schematic diagram of a dimensional transformation layer in an embodiment of the present application performing a conversion and Reshape operation.
  • Reshape is a function that can re-adjust the number of rows, columns, and dimensions of a matrix. As shown in FIG. 10,
  • the dimensional transformation layer 901 can realize the dimensional transformation of the first feature data (B ⁇ T, C, H, W) output by the upper-level spatial feature module, that is, the first feature data (B ⁇ T,
  • the spatial dimensions (H, W) in C, H, W) are merged into the Batch dimension, and the time dimension T is separated separately to obtain a three-dimensional matrix (B ⁇ H ⁇ W, C, T), the first time domain
  • the feature data is composed of pixels with the same H, the same W, and the same C in the first feature data (C, H, W) corresponding to multiple images, arranged in chronological order, and each first time-domain feature data contains T Data
  • the characteristic data contains 8 data.
  • the convolution layer 902 After the first time domain feature data is obtained, the convolution layer 902 performs a convolution operation on the first time domain feature data to obtain the second time domain feature data. Specifically, the first time-domain feature data output by the dimension transformation layer 901 is subjected to convolution processing through the convolution layer 902.
  • the convolutional layer 902 may include a first preset number of one-dimensional convolutional layers with different convolution kernel sizes. For the first time domain feature data output by the dimension transformation layer 901, the first preset number of one-dimensional convolutions are used respectively.
  • the product layer performs convolution processing on the first time domain feature data to obtain a first preset number of second time domain feature data of different scales corresponding to the first time domain feature data.
  • the feature joint layer 906 fuses a first preset number of second time domain feature data of different scales to obtain the corresponding The second feature data. Specifically, the feature joint layer 906 may add a first preset number of second time-domain feature data of different scales to obtain corresponding second feature data.
  • the convolution layer 902 multiple one-dimensional convolution layers with different sizes of convolution kernels are set, and time sequence features of different scales can be extracted from the same first time domain feature data, and the feature joint layer 906 merges these multiple time sequences of different scales. Feature, the second feature data is obtained, and the timing feature of the gesture action is better preserved.
  • the BN layer 903 and the convolutional layer 902 are both a network layer, which is used to accelerate the training speed and improve the generalization ability of the network.
  • the BN layer is essentially a normalized network layer, which can replace the local response normalized layer (LRN layer), the ReLu layer 904, which is used to increase the nonlinear relationship between the layers of the neural network and reduce the problem of gradient disappearance.
  • LRN layer local response normalized layer
  • the maximum pooling operator in the pooling layer 905 can take the pixel with the largest value in the range within a specific range as the result of the maximum pooling, and can sample the first time-domain feature data output by the convolutional layer 902 to get a smaller value. Size image.
  • the convolutional layer 902 may include four one-dimensional convolutional layers with different sizes of convolution kernels.
  • Four one-dimensional convolutional layers perform convolution processing on the first time-domain feature data to obtain four different-scale feature data corresponding to the first time-domain feature data, which passes through the BN layer 903, the ReLu layer 904 and the maximum pooling layer 905 After that, the feature joint layer 906 adds the feature data of four different scales to obtain the second feature data corresponding to the first time domain feature data.
  • FIG. 11 is a schematic diagram of a data assembly method in an embodiment of the application.
  • the first feature data and the second feature data can be assembled as shown in FIG. 11 and then input into the feature joint layer 906 for data fusion.
  • the second feature data can be input to the first classifier to obtain the first probability that the gesture action belongs to each type in the multiple images, and classify the first gesture action to the first gesture action corresponding to the first gesture action.
  • a type with the highest probability the first gesture action is any gesture action in the multiple images.
  • the second feature data is input into the second classifier to obtain the second probability that the gesture action belongs to each attribute in the multiple images, and the first gesture action is classified into the attribute with the second highest probability corresponding to the first gesture action.
  • the output values of the first classifier and the second classifier can be passed to two outputs, one output can be classified using softmax logistic regression (softmax regression) to characterize type classification, and the other output can use sigmoid function to characterize attributes Classification.
  • multiple images containing gesture actions are acquired, and multiple images are identified through the spatial feature module 301, the temporal feature module 302, and the classification module 303 in the gesture recognition model 300 shown in FIG. Image, get the types and attributes of gesture actions in multiple images.
  • the spatial feature module 301 obtains the first feature data including the spatial features of gesture actions in multiple images
  • the temporal feature module 302 obtains the second feature data including the temporal features of the first feature data in the time dimension.
  • the classification module 303 obtains the types and attributes of gesture actions in multiple images. Therefore, based on the aforementioned gesture recognition model 300 and dynamic gesture recognition method, more comprehensive feature information of gesture actions can be obtained from multiple images, thereby improving the recognition accuracy of hand actions.
  • FIG. 12 is a schematic flowchart of another dynamic gesture recognition method in an embodiment of the present invention.
  • the dynamic gesture recognition method may include the following steps.
  • step 1201 is the same as step 601.
  • step 601 which will not be repeated here.
  • step 1202 is the same as step 602.
  • step 602 For detailed description, please refer to step 602, which will not be repeated here.
  • the function corresponding to the type of the gesture action can be performed. For example, if the type and attribute of the gesture action are left and forward respectively, a leftward command is executed in response to the gesture action; if the type and attribute of the gesture action are obtained as leftward and backward respectively, no command is executed.
  • the gesture recognition model may be an AI model. Before the gesture recognition model is used for recognition, the initial gesture recognition model needs to be trained, so that the trained gesture recognition model has the ability to recognize gesture actions in multiple images. Types and attributes and capabilities.
  • the gesture recognition model in this application may have the ability to determine the type and attribute (outbound and return journey) of gesture actions.
  • FIG. 13 is a schematic flowchart of a method for training a gesture recognition model in an embodiment of the present invention. As shown in FIG. 13, the method for training a gesture recognition model may include the following steps.
  • the sample images record gesture actions, and the annotation information can include The types and attributes of gesture actions in the sample image.
  • Gesture action type information is used to indicate the type of gesture action, such as: “continuous left”, “continuous right”, “continuous up”, “continuous down”, etc.
  • the attribute information is used to indicate that the gesture is in multiple images
  • the attributes and attribute information in the image can be the outbound journey or the back journey.
  • the gesture action is the outbound journey in the image 0-image 7
  • the gesture action is the back journey in the image 8-image 15.
  • the annotation information can be saved in files such as extensible markup language (XML) or JavaScript object notation (JSON).
  • the annotation information of the gesture action including the type and attribute can be detected by the gesture action detection algorithm on the sample image to obtain the type information and attribute information of the gesture action recorded in the sample image.
  • the type information and attribute information can also be obtained by manual labeling.
  • the initial gesture recognition model is determined.
  • the initial gesture The recognition model can be an AI model. Specifically, a deep neural network model can be selected. The network can recognize the types of gesture actions and can also recognize the attributes of gesture actions.
  • the initial gesture recognition model may include a spatial feature module, a temporal feature module, a loss function calculation module, and a classification module.
  • FIG. 14 is a schematic diagram of feature extraction in gesture recognition in an embodiment of the present invention.
  • the position information of the hand in the image can be detected through the spatial feature module, and the position information of the hand in the image is marked with a rectangular frame, and then the hand in the rectangular frame Key point detection.
  • the time-domain feature module the time-domain features corresponding to the key points according to the time information of multiple images can be obtained, and the time-domain features of the key points in the front and back images can be obtained by the difference method.
  • FIG. 15 is a schematic diagram of a temporal feature extraction in an embodiment of the present invention.
  • the key points of the hand can include fingertip points and key points of the phalanges, and the displacement information and speed information of the key points of the hand can be extracted.
  • the key point with respect to the hand image may be displaced by S x and S y are represented, S x indicates the direction of displacement x, S y represents the displacement in the y direction.
  • S x and Sy can also represent the movement speed of the key points of the hand.
  • gestures move at different speeds for different attributes of the forward and return journeys. Therefore, when the frame rates of the previous and subsequent images are not the same, the movement speed of the key points of the hand can also be extracted.
  • the loss function calculation module compares the predicted prediction result with the annotation information corresponding to the sample image, and calculates the loss function, and uses the loss function as the objective function to use backpropagation (BP) Optimization algorithms such as gradient descent (GD) or stochastic gradient descent (SGD) update and adjust the weight parameters in the initial gesture recognition model and the classifier.
  • BP backpropagation
  • GD gradient descent
  • SGD stochastic gradient descent
  • Input sample images carrying annotation information in a loop and continue to perform the above training process iteratively until the preset probability corresponding to the sample image obtained based on the initial gesture recognition model and the classifier is consistent with the probability that the annotation information corresponding to the sample image is consistent with the expected value, it means A gesture recognition model that meets the requirements has been obtained, and the training can be completed to obtain a gesture recognition model, that is, the gesture recognition model has the function of recognizing the types and attributes of gesture actions in multiple images, and can be used for dynamic gesture recognition.
  • FIG. 16 is a schematic structural diagram of a dynamic gesture recognition device disclosed in an embodiment of the present invention.
  • the dynamic gesture recognition apparatus 1600 may include:
  • the first acquiring unit 1601 is configured to acquire multiple images containing gesture actions
  • the recognition unit 1602 is used for recognizing multiple images through a gesture recognition model to obtain the types and attributes of gesture actions in the multiple images.
  • the attributes include outbound and return trips.
  • the apparatus 1600 may further include:
  • the second acquiring unit 1603 is configured to acquire multiple sample images carrying annotation information, the sample images are multiple images containing gesture actions, and the annotation information includes the types and attributes of the gesture actions in the sample images;
  • the training unit 1604 is configured to train the initial gesture recognition model to obtain the gesture recognition model according to a plurality of sample images carrying annotation information.
  • the gesture recognition model includes a spatial feature module, a temporal feature module, and a classification module;
  • the identification unit 1602 is specifically used for:
  • the spatial feature module Inputting multiple images into the spatial feature module to obtain first feature data, where the first feature data includes the spatial features of gesture actions in the multiple images;
  • the second feature data is input into the classification module to obtain the types and attributes of gesture actions in multiple images.
  • the temporal feature module includes a dimensional transformation layer, a convolutional layer, a BN layer, a ReLu layer, a maximum pooling layer, and a feature combination layer;
  • the identification unit 1602 is configured to input the first feature data into the time domain feature module to obtain second feature data.
  • the second feature data includes the time domain features of the first feature data in the time dimension, it is specifically used for:
  • the second time domain feature data is sequentially passed through the BN layer, the ReLu layer, the maximum pooling layer and the feature combination layer to obtain the second feature data.
  • the identification unit 1602 is configured to perform convolution processing on the first time domain feature data through the convolutional layer, and when obtaining the second time domain feature data, it is specifically used for:
  • Passing the second time domain feature data through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer in sequence to obtain the second feature data includes:
  • the first preset number of feature data of different scales are sequentially passed through the BN layer, the ReLu layer, the maximum pooling layer, and the feature combination layer to obtain the second feature data.
  • the gesture recognition model further includes a first classifier and a second classifier, and the recognition unit 1602 is used to input the second feature data into the classification module to obtain the types and attributes of gesture actions in multiple images.
  • the recognition unit 1602 is used to input the second feature data into the classification module to obtain the types and attributes of gesture actions in multiple images.
  • the first gesture action is classified into the attribute with the second highest probability corresponding to the first gesture action.
  • the first obtaining unit 1601 is specifically configured to:
  • One image is extracted from the second preset number of images in the to-be-recognized video every interval to obtain multiple images including gesture actions.
  • the apparatus 1600 may further include: an execution unit 1605, configured to execute a function corresponding to the type of the gesture action when the attribute of the gesture action in the multiple images is a forward journey.
  • FIG. 17 is a schematic structural diagram of a computing device disclosed in an embodiment of the present invention.
  • the computing device 1700 may include: a memory 1701, a processor 1702, a communication interface 1703, and a bus 1704.
  • the memory 1701, the processor 1702, and the communication interface 1703 implement communication connections between each other through the bus 1704.
  • the memory 1701 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1701 may store a program. When the program stored in the memory 1701 is executed by the processor 1702, the processor 1702 and the communication interface 1703 are used to execute each step of the dynamic gesture recognition method of the embodiment of the present application.
  • the processor 1702 may use a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processing unit (graphics processing unit, GPU), or one or more
  • the integrated circuit is used to execute related programs to realize the functions required by the units in the dynamic gesture recognition device of the embodiment of the present application, or to execute the dynamic gesture recognition method of the method embodiment of the present application.
  • the processor 1702 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the dynamic gesture recognition method of the present application can be completed by the integrated logic circuit of hardware in the processor 1702 or instructions in the form of software.
  • the aforementioned processor 1702 may also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , Discrete gates or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processing
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • FPGA Field Programmable Gate Array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1701, and the processor 1702 reads the information in the memory 1701, and combines its hardware to complete the functions required by the units included in the dynamic gesture recognition device of the embodiment of the present application, or execute the dynamics of the method embodiment of the present application. Gesture recognition method.
  • the communication interface 1703 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 1700 and other devices or a communication network.
  • the bus 1704 may include a path for transferring information between various components of the device 1700 (for example, the memory 1701, the processor 1702, and the communication interface 1703).
  • the computer-readable medium may include a computer-readable storage medium, which corresponds to a tangible medium, such as a data storage medium, or a communication medium that includes any medium that facilitates the transfer of a computer program from one place to another (for example, according to a communication protocol) .
  • a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and/or data structures for implementing the techniques described in this application.
  • the computer program product may include a computer-readable medium.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, flash memory, or structures that can be used to store instructions or data Any other media that can be accessed by the computer in the form of desired program code. And, any connection is properly termed a computer-readable medium.
  • any connection is properly termed a computer-readable medium.
  • coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave to transmit instructions from a website, server, or other remote source
  • coaxial cable Wire, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of media.
  • the computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other temporary media, but are actually directed to non-transitory tangible storage media.
  • magnetic disks and optical disks include compact disks (CDs), laser disks, optical disks, digital versatile disks (DVD) and Blu-ray disks, where disks usually reproduce data magnetically, while optical disks use lasers to reproduce data optically data. Combinations of the above should also be included in the scope of computer-readable media.
  • DSP digital signal processors
  • ASIC application-specific integrated circuits
  • FPGA field programmable logic arrays
  • processor may refer to any of the foregoing structure or any other structure suitable for implementing the techniques described herein.
  • DSP digital signal processors
  • ASIC application-specific integrated circuits
  • FPGA field programmable logic arrays
  • the term "processor” as used herein may refer to any of the foregoing structure or any other structure suitable for implementing the techniques described herein.
  • the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided in dedicated hardware and/or software modules configured for encoding and decoding, or combined Into the combined codec.
  • the technology may be fully implemented in one or more circuits or logic elements.
  • the technology of this application can be implemented in a variety of devices or devices, including wireless handsets, integrated circuits (ICs), or a set of ICs (for example, chipsets).
  • ICs integrated circuits
  • a set of ICs for example, chipsets.
  • Various components, modules, or units are described in this application to emphasize the functional aspects of the device for implementing the disclosed technology, but they do not necessarily need to be implemented by different hardware units.
  • various units can be combined with appropriate software and/or firmware in the codec hardware unit, or by interoperating hardware units (including one or more processors as described above). supply.

Abstract

本发明公开了人工智能领域中的一种动态手势识别方法及设备,涉及人工智能AI领域。该动态手势识别方法包括:获取包含手势动作的多张图像;通过手势识别模型识别多张图像,得到多张图像中手势动作的类型和属性,该属性包括去程和回程。上述方法能够提高识别精度。

Description

一种动态手势识别方法及设备
本申请要求于2020年03月27日提交中国专利局、申请号为202010235859.6、申请名称为“一种动态手势识别方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(artificial intelligence,AI)领域,特别涉及一种动态手势识别方法及设备。
背景技术
手势是一种富有表现力的身体动作,能够传达各种有意义的信息。动态手势识别一直是深度学习领域的研究热点之一,作为一种新兴的人机交互(human-computer interaction,HCI)方式,动态手势识别在虚拟现实、智能家居、儿童早教、医用机器人等众多领域具有广阔的应用前景。在动态手势识别中,当用户要往一个方向连续挥动手势的时候,一定会往另一个方向回到起点,这就是手势回程,手势回程容易导致终端设备对手势方向的误判情况出现。例如,用户需要连续向右挥动两次手势,第二次挥动的时候需要将手返回至第一次挥动的起点重新挥动,在返回的过程中,终端设备很容易将其判断成向左的手势。
因此,如何实现对手势动作进行准确的识别是目前亟待解决的问题。
发明内容
本申请实施例提供一种动态手势识别方法及设备,可以提高识别精度。
第一方面,本申请实施例提供一种动态手势识别方法,包括:获取包含手势动作的多张图像;通过手势识别模型识别所述多张图像,得到所述多张图像中手势动作的类型和属性,所述属性包括去程和回程。
在本申请提供的方案中,利用手势识别模型识别包含手势动作的多张图像,可以得到所述多张图像中手势动作的类型和属性,进而根据得到的多张图像中手势动作的类型和属性进行对应的操作,可以避免因回程问题造成的终端对手势误判的情况。
在一种可能的实现方式中,获取多个携带标注信息的样本图像,所述样本图像为包含手势动作的多张图像,所述标注信息包括所述样本图像中手势动作的类型和属性;根据所述多个携带标注信息的样本图像对初始手势识别模型进行训练得到所述手势识别模型。
在本申请提供的方案中,可以提前获取样本图像中记录的手势动作在样本图像中的类型和属性,然后利用多个携带类型和属性的样本图像对初始手势识别模型进行训练,以使得训练完成的手势识别模型具备识别多张图像中记录的手势动作的类型和属性的能力,这样可以对输入手势识别模型的多张图像进行识别,从而可以输出多张图像中记录的手势动作的类型和属性。
在一种可能的实现方式中,所述手势识别模型包括空间特征模块、时域特征模块和分类模块;所述通过手势识别模型识别所述多张图像,得到所述多张图像中手势动作的类型和属性,包括:将所述多张图像输入所述空间特征模块,得到第一特征数据,所述第一特 征数据包括所述多张图像中手势动作的空间特征;将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征;将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性。
在本申请提供的方案中,通过手势识别模型的空间特征模块对输入的多张图像进行空间特征提取之后,得到第一特征数据,手势识别模型的时域特征模块提取多张图像针对第一特征数据在时间维度上的时域特征,最后输入手势识别模型的分类模块,得到所述多张图像中手势动作的类型和属性。
在一种可能的实现方式中,所述时域特征模块包括维度变换层、卷积层、批量标准化(batch normalization,BN)层、线性修正单元(rectified linear unit,ReLu)层、最大池化层和特征联合层;所述将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征,包括:按照所述多张图像的时间信息通过所述维度变换层确定所述第一特征数据在时间维度上对应的第一时域特征数据;通过所述卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据;将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据。
在本申请提供的方案中,所述时域特征模块的维度变换层按照多张图像中各张图像的时间信息,确定所述多张图像对应的第一特征数据中位置相同的像素点在时间维度上对应的第一时域特征数据,所述时域特征模块的卷积层对所述第一特征数据进行卷积处理,得到对应的第二时域特征数据,将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据。
在一种可能的实现方式中,所述通过卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据包括:使用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据,所述第二时域特征数据包括第一预设数量个不同尺度的特征数据;所述将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据包括:将所述第一预设数量个不同尺度的特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,得到所述第二特征数据。
在本申请提供的方案中,针对所述第一时域特征数据,所述时域特征模块的卷积层用第一预设数量个卷积核大小不同的一维卷积层对第一时域特征数据进行卷积处理,得到第一预设数量个不同尺度的特征数据,所述特征数据经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,所述特征联合层融合第一时域特征数据对应的第一预设数量个不同尺度的特征数据,得到所述第一特征数据对应的第二特征数据。使用一维卷积层可以有效降低计算量,提高时域特征模块的卷积层的处理效率。
在一种可能的实现方式中,所述手势识别模型还包括第一分类器和第二分类器,所述将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性包括:将所述第二特征数据输入所述第一分类器,得到所述多张图像中的手势动作属于每个类型的第一概率;将第一手势动作归类至所述第一手势动作对应的第一概率最大的类型,所述 第一手势动作为所述多张图像中的手势动作中的任一手势动作;将所述第二特征数据输入所述第二分类器,得到所述多张图像中的手势动作属于每个属性的第二概率;将所述第一手势动作归类至所述第一手势动作对应的第二概率最大的属性。
在本申请提供的方案中,将所述第二特征数据输入已训练好的第一分类器和第二分类器,得到手势动作属于每个类型的第一概率和属于每个属性的第二概率,将第一手势动作归类至所述第一手势动作对应的第一概率最大的类型和第二概率最大的属性。
在一种可能的实现方式中,所述获取包含手势动作的多张图像包括:获取待识别视频;从所述待识别视频中每间隔第二预设数量张图像抽取一张图像,得到包括包含手势动作的多张图像。
在本申请提供的方案中,获取待识别视频,按待识别视频中图像的时序,从待识别视频中每间隔第二预设数量张图像抽取一张图像,在抽取的图像张的数量达到第三预设数量的情况下,将抽取的第三预设数量张图像确定为所述多张图像。
在一种可能的实现方式中,所述方法还包括:在所述多张图像中手势动作的属性为去程的情况下,执行所述手势动作的类型对应的功能。
在本申请提供的方案中,通过手势识别模型识别出所述多张图像中手势动作的类型和属性之后,在所述属性为去程的情况下,终端设备执行识别出的手势动作的类型对应的功能,在所述属性为回程的情况下,终端设备不作处理。
第二方面,本申请实施例提供一种动态手势识别装置,包括:第一获取单元,用于获取包含手势动作的多张图像;识别单元,用于通过手势识别模型识别所述多张图像,得到所述多张图像中手势动作的类型和属性,所述属性包括去程和回程。
在一种可能的实现方式中,所述装置还包括:第二获取单元,用于获取多个携带标注信息的样本图像,所述样本图像为包含手势动作的多张图像,所述标注信息包括所述样本图像中手势动作的类型和属性;训练单元,用于根据所述多个携带标注信息的样本图像对初始手势识别模型进行训练得到所述手势识别模型。
在一种可能的实现方式中,所述手势识别模型包括空间特征模块、时域特征模块和分类模块;所述识别单元,具体用于:将所述多张图像输入所述空间特征模块,得到第一特征数据,所述第一特征数据包括所述多张图像中手势动作的空间特征;将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征;将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性。
在一种可能的实现方式中,所述时域特征模块包括维度变换层、卷积层、BN层、ReLu层、最大池化层和特征联合层;所述识别单元用于将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征时,具体用于:按照所述多张图像的时间信息通过所述维度变换层确定所述第一特征数据在时间维度上对应的第一时域特征数据;通过所述卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据;将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据。
在一种可能的实现方式中,所述识别单元用于通过所述卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据时,具体用于:使用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据,所述第二时域特征数据包括第一预设数量个不同尺度的特征数据;所述将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据包括:将所述第一预设数量个不同尺度的特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,得到所述第二特征数据。
在一种可能的实现方式中,所述手势识别模型还包括第一分类器和第二分类器,所述识别单元用于将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性时,具体用于:将所述第二特征数据输入所述第一分类器,得到所述多张图像中的手势动作属于每个类型的第一概率;将第一手势动作归类至所述第一手势动作对应的第一概率最大的类型,所述第一手势动作为所述多张图像中的手势动作中的任一手势动作;将所述第二特征数据输入所述第二分类器,得到所述多张图像中的手势动作属于每个属性的第二概率;将所述第一手势动作归类至所述第一手势动作对应的第二概率最大的属性。
在一种可能的实现方式中,所述第一获取单元,具体用于:获取待识别视频;从所述待识别视频中每间隔第二预设数量张图像抽取一张图像,得到包括包含手势动作的多张图像。
在一种可能的实现方式中,所述装置还包括:执行单元,用于在所述多张图像中手势动作的属性为去程的情况下,执行所述手势动作的类型对应的功能。
第三方面,本申请实施例提供一种计算设备,所述计算设备包括处理器和存储器,所述存储器用于存储程序,所述处理器执行所述存储器存储的程序,当所述存储器存储的程序被执行时,使得所述计算设备实现上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的动态手势识别方法。
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读介质用于存储有计算机可执行指令,所述计算机可执行指令在被所述计算机调用时用于使所述计算机实现上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的动态手势识别方法。
第五方面,本申请实施例提供一种计算机程序产品,该计算机程序产品包括指令,当该计算机程序产品被计算机执行时,使得计算机可以执行上述第一方面以及结合上述第一方面中的任意一种实现方式所提供的动态手势识别方法的流程。
附图说明
图1为本申请实施例中的一种动态手势交互的场景;
图2为本申请实施例中的一种动态手势识别系统架构示意图;
图3为本申请实施例中的一种手势识别模型的示意图;
图4为本申请实施例中的一种CNN的示意图;
图5为本申请实施例中的一种芯片硬件结构示意图;
图6为本发明实施例中的一种动态手势识别方法的流程示意图;
图7为本申请实施例中的一种空间特征模块的示意图;
图8为本申请实施例中的一种第一特征数据提取的示意图;
图9为本申请实施例中的一种时域特征模块的示意图;
图10为本申请实施例中的一种维度变换层进行转换Reshape操作的示意图;
图11为本发明实施例中的另一种动态手势识别方法的流程示意图;
图12为本发明实施例中的另一种动态手势识别方法的流程示意图;
图13为本发明实施例中的一种手势识别模型训练方法的流程示意图;
图14为本发明实施例中的一种手势动作识别中特征提取的示意图;
图15为本发明实施例中的一种时域特征提取的示意图;
图16为本发明实施例中的一种动态手势识别装置的结构示意图;
图17为本发明实施例公开的一种计算设备的结构示意图。
具体实施方式
下面结合附图对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
下面将结合附图,对本申请中的技术方案进行描述。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)人工智能
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
(2)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021079699-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(3)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021079699-appb-000002
其中,
Figure PCTCN2021079699-appb-000003
是输入向量,
Figure PCTCN2021079699-appb-000004
是输出向量,b是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021079699-appb-000005
经过如此简单的操作得到输出向量
Figure PCTCN2021079699-appb-000006
由于DNN层数多,则系数W和偏移向量b的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021079699-appb-000007
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021079699-appb-000008
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(4)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像 上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)手势回程
在动态手势识别中,当用户要往一个方向连续挥动手势的时候,一定会往另一个方向回到起点,这就是手势回程,手势回程容易导致终端设备对手势方向的误判情况出现。
随着计算机技术的飞速发展,动态手势识别已经成为人机交互方式之一。图1为本申请实施例中的一种动态手势交互的场景,如图1所示,在交互设计领域中,用户可以使用一只手或两只手对终端设备进行无接触操作,终端设备通过动态手势识别来响应用户的手势并且执行相关命令。目前主流的动态手势识别方法可分为两种:第一种为结合神经网络和视频输入的方法实现动态手势识别方法,该方法基于多图像输入,使用卷积神经网络(convolutional neuron network,CNN)抽取空间特征(图像的特征),使用一维卷积(one dimensional convolution,1DCONV)或者全连接网络(multilayer perceptron,MLP)抽取时域特征,最后得到视频中动态手势识别结果。该方法可以端到端完成动态手势识别的自识别(训练过程中学习动作的特征),但存在比较严重的回程问题。第二种为通过神经网络进行静态图像识别(检测跟踪、分类或关键点识别),通过连续帧的分类的组合,例如位置以及手部形态的分类,推测动态的动作。该方法可以通过一些分类结果以及分类门限调整动态手势识别的准确率,但回程问题也很难解决。
基于上述问题,本申请提供了一种动态手势识别的方法,可以获取包含手势动作的多张图像,通过手势识别模型识别多张图像,得到多张图像中手势动作的类型和属性,该属性包括去程和回程,然后根据得到的手势动作的类型和属性执行对应的操作命令。通过这种方法,可以解决动态手势识别中回程去程的识别,提高手势动作的识别精度。
下面介绍本申请实施例提供的系统架构。
请参见图2,图2为本申请实施例中的一种动态手势识别系统架构示意图。如图2所 示,动态手势识别系统架构200可以包括执行设备210、训练设备220、数据库230、用户设备240、数据存储系统250和数据采集设备260。
数据采集设备260用于采集包含手势动作的多张图像数据,并将多张图像数据存入数据库230,训练设备220基于数据库230中维护的多张图像数据训练得到手势识别模型201。训练过程可以包括:训练设备220将多张图像数据输入初始手势识别模型221,得到手势识别模型201,初始手势识别模型221为深度神经网络。深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2021079699-appb-000009
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2021079699-appb-000010
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
根据训练设备220训练得到的手势识别模型201可以应用于不同的系统或设备中,如应用于图2所示的执行设备210,所述执行设备210可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。执行设备210可以执行本申请实施例中的动态手势识别方法。在图2中,执行设备210配置有I/O接口212,用于与外部设备进行数据交互,用户可以通过用户设备240向I/O接口212输入数据,所述输入数据在本申请实施例中可以为包含手势动作的多张图像数据,也可以为向执行设备210请求对动态手势进行识别的请求。
在执行设备210的计算模块211执行计算等相关的处理过程中,执行设备210可以调 用数据存储系统250中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统250中。
计算模块211可以使用手势识别模型201对输入的包含手势动作的多张图像数据进行处理,具体地,先获取包含手势动作的多张图像,通过手势识别模型201中的空间特征模块得到第一特征数据,将第一特征数据输入手势识别模型201中的时域特征模块得到第二特征数据,将第二特征数据输入手势识别模型201中的分类模块,得到多张图像中手势动作的类型和属性。
最后,I/O接口212将处理结果,如上述手势识别模型201的识别方法得到多张图像中手势动作的类型和属性返回给用户设备240。该用户设备240可以是终端,如手机终端、笔记本电脑、AR/VR终端或车载终端等,以用于响应与终端用户的相应需求。
在图2中所示的情况下,用户可以手动给定输入数据(如本申请实施例中包含手势动作的多张图像),该手动给定可以通过I/O接口212提供的界面进行操作。另一种情况下,用户设备240可以自动地向I/O接口212发送输入数据,如果要求用户设备240自动发送输入数据需要获得用户的授权,则用户可以在用户设备240中设置相应权限。用户可以在用户设备240查看执行设备210输出的识别结果,识别结果包括多张图像中手势动作的类型和属性。用户设备240在接收到识别结果后,可以将识别结果转换成相应的指令以响应于用户的动态手势。
值得注意的是,图2仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。数据采集设备260可以是相对于用户设备240单独的外部设备,也可以是置于用户设备240中的内部设备。
如图2所示,根据训练设备220训练得到的可以是本实施例中的手势识别模型201,具体的,本申请实施例提供的手势识别模型201可以是用于动态手势识别的神经网络模型。
请参阅图3,图3为本申请实施例中的一种手势识别模型的示意图。如图3所示,手势识别模型300可以包括空间特征模块301、时域特征模块302和分类模块303,时域特征模块302可以设置在空间特征模块301后面。图3中的空间特征模块301从输入的包含手势动作的多张图像中逐层提取第一特征数据,第一特征数据中包含表征手势动作在多张图像中的空间特征。时域特征模块302的输入数据为位于其上一级的空间特征模块301输出的第一特征数据。时域特征模块302对第一特征数据进行处理,得到第二特征数据。分类模块303的输入数据为位于其上一级的时域特征模块302输出的第二特征数据,分类模块303对第二特征数据进行分类,确定多张图像中手势动作的类型和属性。分类模块303的输出值可以被传递给两个输出,一个输出可以采用softmax逻辑回归(softmax regression)进行分类用于表征手势动作的类型,另一个输出可以采用sigmoid函数用于表征手势动作的属性。
具体实施时,手势识别模型300可以包括多个空间特征模块和多个时域特征模块,多个空间特征模块的结构可以相同,也可以不同。单个空间特征模块可以仅包含一个神经网络层,例如,单个空间特征模块中仅包含一个卷积层;单个空间特征模块也可以包括多个 相同或不同的神经网络层,例如,单个空间特征模块中包含卷积层和池化层,或者单个空间特征模块中包含多个不同的卷积层。图3所述的手势识别模型300仅为一个示例,实际应用中,手势识别模型300包含的空间特征模块的数量、结构、位置和时域特征模块的数量、结构、位置均可根据实际需求设定,本申请实施例不作限定。
在本实施例中,空间特征模块301可以是CNN架构。
请参阅图4,图4为本申请实施例中的一种CNN的示意图。如前文的基础概念介绍所述,CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的包含手势动作的多张图像作出响应。
如图4所示,卷积神经网络(CNN)400可以包括输入层410,卷积层/池化层420(其中池化层为可选的),以及神经网络层430。
卷积层/池化层420:
卷积层:
如图4所示卷积层/池化层420可以包括如示例421-426层,举例来说:在一种实现中,421层为卷积层,422层为池化层,423层为卷积层,424层为池化层,425为卷积层,426为池化层;在另一种实现方式中,421、422为卷积层,423为池化层,424、425为卷积层,426为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层421为例,介绍一层卷积层的内部工作原理。
卷积层421可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络400进行正确的预测。
当卷积神经网络400有多个卷积层的时候,初始的卷积层(例如421)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络400深度的加深,越往后的卷积层(例如426)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图4中420所示例的421-426各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层430:
在经过卷积层/池化层420的处理后,卷积神经网络400还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层520只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络400需要利用神经网络层430来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层430中可以包括多层隐含层(如图4所示的431、432至43n)以及输出层440,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到。
在神经网络层430中的多层隐含层之后,也就是整个卷积神经网络400的最后层为输出层440,该输出层440具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络400的前向传播(如图4由410至440方向的传播为前向传播)完成,反向传播(如图4由440至410方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络400的损失,及卷积神经网络400通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图4所示的卷积神经网络400仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
下面介绍本申请实施例提供的一种芯片硬件结构。
请参阅图5,图5为本申请实施例中的一种芯片硬件结构示意图。如图5所示,该芯片包括神经网络处理器50。该芯片可以被设置在如图2所示的执行设备210中,用以完成计算模块211的计算工作。该芯片也可以被设置在如图2所示的训练设备220中,用以完成训练设备220的训练工作并输出目标模型201。如图3所示的手势识别模型中各模块的算法均可在如图5所示的芯片中得以实现。
神经网络处理器50可以是NPU,TPU,或者GPU等一切适合用于大规模异或运算处 理的处理器。以NPU为例:NPU可以作为协处理器挂载到主CPU(host CPU)上,由主CPU为其分配任务。NPU的核心部分为运算电路503,通过控制器504控制运算电路503提取存储器(501和502)中的矩阵数据并进行乘加运算。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路503从权重存储器502中取矩阵B的权重数据,并缓存在运算电路503中的每一个PE上。运算电路503从输入存储器501中取矩阵A的输入数据,根据矩阵A的输入数据与矩阵B的权重数据进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)508中。
统一存储器506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)505,被搬运到权重存储器502中。输入数据也通过DMAC被搬运到统一存储器506中。
总线接口单元(bus interface unit,BIU)510,用于DMAC和取指存储器(instruction fetch buffer)509的交互;总线接口单元501还用于取指存储器509从外部存储器获取指令;总线接口单元501还用于存储单元访问控制器505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器506中,或将权重数据搬运到权重存储器502中,或将输入数据搬运到输入存储器501中。
向量计算单元507多个运算处理单元,在需要的情况下,对运算电路503的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。向量计算单元507主要用于神经网络中非卷积层或全连接层(fully connected layers,FC)的计算,具体可以处理:池化(pooling),归一化(normalization)等的计算。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。
在一些实现中,向量计算单元507将经处理的向量存储到统一存储器506。在一些实现中,经向量计算单元507处理过的向量能够用作运算电路503的激活输入,例如用于神经网络中后续层中的使用,如图4所示,若当前处理层是隐含层1(431),则经向量计算单元507处理过的向量还可以被用到隐含层2(432)中的计算。
控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为On-Chip存储器。外部存储器独立于该NPU硬件架构。
其中,图3所示的手势识别模型中各模块的运算可以由运算电路503或向量计算单元507执行。
下面提供一种基于图2的系统架构的一种动态手势识别方法。请参阅图6,图6为本 发明实施例中的一种动态手势识别方法的流程示意图。如图6所示,该动态手势识别方法可以包括以下步骤。
601、获取包含手势动作的多张图像。
在实施例中,当需要对手势动作进行识别时,可以获取包含手势动作的多张图像。包含手势动作的多张图像可以只包括手部对象,也可以包括手部对象的背景。
具体实施时,可以先获取待识别视频,从待识别视频中每间隔第二预设数量张图像抽取一张图像,得到包括包含手势动作的多张图像。在一种实现中,在抽取的图像的数量达到第三预设数量的情况下,将抽取的第三预设数量张图像确定为包含手势动作的多张图像。例如,第二预设数量为14,第三预设数量为8,则可从待识别视频中的第1帧开始,每隔14帧抽取一张图像,最终得到第1、15、29、43、57、71、85、99帧组成的第一个8张图像。可继续每间隔14帧抽取一张图像,得到第二个8张图像。
602、通过手势识别模型识别多张图像,得到多张图像中手势动作的类型和属性。
在获取到包含手势动作的多张图像之后,可以通过手势识别模型识别多张图像,得到多张图像中手势动作的类型和属性。其中,手势识别模型可以包括空间特征模块、时域特征模块、分类模块、第一分类器和第二分类器。
具体实施时,输入手势识别模型的多张图像可以是一段连续的图像,也可以是从一段待识别视频中截取的、不连续的多个图像按照时序排列后得到的多张图像。多张图像本质上是一个四维矩阵(B×T,C,H,W),其中,B为批处理数目(batch size),即手势识别模型可以一次处理完的多张图像的数量,T为图像长度,即多张图像中包含的图像的数量,C为图像的通道数,H为图像的高,W为图像的宽,此时所指的图像为视频帧。以批处理数目B=2,图像长度T=8,RGB通道数C=3,高H=224,宽W=224的输入信息为例,即输入手势识别模型的多张图像为一个四维矩阵(2×8,3,224,224)。如果同一时间内,手势识别模型只处理一组图像,则B可以设置为1,即手势识别模型一次可处理一组图像中的T张图像。
具体实施时,可以先将包含手势动作的多张图像输入空间特征模块,得到第一特征数据,第一特征数据包括多张图像中手势动作的空间特征。在一种实现中,请参阅图7,图7为本申请实施例中的一种空间特征模块的示意图。如图7所示,空间特征模块700可以包括输入层710,卷积层/池化层720(其中池化层为可选的),注意力机制(attention mechanism)730,以及神经网络层740。空间特征模块700的网络结构设计中,主干网络可以采用CNN架构,空间特征模块700是基于CNN架构的增加了attention机制的轻量级网络,使用批归一化BatchNorm代替L2Norm,可以得到比较好的效果。其中,attention机制730可以加在卷积层/池化层720(其中池化层为可选的)的后面。
请参阅图8,图8为本申请实施例中的一种第一特征数据提取的示意图。如图8中的(1)所示,本实施例采用的是多张图像输入的方案,对于每一张采用共享参数的方法,区块0~区块3采用每一张都用相同的参数来提取空间特征。其中,区块是空间特征模块中卷积层对图像的一个局部区域进行空间特征提取之后形成的特征区块,在区块之后可以加入attention机制,用于增强局部区域的空间特征提取。在本实施例中,每一个区块之后都加入了attention机制。具体的如图8中的(2)所示,例如,对图像0进行空间特征提取时, 可以得到多个局部区域的特征,多个局部区域的特征进行相加,就可以得到空间特征0。相同地,对图像1进行空间特征提取,可以得到空间特征1,…,对图像n进行空间特征提取,可以得到空间特征n。
本申请实施例中,每张图像对应的第一特征数据包括多个二维图片(即二维矩阵(H,W)),每一个二维图片即为一个特征图(feature map),第一特征数据包含的特征图数量等于对应的通道数。例如,空间特征模块输出的数据的维度为(16,64,112,112),则一张图像对应的第一特征数据包含的特征图的数量为64,每个特征图的大小为112×112。需要说明的是,相同模块输出的每张图像对应的第一特征数据的维度、大小均相同。同样,每张图像对应的第二特征数据也包括多个特征图。
得到第一特征数据之后,可以将第一特征数据输入时域特征模块,得到第二特征数据,第二特征数据包括第一特征数据在时间维度上的时域特征。在一种实现中,时域特征模块可以是CNN架构。请参阅图9,图9为本申请实施例中的一种时域特征模块的示意图。如图9所示,时域特征模块900可以包括维度变换层901、卷积层902、批量归一化层903、激活函数层904、最大池化层905和特征联合层906。具体地,按照多张图像的时间信息通过维度变换层901确定第一特征数据在时间维度上对应的第一时域特征数据。请参阅图10,图10为本申请实施例中的一种维度变换层进行转换Reshape操作的示意图。Reshape是一种可以重新调整矩阵的行数、列数和维数的函数。如图10所示,维度变换层901可以实现对上一级空间特征模块输出的第一特征数据(B×T,C,H,W)的维度转换,即将第一特征数据(B×T,C,H,W)中的空间维度(H,W)合并到Batch批处理维度上,将时间维度T单独分离出来,得到三维矩阵(B×H×W,C,T),第一时域特征数据由多张图像对应的第一特征数据(C,H,W)中H相同、W相同、C相同的像素点按照时间顺序排列而成,每个第一时域特征数据中包含T个数据,第一时域特征数据为由这T个数据组成的一维向量。例如,当B=1,T=8,C=64,H=56,W=56时,Reshape操作后可得到1×56×56×64个第一时域特征数据,每个第一时域特征数据包含8个数据。
得到第一时域特征数据之后,通过卷积层902对第一时域特征数据进行卷积操作,得到第二时域特征数据。具体地,通过卷积层902对维度变换层901输出的第一时域特征数据进行卷积处理。卷积层902可以包含第一预设数量个卷积核大小不同的一维卷积层,针对维度变换层901输出的第一时域特征数据,分别用这第一预设数量个一维卷积层对第一时域特征数据进行卷积处理,得到第一时域特征数据对应的第一预设数量个不同尺度的第二时域特征数据。将第二时域特征数据依次经过BN层903、ReLu层904、最大池化层905,进一步地,特征联合层906融合第一预设数量个不同尺度的第二时域特征数据,得到对应的第二特征数据。具体地,特征联合层906可以将第一预设数量个不同尺度的第二时域特征数据相加,得到对应的第二特征数据。通过卷积层902设置多个卷积核大小不同的一维卷积层,可从同一第一时域特征数据中提取出不同尺度的时序特征,特征联合层906融合这多个不同尺度的时序特征,得到第二特征数据,较好保留了手势动作的时序特征。
其中,BN层903和卷积层902一样都是一个网络层,用于加快训练速度,提高网络的泛化能力。BN层本质上是一个归一化网络层,可以替代局部响应归一化层(LRN层),ReLu层904,用于增加神经网络各层之间的非线性关系,且减轻梯度消失问题,最大池化层905 中的最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果,可以对卷积层902输出的第一时域特征数据进行采样得到较小尺寸的图像。
在一种实现中,卷积层902可以包括4个卷积核大小不同的一维卷积层,卷积核分别为k=3,k=4,k=5,k=6,分别用这4个一维卷积层对第一时域特征数据进行卷积处理,得到第一时域特征数据对应的4个不同尺度的特征数据,经过BN层903、ReLu层904和最大池化层905之后,特征联合层906将4个不同尺度的特征数据相加,得到该第一时域特征数据对应的第二特征数据。
在一种实现中,请参阅图11,图11为本申请实施例中的一种数据组装方式的示意图。第一特征数据和第二特征数据可以按照如图11的方式进行组装之后输入进特征联合层906进行数据融合。
得到第二特征数据之后,可以将第二特征数据输入第一分类器,得到多张图像中手势动作属于每个类型的第一概率,将第一手势动作归类至第一手势动作对应的第一概率最大的类型,第一手势动作为多张图像中的手势动作中的任一手势动作。将第二特征数据输入第二分类器,得到多张图像中手势动作属于每个属性的第二概率,将第一手势动作归类至第一手势动作对应的第二概率最大的属性。第一分类器和第二分类器的输出值可以被传递给两个输出,一个输出可以采用softmax逻辑回归(softmax regression)进行分类用于表征类型分类,另一个输出可以采用sigmoid函数用于表征属性分类。
在图6所描述的动态手势识别方法中,获取包含手势动作的多张图像,通过图3所示的手势识别模型300中的空间特征模块301、时域特征模块302和分类模块303识别多张图像,得到多张图像中手势动作的类型和属性。其中,通过空间特征模块301得到包括多张图像中手势动作的空间特征的第一特征数据,通过时域特征模块302得到包括第一特征数据在时间维度上的时域特征的第二特征数据。通过分类模块303得到多张图像中手势动作的类型和属性。因此,基于上述的手势识别模型300和动态手势识别方法,能够从多张图像中获取到手势动作更加全面的特征信息,进而提高了针对手部动作的识别准确率。
下面提供一种基于图2的系统架构的另一种动态手势识别方法。请参阅图12,图12为本发明实施例中的另一种动态手势识别方法的流程示意图。如图12所示,该动态手势识别方法可以包括以下步骤。
1201、获取包含手势动作的多张图像。
其中,步骤1201与步骤601相同,详细描述请参考步骤601,在此不加赘述。
1202、通过手势识别模型识别多张图像,得到多张图像中手势动作的类型和属性。
其中,步骤1202与步骤602相同,详细描述请参考步骤602,在此不加赘述。
1203、在多张图像中手势动作的属性为去程的情况下,执行手势动作的类型对应的功能。
得到多张图像中手势动作的类型和属性之后,在多张图像中手势动作的属性为去程的情况下,可以执行手势动作的类型对应的功能。例如,得到手势动作的类型和属性分别为向左和去程,则执行向左的命令以响应于手势动作;得到手势动作的类型和属性分别为向左和回程,则不执行任何命令。
本申请实施例中,手势识别模型可以是一种AI模型,在利用手势识别模型进行识别之前需要对初始手势识别模型进行训练,以使得训练后的手势识别模型具备识别多张图像中手势动作的类型和属性和能力。本申请中的手势识别模型可以具有确定手势动作的类型和属性(去程回程)的能力。下面提供一种基于图2的系统架构的一种手势识别模型训练方法。请参阅图13,图13为本发明实施例中的一种手势识别模型训练方法的流程示意图。如图13所示,该手势识别模型训练方法可以包括以下步骤。
1301、获取多个携带标注信息的样本图像。
在训练初始手势识别模型的过程中,需要使用特别的训练数据进行训练,从模型能力需求出发进行分析,需要使用携带标注信息的样本图像进行训练,样本图像中记录了手势动作,标注信息可以包括手势动作在样本图像中的类型和属性。手势动作的类型信息用于表示手势动作的类型,例如:“连续左翻”、“连续右翻”、“连续上翻”、“连续下翻”等,属性信息用于表示手势动作在多张图像中的属性,属性信息可以是去程,也可以是回程,例如,手势动作在图像0-图像7中是去程,手势动作在图像8-图像15中是回程。需要说明的是,标注信息可以以可扩展标记语言(extensible markup language,XML)或JavaScript对象简谱(JavaScript object notation,JSON)等文件进行保存。
手势动作的包括类型和属性的标注信息可以利用手势动作检测算法对样本图像进行检测得到样本图像中记录的手势动作的类型信息和属性信息,也可以通过人工标注的方式得到类型信息和属性信息。
1302、根据多个携带标注信息的样本图像对初始手势识别模型进行训练得到手势识别模型。
获取到多个携带标注信息的样本图像之后,多个携带标注信息的样本图像构成了训练集,利用训练集中的训练样本进行模型训练,首先确定初始手势识别模型,本申请实施例中,初始手势识别模型可以为一种AI模型,具体可以选用一种深度神经网络模型,该网络可以对手势动作的类别进行识别,还可以对手势动作的属性进行识别。
该初始手势识别模型可以包括空间特征模块、时域特征模块、损失函数计算模块和分类模块。
首先,将初始手势识别模型的参数初始化,之后将样本图像输入至初始手势识别模型中的空间模块,对输入的样本图像进行空间特征提取,得到抽象的特征。通过空间特征模块可以检测到的手势动作中手部特征和手部关键点特征。再通过时域特征模型得到手部关键点特征针对时间信息对应的时域特征。请参阅图14,图14为本发明实施例中的一种手势动作识别中特征提取的示意图。如图14所示,对于样本图像,可以通过空间特征模块可以检测到手部在图像中的位置信息,将手部在图像中的位置信息用矩形框标注出来,再针对矩形框内的手部进行关键点检测。通过时域特征模块可以得到关键点根据多张图像的时间信息对应的时域特征,可以采用差分法得到前后张图像中关键点的时域特征。具体地,请参阅图15,图15为本发明实施例中的一种时域特征提取的示意图。如图15所示,手部关键点可以包括指尖点和指骨关键点,可以提取手部关键点的位移信息和速度信息。手部 关键点相对于图像的位移可以用S x和S y来表示,S x表示x方向的位移,S y表示y方向的位移。在前后张图像采集的帧率相同的情况下,S x和S y也可以代表手部关键点的移动速度。根据大量数据经验所得,手势动作针对去程和回程的不同属性,移动的速度不同,因此在前后张图像采集的帧率不相同的情况下,还可以提取手部关键点的移动速度。对于不同的关键点,可以提取到多个位移信息和速度信息,将多个不同手部关键点的位移信息和速度信息合成向量,可以得到向量F,F=[f 1,f 2,f 3,...,f n]。
对空间特征模块和时域特征模块提取到的特征进行检测和识别,预测出输入的样本图像中的手势动作的类型和属性,输出至损失函数计算模块,然后将该样本图像对应的标注信息也输入到损失函数计算模块,损失函数计算模块将预测得到的预测结果与该样本图像对应的标注信息进行比对,并计算出损失函数,以损失函数为目标函数使用反向传播(backpropagation,BP)、梯度下降(gradient descent,GD)或者随机梯度下降(stochastic gradient descent,SGD)等优化算法更新调整初始手势识别模型和分类器中的权重参数。依次循环输入携带标注信息的样本图像,不断迭代执行上述训练过程,直到基于初始手势识别模型和分类器得到的样本图像对应的预设概率与样本图像对应的标注信息一致的概率达到期望值,则表示已获得符合要求的手势识别模型,可结束训练得到手势识别模型,即手势识别模型已经具备识别多张图像中手势动作的类型和属性的功能,可以用于动态手势识别。
请参阅图16,图16为本发明实施例公开的一种动态手势识别装置的结构示意图。如图16所示,该动态手势识别装置1600可以包括:
第一获取单元1601,用于获取包含手势动作的多张图像;
识别单元1602,用于通过手势识别模型识别多张图像,得到多张图像中手势动作的类型和属性,属性包括去程和回程。
在一种可选的实现方式中,装置1600还可以包括:
第二获取单元1603,用于获取多个携带标注信息的样本图像,样本图像为包含手势动作的多张图像,标注信息包括样本图像中手势动作的类型和属性;
训练单元1604,用于根据多个携带标注信息的样本图像对初始手势识别模型进行训练得到手势识别模型。
在一种可选的实现方式中,手势识别模型包括空间特征模块、时域特征模块和分类模块;
识别单元1602,具体用于:
将多张图像输入空间特征模块,得到第一特征数据,第一特征数据包括多张图像中手势动作的空间特征;
将第一特征数据输入时域特征模块,得到第二特征数据,第二特征数据包括第一特征数据在时间维度上的时域特征;
将第二特征数据输入分类模块,得到多张图像中手势动作的类型和属性。
在一种可选的实现方式中,时域特征模块包括维度变换层、卷积层、BN层、ReLu层、最大池化层和特征联合层;
识别单元1602用于将第一特征数据输入时域特征模块,得到第二特征数据,第二特征数据包括第一特征数据在时间维度上的时域特征时,具体用于:
按照多张图像的时间信息通过维度变换层确定第一特征数据在时间维度上对应的第一时域特征数据;
通过卷积层对第一时域特征数据进行卷积处理,得到第二时域特征数据;
将第二时域特征数据依次经过BN层、ReLu层、最大池化层和特征联合层,获得第二特征数据。
在一种可选的实现方式中,识别单元1602用于通过卷积层对第一时域特征数据进行卷积处理,得到第二时域特征数据时,具体用于:
使用第一预设数量个卷积核大小不同的一维卷积层对第一时域特征数据进行卷积处理,得到第二时域特征数据,第二时域特征数据包括第一预设数量个不同尺度的特征数据;
将第二时域特征数据依次经过BN层、ReLu层、最大池化层和特征联合层,获得第二特征数据包括:
将第一预设数量个不同尺度的特征数据依次经过BN层、ReLu层、最大池化层和特征联合层,得到第二特征数据。
在一种可选的实现方式中,手势识别模型还包括第一分类器和第二分类器,识别单元1602用于将第二特征数据输入分类模块,得到多张图像中手势动作的类型和属性时,具体用于:
将第二特征数据输入第一分类器,得到多张图像中的手势动作属于每个类型的第一概率;
将第一手势动作归类至第一手势动作对应的第一概率最大的类型,第一手势动作为多张图像中的手势动作中的任一手势动作;
将第二特征数据输入第二分类器,得到多张图像中的手势动作属于每个属性的第二概率;
将第一手势动作归类至第一手势动作对应的第二概率最大的属性。
在一种可选的实现方式中,第一获取单元1601,具体用于:
获取待识别视频;
从待识别视频中每间隔第二预设数量张图像抽取一张图像,得到包括包含手势动作的多张图像。
在一种可选的实现方式中,装置1600还可以包括:执行单元1605,用于在多张图像中手势动作的属性为去程的情况下,执行手势动作的类型对应的功能。
请参阅图17,图17为本发明实施例公开的一种计算设备的结构示意图。如图17所示,该计算设备1700可以包括:存储器1701、处理器1702、通信接口1703以及总线1704。其中,存储器1701、处理器1702、通信接口1703通过总线1704实现彼此之间的通信连接。
存储器1701可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1701可以存储程序,当存储器1701中存储的程序被处理器1702执行时,处理器1702和通信接口1703用于执行本申请实施例的动态手势识别方法的各个步骤。
处理器1702可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的动态手势识别装置中的单元所需执行的功能,或者执行本申请方法实施例的动态手势识别方法。
处理器1702还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的动态手势识别方法的各个步骤可以通过处理器1702中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1702还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1701,处理器1702读取存储器1701中的信息,结合其硬件完成本申请实施例的动态手势识别装置中包括的单元所需执行的功能,或者执行本申请方法实施例的动态手势识别方法。
通信接口1703使用例如但不限于收发器一类的收发装置,来实现装置1700与其他设备或通信网络之间的通信。总线1704可包括在装置1700各个部件(例如,存储器1701、处理器1702、通信接口1703)之间传送信息的通路。上述各个功能器件的具体实现可以参见上述实施例中动态手势识别方法的相关描述,本申请实施例不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本领域技术人员能够领会,结合本文公开描述的各种说明性逻辑框、模块和算法步骤所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件来实施,那么各种说明性逻辑框、模块、和步骤描述的功能可作为一或多个指令或代码在计算机可读媒体上存储或传输,且由基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体,其对应于有形媒体,例如数据存储媒体,或包括任何促进将计算机程序从一处传送到另一处的媒体(例如,根据通信协议)的通信媒体。以此方式,计算机可读媒体大体上可对应于(1)非暂时性的有形计算机可读存储媒体,或(2)通信媒体,例如信号或载波。数据存储媒体 可为可由一或多个计算机或一或多个处理器存取以检索用于实施本申请中描述的技术的指令、代码和/或数据结构的任何可用媒体。计算机程序产品可包含计算机可读媒体。
作为实例而非限制,此类计算机可读存储媒体可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来存储指令或数据结构的形式的所要程序代码并且可由计算机存取的任何其它媒体。并且,任何连接被恰当地称作计算机可读媒体。举例来说,如果使用同轴缆线、光纤缆线、双绞线、数字订户线(DSL)或例如红外线、无线电和微波等无线技术从网站、服务器或其它远程源传输指令,那么同轴缆线、光纤缆线、双绞线、DSL或例如红外线、无线电和微波等无线技术包含在媒体的定义中。但是,应理解,所述计算机可读存储媒体和数据存储媒体并不包括连接、载波、信号或其它暂时媒体,而是实际上针对于非暂时性有形存储媒体。如本文中所使用,磁盘和光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字多功能光盘(DVD)和蓝光光盘,其中磁盘通常以磁性方式再现数据,而光盘利用激光以光学方式再现数据。以上各项的组合也应包含在计算机可读媒体的范围内。
可通过例如一或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指前述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文中所描述的各种说明性逻辑框、模块、和步骤所描述的功能可以提供于经配置以用于编码和解码的专用硬件和/或软件模块内,或者并入在组合编解码器中。而且,所述技术可完全实施于一或多个电路或逻辑元件中。
本申请的技术可在各种各样的装置或设备中实施,包含无线手持机、集成电路(IC)或一组IC(例如,芯片组)。本申请中描述各种组件、模块或单元是为了强调用于执行所揭示的技术的装置的功能方面,但未必需要由不同硬件单元实现。实际上,如上文所描述,各种单元可结合合适的软件和/或固件组合在编码解码器硬件单元中,或者通过互操作硬件单元(包含如上文所描述的一或多个处理器)来提供。
以上所述,仅为本申请示例性的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应该以权利要求的保护范围为准。

Claims (18)

  1. 一种动态手势识别方法,其特征在于,包括:
    获取包含手势动作的多张图像;
    通过手势识别模型识别所述多张图像,得到所述多张图像中手势动作的类型和属性,所述属性包括去程和回程。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取多个携带标注信息的样本图像,所述样本图像为包含手势动作的多张图像,所述标注信息包括所述样本图像中手势动作的类型和属性;
    根据所述多个携带标注信息的样本图像对初始手势识别模型进行训练得到所述手势识别模型。
  3. 根据权利要求1所述的方法,其特征在于,所述手势识别模型包括空间特征模块、时域特征模块和分类模块;
    所述通过手势识别模型识别所述多张图像,得到所述多张图像中手势动作的类型和属性,包括:
    将所述多张图像输入所述空间特征模块,得到第一特征数据,所述第一特征数据包括所述多张图像中手势动作的空间特征;
    将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征;
    将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性。
  4. 根据权利要求3所述的方法,其特征在于,所述时域特征模块包括维度变换层、卷积层、批量标准化BN层、修正线性单元ReLu层、最大池化层和特征联合层;
    所述将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征,包括:
    按照所述多张图像的时间信息通过所述维度变换层确定所述第一特征数据在时间维度上对应的第一时域特征数据;
    通过所述卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据;
    将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据。
  5. 根据权利要求4所述的方法,其特征在于,所述通过卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据包括:
    使用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据,所述第二时域特征数据包括第一预设数量个不同尺度的特征数据;
    所述将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据包括:
    将所述第一预设数量个不同尺度的特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,得到所述第二特征数据。
  6. 根据权利要求3-5任一项所述的方法,其特征在于,所述手势识别模型还包括第一分类器和第二分类器,所述将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性包括:
    将所述第二特征数据输入所述第一分类器,得到所述多张图像中的手势动作属于每个类型的第一概率;
    将第一手势动作归类至所述第一手势动作对应的第一概率最大的类型,所述第一手势动作为所述多张图像中的手势动作中的任一手势动作;
    将所述第二特征数据输入所述第二分类器,得到所述多张图像中的手势动作属于每个属性的第二概率;
    将所述第一手势动作归类至所述第一手势动作对应的第二概率最大的属性。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述获取包含手势动作的多张图像包括:
    获取待识别视频;
    从所述待识别视频中每间隔第二预设数量张图像抽取一张图像,得到包括包含手势动作的多张图像。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述方法还包括:
    在所述多张图像中手势动作的属性为去程的情况下,执行所述手势动作的类型对应的功能。
  9. 一种动态手势识别装置,其特征在于,包括:
    第一获取单元,用于获取包含手势动作的多张图像;
    识别单元,用于通过手势识别模型识别所述多张图像,得到所述多张图像中手势动作的类型和属性,所述属性包括去程和回程。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    第二获取单元,用于获取多个携带标注信息的样本图像,所述样本图像为包含手势动作的多张图像,所述标注信息包括所述样本图像中手势动作的类型和属性;
    训练单元,用于根据所述多个携带标注信息的样本图像对初始手势识别模型进行训练得到所述手势识别模型。
  11. 根据权利要求9所述的装置,其特征在于,所述手势识别模型包括空间特征模块、 时域特征模块和分类模块;
    所述识别单元,具体用于:
    将所述多张图像输入所述空间特征模块,得到第一特征数据,所述第一特征数据包括所述多张图像中手势动作的空间特征;
    将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征;
    将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性。
  12. 根据权利要求11所述的装置,其特征在于,所述时域特征模块包括维度变换层、卷积层、批量标准化BN层、修正线性单元ReLu层、最大池化层和特征联合层;
    所述识别单元用于将所述第一特征数据输入所述时域特征模块,得到第二特征数据,所述第二特征数据包括所述第一特征数据在时间维度上的时域特征时,具体用于:
    按照所述多张图像的时间信息通过所述维度变换层确定所述第一特征数据在时间维度上对应的第一时域特征数据;
    通过所述卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据;
    将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据。
  13. 根据权利要求12所述的装置,其特征在于,所述识别单元用于通过所述卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据时,具体用于:
    使用第一预设数量个卷积核大小不同的一维卷积层对所述第一时域特征数据进行卷积处理,得到第二时域特征数据,所述第二时域特征数据包括第一预设数量个不同尺度的特征数据;
    所述将所述第二时域特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,获得所述第二特征数据包括:
    将所述第一预设数量个不同尺度的特征数据依次经过所述BN层、所述ReLu层、所述最大池化层和所述特征联合层,得到所述第二特征数据。
  14. 根据权利要求11-13任一项所述的装置,其特征在于,所述手势识别模型还包括第一分类器和第二分类器,所述识别单元用于将所述第二特征数据输入所述分类模块,得到所述多张图像中手势动作的类型和属性时,具体用于:
    将所述第二特征数据输入所述第一分类器,得到所述多张图像中的手势动作属于每个类型的第一概率;
    将第一手势动作归类至所述第一手势动作对应的第一概率最大的类型,所述第一手势动作为所述多张图像中的手势动作中的任一手势动作;
    将所述第二特征数据输入所述第二分类器,得到所述多张图像中的手势动作属于每个属性的第二概率;
    将所述第一手势动作归类至所述第一手势动作对应的第二概率最大的属性。
  15. 根据权利要求9-14任一项所述的装置,其特征在于,所述第一获取单元,具体用于:
    获取待识别视频;
    从所述待识别视频中每间隔第二预设数量张图像抽取一张图像,得到包括包含手势动作的多张图像。
  16. 根据权利要求9-15所述的装置,其特征在于,所述装置还包括:
    执行单元,用于在所述多张图像中手势动作的属性为去程的情况下,执行所述手势动作的类型对应的功能。
  17. 一种计算设备,其特征在于,包括处理器和存储器,所述存储器用于存储程序,所述处理器执行所述存储器存储的程序,当所述存储器存储的程序被执行时,使得所述计算设备实现如权利要求1-8任一项所述的方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读介质用于存储有计算机可执行指令,所述计算机可执行指令在被所述计算机调用时用于使所述计算机实现如权利要求1-8任一项所述的方法。
PCT/CN2021/079699 2020-03-27 2021-03-09 一种动态手势识别方法及设备 WO2021190296A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010235859.6A CN113449573A (zh) 2020-03-27 2020-03-27 一种动态手势识别方法及设备
CN202010235859.6 2020-03-27

Publications (1)

Publication Number Publication Date
WO2021190296A1 true WO2021190296A1 (zh) 2021-09-30

Family

ID=77808237

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/079699 WO2021190296A1 (zh) 2020-03-27 2021-03-09 一种动态手势识别方法及设备

Country Status (2)

Country Link
CN (1) CN113449573A (zh)
WO (1) WO2021190296A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920583A (zh) * 2021-10-14 2022-01-11 根尖体育科技(北京)有限公司 细粒度行为识别模型构建方法及系统
CN114120048A (zh) * 2022-01-26 2022-03-01 中兴通讯股份有限公司 图像处理方法、电子设备及计算存储介质
CN114626412A (zh) * 2022-02-28 2022-06-14 长沙融创智胜电子科技有限公司 用于无人值守传感器系统的多类别目标识别方法及系统
CN116168334A (zh) * 2023-04-26 2023-05-26 深圳金三立视频科技股份有限公司 一种视频行为分类的方法及终端
CN116954113A (zh) * 2023-06-05 2023-10-27 深圳市机器时代科技有限公司 智能机器人驱动传感智能控制系统及其方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821801B (zh) * 2022-05-10 2023-04-07 百度在线网络技术(北京)有限公司 动作识别方法、模型训练方法、装置、电子设备和存储介质
CN116165911B (zh) * 2023-04-19 2023-07-11 深圳市吉方工控有限公司 智能家居控制方法、装置、嵌入式工控设备及介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN107340861A (zh) * 2017-06-26 2017-11-10 联想(北京)有限公司 手势识别方法及其设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740823B (zh) * 2016-02-01 2019-03-29 北京高科中天技术股份有限公司 基于深度卷积神经网络的动态手势轨迹识别方法
CN107808131B (zh) * 2017-10-23 2019-12-10 华南理工大学 基于双通路深度卷积神经网络的动态手势识别方法
CN108537147B (zh) * 2018-03-22 2021-12-10 东华大学 一种基于深度学习的手势识别方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN107340861A (zh) * 2017-06-26 2017-11-10 联想(北京)有限公司 手势识别方法及其设备

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DOAN HUONG-GIANG, VU HAI, TRAN THANH-HAI: "New Cyclical Pattern and Temporal-Spatial Representation for Robust Dynamic Hand Gesture Recognition", DOCTORAL CONSORTIUM OF IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, 30 June 2017 (2017-06-30), XP055853222 *
DOAN HUONG-GIANG; VU HAI; TRAN THANH-HAI: "Dynamic hand gesture recognition from cyclical hand pattern", 2017 FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA), MVA ORGANIZATION, 8 May 2017 (2017-05-08), pages 97 - 100, XP033126540, DOI: 10.23919/MVA.2017.7986799 *
LI JINGHUA; HUAI HUARUI; GAO JUNBIN; KONG DEHUI; WANG LICHUN: "Spatial-temporal dynamic hand gesture recognition via hybrid deep learning model", JOURNAL ON MULTIMODAL USER INTERFACES, SPRINGER BERLIN HEIDELBERG, BERLIN/HEIDELBERG, vol. 13, no. 4, 14 May 2019 (2019-05-14), Berlin/Heidelberg, pages 363 - 371, XP036918453, ISSN: 1783-7677, DOI: 10.1007/s12193-019-00304-z *
NAINA DHINGRA; ANDREAS KUNZ: "Res3ATN -- Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 January 2020 (2020-01-04), 201 Olin Library Cornell University Ithaca, NY 14853, XP081572273, DOI: 10.1109/3DV.2019.00061 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113920583A (zh) * 2021-10-14 2022-01-11 根尖体育科技(北京)有限公司 细粒度行为识别模型构建方法及系统
CN114120048A (zh) * 2022-01-26 2022-03-01 中兴通讯股份有限公司 图像处理方法、电子设备及计算存储介质
CN114626412A (zh) * 2022-02-28 2022-06-14 长沙融创智胜电子科技有限公司 用于无人值守传感器系统的多类别目标识别方法及系统
CN114626412B (zh) * 2022-02-28 2024-04-02 长沙融创智胜电子科技有限公司 用于无人值守传感器系统的多类别目标识别方法及系统
CN116168334A (zh) * 2023-04-26 2023-05-26 深圳金三立视频科技股份有限公司 一种视频行为分类的方法及终端
CN116954113A (zh) * 2023-06-05 2023-10-27 深圳市机器时代科技有限公司 智能机器人驱动传感智能控制系统及其方法
CN116954113B (zh) * 2023-06-05 2024-02-09 深圳市机器时代科技有限公司 智能机器人驱动传感智能控制系统及其方法

Also Published As

Publication number Publication date
CN113449573A (zh) 2021-09-28

Similar Documents

Publication Publication Date Title
WO2021190296A1 (zh) 一种动态手势识别方法及设备
WO2020238293A1 (zh) 图像分类方法、神经网络的训练方法及装置
US20210012198A1 (en) Method for training deep neural network and apparatus
Singh et al. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images
WO2019228317A1 (zh) 人脸识别方法、装置及计算机可读介质
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
CN111291809B (zh) 一种处理装置、方法及存储介质
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
CN113807399B (zh) 一种神经网络训练方法、检测方法以及装置
CN110222718B (zh) 图像处理的方法及装置
WO2021047587A1 (zh) 手势识别方法、电子设备、计算机可读存储介质和芯片
WO2021218517A1 (zh) 获取神经网络模型的方法、图像处理方法及装置
WO2021013095A1 (zh) 图像分类方法、图像分类模型的训练方法及其装置
WO2021018245A1 (zh) 图像分类方法及装置
WO2021018251A1 (zh) 图像分类方法及装置
CN113326930A (zh) 数据处理方法、神经网络的训练方法及相关装置、设备
WO2022012668A1 (zh) 一种训练集处理方法和装置
WO2021190433A1 (zh) 更新物体识别模型的方法和装置
CN113807183A (zh) 模型训练方法及相关设备
WO2023165361A1 (zh) 一种数据处理方法及相关设备
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
US20220327835A1 (en) Video processing method and apparatus
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21775417

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21775417

Country of ref document: EP

Kind code of ref document: A1