CN112232282A

CN112232282A - Gesture recognition method and device, storage medium and electronic equipment

Info

Publication number: CN112232282A
Application number: CN202011219731.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Suzhou Zhendi Intelligent Technology Co Ltd
Current assignee: Suzhou Zhendi Intelligent Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-01-15

Abstract

The embodiment of the application provides a gesture recognition method, a device, a storage medium and electronic equipment, wherein the gesture recognition method comprises the following steps: acquiring an image to be recognized, wherein the image to be recognized comprises a hand area and a background area; extracting the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to a hand region, wherein the hand region image does not contain any region in a background region; processing the hand region image through a pre-trained hand skeleton construction model to obtain a hand Gaussian heat map; and identifying the hand Gaussian heatmap through a pre-trained gesture classification model to obtain a gesture classification result. According to the embodiment of the application, the background interference signals in the image to be recognized are removed through the hand extraction model, so that an interference-free hand region image is obtained, and the accuracy of subsequent gesture recognition is improved.

Description

Gesture recognition method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of gesture recognition technologies, and in particular, to a gesture recognition method and apparatus, a storage medium, and an electronic device.

Background

With the development of human-computer interaction technology, gesture recognition is an important branch in the field of human-computer interaction, and becomes an important technical means of human-computer interaction due to the characteristics of liveliness, image, intuition and the like.

At present, in an existing gesture recognition method, after an image including a hand is captured by an image capturing device, a gesture is recognized according to the captured image.

In the process of implementing the invention, the inventor finds that the following problems exist in the prior art: the existing gesture recognition method has the problem that the gesture recognition precision is low because the hand recognition is easily interfered by the background.

Disclosure of Invention

An object of the embodiments of the present application is to provide a gesture recognition method, device, storage medium and electronic device, so as to improve accuracy of gesture recognition.

In a first aspect, an embodiment of the present application provides a gesture recognition method, where the gesture recognition method includes: acquiring an image to be recognized, wherein the image to be recognized comprises a hand area and a background area; extracting the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to a hand region, wherein the hand region image does not contain any region in a background region; processing the hand region image through a pre-trained hand skeleton construction model to obtain a hand Gaussian heat map; and identifying the hand Gaussian heatmap through a pre-trained gesture classification model to obtain a gesture classification result.

Therefore, the background interference signals in the image to be recognized are removed through the hand extraction model, so that an interference-free hand region image is obtained, and the accuracy of subsequent gesture recognition is improved.

The hand region image is processed through the hand skeleton construction model to obtain the hand Gaussian heat map, the hand Gaussian heat map is recognized through the gesture classification model to obtain the gesture classification result, and therefore compared with an existing scheme of image content classification directly based on the image, the hand region image classification method based on the hand skeleton construction model is high in robustness and adaptability, and compared with an existing scheme of gesture recognition directly based on the convolutional neural network, the hand region image recognition method based on the hand skeleton construction model is interpretable and flexible.

In one possible embodiment, the hand extraction model includes a first image labeling layer, an image processing layer, and a feature extraction layer, and the extraction processing is performed on the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region, including: marking a hand region in an image to be identified through a first image marking layer to obtain an intermediate image containing a marking frame; performing binarization processing on an image to be identified through an image processing layer to obtain a hand mask image; and obtaining a hand area image according to the intermediate image and the hand mask image through the feature extraction layer.

In one possible embodiment, the hand extraction model further includes a second image annotation layer, and the obtaining, by the feature extraction layer, the hand region image according to the intermediate image and the hand mask image includes: determining a hand region from the labeling region of the intermediate image through the second image labeling layer according to the coordinates of the hand region in the hand mask image; and extracting a hand region from the labeling region of the intermediate image through a feature extraction layer to obtain a hand region image.

Therefore, the hand region in the image to be recognized can be extracted in a pixel level through the hand extraction model, and the extracted hand region image does not contain a background.

In one possible embodiment, the gesture classification model includes a convolutional layer and a Softmax layer, and the recognition of the hand gaussian heatmap is performed by a pre-trained gesture classification model to obtain a gesture classification result, including: performing convolution processing on the hand Gaussian heatmap through the convolution layer to obtain a vector containing at least one first element; and classifying the vectors through a Softmax layer to determine a gesture classification result, wherein the gesture classification result is a category corresponding to the first element with the largest value in the at least one first element.

Therefore, the hand Gaussian heatmap has no interference of a background and the like, the robustness is higher, and the gesture classification result is obtained through the gesture classification model, so that the gesture recognition result is more accurate.

In a second aspect, an embodiment of the present application provides a gesture recognition apparatus, including: the device comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring an image to be recognized, and the image to be recognized comprises a hand area and a background area; the processing module is used for extracting and processing the image to be recognized through a pre-trained hand extraction model so as to obtain a hand region image corresponding to a hand region, wherein the hand region image does not contain any region in a background region; the processing module is also used for processing the hand region image through a pre-trained hand skeleton construction model so as to obtain a hand Gaussian heat map; and the recognition module is used for recognizing the hand Gaussian heatmap through a pre-trained gesture classification model so as to obtain a gesture classification result.

In one possible embodiment, the hand extraction model comprises a first image annotation layer, an image processing layer and a feature extraction layer; the processing module is specifically configured to: marking a hand region in an image to be identified through a first image marking layer to obtain an intermediate image containing a marking frame; performing binarization processing on an image to be identified through an image processing layer to obtain a hand mask image; and obtaining a hand area image according to the intermediate image and the hand mask image through the feature extraction layer.

In one possible embodiment, the hand extraction model further comprises a second image annotation layer; the processing module is specifically configured to: determining a hand region from the labeling region of the intermediate image through the second image labeling layer according to the coordinates of the hand region in the hand mask image; and extracting a hand region from the labeling region of the intermediate image through a feature extraction layer to obtain a hand region image.

In one possible embodiment, the gesture classification model includes a convolutional layer and a Softmax layer; the identification module is specifically configured to: performing convolution processing on the hand Gaussian heatmap through the convolution layer to obtain a vector containing at least one element; and classifying the vector through a Softmax layer to determine a gesture classification result, wherein the gesture classification result is a category corresponding to the element with the largest value in the at least one element.

In a third aspect, an embodiment of the present application provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program performs the method according to the first aspect or any optional implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect or any of the alternative implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart illustrating a gesture recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a structure of a hand extraction model provided by an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a structure of a gesture classification model according to an embodiment of the present application;

fig. 4 shows a block diagram of a gesture recognition apparatus provided in an embodiment of the present application;

fig. 5 shows a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The traditional human-computer interaction mode (such as a mouse, a keyboard and the like) gradually has the defects of being not concise in operation and unnatural in emerging application scenes, and although the touch human-computer interaction mode based on the touch screen technology can relieve the problems to a certain extent, the use scenes of the touch human-computer interaction mode still have certain limitations. For example, the touch man-machine interaction mode is difficult to meet the requirements of unmanned aerial vehicle application scenes.

And with the development of the human-computer interaction technology, although the problem of scene limitation is solved by the appearance of gesture recognition, the gesture recognition accuracy is low because the gesture recognition is easily interfered by background signals. For example, in the conventional gesture recognition method, a hand region in an image is cut out, but the cut-out portion includes not only a hand but also a background, which causes a problem of relatively low gesture recognition accuracy.

Based on this, an embodiment of the present application provides a gesture recognition scheme, where an image to be recognized is obtained, where the image to be recognized includes a hand region and a background region, and then the image to be recognized is extracted through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region, where the hand region image does not include any region in the background region, and then the hand region image is processed through a pre-trained hand skeleton construction model to obtain a hand gaussian thermal map, and finally the hand gaussian thermal map is recognized through a pre-trained gesture classification model to obtain a gesture classification result.

Referring to fig. 1, fig. 1 is a flowchart illustrating a gesture recognition method according to an embodiment of the present disclosure. It should be understood that the gesture recognition method shown in fig. 1 may be performed by a gesture recognition device, which may correspond to the gesture recognition device shown in fig. 4 below, and the gesture recognition device may be various devices capable of performing the method, for example, a drone, a computer, or the like, and the embodiments of the present application are not limited thereto. The gesture recognition method shown in fig. 1 specifically includes the following steps:

and step S110, acquiring an image to be identified. The image to be recognized comprises a hand area and a background area.

It should be understood that the image to be identified may be acquired by an image acquisition device on the drone.

And step S120, extracting the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region. Wherein the hand region image does not include any region in the background region.

It should be understood that the specific structure, the model type, and the like of the hand extraction model may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

Optionally, please refer to fig. 2, and fig. 2 illustrates a schematic structural diagram of a hand extraction model provided in an embodiment of the present application. The hand extraction model shown in fig. 2 includes a first image annotation layer, an image processing layer, a second image annotation layer, and a feature extraction layer. The first image annotation layer is used for annotating the image input into the first image annotation layer to obtain an intermediate image with an annotation frame; the image processing layer is used for carrying out binarization processing on the image to be identified so as to obtain a hand mask image; the second image annotation layer is used for determining a hand area from the annotation area of the intermediate image according to the coordinates of the hand area in the hand mask image; the characteristic extraction layer is used for extracting a hand region from the labeling region of the intermediate image so as to obtain a hand region image.

It should be noted here that, although fig. 2 is described with a four-layer structure of a first image annotation layer, an image processing layer, a second image annotation layer, and a feature extraction layer, it should be understood by those skilled in the art that the four-layer structure may also be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the first image annotation layer and the image processing layer can be combined into one image annotation processing layer, which can output not only the intermediate image with the annotation frame, but also the hand mask image.

In order to facilitate understanding of the embodiments of the present application, the following description will be given by way of specific examples.

Specifically, the hand region in the image to be recognized may be labeled through the first image labeling layer to obtain an intermediate image including a labeling frame, where the labeling region corresponding to the labeling frame may include a partial background region in addition to the hand region. The image to be identified can be subjected to binarization processing through the image processing layer to obtain a hand mask image; and the hand mask image distinguishes the hand area from the non-hand area through binarization, and the labeling area comprises part of the background area besides the hand area, so that the hand area can be determined from the labeling area of the intermediate image through the second image labeling layer according to the coordinates of the hand area in the hand mask image, and the hand area without the background area can be accurately determined from the hand area. And extracting a hand region from the labeling region of the intermediate image through the feature extraction layer to obtain a hand region image.

That is, based on the end-to-end implementation of the hand extraction model, which is a convolutional neural network, the task of extracting the hand region image can be implemented by dividing it into a plurality of tasks, the task of generating the label box can be processed by the first image label layer, and the task of generating the hand mask image, which can be a pixel-level classification mask image, can also be processed by the image processing layer. And the task of determining the hand region from the labeling region can be processed through the second image labeling layer, and the task of extracting the features of the hand region can be processed through the feature extraction layer, so that the hand region image at the pixel level can be acquired.

It should be noted here that, in consideration of the situation that a plurality of hands (for example, two hands) are relatively close to each other in an image, the number of the hands cannot be determined only by an image obtained by performing binarization processing on the image to be recognized, so that a situation that two hands are taken as one hand for extraction may occur when directly extracting features from the binarized image, and in consideration of the problem that extraction of the labeling area also has background interference, the method in the embodiment of the present application can ensure accuracy of a hand extraction result from multiple dimensions, and further avoid a problem that two hands are taken as one hand for extraction or that the background interference occurs.

It should be understood that the specific process of labeling the hand region in the image to be recognized through the first image labeling layer may be set according to actual needs, and the embodiment of the present application is not limited thereto.

For example, the hand region in the image to be recognized may be labeled by a rectangular frame.

It should also be understood that the specific process of performing binarization processing on the image to be recognized through the image processing layer may also be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the hand area in the image may be set to white, and the non-hand area (including the background area) may be set to black.

It should also be understood that, the specific process of determining the hand region from the annotation region of the intermediate image by the second image annotation layer according to the coordinates of the hand region in the hand mask image may also be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the pixel coordinate position of the hand outline in the hand mask image can be determined, and the hand region can be determined from the labeling region according to the pixel coordinate position of the hand outline.

For another example, all pixel coordinate positions of the hand region in the hand mask image may be specified, and the hand region may be specified from the label region based on all pixel coordinate positions of the hand region.

It should be further noted that the training process of the hand extraction model is similar to the use process of the hand extraction model, and specific reference may be made to the description in the above relation, and repeated description is not repeated in the following.

Correspondingly, the hand skeleton construction model and the gesture classification model are similar to the hand extraction model, and repeated description is omitted in the following process.

Step S130, processing the hand region image through a pre-trained hand skeleton construction model to obtain a hand Gaussian heatmap.

It should be understood that a hand gaussian heat map may also be referred to as a gaussian heat map, may also be referred to as a gaussian response heat map, may also be referred to as a hand skeleton map, and the like.

It should also be understood that the specific result, the model type, and the like of the hand skeleton building model can be set according to actual requirements, and the embodiment of the application is not limited thereto.

For example, the hand skeleton construction model may be a model constructed using resnet 18.

Specifically, the hand region image may be input into a hand skeleton construction model trained in advance to obtain N × M matrices (i.e., a hand gaussian heatmap), where a second element of each matrix is a response value of a skeleton and a joint, and the larger the value is, the closer the position of the second element is to the skeleton or the joint of the hand. Wherein N and M are both positive integers.

It should be understood that the value of N and the value of M may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, N and M may both be 64.

Therefore, the embodiment of the application generates a response gaussian heat map through a hand skeleton construction model (for example, response points in the response gaussian heat map may be bones and joint regions of a response hand), and then processes the response gaussian heat map to generate a hand gaussian heat map (for example, the hand gaussian heat map is obtained by connecting response points (for example, joint points) of the hand), so that the network learning considers the correlation between each joint of the hand and the bones instead of generating the hand gaussian heat map by means of artificial connection.

Step S140, recognizing the hand gaussian heatmap by a pre-trained gesture classification model to obtain a gesture classification result.

It should be understood that the specific structure, the model type, and the like of the gesture classification model may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the gesture classification model may be a model constructed using resnet 18.

For another example, please refer to fig. 3, and fig. 3 illustrates a schematic structural diagram of a gesture classification model provided in an embodiment of the present application. The gesture classification model shown in fig. 3 includes a convolutional layer and a Softmax layer connected in sequence. The convolution layer is used for performing convolution processing on the hand Gaussian heat map to obtain a vector containing at least one first element; the Softmax layer is used for classifying the vectors, and each first element in the vectors represents a category, so that the category corresponding to the first element with the largest value in at least one first element can be taken as the category of the gesture represented by the hand gaussian heatmap.

It should be understood that the Softmax layer may also be referred to as a classification layer.

Specifically, the hand gaussian heat map may be pre-processed in advance (for example, the hand gaussian heat map may be adjusted to a preset size), and then the pre-processed image is input into a pre-trained gesture classification model to obtain a gesture classification result. And the gesture classification result is a category corresponding to the first element with the largest value in the at least one first element.

Therefore, the hand Gaussian heatmap has no interference of a background and the like, the robustness is higher, and the gesture classification result is obtained through the gesture classification model, so that the gesture recognition result is more accurate. In addition, because factors such as light rays are ignored, sample collection is simpler and more convenient.

It should be understood that the above-described gesture recognition method is only exemplary, and those skilled in the art may make various changes, modifications or alterations according to the above-described method within the protection scope of the present application.

Referring to fig. 4, fig. 4 shows a structural block diagram of a gesture recognition apparatus 400 provided in an embodiment of the present application, it should be understood that the gesture recognition apparatus 400 corresponds to the above method embodiment and is capable of performing various steps related to the above method embodiment, specific functions of the gesture recognition apparatus 400 may be referred to in the foregoing description, and detailed descriptions are appropriately omitted herein to avoid repetition. The gesture recognition apparatus 400 includes at least one software function module that can be stored in a memory in the form of software or firmware (firmware) or is solidified in an Operating System (OS) of the gesture recognition apparatus 400. Specifically, the gesture recognition apparatus 400 includes:

an obtaining module 410, configured to obtain an image to be recognized, where the image to be recognized includes a hand area and a background area; the processing module 420 is configured to extract the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region, where the hand region image does not include any region in the background region; the processing module 420 is further configured to process the hand region image through a pre-trained hand skeleton construction model to obtain a hand gaussian heatmap; the recognition module 430 is configured to recognize the hand gaussian heatmap through a pre-trained gesture classification model to obtain a gesture classification result.

In one possible embodiment, the hand extraction model comprises a first image annotation layer, an image processing layer and a feature extraction layer; the processing module 420 is specifically configured to: marking a hand region in an image to be identified through a first image marking layer to obtain an intermediate image containing a marking frame; performing binarization processing on an image to be identified through an image processing layer to obtain a hand mask image; and obtaining a hand area image according to the intermediate image and the hand mask image through the feature extraction layer.

In one possible embodiment, the hand extraction model further comprises a second image annotation layer; the processing module 420 is specifically configured to: determining a hand region from the labeling region of the intermediate image through the second image labeling layer according to the coordinates of the hand region in the hand mask image; and extracting a hand region from the labeling region of the intermediate image through a feature extraction layer to obtain a hand region image.

In one possible embodiment, the gesture classification model includes a convolutional layer and a Softmax layer; the identifying module 430 is specifically configured to: performing convolution processing on the hand Gaussian heatmap through the convolution layer to obtain a vector containing at least one element; and classifying the vector through a Softmax layer to determine a gesture classification result, wherein the gesture classification result is a category corresponding to the element with the largest value in the at least one element.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

Referring to fig. 5, fig. 5 is a block diagram illustrating an electronic device 500 according to an embodiment of the present disclosure. Electronic device 500 may include a processor 510, a communication interface 520, a memory 530, and at least one communication bus 540. Wherein the communication bus 540 is used for realizing direct connection communication of these components. The communication interface 520 in the embodiment of the present application is used for communicating signaling or data with other devices. Processor 510 may be an integrated circuit chip having signal processing capabilities. The Processor 510 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 510 may be any conventional processor or the like.

The Memory 530 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 530 stores computer readable instructions, which when executed by the processor 510, the electronic device 500 may perform the steps of the above-described method embodiments.

The electronic device 500 may further include a memory controller, an input-output unit, an audio unit, and a display unit.

The memory 530, the memory controller, the processor 510, the peripheral interface, the input/output unit, the audio unit, and the display unit are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, these elements may be electrically coupled to each other via one or more communication buses 540. The processor 510 is used to execute executable modules stored in the memory 530. Also, the electronic device 500 is configured to perform the following method: acquiring an image to be recognized, wherein the image to be recognized comprises a hand area and a background area; extracting the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region, wherein the hand region image does not contain any region in the background region; processing the hand region image through a pre-trained hand skeleton construction model to obtain a hand Gaussian heat map; and identifying the hand Gaussian heatmap through a pre-trained gesture classification model to obtain a gesture classification result.

The input and output unit is used for providing input data for a user to realize the interaction of the user and the server (or the local terminal). The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.

The audio unit provides an audio interface to the user, which may include one or more microphones, one or more speakers, and audio circuitry.

The display unit provides an interactive interface (e.g. a user interface) between the electronic device and a user or for displaying image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.

It will be appreciated that the configuration shown in FIG. 5 is merely illustrative and that the electronic device 500 may include more or fewer components than shown in FIG. 5 or may have a different configuration than shown in FIG. 5. The components shown in fig. 5 may be implemented in hardware, software, or a combination thereof.

The present application also provides a storage medium having a computer program stored thereon, which, when executed by a processor, performs the method of the method embodiments.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A gesture recognition method, comprising:

acquiring an image to be recognized, wherein the image to be recognized comprises a hand area and a background area;

extracting the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region, wherein the hand region image does not contain any region in the background region;

processing the hand region image through a pre-trained hand skeleton construction model to obtain a hand Gaussian heat map;

and identifying the hand Gaussian heatmap through a pre-trained gesture classification model to obtain a gesture classification result.

2. The gesture recognition method according to claim 1, wherein the hand extraction model includes a first image labeling layer, an image processing layer and a feature extraction layer, and the extracting process is performed on the image to be recognized through a pre-trained hand extraction model to obtain a hand region image corresponding to the hand region, including:

marking the hand region in the image to be identified through the first image marking layer to obtain an intermediate image containing a marking frame;

performing binarization processing on the image to be identified through the image processing layer to obtain a hand mask image;

and obtaining the hand area image according to the intermediate image and the hand mask image through the feature extraction layer.

3. The gesture recognition method of claim 2, wherein the hand extraction model further comprises a second image annotation layer, and the obtaining of the hand region image from the intermediate image and the hand mask image by the feature extraction layer comprises:

determining the hand area from the labeling area of the intermediate image according to the coordinates of the hand area in the hand mask image through the second image labeling layer;

extracting the hand region from the labeling region of the intermediate image through the feature extraction layer to obtain the hand region image.

4. The gesture recognition method of claim 1, wherein the gesture classification model comprises a convolutional layer and a Softmax layer, and the recognizing the hand gaussian heatmap through a pre-trained gesture classification model to obtain a gesture classification result comprises:

performing convolution processing on the hand Gaussian heatmap through the convolution layer to obtain a vector containing at least one first element;

classifying the vector through the Softmax layer to determine the gesture classification result, wherein the gesture classification result is a category corresponding to a first element with a largest value in the at least one first element.

5. A gesture recognition apparatus, comprising:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an image to be recognized, and the image to be recognized comprises a hand area and a background area;

the processing module is used for extracting the image to be recognized through a pre-trained hand extraction model so as to obtain a hand region image corresponding to the hand region, wherein the hand region image does not contain any region in the background region;

the processing module is further used for processing the hand region image through a pre-trained hand skeleton construction model to obtain a hand Gaussian heat map;

and the recognition module is used for recognizing the hand Gaussian heatmap through a pre-trained gesture classification model so as to obtain a gesture classification result.

6. The gesture recognition device of claim 5, wherein the hand extraction model comprises a first image annotation layer, an image processing layer, and a feature extraction layer;

the processing module is specifically configured to: marking the hand region in the image to be identified through the first image marking layer to obtain an intermediate image containing a marking frame; performing binarization processing on the image to be identified through the image processing layer to obtain a hand mask image; and obtaining the hand area image according to the intermediate image and the hand mask image through the feature extraction layer.

7. The gesture recognition device of claim 6, wherein the hand extraction model further comprises a second image annotation layer;

the processing module is specifically configured to: determining the hand area from the labeling area of the intermediate image according to the coordinates of the hand area in the hand mask image through the second image labeling layer; extracting the hand region from the labeling region of the intermediate image through the feature extraction layer to obtain the hand region image.

8. The gesture recognition device of claim 5, wherein the gesture classification model comprises a convolutional layer and a Softmax layer;

the identification module is specifically configured to: performing convolution processing on the hand Gaussian heatmap through the convolution layer to obtain a vector containing at least one element; classifying the vector through the Softmax layer to determine the gesture classification result, wherein the gesture classification result is a category corresponding to the element with the largest value in the at least one element.

9. A storage medium, having stored thereon a computer program which, when executed by a processor, performs a gesture recognition method according to any one of claims 1 to 4.

10. An electronic device, characterized in that the electronic device comprises: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the gesture recognition method of any of claims 1 to 4.