CN114510142B

CN114510142B - Gesture recognition method based on two-dimensional image, gesture recognition system based on two-dimensional image and electronic equipment

Info

Publication number: CN114510142B
Application number: CN202011180708.1A
Authority: CN
Inventors: 徐诚; 谢森栋; 田文军; 李程辉
Original assignee: Sunny Optical Zhejiang Research Institute Co Ltd
Current assignee: Sunny Optical Zhejiang Research Institute Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2023-11-10
Anticipated expiration: 2040-10-29
Also published as: CN114510142A

Abstract

Gesture recognition method based on two-dimensional image, gesture recognition system based on two-dimensional image and electronic equipment. The gesture recognition method based on the two-dimensional image comprises the following steps: tracking the azimuth of the palm in the current frame image according to the hand key point information in the previous frame image so as to obtain the palm information of the current frame image; performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image; and performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

Description

Gesture recognition method based on two-dimensional image, gesture recognition system based on two-dimensional image and electronic equipment

Technical Field

The present invention relates to the field of gesture recognition technologies, and in particular, to a gesture recognition method based on a two-dimensional image, a gesture recognition system based on the two-dimensional image, and an electronic device.

Background

Currently, augmented reality (AR for short) technology has been increasingly applied to mobile terminals such as AR glasses, and is favored by various large mobile device manufacturers and users. As the most self-heating man-machine interaction means, gesture recognition has great application value in augmented reality. Currently, gesture recognition can be divided into two main categories according to the difference of acquired data: the first method is to obtain data by a sensor device such as a data glove; the second method is based on visual data, including two-dimensional images, three-dimensional point clouds, or depth images.

However, the first method described above, although generally having a high recognition accuracy, is expensive and inconvenient to operate for the sensor device; the second method has low requirements on equipment and convenient operation, but often has the problems of insufficient recognition speed and insufficient recognition accuracy. For example, one existing end-to-end gesture recognition method uses two models, one is a lightweight 3D Convolutional Neural Network (CNN) classifier for detecting gestures, and the other is a deep 3D Convolutional Neural Network (CNN) detector for classifying detected gestures. At the same time, two sliding windows are used to implement the workflow of the recognition system, the sliding windows of the detector and classifier are n and m, respectively, in length, and n is much smaller than m. Thus, the workflow starts from a detector which is used as a switch for activating the classifier when a gesture is detected, and when the gesture is detected, the classifier is activated and the data is provided by the frames in the classifier queue for gesture recognition, but the scheme has very high computational complexity and high requirement on hardware, and cannot meet the real-time requirement at all on mobile equipment.

Disclosure of Invention

An advantage of the present invention is to provide a gesture recognition method based on a two-dimensional image, a system thereof and an electronic device thereof, which can increase the speed of gesture recognition while maintaining the accuracy of gesture recognition, so as to meet the real-time requirement of a mobile terminal device.

Another advantage of the present invention is to provide a gesture recognition method based on a two-dimensional image, a system thereof and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on the two-dimensional image designs a gesture tracking method according to characteristics of a gesture motion, so as to maintain accuracy of gesture recognition and increase speed of gesture recognition.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional image, a system thereof and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional image can achieve real-time on a mobile terminal device only by using a common RGB camera without additional sensor devices, and has high accuracy and robustness.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system thereof and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional images can perform dynamic gesture recognition through combination of single-frame gesture motion and hand key points, which is helpful for improving robustness of the method.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system thereof and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional images combines deep learning and traditional machine learning, and provides an innovative tracking method and fault tolerance strategy, which are helpful for improving the speed and accuracy of gesture recognition.

The invention further provides a gesture recognition method based on a two-dimensional image, a gesture recognition system based on the two-dimensional image and electronic equipment, wherein in one embodiment of the invention, the gesture recognition method based on the two-dimensional image can accelerate extraction of key points of a hand through palm tracking, so that the speed of gesture recognition is improved, and the method can better meet the real-time requirement of the mobile equipment.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional image, a system and an electronic device thereof, wherein in order to achieve the above advantages, the present invention does not need to use a complex structure and huge calculation, and has low requirements on software and hardware. Therefore, the invention successfully and effectively provides a solution, which not only provides a gesture recognition method based on a two-dimensional image and a system and electronic equipment thereof, but also increases the practicability and reliability of the gesture recognition method based on the two-dimensional image and the system and the electronic equipment thereof.

To achieve at least one of the above or other advantages and objects, the present application provides a gesture recognition method based on a two-dimensional image, including the steps of:

tracking the azimuth of the palm in the current frame image according to the hand key point information in the previous frame image so as to obtain the palm information of the current frame image;

performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image; and

and carrying out single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

According to an embodiment of the present application, the step of tracking the orientation of the palm in the current frame image according to the hand key point information in the previous frame image to obtain the palm information of the current frame image includes the steps of:

calculating a minimum circumscribed rectangle through traversing the position information of part of hand key points in the previous frame of image so as to obtain the center point, the length and the width of the minimum circumscribed rectangle;

under the condition of keeping the length-width ratio unchanged, carrying out outward expansion on the minimum circumscribed rectangle according to a first outward expansion proportion to obtain an outward expansion rectangle;

Judging whether the cross ratio between the expansion rectangle and a key point detection frame in the previous frame of image is smaller than a first threshold value or not;

responding to the intersection ratio smaller than the first threshold value, starting a palm detection module to obtain the palm information of the current frame image; and

and in response to the intersection ratio being greater than or equal to the first threshold, performing azimuth calculation on the expansion rectangle to obtain palm information of the current frame image.

According to an embodiment of the present application, the palm detection module is a palm detection model based on deep learning.

According to an embodiment of the present application, the step of performing azimuth calculation on the expanded rectangle to obtain palm information of the current frame image in response to the intersection ratio being greater than or equal to the first threshold value includes the steps of:

judging whether the expanded rectangle is smaller than a rectangle frame threshold value or not, wherein the rectangle frame threshold value is dynamically updated through the palm detection module;

expanding the expanded rectangle according to a second expansion ratio in response to the expanded rectangle being less than the rectangular frame threshold, such that the expanded rectangle is greater than or equal to the rectangular frame threshold;

normalizing the expanded rectangle which is larger than or equal to the rectangle frame threshold value to obtain the position information of the expanded rectangle in the current frame image; and

And calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

According to an embodiment of the present application, the step of performing keypoint detection on the current frame image based on the palm information of the current frame image to obtain hand keypoint information in the current frame image includes the steps of:

affine transformation is carried out on the current frame image according to the palm information of the current frame image so as to cut out a palm image corresponding to the palm information of the current frame image; and

and detecting the positions of the key points of each hand from the palm image through the key point detection network so as to obtain the hand key point information in the current frame image.

According to an embodiment of the present application, the step of performing single-frame gesture recognition on the hand keypoint information in the current frame image to obtain a single-frame static gesture in the current frame image includes the steps of:

calculating characteristic values through the hand key point information in the current frame image so as to obtain characteristics of a plurality of different dimensions; and

and carrying out gesture recognition on the characteristics of the plurality of different dimensions through a random forest so as to obtain a static gesture recognition result in the current frame image.

According to an embodiment of the present application, the gesture recognition method based on two-dimensional image further includes the steps of:

based on single-frame static gestures and hand key point information in a buffer queue, detecting a single-frame gesture change process in the buffer queue so as to realize multi-frame dynamic gesture recognition.

According to an embodiment of the present application, the step of detecting a single-frame gesture change in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue to realize the recognition of multiple-frame dynamic gestures includes the steps of:

filling the single-frame static gesture and the hand key point information in the current frame image into the buffer queue to update the buffer queue;

searching whether single-frame gesture motion changes conforming to the setting exist in the updated cache queue; and

and responding to the single-frame gesture motion change conforming to the setting, and carrying out dynamic gesture recognition by combining the coordinate change of the key points of the hand so as to obtain a single-hand motion recognition result.

According to an embodiment of the present application, the step of detecting a single-frame gesture change in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue to realize the recognition of multiple-frame dynamic gestures further includes the steps of:

Judging whether the hand key points in the updated cache queue contain key points of both hands or not and continuing to reserve frames; and

and in response to the hand key points including the key points of the two hands and lasting the preset frame, carrying out fusion recognition on the single-hand action recognition results of the two hands so as to output the action recognition results of the two hands.

According to another aspect of the present application, an embodiment of the present application provides a gesture recognition system based on two-dimensional images, including:

the palm tracking module is used for tracking the orientation of the palm in the current frame image according to the hand key point information in the previous frame image so as to obtain the palm information of the current frame image;

the key point detection module is used for carrying out key point detection on the current frame image based on the palm information of the current frame image so as to obtain hand key point information in the current frame image; and

and the single-frame gesture recognition module is used for carrying out single-frame gesture recognition on the hand key point information in the current frame image so as to obtain a single-frame static gesture in the current frame image.

According to an embodiment of the application, the palm tracking module comprises:

The rectangle calculation module is used for calculating a minimum circumscribed rectangle through traversing the position information of part of hand key points in the previous frame of image so as to obtain the center point, the length and the width of the minimum circumscribed rectangle;

the rectangular expansion module is used for expanding the minimum circumscribed rectangle according to a first expansion proportion under the condition of keeping the length-width ratio unchanged so as to obtain an expanded rectangle;

the cross-over ratio judging module is used for judging whether the cross-over ratio between the expansion rectangle and the key point detection frame in the previous frame of image is smaller than a first threshold value;

a palm detection starting module, which responds to the intersection ratio smaller than the first threshold value, and starts the palm detection module to obtain the palm information of the current frame image; and

and the azimuth calculation module is used for responding to the fact that the intersection ratio is larger than or equal to the first threshold value, and carrying out azimuth calculation on the expansion rectangle so as to obtain palm information of the current frame image.

According to an embodiment of the present application, the azimuth calculation module is further configured to: judging whether the expanded rectangle is smaller than a rectangle frame threshold value or not, wherein the rectangle frame threshold value is dynamically updated through the palm detection module; expanding the expanded rectangle according to a second expansion ratio in response to the expanded rectangle being less than the rectangular frame threshold, such that the expanded rectangle is greater than or equal to the rectangular frame threshold; normalizing the expanded rectangle which is larger than or equal to the rectangle frame threshold value to obtain the position information of the expanded rectangle in the current frame image; and calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

According to an embodiment of the present application, the keypoint detection module includes an affine transformation module and a keypoint extraction module that are communicatively connected to each other, wherein the affine transformation module is configured to perform affine transformation on the current frame image according to the palm information of the current frame image, so as to cut out a palm image corresponding to the palm information of the current frame image; the key point extraction module is used for extracting the positions of the key points of each hand from the palm image through the key point detection network so as to obtain the information of the key points of the hand in the current frame image.

According to an embodiment of the present application, the single-frame gesture recognition module includes a feature value calculation module and a random forest module that are communicatively connected to each other, where the feature value calculation module is configured to perform feature value calculation through the hand key point information in the current frame image, so as to obtain features in a plurality of different dimensions; the random forest module is used for carrying out gesture recognition on the characteristics of the plurality of different dimensions through the random forest so as to obtain a static gesture recognition result in the current frame image.

According to an embodiment of the present application, the gesture recognition system based on two-dimensional images further includes a multi-frame gesture recognition module, configured to detect a process of changing a single-frame gesture in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue, so as to implement recognition of multi-frame dynamic gestures.

According to an embodiment of the present application, the multi-frame gesture recognition module includes a filling update module, a search module and a single-hand motion recognition module that are communicatively connected to each other, where the filling update module is configured to fill the single-frame static gesture and the hand key point information in the current frame image into the buffer queue to update the buffer queue; the searching module is used for searching whether the single-frame gesture motion change conforming to the setting exists in the updated cache queue; the single-hand motion recognition module is used for responding to the single-frame gesture motion change conforming to the setting, and carrying out dynamic gesture recognition by combining the coordinate change of the key points of the hand so as to obtain a single-hand motion recognition result.

According to an embodiment of the present application, the multi-frame gesture recognition module further includes a judging module and a two-hand motion recognition module that are communicatively connected to each other, where the judging module is configured to judge whether the hand keypoints in the updated cache queue include the keypoints of two hands and continue for a predetermined frame; the two-hand motion recognition module is used for responding to the fact that the key points of the hands contain the key points of the hands and continuously carrying out fusion recognition on the single-hand motion recognition results of the two hands for a preset frame, so that the motion recognition results of the hands are output.

According to another aspect of the present application, an embodiment of the present application provides an electronic device, including:

at least one processor for executing instructions; and

a memory communicatively connected to the at least one processor, wherein the memory has at least one instruction, wherein the instruction is executed by the at least one processor to cause the at least one processor to perform some or all of the steps in a two-dimensional image-based gesture recognition method, wherein the two-dimensional image-based gesture recognition method comprises the steps of:

Further objects and advantages of the present application will become fully apparent from the following description and the accompanying drawings.

These and other objects, features and advantages of the present application will become more fully apparent from the following detailed description, the accompanying drawings and the appended claims.

Drawings

FIG. 1 is a schematic diagram of an algorithm framework of a gesture recognition method based on two-dimensional images according to an embodiment of the present application.

FIG. 2 is a flow chart of a gesture recognition method based on two-dimensional images according to an embodiment of the present application.

Fig. 3A and 3B are schematic flow diagrams illustrating a palm tracking step in the gesture recognition method based on two-dimensional images according to the above embodiment of the present application.

Fig. 4 shows an example of the palm tracking step in the two-dimensional image-based gesture recognition method according to the above embodiment of the present application.

Fig. 5A shows an example of hand keypoint information in a previous frame image according to the present application.

Fig. 5B shows an example of a minimum bounding rectangle in a current frame image according to the present application.

Fig. 5C shows an example of a spread rectangle in the current frame image according to the present application.

Fig. 5D shows an example of the cross-over ratio between the expanded rectangle in the current frame image and the keypoint detection frame in the previous frame image according to the present application.

Fig. 5E shows an example of the direction information of the palm in the current frame image according to the present application.

Fig. 6 shows a schematic flow chart of a key point detection step in the gesture recognition method based on two-dimensional images according to the above embodiment of the present application.

Fig. 7 shows an example of hand keypoint information in a current frame image according to the present application.

Fig. 8 is a flowchart illustrating a single-frame gesture recognition step in the gesture recognition method based on two-dimensional images according to the above embodiment of the present application.

Fig. 9 is a flowchart illustrating steps of multi-frame gesture recognition in the two-dimensional image-based gesture recognition method according to the above embodiment of the present application.

Fig. 10 shows an example of the multi-frame gesture recognition step in the two-dimensional image-based gesture recognition method according to the above embodiment of the present application.

Fig. 11 illustrates one example of a cache queue update in accordance with the present application.

FIG. 12 is a block diagram schematic of the two-dimensional image-based gesture recognition system according to an embodiment of the present application.

Fig. 13 shows a block diagram schematic of an electronic device according to an embodiment of the application.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the invention. The preferred embodiments in the following description are by way of example only and other obvious variations will occur to those skilled in the art. The basic principles of the invention defined in the following description may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.

In the present invention, the terms "a" and "an" in the claims and specification should be understood as "one or more", i.e. in one embodiment the number of one element may be one, while in another embodiment the number of the element may be plural. The terms "a" and "an" are not to be construed as unique or singular, and the term "the" and "the" are not to be construed as limiting the amount of the element unless the amount of the element is specifically indicated as being only one in the disclosure of the present invention.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Currently, augmented reality (AR for short) technology has been increasingly applied to mobile terminals such as AR glasses, and is favored by various large mobile device manufacturers and users. As the most self-heating man-machine interaction means, gesture recognition has great application value in augmented reality. Currently, gesture recognition can be divided into two main categories according to the difference of acquired data: the first method is to obtain data by a sensor device such as a data glove; the second method is based on visual data, including two-dimensional images, three-dimensional point clouds, or depth images. However, the first method described above, although generally having a high recognition accuracy, is expensive and inconvenient to operate for the sensor device; the second method has low requirements on equipment and convenient operation, but often has the problems of insufficient recognition speed and insufficient recognition accuracy. Therefore, in order to solve the above problems, the present application provides a gesture recognition method based on two-dimensional image, a system thereof and an electronic device, wherein an algorithm frame thereof is shown in fig. 1, and the algorithm frame mainly comprises several modules of palm detection, palm tracking, key point detection, single-frame gesture recognition and multi-frame gesture recognition. In particular, the palm center detection and the key point detection module can extract hand key points from the input rgb image, the hand key points are accelerated through palm tracking, and then the key points are input into single-frame gesture recognition and multi-frame gesture recognition to obtain a final gesture result.

Schematic method

Referring to fig. 2 to 11 of the drawings, a gesture recognition method based on a two-dimensional image according to an embodiment of the present application is illustrated. Specifically, as shown in fig. 2, the gesture recognition method based on the two-dimensional image may include the steps of:

s100: tracking the azimuth of the palm in the current frame image according to the hand key point information in the previous frame image so as to obtain the palm information of the current frame image;

s200: performing key point detection on the current frame image based on palm information of the current frame image to obtain hand key point information in the current frame image; and

s300: and carrying out single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

It is worth noting that, in combination with a specific AR application scenario, the frame rate of the gesture recognition image acquired by the mobile terminal generally needs to be greater than 30 frames per second so as to perform gesture recognition in real time, which makes the transformation of the orientation of the palm in the continuous frame image slow, so that the gesture recognition method based on the two-dimensional image creatively proposes to predict the orientation of the palm in the current frame image according to the hand key point information in the previous frame image, so that the palm information of the current frame image can be obtained through palm tracking without starting the palm detection module, which is beneficial to reducing the time cost and the consumption of memory resources, further accelerating the speed of gesture recognition and improving the instantaneity of gesture recognition. It is to be understood that the two-dimensional image referred to in the present application may be implemented as, but is not limited to, an RGB image.

More specifically, as shown in fig. 3A, the step S100 of the gesture recognition method based on a two-dimensional image of the present application may include the steps of:

s110: calculating a minimum circumscribed rectangle through traversing the position information of part of hand key points in the previous frame of image so as to obtain the center point, the length and the width of the minimum circumscribed rectangle;

s120: under the condition of keeping the length-width ratio unchanged, carrying out outward expansion on the minimum circumscribed rectangle according to a first outward expansion proportion to obtain an outward expansion rectangle;

s130: judging whether the cross ratio between the expansion rectangle and a key point detection frame in the previous frame of image is smaller than a first threshold value or not;

s140: responding to the fact that the intersection ratio is smaller than the first threshold value, starting a palm detection module to obtain palm information of the current frame image; and

s150: and in response to the intersection ratio being greater than or equal to the first threshold, performing azimuth calculation on the expansion rectangle to obtain palm information of the current frame image.

It should be noted that, although the orientation of the palm in the continuous frame image changes slowly, the area or the proportion occupied by the palm in the continuous frame image may change greatly, so the gesture recognition method based on the two-dimensional image according to the present application is beneficial to improving the accuracy of gesture recognition by performing the expansion on the minimum circumscribed rectangle to ensure that the rectangle frame of the subsequent analysis can cover the palm in the current frame image as much as possible. In addition, in the step S120 of the gesture recognition method based on a two-dimensional image of the present application, the first expansion ratio may be preset according to experience of gesture recognition and in combination with a specific application scenario, but is not limited thereto.

According to the above-described embodiment of the present application, in the step S130 of the two-dimensional image-based gesture recognition method of the present application, the first threshold may be, but is not limited to being, implemented as 0.5. In this way, when the intersection ratio IOU between the expanded rectangle and the keypoint detection frame in the previous frame image is smaller than the first threshold, it is considered that there is a larger displacement of the palm in the current frame image compared with the palm in the previous frame image, and at this time, the transformation of the palm position in the continuous frame image is not slow, but the expanded rectangle cannot be used as the keypoint detection frame in the current frame image by using palm tracking, so that the palm detection module needs to be started to directly detect the position information and the direction information of the palm center of the palm from the current frame image, so as to directly obtain the palm information of the current frame image. And when the intersection ratio IOU between the expansion rectangle and the key point detection frame in the previous frame image is larger than or equal to the first threshold value, the palm in the current frame image is considered to have smaller displacement compared with the palm in the previous frame image, and the transformation of the palm position in the continuous frame image is slow, so that the expansion rectangle can be used as the key point detection frame in the current frame image by utilizing palm tracking, and the palm information of the current frame image can be obtained.

Preferably, the palm detection module of the present application may be implemented as, but not limited to, a palm detection model based on deep learning, wherein the input data is an RGB image, and the correspondingly output data is position information of a palm center and a palm direction in the RGB image, and is generally represented by a bounding box formed by an upper left boundary point and a lower right boundary point. It can be understood that the key point detection frame in the current frame image may be a bounding box output by the palm detection module or an expanded rectangle obtained by palm tracking.

More preferably, the main network of the palm detection model is constructed based on a residual module, and the head network part of the palm detection model performs Feature fusion on Feature graphs (Feature maps) output by different layers, so that improvement by taking an SSD detection network as a prototype is realized, and palm information with higher precision is conveniently output.

According to an example of the present application, as shown in fig. 3B, the step S150 of the gesture recognition method based on two-dimensional image of the present application may include the steps of:

s151: judging whether the expanded rectangle is smaller than a rectangle frame threshold value or not, wherein the rectangle frame threshold value is dynamically updated through the palm detection module;

S152: expanding the expanded rectangle according to a second expansion ratio in response to the expanded rectangle being less than the rectangular frame threshold, such that the expanded rectangle is greater than or equal to the rectangular frame threshold;

s153: normalizing the expanded rectangle which is larger than or equal to the rectangle frame threshold value to obtain the position information of the expanded rectangle in the current frame image; and

s154: and calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

It should be noted that, when the expanded rectangle is smaller than the rectangle frame threshold, the portion of the current frame image corresponding to the expanded rectangle may be difficult to cover the complete hand image, so that in order to enable the keypoint detection frame in the current frame image to cover the complete hand image as much as possible, the step S152 of the gesture recognition method based on two-dimensional image of the present application further expands the expanded rectangle when the expanded rectangle is smaller than the rectangle frame threshold, so as to ensure that the expanded rectangle is greater than or equal to the rectangle frame threshold.

Preferably, the second expansion ratio may be set in real time according to, but not limited to, a size relationship between the rectangular frame threshold and the expansion rectangle. Of course, the second expansion ratio may be set in advance according to experience, so long as the expanded expansion rectangle is ensured to be greater than or equal to the rectangle frame threshold value, which is not described in detail in the present application.

It should be noted that, between the step S153 and the step S154 of the present application, the step S150 of the gesture recognition method based on two-dimensional image may further include the steps of:

judging whether the palm is stored or not by calculating the cross ratio between the expansion rectangle and the stored key point detection frame, if the cross ratio is larger than or equal to 0.3, judging that the palm is stored, and starting a palm detection module; otherwise, the step S154 is performed.

It should be noted that, before the step S154 is performed, it is determined whether the palm exists, mainly because the left palm and the right palm may appear in the current frame image at the same time, if there is a large overlap with the saved palm (i.e. the key point detection frame), it indicates that the palm tracking fails, and the palm detection module needs to be started to obtain the palm information of the current frame image.

Illustratively, as shown in fig. 4, the specific algorithm of step S100 of the gesture recognition method based on two-dimensional image of the present application may be implemented, but is not limited to, as a flowchart including:

(1) Reading in key point detection information (shown in fig. 5A), and if the number of the palms is greater than 0, entering the next step; otherwise, exiting the algorithm;

(2) Calculating a minimum circumscribed rectangle (shown in fig. 5B) by traversing the position information of the key points of the hand, calculating the central point and the length and width of the circumscribed rectangle, and expanding the circumscribed rectangle while maintaining the length-width ratio to obtain an expanded rectangle (shown in fig. 5C);

(3) Calculating the intersection ratio IOU (shown in fig. 5D) of the expanded rectangle and the detection frame of the previous frame, if the IOU is smaller than the threshold value and the palm detection module is not started, considering that the current gesture has larger displacement compared with the previous frame, discarding, starting the palm detection, exiting the algorithm, otherwise, entering the next step;

(4) Judging whether the outward expansion rectangle is smaller than a threshold value of a rectangle frame (the threshold value is dynamically updated by a palm detection module), if so, expanding according to a certain proportion, and then entering the step (5);

(5) Rectangular normalization processing, namely converting rectangular coordinates into offset relative to an origin of an upper left corner of the test image, and converting width and height into a duty ratio relative to the width and the height of the test image;

(6) Calculating the current palm information and the IOU value of the saved result, and judging whether the palm is saved or not; if there is a large overlap, the palm center detection is started, and the algorithm is ended; otherwise, entering step (7);

(7) And calculating the angle information of the current palm (as shown in fig. 5E), storing the palm information into a result queue, and ending the algorithm.

It can be appreciated that the gesture recognition method based on the two-dimensional image combines specific application scenes, namely, the palm position in the continuous frame image is slowly transformed, so that the method is beneficial to reducing the starting times of the palm detection module and reducing the time expenditure of an algorithm and the consumption of memory resources by predicting the palm position of the current frame image by considering the key point detection information of the last frame.

According to the above embodiment of the present application, as shown in fig. 6, the step S200 of the gesture recognition method based on two-dimensional image may include the steps of:

s210: affine transformation is carried out on the current frame image according to the palm information of the current frame image so as to cut out a palm image corresponding to the palm information of the current frame image; and

s220: and extracting the positions of all hand key points from the palm image through the key point detection network so as to obtain the hand key point information in the current frame image.

Preferably, as shown in fig. 7, the hand keypoints in the current frame image may include a wrist center and five finger joints, for a total of 21 keypoints.

It should be noted that, when the previous frame image is the first frame image, the hand key point information in the previous frame image is obtained by detecting the palm center to obtain the position information and the direction information of the palm center, and then by detecting the key point; when the previous frame image is not the first frame image, the hand key point information in the previous frame image is obtained through the step S100 and the step S200 of the gesture recognition method based on two-dimensional image of the present application.

According to the above embodiment of the present application, as shown in fig. 8, the step S300 of the gesture recognition method based on two-dimensional image may include the steps of:

s310: calculating characteristic values through the hand key point information in the current frame image so as to obtain characteristics of a plurality of different dimensions; and

s320: and carrying out gesture recognition on the characteristics of the plurality of different dimensions through a random forest so as to obtain a static gesture recognition result in the current frame image.

Illustratively, in the step S310 of the gesture recognition method based on two-dimensional image of the present application, 30 features of different dimensions may be calculated by two-dimensional key points of the palm, and in the step S320, the static gesture recognition results 0-9 of the current frame are recognized by random forest.

Notably, currently, the main ideas of the existing dynamic gesture recognition scheme based on gesture detection are: firstly, recognizing gesture actions of a single frame to obtain static gestures of a current frame; and then, according to the switching of the static gestures, the dynamic gesture recognition is performed. Although the existing dynamic gesture recognition method can realize real-time detection at the mobile terminal, the existing dynamic gesture recognition method has the following limitations: 1) Since the result of the detection is only the outer rectangular outline of the hand and there are no key points of the hand, interactions including both hand movements and both hands cannot be achieved for many dynamic gestures; 2) The dynamic gesture recognition can only be performed on the basis of a defined static gesture, and the expandability is poor; 3) For undefined actions, false detection is liable to occur, resulting in a decrease in accuracy of recognition.

Therefore, in order to solve the above-mentioned problem, as shown in fig. 2, the gesture recognition method based on a two-dimensional image according to the present application may further include, after the step S300, the steps of:

s400: based on single-frame static gestures and hand key point information in a buffer queue, detecting a single-frame gesture change process in the buffer queue so as to realize multi-frame dynamic gesture recognition.

Specifically, as shown in fig. 9, the step S400 of the gesture recognition method based on two-dimensional image of the present application may include the steps of:

s410: filling the single-frame static gesture and the hand key point information in the current frame image into the cache queue to update the cache queue;

s420: searching whether the updated single-frame gesture motion change conforming to the setting exists in the cache queue; and

s430: and responding to the single-frame gesture motion change conforming to the setting, and carrying out dynamic gesture recognition by combining the coordinate change of the key points of the hand so as to obtain a single-hand motion recognition result.

It should be noted that, in the gesture recognition process, there is usually not only a single-hand motion but also a double-hand motion, so in order to enable the recognition of the double-hand motion, as shown in fig. 9, the step S400 of the gesture recognition method based on a two-dimensional image according to the present application may further include the steps of:

S440: judging whether the hand key points in the updated cache queue contain key points of both hands or not and continuing to reserve frames; and

s450: and responding to the hand key points including the key points of the two hands and continuing the preset frame, and carrying out fusion recognition on the single-hand action recognition results of the two hands so as to output the action recognition results of the two hands.

Illustratively, the multi-frame gesture recognition of the present application is a process of detecting single-frame gesture changes within a period of time (e.g., within a buffer queue) based on single-frame static gestures and key points that are serialized within the buffer (i.e., the buffer queue), thereby implementing multi-frame dynamic gesture recognition. As shown in fig. 10, the main process is as follows: firstly, carrying out recognition of single-hand actions, and updating recognition results and key point information into a cache queue, wherein the process is shown in FIG. 11; then, searching whether single-frame motion changes conforming to the setting exist in the cache queue, if so, changing from gesture 0 to gesture 5, then combining the change of key point coordinates to identify the gesture, if so, storing a single-hand motion result, otherwise, failing to perform single-hand motion; thereafter, the same operation is used to recognize a single-hand motion for both hands. It is noted that the present application can determine whether the current is a one-hand motion or a two-hand motion through the iTwohandsFlag parameter, for example, the iTwohandsFlag defaults to a one-hand motion, and when the input key point contains 2 hands and lasts for 5 frames, the iTwohandsFlag is modified to a two-hand motion; if the two hands are in motion, the single-hand motions of the two hands are fused, otherwise, the recognition result of the multi-frame motions is directly output.

Schematic System

Referring to FIG. 12 of the drawings, a two-dimensional image-based gesture recognition system in accordance with an embodiment of the present invention is illustrated. Specifically, as shown in fig. 12, the gesture recognition system 1 based on two-dimensional images may include, communicatively connected to each other: a palm tracking module 10, configured to track a palm position in a current frame image according to the hand key point information in the previous frame image, so as to obtain palm information of the current frame image; a keypoint detection module 20 for performing keypoint detection on the current frame image based on the palm information of the current frame image to obtain hand keypoint information in the current frame image; and a single-frame gesture recognition module 30, configured to perform single-frame gesture recognition on the hand keypoint information in the current frame image, so as to obtain a single-frame static gesture in the current frame image.

More specifically, as shown in fig. 12, the palm tracking module 10 includes, communicatively connected to each other: a rectangle calculation module 11, configured to calculate a minimum bounding rectangle by traversing the position information of the part of the hand key points in the previous frame of image, so as to obtain a center point, a length and a width of the minimum bounding rectangle; a rectangular expansion module 12, configured to expand the minimum circumscribed rectangle according to a first expansion ratio under the condition that the aspect ratio is kept unchanged, so as to obtain an expanded rectangle; an intersection ratio judging module 13, configured to judge whether an intersection ratio between the expanded rectangle and a key point detection frame in the previous frame image is smaller than a first threshold; a palm detection start module 14, which is started to obtain the palm information of the current frame image in response to the intersection ratio being smaller than the first threshold; and a azimuth calculating module 15, configured to perform azimuth calculation on the expanded rectangle in response to the blending ratio being greater than or equal to the first threshold, so as to obtain palm information of the current frame image.

Notably, in an example of the present application, the orientation calculation module 15 is further configured to: judging whether the expanded rectangle is smaller than a rectangle frame threshold value or not, wherein the rectangle frame threshold value is dynamically updated through the palm detection module; expanding the expanded rectangle according to a second expansion ratio in response to the expanded rectangle being less than the rectangular frame threshold, such that the expanded rectangle is greater than or equal to the rectangular frame threshold; normalizing the expanded rectangle which is larger than or equal to the rectangle frame threshold value to obtain the position information of the expanded rectangle in the current frame image; and calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

According to the above embodiment of the present application, as shown in fig. 12, the keypoint detection module 20 includes an affine transformation module 21 and a keypoint extraction module 22 that are communicatively connected to each other, wherein the affine transformation module 21 is configured to perform affine transformation on the current frame image according to the palm information of the current frame image so as to cut out a palm image corresponding to the palm information of the current frame image; the keypoint extraction module 22 is configured to extract, from the palm image, a position of each hand keypoint through the keypoint detection network, so as to obtain the hand keypoint information in the current frame image.

According to the above embodiment of the present application, as shown in fig. 12, the single-frame gesture recognition module 30 includes a feature value calculation module 31 and a random forest module 32 that are communicatively connected to each other, where the feature value calculation module is configured to perform feature value calculation through the hand key point information in the current frame image, so as to obtain features of multiple different dimensions; the random forest module is used for carrying out gesture recognition on the characteristics of the plurality of different dimensions through the random forest so as to obtain a static gesture recognition result in the current frame image.

According to the above embodiment of the present application, as shown in fig. 12, the gesture recognition system 1 based on two-dimensional image further includes a multi-frame gesture recognition module 40 for detecting a single-frame gesture change process in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue, so as to implement recognition of multi-frame dynamic gesture.

It should be noted that, as shown in fig. 12, the multi-frame gesture recognition module 40 includes a filling update module 41, a search module 42 and a single-hand motion recognition module 43 that are communicatively connected to each other, wherein the filling update module 41 is configured to fill the buffer queue with the single-frame static gesture and the hand key point information in the current frame image so as to update the buffer queue; the searching module 42 is configured to search whether a single-frame gesture motion change conforming to the setting exists in the updated buffer queue; the single-hand motion recognition module 43 is configured to perform dynamic gesture recognition in combination with the coordinate change of the key point of the hand in response to the presence of a single-frame gesture motion change conforming to the setting, so as to obtain a single-hand motion recognition result.

In addition, as shown in fig. 12, the multi-frame gesture recognition module 40 further includes a judging module 44 and a two-hand motion recognition module 45 that are communicatively connected to each other, wherein the judging module 44 is configured to judge whether the hand keypoints in the updated cache queue include the keypoints of two hands for a predetermined frame; the two-hand motion recognition module 45 is configured to perform fusion recognition on the single-hand motion recognition results of the two hands in response to the hand keypoints including the keypoints of the two hands and lasting for the predetermined frame, so as to output the motion recognition results of the two hands.

Schematic electronic device

Next, an electronic device according to an embodiment of the present invention is described with reference to fig. 13. As shown in fig. 13, the electronic device 90 includes one or more processors 91 and memory 92.

The processor 91 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 90 to perform desired functions. In other words, the processor 91 comprises one or more physical devices configured to execute instructions. For example, the processor 91 may be configured to execute instructions that are part of: one or more applications, services, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, implement a technical effect, or otherwise achieve a desired result.

The processor 91 may include one or more processors configured to execute software instructions. Additionally or alternatively, the processor 91 may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. The processors of the processor 91 may be single-core or multi-core, and the instructions executed thereon may be configured for serial, parallel, and/or distributed processing. The various components of the processor 91 may optionally be distributed across two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the processor 91 may be virtualized and executed by remotely accessible networked computing devices configured in a cloud computing configuration.

The memory 92 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium and executed by the processor 91 to perform some or all of the steps in the above-described exemplary methods of the present invention, and/or other desired functions.

In other words, the memory 92 includes one or more physical devices configured to hold machine readable instructions executable by the processor 91 to implement the methods and processes described herein. In implementing these methods and processes, the state of the memory 92 may be transformed (e.g., different data is saved). The memory 92 may include removable and/or built-in devices. The memory 92 may include optical memory (e.g., CD, DVD, HD-DVD, blu-ray disc, etc.), semiconductor memory (e.g., RAM, EPROM, EEPROM, etc.), and/or magnetic memory (e.g., hard disk drive, floppy disk drive, tape drive, MRAM, etc.), among others. The memory 92 may include volatile, nonvolatile, dynamic, static, read/write, read-only, random access, sequential access, location-addressable, file-addressable, and/or content-addressable devices.

It is to be appreciated that the memory 92 includes one or more physical devices. However, aspects of the instructions described herein may alternatively be propagated through a communication medium (e.g., an electromagnetic signal, an optical signal, etc.) that is not held by a physical device for a limited period of time. Aspects of the processor 91 and the memory 92 may be integrated together into one or more hardware logic components. These hardware logic components may include, for example, field Programmable Gate Arrays (FPGAs), program and application specific integrated circuits (PASICs/ASICs), program and application specific standard products (PSSPs/ASSPs), system on a chip (SOCs), and Complex Programmable Logic Devices (CPLDs).

In one example, as shown in FIG. 13, the electronic device 90 may further include an input device 93 and an output device 94, which are interconnected by a bus system and/or other form of connection mechanism (not shown). For example, the input device 93 may be, for example, a camera module or the like for capturing image data or video data. As another example, the input device 93 may include or interface with one or more user input devices such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input device 93 may include or interface with selected Natural User Input (NUI) components. Such component parts may be integrated or peripheral and the transduction and/or processing of the input actions may be processed on-board or off-board. Example NUI components may include microphones for speech and/or speech recognition; infrared, color, stereoscopic display, and/or depth cameras for machine vision and/or gesture recognition; head trackers, eye trackers, accelerometers and/or gyroscopes for motion detection and/or intent recognition; and an electric field sensing component for assessing brain activity and/or body movement; and/or any other suitable sensor.

The output device 94 may output various information including the classification result and the like to the outside. The output means 94 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, the electronic device 90 may further comprise the communication means, wherein the communication means may be configured to communicatively couple the electronic device 90 with one or more other computer devices. The communication means may comprise wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured for communication via a wireless telephone network or a wired or wireless local area network or wide area network. In some embodiments, the communications apparatus may allow the electronic device 90 to send and/or receive messages to and/or from other devices via a network such as the Internet.

It will be appreciated that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Also, the order of the above-described processes may be changed.

Of course, only some of the components of the electronic device 90 that are relevant to the present invention are shown in fig. 13 for simplicity, components such as buses, input/output interfaces, etc. being omitted. In addition, the electronic device 90 may include any other suitable components depending on the particular application.

It is also noted that in the apparatus, devices and methods of the present invention, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present invention.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are by way of example only and are not limiting. The objects of the present invention have been fully and effectively achieved. The functional and structural principles of the present invention have been shown and described in the examples and embodiments of the invention may be modified or practiced without departing from the principles described.

Claims

1. The gesture recognition method based on the two-dimensional image is characterized by comprising the following steps of:

carrying out single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image;

the step of tracking the orientation of the palm in the current frame image according to the hand key point information in the previous frame image so as to obtain the palm information of the current frame image comprises the following steps:

responding to the intersection ratio being greater than or equal to the first threshold value, carrying out azimuth calculation on the expansion rectangle so as to obtain palm information of the current frame image;

and the step of performing azimuth calculation on the expanded rectangle to obtain palm information of the current frame image in response to the intersection ratio being greater than or equal to the first threshold value comprises the following steps:

2. The two-dimensional image-based gesture recognition method of claim 1, wherein the palm detection module is a deep learning-based palm detection model.

3. The two-dimensional image-based gesture recognition method of claim 1 or 2, wherein the step of performing keypoint detection on the current frame image based on the palm information of the current frame image to obtain hand keypoint information in the current frame image comprises the steps of:

4. The two-dimensional image-based gesture recognition method of claim 1 or 2, wherein the step of performing single-frame gesture recognition on the hand keypoint information in the current frame image to obtain a single-frame static gesture in the current frame image comprises the steps of:

5. The two-dimensional image-based gesture recognition method according to claim 1 or 2, further comprising the step of:

6. The gesture recognition method based on two-dimensional image as set forth in claim 5, wherein the step of detecting the single frame gesture change in the buffer queue based on the single frame static gesture and the hand key point information in the buffer queue to realize the recognition of the multi-frame dynamic gesture comprises the steps of:

filling the single-frame static gesture and the hand key point information in the current frame image into the cache queue to update the cache queue;

7. The two-dimensional image-based gesture recognition method as set forth in claim 6, wherein the step of detecting a single-frame gesture change in the buffer queue based on the single-frame static gesture and the hand keypoint information in the buffer queue to realize recognition of multi-frame dynamic gestures, further comprises the step of:

and in response to the hand key points containing the key points of the two hands and lasting the preset frame, carrying out fusion recognition on the single-hand action recognition results of the two hands so as to output the action recognition results of the two hands.

8. A two-dimensional image-based gesture recognition system comprising:

the single-frame gesture recognition module is used for carrying out single-frame gesture recognition on the hand key point information in the current frame image so as to obtain a single-frame static gesture in the current frame image;

wherein the palm tracking module comprises:

the azimuth calculation module is used for responding to the fact that the intersection ratio is larger than or equal to the first threshold value, and carrying out azimuth calculation on the expansion rectangle so as to obtain palm information of the current frame image;

wherein, the position calculation module is further used for: judging whether the expanded rectangle is smaller than a rectangle frame threshold value or not, wherein the rectangle frame threshold value is dynamically updated through the palm detection module; expanding the expanded rectangle according to a second expansion ratio in response to the expanded rectangle being less than the rectangular frame threshold, such that the expanded rectangle is greater than or equal to the rectangular frame threshold; normalizing the expanded rectangle which is larger than or equal to the rectangle frame threshold value to obtain the position information of the expanded rectangle in the current frame image; and calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

9. The gesture recognition system based on two-dimensional image of claim 8, wherein the keypoint detection module comprises an affine transformation module and a keypoint extraction module communicatively connected to each other, wherein the affine transformation module is configured to perform affine transformation on the current frame image according to the palm information of the current frame image so as to cut out a palm image corresponding to the palm information of the current frame image; the key point extraction module is used for extracting the positions of the key points of each hand from the palm image through the key point detection network so as to obtain the information of the key points of the hand in the current frame image.

10. The two-dimensional image-based gesture recognition system of claim 8, wherein the single-frame gesture recognition module comprises a feature value calculation module and a random forest module that are communicatively connected to each other, wherein the feature value calculation module is configured to perform feature value calculation by the hand keypoint information in the current frame image to obtain a plurality of features of different dimensions; the random forest module is used for carrying out gesture recognition on the characteristics of the plurality of different dimensions through the random forest so as to obtain a static gesture recognition result in the current frame image.

11. The two-dimensional image based gesture recognition system of claim 8, further comprising a multi-frame gesture recognition module for detecting a single-frame gesture change process in the buffer queue based on the single-frame static gesture and the hand keypoint information in the buffer queue, so as to realize recognition of multi-frame dynamic gestures.

12. The two-dimensional image-based gesture recognition system of claim 11, wherein the multi-frame gesture recognition module comprises a fill update module, a search module, and a single-hand motion recognition module communicatively coupled to each other, wherein the fill update module is configured to fill the single-frame static gesture and the hand keypoint information in the current frame image into the cache queue to update the cache queue; the searching module is used for searching whether the single-frame gesture motion change conforming to the setting exists in the updated cache queue; the single-hand motion recognition module is used for responding to the single-frame gesture motion change conforming to the setting, and carrying out dynamic gesture recognition by combining the coordinate change of the key points of the hand so as to obtain a single-hand motion recognition result.

13. The two-dimensional image-based gesture recognition system of claim 12, wherein the multi-frame gesture recognition module further comprises a judgment module and a two-hand motion recognition module communicatively coupled to each other, wherein the judgment module is configured to judge whether the hand keypoints in the updated cache queue comprise keypoints of two hands for a predetermined frame; the two-hand motion recognition module is used for responding to the fact that the key points of the hands contain the key points of the hands and continuously carrying out fusion recognition on the single-hand motion recognition results of the two hands for a preset frame so as to output the motion recognition results of the hands.

14. An electronic device, comprising:

at least one processor for executing instructions; and

a memory communicatively connected to the at least one processor, wherein the memory has at least one instruction, wherein the instruction is executed by the at least one processor to cause the at least one processor to perform all of the steps in a two-dimensional image based gesture recognition method, wherein the two-dimensional image based gesture recognition method comprises the steps of: