CN114510142A

CN114510142A - Gesture recognition method based on two-dimensional image, system thereof and electronic equipment

Info

Publication number: CN114510142A
Application number: CN202011180708.1A
Authority: CN
Inventors: 徐诚; 谢森栋; 田文军; 李程辉
Original assignee: Sunny Optical Zhejiang Research Institute Co Ltd
Current assignee: Sunny Optical Zhejiang Research Institute Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-05-17
Anticipated expiration: 2040-10-29
Also published as: CN114510142B

Abstract

Provided are a gesture recognition method based on a two-dimensional image, a system and electronic equipment thereof. The gesture recognition method based on the two-dimensional image comprises the following steps: tracking the orientation of a palm in the current frame image according to the hand key point information in the previous frame image to obtain palm information of the current frame image; performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image; and performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

Description

Gesture recognition method based on two-dimensional image, system thereof and electronic equipment

Technical Field

The invention relates to the technical field of gesture recognition, in particular to a gesture recognition method based on a two-dimensional image, a system and electronic equipment thereof.

Background

Currently, Augmented Reality (AR) technology is increasingly applied to mobile terminals such as AR glasses, and is popular among manufacturers of large mobile devices and users. As the most self-heating man-machine interaction means, gesture recognition has great application value in augmented reality. At present, gesture recognition can be divided into two main categories according to different acquired data: the first method is to obtain data by a sensor device such as a data glove; the second method is based on visual data, including two-dimensional images, three-dimensional point clouds, or depth images.

However, although the first method generally has a high recognition accuracy, the sensor device is expensive and inconvenient to operate; although the second method has low requirements on equipment and convenient operation, the second method has the problems of not high identification speed and not high identification accuracy. For example, an existing end-to-end gesture recognition method uses two models, one is a lightweight 3D Convolutional Neural Network (CNN) classifier for detecting gestures, and the other is a deep 3D Convolutional Neural Network (CNN) detector for classifying detected gestures. Meanwhile, the workflow of the recognition system is realized by using two sliding windows, the lengths of the sliding windows of the detector and the classifier are n and m respectively, and n is far smaller than m. Thus, the workflow starts with a detector that acts as a switch to activate the classifier when a gesture is detected, and when a gesture is detected, the classifier is activated and data is provided by frames in the classifier queue for gesture recognition, but this scheme is computationally very complex, has high hardware requirements, and simply fails to meet the real-time requirements on the mobile device.

Disclosure of Invention

An advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system thereof, and an electronic device, which can improve the speed of gesture recognition while maintaining the accuracy of gesture recognition, so as to meet the real-time requirement of a mobile terminal device.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional images designs a gesture tracking method according to characteristics of gesture actions, so that the gesture recognition accuracy is maintained, and meanwhile, the gesture recognition speed can be increased.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional images can achieve real-time performance on a mobile terminal device without additional sensor devices by only using a common RGB camera, and has high accuracy and robustness.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional images can perform recognition of dynamic gestures through combination of single-frame gesture actions and hand key points, which is helpful to improve robustness of the method.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on two-dimensional images combines deep learning and conventional machine learning to provide an innovative tracking method and fault-tolerant strategy, which is helpful for improving the speed and accuracy of gesture recognition.

Another advantage of the present invention is to provide a gesture recognition method based on a two-dimensional image, a system and an electronic device thereof, wherein in an embodiment of the present invention, the gesture recognition method based on a two-dimensional image can accelerate extraction of a hand key point by palm tracking, which is helpful for increasing a gesture recognition speed, so that the method can better meet a real-time requirement of a mobile device.

Another advantage of the present invention is to provide a gesture recognition method based on two-dimensional images, a system and an electronic device thereof, wherein in order to achieve the above advantages, the present invention does not need to adopt a complex structure and a huge amount of computation, and has low requirements on software and hardware. Therefore, the invention successfully and effectively provides a solution, not only provides a gesture recognition method based on two-dimensional images, a system thereof and electronic equipment, but also increases the practicability and reliability of the gesture recognition method based on two-dimensional images, the system thereof and the electronic equipment.

To achieve at least one of the above advantages or other advantages and objects, the present invention provides a gesture recognition method based on two-dimensional images, including the steps of:

tracking the orientation of a palm in the current frame image according to the hand key point information in the previous frame image to obtain palm information of the current frame image;

performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image; and

and performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

According to an embodiment of the present application, the step of tracking the orientation of the palm in the current frame image according to the hand key point information in the previous frame image to obtain the palm information of the current frame image includes the steps of:

calculating a minimum circumscribed rectangle by traversing position information of a part of hand key points in the previous frame of image to obtain a central point, a length and a width of the minimum circumscribed rectangle;

under the condition of keeping the length-width ratio unchanged, carrying out external expansion on the minimum external rectangle according to a first external expansion proportion to obtain an external expansion rectangle;

judging whether the intersection ratio between the extended rectangle and the key point detection frame in the previous frame of image is smaller than a first threshold value;

responding to the intersection ratio smaller than the first threshold value, starting a palm detection module to obtain the palm information of the current frame image; and

and performing orientation calculation on the extended rectangle to obtain palm information of the current frame image in response to the intersection ratio being greater than or equal to the first threshold.

According to an embodiment of the application, the palm detection module is a palm center detection model based on deep learning.

According to an embodiment of the present application, the step of performing an orientation calculation on the extended rectangle in response to the intersection ratio being greater than or equal to the first threshold value to obtain the palm information of the current frame image includes the steps of:

judging whether the outward-expanded rectangle is smaller than a rectangular frame threshold value or not, wherein the rectangular frame threshold value is dynamically updated through the palm detection module;

in response to the expanded rectangle being smaller than the rectangular frame threshold, expanding the expanded rectangle according to a second expansion proportion so that the expanded rectangle is larger than or equal to the rectangular frame threshold;

normalizing the extended rectangle which is greater than or equal to the rectangle frame threshold value to obtain the position information of the extended rectangle in the current frame image; and

and calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

According to an embodiment of the present application, the step of performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image includes the steps of:

carrying out affine transformation on the current frame image according to the palm information of the current frame image so as to cut out a palm image corresponding to the palm information of the current frame image; and

and detecting the position of each hand key point from the palm image through the key point detection network so as to obtain the hand key point information in the current frame image.

According to an embodiment of the present application, the step of performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image includes the steps of:

calculating characteristic values through the hand key point information in the current frame image to obtain a plurality of characteristics with different dimensions; and

and performing gesture recognition on the features with different dimensions through a random forest to obtain a static gesture recognition result in the current frame image.

According to an embodiment of the present application, the gesture recognition method based on two-dimensional images further includes the steps of:

and detecting the process of single-frame gesture change in the buffer queue on the basis of the single-frame static gesture and the hand key point information in the buffer queue so as to realize the identification of multi-frame dynamic gestures.

According to an embodiment of the present application, the step of detecting a change of a single-frame gesture in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue to recognize a multi-frame dynamic gesture includes:

filling the single-frame static gesture and the hand key point information in the current frame image into the buffer queue to update the buffer queue;

searching whether a single-frame gesture action change meeting the setting exists in the updated cache queue; and

and responding to the single-frame gesture motion change which accords with the setting, and combining the coordinate change of the key point of the hand to perform dynamic gesture recognition so as to obtain the motion recognition result of the single hand.

According to an embodiment of the present application, the step of detecting a change of a single-frame gesture in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue to recognize a multi-frame dynamic gesture further includes:

judging whether the hand key points in the updated cache queue comprise key points of two hands or not and continuing to preset frames; and

and performing fusion recognition on the single-hand action recognition results of the two hands in response to the fact that the hand key points comprise the key points of the two hands and the preset frame is continued, so as to output action recognition results of the two hands.

According to another aspect of the present application, an embodiment of the present application provides a two-dimensional image-based gesture recognition system, including:

the palm tracking module is used for tracking the orientation of a palm in the current frame image according to the hand key point information in the previous frame image so as to obtain the palm information of the current frame image;

a key point detection module, configured to perform key point detection on the current frame image based on the palm information of the current frame image, so as to obtain hand key point information in the current frame image; and

and the single-frame gesture recognition module is used for performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

According to an embodiment of the application, the palm tracking module comprises communicatively connected to each other:

the rectangle calculation module is used for calculating a minimum circumscribed rectangle by traversing the position information of the hand key points in the previous frame of image so as to obtain the central point, the length and the width of the minimum circumscribed rectangle;

the rectangle external expansion module is used for externally expanding the minimum external rectangle according to a first external expansion proportion under the condition of keeping the length-width ratio unchanged so as to obtain an external expansion rectangle;

the intersection and comparison judging module is used for judging whether the intersection and comparison between the extended rectangle and the key point detection frame in the previous frame of image is smaller than a first threshold value or not;

a palm detection starting module, which responds to the intersection ratio smaller than the first threshold value, and starts the palm detection module to obtain the palm information of the current frame image; and

and the orientation calculation module is used for performing orientation calculation on the extended rectangle in response to the intersection ratio being greater than or equal to the first threshold value so as to obtain the palm information of the current frame image.

According to an embodiment of the present application, the orientation calculation module is further configured to: judging whether the outward-expanded rectangle is smaller than a rectangular frame threshold value or not, wherein the rectangular frame threshold value is dynamically updated through the palm detection module; in response to the expanded rectangle being smaller than the rectangular frame threshold, expanding the expanded rectangle according to a second expansion proportion so that the expanded rectangle is larger than or equal to the rectangular frame threshold; normalizing the extended rectangle which is greater than or equal to the rectangle frame threshold value to obtain the position information of the extended rectangle in the current frame image; and calculating the direction information of the palm in the current frame image according to the position information of the key points of the hand in the previous frame image.

According to an embodiment of the application, the keypoint detection module includes an affine transformation module and a keypoint extraction module communicably connected to each other, where the affine transformation module is configured to perform affine transformation on the current frame image according to the palm information of the current frame image to cut out a palm image corresponding to the palm information of the current frame image; the key point extracting module is used for extracting the positions of all hand key points from the palm image through the key point detection network so as to obtain the hand key point information in the current frame image.

According to an embodiment of the application, the single-frame gesture recognition module comprises a feature value calculation module and a random forest module which are mutually communicably connected, wherein the feature value calculation module is used for calculating feature values through the hand key point information in the current frame image so as to obtain a plurality of features with different dimensions; and the random forest module is used for performing gesture recognition on the features with different dimensions through a random forest to obtain a static gesture recognition result in the current frame image.

According to an embodiment of the application, the gesture recognition system based on the two-dimensional image further includes a multi-frame gesture recognition module, configured to detect a process of a change of a single-frame gesture in the cache queue based on a single-frame static gesture and hand key point information in the cache queue, so as to realize recognition of a multi-frame dynamic gesture.

According to an embodiment of the present application, the multi-frame gesture recognition module includes a filling and updating module, a searching module and a single-hand gesture recognition module, which are communicably connected to each other, wherein the filling and updating module is configured to fill the single-frame static gesture and the hand key point information in the current frame image into the buffer queue to update the buffer queue; the searching module is used for searching whether a single-frame gesture action change meeting the setting exists in the updated cache queue; the single-hand motion recognition module is used for responding to the single-frame gesture motion change meeting the setting, and performing dynamic gesture recognition by combining the coordinate change of the key point of the hand to obtain the motion recognition result of the single hand.

According to an embodiment of the present application, the multi-frame gesture recognition module further includes a determination module and a double-hand gesture recognition module communicatively connected to each other, where the determination module is configured to determine whether the hand keypoints in the updated cache queue include the keypoints of both hands and continue for a predetermined frame; and the two-hand action recognition module is used for responding to the key points of the hands, including the key points of the two hands and continuing the preset frame, and performing fusion recognition on the single-hand action recognition results of the two hands so as to output the action recognition results of the two hands.

According to another aspect of the present application, an embodiment of the present application provides an electronic device, including:

at least one processor configured to execute instructions; and

a memory communicatively coupled to the at least one processor, wherein the memory has at least one instruction, wherein the instruction is executable by the at least one processor to cause the at least one processor to perform some or all of the steps of a two-dimensional image based gesture recognition method, wherein the two-dimensional image based gesture recognition method comprises the steps of:

Further objects and advantages of the invention will be fully apparent from the ensuing description and drawings.

These and other objects, features and advantages of the present invention will become more fully apparent from the following detailed description, the accompanying drawings and the claims.

Drawings

Fig. 1 is an algorithm framework diagram of a gesture recognition method based on two-dimensional images according to an embodiment of the invention.

Fig. 2 is a flowchart illustrating a gesture recognition method based on two-dimensional images according to an embodiment of the present invention.

Fig. 3A and 3B are schematic flow charts illustrating a palm tracking step in the two-dimensional image-based gesture recognition method according to the above embodiment of the present invention.

Fig. 4 shows an example of the palm tracking step in the two-dimensional image-based gesture recognition method according to the above-described embodiment of the present invention.

FIG. 5A shows an example of hand keypoint information in a previous frame of image according to the present application.

Fig. 5B illustrates an example of a minimum bounding rectangle in a current frame image according to the present application.

Fig. 5C shows an example of an extended rectangle in the current frame image according to the present application.

FIG. 5D shows an example of an intersection ratio between the extended rectangle in the current frame image and the keypoint detection box in the previous frame image according to the present application.

Fig. 5E shows an example of direction information of the palm in the current frame image according to the present application.

Fig. 6 is a schematic flow chart illustrating a key point detecting step in the two-dimensional image-based gesture recognition method according to the above embodiment of the invention.

Fig. 7 shows an example of hand keypoint information in a current frame image according to the present application.

Fig. 8 is a schematic flow chart illustrating a single-frame gesture recognition step in the two-dimensional image-based gesture recognition method according to the above embodiment of the invention.

Fig. 9 is a schematic flow chart illustrating a multi-frame gesture recognition step in the two-dimensional image-based gesture recognition method according to the above embodiment of the invention.

Fig. 10 shows an example of the multi-frame gesture recognition step in the two-dimensional image-based gesture recognition method according to the above-described embodiment of the present invention.

FIG. 11 illustrates an example of a cache queue update according to the present application.

FIG. 12 is a block diagram schematic of the two-dimensional image based gesture recognition system according to an embodiment of the invention.

FIG. 13 shows a block diagram schematic of an electronic device according to an embodiment of the invention.

Detailed Description

The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments in the following description are given by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.

In the present invention, the terms "a" and "an" in the claims and the description should be understood as meaning "one or more", that is, one element may be one in number in one embodiment, and the element may be more than one in number in another embodiment. The terms "a" and "an" should not be construed as limiting the number unless the number of such elements is explicitly recited as one in the present disclosure, but rather the terms "a" and "an" should not be construed as being limited to only one of the number.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Currently, Augmented Reality (AR) technology is increasingly applied to mobile terminals such as AR glasses, and is popular among manufacturers of large mobile devices and users. As the most self-heating human-computer interaction means, gesture recognition has great application value in augmented reality. At present, gesture recognition can be divided into two main categories according to different acquired data: the first method is to obtain data by a sensor device such as a data glove; the second method is based on visual data, including two-dimensional images, three-dimensional point clouds, or depth images. However, although the first method generally has a high recognition accuracy, the sensor device is expensive and inconvenient to operate; although the second method has low requirements on equipment and convenient operation, the second method has the problems of not high identification speed and not high identification accuracy. Therefore, in order to solve the above problems, the present application provides a gesture recognition method based on two-dimensional images, a system and an electronic device thereof, an algorithm framework of which is shown in fig. 1 and mainly includes several modules of palm detection, palm tracking, key point detection, single-frame gesture recognition and multi-frame gesture recognition. In particular, the palm detection and keypoint detection module can extract hand keypoints from the input rgb image, which is also accelerated by palm tracking, and then input the keypoints into single-frame gesture recognition and multi-frame gesture recognition to obtain the final gesture result.

Illustrative method

Referring to fig. 2-11 of the drawings, a method for two-dimensional image based gesture recognition according to an embodiment of the present invention is illustrated. Specifically, as shown in fig. 2, the gesture recognition method based on two-dimensional images may include the steps of:

s100: tracking the orientation of a palm in the current frame image according to the hand key point information in the previous frame image to obtain palm information of the current frame image;

s200: performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image; and

s300: and performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

It should be noted that, in combination with a specific AR application scenario, the frame rate of the gesture recognition image obtained by the mobile terminal generally needs to be greater than 30 frames per second to perform gesture recognition in real time, which makes the orientation of the palm in the continuous frame image slowly change, so the two-dimensional image-based gesture recognition method of the present application creatively proposes to predict the orientation of the palm in the current frame image according to the hand key point information in the previous frame image, so that the palm information of the current frame image can be obtained through palm tracking without starting the palm detection module, which is helpful to reduce time overhead and consumption of memory resources, further accelerate the speed of gesture recognition, and improve the real-time performance of gesture recognition. It is to be understood that the two-dimensional images referred to herein may be, but are not limited to being, implemented as RGB images.

More specifically, as shown in fig. 3A, the step S100 of the two-dimensional image-based gesture recognition method of the present application may include the steps of:

s110: calculating a minimum circumscribed rectangle by traversing position information of a part of hand key points in the previous frame of image to obtain a central point, a length and a width of the minimum circumscribed rectangle;

s120: under the condition of keeping the length-width ratio unchanged, carrying out external expansion on the minimum external rectangle according to a first external expansion proportion to obtain an external expansion rectangle;

s130: judging whether the intersection ratio between the extended rectangle and the key point detection frame in the previous frame of image is smaller than a first threshold value;

s140: responding to the intersection ratio smaller than the first threshold value, starting a palm detection module to obtain palm information of the current frame image; and

s150: and performing orientation calculation on the extended rectangle to obtain palm information of the current frame image in response to the intersection ratio being greater than or equal to the first threshold.

It should be noted that although the orientation of the palm in the continuous frame image is slowly changed, the area or the proportion occupied by the palm in the continuous frame image may be greatly changed, so that the gesture recognition method based on the two-dimensional image ensures that the rectangular frame for subsequent analysis can cover the palm in the current frame image as much as possible by extending the minimum circumscribed rectangle, which is helpful to improve the accuracy of gesture recognition. In addition, in the step S120 of the gesture recognition method based on two-dimensional images of the present application, the first expansion ratio may be, but is not limited to, preset according to experience of gesture recognition and in combination with a specific application scenario.

According to the above-mentioned embodiment of the present application, in the step S130 of the two-dimensional image based gesture recognition method of the present application, the first threshold may be, but is not limited to be, implemented as 0.5. In this way, when the intersection ratio IOU between the extended rectangle and the key point detection frame in the previous frame image is smaller than the first threshold, it is considered that the palm in the current frame image has a larger displacement than the palm in the previous frame image, and at this time, the change of the palm position in the continuous frame image is not slow, and the extended rectangle cannot be used as the key point detection frame in the current frame image by using palm tracking. When the intersection ratio IOU between the extended rectangle and the keypoint detection frame in the previous frame image is greater than or equal to the first threshold, it is determined that the palm in the current frame image has a smaller displacement than the palm in the previous frame image, and at this time, the transformation of the palm position in the continuous frame image is slow, so that the extended rectangle can be used as the keypoint detection frame in the current frame image by using palm tracking to obtain the palm information of the current frame image.

Preferably, the palm detection module of the present application may be implemented as, but not limited to, a deep learning-based palm center detection model, the input data of which is an RGB image, and the corresponding output data is position information of a palm center and a palm direction in the RGB image, which are generally represented by a bounding box formed by an upper left boundary point and a lower right boundary point. It is understood that the keypoint detection frame in the current frame image in the present application may be a bounding box output by the palm detection module, or may be an extended rectangle obtained by palm tracking.

More preferably, the backbone network of the palm detection model of the present application is constructed based on a residual error module, and the head network portion of the palm detection model performs Feature fusion on Feature maps (Feature maps) output by different layers, so as to improve by using the SSD detection network as a prototype, and facilitate outputting palm information with higher precision.

According to an example of the present application, as shown in fig. 3B, the step S150 of the two-dimensional image-based gesture recognition method of the present application may include the steps of:

s151: judging whether the outward-expanded rectangle is smaller than a rectangular frame threshold value or not, wherein the rectangular frame threshold value is dynamically updated through the palm detection module;

s152: in response to the expanded rectangle being smaller than the rectangular frame threshold, expanding the expanded rectangle according to a second expansion proportion so that the expanded rectangle is larger than or equal to the rectangular frame threshold;

s153: normalizing the extended rectangle which is greater than or equal to the rectangle frame threshold value to obtain the position information of the extended rectangle in the current frame image; and

s154: and calculating the direction information of the palm in the current frame image according to the position information of the hand key points in the previous frame image.

It should be noted that, since the portion of the current frame image corresponding to the rectangle may not cover the complete hand image when the rectangle is smaller than the rectangle frame threshold, in order to enable the key point detection frame in the current frame image to cover the complete hand image as much as possible, the step S152 of the two-dimensional image-based gesture recognition method of the present application may further enlarge the rectangle when the rectangle is smaller than the rectangle frame threshold, so as to ensure that the enlarged rectangle is greater than or equal to the rectangle frame threshold.

Preferably, the second outward expansion ratio can be set in real time according to, but not limited to, a size relationship between the rectangular frame threshold and the outward expansion rectangle. Of course, the second outward expansion ratio may also be set in advance according to experience, as long as it can be ensured that the expanded outward expansion rectangle is greater than or equal to the rectangle frame threshold, which is not described in detail herein.

It should be noted that, between the step S153 and the step S154 of the present application, the step S150 of the gesture recognition method based on two-dimensional images may further include the steps of:

judging whether the palm is saved or not by calculating the intersection ratio between the expanded rectangle and the saved key point detection frame, if so, judging that the palm is saved, and starting a palm detection module; otherwise, the step S154 is executed.

It should be noted that, it is determined whether the palm exists before the step S154, mainly because the left palm and the right palm may appear in the current frame image at the same time, and if there is a large overlap with the saved palm (i.e. the keypoint detection frame), which indicates that the palm tracking fails, the palm detection module needs to be started to obtain the palm information of the current frame image.

Illustratively, as shown in fig. 4, the specific algorithm of step S100 of the two-dimensional image-based gesture recognition method of the present application may be, but is not limited to being, implemented as a process including:

(1) reading in the key point detection information (as shown in fig. 5A), and entering the next step if the number of palms is greater than 0; otherwise, exiting the algorithm;

(2) calculating a minimum circumscribed rectangle (as shown in fig. 5B) by traversing position information of the hand key points, calculating a central point and a length and a width of the circumscribed rectangle, and performing outward expansion on the circumscribed rectangle while maintaining the length-width ratio to obtain an outward expanded rectangle (as shown in fig. 5C);

(3) calculating the intersection-parallel ratio IOU (as shown in FIG. 5D) of the outward-extended rectangle and the detection frame of the previous frame, if the IOU is smaller than the threshold value and the palm detection module is not started, considering that the current gesture has larger displacement compared with the previous frame, discarding, starting palm detection, exiting the algorithm, otherwise entering the next step;

(4) judging whether the external expansion rectangle is smaller than the threshold of the rectangle frame (the threshold is dynamically updated through a palm detection module), if so, expanding according to a certain proportion, and then entering the step (5);

(5) the rectangle normalization processing is carried out, the rectangular coordinates are converted into the offset relative to the origin of the upper left corner of the test image, and the width and the height are converted into the ratio relative to the width and the height of the test image;

(6) calculating the current palm information and the IOU value of the stored result, and judging whether the palm is stored; if the overlap is large, starting palm detection and finishing the algorithm; otherwise, entering step (7);

(7) and calculating the angle information of the current palm (as shown in fig. 5E), saving the palm information into a result queue, and ending the algorithm.

It can be understood that the gesture recognition method based on the two-dimensional image combines specific application scenarios, that is, the palm position in the continuous frame image is slowly changed, so that the palm position of the current frame image is predicted by using the previous frame key point detection information, which is helpful for reducing the number of times of starting the palm detection module and reducing the algorithm time overhead and the memory resource consumption.

According to the above embodiment of the present application, as shown in fig. 6, the step S200 of the gesture recognition method based on two-dimensional images may include the steps of:

s210: carrying out affine transformation on the current frame image according to the palm information of the current frame image so as to cut out a palm image corresponding to the palm information of the current frame image; and

s220: and extracting the position of each hand key point from the palm image through the key point detection network so as to obtain the hand key point information in the current frame image.

Preferably, as shown in fig. 7, the hand key points in the current frame image may include a wrist center and five finger joints, for a total of 21 key points.

It should be noted that, when the previous frame image is the first frame image, the hand key point information in the previous frame image is obtained by first obtaining the position information and the direction information of the palm center through palm center detection and then obtaining the hand key point information through key point detection; and when the previous frame image is not the first frame image, the hand key point information in the previous frame image is obtained through the steps S100 and S200 of the two-dimensional image-based gesture recognition method of the present application.

According to the above embodiment of the present application, as shown in fig. 8, the step S300 of the gesture recognition method based on two-dimensional images may include the steps of:

s310: calculating a characteristic value through the hand key point information in the current frame image to obtain a plurality of characteristics with different dimensionalities; and

s320: and performing gesture recognition on the features with different dimensions through a random forest to obtain a static gesture recognition result in the current frame image.

For example, in the step S310 of the gesture recognition method based on two-dimensional images of the present application, 30 features with different dimensions may be calculated through two-dimensional key points of a palm, and then in the step S320, static gesture recognition results 0 to 9 of a current frame are recognized through a random forest.

It is worth noting that, currently, the main ideas of the existing dynamic gesture recognition schemes based on gesture detection are: firstly, recognizing gesture actions of a single frame to obtain a static gesture of a current frame; and then, according to the switching of the static gesture, the dynamic gesture is recognized. Although the existing dynamic gesture recognition method can realize real-time detection at the mobile terminal, the following limitations exist: 1) as the detection result is only the external rectangular outline of the hand and has no key points of the hand, the interaction comprising the action of both hands and both hands cannot be realized for many dynamic gestures; 2) the identification of the dynamic gesture can be only carried out on the basis of the defined static gesture, and the expandability is poor; 3) for undefined actions, false detection is easy to occur, resulting in a reduction in the accuracy of recognition.

Therefore, in order to solve the above problem, as shown in fig. 2, the gesture recognition method based on two-dimensional images according to the present application may further include, after the step S300, the steps of:

s400: and detecting the process of single-frame gesture change in the buffer queue on the basis of the single-frame static gesture and the hand key point information in the buffer queue so as to realize the identification of multi-frame dynamic gestures.

Specifically, as shown in fig. 9, the step S400 of the gesture recognition method based on two-dimensional images of the present application may include the steps of:

s410: filling the single-frame static gesture and the hand key point information in the current frame image into the cache queue to update the cache queue;

s420: searching whether the updated cache queue has single-frame gesture motion change meeting the setting; and

s430: and responding to the single-frame gesture motion change which accords with the setting, and combining the coordinate change of the key point of the hand to perform dynamic gesture recognition so as to obtain the motion recognition result of the single hand.

It should be noted that, in the gesture recognition process, there is usually only one-handed motion, and there may be two-handed motions, so to implement the recognition of two-handed motions, as shown in fig. 9, the step S400 of the gesture recognition method based on two-dimensional images of the present application may further include the steps of:

s440: judging whether the hand key points in the updated cache queue comprise key points of two hands or not and continuing to preset frames; and

s450: and in response to the hand key points comprising key points of both hands and continuing the preset frame, performing fusion recognition on the single-hand action recognition results of both hands to output action recognition results of both hands.

Illustratively, the multi-frame gesture recognition of the present application is a process of detecting a change of a single-frame gesture within a period of time (e.g., within a buffer queue) based on a single-frame static gesture and a key point which are serialized (i.e., in a buffer queue) within the buffer, thereby realizing recognition of a multi-frame dynamic gesture. As shown in fig. 10, the main process is as follows: firstly, identifying a single-hand action, and then updating an identification result and key point information into a cache queue, wherein the process is shown in fig. 11; then, whether a single-frame action change meeting the setting exists in the cache queue is searched, if the single-frame action change meets the setting, the gesture is changed from gesture 0 to gesture 5, the gesture is identified by combining the change of the key point coordinates, if the single-frame action change exists, the action result of a single hand is stored, and if not, the single-hand action fails; thereafter, the same operation is performed to recognize the one-handed operation with respect to both hands. It is to be noted that, the present application may determine whether the current motion is a one-handed motion or a two-handed motion according to the iTwohandsFlag parameter, for example, the iTwohandsFlag is default to the one-handed motion, and when the input key point includes 2 hands and lasts for 5 frames, the iTwohandsFlag is modified to the two-handed motion; and if the actions of the two hands are the actions of the two hands, fusing the actions of the two hands with one hand, otherwise, directly outputting the identification result of the multi-frame action.

Illustrative System

Referring to FIG. 12 of the drawings, a two-dimensional image based gesture recognition system according to one embodiment of the present invention is illustrated. Specifically, as shown in fig. 12, the two-dimensional image-based gesture recognition system 1 may include: a palm tracking module 10, configured to track the orientation of a palm in the current frame image according to the hand key point information in the previous frame image, so as to obtain palm information of the current frame image; a key point detecting module 20, configured to perform key point detection on the current frame image based on the palm information of the current frame image, so as to obtain hand key point information in the current frame image; and a single-frame gesture recognition module 30, configured to perform single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image.

More specifically, as shown in fig. 12, the palm tracking module 10 includes, communicatively connected to each other: a rectangle calculating module 11, configured to calculate a minimum bounding rectangle by traversing position information of a part of the hand key points in the previous frame of image, so as to obtain a central point, a length, and a width of the minimum bounding rectangle; a rectangle outward expansion module 12, configured to perform outward expansion on the minimum circumscribed rectangle according to a first outward expansion ratio under the condition that the aspect ratio is kept unchanged, so as to obtain an outward expansion rectangle; an intersection ratio judging module 13, configured to judge whether an intersection ratio between the extended rectangle and the keypoint detection frame in the previous frame of image is smaller than a first threshold; a palm detection initiating module 14, configured to initiate the palm detection module in response to the intersection ratio being smaller than the first threshold value, so as to obtain the palm information of the current frame image; and an orientation calculation module 15, configured to perform orientation calculation on the extended rectangle in response to that the intersection ratio is greater than or equal to the first threshold, so as to obtain palm information of the current frame image.

Notably, in an example of the present application, the orientation calculation module 15 is further configured to: judging whether the outward-expanded rectangle is smaller than a rectangular frame threshold value or not, wherein the rectangular frame threshold value is dynamically updated through the palm detection module; in response to the expanded rectangle being smaller than the rectangular frame threshold, expanding the expanded rectangle according to a second expansion proportion so that the expanded rectangle is larger than or equal to the rectangular frame threshold; normalizing the extended rectangle which is greater than or equal to the rectangle frame threshold value to obtain the position information of the extended rectangle in the current frame image; and calculating the direction information of the palm in the current frame image according to the position information of the key points of the hand in the previous frame image.

According to the above-mentioned embodiment of the present application, as shown in fig. 12, the keypoint detection module 20 includes an affine transformation module 21 and a keypoint extraction module 22 communicably connected to each other, where the affine transformation module 21 is configured to perform affine transformation on the current frame image according to the palm information of the current frame image to cut out a palm image corresponding to the palm information of the current frame image; the key point extracting module 22 is configured to extract positions of each hand key point from the palm image through the key point detection network, so as to obtain the hand key point information in the current frame image.

According to the above embodiment of the present application, as shown in fig. 12, the single-frame gesture recognition module 30 includes a feature value calculation module 31 and a random forest module 32 communicably connected to each other, wherein the feature value calculation module is configured to perform feature value calculation through the hand keypoint information in the current frame image to obtain a plurality of features of different dimensions; and the random forest module is used for performing gesture recognition on the features with different dimensions through a random forest to obtain a static gesture recognition result in the current frame image.

According to the above embodiment of the present application, as shown in fig. 12, the two-dimensional image-based gesture recognition system 1 further includes a multi-frame gesture recognition module 40, configured to detect a process of a single-frame gesture change in the buffer queue based on a single-frame static gesture and hand key point information in the buffer queue, so as to realize recognition of a multi-frame dynamic gesture.

It should be noted that, as shown in fig. 12, the multi-frame gesture recognition module 40 includes a filling update module 41, a search module 42 and a single-hand gesture recognition module 43, which are communicatively connected to each other, wherein the filling update module 41 is configured to fill the buffer queue with the single-frame static gesture and the hand keypoint information in the current frame image to update the buffer queue; the searching module 42 is configured to search whether a single-frame gesture motion change meeting a setting exists in the updated buffer queue; the single-hand motion recognition module 43 is configured to perform dynamic gesture recognition in combination with the coordinate change of the key point of the hand in response to the existence of the single-frame gesture motion change conforming to the setting, so as to obtain a single-hand motion recognition result.

In addition, as shown in fig. 12, the multi-frame gesture recognition module 40 further includes a determining module 44 and a double-hand gesture recognition module 45 communicatively connected to each other, where the determining module 44 is configured to determine whether the hand keypoints in the updated buffer queue include the keypoints of both hands and continue for a predetermined frame; the two-hand motion recognition module 45 is configured to perform fusion recognition on the single-hand motion recognition results of the two hands in response to that the hand key points include the key points of the two hands and continue the predetermined frame, so as to output the motion recognition results of the two hands.

Illustrative electronic device

Next, an electronic apparatus according to an embodiment of the present invention is described with reference to fig. 13. As shown in fig. 13, the electronic device 90 includes one or more processors 91 and memory 92.

The processor 91 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 90 to perform desired functions. In other words, the processor 91 comprises one or more physical devices configured to execute instructions. For example, the processor 91 may be configured to execute instructions that are part of: one or more applications, services, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, implement a technical effect, or otherwise arrive at a desired result.

The processor 91 may include one or more processors configured to execute software instructions. Additionally or alternatively, the processor 91 may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. The processors of the processor 91 may be single core or multicore, and the instructions executed thereon may be configured for serial, parallel, and/or distributed processing. The various components of the processor 91 may optionally be distributed over two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the processor 91 may be virtualized and executed by remotely accessible networked computing devices configured in a cloud computing configuration.

The memory 92 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium and executed by the processor 91 to implement some or all of the steps of the above-described exemplary methods of the present invention, and/or other desired functions.

In other words, the memory 92 comprises one or more physical devices configured to hold machine-readable instructions executable by the processor 91 to implement the methods and processes described herein. In implementing these methods and processes, the state of the memory 92 may be transformed (e.g., to hold different data). The memory 92 may include removable and/or built-in devices. The memory 92 may include optical memory (e.g., CD, DVD, HD-DVD, blu-ray disc, etc.), semiconductor memory (e.g., RAM, EPROM, EEPROM, etc.), and/or magnetic memory (e.g., hard-disk drive, floppy-disk drive, tape drive, MRAM, etc.), among others. The memory 92 may include volatile, nonvolatile, dynamic, static, read/write, read-only, random-access, sequential-access, location-addressable, file-addressable, and/or content-addressable devices.

It is understood that the memory 92 comprises one or more physical devices. However, aspects of the instructions described herein may alternatively be propagated by a communication medium (e.g., an electromagnetic signal, an optical signal, etc.) that is not held by a physical device for a limited period of time. Aspects of the processor 91 and the memory 92 may be integrated together into one or more hardware logic components. These hardware logic components may include, for example, Field Programmable Gate Arrays (FPGAs), program and application specific integrated circuits (PASIC/ASIC), program and application specific standard products (PSSP/ASSP), system on a chip (SOC), and Complex Programmable Logic Devices (CPLDs).

In one example, as shown in FIG. 13, the electronic device 90 may also include an input device 93 and an output device 94, which may be interconnected via a bus system and/or other form of connection mechanism (not shown). For example, the input device 93 may be, for example, a camera module for capturing image data or video data, or the like. As another example, the input device 93 may include or interface with one or more user input devices such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input device 93 may include or interface with a selected Natural User Input (NUI) component. Such component parts may be integrated or peripheral and the transduction and/or processing of input actions may be processed on-board or off-board. Example NUI components may include a microphone for speech and/or voice recognition; infrared, color, stereo display and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer and/or gyroscope for motion detection and/or intent recognition; and an electric field sensing component for assessing brain activity and/or body movement; and/or any other suitable sensor.

The output device 94 may output various information including the classification result, etc. to the outside. The output devices 94 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, the electronic device 90 may further comprise the communication means, wherein the communication means may be configured to communicatively couple the electronic device 90 with one or more other computer devices. The communication means may comprise wired and/or wireless communication devices compatible with one or more different communication protocols. As a non-limiting example, the communication subsystem may be configured for communication via a wireless telephone network or a wired or wireless local or wide area network. In some embodiments, the communications device may allow the electronic device 90 to send and/or receive messages to and/or from other devices via a network such as the internet.

It will be appreciated that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Also, the order of the above-described processes may be changed.

Of course, for the sake of simplicity, only some of the components of the electronic device 90 relevant to the present invention are shown in fig. 13, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 90 may include any other suitable components depending on the particular application.

It should also be noted that in the apparatus, devices and methods of the present invention, the components or steps may be broken down and/or re-combined. These decompositions and/or recombinations are to be considered as equivalents of the present invention.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are given by way of example only and are not limiting of the invention. The objects of the present invention have been fully and effectively accomplished. The functional and structural principles of the present invention have been shown and described in the examples, and any variations or modifications of the embodiments of the present invention may be made without departing from the principles.

Claims

1. A gesture recognition method based on two-dimensional images is characterized by comprising the following steps:

2. The two-dimensional image-based gesture recognition method according to claim 1, wherein the step of tracking the orientation of the palm in the current frame image according to the hand key point information in the previous frame image to obtain the palm information of the current frame image comprises the steps of:

3. The two-dimensional image-based gesture recognition method of claim 2, wherein the palm detection module is a palm detection model based on deep learning.

4. The two-dimensional image-based gesture recognition method according to claim 2, wherein the step of performing orientation calculation on the extended rectangle to obtain the palm information of the current frame image in response to the intersection ratio being greater than or equal to the first threshold value comprises the steps of:

and calculating the direction information of the palm in the current frame image according to the position information of the key points of the hand in the previous frame image.

5. The two-dimensional image-based gesture recognition method according to any one of claims 1 to 4, wherein the step of performing key point detection on the current frame image based on the palm information of the current frame image to obtain hand key point information in the current frame image comprises the steps of:

6. The two-dimensional image-based gesture recognition method according to any one of claims 1 to 4, wherein the step of performing single-frame gesture recognition on the hand key point information in the current frame image to obtain a single-frame static gesture in the current frame image comprises the steps of:

7. The two-dimensional image based gesture recognition method of any one of claims 1 to 4, further comprising the steps of:

8. The gesture recognition method based on two-dimensional images as claimed in claim 7, wherein the step of detecting the process of single-frame gesture change in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue to realize recognition of multi-frame dynamic gestures comprises the steps of:

9. The gesture recognition method based on two-dimensional images as claimed in claim 8, wherein the step of detecting the process of single-frame gesture change in the buffer queue based on the single-frame static gesture and the hand key point information in the buffer queue to realize recognition of multi-frame dynamic gestures further comprises the steps of:

judging whether the hand key points in the updated cache queue contain key points of two hands and continue for a preset frame; and

10. A two-dimensional image based gesture recognition system comprising, communicatively coupled to each other:

11. The two-dimensional image based gesture recognition system of claim 10, wherein the palm tracking module comprises communicatively coupled to each other:

12. The two-dimensional image-based gesture recognition system of claim 11, wherein the orientation calculation module is further to: judging whether the outward-expanded rectangle is smaller than a rectangular frame threshold value or not, wherein the rectangular frame threshold value is dynamically updated through the palm detection module; in response to the expanded rectangle being smaller than the rectangular frame threshold, expanding the expanded rectangle according to a second expansion proportion so that the expanded rectangle is larger than or equal to the rectangular frame threshold; normalizing the extended rectangle which is greater than or equal to the rectangle frame threshold value to obtain the position information of the extended rectangle in the current frame image; and calculating the direction information of the palm in the current frame image according to the position information of the key points of the hand in the previous frame image.

13. The two-dimensional image-based gesture recognition system according to any one of claims 10 to 12, wherein the keypoint detection module includes an affine transformation module and a keypoint extraction module communicably connected to each other, wherein the affine transformation module is configured to perform affine transformation on the current frame image according to the palm information of the current frame image to cut out a palm image corresponding to the palm information of the current frame image; the key point extraction module is used for extracting the positions of all hand key points from the palm image through the key point detection network so as to obtain the hand key point information in the current frame image.

14. The two-dimensional image-based gesture recognition system according to any one of claims 10 to 12, wherein the single-frame gesture recognition module comprises a feature value calculation module and a random forest module which are communicably connected with each other, wherein the feature value calculation module is configured to perform feature value calculation through the hand keypoint information in the current frame image to obtain a plurality of features of different dimensions; and the random forest module is used for performing gesture recognition on the features with different dimensions through a random forest to obtain a static gesture recognition result in the current frame image.

15. The two-dimensional image-based gesture recognition system of any one of claims 10 to 12, further comprising a multi-frame gesture recognition module for detecting a process of single-frame gesture change in the buffer queue based on a single-frame static gesture and hand key point information in the buffer queue to realize recognition of multi-frame dynamic gestures.

16. The two-dimensional image-based gesture recognition system of claim 15, wherein the multi-frame gesture recognition module comprises a fill update module, a search module and a single-hand gesture recognition module communicatively connected to each other, wherein the fill update module is configured to fill the single-frame static gesture and the hand keypoint information in the current frame image into the buffer queue to update the buffer queue; the searching module is used for searching whether a single-frame gesture action change meeting the setting exists in the updated cache queue; the single-hand motion recognition module is used for responding to the single-frame gesture motion change meeting the setting, and performing dynamic gesture recognition by combining the coordinate change of the key point of the hand to obtain the motion recognition result of the single hand.

17. The two-dimensional image-based gesture recognition system of claim 16, wherein the multi-frame gesture recognition module further comprises a determination module and a double-hand gesture recognition module communicatively connected to each other, wherein the determination module is configured to determine whether the hand keypoints in the updated cache queue include keypoints of both hands and continue for a predetermined frame; and the two-hand action recognition module is used for responding to the key points of the hands, including the key points of the two hands and continuing the preset frame, and performing fusion recognition on the single-hand action recognition results of the two hands so as to output the action recognition results of the two hands.

18. An electronic device, comprising:

at least one processor configured to execute instructions; and