CN111325179A

CN111325179A - Gesture tracking method and device, electronic equipment and storage medium

Info

Publication number: CN111325179A
Application number: CN202010159687.9A
Authority: CN
Inventors: 杨思远; 曲晓超; 姜浩; 刘岩; 万鹏飞
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-06-23
Anticipated expiration: 2040-03-09
Also published as: CN111325179B

Abstract

The embodiment of the invention provides a gesture tracking method and device, electronic equipment and a storage medium, and relates to the technical field of image processing. According to the gesture tracking method, the gesture tracking device, the electronic equipment and the storage medium, the original image is cut based on the first hand detection frame to obtain the first image, the first feature map is obtained through feature extraction of the first image, whether a hand exists in the first feature map is judged, if the hand exists in the first feature map, the first feature map is segmented to obtain the hand image, detection frame regression processing is carried out on the first feature map and the hand image to obtain the second hand detection frame, the second hand detection frame is used as input of the next frame image, tracking of the next frame image is achieved, and real-time performance is good.

Description

Gesture tracking method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a gesture tracking method and device, electronic equipment and a storage medium.

Background

The gesture recognition technology is a technology for finding out a hand part in one image and classifying the posture of the hand part. Gesture information is used as high-level semantic information and is widely applied to man-machine interaction scenes.

The gesture recognition technology can be subdivided into three sub-technologies, namely gesture detection, gesture classification and gesture tracking technologies. The gesture detection technology is responsible for detecting the bounding box of the hand from the image, the gesture classification technology is responsible for classifying the posture of the hand, and the gesture tracking technology is responsible for tracking the detected hand so as to ensure that the detected hand is the hand of the same person.

At present, a classification technology based on a deep neural network is taken as a research hotspot, and a plurality of methods (such as network distillation, network quantization and other technologies) for improving the network operation efficiency appear, so that the real-time operation of the network operation efficiency at a mobile terminal becomes possible, but a tracking technology cannot achieve real-time processing often, and the real-time performance is poor.

Disclosure of Invention

Based on the research, the invention provides a gesture tracking method and device, electronic equipment and a storage medium.

Embodiments of the invention may be implemented as follows:

in a first aspect, an embodiment of the present invention provides a gesture tracking method, which is applied to an electronic device, and the gesture tracking method includes:

based on the first human hand detection frame, cutting the original image to obtain a first image;

performing feature extraction on the first image to obtain a first feature map;

judging whether a hand exists in the first characteristic diagram, if so, segmenting the first characteristic diagram to obtain a hand image;

and performing detection frame regression processing on the first characteristic diagram and the hand image to obtain a second hand detection frame, and tracking the next frame of image according to the second hand detection frame.

In an alternative embodiment, the method further comprises:

if the first feature map does not have a human hand, feature extraction is carried out on the original image to obtain a second feature map;

inputting the second feature map into a regional suggestion network to obtain a first detection frame;

cutting the features of the area corresponding to the first detection frame in the second feature map to obtain a cut feature map;

and performing fine trimming processing on the first detection frame according to the cut feature diagram to obtain a third hand detection frame, and tracking the next frame of image according to the third hand detection frame.

In an optional embodiment, the step of performing a finishing process on the first detection frame according to the cut feature diagram to obtain a third human hand detection frame includes:

inputting the cut feature graph into a classification regression network to obtain a detection frame fine modification parameter;

and carrying out finishing treatment on the first detection frame according to the detection frame finishing parameters to obtain a third hand detection frame.

In an optional embodiment, before performing feature extraction on the original image to obtain a second feature map, the method further includes:

carrying out skin color detection on the original image to obtain a human hand probability map of the original image;

and splicing the human hand probability graph of the original image with the original image to obtain a spliced original image, wherein the second characteristic graph is obtained by extracting the characteristics of the spliced original image.

In an optional embodiment, the step of performing skin color detection on the original image to obtain a human hand probability map of the original image includes:

performing skin color detection on the original image according to the following formula to obtain a human hand probability map of the original image:

wherein, x represents RGB vector of pixels in the original image, N represents the number of sub-Gaussian models in the Gaussian mixture model, and W_iRepresents the probability of x belonging to the ith Gaussian model, μ_iMean value of the ith Gaussian model, ∑_iA covariance matrix representing the ith Gaussian model; p (x) is the probability that x belongs to a part of the human hand.

In an optional embodiment, the cropping the original image based on the first human hand detection frame to obtain the first image includes:

cutting the original image according to the first human hand detection frame to obtain an image to be detected;

carrying out skin color detection on the image to be detected to obtain a hand probability chart of the image to be detected;

and splicing the human hand probability graph of the image to be detected and the image to be detected to obtain the first image.

In an optional embodiment, the step of cropping the original image based on the first human hand detection frame includes:

and amplifying the first human hand detection frame based on a preset amplification factor, and cutting the original image according to the amplified first human hand detection frame.

In a second aspect, an embodiment of the present invention provides a gesture tracking apparatus, which is applied to an electronic device, and includes a classification tracking module; the classification tracking module is configured to:

performing feature extraction on the first image to obtain a first feature map;

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a non-volatile memory storing computer instructions, where when the computer instructions are executed by the processor, the electronic device performs the gesture tracking method described in any one of the foregoing embodiments.

In a fourth aspect, an embodiment of the present invention provides a storage medium, where a computer program is stored, and the computer program is executed to implement the gesture tracking method according to any one of the foregoing embodiments.

According to the gesture tracking method, the gesture tracking device, the electronic equipment and the storage medium, the original image is cut based on the first hand detection frame to obtain the first image, the first feature map is obtained through feature extraction of the first image, whether a hand exists in the first feature map is judged, if the hand exists in the first feature map, the first feature map is segmented to obtain the hand image, detection frame regression processing is conducted on the first feature map and the hand image to obtain the second hand detection frame, the second hand detection frame is used as input of the next frame image, tracking of the next frame image is achieved, and the real-time performance is good.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.

Fig. 2 is a schematic flowchart of a gesture tracking method according to an embodiment of the present invention.

Fig. 3 is another schematic flow chart diagram of a gesture tracking method according to an embodiment of the present invention.

Fig. 4 is a schematic flowchart of a gesture tracking method according to an embodiment of the present invention.

Fig. 5 is a schematic flowchart of a gesture tracking method according to an embodiment of the present invention.

Fig. 6 is a schematic flowchart of a gesture tracking method according to an embodiment of the present invention.

Fig. 7 is a block diagram illustrating a gesture tracking apparatus according to an embodiment of the present invention.

Icon: 100-an electronic device; 10-a gesture tracking device; 11-a classification tracking module; 12-a gesture detection module; 20-a memory; 30-a processor; 40-a communication unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

At present, a deep neural network is the most effective method for solving the two tasks of detection and classification, wherein a classification technology based on the deep neural network is taken as a research hotspot, and a plurality of methods (such as network distillation, network quantization and other technologies) for improving the network operation efficiency are developed, so that the real-time operation of the deep neural network at a mobile terminal is possible, but the detection technology is often difficult to operate in real time, particularly on a gesture recognition task, a very important reason is that the scale change of a human hand in the motion process is very large, so that the neural network needs more parameters to process human hand images with different scales, and the calculation amount is greatly increased. The tracking technology can be divided into a traditional tracking technology and a tracking technology based on a deep neural network, wherein the traditional tracking technology can achieve real-time tracking (such as a KCF tracking algorithm) but has a poor tracking effect, and the tracking technology based on the deep neural network can achieve a good tracking effect but cannot achieve real-time performance, and has poor real-time performance.

Based on the above research, the embodiment of the present invention provides a gesture tracking method to improve the above problem.

The embodiment provides a gesture tracking method, which is applied to the electronic device 100 shown in fig. 1, and the electronic device 100 executes the gesture tracking method provided by the embodiment. In the embodiment, the electronic device 100 may be, but is not limited to, an electronic device 100 with a processing capability, such as a Personal Computer (PC), a notebook Computer, a Personal Digital Assistant (PDA), or a server.

The electronic device 100 comprises a gesture tracking apparatus 10, a memory 20, a processor 30 and a communication unit 40; the various elements of the memory 20, processor 30 and communication unit 40 are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the components may be directly electrically connected to each other via one or more communication buses or signal lines. The gesture tracking apparatus 10 includes at least one software function module which may be stored in the memory 20 in the form of software or Firmware (Firmware), and the processor 30 executes various function applications and data processing by running software programs and modules stored in the memory 20.

The Memory 20 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 30 may be an integrated circuit chip having signal processing capabilities. The processor 30 may be a general-purpose processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like.

The communication unit 40 is configured to establish a communication connection between the electronic device 100 and another external device through a network, and perform data transmission through the network.

It is to be understood that the configuration shown in fig. 1 is merely exemplary, and that the electronic device 100 may include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

The gesture tracking method provided by the embodiment mainly comprises two parts, namely a hand detection frame acquisition part and a gesture classification tracking part, wherein the hand detection frame acquisition part mainly utilizes a fast-RCNN detection frame as a basic frame and comprises a basic neural Network, a regional suggestion Network (RPN) and a classification regression Network; the gesture classification tracking part runs in real time and consists of a multi-branch neural network, and mainly comprises a basic neural network, a gesture classification branch, a gesture frame regression branch and a hand judgment branch, wherein the gesture classification branch can be a real-time classification network and is mainly used for predicting the category of gestures in an image, the hand judgment branch can be a full-connection network and is mainly used for judging whether hands exist in the image or not so as to prevent accidental loss in the tracking process and stop tracking in time, and the gesture frame regression branch can be subdivided into a segmentation network and a regression network and is mainly used for regression of a gesture detection frame.

Referring to fig. 2 in combination, fig. 2 is a schematic flow chart of the gesture tracking method provided in the present embodiment, and fig. 2 is mainly executed by a gesture classification tracking portion in the gesture tracking method provided in the present embodiment. The following describes a specific flow of the gesture tracking method shown in fig. 2 in detail.

Step S10: and cutting the original image based on the first human hand detection frame to obtain a first image.

The first human hand detection frame is a human hand detection frame with a human hand part, and the corresponding area (the human hand part) in the original image is cut according to the first human hand detection frame to obtain the human hand part of the original image.

Since the human hand part in the first human hand detection frame may have a certain deviation from the position of the actual human hand part in the original image, in order to ensure that the human hand part in the image can be included in the human hand detection frame, the step of cropping the original image based on the first human hand detection frame includes the following steps:

As an alternative embodiment, the preset magnification may be twice, that is, the first human hand detection frame is enlarged twice, and the original image is cropped according to the first human hand detection frame enlarged twice to obtain the first image.

Optionally, in order to improve the working efficiency, in this embodiment, please refer to fig. 3 in combination, the step of cropping the original image based on the first human hand detection frame to obtain the first image may include steps S11 to S13.

Step S11: and cutting the original image according to the first human hand detection frame to obtain an image to be detected.

Step S12: and carrying out skin color detection on the image to be detected to obtain a human hand probability chart of the image to be detected.

Step S13: and splicing the human hand probability graph of the image to be detected and the image to be detected to obtain the first image.

The method comprises the steps of carrying out skin color detection on an image to be detected, namely calculating the probability that each pixel point in the image to be detected belongs to a human hand.

As an optional implementation manner, in this embodiment, a gaussian mixture model is used to perform skin color detection on an image to be detected, and the formula is as follows:

wherein, x represents RGB color vector of pixel in image to be detected, N represents number of sub-Gaussian models in Gaussian mixture model, and W_iRepresents the probability of x belonging to the ith Gaussian model, μ_iMean value of the ith Gaussian model, ∑_iRepresents the covariance matrix of the ith Gaussian model, P (x) is the probability that x belongs to a part of the human hand.

After each pixel in the image to be detected is calculated by the formula, the hand probability graph of the image to be detected can be obtained, then the hand probability graph of the image to be detected and the image to be detected are spliced, namely the three-channel image to be detected and the hand probability graph form a four-channel first image, and through the first image, abundant prior information can be provided for subsequent operation, the processing efficiency is improved, more characteristics are provided, and the hand gesture classification is facilitated.

Referring back to fig. 2, after the first image is obtained, step S20 is performed.

Step S20: and performing feature extraction on the first image to obtain a first feature map.

After the first image is obtained, the features of the first image are extracted through convolution processing to obtain a first feature map, and optionally, the process can be realized through a basic neural network.

Step S30: and judging whether a hand exists in the first characteristic diagram, and if the hand exists in the first characteristic diagram, segmenting the first characteristic diagram to obtain a hand image.

Step S40: and performing detection frame regression processing on the first characteristic diagram and the hand image to obtain a second hand detection frame, and tracking the next frame of image according to the second hand detection frame.

After the first feature map is obtained, the hand judging branch is used for judging whether a hand exists in the first feature map, if so, the hand in the first feature map is segmented by the segmentation network in the gesture frame regression branch to obtain a hand image.

After the hand image is obtained, performing detection frame regression processing on the first feature map and the hand image by using a regression network in the gesture frame regression branch to obtain a second hand detection frame with a hand part, taking the second hand detection frame as the input of the next frame of image, cutting the next frame of image by using the second hand detection frame, and executing the corresponding processes from the step S10 to the step S40, thereby realizing real-time tracking of the hand of the next frame of image.

It should be noted that, since the hand detection frame obtained by the regression branch regression of the gesture frame is the gesture frame of the current image and has a certain position deviation with the actual position of the hand part of the next frame of image, in order to ensure that the hand part in the next frame of image can be included in the hand detection frame, the hand detection frame obtained by the regression branch regression of the gesture frame is also amplified based on the preset amplification factor. According to the gesture tracking method provided by the embodiment, the detection task is converted into a single-scale detection task by amplifying the human hand detection frame every time, so that the difficulty is reduced, and a smaller real-time network can complete the task.

As an optional implementation manner, after determining that a human hand exists in the first feature map, the gesture tracking method provided in this embodiment may also classify gesture categories (e.g., fist, finger, centroid, and other gesture categories) of the human hand by using the gesture classification branch, and further combine the gesture classification and the gesture tracking, so that while tracking the human hand, the gesture categories of the human hand are also classified, thereby improving the work efficiency.

The gesture tracking method provided by the embodiment combines the tracking task with real-time classification, and tracks the gesture while classifying the gesture categories, so that the real-time performance is greatly improved.

As another alternative, if there is no human hand in the first feature map, and the tracking process is interrupted, the human hand detection frame needs to be re-acquired to find the human hand portion in the original image, in this process, the human hand detection frame is re-acquired by the human hand detection frame acquisition portion, the flow of which is shown in fig. 4, and the specific flow shown in fig. 4 is described in detail below.

Step S50: and if the first feature map does not have the human hand, performing feature extraction on the original image to obtain a second feature map.

If the human hand is not present in the first feature map, the features of the original image are extracted by convolution processing. Alternatively, the process may be implemented by an underlying neural network.

In order to achieve higher accuracy with smaller calculation amount, please refer to fig. 5, before performing feature extraction on the original image to obtain a second feature map, the method further includes steps S51 to S52.

Step S51: and carrying out skin color detection on the original image to obtain a human hand probability map of the original image.

Step S52: and splicing the human hand probability graph of the original image with the original image to obtain a spliced original image, wherein the second characteristic graph is obtained by extracting the characteristics of the spliced original image.

In order to reduce the calculation amount in the process of obtaining the human hand detection frame and improve the precision of the human hand detection frame, the embodiment firstly performs skin color detection on the original image, calculates the probability that each pixel point in the image to be detected belongs to the human hand, obtains a human hand probability map of the original image, and further obtains the skin color prior information of the human hand.

As an optional implementation manner, in this embodiment, the original image is also subjected to skin color detection by using a gaussian mixture model, and the formula is as follows:

wherein, x represents RGB color vector of pixel in original image, N represents number of sub-Gaussian models in Gaussian mixture model, and W_iRepresents the probability of x belonging to the ith Gaussian model, μ_iMean value of the ith Gaussian model, ∑_iRepresents the covariance matrix of the ith Gaussian model, P (x) is the probability that x belongs to a part of the human hand.

After each pixel in the original image is calculated by the formula, a human hand probability image of the original image can be obtained, then the human hand probability image of the original image and the original image are spliced to obtain the spliced original image, and the spliced original image is a four-channel image formed by the original image with three channels and the human hand probability image. After the spliced original image is obtained, the characteristic of the spliced original image is extracted, so that abundant prior information can be provided for the operation of a subsequent network, the precision of the human hand detection frame is further improved, meanwhile, the calculation amount in the process of obtaining the human hand detection frame is reduced, and the processing efficiency of the network is improved.

Referring back to fig. 4, after the second characteristic diagram is obtained, step S60 is executed.

Step S60: and inputting the second feature map into a regional suggestion network to obtain a first detection frame.

Step S70: and cutting the characteristics of the area corresponding to the first detection frame in the second characteristic diagram to obtain a cut characteristic diagram.

Step S80: and performing fine trimming processing on the first detection frame according to the cut feature diagram to obtain a third hand detection frame, and tracking the next frame of image according to the third hand detection frame.

After the spliced original image is obtained and the second feature image is obtained by feature extraction of the spliced original image, the second feature image is input to the area suggestion network, and the area suggestion network performs preliminary detection on the hand part of the second feature image to obtain the first detection frame.

Since the first detection frame is rough, the first detection frame needs to be refined. Therefore, after the first detection frame is obtained, the features of the area corresponding to the first detection frame in the second feature map are cut to obtain a cut feature map, and the first detection frame is subjected to finishing processing according to the cut feature map to obtain a third hand detection frame.

Referring to fig. 6, in a further embodiment, the step of refining the first detection frame according to the cut feature map to obtain a third human hand detection frame includes steps S81 to S82.

Step S81: and inputting the cut feature graph into a classification regression network to obtain a detection frame fine modification parameter.

Step S82: and carrying out finishing treatment on the first detection frame according to the detection frame finishing parameters to obtain a third hand detection frame.

Since the first detection box obtained before may contain parts other than the human hand part, classification logic is required for further processing. Inputting the cut feature map into a classification regression network, classifying the cut feature map by using the classification regression network, namely classifying the foreground part and the background part of the cut feature map, removing the background part in the first detection frame based on the classification processing result, and only outputting the first detection frame with the foreground part. The foreground part is the hand part in the cut characteristic diagram, and the background part is the other part except the hand part in the cut characteristic diagram.

Inputting the cut feature map into a classification regression network, classifying foreground and background of the cut feature map by using the classification regression network, and simultaneously regressing to obtain an offset (i.e. a detection frame refinement parameter) of the first detection frame on the original basis, wherein the refinement processing of the first detection frame requires adding the detection frame refinement parameter on the original basis of the first detection frame, for example, the first detection frame is (x, y, w, h), and the refinement parameter output by the classification regression network after being input into the classification regression network processing is (α)_x，α_y，β_w，β_h) Then the final refined detection frame is (x + α)_xx，y+α_yy，w·e^βw，h·e^βh)。

And after the finishing parameters of the detection frame are obtained, finishing the first detection frame by using the finishing parameters of the detection frame, and thus obtaining a third human hand detection frame with a human hand part. And after the third hand detection frame with the hand part is obtained, the original image is cut again by the third hand detection frame, the hand part in the original image is searched again, and the corresponding processes from the step S10 to the step S40 are executed, so that the tracking of the image hand is realized.

It should be noted that, due to the requirement of precision, the process of obtaining the third human hand detection frame by the human hand detection frame obtaining part may not be in real time, and therefore, the obtained third human hand detection frame may have a certain deviation from the actual position of the human hand part in the original image, and in order to ensure that the human hand part in the original image can be included in the third human hand detection frame, the third human hand detection frame is also enlarged based on the preset enlargement factor, and the original image is cropped based on the enlarged third human hand detection frame.

The gesture tracking method provided by the embodiment of the invention has the advantages that the gesture classification tracking part is used as a main thread to run in real time, the hand detection frame acquisition part is used as a secondary thread to run, after the fact that hands exist in the first characteristic diagram is judged, the hand detection frame is provided by the gesture classification tracking part to realize real-time tracking, the hand gesture classification is simultaneously classified, after the fact that no hands exist in the first characteristic diagram is judged, the hand detection frame acquisition part provides the hand detection frame, hands in the image are searched again, the tracking of the hands is realized, and then the multithreading execution is realized.

According to the gesture tracking method provided by the embodiment, the real-time performance of human hand tracking is improved on one hand, and on the other hand, the problem of time consumption of detection tasks is avoided, and the working efficiency is improved through multi-thread execution and combination processing of tracking tasks and a real-time classification network.

On the basis, please refer to fig. 7 in combination, an embodiment of the present invention further provides a gesture tracking apparatus 10 applied to an electronic device 100, where the gesture tracking apparatus 10 includes a classification tracking module 11. The classification tracking module 11 is configured to:

and cutting the original image based on the first human hand detection frame to obtain a first image.

And performing feature extraction on the first image to obtain a first feature map.

And judging whether a hand exists in the first characteristic diagram, and if the hand exists in the first characteristic diagram, segmenting the first characteristic diagram to obtain a hand image.

The gesture tracking device 10 further comprises a gesture detection module 12, the gesture detection module 12 being configured to:

and if the first feature map does not have the human hand, performing feature extraction on the original image to obtain a second feature map.

And inputting the second feature map into a regional suggestion network to obtain a first detection frame.

And cutting the characteristics of the area corresponding to the first detection frame in the second characteristic diagram to obtain a cut characteristic diagram.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the gesture tracking apparatus 10 described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

On the basis, an embodiment of the present invention further provides an electronic device, which includes a processor and a non-volatile memory storing computer instructions, where when the computer instructions are executed by the processor, the electronic device executes the gesture tracking method according to any one of the foregoing embodiments.

On the basis of the foregoing, an embodiment of the present invention provides a storage medium, in which a computer program is stored, and the computer program, when executed, implements the gesture tracking method according to any one of the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the electronic device and the storage medium described above may refer to the corresponding processes in the foregoing method, and will not be described in too much detail herein.

To sum up, the gesture tracking method, the gesture tracking device, the electronic device, and the storage medium according to the embodiments of the present invention crop an original image based on a first hand detection frame to obtain a first image, obtain a first feature map by performing feature extraction on the first image, determine whether a hand exists in the first feature map, if a hand exists in the first feature map, perform gesture classification on the hand in the first feature map, and segment the first feature map to obtain a hand image, perform detection frame regression processing on the first feature map and the hand image to obtain a second hand detection frame, use the second hand detection frame as an input of a next frame image, further convert a tracking task into a single-scale detection problem and combine the detection problem with the classification processing, thereby realizing tracking of the next frame image, and achieving better real-time performance.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A gesture tracking method is applied to an electronic device, and comprises the following steps:

performing feature extraction on the first image to obtain a first feature map;

2. The gesture tracking method according to claim 1, further comprising:

3. The gesture tracking method according to claim 2, wherein the step of performing a finishing process on the first detection frame according to the cut feature map to obtain a third human hand detection frame comprises:

4. The gesture tracking method according to claim 2, wherein before the feature extraction of the original image to obtain the second feature map, the method further comprises:

5. The gesture tracking method according to claim 4, wherein the step of performing skin color detection on the original image to obtain a human hand probability map of the original image comprises:

6. The gesture tracking method according to claim 1, wherein the step of cropping the original image to obtain the first image based on the first human hand detection box comprises:

7. The gesture tracking method according to claim 1, wherein the step of cropping the original image based on the first human hand detection box comprises:

8. The gesture tracking device is applied to electronic equipment and comprises a classification tracking module; the classification tracking module is configured to:

performing feature extraction on the first image to obtain a first feature map;

9. An electronic device comprising a processor and a non-volatile memory storing computer instructions that, when executed by the processor, perform the gesture tracking method of any of claims 1-7.

10. A storage medium having stored thereon a computer program which, when executed, implements a gesture tracking method according to any one of claims 1 to 7.