CN110334576B

CN110334576B - Hand tracking method and device

Info

Publication number: CN110334576B
Application number: CN201910359991.5A
Authority: CN
Inventors: 孙晨; 陈文科; 高源�; 姚聪
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2021-09-24
Anticipated expiration: 2039-04-30
Also published as: CN110334576A

Abstract

The present invention relates to the field of gesture recognition technologies, and in particular, to a hand tracking method and apparatus. The hand tracking method comprises the following steps: acquiring video frames, namely acquiring continuous video frames; judging whether a hand image is detected in the previous video frame or not according to whether the label of the hand image exists in the previous video frame or not; a single-scale local detection step, namely performing hand detection on a local range of a current video frame according to a hand image in a previous video frame, and marking the position of the hand image in the current video frame; and a multi-scale global detection step, namely performing hand detection on the whole image range of the current video frame according to the condition that the previous video frame does not have a hand image, and marking the position of the hand image in the current video frame. The hand position in the video is detected by using two detection modes and is marked, so that the continuous motion track of the hand image is tracked, the hand tracking efficiency is improved, and the hand detection difficulty and the real-time operation difficulty are reduced.

Description

Hand tracking method and device

Technical Field

The present invention relates generally to the field of gesture recognition technology, and more particularly, to a hand tracking method and apparatus.

Background

Gesture recognition is a support technology of touchless human-computer interaction without mechanical equipment such as a touch screen, people can use simple gestures to control or interact with the equipment, and a computer is used for solving human behaviors, so that the method is one of important research directions in a human-computer interaction mode.

However, the continuous determination of hand position, i.e. hand tracking, is the most difficult and time-consuming step in gesture recognition. Different from general object detection and tracking, the difficulty of hand tracking in gesture recognition is mainly embodied in three aspects: on the first hand, the deformation degree of the hand is high, and the detection difficulty is high; on the second hand, the moving speed of the hand in the camera visual field is high, and the tracking difficulty is high; in the third aspect, the computing resources are limited, and the data operation difficulty is high.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a hand tracking method and device.

In a first aspect, an embodiment of the present invention provides a hand tracking method, including: a step of acquiring video frames, which is to acquire continuous video frames including a previous video frame and a current video frame; judging whether a hand image is detected in the previous video frame or not according to whether the label of the hand image exists in the previous video frame or not; a single-scale local detection step, when the hand image is judged to be in the previous video frame, performing hand detection on the local range of the current video frame through single-scale local detection, and marking the positions of one or more detected hand images in the current video frame; and a multi-scale global detection step, wherein when the previous video frame is judged not to have the hand image, the hand detection is carried out on the whole image range of the current video frame through the multi-scale global detection, and the position of one or more detected hand images in the current video frame is marked in the current video frame.

In one embodiment, the hand tracking method further comprises: and selecting a main hand image, namely selecting one hand image as the main hand image based on one or more hand images detected in the single-scale local detection step or the multi-scale global detection step, and deleting the labels of the rest hand images.

In another embodiment, the step of selecting a subject hand image further comprises: and selecting the hand image with the largest area as the main hand image according to the area sizes of the plurality of hand images in the current video frame.

In a further embodiment, the step of selecting a subject hand image further comprises: and selecting a hand image which is matched with the preset gesture in the video frame as a main hand image.

In one embodiment, the hand tracking method further comprises: and a hand identification step, namely judging whether the hand image is a real hand or not based on one or more hand images in the current video frame, and deleting the label of the hand image if the hand image is judged not to be the real hand.

In one embodiment, the hand identifying step further comprises: and intercepting the current video frame according to the position of the hand image to obtain an intercepted image, and judging the intercepted image when the intercepted image is zoomed to a fixed identification size range.

In one embodiment, the single-scale local detection step further comprises: and based on the position of the hand image in the previous video frame, carrying out enlarged capture on the current video frame, acquiring a local image in a local range, and carrying out hand detection in the local image.

In another embodiment, the single-scale local detection step further comprises: and zooming the local image to a fixed detection size range for hand detection.

In a second aspect, an embodiment of the present invention provides a hand tracking device, including: the video frame acquisition module is used for acquiring continuous video frames including a previous video frame and a current video frame; the judging module is used for judging whether the hand image is detected in the previous video frame according to whether the hand image is marked in the previous video frame; the single-scale local detection module is used for detecting a hand in a local range of a current video frame when judging that a hand image exists in a previous video frame, and marking the positions of one or more detected hand images in the current video frame; and the multi-scale global detection module is used for detecting hands in the full-image range of the current video frame when judging that the previous video frame does not have the hands, and marking the positions of one or more detected hands in the current video frame.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a memory to store instructions; and a processor for invoking the memory-stored instructions to perform the hand tracking method of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, perform the hand tracking method of the first aspect.

According to the hand tracking method and device provided by the invention, the hand position in the video is detected by using two detection modes, namely single-scale local detection and multi-scale global detection, and the detected hand image is marked for tracking the continuous motion track of the hand image, so that the hand tracking efficiency is improved, and the hand detection difficulty and the real-time operation difficulty are reduced.

Drawings

The above and other objects, features and advantages of embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a schematic diagram illustrating a hand tracking method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating another hand tracking method provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating another hand tracking method provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating another hand tracking method provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a hand tracking device according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an electronic device provided by an embodiment of the invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way.

It should be noted that although the expressions "first", "second", etc. are used herein to describe different modules, steps, data, etc. of the embodiments of the present invention, the expressions "first", "second", etc. are merely used to distinguish between different modules, steps, data, etc. and do not indicate a particular order or degree of importance. Indeed, the terms "first," "second," and the like are fully interchangeable.

FIG. 1 is a flow chart illustrating one embodiment of a hand tracking method 10. As shown in fig. 1, the method of this embodiment includes: the method comprises a video frame acquisition step 110, a judgment step 120, a single-scale local detection step 130 and a multi-scale global detection step 140. The respective steps in fig. 1 are explained in detail below.

The acquire video frame step 110 acquires successive video frames, including a previous video frame and a current video frame.

In this embodiment, the video frames may be obtained through an image capturing device, such as a mobile phone camera, a computer camera, or a video may be called through a local database or a cloud to obtain continuous video frames. In one example, an image acquisition device is used to acquire images, start a preview video stream, and acquire real-time video frames. In another example, video frames are obtained from videos stored in a local database or cloud. Hand tracking is performed on the obtained consecutive video frames.

In the determining step 120, it is determined whether a hand image is detected in the previous video frame according to whether the hand image is labeled in the previous video frame.

In this embodiment, whether the previous video frame has a hand image is determined according to whether the obtained annotation of the hand image in the previous video frame is used as a determination criterion. And different detection means are selected for detection according to the judgment result, so that the detection difficulty is reduced.

And a single-scale local detection step 130, when it is determined that the previous video frame has a hand image, performing hand detection on a local range of the current video frame through single-scale local detection, and marking positions of one or more detected hand images in the current video frame.

Single scale local detection can detect hand images of scales, i.e., pixels, within a certain range. In the embodiment, the current video frame is detected according to the result of judging that the previous video frame has the hand image. Local single-scale local detection is carried out on the current video frame by using a single-scale hand detection model, wherein the single-scale hand detection model can be a target detection model based on a convolutional neural network, and a hand frame participating in the model training only has a relatively fixed range of jitter on a pixel level, such as: 56X56 to 96X96, therefore, the model network has a simple structure, involves a small amount of detection data, can quickly detect whether a hand image exists in a local range of a current video frame, and if the hand image is detected in the local range, all detected hand images are labeled on the current video frame through a square or circular position frame; if the hand image is not detected, no labeling is performed. And taking an image of the current video frame after single-scale local detection as a judgment basis of a next video frame, taking an image of the current video frame after detection as a previous video frame of the next video frame for judgment, and selecting a detection mode for the next video frame according to a result of judging the image of the current video frame after detection. The detection speed is high by using a single-scale local detection mode, and the real-time operation difficulty is reduced.

And a multi-scale global detection step 140, when it is determined that the previous video frame does not have a hand image, performing hand detection on the full-image range of the current video frame through multi-scale global detection, and marking the positions of one or more detected hand images in the current video frame.

Multi-scale global detection can detect hand images of all scales in a video frame. In this embodiment, the current video frame is detected according to a result of determining that no hand image exists in the previous video frame, and full-image-range multi-scale global detection can be performed on the current video frame by using a multi-scale hand detection model, where the multi-scale hand detection model may be a target detection model based on a convolutional neural network, and a hand frame participating in training of the model has a large or small size at a pixel level, that is, a large hand and a small hand, so that the capability of processing hands of multiple scales at the same time is provided. Detecting whether a hand image is contained in the whole image range of the current video frame or not through a multi-scale hand detection model, and if the hand image is detected in the current video frame, marking all detected hand images on the current video frame through a square or circular position frame; and if the hand image is not detected in the current video frame, not marking. And taking the image of the current video frame after multi-scale global detection as a judgment basis of the next video frame, taking the image of the current video frame after detection as the previous video frame of the next video frame for judgment, and selecting a detection mode for the next video frame according to the result of judging the image of the current video frame after detection. By using multi-scale global detection, all hand images contained in the current video frame can be detected more comprehensively. In one example, since the first frame of video frames does not have a previous video frame in the acquired consecutive video frames, it is determined by default that the previous video frame of the first video frame does not have a hand image, and the first video frame is detected by multi-scale global detection.

The embodiment detects the hand position in the video and marks the hand position by alternately using different detection modes with different complexity degrees, is used for tracking the continuous motion track of the hand image, is favorable for improving the hand tracking efficiency, and reduces the hand detection difficulty and the difficulty of real-time operation.

Figure 2 shows a schematic flow diagram of another embodiment of the hand tracking method 10. As shown in fig. 2, the method of this embodiment further includes: a step 150 of selecting a main hand image, wherein one hand image is selected as the main hand image based on one or more hand images detected in the single-scale local detection step 130 or the multi-scale global detection step 140, and labels of the rest hand images are deleted.

In some cases, through single-scale local detection or multi-scale global detection, multiple hand images are simultaneously marked in a current video frame, for example, multiple hand images appear in a local range of the single-scale local detection, or multiple hand images appear in a full image of a multi-scale global detection video frame.

In one embodiment, the selecting a subject hand image step 150 further comprises: and selecting the hand image with the largest area as the main hand image according to the area sizes of the plurality of hand images in the current video frame. Through single-scale local detection or multi-scale global detection, a plurality of hand images are marked in the current video frame at the same time, in a common situation, the hand with the largest area is the hand to be detected, and the hand with the small area is the background, so that the hand image with the largest hand image area in the current video frame is selected as the main hand image to be tracked, common requirements can be met, and the change of the motion track of the hand image can be conveniently observed through the method, and the tracking is convenient.

In yet another embodiment, the selecting a subject hand image step 150 further comprises: and selecting a hand image which is matched with the preset gesture in the video frame as a main hand image. The hand image matched with the preset gesture is selected as the main hand image and tracked, so that the required hand image can be accurately tracked in the application needing gesture control, and the control efficiency is improved. For example: a coded lock is unlocked according to a section of preset gesture transformation, in the scanning and unlocking process of the coded lock, a poster with a hand image appears on a picture background, and the hand image matched with the preset gesture in a video frame is selected as a main body hand image for tracking.

Fig. 3 and 4 show flow diagrams of further embodiments of the hand tracking method 10. As shown in fig. 3 and 4, the method of this embodiment further includes: a hand recognition step 160, based on one or more hand images in the current video frame, determining whether the hand image is a real hand, and if the hand image is not a real hand, deleting the label of the hand image. The hand recognition is carried out by using a hand classification model, wherein the hand classification model can be a classification model based on a convolutional neural network and is used for recognizing whether a hand image detected locally in a single scale or globally in a multi-scale manner is a real hand, if the hand image is the real hand, the label of the hand image on a video frame is reserved, and if the hand image is not the real hand, the label of the hand image is deleted. Through hand recognition, the method helps to avoid that the false hand which is continuously tracked and wrongly detected falls into endless loop.

In one example, the hand recognition step is configured to recognize whether all hand images detected locally at a single scale or globally at multiple scales are real hands before the main hand selection step, exclude the hands which are not real hands, delete the hand image labels of the artificial hands, only keep the real hand images, and determine that no hand image exists in the current video frame if the number of hand images in the current video frame is zero and no real hands exist. Through hand recognition, the method is beneficial to reducing the selection time and improving the accuracy of hand recognition in the process of selecting the main body hand.

In another example, the hand recognition step is used after the subject hand selection step to help detect whether the subject hand image is a real hand image. And when the main hand is not the real hand, judging that the current video frame has no hand image. Through discerning main part hand image, avoid tracing artificial hand, improve hand tracking efficiency and accuracy.

In one embodiment, the hand identifying step 160 further comprises: and intercepting the current video frame according to the position of the hand image to obtain an intercepted image, and judging the intercepted image when the intercepted image is zoomed to a fixed identification size range. Hand recognition is accomplished through a hand classification model that scales the captured image to a uniform fixed recognition size range, such as: the pixels of the intercepted image are scaled to the range of 56 × 56 to 64 × 64 pixels, so that the time for hand classification is saved, and the hand recognition efficiency is improved.

In one embodiment, the single-scale local detection step 130 further comprises: and based on the position of the hand image in the previous video frame, carrying out enlarged capture on the current video frame, acquiring a local image in a local range, and carrying out hand detection in the local image. According to the position marked by the hand image of the previous video frame, the local image of the current video frame is intercepted, because the hand can move and change in a short time in the video, but the moving distance of the hand in the short time is limited, before the local image of the current video frame is intercepted, the range of the local image is firstly expanded to a preset range which can be detected by single-scale local detection, the interception is carried out, and the position of the current video frame after the hand moves and changes is detected, for example: the range 128 x 128 of the intercepted image is preset, the image size of the main hand in the video frame in the previous video frame is 56x56, the image size is enlarged to 128 x 128 in the current video frame for interception, detection of the whole image is avoided, detection time is saved, tracking efficiency is improved, and real-time operation difficulty is reduced.

In another embodiment, the single-scale local detection step 130 further comprises: and zooming the local image to a fixed detection size range for hand detection. The clipped local image is uniformly scaled to the same fixed detection size range, for example, pixels of the local image are scaled to the range of 56 × 56 to 96 × 96 pixels, so that the size of the pixels of the hand in the clipped local image is relatively fixed, the detection pressure is further reduced, and the detection time is saved.

Fig. 5 shows an exemplary schematic of the hand tracking device 20. As shown in fig. 5, the hand tracking device of this embodiment includes: an acquire video frame module 210, configured to acquire consecutive video frames, including a previous video frame and a current video frame; the judging module 220 is configured to judge whether a hand image is detected in a previous video frame according to whether the hand image is labeled in the previous video frame; a single-scale local detection module 230, configured to, when it is determined that a previous video frame has a hand image, perform hand detection on a local range of a current video frame, and mark, in the current video frame, a position of one or more detected hand images in the current video frame; and the multi-scale global detection module 240 is configured to, when it is determined that no hand image exists in the previous video frame, perform hand detection on the full-image range of the current video frame, and mark the position of one or more detected hand images in the current video frame.

In one embodiment, the hand tracking device further comprises: and the main hand image selecting module is used for selecting one hand image as a main hand image and deleting the labels of the other hand images based on one or more hand images detected by the single-scale local detection module or the multi-scale global detection module.

In one embodiment, the selecting a subject hand image module is further configured to select a hand image with the largest area as the subject hand image according to the size of the area of the plurality of hand images in the current video frame.

In one embodiment, the hand tracking device further comprises: and the hand recognition module is used for judging whether the hand images are real hands or not by one or more hand images in the current video frame, and deleting the labels of the hand images if the hand images are not real hands.

In an embodiment, the hand recognition module is further configured to capture the current video frame according to the position of the hand image to obtain a captured image, and determine that the captured image is scaled to a fixed recognition size range.

In an embodiment, the single-scale local detection module is further configured to perform extended capture on the current video frame according to the position of the hand image in the previous video frame, obtain a local image in a local range, and perform hand detection in the local image.

In one embodiment, the single-scale local detection module is further configured to scale the local image to a fixed detection size range for hand detection.

The functions implemented by the modules in the apparatus correspond to the steps in the method described above, and for concrete implementation and technical effects, please refer to the description of the method steps above, which is not described herein again.

As shown in fig. 6, one embodiment of the present invention provides an electronic device 30. The electronic device 30 includes a memory 310, a processor 320, and an Input/Output (I/O) interface 330. The memory 310 is used for storing instructions. A processor 320 for calling the instructions stored in the memory 310 to execute the hand tracking method of the present invention. The processor 320 is connected to the memory 310 and the I/O interface 330, respectively, for example, via a bus system and/or other connection mechanism (not shown). The memory 310 may be used to store programs and data including a program for hand tracking according to an embodiment of the present invention, and the processor 320 executes various functional applications and data processing of the electronic device 30 by executing the program stored in the memory 310.

In an embodiment of the present invention, the processor 320 may be implemented in at least one hardware form of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), and the processor 320 may be one or a combination of several Central Processing Units (CPUs) or other forms of Processing units with data Processing capability and/or instruction execution capability.

Memory 310 in embodiments of the present invention may comprise one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile Memory may include, for example, a Random Access Memory (RAM), a cache Memory (cache), and/or the like. The nonvolatile Memory may include, for example, a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), a Solid-State Drive (SSD), or the like.

In the embodiment of the present invention, the I/O interface 330 may be used to receive input instructions (e.g., numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device 30, etc.), and may also output various information (e.g., images or sounds, etc.) to the outside. The I/O interface 330 may comprise one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a mouse, a joystick, a trackball, a microphone, a speaker, a touch panel, and the like.

In some embodiments, the invention provides a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, perform any of the methods described above.

Although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

The methods and apparatus of the present invention can be accomplished with standard programming techniques with rule based logic or other logic to accomplish the various method steps. It should also be noted that the words "means" and "module," as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving inputs.

Any of the steps, operations, or procedures described herein may be performed or implemented using one or more hardware or software modules, alone or in combination with other devices. In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium having computer program code embodied therewith, the computer program product being capable of being executed by a computer processor to perform any or all of the described steps, operations or procedures.

The foregoing description of the implementation of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiments were chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated.

Claims

1. A hand tracking method, comprising:

a step of acquiring video frames, which is to acquire continuous video frames including a previous video frame and a current video frame;

judging whether a hand image is detected in the previous video frame or not according to the fact whether the label of the hand image exists in the previous video frame or not;

a single-scale local detection step, when the hand image is judged to be in the previous video frame, performing hand detection on a local range of the current video frame through single-scale local detection, and marking the positions of one or more detected hand images in the current video frame, wherein the single-scale local detection is used for detecting the hand image with the scale in a fixed range;

and a multi-scale global detection step, wherein when the previous video frame is judged not to have the hand image, the hand image is detected in the full image range of the current video frame through the multi-scale global detection, the position of one or more detected hand images in the current video frame is marked in the current video frame, and the multi-scale global detection is used for simultaneously detecting the hand images in various scale ranges.

2. The method of claim 1, wherein the method further comprises:

and selecting a main hand image, selecting one hand image as a main hand image based on one or more hand images detected in the single-scale local detection step or the multi-scale global detection step, and deleting labels of the rest hand images.

3. The method of claim 2, wherein the selecting a subject hand image step further comprises: according to the area sizes of the plurality of hand images in the current video frame, selecting the hand image with the largest area as the main hand image.

4. The method of claim 2, wherein the selecting a subject hand image step further comprises: selecting the hand image matched with the preset gesture in the video frame as the main hand image.

5. The method of any of claims 1 to 4, wherein the method further comprises: and a hand identification step, namely judging whether the hand image is a real hand or not based on one or more hand images in the current video frame, and deleting the label of the hand image if the hand image is not the real hand.

6. The method of claim 5, wherein the hand identifying step further comprises: and intercepting the current video frame according to the position of the hand image to obtain an intercepted image, and judging that the intercepted image is zoomed to a fixed identification size range.

7. The method of claim 1, wherein the single-scale local detection step further comprises: and based on the position of the hand image in the previous video frame, carrying out enlarged capture on the current video frame, acquiring a local image of the local range, and carrying out hand detection in the local image.

8. The method of claim 7, wherein the single-scale local detection step further comprises: and zooming the local image to a fixed detection size range to perform the hand detection.

9. A hand tracking device, comprising:

the video frame acquisition module is used for acquiring continuous video frames including a previous video frame and a current video frame;

the judging module is used for judging whether the hand image is detected in the previous video frame according to whether the hand image is marked in the previous video frame;

the single-scale local detection module is used for carrying out hand detection on a local range of the current video frame through single-scale local detection when the hand image is judged to be in the previous video frame, marking the position of one or more detected hand images in the current video frame, and the single-scale local detection is used for detecting the hand image with the scale in a fixed range;

and the multi-scale global detection module is used for carrying out hand detection on the whole image range of the current video frame through multi-scale global detection when the previous video frame is judged not to have the hand image, marking the position of one or more detected hand images in the current video frame, and the multi-scale global detection is used for simultaneously detecting the hand images in various scale ranges.

10. An electronic device, wherein the electronic device comprises:

a memory to store instructions; and

a processor for invoking the memory-stored instructions to perform the hand tracking method of any of claims 1-8.

11. A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions that, when executed by a processor, perform the hand tracking method of any of claims 1-8.