WO2019200785A1

WO2019200785A1 - Fast hand tracking method, device, terminal, and storage medium

Info

Publication number: WO2019200785A1
Application number: PCT/CN2018/100227
Authority: WO
Inventors: 阮晓雯; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-18
Filing date: 2018-08-13
Publication date: 2019-10-24
Also published as: CN108682021A; CN108682021B

Abstract

A fast hand tracking method comprises: displaying, on a display interface, a video containing a hand region acquired by an imaging apparatus; receiving a bounding box marked by a user on the video containing the hand region; extracting a histogram-of-oriented-gradient (HOG) feature of a region marked by the bounding box, and segmenting, according to the HOG feature, the region marked by the bounding box, so as to obtain a hand image; and tracking the hand image by means of a continuously adaptive mean shift operator. The present application further provides a fast hand tracking device, a terminal, and a storage medium. The present application enables fast extraction of a HOG feature in a bounding box marked by a user, and accurately performs segmentation to obtain a hand region according to the HOG feature, thereby achieving a better tracking result.

Description

Fast hand tracking method, device, terminal and storage medium

This application claims the priority of the Chinese Patent Application entitled "Fast Hand Tracking Method, Apparatus, Terminal, and Storage Medium" by the Chinese Patent Office on April 18, 2018. The application number is 201810349972.X. The citations are incorporated herein by reference.

Technical field

The present application relates to the field of hand tracking technology, and in particular, to a fast hand tracking method, device, terminal, and storage medium.

Background technique

As an important means of natural interaction, gestures have important research value and broad application prospects. The first and most important step in gesture recognition and hand tracking is to segment the hand area from the graphics. The advantages and disadvantages of hand segmentation directly affect the subsequent gesture recognition and gesture tracking effects.

In the process of interaction between human and robot, when the video acquisition device installed on the robot has a certain distance from the human body, the collected photos contain the whole body of the human body. Since such photos have a large amount of background, the hand area is only a small part of the picture. How to detect the hand from a large number of background areas and segment it quickly and accurately is a problem worth studying.

Summary of the invention

In view of the above, it is necessary to propose a fast hand tracking method, device, terminal and storage medium, which can shorten the time for extracting the hand region, improve the accuracy and efficiency of hand recognition and hand tracking, especially in complex backgrounds. Under the hand area tracking, tracking efficiency is better.

A first aspect of the present application provides a fast hand tracking method, the method comprising:

Displaying a video of the human hand region collected by the imaging device on the display interface;

Receiving a calibration frame that is calibrated by the user on the video containing the human hand region;

Extracting a gradient direction histogram feature of the calibration area of the calibration frame, and segmenting the calibration area of the calibration frame according to the gradient direction histogram feature to obtain a hand image; and

Tracking the hand image with a continuous adaptive mathematical expectation movement operator, wherein the tracking the hand image with the continuous adaptive mathematical expectation movement operator specifically includes:

Converting the color space of the hand image to the HSV color space, separating the hand image of the hue component, based on the hand image I(i,j) of the hue component and the centroid position and size of the initialized search box, Calculate the centroid position of the current search window (M ₁₀ /M ₀₀ , M ₀₁ /M ₀₀ ) and the size of the current search window

among them,

and

Is the first moment of the current search window,

For the zeroth moment of the current search window, i is the pixel value in the horizontal direction of I(i,j), and the pixel value in the vertical direction of j I(i,j).

A second aspect of the present application provides a fast hand tracking device, the device comprising:

a display module, configured to display, on the display interface, a video that is collected by the imaging device and includes a human hand region; a calibration module, configured to receive a calibration frame that is calibrated by the user on the video that includes the human hand region; and a segmentation module, configured to: Extracting a gradient direction histogram feature of the calibration area of the calibration frame, and segmenting the calibration area of the calibration frame according to the gradient direction histogram feature to obtain a hand image; and a tracking module for utilizing continuous adaptive mathematics It is desirable for the mobile operator to track the hand image.

A third aspect of the present application provides a terminal comprising a processor and a memory, the processor implementing the fast hand tracking method when executing computer readable instructions stored in a memory.

A fourth aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the fast hand tracking method.

The fast hand tracking method, device, terminal and storage medium described in the present application firstly perform rough calibration on the hand region to obtain a calibration frame, and then extract the HOG feature in the calibration area of the calibration frame, according to the HOG feature. The hand region is accurately segmented from the calibration area of the calibration frame, thereby reducing the area of the region for extracting the HOG feature, effectively shortening the time for extracting the HOG feature, and thus capable of quickly segmenting and tracking the hand region; Secondly, obtaining the depth information of the video containing the hand can further ensure the clarity of the hand contour, especially in the hand region tracking under complex background, and the tracking efficiency is particularly remarkable.

DRAWINGS

FIG. 1 is a flowchart of a fast hand tracking method according to Embodiment 1 of the present application.

FIG. 2 is a flowchart of a fast hand tracking method according to Embodiment 2 of the present application.

FIG. 3 is a structural diagram of a fast hand tracking device according to Embodiment 3 of the present application.

4 is a structural diagram of a fast hand tracking device according to Embodiment 4 of the present application.

FIG. 5 is a schematic diagram of a terminal provided in Embodiment 5 of the present application.

detailed description

The fast hand tracking method of the embodiment of the present application is applied to one or more terminals. The fast hand tracking method can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network. The fast hand tracking method of the embodiment of the present application may be executed by a server or by a terminal; or may be performed by a server and a terminal together.

For the terminal that needs to perform the fast hand tracking method, the fast hand tracking function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed. For example, the method provided by the present application can also be run on a server, such as a software development kit (SDK), to provide an interface for fast hand tracking function in the form of an SDK, and the terminal or other device passes The interface provided enables hand tracking.

Embodiment 1

101: Display a video of the human hand region collected by the imaging device on the display interface.

In this embodiment, the terminal provides a display interface, and the display interface is used to synchronously display a video that is collected by the imaging device and includes a human hand region. The imaging device is a 2D camera.

102: Receive a calibration frame that is calibrated by the user on the video including the human hand region.

In this embodiment, when the user finds the hand information of interest in the video including the human hand region displayed on the display interface, the calibration target is added by adding a calibration box on the display interface. Hand information.

The user can touch the display interface with a finger, a stylus or any other suitable object, preferably a finger touching the display interface and adding a calibration frame to the display interface.

103: Extract a gradient direction histogram feature of the calibration area of the calibration frame, and segment the calibration target area according to the gradient direction histogram feature to obtain a hand image.

The specific process of extracting the Histogram Of Gradient (HOG) feature of the calibration area of the calibration frame includes:

11) calculating gradient information of each pixel of the calibration area of the calibration frame, the gradient information including a gradient amplitude and a gradient direction;

One-dimensional center [1,0,-1], one-dimensional non-central [-1,1], one-dimensional cubic correction [1,-8,-8,-1], Soble operator, etc. can be used. The first-order differential template respectively calculates gradients of horizontal and vertical directions of respective pixels of the calibration area of the calibration frame; and calculates a gradient width of the calibration area of the calibration frame according to the gradient in the horizontal direction and the gradient in the vertical direction Value and gradient direction.

In the preferred embodiment, the gradient information of each pixel of the calibration area of the calibration frame is calculated by taking a one-dimensional center [1, 0, -1] template as an example. The area marked by the calibration frame is denoted as I(x, y), and the gradients of the calculated pixel points in the horizontal direction and the vertical direction are respectively as shown in the following formula (1-1):

Where G _h (x, y) and G _v (x, y) represent gradient values of the pixel points (x, y) in the horizontal direction and the vertical direction, respectively.

Calculate the gradient magnitude (or gradient strength) of the pixel (x, y) and the gradient direction as shown in the following equation (1-2):

Where M(x, y) and θ(x, y) represent the gradient magnitude and gradient direction of the pixel point (x, y), respectively.

Further, for the range definition of the gradient direction, an unsigned range can be generally used, that is, the positive and negative levels of the angular degree of the gradient direction are ignored, and the unsigned gradient direction can be expressed by the following formula (1-3):

After the calculation of the formula (1-3), the gradient direction of each pixel of the region to which the calibration frame is calibrated is limited to 0 to 180 degrees.

12) dividing the area marked by the calibration frame into a plurality of blocks, each block being divided into a plurality of cell units, each cell unit including a plurality of pixel points;

In this embodiment, the size of the cell unit is 8*8 pixels, and adjacent cell units do not overlap.

For example, assuming that the size of the calibration area I (x, y) is 64*128, and the size of each block is 16*16, and the size of each cell unit is 8*8, The calibration framed area can be divided into 105 blocks, each block comprising 4 cell units, each cell unit comprising 64 pixel points.

Dividing the cell units in a non-overlapping manner in this embodiment makes it possible to calculate the gradient direction histogram in each block faster.

13) performing quantization processing on gradient information of each pixel in each cell unit to obtain a gradient histogram of the region of the calibration frame calibration;

In this embodiment, the gradient direction of each pixel of each cell unit is first divided into 9 bins (9 directional channels), and the 9 bins are the horizontal axes of the gradient histograms, which are [0°, 20°, respectively. ], [20°, 40°], [40°, 40°], [40°, 80°], [80°, 100°], [100°, 120°], [120°, 140°], [140°, 140°], [140°, 180°]; then the gradient magnitudes of the pixels corresponding to each bin are accumulated as the vertical axis of the gradient histogram.

14) normalizing the gradient histogram of each block to obtain a gradient histogram normalization result of each block;

In the preferred embodiment, the gradient histogram of each block can be normalized using a normalization function, which can be an L2 norm, an L1 norm.

Due to the change of local illumination and the change of foreground/background contrast, the gradient amplitude of the pixel points varies greatly. Normalization can compress the illumination, shadow and edge, so that the gradient direction histogram feature vector space is illuminated, Shadow and edge changes are robust.

15) normalizing the gradient histogram normalization results of all the blocks to obtain the final HOG feature of the region calibrated by the calibration frame;

16) Segmenting the hand region from the region of the calibration frame according to the final HOG feature.

104: Tracking the hand image with a continuous adaptive math expectation movement operator.

In this embodiment, the Continuously Adaptive Mean Shift (CamShift) algorithm is a method based on color information, which can track the specific color of the target, automatically adjust the size and position of the search window, and locate the position. Track the size and center of the target and take the result of the previous frame (ie search window size and centroid) as the size and centroid of the next frame target in the image.

The tracking of the hand image by using the continuous adaptive math expectation mobile operator specifically includes:

21) converting the color space of the hand image to an HSV (Hue chromaticity, Saturation saturation, Value purity) color space, and separating the hand image of the hue H component;

22) based on the hand image of the hue H component, initializing the centroid position and size S of the search window W;

23) calculating the moment of the current search window;

The zeroth moment of the current search window is calculated according to equation (1-4), and the first moment of the current search window is calculated according to equation (1-5).

24) Calculate the centroid position of the current search window according to the moment of the current search window (M ₁₀ /M ₀₀ , M ₀₁ /M ₀₀ )

25) Calculate the current search window according to the moment of the current search window

Comparing the relationship between the currently calculated search window and the preset search window threshold, when the currently calculated search window is greater than or equal to the preset search window threshold, repeating the above steps 21)-25); when the currently calculated search When the window is smaller than the preset search window threshold, the tracking ends. At this time, the position of the center of the search window is the current position of the tracking target.

In summary, the fast hand tracking method described in the present application is performed by the user to calibrate the hand information of interest in the video containing the human hand region, and then extract the calibration area of the calibration frame. The HOG feature segments the hand region from the calibration area of the calibration frame based on the HOG feature. Therefore, only the HOG feature in the calibration area of the calibration frame needs to be calculated. Compared with the calculation of the entire video image including the human hand region, the present application can reduce the area of the region in which the HOG feature is extracted by receiving the calibration frame of the user calibration. Thereby, the time for extracting the HOG feature is effectively shortened, and thus the hand region can be quickly separated from the video containing the human hand region.

In addition, since the gradient information of each pixel in the calibration area of the calibration frame is a cell unit as a processing unit, the calculated HOG feature can maintain the geometric and optical characteristics of the hand region; secondly, the cell division unit The calculation processing method can make the relationship between the pixel points in the hand area be well characterized; finally, the normalization process can partially offset the influence of the illumination change, thereby ensuring the extracted hand region. Sharpness, accurately segmenting the hand area.

Embodiment 2

201: Display a video of the human hand region collected by the imaging device on the display interface, and display a preset standard calibration frame in a preset display manner.

In this embodiment, the terminal provides a display interface, and the display interface is used to synchronously display a video that is collected by the imaging device and includes a human hand region, and the display interface also displays a standard calibration frame.

The imaging device is a 3D depth camera, and the 3D depth camera is different from the 2D camera in that the 3D depth camera can simultaneously capture grayscale image information of the scene and 3-dimensional information including depth information. After the video containing the human hand region is acquired by the 3D depth camera, the video including the human hand region is synchronously displayed on the display interface of the terminal.

In this embodiment, the preset standard calibration frame is provided for the user to perform calibration on the displayed video containing the human hand region to obtain the hand information of interest.

The preset display manner includes one or a combination of the following:

1) when the display instruction is received, displaying the preset standard calibration frame;

The display instruction corresponds to a display operation input by the user, and the display operation input by the user includes, but is not limited to, clicking an arbitrary position of the display interface, or touching the arbitrary position of the display interface for more than a first preset time period (for example, 1 second) ), or issue a first preset voice (for example, "calibration box").

When it is detected that the user performs a click operation on the display interface, or when detecting that the touch operation performed by the user on the display interface exceeds a preset time, or when detecting that the user issues the first When the voice is preset, the terminal determines that the display instruction is received, and displays the preset standard calibration frame.

2) hiding the preset standard calibration frame when receiving the hidden instruction;

The hidden instruction corresponds to a hidden operation input by the user, and the hidden operation input by the user includes, but is not limited to, clicking on any position of the display interface, or touching the arbitrary position of the display interface for more than a second preset time period (for example, 2 seconds) ), or issue a second preset voice (for example, "exit").

When it is detected that the user performs a click operation on the display interface, or when detecting that the touch operation performed by the user on the display interface exceeds a second preset time, or when detecting that the user issues a second When the voice is preset, the terminal determines that a hidden command is received, and the preset standard calibration frame is hidden.

The hidden instruction may be the same as or different from the display instruction. The first preset time period may be the same as or different from the second preset time period. Preferably, the first preset time period is smaller than the second preset time period, and a shorter first preset time period is set, and the preset standard calibration frame can be quickly displayed, and the setting is long. The second preset time period can avoid the situation that the user unconscious or the operation error causes the hidden standard calibration frame to be hidden.

Displaying the preset standard calibration frame when receiving the display instruction enables the display interface to calibrate the hand region of interest when displaying the video including the human hand region; and simultaneously receiving the When the instruction is displayed, the preset standard calibration frame is not displayed, or the hidden instruction is received to hide the preset standard calibration frame, so that the displayed video containing the human hand region can be prevented from being displayed for a long time. The preset standard calibration frame is occluded, thereby causing omission of important information or giving the user a visual discomfort when viewing the video containing the human hand region.

3) automatically resetting the preset standard calibration frame when receiving the display instruction to display the preset standard calibration frame, and the time after receiving no instruction exceeds the third preset time period.

After displaying the preset standard calibration frame, when the user no longer inputs any operation and exceeds the third preset time period, the preset standard calibration frame is automatically hidden, thereby preventing the user from triggering unconsciously. The display of the command and the display of the preset standard calibration frame for a long time occurs. Secondly, the preset standard calibration frame is automatically hidden, which also helps to enhance the user's interactive experience.

In this embodiment, the preset standard calibration frame may be a circle, an ellipse, a rectangle, a square, or the like.

202: Receive a standard calibration frame that is calibrated by the user on the video including the human hand region.

In this embodiment, when the user finds the hand information of interest in the video including the human hand region displayed on the display interface, the calibration is performed by adding a standard calibration box on the display interface. Hand information.

In this embodiment, the standard calibration frame that is received by the receiving user on the video including the human hand region includes the following two situations:

a first case: receiving a rough calibration frame drawn by the user in the video containing the human hand region; matching a preset standard calibration frame corresponding to the coarse calibration frame by a fuzzy matching method; The standard calibration frame is calibrated to the video containing the human hand region and displays a calibration standard calibration frame, wherein the geometric center of the coarse calibration frame is the same as the geometric center of the matched standard calibration frame.

In this embodiment, since the shape of the calibration frame drawn by the user on the display interface by the finger is not a specification or a standard, for example, the circular calibration frame drawn by the user is not very accurate, and thus the terminal receives the user's drawing. After roughly sketching the shape of the frame, the shape of the corresponding standard calibration frame is matched according to the approximate shape of the rough calibration frame. The corresponding standard calibration frame is matched by the fuzzy matching method, so that the area to be calibrated by the calibration frame is subsequently cut.

The second case: directly receiving the standard calibration frame selected by the user, and performing calibration on the video containing the human hand region according to the standard calibration frame and displaying the calibration standard calibration frame.

In this embodiment, the user inputs a display operation trigger display instruction, thereby displaying a plurality of standard calibration frames set in advance, and the user touches the standard calibration frame, and after detecting the touch signal on the standard calibration frame, the terminal determines that the standard calibration frame is selected. . The user moves the selected standard calibration frame and drags it onto the video containing the human hand area, and the terminal displays the dragged standard calibration frame on the video containing the human hand area.

Preferably, the step 202 may further include: when receiving the instructions of zooming in, zooming out, moving, and deleting, zooming in, zooming out, moving, and deleting the displayed standard calibration frame.

203: Pre-processing the area marked by the standard calibration frame.

In this embodiment, the pre-processing may include a combination of one or more of the following: grayscale processing, correction processing.

The grayscale processing refers to converting the image of the area calibrated by the standard calibration frame into a grayscale image, because the color information has little effect on the extraction gradient direction histogram feature, and thus the image of the calibration of the standard calibration frame is converted into The grayscale image does not affect the gradient information of each pixel of the region where the standard calibration frame is subsequently calculated, and the calculation amount of the gradient information of each pixel is also reduced.

The correction process may use gamma correction, because the local surface exposure contribution is larger in the texture intensity of the image, and the image processed by the gamma correction can effectively reduce local shadow and illumination changes.

204: Extract a gradient direction histogram feature of the pre-processed calibration area of the standard calibration frame, and divide the area of the standard calibration frame according to the gradient direction histogram feature to obtain a hand image.

The step 204 described in this embodiment is the same as the step 103 described in the first embodiment, and details are not described herein again.

205: Tracking the hand image with a continuous adaptive math expectation movement operator.

The step 205 described in this embodiment is the same as the step 104 described in the first embodiment, and details are not described herein again.

Further, in order to make full use of the depth information, after the step 204, before the step 205, the method further includes: acquiring the depth information in the video including the human hand region corresponding to the calibration area of the calibration frame. And normalizing the hand image according to the depth information.

The depth information is obtained from the 3D depth camera. The specific process of normalizing the hand image according to the depth information is: recording the size of the hand image obtained by segmenting the area of the first standard calibration frame as the standard size S1, the first time The depth of field information corresponding to the area marked by the calibration frame is recorded as the standard depth of field H1; the size of the hand image obtained by the area division of the current standard calibration frame is recorded as S2, and the depth information corresponding to the area marked by the current calibration frame is recorded as H2. The hand image obtained by the region division of the current calibration frame is normalized to S2*(H2/H1).

The size of the hand image is normalized so that the finally extracted HOG feature representation has a uniform critical criterion, that is, has the same dimension, and improves the accuracy of hand tracking.

In summary, the fast hand tracking method described in the present application provides two standard calibration frames to calibrate the video containing the human hand region, so that the calibration frame of the user calibration is a standard calibration frame, and then the segmentation is obtained. The shape of the hand area is standard, and the hand tracking effect is better based on the standard calibration frame of the division.

It should be noted that the fast dynamic hand tracking method described in the present application can be applied to the tracking of a single hand, and can also be applied to the tracking of multiple hands. For the tracking of multiple hands, the method of parallel tracking is used for tracking. The essence is the process of multiple single hand tracking. It will not be described in detail here. Any method that uses the idea of this application for hand tracking should be It is included in the scope of this application.

The above description of the steps in FIG. 1 and FIG. 2, the order of execution in the flowcharts of FIGS. 1 and 2 may be changed according to different requirements, and some steps may be omitted.

The function modules and hardware structures of the terminal for implementing the above fast hand tracking method are respectively described below with reference to the third to fifth figures.

Embodiment 3

3 is a functional block diagram of a preferred embodiment of the fast hand tracking device of the present application.

In some embodiments, the fast hand tracking device 30 operates in a terminal. The fast hand tracking device 30 can include a plurality of functional modules comprised of program code segments. The program code for each of the program segments in the fast hand tracking device 30 can be stored in a memory and executed by at least one processor to perform (see Figure 1 and its associated description) tracking of the hand region.

In this embodiment, the fast hand tracking device 30 of the terminal can be divided into multiple functional modules according to the functions performed by the terminal. The function module may include: a display module 301, a calibration module 302, a segmentation module 303, and a tracking module 304.

The display module 301 is configured to display, on the display interface, a video that is collected by the imaging device and includes a human hand region.

The calibration module 302 is configured to receive a calibration frame that is calibrated by the user on the video including the human hand region.

The segmentation module 303 is configured to extract a gradient direction histogram feature of the calibration frame calibration region, and divide the calibration frame calibration region according to the gradient direction histogram feature to obtain a hand image.

The segmentation module 303 extracts a Histogram Of Gradient (HOG) feature of the calibration area of the calibration frame, and specifically includes:

The tracking module 304 is configured to track the hand image with a continuous adaptive mathematical expectation movement operator.

23) calculating the moment of the current search window;

25) Calculate the size of the current search window according to the moment of the current search window

In summary, the fast hand tracking device 30 of the present application is configured by the user to calibrate the hand information of interest in the video containing the human hand region, and then extract the calibration of the calibration frame. The HOG feature of the region, the hand region is segmented from the region of the calibration frame according to the HOG feature. Therefore, only the HOG feature in the calibration area of the calibration frame needs to be calculated. Compared with the calculation of the entire video image including the human hand region, the present application can reduce the area of the region in which the HOG feature is extracted by receiving the calibration frame of the user calibration. Thereby, the time for extracting the HOG feature is effectively shortened, and thus the hand region can be quickly separated from the video containing the human hand region.

Embodiment 4

4 is a functional block diagram of a preferred embodiment of the fast hand tracking device of the present application.

In some embodiments, the fast hand tracking device 40 operates in a terminal. The fast hand tracking device 40 can include a plurality of functional modules comprised of program code segments. The program code for each of the program segments in the fast hand tracking device 40 can be stored in a memory and executed by at least one processor to perform (see Figure 2 and its associated description) tracking of the hand region.

In this embodiment, the fast hand tracking device of the terminal may be divided into multiple functional modules according to the functions performed by the terminal. The function module may include: a display module 401, a calibration module 402, a pre-processing module 403, a segmentation module 404, a tracking module 405, and a normalization module 406.

The display module 401 includes: a first display sub-module 4010 and a second display sub-module 4012. The first display sub-module 4010 is configured to display, on the display interface, a video that is collected by the imaging device and includes a human hand region, and the second display sub-module 4012 is configured to display a preset standard in a preset display manner. Calibration box.

The preset display manner includes one or a combination of the following:

The calibration module 402 is configured to receive a standard calibration frame that is calibrated by the user on the video including the human hand region.

In this embodiment, the calibration module 402 further includes a first standard stator module 4020, a second standard stator module 4022, and a third standard stator module 4024.

The first standard stator module 4020 is configured to receive a rough calibration frame drawn by a user in the video containing the human hand region; and match a preset setting corresponding to the coarse calibration frame by a fuzzy matching method. a standard calibration frame; calibrating and displaying a calibration standard calibration frame in the video containing the human hand region according to the matched standard calibration frame, wherein the geometric center of the coarse calibration frame and the matched standard calibration frame The geometric center is the same.

The second standard stator module 4022 is configured to directly receive a standard calibration frame selected by a user, perform calibration on the video containing the human hand region according to the standard calibration frame, and display a calibration standard calibration frame.

The third standard stator module 4024 is configured to enlarge, reduce, move, and delete the displayed standard calibration frame when receiving an instruction to zoom in, zoom out, move, or delete.

The pre-processing module 403 is configured to pre-process the area of the standard calibration frame.

a segmentation module 404, configured to extract a gradient direction histogram feature of the pre-processed calibration area of the standard calibration frame, and segment the area of the standard calibration frame according to the gradient direction histogram feature to obtain a hand Part image.

The tracking module 405 is configured to track the hand image with a continuous adaptive mathematical expectation movement operator.

Further, the fast hand tracking device 40 further includes a normalization module 406, configured to acquire depth information in a video corresponding to a human hand region corresponding to the calibration area of the calibration frame, and the hand is used according to the depth information. The image is normalized.

In summary, the fast hand tracking device 40 of the present application provides two standard calibration frames to calibrate the video containing the human hand region, so that the calibration frame of the user calibration can be a standard calibration frame, and then segmented. The shape of the obtained hand region is standard, and the hand tracking effect is better based on the standard calibration frame of the segmentation.

It should be noted that the fast dynamic

hand tracking device

30, 40 described in the present application can be applied to the tracking of a single hand, and can also be applied to the tracking of multiple hands. For the tracking of multiple hands, the method of parallel tracking is used for tracking. The essence is the process of multiple single hand tracking. It will not be described in detail here. Any device that uses the idea of this application for hand tracking should It is included in the scope of this application.

Embodiment 5

FIG. 5 is a schematic diagram of a terminal according to Embodiment 5 of the present application.

The terminal 5 includes a memory 51, at least one processor 52, computer readable instructions 53 stored in the memory 51 and operable on the at least one processor 52, at least one communication bus 54 and an imaging device 55. .

The at least one processor 52 implements the steps in the fast hand tracking method embodiment when the computer readable instructions 53 are executed, such as steps 101 to 104 shown in FIG. 1 or steps 201 to 205 shown in FIG. 2. Alternatively, the at least one processor 52 implements the functions of the modules/units in the above-described apparatus embodiments when the computer readable instructions 53 are executed, such as the modules 301 to 304 in FIG. 3 or the modules 401 to 406 in FIG.

Illustratively, the computer readable instructions 53 may be partitioned into one or more modules/units, the one or more modules/units being stored in the memory 51 and by the at least one processor 52 Execute to complete this application. The one or more modules/units may be a series of computer readable instruction instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 53 in the terminal 5. For example, the computer readable instructions 53 may be divided into the display module 301, the calibration module 302, the segmentation module 303, and the tracking module 304 in FIG. 3, or may be divided into the display module 401, the calibration module 402, and the pre-FIG. The processing module 403, the segmentation module 404, the tracking module 405, and the normalization module 406. The display module 401 includes a first display sub-module 4010 and a second display sub-module 4012. The calibration module 402 includes a first standard stator module 4020, a second standard stator module 4022, and a third standard stator module 4024. For specific functions, see Embodiments 1 and 2 and their corresponding descriptions.

The imaging device 55 includes a 2D camera, a 3D depth camera, etc., and the imaging device 55 may be mounted on the terminal 5 or may be separated from the terminal 5 as an independent component.

The terminal 5 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It can be understood by those skilled in the art that the schematic diagram 5 is only an example of the terminal 5, does not constitute a limitation of the terminal 5, may include more or less components than the illustration, or combine some components, or different components. For example, the terminal 5 may further include an input/output device, a network access device, a bus, and the like.

The at least one processor 52 may be a central processing unit (CPU), or may be another general-purpose processor, a digital signal processor (DSP), or an application specific integrated circuit (ASIC). ), a Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and the like. The processor 52 may be a microprocessor or the processor 52 may be any conventional processor or the like. The processor 52 is a control center of the terminal 5, and connects the entire terminal 5 with various interfaces and lines. section.

The memory 51 can be used to store the computer readable instructions 53 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 51, and The data stored in the memory 51 is called to implement various functions of the terminal 5. The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); and the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 5 is stored. In addition, the memory 51 may include a high-speed random access memory, and may also include a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD). Card, flash card, at least one disk storage device, flash device, or other volatile solid state storage device.

The modules/units integrated by the terminal 5 can be stored in a non-volatile readable storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer-readable instructions, which may be stored in a non-volatile manner. In reading a storage medium, the computer readable instructions, when executed by a processor, implement the steps of the various method embodiments described above. Wherein, the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like. The non-transitory readable medium may include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read-Only Memory), Random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the contents of the non-volatile readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, Volatile readable media does not include electrical carrier signals and telecommunication signals.

The functional units in the various embodiments of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules. In addition, it is to be understood that the term "comprising" does not exclude other elements or the singular does not exclude the plural. A plurality of units or devices recited in the system claims can also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.

It should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto. Although the present application is described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be applied. Modifications or equivalent substitutions are made without departing from the spirit of the invention.

Claims

A fast hand tracking method, characterized in that the method comprises:

Displaying a video of the human hand region collected by the imaging device on the display interface;

Receiving a calibration frame that is calibrated by the user on the video containing the human hand region;

Extracting a gradient direction histogram feature of the calibration area of the calibration frame, and segmenting the calibration area of the calibration frame according to the gradient direction histogram feature to obtain a hand image; and

Tracking the hand image with a continuous adaptive mathematical expectation movement operator, wherein the tracking the hand image with the continuous adaptive mathematical expectation movement operator specifically includes:

Converting the color space of the hand image to the HSV color space, separating the hand image of the hue component, based on the hand image I(i,j) of the hue component and the centroid position and size of the initialized search box, Calculate the centroid position of the current search window (M 10 /M 00 , M 01 /M 00 ) and the size of the current search window
among them,
and
Is the first moment of the current search window,
For the zeroth moment of the current search window, i is the pixel value in the horizontal direction of I(i,j), and the pixel value in the vertical direction of jI(i,j).
The method of claim 1, wherein the displaying the video of the human hand region collected by the imaging device on the display interface further comprises:

The preset standard calibration frame is displayed in a preset display manner, and the preset display manner includes one or a combination of the following:

When the display instruction is received, the preset standard calibration box is displayed;

Hidden the preset standard calibration box when receiving the hidden instruction;

The preset standard calibration frame is automatically hidden when receiving the display instruction to display the preset standard calibration frame, and the time after receiving no instruction exceeds a preset time period.
The method of claim 2 wherein said receiving a calibration frame calibrated by said user on said video comprising said human hand region comprises:

Receiving a standard calibration frame calibrated by the user on the video containing the human hand region, including:

Receiving a rough calibration frame drawn by the user in the video containing the human hand region;

Matching a preset standard calibration frame corresponding to the coarse calibration frame by a fuzzy matching method;

Calibrating the video containing the human hand region according to the matched standard calibration frame and displaying the calibration standard calibration frame, wherein the geometric center of the coarse calibration frame is the same as the geometric center of the matched standard calibration frame .
The method of claim 2 wherein said receiving a calibration frame calibrated by said user on said video comprising said human hand region comprises:

Receiving a standard calibration frame calibrated by the user on the video containing the human hand region, including:

The standard calibration frame selected by the user is directly received, and the calibration standard frame is displayed on the video containing the human hand region according to the standard calibration frame, and the calibration standard calibration frame is displayed.
The method according to claim 3 or 4, wherein the receiving the standard calibration frame that the user calibrates on the video containing the human hand region further comprises:

When receiving an instruction to zoom in, zoom out, move, or delete, the standard calibration frame displayed is enlarged, reduced, moved, and deleted.
The method of claim 5, wherein the method further comprises:

The region to which the standard calibration frame is calibrated is pre-processed, and the pre-processing may include a combination of one or more of the following: grayscale processing, correction processing.
The method of claim 6 wherein the method further comprises:

Obtaining depth information in a video including a human hand region corresponding to the calibration area of the calibration frame, and normalizing the hand image according to the depth information, where the normalization process is: S2*(H2/H1) Where S1 is the size of the hand image obtained from the area segmentation of the first calibration frame, H1 is the depth information corresponding to the area of the first calibration frame; S2 is the area of the current standard calibration frame The size of the hand image obtained by the division, and H2 is the depth information corresponding to the area of the current calibration frame.
A rapid hand tracking device, characterized in that the device comprises:

a display module, configured to display, on the display interface, a video that is collected by the imaging device and includes a human hand region;

a calibration module, configured to receive a calibration frame that is calibrated by the user on the video including the human hand region;

a segmentation module, configured to extract a gradient direction histogram feature of the calibration area of the calibration frame, and segment the area of the calibration frame according to the gradient direction histogram feature to obtain a hand image; and

a tracking module, configured to track the hand image by using a continuous adaptive mathematical expectation movement operator, wherein the tracking the hand image by using a continuous adaptive mathematical expectation mobile operator comprises:

Converting the color space of the hand image to the HSV color space, separating the hand image of the hue component, based on the hand image I(i,j) of the hue component and the centroid position and size of the initialized search box, Calculate the centroid position of the current search window (M 10 /M 00 , M 01 /M 00 ) and the size of the current search window
among them,
and
Is the first moment of the current search window,
For the zeroth moment of the current search window, i is the pixel value in the horizontal direction of I(i,j), and the pixel value in the vertical direction of jI(i,j).
A terminal, comprising: a processor and a memory, the processor for executing computer readable instructions stored in the memory to implement the following steps:

Displaying a video of the human hand region collected by the imaging device on the display interface;

Receiving a calibration frame that is calibrated by the user on the video containing the human hand region;

Extracting a gradient direction histogram feature of the calibration area of the calibration frame, and segmenting the calibration area of the calibration frame according to the gradient direction histogram feature to obtain a hand image; and

Tracking the hand image with a continuous adaptive mathematical expectation movement operator, wherein the tracking the hand image with the continuous adaptive mathematical expectation movement operator specifically includes:

Converting the color space of the hand image to the HSV color space, separating the hand image of the hue component, based on the hand image I(i,j) of the hue component and the centroid position and size of the initialized search box, Calculate the centroid position of the current search window (M 10 /M 00 , M 01 /M 00 ) and the size of the current search window
among them,
and
Is the first moment of the current search window,
For the zeroth moment of the current search window, i is the pixel value in the horizontal direction of I(i,j), and the pixel value in the vertical direction of jI(i,j).
The terminal according to claim 9, wherein the displaying the video of the human hand region collected by the imaging device on the display interface further comprises:

The preset standard calibration frame is displayed in a preset display manner, and the preset display manner includes one or a combination of the following:

When the display instruction is received, the preset standard calibration box is displayed;

Hidden the preset standard calibration box when receiving the hidden instruction;

The preset standard calibration frame is automatically hidden when receiving the display instruction to display the preset standard calibration frame, and the time after receiving no instruction exceeds a preset time period.
The terminal according to claim 10, wherein the calibration frame that is received by the receiving user on the video containing the human hand region comprises:

Receiving a standard calibration frame calibrated by the user on the video containing the human hand region, including:

Receiving a rough calibration frame drawn by the user in the video containing the human hand region;

Matching a preset standard calibration frame corresponding to the coarse calibration frame by a fuzzy matching method;

Calibrating the video containing the human hand region according to the matched standard calibration frame and displaying the calibration standard calibration frame, wherein the geometric center of the coarse calibration frame is the same as the geometric center of the matched standard calibration frame .
The terminal according to claim 10, wherein the calibration frame that is received by the receiving user on the video containing the human hand region comprises:

Receiving a standard calibration frame calibrated by the user on the video containing the human hand region, including:

The standard calibration frame selected by the user is directly received, and the calibration standard frame is displayed on the video containing the human hand region according to the standard calibration frame, and the calibration standard calibration frame is displayed.
The terminal of claim 10, wherein the processor is further configured to execute the computer readable instructions to implement the following steps:

The region to which the standard calibration frame is calibrated is pre-processed, and the pre-processing may include a combination of one or more of the following: grayscale processing, correction processing.
The terminal of claim 13 wherein said processor is further operative to execute said computer readable instructions to:

Obtaining depth information in a video including a human hand region corresponding to the calibration area of the calibration frame, and normalizing the hand image according to the depth information, where the normalization process is: S2*(H2/H1) Where S1 is the size of the hand image obtained from the area segmentation of the first calibration frame, H1 is the depth information corresponding to the area of the first calibration frame; S2 is the area of the current standard calibration frame The size of the hand image obtained by the division, and H2 is the depth information corresponding to the area of the current calibration frame.
A non-volatile readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:

Displaying a video of the human hand region collected by the imaging device on the display interface;

Receiving a calibration frame that is calibrated by the user on the video containing the human hand region;

Extracting a gradient direction histogram feature of the calibration area of the calibration frame, and segmenting the calibration area of the calibration frame according to the gradient direction histogram feature to obtain a hand image; and

Tracking the hand image with a continuous adaptive mathematical expectation movement operator, wherein the tracking the hand image with the continuous adaptive mathematical expectation movement operator specifically includes:

Converting the color space of the hand image to the HSV color space, separating the hand image of the hue component, based on the hand image I(i,j) of the hue component and the centroid position and size of the initialized search box, Calculate the centroid position of the current search window (M 10 /M 00 , M 01 /M 00 ) and the size of the current search window
among them,
and
Is the first moment of the current search window,
For the zeroth moment of the current search window, i is the pixel value in the horizontal direction of I(i,j), and the pixel value in the vertical direction of jI(i,j).
The storage medium according to claim 15, wherein the displaying the video of the human hand region collected by the imaging device on the display interface further comprises:

The preset standard calibration frame is displayed in a preset display manner, and the preset display manner includes one or a combination of the following:

When the display instruction is received, the preset standard calibration box is displayed;

Hidden the preset standard calibration box when receiving the hidden instruction;

The preset standard calibration frame is automatically hidden when receiving the display instruction to display the preset standard calibration frame, and the time after receiving no instruction exceeds a preset time period.
The storage medium of claim 16 wherein said receiving a calibration frame calibrated by said user on said video comprising said human hand region comprises:

Receiving a standard calibration frame calibrated by the user on the video containing the human hand region, including:

Receiving a rough calibration frame drawn by the user in the video containing the human hand region;

Matching a preset standard calibration frame corresponding to the coarse calibration frame by a fuzzy matching method;

Calibrating the video containing the human hand region according to the matched standard calibration frame and displaying the calibration standard calibration frame, wherein the geometric center of the coarse calibration frame is the same as the geometric center of the matched standard calibration frame .
The storage medium of claim 16 wherein said receiving a calibration frame calibrated by said user on said video comprising said human hand region comprises:

Receiving a standard calibration frame calibrated by the user on the video containing the human hand region, including:

The standard calibration frame selected by the user is directly received, and the calibration standard frame is displayed on the video containing the human hand region according to the standard calibration frame, and the calibration standard calibration frame is displayed.
The storage medium of claim 16 wherein said computer readable instructions are executed by said processor further implementing the steps of:

The region to which the standard calibration frame is calibrated is pre-processed, and the pre-processing may include a combination of one or more of the following: grayscale processing, correction processing.
The storage medium of claim 19 wherein said computer readable instructions are executed by said processor further implementing the following steps:

Obtaining depth information in a video including a human hand region corresponding to the calibration area of the calibration frame, and normalizing the hand image according to the depth information, where the normalization process is: S2*(H2/H1) Where S1 is the size of the hand image obtained from the area segmentation of the first calibration frame, H1 is the depth information corresponding to the area of the first calibration frame; S2 is the area of the current standard calibration frame The size of the hand image obtained by the division, and H2 is the depth information corresponding to the area of the current calibration frame.