CN103530892B

CN103530892B - A kind of both hands tracking based on Kinect sensor and device

Info

Publication number: CN103530892B
Application number: CN201310497334.XA
Authority: CN
Inventors: 朱艳敏; 袁博
Original assignee: Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Graduate School Tsinghua University
Priority date: 2013-10-21
Filing date: 2013-10-21
Publication date: 2016-06-22
Anticipated expiration: 2033-10-21
Also published as: CN103530892A

Abstract

The invention provides a kind of both hands tracking based on Kinect sensor and device, described tracking, including: S1 video acquisition step；S2 detecting step one；S3 one hand tracking step；S4 detecting step two；The both hands having been detected by are tracked by S5 both hands tracking step。Make use of the information of first hands during due to second hands of detection, so the tracking of the present invention can follow the tracks of the motion of user's both hands fast and accurately, and computation complexity is low。

Description

A kind of both hands tracking based on Kinect sensor and device

Technical field

The present invention relates to gesture identification and both hands in field of human-computer interaction to follow the tracks of, especially relate to a kind of both hands tracking based on Kinect sensor and device。

Background technology

Along with the progress of the fast development of computer technology and people's science and technology idea, the perception of computer is proposed increasingly higher requirement by user。Traditional interactive mode depends on physical keyboard and mouse, its more single input mode based on text can not meet the demand of people, and the novel human-machine interaction mode more focusing on " people-oriented " has broken the constraint of traditional mode, input mode is transformed into the more abundant natural form such as image and sound, substantially improves the experience of user。People achieve significant progress in research fields such as recognition of face, speech recognition, human body attitude identification, gesture identification in recent years。

Gesture plays extremely important role in the life of people, and the gesture identification based on computer vision is to realize the key technology that a new generation's man-machine interaction is indispensable。Crucial and most of difficult point in gesture identification is all present in the tracking section of staff。Its significant challenge faced has: one, the interference of complex background, such as face and other class colour of skin background；Two, hands can deform upon in motor process；Three, the appearance effects of the change opponent of illumination is very big；Four, system runs the requirement of real-time。

Current Gesture Recognition is that positioning and tracing method used has: Camshift(ContinuouslyAdaptiveMean-SHIFT for singlehanded situation mostly) algorithm, feature space matching method etc.。These existing methods simply obtain good effect under given conditions。Tracking and the identification of both hands have again new feature and challenge compared to one hand, as how accurately differentiated and to follow the tracks of both hands after occurring mutually to block at both hands。The way having in prior art is that two handss of restriction are blocking front and back shape invariance, and both hands block after separately, utilize the both hands shape information before blocking to be identified。But this method is relatively big to staff restriction, not naturally convenient。The Kinect of Microsoft is gesture tracking and identification provides great convenience, and the depth information that it provides enormously simplify background removal step。Prior art has the 3D model that the depth image utilizing Kinect to provide and coloured image set up both hands the method being tracked, the detailed information such as both hands joint can be accurately positioned, but this method computation complexity is very high, even if when use GPU(GraphicProcessingUnit) accelerate can not reach real-time tracking。Because Kinect is per se with the function extracting and following the tracks of human skeleton, therefore prior art also has be used on skeleton find hand node method extract hand position thus following the tracks of staff, but this method requires that the attitude of user is sitting posture or the stance of standard, user's posture restriction is too many, and identify and to follow the trail of effect not ideal enough。

Summary of the invention

The technical problem to be solved is: overcome the defect of aforementioned prior art, it is provided that a kind of can process complex background and different light conditions and the both hands tracking based on Kinect sensor that the restriction of staff attitude is little, including:

S1 video acquisition step, obtains resolution from Kinect sensor and frame per second is the same from color video stream and deep video stream；

S2 detecting step one, detects first hands doing initial gesture from the image of acquired color video stream and deep video stream；

S3 one hand tracking step, utilizes the positional information of first hands in former frame or two frame images above and dimension information that first hands is positioned and followed the tracks of；

S4 detecting step two, utilizes positional information and dimension information second hands of detection of first hands；

The both hands having been detected by are tracked by S5 both hands tracking step。

According to embodiment, the present invention also can adopt following preferred technical scheme:

Described step S2 farther includes: S2-1 sample training step；S2-2 mode selecting step；S2-3 initiates gesture determination step。

In described step S2-1, select SVM classifier (SupportVectorMachine) study and the morphologic information of training hand, choose geometric invariant moment as training characteristics。

In described step S2-2, when light is suitable, select colour of skin pattern, namely with Complexion filter in conjunction with depth filtering method extract first hands；When light too dark or too bright time, select shape model, method first hands of extraction namely filter in conjunction with shape with depth filtering。

In described step S2-3, initial definition of gesture is: hands forward extends out and distance health is at more than threshold value d。

Location in described step S3 is: utilize the position of first hands in the former frame or two frame images above obtained and boundary rectangle thereof to predict the ROI(RegionOfInterest of first hands in current frame image), in this ROI, do depth filtering to position the position of first hands in current frame image。

Both hands in described step S5 are followed the tracks of and are included:

1) during both hands released state before mutually blocking, two targets are followed the tracks of respectively, namely predict the current frame image two respective ROI of target respectively according to the positions and dimensions information of two handss in former frame or two frame images above, and detect respectively in the two region；

2) both hands are when mutual occlusion state, and two target trajectorys detected overlap, and are considered as same target is tracked；

3), when both hands separate after mutually blocking, the invariance according to blocking front and back both hands position relationship in the depth direction distinguishes both hands, and follows the tracks of respectively。

The present invention also provides for a kind of both hands based on Kinect and follows the tracks of device, including such as lower module:

Video acquisition module, for being the same from color video stream and deep video stream from Kinect sensor acquisition resolution and frame per second；

Detection module one, for detecting first hands doing initial gesture from acquired coloured image and depth image；

Singlehanded tracking module, including positioning unit and tracking cell, for utilizing the positional information of first hands in former frame or two frame images above and dimension information that first hands is positioned and followed the tracks of；

Detection module two, for utilizing positional information and dimension information second hands of detection of first hands；

Both hands tracking module, for being tracked the both hands having been detected by。

Following preferred technical scheme also can be adopted according to embodiment:

Described detection module one includes sample training unit, mode selecting unit and initial gesture identifying unit。

Described sample training unit selects SVM classifier study and the morphologic information of training hand, chooses geometric invariant moment as training characteristics。

Described mode selecting unit is used for: when light is suitable, select colour of skin pattern, namely with Complexion filter in conjunction with depth filtering method extract first hands；When light too dark or too bright time, select shape model, method first hands of extraction namely filter in conjunction with shape with depth filtering。

In described initial gesture identifying unit, initial definition of gesture is: hands forward extends out and distance health is at more than threshold value d。

Described positioning unit is used for: utilizes the position of first hands in the former frame or two frame images above obtained and boundary rectangle thereof to predict the ROI of first hands in current frame image, does depth filtering to position the position of first hands in current frame image in this ROI。

Described both hands tracking module is used for:

The present invention is compared with the prior art and provides the benefit that:

Make use of the information of first hands during due to second hands of detection, so the tracking of the present invention can follow the tracks of the motion of user's both hands fast and accurately, and computation complexity is low。

In one preferred technical scheme, owing to have employed the method for depth filtering and setting the region of interest ROI of target in tracking process, so the interference of complex background can not be subject to, because depth filtering can remove the impact of the face of target rear chaff interference such as user；And set the target that ROI region only makes in the region detection next frame image of target proximity, eliminate the object outside ROI region, such as the interference of the hand and face etc. of onlooker；Below another benefit is also brought, it is simply that the attitude of user is limited seldom。

In another preferred technical scheme, user can select suitable detection pattern according to light conditions so that the method can adapt to the situation of different light。

In another preferred technical scheme, not using its shape information when detecting hands in tracking process, therefore hands deformation in motor process does not affect tracking effect。

Accompanying drawing explanation

Fig. 1 is the flow chart of the both hands tracking of one embodiment of the invention。

Fig. 2 is the flow chart of first hands initial sign of detection of an embodiment。

Detailed description of the invention

Below against accompanying drawing and the present invention being explained in detail in conjunction with preferred embodiment。

Embodiment 1

A kind of both hands tracking based on Kinect, including:

S3 one hand tracking step, utilizes former frame or two frames with positional information and the dimension information of upper first hands, first hands to be positioned and followed the tracks of；

In described step S2-1, select SVM classifier study and the morphologic information of training hand, choose geometric invariant moment as training characteristics。

Location in described step S3 is: utilizes the position of first hands in the former frame or two frame images above obtained and boundary rectangle thereof to predict the ROI of first hands in current frame image, does depth filtering to position the position of first hands in current frame image in this ROI。

Both hands in described step S5 are followed the tracks of and are included:

Embodiment 2

As it is shown in figure 1, be the present embodiment both hands follow the tracks of flow chart, including:

Step 1) obtains video flowing。Such as using Kinect sensor to obtain resolution is 640*480, and frame per second is color video stream and the deep video stream of 30fps。

Step 2) first hands of detection。Namely from the cromogram got and depth map, detect first hands。Specifically find nearest sufficiently large target object as first hands according to depth information and Skin Color Information or shape information。

Step 3) judges whether initial sign to be detected。Namely judge whether find first hands is doing initial gesture。When hands forward extends out, distance health is when more than threshold value d, it is determined that this hands makes effectively initial gesture, and tracking starts。This hands of labelling is Hand1, and its three dimensional local information is stored in the track traj1 of this hands。

Wherein d is typically set to 15～25cm。

Step 4) follows the tracks of single hands。Namely described first hands Hand1 that tracing detection arrives。Specifically may is that according to the positions and dimensions information of hands in front cross frame image, it was predicted that the possible position region of target, i.e. region of interest ROI in current frame image。In the ROI region of depth map, find most probable target, the three dimensional local information of target is stored in the track traj1 of Hand1。Described front cross frame refers to two two field pictures in tracking process before present frame, if present frame is f_t, then front cross frame refers to f_t-1And f_t-2。Certainly, utilizing front cross frame is a preferred way, " two " frame before also must being not necessarily, as when starting to follow the tracks of, above only having (that frame of first hands being detected) in a two field picture to have the information of hands, that is just only with the position dimension information of the hands of former frame。

This step Main Function is that the position of present frame target is given a forecast by the information utilizing " front N frame "。N is more big, it is possible to the information of reference is more many, but predictor formula can be more complicated。It addition, N is more big, before the reference value of nth frame image more little, so N is not the bigger the better, take 2 or 3 be likely to proper。

Step 5) second hands of detection。Because the natural initial state of bimanual input is that two handss stretch out close depth distance, show similar gesture, the positions and dimensions information of first hands Hand1 in current frame image therefore can be utilized to detect second hands in global scope。Depth map being filtered, finds the area target all similar with depth location and first hands, this method is simpler quickly relative to utilizing colouring information or shape information to re-search for the method for an other hands。

Step 6) judges whether target to be detected。Namely whether determination step 5 finds second hands, if it is found, be designated as second hands Hand2, its three dimensional local information is stored in the track traj2 of Hand2；If do not found, then continue to follow the tracks of single hands。

Step 7) follows the tracks of both hands。Namely after finding second hands, both hands (Hand1 and Hand2) are tracked。Mutually do not block or when overlap at two handss, be equivalent to two independent targets are tracked, according to the possible position region of two targets in the positions and dimensions information prediction current frame image of target in front cross frame image, set the respective region of interest ROI of two targets¹And ROI², and at the ROI of depth map¹And ROI²Do depth filtering in scope and find most possible target respectively。When two palmistrys are blocked mutually or be overlapping, at ROI¹And ROI²What inside detect is same target, and the positional information of two handss is identical, and track overlaps。After two palmistrys are blocked mutually more separately time, differentiate Hand1 and Hand2 according to the invariance of the depth value relation of two handss before blocking and after blocking, if namely before blocking Hand1 before Hand2, then block after separately before Hand1 should remain in Hand2。

The track of step 8) output hands。Namely after telling two handss, its each three dimensional local information be stored in respectively correspondence track in。

As in figure 2 it is shown, be the flow chart detecting first hands initial sign in the present embodiment, specifically include that

Step 201, obtains coloured image and depth image: obtaining resolution from Kinect is 640*480, and frame per second is RGB color image and the gray scale depth image of 30fps。

Step 202 judges that whether light is suitable。Select detection pattern according to illumination power, if illumination is moderate, select colour of skin pattern；If illumination is too weak or too bright, selected shape pattern。

Step 203 Complexion filter。First colour of skin pattern is transformed into the greater concentration of YCbCr color space of the colour of skin RGB picture, and wherein Y represents lightness (Luminance, Luma), and Cb and Cr refers to color, and Cb refers to chroma blue, and Cr refers to red color。Conversion formula is as follows:

Y=0.299*R+0.587*G+0.114*B

Cr=(R-Y)*0.713+128

Cb=(B-Y)*0.564+128

Then carry out Complexion filter, with oval complexion model, the colour picture of YCbCr color space is filtered obtaining the bianry image mask1 of the colour of skin。

Step 204 depth filtering: colour of skin bianry image mask1 is carried out depth filtering in conjunction with depth map。All areas connected domain more than hand minimum area Amin is found from mask1, select that wherein area is maximum 3, obtain new colour of skin bianry image mask2。Mask2 chooses the connected domain SRn that the degree of depth is minimum, it is carried out depth filtering, choose depth bounds region in [Dmin, Dmin+l] as quasi goal region Rc。Wherein, Dmin is the minimum depth value of connected domain SRn, l is the threshold value for splitting hand, be typically set to 5～8cm(referred to herein as from hands foremost such as the depth distance of finger tip to wrist, the setting implication of " threshold value of segmentation hand " described hereinafter is with where like)。

Step 205 sample training: shape model requires do not have object to disturb between user and photographic head, and user is in forefront, and when doing initial gesture, palm to open。First the positive and negative sample database opened hand is gathered, then selected characteristic and grader learning training hand。Owing to geometric invariant moment Hu square has translation, rotates and scale invariability, select it as the feature of training grader。The definition of Hu square is as follows:

Wherein, η_pqIt it is (p+q) rank normalization central moment。

I₁=η₂₀+η₀₂

I_{2} = {(η_{20} - η_{02})}^{2} + {4 η}_{11}^{2}

I₃=(η₃₀-3η₁₂)²+(3η₂₁-η₀₃)²

I₄=(η₃₀+η₁₂)²+(η₂₁+η₀₃)²

I₅=(η₃₀-3η₁₂)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-3(η₂₁+η₀₃)²]+

(3η₂₁-η₀₃)(η₂₁+η₀₃)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]

I₆=(η₂₀-η₀₂)[(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]+4η₁₁(η₃₀+η₁₂)(η₂₁+η₀₃)

I₇=(3η₂₁-η₀₃)(η₃₀+η₁₂)[(η₃₀+η₁₂)²-3(η₂₁+η₀₃)²]-

(η₃₀-3η₁₂)(η₂₁+η₀₃)[3(η₃₀+η₁₂)²-(η₂₁+η₀₃)²]

SVM (SupportVectorMachine, support vector machine) has many distinctive advantages when solving small sample, non-linear and high-dimensional classification problem, therefore selects SVM that the Hu moment characteristics of hand is learnt and train, generation hand grader。

Step 206 depth filtering: depth map carries out depth filtering, emplacement depth scope largest object profile in [Dmin, Dmin+l], if its area is more than hand-type minimum area Amin, then the region that this profile comprises is territory, probable target area Rn。Wherein, Dmin is depth map minimum depth value, and l is the threshold value for splitting hand, is typically set to 5～8cm。

Step 207 form discrimination: of the SVM hand grader training territory, probable target area Rn made and classify, it determines whether be the hand opened。The target area Rc if it is, this territory, probable target area is as the criterion。

Step 208 judges whether to find target hands: judge whether quasi goal region Rc is the staff doing gesture, namely judges whether hands makes initial gesture。First depth image is filtered, takes depth bounds region in [Dmin, Dmin+d], choose the region Rb wherein comprising Rc, calculate the area ratio in two regions。

ratio=area(Rc)/area(Rb)

Scope according to ratio judges whether target is the hands stretched out。

If ratio is in [0.5,1] scope, then judges that Rb does not include health, only includes arm segment, now judge that quasi goal Rc is the hands stretched out, be effectively initiate gesture, be not otherwise effectively initiate gesture。

Wherein, Dmin is as the criterion the minimum-depth of target area Rc, and d is for judging the threshold value whether staff stretches out, being typically set to 15～25cm。

Step 209 lock onto target: if having found effective initial gesture, the target hands that labelling finds is Hand1, and its three dimensional local information is stored in the track traj1 of Hand1。

After finding initial gesture, the detailed implementation process of later step describes as follows:

Singlehanded tracking, enters singlehanded tracking module, utilizes front cross frame f after finding first hands_t-2And f_t-1The positions and dimensions information of middle Hand1 determines present frame f_tRectangle region of interest ROI¹。ROI¹Defining method be:

x=2x_t-1-x_t-2

y=2y_t-1-y_t-2

width=1.5*width(boundRect)

height=1.5*height(boundRect)

Wherein (x, y) for ROI¹Central point, respectively it is wide and high for width and height, (x_t-2,_yt-2) and (x_t-1,_yt-1) for Hand1 central point at front cross frame image f_t-2And f_t-1The coordinate of middle x/y plane, boundRect is former frame f_t-1The boundary rectangle in middle hand region。If t < 2, namely present frame is the first frame after initial gesture being detected, now only has the information of the hands of former frame, then ROI¹Central point be x=x_t-1, y=_yt-1, wide same as above with height。ROI at depth map¹The degree of depth area in [Dmin, Dmin+l] the largest connected region more than Amin is found, it is determined that for following the tracks of target, its three dimensional local information is stored in the track traj1 of Hand1 in scope。Wherein, Dmin is ROI¹Minimum depth value in scope, l is the threshold value for splitting hand, is typically set to 5～8cm。

Detecting second hands, find the target that depth value is all similar to Hand1 with area in the current frame, namely the degree of depth is at [z_t-2cm,z_t+ l+2cm] in scope, and area is at [area_t* 0.8, area_t* 1.2] the connected region in scope, is labeled as second hands Hand2, its three dimensional local information is stored in the track traj2 of its correspondence。Wherein, z_tFor the minimum depth value of Hand1, area_tFor the area of Hand1, l is the threshold value of segmentation hand, is typically set to 5～8cm。

Both hands are followed the tracks of, when the state of separation before two handss are in and block, according to front cross frame f_t-2And f_t-1The position of middle Hand1 and Hand2 and size information set present frame f_tIn respective region of interest ROI¹And ROI²。At ROI¹Middle searching target Hand1, at ROI²Middle searching target Hand2, method is identical with singlehanded tracking。When two handss occur mutually to block, ROI¹With ROI²Overlapping, two targets detecting are actual is same target, and now the track of two handss overlaps。When two handss block and separate afterwards, block front and back invariance of position relationship on its depth direction according to two handss and differentiate both hands。

z_{s}^{1} < z_{s}^{2} &DoubleRightArrow; z_{t}^{1} < z_{t}^{2}

Wherein s is the moment before blocking, and t blocks the moment after separately,WithThe respectively depth value of s moment Hand1 and Hand2,WithThe respectively depth value of t Hand1 and Hand2。If before the moment Hand1 namely before mutually blocking is in Hand2, then block the moment after end, Hand1 is also before Hand2。After telling both hands in this way, it is indicated and their three dimensional local information is stored in respective track respectively。

Should be appreciated that the modules in assembly of the invention embodiment can be identical with the description in embodiment of the method with the specific operation process of unit, be not described in detail herein。

Above content is in conjunction with concrete preferred implementation further description made for the present invention, it is impossible to assert that specific embodiment of the invention is confined to these explanations。For general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, make some equivalent replacements or obvious modification, and performance or purposes are identical, all should be considered as belonging to protection scope of the present invention。

Claims

1. the both hands tracking based on Kinect sensor, it is characterised in that including:

S4 detecting step two, filters depth map, finds the target that area is all similar with depth location and first hands, utilizes positional information and dimension information second hands of detection of first hands；

2. the both hands tracking based on Kinect sensor as claimed in claim 1, it is characterised in that described step S2 farther includes:

S2-1 sample training step；

S2-2 mode selecting step；

S2-3 initiates gesture determination step。

3. the both hands tracking based on Kinect sensor as claimed in claim 2, it is characterized in that: in described step S2-1, select SVM classifier (SupportVectorMachine) study and the morphologic information of training hand, choose geometric invariant moment as training characteristics。

4. the both hands tracking based on Kinect sensor as claimed in claim 2, it is characterised in that: in described step S2-2, when light is suitable, select colour of skin pattern, namely extract first hands with Complexion filter in conjunction with the method for depth filtering；When light too dark or too bright time, select shape model, method first hands of extraction namely filter in conjunction with shape with depth filtering。

5. the both hands tracking based on Kinect sensor as claimed in claim 2, it is characterised in that: in described step S2-3, initial definition of gesture is: hands forward extends out and distance health is at more than threshold value d。

6. the both hands tracking based on Kinect sensor as claimed in claim 1, it is characterized in that, location in described step S3 is: utilizes the position of first hands in the former frame or two frame images above obtained and boundary rectangle thereof to predict the ROI of first hands in current frame image, does depth filtering to position the position of first hands in current frame image in this ROI。

7. the both hands tracking based on Kinect sensor as claimed in claim 1, it is characterised in that the both hands in described step S5 are followed the tracks of and included:

8. the both hands based on Kinect sensor follow the tracks of device, it is characterised in that include such as lower module:

9. the both hands based on Kinect sensor as claimed in claim 8 follow the tracks of device, it is characterised in that: described detection module one includes sample training unit, mode selecting unit and initial gesture identifying unit。

10. the both hands based on Kinect sensor as claimed in claim 9 follow the tracks of device, it is characterised in that: described sample training unit selects SVM classifier study and the morphologic information of training hand, chooses geometric invariant moment as training characteristics。

11. the both hands based on Kinect sensor as claimed in claim 9 follow the tracks of device, it is characterised in that described mode selecting unit is used for:

When light is suitable, select colour of skin pattern, namely with Complexion filter in conjunction with depth filtering method extract first hands；

When light too dark or too bright time, select shape model, method first hands of extraction namely filter in conjunction with shape with depth filtering。

12. the both hands based on Kinect sensor as claimed in claim 9 follow the tracks of device, it is characterised in that: in described initial gesture identifying unit, initial definition of gesture is: hands forward extends out and distance health is at more than threshold value d。

13. the both hands based on Kinect sensor as claimed in claim 8 follow the tracks of device, it is characterized in that, described positioning unit is used for: utilizes the position of first hands in the former frame or two frame images above obtained and boundary rectangle thereof to predict the ROI of first hands in current frame image, does depth filtering to position the position of first hands in current frame image in this ROI。

14. the both hands based on Kinect sensor as claimed in claim 8 follow the tracks of device, it is characterised in that described both hands tracking module is used for: