CN102324019A

CN102324019A - Method and system for automatically extracting gesture candidate region in video sequence

Info

Publication number: CN102324019A
Application number: CN201110230698A
Authority: CN
Inventors: 王维东; 赵亚飞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-08-12
Filing date: 2011-08-12
Publication date: 2012-01-18
Anticipated expiration: 2031-08-12
Also published as: CN102324019B

Abstract

The invention discloses a method for automatically extracting a gesture candidate region in a video sequence. The method comprises the following steps of: enabling a gesture video image acquisition system to acquire video images; constructing reference background images; computing to generate motion description images; computing to obtain motion segmentation threshold values and converting the motion description images into binary motion images; conducting skin color segmentation to obtain binary skin color images; conducting logic and operation to the binary motion images and the binary skin color images point by point to obtain binary fused images; and conducting connected region analysis to the binary fused images and selecting the gesture candidate region. The method combines motion information with skin color information to find the gesture candidate region, the two kinds of information complement each other, the detection accuracy is improved, the method is practical and effective and a very good foundation is provided for the segmentation, the positioning and the recognition of gestures.

Description

Automatically extract the method and system of gesture candidate region in a kind of video sequence

Technical field

The invention belongs to the intelligent information processing technology field, relate to the method for extracting the gesture candidate region in a kind of video sequence automatically, it is applied to digital video image analysis and understanding.

Background technology

Traditional man-machine interaction mode such as mouse, keyboard, telepilot etc. all are that the mankind deacclimatize computing machine, accomplish interactive task according to pre-set standard.Along with the continuous development of technology, the processing power of computing machine is more and more stronger in recent years, and people begin one's study and meet the human natural human-computer interaction technology that exchanges custom, from being that the center transfers to progressively that focus be put on man with the computing machine.These researchs comprise that speech recognition, people's face and Expression Recognition, head movement follow the tracks of, stare tracking, gesture identification and the identification of body gesture or the like.And be a focus in the nature man-machine interaction research field based on the gesture identification research of vision.

Gesture identification based on vision roughly is divided into detection and location, follows the tracks of, cuts apart, discerns several stages.Wherein the detection and location of gesture are very important steps, and purpose is the position of confirming that gesture occurs in video image, and other several stages of gesture identification all are on the detection and location basis of gesture, to carry out.The artificial auxiliary detection and location that realize gesture of existing some gesture identification system requirements; Promptly require the user that hand is put into certain FX, like the people such as Mathias of California, USA university the HandVu of gesture identification system exploitation in 2004 in system initialisation phase.And other is much cut apart about technological most detections that all only rely on colour of skin information to carry out gesture of gesture identification, in these technology, often supposes that hand is an area of skin color unique or maximum in the entire image; In addition, have part Study person to utilize movable information to carry out the detection and location of gesture, it supposes that equally hand is moving region unique or maximum in the entire image.Yet, more than two kinds of methods only just can be effective at simple application scenarios, and the practical application scene is generally all comparatively complicated and can not satisfy assumed condition, so be difficult to be effective.Adopt in the gesture identification system of template matches technology at some; It does not have the independent detection and location stage, adopts preset gesture template traversal entire image, finds best match position; Accomplish detection and location and identification simultaneously, yet the method calculated amount of this traversal is very big.So, to the problems referred to above that exist in the prior art, be to be necessary to study, the method for distilling of gesture candidate region in a kind of practicality, the effective video sequence is provided.

Summary of the invention

The purpose of the embodiment of the invention is to provide the method and system of extracting the gesture candidate region in a kind of video sequence automatically, and its merging motion information and colour of skin information are extracted gesture candidate region, practicability and effectiveness automatically from video image.

The embodiment of the invention is achieved in that the method for extracting the gesture candidate region in a kind of video sequence automatically, comprises the steps:

Video image is gathered by initiation gesture video image acquisition system;

Make up reference background image;

Calculate and generate motion description image;

Calculate and obtain the motion segmentation threshold value, the description image that will move changes into the two-value moving image;

Skin color segmentation obtains two-value broca scale picture;

Two-value moving image and two-value broca scale are done the logical and operation as pointwise, obtain the two-value fused images;

The two-value fused images is carried out connected component analysis, select the gesture candidate region.

Further, said initiation gesture video image acquisition system, the step of gathering video image comprises:

After image capturing system starts, to gather the M frame video image and convert gray-scale map respectively to, the mean value of using this M frame gray level image is as the initial reference background image, and wherein, M is more than or equal to 20.

Further, the step of said structure reference background image comprises: make up current t reference background image constantly, after initial background was set up, t+1 reference background image constantly all was on the reference background image basis of previous moment t, to upgrade and get.

Further, the step of said calculating generation motion description image comprises:

The image transitions that current t is collected constantly is a gray-scale map, and makes difference and take absolute value generation motion description image with the reference background image pointwise.

Further, said calculating obtains the motion segmentation threshold value, and the step that motion description image is changed into the two-value moving image comprises:

Utilize maximum variance between clusters to describe image and obtain motion segmentation threshold value λ from motion _M, utilize this motion segmentation threshold value to move to describe image to change into the two-value moving image.

Further, said segmentation threshold λ _MAcquisition comprise the steps:

Calculate the current t description image that moves constantly, the amplitude distribution histogram of image is described in the statistics motion;

Select a threshold value λ,, calculate the inter-class variance between two parts said amplitude distribution histogram separated into two parts;

The possible value of institute of traversal threshold value λ selects the maximum threshold value λ of corresponding inter-class variance as optimal threshold.

Further, said skin color segmentation, the step that obtains two-value broca scale picture comprises:

To the imagery exploitation priori complexion model skin color segmentation that current time collects, obtain the two-value broca scale picture of current time.

Further, said the two-value fused images is carried out connected component analysis, selects the step of gesture candidate region to comprise:

The two-value fused images is carried out connected component analysis, the motion area of skin color among the figure is divided into a plurality of connected regions, and calculates the area of each connected region, that is: the pixel number that comprises selects N bigger connected region of area as the gesture candidate region.

Further, said selection gesture candidate region comprises the steps:

The two-value fused images is carried out connected component analysis, the motion area of skin color among the figure is divided into a plurality of connected regions, and calculates the area of each connected region;

Get rid of area less than the connected region of the certain threshold value possibility as the candidate region, this threshold value and video image are in proportion;

If some in the former frame image in a plurality of gestures candidate region is determined or discerns and comprise gesture, then in the nearer preferential selected gesture candidate region of doing of connected region of hand gesture location that present frame middle distance former frame has been discerned; Do not comprise gesture if the gesture candidate region in the former frame image all is determined, then in present frame, select N bigger connected region of area as new gesture candidate region.

Gesture candidate region extraction system in a kind of video sequence comprises being used to gather the gesture video image acquisition system of images of gestures and connecting this gesture video image acquisition system, to be used to generate the gesture candidate region extraction system of gesture candidate region; Wherein, said gesture candidate region extraction system includes that motion detection unit, the background image structure that is connected with motion detection unit and the maintenance unit, the connection background image that connect gesture video image acquisition system make up and motion, the colour of skin information fusion unit of Face Detection unit, connection motion detection unit and the Face Detection unit of maintenance unit and the gesture candidate region analysis extraction unit that is connected motion, colour of skin information fusion unit.

Automatically the method for extracting the gesture candidate region in the video sequence of the present invention has merged movable information and colour of skin information searching gesture candidate region, carries out complementation between two kinds of information and has improved the accuracy that detects; And it has considered gesture motion continuity in time, utilizes this continuity to instruct the selection of gesture candidate region.The gesture application scenarios is not too much limited and hypothesis, applicable to the practical application scene; Do not need artificial auxiliary yet; The automatic extraction of gesture candidate region can be provided for the gesture identification system based on vision; When dwindling gesture candidate scope, reduce the possibility of omission, practical, effectively, for the cutting apart of gesture, locate, discern good basis is provided.

Description of drawings

Fig. 1 is the flow process diagram of gesture candidate region method for distilling in the video sequence of the present invention;

Fig. 2 is an input raw video image of the present invention;

Fig. 3 is a reference background image of the present invention;

Fig. 4 be with Fig. 2 convert into after the gray-scale map with Fig. 3 carry out pointwise do to differ from and take absolute value after the motion that obtains image is described;

Fig. 5 is the histogram distribution of Fig. 4 and the synoptic diagram that utilizes the motion segmentation threshold value that maximum variance between clusters confirms;

Fig. 6 utilizes the motion segmentation threshold value Fig. 4 to be carried out the two-value moving image that produces after the Threshold Segmentation;

Fig. 7 carries out the image after medium filtering and morphology expand to Fig. 6;

Fig. 8 carries out based on the two-value broca scale picture that obtains after the YCrCb color space skin color segmentation Fig. 2;

Fig. 9 carries out the image after medium filtering and morphology expand to Fig. 8;

Figure 10 is that Fig. 7 and Fig. 9 carry out the two-value fused images that the logical and operation obtains afterwards;

Figure 11 carries out the connected component analysis synoptic diagram of the gesture candidate region of selection afterwards to Figure 10;

Figure 12 is the principle module frame chart of gesture candidate region extraction system in the video sequence of the present invention.

Embodiment

In order to make the object of the invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

Automatically the method for extracting the gesture candidate region in the video sequence of the present invention comprises: initiation gesture video image acquisition system, gather video image; Make up reference background image B _tCalculate and generate motion description image; Calculate and obtain the motion segmentation threshold value, the description image that will move changes into two-value moving image BM _tSkin color segmentation obtains the two-value broca scale as BS _tWith two-value moving image BM _tWith the two-value broca scale as BS _tThe logical and operation is done in pointwise, obtains two-value fused images BF _tTo two-value fused images BF _tCarry out connected component analysis, select the gesture candidate region.

With reference to shown in Figure 1, it has shown the method flow that extracts the gesture candidate region in the video sequence of the present invention automatically.Wherein, comprising:

Step 1: initiation gesture video image acquisition system, gather video image;

As shown in Figure 2, raw video image is the RGB cromogram, and this video image is gathered by initiation gesture video image acquisition system.

Step 2: make up current t reference background image B constantly _t

Reference background image is to make up with the gray level image of inputted video image, therefore need convert it into gray-scale map from the rgb space cromogram for the original inputted video image of each frame, adopts following formula to change:

Y＝0.212671*R+0.715160*G+0.072169*B，

Wherein Y is the gray-scale value after changing, and R, G, B are the three-component value of original RGB.

After image capturing system starts, gather the M frame video image, utilize above-mentioned formula to be converted into gray-scale map, the mean value of using this M frame gray level image is as initial reference background image B ₀, M is more than or equal to 20, and concrete size is decided as required;

After initial background is set up, t+1 reference background image B constantly _T+1All at the reference background image B of previous moment t _tFormula below adopting on the basis upgrades:

B _t+1＝(1-a)×B _t+a×G _t

Wherein, G _tBe the corresponding gray-scale map of video image that t collects constantly, a is update coefficients and 0＜a＜1, and a is big more, and the expression renewal speed is fast more.As shown in Figure 3, it has shown a reference background image that adopts this method to make up.

Step 3: calculate and generate motion description image;

The image F that current t is collected constantly _tConvert gray-scale map G into _t, and with reference background image B _tPointwise is made difference and is taken absolute value, and generates motion and describes image M _t

Wherein, M _t=| F _t-B _t|; As shown in Figure 4, its do to differ from for Fig. 2 and Fig. 3 pointwise and take absolute value after the motion that obtains image is described.

Step 4: calculate and obtain the motion segmentation threshold value, the description image that will move changes into two-value moving image BM _t

Utilize maximum variance between clusters to describe image M from motion _tObtain motion segmentation threshold value λ _M, utilize this threshold value to move to describe image to change into two-value moving image BM _t, at BM _tIntermediate value belongs to the moving region for the pixel of " 1 ", and the pixel that is worth for " 0 " belongs to non-moving region;

Image M is described in the statistics motion _tAmplitude distribution histogram Hist _tAs shown in Figure 5, it is the histogram distribution synoptic diagram of Fig. 4.

Wherein, segmentation threshold λ _MConfirm comprise the steps:

1) calculates current t description image: the M that moves constantly _t=| G _t-B _t|, statistics M _tAmplitude distribution histogram Hist _t

2) select a threshold value λ, with Hist _tSeparated into two parts calculates the inter-class variance g=ω between two parts ₀(μ ₀-μ) ²+ ω ₁(μ ₁-μ) ², ω wherein ₀And μ ₀Be respectively the amplitude mean value that accounts for ratio and this part of the total number of pixels of histogram less than the number of pixels of the part of λ; ω ₁And μ ₁Be respectively the amplitude mean value that accounts for ratio and this part of the total number of pixels of histogram greater than the number of pixels of the part of λ; μ is whole histogrammic amplitude mean value;

3) institute of traversal threshold value λ might value (like image bit is 8 deeply, and then the possible value of λ is 0～255), and the threshold value λ that selects corresponding inter-class variance g maximum is as optimal threshold λ _MAs shown in Figure 5, the grey vertical curve among the figure has indicated optimal threshold λ _MThe position at place.

Utilize optimal threshold λ _MDoing Threshold Segmentation will move and describe image and change into two-value moving image BM _t:

{BM}_{t} (i, j) = \{\begin{matrix} 1, M_{t} (i, j) &GreaterEqual; λ_{M} \\ 0, M_{t} (i, j) < λ_{M} \end{matrix}

Here i and j distinguish line number and the columns of remarked pixel point in image.As shown in Figure 6, it is for carrying out the structural diagrams after the Threshold Segmentation to Fig. 5, and the white portion value among Fig. 6 is " 1 ", belongs to the moving region, and the black part score value is " 0 ", belongs to non-moving region.

Step 5: skin color segmentation obtains the two-value broca scale as BS _t

To the imagery exploitation priori complexion model skin color segmentation that current time collects, the two-value broca scale that obtains current time is as BS _t, as shown in Figure 8, the figure intermediate value belongs to area of skin color for the pixel of " 1 ", and the pixel that is worth for " 0 " belongs to non-area of skin color.

Human body complexion has good cluster characteristic at the YCrCb color space.The image transitions that earlier current time is collected arrives the YCrCb color space, and RGB is following to the conversion formula of YCrCb:

Y＝0.299*R+0.587*G+0.114*B

Cr＝0.713*(R-Y)+128

Cb＝0.564*(B-Y)+128

Among the YCrCb, the monochrome information of Y representation in components color, Cr and Cb component are represented red with blue colourity respectively.Only get Cr and Cb component when setting up the priori complexion model, can reduce the interference of illumination condition like this skin color segmentation.The colour of skin is in a stable scope in Cr, Cb spatial distributions.In present image, will satisfy T ₁≤Cr≤T ₂And T ₃≤Cb≤T ₄Pixel be referred to area of skin color, at the two-value broca scale as BS _tIn the value of correspondence position pixel is made as " 1 ", the pixel that does not satisfy this condition is referred to non-area of skin color, at BS _tIn the value of corresponding pixel points is made as " 0 "; As shown in Figure 8, it carries out the two-value broca scale picture that obtains after the skin color segmentation for Fig. 2.The white portion value is " 1 " among Fig. 8, belongs to area of skin color, and the black part score value is " 0 ", and expression belongs to non-area of skin color.

Step 6: with two-value moving image BM _tWith the two-value broca scale as BS _tThe logical and operation is done in pointwise, obtains two-value fused images BF _t

With two-value moving image BM _tWith the two-value broca scale as BS _tThe logical and operation is done in pointwise, obtains two-value fused images BF _t, shown in figure 10.The white portion value is " 1 " among Figure 10, belongs to the motion area of skin color, and the black part score value is " 0 ", belongs to other zones.

Step 7: to two-value fused images BF _tCarry out connected component analysis, select the gesture candidate region;

To two-value fused images BF _tCarry out connected component analysis; Motion area of skin color among the figure is divided into a plurality of connected regions; And calculate the area (pixel number that promptly comprises) of each connected region, and select N bigger connected region of area as the gesture candidate region, the size of N is decided as required.

With two-value moving image BM _tWith the two-value broca scale as BS _tThe logical and operation is done in pointwise, obtains two-value fused images BF _tBefore carrying out the logical and operation, earlier to BM _tAnd BS _tCarrying out medium filtering removes isolated noise point, carries out morphology and expand and fill up tiny cavity.Like Fig. 7 and shown in Figure 9, it is respectively that Fig. 6 and Fig. 8 are carried out the figure as a result after medium filtering and morphology expand.And Figure 10 carries out the two-value fused images that the logical and operation obtains to Fig. 7 and Fig. 9, and the white portion value is " 1 " among Figure 10, and expression belongs to the motion area of skin color, and the black part score value belongs to other zones for " 0 ".Shown in figure 11, it is for operating the synoptic diagram of the gesture candidate region that obtains afterwards to Figure 10.

Wherein, select the gesture candidate region to comprise the steps:

To two-value fused images BF _tCarry out connected component analysis, the motion area of skin color among the figure is divided into a plurality of connected regions, and calculates the area (pixel number that promptly comprises) of each connected region;

Get rid of area less than certain threshold value T _AreaConnected region as the possibility of candidate region, this threshold value T _AreaBe in proportion with video image, i.e. T _Area=H*W* β, H and W are respectively the height and the width of image here, and β is a scale-up factor, gets 0.0025 but be not limited thereto size in the present embodiment;

If i gesture candidate region Cand in a plurality of gestures candidate region in the former frame image _iIn be determined (identification) and comprise gesture (concrete grammar is not limit), then at present frame middle distance Cand _iThe preferential selected gesture candidate region of doing of nearer connected region; Do not comprise gesture if all the gesture candidate regions in the former frame image all are determined, then in present frame, select N bigger connected region of area as new gesture candidate region, the size of N is decided as required.

Shown in figure 12, gesture candidate region extraction system in a kind of video sequence comprises gesture video image acquisition system, is used to gather images of gestures; And the gesture candidate region extraction system that connects this gesture video image acquisition system, to be used to generate the gesture candidate region.Wherein, gesture candidate region extraction system includes the Face Detection unit of motion detection unit, the background image structure that is connected with motion detection unit and the maintenance unit, connection background image structure and the maintenance unit that connect gesture video image acquisition system, the gesture candidate region analysis extraction unit that connects motion, the colour of skin information fusion unit of motion detection unit and Face Detection unit and be connected motion, colour of skin information fusion unit.Wherein, background image makes up and maintenance unit is used to generate reference background image that motion detection unit uses; Motion detection unit is used for current input image is converted into the two-value moving image; The Face Detection unit is used for current input image is converted into two-value broca scale picture; Motion, colour of skin information fusion unit are used for motion detection unit is exported result and Face Detection unit output result fusion, generated the two-value fused images; And gesture candidate region analysis extraction unit is used for extracting the gesture candidate region from the two-value fused images.

Said background image makes up and the input of maintenance unit links to each other with the output of said gesture video image acquisition system; Said motion detection unit has two inputs, and its first input makes up with described background image and the output of maintenance unit links to each other, and its second input links to each other with the output of said gesture video image acquisition system; The input of said Face Detection unit links to each other with the output of said gesture video image acquisition system; Said information fusion unit has two inputs, and its first input links to each other with the output of motion detection unit, and its second input links to each other with the output of described Face Detection unit; The output of said motion, colour of skin information fusion unit is analyzed extraction unit with the gesture candidate region and is linked to each other.

Automatically the method for extracting the gesture candidate region in the video sequence of the present invention has merged movable information and colour of skin information searching gesture candidate region, carries out complementation between two kinds of information and has improved the accuracy that detects; And this method has been considered gesture motion continuity in time, utilizes this continuity to instruct the selection of gesture candidate region.The present invention does not too much limit and hypothesis the gesture application scenarios, applicable to the practical application scene; And it does not need artificial auxiliary, and the automatic extraction of gesture candidate region can be provided for the gesture identification system based on vision, when dwindling gesture candidate scope, reduces the possibility of omission, for the cutting apart of gesture, locate, discern good basis is provided.

The above is merely preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of within spirit of the present invention and principle, being done, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. extract the method for gesture candidate region in the video sequence automatically, it is characterized in that, comprise the steps:

Video image is gathered by initiation gesture video image acquisition system;

Make up reference background image;

Calculate and generate motion description image;

Skin color segmentation obtains two-value broca scale picture;

2. extract the method for gesture candidate region according to claim 1 in the video sequence automatically, it is characterized in that: said initiation gesture image capturing system, the step of gathering video sequence comprises:

3. as extracting the method for gesture candidate region in claim 1 and the 2 said video sequences automatically; It is characterized in that: the step of said structure reference background image comprises: make up current t reference background image constantly; After initial background was set up, t+1 reference background image constantly all was on the reference background image basis of previous moment t, to upgrade and get.

4. as extracting the method for gesture candidate region in the said video sequence of claim 3 automatically, it is characterized in that: the step that said calculating generates motion description image comprises:

5. as extracting the method for gesture candidate region in the said video sequence of claim 4 automatically, it is characterized in that: said calculating obtains the motion segmentation threshold value, and the step that motion description image is changed into the two-value moving image comprises:

6. as extracting the method for gesture candidate region in the said video sequence of claim 5 automatically, it is characterized in that: said segmentation threshold λ _MAcquisition comprise the steps:

7. as extracting the method for gesture candidate region in the said video sequence of claim 6 automatically, it is characterized in that: said skin color segmentation, the step that obtains two-value broca scale picture comprises:

The imagery exploitation priori complexion model that current time is collected carries out skin color segmentation, obtains the two-value broca scale picture of current time.

8. as automatically extracting the method for gesture candidate region in the said video sequence of claim 7, it is characterized in that: said the two-value fused images is carried out connected component analysis, select the step of gesture candidate region to comprise:

9. as extracting the method for gesture candidate region in the said video sequence of claim 8 automatically, it is characterized in that: said selection gesture candidate region comprises the steps:

10. gesture candidate region extraction system in the video sequence; Comprise and be used to gather the gesture video image acquisition system of images of gestures and connect this gesture video image acquisition system; To be used to generate the gesture candidate region extraction system of gesture candidate region, it is characterized in that: said gesture candidate region extraction system includes the Face Detection unit of motion detection unit, the background image structure that is connected with motion detection unit and the maintenance unit, connection background image structure and the maintenance unit that connect gesture video image acquisition system, the gesture candidate region analysis extraction unit that connects motion, the colour of skin information fusion unit of motion detection unit and Face Detection unit and be connected motion, colour of skin information fusion unit.