CN104866825B

CN104866825B - A kind of sign language video frame sequence classification method based on Hu square

Info

Publication number: CN104866825B
Application number: CN201510254121.3A
Authority: CN
Inventors: 倪浩淼; 徐向民; 裘索; 黄爱发; 李兆海
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2015-05-17
Filing date: 2015-05-17
Publication date: 2019-01-29
Anticipated expiration: 2035-05-17
Also published as: CN104866825A

Abstract

The present invention discloses a kind of sign language video frame sequence classification method based on Hu square, comprising the following steps: step 1: obtaining sign language video to be identified；Step 2: frame sampling being carried out to the sign language video, obtains frame sequence；Step 3: the frame sequence of color image is converted to the frame sequence of bianry image；Step 4: by hand and background segment；Step 5: the Hu square of the hand images after extracting segmentation obtains the feature vector of hand images；Step 6: calculating the Euclidean distance between each frame and the feature vector of former frame, adjudicate, paste digital label, carry out preliminary classification.Step 7: frame sequence being carried out according to label and is further classified, transitional movement frame, the sorted sequence label of Sequential output are removed.Computation complexity of the present invention is lower, and has good robustness for the rotation of sign language image, zooming and panning, may be directly applied to sign language video identifying system.

Description

A kind of sign language video frame sequence classification method based on Hu square

Technical field

The present invention relates to the field of video processing of computer vision, more particularly to a kind of sign language video frame based on Hu square Sequence classification method

Background technique

Influence with computer to modern society is growing, and human-computer interaction is just becoming in people's daily life increasingly Important a part.Most of man-machine interaction mode of today is the input tool relatively simple by keyboard and mouse etc., is made It is then fast and easily more compared with them with the input mode of Body Languages.Sign Language Recognition is as in field of intelligent man-machine interaction An important component, it is the purpose is to provide a kind of efficiently and accurately mechanism by computer, sign language interpreter is written Sheet or voice.This technology not only makes the exchange between deaf person and Subjects With Normal Hearing become more convenient, but also in human-computer interaction Also there are positive meaning and application value in field.

Traditional Sign Language Recognition includes the Sign Language Recognition of Sign Language Recognition and view-based access control model based on data glove.Wherein, base In the Sign Language Recognition of data glove, user needs to wear data glove, and computer obtains the movement of human hand by data glove Information is simultaneously handled in real time, and by fructufy when shows to realize human-computer interaction.And the Sign Language Recognition of view-based access control model is then The hand images that people is obtained by digital photographing apparatus, are further handled the image of acquisition, and then identify corresponding hand Language information.Generally speaking, the advantages of method based on data glove is that input data is few and resolution height, but data glove is set Standby expensive and wearing is thick and heavy, is unfavorable for natural human-computer interaction.And method based on computer vision then may be implemented more certainly Right human-computer interaction, while price is lower, input cost is small, but this method is primarily present following two large problems:

(1) Hand Gesture Segmentation in the case where monocular vision under complex background is very difficult, since background is various each Sample, environmental factor is unpredictable, and not only unmature theoretical be used as is instructed, but also the realization of existing method is difficult, calculates again Miscellaneous degree is high, and effect is also not highly desirable.Common solution is to increase limitation at present, such as using black or the wall of white, Dark clothes etc. simplify background, or the gloves etc. for requiring manpower to wear special color emphasize prospect, to simplify hand region and background The division in region.

(2) during the progress of sign language, hand can irregular generation during sign language is realized as non-rigid object Deformation, therefore, a critical issue is just how to carry out the classification of sequence of frames of video, identification in the identification of dynamic sign language video The semanteme of the dynamic sign language sequence out.Conventional track algorithm tracks extremely easy loss in the motion profile of track record hand Target, and be difficult to give for change again.

Geometric moment was by Hu (Visual pattern recognition by moment invariants) in 1962 It is proposed that there is translation, rotation and scale invariability.Hu using second order and third central moment construct seven not bending moment M1~ M7, they can keep translation, scaling and invariable rotary under the conditions of row graph, and related definition is also used by the present invention.It is real On border, in picture in the identification process of object, only M1 and M2 invariance is kept relatively good, other several not bending moments Bring application condition is big.

Summary of the invention

In view of the foregoing, it is necessary to a kind of correct recognition rata that can be improved sign language video be provided, there is preferable Shandong Stick is effectively improved the classification method of the sign language video frame sequence of sign language video recognition effect.

To achieve the above object, the present invention provides a kind of sign language video frame sequence classification method based on Hu square, including with Lower step:

Step 1: obtaining colored sign language video to be identified；

Step 2: frame sampling being carried out to the colour sign language video, obtains frame sequence；

Step 3: the frame sequence of color image is converted to the frame sequence of bianry image；

Step 4: by hand and background segment；

Step 5: M1 the and M2 square in the Hu square of the hand images sequence after extracting segmentation obtain the features of hand images to Amount；

Further, bending moment M1 and M2 are not defined as follows:

M1=η₂₀+η₀₂

Wherein,

N and M is the height and width of image respectively, and f (x, y) is image function,Representative image center of gravity, definition Are as follows:

Wherein,

Step 6: calculating the Euclidean distance between each frame hand images and the feature vector of former frame hand images Sample_dis, and adaptive threshold judgement is carried out, digital label is pasted, preliminary classification is carried out；

Step 7: the classification of frame sequence essence being carried out according to label, removes transition frames, the sorted sequence label of Sequential output.

Further, image is mapped to characteristic vector space from pixel space with M1 and M2 square in the step 5, reduced Operation complexity:

Wherein, f_i(x, y) is the image function of the i-th frame,For corresponding feature vector.

Further, the Euclid of two frame image features vectors is adjudicated in the step 6 with adaptive threshold thre Measurement carries out preliminary classification:

ρ function is defined as follows:

For the feature vector of the i-th frame image, L_iFor the digital label of the i-th frame image, if L₁=1；

Under the premise of the sampling interval that sample rate meets front cross frame is sufficiently small, the hand done in front cross frame can be approximatively thought Language is all of a sort sign language, therefore can be setRealize that the stronger adaptive threshold of robustness determines.

Further, include: step 71 in the step 7, obtain the label of new frame sampling image；Step 72, pass through The label of step 71 judges whether the frame is transition gesture motion；If so, the frame image is removed；If it is not, by the frame with it is same The frame of label is classified as one kind；Step 73, according to frame sequence Sequential output classification results, classification results are indicated with digital label, together The frame sequence of one digital label indicates the video clip sample frame set that they are same sign language movement；

Further, step 7 removes transition sign language frame using digital label, specifically includes:

Assuming that a total of N frame image of target video sequence, if to the digital label L of the i-th frame image (1 < i < N)_iMeet: L_i=L_i-1+1&&L_i=L_i+1- 1, then the frame is transition gesture motion, needs to remove the frame, it may be assumed that for arbitrary kth frame image, As k >=i, L is enabled_k=L_k-1；Wherein, L_kFor the digital label of kth frame image.

Further, the step 3 is into the following steps are included: step 31, the frame sampling image that obtaining step 2 obtains；Step Rapid 32, all pixels of the image are traversed, judge whether it may be the pixel of hand region, if so, being set to white Color, if it is not, being set to black；Step 33, bianry image is exported.

Further, step 4 is the following steps are included: step 41, the frame bianry image that obtaining step 3 obtains；Time step 42, All profiles of the image are gone through, judge whether it is the profile of hand region, if so, continuing step 43, if it is not, repeating step 42 Until traversal terminates to exit；Step 43, by the hand region and background segment；Step 44, the hand region image is exported.

Preferably, the sample rate Sample_rate in the step 2 is to take a frame every 1 second.

Compared with prior art, the invention has the advantages that and technical effect:

A kind of sign language video frame sequence classification method based on Hu square provided by the present invention, computation complexity is lower, and The non-intrinsically safe change of the images such as rotation, the zooming and panning of sign language image caused by inevitable for external environment difference has Good robustness.The present invention provides how to remove transition gesture motion in traditional frame sequence to influence this to sign language video identification The simple solution of one problem.The present invention may be directly applied to sign language video identifying system, by providing sign language frame sequence Sequence digital label, the sign language frame of same label are same class, as long as therefore taking a frame, progress letter from every one kind in order Single static two-value sign language image recognition, can be completed the semantics recognition of sign language video.

Detailed description of the invention

Fig. 1 is the overview flow chart of the sign language video frame sequence classification of the invention based on Hu square；

Fig. 2 is the sub-process figure of step S103 in Fig. 1；

Fig. 3 is the sub-process figure of step S104 in Fig. 1；

Fig. 4 is the sub-process figure of step S107 in Fig. 1.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is described in detail.It should be appreciated that described herein, specific examples are only used to explain the present invention, is not used to limit The fixed present invention, if it is existing to be that those skilled in the art can refer in addition, have the process not being described in detail especially or symbol below Technology realize or understand.

As shown in Figure 1, being the flow chart of sign language video frame sequence classification method preferred embodiment of the present invention.Step S101, Sign language video to be identified is obtained using digital photographing apparatus.In the present embodiment, it is contemplated that utilize current computer vision Technology can not also directly extract careful free hand gestures characteristic information from two-dimensional space, in order to describe gesture feature subtly with area Divide similar gesture word, using the calibration and detection at a secondary blue gloves supplemental characteristic position, this is not limiting upon interactive nature Property.Digital photographing apparatus used by the present embodiment is the DSC-W730 digital camera of Sony company production, and valid pixel is 16100000, sampling rate is 29.97 frames/s, and the camera is to be located to do sign language person front and shoot forward when video recording.

Step S102 is that sample rate Sample_rate carries out frame sampling to the sign language video.In the present embodiment, Sample_rate is set as taking 1 frame every 30 frames, i.e., about 1s acquires 1 frame.

Step S103 is converted to black and white binary image sequence as preprocessing part, by collected color image sequence, As shown in Fig. 2, the step S103 further includes steps of step S201: obtaining new frame sampling image；Step S202: traversing all pixels of the image, judges whether it may be the pixel of hand region；If so, being set to white Color；If it is not, being set to black；Step S203: output bianry image.Specifically, the basic principle of binaryzation is to pass through gloves Color the part of hand is set as white, remaining background is black, and specific decision criteria is as follows:

1. the rgb value of pixel pixel meets: B=Max (R, G, B)

2. the B value of pixel pixel meets: B > B_thre, preferably, B_thre=50

3. the RGB variance Var of pixel pixel meets: Var > Var_thre, preferably, Var_thre=350

Wherein,

Criterion 1 is because gloves are blue, and B value is the maximum value of rgb value；Criterion 2 is because meeting the picture of criterion 1 Vegetarian refreshments is not necessarily blue, and B value needs to be more than that certain threshold value just can be determined that as blue；Criterion 3 is because region is bright Aobvious blue, so the variance of its rgb value can be more than certain threshold value.

The pixel pixel for only meeting above-mentioned 3 criterion simultaneously just can be determined that pixel for hand region, by it Rgb value is all assigned a value of 255 (i.e. white), otherwise, its rgb value is all assigned a value of 0 (i.e. black).

It when there is blue object in shooting background, can also be binarized as white area, therefore, in bianry image not All white areas are all hand region, and step S104 distinguishes hand region and other white areas, by hand and back Scape segmentation.As shown in figure 3, the step S104 further includes steps of step S301: obtaining new frame bianry image； Step S302: traversing all profiles of the image, judge its whether be hand region profile；If so, continuing step S303；If It is no, step S302 is repeated until traversal terminates to exit；Step S303: by the hand region and background segment；Step S304: output The hand region image.Specifically, the basic principle of segmentation is identified using the size of the minimum rectangle of enclosing region profile Hand region out, specific decision criteria are as follows:

1. the length for surrounding the minimum rectangle of profile Contour meets: len_min < length < len_max, preferably Ground, len_min=150, len_max=250

2. the width for surrounding the minimum rectangle of profile Contour meets: hgh_min < height < hgh_max, preferably Ground, hgh_min=100, hgh_max=200

Only meet criterion 1 simultaneously and when criterion 2 just can be determined that as hand region, and with its minimum rectangle general of encirclement The region segmentation comes out, the bianry image after output hand and background segment.

Since the data volume that image is included is very big, in order to effectively further analyze these image datas, such as Fig. 1 institute Show, in step s105, with the opponent of the moment invariants Hu square with constant rate, translation invariance and rotational invariance Portion's bianry image is described, and by the analysis of analysis one group of feature vector value of conversion of bianry image sequence, reduces answering for operation Miscellaneous degree.In fact, the 7 invariant moments of Hu only have the ratio of M1 and M2 invariance holding in picture in the identification process of object Preferably, other several not bending moment bring application conditions are big.Therefore only with M1 and M2 to the two-value hand after segmentation in S105 Portion's image is described.Specifically, bending moment M1 and M2 are not defined as follows:

M1=η₂₀+η₀₂

Wherein,

F (x, y) is image function,Representative image center of gravity, is defined as:

Wherein,

With not bending moment M1 and M2, the two-value hand images after each segmentation can be mapped as following feature vector, Realize the reduction of computation complexity:

After image is mapped to characteristic vector space from pixel space, difference between each image can with they Difference between feature vector is measured.And the difference between feature vector can be by two vectors in feature space Distance characterizes.As shown in Figure 1, in step s 106, indicated with euclidean metric two vectors in feature space away from From, be defined as follows shown in:

Assuming that there are two image Hu Character eigenvectorsThen their Euclid's degree Amount are as follows:

After obtaining the euclidean metric between the feature vector of two images, need to set an adaptive threshold value, To judge whether they belong to same class.If same class, identical label is sticked；If not same class, sticks different marks Label.In fact, assume that the sign language done in the 1st second is all of a sort sign language completely, therefore threshold value takes front cross frame image The euclidean metric of feature vector.Specific decision criteria is as follows:

Wherein,

For the feature vector of the i-th frame image, L_iFor the label of the i-th frame image, label is represented with number in the present embodiment.

By step S106 available preliminary classification results.However, traditional frame sequence classification problem One difficult point is: those frames excessively acted how are removed from collected frame sequence.It will be according to image in step S107 Label be that frame sequence carries out smart classification.Specifically, as shown in figure 4, the step S107 further includes steps of step S401: the label of new frame sampling image is obtained；Step S402: judge whether the frame is transition gesture motion by the label； If so, removing it；If it is not, the frame of itself and same label is classified as one kind.Step S403: classify according to frame sequence Sequential output As a result.The specific decision criteria of the step S402 is as follows:

Assuming that a total of N frame image of target video sequence, if to the digital label L of the i-th frame image (1 < i < N)_iMeet: L_i=L_i-1+1&&L_i=L_i+1- 1, then the frame is transition gesture motion, needs to remove the frame, it may be assumed that for arbitrary kth frame image, As k >=i, L is enabled_k=L_k-1；Wherein, L_kFor the digital label of kth frame image.If not satisfied, returning to step S401.

Finally, according to frame sequence Sequential output essence classification results.Classification results indicate with digital label, same digital label Frame sequence indicate they be same sign language movement video clip sample frame set.

Claims

1. a kind of sign language video frame sequence classification method based on Hu square, which comprises the following steps:

Step S101 obtains sign language video to be identified using digital photographing apparatus；Using a secondary blue gloves supplemental characteristic portion The calibration and detection of position, this is not limiting upon interactive naturality；

Step S102 is that sample rate SAMPLE_RATE carries out frame sampling to the sign language video；SAMPLE_RATE is set as every 30 Frame takes 1 frame, i.e. 1S acquires 1 frame；

Step S103 is converted to black and white binary image sequence as preprocessing part, by collected color image sequence；It is described Step S103 further includes steps of step S201: obtaining new frame sampling image；Step S202: the image institute is traversed Some pixels judge whether it may be the pixel of hand region；If so, being set to white；If it is not, being set to black Color；Step S203: output bianry image；Specifically, the basic principle of binaryzation is to be set the part of hand by the color of gloves For white, remaining background is black, and specific decision criteria is as follows:

1. the rgb value of pixel pixel meets: B=Max (R, G, B)

2. the B value of pixel pixel meets: B > B_thre,

3. the RGB variance Var of pixel pixel meets: Var > Var_thre, wherein

Criterion 1 is because gloves are blue, and B value is the maximum value of rgb value；Criterion 2 is because meeting the pixel of criterion 1 It is not necessarily blue, B value needs to be more than that certain threshold value just can be determined that as blue；Criterion 3 is because region is apparent Blue, so the variance of its rgb value can be more than certain threshold value；

The pixel pixel for only meeting above-mentioned 3 criterion simultaneously is just determined as the pixel of hand region, its rgb value is complete Portion is assigned a value of 255, otherwise, its rgb value is all assigned a value of 0 i.e. black；

When there is blue object in shooting background, can also be binarized as white area, it is therefore, in bianry image and not all White area be all hand region, step S104 distinguishes hand region and other white areas, by hand and background point It cuts；The step S104 further includes steps of step S301: obtaining new frame bianry image；Step S302: traversal should All profiles of image, judge its whether be hand region profile；If so, continuing step S303；If it is not, repeating step S302 Until traversal terminates to exit；Step S303: by the hand region and background segment；Step S304: the hand region image is exported； Specifically, the basic principle of segmentation is that hand region is identified using the size of the minimum rectangle of enclosing region profile, specifically Decision criteria is as follows:

1. the length for surrounding the minimum rectangle of profile Contour meets: len_min < length < len_max,

2. the width for surrounding the minimum rectangle of profile Contour meets: hgh_min < height < hgh_max,

Only meet criterion 1 simultaneously and when criterion 2 just can be determined that as hand region, and with its minimum rectangle of encirclement by the area Regional partition comes out, the bianry image after output hand and background segment；

Since the data volume that image is included is very big, in order to effectively further analyze these image datas, in step s105, With being retouched to hand bianry image for the moment invariants Hu square with constant rate, translation invariance and rotational invariance It states, by the analysis of analysis one group of feature vector value of conversion of bianry image sequence, reduces the complexity of operation；In fact, right In picture in the identification process of object, the 7 invariant moments of Hu only have M1 and M2 invariance to keep relatively good, and others are several not Bending moment bring application condition is big；Therefore only the two-value hand images after segmentation are described with M1 and M2 in S105；Tool Body, bending moment M1 and M2 are not defined as follows:

M1=η₂₀+η₀₂

Wherein,

With not bending moment M1 and M2, the two-value hand images after each segmentation are mapped as following feature vector, realizes and calculates The reduction of complexity:

After image is mapped to characteristic vector space from pixel space, difference between each image can use their feature Difference between vector is measured；And the difference between feature vector can pass through distance of two vectors in feature space To characterize；In step s 106, distance of two vectors in feature space is indicated with euclidean metric, is defined as follows institute Show:

Assuming that there are two image Hu Character eigenvectorsThen their euclidean metric are as follows:

After obtaining the euclidean metric between the feature vector of two images, need to set an adaptive threshold value, to sentence Breaking, whether they belong to same class；If same class, identical label is sticked；If not same class, sticks different labels；It is real On border, assume that the sign language done in the 1st second is all of a sort sign language completely, thus threshold value take front cross frame characteristics of image to The euclidean metric of amount；Specific decision criteria is as follows:

Wherein,

For the feature vector of the i-th frame image, L_iFor the label of the i-th frame image, label is represented with number；

A preliminary classification results have been obtained by step S106；However, a difficult point of traditional frame sequence classification problem It is: how removes the frame of those transitional movements from collected frame sequence；To be according to the label of image in step S107 Frame sequence carries out smart classification；The step S107 further includes steps of step S401: obtaining new frame sampling image Label；Step S402: judge whether the frame is transition gesture motion by the label；If so, removing it；If it is not, by its with The frame of same label is classified as one kind；Step S403: according to frame sequence Sequential output classification results；The step S402's specifically sentences It fixes then as follows:

Assuming that a total of N frame image of target video sequence, if to the digital label L of the i-th frame image_iMeet: L_i=L_i-1+1&&L_i =L_i+1- 1,1<i<N, then the frame is transition gesture motion, needs to remove the frame, it may be assumed that for arbitrary kth frame image, as k>=i When, enable L_k=L_k-1；Wherein, L_kFor the digital label of kth frame image；If not satisfied, returning to step S401；

Finally, according to frame sequence Sequential output essence classification results；Classification results indicate with digital label, the frame of same digital label Sequence indicates the video clip sample frame set that they are same sign language movement.