CN104392223B

CN104392223B - Human posture recognition method in two-dimensional video image

Info

Publication number: CN104392223B
Application number: CN201410734845.3A
Authority: CN
Inventors: 王传旭; 刘云; 闫春娟; 崔雪红; 李辉
Original assignee: Qingdao University of Science and Technology
Current assignee: Haier Robotics Qingdao Co ltd
Priority date: 2014-12-05
Filing date: 2014-12-05
Publication date: 2017-07-11
Anticipated expiration: 2034-12-05
Also published as: CN104392223A

Abstract

The invention discloses the human posture recognition method in a kind of two-dimensional video image, comprise the steps：By raw video imageIt is grouped according to scale size；A sampled images for specified yardstick are calculated every group of image, and HOG is calculated to the sampled images；With the corresponding HOG of other specified yardstick sampled images in the HOG of a sampled images in every group prediction calculating group；According to the multiple dimensioned HOG of gained, the good SVM classifier of combined training detects the raw video imageHuman body target region under middle different scale；The pixel in the human body target region of detection is classified using the random forest grader for training, is determined the body part region in the human body target region；Each body part is connected to form human body contour outline, realizes that human body attitude is recognized.Using the method for the present invention, on the basis of accuracy of detection is not reduced, the calculating speed of multiple dimensioned low-level image feature is accelerated, gesture recognition speed and precision is improved.

Description

Human posture recognition method in two-dimensional video image

Technical field

It is the human body appearance being related in a kind of two-dimensional video image specifically the invention belongs to technical field of image processing State recognition methods.

Background technology

Human body attitude identification can apply to the fields such as physical activity analysis, man-machine interaction and visual surveillance, be recent A popular problem in computer vision field.Human body attitude identification refers to detect the position of partes corporis humani point simultaneously from image Its direction and dimensional information are calculated, the result of gesture recognition is divided to two and three dimensions two kinds of situations, and the method estimated point is based on mould Two kinds of approach of type and model-free.

The Chinese patent application of Publication No. CN101350064A, discloses a kind of estimating two-dimension human body guise with dress Put.The method detects the human region in two dimensional image and determines hunting zone of the human body in two dimensional image first. Then according to the hunting zone of human body, with reference to the trunk of human body, head, hand, leg, foot, formwork calculation With similarity, the identification at each position is realized；With reference to the restriction relation between adjacent regions, the attitude of two-dimension human body is obtained.Implement Step is as follows：

The first step：Graded in existing method detection two dimensional image using existing optical flow method, frame differential method, background difference Human region.

Second step：Determine the hunting zone of multiple human bodies in human region.

（1）Face datection is carried out in human region, search model of the position where the face that will be detected as head Enclose；

（2）The hunting zone of left hand and right hand is determined using the face complexion feature detected；And then determine trunk, a left side The hunting zone of arm, right arm.

（3）Remainder in human region is defined as to the hunting zone of left leg, left foot, right leg, right crus of diaphragm.

3rd step：Matching similarity is calculated in corresponding human body hunting zone according to each human body template, really Determine the optimal location of partes corporis humani position, with reference to the restriction relation between adjacent human body, obtain the attitude of two-dimension human body.

The method of above-mentioned estimation human body attitude has following shortcomings：

First, detecting two dimensional image using existing method of grading is differed using existing optical flow method, frame differential method, background In human region, there are problems that illumination variation, background dynamics change, light stream Multi-Scale Calculation speed, frequently can lead to The human region for detecting has larger error, is that follow-up human body detection algorithm hides some dangers for, and can cause total algorithm Failure；

Lead to not second, there can be face using method for detecting human face progress head zone positioning and partly or entirely block The problem of detection, and, Face datection algorithm often only has accuracy of detection very high, offside dough figurine face effect to front face It is poor；

Third, the method for template matches, which carries out human body identification positioning, can produce the problem of precision is not high, show and regard Human body in frequency image can be because scale size change, the not equal factor of clothing, cause the precision of match cognization algorithm to become Difference, causes human body Wrong localization, whole algorithm is failed.

The content of the invention

It is an object of the invention to provide the human body attitude in the two-dimensional video image that a kind of accuracy of identification is high, recognition speed is fast Recognition methods.

For achieving the above object, the present invention is achieved using following technical proposals：

A kind of human posture recognition method in two-dimensional video image, methods described comprises the steps：

A, according to the metric space principle of stratification by raw video imageIt is divided intoGroup,,It is the resolution ratio of the raw video image；

B, to every group of video image, calculating a yardstick isSampled images,ForIn one of yardstick,Represent sampling function,Represent theGroup video image,,For the resolution ratio of the raw video image,It is the natural number more than 1 of setting, represents The quantity for the sample video image that every group of video image is included,；

C, to the sampled images in every groupHOG low-level image feature descriptors are calculated respectively；

D, by step c obtain every group in a sampled images HOG low-level image feature descriptors based on, according to prediction FormulaCalculating yardstick in every group isIn remaining（） The corresponding HOG low-level image features descriptor of sample video image of individual yardstick,WithSampled images are represented respectivelyAnd sample graph PictureYardstick,It is setting value；

E, the HOG low-level image feature descriptors according to all different scale sample video images of step c and step d, with reference to The SVM for training, detects the human body target region in the raw video image；

F, using the random forest grader trained the pixel in the step e human body target regions detected is classified, Determine the body part region in the human body target region；

G, by step f determine each body part connect to form human body contour outline, realize human body attitude recognize.

Preferably, in the step b, utilizeIn end yardstick to every group of video figure As sampling, the corresponding sampled images of end yardstick are calculated。

Random forest classification in human posture recognition method in two-dimensional video image as described above, the step f Device is preferably trained by following methods：

Acquisition includes the real video images in the artificial synthesized video image and target detection scene of human body attitude, every width Video image is used as a training sample；

The background area in each training sample and human body target region are labeled according to setting body part；

The pixel characteristic of each tab area, all tab area and its pixel characteristic data structures are calculated using SURF operators Into training data set；

Using the training data set and object function Random forest grader is trained；

Wherein,Decision tree class node in for random forest,It is weights,It is comentropy meter Calculate function,It is the pixel characteristic of tab area in the artificial synthesized video image training sample,It is described true The pixel characteristic of tab area in real video image training sample,It is the artificial synthesized video image training sample In markedThe statistics descriptor of the pixel characteristic of individual body part,It is the artificial synthesized video image In training sample in all tab areas all pixels feature statistics descriptor,It is the real video images In training sample in all tab areas all pixels feature statistics descriptor,ForWith'sDistance.

Compared with prior art, advantages and positive effects of the present invention are：

（1）When detecting human body target from raw video image using the multiple dimensioned low-level image feature extracting methods of HOG, after packet Every group of sampled images in only need to calculate the HOG low-level image feature descriptors of a secondary sampled images, the bottom of remaining sampled images Feature descriptor is calculated by feature prediction, on the basis of accuracy of detection is not reduced, accelerates multiple dimensioned low-level image feature Calculating speed, fundamentally solve the multiple dimensioned human body target detection method of restriction and move towards the amount of calculation that practical application faces Greatly, the not enough thorny problem of real-time.

（2）Classification and Identification is carried out to human body limb position using random forest grader, when random forest grader is trained Using decision tree nodes in new object function training grader, Weak Classifier can be made extensive to test from training sample space Still there is consistent spatial activation pattern during sample space.So so that the training of the grader can be by by computer The artificial synthesized human body attitude video image sample of graphics is main body, comes with reference to a small amount of real human body attitude video for having marked The training of random forest grader is completed, so as to realize from artificial synthesized human body attitude sample to real human body attitude feature It is extensive, reduce the requirement to training sample.

After specific embodiment of the invention is read in conjunction with the figure, other features of the invention and advantage will become more clear Chu.

Brief description of the drawings

Fig. 1 is the flow chart of human posture recognition method one embodiment in two-dimensional video image of the present invention.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below with reference to drawings and Examples, The present invention is described in further detail.

First, the general roadmap that the present invention realizes human body attitude identification is briefly explained：

Human body attitude is recognized from two-dimensional video image, is divided into two steps, the first step is detected from raw video image Human body target region, second step is to carry out Classification and Identification to human body target region, identifies human body limb position, such as head, hand, elbow The joint parts such as portion, shoulder, buttocks, knee, pin, and body part is connected to form human body contour outline, and then realize human body attitude Identification.In the present invention, during first step detection human body target region, using the multiple dimensioned low-level image feature extracting methods of HOG, reduce The influence of background, illumination etc., keeps scale invariability；And low-level image feature extracting method is improved, improve real-time.Second Step improves classification accuracy using random forest classification tree identification human body limb position；And to the mesh in random forest classification tree Scalar functions are improved, and improve the generalization ability of grader, the complexity of required training sample when reducing classifier training.More The implementation method of body, refer to following description.

Fig. 1 is referred to, the figure show the human posture recognition method one embodiment in two-dimensional video image of the present invention Flow chart.

As shown in figure 1, the process of embodiment identification human body attitude is specifically realized using following step：

Step 101：Raw video image is divided into multiple series of images according to space delamination principle.

According to the metric space principle of stratification by raw video imageIt is divided intoGroup, wherein,,It is raw video imageResolution ratio.The principle and method being layered to video image according to metric space are existing skill Art, is not specifically addressed herein.

Step 102：A sampled images for particular dimensions are calculated in every group, and calculates the HOG low-level image features of sampled images Descriptor.

Every group of video image is sampled, calculating a yardstick isSampled images.YardstickFor One particular dimensions, specifically,ForIn one of yardstick.It is preferred that,ForIn end yardstick.Wherein,Represent sampling function,Represent theGroup video image,,It is the resolution ratio of the raw video image,It is the natural number more than 1 of setting, represents The quantity for the sample video image that every group of video image is included,.Usually,Value be 5-8, represent every group of video Image includes 5-8 layers of sample video image.

Then, the HOG for the sampled images that yardstick is selected in every group is calculated（Histogram of Oriented Gradient, histograms of oriented gradients）Low-level image feature descriptor.Calculating HOG low-level image features descriptor can use prior art In method, be not specifically described herein.

Step 103：The HOG low-level image features of the sample video image of other particular dimensions in every group are calculated by prediction algorithm Descriptor.

For every group of video image, the HOG low-level image feature descriptors of a sampled images have been calculated through step 102.So Afterwards, based on the HOG low-level image feature descriptors that this is calculated, prediction calculates the sample video image of other particular dimensions HOG low-level image feature descriptors.

Specifically, other particular dimensions refer toIn have calculated that except step 102 Remaining outside the yardstick of HOG low-level image feature descriptors（）Individual yardstick.Calculating is predicted using following formula, and other are specific The HOG low-level image feature descriptors of the sample video image of yardstick：

Wherein,WithSampled images are represented respectivelyAnd sampled imagesYardstick,,It is setting value,For sampled imagesHOG low-level image feature descriptors,It is sampled imagesHOG bottoms Layer feature descriptor.

Wherein,It is a setting value as power exponent, the setting value rule of thumb can determine in verification method fitting. In this embodiment,Preferred value be 0.0042.

In above-mentioned formula, power exponentIt is determination value, one of yardstick and its corresponding HOG low-level image features descriptor It is calculated through step 102, then, for specified another yardstick, can calculates what this was specified conveniently by above-mentioned formula The corresponding HOG low-level image features descriptor of another yardstick.The like, can easily it calculate in group corresponding to remaining yardstick HOG low-level image feature descriptors, so as to calculate the HOG low-level image feature descriptors of the sample video image included in all groups.

Step 104：HOG low-level image feature descriptors according to all different scale sample video images, combined training is good Human body target region in SVM, detection video image.

The HOG low-level image features of the sample video image included in all groups calculated using step 102 and step 103 Descriptor, you can detect the human body target region under different scale.Using HOG low-level image features descriptor and the SVM trained, Realizing the specific method of human body target region detection can realize that no further details to be given herein using prior art.

Step 105：The pixel in human body target region is classified using random forest grader, body part area is determined Domain.

Step 104 is determined after human body target region, using the random forest grader trained to human body target area The pixel in domain is classified, so that it is determined that body part region.The input of random forest grader is the feature of pixel, selectes and divides The parameter of class device, including the quantity of decision tree, internal node randomly choose number, the smallest sample of terminal note of attribute in forest Number, grader is input into using the pixel characteristic in human body target region as |input paramete, and grader is by the affiliated limbs portion of output pixel The result in position region, so that it is determined that going out body part region.In this embodiment, from SURF（speed up robust Features, fast robust Gradient Features）Operator calculates pixel characteristic, and each pixel characteristic can be constructed as retouching for 128 dimensions State symbol.Body part region includes seven articular portions of human body, is respectively：Pin, knee, buttocks, shoulder, ancon, hand, head.

Step 106：Each body part is connected to form human body contour outline, realizes that human body attitude is recognized.

Step 105 is determined after body part, and the connection of each body part connects according to head-shoulder-buttocks-knee-pin Trunk is connected into, both sides reconnect ancon and hand, can so identify human body contour outline, so as to realize based on human synovial model Human body attitude identification.

In this embodiment, when detecting human body target region, although employ the mode of HOG low-level image feature descriptors, but Only raw video image is grouped, every group included sample video image is determined quantity, namely every group The number of plies, calculates function in every group and calculates a HOG low-level image feature descriptor for sampled images only with low-level image feature, in group its The HOG low-level image features descriptor of the sampled images of his yardstick is calculated using the prediction algorithm of step 103, computation complexity and Amount of calculation calculates function fashion much smaller than using low-level image feature.And, it is corresponding without calculating each yardstick using prediction algorithm Sample video image, directly obtains the HOG low-level image feature descriptors of the sample video image, reduce further amount of calculation.Enter And, rapidity and real-time based on the detection of HOG human body targets are improve, fundamentally solve the multiple dimensioned human body target of restriction Detection method moves towards computationally intensive, the thorny problem that real-time is not enough that practical application faces.

In machine learning, random forest is a grader comprising multiple decision trees.It is main that it is used for gesture recognition Reason is nicety of grading high, in addition with four factors, one is its learning process is very quick；The second is the complexity of algorithm Degree can be controlled by the depth adaptive of internal decision making tree；The third is when building forest, it can be internally for vague generalization Error afterwards produces the estimation of not deviation；Fourth, have good tolerance to exceptional value and noise, and it is existing to be less prone to over-fitting As.But its major defect is that requirement training data is similar to test data, i.e., both are distributed with identical, which has limited The generalization ability of the grader.Therefore, high-precision random forest grader is obtained, it is desirable to which training sample covers survey in the future The examination all possible variable condition of data.But, due to visual angle change, the twisting of limbs, human dressing in actual test scene The factors such as texture variations, illumination variation influence, and can not possibly obtain training sample sufficient enough.

The disadvantages mentioned above existed for random forest grader, in the above embodiment of the present invention, is improved random gloomy The object function of decision tree nodes is trained in woods grader, so that Weak Classifier is extensive to test sample from training sample space Still there is consistent spatial activation pattern during space.So, target detection can be needed only in training sample selection empty Between in some weak marks sample, and other training data can utilize the artificial synthesized human body of computer graphics Attitude video image sample is completed, so as to reduce the requirement to training sample.Specific training process is as follows：

Acquisition includes the real video images in the artificial synthesized video image and target detection scene of human body attitude, every width Video image is used as a training sample.Moreover, artificial synthesized video image is main body, with reference to having marked body part on a small quantity And the real video images in the target detection scene of background.

The background area in each training sample and human body target region are labeled according to setting body part.Specifically For, human body target area marking is eight parts by foundation human synovial position, and a portion is background, remaining seven part point It is not：Pin, knee, buttocks, shoulder, ancon, hand, head.

Each pixel characteristic in each tab area, all tab areas and its corresponding picture are calculated using SURF operators Plain characteristic composing training data acquisition system.Specifically, artificial synthesized video image training sample is calculated from SURF operators With each pixel characteristic in each tab area in real video images training sample, each pixel characteristic is configured to 128 dimensions Descriptor.The pixel characteristic of tab area is designated as in artificial synthesized video image training sample, real video images training The pixel characteristic of tab area is designated as in sample,WithComposing training data acquisition system,For random forest In a decision tree a class node.Meanwhile, calculate in the artificial synthesized all marked regions of video image training sample The statistics descriptor of all 128 dimension SURF descriptorsAnd in all marked regions of real video images training sample The statistics descriptor of all 128 dimension SURF descriptors。

Finally, the object function after above-mentioned training data set and improvement is utilizedRandom forest grader is carried out Training.Wherein, improved object functionExpression formula be：

In above-mentioned formula,For weights, the weights are one and test the fixed value measured, are preferably, grader Recognition effect it is best.For comentropy calculates function, specific function expression uses prior art.It is artificial Marked in composite video image training sampleThe statistics descriptor of all pixels feature in individual body part,ForWith'sDistance.

Object function in above-mentioned expression formula, had both considered training sample entropy（）, training number is combined again According to the information difference between target detection data（）, by both weighted sums, it is used as instruction Practice the object function of decision tree, thus, improve the generalization ability of the grader for training.Know using the grader trained During others' body body part, recognition accuracy higher is obtained in that.

Above-mentioned object function is usedDistance represents the information difference between training data and target detection data, but not It is confined to this, it would however also be possible to employ Euclidean distance or other distances represent both diversity factoies.

The above embodiments are merely illustrative of the technical solutions of the present invention, rather than is limited；Although with reference to foregoing reality Example is applied to be described in detail the present invention, for the person of ordinary skill of the art, still can be to foregoing implementation Technical scheme described in example is modified, or carries out equivalent to which part technical characteristic；And these are changed or replaced Change, do not make the spirit and scope of the essence disengaging claimed technical solution of the invention of appropriate technical solution.

Claims

1. the human posture recognition method in a kind of two-dimensional video image, it is characterised in that methods described comprises the steps：

A, according to the metric space principle of stratification by raw video image I points it is O groups, O=[log₂(min (M, N))] -3, (M, N) is The resolution ratio of the raw video image；

B, to the sampling of every group of video image, it is the sampled images I of S to calculate a yardstick_S=R (I, S), S are 2^i-1(σ,kσ,k² σ,…,k^n-1One of yardstick in σ), i represents i-th group of video image, i=1,2 ... O, and σ=(M, N) original regards to be described The resolution ratio of frequency image, n is the natural number more than 1 of setting, represents the number of the sample video image that every group of video image is included Amount, k=2^1/n；

C, to the sampled images I in every group_S=R (I, S) calculates HOG low-level image feature descriptors f respectively_Ω(I_S)；

D, by step c obtain every group in a sampled images HOG low-level image feature descriptors based on, according to predictor formulaIt is 2 to calculate yardstick in every group^i-1(σ,kσ,k²σ,…,k^n-1Remaining in σ) (n-1) individual yardstick The corresponding HOG low-level image features descriptor of sample video image, S₁And S₂Sampled images I is represented respectively_S1With sampled images I_S2's Yardstick, λ_ΩFor setting value, f_Ω(I_S1) it is sampled images I_S1HOG low-level image feature descriptors, f_Ω(I_S2) it is sampled images I_S2's HOG low-level image feature descriptors；

E, the HOG low-level image feature descriptors according to step c and step d all different scale sample video images, combined training Good SVM, detects the human body target region under the different scale in the raw video image；

F, using the random forest grader trained the pixel in the step e human body target regions detected is classified, it is determined that Body part region in the human body target region；

2. the human posture recognition method in two-dimensional video image according to claim 1, it is characterised in that the step In b, using 2^i-1(σ,kσ,k²σ,…,k^n-1End yardstick in σ) is sampled to every group of video image, calculates end yardstick correspondence Sampled images I_S=R (I, S).

3. the human posture recognition method in two-dimensional video image according to claim 1, it is characterised in that the step Random forest grader in f is trained by following methods：

The pixel characteristic of each tab area is calculated using SURF operators, all tab areas and its pixel characteristic data constitute instruction Practice data acquisition system；

Using the training data set and object function f (m)=α E (h_l{T_S(m)})+(1-α)·dχ²(h_s{T_S(m)},h_s {T_R(m) }) random forest grader is trained；

Wherein, m is decision tree class node in random forest, and α is weights, and E () is that comentropy calculates function, T_S(m) be tab area in the artificial synthesized video image training sample pixel characteristic, T_RM () is the real video figure As the pixel characteristic of tab area in training sample, h_l{T_S(m) } it is to have marked in the artificial synthesized video image training sample L-th body part pixel characteristic statistics descriptor, h_S{T_S(m) } it is the artificial synthesized video image training sample In in all tab areas all pixels feature statistics descriptor, h_S{T_R(m) it is } in the real video images training sample The statistics descriptor of all pixels feature, d χ in all tab areas²For h_S{T_S} and h (m)_S{T_R(m) } χ²Distance.