CN108154104B

CN108154104B - Human body posture estimation method based on depth image super-pixel combined features

Info

Publication number: CN108154104B
Application number: CN201711395472.1A
Authority: CN
Inventors: 孔德慧; 张雯晖; 王少帆; 王玉萍; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2021-10-15
Anticipated expiration: 2037-12-21
Also published as: CN108154104A

Abstract

The invention discloses a human body posture estimation method based on a depth image superpixel combined feature. By adopting the technical scheme of the invention, the accuracy of the human body posture estimation is improved, and the real-time performance of the posture estimation method is improved.

Description

Human body posture estimation method based on depth image super-pixel combined features

Technical Field

The invention belongs to the field of computer vision and pattern recognition, and particularly relates to a human body posture estimation method based on a depth image superpixel combined feature.

Background

Human body part segmentation and bone joint positioning are used as human body posture estimation problems, and are basic work in the fields of computer vision and human-computer interaction. The attitude estimation has wide application in the aspects of motion recognition, animation simulation, gait analysis, content-based video image retrieval, intelligent video monitoring and the like. With the development of depth image acquisition devices such as Kinect sensors, TOF cameras, etc., many research works are gradually shifting from conventional color or grayscale intensity images to depth images. Compared with color images, depth images can avoid the influence of different illumination, appearance and background noise.

Because the human body belongs to a hinge type structure, the human body has high degree of freedom, constrained parameter space, self-similar parts and self-shielding, so that the direct modeling of the human skeleton is very difficult. The difficulty of human posture estimation is to construct a complex human joint representation model and calculate the position of a human joint through unmarked data, and the difficulty is further aggravated by real-time application requirements.

From the aspect of application data, the human body posture estimation method is divided into a single frame image-based estimation method and an image sequence-based estimation method. Compared with an estimation method of an image sequence, the estimation method based on the single frame has no accumulated error, does not need error recovery, and can directly obtain the required posture from a single image; but because there is no temporal context information of motion, errors are easily discriminated for ambiguous poses.

From the application method, the early method mainly models the human body, and the matching and alignment with the image features are facilitated by searching the human body state space. Such as the commonly used iterative closest point method, a method of fitting a human model to head, body and extremity detectors using a markov chain, etc. Such a method of fitting matches usually requires an initialization step and requires designing a model that fits the real human body, and is computationally expensive. Methods based on machine learning are gradually used by people, and methods such as Random Decision Forest (RDF), Support Vector Machine (SVM), K-nearest neighbor classification (KNN), deep learning and the like are correspondingly applied to the aspect of human posture estimation. The learning method does not need a priori human body model, but depends on a training set which is large enough and good in diversity, increases training time, and simultaneously, is also a main challenge to determine whether effective, accurate and stable feature descriptors can be extracted.

A real-time human body posture estimation method proposed by Shotton et al in 2011 is successful in Kinect application. The depth difference features are combined with random decision forests, and human skeleton points are obtained through human body part classification and finally mean shift clustering regression. One problem with random decision forest applications, however, is that the more trees in a forest, the more stable the results will be for the entire forest. As trees grow, training time and testing time of the random decision forest increase, which in turn limits the size of the random decision forest in real-time applications. Meanwhile, with the progress of the technology, the resolution of the depth image is larger and larger, so that the number of pixels which need to be processed originally is increased in multiples, and the requirement on the real-time aspect of the processing method is higher and higher.

The super-pixel is used as an image segmentation method in the field of image processing, pixels with similar semantics can be segmented into a super-pixel region, then the image can be changed from pixel-by-pixel processing into unified processing of the whole super-pixel block, the efficiency of a complex image processing program can be improved by orders of magnitude, and the use of more complex feature calculation in applications with requirements on real-time performance becomes possible. Simple Linear Iterative Clustering (SLIC) is an excellent super-pixel segmentation method, the segmentation result is compact and has small size difference, and the neighborhood relationship is still better than that of other super-pixel segmentation methods. The method for carrying out superpixel segmentation on the depth map on the basis of SLIC is less, the distance between pixels is directly measured by directly using the Euclidean distance of a three-dimensional point cloud space, and superpixel blocks segmented by the method have large difference; semantic segmentation is improved by adding gradient directions, but the complexity of calculation is increased; still other approaches are to split the superpixel with a combination of color + depth, but this also adds additional input information.

Disclosure of Invention

The human body posture estimation method aims to rapidly and accurately extract the characteristics of the human body and effectively calculate, improve the accuracy of human body posture estimation and improve the real-time performance of the posture estimation method. The invention provides a human body posture estimation method based on combined superpixel features, which extracts a novel combined feature based on superpixels, uses a fusion feature based on superpixel depth difference features and geodesic distance based on superpixels, applies the combined feature to a method of randomly deciding forests to segment human body parts, and then estimates the human body posture by applying a sparse regression method to K-means clustering points of each segmented part. The technology to be solved by the invention comprises the following steps: fast and effective superpixel division; human body part segmentation based on the joint features of the superpixels; and (3) human posture sparse regression based on human body part clustering.

In order to achieve the purpose, the invention adopts the following technical scheme:

the human body posture estimation method based on the super-pixel combined characteristics of the depth image takes a single depth image containing a human body as input data, carries out human body posture characteristic extraction on the depth image, uses the characteristics to segment human body parts, carries out clustering operation on the segmented parts, and is applied to sparse regression to carry out position estimation on human body skeleton points. The whole frame comprises the following steps:

step (1), dividing the mu SLIC superpixels:

the invention uses a simple linear iterative clustering (mu SLIC) method with weight mu to carry out superpixel operation on a depth map, and the method is divided into two stages: firstly, initializing, converting a depth space (u, v, d (u, v)) to obtain a corresponding three-dimensional point cloud space (x, y, z) for a depth image I, wherein u and v are two-dimensional coordinates of the image, d (u, v) is a depth value of a corresponding position (u, v), and x, y and z are three-dimensional coordinates, and uniformly dividing the depth image into N-containing points according to a delta x delta grid_sAdding the pixel points in the grid into the cluster taking the seed point as the center, and obtaining the geometric center according to the geometric average of all the pixel points in the cluster

The new position of the seed point is updated. Then an iteration phase: for all seed points, within 3 delta x 3 delta neighborhood, according to distance D_sTo measure the distance of the pixel from the seed point, to classify the pixel into the seed point cluster closest to the pixel, and to update to generate N_sThe new position of each seed point is iterated until the whole process converges or the maximum iteration step number N is reached_i。

Pixel X_k(x_k，y_k，z_k) With the ith cluster center point

Distance measure D_sIs designed as follows:

where μ is the weight in the super-pixel to adjust the compactness.

Step (2), segmenting the human body part with the SDDF + SGDF superpixel combined characteristics:

applying a combined feature (SDDF + SGDF) based on a Superpixel Depth Difference Feature (SDDF) and a Superpixel Geodesic Distance Feature (SGDF) to a set of super-pixels χ_sGeometric center of

A group of offsets obtained by random uniform sampling in a circular range with the radius theta as the circle center

On image I, N is combined_SDDFValue f of SDDF_θAnd N_SGDFValue g of SGDF_θObtaining a value related to the super pixel χ_sDimension of (A) is N_SDDF+N_SGDFThe characteristics of (A):

1) depth difference feature f based on superpixels_θ: for a super pixel 'x' well divided in the depth map I_sAt its geometric center

In the circular range of the radius theta, an offset theta is generated in advance through random uniform sampling, and the depth difference characteristic value is as follows:

wherein d is_IDenotes taking the depth value of a certain pixel position.

2) Geodesic distance feature g based on superpixels_θ: firstly, forming an undirected graph structure according to segmented superpixels containing foregrounds in an image, wherein vertexes are the foreground superpixels; then, according to two rules, determine whether to add an edge between the vertices, if two superpixels x_iHexix-_jWith pixels directly adjacent and the absolute value of the difference in depth between adjacent pixels being less than delta_dThen add an edge to the graph for two superpixels, the weight of the edge:

is the Euclidean distance between the centers of two superpixels; ② for the outlier χ which has no edge connected with other vertex through the first rule_iIs present at and x_iOne super pixel with minimum distance x_c＝argmin_χdist(X_iχ), additive and super-pixel χ_cThe weight is dist (x)_i，χ_c) An edge of (a); the Floyd-WarshaiI algorithm is applied to the constructed connected undirected graph, so that the shortest distance among all superpixels can be calculated,

for an offset θ, derived randomly as the depth difference feature, the superpixel χ_sThe geodesic distance feature value of (a) is expressed as:

wherein, SP_I(X) denotes the super pixel to which the pixel X belongs in the image, d_geodesicRepresenting the shortest distance between two vertices on the previous undirected graph structure, i.e. the geodesic distance in the set of superpixels;

3) random decision forest classification of superpixels: joint feature F for superpixels_I(X) applying a random decision forest for classification,

training first to generate N_tIn the training process of random decision forest, information entropy and information gain of each split node need to be calculated, and the information entropy of a node containing a sampling point training set S is as follows:

wherein l is the site to be classified, p: (l | S) is the probability of classification as l in the set S, and the random decision forest algorithm chooses the partition that achieves the maximum information gain, such that the partition P of a node is { P ═ P_left，P_rightDividing the training sample set S contained in the node into { S }_left，S_rightThe information gain of the partition is defined as:

InfGain(S，P)＝H(S)-H(S|P)

＝H(S)-p_leftH(S_left)-p_rightH(S_right)。

after the random decision forest training is finished, the probability p of classifying the superpixel χ into the part l can be obtained through the image I through one tree t in the forest_t(l I, x), obtained over the entire forest

And selecting the part with the maximum probability as the classification of the super pixels and as the classification result of the pixels contained in the super pixels.

And (3) human posture sparse regression based on human body part clustering characteristics:

after the super pixels are classified through a random decision forest, all foreground pixels are classified, the characteristics of mapping from the part to the joint are adopted, and the positions of the skeleton points are mapped through a sparse regression method.

1) Human body posture representation based on clustering features of the parts: for a part l, carrying out K mean value clustering on the part l to obtain N_kThe clustering points are sorted according to the distance between the part and the preset part to obtain a vector

Then the geometric center c of all foreground pixels₀And all N_pCombining the parts to obtain new human body posture expression based on the part clustering characteristics:

2) sparse regression: the goal of human pose estimation is to obtain N_JThe position y of each three-dimensional skeleton joint point is assumed to have N training pictures, and the joint point information is known

And site characteristics

Wherein N is_qIs the number of the feature points of the part, and i is 1

Wherein

Mapping the feature c to the jth skeleton point (j ═ 1.., N)_q) Then the sparse regression model is y_i≈Ac_iThen, for projection matrix A, one can pass the following N_JThe individual optimizations yielded:

wherein y is_i(3 j-2: 3j) represents the vector y_iThe subvectors from dimension (3j-2) to dimension (3 j). And obtaining the three-dimensional position y of the skeleton point Ac by using linear sparse regression through the trained matrix A and the part clustering characteristic c.

In the invention, the skeleton point of sparse regression is observed to be not positioned on the foreground pixel sometimes in the experiment, and because the human joint point is generally positioned on the center of the local part of the human body, because the change of the actual viewpoint at least also comprises the foreground pixel of the human body, the nearest neighbor foreground pixel point matching can be carried out on the result of sparse regression, the skeleton point deviating from the foreground pixel is corrected, and the precision of the final skeleton point on the result of regression is further improved.

Drawings

FIG. 1 is a human pose estimation work frame;

the part clustering effect is shown in figure 2K-3;

FIG. 3(a) the number of depth sums of the tree;

FIG. 3(b) offset range and feature dimensions;

FIG. 4 superpixel parameters; wherein, the (a) is a super pixel parameter of mu-1.0, and the (b) is a super pixel parameter of mu-1.5. (c) Mu is 2.0 super pixel parameter;

FIG. 5SDDF and SGDF feature combinations;

FIG. 6 single pixel feature and superpixel feature classification results;

FIG. 7 classifies the subjective graph using a single-pixel feature (PDDF + PGDF) and a super-pixel feature (SDDF + SGDF);

FIG. 8 results of pose regression on CMUSD and XiDian datasets;

FIG. 9 results of pose regression on EVAL datasets.

Detailed Description

As shown in fig. 1, the present invention provides a human body posture estimation method suitable for a single frame depth image, which uses a single depth image containing a human body as input data, performs human body posture feature extraction on the depth image, applies features to segment human body parts, performs clustering operation on the segmented parts, and applies to sparse regression to perform position estimation of human body skeleton points. The whole frame comprises the following steps:

(1) μ SLIC superpixel partition:

the invention uses a simple linear iterative clustering (mu SLIC) method with weight mu to carry out superpixel operation on a depth map, and the method is divided into two stages: firstly, initializing, converting a depth space (u, v, d (u, v)) to obtain a corresponding three-dimensional point cloud space (x, y, z) for a depth image I, and uniformly dividing the depth image into a depth image containing N according to a delta x delta grid_sAdding the pixel points in the grid into the cluster taking the seed point as the center according to the seeds, and according to the geometric centers of all the pixel points in the cluster

To avoid the effect of noise or regions with drastically changing depth values on the superpixel and at the same time increase the processing speed in the superpixel phase, pixel X_kWith the ith cluster center point

Distance measure D_sIs designed as follows:

wherein mu is the weight for adjusting compactness in the super-pixel, and when the value of mu is smaller, more uniform segmentation is easy to generate on a two-dimensional image space, but depth details are not good. When the value of mu is larger, pixels with similar depths are easier to be divided into a pixel block, but a plurality of slender areas can be generated, and the compactness is not good.

Since in the algorithmic process, the distance measure D_sNo overlap is required for comparing distance only, so D is used in the algorithm implementation_s ²To replace D_sThis eliminates the need for an open-square operation to speed up the efficiency of algorithm execution.

(2) And (3) dividing the human body part with the SDDF + SGDF superpixel combined characteristics:

the Kinect system demonstrates the effectiveness of the depth difference feature in representing single-pixel characteristics in pose estimation. The method provides a combined feature (SDDF + SGDF) based on a super-pixel depth difference feature (SDDF) and a super-pixel geodesic distance feature (SGDF). For a set of super-pixels χ_sGeometric center of

wherein d is_IDenotes taking the depth value of a certain pixel position.

2) Geodesic distance feature g based on superpixels_θ: firstly, forming an undirected graph structure according to segmented superpixels containing foregrounds in an image, wherein vertexes are the foreground superpixels; then, according to two rules, determine whether to add an edge between the vertices, if two superpixels x_iHexix-_iWith pixels directly adjacent and the absolute value of the difference in depth between adjacent pixels being less than delta_dThen add an edge to the graph for two superpixels, the weight of the edge:

is the Euclidean distance between the centers of two superpixels; ② for the outlier χ which has no edge connected with other vertex through the first rule_iIs present at and x_iOne super pixel with minimum distance x_c＝argmin_xdist(χ_iχ), additive and super-pixel χ_cThe weight is dist (x)_i，χ_c) One edge of (2). And applying Floyd-WarshaiI algorithm to the constructed connected undirected graph, and calculating the shortest distance between all super pixels.

wherein SF_I(X) denotes the super pixel to which the pixel X belongs in the image, d_geodesicRepresents the shortest distance between two vertices on the previous undirected graph structure, i.e., the geodesic distance in the set of superpixels.

3) Random decision forest classification of superpixels: joint feature F for superpixels_I(X) applying a random decision forest for classification.

Training first to generate N_tA forest of trees. In the training process of the random decision forest, the information entropy and the information gain of each split node need to be calculated. The information entropy of a node containing the training set S of sampling points is:

where l is the site to be classified and p (l | S) is the probability of being classified as l in the set S. The random decision forest algorithm selects the partition which can obtain the maximum information gain, and the partition P of one node is { P ═ P_left，P_rightIs prepared byThe training sample set S contained in a node is divided into { S }_left，S_rightThe information gain of the partition is defined as:

InfGain(S，P)＝H(S)-H(S|P)

＝H(S)-p_leftH(S_left)-p_rightH(S_right)。

(3) Human posture sparse regression based on human body part clustering characteristics:

after the super pixels are classified by a random decision forest, all foreground pixels are classified, and then the image is divided into each predefined part. However, the required skeleton point information is that some joints (such as head joint and chest joint) are the central positions of the parts (such as head and trunk parts) and some joints (elbow joint) are the adjacent positions of the parts and the parts (upper arm part and forearm part), so the invention designs a characteristic of mapping from the parts to the joints and maps the positions of the skeleton points by a sparse regression method.

1) Human body posture representation based on clustering features of the parts: for a part l, carrying out K mean value clustering on the part l to obtain N_kThe clustering points are sorted according to the distance between the part and a preset part (a main part connected with the part) to obtain a vector

Then the geometric center c of all foreground pixels₀And all N_pThe parts are combined to obtain a new relationHuman body posture expression based on the part clustering characteristics:

2) sparse regression: the goal of human pose estimation is to obtain N_JThe position y of each three-dimensional skeletal joint point. Assuming N training pictures, the joint information is known

And site characteristics

Wherein N is_qIs the number of the feature points of the part, and i is 1

Wherein

Mapping the feature c to the jth skeleton point (j ═ 1.., N)_q) Then the sparse regression model is y_i≈Ac_i. The projection matrix A may then be passed through the following N_JThe individual optimizations yielded:

The invention uses a new human body posture estimation framework, provides a new combined superpixel feature representation on a depth image, performs body part segmentation by applying a random decision forest through the combined feature, and uses a sparse regression method to map to obtain the final skeleton joint point position after extracting the clustering feature of the part segmentation. The invention compares the advantages and the disadvantages of other methods through experiments on a plurality of data sets, and verifies the effectiveness of the provided characteristics and the framework.

Example 1:

machine learning based methods typically require a large number of data sets for training and validation. Three data sets were used in the present invention to perform the experiments. Wherein the EVAL data set comprises 3 themes respectively corresponding to different actors, each actor has about 1 ten thousand frames of 8 action sequences, and the resolution is 320 multiplied by 240; the XiDian data contains 5 motion sequences, 2850 frames, with a resolution of 2048 x 2048, the data set is acquired on a prototype built for high resolution depth data acquisition, which is noisy. Since these three datasets only contain depth data and cannot train random decision forests for human body part classification, the CMU motion capture database is used in the present invention to generate CMU composite datasets (cmussd) with depth data and part tag data, and the common Poser software synthesizes 113 subjects, 2549 motion sequences, each pose contains 8 random cameras with front as the main position, and more than 82 ten thousand 640 × 480 depth pictures and part tags, and the cmussd covers richer motions.

The prediction accuracy in random decision forests increases with increasing depth of the tree, and from fig. 3(a) but as the tree gets deeper and deeper, the improvement in accuracy is less and less obvious, but the overhead of the tree itself is larger and larger. The present invention selects a tree depth of 20. The same effect is achieved by adding trees, and in order to balance the efficiency of the implementation, 8 trees are selected for the forest size. As can be seen from fig. 3(b), the range of the characteristic shift amount and the dimension of the characteristic point can obtain relatively good results when the shift range is 180 pixels and the characteristic dimension is 1000. In all subsequent experiments, this set of parameters will be used.

(1) μ SLIC superpixel partition:

in the super pixel operation, a picture set to 640 × 480 using the cmussd is initially divided into 12 × 12 pixel grids, and the number of super pixels including the human foreground is about 120 on average. Is not limited toThe initial size of the superpixels of the data at the same dataset resolution is scaled by the ratio of the foreground pixels to keep the foreground pixels available for superpixels of the same order of magnitude. In the process of superpixel division, after the iteration times reach a certain number, the pixels contained in the superpixel tend to be stable, and the iteration times and the running time are in a direct proportion relation, so that the maximum iteration times N is taken_i10. As for the parameter of the super-pixel distance measurement, it can be seen from fig. 4(a) that the contour details of the chin are not reflected in the super-pixels of the head, and in fig. 4(c), although the contour details of each part are well reflected, many slender super-pixel divisions appear at the body edge. The size of the super-pixels in fig. 4(b) is relatively uniform and allows for detailed presentation. In the subsequent experiments of the present invention, μ ═ 1.5.

(2) Super-pixel feature extraction:

in this experiment, about 8 thousands of super pixels of about 100 thousands of left and right images of 12 sequences of the cmussd were used, 50% of each sequence was randomly selected for training, and the other 50% was tested and verified.

In fig. 5 different bars represent different combinations of features. It can be seen from the figure that 0+1000 fully uses the Geodesic Distance Feature (GDF) less effectively than 1000+0 fully uses the Depth Difference Feature (DDF), but the feature has some effectiveness in terms of accuracy. Especially after mixing the two features according to 800+200, the result can be better than the result of using the depth difference alone, which means that the use of geodesic distance features for certain poses and positions can beneficially complement the depth difference feature, but most of the time it is the depth difference feature that works. The end-use feature is a 1000-dimensional superpixel feature that blends in depth difference and geodesic distance at 800+ 200.

(3) Comparing the classification effect of single pixel and super pixel:

experimental comparisons of single pixel features (PDDF + PGDF) and super-pixel features (SDDF + SGDF) were performed on the cmussd, using the same feature dimensions (800+200) and random forest settings. When PDDF + PGDF is used, 120 pixel points are randomly sampled from each depth map to be used as a training set. The average accuracy of the final PDDF + PGDF was 92.105, and the average accuracy of SDDF + SGDF was 92.468. In fig. 6, it is shown that the use of SDDF + SGDF and PDDF + PGDF slightly drifts the accuracy on different subjects, but there is no significant difference between the two in view of the randomness of the sampling points. Although some pixels may be merged into different adjacent regions when classified using the SDDF + SGDF method in fig. 7 and the classification of a superpixel will affect the whole pixels within the superpixel, the use of superpixel can make semantically identical regions more likely to be classified uniformly and greatly save extraction efficiency and test efficiency of geodesic distance features.

(4) Posture regression:

in the posture regression part feature extraction process, as shown in fig. 2, the clustering point number of the K-means is set to be 3, 10 parts are respectively clustered into 3 classes, and sorted according to the euclidean distance from the parent connection part (for example, the head is according to the distance from the head to the trunk part), and the dimension of c is 93 dimensions.

For CMUSD, experiments were performed using 50% of the pictures in the data set as training set for random forest and sparse regression, and the others as test set. It can be seen from fig. 8 that the joint points of the limbs are less accurate than the joint points of the head and the chest. This is because the motion of the limb portion is relatively intense, the amplitude range is relatively large, and the occlusion once it occurs will have a relatively large effect on the pixel classification and on the final regression results.

In the experiment carried out by the XiDian database, 4 sequences are used as training sets, 1 sequence is used as a test set for cross validation, and the average accuracy (mAP) is 91.7, which shows that the method can produce better results for the data with large resolution and larger noise. Relatively speaking, although the data noise of the west electrical database is large, the motion is relatively simple, so the result is higher than the accuracy of the CMUSD.

For EVAL data set, frames in which some joint points are far away from the body foreground or the direct distance of the joint points obviously exceeds the human body structure are removed, and each topic is left to contain 3 thousand frames on average. The CMUSD is used for training a random forest in a pixel classification stage, all pictures of the CMUSD are used as a model trained by a training set, two subjects of the EVAL are used as the training set in a regression stage, the rest subjects are used as a test set, and finally, cross validation is carried out on three groups of average values.

Fig. 9 is a comparison of the algorithm of Ye et al and the algorithm of Jung et al on the EVAL dataset. The algorithm based on the Gaussian mixture model proposed by Ye et al in 2014 needs to be fitted with a body model, and the execution speed is low; the model of the random walk tree proposed by Jung et al in 2015 was greatly improved in execution efficiency, but verified on only small data. It can be seen from the figure that the joint points near the center of the body part are relatively accurate, and the accuracy of the end points of the limbs is relatively poor, because the accuracy of the end parts of the limbs in the pixel classification stage is low, which directly affects the positions of the regression feature cluster points, and thus the accuracy of the pose regression.

(5) Run time

The single pixel method and the super-pixel method are run-time compared using the same set of feature dimensions and the same random forest settings. The time complexity of the geodesic distance calculation is O (n)³) For large resolution images, it is almost impossible to use the image as a feature in a real-time environment. Although the method based on the super-pixels increases the time for calculating the super-pixels, the efficiency of extracting the pixel features is greatly shortened, and meanwhile, the time for classifying the random forest is also reduced. Even if only the depth difference is used as the characteristic, the super-pixel method can ensure the accuracy and simultaneously accelerate the algorithm for different resolutions by 1.5-8 times. It can be seen from the table that the method can achieve real-time requirements under data sets of various resolutions.

TABLE 1 execution time (unit: ms)

The part proves that under the condition of various different resolutions and data qualities, the new combined characteristics can effectively express the body part characteristics, and the whole framework can effectively calculate the human body posture in real time.

The super-pixel generation method uses the pixel distance measurement which gives consideration to two-dimensional and three-dimensional information, realizes the down-sampling operation from the pixel to the super-pixel, and reduces the presentation order of the directly processed data. The method extracts the combined characteristics of the fusion depth difference and the geodesic distance information on the single-frame depth map, comprehensively utilizes the global and local incidence relation between pixels, and improves the classification precision of human body parts. Compared with the former work, the human body joint point estimation is realized by performing sparse regression according to the part position clustering feature points, so that the processing time is reduced, and higher human body posture estimation precision is obtained.

Claims

1. A human body posture estimation method based on depth image super-pixel combined features is characterized by comprising the following steps:

step (1) ultra-pixel division of mu SLIC

The method of simple linear iterative clustering mu SLIC with the weight mu is adopted to carry out superpixel operation on the depth map, and the method is divided into two stages: firstly, initializing, converting a depth space (u, v, d (u, v)) to obtain a corresponding three-dimensional point cloud space (x, y, z) for a depth image I, and uniformly dividing the depth image into a depth image containing N according to a delta x delta grid_sAdding the pixel points in the grid into the cluster taking the seed point as the center according to the seeds, and according to the geometric centers of all the pixel points in the cluster

Updating the new position of the seed point; then an iteration phase: for all seed points, within the neighborhood of 3 delta multiplied by 3 delta, the distance between the pixel and the seed point is measured according to the distance Ds, the pixel is classified into the seed point cluster with the nearest distance, and the updating generates N_sThe new position of each seed point is iterated until the whole process converges or the maximum iteration step number N is reached_i；

Step (2) human body part segmentation of SDDF + SGDF superpixel combined characteristics

Adopts SDDF and super image based on super pixel depth difference characteristicA combined feature SDDF + SGDF of a pixel geodesic distance feature SGDF for a set of super-pixels χ_sGeometric center of

1) depth difference feature f based on superpixels_θ: for a super-pixel Xs divided in the depth map I, at the geometric center

wherein d is_I() represents taking the depth value of a certain pixel position;

2) geodesic distance feature g based on superpixels_θ: firstly, forming an undirected graph structure according to segmented superpixels containing foregrounds in an image, wherein vertexes are the foreground superpixels; then, according to two rules, determine whether to add an edge between the vertices, if two superpixels x_iHexix-_jWith pixels directly adjacent and the absolute value of the difference in depth between adjacent pixels being less than delta_dThen in the figureAdd an edge for two superpixels, the weight of the edge:

is the Euclidean distance between the centers of two superpixels; ② for the outlier χ which has no edge connected with other vertex through the first rule_iIs present at and x_iOne super pixel with minimum distance x_c＝argmin_xdist(χ_iχ), additive and super-pixel χ_cThe weight is dist (x)_i，χ_c) An edge of (a); the Floyd-WarshaiI algorithm is applied to the constructed connected undirected graph, so that the shortest distance among all superpixels can be calculated,

wherein, SF_I(X) denotes the super pixel to which the pixel X belongs in the image, d_geodesicRepresenting the shortest distance between two vertices on the previous undirected graph structure, i.e. the geodesic distance in the set of superpixels;

wherein lIs the part to be classified, P (l | S) is the probability of being classified as l in the set S, and the random decision forest algorithm selects the partition which can obtain the maximum information gain, so that the partition P of a node is { P ═ P {_left，P_rightDividing the training sample set S contained in the node into { S }_left，S_rightThe information gain of the partition is defined as:

InfGain(S，P)＝H(S)-H(S|P)＝H(S)-p_leftH(S_left)-p_rightH(S_right)；

Selecting a part with the maximum probability as the classification of the super pixels and as the classification result of the pixels contained in the super pixels;

step (3) human posture sparse regression based on human body part clustering characteristics

Classifying the superpixels by a random decision forest, classifying all foreground pixels, adopting the characteristics from parts to joints, and mapping the positions of skeleton points by a sparse regression method;

And site characteristics

Wherein N is_qIs the number of the feature points of the part, and i is 1

Wherein

wherein y is_i(3 j-2: 3j) represents the vector y_iAnd obtaining the three-dimensional position y of the skeleton point, namely Ac, by using linear sparse regression through the trained matrix A and the part clustering characteristic c from the subvectors from the (3j-2) th dimension to the (3j) th dimension.

2. The method for estimating human body posture based on the super-pixel joint feature of the depth image as claimed in claim 1, wherein in the step (1), pixel X is used_k(x_k，y_k，z_k) With the ith cluster center point

Distance measure D_sIs designed as follows:

where μ is the weight in the super-pixel to adjust the compactness.