CN108154104B - Human body posture estimation method based on depth image super-pixel combined features - Google Patents

Human body posture estimation method based on depth image super-pixel combined features Download PDF

Info

Publication number
CN108154104B
CN108154104B CN201711395472.1A CN201711395472A CN108154104B CN 108154104 B CN108154104 B CN 108154104B CN 201711395472 A CN201711395472 A CN 201711395472A CN 108154104 B CN108154104 B CN 108154104B
Authority
CN
China
Prior art keywords
pixel
super
superpixels
depth
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711395472.1A
Other languages
Chinese (zh)
Other versions
CN108154104A (en
Inventor
孔德慧
张雯晖
王少帆
王玉萍
尹宝才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201711395472.1A priority Critical patent/CN108154104B/en
Publication of CN108154104A publication Critical patent/CN108154104A/en
Application granted granted Critical
Publication of CN108154104B publication Critical patent/CN108154104B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Abstract

The invention discloses a human body posture estimation method based on a depth image superpixel combined feature. By adopting the technical scheme of the invention, the accuracy of the human body posture estimation is improved, and the real-time performance of the posture estimation method is improved.

Description

Human body posture estimation method based on depth image super-pixel combined features
Technical Field
The invention belongs to the field of computer vision and pattern recognition, and particularly relates to a human body posture estimation method based on a depth image superpixel combined feature.
Background
Human body part segmentation and bone joint positioning are used as human body posture estimation problems, and are basic work in the fields of computer vision and human-computer interaction. The attitude estimation has wide application in the aspects of motion recognition, animation simulation, gait analysis, content-based video image retrieval, intelligent video monitoring and the like. With the development of depth image acquisition devices such as Kinect sensors, TOF cameras, etc., many research works are gradually shifting from conventional color or grayscale intensity images to depth images. Compared with color images, depth images can avoid the influence of different illumination, appearance and background noise.
Because the human body belongs to a hinge type structure, the human body has high degree of freedom, constrained parameter space, self-similar parts and self-shielding, so that the direct modeling of the human skeleton is very difficult. The difficulty of human posture estimation is to construct a complex human joint representation model and calculate the position of a human joint through unmarked data, and the difficulty is further aggravated by real-time application requirements.
From the aspect of application data, the human body posture estimation method is divided into a single frame image-based estimation method and an image sequence-based estimation method. Compared with an estimation method of an image sequence, the estimation method based on the single frame has no accumulated error, does not need error recovery, and can directly obtain the required posture from a single image; but because there is no temporal context information of motion, errors are easily discriminated for ambiguous poses.
From the application method, the early method mainly models the human body, and the matching and alignment with the image features are facilitated by searching the human body state space. Such as the commonly used iterative closest point method, a method of fitting a human model to head, body and extremity detectors using a markov chain, etc. Such a method of fitting matches usually requires an initialization step and requires designing a model that fits the real human body, and is computationally expensive. Methods based on machine learning are gradually used by people, and methods such as Random Decision Forest (RDF), Support Vector Machine (SVM), K-nearest neighbor classification (KNN), deep learning and the like are correspondingly applied to the aspect of human posture estimation. The learning method does not need a priori human body model, but depends on a training set which is large enough and good in diversity, increases training time, and simultaneously, is also a main challenge to determine whether effective, accurate and stable feature descriptors can be extracted.
A real-time human body posture estimation method proposed by Shotton et al in 2011 is successful in Kinect application. The depth difference features are combined with random decision forests, and human skeleton points are obtained through human body part classification and finally mean shift clustering regression. One problem with random decision forest applications, however, is that the more trees in a forest, the more stable the results will be for the entire forest. As trees grow, training time and testing time of the random decision forest increase, which in turn limits the size of the random decision forest in real-time applications. Meanwhile, with the progress of the technology, the resolution of the depth image is larger and larger, so that the number of pixels which need to be processed originally is increased in multiples, and the requirement on the real-time aspect of the processing method is higher and higher.
The super-pixel is used as an image segmentation method in the field of image processing, pixels with similar semantics can be segmented into a super-pixel region, then the image can be changed from pixel-by-pixel processing into unified processing of the whole super-pixel block, the efficiency of a complex image processing program can be improved by orders of magnitude, and the use of more complex feature calculation in applications with requirements on real-time performance becomes possible. Simple Linear Iterative Clustering (SLIC) is an excellent super-pixel segmentation method, the segmentation result is compact and has small size difference, and the neighborhood relationship is still better than that of other super-pixel segmentation methods. The method for carrying out superpixel segmentation on the depth map on the basis of SLIC is less, the distance between pixels is directly measured by directly using the Euclidean distance of a three-dimensional point cloud space, and superpixel blocks segmented by the method have large difference; semantic segmentation is improved by adding gradient directions, but the complexity of calculation is increased; still other approaches are to split the superpixel with a combination of color + depth, but this also adds additional input information.
Disclosure of Invention
The human body posture estimation method aims to rapidly and accurately extract the characteristics of the human body and effectively calculate, improve the accuracy of human body posture estimation and improve the real-time performance of the posture estimation method. The invention provides a human body posture estimation method based on combined superpixel features, which extracts a novel combined feature based on superpixels, uses a fusion feature based on superpixel depth difference features and geodesic distance based on superpixels, applies the combined feature to a method of randomly deciding forests to segment human body parts, and then estimates the human body posture by applying a sparse regression method to K-means clustering points of each segmented part. The technology to be solved by the invention comprises the following steps: fast and effective superpixel division; human body part segmentation based on the joint features of the superpixels; and (3) human posture sparse regression based on human body part clustering.
In order to achieve the purpose, the invention adopts the following technical scheme:
the human body posture estimation method based on the super-pixel combined characteristics of the depth image takes a single depth image containing a human body as input data, carries out human body posture characteristic extraction on the depth image, uses the characteristics to segment human body parts, carries out clustering operation on the segmented parts, and is applied to sparse regression to carry out position estimation on human body skeleton points. The whole frame comprises the following steps:
step (1), dividing the mu SLIC superpixels:
the invention uses a simple linear iterative clustering (mu SLIC) method with weight mu to carry out superpixel operation on a depth map, and the method is divided into two stages: firstly, initializing, converting a depth space (u, v, d (u, v)) to obtain a corresponding three-dimensional point cloud space (x, y, z) for a depth image I, wherein u and v are two-dimensional coordinates of the image, d (u, v) is a depth value of a corresponding position (u, v), and x, y and z are three-dimensional coordinates, and uniformly dividing the depth image into N-containing points according to a delta x delta gridsAdding the pixel points in the grid into the cluster taking the seed point as the center, and obtaining the geometric center according to the geometric average of all the pixel points in the cluster
Figure GDA0003226726910000031
The new position of the seed point is updated. Then an iteration phase: for all seed points, within 3 delta x 3 delta neighborhood, according to distance DsTo measure the distance of the pixel from the seed point, to classify the pixel into the seed point cluster closest to the pixel, and to update to generate NsThe new position of each seed point is iterated until the whole process converges or the maximum iteration step number N is reachedi
Pixel Xk(xk,yk,zk) With the ith cluster center point
Figure GDA0003226726910000032
Distance measure DsIs designed as follows:
Figure GDA0003226726910000033
where μ is the weight in the super-pixel to adjust the compactness.
Step (2), segmenting the human body part with the SDDF + SGDF superpixel combined characteristics:
applying a combined feature (SDDF + SGDF) based on a Superpixel Depth Difference Feature (SDDF) and a Superpixel Geodesic Distance Feature (SGDF) to a set of super-pixels χsGeometric center of
Figure GDA0003226726910000034
A group of offsets obtained by random uniform sampling in a circular range with the radius theta as the circle center
Figure GDA0003226726910000035
On image I, N is combinedSDDFValue f of SDDFθAnd NSGDFValue g of SGDFθObtaining a value related to the super pixel χsDimension of (A) is NSDDF+NSGDFThe characteristics of (A):
Figure GDA0003226726910000036
1) depth difference feature f based on superpixelsθ: for a super pixel 'x' well divided in the depth map IsAt its geometric center
Figure GDA0003226726910000037
In the circular range of the radius theta, an offset theta is generated in advance through random uniform sampling, and the depth difference characteristic value is as follows:
Figure GDA0003226726910000038
wherein d isIDenotes taking the depth value of a certain pixel position.
2) Geodesic distance feature g based on superpixelsθ: firstly, forming an undirected graph structure according to segmented superpixels containing foregrounds in an image, wherein vertexes are the foreground superpixels; then, according to two rules, determine whether to add an edge between the vertices, if two superpixels xiHexix-jWith pixels directly adjacent and the absolute value of the difference in depth between adjacent pixels being less than deltadThen add an edge to the graph for two superpixels, the weight of the edge:
Figure GDA0003226726910000039
is the Euclidean distance between the centers of two superpixels; ② for the outlier χ which has no edge connected with other vertex through the first ruleiIs present at and xiOne super pixel with minimum distance xc=argminχdist(Xiχ), additive and super-pixel χcThe weight is dist (x)i,χc) An edge of (a); the Floyd-WarshaiI algorithm is applied to the constructed connected undirected graph, so that the shortest distance among all superpixels can be calculated,
for an offset θ, derived randomly as the depth difference feature, the superpixel χsThe geodesic distance feature value of (a) is expressed as:
Figure GDA0003226726910000041
wherein, SPI(X) denotes the super pixel to which the pixel X belongs in the image, dgeodesicRepresenting the shortest distance between two vertices on the previous undirected graph structure, i.e. the geodesic distance in the set of superpixels;
3) random decision forest classification of superpixels: joint feature F for superpixelsI(X) applying a random decision forest for classification,
training first to generate NtIn the training process of random decision forest, information entropy and information gain of each split node need to be calculated, and the information entropy of a node containing a sampling point training set S is as follows:
Figure GDA0003226726910000042
wherein l is the site to be classified, p: (l | S) is the probability of classification as l in the set S, and the random decision forest algorithm chooses the partition that achieves the maximum information gain, such that the partition P of a node is { P ═ Pleft,PrightDividing the training sample set S contained in the node into { S }left,SrightThe information gain of the partition is defined as:
InfGain(S,P)=H(S)-H(S|P)
=H(S)-pleftH(Sleft)-prightH(Sright)。
after the random decision forest training is finished, the probability p of classifying the superpixel χ into the part l can be obtained through the image I through one tree t in the forestt(l I, x), obtained over the entire forest
Figure GDA0003226726910000043
Figure GDA0003226726910000044
And selecting the part with the maximum probability as the classification of the super pixels and as the classification result of the pixels contained in the super pixels.
And (3) human posture sparse regression based on human body part clustering characteristics:
after the super pixels are classified through a random decision forest, all foreground pixels are classified, the characteristics of mapping from the part to the joint are adopted, and the positions of the skeleton points are mapped through a sparse regression method.
1) Human body posture representation based on clustering features of the parts: for a part l, carrying out K mean value clustering on the part l to obtain NkThe clustering points are sorted according to the distance between the part and the preset part to obtain a vector
Figure GDA0003226726910000047
Then the geometric center c of all foreground pixels0And all NpCombining the parts to obtain new human body posture expression based on the part clustering characteristics:
Figure GDA0003226726910000045
2) sparse regression: the goal of human pose estimation is to obtain NJThe position y of each three-dimensional skeleton joint point is assumed to have N training pictures, and the joint point information is known
Figure GDA0003226726910000046
And site characteristics
Figure GDA0003226726910000051
Wherein N isqIs the number of the feature points of the part, and i is 1
Figure GDA0003226726910000052
Wherein
Figure GDA0003226726910000053
Mapping the feature c to the jth skeleton point (j ═ 1.., N)q) Then the sparse regression model is yi≈AciThen, for projection matrix A, one can pass the following NJThe individual optimizations yielded:
Figure GDA0003226726910000054
wherein y isi(3 j-2: 3j) represents the vector yiThe subvectors from dimension (3j-2) to dimension (3 j). And obtaining the three-dimensional position y of the skeleton point Ac by using linear sparse regression through the trained matrix A and the part clustering characteristic c.
In the invention, the skeleton point of sparse regression is observed to be not positioned on the foreground pixel sometimes in the experiment, and because the human joint point is generally positioned on the center of the local part of the human body, because the change of the actual viewpoint at least also comprises the foreground pixel of the human body, the nearest neighbor foreground pixel point matching can be carried out on the result of sparse regression, the skeleton point deviating from the foreground pixel is corrected, and the precision of the final skeleton point on the result of regression is further improved.
Drawings
FIG. 1 is a human pose estimation work frame;
the part clustering effect is shown in figure 2K-3;
FIG. 3(a) the number of depth sums of the tree;
FIG. 3(b) offset range and feature dimensions;
FIG. 4 superpixel parameters; wherein, the (a) is a super pixel parameter of mu-1.0, and the (b) is a super pixel parameter of mu-1.5. (c) Mu is 2.0 super pixel parameter;
FIG. 5SDDF and SGDF feature combinations;
FIG. 6 single pixel feature and superpixel feature classification results;
FIG. 7 classifies the subjective graph using a single-pixel feature (PDDF + PGDF) and a super-pixel feature (SDDF + SGDF);
FIG. 8 results of pose regression on CMUSD and XiDian datasets;
FIG. 9 results of pose regression on EVAL datasets.
Detailed Description
As shown in fig. 1, the present invention provides a human body posture estimation method suitable for a single frame depth image, which uses a single depth image containing a human body as input data, performs human body posture feature extraction on the depth image, applies features to segment human body parts, performs clustering operation on the segmented parts, and applies to sparse regression to perform position estimation of human body skeleton points. The whole frame comprises the following steps:
(1) μ SLIC superpixel partition:
the invention uses a simple linear iterative clustering (mu SLIC) method with weight mu to carry out superpixel operation on a depth map, and the method is divided into two stages: firstly, initializing, converting a depth space (u, v, d (u, v)) to obtain a corresponding three-dimensional point cloud space (x, y, z) for a depth image I, and uniformly dividing the depth image into a depth image containing N according to a delta x delta gridsAdding the pixel points in the grid into the cluster taking the seed point as the center according to the seeds, and according to the geometric centers of all the pixel points in the cluster
Figure GDA0003226726910000061
The new position of the seed point is updated. Then an iteration phase: for all seed points, within 3 delta x 3 delta neighborhood, according to distance DsTo measure the distance of the pixel from the seed point, to classify the pixel into the seed point cluster closest to the pixel, and to update to generate NsThe new position of each seed point is iterated until the whole process converges or the maximum iteration step number N is reachedi
To avoid the effect of noise or regions with drastically changing depth values on the superpixel and at the same time increase the processing speed in the superpixel phase, pixel XkWith the ith cluster center point
Figure GDA0003226726910000062
Distance measure DsIs designed as follows:
Figure GDA0003226726910000063
wherein mu is the weight for adjusting compactness in the super-pixel, and when the value of mu is smaller, more uniform segmentation is easy to generate on a two-dimensional image space, but depth details are not good. When the value of mu is larger, pixels with similar depths are easier to be divided into a pixel block, but a plurality of slender areas can be generated, and the compactness is not good.
Since in the algorithmic process, the distance measure DsNo overlap is required for comparing distance only, so D is used in the algorithm implementations 2To replace DsThis eliminates the need for an open-square operation to speed up the efficiency of algorithm execution.
(2) And (3) dividing the human body part with the SDDF + SGDF superpixel combined characteristics:
the Kinect system demonstrates the effectiveness of the depth difference feature in representing single-pixel characteristics in pose estimation. The method provides a combined feature (SDDF + SGDF) based on a super-pixel depth difference feature (SDDF) and a super-pixel geodesic distance feature (SGDF). For a set of super-pixels χsGeometric center of
Figure GDA0003226726910000064
A group of offsets obtained by random uniform sampling in a circular range with the radius theta as the circle center
Figure GDA0003226726910000065
On image I, N is combinedSDDFValue f of SDDFθAnd NSGDFValue g of SGDFθObtaining a value related to the super pixel χsDimension of (A) is NSDDF+NSGDFThe characteristics of (A):
Figure GDA0003226726910000066
1) depth difference feature f based on superpixelsθ: for a super pixel 'x' well divided in the depth map IsAt its geometric center
Figure GDA0003226726910000067
In the circular range of the radius theta, an offset theta is generated in advance through random uniform sampling, and the depth difference characteristic value is as follows:
Figure GDA0003226726910000071
wherein d isIDenotes taking the depth value of a certain pixel position.
2) Geodesic distance feature g based on superpixelsθ: firstly, forming an undirected graph structure according to segmented superpixels containing foregrounds in an image, wherein vertexes are the foreground superpixels; then, according to two rules, determine whether to add an edge between the vertices, if two superpixels xiHexix-iWith pixels directly adjacent and the absolute value of the difference in depth between adjacent pixels being less than deltadThen add an edge to the graph for two superpixels, the weight of the edge:
Figure GDA0003226726910000072
is the Euclidean distance between the centers of two superpixels; ② for the outlier χ which has no edge connected with other vertex through the first ruleiIs present at and xiOne super pixel with minimum distance xc=argminxdist(χiχ), additive and super-pixel χcThe weight is dist (x)i,χc) One edge of (2). And applying Floyd-WarshaiI algorithm to the constructed connected undirected graph, and calculating the shortest distance between all super pixels.
For an offset θ, derived randomly as the depth difference feature, the superpixel χsThe geodesic distance feature value of (a) is expressed as:
Figure GDA0003226726910000073
wherein SFI(X) denotes the super pixel to which the pixel X belongs in the image, dgeodesicRepresents the shortest distance between two vertices on the previous undirected graph structure, i.e., the geodesic distance in the set of superpixels.
3) Random decision forest classification of superpixels: joint feature F for superpixelsI(X) applying a random decision forest for classification.
Training first to generate NtA forest of trees. In the training process of the random decision forest, the information entropy and the information gain of each split node need to be calculated. The information entropy of a node containing the training set S of sampling points is:
Figure GDA0003226726910000074
where l is the site to be classified and p (l | S) is the probability of being classified as l in the set S. The random decision forest algorithm selects the partition which can obtain the maximum information gain, and the partition P of one node is { P ═ Pleft,PrightIs prepared byThe training sample set S contained in a node is divided into { S }left,SrightThe information gain of the partition is defined as:
InfGain(S,P)=H(S)-H(S|P)
=H(S)-pleftH(Sleft)-prightH(Sright)。
after the random decision forest training is finished, the probability p of classifying the superpixel χ into the part l can be obtained through the image I through one tree t in the forestt(l I, x), obtained over the entire forest
Figure GDA0003226726910000075
Figure GDA0003226726910000076
And selecting the part with the maximum probability as the classification of the super pixels and as the classification result of the pixels contained in the super pixels.
(3) Human posture sparse regression based on human body part clustering characteristics:
after the super pixels are classified by a random decision forest, all foreground pixels are classified, and then the image is divided into each predefined part. However, the required skeleton point information is that some joints (such as head joint and chest joint) are the central positions of the parts (such as head and trunk parts) and some joints (elbow joint) are the adjacent positions of the parts and the parts (upper arm part and forearm part), so the invention designs a characteristic of mapping from the parts to the joints and maps the positions of the skeleton points by a sparse regression method.
1) Human body posture representation based on clustering features of the parts: for a part l, carrying out K mean value clustering on the part l to obtain NkThe clustering points are sorted according to the distance between the part and a preset part (a main part connected with the part) to obtain a vector
Figure GDA0003226726910000087
Then the geometric center c of all foreground pixels0And all NpThe parts are combined to obtain a new relationHuman body posture expression based on the part clustering characteristics:
Figure GDA0003226726910000081
2) sparse regression: the goal of human pose estimation is to obtain NJThe position y of each three-dimensional skeletal joint point. Assuming N training pictures, the joint information is known
Figure GDA0003226726910000082
And site characteristics
Figure GDA0003226726910000083
Wherein N isqIs the number of the feature points of the part, and i is 1
Figure GDA0003226726910000084
Wherein
Figure GDA0003226726910000085
Mapping the feature c to the jth skeleton point (j ═ 1.., N)q) Then the sparse regression model is yi≈Aci. The projection matrix A may then be passed through the following NJThe individual optimizations yielded:
Figure GDA0003226726910000086
wherein y isi(3 j-2: 3j) represents the vector yiThe subvectors from dimension (3j-2) to dimension (3 j). And obtaining the three-dimensional position y of the skeleton point Ac by using linear sparse regression through the trained matrix A and the part clustering characteristic c.
The invention uses a new human body posture estimation framework, provides a new combined superpixel feature representation on a depth image, performs body part segmentation by applying a random decision forest through the combined feature, and uses a sparse regression method to map to obtain the final skeleton joint point position after extracting the clustering feature of the part segmentation. The invention compares the advantages and the disadvantages of other methods through experiments on a plurality of data sets, and verifies the effectiveness of the provided characteristics and the framework.
Example 1:
machine learning based methods typically require a large number of data sets for training and validation. Three data sets were used in the present invention to perform the experiments. Wherein the EVAL data set comprises 3 themes respectively corresponding to different actors, each actor has about 1 ten thousand frames of 8 action sequences, and the resolution is 320 multiplied by 240; the XiDian data contains 5 motion sequences, 2850 frames, with a resolution of 2048 x 2048, the data set is acquired on a prototype built for high resolution depth data acquisition, which is noisy. Since these three datasets only contain depth data and cannot train random decision forests for human body part classification, the CMU motion capture database is used in the present invention to generate CMU composite datasets (cmussd) with depth data and part tag data, and the common Poser software synthesizes 113 subjects, 2549 motion sequences, each pose contains 8 random cameras with front as the main position, and more than 82 ten thousand 640 × 480 depth pictures and part tags, and the cmussd covers richer motions.
The prediction accuracy in random decision forests increases with increasing depth of the tree, and from fig. 3(a) but as the tree gets deeper and deeper, the improvement in accuracy is less and less obvious, but the overhead of the tree itself is larger and larger. The present invention selects a tree depth of 20. The same effect is achieved by adding trees, and in order to balance the efficiency of the implementation, 8 trees are selected for the forest size. As can be seen from fig. 3(b), the range of the characteristic shift amount and the dimension of the characteristic point can obtain relatively good results when the shift range is 180 pixels and the characteristic dimension is 1000. In all subsequent experiments, this set of parameters will be used.
(1) μ SLIC superpixel partition:
in the super pixel operation, a picture set to 640 × 480 using the cmussd is initially divided into 12 × 12 pixel grids, and the number of super pixels including the human foreground is about 120 on average. Is not limited toThe initial size of the superpixels of the data at the same dataset resolution is scaled by the ratio of the foreground pixels to keep the foreground pixels available for superpixels of the same order of magnitude. In the process of superpixel division, after the iteration times reach a certain number, the pixels contained in the superpixel tend to be stable, and the iteration times and the running time are in a direct proportion relation, so that the maximum iteration times N is takeni10. As for the parameter of the super-pixel distance measurement, it can be seen from fig. 4(a) that the contour details of the chin are not reflected in the super-pixels of the head, and in fig. 4(c), although the contour details of each part are well reflected, many slender super-pixel divisions appear at the body edge. The size of the super-pixels in fig. 4(b) is relatively uniform and allows for detailed presentation. In the subsequent experiments of the present invention, μ ═ 1.5.
(2) Super-pixel feature extraction:
in this experiment, about 8 thousands of super pixels of about 100 thousands of left and right images of 12 sequences of the cmussd were used, 50% of each sequence was randomly selected for training, and the other 50% was tested and verified.
In fig. 5 different bars represent different combinations of features. It can be seen from the figure that 0+1000 fully uses the Geodesic Distance Feature (GDF) less effectively than 1000+0 fully uses the Depth Difference Feature (DDF), but the feature has some effectiveness in terms of accuracy. Especially after mixing the two features according to 800+200, the result can be better than the result of using the depth difference alone, which means that the use of geodesic distance features for certain poses and positions can beneficially complement the depth difference feature, but most of the time it is the depth difference feature that works. The end-use feature is a 1000-dimensional superpixel feature that blends in depth difference and geodesic distance at 800+ 200.
(3) Comparing the classification effect of single pixel and super pixel:
experimental comparisons of single pixel features (PDDF + PGDF) and super-pixel features (SDDF + SGDF) were performed on the cmussd, using the same feature dimensions (800+200) and random forest settings. When PDDF + PGDF is used, 120 pixel points are randomly sampled from each depth map to be used as a training set. The average accuracy of the final PDDF + PGDF was 92.105, and the average accuracy of SDDF + SGDF was 92.468. In fig. 6, it is shown that the use of SDDF + SGDF and PDDF + PGDF slightly drifts the accuracy on different subjects, but there is no significant difference between the two in view of the randomness of the sampling points. Although some pixels may be merged into different adjacent regions when classified using the SDDF + SGDF method in fig. 7 and the classification of a superpixel will affect the whole pixels within the superpixel, the use of superpixel can make semantically identical regions more likely to be classified uniformly and greatly save extraction efficiency and test efficiency of geodesic distance features.
(4) Posture regression:
in the posture regression part feature extraction process, as shown in fig. 2, the clustering point number of the K-means is set to be 3, 10 parts are respectively clustered into 3 classes, and sorted according to the euclidean distance from the parent connection part (for example, the head is according to the distance from the head to the trunk part), and the dimension of c is 93 dimensions.
For CMUSD, experiments were performed using 50% of the pictures in the data set as training set for random forest and sparse regression, and the others as test set. It can be seen from fig. 8 that the joint points of the limbs are less accurate than the joint points of the head and the chest. This is because the motion of the limb portion is relatively intense, the amplitude range is relatively large, and the occlusion once it occurs will have a relatively large effect on the pixel classification and on the final regression results.
In the experiment carried out by the XiDian database, 4 sequences are used as training sets, 1 sequence is used as a test set for cross validation, and the average accuracy (mAP) is 91.7, which shows that the method can produce better results for the data with large resolution and larger noise. Relatively speaking, although the data noise of the west electrical database is large, the motion is relatively simple, so the result is higher than the accuracy of the CMUSD.
For EVAL data set, frames in which some joint points are far away from the body foreground or the direct distance of the joint points obviously exceeds the human body structure are removed, and each topic is left to contain 3 thousand frames on average. The CMUSD is used for training a random forest in a pixel classification stage, all pictures of the CMUSD are used as a model trained by a training set, two subjects of the EVAL are used as the training set in a regression stage, the rest subjects are used as a test set, and finally, cross validation is carried out on three groups of average values.
Fig. 9 is a comparison of the algorithm of Ye et al and the algorithm of Jung et al on the EVAL dataset. The algorithm based on the Gaussian mixture model proposed by Ye et al in 2014 needs to be fitted with a body model, and the execution speed is low; the model of the random walk tree proposed by Jung et al in 2015 was greatly improved in execution efficiency, but verified on only small data. It can be seen from the figure that the joint points near the center of the body part are relatively accurate, and the accuracy of the end points of the limbs is relatively poor, because the accuracy of the end parts of the limbs in the pixel classification stage is low, which directly affects the positions of the regression feature cluster points, and thus the accuracy of the pose regression.
(5) Run time
The single pixel method and the super-pixel method are run-time compared using the same set of feature dimensions and the same random forest settings. The time complexity of the geodesic distance calculation is O (n)3) For large resolution images, it is almost impossible to use the image as a feature in a real-time environment. Although the method based on the super-pixels increases the time for calculating the super-pixels, the efficiency of extracting the pixel features is greatly shortened, and meanwhile, the time for classifying the random forest is also reduced. Even if only the depth difference is used as the characteristic, the super-pixel method can ensure the accuracy and simultaneously accelerate the algorithm for different resolutions by 1.5-8 times. It can be seen from the table that the method can achieve real-time requirements under data sets of various resolutions.
TABLE 1 execution time (unit: ms)
Figure GDA0003226726910000111
The part proves that under the condition of various different resolutions and data qualities, the new combined characteristics can effectively express the body part characteristics, and the whole framework can effectively calculate the human body posture in real time.
The super-pixel generation method uses the pixel distance measurement which gives consideration to two-dimensional and three-dimensional information, realizes the down-sampling operation from the pixel to the super-pixel, and reduces the presentation order of the directly processed data. The method extracts the combined characteristics of the fusion depth difference and the geodesic distance information on the single-frame depth map, comprehensively utilizes the global and local incidence relation between pixels, and improves the classification precision of human body parts. Compared with the former work, the human body joint point estimation is realized by performing sparse regression according to the part position clustering feature points, so that the processing time is reduced, and higher human body posture estimation precision is obtained.

Claims (2)

1. A human body posture estimation method based on depth image super-pixel combined features is characterized by comprising the following steps:
step (1) ultra-pixel division of mu SLIC
The method of simple linear iterative clustering mu SLIC with the weight mu is adopted to carry out superpixel operation on the depth map, and the method is divided into two stages: firstly, initializing, converting a depth space (u, v, d (u, v)) to obtain a corresponding three-dimensional point cloud space (x, y, z) for a depth image I, and uniformly dividing the depth image into a depth image containing N according to a delta x delta gridsAdding the pixel points in the grid into the cluster taking the seed point as the center according to the seeds, and according to the geometric centers of all the pixel points in the cluster
Figure FDA0003226726900000011
Updating the new position of the seed point; then an iteration phase: for all seed points, within the neighborhood of 3 delta multiplied by 3 delta, the distance between the pixel and the seed point is measured according to the distance Ds, the pixel is classified into the seed point cluster with the nearest distance, and the updating generates NsThe new position of each seed point is iterated until the whole process converges or the maximum iteration step number N is reachedi
Step (2) human body part segmentation of SDDF + SGDF superpixel combined characteristics
Adopts SDDF and super image based on super pixel depth difference characteristicA combined feature SDDF + SGDF of a pixel geodesic distance feature SGDF for a set of super-pixels χsGeometric center of
Figure FDA0003226726900000012
A group of offsets obtained by random uniform sampling in a circular range with the radius theta as the circle center
Figure FDA0003226726900000013
On image I, N is combinedSDDFValue f of SDDFθAnd NSGDFValue g of SGDFθObtaining a value related to the super pixel χsDimension of (A) is NSDDF+NSGDFThe characteristics of (A):
Figure FDA0003226726900000014
1) depth difference feature f based on superpixelsθ: for a super-pixel Xs divided in the depth map I, at the geometric center
Figure FDA0003226726900000017
In the circular range of the radius theta, an offset theta is generated in advance through random uniform sampling, and the depth difference characteristic value is as follows:
Figure FDA0003226726900000015
wherein d isI() represents taking the depth value of a certain pixel position;
2) geodesic distance feature g based on superpixelsθ: firstly, forming an undirected graph structure according to segmented superpixels containing foregrounds in an image, wherein vertexes are the foreground superpixels; then, according to two rules, determine whether to add an edge between the vertices, if two superpixels xiHexix-jWith pixels directly adjacent and the absolute value of the difference in depth between adjacent pixels being less than deltadThen in the figureAdd an edge for two superpixels, the weight of the edge:
Figure FDA0003226726900000016
is the Euclidean distance between the centers of two superpixels; ② for the outlier χ which has no edge connected with other vertex through the first ruleiIs present at and xiOne super pixel with minimum distance xc=argminxdist(χiχ), additive and super-pixel χcThe weight is dist (x)i,χc) An edge of (a); the Floyd-WarshaiI algorithm is applied to the constructed connected undirected graph, so that the shortest distance among all superpixels can be calculated,
for an offset θ, derived randomly as the depth difference feature, the superpixel χsThe geodesic distance feature value of (a) is expressed as:
Figure FDA0003226726900000021
wherein, SFI(X) denotes the super pixel to which the pixel X belongs in the image, dgeodesicRepresenting the shortest distance between two vertices on the previous undirected graph structure, i.e. the geodesic distance in the set of superpixels;
3) random decision forest classification of superpixels: joint feature F for superpixelsI(X) applying a random decision forest for classification,
training first to generate NtIn the training process of random decision forest, information entropy and information gain of each split node need to be calculated, and the information entropy of a node containing a sampling point training set S is as follows:
Figure FDA0003226726900000022
wherein lIs the part to be classified, P (l | S) is the probability of being classified as l in the set S, and the random decision forest algorithm selects the partition which can obtain the maximum information gain, so that the partition P of a node is { P ═ P {left,PrightDividing the training sample set S contained in the node into { S }left,SrightThe information gain of the partition is defined as:
InfGain(S,P)=H(S)-H(S|P)=H(S)-pleftH(Sleft)-prightH(Sright);
after the random decision forest training is finished, the probability p of classifying the superpixel χ into the part l can be obtained through the image I through one tree t in the forestt(l I, x), obtained over the entire forest
Figure FDA0003226726900000023
Selecting a part with the maximum probability as the classification of the super pixels and as the classification result of the pixels contained in the super pixels;
step (3) human posture sparse regression based on human body part clustering characteristics
Classifying the superpixels by a random decision forest, classifying all foreground pixels, adopting the characteristics from parts to joints, and mapping the positions of skeleton points by a sparse regression method;
1) human body posture representation based on clustering features of the parts: for a part l, carrying out K mean value clustering on the part l to obtain NkThe clustering points are sorted according to the distance between the part and the preset part to obtain a vector
Figure FDA00032267269000000210
Then the geometric center c of all foreground pixels0And all NpCombining the parts to obtain new human body posture expression based on the part clustering characteristics:
Figure FDA0003226726900000024
2) sparse regression: the goal of human pose estimation is to obtain NJThe position y of each three-dimensional skeleton joint point is assumed to have N training pictures, and the joint point information is known
Figure FDA0003226726900000025
And site characteristics
Figure FDA0003226726900000026
Wherein N isqIs the number of the feature points of the part, and i is 1
Figure FDA0003226726900000027
Wherein
Figure FDA0003226726900000028
Mapping the feature c to the jth skeleton point (j ═ 1.., N)q) Then the sparse regression model is yi≈AciThen, for projection matrix A, one can pass the following NJThe individual optimizations yielded:
Figure FDA0003226726900000029
wherein y isi(3 j-2: 3j) represents the vector yiAnd obtaining the three-dimensional position y of the skeleton point, namely Ac, by using linear sparse regression through the trained matrix A and the part clustering characteristic c from the subvectors from the (3j-2) th dimension to the (3j) th dimension.
2. The method for estimating human body posture based on the super-pixel joint feature of the depth image as claimed in claim 1, wherein in the step (1), pixel X is usedk(xk,yk,zk) With the ith cluster center point
Figure FDA0003226726900000031
Distance measure DsIs designed as follows:
Figure FDA0003226726900000032
where μ is the weight in the super-pixel to adjust the compactness.
CN201711395472.1A 2017-12-21 2017-12-21 Human body posture estimation method based on depth image super-pixel combined features Active CN108154104B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711395472.1A CN108154104B (en) 2017-12-21 2017-12-21 Human body posture estimation method based on depth image super-pixel combined features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711395472.1A CN108154104B (en) 2017-12-21 2017-12-21 Human body posture estimation method based on depth image super-pixel combined features

Publications (2)

Publication Number Publication Date
CN108154104A CN108154104A (en) 2018-06-12
CN108154104B true CN108154104B (en) 2021-10-15

Family

ID=62464113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711395472.1A Active CN108154104B (en) 2017-12-21 2017-12-21 Human body posture estimation method based on depth image super-pixel combined features

Country Status (1)

Country Link
CN (1) CN108154104B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635783B (en) * 2019-01-02 2023-06-20 上海数迹智能科技有限公司 Video monitoring method, device, terminal and medium
CN110288677B (en) * 2019-05-21 2021-06-15 北京大学 Pedestrian image generation method and device based on deformable structure
CN112288798A (en) * 2019-07-24 2021-01-29 鲁班嫡系机器人(深圳)有限公司 Posture recognition and training method, device and system
CN110598675B (en) * 2019-09-24 2022-10-11 深圳度影医疗科技有限公司 Ultrasonic fetal posture identification method, storage medium and electronic equipment
CN110610505A (en) * 2019-09-25 2019-12-24 中科新松有限公司 Image segmentation method fusing depth and color information
CN111046733B (en) * 2019-11-12 2023-04-18 宁波大学 3D human body posture estimation method based on sparsity and depth
CN111428619B (en) * 2020-03-20 2022-08-05 电子科技大学 Three-dimensional point cloud head attitude estimation system and method based on ordered regression and soft labels
CN111860311A (en) * 2020-07-20 2020-10-30 南京智金科技创新服务中心 Method and system for prompting abnormal posture of human body
CN112070835A (en) * 2020-08-21 2020-12-11 达闼机器人有限公司 Mechanical arm pose prediction method and device, storage medium and electronic equipment
CN112766335B (en) * 2021-01-08 2023-12-01 四川九洲北斗导航与位置服务有限公司 Image processing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103890752A (en) * 2012-01-11 2014-06-25 三星电子株式会社 Apparatus for recognizing objects, apparatus for learning classification trees, and method for operating same
CN105389569A (en) * 2015-11-17 2016-03-09 北京工业大学 Human body posture estimation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7995841B2 (en) * 2007-09-24 2011-08-09 Microsoft Corporation Hybrid graph model for unsupervised object segmentation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103890752A (en) * 2012-01-11 2014-06-25 三星电子株式会社 Apparatus for recognizing objects, apparatus for learning classification trees, and method for operating same
CN105389569A (en) * 2015-11-17 2016-03-09 北京工业大学 Human body posture estimation method

Also Published As

Publication number Publication date
CN108154104A (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN108154104B (en) Human body posture estimation method based on depth image super-pixel combined features
CN107229757B (en) Video retrieval method based on deep learning and Hash coding
Berretti et al. Representation, analysis, and recognition of 3D humans: A survey
CN104268138B (en) Merge the human body motion capture method of depth map and threedimensional model
Hofmann et al. Multi-view 3D human pose estimation in complex environment
CN112614213B (en) Facial expression determining method, expression parameter determining model, medium and equipment
KR20080066671A (en) Bi-directional tracking using trajectory segment analysis
Wang et al. 3D human motion editing and synthesis: A survey
Zhang et al. Data-driven synthetic modeling of trees
US11282257B2 (en) Pose selection and animation of characters using video data and training techniques
Uddin et al. Human Activity Recognition via 3-D joint angle features and Hidden Markov models
Zheng et al. 4D reconstruction of blooming flowers
Zhang et al. A Gaussian mixture based hidden Markov model for motion recognition with 3D vision device
Ruhnke et al. Unsupervised learning of compact 3d models based on the detection of recurrent structures
CN112990154B (en) Data processing method, computer equipment and readable storage medium
Zhang et al. 3D human pose estimation from range images with depth difference and geodesic distance
Xue et al. Seeing tree structure from vibration
CN111899159B (en) Method, device, apparatus and storage medium for changing hairstyle
EP2080168A1 (en) Object tracking in computer vision
CN116248920A (en) Virtual character live broadcast processing method, device and system
WO2022173814A1 (en) System and method for photorealistic image synthesis using unsupervised semantic feature disentanglement
CN107341476A (en) A kind of unsupervised manikin construction method based on system-computed principle
CN113034675A (en) Scene model construction method, intelligent terminal and computer readable storage medium
Shen et al. Automatic pose tracking and motion transfer to arbitrary 3d characters
Simek et al. Branching gaussian processes with applications to spatiotemporal reconstruction of 3d trees

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant