CN111797692A

CN111797692A - Depth image gesture estimation method based on semi-supervised learning

Info

Publication number: CN111797692A
Application number: CN202010503293.0A
Authority: CN
Inventors: 涂志刚; 陈雨劲; 张宇昊; 刘军
Original assignee: Wuhan University WHU; Shenzhen Infinova Ltd
Current assignee: Wuhan University WHU; Shenzhen Infinova Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-10-20
Anticipated expiration: 2040-06-05
Also published as: CN111797692B

Abstract

The invention discloses a depth image gesture estimation method based on semi-supervised learning. Compared with the RGB image, the hand gesture with higher precision can be estimated from the depth image. The existing gesture estimation method based on deep learning has good effect, but relies on training with labeling data too much, and the process of labeling the three-dimensional gesture in the image is very complex. The invention provides an efficient point cloud expression mode, which effectively fuses local features and global features and realizes a new method for estimating three-dimensional hand postures from depth images with high precision. The method reduces the cost of data annotation by reducing the dependence on the annotation data in the model training process. Compared with the prior semi-supervised learning method, the invention realizes the breakthrough of hand posture estimation on precision on the premise of ensuring the operation efficiency.

Description

Depth image gesture estimation method based on semi-supervised learning

Technical Field

The invention belongs to the technical field of digital image recognition, and particularly relates to a depth image gesture estimation method based on semi-supervised learning.

Background

Automated real-time three-dimensional gesture estimation has gained much attention this year, including a wide range of application scenarios such as human-machine interaction, computer graphics, virtual/augmented reality, and the like. Through years of intensive research, the three-dimensional gesture estimation has made remarkable progress in accuracy and efficiency. Since convolutional neural networks perform well in processing images, most gesture estimation methods are based on convolutional neural networks. Some methods use two-dimensional convolution to process depth images, but due to the lack of three-dimensional spatial information representation, the features extracted by two-dimensional convolutional neural networks are not suitable for direct three-dimensional pose estimation. To better capture the geometric features of the depth data, some methods convert the depth image to a three-dimensional voxel representation and then use a three-dimensional convolution process to obtain the hand pose, but three-dimensional convolution has considerable memory and computational requirements. While these approaches have made significant advances in estimation accuracy, they often rely heavily on large amounts of annotation data for network training, and rarely consider the problem of reducing annotation costs.

Disclosure of Invention

In view of the above-identified deficiencies in the art or needs for improvement, the present invention provides a gesture estimation method based on semi-supervised learning. Aiming at the specific representation of three-dimensional data, the invention adopts point cloud as the representation form of hand depth image data, and aims to retain the three-dimensional characteristics in a depth image and ensure the efficiency of arithmetic operation. Aiming at the problems that a large amount of labeled posture information is needed in the training process of a gesture estimation model, but the labeling cost of three-dimensional gesture labeling information is high, the invention designs a semi-supervised deep network architecture, and aims to train the whole network by using unlabelled data and less labeled data so as to improve the precision of three-dimensional gesture estimation and reduce the labeling cost of training data.

In order to achieve the above object, the present invention provides a depth image gesture estimation method based on semi-supervised learning, comprising the following steps:

step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;

step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;

and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;

and 4, step 4: converting the three-dimensional point cloud (Xdata)_kAs training samples, constructing a gesture estimation network loss function model according to whether the training samples are marked or not, and obtaining a trained gesture estimation network through optimization training;

preferably, the number of depth images in step 1 is N_K；

The kth depth image is:

d_k(u，v)k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein d is_k(u, v) represents the depth of pixel points on the ith row and the vth column in the kth depth image in the camera coordinate systemDegree, M represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;

step 1, converting into the following steps through camera coordinate system transformation:

z_k(u，v)＝d_k(u，v)

x_k(u，v)＝(u-c_k，1)*z_k(u，v)/f_k，1

y_k(u，v)＝(v-c_k，2)*z_k(u，v)/f_k，2

k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein, c_k，1First coordinate parameter representing the kth depth image, c_k，2Second coordinate parameter, f, representing the k-th depth image_k，1First focal length parameter, f, representing the k-th depth image_k，2Second coordinate parameter, N, representing the k-th focal length image_KM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;

step 1, the three-dimensional point cloud is:

(x_k(u，v)，y_k(u，v)，z_k(u，v))

k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein, the data_kRepresenting the kth set of three-dimensional point clouds, (x)_k(u，v)，y_k(u，v)，z_k(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, N_KThe number of the three-dimensional point cloud groups is M, the number of rows of the kth three-dimensional point cloud is represented by M, the number of columns of the kth three-dimensional point cloud is represented by N, and the number of coordinate points in the kth three-dimensional point cloud is represented by M x N;

preferably, the step 2 of randomly adopting the three-dimensional point cloud to obtain the sampled three-dimensional point cloud is as follows:

slave data_k＝(x_k(u，v)，y_k(u，v)，z_k(u，v))，k∈[1，N_K]，u∈[1，M]，v∈[1，N]In the method, L coordinate points in the M x N coordinate points are randomly selected asThe three-dimensional point cloud after sampling is specifically defined as:

(x_k，m，y_k，m，z_k，m)，k∈[1，N_K]，m∈[1，L]

wherein (x)_k，m，y_k，m，z_k，m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, N_KThe number of the three-dimensional point clouds after sampling is obtained;

step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:

p is the number of principal components, data, taken by the principal component analysis_kFeature vector set for kth set of three-dimensional point clouds

The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:

the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;

step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:

taking the first three direction vectors

Three coordinate axis directions of a point cloud bounding box coordinate system as the set of point clouds

Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:

taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system

Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:

sx_k＝max((x_k，1，x_k，2，...，x_k，L))-min(x_k，1，x_k，2，...，x_k，L)

sy_k＝max(x_k，1，x_k，2，...，x_k，L)-min(x_k，1，x_k，2，...，x_k，L)

sz_k＝max(x_k，1，x_k，2，...，x_k，L)-min(x_k，1，x_k，2，...，x_k，L)

wherein max represents taking the maximum value, and min represents taking the minimum value;

step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:

for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:

wherein s is_k＝(sx_k，sy_k，sz_k)；

Rk is determined as follows:

rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequence

Aligned at an angle of yam_k，pitch_k、roll_kThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:

R_k＝R_k，zR_k，yR_k，x

translation matrix

The converted three-dimensional point cloud in the step 2 is:

k∈[1，N_K]，m∈[1，L]

wherein, Xdata_kThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, N_KThe number of the converted three-dimensional point clouds is obtained;

preferably, the feature extractor in step 3 is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;

the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:

k∈[1，N_K]，m∈[1，L]

the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:

step 3.1, constructing self-organization mapping according to the converted point cloud;

to the converted point cloud Xdata_kConstructing a self-organizing map point cloud M_K；

M_KEmulating Xdata with fewer points_kThe spatial distribution of (a);

wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;

self-organizing map point cloud M_KRetention of Xdata by proximity functions_kTopological properties of (d);

the self-organizing map point cloud is represented as:

M_k＝{(x_k，m，z_k，m，y_k，m)}，m∈[1，M].

wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud Xdata_kThe number of point clouds L;

step 3.2, extracting node characteristics from the converted point cloud;

converting the point cloud Xdata_kEach point in

Mapping point clouds M in self-organization_kPerforming K neighbor search;

point cloud Xdata_kEach point of

K points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;

the input size of the full-connection layer network module is kXLx3, and the output size is kXLxF_n；

Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by F_NObtaining node Feature by taking maximum pooling operation to M dimension_k，nSize of M x F_nA matrix of (a);

step 3.3, obtaining global characteristics from the node characteristics;

feature of node_k，nInputting the coordinates into a full-connection layer network module with input size of MxF_nOutput size of M x F_gThen, the maximum pooling operation is taken on the dimension M to obtain the global Feature_k，g1 XF in size_gThe vector of (a);

the processed point cloud Xdata_kOutputting node Feature as input to Feature extractor_k，nAnd global Feature_k，g；

The global features are defined as: feature_k，gIs of size 1 XF_gVector of (A), F_gIs a constant number, N_LRepresenting global feature vectorsA length;

the node characteristics are defined as: feature_k，nIs of size MxF_nWherein F_nIs constant, M is the defined number of nodes.

The point cloud reconstructor derives the global Feature from the point cloud reconstructor_k，gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud X_k，r：

Step 3, the point cloud reconstructor is specifically defined as follows:

using global Feature vector Feature_k，gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud X_k，r；

The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers;

convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;

the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud X_k，r；

The global Feature_k，gFeature of node_k，nEstimating to obtain a three-dimensional gesture through a gesture estimator;

step 3, the gesture estimator is specifically defined as:

global Feature obtained in the Feature extraction stage_k，gAnd local Feature_k，nPerforming fusion to make the size of the global feature vector be 1 XF_gCopying M times in the first dimension to raise the dimension to the dimension M x F of the local feature matrix_nSame dimension of M × F_gSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Feature_k，fSize ofIs M × (F)_n+F_g)；

Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as N_LThe new feature matrix thus obtained has a size of M N_LWherein the input size of the full connection layer is M × (F)_n+F_g) Output size of M × N_L；

Then averaging and pooling the characteristic matrix with the size of M multiplied by NL output by the full connection layer on the dimension of M to obtain a characteristic matrix with the length of N_LGlobal feature vector V of_k，f；

By applying global Feature_k，gFeature of sum node_k，nThe integral characteristic vector V obtained by fusion_k，fEstimating the three-dimensional gesture Pose_k；

The method for estimating the three-dimensional gesture comprises the following steps:

regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;

the global feature vector is of length N_LThe dimension of each module passing through a plurality of full-connection layer networks is (N)_L，U，V，3×N_joints) To obtain 3 XN_jointsIs then represented as N_jointsThree-dimensional coordinates of individual joint points Pose_kWherein N is_LU, V are respectively a first dimension and a second dimension of a hidden layer in the full-connection layer for inputting the length of the whole feature vector of the full-connection layer;

the three-dimensional gesture is defined as:

wherein, Pose_kThe gestures that represent the k-th group,

is the three-dimensional coordinate of the kth num joint, N_jointsRepresenting the number of gesture joints;

preferably, the step 4 of constructing the gesture estimation network loss function model according to whether the training samples are marked or not is as follows:

if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:

LOSS＝D_Chamfer(Xdata_k，X_k，r)+D_joints(Xdata_k)

wherein D is_Chamfer(Xdata_k，X_k，r) Representing Xdata_kAnd X_k，rThe Chamfer distance is:

k∈[1，N_K]，m∈[1，L]

wherein, | X_k，rI represents the number of k-th set of reconstructed point clouds, Xdata_kThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, N_KThe number of the converted three-dimensional point clouds is obtained;

representing reconstructed point cloud X for kth group_k，rEach point of (1) calculates the point distance point cloud Xdata_kAnd then summing these distances. In the same way, the method for preparing the composite material,

representing peer-to-peer cloud Xdata_kEach point of (2) calculating the point distance point cloud X_k，rAnd then summing these distances.

Wherein N is_jointsRepresenting the number of gesture joints, num ∈ [1, N_joints]，Pose_kHand representing the k-th groupThe potential of the mixture is as follows,

is the three-dimensional coordinate of the num joint of the gesture of the kth group,

the gesture representing the kth group of annotations,

and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:

if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:

LOSS＝D_Chamfer(Xdata_k，X_k，r) Wherein D is_Chamfer(Xdata_k，X_k，r) Representing point cloud Xdata_kAnd reconstructing the point cloud X_k，rThe Chamfer distance of (a);

step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

the application range is wide. The self-encoder based on the point cloud can be integrated into various networks for recovering a three-dimensional structure from a depth image, and the spatial representation capability of the extracted features is greatly improved.

The generalization ability is strong. Compared with the method of directly using the depth image as network input, the method only saves space position coordinates in a point cloud representation form, and therefore the method can be widely applied to various types of three-dimensional data.

The efficiency is high. Compared with other three-dimensional representation forms, the point cloud three-dimensional representation method has the advantages that the point cloud is used as network input, the orderless three-dimensional representation method only comprises the space coordinates of each point, and the network calculation amount can be reduced.

The precision is high. The design provides that a three-dimensional point cloud reconstruction part fully utilizes intermediate representation in a network, extracts multi-level features through an encoder guided by a self-organizing map, and models the spatial distribution of the point cloud. The decoder reconstructs hand point cloud from the encoded global features, so that the features learned by the encoder contain more hand space information, thereby improving the hand posture estimation effect.

The dependency on the annotation data is low. The invention designs a semi-supervised training strategy applied to a gesture estimation task, which trains the whole network by using a small amount of labeled data and optimizes the network by fully utilizing unannotated data.

Therefore, the three-dimensional gesture estimation method provided by the invention has high recognition precision and reduces the labeling cost of training data.

Drawings

FIG. 1: is a flow chart of the method of the present invention;

FIG. 2: is the structure diagram of the neural network of the invention;

FIG. 3: the effect schematic diagram of adapting the initialized self-organizing map to the input hand point cloud in the embodiment of the invention;

FIG. 4: the invention is a network structure diagram for fusing node characteristics and global characteristics;

FIG. 5: the network structure schematic diagram of the reconstructed point cloud is obtained by recovering the global features in the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Self-organizing maps are artificial neural networks that use unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples, which differs from other artificial neural networks in that they use a proximity function to preserve the topological properties of the input space. In the present invention, the low dimensional representation M models the spatial distribution of the point cloud X with fewer points.

The invention provides a depth image gesture estimation method based on semi-supervised learning, and the overall structure diagram of the method is shown in FIG. 2. The system comprises: a data conversion module; a feature extraction module based on a point cloud processing network; a point cloud feature decoding module is used for reconstructing a hand three-dimensional point cloud; a gesture estimation module based on multi-level features.

The depth image gesture estimation method based on semi-supervised learning provided by the present invention is specifically described below with reference to fig. 1 to 5, and specifically includes the following steps:

step 1, the number of the depth images is N_K；

The kth depth image is:

d_k(u，v)k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein d is_k(u, v) represents the depth of pixel points on the u-th row and the v-th column in the k-th depth image in a camera coordinate system, M represents the number of rows of the k-th depth image, and N represents the number of columns of the k-th depth image;

z_k(u，v)＝d_k(u，v)

x_k(u，v)＝(u-c_k，1)*z_k(u，v)/f_k，1

y_k(u，v)＝(v-c_k，2)*z_k(u，v)/f_k，2

k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein, c_k，1First coordinate parameter representing the kth depth image, c_k，2Second coordinate parameter, f, representing the k-th depth image_k，1First focal length parameter, f, representing the k-th depth image_k，2Representing the k-th focal imageSecond coordinate parameter, N_KM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;

step 1, the three-dimensional point cloud is:

(x_k(u，v)，y_k(u，v)，z_k(u，v))

k∈[1，N_K]，u∈[1，M]，v∈[1，N]

step 2, randomly adopting the three-dimensional point cloud to obtain a sampled three-dimensional point cloud:

slave data_k＝(x_k(u，v)，y_k(u，v)，z_k(u，v))，k∈[1，N_K]，u∈[1，M]，v∈[1，N]In the method, L coordinate points are randomly selected from the M × N coordinate points as the sampled three-dimensional point cloud, which is specifically defined as:

(x_k，m，y_k，m，z_k，m)，k∈[1，N_K]，m∈[1，L]

taking the first three direction vectors

wherein s is_k＝(sx_k，sy_k，sz_k)；

R_kThe determination of (2) is as follows:

R_k＝R_k，zR_k，yR_k，x

translation matrix

The converted three-dimensional point cloud in the step 2 is:

k∈[1，N_K]，m∈[1，L]

3, the feature extractor is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;

k∈[1，N_K]，m∈[1，L]

M_KEmulating Xdata with fewer points_kThe spatial distribution of (a); the effect diagram is shown in FIG. 3, and the self-organizing map maps the point cloud with the three parts of the graph being randomly initialized to the point cloud Xdata_kIn the space of (2), obtaining a self-organizing mapping point cloud M_K。

the self-organizing map point cloud is represented as:

M_k＝{(x_k，m，z_k，m，y_k，m)}，m∈[1，M].

wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud Xdata_kThe number of point clouds L; self-organizing map point cloud M_kAnd the converted point cloud Xdata_kSee fig. 2.

Step 3.2, extracting node characteristics from the converted point cloud;

converting the point cloud Xdata_kEach point in

Mapping point clouds M in self-organization_kPerforming K neighbor search;

point cloud Xdata_kEach point of

step 3.3, obtaining global characteristics from the node characteristics;

the processed point cloud Xdata_kOutputting node Feature as input to Feature extractor_k，nAnd global feature Feature_k，g；

The global features are defined as: feature_k，gIs of size 1 XF_gVector of (A), F_gIs a constant number, N_LIndicating the length of the global feature vector, e.g. N_L＝1024；

Step 3, the point cloud reconstructor is specifically defined as follows:

The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers; as shown in fig. 5.

step 3, the gesture estimator is specifically defined as:

global Feature obtained in the Feature extraction stage_k，gAnd local Feature_k，nPerforming fusion to make the size of the global feature vector be 1×F_gCopying M times in the first dimension to raise the dimension to the dimension M x F of the local feature matrix_nSame dimension of M × F_gSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Feature_k，fThe size of which is M × (F)_n+F_g) (ii) a This process is illustrated in fig. 4.

Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as N_L1024, the new feature matrix thus obtained has a size of mxn_LWherein the input size of the full connection layer is M × (F)_n+F_g) Output size of M × N_L；

Then the output size of the full connection layer is MxN_LThe characteristic matrix is averaged and pooled in M dimensions to obtain a length N_L1024 integral eigenvector V_k，f；

the global feature vector is of length N_LThe dimension of each module passing through a plurality of full-connection layer networks is (N) 1024_L，U，V，3×N_joints) To obtain 3 XN_jointsIs then represented as N_jointsThree-dimensional coordinates of individual joint points Pose_kIn which N is_L1024 is the length of the whole feature vector of the input full-link layer, and U512 and V256 are respectively the first dimension and the second dimension of the hidden layer in the full-link layer;

the three-dimensional gesture is defined as:

wherein, Pose_kThe gestures that represent the k-th group,

is the three-dimensional coordinate of the kth num joint, N_jointsIndicating the number of gesture joints, N being taken from this patent_joints＝21；

And 4, step 4: converting the three-dimensional point cloud (Xdata)_kAs training samples, constructing a gesture estimation network loss function model according to whether the training samples are marked or not, and obtaining a trained gesture estimation network through optimization training; as shown in fig. 2.

Step 4, constructing a gesture estimation network loss function model according to whether the training samples are marked or not is as follows:

LOSS＝D_Chamfer(Xdata_k，X_k，r)+D_joints(Xdata_k)

k∈[1，N_K]，m∈[1，L]

representing reconstructed point cloud X for kth group_k，rEach point of (1) calculates the point distance point cloud Xdata_kThe distance of the nearest point in the image, and then the distances of the pointsAnd (4) performing separation and summation. In the same way, the method for preparing the composite material,

Wherein N is_jointsRepresenting the number of gesture joints, num ∈ [1, N_joints]，Pose_kThe gestures that represent the k-th group,

the gesture representing the kth group of annotations,

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A depth image gesture estimation method based on semi-supervised learning is characterized by comprising the following steps:

and 4, step 4: converting the three-dimensional point cloud (Xdata)_kAnd as a training sample, constructing a gesture estimation network loss function model according to whether the training sample is marked or not, and obtaining a trained gesture estimation network through optimization training.

2. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 1, the number of the depth images is N_K；

The kth depth image is:

d_k(u，v)k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein d is_k(u, v) represents the depth of pixel points on the u-th row and the v-th column in the k-th depth image in the camera coordinate system, M represents the number of rows of the k-th depth image, and N represents the k-th depth imageThe number of columns of the image;

z_k(u，v)＝d_k(u，v)

x_k(u，v)＝(u-c_k，1)*z_k(u，v)/f_k，1

y_k(u，v)＝(v-c_k，2)*z_k(u，v)/f_k，2

k∈[1，N_K]，u∈[1，M]，v∈[1，N]

step 1, the three-dimensional point cloud is:

(x_k(u，v)，y_k(u，v)，z_k(u，v))

k∈[1，N_K]，u∈[1，M]，v∈[1，N]

wherein, the data_kRepresenting the kth set of three-dimensional point clouds, (x)_k(u，v)，y_k(u，v)，z_k(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, N_KFor the number of the three-dimensional point cloud sets, M represents the number of rows of the kth set of three-dimensional point clouds, N represents the number of columns of the kth set of three-dimensional point clouds, and M × N represents the number of coordinate points in the kth set of three-dimensional point clouds.

3. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 2, randomly adopting the three-dimensional point cloud to obtain a sampled three-dimensional point cloud:

slave data_k＝(x_k(u，v)，y_k(u，v)，z_k(u，v))，k∈[1，N_K]，u∈[1，M]，v∈[1，N]In (1),randomly selecting L coordinate points from the M x N coordinate points as a three-dimensional point cloud after sampling, wherein the specific definition is as follows:

(x_k，m，y_k，m，z_k，m)，k∈[1，N_K]，m∈[1，L]

taking the first three direction vectors

As a point cloud bounding box coordinate system for the set of point cloudsThree coordinate axis directions

wherein s is_k＝(sx_k，sy_k，sz_k)；

R_kThe determination of (2) is as follows:

R_k＝R_k，zR_k，yR_k，x

translation matrix

The converted three-dimensional point cloud in the step 2 is:

wherein, Xdata_kThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, N_KThe number of the converted three-dimensional point clouds.

4. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: 3, the feature extractor is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;

M_KEmulating Xdata with fewer points_kThe spatial distribution of (a);

the self-organizing map point cloud is represented as:

M_k＝{(x_k，m，z_k，m，y_k，m))，m∈[1，M].

step 3.2, extracting node characteristics from the converted point cloud;

converting the point cloud Xdata_kEach point in

Mapping point clouds M in self-organization_kPerforming K neighbor search;

point cloud Xdata_kEach point of

step 3.3, obtaining global characteristics from the node characteristics;

The global features are defined as: feature_k，gIs of size 1 XF_gVector of (A), F_gIs a constant number, N_LRepresents the length of the global feature vector;

the node characteristics are defined as: feature_k，nIs of size MxF_nWherein F_nIs constant, M is the defined number of nodes;

Step 3, the point cloud reconstructor is specifically defined as follows:

step 3, the gesture estimator is specifically defined as:

global Feature obtained in the Feature extraction stage_k，gAnd local Feature_k，nPerforming fusion to make the size of the global feature vector be 1 XF_gIn the first placeCopying M times in one dimension to raise dimension to dimension M x F of local feature matrix_nSame dimension of M × F_gSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Feature_k，fThe size of which is M × (F)_n+F_g)；

the three-dimensional gesture is defined as:

wherein, Pose_kThe gestures that represent the k-th group,

is the three-dimensional coordinate of the kth num joint, N_jointsRepresenting the number of gesture joints.

5. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 4, constructing a gesture estimation network loss function model according to whether the training samples are marked or not is as follows:

LOSS＝D_Chamfer(Xdata_k，X_k，r)+D_joints(Xdata_k)

representing reconstructed point cloud X for kth group_k，rEach point of (1) calculates the point distance point cloud Xdata_kThe distance of the nearest point in the image, and then summing these distances;

in the same way, the method for preparing the composite material,

representing peer-to-peer cloud Xdata_kEach point of (2) calculating the point distance point cloud X_k，rThe distance of the nearest point in the image, and then summing these distances;

the gesture representing the kth group of annotations,