CN111797692A - Depth image gesture estimation method based on semi-supervised learning - Google Patents

Depth image gesture estimation method based on semi-supervised learning Download PDF

Info

Publication number
CN111797692A
CN111797692A CN202010503293.0A CN202010503293A CN111797692A CN 111797692 A CN111797692 A CN 111797692A CN 202010503293 A CN202010503293 A CN 202010503293A CN 111797692 A CN111797692 A CN 111797692A
Authority
CN
China
Prior art keywords
point cloud
dimensional
feature
dimensional point
gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010503293.0A
Other languages
Chinese (zh)
Other versions
CN111797692B (en
Inventor
涂志刚
陈雨劲
张宇昊
刘军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Shenzhen Infinova Ltd
Original Assignee
Wuhan University WHU
Shenzhen Infinova Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU, Shenzhen Infinova Ltd filed Critical Wuhan University WHU
Priority to CN202010503293.0A priority Critical patent/CN111797692B/en
Publication of CN111797692A publication Critical patent/CN111797692A/en
Application granted granted Critical
Publication of CN111797692B publication Critical patent/CN111797692B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/60Rotation of whole images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a depth image gesture estimation method based on semi-supervised learning. Compared with the RGB image, the hand gesture with higher precision can be estimated from the depth image. The existing gesture estimation method based on deep learning has good effect, but relies on training with labeling data too much, and the process of labeling the three-dimensional gesture in the image is very complex. The invention provides an efficient point cloud expression mode, which effectively fuses local features and global features and realizes a new method for estimating three-dimensional hand postures from depth images with high precision. The method reduces the cost of data annotation by reducing the dependence on the annotation data in the model training process. Compared with the prior semi-supervised learning method, the invention realizes the breakthrough of hand posture estimation on precision on the premise of ensuring the operation efficiency.

Description

Depth image gesture estimation method based on semi-supervised learning
Technical Field
The invention belongs to the technical field of digital image recognition, and particularly relates to a depth image gesture estimation method based on semi-supervised learning.
Background
Automated real-time three-dimensional gesture estimation has gained much attention this year, including a wide range of application scenarios such as human-machine interaction, computer graphics, virtual/augmented reality, and the like. Through years of intensive research, the three-dimensional gesture estimation has made remarkable progress in accuracy and efficiency. Since convolutional neural networks perform well in processing images, most gesture estimation methods are based on convolutional neural networks. Some methods use two-dimensional convolution to process depth images, but due to the lack of three-dimensional spatial information representation, the features extracted by two-dimensional convolutional neural networks are not suitable for direct three-dimensional pose estimation. To better capture the geometric features of the depth data, some methods convert the depth image to a three-dimensional voxel representation and then use a three-dimensional convolution process to obtain the hand pose, but three-dimensional convolution has considerable memory and computational requirements. While these approaches have made significant advances in estimation accuracy, they often rely heavily on large amounts of annotation data for network training, and rarely consider the problem of reducing annotation costs.
Disclosure of Invention
In view of the above-identified deficiencies in the art or needs for improvement, the present invention provides a gesture estimation method based on semi-supervised learning. Aiming at the specific representation of three-dimensional data, the invention adopts point cloud as the representation form of hand depth image data, and aims to retain the three-dimensional characteristics in a depth image and ensure the efficiency of arithmetic operation. Aiming at the problems that a large amount of labeled posture information is needed in the training process of a gesture estimation model, but the labeling cost of three-dimensional gesture labeling information is high, the invention designs a semi-supervised deep network architecture, and aims to train the whole network by using unlabelled data and less labeled data so as to improve the precision of three-dimensional gesture estimation and reduce the labeling cost of training data.
In order to achieve the above object, the present invention provides a depth image gesture estimation method based on semi-supervised learning, comprising the following steps:
step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;
step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;
and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;
and 4, step 4: converting the three-dimensional point cloud (Xdata)kAs training samples, constructing a gesture estimation network loss function model according to whether the training samples are marked or not, and obtaining a trained gesture estimation network through optimization training;
preferably, the number of depth images in step 1 is NK
The kth depth image is:
dk(u,v)k∈[1,NK],u∈[1,M],v∈[1,N]
wherein d isk(u, v) represents the depth of pixel points on the ith row and the vth column in the kth depth image in the camera coordinate systemDegree, M represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, converting into the following steps through camera coordinate system transformation:
zk(u,v)=dk(u,v)
xk(u,v)=(u-ck,1)*zk(u,v)/fk,1
yk(u,v)=(v-ck,2)*zk(u,v)/fk,2
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, ck,1First coordinate parameter representing the kth depth image, ck,2Second coordinate parameter, f, representing the k-th depth imagek,1First focal length parameter, f, representing the k-th depth imagek,2Second coordinate parameter, N, representing the k-th focal length imageKM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, the three-dimensional point cloud is:
(xk(u,v),yk(u,v),zk(u,v))
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, the datakRepresenting the kth set of three-dimensional point clouds, (x)k(u,v),yk(u,v),zk(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, NKThe number of the three-dimensional point cloud groups is M, the number of rows of the kth three-dimensional point cloud is represented by M, the number of columns of the kth three-dimensional point cloud is represented by N, and the number of coordinate points in the kth three-dimensional point cloud is represented by M x N;
preferably, the step 2 of randomly adopting the three-dimensional point cloud to obtain the sampled three-dimensional point cloud is as follows:
slave datak=(xk(u,v),yk(u,v),zk(u,v)),k∈[1,NK],u∈[1,M],v∈[1,N]In the method, L coordinate points in the M x N coordinate points are randomly selected asThe three-dimensional point cloud after sampling is specifically defined as:
(xk,m,yk,m,zk,m),k∈[1,NK],m∈[1,L]
wherein (x)k,m,yk,m,zk,m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, NKThe number of the three-dimensional point clouds after sampling is obtained;
step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:
Figure BDA0002525603000000031
p is the number of principal components, data, taken by the principal component analysiskFeature vector set for kth set of three-dimensional point clouds
The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:
Figure BDA0002525603000000032
Figure BDA0002525603000000033
Figure BDA0002525603000000034
the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;
step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:
taking the first three direction vectors
Figure BDA0002525603000000035
Three coordinate axis directions of a point cloud bounding box coordinate system as the set of point clouds
Figure BDA0002525603000000036
Figure BDA0002525603000000037
Figure BDA0002525603000000038
Figure BDA0002525603000000039
Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:
Figure BDA00025256030000000310
Figure BDA00025256030000000311
Figure BDA00025256030000000312
taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system
Figure BDA00025256030000000313
Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:
sxk=max((xk,1,xk,2,...,xk,L))-min(xk,1,xk,2,...,xk,L)
syk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
szk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
wherein max represents taking the maximum value, and min represents taking the minimum value;
step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:
for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:
Figure BDA0002525603000000041
wherein s isk=(sxk,syk,szk);
Rk is determined as follows:
rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequence
Figure BDA0002525603000000042
Aligned at an angle of yamk,pitchk、rollkThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:
Figure BDA0002525603000000043
Figure BDA0002525603000000044
Figure BDA0002525603000000045
Rk=Rk,zRk,yRk,x
translation matrix
Figure BDA0002525603000000046
The converted three-dimensional point cloud in the step 2 is:
Figure BDA0002525603000000047
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
preferably, the feature extractor in step 3 is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;
the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:
Figure BDA0002525603000000048
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:
step 3.1, constructing self-organization mapping according to the converted point cloud;
to the converted point cloud XdatakConstructing a self-organizing map point cloud MK
MKEmulating Xdata with fewer pointskThe spatial distribution of (a);
wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;
self-organizing map point cloud MKRetention of Xdata by proximity functionskTopological properties of (d);
the self-organizing map point cloud is represented as:
Mk={(xk,m,zk,m,yk,m)},m∈[1,M].
wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud XdatakThe number of point clouds L;
step 3.2, extracting node characteristics from the converted point cloud;
converting the point cloud XdatakEach point in
Figure BDA0002525603000000051
Mapping point clouds M in self-organizationkPerforming K neighbor search;
point cloud XdatakEach point of
Figure BDA0002525603000000052
K points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;
the input size of the full-connection layer network module is kXLx3, and the output size is kXLxFn
Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by FNObtaining node Feature by taking maximum pooling operation to M dimensionk,nSize of M x FnA matrix of (a);
step 3.3, obtaining global characteristics from the node characteristics;
feature of nodek,nInputting the coordinates into a full-connection layer network module with input size of MxFnOutput size of M x FgThen, the maximum pooling operation is taken on the dimension M to obtain the global Featurek,g1 XF in sizegThe vector of (a);
the processed point cloud XdatakOutputting node Feature as input to Feature extractork,nAnd global Featurek,g
The global features are defined as: featurek,gIs of size 1 XFgVector of (A), FgIs a constant number, NLRepresenting global feature vectorsA length;
the node characteristics are defined as: featurek,nIs of size MxFnWherein FnIs constant, M is the defined number of nodes.
The point cloud reconstructor derives the global Feature from the point cloud reconstructork,gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud Xk,r
Figure BDA0002525603000000061
Step 3, the point cloud reconstructor is specifically defined as follows:
using global Feature vector Featurek,gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud Xk,r
The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers;
convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;
the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud Xk,r
The global Featurek,gFeature of nodek,nEstimating to obtain a three-dimensional gesture through a gesture estimator;
step 3, the gesture estimator is specifically defined as:
global Feature obtained in the Feature extraction stagek,gAnd local Featurek,nPerforming fusion to make the size of the global feature vector be 1 XFgCopying M times in the first dimension to raise the dimension to the dimension M x F of the local feature matrixnSame dimension of M × FgSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Featurek,fSize ofIs M × (F)n+Fg);
Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as NLThe new feature matrix thus obtained has a size of M NLWherein the input size of the full connection layer is M × (F)n+Fg) Output size of M × NL
Then averaging and pooling the characteristic matrix with the size of M multiplied by NL output by the full connection layer on the dimension of M to obtain a characteristic matrix with the length of NLGlobal feature vector V ofk,f
By applying global Featurek,gFeature of sum nodek,nThe integral characteristic vector V obtained by fusionk,fEstimating the three-dimensional gesture Posek
The method for estimating the three-dimensional gesture comprises the following steps:
regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;
the global feature vector is of length NLThe dimension of each module passing through a plurality of full-connection layer networks is (N)L,U,V,3×Njoints) To obtain 3 XNjointsIs then represented as NjointsThree-dimensional coordinates of individual joint points PosekWherein N isLU, V are respectively a first dimension and a second dimension of a hidden layer in the full-connection layer for inputting the length of the whole feature vector of the full-connection layer;
the three-dimensional gesture is defined as:
Figure BDA0002525603000000071
wherein, PosekThe gestures that represent the k-th group,
Figure BDA0002525603000000072
is the three-dimensional coordinate of the kth num joint, NjointsRepresenting the number of gesture joints;
preferably, the step 4 of constructing the gesture estimation network loss function model according to whether the training samples are marked or not is as follows:
if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:
LOSS=DChamfer(Xdatak,Xk,r)+Djoints(Xdatak)
wherein D isChamfer(Xdatak,Xk,r) Representing XdatakAnd Xk,rThe Chamfer distance is:
Figure BDA0002525603000000073
Figure BDA0002525603000000074
k∈[1,NK],m∈[1,L]
wherein, | Xk,rI represents the number of k-th set of reconstructed point clouds, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
Figure BDA0002525603000000075
representing reconstructed point cloud X for kth groupk,rEach point of (1) calculates the point distance point cloud XdatakAnd then summing these distances. In the same way, the method for preparing the composite material,
Figure BDA0002525603000000076
representing peer-to-peer cloud XdatakEach point of (2) calculating the point distance point cloud Xk,rAnd then summing these distances.
Figure BDA0002525603000000081
Wherein N isjointsRepresenting the number of gesture joints, num ∈ [1, Njoints],PosekHand representing the k-th groupThe potential of the mixture is as follows,
Figure BDA0002525603000000082
is the three-dimensional coordinate of the num joint of the gesture of the kth group,
Figure BDA0002525603000000083
the gesture representing the kth group of annotations,
Figure BDA0002525603000000084
and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:
if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:
LOSS=DChamfer(Xdatak,Xk,r) Wherein D isChamfer(Xdatak,Xk,r) Representing point cloud XdatakAnd reconstructing the point cloud Xk,rThe Chamfer distance of (a);
step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
the application range is wide. The self-encoder based on the point cloud can be integrated into various networks for recovering a three-dimensional structure from a depth image, and the spatial representation capability of the extracted features is greatly improved.
The generalization ability is strong. Compared with the method of directly using the depth image as network input, the method only saves space position coordinates in a point cloud representation form, and therefore the method can be widely applied to various types of three-dimensional data.
The efficiency is high. Compared with other three-dimensional representation forms, the point cloud three-dimensional representation method has the advantages that the point cloud is used as network input, the orderless three-dimensional representation method only comprises the space coordinates of each point, and the network calculation amount can be reduced.
The precision is high. The design provides that a three-dimensional point cloud reconstruction part fully utilizes intermediate representation in a network, extracts multi-level features through an encoder guided by a self-organizing map, and models the spatial distribution of the point cloud. The decoder reconstructs hand point cloud from the encoded global features, so that the features learned by the encoder contain more hand space information, thereby improving the hand posture estimation effect.
The dependency on the annotation data is low. The invention designs a semi-supervised training strategy applied to a gesture estimation task, which trains the whole network by using a small amount of labeled data and optimizes the network by fully utilizing unannotated data.
Therefore, the three-dimensional gesture estimation method provided by the invention has high recognition precision and reduces the labeling cost of training data.
Drawings
FIG. 1: is a flow chart of the method of the present invention;
FIG. 2: is the structure diagram of the neural network of the invention;
FIG. 3: the effect schematic diagram of adapting the initialized self-organizing map to the input hand point cloud in the embodiment of the invention;
FIG. 4: the invention is a network structure diagram for fusing node characteristics and global characteristics;
FIG. 5: the network structure schematic diagram of the reconstructed point cloud is obtained by recovering the global features in the embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Self-organizing maps are artificial neural networks that use unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples, which differs from other artificial neural networks in that they use a proximity function to preserve the topological properties of the input space. In the present invention, the low dimensional representation M models the spatial distribution of the point cloud X with fewer points.
The invention provides a depth image gesture estimation method based on semi-supervised learning, and the overall structure diagram of the method is shown in FIG. 2. The system comprises: a data conversion module; a feature extraction module based on a point cloud processing network; a point cloud feature decoding module is used for reconstructing a hand three-dimensional point cloud; a gesture estimation module based on multi-level features.
The depth image gesture estimation method based on semi-supervised learning provided by the present invention is specifically described below with reference to fig. 1 to 5, and specifically includes the following steps:
step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;
step 1, the number of the depth images is NK
The kth depth image is:
dk(u,v)k∈[1,NK],u∈[1,M],v∈[1,N]
wherein d isk(u, v) represents the depth of pixel points on the u-th row and the v-th column in the k-th depth image in a camera coordinate system, M represents the number of rows of the k-th depth image, and N represents the number of columns of the k-th depth image;
step 1, converting into the following steps through camera coordinate system transformation:
zk(u,v)=dk(u,v)
xk(u,v)=(u-ck,1)*zk(u,v)/fk,1
yk(u,v)=(v-ck,2)*zk(u,v)/fk,2
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, ck,1First coordinate parameter representing the kth depth image, ck,2Second coordinate parameter, f, representing the k-th depth imagek,1First focal length parameter, f, representing the k-th depth imagek,2Representing the k-th focal imageSecond coordinate parameter, NKM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, the three-dimensional point cloud is:
(xk(u,v),yk(u,v),zk(u,v))
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, the datakRepresenting the kth set of three-dimensional point clouds, (x)k(u,v),yk(u,v),zk(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, NKThe number of the three-dimensional point cloud groups is M, the number of rows of the kth three-dimensional point cloud is represented by M, the number of columns of the kth three-dimensional point cloud is represented by N, and the number of coordinate points in the kth three-dimensional point cloud is represented by M x N;
step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;
step 2, randomly adopting the three-dimensional point cloud to obtain a sampled three-dimensional point cloud:
slave datak=(xk(u,v),yk(u,v),zk(u,v)),k∈[1,NK],u∈[1,M],v∈[1,N]In the method, L coordinate points are randomly selected from the M × N coordinate points as the sampled three-dimensional point cloud, which is specifically defined as:
(xk,m,yk,m,zk,m),k∈[1,NK],m∈[1,L]
wherein (x)k,m,yk,m,zk,m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, NKThe number of the three-dimensional point clouds after sampling is obtained;
step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:
Figure BDA0002525603000000111
p is the number of principal components, data, taken by the principal component analysiskFeature vector set for kth set of three-dimensional point clouds
The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:
Figure BDA0002525603000000112
Figure BDA0002525603000000113
Figure BDA0002525603000000114
the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;
step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:
taking the first three direction vectors
Figure BDA0002525603000000115
Three coordinate axis directions of a point cloud bounding box coordinate system as the set of point clouds
Figure BDA0002525603000000116
Figure BDA0002525603000000117
Figure BDA0002525603000000118
Figure BDA0002525603000000119
Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:
Figure BDA00025256030000001110
Figure BDA00025256030000001111
Figure BDA00025256030000001112
taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system
Figure BDA00025256030000001113
Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:
sxk=max((xk,1,xk,2,...,xk,L))-min(xk,1,xk,2,...,xk,L)
syk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
szk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
wherein max represents taking the maximum value, and min represents taking the minimum value;
step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:
for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:
Figure BDA0002525603000000121
wherein s isk=(sxk,syk,szk);
RkThe determination of (2) is as follows:
rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequence
Figure BDA0002525603000000122
Aligned at an angle of yamk,pitchk、rollkThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:
Figure BDA0002525603000000123
Figure BDA0002525603000000124
Figure BDA0002525603000000125
Rk=Rk,zRk,yRk,x
translation matrix
Figure BDA0002525603000000126
The converted three-dimensional point cloud in the step 2 is:
Figure BDA0002525603000000127
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;
3, the feature extractor is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;
the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:
Figure BDA0002525603000000128
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:
step 3.1, constructing self-organization mapping according to the converted point cloud;
to the converted point cloud XdatakConstructing a self-organizing map point cloud MK
MKEmulating Xdata with fewer pointskThe spatial distribution of (a); the effect diagram is shown in FIG. 3, and the self-organizing map maps the point cloud with the three parts of the graph being randomly initialized to the point cloud XdatakIn the space of (2), obtaining a self-organizing mapping point cloud MK
Wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;
self-organizing map point cloud MKRetention of Xdata by proximity functionskTopological properties of (d);
the self-organizing map point cloud is represented as:
Mk={(xk,m,zk,m,yk,m)},m∈[1,M].
wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud XdatakThe number of point clouds L; self-organizing map point cloud MkAnd the converted point cloud XdatakSee fig. 2.
Step 3.2, extracting node characteristics from the converted point cloud;
converting the point cloud XdatakEach point in
Figure BDA0002525603000000131
Mapping point clouds M in self-organizationkPerforming K neighbor search;
point cloud XdatakEach point of
Figure BDA0002525603000000132
K points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;
the input size of the full-connection layer network module is kXLx3, and the output size is kXLxFn
Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by FNObtaining node Feature by taking maximum pooling operation to M dimensionk,nSize of M x FnA matrix of (a);
step 3.3, obtaining global characteristics from the node characteristics;
feature of nodek,nInputting the coordinates into a full-connection layer network module with input size of MxFnOutput size of M x FgThen, the maximum pooling operation is taken on the dimension M to obtain the global Featurek,g1 XF in sizegThe vector of (a);
the processed point cloud XdatakOutputting node Feature as input to Feature extractork,nAnd global feature Featurek,g
The global features are defined as: featurek,gIs of size 1 XFgVector of (A), FgIs a constant number, NLIndicating the length of the global feature vector, e.g. NL=1024;
The node characteristics are defined as: featurek,nIs of size MxFnWherein FnIs constant, M is the defined number of nodes.
The point cloud reconstructor derives the global Feature from the point cloud reconstructork,gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud Xk,r
Figure BDA0002525603000000141
Step 3, the point cloud reconstructor is specifically defined as follows:
using global Feature vector Featurek,gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud Xk,r
The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers; as shown in fig. 5.
Convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;
the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud Xk,r
The global Featurek,gFeature of nodek,nEstimating to obtain a three-dimensional gesture through a gesture estimator;
step 3, the gesture estimator is specifically defined as:
global Feature obtained in the Feature extraction stagek,gAnd local Featurek,nPerforming fusion to make the size of the global feature vector be 1×FgCopying M times in the first dimension to raise the dimension to the dimension M x F of the local feature matrixnSame dimension of M × FgSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Featurek,fThe size of which is M × (F)n+Fg) (ii) a This process is illustrated in fig. 4.
Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as NL1024, the new feature matrix thus obtained has a size of mxnLWherein the input size of the full connection layer is M × (F)n+Fg) Output size of M × NL
Then the output size of the full connection layer is MxNLThe characteristic matrix is averaged and pooled in M dimensions to obtain a length NL1024 integral eigenvector Vk,f
By applying global Featurek,gFeature of sum nodek,nThe integral characteristic vector V obtained by fusionk,fEstimating the three-dimensional gesture Posek
The method for estimating the three-dimensional gesture comprises the following steps:
regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;
the global feature vector is of length NLThe dimension of each module passing through a plurality of full-connection layer networks is (N) 1024L,U,V,3×Njoints) To obtain 3 XNjointsIs then represented as NjointsThree-dimensional coordinates of individual joint points PosekIn which N isL1024 is the length of the whole feature vector of the input full-link layer, and U512 and V256 are respectively the first dimension and the second dimension of the hidden layer in the full-link layer;
the three-dimensional gesture is defined as:
Figure BDA0002525603000000151
wherein, PosekThe gestures that represent the k-th group,
Figure BDA0002525603000000152
is the three-dimensional coordinate of the kth num joint, NjointsIndicating the number of gesture joints, N being taken from this patentjoints=21;
And 4, step 4: converting the three-dimensional point cloud (Xdata)kAs training samples, constructing a gesture estimation network loss function model according to whether the training samples are marked or not, and obtaining a trained gesture estimation network through optimization training; as shown in fig. 2.
Step 4, constructing a gesture estimation network loss function model according to whether the training samples are marked or not is as follows:
if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:
LOSS=DChamfer(Xdatak,Xk,r)+Djoints(Xdatak)
wherein D isChamfer(Xdatak,Xk,r) Representing XdatakAnd Xk,rThe Chamfer distance is:
Figure BDA0002525603000000153
Figure BDA0002525603000000154
k∈[1,NK],m∈[1,L]
wherein, | Xk,rI represents the number of k-th set of reconstructed point clouds, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
Figure BDA0002525603000000155
representing reconstructed point cloud X for kth groupk,rEach point of (1) calculates the point distance point cloud XdatakThe distance of the nearest point in the image, and then the distances of the pointsAnd (4) performing separation and summation. In the same way, the method for preparing the composite material,
Figure BDA0002525603000000156
representing peer-to-peer cloud XdatakEach point of (2) calculating the point distance point cloud Xk,rAnd then summing these distances.
Figure BDA0002525603000000161
Wherein N isjointsRepresenting the number of gesture joints, num ∈ [1, Njoints],PosekThe gestures that represent the k-th group,
Figure BDA0002525603000000162
is the three-dimensional coordinate of the num joint of the gesture of the kth group,
Figure BDA0002525603000000163
the gesture representing the kth group of annotations,
Figure BDA0002525603000000164
and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:
if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:
LOSS=DChamfer(Xdatak,Xk,r) Wherein D isChamfer(Xdatak,Xk,r) Representing point cloud XdatakAnd reconstructing the point cloud Xk,rThe Chamfer distance of (a);
step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A depth image gesture estimation method based on semi-supervised learning is characterized by comprising the following steps:
step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;
step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;
and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;
and 4, step 4: converting the three-dimensional point cloud (Xdata)kAnd as a training sample, constructing a gesture estimation network loss function model according to whether the training sample is marked or not, and obtaining a trained gesture estimation network through optimization training.
2. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 1, the number of the depth images is NK
The kth depth image is:
dk(u,v)k∈[1,NK],u∈[1,M],v∈[1,N]
wherein d isk(u, v) represents the depth of pixel points on the u-th row and the v-th column in the k-th depth image in the camera coordinate system, M represents the number of rows of the k-th depth image, and N represents the k-th depth imageThe number of columns of the image;
step 1, converting into the following steps through camera coordinate system transformation:
zk(u,v)=dk(u,v)
xk(u,v)=(u-ck,1)*zk(u,v)/fk,1
yk(u,v)=(v-ck,2)*zk(u,v)/fk,2
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, ck,1First coordinate parameter representing the kth depth image, ck,2Second coordinate parameter, f, representing the k-th depth imagek,1First focal length parameter, f, representing the k-th depth imagek,2Second coordinate parameter, N, representing the k-th focal length imageKM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, the three-dimensional point cloud is:
(xk(u,v),yk(u,v),zk(u,v))
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, the datakRepresenting the kth set of three-dimensional point clouds, (x)k(u,v),yk(u,v),zk(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, NKFor the number of the three-dimensional point cloud sets, M represents the number of rows of the kth set of three-dimensional point clouds, N represents the number of columns of the kth set of three-dimensional point clouds, and M × N represents the number of coordinate points in the kth set of three-dimensional point clouds.
3. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 2, randomly adopting the three-dimensional point cloud to obtain a sampled three-dimensional point cloud:
slave datak=(xk(u,v),yk(u,v),zk(u,v)),k∈[1,NK],u∈[1,M],v∈[1,N]In (1),randomly selecting L coordinate points from the M x N coordinate points as a three-dimensional point cloud after sampling, wherein the specific definition is as follows:
(xk,m,yk,m,zk,m),k∈[1,NK],m∈[1,L]
wherein (x)k,m,yk,m,zk,m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, NKThe number of the three-dimensional point clouds after sampling is obtained;
step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:
Figure FDA0002525602990000021
p is the number of principal components, data, taken by the principal component analysiskFeature vector set for kth set of three-dimensional point clouds
The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:
Figure FDA0002525602990000022
Figure FDA0002525602990000023
Figure FDA0002525602990000024
the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;
step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:
taking the first three direction vectors
Figure FDA0002525602990000025
As a point cloud bounding box coordinate system for the set of point cloudsThree coordinate axis directions
Figure FDA0002525602990000026
Figure FDA0002525602990000027
Figure FDA0002525602990000028
Figure FDA0002525602990000029
Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:
Figure FDA0002525602990000031
Figure FDA0002525602990000032
Figure FDA0002525602990000033
taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system
Figure FDA0002525602990000034
Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:
sxk=max((xk,1,xk,2,...,xk,L))-min(xk,1,xk,2,...,xk,L)
syk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
szk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
wherein max represents taking the maximum value, and min represents taking the minimum value;
step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:
for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:
Figure FDA0002525602990000035
wherein s isk=(sxk,syk,szk);
RkThe determination of (2) is as follows:
rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequence
Figure FDA0002525602990000036
Aligned at an angle of yamk,pitchk、rollkThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:
Figure FDA0002525602990000037
Figure FDA0002525602990000038
Figure FDA0002525602990000039
Rk=Rk,zRk,yRk,x
translation matrix
Figure FDA00025256029900000310
The converted three-dimensional point cloud in the step 2 is:
Figure FDA0002525602990000041
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds.
4. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: 3, the feature extractor is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;
the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:
Figure FDA0002525602990000042
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:
step 3.1, constructing self-organization mapping according to the converted point cloud;
to the converted point cloud XdatakConstructing a self-organizing map point cloud MK
MKEmulating Xdata with fewer pointskThe spatial distribution of (a);
wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;
self-organizing map point cloud MKRetention of Xdata by proximity functionskTopological properties of (d);
the self-organizing map point cloud is represented as:
Mk={(xk,m,zk,m,yk,m)),m∈[1,M].
wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud XdatakThe number of point clouds L;
step 3.2, extracting node characteristics from the converted point cloud;
converting the point cloud XdatakEach point in
Figure FDA0002525602990000043
Mapping point clouds M in self-organizationkPerforming K neighbor search;
point cloud XdatakEach point of
Figure FDA0002525602990000044
K points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;
the input size of the full-connection layer network module is kXLx3, and the output size is kXLxFn
Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by FNObtaining node Feature by taking maximum pooling operation to M dimensionk,nSize of M x FnA matrix of (a);
step 3.3, obtaining global characteristics from the node characteristics;
feature of nodek,nInputting the coordinates into a full-connection layer network module with input size of MxFnOutput size of M x FgThen, the maximum pooling operation is taken on the dimension M to obtain the global Featurek,g1 XF in sizegThe vector of (a);
the processed point cloud XdatakOutputting node Feature as input to Feature extractork,nAnd global Featurek,g
The global features are defined as: featurek,gIs of size 1 XFgVector of (A), FgIs a constant number, NLRepresents the length of the global feature vector;
the node characteristics are defined as: featurek,nIs of size MxFnWherein FnIs constant, M is the defined number of nodes;
the point cloud reconstructor derives the global Feature from the point cloud reconstructork,gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud Xk,r
Figure FDA0002525602990000051
Step 3, the point cloud reconstructor is specifically defined as follows:
using global Feature vector Featurek,gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud Xk,r
The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers;
convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;
the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud Xk,r
The global Featurek,gFeature of nodek,nEstimating to obtain a three-dimensional gesture through a gesture estimator;
step 3, the gesture estimator is specifically defined as:
global Feature obtained in the Feature extraction stagek,gAnd local Featurek,nPerforming fusion to make the size of the global feature vector be 1 XFgIn the first placeCopying M times in one dimension to raise dimension to dimension M x F of local feature matrixnSame dimension of M × FgSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Featurek,fThe size of which is M × (F)n+Fg);
Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as NLThe new feature matrix thus obtained has a size of M NLWherein the input size of the full connection layer is M × (F)n+Fg) Output size of M × NL
Then averaging and pooling the characteristic matrix with the size of M multiplied by NL output by the full connection layer on the dimension of M to obtain a characteristic matrix with the length of NLGlobal feature vector V ofk,f
By applying global Featurek,gFeature of sum nodek,nThe integral characteristic vector V obtained by fusionk,fEstimating the three-dimensional gesture Posek
The method for estimating the three-dimensional gesture comprises the following steps:
regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;
the global feature vector is of length NLThe dimension of each module passing through a plurality of full-connection layer networks is (N)L,U,V,3×Njoints) To obtain 3 XNjointsIs then represented as NjointsThree-dimensional coordinates of individual joint points PosekWherein N isLU, V are respectively a first dimension and a second dimension of a hidden layer in the full-connection layer for inputting the length of the whole feature vector of the full-connection layer;
the three-dimensional gesture is defined as:
Figure FDA0002525602990000061
wherein, PosekThe gestures that represent the k-th group,
Figure FDA0002525602990000062
is the three-dimensional coordinate of the kth num joint, NjointsRepresenting the number of gesture joints.
5. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 4, constructing a gesture estimation network loss function model according to whether the training samples are marked or not is as follows:
if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:
LOSS=DChamfer(Xdatak,Xk,r)+Djoints(Xdatak)
wherein D isChamfer(Xdatak,Xk,r) Representing XdatakAnd Xk,rThe Chamfer distance is:
Figure FDA0002525602990000071
Figure FDA0002525602990000072
wherein, | Xk,rI represents the number of k-th set of reconstructed point clouds, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
Figure FDA0002525602990000073
representing reconstructed point cloud X for kth groupk,rEach point of (1) calculates the point distance point cloud XdatakThe distance of the nearest point in the image, and then summing these distances;
in the same way, the method for preparing the composite material,
Figure FDA0002525602990000074
representing peer-to-peer cloud XdatakEach point of (2) calculating the point distance point cloud Xk,rThe distance of the nearest point in the image, and then summing these distances;
Figure FDA0002525602990000075
wherein N isjointsRepresenting the number of gesture joints, num ∈ [1, Njoints],PosekThe gestures that represent the k-th group,
Figure FDA0002525602990000076
is the three-dimensional coordinate of the num joint of the gesture of the kth group,
Figure FDA0002525602990000077
the gesture representing the kth group of annotations,
Figure FDA0002525602990000078
and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:
if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:
LOSS=DChamfer(Xdatak,Xk,r) Wherein D isChamfer(Xdatak,Xk,r) Representing point cloud XdatakAnd reconstructing the point cloud Xk,rThe Chamfer distance of (a);
step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.
CN202010503293.0A 2020-06-05 2020-06-05 Depth image gesture estimation method based on semi-supervised learning Active CN111797692B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010503293.0A CN111797692B (en) 2020-06-05 2020-06-05 Depth image gesture estimation method based on semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010503293.0A CN111797692B (en) 2020-06-05 2020-06-05 Depth image gesture estimation method based on semi-supervised learning

Publications (2)

Publication Number Publication Date
CN111797692A true CN111797692A (en) 2020-10-20
CN111797692B CN111797692B (en) 2022-05-17

Family

ID=72802891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010503293.0A Active CN111797692B (en) 2020-06-05 2020-06-05 Depth image gesture estimation method based on semi-supervised learning

Country Status (1)

Country Link
CN (1) CN111797692B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112635064A (en) * 2020-12-31 2021-04-09 山西三友和智慧信息技术股份有限公司 Early diabetes risk prediction method based on deep PCA (principal component analysis) transformation
CN113112607A (en) * 2021-04-19 2021-07-13 复旦大学 Method and device for generating three-dimensional grid model sequence with any frame rate
CN113129370A (en) * 2021-03-04 2021-07-16 同济大学 Semi-supervised object pose estimation method combining generated data and label-free data
CN113239834A (en) * 2021-05-20 2021-08-10 中国科学技术大学 Sign language recognition system capable of pre-training sign model perception representation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120119984A1 (en) * 2010-11-15 2012-05-17 Yogesh Sankarasubramaniam Hand pose recognition
CN103778407A (en) * 2012-10-23 2014-05-07 南开大学 Gesture recognition algorithm based on conditional random fields under transfer learning framework
CN105955473A (en) * 2016-04-27 2016-09-21 周凯 Computer-based static gesture image recognition interactive system
US20190182415A1 (en) * 2015-04-27 2019-06-13 Snap-Aid Patents Ltd. Estimating and using relative head pose and camera field-of-view
CN110222626A (en) * 2019-06-03 2019-09-10 宁波智能装备研究院有限公司 A kind of unmanned scene point cloud target mask method based on deep learning algorithm
CN110222580A (en) * 2019-05-09 2019-09-10 中国科学院软件研究所 A kind of manpower 3 d pose estimation method and device based on three-dimensional point cloud
CN110866969A (en) * 2019-10-18 2020-03-06 西北工业大学 Engine blade reconstruction method based on neural network and point cloud registration
CN111150175A (en) * 2019-12-05 2020-05-15 新拓三维技术(深圳)有限公司 Method, device and system for three-dimensional scanning of feet

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120119984A1 (en) * 2010-11-15 2012-05-17 Yogesh Sankarasubramaniam Hand pose recognition
CN103778407A (en) * 2012-10-23 2014-05-07 南开大学 Gesture recognition algorithm based on conditional random fields under transfer learning framework
US20190182415A1 (en) * 2015-04-27 2019-06-13 Snap-Aid Patents Ltd. Estimating and using relative head pose and camera field-of-view
CN105955473A (en) * 2016-04-27 2016-09-21 周凯 Computer-based static gesture image recognition interactive system
CN110222580A (en) * 2019-05-09 2019-09-10 中国科学院软件研究所 A kind of manpower 3 d pose estimation method and device based on three-dimensional point cloud
CN110222626A (en) * 2019-06-03 2019-09-10 宁波智能装备研究院有限公司 A kind of unmanned scene point cloud target mask method based on deep learning algorithm
CN110866969A (en) * 2019-10-18 2020-03-06 西北工业大学 Engine blade reconstruction method based on neural network and point cloud registration
CN111150175A (en) * 2019-12-05 2020-05-15 新拓三维技术(深圳)有限公司 Method, device and system for three-dimensional scanning of feet

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIUHAO G.等: "Hand PointNet 3D Hand Pose Estimation Using Point Sets", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
YUJIN C.等: "SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation with Semi-supervised Learning", 《2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
张宏源 等: "基于伪三维卷积神经网络的手势姿态估计", 《计算机应用研究》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112635064A (en) * 2020-12-31 2021-04-09 山西三友和智慧信息技术股份有限公司 Early diabetes risk prediction method based on deep PCA (principal component analysis) transformation
CN112635064B (en) * 2020-12-31 2022-08-09 山西三友和智慧信息技术股份有限公司 Early diabetes risk prediction method based on deep PCA (principal component analysis) transformation
CN113129370A (en) * 2021-03-04 2021-07-16 同济大学 Semi-supervised object pose estimation method combining generated data and label-free data
CN113129370B (en) * 2021-03-04 2022-08-19 同济大学 Semi-supervised object pose estimation method combining generated data and label-free data
CN113112607A (en) * 2021-04-19 2021-07-13 复旦大学 Method and device for generating three-dimensional grid model sequence with any frame rate
CN113239834A (en) * 2021-05-20 2021-08-10 中国科学技术大学 Sign language recognition system capable of pre-training sign model perception representation
CN113239834B (en) * 2021-05-20 2022-07-15 中国科学技术大学 Sign language recognition system capable of pre-training sign model perception representation

Also Published As

Publication number Publication date
CN111797692B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN111797692B (en) Depth image gesture estimation method based on semi-supervised learning
CN108416840B (en) Three-dimensional scene dense reconstruction method based on monocular camera
Chen et al. Alignsdf: Pose-aligned signed distance fields for hand-object reconstruction
CN114863573B (en) Category-level 6D attitude estimation method based on monocular RGB-D image
CN109815847B (en) Visual SLAM method based on semantic constraint
CN111161364B (en) Real-time shape completion and attitude estimation method for single-view depth map
CN114359509B (en) Multi-view natural scene reconstruction method based on deep learning
Tu et al. Consistent 3d hand reconstruction in video via self-supervised learning
CN110223382B (en) Single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning
CN117218343A (en) Semantic component attitude estimation method based on deep learning
CN112232106A (en) Two-dimensional to three-dimensional human body posture estimation method
Wang et al. Adversarial learning for joint optimization of depth and ego-motion
CN111860651A (en) Monocular vision-based semi-dense map construction method for mobile robot
CN112183675A (en) Twin network-based tracking method for low-resolution target
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
Zhang et al. EANet: Edge-attention 6D pose estimation network for texture-less objects
CN104463962B (en) Three-dimensional scene reconstruction method based on GPS information video
CN117315169A (en) Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching
Wang et al. Unsupervised monocular depth estimation with channel and spatial attention
Zou et al. Gpt-cope: A graph-guided point transformer for category-level object pose estimation
CN117788544A (en) Image depth estimation method based on lightweight attention mechanism
Lin et al. Transpose: 6d object pose estimation with geometry-aware transformer
CN117351078A (en) Target size and 6D gesture estimation method based on shape priori
Zhang et al. Hvdistill: Transferring knowledge from images to point clouds via unsupervised hybrid-view distillation
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant