CN111797692A - Depth image gesture estimation method based on semi-supervised learning - Google Patents
Depth image gesture estimation method based on semi-supervised learning Download PDFInfo
- Publication number
- CN111797692A CN111797692A CN202010503293.0A CN202010503293A CN111797692A CN 111797692 A CN111797692 A CN 111797692A CN 202010503293 A CN202010503293 A CN 202010503293A CN 111797692 A CN111797692 A CN 111797692A
- Authority
- CN
- China
- Prior art keywords
- point cloud
- dimensional
- feature
- dimensional point
- gesture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 32
- 239000013598 vector Substances 0.000 claims description 70
- 239000011159 matrix material Substances 0.000 claims description 26
- 238000006243 chemical reaction Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 10
- 238000000513 principal component analysis Methods 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 239000002131 composite material Substances 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 abstract description 6
- 230000008569 process Effects 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 4
- 230000036544 posture Effects 0.000 abstract 2
- 238000013135 deep learning Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/60—Rotation of whole images or parts thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a depth image gesture estimation method based on semi-supervised learning. Compared with the RGB image, the hand gesture with higher precision can be estimated from the depth image. The existing gesture estimation method based on deep learning has good effect, but relies on training with labeling data too much, and the process of labeling the three-dimensional gesture in the image is very complex. The invention provides an efficient point cloud expression mode, which effectively fuses local features and global features and realizes a new method for estimating three-dimensional hand postures from depth images with high precision. The method reduces the cost of data annotation by reducing the dependence on the annotation data in the model training process. Compared with the prior semi-supervised learning method, the invention realizes the breakthrough of hand posture estimation on precision on the premise of ensuring the operation efficiency.
Description
Technical Field
The invention belongs to the technical field of digital image recognition, and particularly relates to a depth image gesture estimation method based on semi-supervised learning.
Background
Automated real-time three-dimensional gesture estimation has gained much attention this year, including a wide range of application scenarios such as human-machine interaction, computer graphics, virtual/augmented reality, and the like. Through years of intensive research, the three-dimensional gesture estimation has made remarkable progress in accuracy and efficiency. Since convolutional neural networks perform well in processing images, most gesture estimation methods are based on convolutional neural networks. Some methods use two-dimensional convolution to process depth images, but due to the lack of three-dimensional spatial information representation, the features extracted by two-dimensional convolutional neural networks are not suitable for direct three-dimensional pose estimation. To better capture the geometric features of the depth data, some methods convert the depth image to a three-dimensional voxel representation and then use a three-dimensional convolution process to obtain the hand pose, but three-dimensional convolution has considerable memory and computational requirements. While these approaches have made significant advances in estimation accuracy, they often rely heavily on large amounts of annotation data for network training, and rarely consider the problem of reducing annotation costs.
Disclosure of Invention
In view of the above-identified deficiencies in the art or needs for improvement, the present invention provides a gesture estimation method based on semi-supervised learning. Aiming at the specific representation of three-dimensional data, the invention adopts point cloud as the representation form of hand depth image data, and aims to retain the three-dimensional characteristics in a depth image and ensure the efficiency of arithmetic operation. Aiming at the problems that a large amount of labeled posture information is needed in the training process of a gesture estimation model, but the labeling cost of three-dimensional gesture labeling information is high, the invention designs a semi-supervised deep network architecture, and aims to train the whole network by using unlabelled data and less labeled data so as to improve the precision of three-dimensional gesture estimation and reduce the labeling cost of training data.
In order to achieve the above object, the present invention provides a depth image gesture estimation method based on semi-supervised learning, comprising the following steps:
step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;
step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;
and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;
and 4, step 4: converting the three-dimensional point cloud (Xdata)kAs training samples, constructing a gesture estimation network loss function model according to whether the training samples are marked or not, and obtaining a trained gesture estimation network through optimization training;
preferably, the number of depth images in step 1 is NK;
The kth depth image is:
dk(u,v)k∈[1,NK],u∈[1,M],v∈[1,N]
wherein d isk(u, v) represents the depth of pixel points on the ith row and the vth column in the kth depth image in the camera coordinate systemDegree, M represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, converting into the following steps through camera coordinate system transformation:
zk(u,v)=dk(u,v)
xk(u,v)=(u-ck,1)*zk(u,v)/fk,1
yk(u,v)=(v-ck,2)*zk(u,v)/fk,2
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, ck,1First coordinate parameter representing the kth depth image, ck,2Second coordinate parameter, f, representing the k-th depth imagek,1First focal length parameter, f, representing the k-th depth imagek,2Second coordinate parameter, N, representing the k-th focal length imageKM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, the three-dimensional point cloud is:
(xk(u,v),yk(u,v),zk(u,v))
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, the datakRepresenting the kth set of three-dimensional point clouds, (x)k(u,v),yk(u,v),zk(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, NKThe number of the three-dimensional point cloud groups is M, the number of rows of the kth three-dimensional point cloud is represented by M, the number of columns of the kth three-dimensional point cloud is represented by N, and the number of coordinate points in the kth three-dimensional point cloud is represented by M x N;
preferably, the step 2 of randomly adopting the three-dimensional point cloud to obtain the sampled three-dimensional point cloud is as follows:
slave datak=(xk(u,v),yk(u,v),zk(u,v)),k∈[1,NK],u∈[1,M],v∈[1,N]In the method, L coordinate points in the M x N coordinate points are randomly selected asThe three-dimensional point cloud after sampling is specifically defined as:
(xk,m,yk,m,zk,m),k∈[1,NK],m∈[1,L]
wherein (x)k,m,yk,m,zk,m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, NKThe number of the three-dimensional point clouds after sampling is obtained;
step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:p is the number of principal components, data, taken by the principal component analysiskFeature vector set for kth set of three-dimensional point clouds
The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:
the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;
step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:
taking the first three direction vectorsThree coordinate axis directions of a point cloud bounding box coordinate system as the set of point clouds
Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:
taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system
Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:
sxk=max((xk,1,xk,2,...,xk,L))-min(xk,1,xk,2,...,xk,L)
syk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
szk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
wherein max represents taking the maximum value, and min represents taking the minimum value;
step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:
for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:
wherein s isk=(sxk,syk,szk);
Rk is determined as follows:
rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequenceAligned at an angle of yamk,pitchk、rollkThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:
Rk=Rk,zRk,yRk,x
The converted three-dimensional point cloud in the step 2 is:
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
preferably, the feature extractor in step 3 is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;
the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:
step 3.1, constructing self-organization mapping according to the converted point cloud;
to the converted point cloud XdatakConstructing a self-organizing map point cloud MK;
MKEmulating Xdata with fewer pointskThe spatial distribution of (a);
wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;
self-organizing map point cloud MKRetention of Xdata by proximity functionskTopological properties of (d);
the self-organizing map point cloud is represented as:
Mk={(xk,m,zk,m,yk,m)},m∈[1,M].
wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud XdatakThe number of point clouds L;
step 3.2, extracting node characteristics from the converted point cloud;
converting the point cloud XdatakEach point inMapping point clouds M in self-organizationkPerforming K neighbor search;
point cloud XdatakEach point ofK points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;
the input size of the full-connection layer network module is kXLx3, and the output size is kXLxFn;
Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by FNObtaining node Feature by taking maximum pooling operation to M dimensionk,nSize of M x FnA matrix of (a);
step 3.3, obtaining global characteristics from the node characteristics;
feature of nodek,nInputting the coordinates into a full-connection layer network module with input size of MxFnOutput size of M x FgThen, the maximum pooling operation is taken on the dimension M to obtain the global Featurek,g1 XF in sizegThe vector of (a);
the processed point cloud XdatakOutputting node Feature as input to Feature extractork,nAnd global Featurek,g;
The global features are defined as: featurek,gIs of size 1 XFgVector of (A), FgIs a constant number, NLRepresenting global feature vectorsA length;
the node characteristics are defined as: featurek,nIs of size MxFnWherein FnIs constant, M is the defined number of nodes.
The point cloud reconstructor derives the global Feature from the point cloud reconstructork,gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud Xk,r:
Step 3, the point cloud reconstructor is specifically defined as follows:
using global Feature vector Featurek,gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud Xk,r;
The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers;
convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;
the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud Xk,r;
The global Featurek,gFeature of nodek,nEstimating to obtain a three-dimensional gesture through a gesture estimator;
step 3, the gesture estimator is specifically defined as:
global Feature obtained in the Feature extraction stagek,gAnd local Featurek,nPerforming fusion to make the size of the global feature vector be 1 XFgCopying M times in the first dimension to raise the dimension to the dimension M x F of the local feature matrixnSame dimension of M × FgSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Featurek,fSize ofIs M × (F)n+Fg);
Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as NLThe new feature matrix thus obtained has a size of M NLWherein the input size of the full connection layer is M × (F)n+Fg) Output size of M × NL;
Then averaging and pooling the characteristic matrix with the size of M multiplied by NL output by the full connection layer on the dimension of M to obtain a characteristic matrix with the length of NLGlobal feature vector V ofk,f;
By applying global Featurek,gFeature of sum nodek,nThe integral characteristic vector V obtained by fusionk,fEstimating the three-dimensional gesture Posek;
The method for estimating the three-dimensional gesture comprises the following steps:
regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;
the global feature vector is of length NLThe dimension of each module passing through a plurality of full-connection layer networks is (N)L,U,V,3×Njoints) To obtain 3 XNjointsIs then represented as NjointsThree-dimensional coordinates of individual joint points PosekWherein N isLU, V are respectively a first dimension and a second dimension of a hidden layer in the full-connection layer for inputting the length of the whole feature vector of the full-connection layer;
the three-dimensional gesture is defined as:
wherein, PosekThe gestures that represent the k-th group,is the three-dimensional coordinate of the kth num joint, NjointsRepresenting the number of gesture joints;
preferably, the step 4 of constructing the gesture estimation network loss function model according to whether the training samples are marked or not is as follows:
if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:
LOSS=DChamfer(Xdatak,Xk,r)+Djoints(Xdatak)
wherein D isChamfer(Xdatak,Xk,r) Representing XdatakAnd Xk,rThe Chamfer distance is:
k∈[1,NK],m∈[1,L]
wherein, | Xk,rI represents the number of k-th set of reconstructed point clouds, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;representing reconstructed point cloud X for kth groupk,rEach point of (1) calculates the point distance point cloud XdatakAnd then summing these distances. In the same way, the method for preparing the composite material,representing peer-to-peer cloud XdatakEach point of (2) calculating the point distance point cloud Xk,rAnd then summing these distances.
Wherein N isjointsRepresenting the number of gesture joints, num ∈ [1, Njoints],PosekHand representing the k-th groupThe potential of the mixture is as follows,is the three-dimensional coordinate of the num joint of the gesture of the kth group,the gesture representing the kth group of annotations,and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:
if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:
LOSS=DChamfer(Xdatak,Xk,r) Wherein D isChamfer(Xdatak,Xk,r) Representing point cloud XdatakAnd reconstructing the point cloud Xk,rThe Chamfer distance of (a);
step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
the application range is wide. The self-encoder based on the point cloud can be integrated into various networks for recovering a three-dimensional structure from a depth image, and the spatial representation capability of the extracted features is greatly improved.
The generalization ability is strong. Compared with the method of directly using the depth image as network input, the method only saves space position coordinates in a point cloud representation form, and therefore the method can be widely applied to various types of three-dimensional data.
The efficiency is high. Compared with other three-dimensional representation forms, the point cloud three-dimensional representation method has the advantages that the point cloud is used as network input, the orderless three-dimensional representation method only comprises the space coordinates of each point, and the network calculation amount can be reduced.
The precision is high. The design provides that a three-dimensional point cloud reconstruction part fully utilizes intermediate representation in a network, extracts multi-level features through an encoder guided by a self-organizing map, and models the spatial distribution of the point cloud. The decoder reconstructs hand point cloud from the encoded global features, so that the features learned by the encoder contain more hand space information, thereby improving the hand posture estimation effect.
The dependency on the annotation data is low. The invention designs a semi-supervised training strategy applied to a gesture estimation task, which trains the whole network by using a small amount of labeled data and optimizes the network by fully utilizing unannotated data.
Therefore, the three-dimensional gesture estimation method provided by the invention has high recognition precision and reduces the labeling cost of training data.
Drawings
FIG. 1: is a flow chart of the method of the present invention;
FIG. 2: is the structure diagram of the neural network of the invention;
FIG. 3: the effect schematic diagram of adapting the initialized self-organizing map to the input hand point cloud in the embodiment of the invention;
FIG. 4: the invention is a network structure diagram for fusing node characteristics and global characteristics;
FIG. 5: the network structure schematic diagram of the reconstructed point cloud is obtained by recovering the global features in the embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Self-organizing maps are artificial neural networks that use unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples, which differs from other artificial neural networks in that they use a proximity function to preserve the topological properties of the input space. In the present invention, the low dimensional representation M models the spatial distribution of the point cloud X with fewer points.
The invention provides a depth image gesture estimation method based on semi-supervised learning, and the overall structure diagram of the method is shown in FIG. 2. The system comprises: a data conversion module; a feature extraction module based on a point cloud processing network; a point cloud feature decoding module is used for reconstructing a hand three-dimensional point cloud; a gesture estimation module based on multi-level features.
The depth image gesture estimation method based on semi-supervised learning provided by the present invention is specifically described below with reference to fig. 1 to 5, and specifically includes the following steps:
step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;
step 1, the number of the depth images is NK;
The kth depth image is:
dk(u,v)k∈[1,NK],u∈[1,M],v∈[1,N]
wherein d isk(u, v) represents the depth of pixel points on the u-th row and the v-th column in the k-th depth image in a camera coordinate system, M represents the number of rows of the k-th depth image, and N represents the number of columns of the k-th depth image;
step 1, converting into the following steps through camera coordinate system transformation:
zk(u,v)=dk(u,v)
xk(u,v)=(u-ck,1)*zk(u,v)/fk,1
yk(u,v)=(v-ck,2)*zk(u,v)/fk,2
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, ck,1First coordinate parameter representing the kth depth image, ck,2Second coordinate parameter, f, representing the k-th depth imagek,1First focal length parameter, f, representing the k-th depth imagek,2Representing the k-th focal imageSecond coordinate parameter, NKM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, the three-dimensional point cloud is:
(xk(u,v),yk(u,v),zk(u,v))
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, the datakRepresenting the kth set of three-dimensional point clouds, (x)k(u,v),yk(u,v),zk(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, NKThe number of the three-dimensional point cloud groups is M, the number of rows of the kth three-dimensional point cloud is represented by M, the number of columns of the kth three-dimensional point cloud is represented by N, and the number of coordinate points in the kth three-dimensional point cloud is represented by M x N;
step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;
step 2, randomly adopting the three-dimensional point cloud to obtain a sampled three-dimensional point cloud:
slave datak=(xk(u,v),yk(u,v),zk(u,v)),k∈[1,NK],u∈[1,M],v∈[1,N]In the method, L coordinate points are randomly selected from the M × N coordinate points as the sampled three-dimensional point cloud, which is specifically defined as:
(xk,m,yk,m,zk,m),k∈[1,NK],m∈[1,L]
wherein (x)k,m,yk,m,zk,m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, NKThe number of the three-dimensional point clouds after sampling is obtained;
step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:p is the number of principal components, data, taken by the principal component analysiskFeature vector set for kth set of three-dimensional point clouds
The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:
the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;
step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:
taking the first three direction vectorsThree coordinate axis directions of a point cloud bounding box coordinate system as the set of point clouds
Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:
taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system
Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:
sxk=max((xk,1,xk,2,...,xk,L))-min(xk,1,xk,2,...,xk,L)
syk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
szk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
wherein max represents taking the maximum value, and min represents taking the minimum value;
step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:
for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:
wherein s isk=(sxk,syk,szk);
RkThe determination of (2) is as follows:
rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequenceAligned at an angle of yamk,pitchk、rollkThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:
Rk=Rk,zRk,yRk,x
The converted three-dimensional point cloud in the step 2 is:
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;
3, the feature extractor is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;
the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:
k∈[1,NK],m∈[1,L]
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:
step 3.1, constructing self-organization mapping according to the converted point cloud;
to the converted point cloud XdatakConstructing a self-organizing map point cloud MK;
MKEmulating Xdata with fewer pointskThe spatial distribution of (a); the effect diagram is shown in FIG. 3, and the self-organizing map maps the point cloud with the three parts of the graph being randomly initialized to the point cloud XdatakIn the space of (2), obtaining a self-organizing mapping point cloud MK。
Wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;
self-organizing map point cloud MKRetention of Xdata by proximity functionskTopological properties of (d);
the self-organizing map point cloud is represented as:
Mk={(xk,m,zk,m,yk,m)},m∈[1,M].
wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud XdatakThe number of point clouds L; self-organizing map point cloud MkAnd the converted point cloud XdatakSee fig. 2.
Step 3.2, extracting node characteristics from the converted point cloud;
converting the point cloud XdatakEach point inMapping point clouds M in self-organizationkPerforming K neighbor search;
point cloud XdatakEach point ofK points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;
the input size of the full-connection layer network module is kXLx3, and the output size is kXLxFn;
Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by FNObtaining node Feature by taking maximum pooling operation to M dimensionk,nSize of M x FnA matrix of (a);
step 3.3, obtaining global characteristics from the node characteristics;
feature of nodek,nInputting the coordinates into a full-connection layer network module with input size of MxFnOutput size of M x FgThen, the maximum pooling operation is taken on the dimension M to obtain the global Featurek,g1 XF in sizegThe vector of (a);
the processed point cloud XdatakOutputting node Feature as input to Feature extractork,nAnd global feature Featurek,g;
The global features are defined as: featurek,gIs of size 1 XFgVector of (A), FgIs a constant number, NLIndicating the length of the global feature vector, e.g. NL=1024;
The node characteristics are defined as: featurek,nIs of size MxFnWherein FnIs constant, M is the defined number of nodes.
The point cloud reconstructor derives the global Feature from the point cloud reconstructork,gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud Xk,r:
Step 3, the point cloud reconstructor is specifically defined as follows:
using global Feature vector Featurek,gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud Xk,r;
The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers; as shown in fig. 5.
Convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;
the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud Xk,r;
The global Featurek,gFeature of nodek,nEstimating to obtain a three-dimensional gesture through a gesture estimator;
step 3, the gesture estimator is specifically defined as:
global Feature obtained in the Feature extraction stagek,gAnd local Featurek,nPerforming fusion to make the size of the global feature vector be 1×FgCopying M times in the first dimension to raise the dimension to the dimension M x F of the local feature matrixnSame dimension of M × FgSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Featurek,fThe size of which is M × (F)n+Fg) (ii) a This process is illustrated in fig. 4.
Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as NL1024, the new feature matrix thus obtained has a size of mxnLWherein the input size of the full connection layer is M × (F)n+Fg) Output size of M × NL;
Then the output size of the full connection layer is MxNLThe characteristic matrix is averaged and pooled in M dimensions to obtain a length NL1024 integral eigenvector Vk,f;
By applying global Featurek,gFeature of sum nodek,nThe integral characteristic vector V obtained by fusionk,fEstimating the three-dimensional gesture Posek;
The method for estimating the three-dimensional gesture comprises the following steps:
regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;
the global feature vector is of length NLThe dimension of each module passing through a plurality of full-connection layer networks is (N) 1024L,U,V,3×Njoints) To obtain 3 XNjointsIs then represented as NjointsThree-dimensional coordinates of individual joint points PosekIn which N isL1024 is the length of the whole feature vector of the input full-link layer, and U512 and V256 are respectively the first dimension and the second dimension of the hidden layer in the full-link layer;
the three-dimensional gesture is defined as:
wherein, PosekThe gestures that represent the k-th group,is the three-dimensional coordinate of the kth num joint, NjointsIndicating the number of gesture joints, N being taken from this patentjoints=21;
And 4, step 4: converting the three-dimensional point cloud (Xdata)kAs training samples, constructing a gesture estimation network loss function model according to whether the training samples are marked or not, and obtaining a trained gesture estimation network through optimization training; as shown in fig. 2.
Step 4, constructing a gesture estimation network loss function model according to whether the training samples are marked or not is as follows:
if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:
LOSS=DChamfer(Xdatak,Xk,r)+Djoints(Xdatak)
wherein D isChamfer(Xdatak,Xk,r) Representing XdatakAnd Xk,rThe Chamfer distance is:
k∈[1,NK],m∈[1,L]
wherein, | Xk,rI represents the number of k-th set of reconstructed point clouds, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;representing reconstructed point cloud X for kth groupk,rEach point of (1) calculates the point distance point cloud XdatakThe distance of the nearest point in the image, and then the distances of the pointsAnd (4) performing separation and summation. In the same way, the method for preparing the composite material,representing peer-to-peer cloud XdatakEach point of (2) calculating the point distance point cloud Xk,rAnd then summing these distances.
Wherein N isjointsRepresenting the number of gesture joints, num ∈ [1, Njoints],PosekThe gestures that represent the k-th group,is the three-dimensional coordinate of the num joint of the gesture of the kth group,the gesture representing the kth group of annotations,and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:
if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:
LOSS=DChamfer(Xdatak,Xk,r) Wherein D isChamfer(Xdatak,Xk,r) Representing point cloud XdatakAnd reconstructing the point cloud Xk,rThe Chamfer distance of (a);
step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (5)
1. A depth image gesture estimation method based on semi-supervised learning is characterized by comprising the following steps:
step 1: converting the multiple depth images into multiple groups of three-dimensional point clouds through a camera coordinate system;
step 2: obtaining sampled three-dimensional point clouds by randomly adopting each group of three-dimensional point clouds, performing principal component analysis on the sampled three-dimensional point clouds to obtain a feature vector set of the group of three-dimensional point clouds, sequentially selecting a first feature vector, a second feature vector and a third feature vector from the feature vector set of the three-dimensional point clouds, further constructing coordinate axis directions of a coordinate system of the point cloud enclosure, setting a coordinate average value of the sampled three-dimensional point clouds as a coordinate origin in the coordinate system of the point cloud enclosure, determining a value range of the coordinate system of the point cloud enclosure according to the coordinate range of the three-dimensional point clouds to determine a coordinate system of the point cloud enclosure, and converting the group of sampled three-dimensional point clouds into the coordinate system of the point cloud enclosure to obtain converted three-dimensional point clouds;
and step 3: constructing a gesture estimation network through a feature extractor, a point cloud reconstructor and a gesture estimator;
and 4, step 4: converting the three-dimensional point cloud (Xdata)kAnd as a training sample, constructing a gesture estimation network loss function model according to whether the training sample is marked or not, and obtaining a trained gesture estimation network through optimization training.
2. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 1, the number of the depth images is NK;
The kth depth image is:
dk(u,v)k∈[1,NK],u∈[1,M],v∈[1,N]
wherein d isk(u, v) represents the depth of pixel points on the u-th row and the v-th column in the k-th depth image in the camera coordinate system, M represents the number of rows of the k-th depth image, and N represents the k-th depth imageThe number of columns of the image;
step 1, converting into the following steps through camera coordinate system transformation:
zk(u,v)=dk(u,v)
xk(u,v)=(u-ck,1)*zk(u,v)/fk,1
yk(u,v)=(v-ck,2)*zk(u,v)/fk,2
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, ck,1First coordinate parameter representing the kth depth image, ck,2Second coordinate parameter, f, representing the k-th depth imagek,1First focal length parameter, f, representing the k-th depth imagek,2Second coordinate parameter, N, representing the k-th focal length imageKM represents the number of rows of the kth depth image, and N represents the number of columns of the kth depth image;
step 1, the three-dimensional point cloud is:
(xk(u,v),yk(u,v),zk(u,v))
k∈[1,NK],u∈[1,M],v∈[1,N]
wherein, the datakRepresenting the kth set of three-dimensional point clouds, (x)k(u,v),yk(u,v),zk(u, v)) represents a coordinate point on the nth row and the vth column in the kth set of three-dimensional point clouds, NKFor the number of the three-dimensional point cloud sets, M represents the number of rows of the kth set of three-dimensional point clouds, N represents the number of columns of the kth set of three-dimensional point clouds, and M × N represents the number of coordinate points in the kth set of three-dimensional point clouds.
3. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 2, randomly adopting the three-dimensional point cloud to obtain a sampled three-dimensional point cloud:
slave datak=(xk(u,v),yk(u,v),zk(u,v)),k∈[1,NK],u∈[1,M],v∈[1,N]In (1),randomly selecting L coordinate points from the M x N coordinate points as a three-dimensional point cloud after sampling, wherein the specific definition is as follows:
(xk,m,yk,m,zk,m),k∈[1,NK],m∈[1,L]
wherein (x)k,m,yk,m,zk,m) Is the mth coordinate point in the kth group of three-dimensional point clouds after sampling, L is the number of the coordinate points in the kth group of three-dimensional point clouds after sampling, NKThe number of the three-dimensional point clouds after sampling is obtained;
step 2, performing principal component analysis on the sampled three-dimensional point cloud to obtain a feature vector set of the three-dimensional point cloud, wherein the feature vector set comprises:p is the number of principal components, data, taken by the principal component analysiskFeature vector set for kth set of three-dimensional point clouds
The first eigenvector, the second eigenvector and the third eigenvector in the step 2 are three-dimensional space vectors, and are expressed as follows in sequence:
the first characteristic vector, the second characteristic vector and the third characteristic vector are three orthogonal vectors;
step 2, further constructing a coordinate axis direction of the point cloud bounding box coordinate system as follows:
taking the first three direction vectorsAs a point cloud bounding box coordinate system for the set of point cloudsThree coordinate axis directions
Step 2, the average value of the coordinates of the sampled three-dimensional point cloud is as follows:
taking the coordinate mean value of N points as the coordinate origin of the point cloud bounding box coordinate system
Step 2, determining the value ranges of the point cloud bounding box coordinate system in three directions according to the coordinate range of the three-dimensional point cloud, wherein the value ranges are respectively as follows:
sxk=max((xk,1,xk,2,...,xk,L))-min(xk,1,xk,2,...,xk,L)
syk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
szk=max(xk,1,xk,2,...,xk,L)-min(xk,1,xk,2,...,xk,L)
wherein max represents taking the maximum value, and min represents taking the minimum value;
step 2, converting the sampled three-dimensional point cloud into a point cloud bounding box coordinate system to obtain a sampled and extracted three-dimensional point cloud:
for the mth point in the kth group of point clouds, the coordinates in the point cloud bounding box coordinate system after conversion are as follows:
wherein s isk=(sxk,syk,szk);
RkThe determination of (2) is as follows:
rotating the original coordinate axes according to the z axis, the y axis and the x axis in sequenceAligned at an angle of yamk,pitchk、rollkThen the rotation matrix from the original coordinate system to the point cloud bounding box coordinate system is defined as follows:
Rk=Rk,zRk,yRk,x
The converted three-dimensional point cloud in the step 2 is:
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds.
4. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: 3, the feature extractor is respectively connected with the point cloud reconstructor and the gesture estimator in sequence;
the input data of the feature extractor is the converted three-dimensional point cloud in the step 2, namely:
wherein, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;
the feature extraction method of the feature extractor in the step 3 specifically comprises the following steps:
step 3.1, constructing self-organization mapping according to the converted point cloud;
to the converted point cloud XdatakConstructing a self-organizing map point cloud MK;
MKEmulating Xdata with fewer pointskThe spatial distribution of (a);
wherein the self-organizing map is an artificial neural network that uses unsupervised learning to produce a low-dimensional discretized representation of the input space of training samples;
self-organizing map point cloud MKRetention of Xdata by proximity functionskTopological properties of (d);
the self-organizing map point cloud is represented as:
Mk={(xk,m,zk,m,yk,m)),m∈[1,M].
wherein M is the number of points in the self-organizing mapping point cloud, and M is smaller than the point cloud XdatakThe number of point clouds L;
step 3.2, extracting node characteristics from the converted point cloud;
converting the point cloud XdatakEach point inMapping point clouds M in self-organizationkPerforming K neighbor search;
point cloud XdatakEach point ofK points which are nearest neighbor to the k points correspond to obtain k multiplied by L points, and the coordinates of the k multiplied by L points are input to the all-connection layer network module;
the input size of the full-connection layer network module is kXLx3, and the output size is kXLxFn;
Further according to the corresponding relation between the M node searched by the K neighbor and the K multiplied by L point, the output of the full connection layer module is multiplied by L multiplied by FNObtaining node Feature by taking maximum pooling operation to M dimensionk,nSize of M x FnA matrix of (a);
step 3.3, obtaining global characteristics from the node characteristics;
feature of nodek,nInputting the coordinates into a full-connection layer network module with input size of MxFnOutput size of M x FgThen, the maximum pooling operation is taken on the dimension M to obtain the global Featurek,g1 XF in sizegThe vector of (a);
the processed point cloud XdatakOutputting node Feature as input to Feature extractork,nAnd global Featurek,g;
The global features are defined as: featurek,gIs of size 1 XFgVector of (A), FgIs a constant number, NLRepresents the length of the global feature vector;
the node characteristics are defined as: featurek,nIs of size MxFnWherein FnIs constant, M is the defined number of nodes;
the point cloud reconstructor derives the global Feature from the point cloud reconstructork,gReconstructing to obtain the kth group of reconstructed three-dimensional point cloud Xk,r:
Step 3, the point cloud reconstructor is specifically defined as follows:
using global Feature vector Featurek,gReconstructing the three-dimensional point cloud and outputting a reconstructed three-dimensional point cloud Xk,r;
The point cloud reconstructor adopts a reconstruction network structure comprising two branches, the branches of the full-connection layer independently predict the position of each point, and the point cloud reconstructor is good in description of a complex structure and consists of 4 full-connection layers;
convolutional layer branches consist of 5 convolutions, where each convolutional layer is followed by an anti-convolutional layer, which can make full use of spatial continuity;
the output point cloud module is composed of two 1 × 1 convolution layers, and the prediction results of the two branches are merged together to obtain a complete reconstructed point cloud Xk,r;
The global Featurek,gFeature of nodek,nEstimating to obtain a three-dimensional gesture through a gesture estimator;
step 3, the gesture estimator is specifically defined as:
global Feature obtained in the Feature extraction stagek,gAnd local Featurek,nPerforming fusion to make the size of the global feature vector be 1 XFgIn the first placeCopying M times in one dimension to raise dimension to dimension M x F of local feature matrixnSame dimension of M × FgSplicing, i.e. joining, the two matrices in a second dimension as a fused Feature matrix Featurek,fThe size of which is M × (F)n+Fg);
Obtaining a new feature matrix through a full-connection layer network module, and defining the output of the full-connection layer as NLThe new feature matrix thus obtained has a size of M NLWherein the input size of the full connection layer is M × (F)n+Fg) Output size of M × NL;
Then averaging and pooling the characteristic matrix with the size of M multiplied by NL output by the full connection layer on the dimension of M to obtain a characteristic matrix with the length of NLGlobal feature vector V ofk,f;
By applying global Featurek,gFeature of sum nodek,nThe integral characteristic vector V obtained by fusionk,fEstimating the three-dimensional gesture Posek;
The method for estimating the three-dimensional gesture comprises the following steps:
regression three-dimensional coordinates of the hand key points from the overall feature vectors by using a plurality of full-connection layer network modules;
the global feature vector is of length NLThe dimension of each module passing through a plurality of full-connection layer networks is (N)L,U,V,3×Njoints) To obtain 3 XNjointsIs then represented as NjointsThree-dimensional coordinates of individual joint points PosekWherein N isLU, V are respectively a first dimension and a second dimension of a hidden layer in the full-connection layer for inputting the length of the whole feature vector of the full-connection layer;
the three-dimensional gesture is defined as:
5. The semi-supervised learning based depth image gesture estimation method of claim 1, wherein: step 4, constructing a gesture estimation network loss function model according to whether the training samples are marked or not is as follows:
if the training sample is marked, the gesture estimation network loss function model of the sample mark is as follows:
LOSS=DChamfer(Xdatak,Xk,r)+Djoints(Xdatak)
wherein D isChamfer(Xdatak,Xk,r) Representing XdatakAnd Xk,rThe Chamfer distance is:
wherein, | Xk,rI represents the number of k-th set of reconstructed point clouds, XdatakThe k-th group of three-dimensional point clouds after conversion, L is the number of coordinate points in the k-th group of three-dimensional point clouds after conversion, NKThe number of the converted three-dimensional point clouds is obtained;representing reconstructed point cloud X for kth groupk,rEach point of (1) calculates the point distance point cloud XdatakThe distance of the nearest point in the image, and then summing these distances;
in the same way, the method for preparing the composite material,representing peer-to-peer cloud XdatakEach point of (2) calculating the point distance point cloud Xk,rThe distance of the nearest point in the image, and then summing these distances;
wherein N isjointsRepresenting the number of gesture joints, num ∈ [1, Njoints],PosekThe gestures that represent the k-th group,is the three-dimensional coordinate of the num joint of the gesture of the kth group,the gesture representing the kth group of annotations,and marking the num-th joint three-dimensional coordinates of the gesture in the kth group, wherein the calculation formula is as follows:
if the training sample is not marked, the gesture estimation network loss function model of the sample which is not marked is as follows:
LOSS=DChamfer(Xdatak,Xk,r) Wherein D isChamfer(Xdatak,Xk,r) Representing point cloud XdatakAnd reconstructing the point cloud Xk,rThe Chamfer distance of (a);
step 4, obtaining a trained gesture estimation network through optimization training: the device comprises three modules, namely an optimized trained feature extractor, an optimized trained point cloud reconstructor and an optimized trained gesture estimator.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010503293.0A CN111797692B (en) | 2020-06-05 | 2020-06-05 | Depth image gesture estimation method based on semi-supervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010503293.0A CN111797692B (en) | 2020-06-05 | 2020-06-05 | Depth image gesture estimation method based on semi-supervised learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111797692A true CN111797692A (en) | 2020-10-20 |
CN111797692B CN111797692B (en) | 2022-05-17 |
Family
ID=72802891
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010503293.0A Active CN111797692B (en) | 2020-06-05 | 2020-06-05 | Depth image gesture estimation method based on semi-supervised learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111797692B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112635064A (en) * | 2020-12-31 | 2021-04-09 | 山西三友和智慧信息技术股份有限公司 | Early diabetes risk prediction method based on deep PCA (principal component analysis) transformation |
CN113112607A (en) * | 2021-04-19 | 2021-07-13 | 复旦大学 | Method and device for generating three-dimensional grid model sequence with any frame rate |
CN113129370A (en) * | 2021-03-04 | 2021-07-16 | 同济大学 | Semi-supervised object pose estimation method combining generated data and label-free data |
CN113239834A (en) * | 2021-05-20 | 2021-08-10 | 中国科学技术大学 | Sign language recognition system capable of pre-training sign model perception representation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120119984A1 (en) * | 2010-11-15 | 2012-05-17 | Yogesh Sankarasubramaniam | Hand pose recognition |
CN103778407A (en) * | 2012-10-23 | 2014-05-07 | 南开大学 | Gesture recognition algorithm based on conditional random fields under transfer learning framework |
CN105955473A (en) * | 2016-04-27 | 2016-09-21 | 周凯 | Computer-based static gesture image recognition interactive system |
US20190182415A1 (en) * | 2015-04-27 | 2019-06-13 | Snap-Aid Patents Ltd. | Estimating and using relative head pose and camera field-of-view |
CN110222626A (en) * | 2019-06-03 | 2019-09-10 | 宁波智能装备研究院有限公司 | A kind of unmanned scene point cloud target mask method based on deep learning algorithm |
CN110222580A (en) * | 2019-05-09 | 2019-09-10 | 中国科学院软件研究所 | A kind of manpower 3 d pose estimation method and device based on three-dimensional point cloud |
CN110866969A (en) * | 2019-10-18 | 2020-03-06 | 西北工业大学 | Engine blade reconstruction method based on neural network and point cloud registration |
CN111150175A (en) * | 2019-12-05 | 2020-05-15 | 新拓三维技术(深圳)有限公司 | Method, device and system for three-dimensional scanning of feet |
-
2020
- 2020-06-05 CN CN202010503293.0A patent/CN111797692B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120119984A1 (en) * | 2010-11-15 | 2012-05-17 | Yogesh Sankarasubramaniam | Hand pose recognition |
CN103778407A (en) * | 2012-10-23 | 2014-05-07 | 南开大学 | Gesture recognition algorithm based on conditional random fields under transfer learning framework |
US20190182415A1 (en) * | 2015-04-27 | 2019-06-13 | Snap-Aid Patents Ltd. | Estimating and using relative head pose and camera field-of-view |
CN105955473A (en) * | 2016-04-27 | 2016-09-21 | 周凯 | Computer-based static gesture image recognition interactive system |
CN110222580A (en) * | 2019-05-09 | 2019-09-10 | 中国科学院软件研究所 | A kind of manpower 3 d pose estimation method and device based on three-dimensional point cloud |
CN110222626A (en) * | 2019-06-03 | 2019-09-10 | 宁波智能装备研究院有限公司 | A kind of unmanned scene point cloud target mask method based on deep learning algorithm |
CN110866969A (en) * | 2019-10-18 | 2020-03-06 | 西北工业大学 | Engine blade reconstruction method based on neural network and point cloud registration |
CN111150175A (en) * | 2019-12-05 | 2020-05-15 | 新拓三维技术(深圳)有限公司 | Method, device and system for three-dimensional scanning of feet |
Non-Patent Citations (3)
Title |
---|
LIUHAO G.等: "Hand PointNet 3D Hand Pose Estimation Using Point Sets", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
YUJIN C.等: "SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation with Semi-supervised Learning", 《2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 * |
张宏源 等: "基于伪三维卷积神经网络的手势姿态估计", 《计算机应用研究》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112635064A (en) * | 2020-12-31 | 2021-04-09 | 山西三友和智慧信息技术股份有限公司 | Early diabetes risk prediction method based on deep PCA (principal component analysis) transformation |
CN112635064B (en) * | 2020-12-31 | 2022-08-09 | 山西三友和智慧信息技术股份有限公司 | Early diabetes risk prediction method based on deep PCA (principal component analysis) transformation |
CN113129370A (en) * | 2021-03-04 | 2021-07-16 | 同济大学 | Semi-supervised object pose estimation method combining generated data and label-free data |
CN113129370B (en) * | 2021-03-04 | 2022-08-19 | 同济大学 | Semi-supervised object pose estimation method combining generated data and label-free data |
CN113112607A (en) * | 2021-04-19 | 2021-07-13 | 复旦大学 | Method and device for generating three-dimensional grid model sequence with any frame rate |
CN113239834A (en) * | 2021-05-20 | 2021-08-10 | 中国科学技术大学 | Sign language recognition system capable of pre-training sign model perception representation |
CN113239834B (en) * | 2021-05-20 | 2022-07-15 | 中国科学技术大学 | Sign language recognition system capable of pre-training sign model perception representation |
Also Published As
Publication number | Publication date |
---|---|
CN111797692B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111797692B (en) | Depth image gesture estimation method based on semi-supervised learning | |
CN108416840B (en) | Three-dimensional scene dense reconstruction method based on monocular camera | |
Chen et al. | Alignsdf: Pose-aligned signed distance fields for hand-object reconstruction | |
CN114863573B (en) | Category-level 6D attitude estimation method based on monocular RGB-D image | |
CN109815847B (en) | Visual SLAM method based on semantic constraint | |
CN111161364B (en) | Real-time shape completion and attitude estimation method for single-view depth map | |
CN114359509B (en) | Multi-view natural scene reconstruction method based on deep learning | |
Tu et al. | Consistent 3d hand reconstruction in video via self-supervised learning | |
CN110223382B (en) | Single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning | |
CN117218343A (en) | Semantic component attitude estimation method based on deep learning | |
CN112232106A (en) | Two-dimensional to three-dimensional human body posture estimation method | |
Wang et al. | Adversarial learning for joint optimization of depth and ego-motion | |
CN111860651A (en) | Monocular vision-based semi-dense map construction method for mobile robot | |
CN112183675A (en) | Twin network-based tracking method for low-resolution target | |
CN111368733B (en) | Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal | |
Zhang et al. | EANet: Edge-attention 6D pose estimation network for texture-less objects | |
CN104463962B (en) | Three-dimensional scene reconstruction method based on GPS information video | |
CN117315169A (en) | Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching | |
Wang et al. | Unsupervised monocular depth estimation with channel and spatial attention | |
Zou et al. | Gpt-cope: A graph-guided point transformer for category-level object pose estimation | |
CN117788544A (en) | Image depth estimation method based on lightweight attention mechanism | |
Lin et al. | Transpose: 6d object pose estimation with geometry-aware transformer | |
CN117351078A (en) | Target size and 6D gesture estimation method based on shape priori | |
Zhang et al. | Hvdistill: Transferring knowledge from images to point clouds via unsupervised hybrid-view distillation | |
CN115496859A (en) | Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |