CN114882524A - Monocular three-dimensional gesture estimation method based on full convolution neural network - Google Patents

Monocular three-dimensional gesture estimation method based on full convolution neural network Download PDF

Info

Publication number
CN114882524A
CN114882524A CN202210397216.0A CN202210397216A CN114882524A CN 114882524 A CN114882524 A CN 114882524A CN 202210397216 A CN202210397216 A CN 202210397216A CN 114882524 A CN114882524 A CN 114882524A
Authority
CN
China
Prior art keywords
dimensional
network
convolution
gesture estimation
dimensional gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210397216.0A
Other languages
Chinese (zh)
Inventor
刘星言
康文雄
林亿鸿
邓飞其
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210397216.0A priority Critical patent/CN114882524A/en
Publication of CN114882524A publication Critical patent/CN114882524A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monocular three-dimensional gesture estimation method based on a full convolution neural network, which comprises the following steps: acquiring a hand image, and preprocessing the image; constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network; inputting the preprocessed image into a full-convolution three-dimensional gesture estimation network to predict a final two-dimensional coordinate of a key point and the relative depth of each key point; and carrying out post-processing on the two-dimensional coordinates and the relative depth of the predicted key points, and calculating the three-dimensional coordinates of the key points of the hand. According to the method, the hand scale information and the depth prediction process of the neural network on the hand key points are decoupled, the scale uncertainty problem in monocular three-dimensional gesture estimation can be effectively solved, the actual depth of the hand key points in a scene relative to imaging equipment is accurately reduced by acquiring accurate prior scale information in practical application, and the precision upper limit of the three-dimensional gesture estimation method and the generalization capability of the three-dimensional gesture estimation method on the scene are effectively improved.

Description

Monocular three-dimensional gesture estimation method based on full convolution neural network
Technical Field
The invention relates to the field of computer vision, in particular to a monocular three-dimensional gesture estimation method based on a full convolution neural network.
Background
The hand has the characteristics of high degree of freedom, natural friendliness and the like, so that the hand becomes an important research object of man-machine intelligent interaction, the vision-based gesture interaction system can enable a user to get rid of dependence on man-machine interaction middleware, the interaction with a real or virtual scene and the operation on an intelligent terminal are completed by directly using gestures, and the use continuity and convenience of the user are greatly improved. In recent years, a great deal of resource research has been invested in science and technology macros such as google, microsoft and Facebook, and the visual gesture interaction is one of the core technologies, based on intelligent interactive wearable terminals of Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR).
Gesture estimation refers to a process of predicting positions of hand key points from hand images, wherein monocular three-dimensional gesture estimation requires prediction of three-dimensional positions of the hand key points in an imaging space from a single hand color image or depth map, which is a very challenging task. Gesture estimation is an important ring in visual gesture interaction, and can help a computer capture the relative position relation between a hand and other real objects or virtual objects in a scene, so that the change of the real scene is analyzed and predicted, or corresponding feedback is made on the scene of a virtual world. The monocular three-dimensional gesture estimation has low requirements on input images, and the overall method has high flexibility and wide application prospect and practical value.
The most similar prior art to the present invention:
monocular three-dimensional gesture estimation: the monocular three-dimensional gesture estimation task requires that three-dimensional space coordinates of key points of hands are predicted from an input single hand image, and is a high-level visual understanding task. The high degree of freedom of hand pose, self-occlusion properties, and the three-dimensional scale uncertainty of a single image make this task more difficult. The existing monocular three-dimensional gesture estimation method based on deep learning is divided from input image types and can be divided into a method based on a color image and a method based on a depth map. The depth map-based method can utilize the depth information of the surface of the hand, relieves the problem of uncertainty of scale to a certain extent, and achieves a good effect. The method based on the hand color image can be divided into a learning-based method and a model-based method according to the implementation process, and the former trains by designing a complex neural network and combining a large amount of image data with accurate three-dimensional key point labels, so that the neural network directly regresses the coordinates of the key points from the image; the method based on the model introduces a parameterized hand model, regresses parameters of the hand model by using a neural network, and performs semi-supervised training through hand images, so that a higher precision upper limit can be obtained.
The prior art has the following disadvantages:
1. the existing monocular three-dimensional gesture estimation method based on the depth map is high in overall accuracy, but depends on the high-quality depth map, the imaging condition of the depth map is harsh, the method is easily interfered by scenes, and the application scene of monocular gesture estimation is greatly limited.
2. The neural network structure used by the existing monocular color image three-dimensional gesture estimation method based on learning is complex, part of operators are not compatible with a mainstream deployment framework, training of a large amount of accurate marking data is very depended on, the result is easily affected by the problem of scale uncertainty, and the generalization capability is poor. (Yang L, Li S, Lee D, et al. alignment sites for 3D hand position estimation [ C ]// Proceedings of the IEEE/CVF International Conference on Computer Vision.2019: 2335-.
3. The existing model-based monocular color image three-dimensional gesture estimation method relies on parameters of a neural network regression hand model and carries out indirect supervision through a hand color image, so that the training difficulty of the whole method is higher, the training effect is easily influenced by model parameter initialization and image indirect supervision modes, the development difficulty of practical application is increased, and the application expansion is not facilitated (Boukhayma A, Bel R, Torr P H S.3d hand shape and position from images in the world [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2019: 10843-.
4. The existing monocular color image three-dimensional gesture estimation method only considers the improvement of precision on the network structure design, does not consider the operation requirements under different resource conditions, and is difficult to realize the real-time operation under the low-resource scene.
Disclosure of Invention
The invention provides a monocular three-dimensional gesture estimation method based on a full convolution neural network, which aims to solve the problems that a monocular three-dimensional gesture estimation method based on a hand color image is easily influenced by scale uncertainty factors, the structure of the three-dimensional gesture estimation neural network is complex and has poor expansibility, and the real-time operation efficiency of the existing three-dimensional gesture estimation neural network is low.
The invention is realized by at least one of the following technical schemes.
A monocular three-dimensional gesture estimation method based on a full convolution neural network comprises the following steps:
s1, acquiring a hand image, and preprocessing the hand image;
s2, constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network;
s3, inputting the preprocessed image into a full-convolution three-dimensional gesture estimation network to predict the final two-dimensional coordinates of key points and the relative depth of each key point;
and S4, performing post-processing on the two-dimensional coordinates and the relative depth of the predicted key points, and calculating the three-dimensional coordinates of the key points of the hand.
Further, the pre-processing includes image scaling and image filling.
Further, the fully-convoluted three-dimensional gesture estimation network comprises the following units:
an input compression unit for extracting local rough features in an input image while reducing the resolution of the input image;
a basic convolution unit for extracting basic features;
the down-sampling unit is used for carrying out spatial down-sampling on the feature map and improving the receptive field of the feature map;
the up-sampling unit is used for carrying out spatial up-sampling on the feature map, recovering the feature map spatial information by matching with the transverse connection output by the down-sampling unit and further enriching semantic features;
and the output unit is used for extracting information from the final high-resolution feature map and predicting the final two-dimensional coordinates and normalized relative depth of the hand key points.
Further, the basic convolution unit includes:
1)7 × 7 channel-by-channel convolution: the channel-by-channel convolution of the large convolution kernel keeps the channel number of the single-time input feature map;
2) layer normalization operation: converting all pixel value mean values and variances of the input characteristic image into parameters participating in training;
3)1 × 1 convolution: the device is used for amplifying or reducing the number of the characteristic image channels by 4 times;
4) activation function: calculating the output activation value of the convolution layer by adopting a GELU function;
5) residual connection: the cell inputs are added to the activated outputs as the final cell outputs.
Further, the output unit includes:
1) two-dimensional coordinate prediction branch: extracting features from the feature map by using a convolution kernel with the size of 3 multiplied by 3 to obtain a hand key point thermodynamic diagram, and obtaining the highest response position in the hand key point thermodynamic diagram through differentiable maximum index operation (soft-argmax), namely the predicted two-dimensional coordinate of the key point;
2) relative depth prediction branch: extracting features from the feature map by using a convolution kernel with the size of 3 multiplied by 3 to obtain a hidden depth map of the hand key points, and obtaining the normalized relative depth value of each key point through global space average pooling (mean).
Further, the training of the full-convolution three-dimensional gesture estimation network comprises the following steps:
21) image preprocessing: scaling and filling input images participating in training, and performing data enhancement;
22) and (3) marking pretreatment: in the training process, the processing of the hand key point labeling mainly comprises the following steps: modifying corresponding key point two-dimensional labels in a manner of matching with an image rotation data enhancement mode, and converting absolute depth labels into normalized relative depth labels, wherein the specific operation is to make a difference between the depth labels of all key points and the depth labels of root key points, and divide the result by the hand reference length;
23) network forward reasoning: inputting the processed image into a full-convolution three-dimensional gesture estimation network to obtain a predicted two-dimensional coordinate and a predicted relative depth of a key point;
24) and (3) loss calculation: inputting the output of the network forward reasoning and the preprocessed label into a loss function to obtain a loss function value;
25) gradient back propagation: calculating a gradient value of the loss function value relative to a network parameter of the full-convolution three-dimensional gesture estimation network, and updating the network parameter by using a back propagation algorithm;
26) iterative training: and repeating the steps in batches on the input images until the loss function value is not reduced any more, and finishing the full convolution three-dimensional gesture estimation network training.
Further, in the full convolution three-dimensional gesture estimation network training process, a Smooth-L1 loss function is adopted to calculate errors between two-dimensional coordinates and relative depths of key points predicted by the network and labels, the gradients of the errors relative to all network parameters are calculated, and the network parameters are updated through a back propagation algorithm.
Further, data enhancement adopted in image preprocessing in the full convolution three-dimensional gesture estimation network training process comprises the following steps:
1) random image rotation: randomly rotating the input image by-180 degrees to 180 degrees around the central point;
2) random image flipping: randomly turning an input image horizontally and vertically;
3) random color transformation: carrying out random value scaling on HSV channels of the input image within the ranges of 75% -125%, 50% -150% and 50% -150% respectively;
4) random noise enhancement: applying a Gaussian noise with a mean value of 0 and a variance of 0-0.1 to each position in the input image with a probability of 50%.
Further, the post-processing comprises:
1) converting the normalized relative depth of the key points in the full convolution three-dimensional gesture estimation network output into the absolute depth of the key points;
2) and calculating the three-dimensional coordinates of the key points by utilizing the two-dimensional coordinates of the key points in the network output by the full-convolution three-dimensional gesture estimation network in the image and the converted absolute depths of the key points.
Further, the specific process of calculating the three-dimensional coordinates of the key points comprises:
Figure BDA0003599510080000061
wherein (X) i ,Y i ,Z i ) For the three-dimensional coordinates of the ith key point, (u) i ,v i ) Estimating two-dimensional coordinates, z, of the ith keypoint of the network output for a full-convolution three-dimensional gesture i And K is an internal parameter matrix of the imaging device.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the method, the hand scale information and the depth prediction process of the neural network on the hand key points are decoupled, the scale uncertainty problem in monocular three-dimensional gesture estimation can be effectively solved, and the actual depth of the hand key points in a scene relative to the imaging equipment can be accurately restored by acquiring accurate prior scale information in practical application, so that the precision upper limit of the three-dimensional gesture estimation method and the generalization capability of the three-dimensional gesture estimation method on the scene are effectively improved.
2. According to the invention, the three-dimensional gesture estimation network based on the full convolution structure is constructed through basic convolution operation, the network design process of the three-dimensional gesture estimation application based on deep learning is simplified, and optimization acceleration of most of the existing deployment frameworks on the network is supported. Meanwhile, the number of modules and the channels of the convolution layers can be adjusted in proportion by the network, so that the network adapts to application scenes of different resource conditions, and the difficulty of the follow-up maintenance and optimization of the algorithm is reduced.
3. According to the invention, a common lightweight design is introduced into the three-dimensional gesture estimation network structure, so that the operation efficiency of the three-dimensional gesture estimation network is greatly improved, and the real-time three-dimensional gesture estimation can be realized in a low-resource scene.
Drawings
FIG. 1 is a diagram illustrating a network structure for full-convolution three-dimensional gesture estimation according to the present embodiment;
FIG. 2 is a diagram of an input compression unit in the fully-convolved three-dimensional gesture estimation network according to this embodiment;
FIG. 3 is a diagram of a basic convolution unit structure in the fully-convolved three-dimensional gesture estimation network according to this embodiment;
FIG. 4 is a diagram of an upsampling unit structure in the fully-convolved three-dimensional gesture estimation network according to this embodiment;
FIG. 5 is a block diagram of a downsampling unit in the fully-convolved three-dimensional gesture estimation network according to this embodiment;
FIG. 6 is a diagram of an output unit structure in the fully-convolved three-dimensional gesture estimation network according to this embodiment;
fig. 7 is a training and reasoning flowchart of the fully-convoluted three-dimensional gesture estimation network according to the embodiment.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Example 1
The full convolution three-dimensional gesture estimation network provided by the invention can complete the implementation process of monocular three-dimensional gesture estimation application, and is mainly divided into the training process and the reasoning process of a neural network, wherein the neural network can be used for testing and reasoning after the training is completed. As shown in fig. 7, a monocular three-dimensional gesture estimation method based on a full convolution neural network includes the following steps:
s1, constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network; the fully-convoluted three-dimensional gesture estimation network comprises the following units:
an input compression unit: the composition is shown in fig. 2, which comprises that firstly, an input image is processed by a convolution layer with convolution kernel size of 4 x 4 and step length of 4 to obtain a compressed feature map, and then the mean value and variance of the compressed feature map are adjusted to be learnable parameters by intra-layer normalization operation.
Basic convolution unit: the basic unit for extracting features is composed as shown in fig. 3, and is formed by firstly processing an input C-channel feature map by 7 × 7 channel-by-channel convolution, then performing intra-layer normalization processing, increasing the number of feature map channels to 4C by 1 × 1 convolution, performing GELU activation function processing, finally reducing the number of feature map channels to C by 1 × 1 convolution, and summing the C-channel feature map with unit input for output.
A down-sampling unit: the structure of the structure is shown in fig. 5, and the structure is that firstly, the input feature map is subjected to in-layer normalization, and then the feature map with the spatial resolution reduced by half is obtained through convolution layer processing with the convolution kernel size of 2 × 2 and the step length of 2.
An up-sampling unit: the characteristic diagram is subjected to spatial up-sampling, and the structure of the characteristic diagram is shown in fig. 4, wherein the input characteristic diagram is subjected to in-layer normalization, and then the spatial resolution is increased to 2 times of the original resolution by deconvolution processing with a convolution kernel of 2 × 2 and a step length of 2, so that the spatial up-sampling of the characteristic diagram is realized.
An output unit: the structure of the device is shown in fig. 6, wherein one branch converts the feature map into thermodynamic diagrams of each key point of the hand through the same convolution operation with the convolution kernel size of 3 × 3 and the step length of 1, and the highest response value in the thermodynamic diagrams is obtained through differentiable maximum index operation (soft-argmax), namely the predicted two-dimensional coordinates of the key points; the other branch converts the feature map into an implicit depth map of each key point through the same convolution operation with the convolution kernel size of 3 multiplied by 3 and the step length of 1, and obtains the normalized relative depth value of each key point through global space average pooling (mean).
As shown in fig. 1, the full convolution three-dimensional gesture estimation network is mainly divided into a down-sampling stage and an up-sampling stage in terms of the overall structure design. In the down-sampling stage, the input image is firstly compressed by an input compression unit and then processed by three down-sampling feature extraction modules connected in series, wherein the down-sampling feature extraction modules are processed by N i Number of characteristic channels is C i The basic convolution unit and a down-sampling unit are connected in series, and the i is 1,2 and 3 feature map spatial resolution in descending order. The processed characteristic diagram is subjected to N 4 Number of channels is C 4 Further processed and then enters the upsampling stage. In the up-sampling phase, it first passes through N 4 Processing by the basic convolution unit, and then processing by three series-connected up-sampling feature extraction modules, wherein the up-sampling feature extraction module is composed of N i Number of characteristic channels is C i The hand key point two-dimensional coordinate normalization method comprises the steps that a basic convolution unit and an up-sampling unit are connected in series, the basic convolution unit and the up-sampling unit simultaneously receive the output of a down-sampling feature extraction module in a down-sampling stage with the same spatial resolution and sum the output, and finally the predicted two-dimensional coordinate and the normalized relative depth value of the hand key point are obtained through processing of an input unit. In this example N 1 、N 2 、N 3 、N 4 、C 1 、C 2 、C 3 、C 4 The obtained full-convolution three-dimensional gesture estimation network is suitable for application scenes with high requirements on operation efficiency by respectively taking 1, 48, 96, 192 and 384.
The training of the fully-convoluted three-dimensional gesture estimation network comprises the following steps:
1) image preprocessing: the input images participating in training are filled to 1:1 and scaled to 256 × 256, and data enhancement operations including random image rotation, random image flipping, random color transformation, and random noise enhancement are performed.
2) And (3) marking pretreatment: in the training process, the processing of the hand key point labeling mainly comprises the following steps: modifying corresponding key point two-dimensional labels in a manner of matching with an image rotation data enhancement mode, and converting absolute depth labels into normalized relative depth labels, wherein the specific operation is to make a difference between the depth labels of all key points and the depth labels of root key points, and divide the result by the hand reference length;
3) network forward reasoning: and inputting the processed image into a full-convolution three-dimensional gesture estimation network to obtain a predicted two-dimensional coordinate and relative depth of the key point.
4) And (3) loss calculation: and (4) calculating the output of the network forward reasoning and the preprocessed labeled input loss function Smooth L1 to obtain a loss function value.
5) Gradient back propagation: and calculating the gradient value of the loss function value relative to the network parameter of the full-convolution three-dimensional gesture estimation network, and updating the network parameter by using a back propagation algorithm.
6) Iterative training: and repeating the steps in batches on the input images until the loss function value is not reduced any more, and finishing the full convolution three-dimensional gesture estimation network training.
And S2, acquiring a hand image, and preprocessing the hand image, wherein the input image is filled to 1:1 and scaled to 256 × 256 resolution.
And S3, inputting the preprocessed image into the trained full-convolution three-dimensional gesture estimation network to obtain two-dimensional coordinates and normalized relative depth values of the key points of the hand.
And S4, performing post-processing on the two-dimensional coordinates of the hand key points predicted by the full-convolution three-dimensional gesture estimation network and the relative depth values of the key points, calculating the three-dimensional coordinates of the key points, and completing a gesture estimation task. The post-treatment process comprises the following steps:
1) the normalized relative depth is converted to absolute depth by first substituting the provided hand reference length (which may be obtained in practice by other means within the scene, related to scene factors) into the following equation:
Figure BDA0003599510080000101
wherein l ref Denotes a hand reference length, K is a reference matrix of the imaging device, z r As absolute depth of root key points, (u) r ,v r ) Root keypoint two-dimensional coordinates (u) predicted for a network m ,v m ) Two-dimensional coordinates of the middle finger palm node predicted for the network (i.e. another joint point related to the hand reference length), d m Normalized relative depth values for the fingered nodes predicted for the network. The absolute depth z of the root key point can be obtained by solving the above equation r The normalized relative depth values of the key points, which are predicted by matching with the network, can restore the absolute depths of all the key points:
Figure BDA0003599510080000111
wherein the content of the first and second substances,
Figure BDA0003599510080000112
normalized relative depth value, z, representing the ith keypoint of the network prediction i Is the corresponding absolute depth value.
2) Calculating the three-dimensional coordinates of the key points:
Figure BDA0003599510080000113
wherein (Z) i ,Y i ,Z i ) For the three-dimensional coordinates of the ith key point, (u) i ,v i ) Two-dimensional coordinates of ith keypoint, z, output for the network i And K is an internal parameter matrix of the imaging device.
The fully-convoluted three-dimensional gesture estimation network has the following characteristics: according to the application scene requirements, the serial number N of the basic convolution units and the in-layer channel C of the basic convolution units can be adjusted, so that the depth and the width of the full-convolution three-dimensional gesture estimation network are changed, the performance of the whole method is influenced, and the full-convolution three-dimensional gesture estimation network is deployed in application scenes with different resource conditions.
Example 2
Changing N in network architectural design parameters 1 、N 2 、N 3 、N 4 、C 1 、C 2 、C 3 、C 4 2, 96, 144, 216, and 324, the obtained full-convolution three-dimensional gesture estimation network is suitable for an application scenario with a high requirement on the operation efficiency, and the rest processes are the same as those in embodiment 1, and are suitable for a scenario with balanced requirements on the operation efficiency and the precision.
Example 3
Changing N in network architectural design parameters 1 、N 2 、N 3 、N 4 、C 1 、C 2 、C 3 、C 4 The three-dimensional gesture estimation network is respectively 3, 2, 96, 192, 384 and 768, and the obtained full-convolution three-dimensional gesture estimation network is suitable for application scenes with high requirements on operation efficiency. The rest processes are the same as those in the embodiment 1, and the method is suitable for application scenes with high operation precision requirements and low efficiency requirements.
The above embodiments are only for explaining the details to help understanding the technical solution of the present invention, and it is obvious to those skilled in the art that any modifications and substitutions made without departing from the principle of the present invention belong to the protection scope of the present invention.

Claims (10)

1. A monocular three-dimensional gesture estimation method based on a full convolution neural network is characterized by comprising the following steps:
s1, acquiring a hand image, and preprocessing the hand image;
s2, constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network;
s3, inputting the preprocessed image into a full-convolution three-dimensional gesture estimation network to predict the final two-dimensional coordinates of key points and the relative depth of each key point;
and S4, performing post-processing on the two-dimensional coordinates and the relative depth of the predicted key points, and calculating the three-dimensional coordinates of the key points of the hand.
2. The method of claim 1, wherein the preprocessing comprises image scaling and image filling.
3. The monocular three-dimensional gesture estimation method based on full convolution neural network according to claim 1, wherein the full convolution three-dimensional gesture estimation network comprises the following units:
an input compression unit for extracting local features in an input image while reducing a resolution of the input image;
the basic convolution unit is used for extracting basic features of the image;
the down-sampling unit is used for carrying out spatial down-sampling on the feature map and improving the receptive field of the feature map;
the up-sampling unit is used for carrying out spatial up-sampling on the feature map, recovering the feature map spatial information by matching with the transverse connection output by the down-sampling unit and further enriching semantic features;
and the output unit is used for extracting information from the final high-resolution feature map and predicting the final two-dimensional coordinates and normalized relative depth of the hand key points.
4. The monocular three-dimensional gesture estimation method based on full convolution neural network of claim 3, wherein the basic convolution unit comprises:
1)7 × 7 channel-by-channel convolution: the channel-by-channel convolution of the large convolution kernel keeps the channel number of the single-time input feature map;
2) layer normalization operation: converting all pixel value mean values and variances of the input characteristic image into parameters participating in training;
3)1 × 1 convolution: the device is used for amplifying or reducing the number of the characteristic image channels by 4 times;
4) activation function: calculating the output activation value of the convolution layer by adopting a GELU function;
5) residual connection: the cell inputs are added to the activated outputs as the final cell outputs.
5. The monocular three-dimensional gesture estimation method based on full convolution neural network of claim 3, wherein the output unit comprises:
1) two-dimensional coordinate prediction branch: extracting features from the feature map by using a convolution kernel with the size of 3 multiplied by 3 to obtain a hand key point thermodynamic diagram, and obtaining the highest response position in the hand key point thermodynamic diagram through differentiable maximum index operation (soft-argmax), namely the predicted two-dimensional coordinate of the key point;
2) relative depth prediction branch: extracting features from the feature map by using a convolution kernel with the size of 3 multiplied by 3 to obtain a hidden depth map of the hand key points, and obtaining the normalized relative depth value of each key point through global space average pooling (mean).
6. The monocular three-dimensional gesture estimation method based on full convolution neural network of claim 1, wherein the training of the full convolution three-dimensional gesture estimation network comprises the following steps:
21) image preprocessing: scaling and filling input images participating in training, and performing data enhancement;
22) and (3) marking pretreatment: in the training process, the processing of the hand key point labeling mainly comprises the following steps: modifying corresponding key point two-dimensional labels in a manner of matching with an image rotation data enhancement mode, and converting absolute depth labels into normalized relative depth labels, wherein the specific operation is to make a difference between the depth labels of all key points and the depth labels of root key points, and divide the result by the hand reference length;
23) network forward reasoning: inputting the processed image into a full-convolution three-dimensional gesture estimation network to obtain a predicted two-dimensional coordinate and a predicted relative depth of a key point;
24) and (3) loss calculation: inputting the output of the network forward reasoning and the preprocessed label into a loss function to obtain a loss function value;
25) gradient back propagation: calculating a gradient value of the loss function value relative to a network parameter of the full-convolution three-dimensional gesture estimation network, and updating the network parameter by using a back propagation algorithm;
26) iterative training: and repeating the steps in batches on the input images until the loss function value is not reduced any more, and finishing the full convolution three-dimensional gesture estimation network training.
7. The monocular three-dimensional gesture estimation method based on the full convolution neural network as recited in claim 6, wherein the full convolution three-dimensional gesture estimation network training process adopts a Smooth-L1 loss function to calculate errors between a two-dimensional coordinate of a key point predicted by the network and a relative depth and a label, calculates gradients of the errors relative to each network parameter, and updates the network parameters through a back propagation algorithm.
8. The monocular three-dimensional gesture estimation method based on the full convolution neural network of claim 6, wherein data enhancement adopted in image preprocessing in the full convolution three-dimensional gesture estimation network training process comprises:
1) random image rotation: randomly rotating the input image by-180 degrees to 180 degrees around the central point;
2) random image flipping: randomly turning an input image horizontally and vertically;
3) random color transformation: carrying out random value scaling on HSV channels of the input image within the ranges of 75% -125%, 50% -150% and 50% -150% respectively;
4) random noise enhancement: applying a Gaussian noise with a mean value of 0 and a variance of 0-0.1 to each position in the input image with a probability of 50%.
9. The monocular three-dimensional gesture estimation method based on the full convolution neural network according to any one of claims 1 to 8, wherein the post-processing includes:
1) converting the normalized relative depth of the key points in the full convolution three-dimensional gesture estimation network output into the absolute depth of the key points;
2) and calculating the three-dimensional coordinates of the key points by utilizing the two-dimensional coordinates of the key points in the network output by the full-convolution three-dimensional gesture estimation network in the image and the converted absolute depths of the key points.
10. The monocular three-dimensional gesture estimation method based on the full convolution neural network of claim 9, wherein the specific process of calculating the three-dimensional coordinates of the key points is as follows:
Figure FDA0003599510070000041
wherein (X) i ,Y i ,Z i ) For the three-dimensional coordinates of the ith key point, (u) i ,v i ) Estimating two-dimensional coordinates, z, of the ith keypoint of the network output for a full-convolution three-dimensional gesture i And K is an internal parameter matrix of the imaging device.
CN202210397216.0A 2022-04-15 2022-04-15 Monocular three-dimensional gesture estimation method based on full convolution neural network Pending CN114882524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210397216.0A CN114882524A (en) 2022-04-15 2022-04-15 Monocular three-dimensional gesture estimation method based on full convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210397216.0A CN114882524A (en) 2022-04-15 2022-04-15 Monocular three-dimensional gesture estimation method based on full convolution neural network

Publications (1)

Publication Number Publication Date
CN114882524A true CN114882524A (en) 2022-08-09

Family

ID=82669988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210397216.0A Pending CN114882524A (en) 2022-04-15 2022-04-15 Monocular three-dimensional gesture estimation method based on full convolution neural network

Country Status (1)

Country Link
CN (1) CN114882524A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840507A (en) * 2022-12-20 2023-03-24 北京帮威客科技有限公司 Large-screen equipment interaction method based on 3D image control
CN115953839A (en) * 2022-12-26 2023-04-11 广州紫为云科技有限公司 Real-time 2D gesture estimation method based on loop architecture and coordinate system regression
WO2023116620A1 (en) * 2021-12-22 2023-06-29 北京字跳网络技术有限公司 Gesture data annotation method and apparatus
CN115840507B (en) * 2022-12-20 2024-05-24 北京帮威客科技有限公司 Large-screen equipment interaction method based on 3D image control

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116620A1 (en) * 2021-12-22 2023-06-29 北京字跳网络技术有限公司 Gesture data annotation method and apparatus
CN115840507A (en) * 2022-12-20 2023-03-24 北京帮威客科技有限公司 Large-screen equipment interaction method based on 3D image control
CN115840507B (en) * 2022-12-20 2024-05-24 北京帮威客科技有限公司 Large-screen equipment interaction method based on 3D image control
CN115953839A (en) * 2022-12-26 2023-04-11 广州紫为云科技有限公司 Real-time 2D gesture estimation method based on loop architecture and coordinate system regression
CN115953839B (en) * 2022-12-26 2024-04-12 广州紫为云科技有限公司 Real-time 2D gesture estimation method based on loop architecture and key point regression

Similar Documents

Publication Publication Date Title
CN111339903B (en) Multi-person human body posture estimation method
CN108710830B (en) Human body 3D posture estimation method combining dense connection attention pyramid residual error network and isometric limitation
CN108399419B (en) Method for recognizing Chinese text in natural scene image based on two-dimensional recursive network
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN110246181B (en) Anchor point-based attitude estimation model training method, attitude estimation method and system
CN111652892A (en) Remote sensing image building vector extraction and optimization method based on deep learning
CN114882524A (en) Monocular three-dimensional gesture estimation method based on full convolution neural network
CN114863573B (en) Category-level 6D attitude estimation method based on monocular RGB-D image
CN113807355A (en) Image semantic segmentation method based on coding and decoding structure
CN113159232A (en) Three-dimensional target classification and segmentation method
CN113516693A (en) Rapid and universal image registration method
CN111914595B (en) Human hand three-dimensional attitude estimation method and device based on color image
Wang et al. Paccdu: pyramid attention cross-convolutional dual unet for infrared and visible image fusion
Wu et al. Meta transfer learning-based super-resolution infrared imaging
CN111414988B (en) Remote sensing image super-resolution method based on multi-scale feature self-adaptive fusion network
CN113240584A (en) Multitask gesture picture super-resolution method based on picture edge information
CN115115860A (en) Image feature point detection matching network based on deep learning
AU2021104479A4 (en) Text recognition method and system based on decoupled attention mechanism
CN114821192A (en) Remote sensing image elevation prediction method combining semantic information
CN113435398B (en) Signature feature identification method, system, equipment and storage medium based on mask pre-training model
Song et al. Spatial-Aware Dynamic Lightweight Self-Supervised Monocular Depth Estimation
CN113486718A (en) Fingertip detection method based on deep multitask learning
CN113450364A (en) Tree-shaped structure center line extraction method based on three-dimensional flux model
CN113239835A (en) Model-aware gesture migration method
CN115511968B (en) Two-dimensional hand posture estimation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination