CN114882524A

CN114882524A - Monocular three-dimensional gesture estimation method based on full convolution neural network

Info

Publication number: CN114882524A
Application number: CN202210397216.0A
Authority: CN
Inventors: 刘星言; 康文雄; 林亿鸿; 邓飞其
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-08-09

Abstract

The invention discloses a monocular three-dimensional gesture estimation method based on a full convolution neural network, which comprises the following steps: acquiring a hand image, and preprocessing the image; constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network; inputting the preprocessed image into a full-convolution three-dimensional gesture estimation network to predict a final two-dimensional coordinate of a key point and the relative depth of each key point; and carrying out post-processing on the two-dimensional coordinates and the relative depth of the predicted key points, and calculating the three-dimensional coordinates of the key points of the hand. According to the method, the hand scale information and the depth prediction process of the neural network on the hand key points are decoupled, the scale uncertainty problem in monocular three-dimensional gesture estimation can be effectively solved, the actual depth of the hand key points in a scene relative to imaging equipment is accurately reduced by acquiring accurate prior scale information in practical application, and the precision upper limit of the three-dimensional gesture estimation method and the generalization capability of the three-dimensional gesture estimation method on the scene are effectively improved.

Description

Monocular three-dimensional gesture estimation method based on full convolution neural network

Technical Field

The invention relates to the field of computer vision, in particular to a monocular three-dimensional gesture estimation method based on a full convolution neural network.

Background

The hand has the characteristics of high degree of freedom, natural friendliness and the like, so that the hand becomes an important research object of man-machine intelligent interaction, the vision-based gesture interaction system can enable a user to get rid of dependence on man-machine interaction middleware, the interaction with a real or virtual scene and the operation on an intelligent terminal are completed by directly using gestures, and the use continuity and convenience of the user are greatly improved. In recent years, a great deal of resource research has been invested in science and technology macros such as google, microsoft and Facebook, and the visual gesture interaction is one of the core technologies, based on intelligent interactive wearable terminals of Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR).

Gesture estimation refers to a process of predicting positions of hand key points from hand images, wherein monocular three-dimensional gesture estimation requires prediction of three-dimensional positions of the hand key points in an imaging space from a single hand color image or depth map, which is a very challenging task. Gesture estimation is an important ring in visual gesture interaction, and can help a computer capture the relative position relation between a hand and other real objects or virtual objects in a scene, so that the change of the real scene is analyzed and predicted, or corresponding feedback is made on the scene of a virtual world. The monocular three-dimensional gesture estimation has low requirements on input images, and the overall method has high flexibility and wide application prospect and practical value.

The most similar prior art to the present invention:

monocular three-dimensional gesture estimation: the monocular three-dimensional gesture estimation task requires that three-dimensional space coordinates of key points of hands are predicted from an input single hand image, and is a high-level visual understanding task. The high degree of freedom of hand pose, self-occlusion properties, and the three-dimensional scale uncertainty of a single image make this task more difficult. The existing monocular three-dimensional gesture estimation method based on deep learning is divided from input image types and can be divided into a method based on a color image and a method based on a depth map. The depth map-based method can utilize the depth information of the surface of the hand, relieves the problem of uncertainty of scale to a certain extent, and achieves a good effect. The method based on the hand color image can be divided into a learning-based method and a model-based method according to the implementation process, and the former trains by designing a complex neural network and combining a large amount of image data with accurate three-dimensional key point labels, so that the neural network directly regresses the coordinates of the key points from the image; the method based on the model introduces a parameterized hand model, regresses parameters of the hand model by using a neural network, and performs semi-supervised training through hand images, so that a higher precision upper limit can be obtained.

The prior art has the following disadvantages:

1. the existing monocular three-dimensional gesture estimation method based on the depth map is high in overall accuracy, but depends on the high-quality depth map, the imaging condition of the depth map is harsh, the method is easily interfered by scenes, and the application scene of monocular gesture estimation is greatly limited.

2. The neural network structure used by the existing monocular color image three-dimensional gesture estimation method based on learning is complex, part of operators are not compatible with a mainstream deployment framework, training of a large amount of accurate marking data is very depended on, the result is easily affected by the problem of scale uncertainty, and the generalization capability is poor. (Yang L, Li S, Lee D, et al. alignment sites for 3D hand position estimation [ C ]// Proceedings of the IEEE/CVF International Conference on Computer Vision.2019: 2335-.

3. The existing model-based monocular color image three-dimensional gesture estimation method relies on parameters of a neural network regression hand model and carries out indirect supervision through a hand color image, so that the training difficulty of the whole method is higher, the training effect is easily influenced by model parameter initialization and image indirect supervision modes, the development difficulty of practical application is increased, and the application expansion is not facilitated (Boukhayma A, Bel R, Torr P H S.3d hand shape and position from images in the world [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2019: 10843-.

4. The existing monocular color image three-dimensional gesture estimation method only considers the improvement of precision on the network structure design, does not consider the operation requirements under different resource conditions, and is difficult to realize the real-time operation under the low-resource scene.

Disclosure of Invention

The invention provides a monocular three-dimensional gesture estimation method based on a full convolution neural network, which aims to solve the problems that a monocular three-dimensional gesture estimation method based on a hand color image is easily influenced by scale uncertainty factors, the structure of the three-dimensional gesture estimation neural network is complex and has poor expansibility, and the real-time operation efficiency of the existing three-dimensional gesture estimation neural network is low.

The invention is realized by at least one of the following technical schemes.

A monocular three-dimensional gesture estimation method based on a full convolution neural network comprises the following steps:

s1, acquiring a hand image, and preprocessing the hand image;

s2, constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network;

s3, inputting the preprocessed image into a full-convolution three-dimensional gesture estimation network to predict the final two-dimensional coordinates of key points and the relative depth of each key point;

and S4, performing post-processing on the two-dimensional coordinates and the relative depth of the predicted key points, and calculating the three-dimensional coordinates of the key points of the hand.

Further, the pre-processing includes image scaling and image filling.

Further, the fully-convoluted three-dimensional gesture estimation network comprises the following units:

an input compression unit for extracting local rough features in an input image while reducing the resolution of the input image;

a basic convolution unit for extracting basic features;

the down-sampling unit is used for carrying out spatial down-sampling on the feature map and improving the receptive field of the feature map;

the up-sampling unit is used for carrying out spatial up-sampling on the feature map, recovering the feature map spatial information by matching with the transverse connection output by the down-sampling unit and further enriching semantic features;

and the output unit is used for extracting information from the final high-resolution feature map and predicting the final two-dimensional coordinates and normalized relative depth of the hand key points.

Further, the basic convolution unit includes:

1)7 × 7 channel-by-channel convolution: the channel-by-channel convolution of the large convolution kernel keeps the channel number of the single-time input feature map;

2) layer normalization operation: converting all pixel value mean values and variances of the input characteristic image into parameters participating in training;

3)1 × 1 convolution: the device is used for amplifying or reducing the number of the characteristic image channels by 4 times;

4) activation function: calculating the output activation value of the convolution layer by adopting a GELU function;

5) residual connection: the cell inputs are added to the activated outputs as the final cell outputs.

Further, the output unit includes:

1) two-dimensional coordinate prediction branch: extracting features from the feature map by using a convolution kernel with the size of 3 multiplied by 3 to obtain a hand key point thermodynamic diagram, and obtaining the highest response position in the hand key point thermodynamic diagram through differentiable maximum index operation (soft-argmax), namely the predicted two-dimensional coordinate of the key point;

2) relative depth prediction branch: extracting features from the feature map by using a convolution kernel with the size of 3 multiplied by 3 to obtain a hidden depth map of the hand key points, and obtaining the normalized relative depth value of each key point through global space average pooling (mean).

Further, the training of the full-convolution three-dimensional gesture estimation network comprises the following steps:

21) image preprocessing: scaling and filling input images participating in training, and performing data enhancement;

22) and (3) marking pretreatment: in the training process, the processing of the hand key point labeling mainly comprises the following steps: modifying corresponding key point two-dimensional labels in a manner of matching with an image rotation data enhancement mode, and converting absolute depth labels into normalized relative depth labels, wherein the specific operation is to make a difference between the depth labels of all key points and the depth labels of root key points, and divide the result by the hand reference length;

23) network forward reasoning: inputting the processed image into a full-convolution three-dimensional gesture estimation network to obtain a predicted two-dimensional coordinate and a predicted relative depth of a key point;

24) and (3) loss calculation: inputting the output of the network forward reasoning and the preprocessed label into a loss function to obtain a loss function value;

25) gradient back propagation: calculating a gradient value of the loss function value relative to a network parameter of the full-convolution three-dimensional gesture estimation network, and updating the network parameter by using a back propagation algorithm;

26) iterative training: and repeating the steps in batches on the input images until the loss function value is not reduced any more, and finishing the full convolution three-dimensional gesture estimation network training.

Further, in the full convolution three-dimensional gesture estimation network training process, a Smooth-L1 loss function is adopted to calculate errors between two-dimensional coordinates and relative depths of key points predicted by the network and labels, the gradients of the errors relative to all network parameters are calculated, and the network parameters are updated through a back propagation algorithm.

Further, data enhancement adopted in image preprocessing in the full convolution three-dimensional gesture estimation network training process comprises the following steps:

1) random image rotation: randomly rotating the input image by-180 degrees to 180 degrees around the central point;

2) random image flipping: randomly turning an input image horizontally and vertically;

3) random color transformation: carrying out random value scaling on HSV channels of the input image within the ranges of 75% -125%, 50% -150% and 50% -150% respectively;

4) random noise enhancement: applying a Gaussian noise with a mean value of 0 and a variance of 0-0.1 to each position in the input image with a probability of 50%.

Further, the post-processing comprises:

1) converting the normalized relative depth of the key points in the full convolution three-dimensional gesture estimation network output into the absolute depth of the key points;

2) and calculating the three-dimensional coordinates of the key points by utilizing the two-dimensional coordinates of the key points in the network output by the full-convolution three-dimensional gesture estimation network in the image and the converted absolute depths of the key points.

Further, the specific process of calculating the three-dimensional coordinates of the key points comprises:

wherein (X) _i ,Y _i ,Z _i ) For the three-dimensional coordinates of the ith key point, (u) _i ,v _i ) Estimating two-dimensional coordinates, z, of the ith keypoint of the network output for a full-convolution three-dimensional gesture _i And K is an internal parameter matrix of the imaging device.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method, the hand scale information and the depth prediction process of the neural network on the hand key points are decoupled, the scale uncertainty problem in monocular three-dimensional gesture estimation can be effectively solved, and the actual depth of the hand key points in a scene relative to the imaging equipment can be accurately restored by acquiring accurate prior scale information in practical application, so that the precision upper limit of the three-dimensional gesture estimation method and the generalization capability of the three-dimensional gesture estimation method on the scene are effectively improved.

2. According to the invention, the three-dimensional gesture estimation network based on the full convolution structure is constructed through basic convolution operation, the network design process of the three-dimensional gesture estimation application based on deep learning is simplified, and optimization acceleration of most of the existing deployment frameworks on the network is supported. Meanwhile, the number of modules and the channels of the convolution layers can be adjusted in proportion by the network, so that the network adapts to application scenes of different resource conditions, and the difficulty of the follow-up maintenance and optimization of the algorithm is reduced.

3. According to the invention, a common lightweight design is introduced into the three-dimensional gesture estimation network structure, so that the operation efficiency of the three-dimensional gesture estimation network is greatly improved, and the real-time three-dimensional gesture estimation can be realized in a low-resource scene.

Drawings

FIG. 1 is a diagram illustrating a network structure for full-convolution three-dimensional gesture estimation according to the present embodiment;

FIG. 2 is a diagram of an input compression unit in the fully-convolved three-dimensional gesture estimation network according to this embodiment;

FIG. 3 is a diagram of a basic convolution unit structure in the fully-convolved three-dimensional gesture estimation network according to this embodiment;

FIG. 4 is a diagram of an upsampling unit structure in the fully-convolved three-dimensional gesture estimation network according to this embodiment;

FIG. 5 is a block diagram of a downsampling unit in the fully-convolved three-dimensional gesture estimation network according to this embodiment;

FIG. 6 is a diagram of an output unit structure in the fully-convolved three-dimensional gesture estimation network according to this embodiment;

fig. 7 is a training and reasoning flowchart of the fully-convoluted three-dimensional gesture estimation network according to the embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Example 1

The full convolution three-dimensional gesture estimation network provided by the invention can complete the implementation process of monocular three-dimensional gesture estimation application, and is mainly divided into the training process and the reasoning process of a neural network, wherein the neural network can be used for testing and reasoning after the training is completed. As shown in fig. 7, a monocular three-dimensional gesture estimation method based on a full convolution neural network includes the following steps:

s1, constructing a full convolution three-dimensional gesture estimation network, and training the full convolution three-dimensional gesture estimation network; the fully-convoluted three-dimensional gesture estimation network comprises the following units:

an input compression unit: the composition is shown in fig. 2, which comprises that firstly, an input image is processed by a convolution layer with convolution kernel size of 4 x 4 and step length of 4 to obtain a compressed feature map, and then the mean value and variance of the compressed feature map are adjusted to be learnable parameters by intra-layer normalization operation.

Basic convolution unit: the basic unit for extracting features is composed as shown in fig. 3, and is formed by firstly processing an input C-channel feature map by 7 × 7 channel-by-channel convolution, then performing intra-layer normalization processing, increasing the number of feature map channels to 4C by 1 × 1 convolution, performing GELU activation function processing, finally reducing the number of feature map channels to C by 1 × 1 convolution, and summing the C-channel feature map with unit input for output.

A down-sampling unit: the structure of the structure is shown in fig. 5, and the structure is that firstly, the input feature map is subjected to in-layer normalization, and then the feature map with the spatial resolution reduced by half is obtained through convolution layer processing with the convolution kernel size of 2 × 2 and the step length of 2.

An up-sampling unit: the characteristic diagram is subjected to spatial up-sampling, and the structure of the characteristic diagram is shown in fig. 4, wherein the input characteristic diagram is subjected to in-layer normalization, and then the spatial resolution is increased to 2 times of the original resolution by deconvolution processing with a convolution kernel of 2 × 2 and a step length of 2, so that the spatial up-sampling of the characteristic diagram is realized.

An output unit: the structure of the device is shown in fig. 6, wherein one branch converts the feature map into thermodynamic diagrams of each key point of the hand through the same convolution operation with the convolution kernel size of 3 × 3 and the step length of 1, and the highest response value in the thermodynamic diagrams is obtained through differentiable maximum index operation (soft-argmax), namely the predicted two-dimensional coordinates of the key points; the other branch converts the feature map into an implicit depth map of each key point through the same convolution operation with the convolution kernel size of 3 multiplied by 3 and the step length of 1, and obtains the normalized relative depth value of each key point through global space average pooling (mean).

As shown in fig. 1, the full convolution three-dimensional gesture estimation network is mainly divided into a down-sampling stage and an up-sampling stage in terms of the overall structure design. In the down-sampling stage, the input image is firstly compressed by an input compression unit and then processed by three down-sampling feature extraction modules connected in series, wherein the down-sampling feature extraction modules are processed by N _i Number of characteristic channels is C _i The basic convolution unit and a down-sampling unit are connected in series, and the i is 1,2 and 3 feature map spatial resolution in descending order. The processed characteristic diagram is subjected to N ₄ Number of channels is C ₄ Further processed and then enters the upsampling stage. In the up-sampling phase, it first passes through N ₄ Processing by the basic convolution unit, and then processing by three series-connected up-sampling feature extraction modules, wherein the up-sampling feature extraction module is composed of N _i Number of characteristic channels is C _i The hand key point two-dimensional coordinate normalization method comprises the steps that a basic convolution unit and an up-sampling unit are connected in series, the basic convolution unit and the up-sampling unit simultaneously receive the output of a down-sampling feature extraction module in a down-sampling stage with the same spatial resolution and sum the output, and finally the predicted two-dimensional coordinate and the normalized relative depth value of the hand key point are obtained through processing of an input unit. In this example N ₁ 、N ₂ 、N ₃ 、N ₄ 、C ₁ 、C ₂ 、C ₃ 、C ₄ The obtained full-convolution three-dimensional gesture estimation network is suitable for application scenes with high requirements on operation efficiency by respectively taking 1, 48, 96, 192 and 384.

The training of the fully-convoluted three-dimensional gesture estimation network comprises the following steps:

1) image preprocessing: the input images participating in training are filled to 1:1 and scaled to 256 × 256, and data enhancement operations including random image rotation, random image flipping, random color transformation, and random noise enhancement are performed.

2) And (3) marking pretreatment: in the training process, the processing of the hand key point labeling mainly comprises the following steps: modifying corresponding key point two-dimensional labels in a manner of matching with an image rotation data enhancement mode, and converting absolute depth labels into normalized relative depth labels, wherein the specific operation is to make a difference between the depth labels of all key points and the depth labels of root key points, and divide the result by the hand reference length;

3) network forward reasoning: and inputting the processed image into a full-convolution three-dimensional gesture estimation network to obtain a predicted two-dimensional coordinate and relative depth of the key point.

4) And (3) loss calculation: and (4) calculating the output of the network forward reasoning and the preprocessed labeled input loss function Smooth L1 to obtain a loss function value.

5) Gradient back propagation: and calculating the gradient value of the loss function value relative to the network parameter of the full-convolution three-dimensional gesture estimation network, and updating the network parameter by using a back propagation algorithm.

6) Iterative training: and repeating the steps in batches on the input images until the loss function value is not reduced any more, and finishing the full convolution three-dimensional gesture estimation network training.

And S2, acquiring a hand image, and preprocessing the hand image, wherein the input image is filled to 1:1 and scaled to 256 × 256 resolution.

And S3, inputting the preprocessed image into the trained full-convolution three-dimensional gesture estimation network to obtain two-dimensional coordinates and normalized relative depth values of the key points of the hand.

And S4, performing post-processing on the two-dimensional coordinates of the hand key points predicted by the full-convolution three-dimensional gesture estimation network and the relative depth values of the key points, calculating the three-dimensional coordinates of the key points, and completing a gesture estimation task. The post-treatment process comprises the following steps:

1) the normalized relative depth is converted to absolute depth by first substituting the provided hand reference length (which may be obtained in practice by other means within the scene, related to scene factors) into the following equation:

wherein l _ref Denotes a hand reference length, K is a reference matrix of the imaging device, z _r As absolute depth of root key points, (u) _r ,v _r ) Root keypoint two-dimensional coordinates (u) predicted for a network _m ,v _m ) Two-dimensional coordinates of the middle finger palm node predicted for the network (i.e. another joint point related to the hand reference length), d _m Normalized relative depth values for the fingered nodes predicted for the network. The absolute depth z of the root key point can be obtained by solving the above equation _r The normalized relative depth values of the key points, which are predicted by matching with the network, can restore the absolute depths of all the key points:

wherein the content of the first and second substances,

normalized relative depth value, z, representing the ith keypoint of the network prediction _i Is the corresponding absolute depth value.

2) Calculating the three-dimensional coordinates of the key points:

wherein (Z) _i ,Y _i ,Z _i ) For the three-dimensional coordinates of the ith key point, (u) _i ,v _i ) Two-dimensional coordinates of ith keypoint, z, output for the network _i And K is an internal parameter matrix of the imaging device.

The fully-convoluted three-dimensional gesture estimation network has the following characteristics: according to the application scene requirements, the serial number N of the basic convolution units and the in-layer channel C of the basic convolution units can be adjusted, so that the depth and the width of the full-convolution three-dimensional gesture estimation network are changed, the performance of the whole method is influenced, and the full-convolution three-dimensional gesture estimation network is deployed in application scenes with different resource conditions.

Example 2

Changing N in network architectural design parameters ₁ 、N ₂ 、N ₃ 、N ₄ 、C ₁ 、C ₂ 、C ₃ 、C ₄ 2, 96, 144, 216, and 324, the obtained full-convolution three-dimensional gesture estimation network is suitable for an application scenario with a high requirement on the operation efficiency, and the rest processes are the same as those in embodiment 1, and are suitable for a scenario with balanced requirements on the operation efficiency and the precision.

Example 3

Changing N in network architectural design parameters ₁ 、N ₂ 、N ₃ 、N ₄ 、C ₁ 、C ₂ 、C ₃ 、C ₄ The three-dimensional gesture estimation network is respectively 3, 2, 96, 192, 384 and 768, and the obtained full-convolution three-dimensional gesture estimation network is suitable for application scenes with high requirements on operation efficiency. The rest processes are the same as those in the embodiment 1, and the method is suitable for application scenes with high operation precision requirements and low efficiency requirements.

The above embodiments are only for explaining the details to help understanding the technical solution of the present invention, and it is obvious to those skilled in the art that any modifications and substitutions made without departing from the principle of the present invention belong to the protection scope of the present invention.

Claims

1. A monocular three-dimensional gesture estimation method based on a full convolution neural network is characterized by comprising the following steps:

s1, acquiring a hand image, and preprocessing the hand image;

2. The method of claim 1, wherein the preprocessing comprises image scaling and image filling.

3. The monocular three-dimensional gesture estimation method based on full convolution neural network according to claim 1, wherein the full convolution three-dimensional gesture estimation network comprises the following units:

an input compression unit for extracting local features in an input image while reducing a resolution of the input image;

the basic convolution unit is used for extracting basic features of the image;

4. The monocular three-dimensional gesture estimation method based on full convolution neural network of claim 3, wherein the basic convolution unit comprises:

5. The monocular three-dimensional gesture estimation method based on full convolution neural network of claim 3, wherein the output unit comprises:

6. The monocular three-dimensional gesture estimation method based on full convolution neural network of claim 1, wherein the training of the full convolution three-dimensional gesture estimation network comprises the following steps:

7. The monocular three-dimensional gesture estimation method based on the full convolution neural network as recited in claim 6, wherein the full convolution three-dimensional gesture estimation network training process adopts a Smooth-L1 loss function to calculate errors between a two-dimensional coordinate of a key point predicted by the network and a relative depth and a label, calculates gradients of the errors relative to each network parameter, and updates the network parameters through a back propagation algorithm.

8. The monocular three-dimensional gesture estimation method based on the full convolution neural network of claim 6, wherein data enhancement adopted in image preprocessing in the full convolution three-dimensional gesture estimation network training process comprises:

9. The monocular three-dimensional gesture estimation method based on the full convolution neural network according to any one of claims 1 to 8, wherein the post-processing includes:

10. The monocular three-dimensional gesture estimation method based on the full convolution neural network of claim 9, wherein the specific process of calculating the three-dimensional coordinates of the key points is as follows: