CN114998520A

CN114998520A - Three-dimensional interactive hand reconstruction method and system based on implicit expression

Info

Publication number: CN114998520A
Application number: CN202210619894.7A
Authority: CN
Inventors: 王雁刚; 谢薇; 赵子萌
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-09-02

Abstract

The invention provides a three-dimensional interactive hand reconstruction method and a system based on implicit expression, which comprises the following steps: 1. constructing a feature extraction neural network, acquiring global and regional features from a single input color image, and further acquiring examples and joint features according to the input image and the regional features; 2. selecting query points according to a query strategy by using the acquired regional characteristics, instance characteristics and joint characteristics, and constructing query conditions; 3. constructing a parameterized implicit neural network, and performing implicit reconstruction based on query points and query conditions; 4. carrying out physical optimization on the reconstructed model, punishing unreasonable penetration, and adjusting and updating the reconstructed model; 5. and (4) iteratively optimizing the reconstructed model until the maximum penetration depth is less than 2mm, and taking the optimized result as the final reconstruction result of the interactive hand. According to the method, only a single color image is needed, hands with any chirality and number in the image can be reconstructed, an end-to-end modeling mode is achieved, and the quality of three-dimensional gesture and shape reconstruction is improved.

Description

Three-dimensional interactive hand reconstruction method and system based on implicit expression

Technical Field

The invention relates to the field of computer vision and computer graphics, in particular to a three-dimensional interactive hand reconstruction method and system based on implicit expression.

Background

In recent years, three-dimensional gesture and shape reconstruction technologies become a hot research direction, and have wide application prospects in the fields of virtual reality, robot control, motion sensing games and the like. In the actual daily life, the interactive hands are everywhere, and play an important role in information exchange between people, such as single-person emotional expression, multi-person collaboration and the like. As with human activities in real life, interactions between hands in virtual reality are also important content. Therefore, it is not necessary to study the reconstruction of the interactive hand. Due to the interaction, the interacting hands are often heavily occluded, in close contact with each other and have similar textural features as compared to a single hand. In addition, the solution space freedom of the interactive hand is also greater.

Early methods tended to take additional depth information or more perspectives as input. However, they always fit the input data to a specific hand template without personalized changes and cannot adapt to existing learning-based methods. Due to the ubiquitous nature of monocular color images, methods using monocular color images are preferred over methods using multiple cameras and depth cameras. Construction of the interbanded 2.6m dataset brought data support to this problem and subsequently prompted the emergence of a series of methods to reconstruct interacting hands from monocular color images. However, these interactive hand reconstruction efforts still suffer from several problems:

1) they typically rely on the assumption that there is only one left hand and one right hand in the image, re-assigning the problem as a regression of the MANO parameters of both hands. This limits the number of handedness and hands in the image. 2) The reconstructed interactive hand has the physical unreasonable problems of space entanglement, mutual infiltration and the like, so that the reconstruction quality is not high.

Disclosure of Invention

The invention aims to solve the technical problem that the shape and the posture of a hand which are physically reasonable can be reconstructed from a single color image without limiting the chirality and the number of hands.

In order to solve the technical problem, the invention provides a three-dimensional interactive hand reconstruction method based on implicit surface representation. The method constructs a parameterized implicit neural network, and can classify the 3D query points in the standard space of the interactive hand by taking the image characteristics as query strategies and conditions. According to the image clues, a new query strategy and query conditions are designed, so that the query efficiency is improved, and the query space is reduced. In addition, an effective physical optimization scheme is provided to solve the problem of mutual permeation between instances under implicit expression, and the physical authenticity of a reconstruction result is effectively ensured.

The three-dimensional interactive hand reconstruction method based on the implicit expression provided by the invention comprises the following steps:

step 1, building a feature extraction neural network, acquiring global features from a single input color image, dividing the global features into a plurality of regional features based on a connected part, and further acquiring instance features and joint features according to the input image and the regional features;

step 2, selecting query points according to a query strategy by using the global features, the regional features and the joint features obtained in the step 1, and constructing query conditions;

step 3, building a parameterized implicit neural network, and performing implicit reconstruction based on the query points and the query conditions;

step 4, carrying out physical optimization on the reconstructed model, punishing unreasonable penetration, and adjusting and updating the reconstructed model;

and 5, iteratively optimizing the reconstructed model until the maximum penetration depth is less than 2mm, and taking the optimized result as the final reconstruction result of the interactive hand.

Further, step 1 adopts two encoder-decoder neural networks with ResNet18 as a backbone, and the acquired features are supervised in the network training process.

Further, the training may constrain the features by the following loss function:

L _feature ＝L _Global +L _instance +L _joint

first term L of the loss function _Global Representing the constraint on the Global mask and Global Z-map, the second term L of the loss function _instance Showing the third term L of the loss function, constraining the example mask _joint Showing the supervision of the localizer map and the Z-map of the joint. L is _Global 、L _instance And L _joint Are defined as follows:

L _Global in a loss function

And

network predictors for the global mask and the global Z-map respectively,

and

is the corresponding true value; l is _instance In a loss function

And

respectively representing the network predicted value and the truth value of the kth example mask in the area i; l is _joint In a loss function

And

respectively a localization map of the joint of the zone i and a network prediction value of the Z-map,

is and

is the corresponding true value.

Further, in step 1, for an input color image I containing an interactive hand, a first encoder-decoder network is first used to obtain a global mask G _M And Global Z-mapG _Z . Then, the bounding box coordinates for each region are used to crop I, G _M And G _Z To obtain region information, i.e. region images

Area mask

Region(s)

Then, according to

And

parallel extraction of the scout map of the visible joints for each region i using a second encoder-decoder network

And

and mask E of the visible example ^(k) . k represents the number of example masks visible in region i.

Further, in the step 2, for each region

Construction of a sampling space d according to a query strategy ⁽ⁱ⁾ And samples points in the sample space. All the sampling points are combined to be used as query points, and the query points are normalized to be a unit cube. In addition, for each region, region embedding r is obtained from the step 1 feature extraction ⁽ⁱ⁾ Each instance E in the region ^(k) Sending to a multi-layer perceptron to encode to obtain an instance embedding e ^(k) . Will r is ⁽ⁱ⁾ And e ^(k) As a query condition for implicit reconstruction.

Further, in the step 3, based on the query point and the query condition, the parameterized implicit neural network is used for implicit reconstruction. The parameterized implicit neural network can be viewed as an implicit function, with the formula:

wherein, p is a query point, r and e are region embedding and instance embedding respectively, and are taken as query conditions, and τ represents output to obtain an occupancy value.

The neural network is supervised using cross entropy losses for predicted occupancy and real occupancy. Furthermore, we define a collision loss to penalize the collision of the hand surface. Considering that a query point belongs to at most one single-handed instance, the penetration loss is defined as:

where Ω represents all query points and k is equal to the number of hands in each region.

Further, in step 4, after the hand surface is implicitly reconstructed, the physical rationality of the reconstructed interactive hand is checked, whether the reconstructed interactive hand model has penetration or not is judged, and if the reconstructed interactive hand model has penetration, the reconstructed model is optimized by using a physical optimization method.

Further, in step 5, considering that the hand is not rigid, penetration within 2mm is allowed. And (5) iteratively and optimally reconstructing the model according to the step (4) until the depth d is less than 2 mm.

Further, the system comprises a feature extraction module, an implicit reconstruction module and a physical optimization module. The feature extraction module comprises two units: global and regional feature extraction unit for obtaining global mask G from single input image _M Global Z-mapG _Z Region image

Area mask

And area

An example and joint feature estimation unit for extracting a positioning map of the visible joint for each region i in parallel according to the region features

And

and visible example mask E ^(k) . The implicit reconstruction module includes two units: a query point and query condition acquisition unit for each region

Selecting query points according to a query strategy, and acquiring region embedding r from a feature extraction module ⁽ⁱ⁾ Masking E from the instance of the area by a multi-layer perceptron ^(k) Intermediate coding results in instance embedding { e } ^(k) H, will r ⁽ⁱ⁾ And e ^(k) As a query condition for implicit reconstruction; and the implicit curved surface reconstruction unit acquires the occupancy rate value of each query point through a parameterized implicit neural network and further extracts a reconstructed curved surface. And the physical optimization module is used for judging whether the reconstructed model has penetration, if the reconstructed model has penetration, the reconstructed model is optimized by using a physical optimization method, the reconstructed model is iteratively optimized until the maximum penetration depth is less than 2mm, and the optimization result is used as the final reconstruction result of the interactive hand.

Compared with the prior art, the invention has the following advantages:

the invention provides a three-dimensional interactive hand reconstruction method based on implicit expression, which can reconstruct hands with any chirality and quantity in an image only by a single color image, realizes an end-to-end modeling mode, and improves the quality of three-dimensional gesture and shape reconstruction.

In addition, according to the image clue, a new query strategy and query conditions are designed, so that the query efficiency is improved, and the query space is reduced.

In addition, aiming at the unreasonable reconstruction result in physics, an effective physical optimization scheme is provided to solve the problem of mutual permeation between examples under implicit expression, and the physical authenticity of the reconstruction result is effectively ensured.

Drawings

FIG. 1 is a flow chart of a three-dimensional interactive hand reconstruction method in an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a three-dimensional interactive hand reconstruction system in an embodiment of the present invention;

FIG. 3 is a diagram of a physical optimization scheme in an embodiment of the present invention, wherein (a) the diagram is a force calculation for each pixel, and (b) the diagram is a force calculation for each example.

Fig. 4 is a diagram of reconstruction effect that can be achieved by the present invention, wherein (a) the histogram is an input image, (b) the histogram is a real mask, (c) the histogram is a predicted mask, (d) the histogram is an implicit reconstruction result, and (e) the two histograms are results of two optimized views.

Detailed Description

The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and examples, so as to enable a user to clearly understand the present invention by combing the methods and effects of the present invention. It should be noted that, in the case of no conflict, the features of the embodiments of the present invention may be combined with each other, and the formed technical solutions are within the scope of the present invention.

Further, the flowcharts shown in the drawings may be executed in a series of sequential instructions in a computer, and the flow order may be appropriately modified in some cases.

Example one

Fig. 1 is a flowchart of a three-dimensional interactive hand reconstruction method based on implicit representation according to a first embodiment of the present invention, and each step is described in detail below with reference to fig. 1.

And step S110, building a characteristic extraction neural network, and acquiring corresponding characteristics from the input color image.

In order to obtain corresponding features from a single color image, a feature extraction network is built. As shown in fig. 2(S1), the feature extraction network consists of two encoder-decoder neural networks with ResNet18 as the backbone. For an input color image I containing interactive hands, a global mask G is first obtained using a first encoder-decoder network _M And Global Z-mapG _Z . Then, the bounding box coordinates for each region are used to crop I, G _M And G _Z To obtain region information, i.e. region images

Area mask

Region(s)

Then, according to

And

using a second encoder-decoder network, a localization map of visible joints is extracted in parallel for each region i

And

and visible example mask E ^(k) . Where k represents the number of visible example masks in region i.

To achieve the above functionality, the feature extraction network is pre-trained using the following loss function:

L _feature ＝L _Global +L _instance +L _joint

first term L of the loss function _Global Indicating that the global mask and the global Z-map are supervised, the second term L of the loss function _instance Shows the third term L of the loss function for supervising the example mask _joint Showing the supervision of the joint positioning map and the Z-map. L is a radical of an alcohol _Global 、L _instance And L _joint Are defined as follows:

L _Global in a loss function

And

network predictors for the global mask and the global Z-map respectively,

and

is the corresponding true value; l is a radical of an alcohol _instance In a loss function

And

And

respectively a positioning map of the joint of the area i and a network predicted value of the Z-map,

is and

is the corresponding true value.

And step S120, selecting query points according to the query strategy by using the acquired characteristics, and constructing query conditions.

The specific process of query point selection is as follows: for each region

First, find the corresponding query silhouette

And maximum query depth

Then, use

As the area of the bottom portion,

as height, a sampling space d is constructed ⁽ⁱ⁾ . Samples are sampled uniformly in this space. In addition, each visible joint is considered an anchor. Another number of sampling points are selected near each anchor. Specifically, a Gaussian mixture distribution (x) is used ^(j) σ), where each visible joint coordinate x ^(j) As a center, σ is the variance of each dimension. In the experiments of the present invention, σ was set to 2.5 cm. All the sampling points are combined to be used as query points, and the query points are normalized to be a unit cube. The specific process of constructing the query condition is as follows: for each region

Obtaining region embedding r from feature extraction of FIG. 2(S1) ⁽ⁱ⁾ . For each instance E in the region ^(k) Sending to a multi-layer perceptron to encode to obtain an instance embedding e ^(k) . Will r is ⁽ⁱ⁾ And e ^(k) As a query condition for implicit reconstruction.

And S130, building a parameterized implicit neural network, and performing implicit reconstruction based on the query points and the query conditions.

As shown in FIG. 2(S2), a parameterized neural network is established based on the query points and the query conditions

It is treated as an implicit function, and the formula is as follows:

In the experimental phase of the present invention, the surface of the hand is implicitly represented by an isosurface τ -0.5. If the occupancy rate of the query point is less than 0.5, the point is outside the surface, and if the occupancy rate is greater than 0.5, the point is inside the surface.

We use cross entropy loss of predicted occupancy and real occupancy to supervise the neural network. Furthermore, we define a collision loss to penalize the collision of the hand surface. Considering that a query point belongs to at most one-handed instance, the penetration loss is defined as:

The parameterized implicit neural network consists of three convolutional layers and nine fully-connected layers. Training was performed using the SGD optimizer, 64 per batch. The query point is not fixed and is resampled in each iteration.

And step S140, carrying out physical rationality check on the reconstructed interaction hand, optimizing and updating the reconstructed model.

After the hand surface is obtained through implicit reconstruction, physical reasonability check is carried out on the reconstructed interactive hand, and unreasonable penetration is punished so as to better align real gestures. As shown in FIG. 3, the penetration depth is first calculated and the from-region mask is projected in the positive Z direction

The observed rays should continue through one hand and the number of intersections should be even. Assume that there is a ray that passes through the implicitly reconstructed regions of two interacting hands a and B. Along this ray, all intersections are recorded as

Where N represents the number of intersections. For the intersection of the hand A surfaces

If it is not

And

not belonging to the same hand surface, this indicates penetration at that location. In this case, the next intersection point is found on the surface of hand A

Defining the penetration depth as the intersection

And

the distance between them. Then, the force of each pixel is calculated. In the optimization stage, it is desirable that the hand pose be adjusted to any orientation, not parallel to the light. Considering that the force is directional, the magnitude of the force is directly related to the depth of penetration, and the position where the penetration occurs can act like a spring to generate a repulsive force, in which case the handle is considered as a rigid body. The relationship between penetration depth and corresponding repulsion force is defined as:

wherein f is _(u,v) Denotes the force generated by the light emitted along the pixel (u, v) due to penetration, t denotes the number of penetrations, λ is the associated weight, d _t Indicating the t-th penetration depth.

By adding the repulsive forces of all the light rays, the direction of the resultant force F may deviate from the z-axis direction due to the difference in the repulsive force of each light ray. The formula for the resultant force is as follows:

from the resultant force F, the average penetration depth is calculated

Then, for each hand, move in a direction parallel to F

To penalize extreme adjustments, projection losses are used, in which orthogonal projections are used.

Wherein H _pose Represents an optimized interactive hand, pi (H) _pose ) Is a 2D projection thereof, R _M Representing the region mask estimated by the encoder-decoder.

And S150, iteratively optimizing the reconstructed model until the maximum penetration depth is less than 2mm, and taking the optimized result as a final reconstruction result.

Allowing penetration within 2mm considering the hand to be non-rigid. And (5) iteratively optimizing the reconstruction model according to the step (S140) until the depth d is less than 2 mm.

In the first embodiment, the input color image may include any chirality and number of hands, and the experimental result is shown in fig. 4. The first column of fig. 4 represents the input color image, the second and third columns represent the real mask and the predicted mask, respectively, the fourth column represents the implicit reconstruction result, and the fifth column represents the optimized reconstruction result.

Those skilled in the art will appreciate that the modules or steps of the invention described above can be implemented in a general purpose computing device, centralized on a single computing device or distributed across a network of computing devices, and optionally implemented in program code that is executable by a computing device, such that the modules or steps are stored in a memory device and executed by a computing device, fabricated separately into integrated circuit modules, or fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Claims

1. A three-dimensional interactive hand reconstruction method based on implicit expression is characterized by comprising the following steps: the method comprises the following steps:

step 2, selecting query points according to a query strategy by using the regional characteristics, the example characteristics and the joint characteristics acquired in the step 1, and constructing query conditions;

2. The implicit representation based three-dimensional interactive hand reconstruction method according to claim 1, wherein: step 1 adopts two encoder-decoder neural networks using ResNet18 as a backbone, and supervises the acquired features in the network training process.

3. The implicit representation based three-dimensional interactive hand reconstruction method according to claim 2, wherein: the training constrains the features by the following loss function:

L _feature ＝L _Global +L _instance +L _joint

first term L of the loss function _Global Indicating that the global mask and global Z-map are constrained,second term L of loss function _instance Showing the third term L of the loss function, constraining the example mask _joint Showing the supervision of the positioning chart and the Z-map of the joint; l is _Global 、L _instance And L _joint Are defined as follows:

L _Global in a loss function

And

network predictors for the global mask and the global Z-map respectively,

and

is the corresponding true value; l is _instance In a loss function

And

And

is and

is the corresponding true value.

4. The implicit representation based three-dimensional interactive hand reconstruction method according to claim 1, wherein: in step 1, for an input color image I containing an interactive hand, a first encoder-decoder network is used to obtain a global mask G _M And Global Z-mapG _Z (ii) a Then, the bounding box coordinates for each region are used to crop I, G _M And G _Z To obtain region information, i.e. region images

Area mask

Region(s)

Then, according to

And

And

and mask E of the visible example ^(k) (ii) a k represents the number of example masks visible in region i.

5. The implicit representation based three-dimensional interactive hand reconstruction method of claim 1, characterized in that: in the step 2, for each region

Construction of a sampling space d according to a query strategy ⁽ⁱ⁾ Sampling points in the sampling space; combining all sampling points to serve as query points, and normalizing the query points into a unit cube; in addition, for each region, region embedding r is obtained from the step 1 feature extraction ⁽ⁱ⁾ Each instance E in the region ^(k) Sending to a multi-layer perceptron to encode to obtain an instance embedding e ^(k) (ii) a Will r is ⁽ⁱ⁾ And e ^(k) As a query condition for implicit reconstruction.

6. The implicit representation based three-dimensional interactive hand reconstruction method according to claim 5, wherein: in the step 3, based on the query point and the query condition, performing implicit reconstruction by using a parameterized implicit neural network; the parameterized implicit neural network is treated as an implicit function, and the formula is as follows:

wherein, p is a query point, r and e are region embedding and instance embedding respectively and are taken as query conditions, and tau represents an occupancy rate value obtained by output;

supervising the neural network using cross entropy losses for the predicted occupancy and the true occupancy; in addition, a collision loss is defined to penalize the collision of the hand surface; considering that a query point belongs to at most one single-handed instance, the penetration loss is defined as:

7. The implicit representation based three-dimensional interactive hand reconstruction method of claim 1, characterized in that: and 4, after the hand surface is obtained through implicit reconstruction, carrying out physical rationality check on the reconstructed interactive hand, judging whether the reconstructed interactive hand model has penetration, and if the reconstructed interactive hand model has penetration, optimizing the reconstructed model by using a physical optimization method.

8. The implicit representation based three-dimensional interactive hand reconstruction method according to claim 1, wherein: in said step 5, considering that the hand is not rigid, allowing penetration within 2 mm; and (5) iteratively and optimally reconstructing the model according to the step (4) until the depth d is less than 2 mm.

9. A three-dimensional interactive hand reconstruction system based on implicit expression is characterized in that: the system comprises a feature extraction module, an implicit reconstruction module and a physical optimization module;

the feature extraction module comprises two units: global and regional feature extraction unit for obtaining global mask G from single input image _M Global Z-mapG _Z Region image

Area mask

And zone Z-

And

and visible example mask E ^(k) ；

The implicit reconstruction module comprises two units: a query point and query condition acquisition unit for each region

Selecting query points according to a query strategy, and acquiring region embedding r from a feature extraction module ⁽ⁱ⁾ Masking E from the instance of the area by a multi-layer perceptron ^(k) Intermediate coding results in instance embedding { e } ^(k) H, will r ⁽ⁱ⁾ And e ^(k) As a query condition for implicit reconstruction; the implicit curved surface reconstruction unit is used for acquiring the occupancy rate value of each query point through a parameterized implicit neural network and further extracting a reconstructed curved surface;

and the physical optimization module judges whether the reconstructed model has penetration, if the reconstructed model has penetration, the reconstructed model is optimized by using a physical optimization method, the reconstructed model is iteratively optimized until the maximum penetration depth is less than 2mm, and the optimization result is used as the final reconstruction result of the interactive hand.