CN117911609A - Three-dimensional hand modeling method based on nerve radiation field - Google Patents
Three-dimensional hand modeling method based on nerve radiation field Download PDFInfo
- Publication number
- CN117911609A CN117911609A CN202311730771.1A CN202311730771A CN117911609A CN 117911609 A CN117911609 A CN 117911609A CN 202311730771 A CN202311730771 A CN 202311730771A CN 117911609 A CN117911609 A CN 117911609A
- Authority
- CN
- China
- Prior art keywords
- hand
- model
- potential
- code
- mask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000005855 radiation Effects 0.000 title claims abstract description 15
- 210000005036 nerve Anatomy 0.000 title claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims abstract description 34
- 230000001537 neural effect Effects 0.000 claims abstract description 10
- 238000000926 separation method Methods 0.000 claims abstract description 9
- 238000009792 diffusion process Methods 0.000 claims abstract description 6
- 238000009877 rendering Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 241000282412 Homo Species 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 3
- 238000004873 anchoring Methods 0.000 claims 1
- 238000007781 pre-processing Methods 0.000 claims 1
- 230000002123 temporal effect Effects 0.000 claims 1
- 208000003443 Unconsciousness Diseases 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Graphics (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Geometry (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a three-dimensional hand modeling method based on a nerve radiation field, which comprises the following steps: step (1), performing two-hand modal segmentation through a two-hand separation model; step (2), obtaining a single-hand picture through a hand de-shielding and removing module; step (3), structuring latent codes of the single-hand skin model; step (4), potential code diffusion of the single-hand skin model; step (5), regressing the density and the color of the hand of the single-hand skin model; and (6) performing hand volume drawing on the single-hand skin model to obtain a double-hand model. The present invention uses a new implicit neural representation for the dynamic human body, which enables the inventive method to effectively incorporate observations on video frames. The method can obtain high-quality reconstruction performance under various viewpoints and different levels of hand shielding.
Description
Technical Field
The invention relates to the field of computer image processing and three-dimensional reconstruction, in particular to a method for generating a high-precision three-dimensional model of a hand by combining image data of a nerve radiation field and utilizing a computer vision technology and a deep learning algorithm.
Background
In recent years, neural implicitly represents a rapid development in three-dimensional modeling and image synthesis. Neural implicit representation models a scene using a neural network that is spatially continuous and exhibits higher fidelity and flexibility than classical discrete counterparts such as grids, point clouds, and voxels. As the most popular implicit representation in neural rendering, neural radiation fields (Neural RADIANCE FIELDS, NERF) have shown dramatic results in a variety of tasks since the first introduction. Original NeRF is overfitted on a static scene by design, so the time-varying content cannot be modeled.
The image rendering of animate articulated objects, i.e. human body, hand, etc., can be seen as a special case of dynamic scene modeling. Most early efforts completed reconstruction using skeleton-based grids, which generally relied on expensive calibration and a large number of samples to produce high quality results.
Modeling and reconstruction of human hands as smart tools that interact with the physical world and convey rich semantic information has attracted extensive attention from the research community. In general, synthesizing real hand images or videos with different poses in motion has wide applications such as human-machine interaction, sign language generation, virtual and augmented reality techniques, such as telepresence, and the like.
Classical manual modeling work is mainly built on top of parameterized mesh models, such as MANO. They fit the hand geometry to a polygonal mesh operated by shape and pose parameters and then complete shading by texture mapping. Although widely used, these models have the following limitations. On the one hand, high frequency details are difficult to present on a multi-resolution grid, preventing the generation of realistic images. On the other hand, no special design has been developed for the interactive hand, which is an important scenario involving complex gestures for self-occlusion.
Disclosure of Invention
The technical problems to be solved are as follows: previous free viewpoint video systems either rely on dense camera arrays for image-based new view synthesis or require depth sensors for high quality three-dimensional reconstruction to produce realistic rendering. The complex hardware makes the free view video system expensive and only suitable for limited environments. This invention solves the novel view synthesis challenges for human performers from a very sparse set of camera views.
Aiming at the defects of the existing hand three-dimensional reconstruction method, the invention provides a three-dimensional hand modeling method based on a nerve radiation field.
According to the latent variable model research, the latent variable model obtains the corresponding distribution of the observed variable by defining the joint distribution of the visible variable and the latent variable and then marginalizing. Inspired by this we generate implicit 3D representations of human hands on different video frames from the same set of potential codes, which are anchored on vertices of the deformable mesh. For each frame we transform the spatial position of the code according to the human hand pose and use the network to regress the density and color of any 3D position according to the structured latent code. Then, an image of an arbitrary viewpoint can be synthesized by volume rendering. Meanwhile, when the input view is highly sparse, the performance of NeRF drops sharply. The reason is that it is not appropriate to learn the neural representation with very sparse observations. It is therefore believed that the key to solving this inappropriateness problem is to aggregate all observations over different video frames. This idea is achieved by regressing the 3D representation of each frame using the same network with different potential codes as inputs.
A three-dimensional hand modeling method based on a nerve radiation field comprises the following steps:
Step (1), performing two-hand modal segmentation through a two-hand separation model;
Step (2), obtaining a single-hand picture through a hand de-shielding and removing module;
step (3), structuring latent codes of the single-hand skin model;
step (4), potential code diffusion of the single-hand skin model;
Step (5), regressing the density and the color of the hand of the single-hand skin model;
And (6) performing hand volume drawing on the single-hand skin model to obtain a double-hand model.
Further, the specific method of the step (1) is as follows:
First, video is divided into Where c is the camera index, nc is the number of cameras, t is the frame index, and N t is the frame number. And processing the two-hand video data shot by the camera with the marked serial number, obtaining two-hand pictures corresponding to the video and storing the pictures with the camera serial number. The two-hand separation model adapts to the two-hand modal segmentation task by adjusting the existing semantic segmentation model SegFormer. Specifically, the number of decodes is adjusted to be increased from 1 to 4 to predict four segmentation masks, that is, a right-hand full region mask M ra, a right-hand visible region mask M rv, a left-hand full region mask M la, and a left-hand visible region mask M lv, and left-and-right-hand segmentation is performed on the input two-hand image by the four segmentation masks to obtain 4 segmented single-hand pictures. These segmentation masks contain spatial location information to roughly locate the left/right hand, and also information of occlusion regions for de-occlusion, removing interference regions. The segmentation model is supervised with binary cross entropy loss L BCE (x), and the final segmentation loss function formula is as follows:
where M ra、Mrv、Mla and M lv are predicted segmentation masks; and/> Is the corresponding true mask.
Further, the specific method of the step (2) is as follows;
Taking the right hand as an example, firstly, clipping to obtain a picture by taking the right hand complete area mask (M ra) as the center according to the mask predicted in the step (1). The hand is deblocked and the input of the removal module is spliced by the following four parts: 1. the right hand of the blacked-out picture is blocked. 2. The right hand visible partial mask (M rv). 3. The left-hand redundant area of the picture after being blacked out. 4. Masking of background areas other than left and right hands (M bv). For the right hand, the right hand full area mask M ra is first used to locate the right hand, then the original image and segmentation mask are cropped in the center of the right hand. The newly cropped image and mask are represented as: right-hand center clipping mask/> Right-hand visible part center clipping mask/>Left-hand center crop mask/>Left-hand visible part center clipping mask/>M D is used to represent the area where the target hand is occluded by the other hand, and M R is used to represent the area occupied by the distracting hand. Their calculation formula is as follows:
I D and I R are the original images erased by masks M D and M R, respectively And (5) obtaining a picture. They guide the hand de-occlusion and removal module how to focus and how to use partial convolution (Partial Convolution) to fill in the two pictures. In addition, the visible portion mask M rv for the right hand and the mask M bv for the background area other than the left and right hand will direct the hand de-occlusion and removal module to resolve the de-occluded and removable portions of the picture. I D、IR and M bv are calculated as follows:
The I D,Mrv,IR and the M bv are input as a hand de-occlusion and removal module, the hand de-occlusion and removal module uses the data to restore the appearance content of the occluded part and restore the distracted hand to avoid ambiguity, the pixels of the right hand occluded area and the pixels of the background behind the left hand redundant area are predicted, and finally a restored single right hand image is output. And classifying the output single-hand pictures according to the camera serial numbers, and storing the left-hand pictures and the right-hand pictures separately.
Further, the specific method of the step (3) is as follows:
The single-handed image is preprocessed using EasyMocap tools. For each single-hand image obtained in step (2), an estimation of hand pose is performed, and using yolov and hrnet, hand pose nodes are obtained. The internal and external parameters of the camera are obtained by setting the parameters of the camera and calibrating a chessboard when shooting video. In addition, the Self-correction for human parsing is adopted to divide the area of the hand of the single-hand image to obtain a picture for revealing the outline position of the hand in the single-hand image, namely a hand outline position image. Corresponding parameters and vertex information are obtained through the internal and external parameters of the camera and the estimated hand gesture. And secondly, obtaining a single-hand image and Easymocap preprocessed data (camera internal and external parameters, a hand gesture articulation point and a hand outline position image) in the step (2) to obtain a skin line model (SMPL) of the hand. In addition, a potential code (latent code) is introduced to realize the conversion from the skin model to the real model, namely Easymocap preprocessed data in each single-hand image is stored by adopting the potential code. To control the spatial location of the underlying code, the underlying code is anchored to a deformed human hand skin linear model (SMPL), which is a skin vertex-based model defined as a function of shape parameters, pose parameters, and rigid transformations relative to the SMPL coordinate system. The SMPL model outputs a pose 3D mesh with 6890 vertices. A set of potential codes z= { Z1, Z2 "are defined for an initial human hand skin linear model. .., z6890} is at the vertex of the SMPL model. For frame t, the SMPL parameter St is from video images of different perspectives provided And (5) estimating. The spatial positions of the potential codes are then transformed according to the human hand gestures St to perform density and color regression. Latent codes, i.e., latent codes, use neural networks to represent the local geometry and appearance of humans.
Further, the specific method of the step (4) is as follows;
Each point in space is represented as a feature vector NeRF assigns a density and color to each point in 3D space, which requires querying the potential code at successive 3D locations, by tri-linear interpolation. The potential code defined on the surface is spread into the nearby 3D space. A sparse convolutional network (SparseConvNet) is selected to efficiently process structured potential codes. Specifically, based on parameters of the SMPL skin model, a 3D bounding box of the human hand is calculated and the box is divided into small voxels with voxel sizes of 5mm×5mm, referred to as potential code voxels. The potential codes for non-empty voxels are the average of the SMPL vertex potential codes within the voxel. SparseConvNet the input volume is processed with 3D sparse convolution and the potential code volumes are output in 2 x, 4 x, 8 x, 16 x downsampled sizes. Through convolution and downsampling, the input code is spread into nearby space. For any point in 3D space, potential codes are inserted from the multi-scale code volumes of the network layers 5, 9, 13, 17 and connected to the final potential codes. Since code diffusion should not be affected by human hand position and direction in the real world coordinate system, the code position is converted into the SMPL coordinate system. For any point x in 3D space, its potential code is queried from the potential code voxels. Specifically, point x is first converted into an SMPL coordinate system that aligns points and potential amounts of code in 3D space. The potential codes are then calculated using tri-linear interpolation. For the SMPL skin model parameter St, the potential code at point x is denoted as ψ (x, Z, S t). The code vector (ψ (x, Z, S t)) is passed into the density model and the color model to predict the density and color of point x.
Further, the specific method in the step (5) is as follows:
Hand density and color regression is achieved through NeRF, specifically represented by the MLP network.
And (5) a density model. For frame t, the bulk density at point x is predicted as a function of the potential code ψ (x, Z, st), which is defined as: σ t(x)=Mσ(ψ(x,Z,St)), where M σ represents a four-layer MLP network where the MLP network uses the density model from NeRF.
And (5) a color model. The potential code ψ (x, Z, S t) and the viewing direction d are taken as inputs for color regression. The color model also takes as input the point x in order to model the position-dependent incident light. Assigning a potential embedding for each video frame tTo encode the time varying factor. Specifically, for frame t, the color at spatial position x is predicted as the potential code ψ (x, Z, S t), the viewing direction d, spatial position x and potential embedding/>Is a function of (2). The color model of the t-th frame is defined as:
Where Mc denotes an MLP network with two layers, γ d and γ x are position-coding functions of the viewing direction and spatial position, respectively. Will be Is set to 128.
Further, the specific method in the step (6) is as follows:
Taking a right-hand skin model as an example, given a pixel, first calculate its camera ray r using camera parameters, and sample N k points along the camera ray r between the near and far boundaries The scene boundary is estimated according to the single-hand skin model. Then, the volume density and color of these points are predicted using NeRF, and all the predicted points are reflected in 3-dimensional space to project a three-dimensional model, i.e., a generated volume rendering model. Next, the generated volume rendering model is optimized and refined based on a comparison of the generated volume rendering model with the hand image (ground truth) taken from the original video, specifically using two comparison methods, PSNR and SSIM.
For video frame t, the rendered color of the corresponding pixelGiven by the formula:
Where δ k=||xk+1-xk||2 is the distance between adjacent sampling points. N k is set to 64.
Finally, the parameters of the volume rendering model generated by the left hand and the right hand are projected into the same latitude space to obtain the two-hand volume rendering model.
The invention has the following beneficial effects:
The invention provides a method for synthesizing a realistic new view of a performer in complex motions from sparse multi-view video. The present invention uses a new implicit neural representation for the dynamic human body, which enables the inventive method to effectively incorporate observations on video frames. The method can obtain high-quality reconstruction performance under various viewpoints and different levels of hand shielding.
Drawings
FIG. 1 is a schematic flow chart of acquiring a single-hand picture according to an embodiment of the present invention, wherein HSM represents a two-hand separation model, and HDRM represents a hand de-occlusion and removal module;
FIG. 2 is a schematic diagram of a reconstruction process of a human body single-hand model according to the present invention;
FIG. 3 is a comparative schematic of the error rate at InterHand2.6M for the method of the present invention and other methods;
Fig. 4 and 5 are qualitative comparisons of the method of the present invention and existing one-hand reconstruction methods.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings and the embodiments.
Aiming at the defects of the existing hand three-dimensional reconstruction method, the invention provides a method for reconstructing hand model three-dimensions based on NeRF.
According to the latent variable model research, the latent variable model obtains the corresponding distribution of the observed variable by defining the joint distribution of the visible variable and the latent variable and then marginalizing. Inspired by this we generate implicit 3D representations of human hands on different video frames from the same set of potential codes, which are anchored on vertices of the deformable mesh. For each frame we transform the spatial position of the code according to the human hand pose and use the network to regress the density and color of any 3D position according to the structured latent code. Then, an image of an arbitrary viewpoint can be synthesized by volume rendering. Meanwhile, when the input view is highly sparse, the performance of NeRF drops sharply. The reason is that it is not appropriate to learn the neural representation with very sparse observations. It is therefore believed that the key to solving this inappropriateness problem is to aggregate all observations over different video frames. This idea is achieved by regressing the 3D representation of each frame using the same network with different potential codes as inputs. FIG. 1 is a schematic flow chart of acquiring a single-hand picture according to an embodiment of the present invention, wherein HSM represents a two-hand separation model, and HDRM represents a hand de-occlusion and removal module; FIG. 2 is a schematic diagram of a reconstruction process of a human body single-hand model according to the present invention;
a three-dimensional hand modeling method based on a nerve radiation field comprises the following steps:
Step (1), performing two-hand modal segmentation through a two-hand separation model;
First, video is divided into Where c is the camera index, nc is the number of cameras, t is the frame index, and N t is the frame number. And processing the two-hand video data shot by the camera with the marked serial number, obtaining two-hand pictures corresponding to the video and storing the pictures with the camera serial number. The hand separation model adapts to the task of segmentation of the bimanual modality by adjusting the existing semantic segmentation model SegFormer. Specifically, the number of decodes is adjusted to be increased from 1 to 4 to predict four segmentation masks, that is, a right-hand full region mask M ra, a right-hand visible region mask M rv, a left-hand full region mask M la, and a left-hand visible region mask M lv, and left-and-right-hand segmentation is performed on the input two-hand image by the four segmentation masks to obtain 4 segmented single-hand pictures. These segmentation masks contain spatial location information to roughly locate the left/right hand, and also information of occlusion regions for de-occlusion, removing interference regions. The segmentation model is supervised with binary cross entropy loss L BCE (x), and the final segmentation loss function formula is as follows:
where M ra、Mrv、Mla and M lv are predicted segmentation masks; and/> Is the corresponding true mask.
Step (2), obtaining a single-hand picture through a hand de-shielding and removing module;
Taking the right hand as an example, firstly, clipping to obtain a picture by taking the right hand complete area mask (M ra) as the center according to the mask predicted in the step (1). The hand is deblocked and the input of the removal module is spliced by the following four parts: 1. the right hand of the blacked-out picture is blocked. 2. The right hand visible partial mask (M rv). 3. The left-hand redundant area of the picture after being blacked out. 4. Masking of background areas other than left and right hands (M bv). For the right hand, the right hand full area mask M ra is first used to locate the right hand, then the original image and segmentation mask are cropped in the center of the right hand. The newly cropped image and mask are represented as: (right hand center crop mask)/> Right-hand visible portion center cut mask),/>(Left-hand center cut mask)/>(Left-hand visible portion center crop mask), M D is used to represent the area where the target hand is occluded by the other hand, and M R is used to represent the area occupied by the distracting hand. Their calculation formula is as follows:
I D and I R are the original images erased by masks M D and M R, respectively And (5) obtaining a picture. They guide the hand de-occlusion and removal module how to focus and how to use partial convolution (Partial Convolution) to fill in the two pictures. In addition, the visible portion mask M rv for the right hand and the mask M bv for the background area other than the left and right hand will direct the hand de-occlusion and removal module to resolve the de-occluded and removable portions of the picture. I D、IR and M bv are calculated as follows:
The I D,Mrv,IR and the M bv are input as a hand de-occlusion and removal module, the hand de-occlusion and removal module uses the data to restore the appearance content of the occluded part and restore the distracted hand to avoid ambiguity, the pixels of the right hand occluded area and the pixels of the background behind the left hand redundant area are predicted, and finally a restored single right hand image is output. And classifying the output single-hand pictures according to the camera serial numbers, and storing the left-hand pictures and the right-hand pictures separately.
Step (3), structuring latent codes of the single-hand skin model;
The single-handed image is preprocessed using EasyMocap tools. For each single-hand image obtained in step (2), an estimation of hand pose is performed, and using yolov and hrnet, hand pose nodes are obtained. The internal and external parameters of the camera are obtained by setting the parameters of the camera and calibrating a chessboard when shooting video. In addition, the Self-correction for human parsing is adopted to divide the area of the hand of the single-hand image to obtain a picture for revealing the outline position of the hand in the single-hand image, namely a hand outline position image. Corresponding parameters and vertex information are obtained through the internal and external parameters of the camera and the estimated hand gesture. And secondly, obtaining a single-hand image and Easymocap preprocessed data (camera internal and external parameters, a hand gesture articulation point and a hand outline position image) in the step (2) to obtain a skin line model (SMPL) of the hand. In addition, a potential code (latent code) is introduced to realize the conversion from the skin model to the real model, namely Easymocap preprocessed data in each single-hand image is stored by adopting the potential code. To control the spatial location of the underlying code, the underlying code is anchored to a deformed human hand skin linear model (SMPL), which is a skin vertex-based model defined as a function of shape parameters, pose parameters, and rigid transformations relative to the SMPL coordinate system. The SMPL model outputs a pose 3D mesh with 6890 vertices. A set of potential codes z= { Z1, Z2 "are defined for an initial human hand skin linear model. .., z6890} is at the vertex of the SMPL model. For frame t, the SMPL parameter St is from video images of different perspectives provided And (5) estimating. The spatial positions of the potential codes are then transformed according to the human hand gestures St to perform density and color regression. Latent codes, i.e., latent codes, use neural networks to represent the local geometry and appearance of humans.
Step (4), potential code diffusion of the single-hand skin model;
Each point in space is represented as a feature vector NeRF assigns a density and color to each point in 3D space, which requires querying the potential code at successive 3D locations, by tri-linear interpolation. However, since the structured potential code is relatively sparse in 3D space, direct interpolation of the potential code results in a large portion being a zero vector. To address this problem, the potential code defined on the surface is spread into nearby 3D space. A sparse convolutional network (SparseConvNet) is selected to efficiently process structured potential codes. Specifically, based on parameters of the SMPL skin model, a 3D bounding box of the human hand is calculated and the box is divided into small voxels with voxel sizes of 5mm×5mm, referred to as potential code voxels. The potential codes for non-empty voxels are the average of the SMPL vertex potential codes within the voxel. SparseConvNet the input volume is processed with 3D sparse convolution and the potential code volumes are output in 2 x, 4 x, 8 x, 16 x downsampled sizes. Through convolution and downsampling, the input code is spread into nearby space. For any point in 3D space, potential codes are inserted from the multi-scale code volumes of the network layers 5,9, 13, 17 and connected to the final potential codes. Since code diffusion should not be affected by human hand position and direction in the real world coordinate system, the code position is converted into the SMPL coordinate system. For any point x in 3D space, its potential code is queried from the potential code voxels. Specifically, point x is first converted into an SMPL coordinate system that aligns points and potential amounts of code in 3D space. The potential codes are then calculated using tri-linear interpolation. For the SMPL skin model parameter St, the potential code at point x is denoted as ψ (x, Z, S t). The code vector (ψ (x, Z, S t)) is passed into the density model and the color model to predict the density and color of point x.
Step (5), regressing the density and the color of the hand of the single-hand skin model;
Hand density and color regression is achieved through NeRF, specifically represented by the MLP network.
And (5) a density model. For frame t, the bulk density at point x is predicted as a function of the potential code ψ (x, Z, st), which is defined as: σ t(x)=Mσ(ψ(x,Z,St)), where M σ represents a four-layer MLP network where the MLP network uses the density model from NeRF.
And (5) a color model. The potential code ψ (x, Z, S t) and the viewing direction d are taken as inputs for color regression. The color model also takes as input the point x in order to model the position-dependent incident light. We observe that time-varying factors affect the appearance of the hand, such as secondary exposure and self-shading. We assign a potential embedding for each video frame tTo encode the time varying factor. Specifically, for frame t, the color at spatial position x is predicted as the potential code ψ (x, Z, S t), the viewing direction d, spatial position x and potential embedding/>Is a function of (2). The color model of the t-th frame is defined as:
Where Mc denotes an MLP network with two layers, γ d and γ x are position-coding functions of the viewing direction and spatial position, respectively. Will be Is set to 128.
Step (6), drawing a hand body of the single-hand skin model to obtain a double-hand model;
Taking a right-hand skin model as an example, given a pixel, first calculate its camera ray r using camera parameters, and sample N k points along the camera ray r between the near and far boundaries The scene boundary is estimated according to the single-hand skin model. Then, the volume density and color of these points are predicted using NeRF, and all the predicted points are reflected in 3-dimensional space to project a three-dimensional model, i.e., a generated volume rendering model. Next, the generated volume rendering model is optimized and refined based on its comparison with the hand image (ground truth) taken from the original video, specifically using both conventional PSNR and SSIM comparison methods.
For video frame t, the rendered color of the corresponding pixelGiven by the formula:
Where δ k=||xk+1-xk||2 is the distance between adjacent sampling points. N k is set to 64.
Finally, the parameters of the volume rendering model generated by the left hand and the right hand are projected into the same latitude space to obtain the two-hand volume rendering model.
TABLE 1
Table 1 above shows the experimental results of the model of the present invention, with the lower the values of the MPJPE and MPVPE methods, the better the results. The single-hand and double-hand prediction results are evaluated respectively, the upper part is the result of other networks, the lower part is the result of our network, and it can be seen that our model has an improvement in the single-hand and average effects in the MPJPE score estimation model by 0.16 and 0.10 respectively, and the MPVPE estimation model by 0.22 and 0.12 respectively.
FIG. 3 is a comparison of the error rates of the method of the present invention and other methods at InterHand2.6M, with lower error values being better.
FIG. 4 is a qualitative comparison of the present invention method interactive hand reconstruction with the existing most advanced single hand reconstruction methods (Boukhayma et al and Zhou et al) on InterHand2.6M. FIG. 5 is a qualitative result of the prior art method with our network interactive hand reconstruction on InterHand2.6M (line 1) and Haggling (line 3). Our method can achieve high quality reconstruction performance at various viewpoints and different levels of inter-hand occlusion.
The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several alternatives or modifications can be made to the described embodiments without departing from the spirit of the invention, and these alternatives or modifications should be considered to be within the scope of the invention.
The invention, in part not described in detail, is within the skill of those skilled in the art.
Claims (7)
1. The three-dimensional hand modeling method based on the nerve radiation field is characterized by comprising the following steps of:
Step (1), performing two-hand modal segmentation through a two-hand separation model;
Step (2), obtaining a single-hand picture through a hand de-shielding and removing module;
step (3), structuring latent codes of the single-hand skin model;
step (4), potential code diffusion of the single-hand skin model;
Step (5), regressing the density and the color of the hand of the single-hand skin model;
And (6) performing hand volume drawing on the single-hand skin model to obtain a double-hand model.
2. The three-dimensional hand modeling method based on nerve radiation field according to claim 1, wherein the specific method of step (1) is as follows:
First, video is divided into Where c is the camera index, nc is the number of cameras, t is the frame index, and N t is the frame number; processing the two-hand video data shot by the camera with the marked serial number, obtaining two-hand pictures of the corresponding video and storing the pictures with the camera serial number; the hand separation model adapts to the task of segmentation of the two-hand modality by adjusting the existing semantic segmentation model SegFormer; specifically, the number of decoding is adjusted to be increased from 1 to 4 to predict four segmentation masks, namely, a right-hand full region mask M ra, a right-hand visible region mask M rv, a left-hand full region mask M la, and a left-hand visible region mask M lv, and left-hand and right-hand segmentation is performed on an input two-hand image through the four segmentation masks to obtain segmented 4 single-hand pictures; these segmentation masks contain spatial location information to roughly locate the left/right hand, and also information of occlusion regions for de-occlusion, removing interference regions; the segmentation model is supervised with binary cross entropy loss L BCE (x), and the final segmentation loss function formula is as follows:
where M ra、Mrv、Mla and M lv are predicted segmentation masks; and/> Is the corresponding true mask.
3. The three-dimensional hand modeling method based on nerve radiation field according to claim 2, wherein the specific method of step (2) is as follows;
Taking the right hand as an example, firstly, cutting to obtain a picture by taking the right hand complete area mask (M ra) as the center according to the mask predicted in the step (1); the hand is deblocked and the input of the removal module is spliced by the following four parts: 1. the right hand of the picture after being blacked out is blocked; 2. a right-hand visible partial mask (M rv); 3. left-hand redundant area of the picture after blackening; 4. masking of background areas other than left and right hands (M bv); for the right hand, the right hand full area mask M ra is used first to locate the right hand, then the original image and segmentation mask are cropped in the center of the right hand; the newly cropped image and mask are represented as: Right hand center cut mask Right-hand visible part center clipping mask/>Left-hand center crop mask/>Left-hand visible part center clipping mask/>Using M D to represent the area where the target hand is occluded by the other hand, M R to represent the area occupied by the hand for distraction; their calculation formula is as follows:
I D and I R are the original images erased by masks M D and M R, respectively Obtaining a picture; they guide the hand de-occlusion and removal module how to focus and how to use partial convolution Partial Convolution to fill in the two pictures; in addition, the visible portion mask M rv for the right hand and the mask M bv for the background area other than the left and right hand will direct the hand de-occlusion and removal module to resolve the de-occluded and removable portions of the picture; i D、IR and M bv are calculated as follows:
Inputting the I D,Mrv,IR and the M bv as a hand de-occlusion and removal module, using the data to restore the appearance content of the occluded part and restore the distracted hand to avoid ambiguity, predicting the pixels of the right hand occluded area and the pixels of the background behind the left hand redundant area, and finally outputting a restored single right hand image; and classifying the output single-hand pictures according to the camera serial numbers, and storing the left-hand pictures and the right-hand pictures separately.
4. A three-dimensional hand modeling method based on a neural radiation field according to claim 3, wherein the specific method of step (3) is as follows:
Preprocessing the single-hand image by using EasyMocap tools; estimating the hand gesture of each single-hand image obtained in the step (2), and obtaining a hand gesture node by using yolov and hrnet; the internal and external parameters of the camera are obtained by setting camera parameters and chessboard calibration when shooting video; in addition, the Self-correction for human parsing is adopted to divide the area of the hand of the single-hand image to obtain a picture for revealing the outline position of the hand in the single-hand image, namely a hand outline position image; corresponding parameters and vertex information are obtained through the internal and external parameters of the camera and the estimated hand gesture; secondly, obtaining a single-hand image and Easymocap preprocessed data in the step (2), and obtaining a skin line model SMPL of the hand; in addition, a potential code is introduced to realize the conversion from the skin model to the real model, namely Easymocap preprocessed data in each single-hand image are stored by adopting the potential code; to control the spatial location of the potential code, anchoring the potential code to a deformed human hand skin linear model, SMPL, is a skin vertex-based model defined as a function of shape parameters, pose parameters, and rigid transformations relative to the SMPL coordinate system; the SMPL model outputs a pose 3D mesh with 6890 vertices; defining a set of potential codes z= { Z1, Z2,; .., z6890} at the vertices of the SMPL model; for frame t, the SMPL parameter St is from video images of different perspectives provided Estimating to obtain; then, according to the hand gesture St of the human body, the space position of the potential codes is transformed so as to carry out density and color regression; latent codes, i.e., latent codes, use neural networks to represent the local geometry and appearance of humans.
5. The three-dimensional hand modeling method based on nerve radiation field according to claim 4, wherein the specific method of step (4) is as follows;
Causing each point in space to be represented as a feature vector NeRF assigns a density and color to each point in 3D space, which requires querying potential codes at successive 3D locations, by tri-linear interpolation; diffusing the defined potential code on the surface into a nearby 3D space; selecting a sparse convolution network to efficiently process structured latent code; specifically, based on parameters of the SMPL skin model, a 3D bounding box of the human hand is calculated, and the box is divided into small voxels with voxel sizes of 5mm×5mm, which are called potential code voxels; the potential codes for non-empty voxels are the average of the SMPL vertex potential codes within the voxel; sparseConvNet process the input volume with 3D sparse convolution and output a potential code volume in 2×, 4×, 8×,16× downsampling size; through convolution and downsampling, the input code is spread into nearby space; for any point in 3D space, inserting potential codes from the multi-scale code volumes of the network layers 5, 9, 13, 17 and connecting them into the final potential codes; since code spreading should not be affected by human hand position and direction in the real world coordinate system, the code position is converted into the SMPL coordinate system; for any point x in 3D space, query its potential code from the potential code voxels; specifically, point x is first converted into an SMPL coordinate system that aligns points in 3D space with potential code amounts; then, calculating potential codes using tri-linear interpolation; for the SMPL skin model parameter St, the potential code at point x is denoted as ψ (x, Z, S t); the code vector (ψ (x, Z, S t)) is passed into the density model and the color model to predict the density and color of point x.
6. The three-dimensional hand modeling method based on nerve radiation field according to claim 5, wherein the specific method in the step (5) is as follows:
Hand density and color regression is achieved through NeRF, specifically represented by the MLP network;
A density model; for frame t, the bulk density at point x is predicted as a function of the potential code ψ (x, Z, st), which is defined as: σ t(x)=Mσ(ψ(x,Z,St)), where M σ represents a four-layer MLP network where the MLP network uses the density model from NeRF;
A color model; the potential codes psi (x, Z, S t) and the observation direction d are used as inputs for color regression; to model the position-dependent incident light, the color model also takes as input the point x; assigning a potential embedding l t to each video frame t to encode the temporal variation factor; specifically, for frame t, the color at spatial location x is predicted as a function of the potential code ψ (x, Z, S t), the viewing direction d, spatial location x and the potential embedding l t; the color model of the t-th frame is defined as:
ct(x)=Mc(ψ(x,Z,St),γd(d),γx(x),lt)
where Mc represents an MLP network with two layers, gamma d and gamma x are position-coding functions of the viewing direction and the spatial position, respectively; the dimension of l t is set to 128.
7. The three-dimensional hand modeling method based on nerve radiation field according to claim 6, wherein the specific method in the step (6) is as follows:
Taking a right-hand skin model as an example, given a pixel, first calculate its camera ray r using camera parameters, and sample N k points along the camera ray r between the near and far boundaries The scene boundary is estimated according to the single-hand skin model; then, using NeRF to predict the volume density and color of the points, reflecting all the predicted points in 3-dimensional space to project a three-dimensional model, namely a generated volume rendering model; secondly, comparing the generated volume rendering model with hand pictures shot by the original video, and particularly optimizing and improving the generated volume rendering model by using two comparison methods of PSNR and SSIM;
for video frame t, the rendered color of the corresponding pixel Given by the formula:
Wherein δ k=||xk+1-xk||2 is the distance between adjacent sampling points; setting N k to 64;
finally, the parameters of the volume rendering model generated by the left hand and the right hand are projected into the same latitude space to obtain the two-hand volume rendering model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311730771.1A CN117911609A (en) | 2023-12-15 | 2023-12-15 | Three-dimensional hand modeling method based on nerve radiation field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311730771.1A CN117911609A (en) | 2023-12-15 | 2023-12-15 | Three-dimensional hand modeling method based on nerve radiation field |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117911609A true CN117911609A (en) | 2024-04-19 |
Family
ID=90693050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311730771.1A Pending CN117911609A (en) | 2023-12-15 | 2023-12-15 | Three-dimensional hand modeling method based on nerve radiation field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117911609A (en) |
-
2023
- 2023-12-15 CN CN202311730771.1A patent/CN117911609A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Grassal et al. | Neural head avatars from monocular rgb videos | |
Yuan et al. | Star: Self-supervised tracking and reconstruction of rigid objects in motion with neural rendering | |
CN115082639B (en) | Image generation method, device, electronic equipment and storage medium | |
Alatan et al. | Scene representation technologies for 3DTV—A survey | |
CN111986307A (en) | 3D object reconstruction using photometric grid representation | |
CN113628348B (en) | Method and equipment for determining viewpoint path in three-dimensional scene | |
US9747668B2 (en) | Reconstruction of articulated objects from a moving camera | |
Lin et al. | Deep multi depth panoramas for view synthesis | |
CN114863038B (en) | Real-time dynamic free visual angle synthesis method and device based on explicit geometric deformation | |
US11704853B2 (en) | Techniques for feature-based neural rendering | |
CN116310076A (en) | Three-dimensional reconstruction method, device, equipment and storage medium based on nerve radiation field | |
US11961266B2 (en) | Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture | |
CN116134491A (en) | Multi-view neuro-human prediction using implicit differentiable renderers for facial expression, body posture morphology, and clothing performance capture | |
CN116681838A (en) | Monocular video dynamic human body three-dimensional reconstruction method based on gesture optimization | |
CN115951784A (en) | Dressing human body motion capture and generation method based on double nerve radiation fields | |
Choi et al. | Balanced spherical grid for egocentric view synthesis | |
CN117218246A (en) | Training method and device for image generation model, electronic equipment and storage medium | |
JP2023527438A (en) | Geometry Recognition Augmented Reality Effect Using Real-time Depth Map | |
Ehret et al. | Regularization of NeRFs using differential geometry | |
CN116228986A (en) | Indoor scene illumination estimation method based on local-global completion strategy | |
CN109816765A (en) | Texture towards dynamic scene determines method, apparatus, equipment and medium in real time | |
CN117911609A (en) | Three-dimensional hand modeling method based on nerve radiation field | |
CN116883524A (en) | Image generation model training, image generation method and device and computer equipment | |
Jian et al. | Realistic face animation generation from videos | |
CN118314271B (en) | 3D Gaussian rasterization-based rapid high-precision dense reconstruction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |