CN117911609A

CN117911609A - Three-dimensional hand modeling method based on nerve radiation field

Info

Publication number: CN117911609A
Application number: CN202311730771.1A
Authority: CN
Inventors: 郑博仑; 姚扬; 朱尊杰; 李宗鹏; 张桦
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-04-19

Abstract

The invention discloses a three-dimensional hand modeling method based on a nerve radiation field, which comprises the following steps: step (1), performing two-hand modal segmentation through a two-hand separation model; step (2), obtaining a single-hand picture through a hand de-shielding and removing module; step (3), structuring latent codes of the single-hand skin model; step (4), potential code diffusion of the single-hand skin model; step (5), regressing the density and the color of the hand of the single-hand skin model; and (6) performing hand volume drawing on the single-hand skin model to obtain a double-hand model. The present invention uses a new implicit neural representation for the dynamic human body, which enables the inventive method to effectively incorporate observations on video frames. The method can obtain high-quality reconstruction performance under various viewpoints and different levels of hand shielding.

Description

Three-dimensional hand modeling method based on nerve radiation field

Technical Field

The invention relates to the field of computer image processing and three-dimensional reconstruction, in particular to a method for generating a high-precision three-dimensional model of a hand by combining image data of a nerve radiation field and utilizing a computer vision technology and a deep learning algorithm.

Background

In recent years, neural implicitly represents a rapid development in three-dimensional modeling and image synthesis. Neural implicit representation models a scene using a neural network that is spatially continuous and exhibits higher fidelity and flexibility than classical discrete counterparts such as grids, point clouds, and voxels. As the most popular implicit representation in neural rendering, neural radiation fields (Neural RADIANCE FIELDS, NERF) have shown dramatic results in a variety of tasks since the first introduction. Original NeRF is overfitted on a static scene by design, so the time-varying content cannot be modeled.

The image rendering of animate articulated objects, i.e. human body, hand, etc., can be seen as a special case of dynamic scene modeling. Most early efforts completed reconstruction using skeleton-based grids, which generally relied on expensive calibration and a large number of samples to produce high quality results.

Modeling and reconstruction of human hands as smart tools that interact with the physical world and convey rich semantic information has attracted extensive attention from the research community. In general, synthesizing real hand images or videos with different poses in motion has wide applications such as human-machine interaction, sign language generation, virtual and augmented reality techniques, such as telepresence, and the like.

Classical manual modeling work is mainly built on top of parameterized mesh models, such as MANO. They fit the hand geometry to a polygonal mesh operated by shape and pose parameters and then complete shading by texture mapping. Although widely used, these models have the following limitations. On the one hand, high frequency details are difficult to present on a multi-resolution grid, preventing the generation of realistic images. On the other hand, no special design has been developed for the interactive hand, which is an important scenario involving complex gestures for self-occlusion.

Disclosure of Invention

The technical problems to be solved are as follows: previous free viewpoint video systems either rely on dense camera arrays for image-based new view synthesis or require depth sensors for high quality three-dimensional reconstruction to produce realistic rendering. The complex hardware makes the free view video system expensive and only suitable for limited environments. This invention solves the novel view synthesis challenges for human performers from a very sparse set of camera views.

Aiming at the defects of the existing hand three-dimensional reconstruction method, the invention provides a three-dimensional hand modeling method based on a nerve radiation field.

According to the latent variable model research, the latent variable model obtains the corresponding distribution of the observed variable by defining the joint distribution of the visible variable and the latent variable and then marginalizing. Inspired by this we generate implicit 3D representations of human hands on different video frames from the same set of potential codes, which are anchored on vertices of the deformable mesh. For each frame we transform the spatial position of the code according to the human hand pose and use the network to regress the density and color of any 3D position according to the structured latent code. Then, an image of an arbitrary viewpoint can be synthesized by volume rendering. Meanwhile, when the input view is highly sparse, the performance of NeRF drops sharply. The reason is that it is not appropriate to learn the neural representation with very sparse observations. It is therefore believed that the key to solving this inappropriateness problem is to aggregate all observations over different video frames. This idea is achieved by regressing the 3D representation of each frame using the same network with different potential codes as inputs.

A three-dimensional hand modeling method based on a nerve radiation field comprises the following steps:

Step (1), performing two-hand modal segmentation through a two-hand separation model;

Step (2), obtaining a single-hand picture through a hand de-shielding and removing module;

step (3), structuring latent codes of the single-hand skin model;

step (4), potential code diffusion of the single-hand skin model;

Step (5), regressing the density and the color of the hand of the single-hand skin model;

And (6) performing hand volume drawing on the single-hand skin model to obtain a double-hand model.

Further, the specific method of the step (1) is as follows:

First, video is divided into Where c is the camera index, nc is the number of cameras, t is the frame index, and N _t is the frame number. And processing the two-hand video data shot by the camera with the marked serial number, obtaining two-hand pictures corresponding to the video and storing the pictures with the camera serial number. The two-hand separation model adapts to the two-hand modal segmentation task by adjusting the existing semantic segmentation model SegFormer. Specifically, the number of decodes is adjusted to be increased from 1 to 4 to predict four segmentation masks, that is, a right-hand full region mask M _ra, a right-hand visible region mask M _rv, a left-hand full region mask M _la, and a left-hand visible region mask M _lv, and left-and-right-hand segmentation is performed on the input two-hand image by the four segmentation masks to obtain 4 segmented single-hand pictures. These segmentation masks contain spatial location information to roughly locate the left/right hand, and also information of occlusion regions for de-occlusion, removing interference regions. The segmentation model is supervised with binary cross entropy loss L _BCE (x), and the final segmentation loss function formula is as follows:

where M _ra、M_rv、M_la and M _lv are predicted segmentation masks; and/> Is the corresponding true mask.

Further, the specific method of the step (2) is as follows;

Taking the right hand as an example, firstly, clipping to obtain a picture by taking the right hand complete area mask (M _ra) as the center according to the mask predicted in the step (1). The hand is deblocked and the input of the removal module is spliced by the following four parts: 1. the right hand of the blacked-out picture is blocked. 2. The right hand visible partial mask (M _rv). 3. The left-hand redundant area of the picture after being blacked out. 4. Masking of background areas other than left and right hands (M _bv). For the right hand, the right hand full area mask M _ra is first used to locate the right hand, then the original image and segmentation mask are cropped in the center of the right hand. The newly cropped image and mask are represented as: right-hand center clipping mask/> Right-hand visible part center clipping mask/>Left-hand center crop mask/>Left-hand visible part center clipping mask/>M _D is used to represent the area where the target hand is occluded by the other hand, and M _R is used to represent the area occupied by the distracting hand. Their calculation formula is as follows:

I _D and I _R are the original images erased by masks M _D and M _R, respectively And (5) obtaining a picture. They guide the hand de-occlusion and removal module how to focus and how to use partial convolution (Partial Convolution) to fill in the two pictures. In addition, the visible portion mask M _rv for the right hand and the mask M _bv for the background area other than the left and right hand will direct the hand de-occlusion and removal module to resolve the de-occluded and removable portions of the picture. I _D、I_R and M _bv are calculated as follows:

The I _D,M_rv,I_R and the M _bv are input as a hand de-occlusion and removal module, the hand de-occlusion and removal module uses the data to restore the appearance content of the occluded part and restore the distracted hand to avoid ambiguity, the pixels of the right hand occluded area and the pixels of the background behind the left hand redundant area are predicted, and finally a restored single right hand image is output. And classifying the output single-hand pictures according to the camera serial numbers, and storing the left-hand pictures and the right-hand pictures separately.

Further, the specific method of the step (3) is as follows:

The single-handed image is preprocessed using EasyMocap tools. For each single-hand image obtained in step (2), an estimation of hand pose is performed, and using yolov and hrnet, hand pose nodes are obtained. The internal and external parameters of the camera are obtained by setting the parameters of the camera and calibrating a chessboard when shooting video. In addition, the Self-correction for human parsing is adopted to divide the area of the hand of the single-hand image to obtain a picture for revealing the outline position of the hand in the single-hand image, namely a hand outline position image. Corresponding parameters and vertex information are obtained through the internal and external parameters of the camera and the estimated hand gesture. And secondly, obtaining a single-hand image and Easymocap preprocessed data (camera internal and external parameters, a hand gesture articulation point and a hand outline position image) in the step (2) to obtain a skin line model (SMPL) of the hand. In addition, a potential code (latent code) is introduced to realize the conversion from the skin model to the real model, namely Easymocap preprocessed data in each single-hand image is stored by adopting the potential code. To control the spatial location of the underlying code, the underlying code is anchored to a deformed human hand skin linear model (SMPL), which is a skin vertex-based model defined as a function of shape parameters, pose parameters, and rigid transformations relative to the SMPL coordinate system. The SMPL model outputs a pose 3D mesh with 6890 vertices. A set of potential codes z= { Z1, Z2 "are defined for an initial human hand skin linear model. .., z6890} is at the vertex of the SMPL model. For frame t, the SMPL parameter St is from video images of different perspectives provided And (5) estimating. The spatial positions of the potential codes are then transformed according to the human hand gestures St to perform density and color regression. Latent codes, i.e., latent codes, use neural networks to represent the local geometry and appearance of humans.

Further, the specific method of the step (4) is as follows;

Each point in space is represented as a feature vector NeRF assigns a density and color to each point in 3D space, which requires querying the potential code at successive 3D locations, by tri-linear interpolation. The potential code defined on the surface is spread into the nearby 3D space. A sparse convolutional network (SparseConvNet) is selected to efficiently process structured potential codes. Specifically, based on parameters of the SMPL skin model, a 3D bounding box of the human hand is calculated and the box is divided into small voxels with voxel sizes of 5mm×5mm, referred to as potential code voxels. The potential codes for non-empty voxels are the average of the SMPL vertex potential codes within the voxel. SparseConvNet the input volume is processed with 3D sparse convolution and the potential code volumes are output in 2 x, 4 x, 8 x, 16 x downsampled sizes. Through convolution and downsampling, the input code is spread into nearby space. For any point in 3D space, potential codes are inserted from the multi-scale code volumes of the network layers 5, 9, 13, 17 and connected to the final potential codes. Since code diffusion should not be affected by human hand position and direction in the real world coordinate system, the code position is converted into the SMPL coordinate system. For any point x in 3D space, its potential code is queried from the potential code voxels. Specifically, point x is first converted into an SMPL coordinate system that aligns points and potential amounts of code in 3D space. The potential codes are then calculated using tri-linear interpolation. For the SMPL skin model parameter St, the potential code at point x is denoted as ψ (x, Z, S _t). The code vector (ψ (x, Z, S _t)) is passed into the density model and the color model to predict the density and color of point x.

Further, the specific method in the step (5) is as follows:

Hand density and color regression is achieved through NeRF, specifically represented by the MLP network.

And (5) a density model. For frame t, the bulk density at point x is predicted as a function of the potential code ψ (x, Z, st), which is defined as: σ _t(x)＝M_σ(ψ(x,Z,S_t)), where M _σ represents a four-layer MLP network where the MLP network uses the density model from NeRF.

And (5) a color model. The potential code ψ (x, Z, S _t) and the viewing direction d are taken as inputs for color regression. The color model also takes as input the point x in order to model the position-dependent incident light. Assigning a potential embedding for each video frame tTo encode the time varying factor. Specifically, for frame t, the color at spatial position x is predicted as the potential code ψ (x, Z, S _t), the viewing direction d, spatial position x and potential embedding/>Is a function of (2). The color model of the t-th frame is defined as:

Where Mc denotes an MLP network with two layers, γ _d and γ _x are position-coding functions of the viewing direction and spatial position, respectively. Will be Is set to 128.

Further, the specific method in the step (6) is as follows:

Taking a right-hand skin model as an example, given a pixel, first calculate its camera ray r using camera parameters, and sample N _k points along the camera ray r between the near and far boundaries The scene boundary is estimated according to the single-hand skin model. Then, the volume density and color of these points are predicted using NeRF, and all the predicted points are reflected in 3-dimensional space to project a three-dimensional model, i.e., a generated volume rendering model. Next, the generated volume rendering model is optimized and refined based on a comparison of the generated volume rendering model with the hand image (ground truth) taken from the original video, specifically using two comparison methods, PSNR and SSIM.

For video frame t, the rendered color of the corresponding pixelGiven by the formula:

Where δ _k＝||x_k+1-x_k||₂ is the distance between adjacent sampling points. N _k is set to 64.

Finally, the parameters of the volume rendering model generated by the left hand and the right hand are projected into the same latitude space to obtain the two-hand volume rendering model.

The invention has the following beneficial effects:

The invention provides a method for synthesizing a realistic new view of a performer in complex motions from sparse multi-view video. The present invention uses a new implicit neural representation for the dynamic human body, which enables the inventive method to effectively incorporate observations on video frames. The method can obtain high-quality reconstruction performance under various viewpoints and different levels of hand shielding.

Drawings

FIG. 1 is a schematic flow chart of acquiring a single-hand picture according to an embodiment of the present invention, wherein HSM represents a two-hand separation model, and HDRM represents a hand de-occlusion and removal module;

FIG. 2 is a schematic diagram of a reconstruction process of a human body single-hand model according to the present invention;

FIG. 3 is a comparative schematic of the error rate at InterHand2.6M for the method of the present invention and other methods;

Fig. 4 and 5 are qualitative comparisons of the method of the present invention and existing one-hand reconstruction methods.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and the embodiments.

Aiming at the defects of the existing hand three-dimensional reconstruction method, the invention provides a method for reconstructing hand model three-dimensions based on NeRF.

According to the latent variable model research, the latent variable model obtains the corresponding distribution of the observed variable by defining the joint distribution of the visible variable and the latent variable and then marginalizing. Inspired by this we generate implicit 3D representations of human hands on different video frames from the same set of potential codes, which are anchored on vertices of the deformable mesh. For each frame we transform the spatial position of the code according to the human hand pose and use the network to regress the density and color of any 3D position according to the structured latent code. Then, an image of an arbitrary viewpoint can be synthesized by volume rendering. Meanwhile, when the input view is highly sparse, the performance of NeRF drops sharply. The reason is that it is not appropriate to learn the neural representation with very sparse observations. It is therefore believed that the key to solving this inappropriateness problem is to aggregate all observations over different video frames. This idea is achieved by regressing the 3D representation of each frame using the same network with different potential codes as inputs. FIG. 1 is a schematic flow chart of acquiring a single-hand picture according to an embodiment of the present invention, wherein HSM represents a two-hand separation model, and HDRM represents a hand de-occlusion and removal module; FIG. 2 is a schematic diagram of a reconstruction process of a human body single-hand model according to the present invention;

First, video is divided into Where c is the camera index, nc is the number of cameras, t is the frame index, and N _t is the frame number. And processing the two-hand video data shot by the camera with the marked serial number, obtaining two-hand pictures corresponding to the video and storing the pictures with the camera serial number. The hand separation model adapts to the task of segmentation of the bimanual modality by adjusting the existing semantic segmentation model SegFormer. Specifically, the number of decodes is adjusted to be increased from 1 to 4 to predict four segmentation masks, that is, a right-hand full region mask M _ra, a right-hand visible region mask M _rv, a left-hand full region mask M _la, and a left-hand visible region mask M _lv, and left-and-right-hand segmentation is performed on the input two-hand image by the four segmentation masks to obtain 4 segmented single-hand pictures. These segmentation masks contain spatial location information to roughly locate the left/right hand, and also information of occlusion regions for de-occlusion, removing interference regions. The segmentation model is supervised with binary cross entropy loss L _BCE (x), and the final segmentation loss function formula is as follows:

Taking the right hand as an example, firstly, clipping to obtain a picture by taking the right hand complete area mask (M _ra) as the center according to the mask predicted in the step (1). The hand is deblocked and the input of the removal module is spliced by the following four parts: 1. the right hand of the blacked-out picture is blocked. 2. The right hand visible partial mask (M _rv). 3. The left-hand redundant area of the picture after being blacked out. 4. Masking of background areas other than left and right hands (M _bv). For the right hand, the right hand full area mask M _ra is first used to locate the right hand, then the original image and segmentation mask are cropped in the center of the right hand. The newly cropped image and mask are represented as: (right hand center crop mask)/> Right-hand visible portion center cut mask),/>(Left-hand center cut mask)/>(Left-hand visible portion center crop mask), M _D is used to represent the area where the target hand is occluded by the other hand, and M _R is used to represent the area occupied by the distracting hand. Their calculation formula is as follows:

Step (3), structuring latent codes of the single-hand skin model;

Step (4), potential code diffusion of the single-hand skin model;

Each point in space is represented as a feature vector NeRF assigns a density and color to each point in 3D space, which requires querying the potential code at successive 3D locations, by tri-linear interpolation. However, since the structured potential code is relatively sparse in 3D space, direct interpolation of the potential code results in a large portion being a zero vector. To address this problem, the potential code defined on the surface is spread into nearby 3D space. A sparse convolutional network (SparseConvNet) is selected to efficiently process structured potential codes. Specifically, based on parameters of the SMPL skin model, a 3D bounding box of the human hand is calculated and the box is divided into small voxels with voxel sizes of 5mm×5mm, referred to as potential code voxels. The potential codes for non-empty voxels are the average of the SMPL vertex potential codes within the voxel. SparseConvNet the input volume is processed with 3D sparse convolution and the potential code volumes are output in 2 x, 4 x, 8 x, 16 x downsampled sizes. Through convolution and downsampling, the input code is spread into nearby space. For any point in 3D space, potential codes are inserted from the multi-scale code volumes of the network layers 5,9, 13, 17 and connected to the final potential codes. Since code diffusion should not be affected by human hand position and direction in the real world coordinate system, the code position is converted into the SMPL coordinate system. For any point x in 3D space, its potential code is queried from the potential code voxels. Specifically, point x is first converted into an SMPL coordinate system that aligns points and potential amounts of code in 3D space. The potential codes are then calculated using tri-linear interpolation. For the SMPL skin model parameter St, the potential code at point x is denoted as ψ (x, Z, S _t). The code vector (ψ (x, Z, S _t)) is passed into the density model and the color model to predict the density and color of point x.

And (5) a color model. The potential code ψ (x, Z, S _t) and the viewing direction d are taken as inputs for color regression. The color model also takes as input the point x in order to model the position-dependent incident light. We observe that time-varying factors affect the appearance of the hand, such as secondary exposure and self-shading. We assign a potential embedding for each video frame tTo encode the time varying factor. Specifically, for frame t, the color at spatial position x is predicted as the potential code ψ (x, Z, S _t), the viewing direction d, spatial position x and potential embedding/>Is a function of (2). The color model of the t-th frame is defined as:

Step (6), drawing a hand body of the single-hand skin model to obtain a double-hand model;

Taking a right-hand skin model as an example, given a pixel, first calculate its camera ray r using camera parameters, and sample N _k points along the camera ray r between the near and far boundaries The scene boundary is estimated according to the single-hand skin model. Then, the volume density and color of these points are predicted using NeRF, and all the predicted points are reflected in 3-dimensional space to project a three-dimensional model, i.e., a generated volume rendering model. Next, the generated volume rendering model is optimized and refined based on its comparison with the hand image (ground truth) taken from the original video, specifically using both conventional PSNR and SSIM comparison methods.

TABLE 1

Table 1 above shows the experimental results of the model of the present invention, with the lower the values of the MPJPE and MPVPE methods, the better the results. The single-hand and double-hand prediction results are evaluated respectively, the upper part is the result of other networks, the lower part is the result of our network, and it can be seen that our model has an improvement in the single-hand and average effects in the MPJPE score estimation model by 0.16 and 0.10 respectively, and the MPVPE estimation model by 0.22 and 0.12 respectively.

FIG. 3 is a comparison of the error rates of the method of the present invention and other methods at InterHand2.6M, with lower error values being better.

FIG. 4 is a qualitative comparison of the present invention method interactive hand reconstruction with the existing most advanced single hand reconstruction methods (Boukhayma et al and Zhou et al) on InterHand2.6M. FIG. 5 is a qualitative result of the prior art method with our network interactive hand reconstruction on InterHand2.6M (line 1) and Haggling (line 3). Our method can achieve high quality reconstruction performance at various viewpoints and different levels of inter-hand occlusion.

The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several alternatives or modifications can be made to the described embodiments without departing from the spirit of the invention, and these alternatives or modifications should be considered to be within the scope of the invention.

The invention, in part not described in detail, is within the skill of those skilled in the art.

Claims

1. The three-dimensional hand modeling method based on the nerve radiation field is characterized by comprising the following steps of:

step (3), structuring latent codes of the single-hand skin model;

step (4), potential code diffusion of the single-hand skin model;

2. The three-dimensional hand modeling method based on nerve radiation field according to claim 1, wherein the specific method of step (1) is as follows:

First, video is divided into Where c is the camera index, nc is the number of cameras, t is the frame index, and N _t is the frame number; processing the two-hand video data shot by the camera with the marked serial number, obtaining two-hand pictures of the corresponding video and storing the pictures with the camera serial number; the hand separation model adapts to the task of segmentation of the two-hand modality by adjusting the existing semantic segmentation model SegFormer; specifically, the number of decoding is adjusted to be increased from 1 to 4 to predict four segmentation masks, namely, a right-hand full region mask M _ra, a right-hand visible region mask M _rv, a left-hand full region mask M _la, and a left-hand visible region mask M _lv, and left-hand and right-hand segmentation is performed on an input two-hand image through the four segmentation masks to obtain segmented 4 single-hand pictures; these segmentation masks contain spatial location information to roughly locate the left/right hand, and also information of occlusion regions for de-occlusion, removing interference regions; the segmentation model is supervised with binary cross entropy loss L _BCE (x), and the final segmentation loss function formula is as follows:

3. The three-dimensional hand modeling method based on nerve radiation field according to claim 2, wherein the specific method of step (2) is as follows;

Taking the right hand as an example, firstly, cutting to obtain a picture by taking the right hand complete area mask (M _ra) as the center according to the mask predicted in the step (1); the hand is deblocked and the input of the removal module is spliced by the following four parts: 1. the right hand of the picture after being blacked out is blocked; 2. a right-hand visible partial mask (M _rv); 3. left-hand redundant area of the picture after blackening; 4. masking of background areas other than left and right hands (M _bv); for the right hand, the right hand full area mask M _ra is used first to locate the right hand, then the original image and segmentation mask are cropped in the center of the right hand; the newly cropped image and mask are represented as: Right hand center cut mask Right-hand visible part center clipping mask/>Left-hand center crop mask/>Left-hand visible part center clipping mask/>Using M _D to represent the area where the target hand is occluded by the other hand, M _R to represent the area occupied by the hand for distraction; their calculation formula is as follows:

I _D and I _R are the original images erased by masks M _D and M _R, respectively Obtaining a picture; they guide the hand de-occlusion and removal module how to focus and how to use partial convolution Partial Convolution to fill in the two pictures; in addition, the visible portion mask M _rv for the right hand and the mask M _bv for the background area other than the left and right hand will direct the hand de-occlusion and removal module to resolve the de-occluded and removable portions of the picture; i _D、I_R and M _bv are calculated as follows:

Inputting the I _D,M_rv,I_R and the M _bv as a hand de-occlusion and removal module, using the data to restore the appearance content of the occluded part and restore the distracted hand to avoid ambiguity, predicting the pixels of the right hand occluded area and the pixels of the background behind the left hand redundant area, and finally outputting a restored single right hand image; and classifying the output single-hand pictures according to the camera serial numbers, and storing the left-hand pictures and the right-hand pictures separately.

4. A three-dimensional hand modeling method based on a neural radiation field according to claim 3, wherein the specific method of step (3) is as follows:

Preprocessing the single-hand image by using EasyMocap tools; estimating the hand gesture of each single-hand image obtained in the step (2), and obtaining a hand gesture node by using yolov and hrnet; the internal and external parameters of the camera are obtained by setting camera parameters and chessboard calibration when shooting video; in addition, the Self-correction for human parsing is adopted to divide the area of the hand of the single-hand image to obtain a picture for revealing the outline position of the hand in the single-hand image, namely a hand outline position image; corresponding parameters and vertex information are obtained through the internal and external parameters of the camera and the estimated hand gesture; secondly, obtaining a single-hand image and Easymocap preprocessed data in the step (2), and obtaining a skin line model SMPL of the hand; in addition, a potential code is introduced to realize the conversion from the skin model to the real model, namely Easymocap preprocessed data in each single-hand image are stored by adopting the potential code; to control the spatial location of the potential code, anchoring the potential code to a deformed human hand skin linear model, SMPL, is a skin vertex-based model defined as a function of shape parameters, pose parameters, and rigid transformations relative to the SMPL coordinate system; the SMPL model outputs a pose 3D mesh with 6890 vertices; defining a set of potential codes z= { Z1, Z2,; .., z6890} at the vertices of the SMPL model; for frame t, the SMPL parameter St is from video images of different perspectives provided Estimating to obtain; then, according to the hand gesture St of the human body, the space position of the potential codes is transformed so as to carry out density and color regression; latent codes, i.e., latent codes, use neural networks to represent the local geometry and appearance of humans.

5. The three-dimensional hand modeling method based on nerve radiation field according to claim 4, wherein the specific method of step (4) is as follows;

Causing each point in space to be represented as a feature vector NeRF assigns a density and color to each point in 3D space, which requires querying potential codes at successive 3D locations, by tri-linear interpolation; diffusing the defined potential code on the surface into a nearby 3D space; selecting a sparse convolution network to efficiently process structured latent code; specifically, based on parameters of the SMPL skin model, a 3D bounding box of the human hand is calculated, and the box is divided into small voxels with voxel sizes of 5mm×5mm, which are called potential code voxels; the potential codes for non-empty voxels are the average of the SMPL vertex potential codes within the voxel; sparseConvNet process the input volume with 3D sparse convolution and output a potential code volume in 2×, 4×, 8×,16× downsampling size; through convolution and downsampling, the input code is spread into nearby space; for any point in 3D space, inserting potential codes from the multi-scale code volumes of the network layers 5, 9, 13, 17 and connecting them into the final potential codes; since code spreading should not be affected by human hand position and direction in the real world coordinate system, the code position is converted into the SMPL coordinate system; for any point x in 3D space, query its potential code from the potential code voxels; specifically, point x is first converted into an SMPL coordinate system that aligns points in 3D space with potential code amounts; then, calculating potential codes using tri-linear interpolation; for the SMPL skin model parameter St, the potential code at point x is denoted as ψ (x, Z, S _t); the code vector (ψ (x, Z, S _t)) is passed into the density model and the color model to predict the density and color of point x.

6. The three-dimensional hand modeling method based on nerve radiation field according to claim 5, wherein the specific method in the step (5) is as follows:

Hand density and color regression is achieved through NeRF, specifically represented by the MLP network;

A density model; for frame t, the bulk density at point x is predicted as a function of the potential code ψ (x, Z, st), which is defined as: σ _t(x)＝M_σ(ψ(x,Z,S_t)), where M _σ represents a four-layer MLP network where the MLP network uses the density model from NeRF;

A color model; the potential codes psi (x, Z, S _t) and the observation direction d are used as inputs for color regression; to model the position-dependent incident light, the color model also takes as input the point x; assigning a potential embedding l _t to each video frame t to encode the temporal variation factor; specifically, for frame t, the color at spatial location x is predicted as a function of the potential code ψ (x, Z, S _t), the viewing direction d, spatial location x and the potential embedding l _t; the color model of the t-th frame is defined as:

c_t(x)＝M_c(ψ(x,Z,S_t),γ_d(d),γ_x(x),l_t)

where Mc represents an MLP network with two layers, gamma _d and gamma _x are position-coding functions of the viewing direction and the spatial position, respectively; the dimension of l _t is set to 128.

7. The three-dimensional hand modeling method based on nerve radiation field according to claim 6, wherein the specific method in the step (6) is as follows:

Taking a right-hand skin model as an example, given a pixel, first calculate its camera ray r using camera parameters, and sample N _k points along the camera ray r between the near and far boundaries The scene boundary is estimated according to the single-hand skin model; then, using NeRF to predict the volume density and color of the points, reflecting all the predicted points in 3-dimensional space to project a three-dimensional model, namely a generated volume rendering model; secondly, comparing the generated volume rendering model with hand pictures shot by the original video, and particularly optimizing and improving the generated volume rendering model by using two comparison methods of PSNR and SSIM;

for video frame t, the rendered color of the corresponding pixel Given by the formula:

Wherein δ _k＝||x_k+1-x_k||₂ is the distance between adjacent sampling points; setting N _k to 64;