CN116758202A - Human hand image synthesis method, device, electronic equipment and storage medium - Google Patents

Human hand image synthesis method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116758202A
CN116758202A CN202310321253.8A CN202310321253A CN116758202A CN 116758202 A CN116758202 A CN 116758202A CN 202310321253 A CN202310321253 A CN 202310321253A CN 116758202 A CN116758202 A CN 116758202A
Authority
CN
China
Prior art keywords
human hand
vertex
image
alignment
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310321253.8A
Other languages
Chinese (zh)
Inventor
陈庆
石武
乔宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Publication of CN116758202A publication Critical patent/CN116758202A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/08Volume rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/62Analysis of geometric attributes of area, perimeter, diameter or volume
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Geometry (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application discloses a method, a device, electronic equipment and a storage medium for synthesizing a human hand image, belonging to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles aiming at human hands; calculating the posture parameters and the shape parameters of each frame of human hand image, and estimating the three-dimensional grid of the human hand through a parameterized grid model according to the posture parameters and the shape parameters; based on each vertex and the corresponding surface patch of each vertex in the three-dimensional grid of the human hand, performing vertex alignment and surface patch alignment to obtain volume expression distribution of the human hand in each human hand image; and performing image rendering according to the volume expression distribution of the human hand in each human hand image to obtain a synthetic image of the human hand. The human hand image synthesis method provided by the application can be used for driving and generating a new gesture, and the rendered human hand synthesized image has the characteristics of high fidelity, definition and reality.

Description

Human hand image synthesis method, device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for synthesizing a human hand image, electronic equipment and a storage medium.
Background
Synthesizing realistic images of drivable avatars is a core task for emerging applications of computer vision and graphics, while human hands are important interactive components of the human body, so the fine reconstruction and driving of realistic human hand models is of great value.
There are large differences in characteristics between the human hand and other parts of the human body (face, head and body), for example, the human hand is highly articulated and there is severe self-occlusion and contact, so despite the great success of three-dimensional reconstruction of other parts of the human body (e.g., face, body), the search for human hand image synthesis is still small, and although there are some attempts at human hand image synthesis, the related art either fails to perform hand driving (does not animate new poses) or produces blurred pixels in the hand regions, rendering human hand synthesized images that are not sufficiently realistic.
From the above, the prior art has the problems that the hand image synthesis method cannot be driven to generate a new gesture, and the realistic hand image is difficult to render.
Disclosure of Invention
Aiming at the defects of the related art, the application provides a hand image synthesis method, a device, electronic equipment and a storage medium, which aim to solve the problems that the hand image synthesis method in the related art cannot be driven to generate a new gesture and is difficult to render a vivid hand image.
The technical scheme is as follows:
according to one aspect of the application, a method of human hand image synthesis, the method comprising: acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands; calculating the posture parameters and the shape parameters of the hand image of each frame, and estimating the three-dimensional grid of the hand through a parameterized grid model according to the posture parameters and the shape parameters; based on each vertex and the corresponding surface patch of each vertex in the three-dimensional grid of the human hand, performing vertex alignment and surface patch alignment to obtain volume expression distribution of the human hand in each human hand image; and performing image rendering according to the volume expression distribution of the human hand in each human hand image to obtain a synthetic image of the human hand.
According to one aspect of the present application, a human hand image synthesizing apparatus includes: the image sequence acquisition module is used for acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands; the parameter calculation module is used for calculating the posture parameters and the shape parameters of the hand image of each frame and estimating the three-dimensional grid of the hand through the parameterized grid model according to the posture parameters and the shape parameters; the alignment module is used for carrying out vertex alignment and surface patch alignment based on each vertex and the surface patch corresponding to each vertex in the three-dimensional grid of the human hand to obtain volume expression distribution of the human hand in each human hand image; and the rendering module is used for performing image rendering according to the volume expression distribution of the human hand in each human hand image to obtain a synthetic image of the human hand.
In an exemplary embodiment, the alignment module includes: the vertex alignment unit is used for aligning all the vertices in the three-dimensional network of the human hand by using a graph rolling network to obtain vertex alignment characteristics of all the vertices; the principal component expression estimation unit is used for estimating the volume principal component expression of the surface patch according to the vertex alignment feature of the vertex corresponding to the surface patch aiming at the surface patch corresponding to each vertex to obtain the surface patch alignment feature of each surface patch; and the transformation unit is used for transforming the volume principal component expression of each patch to the surface of the three-dimensional grid patch according to the patch alignment characteristic of each patch to obtain the volume expression distribution of the human hand in each human hand image.
In an exemplary embodiment, the vertex alignment unit includes: a coordinate localization subunit, configured to localize vertices of the three-dimensional mesh to obtain localized vertices of the three-dimensional mesh; a hidden variable calculation operator unit, configured to calculate a learnable hidden variable of each vertex of the three-dimensional mesh through an embedding layer; a rotation angle calculating subunit, configured to calculate, based on the posture parameters of the hand image of each frame, a rotation angle of a joint point to which each vertex belongs relative to a parent node; the first stitching subunit is configured to stitch the localized vertex of the three-dimensional grid, the learnable hidden variable of each vertex, the rotation angle of each joint point to which each vertex belongs relative to a parent node, and the gesture parameter, so as to obtain an input feature vector of each vertex; and the vertex alignment feature acquisition subunit is used for inputting the input feature vector of each vertex into the graph convolution network to obtain the vertex alignment feature of each vertex.
In an exemplary embodiment, the vertex alignment feature acquisition subunit includes: the identity coding subunit is used for carrying out identity characteristic coding on the human hand according to the identity of the human hand to obtain the identity characteristic coding of the human hand; and the second splicing subunit is used for embedding the identity feature code of the human hand into the middle layer of the graph rolling network in the process of aligning the vertex of the input feature vector of each vertex through the graph rolling network to obtain the vertex aligning feature of each vertex.
In an exemplary embodiment, the principal component expression estimation unit includes: the characteristic estimation subunit is used for respectively inputting the vertex alignment characteristics of the vertexes corresponding to the surface patch into the color branch multi-layer perceptron, the density branch multi-layer perceptron and the motion branch multi-layer perceptron to estimate the color characteristics, the density characteristics and the motion characteristics of the surface patch; and the feature fusion subunit is used for carrying out feature fusion on the estimated color features, the density features and the motion features to obtain the alignment features of the patches.
In an exemplary embodiment, the rendering module includes: and the pixel value calculation unit is used for calculating the pixel value of each pixel position of the composite image under the rendering view angle through a differentiable neural rendering equation according to the volume expression distribution of the human hand in each human hand image, and generating the composite image of the human hand.
In an exemplary embodiment, the human hand image synthesis is achieved by synthesizing an image model; the synthetic image model is a trained neural network model; the training process of the synthetic image model comprises the following steps: acquiring a training image; calculating a loss value of a set loss function based on the training image; the set loss function includes at least one of: a picture reconstruction self-supervision loss function for optimizing picture convolution network parameters, a geometric reconstruction loss function for optimizing feature fusion parameters and a hidden variable regularization constraint loss function for optimizing identity hidden coding parameters; if the loss value meets the convergence condition, training the neural network model to obtain a synthetic image model; otherwise, optimizing parameters of the neural network model; the parameters at least comprise graph convolution network parameters, characteristic fusion parameters and identity hidden coding parameters
According to one aspect of the application, an electronic device comprises at least one processor and at least one memory, wherein the memory has program instructions or code stored thereon; the program instructions or code are loaded and executed by the processor to cause the electronic device to implement the method of human hand image synthesis as described above.
According to one aspect of the present application, a storage medium has stored thereon program instructions or code that are loaded and executed by a processor to implement a human hand image synthesis method as described above.
According to one aspect of the present application, an application program product includes program instructions or code stored in a storage medium, and a processor of an electronic device reads the program instructions or code from the storage medium, loads and executes the program instructions or code, so that the electronic device implements the human hand image synthesizing method as described above.
The application has the following beneficial effects:
according to the technical scheme, corresponding gesture parameters and shape parameters are calculated according to the acquired human hand image sequence, a parameterized grid model is utilized to generate a three-dimensional grid of a human hand based on the gesture parameters and the shape parameters, volume expression distribution is estimated according to the three-dimensional grid, and then the volume expression distribution is utilized to render a synthetic image of the human hand. That is, combining the volume principal element with the parameterized mesh model (e.g., MANO) can represent the high-frequency texture characteristics of the human hand, and thus can achieve high-quality rendering reconstruction; the method can drive and generate the new gesture, and the volume principal element can realize the fidelity rendering, so that the human hand image synthesis method provided by the application can drive and generate the new gesture, and the rendered human hand synthesized image has the characteristics of high fidelity, definition and reality.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic diagram of an implementation environment of a human hand image synthesis method provided by an embodiment of the present application;
FIG. 2 is a flow chart of a method of human hand image synthesis provided in accordance with an exemplary embodiment;
FIG. 3 is a schematic diagram of predicting pose parameters, shape parameters from images of a human hand from multiple perspectives in accordance with an exemplary embodiment;
FIG. 4 is a flow chart of step 220 in one embodiment of the corresponding embodiment of FIG. 2;
FIG. 5 is a flow chart of step 400 in one embodiment of the corresponding embodiment of FIG. 4;
FIG. 6 is a flow chart of step 580 in one embodiment in the corresponding embodiment of FIG. 5;
FIG. 7 is a flowchart of a training process for synthesizing an image model in accordance with an exemplary embodiment;
FIG. 8 is a schematic diagram of a specific implementation of a method for synthesizing a human hand image in an application scenario;
FIG. 9 is a schematic representation of a human hand composite image, and a control image and a true value image generated using the method of the present application;
FIG. 10 is a block diagram of a human hand image synthesizing device according to an exemplary embodiment;
FIG. 11 is a hardware block diagram of a server shown in accordance with an exemplary embodiment;
fig. 12 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. In addition, the technical features of the embodiments of the present application described below may be combined with each other as long as they do not collide with each other.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
Before explaining the various embodiments of the present application, a description is given first of several concepts to which the present application relates.
Dough grid (mesh): a polygonal mesh consisting of multiple geometric vertices is a data structure used in computer graphics for modeling various irregular objects, including patches and vertices, which can simulate the surface of a complex object, but cannot describe the internal information of the complex object. Among the patches of the polygonal mesh, the triangular patch is the minimum unit of division, and the representation is relatively simple, flexible and convenient for topology description, so it is widely used, so mesh is often referred to as a triangular patch.
Graph roll network (Graph Convolutional Networks, GCN): in the process of transferring information between adjacent vertexes of the mesh of the face sheet, after the graph rolling operation, each geometric vertex has the information of the adjacent vertexes.
Parameterized mesh model: the reconstructed model may be described by defining a set of low-dimensional vectors (e.g., shape parameters, pose parameters). The parameterized mesh model is used for parameterized reconstruction, such as human three-dimensional reconstruction, human hand three-dimensional reconstruction, and so forth.
Volume principal component: also called voxels (voxels/volume pixels), like the smallest unit of an image in two-dimensional space-pixels, the principal element of volume is the smallest unit of three-dimensional space, which can be understood as a cube.
Volumetric expression profile: also called voxel grid, is a data structure representing a three-dimensional object based on a cube (volume principal element) of fixed size as the smallest unit.
As described above, the prior art has a problem that the hand image synthesizing method cannot drive generation of a new gesture and rendering of a realistic hand image is difficult.
Unlike other parts of the human body (e.g., the face, limbs, body), the hands are highly articulated, and complex hand movements present difficulties in neural rendering. First, deformation of hand geometry is difficult to model, in particular, when dealing with large and complex human hand deformations (e.g., self-contact), previous skin-based methods have difficulty finding accurate skin weights, while partial perception methods often suffer from cross-part inconsistencies; second, due to the highly articulated structure, hand texture is difficult to model, e.g., articulated hand motion can lead to severe self-occlusion, so different hand poses can lead to significant changes in illumination and shadow patterns, illumination is important for realistic rendering, but no study of illumination due to articulated self-occlusion is currently estimated.
At present, a classical method for modeling an animated avatar uses a mesh-based model, for example, a human hand is represented as an assembled mesh and animation is performed through a skin, but the mesh is focused on shape representation, and because the resolution of the mesh is limited or high-definition texture mapping cannot be obtained, the mesh is not suitable for realistic rendering, taking a parametric human mesh Model (MANO) which is usually used for representing different postures and shape changes of the human hand as an example, the MANO can only represent 778 vertices and 1538 patches, and the expression capacity of the MANO is extremely limited.
Secondly, the mesh has the disadvantage of a discontinuous and unalterable topology, and in order to solve this problem, recent studies have tended to explore implicit human expression, which has the advantage of flexibility and continuity. However, implicit geometry performs poorly in free pose driving compared to explicit meshes, and thus the hinge driving of implicit human body geometry is widely studied. Some use linear hybrid skin and inverse skin weights to convert the pose space query back to canonical space, but the inverse skin paradigm cannot handle self-contact, and where the query can match multiple canonical space points, easily causing ambiguity problems. In addition, forward skin deformation is designed to convert canonical space points into pose space using iterative root finding methods, but iterative optimization algorithms may compromise end-to-end network training. In general, the rigid transformation of each bone can constitute a large motion space, making it difficult to optimize accurate skin weights for any 3D point interrogation. This presents a significant problem for accurate hand gesture actuation.
From the above, the human hand synthesized image rendered in the prior art lacks texture details, is not clear and vivid enough, has low image quality, and cannot be driven to generate a new gesture.
Therefore, the human hand image synthesis method provided by the application can be driven to generate a new gesture, and can render a human hand synthesized image with high fidelity, definition and reality, and accordingly, the human hand image synthesis method is suitable for a human hand image synthesis device which can be deployed on electronic equipment, for example, the electronic equipment can be computer equipment configured with a von neumann architecture, and the computer equipment comprises, but is not limited to, a desktop computer, a notebook computer, a server and the like.
Referring to fig. 1, a schematic diagram of an implementation environment related to a human hand image synthesis method is shown. It should be noted that this implementation environment is only one example adapted to the present application and should not be considered as providing any limitation to the scope of use of the present application.
As shown in fig. 1, the implementation environment includes an acquisition end 110 and a service end 130.
Specifically, the capturing end 110 may also be considered as an image capturing device, including but not limited to, a video camera, and other electronic devices having a photographing function. For example, the acquisition end 110 is a smart phone with a shooting function.
The server 130 may be a desktop computer, a notebook computer, a server, or other electronic devices, or may be a computer cluster formed by multiple servers, or even a cloud computing center formed by multiple servers. The server 130 is configured to provide a background service, for example, the background service includes, but is not limited to, human hand image synthesis, and the like.
The server 130 and the acquisition end 110 are pre-connected by wired or wireless network communication, and data transmission between the server 130 and the acquisition end 110 is realized through the network communication. The data transmitted includes, but is not limited to: a multi-view human hand image, a human hand synthetic image rendered by volume expression distribution, and the like.
In an application scenario, through interaction between the acquisition end 110 and the server 130, the acquisition end 110 photographs and acquires multi-view human hand images aiming at human hands with different identities, and uploads the multi-view human hand images to the server 130 to request the server 130 to synthesize the human hand images.
For the server 130, after receiving the multi-view hand image uploaded by the acquisition end 110, the hand image synthesis is performed to obtain a volume expression distribution (a three-dimensional hand model), the volume expression distribution is rendered into a hand synthesized image, the new gesture and any view angle can be driven to be generated based on the volume expression distribution of the parameterized grid model, the hand synthesized image with any gesture and/or any view angle can be rendered, and the hand synthesized image has high fidelity, so that the problems that the hand image synthesis method in the related art cannot be driven to generate the new gesture and the generated hand image is not lifelike are solved.
Referring to fig. 2, an embodiment of the present application provides a method for synthesizing a human hand image, which is suitable for an electronic device, and the electronic device may be the server 130 in the implementation environment shown in fig. 1.
In the following method embodiments, for convenience of description, the execution subject of each step of the method is described as an electronic device, but this configuration is not particularly limited.
As shown in fig. 2, the method may include the steps of:
step 200, a sequence of images of a human hand is acquired.
The human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands.
The viewing angle refers to a camera shooting viewing angle, optionally, the human hand image sequence comprises 4-10 human hand images with different viewing angles, and the overlapping ratio between the camera viewing angles should not exceed 50%.
The camera parameters include in-and-out parameters, which are parameters related to the characteristics of the camera itself, such as focal length, pixel size, etc. of the camera; the off-camera parameters are parameters in the world coordinate system, such as the position, rotation direction, etc. of the camera.
Step 210, calculating the posture parameters and the shape parameters of each frame of human hand image, and estimating the three-dimensional grid of the human hand by using the parameterized grid model according to the posture parameters and the shape parameters.
The gesture parameters and the shape parameters are used for controlling the change of the shape and the gesture of the human hand, and are the basis for generating the three-dimensional grid.
Parameterized mesh models include, but are not limited to, mannequin SMPL, human hand triangular patch mesh model MANO, and SCAPE, among others. Taking MANO as an example, MANO is a parameterized mesh model for reconstruction of human hands, which can be understood as a basic model and the sum of deformations performed on the basis of the model, and PCA (Principal Component Analysis ) is performed on the basis of the deformations to obtain a low-dimensional parameter-shape parameter (shape) describing the shape; meanwhile, the motion tree is used for representing the gesture of the human hand, namely the rotation relation of each joint point and father node of the motion tree can be represented as a three-dimensional vector, and finally the local rotation vector of each joint point forms a gesture parameter (post) of the MANO model, wherein the gesture parameter has 48 parameters for representing the rotation angles of 16 joint points, and the shape parameter has 10 parameters for representing the ratio of the length and the thickness of the human finger; in general, in the MANO model, the change of the shape and the posture of a human hand can be controlled through reasonable combination of various parameters.
In one possible implementation mode, each frame of human hand image is sent into a parameterized grid model, gesture parameters and shape parameters are obtained through prediction, and then a human hand three-dimensional grid is generated according to the gesture parameters and the shape parameters. Taking MANO as an example, parameters required by the MANO parameterized mesh model, including but not limited to 10 shape parameters and 48 pose parameters, can be predicted based on each frame of human hand image and then a MANO human hand three-dimensional mesh can be generated based on the parameters.
As shown in fig. 3, a schematic diagram of predicting gesture parameters and shape parameters from a multi-view human hand image is shown.
And 220, performing vertex alignment and surface patch alignment based on each vertex and the surface patch corresponding to each vertex in the three-dimensional grid of the human hand to obtain the volume expression distribution of the human hand in the image of each human hand.
In one possible implementation, as shown in fig. 4, step 220 includes the steps of:
and 400, aligning each vertex in the three-dimensional network of the human hand by using the graph rolling network to obtain vertex alignment characteristics of each vertex.
In one possible implementation, the vertex alignment feature is obtained by stitching features related to each vertex of the three-dimensional mesh according to the sequence numbers of each vertex of the three-dimensional mesh. Wherein each vertex-related feature includes, but is not limited to: the method comprises the steps of localization vertexes of a three-dimensional grid, a learnable hidden variable of each vertex, a rotation angle of an articulation point to which each vertex belongs relative to a father node and the like.
Step 420, for the surface patches corresponding to each vertex, estimating the principal component expression of the volume of the surface patches according to the vertex alignment features of the vertices corresponding to the surface patches, and obtaining the surface patch alignment features of each surface patch.
Wherein the dough sheet alignment feature of the dough sheet is used for representing color features, density features and motion features of the dough sheet.
In one possible implementation manner, vertex alignment features of vertices corresponding to each of the patches are input into a feature fusion branch network to obtain patch alignment features of each of the patches. The feature fusion branch network comprises a plurality of branch multi-layer perceptrons. Specifically, vertex alignment features of vertexes corresponding to the dough sheet are respectively input into a color branch multi-layer perceptron, a density branch multi-layer perceptron and a motion branch multi-layer perceptron to estimate the color features, the density features and the motion features of the dough sheet; performing feature fusion on the estimated color features, density features and motion features to obtain the alignment features of the patches
And step 440, transforming the volume principal component expression of each patch to the surface of the three-dimensional grid patch according to the patch alignment characteristics of each patch to obtain the volume expression distribution of the human hand in each human hand image.
As described above, the feature of the alignment of the dough piece is used to characterize the color, density, motion, and the like of the dough piece, and based on this, the volume principal component expression of the dough piece can be obtained by deforming these features into the three-dimensional volume expression according to the set principal component volume size.
It should be noted that the volume of the volume principal element should not be too large, which would affect the model training convergence speed and rendering speed, e.g. the volume of the volume principal element may be 2 x 2 or 8 x 8, where 4 x 4 is optimally balanced in convergence speed and model accuracy.
In one possible implementation, the volume expression distribution of the human hand in the human hand image can be obtained by transforming the volume principal component expression of each patch to the surface of the three-dimensional grid patch through the transformation matrix.
Alternatively, the transformation matrix may be a TBN matrix. The method for calculating the transformation matrix comprises the following steps: and according to the UV unfolding diagram of the three-dimensional grid of the human hand, calculating a transformation matrix of the center point of each patch based on a sampling strategy of patch alignment.
Here, the inventor has realized that the UV-developed view (2D) of the three-dimensional grid of the human hand has the characteristics of uneven distribution of patches and discontinuous semantics, so that the patches with similar euclidean distances on the UV-developed view are not necessarily consistent in semantics, and the UV-developed view has a large part of invalid regions, which causes ambiguity problems for the 2D decoder. Therefore, the embodiment adopts a sampling strategy of patch alignment to align the decoded volume principal element with the patches of the three-dimensional grid, so that the intrinsic problem of semantic ambiguity introduced by the UV image can be avoided.
The method is characterized in that the sensitivity of the graph convolution neural network to geometric topological features is utilized to transfer the decoding problem of UV (2D) alignment to the decoding problem of 3D geometric alignment, so that 3D geometric information is fused as the prior of network expression; given that direct learning of UV aligned volume principal component expressions of human hands using 2D codecs will not yield reasonable results driven by new poses, which is essentially due to the limited modeling ability of 2D convolutional neural networks for semantically discontinuous UV unfolded graphs, conventional training with conventional 2D convolutional models cannot achieve the expected effects.
And 230, performing image rendering according to the volume expression distribution of the human hand in each human hand image to obtain a synthetic image of the human hand.
It should be understood that, the volume expression distribution of the human hand in each human hand image refers to a three-dimensional stereoscopic model of the human hand, and then image rendering based on the volume expression distribution essentially renders the three-dimensional stereoscopic model to an image two-dimensional space, thereby obtaining a synthetic image of the human hand.
In one possible implementation, step 230 may include the steps of: according to the volume expression distribution of the human hand in each human hand image, calculating the pixel value of each pixel position of the synthesized image under the rendering view angle through a differentiable neural rendering equation, and generating the synthesized image of the human hand.
Specifically, the density volume distribution and the color volume distribution on the camera ray can be accumulated by an integral equation to obtain a pixel value for each pixel position. Wherein the camera rays are calculated based on camera parameters, it is possible to calculate the camera rays by the following format: r is (r) p (t)=o p +td p Wherein the starting position is o p The direction of the light is d p
For each pixel of the image, the differentiable volume renderer accumulates the volume density and color on the camera ray through an integral equation, resulting in a pixel color. In one possible implementation, the pixel approximation is calculated by numerical integration, for example, by calculating the pixel approximation at position p by the formula:
wherein the method comprises the steps ofV col And V α A density volume distribution and a color volume distribution, respectively.
And traversing all pixel positions to render a synthesized image of the human hand with a specific shape and posture.
Through the process, the volume principal element is combined with the parameterized grid model (e.g. MANO), so that the high-frequency texture characteristics of the human hand can be represented, and further, high-quality rendering reconstruction can be realized; the method for synthesizing the human hand image can drive and generate the new gesture, and the rendered human hand synthesized image has the characteristics of high fidelity, definition and reality.
Referring to fig. 5, in an exemplary embodiment, step 400 may include the steps of:
step 500, the vertexes of the three-dimensional grid are localized, and localized vertexes of the three-dimensional grid are obtained.
Wherein each vertex of the three-dimensional mesh is uniquely represented by a coordinate, and correspondingly, the localized vertex is also uniquely represented by a coordinate, whereby the vertex localization essence is localization of the vertex coordinate.
Taking a human hand three-dimensional grid as a human hand triangular patch grid (MANO) as an example, for each frame of human hand three-dimensional grid, converting the vertex of the human hand three-dimensional grid to a standard posture MANO template through inverse linear skin operation to obtain normalized vertex global coordinates; and subtracting the vertex coordinates of the node to which the vertex belongs from the global coordinates of each vertex of the three-dimensional grid of the human hand to obtain the localized vertex coordinates taking the node to which the vertex belongs as a coordinate system.
The calculation formula of the localized vertex coordinates is as follows:
wherein J (. Beta.) j Parent node, B, representing the jth bone to which the ith vertex belongs S (β) i And B P (β) i Respectively representing shape blending deformation and gesture blending deformation parameters.
At step 520, a learnable hidden variable for each vertex of the three-dimensional mesh is calculated by the embedding layer.
The embedding layer may be referred to as an embedding layer for transforming each vertex of the three-dimensional mesh into a vector representation, i.e. learning hidden variables.
Step 540, calculating the rotation angle of the node to which each vertex belongs relative to the parent node based on the posture parameters of each frame of human hand image.
In one possible implementation manner, based on the gesture parameters of each frame of human hand image, the rotation angle of the joint point to which each vertex belongs relative to the father node is calculated according to the maximum linear mixed skin weight.
Step 560, stitching the localization vertexes of the three-dimensional grid, the learnable hidden variables of each vertex, the rotation angle of the joint point of each vertex relative to the father node, and the gesture parameters to obtain the input feature vector of each vertex.
In one possible implementation, the localized vertex coordinates of the three-dimensional mesh, the learnable hidden variables of each vertex, the rotation angle of each vertex's joint point relative to the parent node, and the pose parameters are stitched according to the sequence numbers of the three-dimensional mesh vertices.
In step 580, the input feature vectors of the vertices are input into the graph convolution network to obtain vertex alignment features of the vertices.
As shown in fig. 6, in one possible implementation, step 580 may include the steps of:
And 600, carrying out identity characteristic coding on the human hand according to the identity of the human hand, and obtaining the identity characteristic coding of the human hand.
That is, the identity code is used to uniquely represent the identity to which the person's hand belongs, it being understood that the identity code remains consistent for persons of the same identity and not for persons of different identities.
In one possible implementation, the inventors realized that it would be difficult to optimize if the dimension of the identity code was too high, and therefore the dimension of the identity code was set to 128 or 256.
In step 620, in the process of vertex alignment of the input feature vectors of the vertices through the graph rolling network, the identity feature codes of the human hand are embedded into the middle layer of the graph rolling network, so as to obtain the vertex alignment features of the vertices.
Specifically, the input feature vector of each vertex is input into each layer in the graph convolution network to perform feature extraction, and for the middle layer of the graph convolution network, the extracted features are spliced with the identity feature codes first, then the features are continuously input into each layer in the graph convolution network to perform feature extraction, and finally vertex alignment features of each vertex are obtained.
It is worth explaining that the graph convolution network can be replaced by other neural networks capable of modeling the geometric grid or the point cloud signals, so that geometric prior information is introduced; that is, the graph-rolling network is only an example, and other equivalent or equivalent embodiments are not directly given, but are intended to fall within the scope of the present application.
Under the action of the embodiment, through the fusion of a plurality of different features, not only is the feature expression capability enriched, but also the advantages of flexibility and continuity are achieved, and the problem of causing expression ambiguity is avoided.
In an exemplary embodiment, human hand image synthesis is achieved by synthesizing an image model. The composite image model is a trained neural network model.
As shown in fig. 7, the training process of the composite image model may include the steps of:
step 800, acquiring a training image.
In one possible implementation, the training image comprises a human hand image with 4-10 visual angles, and the coincidence degree between the visual angles of the cameras is not more than 50%; the training image should contain each reasonable rotation angle of each joint of the finger relative to the parent node to ensure that each skeletal rotation angle of the driving gesture is contained in the training image.
Step 820, calculating the loss value of the set loss function based on the training image, and judging whether the loss value satisfies the convergence condition.
The set loss function includes at least one of: a picture reconstruction self-supervision loss function for optimizing picture convolution network parameters, a geometric reconstruction loss function for optimizing feature fusion parameters, and a hidden variable regularization constraint loss function for optimizing identity hidden coding parameters.
The convergence condition can be flexibly set according to the actual requirement of the application scene, for example, the loss value is smaller than a specific value.
The self-supervision loss function of picture reconstruction is as followsWherein (1)>For the pixel value at the composite image position p +.>For pixel values at the true image p, co-traverse N P A plurality of pixel positions, typically all pixel positions lambda, on a picture pho Weights are the penalty functions.
The geometric reconstruction loss function isN vert For all grid top points, v i And->Respectively, the true position of the ith vertex of the grid and the predicted position of the ith vertex, lambda geo Weights are the penalty functions.
Hidden variable regularization constraint loss function isl id Coding identity features, l index Hidden variables, lambda, can be learned for each vertex id And lambda (lambda) index Weights are the penalty functions.
If the loss value satisfies the convergence condition, step 840 is performed.
If the loss value does not meet the convergence condition, step 860 is performed.
And step 840, training by the neural network model to obtain a synthetic image model.
In step 860, parameters of the neural network model are optimized.
The parameters at least comprise a graph convolution network parameter, a characteristic fusion parameter and an identity hidden coding parameter.
After the training process is completed, a human hand image synthesis model with the capability of synthesizing human hand images is obtained.
Fig. 8 is a schematic diagram of a specific implementation of a method for synthesizing a human hand image in an application scenario. In fig. 8, the human hand image synthesis process includes a vertex alignment portion, a patch alignment portion, and an image rendering portion. The vertex alignment part comprises a feature stitching fusion and graph rolling network; the patch alignment feature includes a plurality of branched multi-layer perceptrons, TBN matrix transforms.
Fig. 9 shows a human hand synthesized image generated using the human hand image synthesizing method of the present application, and a control group image and a true value image. As can be seen from fig. 9, the method of the present application is better in driving the new pose and image visual effect than the previous method, and the synthetic image of the human hand generated by the method of the present application has clearer finger joints and knuckle ends, and the characteristics of the veins and hair texture of the back of the hand are better preserved than those of the control group.
In order to quantitatively analyze the quality of the hand synthesized image generated by the method, indexes such as peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR), structural similarity (Structural Similarity, SSIM), and learnable perceived image block similarity (Learned Perceptual Image Patch Similarity, LPIPS) are used for representing the quality of the hand synthesized image. Wherein, the larger the PSNR value is, the higher the image quality is; the SSIM value range is [0,1], the larger the value is, the smaller the image distortion is, namely the higher the image quality is; the smaller the LPIPS value, the smaller the difference between the synthesized image and the true value, i.e. the higher the image quality.
Table 1 below, table 1 quantitatively compares the performance of the inventive method with the prior art in performing new view rendering and new pose driving tasks.
TABLE 1
As can be seen from table 1, the inventive method performs far better than the prior art (NB method, AMVP method) in performing new view rendering and new pose driving tasks.
The prior art mentioned in table 1 includes NB method, AMVP method, where NB (Neural Body) method is a deformable neural radiation field applied in human reconstruction, by assuming that the neural representations learned in different frames share the same set of potential codes anchored to the deformable mesh, so the observations across frames can be integrated naturally; AMVP (Animatable Nerf) is a method for dynamically reconstructing a scene by applying MVP to a face, and the characteristics of the MVP method are remained by changing a geometric estimation layer into a MANO estimation layer as a decoupling control mechanism driven by parameters.
The following is an embodiment of the apparatus of the present application, which may be used to perform the method of human hand image synthesis according to the present application. For details not disclosed in the embodiment of the apparatus of the present application, please refer to a method embodiment of the method for synthesizing a human hand image according to the present application.
Referring to fig. 10, in an embodiment of the present application, a human hand image synthesizing apparatus 900 is provided, including but not limited to: an image sequence acquisition module 910, a parameter calculation module 920, an alignment module 930, and a rendering module 940.
The image sequence obtaining module 910 is configured to obtain a human hand image sequence, where the human hand image sequence includes multiple frames of human hand images, and each frame of human hand image is obtained by shooting with respect to a human hand at different angles of view.
The parameter calculation module 920 is configured to calculate pose parameters and shape parameters of each frame of human hand image, and estimate a three-dimensional grid of the human hand by using the parameterized grid model according to the pose parameters and the shape parameters.
The alignment module 930 is configured to perform vertex alignment and surface patch alignment based on each vertex and the surface patch corresponding to each vertex in the three-dimensional grid of the human hand, so as to obtain the volume expression distribution of the human hand in the image of each human hand.
And the rendering module 940 is used for performing image rendering according to the volume expression distribution of the human hand in the images of all the human hands to obtain a synthetic image of the human hand.
In an exemplary embodiment, the alignment module 930 includes: a vertex alignment unit 931, configured to align each vertex in the three-dimensional network of the human hand by using the graph rolling network, so as to obtain a vertex alignment feature of each vertex; a principal component expression estimation unit 932, configured to estimate, for each of the patches corresponding to each of the vertices, a volume principal component expression of the patch according to a vertex alignment feature of the vertex corresponding to the patch, to obtain a patch alignment feature of each of the patches; and a transformation unit 933, configured to transform the volume principal component expression of each patch to the surface of the three-dimensional grid patch according to the patch alignment feature of each patch, so as to obtain the volume expression distribution of the human hand in the image of each human hand.
In an exemplary embodiment, the vertex alignment unit 931 includes: a coordinate localization subunit 9311, configured to localize vertices of the three-dimensional mesh to obtain localized vertices of the three-dimensional mesh; a hidden variable computation operator unit 9312 for obtaining a learnable hidden variable of each vertex of the three-dimensional mesh through the embedding layer computation; a rotation angle calculation subunit 9313, configured to calculate a rotation angle of the node to which each vertex belongs relative to the parent node based on the posture parameter of each frame of the human hand image; the first stitching subunit 9314 is configured to stitch the localized vertex of the three-dimensional mesh, the learnable hidden variable of each vertex, the rotation angle of the joint point to which each vertex belongs relative to the parent node, and the gesture parameter, so as to obtain an input feature vector of each vertex; the vertex alignment feature obtaining subunit 9315 is configured to input the input feature vector of each vertex into the graph convolution network, so as to obtain the vertex alignment feature of each vertex.
In an exemplary embodiment, vertex alignment feature acquisition subunit 9315 includes: the identity coding subunit 9316 is configured to perform identity feature coding on a human hand according to an identity to which the human hand belongs, so as to obtain an identity feature code of the human hand; and the second stitching subunit 9317 is configured to embed the identity feature code of the human hand into the middle layer of the graph rolling network in the process of performing vertex alignment on the input feature vector of each vertex through the graph rolling network, so as to obtain the vertex alignment feature of each vertex.
In an exemplary embodiment, the principal component expression estimation unit 932 includes: the feature estimation subunit 9321 is configured to input the vertex alignment features of vertices corresponding to the patch into the color branch multi-layer perceptron, the density branch multi-layer perceptron, and the motion branch multi-layer perceptron to perform estimation of the color feature, the density feature, and the motion feature of the patch; and the characteristic fusion subunit is used for carrying out characteristic fusion on the estimated color characteristic, the estimated density characteristic and the estimated motion characteristic to obtain the patch alignment characteristic of each patch.
In an exemplary embodiment, the rendering module 940 includes: the pixel value calculating unit 941 is configured to calculate, according to the volume expression distribution of the human hand in the images of the human hand, a pixel value of each pixel position of the composite image under the rendering view angle by using a differentiable neural rendering equation, and generate a composite image of the human hand.
It should be noted that, in the hand image synthesizing device provided in the foregoing embodiment, only the division of the functional modules is used for illustration, and in practical application, the above-mentioned functions may be allocated to different functional modules according to needs, that is, the internal structure of the hand image synthesizing device is divided into different functional modules to complete all or part of the functions described above.
In addition, the human hand image synthesizing device and the embodiment of the human hand image synthesizing method provided in the foregoing embodiments belong to the same concept, wherein the specific manner of performing the operations of each module has been described in detail in the method embodiment, and will not be described herein again.
Fig. 11 shows a structural schematic of a server according to an exemplary embodiment. The server is suitable for use at the server 130 in the implementation environment shown in fig. 1.
It should be noted that this server is only an example adapted to the present application, and should not be construed as providing any limitation on the scope of use of the present application. Nor should the server be construed as necessarily relying on or necessarily having one or more of the components of the exemplary server 2000 illustrated in fig. 11.
The hardware structure of the server 2000 may vary widely depending on the configuration or performance, as shown in fig. 11, the server 2000 includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU, central Processing Units) 270.
Specifically, the power supply 210 is configured to provide an operating voltage for each hardware device on the server 2000.
The interface 230 includes at least one wired or wireless network interface 231 for interacting with external devices. For example, interactions between acquisition side 110 and server side 130 in the implementation environment shown in FIG. 1 are performed.
Of course, in other examples of the adaptation of the present application, the interface 230 may further include at least one serial-parallel conversion interface 233, at least one input-output interface 235, at least one USB interface 237, and the like, as shown in fig. 11, which is not particularly limited herein.
The memory 250 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, where the resources stored include an operating system 251, application programs 253, and data 255, and the storage mode may be transient storage or permanent storage.
The operating system 251 is used for managing and controlling various hardware devices and applications 253 on the server 2000 to implement the operation and processing of the massive data 255 in the memory 250 by the central processing unit 270, which may be Windows server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.
The application 253 is a computer program that performs at least one specific task based on the operating system 251, and may include at least one module (not shown in fig. 11), each of which may respectively include a computer program for the server 2000. For example, a human hand image synthesizing device may be considered as the application 253 deployed on the server 2000.
The data 255 may be a photograph, a picture, etc. stored in a disk, or may be a synthesized image of a human hand generated by the server 130, a sequence of images of human hands with multiple perspectives acquired by the acquisition terminal 110, etc., and stored in the memory 250.
The central processor 270 may include one or more processors and is configured to communicate with the memory 250 via at least one communication bus to read the computer program stored in the memory 250, thereby implementing the operation and processing of the bulk data 255 in the memory 250. The human hand image synthesizing method is accomplished, for example, by the cpu 270 reading a series of computer programs stored in the memory 250.
Furthermore, the present application can be realized by hardware circuitry or by a combination of hardware circuitry and software, and thus, the implementation of the present application is not limited to any specific hardware circuitry, software, or combination of the two.
Referring to fig. 12, in an embodiment of the present application, an electronic device 4000 is provided, where the electronic device 4000 may include: desktop computers, notebook computers, servers, etc.
In fig. 12, the electronic device 4000 includes at least one processor 4001, at least one communication bus 4002, and at least one memory 4003.
Wherein the processor 4001 is coupled to the memory 4003, such as via a communication bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
The communication bus 4002 may include a pathway to transfer information between the aforementioned components. The communication bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 12, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 4003 has stored thereon a computer program, and the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002.
The computer program, when executed by the processor 4001, implements the human hand image synthesizing method in each of the embodiments described above.
Further, in the embodiments of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the human hand image synthesizing method in the above-described embodiments.
In an embodiment of the application, a computer program product is provided, which comprises a computer program stored in a storage medium. The processor of the computer device reads the computer program from the storage medium, and the processor executes the computer program so that the computer device executes the human hand image synthesizing method in each of the above embodiments.
In contrast to the related art,
1. considering the characteristics of mixed volume expression and the characteristics of discontinuous and uneven distribution of UV unfolding of hands, the application introduces a graph convolution neural network as a main coder and decoder, thereby combining a geometric driving signal with a geometric topological structure, and transferring UV aligned decoding to geometrically aligned decoding, so that the utilization of the geometric signal is more reasonable.
2. According to the scheme, the gesture and the shape parameters obtained from the parameterized grid model are fused as the driving parameters, and the MANO geometric information extracted through graph convolution can effectively learn common local dynamic characteristics from hand data of different gestures and identities, so that generalization capability of driving different gestures is enhanced, and any gesture can be driven and generated.
3. A graph rolling network is introduced which has the same grid topology as the MANO model as a backbone with good structural awareness.
4. By realizing the highly optimized data parallel BVH, the BVH can be reconstructed on the basis of each frame, so that the dynamic scene can be processed efficiently; the estimation space of the rendering space is reduced, so that the neural network only needs to learn the volume principal component expression of texture alignment, meaningless parameter space is reduced, and the training and rendering speed is improved.
5. The scheme of the application is a mixed expression, and the optimal balance can be found between volume-based and principal element-based neural scene expression. Therefore, it can produce high quality results with fine scale detail, rendering speed is fast, drivability, and memory limitations are reduced.
6. And calculating corresponding gesture parameters and shape parameters according to the acquired human hand image sequence, generating a three-dimensional grid of the human hand by using a parameterized grid model based on the gesture parameters and the shape parameters, estimating volume expression distribution, namely a neural parameter and a volume principal component model, according to the three-dimensional grid, and rendering a synthetic image of the human hand by using the volume expression distribution.
The volume principal element is combined with a parameterized grid model (e.g. MANO), so that the high-frequency texture characteristics of the human hand can be represented, and further, high-quality rendering reconstruction can be realized; the method for synthesizing the human hand image can drive and generate the new gesture, and the rendered human hand synthesized image has the characteristics of high fidelity, definition and reality.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the application and is not intended to limit the application, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (10)

1. A method of human hand image synthesis, the method comprising:
acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands;
calculating the posture parameters and the shape parameters of the hand image of each frame, and estimating the three-dimensional grid of the hand through a parameterized grid model according to the posture parameters and the shape parameters;
based on each vertex and the corresponding surface patch of each vertex in the three-dimensional grid of the human hand, performing vertex alignment and surface patch alignment to obtain volume expression distribution of the human hand in each human hand image;
and performing image rendering according to the volume expression distribution of the human hand in each human hand image to obtain a synthetic image of the human hand.
2. The method of claim 1, wherein the performing vertex alignment and patch alignment based on each vertex and a patch corresponding to each vertex in the three-dimensional grid of the human hand to obtain a volumetric expression distribution of the human hand in each human hand image comprises:
Aligning each vertex in the three-dimensional network of the human hand by using a graph rolling network to obtain vertex alignment features of each vertex;
estimating the volume principal component expression of the surface patch according to the vertex alignment feature of the vertex corresponding to the surface patch aiming at the surface patch corresponding to each vertex to obtain the surface patch alignment feature of each surface patch;
and according to the surface patch alignment characteristics of the surface patches, transforming the volume principal component expression of the surface patches to the surface of the three-dimensional grid surface patches to obtain the volume expression distribution of the human hand in the human hand images.
3. The method of claim 2, wherein said aligning each of said vertices in said three-dimensional network of said human hand with a graph convolutional network to obtain vertex alignment features for each of said vertices comprises:
the vertexes of the three-dimensional grid are localized to obtain localized vertexes of the three-dimensional grid;
obtaining a learnable hidden variable of each vertex of the three-dimensional grid through the calculation of an embedding layer;
calculating the rotation angle of the joint point to which each vertex belongs relative to a father node based on the gesture parameters of the hand image of each frame;
Splicing the localized vertexes of the three-dimensional grid, the learnable hidden variables of each vertex, the rotation angle of each joint point to which each vertex belongs relative to a father node and the gesture parameters to obtain input feature vectors of the vertexes;
and inputting the input feature vector of each vertex into the graph convolution network to obtain the vertex alignment feature of each vertex.
4. A method as claimed in claim 3, wherein said inputting said input feature vector for each vertex into said graph convolution network results in said vertex alignment feature for each vertex, comprising:
carrying out identity characteristic coding on the human hand according to the identity of the human hand to obtain the identity characteristic coding of the human hand;
and embedding the identity feature code of the human hand into the middle layer of the graph rolling network in the process of aligning the vertex of the input feature vector of each vertex through the graph rolling network to obtain the vertex aligning feature of each vertex.
5. The method of claim 2, wherein the estimating, for each of the vertices of the corresponding dough, the principal component expression of the volume of the dough according to the vertex alignment feature of the vertex to which the dough corresponds, to obtain the dough alignment feature of each of the dough, comprises:
Respectively inputting vertex alignment features of vertexes corresponding to the dough sheet into a color branch multi-layer perceptron, a density branch multi-layer perceptron and a motion branch multi-layer perceptron to estimate the color features, the density features and the motion features of the dough sheet;
and carrying out feature fusion on the estimated color features, density features and motion features to obtain the alignment features of the patches.
6. The method of claim 1, wherein said performing image rendering based on the volumetric expression distribution of the human hand in each of the human hand images to obtain a composite image of the human hand comprises:
according to the volume expression distribution of the human hand in each human hand image, calculating the pixel value of each pixel position of the composite image under the rendering view angle through a differentiable neural rendering equation, and generating the composite image of the human hand.
7. The method of any one of claims 1 to 6, wherein the human hand image synthesis is achieved by a synthetic image model; the synthetic image model is a trained neural network model;
the training process of the synthetic image model comprises the following steps:
acquiring a training image;
Calculating a loss value of a set loss function based on the training image; the set loss function includes at least one of: a picture reconstruction self-supervision loss function for optimizing picture convolution network parameters, a geometric reconstruction loss function for optimizing feature fusion parameters and a hidden variable regularization constraint loss function for optimizing identity hidden coding parameters;
if the loss value meets the convergence condition, training the neural network model to obtain a synthetic image model;
otherwise, optimizing parameters of the neural network model; the parameters at least comprise a graph convolutional network parameter, a characteristic fusion parameter and an identity hidden coding parameter.
8. A human hand image synthesizing apparatus, comprising:
the image sequence acquisition module is used for acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands;
the parameter calculation module is used for calculating the posture parameters and the shape parameters of the hand image of each frame and estimating the three-dimensional grid of the hand through the parameterized grid model according to the posture parameters and the shape parameters;
the alignment module is used for carrying out vertex alignment and surface patch alignment based on each vertex and the surface patch corresponding to each vertex in the three-dimensional grid of the human hand to obtain volume expression distribution of the human hand in each human hand image;
And the rendering module is used for performing image rendering according to the volume expression distribution of the human hand in each human hand image to obtain a synthetic image of the human hand.
9. An electronic device, comprising: at least one processor, and at least one memory, wherein,
program instructions or codes are stored on the memory;
the program instructions or code are loaded and executed by the processor to cause an electronic device to implement a method of human hand image synthesis as claimed in any one of claims 1 to 7.
10. A storage medium having stored thereon program instructions or code that are loaded and executed by a processor to implement the human hand image synthesis method of any one of claims 1 to 7.
CN202310321253.8A 2023-03-14 2023-03-23 Human hand image synthesis method, device, electronic equipment and storage medium Pending CN116758202A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310282564 2023-03-14
CN2023102825648 2023-03-14

Publications (1)

Publication Number Publication Date
CN116758202A true CN116758202A (en) 2023-09-15

Family

ID=87952083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310321253.8A Pending CN116758202A (en) 2023-03-14 2023-03-23 Human hand image synthesis method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116758202A (en)

Similar Documents

Publication Publication Date Title
CN111325794B (en) Visual simultaneous localization and map construction method based on depth convolution self-encoder
Yan et al. Ddrnet: Depth map denoising and refinement for consumer depth cameras using cascaded cnns
Tretschk et al. Demea: Deep mesh autoencoders for non-rigidly deforming objects
CN112330729B (en) Image depth prediction method, device, terminal equipment and readable storage medium
CN112837406B (en) Three-dimensional reconstruction method, device and system
CN112085836A (en) Three-dimensional face reconstruction method based on graph convolution neural network
Jin et al. 3d reconstruction using deep learning: a survey
Chen et al. Towards efficient and photorealistic 3d human reconstruction: a brief survey
CN113421328B (en) Three-dimensional human body virtual reconstruction method and device
CN113313818A (en) Three-dimensional reconstruction method, device and system
CN112967373B (en) Facial image feature coding method based on nonlinear 3DMM
Kang et al. Competitive learning of facial fitting and synthesis using uv energy
CN115115805A (en) Training method, device and equipment for three-dimensional reconstruction model and storage medium
CN115222917A (en) Training method, device and equipment for three-dimensional reconstruction model and storage medium
JP2024510230A (en) Multi-view neural human prediction using implicitly differentiable renderer for facial expression, body pose shape and clothing performance capture
CN117496072B (en) Three-dimensional digital person generation and interaction method and system
Rabby et al. Beyondpixels: A comprehensive review of the evolution of neural radiance fields
CN116385667B (en) Reconstruction method of three-dimensional model, training method and device of texture reconstruction model
CN116452715A (en) Dynamic human hand rendering method, device and storage medium
CN111862278A (en) Animation obtaining method and device, electronic equipment and storage medium
Maxim et al. A survey on the current state of the art on deep learning 3D reconstruction
CN116758202A (en) Human hand image synthesis method, device, electronic equipment and storage medium
CN116797713A (en) Three-dimensional reconstruction method and terminal equipment
Li et al. Point-based neural scene rendering for street views
CN111932670A (en) Three-dimensional human body self-portrait reconstruction method and system based on single RGBD camera

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination