CN116188704A

CN116188704A - Hand image generation method and device, electronic equipment and readable storage medium

Info

Publication number: CN116188704A
Application number: CN202310487866.9A
Authority: CN
Inventors: 陈星宇; 王宝元; 沈向洋
Original assignee: Beijing Hongmian Xiaoice Technology Co Ltd
Current assignee: Beijing Hongmian Xiaoice Technology Co Ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-05-30

Abstract

The application provides a hand image generation method, a device, electronic equipment and a readable storage medium, and relates to the field of hand image generation, wherein the method comprises the following steps: constructing occupation fields of bones of the hand and hand coloring fields based on the hand grid model; predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field; the color value of each pixel is calculated based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and the hand image is synthesized using a volume rendering method based on the color value of each pixel. The hand image generation method is used for rendering the hand image with strong sense of reality at lower cost.

Description

Hand image generation method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of hand image generation, and in particular, to a hand image generation method, device, electronic apparatus, and readable storage medium.

Background

Human body rendering is the basic technology of virtual human production, wherein the hand is used as the most flexible motion part in the human body, and complex gesture changes exist, so the difficulty of hand appearance modeling is extremely high.

In the related art, hand appearance modeling mainly relies on expensive scanning equipment to reconstruct hand geometry and texture information, and subsequently, a great deal of effort is required to accurately draw hand textures. Also, the conventional graphic rendering-based method is difficult to present real illumination information, resulting in a lack of realism of the rendered image.

Based on this, there is an urgent need for a low-cost hand image generation method that can reduce the cost of modeling the hand appearance and improve the realism of the hand appearance image.

Disclosure of Invention

The invention aims to provide a hand image generation method, a device, electronic equipment and a readable storage medium, which are used for rendering a hand image with strong sense of reality at lower cost.

The application provides a hand image generation method, which comprises the following steps:

acquiring hand gesture information; inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of a hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field; and calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

Optionally, the constructing the occupation field of each skeleton of the hand based on the hand grid model includes: partially sampling the vertexes of the hand grid model according to the hand activity information and preset hand skeleton division information to obtain a first point cloud; inputting the position information of a sampling point corresponding to a target skeleton in a first point cloud into a part geometric encoder corresponding to the target skeleton to obtain a part geometric code corresponding to the target skeleton; a bone corresponds to a part geometry encoder; the parts corresponding to adjacent bones are input into a local pair decoder after being subjected to geometric coding cascade connection, so that occupation fields of all bones of the hand are obtained; the preset hand skeleton division information is used for dividing the hand into a plurality of skeletons; the target bone is any one of the plurality of bones.

Optionally, the predicting the occupancy information of each spatial coordinate point in the plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupancy field of each bone of the hand includes: taking the maximum value of the occupied fields of all bones of the hand to obtain the occupied field of the hand of the whole hand, and determining a target relationship based on the occupied field of the hand; the target relationship is as follows: the relative position relation between each space coordinate point and the geometrical space of the hand; and determining the occupancy information of a plurality of sampling points adjacent to each space coordinate point in the first point cloud based on the hand occupancy field, and predicting the occupancy information of each space coordinate point based on the target relation and the occupancy information of a plurality of sampling points adjacent to each space coordinate point.

Optionally, constructing a hand coloring field based on the hand mesh model includes: constructing the hand coloring field according to the hand grid model and the occupation fields of all bones of the hand; the hand coloring field is used for predicting coloring information of each space coordinate point; the coloring information is obtained by multiplying the illumination information of each space coordinate point and the albedo information of each space coordinate point.

Optionally, the predicting the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point based on the hand-coloring field includes: sampling the gravity center of each grid of the hand grid model to obtain a second point cloud; calculating the position code of each spatial coordinate point and the albedo code of each spatial coordinate point through a point interpolation algorithm based on the position information of each sampling point, the position code of each sampling point and the albedo code of each sampling point in a plurality of sampling points adjacent to each spatial coordinate point in the second point cloud; inputting a target relation, a position code of each spatial coordinate point, an occupation value of each spatial coordinate point and the hand gesture information into a first multi-layer perceptron, predicting illumination information of each spatial coordinate point, inputting an albedo code of each spatial coordinate point into a second multi-layer perceptron, and predicting albedo information of each spatial coordinate point; wherein, the target relationship is: the relative position relation between each space coordinate point and the geometrical space of the hand; the position codes of the sampling points are used for describing the relative position relation between the sampling points and the hand surface; the position codes of the space coordinate points are used for describing the relative position relation between the space coordinate points and the hand surface.

Optionally, the calculating the color value of each pixel based on the occupancy value of each spatial coordinate point, the albedo information of each spatial coordinate point, and the illumination information of each spatial coordinate point includes: and carrying out integral operation on the plurality of space coordinate points based on geometric information of the hand geometric space to obtain the color value of the target pixel point.

Optionally, the parameterized model, the occupancy fields of the bones of the hand, and the hand-coloring fields are integrated into a unified system for end-to-end training; in the training process of the end-to-end training, the training process is constrained based on the following constraint conditions: and the cross ratio loss of the hand silhouette generated by the parameterized model and the truth image in the training sample, the L1 loss between the rendered image and the truth image in the training sample, and the perception loss between the rendered image and the truth image in the training video.

The application also provides a hand image generation device, comprising:

the acquisition module is used for acquiring hand gesture information; the execution module is used for inputting the hand gesture information into the parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; the prediction module is used for predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field; the calculating module is used for calculating the color value of each pixel based on the occupation value of each space coordinate point, the albedo information of each space coordinate point and the illumination information of each space coordinate point; and the image rendering module is used for synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

Optionally, the apparatus further comprises: a sampling module; the sampling module is used for partially sampling the vertexes of the hand grid model according to the hand activity information and the preset hand skeleton division information to obtain first point cloud; the execution module is specifically configured to input position information of a sampling point corresponding to a target bone in a first point cloud into a part geometry encoder corresponding to the target bone, so as to obtain a part geometry code corresponding to the target bone; a bone corresponds to a part geometry encoder; the execution module is specifically used for cascading the part geometric codes corresponding to the adjacent bones and inputting the part geometric codes into the local pair decoder to obtain the occupation field of each bone of the hand; the preset hand skeleton division information is used for dividing the hand into a plurality of skeletons; the target bone is any one of the plurality of bones.

Optionally, the executing module is further configured to obtain a hand occupation field of the entire hand after taking a maximum value of the occupation fields of the bones of the hand, and determine a target relationship based on the hand occupation field; the target relationship is as follows: the relative position relation between each space coordinate point and the geometrical space of the hand; the prediction module is specifically configured to determine occupancy information of a plurality of sampling points adjacent to each spatial coordinate point in the first point cloud based on the hand occupancy field, and predict occupancy information of each spatial coordinate point based on the target relationship and occupancy information of a plurality of sampling points adjacent to each spatial coordinate point.

Optionally, the execution module is specifically configured to construct the hand coloring field according to the hand grid model and the occupation field of each skeleton of the hand; the hand coloring field is used for predicting coloring information of each space coordinate point; the coloring information is obtained by multiplying the illumination information of each space coordinate point and the albedo information of each space coordinate point.

Optionally, the sampling module is further configured to sample a center of gravity of each grid of the hand grid model to obtain a second point cloud; the computing module is further configured to calculate, by a point interpolation algorithm, a position code of each spatial coordinate point and an albedo code of each spatial coordinate point based on position information of each sampling point, a position code of each sampling point, and an albedo code of each sampling point in the plurality of sampling points adjacent to each spatial coordinate point in the second point cloud; the prediction module is specifically configured to input a target relationship, a position code of each spatial coordinate point, an occupation value of each spatial coordinate point, and the hand gesture information into a first multi-layer perceptron, predict illumination information of each spatial coordinate point, and input an albedo code of each spatial coordinate point into a second multi-layer perceptron, and predict albedo information of each spatial coordinate point; wherein, the target relationship is: the relative position relation between each space coordinate point and the geometrical space of the hand; the position codes of the sampling points are used for describing the relative position relation between the sampling points and the hand surface; the position codes of the space coordinate points are used for describing the relative position relation between the space coordinate points and the hand surface.

Optionally, the image rendering module is specifically configured to perform an integral operation on the plurality of spatial coordinate points based on geometric information of the hand geometric space, so as to obtain a color value of the target pixel point.

The present application also provides a computer program product comprising computer program/instructions which, when executed by a processor, implement the steps of a hand image generation method as described in any one of the above.

The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the hand image generation method as described in any one of the above when executing the program.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a hand image generation method as described in any of the above.

The hand image generation method, the hand image generation device, the electronic equipment and the readable storage medium provided by the application are characterized in that firstly, a target video containing a single-view hand image is obtained, and hand gesture information is extracted from the target video through a gesture estimation method. Then, inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; and predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field. Finally, calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel. Therefore, geometric shape and texture information under any gesture can be predicted through the neural network, illumination information related to the gesture of the hand is generated at the same time, and an image is generated by using a volume rendering method, so that the rendering cost of the hand image is greatly reduced, and the sense of reality of the hand image is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application or the prior art, the following description will briefly introduce the drawings used in the embodiments or the description of the prior art, and it is obvious that, in the following description, the drawings are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a hand image generating method provided in the present application;

FIG. 2 is a schematic illustration of a hand mesh model provided herein;

FIG. 3 is a second flow chart of the hand image generating method provided in the present application;

fig. 4 is a schematic structural diagram of a hand image generating device provided in the present application;

fig. 5 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

In order to solve the technical problems that in the related art, the modeling cost of the appearance of the hand is high and the reality is lacking after imaging, the embodiment of the application provides a hand image generation method based on neural network rendering, which can reconstruct the geometry and texture of the hand based on single-view hand video and generate a hand image with photo-level reality. After training of the neural network model is completed, only hand gesture information is needed to be used as input, the neural network model can be used for predicting hand geometric shape and texture information under any gesture, illumination information related to the hand gesture is generated at the same time, and finally an image is generated by using a volume rendering method.

The hand image generating method provided by the embodiment of the application is described in detail below by means of specific embodiments and application scenes thereof with reference to the accompanying drawings.

As shown in fig. 1, the method for generating a hand image according to the embodiment of the present application may include the following steps 101 to 104:

and 101, acquiring hand gesture information.

Illustratively, the above hand gesture information includes: the hand gesture information can be extracted from a target video containing a single-view hand image by a gesture estimation method, and the existing hand gesture information can also be directly used.

It should be noted that any more mature posture estimation method in the related art may be used in the above posture estimation method, which is not limited in the embodiment of the present application.

102, inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing an occupation field and a hand coloring field of each skeleton of the hand based on the hand grid model.

Illustratively, after the hand pose information is obtained, the hand pose information may be input into a parameterized model from which a hand mesh model is obtained.

It should be noted that, in the embodiment of the present application, MANO-HD may be used as a driving model of hand motion, where MANO-HD is an extension of a hand parameterized model MANO 1, and by subdividing a template mesh model of MANO, and re-optimizing kinematic parameters, a drivable mesh model with higher resolution is realized.

As shown in fig. 2, fig. 2 (a) is a hand grid model generated by a parameterized model MANO, and fig. 2 (B) is a hand grid model generated by a parameterized model MANO-HD used in the embodiment of the present application, as can be seen from fig. 2, the parameterized model MANO-HD used in the embodiment of the present application can generate a finer hand grid model.

After the hand grid model is obtained, the hand grid model can be sampled in a feature extraction mode based on the hand grid model, and a skeleton occupation field and a hand coloring field of each skeleton of the hand can be constructed according to the sampled data.

Step 103, predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field.

The plurality of spatial coordinate points are, for example, spatial coordinate points obtained by uniformly sampling any target pixel of the hand image to be rendered and generated in the observation direction. One pixel corresponds to a plurality of spatial coordinate points.

And 104, calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

In this embodiment of the present application, geometric information of each of the plurality of spatial coordinate points, that is, the occupancy field information α (represented by the occupancy value of the spatial coordinate value), the albedo information a (represented by the albedo value of the occupancy field value of the spatial coordinate point) and the illumination information (represented by the illumination value illumination value) are predicted by a model, and then the information of the plurality of spatial coordinate points is integrated, so as to obtain the color value of the target pixel.

Illustratively, the color values in the embodiments of the present application may be represented by Red Green Blue (RGB) values, or by cyan magenta yellow (Cyan Magenta Yellow, CMY) values, which are used in the embodiments of the present application to represent the color values of the hand image.

Illustratively, after calculating the color value of each pixel of the hand image based on the calculation mode of the color value of the target pixel, the hand image may be rendered by the volume rendering mode.

For example, as shown in fig. 3, after the hand gesture information is obtained, the hand gesture information may be input into a parameterized model to obtain a hand mesh model, and a hand skeleton occupation field (i.e., the occupation field of each skeleton of the hand) and a hand coloring field are constructed based on the hand mesh model. And then, uniformly sampling each light ray in the illumination direction to obtain a plurality of spatial coordinate points corresponding to each light ray, wherein each light ray corresponds to one pixel point, so the light ray can be regarded as a plurality of spatial coordinate points corresponding to each pixel point. The occupancy value of each spatial coordinate point of each pixel (i.e., the sampling point as in fig. 2) is predicted by the hand skeletal occupancy field, and the albedo value and the illumination value of each spatial coordinate point of each pixel are predicted by the hand shading field. Finally, according to each spatial coordinate point of each pixel, the albedo value of each spatial coordinate point of each pixel and the illumination value of each spatial coordinate point of each pixel, calculating the RGB value of each pixel, and synthesizing the hand image by a volume rendering method.

Optionally, in the embodiment of the present application, the occupancy field and the hand coloring field of each skeleton of the hand may be constructed based on the point cloud obtained by sampling the features of the hand grid model.

Specifically, the step 102 may include the following steps 102a1 to 102a3:

step 102a1, partially sampling the vertices of the hand grid model according to the hand activity information indicated by the target video and the preset hand skeleton division information to obtain a first point cloud.

Step 102a2, inputting the position information of the sampling point corresponding to the target skeleton in the first point cloud into a part geometry encoder corresponding to the target skeleton, and obtaining the part geometry code corresponding to the target skeleton.

Wherein a bone corresponds to a part geometry encoder.

Step 102a3, the part geometric codes corresponding to adjacent bones are input into a local pair decoder after being cascaded, and the occupation field of each bone of the hand is obtained.

The preset hand skeleton division information is used for dividing the hand into a plurality of skeletons; the target bone is any one of the plurality of bones.

Illustratively, the first point cloud may be obtained by partially sampling (part-wise sampling) vertices of the hand grid model. Each sampling point in the first point cloud includes the following information: sample point normal (sampled point normal), sample point position (sampled point position).

Illustratively, the sampling points in the first point cloud may be converted into sampling points on bones of the hand indicated by the preset hand bone division information based on a bone conversion matrix (bone transformation matrix). Then, the position information of the sampling points belonging to the same skeleton is input into a part geometry encoder corresponding to the skeleton, and the part geometry code corresponding to each skeleton is obtained (part geometr encoding). And finally, inputting the parts corresponding to adjacent bones into a local pair decoder after the geometric codes of the parts are cascaded, so as to obtain the occupation field of each bone of the hand.

It should be noted that, after the maximum value is taken for the occupation field of each skeleton, the hand occupation field of the whole hand can be obtained, and the hand occupation field is used for describing the relative position relationship between each space coordinate point and the geometric space of the hand.

Illustratively, following the step 102a3, the step 102 may further include the following steps 102a4 and 102a5:

step 102a4, obtaining a hand occupation field of the whole hand after taking the maximum value of the occupation field of each skeleton of the hand, and determining a target relationship based on the hand occupation field.

Wherein, the target relationship is: and the relative position relation between each space coordinate point and the geometrical space of the hand.

Step 102a5, determining occupancy information of a plurality of sampling points adjacent to each spatial coordinate point in the first point cloud based on the hand occupancy field, and predicting occupancy information of each spatial coordinate point based on the target relationship and the occupancy information of the plurality of sampling points adjacent to each spatial coordinate point.

The hand occupation field is used for describing the relative position relation between each space coordinate point and the hand geometric space.

For example, based on the hand occupancy field, occupancy information of each sampling point in the first point cloud may be determined, and then, based on occupancy information of a plurality of sampling points adjacent to each spatial coordinate point and a relative positional relationship between each spatial coordinate point and a hand geometric space, occupancy information of each spatial coordinate point may be predicted.

Specifically, the step 102 may further include the following step 102b:

step 102b, constructing the hand coloring field according to the hand grid model and the occupation field of each skeleton of the hand.

The hand coloring field is used for predicting coloring information of each space coordinate point; the coloring information is obtained by multiplying the illumination information of each space coordinate point and the albedo information of each space coordinate point.

Illustratively, after the occupancy field of each skeleton of the hand is obtained, a hand-coloring field for predicting the coloring information of each spatial coordinate point may be constructed based on the hand mesh model and the occupancy field of each skeleton of the hand.

Specifically, the step 103 may include the following steps 103a1 to 102a3:

and 103a1, sampling the center of gravity of each grid of the hand grid model to obtain a second point cloud.

Step 103a2, calculating the position code of each spatial coordinate point and the albedo code of each spatial coordinate point by a point interpolation algorithm based on the position information of each sampling point, the position code of each sampling point and the albedo code of each sampling point in the plurality of sampling points adjacent to each spatial coordinate point in the second point cloud.

Step 103a3, inputting the target relationship, the position code of each spatial coordinate point, the occupation value of each spatial coordinate point and the hand gesture information into a first multi-layer perceptron, predicting the illumination information of each spatial coordinate point, and inputting the albedo code of each spatial coordinate point into a second multi-layer perceptron, and predicting the albedo information of each spatial coordinate point.

Wherein, the target relationship is: the relative position relation between each space coordinate point and the geometrical space of the hand; the position codes of the sampling points are used for describing the relative position relation between the sampling points and the hand surface; the position codes of the space coordinate points are used for describing the relative position relation between the space coordinate points and the hand surface.

Illustratively, after the hand-coloring field is constructed, the albedo information and the illumination value of each spatial coordinate point may be predicted based on the hand-coloring field. First, a center of gravity sampling (barycentric sampling) is required for each mesh in the hand mesh model, resulting in a second point cloud. Each sampling point in the second point cloud includes the following information: sample point position information (sampled point position), sample point position coding (position encoding), sample point albedo coding (albedo encoding).

Illustratively, the albedo code and the position code of each spatial coordinate point are predicted by a point interpolation algorithm based on the position information of each sampling point, the position code of each sampling point, and the albedo code of each sampling point in the plurality of sampling points adjacent to each spatial coordinate point in the second point cloud.

Illustratively, the position code of each spatial coordinate point, the relative position relation between each spatial coordinate point and the geometric space of the hand, and the hand gesture information are input into the first multi-layer perceptron, so that the illumination information of each spatial coordinate point can be predicted.

It should be noted that, in order to make the prediction result of the illumination information of each spatial coordinate point more accurate, the occupation value of each spatial coordinate point input into the first multi-layer perceptron may be obtained based on Soft-Sigmoid, and the occupation value of each spatial coordinate point used in the volume rendering may be obtained based on Sigmoid.

Illustratively, the albedo code of each spatial coordinate point is input into the second multi-layer perceptron, so that albedo information of each spatial coordinate point can be predicted. The albedo information may be represented based on RGB values, i.e. albedo values on RGB (albedo value in RGB).

Optionally, in the embodiment of the present application, after the occupancy information, the illumination information, and the albedo information of each spatial coordinate point are obtained, image rendering may be performed based on the above information, so as to obtain a hand image under a specified viewing angle.

Specifically, the step 104 may include the following step 104a:

and 104a, performing integral operation on the plurality of space coordinate points based on geometric information of the geometric space of the hand to obtain a color value of the target pixel point.

Illustratively, embodiments of the present application integrate hand geometry, texture, and illumination using a volume rendering method, and render as an image. Firstly, sampling a plurality of space coordinate points in the observation direction, predicting geometric information, texture information and illumination information of each space coordinate point, multiplying the texture and the illumination to obtain coloring information of the space coordinate points, and integrating the plurality of space coordinate points according to the geometric information to obtain an RGB value of a pixel.

Alternatively, in the embodiment of the present application, the parameterized model, the occupation fields of each skeleton of the hand, and the hand coloring fields may be integrated into a unified system for end-to-end training.

Illustratively, in a training process of end-to-end training, the training process may be constrained based on the following constraints: the parametric model generates a loss of the intersection ratio (Intersection over Union, ioU) of the hand silhouette and the truth image in the training sample, a loss of L1 between the rendered image and the truth image in the training sample, and a loss of perception between the rendered image and the truth image in the training video. The truth image may be a binary image rendered based on a hand mesh model.

Illustratively, in the training process, the hand gesture information may be extracted from the target video of the single-view hand image based on a mature hand gesture estimation method. Meanwhile, the hand silhouette can be extracted.

It should be noted that the advantage of a high resolution mesh is that the movement of vertices can be used to fit the personal geometry. Therefore, during the training process, the hand grid model is rendered into a binary image by using a micro-renderable method, and the binary image is constrained to be as similar as possible to the hand silhouette.

According to the hand image generation method, firstly, a target video containing a single-view hand image is obtained, and hand gesture information is extracted from the target video through a gesture estimation method. Then, inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; and predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field. Finally, calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel. Therefore, geometric shape and texture information under any gesture can be predicted through the neural network, illumination information related to the gesture of the hand is generated at the same time, and an image is generated by using a volume rendering method, so that the rendering cost of the hand image is greatly reduced, and the sense of reality of the hand image is improved.

In the hand image generating method provided in the embodiment of the present application, the execution subject may be a hand image generating device, or a control module for executing the hand image generating method in the hand image generating device. In the embodiment of the present application, a hand image generating device provided in the embodiment of the present application will be described by taking a hand image generating device executing a hand image generating method as an example.

In the embodiment of the application, the method is shown in the drawings. The hand image generation method is exemplified by a figure in combination with the embodiment of the application. In specific implementation, the hand image generating method shown in the above method drawings may also be implemented in combination with any other drawing that may be combined and is illustrated in the above embodiment, and will not be described herein again.

The hand image generating device provided in the present application is described below, and the hand image generating method described below and the hand image generating method described above may be referred to correspondingly to each other.

Fig. 4 is a schematic structural diagram of a hand image generating device according to an embodiment of the present application, as shown in fig. 4, specifically including:

an acquiring module 401, configured to acquire hand gesture information; the execution module 402 is configured to input the hand gesture information into a parameterized model to obtain a hand grid model, and construct an occupation field and a hand coloring field of each skeleton of the hand based on the hand grid model; a prediction module 403, configured to predict, based on an occupied field of each skeleton of a hand, occupied field information of each spatial coordinate point of a plurality of spatial coordinate points sampled in an observation direction by any target pixel of the hand image, and albedo information and illumination information of each spatial coordinate point based on the hand coloring field; a calculating module 404, configured to calculate a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point, and the illumination information of each spatial coordinate point; an image rendering module 405, configured to synthesize the hand image using a volume rendering method based on the color value of each pixel.

Optionally, the apparatus further comprises: a sampling module; the sampling module is used for partially sampling the vertexes of the hand grid model according to the hand activity information and the preset hand skeleton division information to obtain first point cloud; the execution module 402 is specifically configured to input, to a part geometry encoder corresponding to a target bone, position information of a sampling point corresponding to the target bone in a first point cloud, to obtain a part geometry code corresponding to the target bone; a bone corresponds to a part geometry encoder; the execution module 402 is specifically further configured to cascade part geometric codes corresponding to adjacent bones and input the parts into a local pair of decoders to obtain occupation fields of each bone of the hand; the preset hand skeleton division information is used for dividing the hand into a plurality of skeletons; the target bone is any one of the plurality of bones.

Optionally, the executing module 402 is further configured to obtain a hand occupation field of the entire hand after taking a maximum value of the occupation fields of the bones of the hand, and determine a target relationship based on the hand occupation field; the target relationship is as follows: the relative position relation between each space coordinate point and the geometrical space of the hand; the prediction module 403 is specifically configured to determine occupancy information of a plurality of sampling points adjacent to each spatial coordinate point in the first point cloud based on the hand occupancy field, and predict occupancy information of each spatial coordinate point based on the target relationship and occupancy information of a plurality of sampling points adjacent to each spatial coordinate point.

Optionally, the executing module 402 is specifically configured to construct the hand coloring field according to the hand grid model and the occupation field of each skeleton of the hand; the hand coloring field is used for predicting coloring information of each space coordinate point; the coloring information is obtained by multiplying the illumination information of each space coordinate point and the albedo information of each space coordinate point.

Optionally, the sampling module is further configured to sample a center of gravity of each grid of the hand grid model to obtain a second point cloud; the calculating module 404 is further configured to calculate, by a point interpolation algorithm, a position code of each spatial coordinate point and an albedo code of each spatial coordinate point based on position information of each sampling point, a position code of each sampling point, and an albedo code of each sampling point in the plurality of sampling points adjacent to each spatial coordinate point in the second point cloud; the prediction module 403 is specifically further configured to input a target relationship, a position code of each spatial coordinate point, an occupation value of each spatial coordinate point, and the hand gesture information into a first multi-layer perceptron, predict illumination information of each spatial coordinate point, and input an albedo code of each spatial coordinate point into a second multi-layer perceptron, and predict albedo information of each spatial coordinate point; wherein, the target relationship is: the relative position relation between each space coordinate point and the geometrical space of the hand; the position codes of the sampling points are used for describing the relative position relation between the sampling points and the hand surface; the position codes of the space coordinate points are used for describing the relative position relation between the space coordinate points and the hand surface.

Optionally, the image rendering module 405 is specifically configured to perform an integral operation on the plurality of spatial coordinate points based on geometric information of the hand geometric space, so as to obtain a color value of the target pixel point.

The hand image generating device provided by the application firstly acquires a target video containing a single-view hand image, and extracts hand gesture information from the target video through a gesture estimation method. Then, inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; and predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field. Finally, calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel. Therefore, geometric shape and texture information under any gesture can be predicted through the neural network, illumination information related to the gesture of the hand is generated at the same time, and an image is generated by using a volume rendering method, so that the rendering cost of the hand image is greatly reduced, and the sense of reality of the hand image is improved.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. The processor 510 may invoke logic instructions in the memory 530 to perform a hand image generation method comprising: acquiring hand gesture information; inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of a hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field; and calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a hand image generation method provided by the methods described above, the method comprising: acquiring hand gesture information; inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of a hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field; and calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

In yet another aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor is implemented to perform the hand image generation methods provided above, the method comprising: acquiring hand gesture information; inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model; predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of a hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field; and calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A hand image generation method, comprising:

acquiring hand gesture information;

inputting the hand gesture information into a parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model;

predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of a hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field;

and calculating a color value of each pixel based on the occupation value of each spatial coordinate point, the albedo information of each spatial coordinate point and the illumination information of each spatial coordinate point, and synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

2. The method of claim 1, wherein said constructing an occupancy field for each bone of the hand based on the hand mesh model comprises:

partially sampling the vertexes of the hand grid model according to the hand activity information and preset hand skeleton division information to obtain a first point cloud;

Inputting the position information of a sampling point corresponding to a target skeleton in a first point cloud into a part geometric encoder corresponding to the target skeleton to obtain a part geometric code corresponding to the target skeleton; a bone corresponds to a part geometry encoder;

the parts corresponding to adjacent bones are input into a local pair decoder after being subjected to geometric coding cascade connection, so that occupation fields of all bones of the hand are obtained;

3. The method according to claim 2, wherein predicting occupancy information for each of a plurality of spatial coordinate points sampled in a viewing direction for any target pixel of the hand image based on occupancy fields of respective bones of the hand, comprises:

taking the maximum value of the occupied fields of all bones of the hand to obtain the occupied field of the hand of the whole hand, and determining a target relationship based on the occupied field of the hand; the target relationship is as follows: the relative position relation between each space coordinate point and the geometrical space of the hand;

and determining the occupancy information of a plurality of sampling points adjacent to each space coordinate point in the first point cloud based on the hand occupancy field, and predicting the occupancy information of each space coordinate point based on the target relation and the occupancy information of a plurality of sampling points adjacent to each space coordinate point.

4. The method of claim 2, wherein constructing a hand coloring field based on the hand mesh model comprises:

constructing the hand coloring field according to the hand grid model and the occupation fields of all bones of the hand;

5. The method of claim 4, wherein predicting the albedo information for each spatial coordinate point and the illumination information for each spatial coordinate point based on the hand-coloring field comprises:

sampling the gravity center of each grid of the hand grid model to obtain a second point cloud;

calculating the position code of each spatial coordinate point and the albedo code of each spatial coordinate point through a point interpolation algorithm based on the position information of each sampling point, the position code of each sampling point and the albedo code of each sampling point in a plurality of sampling points adjacent to each spatial coordinate point in the second point cloud;

Inputting a target relation, a position code of each spatial coordinate point, an occupation value of each spatial coordinate point and the hand gesture information into a first multi-layer perceptron, predicting illumination information of each spatial coordinate point, inputting an albedo code of each spatial coordinate point into a second multi-layer perceptron, and predicting albedo information of each spatial coordinate point;

6. The method of claim 1, wherein calculating the color value of each pixel based on the occupancy value of each spatial coordinate point, the albedo information of each spatial coordinate point, and the illumination information of each spatial coordinate point comprises:

and carrying out integral operation on the plurality of space coordinate points based on geometric information of the hand geometric space to obtain the color value of the target pixel point.

7. The method of claim 1, wherein the parameterized model, the occupancy of each skeleton of the hand, and the hand-coloring field are integrated into a unified system for end-to-end training;

in the training process of the end-to-end training, the training process is constrained based on the following constraint conditions: and the cross ratio loss of the hand silhouette generated by the parameterized model and the truth image in the training sample, the L1 loss between the rendered image and the truth image in the training sample, and the perception loss between the rendered image and the truth image in the training video.

8. A hand image generation apparatus, the apparatus comprising:

the acquisition module is used for acquiring hand gesture information;

the execution module is used for inputting the hand gesture information into the parameterized model to obtain a hand grid model, and constructing occupation fields and hand coloring fields of bones of the hand based on the hand grid model;

the prediction module is used for predicting the occupation field information of each spatial coordinate point in a plurality of spatial coordinate points sampled in the observation direction by any target pixel of the hand image based on the occupation field of each skeleton of the hand, and predicting the albedo information and illumination information of each spatial coordinate point based on the hand coloring field;

The calculating module is used for calculating the color value of each pixel based on the occupation value of each space coordinate point, the albedo information of each space coordinate point and the illumination information of each space coordinate point;

and the image rendering module is used for synthesizing the hand image by using a volume rendering method based on the color value of each pixel.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the hand image generation method of any one of claims 1 to 7 when the program is executed.

10. A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of the hand image generation method according to any one of claims 1 to 7.