CN115994966B

CN115994966B - Multi-view image generation method and device, readable storage medium and electronic equipment

Info

Publication number: CN115994966B
Application number: CN202310283837.0A
Authority: CN
Inventors: 邓誉; 王宝元; 沈向洋
Original assignee: Beijing Hongmian Xiaoice Technology Co Ltd
Current assignee: Beijing Hongmian Xiaoice Technology Co Ltd
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-06-30
Anticipated expiration: 2043-03-22
Also published as: CN115994966A

Abstract

The application provides a multi-view image generation method, a device, a readable storage medium and electronic equipment, and relates to the technical field of computers, wherein the method comprises the following steps: acquiring a first image containing a target person at any view angle; inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; and combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression. The multi-view image generation method, the device, the readable storage medium and the electronic equipment are used for enriching details of the reconstructed input image and ensuring three-dimensional consistency under multi-view image generation.

Description

Multi-view image generation method and device, readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a multi-view image, a readable storage medium, and an electronic device.

Background

With the development of virtual technology, a portrait generation technology with a sense of reality at a free camera view angle is important for the production of virtual human video content. In the related art, a new character image of a given character image under different camera perspectives may be generated by generating a countermeasure network (Generative Adversarial Networks, GAN).

However, the multi-view image generation method in the related art has a technical problem that detailed information of a given person image cannot be accurately reconstructed, resulting in that the generated person image at the new camera view angle does not coincide with the details of the input image, and has poor fidelity.

Based on this, there is an urgent need for a multi-view image generation method that can accurately reconstruct details of a given image and that is efficient in image generation.

Disclosure of Invention

The purpose of the application is to provide a multi-view image generation method, a device, a readable storage medium and electronic equipment, which are used for enriching details of a reconstructed input image and ensuring three-dimensional consistency under multi-view image generation.

The application provides a multi-view image generation method, which comprises the following steps:

acquiring a first image containing a target person at any view angle; inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression; wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

Optionally, before the acquiring the first image including the target person at any view angle, the method further includes: the following steps are circularly executed until the pre-training of the three-dimensional generation countermeasure network is completed: inputting the Gaussian noise with multidimensional distribution into the three-dimensional generation countermeasure network, outputting the three-dimensional expression of the character image, and rendering the three-dimensional expression into a two-dimensional character image under a random camera view angle by utilizing a differentiable rendering method corresponding to the three-dimensional expression of the character image; identifying whether the two-dimensional character image under the random camera view angle is a real image or not by utilizing a discriminator, and adjusting the parameters of the three-dimensional generation countermeasure network according to the identification result; wherein the discriminator identifies the two-dimensional character image at the random camera view angle based on the true two-dimensional character image.

Optionally, after the pre-training of the three-dimensional generation countermeasure network is completed, the method further includes: training the coarse-granularity image encoder by taking a real two-dimensional character image as an input image of the coarse-granularity image encoder, so that after the hidden space vector output by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, the three-dimensional expression output by the three-dimensional generation countermeasure network can be reconstructed into the input image of the coarse-granularity image encoder; wherein, the hidden space vector output by the coarse-granularity image encoder is: the coarse-granularity image encoder maps an input image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the first loss function used by the training process of the coarse-granularity image encoder comprises at least one of target loss functions; the objective loss function includes: pixel difference of the input image and the reconstructed image, perception loss, character characteristic difference between the input image and the reconstructed image, difference between hidden space vectors output by the coarse-granularity image encoder and average value of hidden space vectors of the three-dimensional generation countermeasure network; the mean value of the hidden space vectors is as follows: the three-dimensional generation antagonism network is in the pretraining process, and the average value of a plurality of hidden space vectors output by the middle layer.

Optionally, after the training of the coarse-granularity image encoder using the real two-dimensional character image as the input image of the coarse-granularity image encoder, the method further includes: optimizing model parameters of the three-dimensional generation countermeasure network based on the trained coarse-granularity image encoder; wherein, the tuning process uses the real two-dimensional character image as training data; the second loss function used by the tuning process includes: at least one of the objective loss functions, and/or a similarity loss of a two-dimensional character image generated by the three-dimensional generation countermeasure network in a pre-training process and a generation result before tuning of the three-dimensional generation countermeasure network; the generation result before the three-dimensional generation of the countermeasure network tuning is as follows: the three-dimensional generation is against the two-dimensional figure image that the three-dimensional representation of the network output rendered.

Optionally, after the training of the coarse-granularity image encoder using the real two-dimensional character image as the input image of the coarse-granularity image encoder, the method further includes: training the fine granularity image encoder by circularly executing the following steps: taking a real two-dimensional character image as an input image of the fine-granularity image encoder, extracting three-dimensional detail features of the input image through the fine-granularity image encoder, and combining the extracted three-dimensional detail features with a three-dimensional expression output by the three-dimensional generation countermeasure network based on the same input image to obtain fine-granularity three-dimensional expression; generating a two-dimensional character image under any camera view angle based on the fine-granularity three-dimensional expression rendering, and adjusting model parameters of the fine-granularity image encoder according to a loss value of a third loss function; wherein the third loss function comprises: at least one of the objective loss functions; during the training of the fine granularity image encoder, constraining rationality of a fourth image based on the third image; the third image is: after the hidden space vector mapped by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, a two-dimensional character image rendered by the obtained three-dimensional expression is displayed; the fourth image is: rendering a two-dimensional character image based on the fine-grained three-dimensional representation; the third image and the fourth image are obtained based on the same input image.

The application also provides a multi-view image generation device, comprising:

the acquisition module is used for acquiring a first image containing a target person at any view angle; the processing module is used for inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; the generation module is used for combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression; wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

Optionally, the apparatus further comprises: a training module; the training module is used for inputting the Gaussian noise with multidimensional distribution into the three-dimensional generation countermeasure network, outputting the three-dimensional expression of the character image, and rendering the three-dimensional expression into a two-dimensional character image under a random camera view angle by utilizing a differentiable rendering method corresponding to the three-dimensional expression of the character image; the training module is further used for identifying whether the two-dimensional character image under the random camera visual angle is a real image or not by utilizing the discriminator, and adjusting the parameters of the three-dimensional generation countermeasure network according to the identification result; wherein the discriminator identifies the two-dimensional character image at the random camera view angle based on the true two-dimensional character image.

Optionally, the training module is configured to train the coarse-granularity image encoder by using the real two-dimensional character image as an input image of the coarse-granularity image encoder, so that after the hidden space vector output by the coarse-granularity image encoder is input to the three-dimensional generation countermeasure network, the three-dimensional expression output by the three-dimensional generation countermeasure network can be reconstructed into the input image of the coarse-granularity image encoder; wherein, the hidden space vector output by the coarse-granularity image encoder is: the coarse-granularity image encoder maps an input image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the first loss function used by the training process of the coarse-granularity image encoder comprises at least one of target loss functions; the objective loss function includes: pixel difference of the input image and the reconstructed image, perception loss, character characteristic difference between the input image and the reconstructed image, difference between hidden space vectors output by the coarse-granularity image encoder and average value of hidden space vectors of the three-dimensional generation countermeasure network; the mean value of the hidden space vectors is as follows: the three-dimensional generation antagonism network is in the pretraining process, and the average value of a plurality of hidden space vectors output by the middle layer.

Optionally, the training module is further configured to tune model parameters of the three-dimensional generation countermeasure network based on the trained coarse-granularity image encoder; wherein, the tuning process uses the real two-dimensional character image as training data; the second loss function used by the tuning process includes: at least one of the objective loss functions, and/or a similarity loss of a two-dimensional character image generated by the three-dimensional generation countermeasure network in a pre-training process and a generation result before tuning of the three-dimensional generation countermeasure network; the generation result before the three-dimensional generation of the countermeasure network tuning is as follows: the three-dimensional generation is against the two-dimensional figure image that the three-dimensional representation of the network output rendered.

Optionally, the training module is further configured to use the real two-dimensional character image as an input image of the fine-granularity image encoder, extract three-dimensional detail features of the input image through the fine-granularity image encoder, and combine the extracted three-dimensional detail features with a three-dimensional expression output by the three-dimensional generation countermeasure network based on the same input image to obtain a fine-granularity three-dimensional expression; the training module is further used for generating a two-dimensional character image under any camera view angle based on the fine-granularity three-dimensional expression rendering, and adjusting model parameters of the fine-granularity image encoder according to a loss value of a third loss function; wherein the third loss function comprises: at least one of the objective loss functions; during the training of the fine granularity image encoder, constraining rationality of a fourth image based on the third image; the third image is: after the hidden space vector mapped by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, a two-dimensional character image rendered by the obtained three-dimensional expression is displayed; the fourth image is: rendering a two-dimensional character image based on the fine-grained three-dimensional representation; the third image and the fourth image are obtained based on the same input image.

The present application also provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps of a multi-view image generation method as described in any of the above.

The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multi-view image generation method as described in any one of the above when executing the program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a multi-view image generation method as described in any of the above.

The method, the device, the readable storage medium and the electronic equipment for generating the multi-view image comprise the steps of firstly, acquiring a first image containing a target person at any view angle; then, inputting the first image into a target model to obtain a coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; and finally, combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression. Therefore, the character image of the character under any given camera view angle can be obtained based on the single character image, so that the details of the reconstructed input image can be enriched, and the three-dimensional consistency under the generation of the multi-view image can be ensured.

Drawings

In order to more clearly illustrate the technical solutions of the present application or the prior art, the following description will briefly introduce the drawings used in the embodiments or the description of the prior art, and it is obvious that, in the following description, the drawings are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a multi-view image generation method model framework structure provided by the application;

FIG. 2 is a flow chart of a multi-view image generation method provided by the present application;

fig. 3 is a schematic structural diagram of a multi-view image generating apparatus provided in the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Three-dimensional generation of the antagonism network 3D GAN is the binding product of the neural radiation field (Neural Radiance Field, neRF) and the generation of the antagonism network (Generative Adversarial Network, GAN). By introducing a neural radiation field into the generation countermeasure network, the 3D GAN maintains a very strong sense of realism of the 2D GAN. As long as the geometric space of the constructed nerve radiation field is reasonable, the 3D GAN can render virtual human images of any visual angle in theory.

The multi-view image generation method based on the 3D GAN reconstructs the three-dimensional expression of the given image for image generation, and has better three-dimensional consistency under the generation of the multi-view image. However, the three-dimensional expression obtained by the scheme cannot accurately reconstruct the detail information of a given image, so that the generated new view angle image does not accord with the detail of the original image, and the fidelity is poor.

Aiming at the technical problems in the related art, the application provides a multi-view image generation method, which can ensure strict three-dimensional consistency under multi-view image generation and accurately reconstruct details of a given image. In the reasoning stage, the method can generate high-quality multi-view images only by utilizing single forward propagation of the network, and effectively improves the image generation efficiency.

The multi-view image generation method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

As shown in fig. 1, a schematic diagram of a model frame structure used in the multi-view image generating method according to the embodiment of the present application is shown. The model frame includes: a coarse-granularity image encoder and a three-dimensional generation countermeasure network (i.e., 3D GAN) for obtaining a coarse-granularity three-dimensional representation of the input image; a fine granularity image encoder for obtaining three-dimensional detail features of an input image; after the coarse-granularity three-dimensional expression is combined with the three-dimensional detail feature, the fine-granularity three-dimensional expression of the input image can be obtained, and based on the fine-granularity three-dimensional expression, the character image under any appointed camera view angle can be rendered.

As shown in fig. 2, based on the model framework shown in fig. 1, a multi-view image generating method provided in an embodiment of the present application may include the following steps 201 to 203:

step 201, a first image including a target person at any view angle is acquired.

Illustratively, the first image is an image containing the target person at any viewing angle, and may be an input image of the model frame as shown in fig. 1. In the embodiment of the application, the single-view character image can be used as an input image to generate the multi-view character image.

Step 202, inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image.

Wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

Illustratively, as shown in fig. 1, in the embodiment of the present application, the first image needs to be input into two encoders for processing, respectively. After the first image is input to the coarse-granularity image encoder, the coarse-granularity image encoder may map the first image into hidden space vectors (i.e., the latency expression in fig. 1) that three-dimensionally generate intermediate layer outputs against the network (i.e., the 3D GAN in fig. 1). And then obtaining coarse-granularity three-dimensional expression of the first image based on the hidden space vector by the three-dimensional generation countermeasure network.

Illustratively, the middle layer of the three-dimensional generation countermeasure network may be any designated middle layer in the three-dimensional generation countermeasure network, i.e., the coarse-granularity image encoder may map the input first image into a implicit Latent expression of the three-dimensional generation countermeasure network.

Illustratively, after obtaining the hidden space vector output by the coarse-granularity image encoder, the hidden space vector may be input into the three-dimensional generation countermeasure network to obtain the three-dimensional representation of the first image.

It should be noted that, three-dimensional expressions (including coarse-grain three-dimensional expressions and fine-grain three-dimensional expressions) in the embodiments of the present application may be represented using neural radiation fields and sparse variants thereof.

Illustratively, while generating the coarse-grain three-dimensional representation of the first image by the coarse-grain image encoder and the three-dimensional generation countermeasure network (i.e., the object model described above), three-dimensional detail features may also be extracted from the first image by the fine-grain image encoder.

It should be noted that, the coarse-granularity three-dimensional expression cannot accurately reconstruct the detail information of the character image of the given camera view angle, so that the three-dimensional detail feature output by the fine-granularity encoder needs to be combined to improve the accuracy of reconstructing the detail of the input image.

And 203, combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression.

Illustratively, as shown in FIG. 1, after the coarse-grained three-dimensional representation and the three-dimensional detail features described above are obtained, the two may be combined to obtain a fine-grained three-dimensional representation. Based on the fine-grained three-dimensional representation, a two-dimensional character image at any given camera view angle, i.e., the second image, can be rendered.

In the embodiment of the present application, the reconstructed two-dimensional character image (i.e., the second image) is the same as the character in the input two-dimensional character image (i.e., the first image), and is the target character in the input image, except for the difference in viewing angle. In special cases, the input image and the reconstructed image may also have the same viewing angle.

Alternatively, in embodiments of the present application, the coarse-granularity image encoder, the three-dimensional generation countermeasure network, and the fine-granularity image encoder described above may be trained by the following training methods.

Illustratively, before the step 201, the multi-view image generating method provided in the embodiment of the present application may further include the following steps 204 and 205:

and 204, inputting the Gaussian noise distributed in multiple dimensions into the three-dimensional generation countermeasure network, outputting the three-dimensional expression of the character image, and rendering the three-dimensional expression into a two-dimensional character image under a random camera view angle by utilizing a differentiable rendering method corresponding to the three-dimensional expression of the character image.

Step 205, identifying whether the two-dimensional character image under the random camera view angle is a real image by using a discriminator, and adjusting the parameters of the three-dimensional generation countermeasure network according to the identification result.

Wherein the discriminator identifies the two-dimensional character image at the random camera view angle based on the true two-dimensional character image.

Illustratively, the three-dimensional generation countermeasure network described above needs to be pre-trained first before training the coarse-granularity image encoder and the fine-granularity image encoder. Specifically, steps 204 and 205 described above may be repeated until the pre-training of the three-dimensional generation countermeasure network is completed.

Illustratively, the three-dimensional generation countermeasure network takes random noise as input, outputting a specific three-dimensional representation of the figure. While a three-dimensional generation of a network structure against the network may use a StyleGAN2 structure. And (3) giving a randomly sampled camera view angle, rendering the camera view angle into a two-dimensional character image by using a differentiable rendering method corresponding to the three-dimensional expression, and performing an countermeasure training three-dimensional generation countermeasure network by using a discriminator and a large number of collected real two-dimensional character images (including a front image, a side image and the like), so as to finally obtain the three-dimensional generation countermeasure network capable of generating a reasonable three-dimensional expression of the human image.

Illustratively, after the pre-training of the three-dimensional generation countermeasure network is completed, the coarse-granularity image encoder may be trained based on the three-dimensional generation countermeasure network.

Illustratively, after the step 205, the multi-view image generating method provided in the embodiment of the present application may further include the following step 206:

and 206, taking the real two-dimensional character image as an input image of the coarse-granularity image encoder, and training the coarse-granularity image encoder so that after the hidden space vector output by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, the three-dimensional expression output by the three-dimensional generation countermeasure network can be reconstructed into the input image of the coarse-granularity image encoder.

Wherein, the hidden space vector output by the coarse-granularity image encoder is: the coarse-granularity image encoder maps an input image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the first loss function used by the training process of the coarse-granularity image encoder comprises at least one of target loss functions; the objective loss function includes: pixel difference of the input image and the reconstructed image, perception loss, character characteristic difference between the input image and the reconstructed image, difference between hidden space vectors output by the coarse-granularity image encoder and average value of hidden space vectors of the three-dimensional generation countermeasure network; the mean value of the hidden space vectors is as follows: the three-dimensional generation antagonism network is in the pretraining process, and the average value of a plurality of hidden space vectors output by the middle layer.

Illustratively, based on the pre-trained three-dimensional generation countermeasure network, an image encoder that maps a single person image into a latency expression (i.e., the hidden space vector described above) of the three-dimensional generation countermeasure network is trained such that coarse-grained three-dimensional expressions resulting from the latency expression via the three-dimensional generation countermeasure network are rendered to enable reconstruction of an input image. The coarse-granularity encoder training process also uses the collected large number of real two-dimensional artifacts as training data. The coarse-granularity image encoder may employ a residual network res net or feature pyramid (Feature Pyramid Networks, FPN) structure, and the loss function is a difference between pixels of the input image and the reconstructed image, a difference between learning perceived image block similarities (Learned Perceptual Image Patch Similarity, LPIPS) and a difference between character features of the input image and the reconstructed image (also referred to as a perceived loss), and a difference between the mapped content expression and a mean value of hidden space vectors of the three-dimensional generation countermeasure network.

Illustratively, after the training of the coarse-granularity image encoder is completed, parameters of the pre-trained three-dimensional generation countermeasure network need to be optimized based on the trained coarse-granularity image encoder.

Illustratively, after the step 206, the multi-view image generating method provided in the embodiment of the present application may further include the following step 207:

step 207, optimizing model parameters of the three-dimensional generation countermeasure network based on the trained coarse granularity image encoder.

Wherein, the tuning process uses the real two-dimensional character image as training data; the second loss function used by the tuning process includes: at least one of the objective loss functions, and/or a similarity loss of a two-dimensional character image generated by the three-dimensional generation countermeasure network in a pre-training process and a generation result before tuning of the three-dimensional generation countermeasure network; the generation result before the three-dimensional generation of the countermeasure network tuning is as follows: the three-dimensional generation is against the two-dimensional figure image that the three-dimensional representation of the network output rendered.

For example, to improve accuracy of coarse-granularity phase image reconstruction, model parameters of the three-dimensional generation countermeasure network may be further optimized based on the trained coarse-granularity image encoder. The training data and the loss function used in the tuning process are the same as those in the step 206, and the similarity loss of the character image generated by random sampling on the three-dimensional generation countermeasure network and the generated result before tuning can be added, so that the accuracy of image reconstruction is improved, and the rationality of coarse-granularity three-dimensional expression generated by the tuned three-dimensional generation countermeasure network is ensured.

Illustratively, after the training of the three-dimensional generation countermeasure network and the coarse-grain image encoder described above is completed, the fine-grain image encoder may be trained based on the trained three-dimensional generation countermeasure network and the coarse-grain image encoder.

Illustratively, after the step 207, the multi-view image generating method provided in the embodiment of the present application may further include the following steps 208 and 209:

and step 208, taking the real two-dimensional character image as an input image of the fine-granularity image encoder, extracting three-dimensional detail features of the input image through the fine-granularity image encoder, and combining the extracted three-dimensional detail features with the three-dimensional expression output by the three-dimensional generation countermeasure network based on the same input image to obtain the fine-granularity three-dimensional expression.

And 209, generating a two-dimensional character image under any camera view angle based on the fine-granularity three-dimensional expression rendering, and adjusting model parameters of the fine-granularity image encoder according to a loss value of a third loss function.

Wherein the third loss function comprises: at least one of the objective loss functions; during the training of the fine granularity image encoder, constraining rationality of a fourth image based on the third image; the third image is: after the hidden space vector mapped by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, a two-dimensional character image rendered by the obtained three-dimensional expression is displayed; the fourth image is: rendering a two-dimensional character image based on the fine-grained three-dimensional representation; the third image and the fourth image are obtained based on the same input image.

Illustratively, during the training process of the fine-granularity image encoder, the trained coarse-granularity image encoder and the three-dimensional generation countermeasure network need to be utilized to synchronously output the corresponding coarse-granularity three-dimensional expression according to the step 202.

Illustratively, a fine-granularity image encoder is used to reconstruct image details that cannot be efficiently expressed by the latex, i.e., the three-dimensional detail features described above, on a coarse-granularity three-dimensional representation of the three-dimensional generation countermeasure network output. The network structure of the fine-granularity image encoder can adopt a three-dimensional semantic segmentation network 3D U-Net structure to extract the detail characteristics of an input image in a three-dimensional space, and the detail characteristics are added to the coarse-granularity three-dimensional expression output by the three-dimensional generation countermeasure network, so that the fine-granularity three-dimensional expression is obtained.

Illustratively, the training process of the fine granularity image encoder is also trained using the reconstruction loss function of the real two-dimensional portrait. Meanwhile, the reasonability of the new view angle image rendered by the fine-granularity three-dimensional expression can be restrained by using the coarse-granularity reconstruction result (namely, the two-dimensional character image reconstructed based on the coarse-granularity three-dimensional expression), so that the finally obtained fine-granularity three-dimensional expression can be rendered to obtain the multi-view character image with reality while accurately reconstructing the input image.

For example, as shown in fig. 1, after a single-view character image is input into a network model, a two-dimensional character image at a given camera view angle can be rendered based on a fine-grained three-dimensional representation output by the network model.

In the multi-view image generation method provided by the application, the multi-view image generated by using a single portrait has strict three-dimensional consistency, and the image sequence rendered under the free camera view has strong sense of reality; the generated multi-view character image has high fidelity, the fine details of the character image in the original image are effectively restored, and the whole reasoning process can be realized by only utilizing single forward propagation of the network, so that the method is very efficient.

The multi-view image generation method provided by the embodiment of the application includes the steps that first, a first image containing a target person at any view angle is acquired; then, inputting the first image into a target model to obtain a coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; and finally, combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression. Therefore, the character image of the character under any given camera view angle can be obtained based on the single character image, so that the details of the reconstructed input image can be enriched, and the three-dimensional consistency under the generation of the multi-view image can be ensured.

It should be noted that, in the multi-view image generating method provided in the embodiment of the present application, the execution subject may be a multi-view image generating apparatus, or a control module in the multi-view image generating apparatus for executing the multi-view image generating method. In the embodiment of the present application, a multi-view image generating device is described by taking an example of a multi-view image generating method performed by the multi-view image generating device.

In the embodiment of the application, the method is shown in the drawings. The multi-view image generation method is exemplified by a drawing in combination with the embodiment of the present application. In specific implementation, the multi-view image generating method shown in the foregoing method drawings may also be implemented in combination with any other drawing that may be illustrated in the foregoing embodiment, and will not be described herein.

The multi-view image generating apparatus provided in the present application will be described below, and the multi-view image generating method described below and the multi-view image generating method described above may be referred to in correspondence with each other.

Fig. 3 is a schematic structural diagram of a multi-view image generating apparatus according to an embodiment of the present application, as shown in fig. 3, specifically including:

An acquiring module 301, configured to acquire a first image including a target person at any view angle; the processing module 302 is configured to input the first image into a target model to obtain a coarse-granularity three-dimensional representation of the first image, and input the first image into a fine-granularity image encoder to obtain three-dimensional detail features of the first image; the generating module 303 is configured to combine the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and render a second image including the target person at a given camera view angle based on the fine-grain three-dimensional expression; wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

The multi-view image generating device provided by the application firstly acquires a first image containing a target person at any view angle; then, inputting the first image into a target model to obtain a coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; and finally, combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression. Therefore, the character image of the character under any given camera view angle can be obtained based on the single character image, so that the details of the reconstructed input image can be enriched, and the three-dimensional consistency under the generation of the multi-view image can be ensured.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a multi-view image generation method comprising: acquiring a first image containing a target person at any view angle; inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression; wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the multi-view image generation method provided by the methods described above, the method comprising: acquiring a first image containing a target person at any view angle; inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression; wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

In yet another aspect, the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided multi-view image generation methods, the method comprising: acquiring a first image containing a target person at any view angle; inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image; combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression; wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A multi-view image generation method, comprising:

acquiring a first image containing a target person at any view angle;

inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image;

combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression;

Wherein the object model comprises: coarse-granularity image encoders and three-dimensional generation countermeasure networks; the coarse-granularity image encoder is used for mapping the first image into hidden space vectors output by the middle layer of the three-dimensional generation countermeasure network; the three-dimensional generation countermeasure network is used for obtaining the coarse-granularity three-dimensional expression based on the hidden space vector output by the coarse-granularity image encoder; the coarse-grained three-dimensional representation and the fine-grained three-dimensional representation are both represented using a neuro-radiation field; the three-dimensional detail features are represented using feature vectors.

2. The method of claim 1, wherein prior to the acquiring the first image comprising the target person at any viewing angle, the method further comprises:

the following steps are circularly executed until the pre-training of the three-dimensional generation countermeasure network is completed:

inputting the Gaussian noise with multidimensional distribution into the three-dimensional generation countermeasure network, outputting the three-dimensional expression of the character image, and rendering the three-dimensional expression into a two-dimensional character image under a random camera view angle by utilizing a differentiable rendering method corresponding to the three-dimensional expression of the character image;

identifying whether the two-dimensional character image under the random camera view angle is a real image or not by utilizing a discriminator, and adjusting the parameters of the three-dimensional generation countermeasure network according to the identification result;

3. The method of claim 2, wherein after the pre-training of the three-dimensional generation of the countermeasure network is completed, the method further comprises:

training the coarse-granularity image encoder by taking a real two-dimensional character image as an input image of the coarse-granularity image encoder, so that after the hidden space vector output by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, the three-dimensional expression output by the three-dimensional generation countermeasure network can be reconstructed into the input image of the coarse-granularity image encoder;

4. The method of claim 3, wherein the training of the coarse-granularity image encoder using the real two-dimensional character image as the input image to the coarse-granularity image encoder further comprises:

optimizing model parameters of the three-dimensional generation countermeasure network based on the trained coarse-granularity image encoder;

5. The method of claim 3 or 4, wherein the training of the coarse-granularity image encoder using the real two-dimensional character image as the input image to the coarse-granularity image encoder further comprises:

Training the fine granularity image encoder by circularly executing the following steps:

taking a real two-dimensional character image as an input image of the fine-granularity image encoder, extracting three-dimensional detail features of the input image through the fine-granularity image encoder, and combining the extracted three-dimensional detail features with a three-dimensional expression output by the three-dimensional generation countermeasure network based on the same input image to obtain fine-granularity three-dimensional expression;

generating a two-dimensional character image under any camera view angle based on the fine-granularity three-dimensional expression rendering, and adjusting model parameters of the fine-granularity image encoder according to a loss value of a third loss function;

6. A multi-view image generation apparatus, the apparatus comprising:

the acquisition module is used for acquiring a first image containing a target person at any view angle;

the processing module is used for inputting the first image into a target model to obtain coarse-granularity three-dimensional expression of the first image, and inputting the first image into a fine-granularity image encoder to obtain three-dimensional detail characteristics of the first image;

the generation module is used for combining the coarse-grain three-dimensional expression with the three-dimensional detail feature to obtain a fine-grain three-dimensional expression of the first image, and rendering a second image containing the target person under a given camera view angle based on the fine-grain three-dimensional expression;

7. The apparatus of claim 6, wherein the apparatus further comprises: a training module;

the training module is used for inputting the Gaussian noise with multidimensional distribution into the three-dimensional generation countermeasure network, outputting the three-dimensional expression of the character image, and rendering the three-dimensional expression into a two-dimensional character image under a random camera view angle by utilizing a differentiable rendering method corresponding to the three-dimensional expression of the character image;

the training module is further used for identifying whether the two-dimensional character image under the random camera visual angle is a real image or not by utilizing the discriminator, and adjusting the parameters of the three-dimensional generation countermeasure network according to the identification result;

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the training module is used for training the coarse-granularity image encoder by taking a real two-dimensional character image as an input image of the coarse-granularity image encoder, so that after the hidden space vector output by the coarse-granularity image encoder is input into the three-dimensional generation countermeasure network, the three-dimensional expression output by the three-dimensional generation countermeasure network can be reconstructed into the input image of the coarse-granularity image encoder;

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the multi-view image generating method according to any one of claims 1 to 5.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multi-view image generation method according to any one of claims 1 to 5 when the program is executed.