CN117372631A

CN117372631A - Training method and application method of multi-view image generation model

Info

Publication number: CN117372631A
Application number: CN202311673946.XA
Authority: CN
Inventors: 王宏升; 林峰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-01-09
Anticipated expiration: 2043-12-07
Also published as: CN117372631B

Abstract

The specification discloses a training method and an application method of a multi-view image generation model, wherein the generation model to be trained at least comprises a noise adding layer, a cross attention layer and a noise removing layer, an initial feature image, a time parameter and a noise image of each view angle are input into the noise adding layer, the noise adding feature image of each view angle can be obtained, the noise adding feature image of each view angle and the initial feature image of each view angle are input into the cross attention layer, each second fusion feature image with two-dimensional space semantically enhanced of each view angle is obtained, and each second fusion feature image is input into the noise removing layer, so that a predicted noise image of each view angle can be obtained. The denoising layer of the trained generation model can be used for generating the generated image of each view angle corresponding to the target image according to the target image, each randomly generated noise image and the time parameter, and the generated image of each view angle has strong consistency constraint.

Description

Training method and application method of multi-view image generation model

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method and an application method for a multi-view image generation model.

Background

Three-dimensional modeling is widely applied in the fields of games, animations, virtual fitting and the like, and generally, three-dimensional space information in a scene is obtained according to two-dimensional space information of images under multiple view angles of the same scene, and the scene is subjected to three-dimensional modeling.

Currently, images at multiple perspectives may be generated from one single perspective image by generating a model. However, in the existing generation model, because the key parts of some people or objects may be blocked or hidden in the input image, the model cannot sense the three-dimensional space information of the blocked or hidden people or objects, and the presence of the blocked or hidden people or objects in the generated multi-view image is inconsistent, so that the consistency constraint in different view images is not strong.

How to enhance the consistency constraint in the generated images with different viewing angles is a problem to be solved.

Disclosure of Invention

The present disclosure provides a training method, apparatus, storage medium and electronic device for a multi-view image generation model, so as to at least partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

The specification provides a training method of a multi-view image generation model, wherein the generation model to be trained at least comprises a noise adding layer, a cross attention layer and a denoising layer, and the method comprises the following steps:

acquiring sample images of a plurality of view angles, and determining an initial feature map of each sample image;

determining a time parameter and a noise image of each view angle, inputting each initial feature image, each noise image and the time parameter into the noise adding layer, and combining the noise image of each view angle with the initial feature image of the view angle according to the time parameter for each view angle to obtain the noise adding feature image of the view angle;

inputting the initial feature images and the noisy feature images into the cross attention layer, splicing key vectors corresponding to the initial feature images and key vectors corresponding to the noisy feature images to obtain first spliced key vectors, splicing value vectors corresponding to the initial feature images and value vectors corresponding to the noisy feature images to obtain first spliced value vectors, and carrying out cross attention calculation on the first spliced key vectors and the first spliced value vectors and query vectors corresponding to the noisy feature images respectively to determine first fusion feature images;

Splicing key vectors corresponding to the first fusion feature graphs to obtain second spliced key vectors, splicing value vectors corresponding to the first fusion feature graphs to obtain second spliced value vectors, and respectively carrying out cross attention calculation on the second spliced key vectors and the second spliced value vectors and query vectors corresponding to the noise-added feature graphs to determine the second fusion feature graphs;

inputting the second fusion feature images, the time parameters and the initial feature images of any view angle into the denoising layer, determining each prediction noise image, and training the generation model according to the difference between the prediction noise image of the view angle and the noise image of the view angle for each view angle;

after the training of the generation model is completed, responding to a multi-view image generation request carrying a target image, inputting an initial feature image of the target image, randomly generated noise images of all view angles and time parameters into a denoising layer of the generation model after the training is completed, and obtaining the generation image of all view angles corresponding to the target image.

Optionally, determining a time parameter and a noise image of each view, inputting each initial feature image, each noise image and the time parameter into the noise adding layer, and combining the noise image of each view with the initial feature image of the view according to the time parameter for each view to obtain the noise adding feature image of the view, which specifically includes:

Randomly generating a time step within a preset time range as the time parameter;

for each view angle, randomly sampling in standard Gaussian distribution, and determining a noise image of the view angle;

defining a degree parameter corresponding to each time step in the time range, determining the degree parameter corresponding to the time parameter, determining the synthesis weight of the initial feature map of the view angle and the noise image of the view angle according to the degree parameter, and synthesizing the initial feature map of the view angle and the noise image of the view angle according to the synthesis weight to obtain the noise-added feature map of the view angle.

Optionally, the generative model further comprises a self-attention layer;

inputting the second fusion feature map, the time parameter and the initial feature map of any view angle into the denoising layer, wherein the method specifically comprises the following steps:

inputting the second fusion feature graphs into the self-attention layer, and aiming at each view angle, carrying out self-attention calculation on query vectors, key vectors and value vectors corresponding to the second fusion feature graphs of the view angle to determine an enhancement feature graph of the view angle;

and inputting each enhancement characteristic diagram, the time parameter and the initial characteristic diagram of any view angle into the denoising layer.

Optionally, the generating model further comprises a dimension reduction layer;

inputting the second fusion feature graphs into the dimension reduction layer, and respectively downsampling the second fusion features to obtain dimension reduction feature graphs;

and inputting the dimension reduction feature graphs, the time parameters and the initial feature graph of any view angle into the denoising layer.

Optionally, the denoising layer comprises a predictor;

inputting the second fusion feature map, the time parameter and the initial feature map of any view angle into the denoising layer, and determining each prediction noise image, wherein the method specifically comprises the following steps:

and inputting the second fusion feature map of the view, the time parameter and the initial feature map of any view into a predictor of the view for each view, and determining a prediction noise image of the view.

The specification provides an application method of a multi-view image generation model, which comprises the following steps:

acquiring a target image and determining an initial feature map of the target image;

determining a time step corresponding to the time parameter, randomly sampling in standard Gaussian distribution for each view angle, and determining a noise image of the view angle;

According to the descending order of time steps, starting from the time step corresponding to the time parameter, determining a noise image of the visual angle corresponding to each time step, inputting the noise image of the visual angle corresponding to the time step, the time step and the initial feature map into a denoising layer of a trained generation model to obtain a prediction noise image of the visual angle corresponding to the time step, wherein the generation model is obtained by training according to the multi-visual angle image generation model training method;

according to the time step, removing the predicted noise image of the view angle corresponding to the time step from the noise image of the view angle corresponding to the time step to obtain the noise image of the view angle corresponding to the next time step of the time step;

and determining the noise image of each view angle obtained in the last time step as a generated image of each view angle corresponding to the target image.

The specification provides a multi-view image generation model training device, and a generation model to be trained at least comprises a noise adding layer, a cross attention layer and a denoising layer, wherein the device comprises:

the acquisition module acquires sample images of a plurality of view angles and determines an initial feature map of each sample image;

The noise adding module is used for determining time parameters and noise images of all the visual angles, inputting all the initial feature images, all the noise images and the time parameters into the noise adding layer, and combining the noise images of the visual angles with the initial feature images of the visual angles according to the time parameters for each visual angle to obtain the noise adding feature images of the visual angles;

the first cross attention module inputs the initial feature images and the noise feature images into the cross attention layer, splices the key vectors corresponding to the initial feature images and the key vectors corresponding to the noise feature images to obtain first spliced key vectors, splices the value vectors corresponding to the initial feature images and the value vectors corresponding to the noise feature images to obtain first spliced value vectors, carries out cross attention calculation on the first spliced key vectors and the first spliced value vectors and query vectors corresponding to the noise feature images respectively, and determines the first fusion feature images;

the second cross attention module is used for splicing the key vectors corresponding to the first fusion feature graphs to obtain second spliced key vectors, splicing the value vectors corresponding to the first fusion feature graphs to obtain second spliced value vectors, and carrying out cross attention calculation on the second spliced key vectors and the second spliced value vectors and the query vectors corresponding to the noise adding feature graphs respectively to determine the second fusion feature graphs;

And the denoising module inputs the second fusion feature images, the time parameters and the initial feature images of any view angle into the denoising layer, determines each prediction noise image, trains the generation model according to the difference between the prediction noise image of the view angle and the noise image of the view angle for each view angle, responds to a multi-view image generation request carrying a target image after the generation model is trained, inputs the initial feature images of the target image, the randomly generated noise images of each view angle and the time parameters into the denoising layer of the trained generation model, and obtains the generated image of each view angle corresponding to the target image.

The present specification provides an application apparatus of a multi-view image generation model, the apparatus comprising:

the acquisition module acquires a target image and determines an initial feature map of the target image;

the sampling module is used for determining a time step corresponding to the time parameter, randomly sampling in standard Gaussian distribution aiming at each view angle, and determining a noise image of the view angle;

the prediction noise image determining module is used for determining a noise image of the visual angle corresponding to each time step from the time step corresponding to the time parameter according to the descending order of the time steps, inputting the noise image of the visual angle corresponding to the time step, the time step and the initial feature map into a denoising layer of a trained generation model to obtain a prediction noise image of the visual angle corresponding to the time step, wherein the generation model is obtained by training according to the multi-visual angle image generation model training method;

The noise image determining module is used for removing the predicted noise image of the visual angle corresponding to the time step from the noise image of the visual angle corresponding to the time step according to the time step to obtain the noise image of the visual angle corresponding to the next time step of the time step;

and the generated image determining module is used for determining the noise image of each view angle obtained in the last time step and taking the noise image as a generated image of each view angle corresponding to the target image.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described multi-view image generation model training method or application method of a multi-view image generation model.

The present specification provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the above-described multi-view image generation model training method or application method of the multi-view image generation model when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the multi-view image generation model training method provided by the specification, a generation model to be trained at least comprises a noise adding layer, a cross attention layer and a noise removing layer, an initial feature image of each view angle, a time parameter and a noise image of each view angle are input into the noise adding layer, the noise adding feature image of each view angle can be obtained, the noise adding feature image of each view angle and the initial feature image of each view angle are input into the cross attention layer, a second fusion feature image of each view angle is obtained, and the second fusion feature image of each view angle is input into the noise removing layer, so that a prediction noise image of each view angle can be obtained.

In the cross attention layer, first, each first fusion feature image with enhanced consistency constraint among all view angles is obtained through the cross attention of the initial feature image of each view angle and the noise feature image of each view angle. And obtaining each second fusion feature map with further enhanced consistency constraint among the view angles through the cross attention of the first fusion feature map of each view angle and the noise adding feature map of each view angle. In this way, at the denoising layer, the generated image with enhanced consistency constraint of each view angle is obtained by determining the prediction noise image of each view angle and removing each prediction noise image from the second fusion characteristic corresponding to each view angle. The denoising layer of the trained generation model can be used for generating the generated image of each view angle corresponding to the target image according to the target image, each randomly generated noise image and the time parameter, and the generated image of each view angle has strong consistency constraint.

In this way, at the denoising layer, the generated image with enhanced consistency constraint of each view angle is obtained by determining the prediction noise image of each view angle and removing each prediction noise image from the second fusion characteristic corresponding to each view angle. The denoising layer of the trained generation model can be used for generating the generated image of each view angle corresponding to the target image according to the target image, each randomly generated noise image and the time parameter, and the generated image of each view angle has strong consistency constraint.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a multi-view image generation model training method in the present specification;

FIG. 2 is a schematic structural diagram of a model to be trained provided in the present specification;

FIG. 3 is a schematic flow chart of an algorithm of a cross-attention layer provided in the present specification;

FIG. 4 is a schematic flow chart of an algorithm of a denoising layer provided in the present specification;

FIG. 5 is a schematic structural diagram of a model to be trained provided in the present specification;

FIG. 6 is a flowchart of an application method of a multi-view image generation model provided in the present specification;

FIG. 7 is a schematic diagram of a multi-view image generation model training apparatus provided in the present specification;

FIG. 8 is a schematic diagram of an application apparatus of a multi-view image generation model provided in the present specification;

fig. 9 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present application based on the embodiments herein.

In three-dimensional modeling, images of multiple view angles of the same scene are input into a reconstruction model, and depth of each pixel point in the image in a three-dimensional space is determined according to two-dimensional space information contained in the images of the view angles, so that three-dimensional space information in the scene is obtained. Since, in practical applications, images of multiple views of the same scene are not readily available, multi-view images are typically obtained by generating multiple view images from a single view image by generating models, such as a countermeasure network (Generative adversarial network, GAN), a variance Auto-Encoders (VAE), a denoising probability diffusion model (Denoising Diffusion Probabilistic Model, DDPM), and the like.

However, the images of a single view angle do not contain enough three-dimensional space information, so that two-dimensional semantics of the same three-dimensional position in space in the generated multi-view angle images are inconsistent, and consistency constraint among the images of different view angles is not strong. The strength of the consistency constraint among the view images directly influences the effect of three-dimensional modeling, and the multi-view images with weak consistency constraint are used for three-dimensional modeling, so that in the obtained three-dimensional scene, normal images can only be rendered at certain angles, and images which do not accord with objective facts can be rendered at other angles. For example, a three-dimensional model of a human body obtained in this way is normal for a person looking at one angle and may see a face or body depression at another angle.

In order to enhance the consistency constraint among multi-view images generated by a generating model, the specification provides a training method of the multi-view image generating model, and the two-dimensional semantic consistency among generated different view images is enhanced by combining a cross attention mechanism.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flow chart of a multi-view image generation model training method in the present specification, which specifically includes the following steps:

s100: acquiring sample images of a plurality of view angles, and determining an initial feature map of each sample image;

all steps in the multi-view image generation model training method provided in the present specification may be implemented by any electronic device having a computing function, such as a terminal, a server, and the like. For convenience of description, the multi-view image generation model training method provided in the present specification will be described below with only a server as an execution subject.

The server obtains a set of images corresponding to multiple perspectives of the same scene from the training dataset as sample images of the multiple perspectives to train the generative model.

In order to enable the generated model to learn the characteristics of each sample image better, before each sample image is input into the generated model, a trained encoder is used for encoding each sample image, and an initial characteristic diagram of each sample image is obtained. The encoder may use a residual network (ResidualNetwork, resNet), a deep convolutional network (very deep convolutional networks for large-scale image recognition, VGGNet) for large-scale image recognition, or a densely connected convolutional network (Densely Connected Convolutional Network, denseNet), etc., and the network structure of the encoder is not particularly limited in this specification.

S102: determining a time parameter and a noise image of each view angle, inputting each initial feature image, each noise image and the time parameter into the noise adding layer, and combining the noise image of each view angle with the initial feature image of the view angle according to the time parameter for each view angle to obtain the noise adding feature image of the view angle.

The generation model provided by the specification adopts the training idea of DDPM, according to the principle of DDPM, random sampling is carried out in standard Gaussian distribution at each time step in a preset time range, and noise images obtained at each time step are added to the input images in sequence according to the ascending sequence of the time steps. However, in the training process of the model, if the generated model is trained by using the same training data at each time step, the diversity of the training data is affected, the stability of the generated model obtained by training is not strong, and the generated result obtained when the generated model is applied may be good or bad.

Therefore, in order to train the generated model by using different training data at different time steps and improve the stability of the generated model, in the training method of the generated model provided in the present specification, the server randomly generates a time step as a time parameter within a preset time range.

Because training is performed on only one time step for each view angle sample image in a set of training data, the time parameter is used to determine the noise adding level of the noise image of the randomly generated time step for training, so that the weight of the noise image in the noise adding feature map corresponding to the larger time step is larger, and the effect of accumulating and adding the noise image in each time step is achieved.

Then, the server needs to determine the noise image for each view. Specifically, for each view, the server performs random sampling in a standard gaussian distribution to determine a noise image for the view.

Fig. 2 is a schematic structural diagram of a to-be-trained generating model provided in the present specification, as shown in fig. 2, where the to-be-trained generating model includes a noise adding layer, a cross attention layer and a noise removing layer, where N view initial feature images, time parameters and N view noise images are input into the noise adding layer, so that N view noise adding feature images can be obtained, N view noise adding feature images and N view initial feature images are input into the cross attention layer, so as to obtain N view second fused feature images, and each view second fused feature image is input into the noise removing layer, so that N view predicted noise images can be obtained.

As shown in fig. 2, the server inputs each initial feature map, each noise image and time parameter into the noise adding layer, and synthesizes the noise image of the view angle with the initial feature map of the view angle according to the time parameter for each view angle to obtain the noise adding feature map of the view angle.

Specifically, the server defines the corresponding degree parameters of each time step in the time range in advance. For example, the preset time range is 0~T, which includes T time steps, and the corresponding degree parameter of each time step can be set as [ []. And determining a degree parameter corresponding to the time parameter through the noise adding layer, determining the synthesis weight of the initial feature map of the visual angle and the noise image of the visual angle according to the degree parameter, and synthesizing the initial feature map of the visual angle and the noise image of the visual angle according to the synthesis weight to obtain the noise adding feature map of the visual angle.

The noise signature may be determined according to the following formula:

wherein,representing a time parameter of +.>Time-corresponding->Noise signature for individual views, +.>Indicate->Initial feature map for individual views, +.>Representing a time parameter of +.>Time->Noise image of individual view angle->Representing the time parameter +.>Corresponding degree parameter, ++ >Representing the composite weights.

The process of determining the noisy feature map corresponding to the time parameter is also the process of determining the label of the generated model, the noisy image of each view angle of the noisy layer is input and used as the label of each view angle, and the generated model is trained.

In order to ensure that the larger the time step corresponding to the determined time parameter in the noise adding feature map obtained according to the above formula is, the larger the weight occupied by the noise image in the noise adding feature map corresponding to the time parameter is, the degree parameter corresponding to each time step can be defined to be gradually reduced along with the increase of the time step. Of course, it is also possible to define a determination formula for the noise-adding feature map by changing the positions of the upper and lower elements in the synthesized weight of the above formula only when the degree parameter corresponding to each time step is gradually increased with the increase of the time step.

S104: inputting the initial feature images and the noise feature images into the cross attention layer, splicing key vectors corresponding to the initial feature images and the noise feature images to obtain first spliced key vectors, splicing value vectors corresponding to the initial feature images and the noise feature images to obtain first spliced value vectors, and carrying out cross attention calculation on the first spliced key vectors and the first spliced value vectors and query vectors corresponding to the noise feature images respectively to determine the first fusion feature images.

As shown in fig. 2, the generated model includes a cross-attention layer, and the cross-attention calculation is performed on each initial feature map and each noisy feature map to obtain each second fused feature map with enhanced consistency constraint.

Fig. 3 is a schematic flow chart of an algorithm of a cross-attention layer provided in the present specification, wherein,indicate->Query vector of initial feature map of individual view, +.>Indicate->Key vector of initial feature map of individual view, +.>Represent the firstValue vector of initial feature map for individual view angles, < ->Indicate->Query vector of noise image of individual view, +.>Represent the firstKey vector of noise image of individual view, +.>Indicate->Value vector of noise image of individual view, +.>Indicate->Query vector of first fused feature map of individual view,/->Indicate->Key vectors of the first fused feature map for each view,indicate->A value vector of a first fused feature map for each view.

As shown in fig. 3, the cross-attention layer also contains a linear sub-layer. And in the cross attention layer, obtaining each first fusion feature map with enhanced consistency constraint among all view angles through the cross attention of each initial feature map and each noise feature map. And obtaining each second fusion feature map with further enhanced consistency constraint among all view angles through the cross attention of each first fusion feature map and each noise feature map.

Firstly, the server inputs each initial feature map and each noise-adding feature map into the cross attention layer, obtains the query vector, the key vector and the value vector corresponding to the initial feature map of each view through the linear sub-layer of the initial feature map of each view, and obtains the query vector, the key vector and the value vector corresponding to the noise-adding feature map of each view through the linear sub-layer of the noise-adding feature map of each view.

And then, the server splices the key vector corresponding to each initial feature map and the key vector corresponding to each noise feature map to obtain a first spliced key vector. And splicing the value vector corresponding to each initial feature map and the value vector corresponding to each noise-added feature map to obtain a first spliced value vector.

And finally, the server carries out cross attention calculation on the first splicing key vector and the first splicing value vector and the query vector of each noise-added feature map respectively to determine a first fusion feature map of each view angle.

Specifically, the first fused feature map may be determined according to the following formula:

wherein,representing a time parameter of +.>Time-corresponding->First fused feature map of individual views, +.>Representing a time parameter of +.>Time-corresponding- >Query vector of noisy feature map for individual view angles, < ->Representing a time parameter of +.>Transpose of the first splice key vector at time, < >>Representing a time parameter of +.>First splice value vector at time, +.>Representing the dimension of the first splice key vector, +.>To activate the function.

Because, in the first splice key vector and the first splice value vector, the initial feature map of each view angle and the two-dimensional space semantics of the noise-added feature map of each view angle are integrated, so that the three-dimensional space semantics among the view angles are formed. The query vector of the noisy feature map of each view angle is subjected to cross attention calculation with the first splicing key vector and the first splicing value vector, and the two-dimensional space semantics in the noisy feature map of each view angle are subjected to semantic enhancement according to the three-dimensional space semantics, so that the enhancement of consistency constraint among the first fusion feature maps of each view angle is realized.

S106: and splicing the key vectors corresponding to the first fusion feature graphs to obtain second spliced key vectors, splicing the value vectors corresponding to the first fusion feature graphs to obtain second spliced value vectors, and respectively carrying out cross attention calculation on the second spliced key vectors and the second spliced value vectors and the query vectors corresponding to the noise-added feature graphs to determine the second fusion feature graphs.

As shown in fig. 3, the server continues to perform cross attention calculation on each first fusion feature map and each noisy feature map, so as to obtain each second fusion feature map with further enhanced consistency constraint among all view angles.

Firstly, the server obtains query vectors, key vectors and value vectors corresponding to the first fusion feature graphs through linear sublayers of the first fusion feature graphs.

And then, the server splices the key vectors corresponding to the first fusion feature graphs to obtain a second spliced key vector. And splicing the value vectors corresponding to the first fusion feature graphs to obtain a second spliced value vector.

And finally, the server carries out cross attention calculation on the second splicing key vector and the second splicing value vector and the query vector of each noise adding feature map respectively to determine each second fusion feature map.

Specifically, the second fused feature map may be determined according to the following formula:

wherein,representing a time parameter of +.>Time-corresponding->Second fused feature map of individual views, +.>Representing a time parameter of +.>Transpose of the second splice key vector at time, < >>Representing a time parameter of +.>A second vector of splice values at the time,representing the dimensions of the second splice key vector.

On the premise of enhancing the consistency constraint realized between the first fusion feature graphs of each view angle, the two-dimensional space semantics in the first fusion feature graphs of each view angle are integrated by splicing key vectors of the first fusion feature graphs of each view angle and splicing value vectors of the first fusion feature graphs of each view angle again to form three-dimensional space semantics among the view angles, and the three-dimensional space semantics at the moment are enhancement of the three-dimensional space semantics determined in S104. Through respectively carrying out cross attention calculation on the query vector of the noisy feature map of each view angle, the second spliced key vector and the second spliced value vector, carrying out semantic enhancement on the two-dimensional spatial semantics in the noisy feature map of each view angle again according to the three-dimensional spatial semantics enhanced by the semantics, and realizing the further enhancement of consistency constraint among the first fused feature maps of each view angle.

S108: and inputting the second fusion feature images, the time parameters and the initial feature images of any view angle into the denoising layer, determining each prediction noise image, and training the generation model according to the difference between the prediction noise image of the view angle and the noise image of the view angle for each view angle.

As shown in fig. 2, the server inputs each second fused feature map, a time parameter and an initial feature map of any view angle into a denoising layer to obtain each prediction noise image.

Fig. 4 is a schematic flow chart of an algorithm of the denoising layer provided in the present specification, and as shown in fig. 4, the denoising layer further includes a predictor. For each view, the server inputs the second fusion feature map of the view, the time parameter and the initial feature map of any view into a predictor of the view to determine a prediction noise image of the view. And determining loss according to the difference between the predicted noise image of the visual angle and the noise image of the visual angle, and training a generating model with the minimum loss as a target.

The initial feature map of any view angle of the denoising layer is input, and the prediction noise of each view angle obtained by the denoising layer is generated by predicting according to the input sample image corresponding to the initial feature map, so that the finally obtained generated image is the image corresponding to each view angle of the sample image.

The loss function of the generative model may be determined according to the following formula:

wherein,representing the number of viewing angles>Representing a time parameter of +.>Time->Prediction noise image of individual view.

As shown in fig. 4, after obtaining each prediction noise image, the server removes, for each view, the prediction noise image of the view from the second fusion feature map of the view at the denoising layer, and obtains a generated image of the view. The generated images of all the visual angles are obtained by denoising in the second fusion feature images with enhanced consistency constraint, so that the two-dimensional space semantics of the generated images of all the visual angles also keep higher consistency, and compared with the input sample images of all the visual angles, the generated images are obtained by carrying out semantic enhancement on the two-dimensional space semantics of all the sample images through cross attention calculation according to the three-dimensional space semantics.

In step S108, the trained denoising layer may generate the image of each view angle corresponding to any sample image in the training dataset according to the initial feature map of the sample image. The initial feature map is a result of feature extraction on the sample image, and includes features of each sample image. That is, the trained denoising layer can identify the characteristics of the sample image, determine what the image of each view angle corresponding to the sample image should be, further obtain the predicted noise image in each second fused feature image, and remove the predicted noise image of each view angle from the second fused feature image of the view angle for each view angle, so as to obtain the generated image of each view angle corresponding to the sample image.

Then, for the denoising layer of the trained generation model, when an initial feature map of the target image which does not appear in the training dataset is input, the denoising layer can still identify the features in the initial feature map, so as to generate the generated image of each view angle corresponding to the target image.

In the training process, the denoising layer removes predicted noise images corresponding to respective view angles from second fusion feature images of the view angles, the second fusion feature images of the view angles are the result of carrying out semantic enhancement on the noise adding feature images of the view angles, and the noise images contained in the noise adding feature images conform to standard Gaussian distribution. Therefore, in the present specification, the trained generation model can generate the generated image for each view angle from the noise image conforming to the standard high distribution.

Therefore, after the model generation training is completed, the server inputs the initial feature image of one target image which does not belong to the training data set, the noise image of each view angle and the time parameter into the denoising layer, and the prediction noise image of each view angle can be obtained, wherein the noise image of each view angle is obtained by randomly sampling from standard Gaussian distribution. And then removing the predicted noise image of each view angle from the noise image of the corresponding view angle, and obtaining the generated image of each view angle corresponding to the target image.

In the multi-view image generation model training method provided by the specification, a generation model to be trained at least comprises a noise adding layer, a cross attention layer and a noise removing layer. The initial feature images, time parameters and noise images of all the visual angles are input into a noise adding layer, so that the noise adding feature images of all the visual angles can be obtained, the noise adding feature images of all the visual angles and the initial feature images of all the visual angles are input into a cross attention layer, the second fusion feature images of all the visual angles are obtained, the second fusion feature images of all the visual angles are input into a noise removing layer, and the prediction noise images of all the visual angles can be obtained.

In the step S106, the second fused feature map is a result of performing semantic enhancement on the noisy feature map of each view according to the three-dimensional spatial semantics, and in order to continue performing semantic enhancement on the second fused feature map of each view according to the two-dimensional spatial semantics of each view, a self-attention layer may be set in the generation model.

In one or more embodiments of the present description, the generative model to be trained further comprises a self-attention layer. And the server inputs each second fusion feature map into the self-attention layer, and obtains query vectors, key vectors and value vectors corresponding to each second fusion feature map through the linear sub-layers of the self-attention layer.

For each view, the server performs self-attention calculation on the query vector, the key vector and the value vector of the second fusion feature map of the view, and determines an enhancement feature map of the view. And then, inputting the enhancement characteristic diagram, the time parameter and the initial characteristic diagram of any view angle into a denoising layer.

In the step S106, vector stitching is performed in the computation process of the first fused feature and the second fused feature in the cross attention layer, so that the dimension of the second fused feature output by the cross attention layer is multiplied compared with the dimension of the initial feature map, and the resolution of the second fused feature map is higher than the resolution of the initial feature map. And the noise image of each view angle, which is initially sampled from the standard gaussian distribution, is synthesized with the initial feature image, i.e., the dimension of the noise image is the same as the dimension of the initial noise image. So in order for the denoising layer to obtain a more accurate prediction noise image, a dimension reduction layer may be provided after the cross-attention layer.

In one or more embodiments of the present description, the generative model to be trained further comprises a dimension reduction layer. And the server inputs the second fusion feature graphs into a dimension reduction layer, and respectively downsamples the second fusion feature graphs to obtain the dimension reduction feature graphs. And then, inputting the dimension reduction feature map, the time parameter and the initial feature map of any view angle into a denoising layer.

Of course, the generative model to be trained may contain both the self-attention layer and the dimension reduction layer. Fig. 5 is a schematic structural diagram of a model to be trained provided in the present specification, where, as shown in fig. 5, the model to be trained further includes a self-attention layer and a dimension-reduction layer.

And the server inputs the enhancement feature graphs of the various visual angles into a dimension reduction layer, and respectively downsamples the enhancement feature graphs of the various visual angles to obtain the dimension reduction feature graphs of the various visual angles. And then, inputting the dimension reduction feature map, the time parameter and the initial feature map of any view angle into a denoising layer.

The foregoing describes a method for training a multi-view image generation model, and in the following, describes how to use the generation model to generate multi-view images after the generation model is trained.

Fig. 6 is a flowchart of an application method of the multi-view image generation model provided in the present specification.

S200: and acquiring a target image, and determining an initial feature map of the target image.

As described in step S108, the denoising layer of the trained generated model may generate the image of each view angle corresponding to the target image according to the initial feature map of the target image that has not appeared in the training dataset.

The server obtains an arbitrary target image, which does not belong to the training dataset. Then, the target image is encoded by an encoder, and an initial feature map of the target image is obtained. As described in step S100, the network structure of the encoder is not particularly limited in this specification.

S202: and determining a time step corresponding to the time parameter, and randomly sampling in standard Gaussian distribution for each view angle to determine a noise image of the view angle.

For a specific method for determining the time parameter, reference may be made to the description of the corresponding content in S102, which is not repeated herein.

Although the generated model only randomly generates one time parameter in each training, the denoising layer of the generated model after the training is completed, which is obtained through a large number of training, can remove the prediction noise image corresponding to each view angle from the noise image of each view angle in any time step corresponding to the time parameter, so as to obtain the generated image of each view angle.

Therefore, during the application of the generative model, for each view, the server randomly samples from the standard gaussian distribution, determining the noise image for that view.

S204: and according to the descending order of the time steps, starting from the time step corresponding to the time parameter, determining the noise image of the visual angle corresponding to each time step, inputting the noise image of the visual angle corresponding to the time step, the time step and the initial feature map into a denoising layer of a trained generation model to obtain a predicted noise image of the visual angle corresponding to the time step, wherein the generation model is obtained by training according to the multi-visual angle image generation model training method.

In the training process, the denoising layer can obtain the predicted noise image in the second fusion feature map at one time, but attention is paid to the training result of the denoising layer on the premise of labeling, namely the noise image applied by each view angle. However, in the application process, the denoising layer may determine the predicted noise image of each view according to the input target image according to learning the training data set, and then remove the predicted noise image of each view from the noise image of the corresponding view, so as to obtain the generated image of each view. However, the prediction noise image in the application process is only a prediction of the generated image of each viewing angle assumed to exist, and it is difficult to eliminate noise at one time, and a clean generated image of each viewing angle is obtained.

Therefore, in the application, an iteration mode is adopted to determine a prediction noise image in each iteration process, noise in the noise image of each view angle obtained by random sampling is gradually removed through each iteration, and finally a clean generated image of each view angle is obtained.

For each view, the server determines a prediction noise image in descending order of time steps, starting from the time step corresponding to the determined time parameter, and iterating once in each time step. For each time step, a noise image of the view corresponding to the time step is determined. And inputting the noise image of the visual angle corresponding to the time step, the initial feature images of the time step and the target image into a denoising layer to obtain a predicted noise image of the visual angle corresponding to the time step.

S206: and according to the time step, removing the predicted noise image of the view angle corresponding to the time step from the noise image of the view angle corresponding to the time step to obtain the noise image of the view angle corresponding to the next time step of the time step.

And the server removes the predicted noise image of the visual angle corresponding to the time step from the noise image of the visual angle corresponding to the time step according to the time step, and obtains the noise image of the visual angle corresponding to the next time step of the time step.

Specifically, the server determines a degree parameter of the time step according to the time step, and removes a predicted noise image of the view angle corresponding to the time step from a noise image of the view angle corresponding to the time step according to the degree parameter to obtain a noise image of the view angle corresponding to the next time step.

S208: and determining the noise image of each view angle obtained in the last time step as a generated image of each view angle corresponding to the target image.

And when the iteration is carried out to the last time step, the server takes the noise image of each view angle obtained in the last time step as a generated image of each view angle corresponding to the target image.

Specifically, from the time step corresponding to the time parameterInitially, the iterative process at each time step can be described by the following formula:

wherein,indicate->Corresponding ∈1->Noise image of individual view angle->Indicate->Corresponding ∈1->Prediction noise image for individual view,/>Indicate->The corresponding degree parameter of each time step +.>Is->Related parameters->=/>。

Note that, the firstThe noise images of each view angle corresponding to each time step are obtained by randomly sampling in standard Gaussian distribution. As the iterative process proceeds, the time steps gradually decrease, and the noise image of each view angle obtained in each time step can gradually show the appearance of the generated image of each view angle. In the last time step of the iteration, the +. >Namely, the generated image of each view angle corresponding to the target image.

The generated images of all the visual angles obtained through the denoising layer of the generated model after training are images with semantically enhanced two-dimensional space semanteme of all the visual angles, and the consistency constraint among all the generated images is strong. The generated images of all the visual angles are input into the trained reconstruction model, so that the three-dimensional scene in the target image can be realistically reconstructed, and then the required images of any visual angles are rendered from the three-dimensional scene, so that the images of any visual angles all present a realistic effect.

The above is a multi-view image generation model training method provided in the present specification, and based on the same thought, the present specification also provides a corresponding multi-view image generation model training device, as shown in fig. 7.

Fig. 7 is a schematic diagram of a multi-view image generation model training device provided in the present specification, where a generation model to be trained at least includes a noise adding layer, a cross attention layer, and a noise removing layer, and the device includes:

the acquisition module 300 acquires sample images of a plurality of view angles and determines an initial feature map of each sample image;

the noise adding module 302 determines a time parameter and a noise image of each view angle, inputs each initial feature image, each noise image and the time parameter into the noise adding layer, synthesizes the noise image of each view angle with the initial feature image of each view angle according to the time parameter for each view angle, and obtains the noise adding feature image of each view angle;

The first cross attention module 304 inputs the initial feature graphs and the noisy feature graphs into the cross attention layer, splices the key vectors corresponding to the initial feature graphs and the key vectors corresponding to the noisy feature graphs to obtain first spliced key vectors, splices the value vectors corresponding to the initial feature graphs and the value vectors corresponding to the noisy feature graphs to obtain first spliced value vectors, and performs cross attention calculation on the first spliced key vectors and the first spliced value vectors and query vectors corresponding to the noisy feature graphs to determine first fusion feature graphs;

the second cross attention module 306 is configured to splice key vectors corresponding to the first fusion feature graphs to obtain second spliced key vectors, splice value vectors corresponding to the first fusion feature graphs to obtain second spliced value vectors, and perform cross attention calculation on the second spliced key vectors and the second spliced value vectors and query vectors corresponding to the noise-added feature graphs respectively to determine second fusion feature graphs;

the denoising module 308 inputs the second fused feature graphs, the time parameter and the initial feature graph of any view angle into the denoising layer, determines each prediction noise image, trains the generating model according to the difference between the prediction noise image of the view angle and the noise image of the view angle for each view angle, and after the generating model is trained, inputs the initial feature graph of the target image, the noise image of each view angle randomly generated and the time parameter into the denoising layer of the trained generating model to obtain the generating image of each view angle corresponding to the target image.

Optionally, the noise adding module 302 is specifically configured to randomly generate a time step within a preset time range, perform random sampling in a standard gaussian distribution for each view angle as the time parameter, determine a noise image of the view angle, define a degree parameter corresponding to each time step within the time range, determine a degree parameter corresponding to the time parameter, determine a synthesis weight of an initial feature image of the view angle and the noise image of the view angle according to the degree parameter, and synthesize the initial feature image of the view angle and the noise image of the view angle according to the synthesis weight, so as to obtain the noise adding feature image of the view angle.

Optionally, the apparatus further includes a self-attention module 310, where the self-attention module 310 is specifically configured to input the second fusion feature maps into the self-attention layer, perform self-attention computation on a query vector, a key vector, and a value vector corresponding to the second fusion feature map of each view angle, determine an enhancement feature map of the view angle, and input each enhancement feature map, the time parameter, and an initial feature map of any view angle into the denoising layer.

Optionally, the apparatus further includes a dimension reduction module 312, where the dimension reduction module 312 is specifically configured to input the second fusion feature maps into the dimension reduction layer, respectively downsample the second fusion features to obtain dimension reduction feature maps, and input the dimension reduction feature maps, the time parameter, and an initial feature map of any view angle into the denoising layer.

Optionally, the denoising layer includes a predictor, and the denoising module 308 is specifically configured to, for each view, input the second fused feature map of the view, the time parameter, and the initial feature map of any view into the predictor of the view, and determine a prediction noise image of the view.

The present disclosure also provides an application apparatus of the multi-view image generation model, as shown in fig. 8.

Fig. 8 is a schematic diagram of an application apparatus of a multi-view image generation model provided in the present specification, where the apparatus includes:

the acquisition module 400 acquires a target image and determines an initial feature map of the target image;

the sampling module 402 determines a time step corresponding to the time parameter, performs random sampling in standard gaussian distribution for each view angle, and determines a noise image of the view angle;

The predicted noise image determining module 404 determines, for each time step, a noise image of the view corresponding to the time step from a time step corresponding to the time parameter according to a descending order of the time steps, and inputs the noise image of the view corresponding to the time step, the time step and the initial feature map into a denoising layer of a trained generation model to obtain a predicted noise image of the view corresponding to the time step, where the generation model is obtained by training according to the multi-view image generation model training method;

the noise image determining module 406 removes, according to the time step, the predicted noise image of the view angle corresponding to the time step from the noise image of the view angle corresponding to the time step, to obtain a noise image of the view angle corresponding to the next time step of the time step;

the generated image determining module 408 determines the noise image of each view angle obtained in the last time step as the generated image of each view angle corresponding to the target image.

The present specification also provides a computer-readable storage medium storing a computer program operable to execute the above-described multi-view image generation model training method provided in fig. 1 or the application method of the multi-view image generation model provided in fig. 6. The present specification also provides a schematic structural diagram of the electronic device shown in fig. 9. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 9, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the program to realize the multi-view image generation model training method provided in fig. 1 or the application method of the multi-view image generation model provided in fig. 6. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

Improvements to one technology can clearly distinguish between improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) and software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims

1. A multi-view image generation model training method, wherein a generation model to be trained at least comprises a noise adding layer, a cross attention layer and a noise removing layer, the method comprising:

2. The method of claim 1, wherein determining a time parameter and a noise image for each view, inputting each initial feature map, each noise image, and the time parameter into the noise adding layer, and combining, for each view, the noise image for the view with the initial feature map for the view according to the time parameter to obtain the noise adding feature map for the view, comprising:

3. The method of claim 1, wherein the generative model further comprises a self-attention layer;

4. The method of claim 1, wherein the generative model further comprises a dimension reduction layer;

5. The method of claim 1, wherein the denoising layer comprises a predictor;

6. A method for applying a multi-view image generation model, comprising:

according to the descending order of time steps, starting from the time step corresponding to the time parameter, determining a noise image of the visual angle corresponding to the time step for each time step, inputting the noise image of the visual angle corresponding to the time step, the time step and the initial feature map into a denoising layer of a trained generation model to obtain a predicted noise image of the visual angle corresponding to the time step, wherein the generation model is obtained by training according to the method of any one of claims 1-5;

7. A multi-view image generation model training apparatus, wherein a generation model to be trained includes at least a noise adding layer, a cross attention layer, and a noise removing layer, the apparatus comprising:

8. An application apparatus of a multi-view image generation model, the apparatus comprising:

9. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 when executing the program.