CN117853695B

CN117853695B - 3D perception image synthesis method and device based on local spatial self-attention

Info

Publication number: CN117853695B
Application number: CN202410261885.4A
Authority: CN
Inventors: 宋成刚; 符颖; 袁霞; 谭诗瀚
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2024-05-03
Anticipated expiration: 2044-03-07
Also published as: CN117853695A

Abstract

The invention discloses a 3D perception image synthesis method and device based on local spatial self-attention, which are characterized in that firstly, decoupling representation is carried out on a scene to be synthesized by combining pose information of each object and a predefined 3D boundary box, so that entanglement representation among the objects is effectively overcome; then, carrying out local implicit expression on each object through a self-attention local space scene representation module, realizing global perception in a local range, enhancing expression of fine details and greatly reducing calculation complexity; then defining a combination operator to obtain a complete scene representation; and finally, rendering the scene body into a low-resolution feature map by a rendering module, and rendering by a2D neural renderer which simultaneously takes the influence of space and channel factors into consideration to obtain a final result. The method of the invention has finer synthesis effect, low computational complexity and allows rendering from a wide range of camera poses, and can learn the characteristic field representation from the original image set without any additional supervision.

Description

3D perception image synthesis method and device based on local spatial self-attention

Technical Field

The invention relates to the field of 3D (three-dimensional) perception image synthesis, in particular to a 3D perception image synthesis method and device based on local spatial self-attention.

Background

The 3D perceived image synthesis aims at generating a 3D consistent image with a clearly controllable camera pose. In contrast to 2D synthesis, 3D-aware image synthesis requires an understanding of the geometry behind the 2D image, which is typically achieved by incorporating 3D representations into a generation model such as the generation of the countermeasure network (GENERATIVE ADVERSARIAL Networks, GANs). Given an unstructured two-dimensional image set, the GAN is able to learn an unsupervised 3D scene representation from it, and further synthesize multi-perspective consistent images from new camera poses via the learned representation.

Based on this, many studies have been developed around the task of 3D-aware image synthesis and have achieved encouraging results, but their image quality still falls far behind that of conventional 2D image synthesis. Although voxel-based representation methods can generate interpretable, real 3D representations, they are limited by high memory and high computational cost, and can only enable low resolution coarse detail generation; convolution methods with depth voxel representations can create fine images, but the dependence on black box rendering makes them unable to guarantee multi-view consistency. Recently, a method of implicit representation of a neural radiation field (Neural RADIANCE FIELDS, NERF) is utilized to parameterize the geometry (shape) and appearance (application) of a scene or object into weights of a neural network, and then a displayed, physical-based volume rendering paradigm is adopted for rendering and compositing, which can effectively ensure multi-view consistency of a composite image and camera control of display. However, these methods generally rely on Multi-Layer Perceptron (MLP) network for parameterization, and the huge amount of parameters of the MLP architecture may cause problems of high computational complexity and high memory requirement. This will greatly limit their expressive power in reconstructing the scene. Due to the inherent limitations of MLP, these methods face difficulties in detail expression, thereby affecting the quality of the composite image. In practice, this representation network architecture, which relies on limited expressive power, makes the generated image lacking realism, granularity and consistency, and there is still a large gap compared to conventional 2D image synthesis. Therefore, further improvement of the expression capability becomes a key point for solving the problem of poor quality of the current image synthesis.

Further, the generative model should also have simple and consistent control capabilities, not limited to accomplishing static, uncontrollable multi-view synthesis tasks. In other words, it is desirable to be able to control properties of interest (such as geometry, size and pose of an object) while keeping other properties unchanged. To achieve this goal, some research has focused on methods of learning the decoupled representation without explicit supervision. However, most methods ignore the composition properties of the scene, are limited to two-dimensional domain operations, and ignore the three-dimensional nature of the real world. This approach may lead to representation entanglement and requires finding control mechanisms in the potential space of the posterior, rather than built-in control capabilities. In addition, there have been some studies directed to incorporating three-dimensional representations of voxels, elemental elements, and radiation fields directly into the generated model. While these methods can achieve impressive results through built-in control, they are generally limited to single object scenes and the consistency of the results is to be improved when processing higher resolution, more complex and more realistic images (e.g., scenes where objects are located in non-central positions or with clutter).

In summary, the prior art scheme has the following disadvantages: 1. ignoring entangled representations of the three-dimensional nature of the real world, such that the resulting image is not multi-perspective consistent and uncontrollable; 2. depending on the representation network architecture with insufficient expressive power, the quality of the generated image is poor.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a 3D perceived image synthesis method based on local spatial self-attention, which takes the generated variant of a nerve radiation field as an object level representation in a scene and adopts a generated countermeasure network mechanism for training, wherein a generator consists of an object decoupling module, a local spatial scene representation module, a characteristic field combination module and a rendering module; firstly, an object decoupling module combines pose information of each object and a 3D boundary box thereof to carry out decoupling representation on a scene to be synthesized; then, the local space scene characterization module adopts a self-attention mechanism to carry out local implicit representation on each object; the characteristic field combination module integrates the characteristic fields of each object and the background in a weighted average mode to obtain 3D representation of the whole scene; finally, the rendering module renders the scene body into a low-resolution feature map, and then renders the low-resolution feature map into a final synthesis result through a 2D neural renderer which takes the influence of space and channel factors into consideration; finally, judging by a judging device; the image synthesis method specifically comprises the following steps:

step 1: the object decoupling module performs decoupling operation on each object including the background in the scene:

Step 11: the object decoupling module acquires a camera internal parameter matrix and a camera external parameter matrix, randomly samples from uniform distribution of camera elevation angles depending on a data set to obtain a camera gesture, and processes and obtains the camera gesture by combining preset information of a field angle, a rotation range, an elevation angle range and a radius range;

Step 12: defining 3D bounding boxes of objects defining one 3D bounding box for each object, including the background, to capture the range, position and rotation of each object in camera space;

step 13: converting a camera coordinate system and a world coordinate system, converting a camera origin and each pixel point to be under the camera coordinate system according to the camera internal reference matrix obtained in the step 11, and then converting the camera origin and each pixel point to be under the world coordinate system according to the camera external reference matrix;

step 14: obtaining a depth value t of each sampling point according to the preset sampling point number, the near boundary and the far boundary;

Step 15: according to a sampling formula r (x) =o+t×d, wherein o represents a camera origin, d represents a view angle direction, t represents depth values of different sampling points, and spatial point position information r (x) of all the sampling points under a world coordinate system and a corresponding view angle direction d are obtained according to the different depth values t;

Step 16: converting each sampling point into a 3D scene space corresponding to each object and background respectively by combining the 3D boundary boxes of each object and background which are predefined in the step 12;

step 2: the local space scene characterization module adopts a self-attention network as a conditional radiation field, creates a nerve characteristic field for each object and the background thereof in a local characterization mode, calculates a volume density irrelevant to a visual angle and a brightness characteristic diagram relevant to the visual angle, characterizes an object entity and a background entity by stacking self-attention blocks, and comprises the following steps:

Step 21: position coding, namely performing dimension elevation on the spatial position x of each sampling point and the corresponding view angle direction d by adopting a sine and cosine position coding function;

Step 22: acquiring potential codes of each object and background, wherein the potential codes of the objects are appearance codes and geometric codes required by randomly sampling object characteristic fields from 256-dimensional Gaussian distribution, and the potential codes of the background characteristic fields are sampled from 128-dimensional Gaussian distribution;

Step 23:3D subspace field segmentation, defining a window with the size of p to segment a 2D pixel plane into a plurality of non-overlapping uniform pixel blocks, and then independently representing a characteristic field generated by projection of each local pixel block through a characteristic field network;

Step 24: constructing an object characteristic field network, which consists of 6 self-attention blocks, wherein the middle dimension is 128, adding a jump connection, and splicing the output after position coding to the output of the layer 3; the first branch projects the feature to one-dimensional object density value through a full connection layer; the second branch firstly carries out position coding and geometric coding to uniformly increase the dimension to 128 dimension through a full connection layer, and fuses the intermediate characteristics output by a fifth self-attention block, and then obtains the object characteristic field representation of each object subspace field after passing through the self-attention block and the full connection layer;

Step 25: constructing a background characteristic field network, wherein the background characteristic field network consists of 3 self-attention blocks, the middle dimension is 64, and a third branch projects the characteristics to a one-dimensional background density value through a full connection layer; the fourth branch firstly increases the dimension of the position code and the appearance code to 64 dimension through a full connection layer, and merges the intermediate characteristics output by the second self-attention block, and then obtains the background characteristic field representation of each background subspace field after passing through the self-attention block and the full connection layer;

Step 3: the characteristic field combination module processes the object density value, the object characteristic field representation, the background density value and the background characteristic field representation obtained in the step 2, and a combination operator is used for combining the prediction output of each characteristic field representation to obtain a combined characteristic field, namely a complete 3D scene representation;

Step 4: the rendering module is composed of a volume rendering module and a 2D nerve renderer, the combined feature field is rendered into a low-resolution high-dimensional feature map through the volume rendering module, and then the 2D nerve renderer upsamples the low-resolution high-dimensional feature map into a high-resolution predicted image, and the specific steps comprise:

Step 41: the volume rendering module is used for obtaining a high-dimensional characteristic diagram with a whole low resolution by adopting a numerical integration method;

Step 42: the 2D neural renderer aims at mapping up-sampling a low-resolution high-dimensional feature map to a higher-resolution predicted image, and the 2D neural renderer is constructed by combining a two-dimensional convolutional neural network and an up-sampling technology;

Step 5: and respectively inputting the real images acquired from the real data set and the predicted images obtained from the generator into a discriminator network for discrimination, and calculating loss, so as to guide the generator and the discriminator to update parameters.

A 3D perceived image synthesis apparatus based on local spatial self-attention, the image synthesis apparatus being configured to implement the 3D perceived image synthesis method of claim 1, comprising a generator and a arbiter, wherein the generator comprises an object decoupling module, a local spatial scene characterization module, a feature field combination module, and a rendering module;

the object decoupling module is used for carrying out decoupling representation on a scene to be synthesized by combining pose information of each object and the background with a 3D boundary box;

The local space scene characterization module comprises an object characteristic field module and a background characteristic field module and is used for carrying out local implicit characterization on each object and background to obtain characteristic field characterization of each object and background;

The characteristic field combination module is used for combining each object and the background characteristic field representation through a combination operator to obtain a complete 3D scene representation;

The rendering module is used for rendering the complete 3D scene representation to obtain a high-resolution predicted image;

and the discriminator discriminates according to the predicted image and the real image, calculates loss values of the predicted image and the real image and updates parameters of the generator and the discriminator.

Compared with the prior art, the invention has the following beneficial effects:

1. The invention introduces an implicit GAN generator based on a self-attention mechanism as a feasible alternative of a fully connected GAN architecture; such self-attention based neural radiation field representations can encourage multi-view consistency and allow rendering from a wide range of camera poses. In addition, the self-attention mechanism is adopted to perform local implicit scene representation, fine details can be represented more than a fully-connected network, and the synthetic result is clearer.

2. The invention provides a method for implicitly representing local space as key setting, and by applying a self-attention mechanism in each local space, global perception can be realized in a local range, the expression capacity of fine details can be improved, and the calculation complexity is greatly reduced.

3. The invention introduces a simple, reasonable and effective fusion mode of combining the characteristic fields directly into the generated model, namely, the density weighted average is adopted to combine the characteristic fields, which not only effectively overcomes entanglement representation among objects, but also allows simple control of the synthesized image.

4. The invention constructs the nerve renderer with the space and channel factors at the same time, and the innovative design enables the model to capture and learn the image characteristics more comprehensively, ensures stronger gradient flow and avoids unnecessary information loss. In the process of gradually improving the resolution, the model can gradually refine the image details through the synergistic effect of convolution operation and an up-sampling technology, and a more real and fine rendering result is generated.

Drawings

FIG. 1 is a schematic diagram of the structure of an image synthesis network model of the present invention;

FIG. 2 is a schematic diagram of a generator model structure of the present invention;

FIG. 3 is a schematic diagram of the structure of the object feature field network of the present invention;

FIG. 4 is a schematic diagram of the structure of the background feature field network of the present invention;

FIG. 5 is a schematic diagram of the construction of a renderer of the present invention;

FIG. 6 is a graph showing the effect of the method of the present invention on CompCars datasets for 64X 64 and 256X 256 pixel resolutions, respectively;

FIG. 7 is a graph of the decoupling effect of the method of the present invention at a resolution of 64×64 pixels;

FIG. 8 is a comparison of the controlled synthesis performed on CompCars datasets by the method of the present invention;

FIG. 9 is a graph of the effect of the method of the present invention on the controllable manipulation of an object at a resolution of 64X 64 pixels;

fig. 10 is a schematic structural view of the image synthesizing apparatus of the present invention.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

The following detailed description refers to the accompanying drawings.

The object referred to in the present invention is: a region of interest in the image, such as a chair in Chairs dataset, a car in CompCars dataset.

The background referred to in the invention is: image areas outside of the subject, such as Chairs white areas outside of the chair in the dataset, sky, trees, ground, etc. outside of the church in the Churches dataset.

FIG. 1 is a schematic diagram of an image synthesis network model structure of the present invention; the spatial point position information and view information, geometric coding and appearance coding are input into a generator, and a predicted image is output. Fig. 2 is a schematic diagram of the generator model structure of the present invention. The invention provides a 3D perception image synthesis method based on local spatial self-attention, which takes the generated variant of a nerve radiation field NeRF as an object level representation in a scene and adopts a generation countermeasure network mechanism for training; the generator consists of an object decoupling module, a local space scene characterization module, a characteristic field combination module and a rendering module; firstly, an object decoupling module performs decoupling representation on a scene to be synthesized by combining pose information of each object and a 3D boundary box so as to overcome entanglement representation among the objects and realize simple control on a synthesized image; then, the local space scene characterization module adopts a self-attention mechanism to carry out local implicit representation on each object and background so as to realize global perception in a local range and improve the expression capability of fine details; the characteristic field combination module integrates the characteristic fields of each object and the background in a weighted average mode to obtain three-dimensional characterization information of the whole scene; finally, the rendering module renders the three-dimensional representation information into a low-resolution feature map through the volume rendering module so as to reduce the calculated amount and rendering time; and then, rendering the low-resolution high-dimensional characteristics processed by the volume rendering module into a final synthesis result by a 2D nerve renderer which is simultaneously influenced by space and channel factors. The method comprises the following specific steps:

Step1: the object decoupling module defines decoupling operation on each object including the background in the scene, obtains the appearance code and the geometric code of the obtained object and the background, overcomes entanglement representation among the objects, and further realizes simple control on the objects, and specifically comprises the following steps:

Step 11: the object decoupling module acquires a camera internal parameter matrix and a camera external parameter matrix, randomly samples from uniform distribution of camera elevation angles depending on a data set to obtain a camera gesture, and processes the camera gesture by combining preset information of a field angle, a rotation range, an elevation angle range and a radius range.

The camera reference matrix containing the focal length information is used for down-converting the image information from the image coordinate system to the camera coordinate system, and the camera reference matrix containing the rotation, translation and scaling parameters further converts the image information to the world coordinate system.

Step 12: defining 3D bounding boxes of objects for each object, including the background, a 3D bounding box t= { s, T, R }, is defined to capture the range, position and rotation of each object in camera space.

The samples of the 3D bounding box T come from the relevant distribution of the dataset, where s, T represent the scaling and translation parameters, respectively, and R represents the rotation matrix.

Step 13: and (3) converting a camera coordinate system and a world coordinate system, converting the camera origin and each pixel point into the camera coordinate system according to the camera internal reference matrix obtained in the step (11), and then converting the camera origin and each pixel point into the world coordinate system according to the camera external reference matrix.

Step 14: and obtaining the depth value t of each sampling point according to the preset sampling point number, the near boundary and the far boundary.

Step 15: according to a sampling formula r (x) =o+t×d, wherein o represents a camera origin, d represents a view angle direction, t represents depth values of different sampling points, and spatial point position information r (x) of all the sampling points under a world coordinate system and a corresponding view angle direction d are obtained according to the different depth values t.

Step 16: and (3) converting each sampling point into a 3D scene space corresponding to each object and background respectively by combining the 3D bounding boxes of each object and background which are predefined in the step (12).

Thus, a preliminary representation of the decoupled feature fields is obtained, in which each feature field is simply composed of h×w×ns spatial points, H and W being the width and height of the image, ns being the sampling points. Each spatial point in turn consists of a spatial position and a viewing direction, assuming that the scene consists of M-1 objects and 1 background. As shown in fig. 2, 3D bounding boxes and pose information of the object 1, the object 2 and the background are input, respectively, to obtain a 3D scene space of the object 1, a 3D scene space of the object 2 and a 3D scene space of the background, respectively. And inputting the three results into a local spatial scene characterization module for further processing.

Step 2: the local space scene characterization module adopts a self-attention network as a conditional radiation field, creates a nerve characteristic field for each object and the background thereof in a local characterization mode, calculates a volume density value irrelevant to a visual angle and a brightness characteristic diagram relevant to the visual angle, and characterizes different object entities and background entities by stacking different numbers of self-attention blocks.

Each self-attention block consists of layer normalization, single head self-attention, multi-layer perceptron MLP and feed forward connections. The network maps the input sample points and view direction d, and potentially shape, geometry and appearance encodings, to one-dimensional object density values and feature maps. The method comprises the following steps:

Step 21: and performing position coding on the object nerve characteristic field and the background nerve characteristic field, and performing dimension elevation on the spatial positions of the sampling points and the corresponding visual angle direction d by adopting a sine and cosine position coding function. This has the advantage of facilitating the neural network to overcome its spectral bias to reproduce the high frequency details of the input signal; specifically, the frequencies thereof are set to 10 and 4, respectively, which are embedded in 60-dimensional and 24-dimensional spaces, respectively.

Step 22: the potential codes of each object and the background are obtained, the potential codes of the object are the appearance codes and the geometric codes required by randomly sampling object characteristic fields from 256-dimensional Gaussian distribution, and the potential codes of the background characteristic fields are sampled from 128-dimensional Gaussian distribution.

Step 23: the 3D subspace field is divided, a window with the size of p is defined, a 2D pixel plane is divided into a plurality of non-overlapping uniform pixel blocks, and then the characteristic field generated by projection of each local pixel block is independently represented through a characteristic field network.

FIG. 3 is a schematic diagram of the structure of the object feature field network of the present invention; fig. 4 is a schematic diagram of the structure of the background feature field network of the present invention.

Step 24: the constructed object characteristic field network consists of 6 self-attention blocks, the middle dimension is 128, a jump connection is added, and the output after position coding is spliced to the output of the 3 rd layer, so that the reconstruction quality is improved; the first branch projects the feature to one-dimensional object density value through a full connection layer; the second branch firstly carries out position coding and geometric coding, uniformly increases the dimension of the position coding and geometric coding to 128 dimension through a full connection layer, merges the intermediate features output by a fifth self-attention block, and then obtains the object feature field representation of each object subspace field through the self-attention block and the full connection layer.

Step 25: the background characteristic field network is constructed, 3 self-attention blocks are used due to low complexity, the middle dimension is 64, and the third branch projects the characteristics to a one-dimensional background density value through one-layer full connection; the fourth branch firstly increases the dimension of the position code and the appearance code to 64 dimension through a full connection layer, and merges the intermediate characteristics output by the third self-attention block, and then the background characteristic field of each background subspace field after passing through the self-attention block and the full connection layer is characterized.

The processing flow of the self-attention block: each self-attention block consists of layer normalization, single head self-attention, multi-layer perceptron MLP and feed forward connections. In particular, the self-attention block is intended to perform self-attention operations on the channel level of the corresponding features of each sample point in the sub-radiation field. First, the input is subjected to a layer normalization operation, and then a single-head self-attention mechanism performs self-attention score calculation. Then, the first feedforward connection acts on the input and the output after single-head self-attention processing, and then the output is processed through a layer of full-connection layer after the layer normalization operation; and fusing the output after the single-head self-attention processing with the output after the full-connection layer processing through the second feedforward connection to obtain the final output of the self-attention block. As shown in fig. 2, the local spatial scene characterization module performs appearance and geometric coding on the input decoupling information of the object 1, the object 2 and the background respectively, then performs block local spatial sampling respectively, and then inputs the object characteristic field network and the background characteristic field network to obtain the object characteristic field characterization and the background characteristic field characterization respectively.

Step 3: and (3) inputting the object density value, the object characteristic field representation, the background density value and the background characteristic field representation obtained in the step (2) into a characteristic field combination module to be combined to obtain a combined characteristic field, and combining the prediction output of each characteristic field representation into a complete 3D scene representation through a combination operator. Specifically, the density values of the corresponding sampling points obtained by outputting different characteristic fields are directly summed to obtain an overall density value; and combining the object characteristic field and the background characteristic field in a density weighted average mode to obtain the complete characteristic field characterization. This simple and intuitive combination ensures that the gradient can flow towards all entities with a density value greater than 0, thus avoiding slow or stagnant model training due to the disappearance of the gradient.

Step 4: the rendering module is composed of a volume rendering module and a 2D nerve renderer and is used for rendering the combined feature field to obtain a predicted image, the predicted image is rendered into a low-resolution high-dimensional feature image through the classical volume rendering module, and then the 2D nerve renderer upsamples the low-resolution high-dimensional feature image to the high-resolution predicted image.

As shown in fig. 5, the renderer is constructed by skillfully combining a two-dimensional convolutional neural network with an upsampling technique. In particular, the present invention fully integrates the factors of the spatial level and the channel level in the design to gradually increase the image resolution. The secondary branch combines convolution and a spatial up-sampling technology, so that the sensitivity to image space information is emphasized, the local structure and the spatial relation among pixels are captured by convolution operation, and the spatial enhancement and detail reservation of spatial up-sampling are effectively ensured. The main branch highlights the processing of channel information, and Pixelshuffle up-sampling technology proposed in the super-resolution field is introduced to realize high-resolution conversion in a mode of rearranging channels. In the invention, the convolution operation maps the characteristic diagram of a certain stage to the RGB image, and then rearranges each channel independently through Pixelshuffle up-sampling technology to realize double up-sampling. At each spatial resolution, the secondary branch output is merged with the primary branch output and added to the next output.

The discriminators are constructed from convolutional neural networks with a leak ReLU activation. For experiments with 64×64 and 256×256 pixel resolutions, five and seven layers of convolution layer processing were performed, each time doubling the resolution, increasing the number of channels while doubling the resolution until the last layer convolution was directly reduced from a high number of channels to a single channel value to represent the prediction result. An unsaturated GAN loss training network with R1 regularization is employed.

To further illustrate the effectiveness of the proposed method, the quantization index employs a distance score FID (fre chet Inception Distance) widely used for quality assessment in the generation network, which refers to a distance measure between feature vectors of a real image and a predicted image; specifically, 20000 real and generated samples are used to calculate the FID score, and the smaller the FID value, the better the synthesis effect.

The present invention synthesizes images of 64×64 and 256×256 pixel resolutions, respectively. In all experiments, the camera was defined on a sphere with a radius of 2.732 and set a near boundary of 0.5 and a far boundary of 6, respectively; the window size p of the partial representation is set to 4 x 4 and 16 points are sampled along each ray. To determine when to stop training, following normal operation, FID score metric evaluation was performed every 10000 iterations performed.

The complexity is limited due to the disclosure of single object datasets such as Cats, chairs, celebA and FFHQ. I.e. the background is pure white or only a small part of the image. Thus, the present invention makes experimental comparisons in more challenging public single object datasets CompCars and LSUN Churches. The background is also more complex and variable and occupies a larger portion of the image, making the compositing task more difficult, since the object (object) in the CompCar dataset and the LSUN Churches dataset is not always centered. In particular, for CompCars datasets, the present invention enables more variation in the position of objects in an image by randomly cropping the image.

Fig. 6 shows a 64 x 64 (shown in fig. 6 (a)) and 256 x 256 (shown in fig. 6 (b)) pixel resolution partial effect map generated by random sampling rendering on CompCars datasets of the present invention, illustrating that the method of the present invention is effective, enabling rendering from arbitrary camera poses to synthesize high quality images.

The invention also compares with some existing image synthesis algorithms. Comparison with FID benchmark scores assessed by different networks on each dataset, specifically includes ResNet-based method 1:2D-GAN; voxel-based method 2: blockGAN and method 3: holoGAN; radiation field based method 4: GRAF and method 5: GIRAFFE. The results are shown in tables 1 and 2.

Table 1 FID score comparison at 64 x 64 pixel resolution for different methods

Table 2 FID score comparison at 256 x 256 pixel resolution for different methods

FIG. 7 is a graph of the decoupling effect of the method of the present invention on a Cats dataset, chairs dataset, celebA dataset, compCars dataset, and Churches dataset at a 64X 64 pixel resolution synthesis; the first column of fig. 7 is the cast dataset result, the second column of fig. 7 is the Chairs dataset result, the third column of fig. 7 is the CelebA dataset result, the fourth column of fig. 7 is the CompCars dataset result, and the fifth column of fig. 7 is the Churches dataset result; the first to third rows of fig. 7 present Alpha mapping results of objects, backgrounds and objects in the scene, respectively, and the fourth row of fig. 7 represents the final composition result. As can be seen from fig. 7, the method of the present invention can successfully implement object and background decoupled representations without any supervision; and when training data almost only contains images with objects, the learned model can generate a reasonable and reliable background, so that the patching effect is implicitly exerted.

The invention further relates to a complex dataset CompCar dataset, and a method 4: GARF A controlled synthesis comparison, as shown in FIG. 8, FIG. 8 (a) is the result of GARF synthesis and FIG. 8 (b) is the result of the inventive method synthesis. As can be seen from fig. 8, on a complex scene with a cluttered background, the method of the present invention effectively achieves rotating the object in a fixed background due to decoupling the representation of the object.

Fig. 9 illustrates the controlled synthesis of a 64 x 64 pixel resolution image on CompCars datasets in accordance with the method of the present invention. Wherein, fig. 9 (a) is a schematic diagram of changing the appearance of the object, fig. 9 (b) is a schematic diagram of changing the geometry of the object, fig. 9 (c) is a schematic diagram of rotating the object, fig. 9 (d) is a schematic diagram of changing the elevation angle of the camera, fig. 9 (e) is a schematic diagram of horizontally shifting the object, fig. 9 (f) is a schematic diagram of changing the background, fig. 9 (g) is a schematic diagram of adding the object, and fig. 9 (h) is a schematic diagram of longitudinally shifting the object. As can be seen from fig. 9, the method of the present invention effectively achieves the appearance and geometry modification, rotation and camera elevation change operations of objects in a scene. In addition, for high complexity datasets with cluttered backgrounds, it is further achieved to translate individual physical objects of interest laterally and longitudinally in 3D space, as well as to replace scene backgrounds. This is a good indication that the proposed model has excellent controllability. It is worth noting that when the background is changed while the control entity object is unchanged, it can be found that the corresponding entity object in the scene can adaptively follow the background brightness change to generate a change, which further indicates that the model has good robustness and reality in complex scene processing. On the other hand, the present invention further explores multiple object synthesis capabilities on CompCars datasets, as shown in FIG. 9 (g), merging the combined 3D representation into the generative model also facilitates the proposed model being able to add objects outside of multiple distributions in the scene.

Different from a multi-view supervision training mode, the invention trains a generating model from a monocular image data set of unstructured and pose-free information so as to relieve the problems of overfitting, poor generalization and the like caused by densely capturing single scenes; the invention can be popularized to a new viewpoint to complete a new view angle synthesis task, and the strong expression capacity of the invention also allows the realization of view angle consistent, controllable and high-fidelity image synthesis under the condition of low-frequency sampling (16 sampling points); the invention can well complete the 3D perception image synthesis task on a simple data set, and further, the expression capability of the invention is more prominent on a real data set with higher complexity.

The invention also provides a 3D perception image synthesis device based on local spatial self-attention, and the structure is shown in fig. 10. The synthesizing device is used for realizing the 3D perception image synthesizing method and comprises a generator and a discriminator, wherein the generator comprises an object decoupling module, a local space scene characterization module, a characteristic field combination module and a rendering module, in particular,

The object decoupling module is used for carrying out decoupling representation on the scene to be synthesized by combining pose information of each object and the background with the 3D bounding box.

The local space scene characterization module comprises an object characteristic field module and a background characteristic field module, and is used for carrying out local implicit characterization on each object and background to obtain characteristic field characterization of each object and background.

The characteristic field combination module is used for combining each object and the background characteristic field representation through a combination operator to obtain the complete 3D scene representation.

The rendering module is used for rendering the complete 3D scene representation to obtain a predicted image.

The discriminator discriminates according to the predicted image and the real image, calculates loss values of the predicted image and the real image, and updates parameters of the generator and the discriminator.

It should be noted that the above-described embodiments are exemplary, and that a person skilled in the art, in light of the present disclosure, may devise various solutions that fall within the scope of the present disclosure and fall within the scope of the present disclosure. It should be understood by those skilled in the art that the present description and drawings are illustrative and not limiting to the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. The 3D perception image synthesis method based on local spatial self-attention is characterized in that the image synthesis method takes a generated variant of a nerve radiation field as an object level representation in a scene, and adopts a generated countermeasure network mechanism for training, wherein a generator consists of an object decoupling module, a local spatial scene representation module, a characteristic field combination module and a rendering module; firstly, an object decoupling module combines pose information of each object and a 3D boundary box thereof to carry out decoupling representation on a scene to be synthesized; then, the local space scene characterization module adopts a self-attention mechanism to carry out local implicit representation on each object; the characteristic field combination module integrates the characteristic fields of each object and the background in a weighted average mode to obtain 3D representation of the whole scene; finally, the rendering module renders the scene body into a low-resolution feature map, and then renders the low-resolution feature map into a final predicted image through a 2D neural renderer which takes the influence of space and channel factors into consideration; finally, judging by a judging device; the image synthesis method specifically comprises the following steps:

Step 11: the object decoupling module acquires a camera internal parameter matrix and a camera external parameter matrix, specifically, firstly randomly samples from uniform distribution of camera elevation angles depending on a data set to obtain a camera gesture, and then processes and obtains the camera gesture by combining preset information of a field angle, a rotation range, an elevation angle range and a radius range;

step 12: defining a 3D bounding box of objects, defining a 3D bounding box for each object including the background, thereby capturing the range, position and rotation of each object in camera space;

step 21: position coding, namely performing dimension elevation on the spatial positions of all sampling points and the corresponding view angle direction d by adopting a sine and cosine position coding function;

step 42: the 2D neural renderer maps and upsamples the high-dimensional feature map with low resolution to a predicted image with higher resolution, and the 2D neural renderer is constructed by combining a two-dimensional convolutional neural network and an upsampling technology;

Step 5: and respectively inputting the real images acquired from the real data set and the predicted images obtained from the generator into the discriminator to discriminate, and calculating the loss, so as to guide the updating of parameters of the generator and the discriminator.

2. The 3D perception image synthesis device based on local spatial self-attention is characterized in that the image synthesis device is used for realizing the 3D perception image synthesis method according to claim 1 and comprises a generator and a discriminator, wherein the generator comprises an object decoupling module, a local spatial scene characterization module, a characteristic field combination module and a rendering module;