CN117058049A

CN117058049A - New view image synthesis method, synthesis model training method and storage medium

Info

Publication number: CN117058049A
Application number: CN202310489076.4A
Authority: CN
Inventors: 黄晋; 莫智雄
Original assignee: Guangzhou Tuyu Information Technology Co ltd
Current assignee: Guangzhou Tuyu Information Technology Co ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-11-14
Anticipated expiration: 2043-05-04
Also published as: CN117058049B

Abstract

The embodiment of the application provides a new view angle image synthesis method, a synthesis model training method and a storage medium, wherein the new view angle image synthesis method constructs a first depth probability body and a geometric feature body based on image features of an initial image, camera parameters and sampling depth, carries out two-dimensional convolution on the first depth probability body to obtain a first uncertainty image, acquires the first depth probability body and the first uncertainty image, can realize an uncertainty guiding sampling strategy based on depth prediction and uncertainty perception, combines coarse-to-fine sampling point determination and extraction of corresponding features, can predict scene depth probability and filter accurate key points, avoids interference of empty sampling points, and is beneficial to reducing rendering time and further reduces new view angle image synthesis time; meanwhile, based on reasonable sampling of the uncertainty graph, the embodiment of the application can generate a new view angle image with higher quality even under a sparse view.

Description

New view image synthesis method, synthesis model training method and storage medium

Technical Field

The application relates to the technical field of image processing, in particular to a new view angle image synthesis method, a synthesis model training method and a storage medium.

Background

New view synthesis (Novel view synthesis, NVS) is an important area of research in computer graphics and computer vision, where NVS can synthesize realistic images at target viewing angles from a given set of input images and corresponding camera poses. In recent years, NVS has been widely used in various application fields including virtual tourism, television, sports broadcasting, and the like.

With the development of deep learning, the learning-based NVS method has had a significant impact on improving the quality of the composite image, in which the neural radiation field (Neural Radiance Field, neRF) has achieved great success in rendering high quality new view angle images as a new implicit three-dimensional representation method. However, neRF needs to query more sampling points in the rendering process, resulting in slower synthesis speed of new view angle images.

Disclosure of Invention

The embodiment of the application provides a new view angle image synthesis method, a synthesis model training method and a storage medium, which are used for solving the problem of slower synthesis speed of a new view angle image in the related technology.

In order to solve the technical problems, the application is realized as follows:

in a first aspect, an embodiment of the present application provides a new view angle image synthesis method, including:

Acquiring a plurality of initial images, camera parameters corresponding to the initial images and a plurality of sampling depths;

respectively extracting image features of each initial image, constructing a first depth probability body and a geometric feature body based on the image features of the initial images, camera parameters and sampling depth, wherein each body pixel in the first depth probability body corresponds to a predicted depth probability, and the voxel features of each body pixel of the geometric feature body are obtained by carrying out integrated extraction on the image features of a plurality of initial images;

performing two-dimensional convolution on the first depth probability body to obtain a first uncertainty diagram;

determining coarse sampling points according to the first depth probability body and the first uncertainty diagram;

acquiring coarse sampling point features corresponding to the coarse sampling points from the image features and the geometric feature body of the initial image;

determining a fine sampling point, a ray characteristic map and a depth map based on the coarse sampling point characteristics;

acquiring fine sampling point characteristics corresponding to the fine sampling points from the image characteristics of the initial image;

and synthesizing a new view angle image according to the ray characteristic diagram, the depth diagram and the fine sampling point characteristic.

In a second aspect, the embodiment of the application also provides a new visual angle image synthesis model training method, which comprises the following steps:

Constructing a new view angle image synthesis network architecture, wherein the new view angle image synthesis network architecture is configured with a loss function;

acquiring a training sample set, wherein the training sample set comprises a plurality of sample image sets and a real image serving as a labeling result of each sample image set;

inputting the image set into a new view angle image synthesis network architecture, and outputting a predicted image;

and updating network parameters of the new view angle image synthesis network architecture based on the loss value calculated by the real image and the predicted image through the loss function until the loss value of the loss function converges to obtain a new view angle image synthesis model after training, wherein the new view angle image synthesis model after training is used for realizing the new view angle image synthesis method shown in the first aspect.

In a third aspect, an embodiment of the present application further provides a new view angle image synthesis apparatus, including:

the first acquisition module is used for acquiring a plurality of initial images, camera parameters corresponding to the initial images and a plurality of sampling depths;

the extraction and construction module is used for respectively extracting the image characteristics of each initial image, constructing a first depth probability body and a geometric characteristic body based on the image characteristics of the initial images, camera parameters and sampling depth, wherein each body pixel in the first depth probability body is correspondingly provided with a predicted depth probability, and the voxel characteristics of each body pixel of the geometric characteristic body are obtained by carrying out integrated extraction on the image characteristics of a plurality of initial images;

The convolution module is used for carrying out two-dimensional convolution on the first depth probability body to obtain a first uncertainty diagram;

the first determining module is used for determining coarse sampling points according to the first depth probability body and the first uncertainty diagram;

the second acquisition module is used for acquiring coarse sampling point features corresponding to the coarse sampling points from the image features and the geometric feature bodies of the initial image;

the second determining module is used for determining a fine sampling point, a ray characteristic map and a depth map based on the coarse sampling point characteristics;

the third acquisition module is used for acquiring fine sampling point characteristics corresponding to the fine sampling points from the image characteristics of the initial image;

and the synthesis module is used for synthesizing the new view angle image according to the ray characteristic diagram, the depth diagram and the fine sampling point characteristic.

In a fourth aspect, an embodiment of the present application further provides a training device for a new view angle image synthesis model, including:

the construction module is used for constructing a new view angle image synthesis network architecture, and the new view angle image synthesis network architecture is configured with a loss function;

the fourth acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of sample image sets and real images serving as labeling results of the sample image sets;

The output module is used for inputting the image set into a new view angle image synthesis network architecture and outputting a predicted image;

the training module is used for updating network parameters of the new view angle image synthesis network architecture based on the loss value calculated by the real image and the predicted image through the loss function until the loss value of the loss function is converged to obtain a new view angle image synthesis model after training, and the new view angle image synthesis model after training is used for realizing the new view angle image synthesis method.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method described above when executing the computer program.

In a sixth aspect, embodiments of the present application further provide a computer readable storage medium storing a computer program, which when executed by a processor, implements the above method.

In the embodiment of the application, a first depth probability body and a geometric feature body are constructed based on the image features, camera parameters and sampling depth of an initial image, a first uncertainty map is obtained by two-dimensional convolution of the first depth probability body, the first depth probability body and the first uncertainty map are obtained, an uncertainty guiding sampling strategy based on depth prediction and uncertainty perception can be realized, coarse sampling points serving as key points are obtained, geometric prediction can be realized by obtaining geometric feature bodies, coarse sampling point features corresponding to the coarse sampling points are obtained from the image features and the geometric feature bodies of the initial image, fine sampling points can be determined by utilizing the coarse sampling point features, and a new view angle image is synthesized according to the ray feature map, the depth map and the fine sampling point features. The embodiment of the application predicts the scene depth probability and filters out accurate key points by utilizing an uncertainty-aware sampling strategy and geometric prediction, and avoids the interference of empty sampling points, and the strategies are helpful for reducing the rendering time and further reducing the synthesis time of new view angle images; meanwhile, based on reasonable sampling of the uncertainty graph, the embodiment of the application can generate a new view angle image with higher quality even under a sparse view.

Drawings

FIG. 1 is a flow chart of a new view angle image synthesizing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a new view angle image synthesizing method according to an embodiment of the present application;

FIG. 3 is a schematic illustration of the application of detailed information of an uncertainty-aware sampling strategy and full resolution rendering in an embodiment of the present application;

FIG. 4 is a flow chart of a new view angle image synthesis model training method provided by an embodiment of the application;

FIG. 5 is a qualitative comparison example of new view image composition quality for different new view composition models on different data sets;

fig. 6 is a diagram of a qualitative comparison example of generalization results based on qualitative ablation analysis.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided merely to facilitate a thorough understanding of embodiments of the application. It will therefore be apparent to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the application. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.

As shown in fig. 1, the new view angle image synthesis method provided by the embodiment of the application includes:

step 101, acquiring a plurality of initial images, camera parameters corresponding to the initial images and a plurality of sampling depths;

102, respectively extracting image features of each initial image, constructing a first depth probability body and a geometric feature body based on the image features of the initial images, camera parameters and sampling depth, wherein each body pixel in the first depth probability body is corresponding to a predicted depth probability, and voxel features of each body pixel of the geometric feature body are obtained by carrying out integrated extraction on the image features of a plurality of initial images;

step 103, carrying out two-dimensional convolution on the first depth probability body to obtain a first uncertainty diagram;

Step 104, determining coarse sampling points according to the first depth probability body and the first uncertainty diagram;

step 105, obtaining coarse sampling point features corresponding to the coarse sampling points from the image features and the geometric feature body of the initial image;

step 106, determining a fine sampling point, a radial characteristic map and a depth map based on the coarse sampling point characteristics;

step 107, acquiring fine sampling point features corresponding to the fine sampling points from the image features of the initial image;

and step 108, synthesizing a new view angle image according to the ray characteristic diagram, the depth diagram and the fine sampling point characteristic.

The new view angle image synthesis method provided by the application is specifically described below with reference to some embodiments.

In step 101, each acquired initial image corresponds to a camera parameter, where the camera parameter may refer to a viewing angle at which the initial image is captured, and may specifically include an internal parameter and an external parameter of the camera. In general, one reference view image may be included in a plurality of initial images, the remaining initial images may be source view images, and for convenience of description, a set of source view images may be described asWherein I is _i The I Zhang Yuan view image, nv being the total number of source view images, in the set representing the initial image, correspondingly, the initial image as the reference view image may be noted as I ₀ 。

Feature extraction may be performed for each initial image to obtain corresponding image features, where such feature extraction may be performed using a convolutional neural network, or, in some specific application examples, may be performed using a feature pyramid network (Feature Pyramid Network, FPN) to extract a feature map of a preset size, where the image features of the initial image may be present in the feature map.

For ease of illustration, the feature map may be considered as a feature image with pixels therein, and by matching features, matching pixels on the feature images of different initial images may be determined. For example, for initial image I ₀ And an initial image I ₁ ，I ₀ Is a characteristic image I of (1) _F0 Pixel points a and I on ₁ Is a characteristic image I of (1) _F1 The pixel points B on the two pixels are matched.

Pixel point a is at I ₀ The upper position and the pixel point B are in I ₁ The positions may be different for I if a homography transformation matrix is determined _F1 Transforming to obtainAnd the pixel point B after transformation is +.>The position and the pixel point in the characteristic image I _F0 The positions on the single-tone transformation matrix are the same, and the single-tone transformation matrix is a relatively accurate transformation matrix. The homography transformation matrix mainly comprises camera parameters and depths of the pixel points, wherein the camera parameters are known, and the depths of the pixel points are to be obtained.

Based on the above analysis, a plurality of sampling depths are obtained in step 101, a homography transformation matrix can be determined for a sample depth and camera parameters, each homography transformation matrix can be used for the feature image I _F1 Transforming to obtain And I _F0 The more similar (e.g., the same or close to the matched pixel locations on the various feature images), the more accurate the homography matrix is described, and the more accurate the sampling depth is employed. But->And I _F0 The larger the difference, the less suitable the sampling depth. />And I _F0 The difference in (c) can be expressed as the cost value of the sampling depth employed.

The above description has been given by taking a single pixel as an example, I _F1 The number of pixels in the pixel array is multiple, and the pixel array can be in I for each sampling depth _F1 A plurality of cost values are formed, and then a cost map is formed.

Further, I _F0 And I _F1 Formed therebetween is a cost map between two characteristic imagesIn the case of multiple initial images, it is also necessary to fuse information of multiple viewing angles, such as through mean value, straightConcatenating or variance means to count all I _i And fusing the cost graphs corresponding to the feature graphs to obtain the fused cost graph.

Still further, only one sampling depth is described above, one cost map may be constructed by one sampling depth, and a three-dimensional cost aggregate may be formed by constructing a plurality of cost maps corresponding to a plurality of sampling depths. The three-dimensional cost aggregate may be used to further construct a first depth probability volume.

The construction process of the first depth probability volume is described in detail below.

Optionally, extracting the image feature of each initial image separately includes:

inputting the initial image into a feature pyramid network to extract a plurality of feature maps with different sizes, wherein the feature maps comprise a first feature map, a second feature map and a third feature map with sequentially increased sizes, and the image features of the initial image are positioned in the feature maps.

For each initial image I _i According to its height-width dimension H W and channel number, can be expressed as I _i ∈R ^H×W×3 Wherein the channel number 3 may correspond to three color channels of RGB.

As shown in FIG. 2, will I _i Is input to FPN, three feature maps F of different sizes can be extracted _i，1 ∈R ^H/4×W/4×C1 、F _i，2 ∈R ^H/2×W/2×C2 And F _i，3 ∈R ^H×W×C3 。

The multi-scale characteristics can be obtained based on the FPN, so that the embodiment of the application can estimate the scene depth probability by using a multi-scale geometric predictor and sample the key points close to the scene surface, thereby avoiding the interference of empty sampling points.

Optionally, constructing the first depth probability volume and the geometric feature volume based on the image features of the initial image, the camera parameters, and the sampling depth includes:

downsampling is carried out from the first feature mapping corresponding to each initial image respectively to obtain a fourth feature mapping;

according to the fourth feature mapping corresponding to the plurality of initial images, determining the cost at each sampling depth to obtain a cost aggregate;

and regularizing the cost aggregate to obtain a first depth probability body.

As shown in fig. 2, in this embodiment, the acquisition of the first depth probability volume may be performed using a Multi-view stereo algorithm (MVS) or a MVS-like method.

Specifically, for F _i,1 Downsampling to obtain a fourth feature map, denoted F _wi ∈R ^H/8×W/8×C1 Coarse 3D scene geometry can be predicted using existing MVS methods, 2D image features mapped into a planar scan volume by homography for constructing a cost volume, then mapped features F from each input view using variance-based approaches _wi The cost polymer, referred to above as a three-dimensional cost polymer, may be essentially a cost volume after multi-view image fusion.

Similar to the existing MVS method, the present embodiment can also normalize the cost volume under the target view (corresponding to the reference view image) by 3D UNet to predict the depth probability volume, which is the first depth probability volume mentioned above, and can be denoted as P ^mvs ∈R ^H/a×W/a×Nd Where Nd is the number of sampling depths, otherwise known as the number of depth planes.

The geometrical feature volume may store geometrical features, such as rays, colors, etc., by means of voxels, which may be realized in particular by means of prior art techniques, such as surfanet, etc., which are not described in detail here. Hereinafter the geometric feature may be defined by F _voxel And (3) representing.

Unlike the existing MVS method, in this embodiment, the first depth probability body is further convolved in two dimensions to obtain a first uncertainty map.

As indicated above, the first depth probability volume may be denoted as P ^mvs In one example, a shallow 2DCNN network S may be used _c Deducing a first uncertainty map U ^mvs The formula is expressed as follows:

according to the first depth probability volume P ^mvs And a first uncertainty map U ^mvs Coarse sampling points, simply referred to as coarse sampling points, may be estimated. The process of estimating coarse sample points may be considered as a process of inferring keypoints based on an uncertainty-aware sampling strategy, which may also be applicable in some embodiments to subsequent determinations of fine sample points. Therefore, the following description of the uncertainty-aware sampling strategy may be given with reference to some general formulas applicable to each sampling point stage, and it should be emphasized that the size of the image adopted in the expression of the general formulas is not representative of the limitation of the relative size between the feature image and the initial image in the embodiment of the present application.

The depth is usually estimated by unimodal distribution in existing methods, but they tend to ignore small objects and boundary areas of abrupt depth changes. To solve this problem, a depth probability volume P (e.g., a first depth probability volume P ^mvs ) And its corresponding depth plane L, the initial depth sampling point X E R is calculated according to the inverse transformation sampling strategy ^H×W×N Subsequently, two additional points X near each initial sampling point can be calculated using uncertainty _un It is defined as follows:

X _un ＝X±clamp(U，0，1)×d _inter (2)

where U represents an uncertainty map, such as the first uncertainty map U described above ^mvs The clamp is a clamping function, which is used to limit the value of U to between 0 and 1, d in equation (2) _inter Is the depth interval, and X _un ∈R ^H×W×N×2 Is a set of sampling points based on uncertainty, hereinafter X _un The coarse and fine sampling points may be separated according to different image processing stages.

After X is obtained _un Thereafter, a further feature fusion operation may be performed by continuing to fuse each of the originalsImage features of the starting image to render the target view image. For example, each sample point may be projected onto the image feature map using methods described in the existing IBRNetAnd extracting the corresponding pixel alignment feature +.>Then use the pooled network phi _fusion (f _1,n ,f _2,n ....,f _Nv,n ) Aggregating these features, outputting the image point feature +.>Wherein, in the above expression, n can be used as the number of the scale of the feature map, for example, n can be 2 in the course of processing the coarse sampling point, at this time, F _i,n May correspond to F above _i,2 Etc.; the value of u may be 1 and 2, corresponding to two additional points near each initial sampling point.

Finally, according to the uncertainty points and the corresponding point characteristics, a network based on mean and variance can be adopted to fuse the uncertainty points and the corresponding point characteristics:

wherein,is a two-layer sensor, outputs image characteristic f _img Will be used for the rendering process.

Some general processing flows are briefly described above, and the description of the determination of coarse sampling points and the processing procedure is returned to below.

For example, applying equation (2) specifically to the step of determining the coarse sampling point according to the first depth probability volume and the first uncertainty map may be specifically described as:

based on the first depthProbability degree volume P ^mvs First depth probability volume P ^mvs The sampling depth (corresponding to the depth plane L) of each layer in the (a) is subjected to inverse transformation sampling to obtain an initial sampling point (corresponding to an initial depth sampling point X);

combining the first uncertainty map with a predetermined depth interval (corresponding to d _inter ) Two neighboring points (corresponding to X are determined for each initial sampling point _un ) The neighboring points serve as coarse sampling points. For purposes of distinction, the coarse sample point may be hereinafter denoted as X _un，c 。

Optionally, acquiring the coarse sampling point feature corresponding to the coarse sampling point from the image feature and the geometric feature of the initial image includes:

Projecting the coarse sampling points to a second feature map to obtain image point features corresponding to the coarse sampling points;

based on the tri-linear interpolation, voxel features corresponding to the coarse sampling points are obtained from the geometric feature body;

the coarse sampling point features comprise image point features corresponding to the coarse sampling points and voxel features corresponding to the coarse sampling points.

In one example, embodiments of the present application employ an uncertainty-aware sampling strategy, utilizing P ^mvs And U ^mvs To calculate the coarse sampling point X _un，c ∈R ^{H/8×W/8×Nc×2} . These points X are then added _un，c Upsampling twice, projecting to a second feature map F _i,2 And obtaining the image point characteristics corresponding to the coarse sampling points.

Furthermore, for each coarse sample point, tri-linear interpolation may be used from the geometric feature volume F _voxel Acquiring voxel characteristic f corresponding to coarse sampling point _voxel These voxel features may belong to the geometric information of the image.

To achieve coarse to fine sampling, optionally, determining fine sampling points based on coarse sampling point features includes:

adopting a network based on the mean value and the variance to perform feature perception on the image point features corresponding to the coarse sampling points to obtain image features for rendering;

performing feature perception on the image features and voxel features for rendering to obtain color features and densities corresponding to the coarse sampling points, wherein the color features are used for rendering a new view angle image;

Fine sampling points are determined from the density.

In combination with the above formula (3), the method can be specifically applied to the application process of the coarse sampling point, and can be used for mapping F from the features by using a feature fusion method _i,2 Image feature f of each coarse sampling point is aggregated _img,c ，f _img,c Corresponding to the image characteristic for rendering, belonging to the image characteristic f obtained based on the coarse sampling point _img 。

Performing feature perception on the image features and voxel features for rendering to obtain color features f corresponding to coarse sampling points _r And a density sigma.

For instance, in one example, f-based _img,c And f _voxel MLP networks can be usedGenerating a color feature f for each coarse sample point _r And density sigma:

color feature f _r For new view image rendering, while the density σ can be used to further determine fine sampling points.

In one embodiment, determining fine sampling points based on density includes:

tracking light rays passing through individual pixels of the initial image;

determining the depth probability distribution of each ray based on the number of the rough sampling points on each ray, the density corresponding to the rough sampling points, the distance between two adjacent rough sampling points and the cumulative transmittance;

constructing a second depth probability body according to the depth probability distribution of each ray;

performing two-dimensional convolution on the second depth probability body to obtain a second uncertainty diagram;

And determining fine sampling points according to the second depth probability body and the second uncertainty map.

In connection with one specific application, density estimation is used to build up a fine sampling prior in order to predict detail scene geometry. Geometrically existing ray tracing means, which can trace the ray passing through each pixel and calculate the depth probability tau of each point on it _k ：

Where k represents the number of coarse sample points, δ, on each ray _k Representing the distance between two adjacent coarse sampling points, T _k Indicating the cumulative transmittance at the time of reaching the kth coarse sample point. And the first depth probability volume P above ^mvs And a first uncertainty map U ^mvs Construction principles are similar, where τ can be used _k Constructing a depth probability volume, denoted as a second depth probability volume P ^nerf And pass through a 2DCNN network S _f A second uncertainty map U can be inferred ^nerf . Subsequently, P is used ^nerf And U ^nerf Calculating a fine sampling point X _un，f ∈R ^{H/4×W/4×Nf×2} (where Nf may be equal to 1). The general procedure of the uncertainty-aware sampling strategy is described above, and the uncertainty-aware sampling strategy is equally applicable to the determination of fine sampling points, and will not be repeated here.

Accordingly, as shown in fig. 2, acquiring the fine sampling point feature corresponding to the fine sampling point from the image feature of the initial image includes:

Projecting fine sample points to a third feature map F _i,3 And obtaining the characteristic of the fine sampling point corresponding to the fine sampling point.

In addition, in the embodiment of the present application, a ray characteristic map and a depth map are also determined, and in one example, a ray characteristic map F is calculated for each ray r _r (r) and a depth map D (r), defined as follows:

wherein, delta' _k Representing the distance from the coarse sampling point k to the camera center, f _r,k The point characteristic of the ray r at the coarse sampling point k is indicated.

In addition, as shown in FIG. 2, in order for the network to accurately learn low resolution information, a rendering network may be usedPerforming volume rendering to generate a low resolution image +.>

As shown in FIG. 2, in one example, a ray characteristic diagram F is first of all _r Depth map D and uncertainty-based fine sampling point X _un，f Up-sampling by a factor of four. According to the feature fusion method, also from the feature map F _i,3 Calculating the image feature f of each fine sampling point _img,f (corresponding to the fine sample point feature). In order to provide guidance for 3D structure, depth map D may be position-embedded encoded and fed into a 2DCNN network and a depth profile D' is deduced, which is designed as follows:

Dv＝φ _depth (γ(D)) (7)

where γ () represents the position embedding.There may be a four layer ResNet.

Optionally, synthesizing the new view angle image according to the ray characteristic diagram, the depth diagram and the fine sampling point characteristic comprises the following steps:

Inputting the fine sampling point characteristics and the ray characteristic diagram into a texture characteristic generation network to obtain image textures;

performing two-dimensional convolution on the depth map to obtain a depth feature map;

and inputting the image texture depth feature map to an image decoder, and outputting to obtain a new view angle image, wherein the image decoder comprises a plurality of residual nested dense blocks, and the depth feature map is respectively injected into each residual nested dense block.

As shown in fig. 3, in combination with an application example, the new view angle image synthesis method provided by the application mainly includes (a) an uncertainty perception sampling strategy and (b) detailed information of full resolution rendering.

Wherein (a) key points are calculated from the depth volume using inverse transform sampling and two additional 3D points near each initial sampling point are calculated by uncertainty. (b) In rendering, the ray feature map and image features (corresponding fine sample point features) are fed into an image decoder.

The image decoder includes a plurality of Residual nested dense blocks (RRDB), and the method of the present application injects a depth map into each RRDB block.

In one example, depth (for D), light (for F _r ) And image features (corresponding to f _img,f ) Is input to an image decoder for generating a target view image

Wherein,is a network with two residual nested dense blocks (RRDB), in which instead of connecting depth features with ray and image features, depth features are injected into each block.

After the rough scene geometry and information are calculated, the application uses the image decoding network to render the final full resolution target view image. Rendering high quality images from low resolution features is a challenging task. The application can effectively recover the high-frequency details in rendering by integrating the local information near the pixels.

As shown in fig. 4, the embodiment of the application further provides a new view angle image synthesis model training method, which includes:

step 401, constructing a new view angle image synthesis network architecture, wherein the new view angle image synthesis network architecture is configured with a loss function;

step 402, obtaining a training sample set, wherein the training sample set comprises a plurality of sample image sets and real images serving as labeling results of the sample image sets;

step 403, inputting the image set into a new view angle image synthesis network architecture, and outputting a predicted image;

and step 404, updating network parameters of the new view angle image synthesis network architecture based on the loss values calculated by the real image and the predicted image through the loss function until the loss values of the loss function are converged to obtain a new view angle image synthesis model after training, wherein the new view angle image synthesis model after training is used for realizing the new view angle image synthesis method.

The new view image synthesis network architecture may be considered an untrained new view image synthesis model, which in some embodiments may be trained from RGB images during the new view image synthesis model training. For example, neRF training methods can be imitated, and models are obtained by minimizing predicted imagesAnd a mean-square error (MSE) between the true image C (r) for training:

wherein,represents the mean square error loss function, R represents the number of rays of the image,/->I.e. the above-mentioned predictive picture->Or->

While in some of the ways in which it may be used,or may be an integral part of a loss function that includes more content. For example, the loss functions include a first loss function, a second loss function, and a third loss function;

the method comprises the steps of calculating a loss value caused by similarity between a real image and a predicted image by a first loss function, calculating a loss value caused by uncertainty of rays in the predicted image by a second loss function, and calculating a loss value caused by feature similarity between the real image and the predicted image by a third function;

the loss value of the loss function is obtained by weighting the loss value of the first loss function, the loss value of the second loss function, and the loss value of the third loss function.

Wherein the first loss function may correspond to the one described aboveThe second loss function is denoted as +.>Can be used to learn the rendering colors and their uncertainties by minimizing the following negative log likelihood functions:

where U (r) represents the ray uncertainty after the upsampling operation.

To address the phenomenon that models tend to produce blurred or excessively smooth images during training, perceptual loss may be used to normalize details during rendering:

wherein,the third loss function may be used to represent a perceived loss. Phi () is a trained multi-layer computer vision group (Visual Geometry Group, VGG) network for estimating predictive images in feature spaceAnd the similarity between the true images C (r).

The first loss function, the second loss function and the third loss function may be collectively referred to as sub-loss functions, and the sub-loss functions may be integrated by a weighting method to obtain the loss functions:

wherein lambda is _MSE 、λ _un 、λ _prec All are preset weights.

Furthermore, for each scale of the rendered imageAnd->Models calculate their multiple losses separately and define the total loss function as +.>

The structure and training mode of the new view angle image synthesis model are exemplarily described below.

The experimental steps are as follows:

the model details are as follows: in the training process, an Adam optimizer is used, and the initial learning rate is0.0005. In the two-dimensional image feature extraction, the feature channel numbers are c1=32, c2=16, and c3=8, respectively. In the sampling process, the sampling point number is nd=64, nc=8. Multiple loss parameter lambda _MSE 、λ _un 、λ _prec Total loss parameter lambda _coarse 、λ _fine Set to 1, 0.1, 0.01 and 0.1, 1, respectively. In pre-training, training may be performed on a GTX 3090GPU, which tends to converge after about 170K iterations, requiring about 14 hours. In single scenario optimization, training was performed for 15 minutes based on a pre-training model.

Data set aspects a DTU data set may be used to train a pre-trained model, the data set containing 124 scenes of different objects. The data set was divided into 88 training scenes, 15 verification scenes, and 16 test scenes using the same data set processing method as MVSNeRF and PixelNeRF. To demonstrate the generalization ability of the model of the present application, tests were also performed on the NeRF synthetic dataset and the Real Forward Face dataset. According to the rules of MVSNeRF, three neighboring source views are selected as inputs to train the model and generate the target view.

Baseline method: the method of the application can be compared to PixelNeRF, IBRNet and MVSNeRF, which are the best generalizable NeRF variants in the prior art at sparse inputs, without single scene optimization. In the fine tuning process, the method of the application is compared and analyzed with other technologies.

Evaluation index aspect: the present application measures image quality using peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR), structural similarity (Structural Similarity, SSIM), and learning perceived image block similarity (Learned Perceptual Image Patch Similarity, LPIPS) commonly used in NeRF, while measuring rendering speed using Frame Per Second (FPS), and compares the model of the present application with other methods to demonstrate its generalization performance and rendering speed.

Wherein a qualitative comparison of the quality of the new view angle image synthesis on the NeRF synthesis dataset, DTU dataset and LLFF (Real Forward Face) dataset can be seen in fig. 5, where our represents the new view angle image obtained based on the model or method provided by the present application.

Baseline method comparison:

the model of the present application was compared to PixelNeRF, MVSNeRF and IBRNet for generalization ability and rendering speed. In the pre-training phase, all generalizable methods pre-train on DTU datasets and test other unknown scene datasets using the same criteria as MVSNeRF.

The quantitative results in the above table (noted as table 1) show that the method of the present application is consistently superior to the baseline method in terms of generalization ability. In addition, to demonstrate the rendering efficiency of the method of the present application, a 512×640 resolution image in a DTU dataset was rendered on a single RTX 3090GPU using these models, respectively. Table 1 shows that the model of the present application renders at least 50 times faster than the previous generalized radiation field approach. The slow rendering of the previous methods can be attributed to their need to extrapolate multiple sampling points for each pixel. In contrast, the model of the present application utilizes an uncertainty-aware sampling strategy and a multi-scale geometry predictor to estimate depth ranges and filter out accurate keypoints. These strategies help reduce rendering time. In particular, the inventive model first evaluates scene geometry at low resolution using mvs method, which takes 14ms. It would then take 17ms for the geometry information to be acquired from coarse to fine, and finally 25ms for the full resolution image to be rendered using the image decoder. The results of the single scene fine tuning of the present application are also compared to other methods during the comparison process. After 15 minutes of fine tuning training, the model of the application achieved high quality results over several data sets, particularly with regard to LPIPS.

A qualitative comparison of the generalization results is provided in fig. 6. It is clear that other models tend to produce significant blurring and artifacts when generating new views under sparse inputs. This is because reflective regions and low texture surfaces typically result in a density of prediction errors for empty sample points and create artifacts during the volume rendering process. In contrast, the method provided by the application uses a multi-scale geometric predictor to estimate the scene depth probability and samples key points close to the scene surface, thereby avoiding the interference of empty sampling points. The method reasonably samples in an uncertain region, so that the model can generate a higher-quality image even in a sparse view. In addition, the rendering 2D network integrates depth information in the full resolution rendering process, so that the image can better recover scene details and reduce artifacts.

Ablation learning and analysis:

ablation learning and analysis ablation experiments were performed on the DTU test dataset to examine the effect of the individual modules. The quantitative and qualitative results were recorded separately and are shown in the following table (noted as table 2) and in fig. 6.

Where Algorithm represents an Algorithm, w/o represents an ablation object, e.g., w/o uncertainty represents removal of uncertainty predictions, w/o depthrender represents removal of depth implants in an image decoder, and w/o coarse loss represents removal of low resolution image loss.

First, the effect of uncertainty in the model was analyzed. The uncertainty predicts the confidence of the depth estimate, which enables the model to infer more accurate depth information. Through uncertainty estimation and sampling, it is demonstrated that the model can improve rendering performance. Next, the depth implant in the image decoder is removed. The inventive method utilizes low resolution light features to generate full resolution images, where detailed location information is critical to image rendering. Therefore, missing depth information may cause image blurring. Depth injection into an image decoder can significantly improve the sharpness of complex and subtle object details. Finally, the important role of low resolution image loss is demonstrated. In the course of coarse depth estimation, it is challenging to predict the low resolution depth probability of a scene without depth supervision. Thus, without using low resolution image loss, there will be errors in the coarse geometry estimation and reduced image generation quality.

In addition, the impact of different sampling points and input views on the speed and rendering quality was investigated. As shown in the following table (noted as table 3), an attempt was made to change the number of sampling points per ray.

Based on the analysis, the embodiment of the application can realize a new view angle image synthesis method with generalizability and efficient rendering. Specifically, given a set of multi-view images, keypoints are inferred from coarse to fine using a multi-scale scene geometry predictor consisting of MVS and NeRF. In addition, in order to obtain more accurate key point positions and features, an uncertainty guiding sampling strategy based on depth prediction and uncertainty perception is designed. With keypoints and scene geometry, a rendering network can be utilized to synthesize a full resolution image. This process is fully differentiable and the network can be trained using only RGB images. The experimental results prove that the new view image synthesis model has higher rendering efficiency and higher rendering quality on various synthesis data sets and real data sets through comparison with the most advanced base line. The method provided by the application can effectively infer geometric information and remarkably improve rendering speed by utilizing a multi-scale scene geometric predictor and an uncertainty perception sampling strategy.

The embodiment of the application also provides a new view angle image synthesis device, which comprises:

Alternatively, the extraction building module may be specifically configured to:

Optionally, the extraction building module may further be configured to:

and regularizing the cost aggregate to obtain a first depth probability body.

Alternatively, the first determining module may be specifically configured to:

performing inverse transformation sampling based on the first depth probability body and the sampling depth of each layer in the first depth probability body to obtain an initial sampling point;

and combining the first uncertain graph with a preset depth interval, and respectively determining two adjacent points for each initial sampling point, wherein the adjacent points are used as coarse sampling points.

Alternatively, the second acquisition module may be specifically configured to:

Alternatively, the second determining module may be specifically configured to:

fine sampling points are determined from the density.

Alternatively, the second determining module may be specifically configured to:

tracking light rays passing through individual pixels of the initial image;

determining fine sampling points according to the second depth probability body and the second uncertainty diagram;

Accordingly, the third acquisition module may be configured to:

and projecting the fine sampling points to a third feature map to obtain fine sampling point features corresponding to the fine sampling points.

Alternatively, the synthesis module may be specifically configured to:

The new view angle image synthesizing device provided in the embodiment of the present application is a device authority corresponding to the new view angle image synthesizing method in the above embodiment, and the method embodiment may be applied to the device embodiment and obtain the same technical effects, which are not described herein again.

The embodiment of the application also provides a new visual angle image synthesis model training device, which comprises the following steps:

Optionally, the loss function includes a first loss function, a second loss function, and a third loss function;

The new view angle image synthesis model training device provided by the embodiment of the application is the device authority corresponding to the new view angle image synthesis model training method of the above embodiment, and the method embodiment can be applied to the device embodiment and obtain the same technical effects, and is not repeated here.

The embodiment of the application also provides electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the new view angle image synthesis method or the new view angle image synthesis model training method when executing the computer program.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the new view angle image synthesis method or the new view angle image synthesis model training method when being executed by a processor.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A new view angle image synthesis method, characterized by comprising:

respectively extracting image features of each initial image, and constructing a first depth probability body and a geometric feature body based on the image features of the initial images, the camera parameters and the sampling depth, wherein each body pixel in the first depth probability body is corresponding to a predicted depth probability, and the voxel features of each body pixel of the geometric feature body are obtained by carrying out integrated extraction on the image features of the plurality of initial images;

acquiring coarse sampling point features corresponding to the coarse sampling points from the image features of the initial image and the geometric feature body;

determining a fine sampling point, a radial feature map and a depth map based on the coarse sampling point features;

2. The method of claim 1, wherein the extracting image features of each of the initial images, respectively, comprises:

3. The method of claim 2, wherein the constructing a first depth probability volume and a geometric feature volume based on the image features of the initial image, the camera parameters, and the sampling depth comprises:

Downsampling is carried out from the first feature map corresponding to each initial image respectively to obtain a fourth feature map;

determining the cost at each sampling depth according to the fourth feature mapping corresponding to the plurality of initial images so as to obtain a cost aggregate;

and regularizing the cost aggregate to obtain the first depth probability body.

4. A method according to any one of claims 1 to 3, wherein said determining coarse sampling points from said first depth probability volume and said first uncertainty map comprises:

performing inverse transformation sampling based on the first depth probability body and sampling depths of all layers in the first depth probability body to obtain initial sampling points;

and combining the first uncertain graph with a preset depth interval, and respectively determining two adjacent points for each initial sampling point, wherein the adjacent points serve as the coarse sampling points.

5. The method according to claim 2, wherein the acquiring the coarse sampling point feature corresponding to the coarse sampling point from the image feature of the initial image and the geometric feature volume includes:

projecting the coarse sampling points to the second feature map to obtain image point features corresponding to the coarse sampling points;

Acquiring voxel features corresponding to the coarse sampling points from the geometric feature body based on tri-linear interpolation;

6. The method of claim 5, wherein the determining a fine sample point based on the coarse sample point feature comprises:

performing feature perception on the image features for rendering and the voxel features to obtain color features and densities corresponding to the coarse sampling points, wherein the color features are used for rendering new view angle images;

and determining the fine sampling point according to the density.

7. The method of claim 6, wherein said determining said fine sampling points from said density comprises:

tracking light rays passing through individual pixels of the initial image;

determining a depth probability distribution of each ray based on the number of the rough sampling points on each ray, the density corresponding to the rough sampling points, the distance between two adjacent rough sampling points and the cumulative transmittance;

determining the fine sampling points according to the second depth probability body and the second uncertainty diagram;

the obtaining the fine sampling point feature corresponding to the fine sampling point from the image feature of the initial image includes:

and projecting the fine sampling points to the third feature map to obtain fine sampling point features corresponding to the fine sampling points.

8. The method of claim 1, wherein the synthesizing a new view angle image from the ray feature map, depth map, and fine sample point features comprises:

and inputting the depth feature map of the image texture to an image decoder, and outputting to obtain the new view angle image, wherein the image decoder comprises a plurality of residual nested dense blocks, and the depth feature map is respectively injected into each residual nested dense block.

9. A new view angle image synthesis model training method, the method comprising:

inputting the image set into the new view angle image synthesis network architecture, and outputting a predicted image;

updating network parameters of the new view angle image synthesis network architecture based on the loss values calculated by the real image and the predicted image through the loss function until the loss values of the loss function are converged to obtain a new view angle image synthesis model after training, wherein the new view angle image synthesis model after training is used for realizing the new view angle image synthesis method according to any one of claims 1 to 8.

10. The method of claim 9, wherein the loss function comprises a first loss function, a second loss function, and a third loss function;

the first loss function is used for calculating a loss value caused by the similarity between the real image and the predicted image, the second loss function is used for calculating a loss value caused by the uncertainty of rays in the predicted image, and the third function is used for calculating a loss value caused by the feature similarity between the real image and the predicted image;

The loss value of the loss function is obtained by weighting calculation of the loss value of the first loss function, the loss value of the second loss function and the loss value of the third loss function.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 10.