CN115439388B

CN115439388B - Free viewpoint image synthesis method based on multilayer nerve surface expression

Info

Publication number: CN115439388B
Application number: CN202211391996.4A
Authority: CN
Inventors: 戴翘楚; 吴翼天; 曹静萍
Original assignee: Hangzhou Yilan Technology Co ltd
Current assignee: Hangzhou Yilan Technology Co ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2024-02-06
Anticipated expiration: 2042-11-08
Also published as: CN115439388A

Abstract

The invention discloses a free viewpoint image synthesis method based on multilayer nerve surface expression, and relates to the field of computer vision, the method comprises the following steps of S1, acquiring image data acquired by sparse viewpoints, and estimating the pose of the sparse viewpoints; s2, designing a sparse viewpoint free viewpoint image synthesis network based on multilayer nerve surface expression; s3, training the sparse view free view image synthesis network based on the multi-layer nerve surface expression by utilizing a large-scale multi-view data set; and S4, after obtaining the image synthesis network model parameters, applying the image synthesis network model parameters to a free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the first step. According to the invention, by designing the multi-layer nerve surface expression model, the characteristics of the sparse multi-view image are fully utilized, a free view image synthesis algorithm with high quality and generalization is completed, and the method is suitable for free view image synthesis tasks of a multi-view acquisition system.

Description

Free viewpoint image synthesis method based on multilayer nerve surface expression

Technical Field

The invention relates to the field of computer vision, in particular to a free viewpoint image synthesis method based on multilayer nerve surface expression.

Background

Free viewpoint image synthesis is a major problem in the field of computer vision. With the advent of the 5G age and the development and popularization of virtual reality technology and augmented reality technology, digital images have developed into a necessary trend toward interactivity and immersion.

The free viewpoint synthesis is widely applied to various fields such as virtual reality, movie and television production, sports live broadcast, cultural social interaction and the like due to the characteristics of strong three-dimensional immersion, large viewing freedom and rich interaction experience.

However, the current free viewpoint system still requires hundreds of cameras, and the structure is complex and expensive; meanwhile, most of domestic landing systems adopt fixed imaging tracks, the watching view point is limited, the immersion feeling is insufficient, and the practicability and the economy are to be improved.

Disclosure of Invention

In order to achieve the above purpose, the invention aims to study the free viewpoint image synthesis problem under the sparse viewpoint, solve the problem that the existing free viewpoint generation algorithm needs each group of multi-viewpoint images to be trained for a long time or the geometric estimation defect under the sparse viewpoint influences the final viewpoint synthesis result, propose a framework based on multi-layer nerve surface expression, realize the scene geometric estimation and texture mapping synthesis of the viewpoint to be synthesized from end to end, realize the free viewpoint image generation with high quality and high efficiency, and solve the problems in the background technology.

In the application, based on innovative multilayer nerve surface expression, a workflow of free viewpoint image synthesis input by sparse viewpoints is designed, scene structure information of a new viewpoint to be synthesized and accurate texture migration and fusion processes are fully learned in a network, and the synthesis of the free viewpoint images with quasi-high quality and high efficiency is completed.

The technical scheme adopted is as follows: the free viewpoint image synthesis method based on the multilayer nerve surface expression comprises the following steps:

s1, acquiring multi-view synchronous or static scene image data acquired by sparse views, and estimating sparse view point pose;

s2, designing a free viewpoint image synthesis network;

s3, training the free viewpoint image synthesis network by utilizing a large-scale multi-viewpoint data set so that the free viewpoint image synthesis network can be generalized to various multi-viewpoint data;

and S4, after obtaining the trained free viewpoint image synthesis network model parameters, applying the free viewpoint synthesis network model parameters to the free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the step S1.

Further, after step S4, there is also: and S5, when the trained network has certain generalization property for the data which does not appear in the training set, directly utilizing the trained network model in the step S3 to conduct forward prediction, and realizing high-quality free viewpoint image synthesis of the sparse multi-viewpoint data to be tested.

Further, in step S1, the acquisition method is a Structure-from-Motion method or a multi-viewpoint calibration method for a given calibration object scale.

Further, the free viewpoint synthesis network comprises a Multi-scale image feature extraction module, a target-oriented Multi-scale refined scene depth (depth) estimation MVS (Multi-view Stereo) module, a Multi-layer nerve surface density estimation module, a reverse feature fusion and Multi-layer nerve surface color decoding module and a Multi-layer nerve surface voxel rendering module;

the multi-scale image feature extraction module consists of a convolution layer and a jump connection layer, and the multi-scale image feature extraction module is expressed as follows:

wherein,network representing the module->For arbitrary input of the image of the module, the output of the module can be three-scale image features +.>；

The MVS module realizes scene geometric estimation of any view point by modifying the MVS network based on learning, and the realization method comprises the following steps:

the M source viewpoint images are subjected to a multi-scale image feature extraction module to obtain M multiplied by 3 image features;

the transformation from the source viewpoint characteristics to a certain depth of the target viewpoint is realized corresponding to each scale, and the probability of each pixel point of the target image on each depth is output after 3D convolution regularization by constructing a cost body of variance;

gradually optimizing from small scale to large scale, updating and sampling according to the probability of the depth value of the previous layer, and finally outputting a multi-layer surface (curved surface) with the target point corresponding to the original image resolution, wherein the probability on depth is determined by the depth value of the final sampling;

the multi-layer nerve surface density estimation module samples depth probability bodies from the output of the MVS moduleIn recovering the density value +.>Providing for volume rendering to obtain a final output image corresponding to the opacity representing the multi-layer surface;

the reverse feature fusion and multi-layer nerve surface color decoding module uses the multi-layer surface sampling point set obtained in the MVS module to access the source viewpoint feature in a reverse wayCorresponding characteristic values in the code, and fusing and decoding the corresponding characteristics to formColor values of the multi-layer surface;

and the multi-layer nerve surface voxel rendering module performs voxel rendering after the density corresponding to the multi-layer nerve surface obtained by the multi-layer nerve surface density estimation module and the color corresponding to the multi-layer nerve surface obtained by the reverse feature fusion and multi-layer nerve surface color decoding module, so as to complete the synthesis of a final target image.

Further, in step S3, the training data is multi-viewpoint image data with camera pose, and is divided into a training set, a verification set and a test set, where the training converges the network and is on the verification set.

Further, in step S3, the multi-view synchronous (or still scene) image data acquired by sparse view is set as，/>The number of views to be input; estimating the sparse viewpoint pose to obtain the pose of each viewpointWherein->The internal references of each view are respectively contained>External parameters->(rotation matrix and translation matrix).

Further, in step S3, the pose of the target viewpoint is defined asFinding the image of M source viewpoints closest to the target viewpoint among the input viewpoints according to the position and orientation of the target viewpoint +.>And camera pose->As an input to the network.

Compared with the prior art, the invention has the following beneficial effects:

(1) In the invention, because the trained network has certain generalization for the data which does not appear in the training set, the forward prediction of the network can be directly utilized to complete the free viewpoint image synthesis task of the sparse multi-viewpoint data to be tested;

(2) According to the invention, by designing a multi-layer nerve surface expression model, the characteristics of sparse multi-view images are fully utilized to complete a high-quality free-view image synthesis algorithm with generalization, and the method is suitable for free-view image synthesis tasks of a multi-view acquisition system;

(3) In the invention, a network is designed for expressing the multi-layer nerve surface, the reconstruction of the multi-layer surface of a scene is realized in a new viewpoint synthesis frame from end to end, and the fusion and the generation of high-quality new viewpoint textures are completed based on the expression of the multi-layer surface.

Drawings

Fig. 1 is a flowchart of a free viewpoint image synthesis method based on multi-layer neural surface expression in the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

As shown in fig. 1, the free viewpoint image synthesis method based on the multilayer neural surface expression described in the present embodiment includes the following steps:

the acquisition method is a Structure-from-Motion method or a multi-view calibration method for a given calibration object scale;

that is, the step collects multi-view data of the same static scene or dynamic scene at the same moment, and the multi-view data can be sparse (i.e. the view point pose changes greatly);

when the method is used, the purpose of the acquisition is to acquire different viewpoints of a scene to be observed by using limited acquisition equipment, and the free viewpoint image of the scene can be expected to be recovered through an algorithm.

S2, designing a free viewpoint image synthesis network;

when the network is used, the network is used for designing the expression of the multi-layer nerve surface, the reconstruction of the multi-layer surface of the scene is realized in the end-to-end new viewpoint synthesis frame, and the high-quality fusion and generation of the new viewpoint texture are completed based on the expression of the multi-layer surface.

Wherein the free viewpoint synthesis network comprises: the system comprises a Multi-scale image feature extraction module, a target-oriented Multi-scale refined scene depth (depth) estimated MVS (Multi-view Stereo) module, a Multi-layer nerve surface density estimation module, a reverse feature fusion and Multi-layer nerve surface color decoding module and a Multi-layer nerve surface voxel rendering module.

In use, this part of the deep neural network (deep neural network) design is the core part.

the training data are multi-view image data with camera pose; the training data is divided into a training set, a validation set and a test set, the training causing the network to converge and be on the validation set.

The step S3 comprises the following steps:

s31, inputting N viewpoints which are most similar to the viewpoint to be synthesized into the network, and outputting predicted images under the viewpoint to be synthesized;

s32, supervision may be a pixel level loss function such as L1, L2 or a perceptual loss function.

As shown in fig. 1, the method for synthesizing a free viewpoint image based on the expression of a multilayered nerve surface in this embodiment specifically includes the following steps:

acquiring multi-view synchronous (or still scene) image data acquired by sparse views;

wherein,，/>the number of views to be input; estimating the sparse viewpoint pose to obtain the pose of each viewpoint>Wherein->The internal references of each view are respectively contained>External parameters->(rotation matrix and translation matrix).

Specifically, the step acquires multi-view data of the same static scene or dynamic scene at the same moment, and the multi-view data can be sparse (namely, the position and the posture of the view point are changed greatly); the purpose of the acquisition is to acquire different viewpoints of a scene to be observed by using limited acquisition equipment, and the free viewpoint image of the scene can be expected to be restored through an algorithm.

Defining the pose of a target viewpoint asFinding the image of M source viewpoints closest to the target viewpoint among the input viewpoints according to the position and orientation of the target viewpoint +.>And camera pose->As an input to the network.

The method specifically comprises the following steps:

the multi-scale image feature extraction module is based on a U-Net model and consists of a multi-scale convolution layer and a jump connection layer, and can be expressed as:

When the three-channel image fusion method is used, the three-channel image passes through the feature extraction module, so that multi-scale features with different scales and different channel numbers extracted at different depths in a network can be obtained, and the three-channel image comprises image features corresponding to different perception domains and is used for subsequent nerve surface positioning and reverse feature fusion.

The MVS module for target-oriented multi-scale and refined scene depth estimation realizes scene geometric estimation of any view point by modifying a learning-based MVS network, and the realization method comprises the following steps:

(1) The M source viewpoint images are subjected to a multi-scale image feature extraction module to obtain M multiplied by 3 image features;

(2) The transformation from the source viewpoint characteristics to a certain depth of the target viewpoint is realized corresponding to each scale, and the probability of each pixel point of the target image on each depth is output after 3D convolution regularization by constructing a cost body of variance;

(3) Gradually optimizing from small scale to large scale, updating sampling according to the probability of the depth value of the previous layer, outputting the multi-layer surface (curved surface) with the target point corresponding to the original image resolution, and determining the probability on depth according to the final sampled depth value.

The MVS module may be expressed as:

wherein,the module is the MVS module for target-oriented multi-scale and refined scene depth estimation, and outputs the MVS module as a sampling point set of a target viewpoint on a multi-layer surface>On a corresponding sampling depth probability volume。

Multi-layer nerve surface density estimation module, sampling depth probability body from output of MVS moduleIn recovering the density value +.>Providing for volume rendering to obtain a final output image corresponding to the opacity representing the multi-layer surface;

reverse feature fusion and multi-layer nerve surface color decoding module, and reverse access source viewpoint feature by utilizing multi-layer surface sampling point set obtained in MVS moduleAnd fusing and decoding the corresponding characteristic values to form the color values of the multi-layer surface.

For convenience of explanation, the processing procedure of the inverse feature fusion and multi-layer nerve surface color decoding module is described:

(1) Corresponding to a certain depthAnd a certain pixel point->By->Finding out the characteristic of the corresponding source view, and taking out the characteristic set corresponding to the M source view as +.>；

(2) M feature vectors are respectively encoded by an MLP (multi-layer perceptron), and fused features are obtained through mean value operation；

(3) Corresponds to each pixel pointAnd depth->All of which are de-invertedFeature fusion is carried out to obtain multi-layer features；

It may be noted that the feature fusion method may be other feature fusion forms;

(4) By decoding by a decoder, a multi-layer color is obtained as follows:

wherein,layer number of the surface of the multilayer nerve, < > and the like>For the image decoder, a multi-layer color is finally output +.>。

The multi-layer nerve surface voxel rendering module performs voxel rendering after the density corresponding to the multi-layer nerve surface obtained by the multi-layer nerve surface density estimation module and the color corresponding to the multi-layer nerve surface obtained by the inverse feature fusion and multi-layer nerve surface color decoding module, and completes the synthesis of a final target image, which can be expressed as:

the sparse view free view image synthesis network based on the multi-layer nerve surface expression is trained by utilizing a large-scale multi-view data set, so that the sparse view free view image synthesis network can be generalized to various multi-view data.

In particular, the training data may be multi-view image data with camera pose.

The input of the network is M viewpoint images most similar to the viewpoint to be synthesizedAnd corresponding camera poseOutput as predicted image +.>The supervision may be a pixel level loss function such as L1, L2 or a perceptual loss function, etc.; the loss function may be an L2 loss:

it is noted that relational terms such as first and second, and the like, if any, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

The technical problems to be solved are still consistent with the invention, and all the technical problems are included in the protection scope of the invention.

Claims

1. The free viewpoint image synthesis method based on the multilayer nerve surface expression is characterized by comprising the following steps of:

s2, designing a free viewpoint image synthesis network;

step S4, after obtaining the trained free viewpoint image synthesis network model parameters, the method is applied to a free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the step S1;

the free viewpoint image synthesis network comprises a multi-scale image feature extraction module, a target-oriented multi-scale refined scene depth estimation MVS module, a multi-layer nerve surface density estimation module, a reverse feature fusion multi-layer nerve surface color decoding module and a multi-layer nerve surface voxel rendering module;

，

wherein,network representing the module->For arbitrary input of the image of the module, the output of the module is the three-scale image feature +.>；

gradually optimizing from a small scale to a large scale, updating and sampling according to the probability of the depth value of the previous layer, and finally outputting a multi-layer surface with a target point corresponding to the original image resolution, wherein the probability on the depth is determined by the depth value of the final sampling;

the multi-layer nerve surface density estimation module outputs a sampling depth probability volume of an MVS module for target-oriented multi-scale refineable scene depth estimationIn recovering the density value +.>Providing for volume rendering to obtain a final output image corresponding to the opacity representing the multi-layer surface;

the reverse feature fusion and multi-layer nerve surface color decoding module utilizes a multi-layer surface sampling point set obtained in an MVS module of target-oriented multi-scale refinement scene depth estimation to access source viewpoint features reverselyCorresponding characteristic values in the multi-layer surface are fused and decoded to form color values of the multi-layer surface;

2. The method for synthesizing a free viewpoint image based on multi-layer neural surface expression according to claim 1, wherein after step S4, there is further provided:

and S5, when the trained network has certain generalization property for the data which does not appear in the training set, directly utilizing the trained network model in the step S3 to conduct forward prediction, and realizing high-quality free viewpoint image synthesis of the sparse multi-viewpoint data to be tested.

3. The method for synthesizing a free viewpoint image based on multilayer neural surface expression according to claim 1, wherein the acquisition method in step S1 is a Structure-from-Motion method or a multi-viewpoint scaling method of a given scaling.

4. The method of claim 1, wherein in step S3, the large-scale multi-view dataset is multi-view image data with camera pose, and is divided into a training set, a validation set, and a test set.

5. The method for synthesizing a free-viewpoint image based on multi-layer neural surface expression according to claim 1, wherein in step S3, sparse viewpoint-acquired multi-viewpoint-synchronized or still scene image data is set as,/>The number of views to be input; estimating the sparse viewpoint pose to obtain the pose of each viewpoint>The method comprises the steps of carrying out a first treatment on the surface of the Wherein->The internal references of each view are respectively contained>External parameters->。

6. The free viewpoint image synthesis method based on the multilayer neural surface expression according to claim 1, wherein in step S3, the pose of the target viewpoint is defined asFinding the image of M source viewpoints closest to the target viewpoint among the input viewpoints according to the position and orientation of the target viewpoint +.>And camera poseAs an input to the network.