CN115439388A

CN115439388A - Free viewpoint image synthesis method based on multilayer neural surface expression

Info

Publication number: CN115439388A
Application number: CN202211391996.4A
Authority: CN
Inventors: 戴翘楚; 吴翼天; 曹静萍
Original assignee: Hangzhou Yilan Technology Co ltd
Current assignee: Hangzhou Yilan Technology Co ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2022-12-06
Anticipated expiration: 2042-11-08
Also published as: CN115439388B

Abstract

The invention discloses a free viewpoint image synthesis method based on multilayer neural surface expression, which relates to the field of computer vision and comprises the following steps of S1, acquiring image data collected by sparse viewpoints, and estimating the pose of the sparse viewpoints; s2, designing a sparse viewpoint free viewpoint image synthesis network based on multilayer neural surface expression; s3, training the sparse viewpoint free viewpoint image synthesis network based on the multilayer neural surface expression by using a large-scale multi-viewpoint data set; and S4, after obtaining the image synthesis network model parameters, applying the image synthesis network model parameters to the free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the first step. According to the method, a multilayer neural surface expression model is designed, the characteristics of sparse multi-viewpoint images are fully utilized, a high-quality and generalized free viewpoint image synthesis algorithm is completed, and the method is suitable for a free viewpoint image synthesis task of a multi-viewpoint acquisition system.

Description

Free viewpoint image synthesis method based on multilayer neural surface expression

Technical Field

The invention relates to the field of computer vision, in particular to a free viewpoint image synthesis method based on multilayer neural surface expression.

Background

Free viewpoint image synthesis is a key problem in the field of computer vision. With the advent of the 5G era and the development and popularization of virtual reality technology and augmented reality technology, digital images have developed toward the inevitable trend of interchangeability and immersion.

The free viewpoint synthesis has the characteristics of strong three-dimensional immersion, large viewing freedom and rich interactive experience, and is widely applied to various fields such as virtual reality, movie and television production, live sports, cultural social contact and the like.

However, the current free viewpoint system still needs hundreds of cameras, and has complex and expensive structure; meanwhile, most domestic landing systems adopt fixed imaging tracks, the viewing viewpoints are limited, the immersion feeling is insufficient, and the practicability and the economy are improved.

Disclosure of Invention

In order to achieve the purpose, the invention aims to research the problem of free viewpoint image synthesis under a sparse viewpoint, overcome the problem that the final viewpoint synthesis result is influenced by the defect that each group of multi-viewpoint images are required to be trained for a long time or the geometric estimation under the sparse viewpoint in the existing free viewpoint generation algorithm, and provide a framework based on multilayer neural surface expression to realize the scene geometric estimation and texture mapping synthesis of the viewpoint to be synthesized end to end, realize high-quality and efficient free viewpoint image generation, and solve the problems in the background technology.

In the method, based on innovative multilayer neural surface expression, a workflow for synthesizing the free viewpoint images input by sparse viewpoints is designed, scene structure information of a new viewpoint to be synthesized and accurate texture migration and fusion processes are fully learned in a network, and the synthesis of the free viewpoint images with high quality and high efficiency is completed.

The technical scheme is as follows: the free viewpoint image synthesis method based on the multilayer neural surface expression comprises the following steps:

s1, acquiring multi-view synchronous or static scene image data acquired by sparse views, and estimating the pose of the sparse views;

s2, designing a sparse viewpoint free viewpoint image synthesis network based on multilayer neural surface expression;

and S3, training the sparse viewpoint free viewpoint image synthetic network based on the multilayer neural surface expression by using a large-scale multi-viewpoint data set, so that the sparse viewpoint free viewpoint image synthetic network can be generalized to various multi-viewpoint data.

And S4, after the trained sparse viewpoint free viewpoint image synthesis network model parameters based on the multilayer neural surface expression are obtained, applying the trained sparse viewpoint free viewpoint image synthesis network model parameters to the free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the first step.

Further, after step S4, there are: and S5, when the trained network has certain generalization on data which does not appear in the training set, directly utilizing the trained network model in the step S3 to perform forward prediction, and realizing high-quality free viewpoint image synthesis of the sparse multi-viewpoint data to be tested.

Further, in step S1, the acquisition method is a Structure-from-Motion method or a multi-view calibration method with a given calibration object scale.

Further, the free viewpoint synthesis network comprises a Multi-scale image feature extraction module, an MVS (Multi-ViewStereo) module for target-oriented Multi-scale refined scene depth (depth) estimation, a Multi-layer neural surface density estimation module, a reverse feature fusion and Multi-layer neural surface color decoding module, and a Multi-layer neural surface voxel rendering module.

Further, in step S3, the training data is multi-viewpoint image data with a camera pose, and is divided into a training set, a validation set, and a test set, and the training makes the network converge on the validation set.

Further, in step S3, setting is made

，

Is the number of viewpoints entered; estimating the pose of the sparse viewpoints to obtain the pose of each viewpoint

(ii) a Wherein

Respectively including the internal reference of each viewpoint

Root of external ginseng

(rotation matrix and translation matrix).

Further, in step S3, the pose of the target viewpoint is defined as

Finding out the images of M source viewpoints which are closest to the target viewpoint in the input viewpoints according to the position and the orientation of the target viewpoint

And camera pose

As input to the network.

Further, the multi-scale image feature extraction module is composed of a convolution layer and a jump connection layer, and the multi-scale image feature extraction module is represented as:

wherein, the first and the second end of the pipe are connected with each other,

on behalf of the network of the present module,

for any image input to the module, the output of the module can be three-scale image features

。

Further, the MVS module implements scene geometry estimation at any viewpoint by modifying a learning-based MVS network, and the implementation method includes the following steps:

m source viewpoint images pass through a multi-scale image feature extraction module to obtain M multiplied by 3 image features;

the method comprises the steps of realizing the transformation from source viewpoint characteristics to a certain depth of a target viewpoint corresponding to each scale, and outputting the probability of each pixel point of a target image on each depth after the regularization of 3D convolution through constructing a cost body of variance;

optimizing from small scale to large scale gradually, updating sampling according to the depth value probability of the previous layer, and finally outputting the depth probability of the target point corresponding to the multilayer surface and the curved surface under the resolution of the original image, wherein the depth probability is determined by the finally sampled depth value).

Further, the multi-layer neural surface density estimation module samples the depth probability volume from the output of the MVS module

To recover the density values on the multi-layer surface points

Correspondingly representing the opacity of the multilayer surface, and preparing for volume rendering to obtain a final output image;

the reverse feature fusion and multilayer neural surface color decoding module reversely accesses the source viewpoint features by utilizing a multilayer surface sampling point set obtained by an MVS module

Fusing and decoding the corresponding characteristic values to form color values of the multilayer surface;

and the multilayer neural surface voxel rendering module performs voxel rendering after acquiring the density corresponding to the multilayer neural surface through the multilayer neural surface density estimation module and acquiring the color corresponding to the multilayer neural surface through inverse feature fusion and the multilayer neural surface color decoding module, so as to complete the synthesis of the final target image.

Compared with the prior art, the invention has the following beneficial effects:

(1) In the invention, because the trained network has certain generalization on the data which does not appear in the training set, the free viewpoint image synthesis task of the sparse multi-viewpoint data to be tested can be completed by directly utilizing the forward prediction of the network;

(2) In the invention, by designing a multilayer neural surface expression model and fully utilizing the characteristics of sparse multi-viewpoint images, a high-quality and generalized free viewpoint image synthesis algorithm is completed, and the method is suitable for a free viewpoint image synthesis task of a multi-viewpoint acquisition system;

(3) In the invention, a network designs multilayer neural surface expression, aims to realize reconstruction of multilayer surfaces of a scene in an end-to-end new viewpoint synthesis framework, and completes high-quality new viewpoint texture fusion and generation based on the multilayer surface expression.

Drawings

Fig. 1 is a work flow chart of a free viewpoint image synthesis method based on multi-layer neural surface expression in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Examples

As shown in fig. 1, the method for synthesizing a free viewpoint image based on multi-layer neural surface expression in this embodiment includes the following steps:

wherein, the acquisition method is a Structure-from-Motion method or a multi-viewpoint calibration method for giving a calibration object scale;

that is, the step acquires multi-view data of the same static scene or dynamic scene at the same time, and the multi-view data can be sparse (namely, the change of the viewpoint pose is large);

when the system is used, the purpose of the acquisition is to acquire different viewpoints of a scene to be observed by using limited acquisition equipment, and a free viewpoint image of the scene is expected to be recovered through an algorithm.

when the method is used, the network designs multilayer neural surface expression, aims to realize reconstruction of multilayer surfaces of scenes in an end-to-end new viewpoint synthesis framework, and completes high-quality new viewpoint texture fusion and generation based on the multilayer surface expression.

Wherein the free viewpoint synthesis network includes: the system comprises a Multi-scale image feature extraction module, an MVS (Multi-ViewStereo) module for estimating depth (depth) of a target-oriented Multi-scale fined scene, a multilayer neural surface density estimation module, a reverse feature fusion and multilayer neural surface color decoding module and a multilayer neural surface voxel rendering module.

When in use, the deep neural network (DeepNeralNet) design of the part is a core part.

And S3, training the sparse viewpoint image synthesis network based on the multilayer neural surface expression by using a large-scale multi-viewpoint data set, so that the sparse viewpoint image synthesis network can be generalized to various multi-viewpoint data.

The training data is multi-viewpoint image data with a camera pose; the training data is divided into a training set, a validation set, and a test set, and the training converges the network on the validation set.

The step S3 comprises the following steps:

s31, inputting N viewpoints which are most similar to the viewpoints to be synthesized into the network, and outputting a predicted image under the viewpoints to be synthesized;

s32, supervision can be pixel-level loss functions such as L1 and L2 or perception loss functions;

As shown in fig. 1, the method for synthesizing a free viewpoint image based on multi-layer neural surface expression in this embodiment includes the following specific steps:

acquiring multi-view synchronous (or static scene) image data acquired by sparse view;

，

is the number of views input; estimating the pose of the sparse viewpoints to obtain the pose of each viewpoint

Wherein

Respectively including the internal reference of each viewpoint

Root of Redborne ginseng

(rotation matrix and translation matrix).

Specifically, the step acquires multi-view data of the same static scene or dynamic scene at the same time, and the multi-view data can be sparse (namely, the change of the viewpoint pose is large); the purpose of the acquisition is to acquire different viewpoints of a scene to be observed by using limited acquisition equipment, and it is expected that a free viewpoint image of the scene can be recovered through an algorithm.

Defining the pose of the target viewpoint as

Finding out the images of M source viewpoints closest to the target viewpoint from the input viewpoints according to the position and orientation of the target viewpoint

And camera pose

As input to the network.

The method specifically comprises the following steps:

the multi-scale image feature extraction module is based on a U-Net model and consists of a multi-scale convolution layer and a jump connection layer, and the multi-scale image feature extraction module can be expressed as follows:

wherein the content of the first and second substances,

on behalf of the network of the present module,

；

When the device is used, the three-channel image passes through the feature extraction module, so that multi-scale features with different scales and different channel numbers extracted at different depths in a network can be obtained, the multi-scale features comprise image features corresponding to different perception domains, and the multi-scale features are used for subsequent neural surface positioning and reverse feature fusion.

The target-oriented multi-scale MVS module capable of refining scene depth estimation realizes scene geometric estimation of any viewpoint by modifying a learning-based MVS network, and the realization method comprises the following steps:

(1) Enabling the M source viewpoint images to pass through a multi-scale image feature extraction module to obtain M multiplied by 3 image features;

(2) The method comprises the steps of realizing the transformation from source viewpoint characteristics to a certain depth of a target viewpoint corresponding to each scale, and outputting the probability of each pixel point of a target image on each depth after the regularization of 3D convolution through constructing a cost body of variance;

(3) Optimizing from small scale to large scale gradually, updating sampling according to the depth value probability of the previous layer, and finally outputting the depth probability of the target point corresponding to the multilayer surface (curved surface, determined by the final sampled depth value) under the resolution of the original image.

The MVS module can be expressed as:

wherein the content of the first and second substances,

the module is the MVS module for the target-oriented multi-scale scene depth estimation capable of being refined, and the output is a multi-layer surface sampling point set of a target viewpoint

Corresponding sampling depth probability body of

。

Multi-layer neural surface density estimation module, sampling depth probability volume from output of MVS module

To recover the density values on the multi-layer surface points

a reverse feature fusion and multilayer neural surface color decoding module, which utilizes the multilayer surface sampling point set obtained by the MVS module to reversely access the source viewpoint feature

And fusing and decoding the corresponding features to form color values of the multilayer surface.

For convenience of explanation, the processing procedure of the inverse feature fusion and multi-layer neural surface color decoding module is described as follows:

(1) Corresponding to a certain depth

And a certain pixel point

By passing

And

finding out the characteristics of the corresponding source viewpoint, and extracting the characteristics corresponding to the M source viewpoints into a characteristic set

；

(2) Respectively encoding the M feature vectors by MLP (multi-layer perceptron), and averagingOperate to obtain fused features

；

(3) Corresponding to each pixel point

And depth

All go backward to perform feature fusion to obtain multi-layer features

；

It is noted that the feature fusion method may also be in other feature fusion forms;

(4) Through decoding by a decoder, multi-layer colors are obtained as follows:

wherein the content of the first and second substances,

is the number of layers on the surface of the multi-layer nerve,

for image decoder, multi-layer color is finally output

。

The multilayer neural surface voxel rendering module performs voxel rendering after the density corresponding to the multilayer neural surface obtained by the multilayer neural surface density estimation module and the color corresponding to the multilayer neural surface obtained by the inverse feature fusion and multilayer neural surface color decoding module to complete the synthesis of a final target image, which can be expressed as:

the sparse viewpoint free viewpoint image synthesis network based on the multilayer neural surface expression is trained by utilizing a large-scale multi-viewpoint data set, so that the sparse viewpoint free viewpoint image synthesis network can be generalized to various multi-viewpoint data.

In particular, the training data may be multi-view image data with camera poses.

The input of the network is M viewpoint images most similar to the viewpoint to be synthesized

And corresponding camera pose

Output as a predicted image at the view point to be synthesized

Supervision may be a pixel level loss function such as L1, L2, or a perceptual loss function, etc.; the loss function may be the L2 loss:

after the trained sparse viewpoint free viewpoint image synthesis network model parameters based on the multilayer neural surface expression are obtained, the method can be applied to the free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the first step. Because the trained network has certain generalization on data which does not appear in a training set, forward prediction can be directly carried out by using the trained network model, and high-quality free viewpoint image synthesis of sparse multi-viewpoint data to be tested is realized.

It is noted that, in this document, relational terms such as first and second, and the like, if any, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

The technical problems to be solved are still consistent with the present invention and should be included in the scope of the present invention.

Claims

1. The free viewpoint image synthesis method based on the multilayer neural surface expression is characterized by comprising the following steps of:

s3, training the sparse viewpoint free viewpoint image synthetic network based on the multilayer neural surface expression by using a large-scale multi-viewpoint data set, so that the sparse viewpoint free viewpoint image synthetic network can be generalized to various multi-viewpoint data;

and S4, after the trained sparse viewpoint free viewpoint image synthesis network model parameters based on the multilayer neural surface expression are obtained, applying the trained sparse viewpoint free viewpoint image synthesis network model parameters to the free viewpoint synthesis task of the sparse multi-viewpoint data obtained in the S1.

2. The free viewpoint image synthesis method based on multilayer neural surface expression as set forth in claim 1, wherein after step S4, there are further:

and S5, when the trained network has certain generalization on the data which does not appear in the training set, directly utilizing the network model trained in the step S3 to carry out forward prediction, and realizing the high-quality free viewpoint image synthesis of the sparse multi-viewpoint data to be tested.

3. The free viewpoint image synthesis method based on multi-layer neural surface expression as claimed in claim 1, wherein in step S1, the collection method is Structure-from-Motion method or multi-viewpoint calibration method with given calibration object dimension.

4. The method of claim 1, wherein the free viewpoint image synthesis network comprises a multi-scale image feature extraction module, an MVS module for object-oriented multi-scale refineable scene depth estimation, a multi-layer neural surface density estimation module, a reverse feature fusion and multi-layer neural surface color decoding module, and a multi-layer neural surface voxel rendering module.

5. The free viewpoint image synthesis method based on multi-layer neural surface expression as claimed in claim 1, wherein in step S3, the training data is multi-viewpoint image data with camera pose, and is divided into a training set, a validation set and a test set, and the training makes the network converge on the validation set.

6. The free viewpoint image synthesis method based on multi-layer neural surface expression as claimed in claim 1, wherein in step S3,

setting up

，

Is the number of viewpoints entered;

estimating the pose of the sparse viewpoints to obtain the pose of each viewpoint

；

Wherein

Respectively including the internal reference of each viewpoint

Root of external ginseng

。

7. The free viewpoint image synthesis method based on multi-layer neural surface expression as claimed in claim 1, wherein in step S3, the pose of the target viewpoint is defined as

And camera pose

As input to the network.

8. The free viewpoint image synthesis method based on multi-layer neural surface expression as claimed in claim 1, wherein the multi-scale image feature extraction module is composed of a convolutional layer and a jump connection layer, and the multi-scale image feature extraction module is represented as:

wherein the content of the first and second substances,

on behalf of the network of the present module,

。

9. The free viewpoint image synthesis method based on multilayer neural surface expression as claimed in claim 4, wherein the MVS module realizes scene geometric estimation of any viewpoint by modifying the learning-based MVS network, and the realization method comprises the following steps:

enabling the M source viewpoint images to pass through a multi-scale image feature extraction module to obtain M multiplied by 3 image features;

10. The free viewpoint map based on multi-layer neural surface representation of claim 4The image synthesis method is characterized in that the multi-layer neural surface density estimation module samples a depth probability volume from the output of the MVS module

To recover the density values on the multi-layer surface points

the reverse feature fusion and multilayer neural surface color decoding module reversely accesses the source viewpoint features by utilizing a multilayer surface sampling point set obtained by the MVS module