CN115457182A

CN115457182A - An Interactive Viewpoint Image Synthesis Method Based on Multi-plane Image Scene Representation

Info

Publication number: CN115457182A
Application number: CN202211191210.4A
Authority: CN
Inventors: 霍智勇; 魏俊宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-09

Abstract

In order to eliminate distortion and artifacts in a new viewpoint image and improve the synthesis quality of the new viewpoint image, a three-dimensional convolution neural network is used for capturing spatial features across a plurality of depth planes, and meanwhile, the capability of predicting an occlusion area on the depth plane is established. The homography transformation module is used for coding the position information of the input reference image pair and respectively re-projecting the input images into the target camera through the homography transformation matrix; the network framework uses a coder decoder framework based on three-dimensional convolution, and a coder and a decoder of each layer are connected into a U-shaped network through jumping, so that the capability of capturing context information is enhanced; the network output module generates a multi-plane image scene representation by mixing the weight predicted by the network and the alpha image, and the image of the target viewpoint is rendered by synthesizing the scene representation from back to front. The method can improve the accuracy of the new viewpoint image synthesis.

Description

An Interactive Viewpoint Image Synthesis Method Based on Multi-plane Image Scene Representation

技术领域technical field

本发明涉及图像处理领域，具体涉及一种基于多平面图像场景表示的交互视点图像合成方法。The invention relates to the field of image processing, in particular to an interactive viewpoint image synthesis method based on multi-plane image scene representation.

背景技术Background technique

随着深度学习的发展，视图合成在计算机视觉和计算机图形学研究中获得了很多关注，因为它的应用范围很广，例如在沉浸式显示以及增强和虚拟现实(AR/VR)中为用户提供实时的互动体验。显式三维表示进行视图合成的挑战在于，它需要从现有的视点推断出对准确的场景几何理解。事实证明，MPI(多平面图像)是一种方便的体积表示法，可以有效地估计场景的几何形状以合成新视点图像。MPI是一组分布在不同深度的半透明图像，它可以编码扩散表面以及非朗伯斯效应，如透明和反射区域。给定一个MPI场景表示，目标图像的新视点图像可以通过应用逆单应性变换和从后到前的阿尔法合成来简单渲染。With the development of deep learning, view synthesis has gained a lot of attention in computer vision and computer graphics research because of its wide range of applications, such as providing user Real-time interactive experience. The challenge of view synthesis with explicit 3D representations is that it requires inferring an accurate geometric understanding of the scene from existing viewpoints. It turns out that MPI (Multiplanar Image) is a convenient volumetric representation that can efficiently estimate the geometry of a scene to synthesize new viewpoint images. MPI is a set of translucent images distributed at different depths, which can encode diffuse surfaces as well as non-Lambertian effects such as transparent and reflective areas. Given an MPI scene representation, new viewpoint images of the target image can be simply rendered by applying an inverse homography transformation and back-to-front alpha compositing.

目前，许多工作都集中在从单个或多个输入学习MPI场景表示。Tucker等人(Tucker R,Snavely N.Single-view view synthesis with multiplane images.)从单个输入图像预测场景的MPI表示，而MINE方法通过引入平面神经辐射场将MPI扩展为一个连续的深度平面，即给定单个图像作为输入，MPI中任何深度的平面都可以被预测。虽然单个输入的方法取得了良好的效果，但它需要使用点云作为监督来解决单目深度估计的尺度模糊问题，然而预测MPI的过程本身就是对深度的一种估计；基于多个输入的MPI预测方法则不存在尺度模糊的问题。给定一个由窄基线立体相机拍摄的输入立体图像对，Zhou等人(ZhouT,Tucker R,Flynn J,et al.Stereo magnification:Learning view synthesis usingmultiplane images.)使用一个端到端的二维深度学习网络来推断MPI场景表示，该网络使用一个可微分的渲染模块生成新视点图像，该问题被称为立体放大。在这项工作的基础上，Tucker等人(Srinivasan P P,Tucker R,Barron J T,et al.Pushing the boundaries ofview extrapolation with multiplane images.)提出了一个理论分析，即可以从MPI渲染的视图范围随着MPI视差采样频率的增加而线性增加，以及一个基于三维卷积神经网络的两阶段框架，使用光流来预测遮挡的内容。虽然增加MPI的层数可以有效地扩展视点外推的边界，但由于GPU的限制，MPI的层数不能无限增加。Flynn等人(Flynn J,Broxton M,Debevec P,et al.Deepview:View synthesis with learned gradient descent.)将MPI的预测过程看作是一个逆问题，属于过拟合问题，所以提出了梯度下降法来解决，但该方法计算量很大。Currently, many works focus on learning MPI scene representations from single or multiple inputs. Tucker et al. (Tucker R, Snavely N. Single-view view synthesis with multiplane images.) predict the MPI representation of the scene from a single input image, while the MINE method extends the MPI into a continuous depth plane by introducing a planar neural radiation field, namely Given a single image as input, planes of any depth can be predicted in MPI. Although the single-input method achieves good results, it needs to use point clouds as supervision to solve the scale ambiguity problem of monocular depth estimation. However, the process of predicting MPI is itself an estimation of depth; MPI based on multiple inputs The prediction method does not have the problem of scale ambiguity. Given an input stereo image pair captured by a narrow-baseline stereo camera, Zhou et al. (ZhouT, Tucker R, Flynn J, et al. Stereo magnification: Learning view synthesis using multiplane images.) use an end-to-end 2D deep learning network To infer MPI scene representations, the network uses a differentiable rendering module to generate images from new viewpoints, a problem known as stereoscopic upscaling. Building on this work, Tucker et al. (Srinivasan P P, Tucker R, Barron J T, et al. Pushing the boundaries of view extrapolation with multiplane images.) proposed a theoretical analysis that the range of views that can be rendered from MPI increases with The MPI disparity increases linearly with increasing sampling frequency, and a two-stage framework based on a 3D convolutional neural network uses optical flow to predict occluded content. Although increasing the number of layers of MPI can effectively expand the boundary of viewpoint extrapolation, due to the limitation of GPU, the number of layers of MPI cannot be increased infinitely. Flynn et al. (Flynn J, Broxton M, Debevec P, et al. Deepview: View synthesis with learned gradient descent.) regard the MPI prediction process as an inverse problem, which belongs to the over-fitting problem, so the gradient descent method is proposed to solve, but this method is computationally intensive.

发明内容Contents of the invention

针对上述背景技术中，现有方法无法捕捉到MPI跨深度平面的特征连接，导致合成的新的视图图像往往有明显的失真和伪像的问题，本发明提供一个基于MPI场景表示的新视图合成方法，该方法利用三维卷积神经网络来捕获跨多个深度平面的空间特征，建立了MPI在深度平面预测遮挡和隐藏区域的能力，并提高了平面透明度值的预测准确性。在Spaces数据集和RealEstate 10K数据集上进行验证，实验证明，本发明能够有效地合成新视点图像，比现有的方法具有更好的性能。In view of the above-mentioned background technology, the existing method cannot capture the feature connection of MPI across the depth plane, resulting in the problem that the synthesized new view image often has obvious distortion and artifacts. The present invention provides a new view synthesis based on MPI scene representation method, which utilizes a 3D convolutional neural network to capture spatial features across multiple depth planes, establishes the ability of MPI to predict occluded and hidden regions in depth planes, and improves the prediction accuracy of plane transparency values. It is verified on the Spaces dataset and the RealEstate 10K dataset. Experiments prove that the invention can effectively synthesize new viewpoint images, and has better performance than existing methods.

一种基于多平面图像场景表示的交互视点图像合成方法，包括以下步骤：An interactive viewpoint image synthesis method based on multi-plane image scene representation, comprising the following steps:

步骤1，获取训练数据及预处理；Step 1, obtain training data and preprocessing;

步骤2，将步骤1中得到的训练图像数据输入建立的基于多平面图像场景表示的三维卷积神经网络进行训练；Step 2, inputting the training image data obtained in step 1 into a three-dimensional convolutional neural network based on multi-plane image scene representation for training;

所述三维卷积神经网络包含单应性变换模块、三维卷积编解码架构和网络输出模块；The three-dimensional convolutional neural network comprises a homography transformation module, a three-dimensional convolution codec architecture and a network output module;

单应性变换模块对输入的参考相机获取的参考图像对的位置信息进行编码，将输入图像通过单应性变换矩阵分别重新投影到目标相机中；The homography transformation module encodes the position information of the reference image pair acquired by the input reference camera, and re-projects the input image into the target camera respectively through the homography transformation matrix;

三维卷积编解码架构包括预处理块和四层的编码-解码器结构，每层的编码器和解码器通过跳跃连接为U型网络；The three-dimensional convolutional codec architecture includes a preprocessing block and a four-layer encoder-decoder structure, and the encoder and decoder of each layer are connected as a U-shaped network by skipping;

网络输出模块通过混合网络预测的权重及alpha图像生成多平面图像场景表示，目标视点的图像通过对多平面图像进行从后到前的合成被渲染出来；The network output module generates a multi-plane image scene representation by mixing the weight and alpha image predicted by the network, and the image of the target viewpoint is rendered by synthesizing the multi-plane image from back to front;

步骤3，基于训练后的网络，输入测试图像进行检测，得到最终的新视点图像合成的结果。Step 3, based on the trained network, input the test image for detection, and obtain the final new viewpoint image synthesis result.

进一步地，步骤1中，数据预处理包括对训练图像进行数据增强。Further, in step 1, the data preprocessing includes performing data enhancement on the training images.

进一步地，步骤2中，单应性变换使用相同深度，通过比较不同的输入图像来推断场景的几何结构。Further, in step 2, the homography transform uses the same depth to infer the geometry of the scene by comparing different input images.

进一步地，单应性变换模块中，基于参考相机的参数，对输入的参考图像对I₁和I₂的位置信息进行编码，计算出对应的平面扫描体积PSV(Plane Sweeping Volume)，将参考图像对在一组固定的深度D处分别重新投射到目标相机中，其中将参考视图和目标视图中的投影点通过单应性矩阵进行联系，Further, in the homography transformation module, based on the parameters of the reference camera, the position information of the input reference image pair I ₁ and I ₂ is encoded, and the corresponding plane scanning volume PSV (Plane Sweeping Volume) is calculated, and the reference image Re-project to the target camera at a set of fixed depths D, where the projection points in the reference view and the target view are connected through the homography matrix,

进一步地，步骤2中，三维卷积编解码架构捕获跨多个深度平面的空间特征，预测多平面图像在深度平面遮挡和隐藏区域。Further, in step 2, the 3D convolutional codec architecture captures spatial features across multiple depth planes, and predicts occluded and hidden areas of multi-plane images in depth planes.

进一步地，三维卷积编解码架构中，在分辨率为H×W的D个连续图像的3N个颜色通道上进行卷积运算(其中N为输入图像的个数)；预处理块中的第一个卷积层使用的是7×7×7大小的卷积核，其余三维卷积使用的核大小均为3×3×3。Further, in the three-dimensional convolution codec architecture, the convolution operation is performed on the 3N color channels of D consecutive images with a resolution of H×W (where N is the number of input images); the first in the preprocessing block One convolutional layer uses a convolution kernel of size 7×7×7, and the kernel size used by the remaining three-dimensional convolutions is 3×3×3.

进一步地，编码-解码器结构的每一层均包含一个编码块和一个解码块；编码块由四个三维卷积组成，每两个卷积之间有跳跃连接，除第一层的编码块外其他均对输入张量进行下采样；解码块由卷积核大小为3×3×3的两个三维卷积层和一个上采样层组成。Further, each layer of the encoder-decoder structure contains an encoding block and a decoding block; the encoding block consists of four three-dimensional convolutions, and there is a skip connection between each two convolutions, except for the encoding block of the first layer All others downsample the input tensor; the decoding block consists of two 3D convolutional layers with a kernel size of 3×3×3 and an upsampling layer.

本发明达到的有益效果为：The beneficial effects that the present invention reaches are:

1)建立编码器-解码器架构的三维卷积神经网络来捕获跨多个深度平面的空间特征，消除新视点图像中的失真和伪像，提高新视点图像的合成质量；1) Build a 3D convolutional neural network with an encoder-decoder architecture to capture spatial features across multiple depth planes, eliminate distortion and artifacts in new viewpoint images, and improve the synthesis quality of new viewpoint images;

2)建立了预测深度平面上的遮挡区域的能力。2) The ability to predict occluded regions on the depth plane is established.

3)提高新视点图像合成的准确度。3) Improve the accuracy of new viewpoint image synthesis.

附图说明Description of drawings

图1为本发明实施例中基于MPI场景表示使用三维卷积神经网络的视图合成算法流程图。FIG. 1 is a flowchart of a view synthesis algorithm using a three-dimensional convolutional neural network based on MPI scene representation in an embodiment of the present invention.

图2为本发明实施例中的三维卷积神经网络架构示意图。FIG. 2 is a schematic diagram of a three-dimensional convolutional neural network architecture in an embodiment of the present invention.

图3为本发明实施例中的编码块2架构示意图。Fig. 3 is a schematic diagram of the structure of the coding block 2 in the embodiment of the present invention.

图4为本发明实施例中的解码块2架构示意图。Fig. 4 is a schematic diagram of the architecture of the decoding block 2 in the embodiment of the present invention.

具体实施方式detailed description

下面结合说明书附图对本发明的技术方案做进一步的详细说明。The technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明总体结构如图1所示，提出了一种基于MPI场景表示的视点合成框架，该方法利用三维卷积神经网络提高对多平面提取并混合跨多平面间的空间特征，重建高质量的MPI场景表示，采用端到端的方式训练。本方法具体包括以下步骤：The overall structure of the present invention is shown in Figure 1, and a viewpoint synthesis framework based on MPI scene representation is proposed. This method uses a three-dimensional convolutional neural network to improve the extraction of multi-planes and mix the spatial features across multiple planes to reconstruct high-quality images. The MPI scene representation is trained in an end-to-end manner. This method specifically comprises the following steps:

步骤1、获取训练图像数据与预处理。Step 1. Obtain training image data and preprocess.

由于网络需要多次迭代训练，并且要适应各类应用情形，所以准备的训练数据量需要达到一定的量级要求。采用Spaces数据集和RealEstate10K数据集作为训练图像数据，其中RealEstate10k数据集包含大约7500个从YouTube视频中提取的室内和室外场景，并标定了相机内参和相对位置；Spaces数据集包含100个室内和室外场景，使用16台摄像机拍摄，每台摄像机之间的间隔大约为10厘米，并使用运动恢复结构SFM(Structure-from-Motion)对摄像机进行了内参和外参的校准。使用该数据集中的90个场景进行训练，并在剩余的10个场景中进行评估，图像分辨率设置为800×480。它允许深度学习方法使用大规模数据来训练其架构。Since the network needs multiple iterations of training and needs to be adapted to various application scenarios, the amount of training data prepared needs to meet a certain order of magnitude requirement. The Spaces dataset and the RealEstate10K dataset are used as training image data. The RealEstate10k dataset contains about 7,500 indoor and outdoor scenes extracted from YouTube videos, and the internal parameters and relative positions of the cameras are calibrated; the Spaces dataset contains 100 indoor and outdoor scenes. The scene is shot with 16 cameras, the interval between each camera is about 10 cm, and the internal and external parameters of the cameras are calibrated using the motion recovery structure SFM (Structure-from-Motion). Use 90 scenes in this dataset for training and evaluate on the remaining 10 scenes, and the image resolution is set to 800×480. It allows deep learning methods to use large-scale data to train their architectures.

步骤2、基于MPI场景表示使用三维卷积神经网络的视图合成算法流程如图1所示，由三个部分组成：单应性变换模块、网络框架和网络输出模块。单应性变换模块是为了对输入的参考图像对的位置信息进行编码，将输入图像通过单应性变换矩阵分别重新投影到目标相机中；网络框架使用基于三维卷积的编码器解码器架构，每层的编码器和解码器通过跳跃连接为U型网络，增强了捕捉上下文信息的能力；网络输出模块通过混合网络预测的权重及alpha图像生成MPI场景表示，目标视点的图像通过对MPI进行从后到前的合成被渲染出来。Step 2. The flow of view synthesis algorithm using 3D convolutional neural network based on MPI scene representation is shown in Figure 1, which consists of three parts: homography transformation module, network framework and network output module. The homography transformation module is to encode the position information of the input reference image pair, and re-project the input image to the target camera through the homography transformation matrix; the network framework uses a three-dimensional convolution-based encoder-decoder architecture, The encoder and decoder of each layer are connected into a U-shaped network by jumping, which enhances the ability to capture context information; the network output module generates an MPI scene representation by mixing the weights predicted by the network and the alpha image, and the image of the target viewpoint is obtained from the MPI. The back-to-front composition is rendered.

步骤21、给出两幅输入图像I₁和I₂，已知相机参数C₁＝(A₁,[R₁,t₁])和C₂＝(A₂,[R₂,t₂]),其中A_i表示相机的内参，[R_i,T_i](i＝1,2)表示相机的外参(即旋转矩阵和平移矢量)。如图1所示，为了对输入的参考图像对I₁和I₂的位置信息进行编码，计算出一对平面扫描体积PSV(Plane Sweeping Volume)，即将参考图像对在一组固定的深度D处分别重新投射到目标相机中。考虑在参考视图I_i(i＝1,2)中的一个像素点p_i(u_i,v_i,1)(其中(u_i,v_i,1)为像素点p_i的坐标)和位于参考相机坐标系中在深度z_i处相应的体素。如果这个体素位于目标相机坐标系中的深度表示为z_v，则像素点p_i(u_i,v_i,1)投影在目标视图I_t中的像素点p_v(u_v,v_v,1)(其中(u_v,v_v,1)为像素点p_v的坐标)可以表示为：Step 21. Given two input images I ₁ and I ₂ , known camera parameters C ₁ =(A ₁ ,[R ₁ ,t ₁ ]) and C ₂ =(A ₂ ,[R ₂ ,t ₂ ]) , where A _i represents the internal parameters of the camera, and [R _i , T _i ] (i=1,2) represents the external parameters of the camera (ie, rotation matrix and translation vector). As shown in Figure 1, in order to encode the position information of the input reference image pair I ₁ and I ₂ , a pair of plane scanning volume PSV (Plane Sweeping Volume) is calculated, that is, the reference image pair is at a fixed depth D respectively reprojected into the target camera. Consider a pixel point p _i (u _i , v _i ,1) in the reference view I _i (i=1,2) (where (u _i , v _i ,1) is the coordinate of the pixel point p _i ) and Refer to the corresponding voxel at depth _zi in the camera coordinate system. If the depth of this voxel in the target camera coordinate system is expressed as z _v , then the pixel point p _i (u _i , v _i , 1) projected on the target view I _t is the pixel point p _v (u _v , v _v , 1) (where (u _v , v _v , 1) is the coordinate of pixel point p _v ) can be expressed as:

其中A_v表示目标相机的内参，[R_v,T_v]表示目标相机的外参(即旋转矩阵和平移矢量)。一个三维场景可以被分割成多个平面，它们与参考相机的距离(即视差值)相同。对于这样一个深度平面上的点，它们在参考视图中的投影点p_i和目标视图中的投影点p_v可以通过单应性矩阵H_vi,z联系起来，即H_vi,z可以通过简化式(1)得到：Among them, A _v represents the internal parameters of the target camera, and [R _v , T _v ] represents the external parameters of the target camera (ie, rotation matrix and translation vector). A 3D scene can be split into multiple planes that are at the same distance (i.e. disparity value) from the reference camera. For points on such a depth plane, their projected point p _i in the reference view and projected point p _v in the target view can be related by the homography matrix H _vi,z , that is, H _vi,z can be simplified by (1) get:

由于一系列的单应性矩阵H_vi,z被应用于在参考视图上，可以分别得到一组单应的扭曲视图P_i(即平面扫描体积PSV)，即在不同的深度平面上重新投影的结果。每个PSV张量的大小为[3,D,H,W]，将这两个PSV沿着颜色通道连接起来得到一个[3N,D,H,W]的张量，作为三维卷积神经网络的输入。其中H和W分别是图像的高度和宽度，D是深度平面的数量，N是输入图像的数量。三维卷积神经网络通过比较两个不同视图的PSVs来学习推断场景的几何形状。Since a series of homography matrices H _vi,z are applied to the reference view, a set of homography warped views P _i (i.e. planar scan volume PSV) can be obtained respectively, i.e. reprojected on different depth planes result. The size of each PSV tensor is [3, D, H, W], and the two PSVs are connected along the color channel to obtain a [3N, D, H, W] tensor as a three-dimensional convolutional neural network input of. where H and W are the height and width of the image respectively, D is the number of depth planes, and N is the number of input images. A 3D convolutional neural network learns to infer the geometry of a scene by comparing PSVs from two different views.

步骤22、如图2所示，三维卷积神经网络架构由两部分组成：一个预处理块和一个四层的编码-解码器结构。在训练中使用三维卷积来提取跨深度的空间特征，可以有效地学习每个平面之间的空间关系。三维卷积操作是在分辨率为H×W的D个连续图像的3N个颜色通道上进行卷积运算，我们以深度平面数量D＝32,分辨率为480×800的N＝2个输入图像为例，即三位卷积神经网络的输入表示为6@32,480,800。预处理模块通过提取跨32个深度平面的空间特征，将输入张量采样为32@16,240,400。除了预处理块中的第一个卷积层使用的是7×7×7大小的卷积核，其余所有的三维卷积使用的核大小均为3×3×3。每层中的编码块和解码块通过跳跃连接作为U型网络进行连接，这增强了捕捉上下文信息的能力。Step 22. As shown in Figure 2, the 3D convolutional neural network architecture consists of two parts: a preprocessing block and a four-layer encoder-decoder structure. Using 3D convolutions in training to extract spatial features across depths can effectively learn the spatial relationship between each plane. The three-dimensional convolution operation is performed on the 3N color channels of D consecutive images with a resolution of H×W. We use the number of depth planes D=32 and N=2 input images with a resolution of 480×800 For example, the input of the three-bit convolutional neural network is expressed as 6@32,480,800. The preprocessing module samples the input tensor to 32@16,240,400 by extracting spatial features across 32 depth planes. Except that the first convolution layer in the preprocessing block uses a convolution kernel of size 7×7×7, all other three-dimensional convolutions use a kernel size of 3×3×3. The encoding block and decoding block in each layer are connected as a U-network through skip connections, which enhances the ability to capture contextual information.

编码器-解码器架构的每一层都包含一个编码块和一个解码块，如图3和图4所示，分别以编码块2和解码块2为例。编码块2由四个三维卷积组成，每两个卷积之间有跳跃连接，卷积核大小为1×1×1的三维卷积被应用于编码块2的第一个跳跃连接中，对输入张量进行下采样。需要注意的是只有编码块1没有对输入特征张量进行下采样(其余编码块和图3所示的编码块2的结构基本类似，此外图2中编码块1的输入没有进行下采样，输出存在下采样)；解码块2由卷积核大小为3×3×3的两个三维卷积层和一个上采样层组成，其他解码块与解码块2相同。图2中，各个模块的参数均表示张量3N@D,H,W大小的变化，图中“上采样2×”表示分辨率和深度通道参数加倍。Each layer of the encoder-decoder architecture consists of an encoding block and a decoding block, as shown in Figure 3 and Figure 4, taking encoding block 2 and decoding block 2 as examples, respectively. Coding block 2 is composed of four 3D convolutions, and there is a skip connection between every two convolutions. A 3D convolution with a convolution kernel size of 1×1×1 is applied to the first skip connection of coding block 2, Downsample the input tensor. It should be noted that only encoding block 1 does not down-sample the input feature tensor (the structure of the remaining encoding blocks is basically similar to that of encoding block 2 shown in Figure 3, and in addition, the input of encoding block 1 in Figure 2 is not down-sampled, and the output downsampling exists); decoding block 2 consists of two 3D convolutional layers with a kernel size of 3×3×3 and an upsampling layer, and the other decoding blocks are the same as decoding block 2. In Figure 2, the parameters of each module represent changes in the size of the tensor 3N@D, H, and W. In the figure, "upsampling 2×" means that the resolution and depth channel parameters are doubled.

步骤23、网络输出模块如图1所示，三维卷积神经网络直接预测一个MPI的不透明度α_i和两个混合权重w_i(i＝1,2)。而MPI的RGB值可以通过混合权重和P_i很好地建模，其中P_i为步骤21中描述的通过单应性矩阵得到的平面扫描体积PSV。因此，对于MPI中的每个平面，计算其RGB图像c为：Step 23, network output module As shown in FIG. 1 , the three-dimensional convolutional neural network directly predicts an MPI opacity α _i and two mixing weights w _i (i=1, 2). While the RGB values of MPI can be well modeled by mixing weights and _Pi , where Pi is the planar scan volume PSV obtained by the _homography matrix described in step 21. Therefore, for each plane in MPI, compute its RGB image c as:

c＝∑w_iΘP_i(i＝1,2) (3)c＝∑w _i ΘP _i (i＝1,2) (3)

最终，目标图像I_t可以由MPI场景表示M＝{c_i,α_i}(i＝1,2...N)通过阿尔法合成渲染出来。其中，渲染过程是可微分的，合成目标图像I_t的过程被定义为：Finally, the target image I _t can be rendered from the MPI scene representation M={c _i ,α _i }(i=1,2...N) through alpha compositing. Among them, the rendering process is differentiable, and the process of synthesizing the target image I _t is defined as:

本发明使用视图合成作为监督来训练MPI预测网络，使用在通道维度上单位归一化的VGG-19感知损失作为损失函数：The present invention uses view synthesis as supervision to train the MPI prediction network, and uses VGG-19 perceptual loss normalized in the channel dimension as the loss function:

其中

是目标图像I_t的真值，φ_l是VGG-19的一组层，权重超参数λ_l被设定为VGG-19(该VGG-19是在计算损失的过程中用到的二维卷积神经网络，使用其中的φ_l层提取

和I_t的特征并计算两者的损失)中神经元数量的倒数。in

is the true value of the target image I _t , φ _l is a set of layers of VGG-19, and the weight hyperparameter λ _l is set to VGG-19 (the VGG-19 is the two-dimensional volume used in the process of calculating the loss product neural network, using the φ _l layer to extract

and It _features and calculate the loss of both) the reciprocal of the number of neurons in .

为了评估本模型在推断目标视图的能力，分别在RealEstate 10K数据集和Spaces数据集上进行了实验。为了进行公平的比较，本方法使用与其他方法相同数量的深度平面。在Spaces数据集上进行实验时，将输入视图调整为4个。实验结果表明本框架能够捕捉跨多个深度平面上的空间特征，以推断出正确的场景几何形状及遮挡区域内容，从而合成高质量的目标视图；同时本方法能够处理薄而复杂的结构，呈现出比以前的方法更清晰的物体边缘。In order to evaluate the ability of this model in inferring the target view, experiments were carried out on the RealEstate 10K dataset and the Spaces dataset. For a fair comparison, this method uses the same number of depth planes as other methods. When experimenting on the Spaces dataset, the input views are adjusted to 4. Experimental results show that this framework can capture spatial features across multiple depth planes to infer correct scene geometry and content of occluded regions, thereby synthesizing high-quality target views; at the same time, this method can handle thin and complex structures, rendering produce sharper object edges than previous methods.

在消融实验中，设置了更多的深度平面数量(D＝8，16，24，32，40)以验证它们对本方法的重要性。在Spaces数据集上训练了两种不同的基线配置，即输入视点的基线距离约为20厘米和基线约为40厘米，并且使用2个视图作为输入。通过使用数据增强将数据集扩大至原来的16倍。实验结果表明本模型在合成新视点图像的性能随着深度平面数量的增加而提高，并且不受基线距离的影响。性能的提升可以归因于，在训练时对深度进行更密集的采样，模型能够学习到更准确的推断场景几何的能力；而在使用相同数量的深度平面时，合成新视点图像的性能会随着基线距离的增加而下降，这是因为随着基线距离的增加，依赖于网络传播体积能见度是困难的，因此MPI的性能受到影响。In the ablation experiments, more depth planes (D=8, 16, 24, 32, 40) were set to verify their importance to the method. Two different baseline configurations are trained on the Spaces dataset, i.e. the baseline distance of the input viewpoint is about 20 cm and the baseline is about 40 cm, and 2 views are used as input. The dataset is enlarged by a factor of 16 by using data augmentation. Experimental results show that the performance of our model in synthesizing novel viewpoint images increases with the number of depth planes and is not affected by the baseline distance. The performance improvement can be attributed to the ability of the model to learn more accurate inference of scene geometry by sampling depth more densely during training; while the performance of synthesizing new viewpoint images decreases with the same number of depth planes. Decreases as the baseline distance increases, because as the baseline distance increases, it is difficult to propagate volumetric visibility relying on the network, so the performance of MPI suffers.

综上，本发明提出了一种基于多平面场景表示的视点合成方法，为了消除新视点图像中的失真和伪像，提高新视点图像的合成质量，该框架使用三维卷积神经网络来捕获跨多个深度平面的空间特征，同时建立了预测深度平面上的遮挡区域的能力。本方法在特殊区域上也获得了高质量的合成结果，如场景中的镜面反射区域。在两个数据集上的实验结果表明，新视图合成的质量优于以前的算法。随着消融实验中深度平面数量的增加，新视点图像的渲染质量也得到了改善。In summary, the present invention proposes a viewpoint synthesis method based on multi-plane scene representation. In order to eliminate the distortion and artifacts in the new viewpoint image and improve the synthesis quality of the new viewpoint image, the framework uses a three-dimensional convolutional neural network to capture cross- Spatial features of multiple depth planes, while building the ability to predict occluded regions on depth planes. Our method also achieves high-quality synthesis results on special regions, such as specular regions in the scene. Experimental results on two datasets demonstrate that the quality of new view synthesis outperforms previous algorithms. As the number of depth planes increases in the ablation experiments, the rendering quality of images from new viewpoints also improves.

以上所述仅为本发明的较佳实施方式，本发明的保护范围并不以上述实施方式为限，但凡本领域普通技术人员根据本发明所揭示内容所作的等效修饰或变化，皆应纳入权利要求书中记载的保护范围内。The above descriptions are only preferred embodiments of the present invention, and the scope of protection of the present invention is not limited to the above embodiments, but all equivalent modifications or changes made by those of ordinary skill in the art according to the disclosure of the present invention should be included within the scope of protection described in the claims.

Claims

1. A method for synthesizing interactive viewpoint images based on multi-plane image scene representation, characterized in that: comprising the following steps:

Step 1, obtain training data and preprocessing;

Step 2, inputting the training image data obtained in step 1 into a three-dimensional convolutional neural network based on multi-plane image scene representation for training;

The three-dimensional convolutional neural network comprises a homography transformation module, a three-dimensional convolution codec architecture and a network output module;

The homography transformation module encodes the position information of the reference image pair acquired by the input reference camera, and re-projects the input image into the target camera respectively through the homography transformation matrix;

The three-dimensional convolutional codec architecture includes a preprocessing block and a four-layer encoder-decoder structure, and the encoder and decoder of each layer are connected as a U-shaped network by skipping;

The network output module generates a multi-plane image scene representation by mixing the weight and alpha image predicted by the network, and the image of the target viewpoint is rendered by synthesizing the multi-plane image from back to front;

Step 3, based on the trained network, input the test image for detection, and obtain the final new viewpoint image synthesis result.

2 . The method for synthesizing interactive viewpoint images based on multi-plane image scene representations according to claim 1 , wherein in step 1, the data preprocessing includes performing data enhancement on the training images. 3 .

3. A method of interactive viewpoint image synthesis based on multi-plane image scene representation according to claim 1, characterized in that: in step 2, the homography transformation uses the same depth, and the scene is inferred by comparing different input images geometry structure.

4. a kind of interactive viewpoint image synthesis method based on multi-plane image scene representation according to claim 1, is characterized in that: in the homography transformation module, based on the parameter of reference camera, to the input reference image pair I ₁ and The position information of I ₂ is encoded to calculate the corresponding plane scanning volume PSV, and the reference image pair is re-projected to the target camera at a set of fixed depths D, wherein the projection points in the reference view and the target view are passed through a single Responsive matrix to connect.

5. A method of interactive viewpoint image synthesis based on multi-plane image scene representation according to claim 1, characterized in that: in step 2, the three-dimensional convolutional codec architecture captures spatial features across multiple depth planes, and predicts multiple Planar images occlude and hide regions in the depth plane.

6. A method of interactive viewpoint image synthesis based on multi-plane image scene representation according to claim 1, characterized in that: in the three-dimensional convolution codec architecture, in the 3N of D consecutive images with a resolution of H×W Convolution operations are performed on color channels, where the number of N-bit input images; the first convolution layer in the preprocessing block uses a convolution kernel of size 7×7×7, and the kernel used by the rest of the three-dimensional convolution All are 3×3×3 in size.

7. A kind of interactive viewpoint image synthesis method based on multi-plane image scene representation according to claim 1, is characterized in that: each layer of encoder-decoder structure all comprises an encoding block and a decoding block; The encoding block consists of It consists of four three-dimensional convolutions, and there is a skip connection between each two convolutions. Except for the encoding block of the first layer, the input tensor is down-sampled; the decoding block consists of a convolution kernel with a size of 3×3×3 It consists of two 3D convolutional layers and an upsampling layer.