WO2021093584A1 - 基于深度卷积神经网络的自由视点视频生成及交互方法 - Google Patents
基于深度卷积神经网络的自由视点视频生成及交互方法 Download PDFInfo
- Publication number
- WO2021093584A1 WO2021093584A1 PCT/CN2020/124206 CN2020124206W WO2021093584A1 WO 2021093584 A1 WO2021093584 A1 WO 2021093584A1 CN 2020124206 W CN2020124206 W CN 2020124206W WO 2021093584 A1 WO2021093584 A1 WO 2021093584A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- viewpoints
- camera
- free
- convolutional neural
- calibration
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 19
- 230000003993 interaction Effects 0.000 title claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 230000001360 synchronised effect Effects 0.000 claims abstract description 6
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 10
- 230000003287 optical effect Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- 238000006073 displacement reaction Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 5
- 230000006835 compression Effects 0.000 claims description 2
- 238000007906 compression Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 230000002194 synthesizing effect Effects 0.000 claims 1
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000000605 extraction Methods 0.000 abstract 1
- 238000012549 training Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 230000002452 interceptive effect Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/111—Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
- H04N13/117—Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/111—Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/204—Image signal generators using stereoscopic image cameras
- H04N13/239—Image signal generators using stereoscopic image cameras using two 2D image sensors having a relative position equal to or related to the interocular distance
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/204—Image signal generators using stereoscopic image cameras
- H04N13/246—Calibration of cameras
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/257—Colour aspects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/296—Synchronisation thereof; Control thereof
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/90—Arrangement of cameras or camera modules, e.g. multiple cameras in TV studios or sports stadiums
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Definitions
- the present invention relates to the field of computer vision, and mainly relates to a method for generating and interacting virtual viewpoints in Free Viewpoint Video (FVV).
- FVV Free Viewpoint Video
- the director shoots and broadcasts video programs from a limited perspective to viewers.
- the output and acquisition method of this information is one-way.
- the audience can only watch the video from the specific perspective given by the director. Due to the limitation of the number of cameras in some program shooting scenes, the jump of the screen when the director actively switches the viewing angle will bring the audience a less than ideal viewing experience.
- the free view video technology has been developed rapidly in recent years, and the interactive video viewing method is becoming a new generation of new media.
- the broadcaster In the TV broadcast of typical stage scenes such as typical sports events, in order to capture more comprehensive viewpoints as much as possible, the broadcaster will set up as many cameras as possible. However, as the number of cameras increases, the viewpoints will be switched. A smoother effect is achieved, but the pressure of data transmission will also increase linearly.
- the virtual viewpoint generation technology came into being. This technology can be used Generate virtual viewpoints between the physical viewpoints collected by the camera, thereby transferring the data transmission pressure of the physical acquisition end to the local server or cloud server with high computing power, so using the lowest possible amount of calculation to generate better-quality virtual viewpoints becomes The core of free-view video related technology.
- Deep learning in the field of virtual view generation is currently carried out in the field of video frame interpolation (Frame Interpolation).
- Some of these networks are based on the information related to the optical flow between adjacent frames of the video, using a deep network with a specific structure, and combining The real viewpoint images in the data set are used to predict and generate virtual viewpoints. If these networks used for video frame insertion are directly used in the multi-viewpoint shooting field of large scenes, due to the wider baselines of adjacent physical viewpoints and larger displacements, a larger area of shadow effect will be produced.
- the purpose of the present invention is to provide a method for generating and interacting virtual viewpoints in free viewpoint videos based on deep learning CNN networks, so as to improve the quality of virtual viewpoints and reduce the amount of calculation.
- the free-view video generation and interaction method based on deep convolutional neural network includes the following steps:
- the acquisition system includes N cameras.
- the cameras are evenly arranged according to the arc, and the height of the camera remains the same; at the center of the arc, the posture position of the camera is calibrated with the reference object, and the fixed camera position remains unchanged after the calibration is completed; through the gray world
- the white balance algorithm performs color calibration on N camera parameters;
- step (6) The spliced frames at all times obtained in step (6) are synthesized into a free viewpoint video according to the shooting frame rate of the multi-camera.
- the present invention performs pixel-level baseline calibration on the captured multi-viewpoint video sequence, and uses a deep convolutional neural network to predict and generate virtual viewpoints. Compared with the traditional geometric methods based on depth and parallax, it does not need to be performed in advance.
- the multi-camera calibration work solves the disadvantages of difficult and low-precision multi-camera calibration in large scenes, while also reducing the amount of calculation and improving the efficiency of virtual viewpoint generation.
- the present invention performs baseline calibration, color calibration, and displacement threshold filtering based on optical flow calculation on the binocular vision data set during deep convolutional neural network training, and synthesizes under the condition of wide baseline and large displacement of adjacent viewpoints.
- the virtual viewpoint has a better synthesis effect, and to a certain extent removes a large area of shadows.
- Figure 1 is a schematic flow diagram of the method of the present invention
- Figure 2 is a topological diagram of a hardware acquisition system according to an embodiment of the present invention.
- FIG. 3 is a schematic diagram of a baseline calibration method according to an embodiment of the present invention.
- FIG. 4 is a flowchart of a deep convolutional neural network according to an embodiment of the present invention.
- Fig. 5 is an interactive display software interface of free-viewpoint video according to an embodiment of the present invention.
- a multi-camera array as shown in the topology diagram of Fig. 1 is set up in the program stage scene to synchronously collect scene video sequence information, and then synthesize interactive free viewpoint video through a series of data processing, and display the interactive free viewpoint video through a supporting development.
- the system is for users to interactively watch, making it possible to transmit the rebroadcast information in both directions.
- Fig. 1 The processing flow of this embodiment is shown in Fig. 1, and includes the following steps:
- the topological diagram of the hardware acquisition system is shown in Figure 1.
- the number of cameras is N, and multiple cameras are evenly arranged according to arcs.
- the height of the cameras remains the same.
- the angle between the optical axes of adjacent cameras is controlled to about 30°.
- place a vertical and horizontal flat panel at the center of the scene and align the centers of all cameras. Refer to the center O of the flat panel, and at the same time make the center vertical direction of the camera screen coincide with the vertical reference line of the reference flat panel, and the fixed camera position remains unchanged after the calibration is completed.
- the color calibration of the N camera parameters is performed through the gray world white balance algorithm (Gray World).
- All cameras use an external trigger signal generator to synchronize through the video data trigger line, adjust the frequency of the trigger signal, and trigger each camera to synchronously collect the shooting scene information.
- the camera array set up in step (1) uses the camera array set up in step (1) to take synchronous video sequence shooting of the target scene object, select a video frame at a certain time, perform baseline calibration on N-1 groups of adjacent viewpoints, and manually set the affine according to the feature points of the objects in the scene
- the translation coefficient (x, y), the rotation coefficient ⁇ , and the zoom coefficient k are such that the feature point at the center position of the scene is at the same reference position, as shown in the schematic diagram of the baseline calibration system used in this embodiment (shown in Figure 3). Point the center calibration point of the scene.
- Cam_L and Cam_R represent the center object of the scene shot by two cameras with the same left and right parameters at the same time.
- N cameras arranged in an arc shape are used in sequence according to the spatial position of the cameras (3 )
- the data set is preprocessed such as baseline calibration, color calibration, and displacement threshold filtering based on optical flow calculation.
- the data set is composed of multiple image triples of three viewpoints of'left, center, right' in many scenes. Each image triple is first batched for baseline calibration. The method is the same as step (3) so that each three pictures Several groups of feature points are on the same horizontal line. Color calibration uses the gray world white balance algorithm to make three images of the same scene have consistent white balance parameters. Finally, by calculating the two-by-two optical flow diagrams of the triples, the pixel displacements of the same object in the same scene are averaged, the threshold is set, and the triples exceeding the threshold are selected to form a new training data set.
- FIG. 4 The structure of the deep convolutional neural network used in this embodiment is shown in FIG. 4, which specifically includes an encoding network and a decoding network (as shown by the two sub-network blocks in the two dashed boxes of Encoder and Decoder in FIG. 4).
- the images Image1 and Image2 of the left and right viewpoints pass through the encoding network and the decoding network in turn.
- the encoding network passes through the convolutional layer (Conv) and the average pooling layer (Pooling) of various sizes as shown in Figure 4 in turn.
- the decoding network passes through the following successively
- the convolutional layer (Upconv) and linear upsampling layer (upsampling) of various sizes shown in Figure 4 respectively obtain the depth feature mapping parameter information S1 and S2 of the scene, and then cascade and add them with the input images Image1 and Image2 respectively.
- the two-dimensional output image of the virtual viewpoint between the left and right physical viewpoints is predicted.
- the difference between Output and the intermediate images of the ground truth dataset triples is used to quantify the training results, and the following two forms of loss functions are used respectively:
- L total L 1 + ⁇ L 2 , where L 1 is the two-norm error based on the pixel RGB difference between the network predicted image and the true value, L 2 is the difference in the feature structure extracted by the network, the function S() Extracting this loss function for features is used to train the network model's perception of deep special structures in the scene.
- the total loss function L total used for training is the linear weighted sum of L 1 and L 2.
- VVGN Virtual View Generation Network
- the present invention is based on a deep convolutional neural network to predict and generate a virtual viewpoint between two physical viewpoints.
- the input data is preprocessed by pixel-level baseline calibration, and the input data is directly learned through the CNN network.
- the characteristic structure of the viewpoint is thus outputted, and there is no need to calibrate the multi-camera in advance. This step determines the quality of the virtual viewpoint.
- the binocular data set is preprocessed by baseline calibration, color calibration, and displacement threshold filtering based on optical flow calculation, and enters the CNN network as shown in Figure 4 for training.
- the training input is left and right binoculars.
- the training loss functions are:
- L total L 1 + ⁇ L 2 , where L 1 is the two-norm error based on the pixel RGB difference between the network predicted image and the true value, L 2 is the difference in the feature structure extracted by the network, the function S() Extracting this loss function for features is used to train the network model's perception of deep special structures in the scene.
- the total loss function L total used for training is the linear weighted sum of L 1 and L 2. In the case of a wide binocular baseline, compared to the existing deep learning-based video frame insertion network, a better virtual viewpoint quality can be obtained; and the amount of calculation is much lower than that of the traditional method.
- cv2.VideoWriter() Use the cv2.VideoWriter() function in FFmpeg or OpenCV to synthesize the free viewpoint video (FVV) according to the shooting frame rate of multiple cameras with the stitched frames at all times obtained in the previous step, and compress and store them to the local server at a certain compression ratio.
- the interface of the free-view video interactive playback software system is shown in Figure 5. Reading the free-view video (FVV) synthesized in step (8), the user can use the Slider or Dial interactive button module to smoothly switch to a specific view block in real time Index the video blocks under different viewpoints corresponding to the Block_Index to realize the user's free interactive viewing experience.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于深度卷积神经网络的自由视点视频生成及交互方法。具体步骤包括:利用配套搭建的多相机阵列同步拍摄系统对目标场景进行多视点数据的采集,获得多视点下的同步视频帧序列组,然后批量进行像素级别的基线校准;利用设计训练好的深度学习卷积神经网络模型,通过编码和解码网络结构,对每组输入视点图像进行特征提取,获得场景的深度特征信息,结合输入图像分别生成每一个时刻的每一组相邻物理视点之间的虚拟视点图像;将所有视点按时刻以及视点的空间位置通过矩阵拼接的方式合成自由视点视频帧。本发明方法无需进行相机标定以及深度图的计算,很大程度上降低了虚拟视点生成的计算量,并且一定程度提升了虚拟视点图像质量。
Description
本发明涉及计算机视觉领域,主要涉及自由视点视频(Free Viewpoint Video,FVV)中虚拟视点的生成及交互方法。
在传统的电视转播中,导播用有限个视角拍摄并且转播视频节目给观众观看,这种信息的输出以及获取方式是单向的,观众只能观看导播给定的特定视角下的视频,并且大部分节目拍摄现场由于受到了相机数目的限制,导播在主动切换视角时画面的跳跃感会给观众带来不太理想的观看体验。为了解决这种被动式的视频观看体验,借助视频采集设备水平的提升以及算力的飞速提升,近年来自由视点视频技术得到了较快的发展,可交互的视频观看方式正成为新一代新媒体的发展方向。
在典型的体育赛事这一类比较典型的舞台场景电视转播中,为了尽可能拍摄更全面的视点,转播方会尽可能架设更多数量的相机,但随着相机数目的增多,视点切换时会达到更为平滑的效果,但数据传输的压力也会线性增大,为了在可控相机数量的设备条件下达到尽可能平滑的视点切换效果,虚拟视点生成技术应运而生,该技术可以用来生成相机采集到的物理视点间的虚拟视点,从而将物理采集端的数据传输压力转移到了具有高算力的本地服务器或者云服务器端,因此用尽可能低的计算量生成质更好的虚拟视点成为自由视点视频相关技术的核心。
现在一些已有的虚拟视点生成技术都是基于深度视差的传统图像渲染得到的,如专利CN 102447932A,该方法利用事先标定好的相机内外参数计算出拍摄场景的深度图,再通过深度图中的深度信息将对应参考图像中的像素点映射到对应的三维空间,然后通过平移参数和相机内部参数将三维空间中的参考图像中像素点对应转换到虚拟相机位置,最后显示虚拟相机平面的成像即为虚拟视点图像。这种方法由于要遍历图像所有像素点进行计算,导致计算量较大,渲染效率随着图像分辨率以及相机数量的增多而指数型增长,并且这种虚拟视点生成方法需要事先对相机进行标定,在体育赛事等规模较大的转播场景情况下,相机标定的难度以及准确度会受到很大影响,从而导致合成的虚拟视点质量的降低。
深度学习在虚拟视点生成领域目前很大一部分是在视频插帧(Frame Interpolation)领域开展的,这些网络中一部分是基于视频相邻帧之间光流相关的信息,使用特定结构的深度网络,结合数据集中的真实视点图像来预测生成虚拟视点。如果直接将这些用于视频插帧的网络用到大场景的多视点拍摄领域,由于相邻物理视点基线较宽位移较大,会产生面积较大的诡影效果。
发明内容
本发明的目的是为了提供一种基于深度学习CNN网络的自由视点视频中虚拟视点的生成及交互方法,以提高虚拟视点质量并降低计算量。
本发明采用以下技术方案:
基于深度卷积神经网络的自由视点视频生成及交互方法,包括如下步骤:
(1)校准采集系统中相机的姿态和颜色
采集系统包括N个相机,相机按照圆弧均匀排列,相机的高度保持一致;在位于圆弧圆心处,对相机的姿态位置进行参照物校准,校准完成后固定相机位置保持不变;通过灰色世界白平衡算法对N个相机参数进行颜色校准;
(2)使用采集系统的相机阵列对目标场景对象进行同步视频序列拍摄,选取某一时刻视频帧,对N-1组相邻视点依次进行基线校准,获取N-1个图像仿射变换矩阵M
i,i=1,2,…,n;
(3)利用获得的仿射变换矩阵M
i依次对次相邻视点的所有帧数据进行基线校准;
(4)先对双目数据集进行基线校准、基于灰色世界方法的颜色校准以及基于光流计算的位移阈值筛选的预处理,然后训练深度卷积神网络的虚拟视点生成能力;
(5)将步骤(3)基线校准好的数据输入步骤(4)预训练好的深度卷积神经网络,根据重建的虚拟视点数量,输出生成的虚拟视点二维图像;
(6)将物理视点和生成的虚拟视点按照物理空间位置排布顺序进行矩阵拼接,并依次标注各视点在图像矩阵中的块索引;
(7)将步骤(6)得到的所有时刻的拼接帧按照多相机的拍摄帧率,合成自由视点视频。
与现有技术相比,本发明的有益效果为:
(1)本发明对拍摄的多视点视频序列进行了像素级别的基线校准,并且使用深度卷积神经网络来预测生成虚拟视点,相较于传统的基于深度和视差的几何方法,不需要预先进行多相机标定工作,解决了大场景下多相机标定难精度低的弊端,同时也降低了计算量,提高了虚拟视点生成的效率。
(2)本发明在深度卷积神经网络训练时对双目视觉数据集进行了基线校准、颜色校准、基于光流计算的位移阈值筛选等预处理,在相邻视点宽基线大位移情况下合成的虚拟视点具有更好的合成效果,一定程度上去除了大面积的诡影。
图1为本发明方法的流程示意图;
图2为本发明实施例的硬件采集系统拓扑图;
图3为本发明实施例的基线校准方法示意图;
图4为本发明实施例的深度卷积神经网络流程图;
图5为本发明实施例自由视点视频可交互显示软件界面。
下面将结合附图及具体实施例对本发明进行详细描述。
本实施例在节目舞台场景中搭设如图1拓扑图所示的多相机阵列来同步采集场景视频序列信息,然后通过一系列数据处理合成可交互的自由视点视频,通过配套开发的自由视点交互显示系统供用户交互观看,使得转播信息的双向传递成为了可能。
本实施例的处理流程如图1所示,包括以下步骤:
(1)弧形排列的相机阵列搭设,以及相机的姿态校准和多相机颜色校准
硬件采集系统拓扑图如图1所示,相机数量为N,多相机按照圆弧均匀排列,相机的高度保持一致,相邻相机光轴夹角控制为30°左右。在位于场景圆心处放置水平竖直“十字”参照物,对相机的姿态位置进行参照物校准,如图3所示,场景中心放置带有竖直和水平的平板,将所有相机的中心对准参照平板的中心O,同时使得相机画面的正中竖直方向与参照平板竖直参照线重合,校准完成后固定相机位置保持不变。同时通过灰色世界白平衡算法(Gray World)对设置N个相机参数进行颜色校准。
(2)多相机同步校准
所有相机通过视频数据触发线利用外触发信号发生器同步,调节触发信号频率,触发各相机同步采集拍摄场景信息。
(3)同步采集视频序列,通过基线校准获取仿射变换矩阵
使用步骤(1)搭设的相机阵列对目标场景对象进行同步视频序列拍摄,选取某一时刻视频帧,对N-1组相邻视点依次进行基线校准,手动根据场景中物体的特征点设置仿射变换中平移系数(x,y)、旋转系数θ以及缩放系数k,使得场景中心位置的特征点处于相同的参照位处,如本实施例中使用的基线校准系统示意图(图3所示)O点位场景的中心校准点,Cam_L和Cam_R表示左右两个参数相同的相机同时拍摄的场景中心物体,得到的Img_L和Img_R左右图像中物体的三个特征点同时与L1、L2、L3和R1、R2、R3重合,保证了左右相机的基线在同一水平线上,此方法达到了基线校准的目的。通过以上的基线校准方法获取N-1个图像仿射变换矩阵M
i(i=1,2,…,n)。其中,仿射变换矩阵具体的形式为:
其中α=k·cos(θ),β=k·sin(θ)。
(4)批量基线校准处理
利用获得的仿射变换矩阵M
i通过OpenCV中的warpAffine()函数依次对次相邻视点的所有帧数据进行基线校准,N台圆弧形排列的相机,按照相机的空间位置依次使用步骤(3)获得的仿射矩阵M
i(i=1,2,…,n)对N-1组相机两两进行基线校正,使得N台相机校准后图像的基线都保持在同一水平线上。
(5)虚拟视点生成网络训练
本步骤先对数据集进行基线校准、颜色校准、基于光流计算的位移阈值筛选等预处理。数据集由许多场景的多个‘左中右’三个视点的图像三元组为单位组成,每个图像三元组先批量进行基线校准,方法同步骤(3)使得每个三张图片中的几组特征点处于同一水平线上。颜色校准使用灰色世界白平衡算法,使同一场景的三张图像具有一致的白平衡参数。最后通过计算三元组中两两的光流图,得到同一场景中的同一物体的像素位移取平均值,设定阈值,筛选出超过该阈值的三元组组成新的训练数据集。
本实施例使用的深度卷积神经网络结构如图4所示,具体包括编码网络和解码网络(如图4中的Encoder和Decoder这两个虚线框中的两个子网络块所示)。左右视点的图片Image1和Image2依次经过编码网络和解码网络,编码网络中依次经过如图4所示的各个大小的卷积层(Conv)和平均池化层(Pooling),解码网络中依次经过如图4所示的各个大小的卷积层(Upconv)和线形上采样层(upsampling),分别得到场景的深度特征映射参数信息S1和S2,随后分别与输入图像Image1和Image2级联再相加,预测得到左右物理视点中间的虚拟视点Output二维图像。在该网络的训练中,使用Output和作为真值Ground Truth的数据集三元组中间图像之间的差异量化训练结果,分别使用以下两种形式的损失函数:
得到总的L
total=L
1+α·L
2,其中L
1为网络预测图像与真实值基于像素RGB差异的二范数误差,L
2为网络提取的特征结构上的差异,函数S()为特征提取此损失函数用来训练网络模型对场景中深度特种结构的感知。训练使用的总损失函数L
total为L
1和L
2的线形加权和。通过一定周期的迭代训练得到最优的虚拟视点生成网络参数模型。
(6)生成虚拟视点
将步骤(4)基线校准好的数据输入预训练好的深度卷积神经网络(Virtual View Generation Network,VVGN),输入重建的虚拟视点数量,输出生成的虚拟视点二维图像。与传统的虚拟视点生成方法不同,本发明是基于深度卷积神经网络来预测生成两个物理视点之间的虚拟视点,输入数据进行了像素级别的基线校准预处理,直接通过CNN网络学习输入两视点的特征结构从而输出结果,不需要预先对多相机进行标定。此步骤决定着生成虚拟视点效果的好坏。通过步骤(5)对双目数据集进行了基线校准、颜色校准、基于光流计算的位移阈值筛选等预处理,进入如图4所示的CNN网络中进行训练,训练的输入为左右双目的两张二位图像,训练的损失函数分别为:
得到总的L
total=L
1+α·L
2,其中L
1为网络预测图像与真实值基于像素RGB差异的二范数误差,L
2为网络提取的特征结构上的差异,函数S()为特征提取此损失函数用来训练网络模型对场景中深度特种结构的感知。训练使用的总损失函数L
total为L
1和L
2的线形加权和。在双目宽基线情况下,相较于现有的基于深度学习的视频插帧网络,能够得到更好的虚拟视点质量;并且计算量相较传统方法要低很多。
(7)所有视点图像帧进行矩阵拼接
将物理视点和步骤(6)生成的虚拟视点按照物理空间位置排布顺序进行矩阵拼接(矩阵的行列数依据生成的虚拟视点个数而定),并按照先行后列的顺序依次标注各视点在图像矩阵中的块索引Block_Index。
(8)自由视点视频合成
将上步得到的所有时刻的拼接帧使用FFmpeg或者OpenCV中的cv2.VideoWriter()函数,按照多相机的拍摄帧率合成自由视点视频(FVV),同时按照一定的压缩比压缩存储至本地服务器。
(9)用户交互观看自由视点视频
自由视点视频交互播放软件系统(FVV PLAYER)的界面如图5所示,读取步骤(8)合成的自由视点视频(FVV),用户可实时使用Slider或Dial交互按钮模块平滑切换到特定视点块索引Block_Index对应的不同视点下的视频块,实现用户自由交互的观看体验。
Claims (5)
- 基于深度卷积神经网络的自由视点视频生成及交互方法,其特征在于,包括如下步骤:(1)校准采集系统中相机的姿态和颜色采集系统包括N个相机,相机按照圆弧均匀排列,相机的高度保持一致;在位于圆弧圆心处,对相机的姿态位置进行参照物校准,校准完成后固定相机位置保持不变;通过灰色世界白平衡算法对N个相机参数进行颜色校准;(2)使用采集系统的相机阵列对目标场景对象进行同步视频序列拍摄,选取某一时刻视频帧,对N-1组相邻视点依次进行基线校准,获取N-1个图像仿射变换矩阵M i,i=1,2,…,n;(3)利用获得的仿射变换矩阵M i依次对次相邻视点的所有帧数据进行基线校准;(4)先对双目数据集进行基线校准、基于灰色世界方法的颜色校准以及基于光流计算的位移阈值筛选的预处理,然后训练深度卷积神网络的虚拟视点生成能力;(5)将步骤(3)基线校准好的数据输入步骤(4)预训练好的深度卷积神经网络,根据重建的虚拟视点数量,输出生成的虚拟视点二维图像;(6)将物理视点和生成的虚拟视点按照物理空间位置排布顺序进行矩阵拼接,并依次标注各视点在图像矩阵中的块索引;(7)将步骤(6)得到的所有时刻的拼接帧按照多相机的拍摄帧率,合成自由视点视频。
- 根据权利要求1所述的基于深度卷积神经网络的自由视点视频生成及交互方法,其特征在于,所述步骤(1)中,在位于圆弧圆心处放置水平竖直“十字”参照平板,将所有相机的中心对准参照平板的中心,同时使得相机画面的正中竖直方向与参照平板竖直参照线重合,校准完成后固定相机位置保持不变。
- 根据权利要求1所述的基于深度卷积神经网络的自由视点视频生成及交互方法,其特征在于,所述步骤(1)中,相邻相机光轴夹角控制为30°。
- 根据权利要求1所述的基于深度卷积神经网络的自由视点视频生成及交互方法,其特征在于,步骤(7)合成自由视点视频后,按照一定的压缩比压缩存储至本地服务器。
- 根据权利要求4所述的基于深度卷积神经网络的自由视点视频生成及交互方法,其特征在于,用户利用软件读取步骤(7)合成的自由视点视频,可实时按照步骤(6)视点块索引平滑切换不同视点下的视频,实现人机视频交互。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/755,025 US20220394226A1 (en) | 2019-11-13 | 2020-10-28 | Free viewpoint video generation and interaction method based on deep convolutional neural network |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911106557.2A CN110798673B (zh) | 2019-11-13 | 2019-11-13 | 基于深度卷积神经网络的自由视点视频生成及交互方法 |
CN201911106557.2 | 2019-11-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021093584A1 true WO2021093584A1 (zh) | 2021-05-20 |
Family
ID=69444273
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/124206 WO2021093584A1 (zh) | 2019-11-13 | 2020-10-28 | 基于深度卷积神经网络的自由视点视频生成及交互方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20220394226A1 (zh) |
CN (1) | CN110798673B (zh) |
WO (1) | WO2021093584A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114666564A (zh) * | 2022-03-23 | 2022-06-24 | 南京邮电大学 | 一种基于隐式神经场景表示进行虚拟视点图像合成的方法 |
CN114972923A (zh) * | 2022-06-06 | 2022-08-30 | 中国人民解放军国防科技大学 | 基于自监督学习的虚拟数字人肢体交互动作生成方法 |
CN116723305A (zh) * | 2023-04-24 | 2023-09-08 | 南通大学 | 一种基于生成式对抗网络的虚拟视点质量增强方法 |
CN116996654A (zh) * | 2023-07-24 | 2023-11-03 | 京东方科技集团股份有限公司 | 新视点图像生成方法、新视点生成模型的训练方法与装置 |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110798673B (zh) * | 2019-11-13 | 2021-03-19 | 南京大学 | 基于深度卷积神经网络的自由视点视频生成及交互方法 |
CN113784148A (zh) * | 2020-06-10 | 2021-12-10 | 阿里巴巴集团控股有限公司 | 数据处理方法、系统、相关设备和存储介质 |
CN113473244A (zh) * | 2020-06-23 | 2021-10-01 | 青岛海信电子产业控股股份有限公司 | 一种自由视点视频播放控制方法及设备 |
CN114511596A (zh) * | 2020-10-23 | 2022-05-17 | 华为技术有限公司 | 一种数据处理方法及相关设备 |
KR20230035721A (ko) | 2021-09-06 | 2023-03-14 | 한국전자통신연구원 | 임의 시점의 다중평면영상을 생성하는 전자 장치 및 그것의 동작 방법 |
CN114900742A (zh) * | 2022-04-28 | 2022-08-12 | 中德(珠海)人工智能研究院有限公司 | 基于视频推流的场景旋转过渡方法以及系统 |
CN115512038B (zh) * | 2022-07-22 | 2023-07-18 | 北京微视威信息科技有限公司 | 自由视点合成的实时绘制方法、电子设备及可读存储介质 |
CN116320358B (zh) * | 2023-05-19 | 2023-12-01 | 成都工业学院 | 一种视差图像预测装置及方法 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105659592A (zh) * | 2014-09-22 | 2016-06-08 | 三星电子株式会社 | 用于三维视频的相机系统 |
CN107396133A (zh) * | 2017-07-20 | 2017-11-24 | 深圳市佳创视讯技术股份有限公司 | 自由视点视频导播方法及系统 |
CN107545586A (zh) * | 2017-08-04 | 2018-01-05 | 中国科学院自动化研究所 | 基于光场极限平面图像局部的深度获取方法及系统 |
WO2018147329A1 (ja) * | 2017-02-10 | 2018-08-16 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 自由視点映像生成方法及び自由視点映像生成システム |
US20190174122A1 (en) * | 2017-12-04 | 2019-06-06 | Canon Kabushiki Kaisha | Method, system and apparatus for capture of image data for free viewpoint video |
CN107493465B (zh) * | 2017-09-18 | 2019-06-07 | 郑州轻工业学院 | 一种虚拟多视点视频生成方法 |
CN110113593A (zh) * | 2019-06-11 | 2019-08-09 | 南开大学 | 基于卷积神经网络的宽基线多视点视频合成方法 |
CN110443874A (zh) * | 2019-07-17 | 2019-11-12 | 清华大学 | 基于卷积神经网络的视点数据生成方法和装置 |
CN110798673A (zh) * | 2019-11-13 | 2020-02-14 | 南京大学 | 基于深度卷积神经网络的自由视点视频生成及交互方法 |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101277454A (zh) * | 2008-04-28 | 2008-10-01 | 清华大学 | 一种基于双目摄像机的实时立体视频生成方法 |
US8106924B2 (en) * | 2008-07-31 | 2012-01-31 | Stmicroelectronics S.R.L. | Method and system for video rendering, computer program product therefor |
JP6672075B2 (ja) * | 2016-05-25 | 2020-03-25 | キヤノン株式会社 | 制御装置、制御方法、及び、プログラム |
JP6808357B2 (ja) * | 2016-05-25 | 2021-01-06 | キヤノン株式会社 | 情報処理装置、制御方法、及び、プログラム |
JP6429829B2 (ja) * | 2016-05-25 | 2018-11-28 | キヤノン株式会社 | 画像処理システム、画像処理装置、制御方法、及び、プログラム |
JP7054677B2 (ja) * | 2016-08-10 | 2022-04-14 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | カメラワーク生成方法及び映像処理装置 |
JP6882868B2 (ja) * | 2016-08-12 | 2021-06-02 | キヤノン株式会社 | 画像処理装置、画像処理方法、システム |
US10846836B2 (en) * | 2016-11-14 | 2020-11-24 | Ricoh Company, Ltd. | View synthesis using deep convolutional neural networks |
US10762653B2 (en) * | 2016-12-27 | 2020-09-01 | Canon Kabushiki Kaisha | Generation apparatus of virtual viewpoint image, generation method, and storage medium |
US11665308B2 (en) * | 2017-01-31 | 2023-05-30 | Tetavi, Ltd. | System and method for rendering free viewpoint video for sport applications |
JP6948175B2 (ja) * | 2017-07-06 | 2021-10-13 | キヤノン株式会社 | 画像処理装置およびその制御方法 |
WO2019012817A1 (ja) * | 2017-07-14 | 2019-01-17 | ソニー株式会社 | 画像処理装置、画像処理装置の画像処理方法、プログラム |
JP6921686B2 (ja) * | 2017-08-30 | 2021-08-18 | キヤノン株式会社 | 生成装置、生成方法、及びプログラム |
US11024046B2 (en) * | 2018-02-07 | 2021-06-01 | Fotonation Limited | Systems and methods for depth estimation using generative models |
JP7271099B2 (ja) * | 2018-07-19 | 2023-05-11 | キヤノン株式会社 | ファイルの生成装置およびファイルに基づく映像の生成装置 |
US11064180B2 (en) * | 2018-10-15 | 2021-07-13 | City University Of Hong Kong | Convolutional neural network based synthesized view quality enhancement for video coding |
US11961205B2 (en) * | 2018-11-09 | 2024-04-16 | Samsung Electronics Co., Ltd. | Image resynthesis using forward warping, gap discriminators, and coordinate-based inpainting |
US11967092B2 (en) * | 2018-11-28 | 2024-04-23 | Sony Group Corporation | Detection-guided tracking of human dynamics |
JP2020129276A (ja) * | 2019-02-08 | 2020-08-27 | キヤノン株式会社 | 画像処理装置、画像処理方法、およびプログラム |
JP2020191598A (ja) * | 2019-05-23 | 2020-11-26 | キヤノン株式会社 | 画像処理システム |
US11030772B2 (en) * | 2019-06-03 | 2021-06-08 | Microsoft Technology Licensing, Llc | Pose synthesis |
JP7358078B2 (ja) * | 2019-06-07 | 2023-10-10 | キヤノン株式会社 | 情報処理装置、情報処理装置の制御方法、及び、プログラム |
CN110223382B (zh) * | 2019-06-13 | 2021-02-12 | 电子科技大学 | 基于深度学习的单帧图像自由视点三维模型重建方法 |
JP7423974B2 (ja) * | 2019-10-21 | 2024-01-30 | 株式会社Jvcケンウッド | 情報処理システム、情報処理方法及びプログラム |
-
2019
- 2019-11-13 CN CN201911106557.2A patent/CN110798673B/zh active Active
-
2020
- 2020-10-28 WO PCT/CN2020/124206 patent/WO2021093584A1/zh active Application Filing
- 2020-10-28 US US17/755,025 patent/US20220394226A1/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105659592A (zh) * | 2014-09-22 | 2016-06-08 | 三星电子株式会社 | 用于三维视频的相机系统 |
WO2018147329A1 (ja) * | 2017-02-10 | 2018-08-16 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 自由視点映像生成方法及び自由視点映像生成システム |
CN107396133A (zh) * | 2017-07-20 | 2017-11-24 | 深圳市佳创视讯技术股份有限公司 | 自由视点视频导播方法及系统 |
CN107545586A (zh) * | 2017-08-04 | 2018-01-05 | 中国科学院自动化研究所 | 基于光场极限平面图像局部的深度获取方法及系统 |
CN107493465B (zh) * | 2017-09-18 | 2019-06-07 | 郑州轻工业学院 | 一种虚拟多视点视频生成方法 |
US20190174122A1 (en) * | 2017-12-04 | 2019-06-06 | Canon Kabushiki Kaisha | Method, system and apparatus for capture of image data for free viewpoint video |
CN110113593A (zh) * | 2019-06-11 | 2019-08-09 | 南开大学 | 基于卷积神经网络的宽基线多视点视频合成方法 |
CN110443874A (zh) * | 2019-07-17 | 2019-11-12 | 清华大学 | 基于卷积神经网络的视点数据生成方法和装置 |
CN110798673A (zh) * | 2019-11-13 | 2020-02-14 | 南京大学 | 基于深度卷积神经网络的自由视点视频生成及交互方法 |
Non-Patent Citations (3)
Title |
---|
DENG, BAOSONG ET AL.: "Wide Baseline Matching based on Affine Iterative Method", JOURNAL OF SIGNAL PROCESSING, vol. 23, no. 6, 31 December 2007 (2007-12-31), pages 823 - 828, XP055812189 * |
EMILIE BOSC ; ROMUALD PEPION ; PATRICK LE CALLET ; MARTIN KOPPEL ; PATRICK NDJIKI-NYA ; MURIEL PRESSIGOUT ; LUCE MORIN: "Towards a New Quality Metric for 3-D Synthesized View Assessment", IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, IEEE, US, vol. 5, no. 7, 1 November 2011 (2011-11-01), US, pages 1332 - 1343, XP011363102, ISSN: 1932-4553, DOI: 10.1109/JSTSP.2011.2166245 * |
WANG YANRU, HUANG ZHIHAO, ZHU HAO, LI WEI, CAO XUN, YANG RUIGANG: "Interactive free-viewpoint video generation", VIRTUAL REALITY & INTELLIGENT HARDWARE, vol. 2, no. 3, 1 June 2020 (2020-06-01), pages 247 - 260, XP055812177, ISSN: 2096-5796, DOI: 10.1016/j.vrih.2020.04.004 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114666564A (zh) * | 2022-03-23 | 2022-06-24 | 南京邮电大学 | 一种基于隐式神经场景表示进行虚拟视点图像合成的方法 |
CN114666564B (zh) * | 2022-03-23 | 2024-03-01 | 南京邮电大学 | 一种基于隐式神经场景表示进行虚拟视点图像合成的方法 |
CN114972923A (zh) * | 2022-06-06 | 2022-08-30 | 中国人民解放军国防科技大学 | 基于自监督学习的虚拟数字人肢体交互动作生成方法 |
CN116723305A (zh) * | 2023-04-24 | 2023-09-08 | 南通大学 | 一种基于生成式对抗网络的虚拟视点质量增强方法 |
CN116723305B (zh) * | 2023-04-24 | 2024-05-03 | 南通大学 | 一种基于生成式对抗网络的虚拟视点质量增强方法 |
CN116996654A (zh) * | 2023-07-24 | 2023-11-03 | 京东方科技集团股份有限公司 | 新视点图像生成方法、新视点生成模型的训练方法与装置 |
Also Published As
Publication number | Publication date |
---|---|
US20220394226A1 (en) | 2022-12-08 |
CN110798673A (zh) | 2020-02-14 |
CN110798673B (zh) | 2021-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021093584A1 (zh) | 基于深度卷积神经网络的自由视点视频生成及交互方法 | |
WO2021083176A1 (zh) | 数据交互方法及系统、交互终端、可读存储介质 | |
CN104301677B (zh) | 面向大场景的全景视频监控的方法及装置 | |
WO2021083178A1 (zh) | 数据处理方法及系统、服务器和存储介质 | |
CN112085659B (zh) | 一种基于球幕相机的全景拼接融合方法、系统及存储介质 | |
CN103337094A (zh) | 一种应用双目摄像机实现运动三维重建的方法 | |
CN102857739A (zh) | 分布式全景监控系统及其方法 | |
CN101276465A (zh) | 广角图像自动拼接方法 | |
CN107240147B (zh) | 图像渲染方法及系统 | |
WO2021083174A1 (zh) | 虚拟视点图像生成方法、系统、电子设备及存储介质 | |
CN108848354B (zh) | 一种vr内容摄像系统及其工作方法 | |
CN107197135B (zh) | 一种视频生成方法及视频生成装置 | |
KR101933037B1 (ko) | 360도 동영상에서의 가상현실 재생 장치 | |
Cao et al. | Ntire 2023 challenge on 360deg omnidirectional image and video super-resolution: Datasets, methods and results | |
CN111369443B (zh) | 光场跨尺度的零次学习超分辨率方法 | |
CN109618093A (zh) | 一种全景视频直播方法及系统 | |
Tang et al. | A universal optical flow based real-time low-latency omnidirectional stereo video system | |
CN202721763U (zh) | 一种全景视频采集装置 | |
Wang et al. | Interactive free-viewpoint video generation | |
CN111064945A (zh) | 一种裸眼3d图像采集及生成方法 | |
CN110798676A (zh) | 一种利用内镜镜头动态图像形成3d视觉的方法及装置 | |
WO2021083175A1 (zh) | 数据处理方法、设备、系统、可读存储介质及服务器 | |
Chai et al. | Super-Resolution Reconstruction for Stereoscopic Omnidirectional Display Systems via Dynamic Convolutions and Cross-View Transformer | |
CN112003999A (zh) | 基于Unity 3D的三维虚拟现实合成算法 | |
CN111818298A (zh) | 一种基于光场的高清视频监控系统及方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20888196 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20888196 Country of ref document: EP Kind code of ref document: A1 |