KR102555165B1

KR102555165B1 - Method and System for Light Field Synthesis from a Monocular Video using Neural Radiance Field

Info

Publication number: KR102555165B1
Application number: KR1020220126531A
Authority: KR
Inventors: 박인규; 백형선
Original assignee: 인하대학교 산학협력단
Priority date: 2022-10-04
Filing date: 2022-10-04
Publication date: 2023-07-12
Also published as: KR102555165B9

Abstract

단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법 및 시스템이 제시된다. 본 발명에서 제안하는 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법은 전처리부를 통해 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리하는 단계, 모델링부를 통해 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행하는 단계, 학습부를 통해 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시키는 단계 및 학습부를 통해 깊이 영상에 대한 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 존재하지 않는 영역에서의 NeRF 기반 장면모델링의 오류를 감소시키도록 학습하는 단계를 포함한다. A method and system for synthesizing a light field based on neural radiance in monocular video are presented. In the neural radiance-based light field synthesis method in monocular video proposed in the present invention, the step of pre-processing data for learning by using the RGBD image of the input image to improve the data for the visible region through the pre-processing unit, through the modeling unit Performing NeRF (Nueral radiance field)-based scene modeling using a color image and depth image, expanding data using the color image and depth image input as the input through a learning unit, and depth for the depth image through a learning unit and learning to reduce an error of NeRF-based scene modeling in an area in which the value is continuously changed and no texture exists.

Description

Method and System for Light Field Synthesis from a Monocular Video using Neural Radiance Field}

본 발명은 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법 및 시스템에 관한 것이다. The present invention relates to a method and system for synthesizing a light field based on neural radiance in monocular video.

하나 이상의 입력 영상으로부터 가상시점 영상을 합성하는 연구는 VR/AR, 3D 디스플레이, 홀로그램 등의 콘텐츠와 함께 실감 미디어 환경에서의 다양한 적용방향이 존재한다는 점에서 많은 관심과 연구가 진행되어왔다. Research on synthesizing virtual viewpoint images from one or more input images has received a lot of interest and research in that there are various application directions in the immersive media environment along with contents such as VR/AR, 3D display, and hologram.

일반적으로 가상시점 생성은 입력 영상과 카메라 파라미터로부터 영상들의 픽셀들 간의 관계를 활용하거나, 부족한 기하학적 정보 추정을 위한 깊이 정보, 포인트 클라우드 등의 추가적인 정보를 활용하여 장면을 표현하는 것이 일반적이다. 따라서 기존의 실감 미디어 구성을 위한 시스템은 다수의 카메라를 이용하여 특정 목적에 맞도록 배치함으로써 데이터를 수집, 가공하는 방식을 사용한다. 이로인해 다중 카메라 구성에 따른 증가하는 비용 문제와 더불어 다시점 영상들 간의 동기화 문제, 다시점 영상 간의 관계 추정에 필요한 연산량 증가 등의 여러 문제를 갖고 있다는 점에서 여러 제약이 존재한다.In general, virtual view generation uses relationships between pixels of images from input images and camera parameters, or uses additional information such as depth information and point clouds for estimating insufficient geometric information to express a scene. Therefore, existing systems for constructing immersive media use a method of collecting and processing data by using a plurality of cameras and arranging them to suit a specific purpose. As a result, there are several limitations in that it has various problems, such as an increase in cost due to the configuration of multiple cameras, a synchronization problem between multi-view images, and an increase in the amount of calculation required for estimating the relationship between multi-view images.

이러한 문제점을 해결하기 위해서 다시점 영상을 동시에 획득하는 하드웨어들이 제안되었으며 대표적으로는 양안 카메라, 라이트필드 카메라 등이 존재한다. 또한 그래픽 처리 장치(Graphic Processing Unit; GPU)의 기술 발전과 대규모의 데이터로부터의 딥러닝 아키텍처의 급격한 성장으로부터 카메라의 개수를 최소화하면서 기존의 목적에 부합하는 다양한 시점의 영상들을 생성할 수 있는 연구들이 수행되었다. In order to solve this problem, hardware that simultaneously acquires multi-viewpoint images has been proposed, and typical examples include a binocular camera and a light field camera. In addition, researches that can generate images from various viewpoints that meet the existing purposes while minimizing the number of cameras are being pursued due to the rapid growth of deep learning architectures from the development of graphics processing units (GPUs) and large-scale data. has been carried out

딥러닝을 활용한 다수의 가상시점 연구들은 다중 시점 기반의 가상시점 생성부터 더욱 나아가 단안영상까지도 딥러닝 프레임워크를 사용하여 가상시점 생성을 가능하게 하고 있다. 세부적으로는 기존의 장면 표현 방식에 학습 기반의 네트워크를 결합하여 학습함으로써 장면을 모델링 하거나, 새로운 장면표현 방식을 제안함으로써 이전의 장면 표현 방식들에 비해 속도 또는 영상 품질면에서 높은 성능을 보여주는 연구들도 수행되었다. 이는 새로운 장면 표현 방식이 입력 영상 간의 관계를 추정하고 최적화하는 전통적인 컴퓨터 비전 기법에 비해 일반화와 에러에 강인할 수 있다는 점을 알 수 있다. A number of virtual viewpoint studies using deep learning enable the creation of virtual viewpoints using a deep learning framework, from multi-viewpoint based virtual viewpoint creation to even monocular images. In detail, studies that show high performance in terms of speed or image quality compared to previous scene representation methods by modeling scenes by learning by combining learning-based networks with existing scene representation methods or by proposing new scene representation methods. was also performed It can be seen that the new scene representation method can be robust to generalization and error compared to traditional computer vision techniques that estimate and optimize the relationship between input images.

[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, "NeRF: Representing scenes as neural radiance fields for view synthesis," in Proc. of European Conference on Computer Vision, pp. 405- 421, Springer, 2020.[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, "NeRF: Representing scenes as neural radiance fields for view synthesis," in Proc. of European Conference on Computer Vision, pp. 405-421, Springer, 2020. [2] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, "Depth-supervised NeRF:[2] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, "Depth-supervised NeRF: Fewer views and faster training for free," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12882-12891, 2022.Fewer views and faster training for free," in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12882-12891, 2022.

본 발명이 이루고자 하는 기술적 과제는 정방향으로 움직이는 비디오의 프레임들을 다시점 영상과 같이 사용함으로써 가상시점 영상 합성의 기법들을 활용하여 라이트필드를 합성하는 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법 및 시스템을 제공하는데 있다. 라이트필드는 구조화된 다시점 영상을 갖는다는 장점으로 실감 미디어 환경을 구축할 수 있다는 점에서 가상시점 합성 연구의 필요성과 합성한 결과의 평가에 효과적이다. 본 발명에서는 두 가지 장면 표현 방식을 통해 라이트필드를 생성하는 방안을 제안한다.A technical problem to be achieved by the present invention is a method and system for synthesizing a light field based on neural radiance in a monocular video that synthesizes a light field using virtual view image synthesis techniques by using forward moving video frames as a multi-view image. is providing Lightfield is effective in evaluating the necessity of virtual viewpoint synthesis research and the evaluation of synthesized results in that it can build a immersive media environment with the advantage of having structured multi-viewpoint images. In the present invention, a method of generating a light field through two scene expression methods is proposed.

일 측면에 있어서, 본 발명에서 제안하는 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법은 전처리부를 통해 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리하는 단계, 모델링부를 통해 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행하는 단계, 학습부를 통해 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시키는 단계 및 학습부를 통해 깊이 영상에 대한 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 존재하지 않는 영역에서의 NeRF 기반 장면모델링의 오류를 감소시키도록 학습하는 단계를 포함한다. In one aspect, the neural radiance-based light field synthesis method in monocular video proposed in the present invention preprocesses data for learning by using RGBD images of input images to enhance visible region data through a preprocessor. Step, performing NeRF (Nueral radiance field)-based scene modeling using a color image and depth image through a modeling unit, expanding data using the color image and depth image input as the input through a learning unit, and through a learning unit and learning to reduce an error of NeRF-based scene modeling in an area in which a change in a depth value for a depth image is continuously made and no texture exists.

상기 전처리부를 통해 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리하는 단계는 DIBR(Depth Image Based Rendering) 방식으로 상기 가시영역에 대한 가상시점 영상을 획득하고 해당 광선들이 가시 영역인지 여부를 결정할 마스킹를 생성하여 이후에 학습될 NeRF에 입력으로 들어가며, 마스킹 되는 부분의 광선은 학습에서 제외된다. The step of pre-processing the data for learning by using the RGBD image of the input image to improve the data for the visible region through the pre-processing unit obtains a virtual viewpoint image for the visible region using the DIBR (Depth Image Based Rendering) method After creating a masking to determine whether the corresponding rays are in the visible region, they are input to the NeRF to be learned later, and the rays of the masked part are excluded from learning.

상기 모델링부를 통해 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행하는 단계는 가시 영역에 대한 샘플링 레이트를 증가 시키고 폐색 영역의 경우에는 비디오 방향에 따른 입력으로부터 데이터를 추정하여, 렌더링 기반의 입력 영상으로부터 라이트필드의 SAIs(Sub-aperture images) 간의 베이스라인이 넓은 데이터들을 대규모로 취득하여 라이트 필드 데이터셋을 생성한다. In the step of performing NeRF (Nueral radiance field)-based scene modeling using a color image and a depth image through the modeling unit, the sampling rate for the visible region is increased and data is estimated from the input according to the video direction in the case of the occlusion region, A light field dataset is created by acquiring large-scale data with a wide baseline between SAIs (Sub-aperture images) of the light field from the rendering-based input image.

상기 학습부를 통해 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시키는 단계는 깊이 기반 렌더링 방식을 사용하여 입력되는 시점의 주변 시점들을 생성하고, 넓은 베이스라인을 갖도록 DIBR로 워핑시키는 범위를 조절하며, 전체 SAIs를 생성하지 않고 일정 부분의 SAIs를 선택하여 데이터를 확장시킨다. The step of expanding data using the color image and the depth image input through the learning unit generates viewpoints surrounding the input viewpoint using a depth-based rendering method, and warps the scope with DIBR to have a wide baseline. It expands the data by selecting a certain portion of SAIs without generating all SAIs.

또 다른 일 측면에 있어서, 본 발명에서 제안하는 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 시스템은 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리하는 전처리부, 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행하는 모델링부, 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시키고, 깊이 영상에 대한 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 존재하지 않는 영역에서의 NeRF 기반 장면모델링의 오류를 감소시키도록 학습하는 학습부를 포함한다.In another aspect, the neural radiance-based light field synthesis system in the monocular video proposed in the present invention preprocesses data for learning by utilizing the RGBD image of the input image to improve data for the visible region. A modeling unit that performs scene modeling based on a neutral radiance field (NeRF) using a color image and a depth image, expands data using the input color image and depth image, and changes the depth value for the depth image and a learning unit that learns to reduce an error of NeRF-based scene modeling in an area that is continuous and does not have a texture.

본 발명의 실시예들에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법 및 시스템을 통해 정방향으로 움직이는 비디오의 프레임들을 다시점 영상과 같이 사용함으로써 가상시점 영상 합성의 기법들을 활용하여 라이트필드를 합성할 수 있다. 라이트필드는 구조화된 다시점 영상을 갖는다는 장점으로 실감 미디어 환경을 구축할 수 있다는 점에서 가상시점 합성 연구의 필요성과 합성한 결과의 평가에 효과적이다. 본 발명에서는 두 가지 장면 표현 방식을 통해 라이트필드를 생성하는 방안을 제안한다. In a monocular video according to embodiments of the present invention, by using the frames of a video moving in a forward direction as a multi-view image through a neural radiance-based light field synthesis method and system, a light field is generated using virtual view image synthesis techniques. can be synthesized. Lightfield is effective in evaluating the necessity of virtual viewpoint synthesis research and the evaluation of synthesized results in that it can build a immersive media environment with the advantage of having structured multi-viewpoint images. In the present invention, a method of generating a light field through two scene expression methods is proposed.

도 1은 본 발명의 일 실시예에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성을 위한 전체 프레임워크를 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법을 설명하기 위한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 실제환경 실험 시 입력으로 받는 컬러 영상 및 깊이 영상을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 시스템의 구성을 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 합성한 결과의 SAI의 예시를 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따른 본 발명의 일 실시예에 따른 네트워크를 사용하여 생성한 라이트필드를 나타내는 도면이다.
도 7은 본 발명의 일 실시예에 따른 KITTI 데이터 넓은 베이스라인 정성적 결과를 나타내는 도면이다.1 is a diagram illustrating an entire framework for neural radiance-based light field synthesis in monocular video according to an embodiment of the present invention.
2 is a flowchart illustrating a method of synthesizing a light field based on neural radiance in a monocular video according to an embodiment of the present invention.
3 is a diagram illustrating a color image and a depth image received as inputs during a real environment experiment according to an embodiment of the present invention.
4 is a diagram showing the configuration of a neural radiance-based light field synthesis system in monocular video according to an embodiment of the present invention.
5 is a diagram showing an example of SAI of a synthesized result according to an embodiment of the present invention.
6 is a diagram showing a light field generated using a network according to an embodiment of the present invention according to an embodiment of the present invention.
7 is a diagram showing KITTI data broad baseline qualitative results according to an embodiment of the present invention.

본 발명에서는 넓은 베이스라인을 갖는 라이트필드를 합성함으로써 쉽게 취득가능한 라이트 필드 데이터 생성 방법을 제안하고 이를 통해 취득한 대규모의 라이트 필드 비디오를 통하여 단안 비디오로부터 라이트필드를 생성하는 네트워크를 제안한다. 제안하는 네트워크는 기존의 가상 시점 합성에서 높은 활용성을 보이는 MPIs(Multiplane-images)와 깊이 추정을 결합하는 방식으로 두 가지의 장면 표현 방식이 가지고 있는 특성을 결합하는 프레임워크이다. The present invention proposes an easily obtainable light field data generation method by synthesizing a light field having a wide baseline, and proposes a network for generating a light field from a monocular video through a large-scale light field video acquired through the method. The proposed network is a framework that combines the characteristics of two scene representation methods by combining multiplane-images (MPIs) and depth estimation, which are highly useful in existing virtual view synthesis.

본 발명의 실시예에 따르면, 사실적 렌더링 기반의 비디오 게임인 GTAV를 활용하여 8×8 SAIs로 구성된 라이트필드 영상 비디오를 생성한다. According to an embodiment of the present invention, a light field image video composed of 8x8 SAIs is generated using GTAV, a video game based on realistic rendering.

본 발명의 실시예에 따르면, 사실적 렌더링 기반의 비디오 게임으로부터 라이트필드의 SAIs(Sub-aperture images) 간의 베이스라인이 넓은 데이터들을 대규모로 취득하여 라이트 필드 데이터셋을 생성한다. According to an embodiment of the present invention, a light field dataset is created by acquiring data having a wide baseline between sub-aperture images (SAIs) of a light field from a photorealistic rendering-based video game on a large scale.

또한, 다중 평면 기반과 DIBR(Depth image-based rendering) 기법을 결합하여 렌더링 가용범위를 향상시키면서도 높은 품질의 라이트필드를 합성하고, NeRF(Neural radiance field) 기반의 장면 모델링 방식을 통해 기존의 라이트 필드 합성 기법과 다르게 넓은 베이스라인을 갖는 라이트필드를 합성한다. 이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.In addition, by combining multi-plane based and DIBR (Depth image-based rendering) techniques, a high-quality light field is synthesized while improving the rendering range, and the existing light field is synthesized through NeRF (Neural radiance field)-based scene modeling method. Unlike the synthesis method, a light field with a wide baseline is synthesized. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

NeRF는 하나의 장면을 다양한 시점으로부터 입력된 정보를 토대로 장면 그 자체를 하나의 네트워크로 표현하는 것을 의미한다. NeRF 모델링 과정에서 네트워크는 단일한 5D 좌표를 입력으로 이용하고, 공간상 한 지점에서의 볼륨(volume) 밀도와 빛에 따라 달라지는 빛을 출력하는 완전 연결 네트워크(fully connected network)를 이용해 장면을 모델링한다. 훈련된 네트워크를 활용하여 가상시점을 합성하는 경우에는 카메라 빛을 따라서 5D 좌표의 출력값을 이용해 시점을 합성하고, 전통적인 볼륨 렌더링 기법을 이용해 샘플링된 포인트들의 색과 밀도를 영상으로 합성한다. 처음으로 이러한 장면 표현 방식을 제안한 NeRF의 경우 정적인 장면을 다중 레이어의 퍼셉트론을 활용하여 공간상의 점 (x, y, z)와 방향 (

,

)으로 방출하는 빛, 각 점 (x, y, z)마다 빛이 얼마나 통과할 것인지를 결정하는 밀도를 출력으로 하는 연속적인 5D 함수로 표현하였다.NeRF means expressing a scene itself as a network based on information input from various viewpoints. In the NeRF modeling process, the network uses a single 5D coordinate as input and models the scene using a fully connected network that outputs light that varies depending on the volume density and light at a point in space. . In the case of synthesizing a virtual viewpoint using a trained network, the viewpoint is synthesized using the output value of 5D coordinates along the camera light, and the color and density of the sampled points are synthesized into an image using a traditional volume rendering technique. In the case of NeRF, which proposed such a scene representation method for the first time, a static scene is mapped to a point (x, y, z) and direction (

,

), expressed as a continuous 5D function whose output is the density that determines how much light will pass through each point (x, y, z).

r(t) = o + td (1) r(t) = o + td (1)

입력영상에서의 하나의 픽셀은 카메라 투영중심 o와 방향 d를 따라 진행된다. 따라서 특정 시점의 영상을 렌더링 하기 위해서는 장면의 카메라 빛들을 따라가며 공간상의 3D 포인트들을 샘플링하고 이 포인트들과 대응하는 2D 시점 방향을 신경망의 입력으로 사용해 각 포인트들의 색과 밀도를 얻어낸 후, 전통적인 볼륨 렌더링(volume rendering) 방식을 이용해 색과 밀도를 누적하는 방식으로 2D 영상을 만들어낸다. One pixel in the input image progresses along the camera projection center o and direction d. Therefore, in order to render an image at a specific point of view, 3D points in space are sampled by following camera lights in the scene, and the color and density of each point are obtained by using the 2D point direction corresponding to these points as input to the neural network, and then the traditional volume A 2D image is created by accumulating color and density using a volume rendering method.

이러한 과정은 모두 미분 가능하기 때문에, 관찰한 이미지와 이에 대응되는 렌더링된 결과물 간의 차이를 줄여 나가며 모델을 최적화하는 과정에서 그래디언트 경사하강법을 사용할 수 있다. NeRF의 경우 이러한 광선기반의 표현방식만을 사용하여 훈련을 진행하면 고주파수의 영상정보는 손실된 상태로 복원되거나 세부표현들이 제거된다는 특성에 주목하여 아래와 같은 기법을 같이 제안하였다. 첫 번째로는 공간상의 위치와 방향의 차원이 (x, y, z), (

,

)로 5개의 차원으로만 이루어져 있기에 MLP(Multi-Layer Perceptron)에 충분한 정보를 제공하지 못하고 손실된다는 점을 보완하고자 위치 인코딩(positional encoding)으로 입력차원을 증가시켜 고차원 데이터로 매핑하여 입력으로 사용하였다. 이러한 위치 인코딩은 고주파 영역의 처리 성능을 향상된 방향으로 학습되도록 유도하며 주기함수를 통해 다음과 같이 표현할 수 있다.Since these processes are all differentiable, gradient descent can be used in the process of optimizing the model while reducing the difference between the observed image and the corresponding rendered result. In the case of NeRF, when training is performed using only this ray-based representation method, the following technique is proposed, paying attention to the characteristic that high-frequency image information is restored in a lost state or detailed expressions are removed. First, the dimensions of position and direction in space are (x, y, z), (

,

) is composed of only five dimensions, so to compensate for the fact that it does not provide enough information to MLP (Multi-Layer Perceptron) and is lost, the input dimension is increased by positional encoding, and it is mapped to high-dimensional data and used as input. . This positional encoding induces the processing performance of the high-frequency region to be learned in an improved direction, and can be expressed as follows through a periodic function.

γ(x) = [sin(x), cos(x), ..., sin(2^L-1x), cos(2^L-1x)] (2) γ(x) = [sin(x), cos(x), ..., sin(2 ^L-1 x), cos(2 ^L-1 x)] (2)

위 식은 단순히 3D 위치 공간상의 한 점 x의 각 차원의 사인함수와 코사인함수로 1에서 2^L-1까지 2의 거듭제곱으로 표현한다. 여기서 L은 하이퍼파라미터에 해당 되며 NeRF에서는 위치정보와 각정보에 특정값을 이용하여 주기함수를 통해 임베딩시킨다. 하지만 일반적인 도로장면에서는 단순히 모델링이 이루어지기 어렵다. NeRF의 경우에는 입력으로 들어오는 영상들이 모두 유사한 거리에서 획득한 다량의 데이터라는 점에서 멀티뷰를 통한 장면의 기하학적인 내용을 네트워크를 통해 쉽게 학습되도록 제한하였다. 이에 따라 입력으로 사용되는 영상의 형태가 제한적이며 하나의 객체만이 존재하는 장면은 높은 성능을 보여주며 시점합성에 뛰어난 성능을 보여준다. 그러나 일반적인 상황에서 이러한 영상들을 획득하는 것은 어려운 일에 속하므로 입력영상의 개수를 줄이거나 일반적인 장면으로 학습하는 연구들이 진행되고 있다. The above expression is simply expressed as a power of 2 from 1 to 2 ^L-1 as the sine function and cosine function of each dimension of a point x in the 3D position space. Here, L corresponds to a hyperparameter, and in NeRF, location information and each information are embedded using a periodic function using specific values. However, it is difficult to simply model in general road scenes. In the case of NeRF, since the input images are all large amounts of data acquired at similar distances, the geometric content of the scene through multi-view is limited to be easily learned through the network. Accordingly, the form of the image used as an input is limited, and a scene with only one object shows high performance and excellent performance in viewpoint synthesis. However, since it is difficult to acquire these images in a general situation, researches are being conducted to reduce the number of input images or to learn with general scenes.

본 발명에서는 기존의 사용되는 데이터에서 벗어나 일반적인 도로상황을 가정하고 폐색영역이 빈번하게 나타나는 장면을 NeRF기반으로 모델링하는 기법을 제안한다. 이러한 복합적인 장면에 대하여 기존의 NeRF는 고해상도의 결과물을 낼 만큼 충분히 수렴하지 못하고, 카메라 광선당 요구 샘플 개수 측면에서 부족하기 때문에 정확한 기하학적 추정에 실패한다. 이러한 문제를 해결하기 위해 컬러 정보 이외의 깊이 정보를 추가적으로 제공함으로써 기존의 모델에서 필요한 입력영상의 개수를 줄이도록 하였다. 이는 [2]에서 제안된 방법과 유사하게 진행된다. 하지만 [2]에서는 자세정보 추정과정에서 영상들 간의 관계를 파악하고 영상 간의 일치하는 3차원 포인트들을 학습에 사용하지만 본 발명에서는 깊이 정보 영상 전체를 사용하여 네트워크를 훈련한다. 이는 MLP가 카메라의 움직임에 따라 변화하는 정보를 학습할 수 있게 하고 이에 따라 3차원 기하학적 관계 추정에 큰 방향성을 제시하므로 학습의 수렴에 도움이 된다. 다음으로는 RGBD 영상을 활용하는 방식으로 단일 MLP만으로 학습에 어려움을 겪는 파트를 따로 모델링함으로써 샘플링하는 과정에서 요구되는 적정 샘플의 개수에 맞춰 광선들을 확장한다. In the present invention, we propose a NeRF-based modeling technique for a scene in which occluded areas frequently appear, assuming a general road condition away from existing data. For such a complex scene, the existing NeRF fails to converge sufficiently to produce a high-resolution result, and fails to accurately estimate the geometry because it is short in terms of the required number of samples per camera ray. To solve this problem, the number of input images required in the existing model is reduced by additionally providing depth information other than color information. This proceeds similarly to the method proposed in [2]. However, in [2], relationships between images are identified in the posture information estimation process and matching 3D points between images are used for learning. However, in the present invention, the entire depth information image is used to train the network. This enables the MLP to learn information that changes according to the movement of the camera, and thus provides a great direction for estimating the 3D geometric relationship, which is helpful for convergence of learning. Next, by separately modeling the parts that are difficult to learn with only a single MLP by using RGBD images, the rays are expanded according to the number of appropriate samples required in the sampling process.

본 발명의 실시예에 따르면, 제안하는 라이트필드 합성 네트워크 학습을 위한 데이터를 취득하기 위해 사실적 렌더링 기반의 비디오 게임을 사용하였다. 본 발명에서는 깊이 영상을 추가적으로 획득했다는 점에서 종래기술과 차이가 있다. According to an embodiment of the present invention, a video game based on realistic rendering is used to acquire data for learning the proposed light field synthesis network. The present invention is different from the prior art in that a depth image is additionally obtained.

종래기술에 따르면, CNN을 학습시키기 위해 대량의 데이터를 취득해야 했지만 본 발명에서는 여러 장면을 구성할 정도의 데이터만이 필요하므로 이에 맞도록 수정하여 데이터를 생성한다. According to the prior art, it was necessary to acquire a large amount of data to train the CNN, but in the present invention, only enough data to compose several scenes is needed, so the data is generated by modifying accordingly.

본 발명의 실시예에 따르면, 실세계를 모델링한 가상환경 기반의 게임이라는 점에서 학습에 사용할 데이터를 취득한 것이며 이는 실제환경 데이터에서의 동작을 통해 확인할 수 있다. 가상 데이터의 경우 세부적인 데이터셋 취득 방법은 다음과 같다. According to an embodiment of the present invention, data to be used for learning is acquired in that it is a game based on a virtual environment modeling the real world, and this can be confirmed through operation in real environment data. In the case of virtual data, the detailed dataset acquisition method is as follows.

본 발명의 실시예에 따르면, 가상 환경상의 데이터를 다루기 위해서 Script Hook V library를 사용하였다. 해당 라이브러리를 사용하여 가상환경 내에 다중 카메라 배열을 구성하고, 각각의 카메라를 렌더링하여 컬러영상과 깊이 영상을 추출함으로써 하나의 라이트 필드 영상과 대응되는 깊이 영상들을 취득하였다. 실제 환경의 데이터는 KITTI 데이터를 사용하였으며, 라이다 센서를 통해 취득한 데이터를 활용하여 학습에 사용할 깊이 정보 영상으로 가공하였다. According to an embodiment of the present invention, Script Hook V library is used to handle data in a virtual environment. Using the library, a multi-camera array was constructed in the virtual environment, and color images and depth images were extracted by rendering each camera to obtain depth images corresponding to one light field image. KITTI data was used for the data of the real environment, and the data obtained through the lidar sensor was used to process depth information images to be used for learning.

본 발명에서는 기존의 NeRF 기반의 방식이 주요하게 다루던 단일 객체를 정면 또는 360 바라보는 데이터에서 벗어나 좀더 일반적인 데이터에서 작동하는 노벨 뷰 합성(Novel view synthesis)을 통해 베이스라인이 기존의 기법들보다 확장된 라이트필드를 생성하는 것을 목표로 한다. In the present invention, the baseline is expanded beyond existing techniques through Novel view synthesis, which operates on more general data, away from data looking at a single object from the front or 360 degrees, which was mainly handled by the existing NeRF-based method. It aims to create a light field.

도 1 본 발명의 일 실시예에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성을 위한 전체 프레임워크를 나타내는 도면이다. 1 is a diagram showing an overall framework for neural radiance-based light field synthesis in monocular video according to an embodiment of the present invention.

도 1은 발명에서 제안하고자 하는 알고리즘의 전체적인 구성을 보여준다. 1 shows the overall configuration of the algorithm to be proposed in the present invention.

제안하는 뉴럴 레이디언스 기반의 라이트필드 합성 방법은 하나의 장면을 다양한 시점으로부터 입력된 정보(110)를 토대로 장면 그 자체를 하나의 네트워크로 표현한다. The proposed neural radiance-based light field synthesis method expresses a scene itself as a network based on information 110 input from various viewpoints.

첫 번째로 컬러 영상을 감독(supervision)으로 간주하여 모델링(121)하는 NeRF의 방식을 전체적으로 따른다. 그러나 다양한 각도의 조밀한 입력을 가정하고 만들어진 기법인 만큼 비디오를 입력으로 함에 따라 제한된 입력으로부터 퀄리티 높은 결과를 생성하기 위해 깊이 정보를 추가적으로 감독(supervision)으로 사용한다(122). 이것은 RGB영상만으로 입력을 받을 경우 멀티뷰의 특성을 이용하여 장면의 기하학적인 해석이 가능하지만 비디오를 입력으로 할 경우에는 이를 보완하고자 하는 목적으로 사용된다. 두 번째로는 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 거의 존재하지 않는 도로부분에서 뉴럴 레이디언스 필드(Neural radiance field)의 생성에 문제점을 확인하여 이를 해결하고자 학습에 악영향을 미치는 도로부분을 해결하기 위한 방안을 제안한다(130). 위와 같이 본 발명에서 제안하는 기법은 좀더 일반적인 환경의 데이터로의 확장을 목표로 한다. First, the NeRF method of modeling 121 by considering a color image as supervision is generally followed. However, as the technique was created assuming dense inputs of various angles, depth information is additionally used as a supervision to generate high-quality results from limited inputs as video is used as an input (122). This is used for the purpose of supplementing the geometric interpretation of the scene by using the characteristics of multi-view when input is received only with RGB images, but when video is used as an input. Second, in the part of the road where the depth value changes continuously and there is almost no texture, a problem in the generation of the neural radiance field is identified and solved to solve the part of the road that adversely affects learning. A plan to do so is proposed (130). As described above, the technique proposed by the present invention aims to expand to data in a more general environment.

도 2는 본 발명의 일 실시예에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of synthesizing a light field based on neural radiance in a monocular video according to an embodiment of the present invention.

제안하는 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 방법은 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리하는 단계(210), 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행하는 단계(220), 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시키는 단계(230) 및 깊이 영상에 대한 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 존재하지 않는 영역에서의 NeRF 기반 장면모델링의 오류를 감소시키도록 학습하는 단계(240)를 포함한다. In the proposed monocular video, the neural radiance-based light field synthesis method uses the RGBD image of the input image to enhance the data for the visible region, preprocessing the data for learning (210), and the color image and the depth image. Performing NeRF (Nueral radiance field)-based scene modeling (220), expanding data using the input color image and depth image (230), and changing the depth value for the depth image continuously and learning to reduce an error of NeRF-based scene modeling in an area where the texture is not present (240).

단계(210)에서, 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리 한다. In step 210, data for learning is preprocessed by utilizing the RGBD image of the input image in order to improve the data for the visible region.

본 발명에서 목표로 하고 있는 도로 주행의 데이터의 경우에는 기존의 NeRF 기법을 활용하여 해결 하는데 여러 한계점이 존재한다. 기존의 입력과는 다르게 카메라가 움직이고 장면이 고정되지 않았다는 점에서 전체적인 학습에 어려움을 갖고 있다. 특히 텍스처를 구별하기 힘들고 큰 특징이 존재하지 않는 도로부분에서 네트워크는 깊이 추정 또는 3차원 구조 추정에 어려움을 갖는다. 또한 도로부분은 영상의 전체적인 부분에서 나타나면서도 다양한 깊이로 연속적으로 픽셀들이 이어진다는 점은 더욱 문제를 어렵게 한다. In the case of road driving data, which is the target of the present invention, there are several limitations in solving using the existing NeRF technique. Unlike the existing input, it has difficulty in overall learning in that the camera moves and the scene is not fixed. In particular, the network has difficulty in estimating depth or 3D structure in road portions where textures are difficult to distinguish and large features do not exist. In addition, the fact that the road part appears in the whole part of the image but continuously connects pixels at various depths makes the problem even more difficult.

종래기술에 따른 NeRF는 동일한 거리에 있는 데이터를 중심으로 네트워크를 설계하였기에 데이터의 경우 많은 입력을 통해 기하학적인 추정을 할 수 있지만 직선방향의 소수의 프레임으로부터는 획득하기 어려운 것으로 보인다. 따라서 대부분 가시영역인 도로부분에 대한 데이터를 향상시키기 위해 입력으로 들어오는 RGBD 영상을 활용한다. 기존의 DIBR방식으로 시점을 합성할 때의 가장 큰 문제점은 폐색영역에 대한 정보를 얻기 힘들다는 것이다. 하지만 도로의 경우 폐색영역이 존재하는 경우가 거의 없고 대부분 시야에 보이므로 DIBR을 활용하여 가상시점의 데이터를 생성한다. 결과적으로는 DIBR 방식으로 가시영역에 대한 가상시점 영상을 획득하고 해당 광선들이 가시 영역인지 결정할 마스킹를 생성하게 된다. 이는 이후에 학습할 뉴럴 레디언스 필드에 입력으로 들어가며, 마스킹 되는 부분의 광선은 학습에서 제외한다.NeRF according to the prior art designed a network centered on data at the same distance, so in the case of data, geometric estimation can be performed through many inputs, but it seems difficult to obtain from a small number of frames in a straight line direction. Therefore, input RGBD images are used to improve the data for the road part, which is mostly visible. The biggest problem in synthesizing viewpoints with the existing DIBR method is that it is difficult to obtain information about the occluded area. However, in the case of roads, there are almost no occluded areas, and most of them are visible, so DIBR is used to create virtual viewpoint data. As a result, a virtual viewpoint image for the visible region is acquired using the DIBR method, and masking for determining whether the corresponding rays are visible regions is created. This is input to the neural radiance field to be learned later, and the rays of the masked part are excluded from learning.

단계(220)에서, 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행한다. In step 220, scene modeling based on a neutral radiance field (NeRF) using a color image and a depth image is performed.

본 발명의 실시예에 따른 전체적인 프레임워크는 NeRF를 제안하였던 기존의 NeRF의 파이프라인과 유사하게 진행이 된다. 그러나 기존의 방식에서는 고려하지 않았던 깊이 정보를 추가적으로 사용하였고 전처리 과정을 통해 적은 수의 입력으로부터 가시 영역에 대한 샘플들에 대해 여러 데이터를 확보할 수 있다. The overall framework according to the embodiment of the present invention proceeds similarly to the pipeline of the existing NeRF proposed for NeRF. However, depth information, which was not considered in the existing method, was additionally used, and various data for samples of the visible region can be obtained from a small number of inputs through a preprocessing process.

도 3은 본 발명의 일 실시예에 따른 실제환경 실험 시 입력으로 받는 컬러 영상 및 깊이 영상을 나타내는 도면이다. 3 is a diagram illustrating a color image and a depth image received as inputs during a real environment experiment according to an embodiment of the present invention.

도 3(a)는 컬러 영상을 나타내는 도면이고, 도 3(b)는 깊이 영상을 나타내는 도면이다. FIG. 3(a) is a diagram illustrating a color image, and FIG. 3(b) is a diagram illustrating a depth image.

컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링은 결과적으로 가시 영역에 대한 샘플링 레이트를 증가 시키고 폐색 영역의 경우에는 비디오 방향에 따른 입력으로부터 추정한다. 8 × 8 SAI(Sub-aperture images) 모두를 생성하여 학습시키기에는 연산량 증가가 크므로 일정 SAI를 선택하여 진행하였다. NeRF (Nueral radiance field)-based scene modeling using color and depth images consequently increases the sampling rate for the visible region and estimates the occluded region from the input according to the video direction. Since the increase in the amount of computation is large to generate and train all 8 × 8 SAI (Sub-aperture images), a certain SAI was selected and proceeded.

단계(230)에서, 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시킨다. In step 230, data is expanded using the color image and the depth image input as the input.

본 발명에서는 입력 단안 비디오로부터 넓은 베이스라인을 갖는 라이트 필드를 합성하는 NeRF 기반의 네트워크를 제안한다. 또한 제안된 네트워크에 사용할 데이터를 생성하여 합성한 라이트필드의 성능을 검증한다. 제안하는 방 법은 두 개의 스테이지로 구별된다. 첫 번째는 입력으로 들어오는 컬러 영상과 깊이 영상을 이용하여 데이터를 확장시킨다. 깊이 기반 렌더링 방식을 사용하여 입력으로 들어온 시점 주변 시점들을 생성한다. 이는 직선방향 주행으로 발생하는 희소한 입력으로부터 MLP가 기하학적인 추론을 원활하게 하도록 돕는다. 또한 기존의 기법에 비해 넓은 베이스라인을 갖기 위해 DIBR로 워핑시키는 범위를 조 절하였으며, 전체 SAI를 생성하지는 않고 일정 부분의 SAI 를 선택하여 데이터를 증강하였다. In the present invention, we propose a NeRF-based network that synthesizes a light field with a wide baseline from an input monocular video. In addition, the performance of the synthesized light field is verified by generating data to be used in the proposed network. The proposed method is divided into two stages. The first expands the data by using the input color image and depth image. Using depth-based rendering method, views around the input point are created. This helps the MLP make geometric inferences smoothly from sparse inputs generated by straight-line driving. In addition, the range of warping with DIBR was adjusted to have a wider baseline compared to the existing technique, and data was augmented by selecting a certain portion of SAI without generating the entire SAI.

단계(240)에서, 깊이 영상에 대한 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 존재하지 않는 영역에서의 NeRF 기반 장면모델링의 오류를 감소시키도록 학습한다. In step 240, the depth value of the depth image is continuously changed and learning is performed to reduce errors of NeRF-based scene modeling in an area where no texture exists.

본 발명의 실시예에 따른 학습의 전반적인 과정은 가시영역을 포함하는 가상시점을 생성하고 입력 RGBD 를 통해 NeRF를 모델링하게 된다. 이외에는 [1]에서 제안한 NeRF의 원활한 학습을 위한 위치 임베딩(positional embedding), 계층적 볼륨 샘플링(Hierarchical volume sampling)과 같은 기법들을 사용하였다. End-to-End로 네트워크가 훈련되며 훈련 과정에서 사용한 손실함수는 단순히 L2함수만을 사용하였다. 본 발명에서 제안한 네트워크는 ADAM 옵티마이저(optimizer)를 사용하였으며, 학습률은 0.0005으로 시작하여 학습률을 감소시켜가며 학습을 진행한다. 실험은 NVIDIA RTX A6000 48GB GPU를 통해 약 1일 정도의 기간 동안 학습하였다.The overall process of learning according to an embodiment of the present invention generates a virtual view including the visible region and models NeRF through input RGBD. In addition, techniques such as positional embedding and hierarchical volume sampling for smooth learning of NeRF proposed in [1] were used. The network is trained end-to-end, and only the L2 function was used as the loss function used in the training process. The network proposed in the present invention uses the ADAM optimizer, and the learning rate starts with 0.0005 and proceeds with learning by decreasing the learning rate. The experiment was learned for a period of about 1 day through NVIDIA RTX A6000 48GB GPU.

도 4는 본 발명의 일 실시예에 따른 단안 비디오에서 뉴럴 레이디언스 기반의 라이트필드 합성 시스템의 구성을 나타내는 도면이다. 4 is a diagram showing the configuration of a neural radiance-based light field synthesis system in monocular video according to an embodiment of the present invention.

본 실시예에 따른 라이트필드 합성 시스템(400)은 프로세서(410), 버스(420), 네트워크 인터페이스(430), 메모리(440) 및 데이터베이스(450)를 포함할 수 있다. 메모리(440)는 운영체제(441) 및 라이트필드 합성 루틴(442)을 포함할 수 있다. 프로세서(410)는 전처리부(411), 모델링부(412) 및 학습부(413)를 포함할 수 있다. 다른 실시예들에서 라이트필드 합성 시스템(400)은 도 4의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 라이트필드 합성 시스템(400)은 디스플레이나 트랜시버(transceiver)와 같은 다른 구성요소들을 포함할 수도 있다.The light field synthesis system 400 according to this embodiment may include a processor 410, a bus 420, a network interface 430, a memory 440, and a database 450. The memory 440 may include an operating system 441 and a lightfield composition routine 442 . The processor 410 may include a pre-processing unit 411 , a modeling unit 412 and a learning unit 413 . In other embodiments the lightfield synthesis system 400 may include more components than those of FIG. 4 . However, there is no need to clearly show most of the prior art components. For example, the lightfield synthesis system 400 may include other components such as a display or transceiver.

메모리(440)는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 또한, 메모리(440)에는 운영체제(441)와 라이트필드 합성 루틴(442)을 위한 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 드라이브 메커니즘(drive mechanism, 미도시)을 이용하여 메모리(440)와는 별도의 컴퓨터에서 판독 가능한 기록 매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록 매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록 매체(미도시)를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 네트워크 인터페이스(430)를 통해 메모리(440)에 로딩될 수도 있다. The memory 440 is a computer-readable recording medium, and may include a random access memory (RAM), a read only memory (ROM), and a permanent mass storage device such as a disk drive. Also, program codes for the operating system 441 and the light field synthesis routine 442 may be stored in the memory 440 . These software components may be loaded from a computer-readable recording medium separate from the memory 440 using a drive mechanism (not shown). The separate computer-readable recording medium may include a computer-readable recording medium (not shown) such as a floppy drive, a disk, a tape, a DVD/CD-ROM drive, and a memory card. In another embodiment, software components may be loaded into the memory 440 through the network interface 430 rather than a computer-readable recording medium.

버스(420)는 라이트필드 합성 시스템(400)의 구성요소들간의 통신 및 데이터 전송을 가능하게 할 수 있다. 버스(420)는 고속 시리얼 버스(high-speed serial bus), 병렬 버스(parallel bus), SAN(Storage Area Network) 및/또는 다른 적절한 통신 기술을 이용하여 구성될 수 있다.Bus 420 may enable communication and data transfer between components of lightfield synthesis system 400 . Bus 420 may be configured using a high-speed serial bus, a parallel bus, a storage area network (SAN), and/or other suitable communication technology.

네트워크 인터페이스(430)는 라이트필드 합성 시스템(400)을 컴퓨터 네트워크에 연결하기 위한 컴퓨터 하드웨어 구성요소일 수 있다. 네트워크 인터페이스(430)는 라이트필드 합성 시스템(400)을 무선 또는 유선 커넥션을 통해 컴퓨터 네트워크에 연결시킬 수 있다.The network interface 430 may be a computer hardware component for connecting the lightfield synthesis system 400 to a computer network. The network interface 430 may connect the lightfield synthesis system 400 to a computer network through a wireless or wired connection.

데이터베이스(450)는 라이트필드 합성을 위해 필요한 모든 정보를 저장 및 유지하는 역할을 할 수 있다. 도 4에서는 라이트필드 합성 시스템(400)의 내부에 데이터베이스(450)를 구축하여 포함하는 것으로 도시하고 있으나, 이에 한정되는 것은 아니며 시스템 구현 방식이나 환경 등에 따라 생략될 수 있고 혹은 전체 또는 일부의 데이터베이스가 별개의 다른 시스템 상에 구축된 외부 데이터베이스로서 존재하는 것 또한 가능하다.The database 450 may serve to store and maintain all information required for light field synthesis. 4 shows that the database 450 is built and included inside the light field synthesis system 400, but is not limited thereto, and may be omitted depending on the system implementation method or environment, or all or part of the database It is also possible to exist as an external database built on a separate and different system.

프로세서(410)는 기본적인 산술, 로직 및 라이트필드 합성 시스템(400)의 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(440) 또는 네트워크 인터페이스(430)에 의해, 그리고 버스(420)를 통해 프로세서(410)로 제공될 수 있다. 프로세서(410)는 전처리부(411), 모델링부(412) 및 학습부(413)를 위한 프로그램 코드를 실행하도록 구성될 수 있다. 이러한 프로그램 코드는 메모리(440)와 같은 기록 장치에 저장될 수 있다. The processor 410 may be configured to process commands of a computer program by performing basic arithmetic, logic, and input/output operations of the light field synthesis system 400 . Instructions may be provided to processor 410 by memory 440 or network interface 430 and via bus 420 . The processor 410 may be configured to execute program codes for the pre-processing unit 411 , the modeling unit 412 and the learning unit 413 . These program codes may be stored in a recording device such as memory 440 .

전처리부(411), 모델링부(412) 및 학습부(413)는 도 2의 단계들(210~240)을 수행하기 위해 구성될 수 있다.The pre-processing unit 411, the modeling unit 412, and the learning unit 413 may be configured to perform steps 210 to 240 of FIG.

라이트필드 합성 시스템(400)은 전처리부(411), 모델링부(412) 및 학습부(413)를 포함할 수 있다. The light field synthesis system 400 may include a pre-processing unit 411 , a modeling unit 412 and a learning unit 413 .

전처리부(411)는 가시영역에 대한 데이터를 향상시키기 위해 입력영상의 RGBD 영상을 활용하여 학습을 위한 데이터를 전처리한다. The pre-processing unit 411 pre-processes data for learning by utilizing the RGBD image of the input image in order to improve the data for the visible region.

전처리부(411)는 DIBR(Depth Image Based Rendering) 방식으로 상기 가시영역에 대한 가상시점 영상을 획득하고 해당 광선들이 가시 영역인지 여부를 결정할 마스킹를 생성하여 이후에 학습될 NeRF에 입력으로 들어가며, 마스킹 되는 부분의 광선은 학습에서 제외한다. The pre-processing unit 411 obtains the virtual viewpoint image for the visible region using the DIBR (Depth Image Based Rendering) method, generates masking to determine whether the corresponding rays are visible regions, and inputs them to the NeRF to be learned later, which is masked The ray of the part is excluded from learning.

모델링부(412)는 컬러 영상 및 깊이 영상을 이용하는 NeRF(Nueral radiance field) 기반 장면모델링을 수행한다. The modeling unit 412 performs scene modeling based on a neutral radiance field (NeRF) using a color image and a depth image.

모델링부(412)는 가시 영역에 대한 샘플링 레이트를 증가 시키고 폐색 영역의 경우에는 비디오 방향에 따른 입력으로부터 데이터를 추정하여, 렌더링 기반의 입력 영상으로부터 라이트필드의 SAIs(Sub-aperture images) 간의 베이스라인이 넓은 데이터들을 대규모로 취득하여 라이트 필드 데이터셋을 생성한다. The modeling unit 412 increases the sampling rate for the visible region and, in the case of the occlusion region, estimates data from the input according to the video direction, and baselines between SAIs (Sub-aperture images) of the light field from the rendering-based input image By acquiring this wide data on a large scale, a light field dataset is created.

학습부(413)는 상기 입력으로 들어오는 컬러 영상 및 깊이 영상을 이용하여 데이터를 확장시키고, 깊이 영상에 대한 깊이 값의 변화가 연속적으로 이루어지고 텍스처가 존재하지 않는 영역에서의 NeRF 기반 장면모델링의 오류를 감소시키도록 학습한다. The learning unit 413 expands data by using the color image and the depth image input as the input, and errors in NeRF-based scene modeling in an area where depth values for the depth image are continuously changed and no texture exists. learn to reduce

학습부(413)는 깊이 기반 렌더링 방식을 사용하여 입력되는 시점의 주변 시점들을 생성하고, 넓은 베이스라인을 갖도록 DIBR로 워핑시키는 범위를 조절하며, 전체 SAIs를 생성하지 않고 일정 부분의 SAIs를 선택하여 데이터를 확장시킨다. The learning unit 413 creates viewpoints around an input viewpoint using a depth-based rendering method, adjusts the range of DIBR warping to have a wide baseline, and selects a certain portion of SAIs without generating all SAIs. expand the data.

도 5는 본 발명의 일 실시예에 따른 네트워크를 사용하여 생성한 라이트필드를 나타내는 도면이다.5 is a diagram showing a light field generated using a network according to an embodiment of the present invention.

도 5(a)는 G.T., 도 5(b)는 Srinivasan, 도 5(c)는 NeRF, 도 5d)는 본 발명의 실시예에 따른 결과를 나타낸다.Figure 5 (a) is G.T., Figure 5 (b) is Srinivasan, Figure 5 (c) is NeRF, Figure 5d) shows the results according to an embodiment of the present invention.

시점 합성 기술에서는 일반적으로 지도 학습 기반의 방법을 사용 하므로 정량적 평가를 위해서는 GT 영상을 필요로 한다. 본 발명에서의 도로 주행 라이트 필드 데이터셋을 통해 직접 생성한 라이트필드 비디오 데이터를 이용하여 정량적 평가를 수행한다. 제안하는 방법은 학습에 사용되지 않은 시점에서부터 광 선들을 쿼리하여 SAI를 생성한다. 다음으로는 실제환경에서의 검증을 위해 도로 주행 데이터셋인 KITTI를 이용하여 결과를 생성한다. KITTI 데이터셋은 양안 카메라 기반으로 도로를 주행하며 데이터를 생성하였기에 정량적 평가를 불가능하며, 정성적 평가로만 비교 분석하였다. 실험에 사용된 데이터는 학습과정에서 사용한 480Х640 공간해상도를 통해 수행하였으며, KITTI 데이터 또한 사이즈를 변경하여 실험에 사용하였다. View synthesis technology generally uses supervised learning-based methods, so GT images are required for quantitative evaluation. Quantitative evaluation is performed using light field video data generated directly through the road driving light field dataset in the present invention. The proposed method creates SAI by querying light rays from a point not used for learning. Next, for verification in the real environment, results are generated using KITTI, a road driving dataset. Since the KITTI dataset generated data while driving on the road based on a binocular camera, quantitative evaluation was impossible, and comparative analysis was performed only with qualitative evaluation. The data used in the experiment was performed through the 480Х640 spatial resolution used in the learning process, and the KITTI data was also resized and used in the experiment.

도 5는 제안하는 네트워크를 사용하여 생성한 라이트필드를 보여준다. 뉴럴 레디언스를 기반으로 하는 방식이 장면에 따른 최적화를 목표로 한다는 점에서 실제환경과 가상환경상에서의 성능차이는 거의 존재하지 않는 것을 확인 할 수 있었다.5 shows a light field generated using the proposed network. Since the neural radiance-based method aims at optimization according to the scene, it was confirmed that there is almost no difference in performance between the real environment and the virtual environment.

본 발명에서 제안하는 프레임워크에 대해 정성적 평가를 수행 하였다. 도 6은 각각의 방법으로 생성한 SAI를 보여주고 있다. 도 5(a)의 경우는 단안으로부터 깊이정보를 추정하는 것에 기반을 둔 Srinivasan기법에 해당하는 SAI의 예시이고, 도 5(b)의 경우에는 전체적인 기하학적 구조를 정확히 추정하지 못한 것을 볼 수 있다. 또한 도 5(c)의 경우에는 기하적으로 복잡하거나 다양한 물체들이 존재하는 경우 깊이 값을 정확히 추정하지 못해 플로팅되는 아티팩트를 많이 생성하는 한계를 볼 수 있다. 정성적 평가의 경우에는 가상환경 및 실제환경 모두에서 진행하였다. NeRF와 제안하는 프레임워크로 생성한 결과를 비교하였을 때, 제안하는 방식의 경우가 상대적으로 G.T의 기하적인 형태를 전체적으로 보존하고, 도로부분에서도 정확히 표현하는 것을 확인할 수 있다. A qualitative evaluation was performed on the framework proposed in the present invention. 6 shows SAI generated by each method. In the case of FIG. 5 (a), it is an example of SAI corresponding to the Srinivasan technique based on estimating depth information from a monocular, and in the case of FIG. 5 (b), it can be seen that the overall geometric structure was not accurately estimated. In addition, in the case of FIG. 5(c), when there are geometrically complex or various objects, a depth value cannot be accurately estimated, and many floating artifacts can be seen. In the case of qualitative evaluation, it was conducted in both virtual and real environments. When comparing the results generated by NeRF and the proposed framework, it can be confirmed that the proposed method relatively preserves the geometric shape of the GT as a whole and expresses it accurately in the road part.

도 6은 본 발명의 일 실시예에 따른 합성한 결과의 SAI의 예시를 나타내는 도면이다.6 is a diagram showing an example of SAI of a synthesized result according to an embodiment of the present invention.

도 7은 본 발명의 일 실시예에 따른 KITTI 데이터 넓은 베이스라인 정성적 결과를 나타내는 도면이다.7 is a diagram showing KITTI data broad baseline qualitative results according to an embodiment of the present invention.

도 6은 라이트필드의 일정 부분의 SAI를 보여주고있다. 도 6(a)는 가상환경 데이터, 도 6(b)는 KITTI 데이터를 나타낸다. 비교에서 확인한 것과 같이 베이스라인이 커짐에도 전체적으로 기하적인 구조를 유지하며 높은 품질을 유지한다. Figure 6 shows the SAI of a portion of the light field. 6(a) shows virtual environment data, and FIG. 6(b) shows KITTI data. As confirmed in the comparison, even when the baseline increases, the overall geometric structure is maintained and high quality is maintained.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

pre-processing data related to the RGBD image of the input image for learning by utilizing the RGBD image of the input image in order to enhance the data of the visible region through a pre-processing unit;
Performing scene modeling based on a NeRF (Nueral radiance field) using a color image and a depth image of the RGBD image of the preprocessed input image through a modeling unit;
expanding data by using a color image and a depth image of the RGBD image of the preprocessed input image through a learning unit; and
Learning to reduce an error of NeRF-based scene modeling in an area in which a change in a depth value for a data-extended depth image is continuously made through a learning unit and a texture does not exist.
A light field synthesis method comprising a.

According to claim 1,
The step of pre-processing data on the RGBD image of the input image for learning by utilizing the RGBD image of the input image to enhance the data for the visible region through the pre-processing unit,
DIBR (Depth Image Based Rendering) method acquires the virtual viewpoint image for the visible region, creates masking to determine whether the corresponding rays are visible regions, and inputs them to NeRF to be learned later, and the rays of the masked part are used in learning excluded
Light field synthesis method.

According to claim 1,
The step of performing NeRF (Nueral radiance field) based scene modeling using a color image and a depth image of the RGBD image of the preprocessed input image through the modeling unit,
By increasing the sampling rate for the visible region and estimating data from the input along the video direction in the case of the occlusion region, large-scale data with a wide baseline between SAIs (Sub-aperture images) of the light field from the rendering-based input image Acquisition to create a light field dataset
Light field synthesis method.

According to claim 1,
The step of expanding data using a color image and a depth image of the RGBD image of the preprocessed input image through the learning unit,
Using the depth-based rendering method, viewpoints around the input viewpoint are created, the range of warping with DIBR is adjusted to have a wide baseline, and data is expanded by selecting a certain portion of SAIs without generating all SAIs.
Light field synthesis method.

a pre-processing unit for pre-processing RGBD image data of the input image for learning by utilizing the RGBD image of the input image to improve data for the visible region;
a modeling unit that performs scene modeling based on NeRF (Nueral radiance field) using a color image and a depth image of the RGBD image of the preprocessed input image; and
Data is expanded using a color image and a depth image of the RGBD image of the preprocessed input image coming into the input, and the depth value of the data-extended depth image is continuously changed through the learning unit, and no texture is present. A learning unit that learns to reduce the error of NeRF-based scene modeling in an area not covered by
A light field synthesis system comprising a.

According to claim 5,
The pre-processing unit,
DIBR (Depth Image Based Rendering) method acquires the virtual viewpoint image for the visible region, creates masking to determine whether the corresponding rays are visible regions, and inputs them to NeRF to be learned later, and the rays of the masked part are used in learning excluded
Lightfield synthesis system.

According to claim 5,
The modeling unit,
By increasing the sampling rate for the visible region and estimating data from the input along the video direction in the case of the occlusion region, large-scale data with a wide baseline between SAIs (Sub-aperture images) of the light field from the rendering-based input image Acquisition to create a light field dataset
Lightfield synthesis system.

According to claim 5,
The learning unit,
Using the depth-based rendering method, viewpoints around the input viewpoint are created, the range of warping with DIBR is adjusted to have a wide baseline, and data is expanded by selecting a certain portion of SAIs without generating all SAIs.
Lightfield synthesis system.