KR102262832B1

KR102262832B1 - Device and Method for Estimating Depth of Monocular Video Image

Info

Publication number: KR102262832B1
Application number: KR1020190156863A
Authority: KR
Inventors: 함범섭; 엄찬호; 박현종
Original assignee: 연세대학교 산학협력단
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-06-08
Also published as: WO2021107254A1

Abstract

Disclosed are a method and an apparatus for estimating the depth of a monocular video image. The disclosed apparatus comprises: a spatial feature encoder network module generating a spatial feature map through neural network calculation for a current frame image; a temporal feature encoder network module generating a temporal feature map through neural network calculation for an optical flow image of a current frame image and a previous frame image; a flow guide memory module generating a depth feature map for the current frame image through neural network calculation by using the spatial feature map and the temporal feature map; and a decoder network module generating a depth map through neural network calculation for the depth feature map, wherein the flow guide memory module uses an RNN, corrects a previous state feature map, used in the RNN, through warping based on the temporal feature map, and performs neural network calculation using the corrected previous state feature map in place of the previous state feature map. The disclosed apparatus and method can definitely estimate the depth in the monocular video image by considering an inter-frame correlation.

Description

Device and Method for Estimating Depth of Monocular Video Image

본 발명은 깊이 추정 장치 및 방법에 관한 것으로서, 더욱 상세하게는 단안 비디오 영상의 깊이 추정 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for estimating depth, and more particularly, to an apparatus and method for estimating depth of a monocular video image.

깊이 추정은 자율 주행 및 운전자 보조 시스템에서 필수적으로 필요한 기술이다. 자율 주행 시 지형 구조의 판단 및 주변 차량과 장애물의 정확한 위치 판단을 위해 실시간 깊이 추정이 필요하다. Depth estimation is an essential skill in autonomous driving and driver assistance systems. During autonomous driving, real-time depth estimation is required to determine the terrain structure and to determine the exact location of surrounding vehicles and obstacles.

일반적인 깊이 추정은 스테레오 매칭을 통해 이루어진다. 스테레오 매칭은 두 개의 카메라를 이용하여 획득되는 좌영상 및 우영상을 이용하여 깊이를 추정하는 방법이다. 스테레오 매칭에서는 좌영상과 우영상의 대응 픽셀간의 변이를 연산하여 깊이를 추정하게 된다. A general depth estimation is done through stereo matching. Stereo matching is a method of estimating depth using left and right images obtained using two cameras. In stereo matching, the depth is estimated by calculating the displacement between the corresponding pixels of the left image and the right image.

그러나, 스테레오 매칭은 항상 두 개의 카메라를 이용하여 영상을 획득하여야 하며 두 개의 카메라가 정확히 정렬된 상태에서 영상을 획득하여야 정확한 깊이 추정이 가능한 문제가 있어 현실적으로 사용하기 어려운 측면이 있다. However, stereo matching has a problem in that an image must always be obtained using two cameras, and an image must be obtained in a state in which the two cameras are accurately aligned, so that accurate depth estimation is possible, making it difficult to use in reality.

단안 카메라를 이용하여 깊이를 추정하는 방법에 대해서도 다양한 방법들이 제안되었다. 객체의 상대적 크기, 텍스터 변화 정도, 가려진 영역 등은 단안 영상에서도 깊이를 추정할 수 있는 정보를 제공하고 이러한 정보를 이용하게 될 경우 스테레오 영상이 주어지지 않더라도 깊이 추정이 가능하다. Various methods have been proposed for a method of estimating depth using a monocular camera. The relative size of the object, the degree of texture change, the occluded area, etc. provide information for estimating depth even in a monocular image, and when such information is used, depth estimation is possible even if a stereo image is not provided.

한편, 근래에 들어 딥러닝에 대한 연구가 발전하면서 신경망 연산을 통해 단안 영상의 깊이를 추정하는 다양한 방법들이 제안되고 있다. Meanwhile, as research on deep learning develops in recent years, various methods for estimating the depth of a monocular image through neural network computation have been proposed.

그러나, 기존의 신경망을 이용한 단안 영상의 깊이 추정 방법은 프레임 단위로 독립적으로 깊이를 추정하였다. 연속되는 프레임은 서로 높은 상관 관계가 있음에도 불구하고 프레임간 상관 관계는 단안 영상의 깊이 추정에 잘 반영되지 않았으며, 이는 비디오 영상에서 부정확한 깊이 추정의 주요한 원인 중 하나가 되었다. However, in the conventional method of estimating the depth of a monocular image using a neural network, the depth was independently estimated in units of frames. Although successive frames have a high correlation with each other, the inter-frame correlation was not well reflected in the depth estimation of monocular images, which became one of the main causes of inaccurate depth estimation in video images.

본 발명은 단안 비디오 영상에서 프레임간 상관 관계를 고려하여 정확하게 깊이를 추정할 수 있는 깊이 추정 장치 및 방법을 제안한다. The present invention proposes a depth estimation apparatus and method capable of accurately estimating depth in consideration of inter-frame correlation in a monocular video image.

상기와 같은 목적을 달성하기 위해, 본 발명의 일 측면에 따르면, 현재 프레임 영상에 대한 신경망 연산을 통해 공간 특징맵을 생성하는 공간 특징 엔코더 네트워크 모듈; 현재 프레임 영상과 이전 프레임 영상의 광학 플로우 영상에 대한 신경망 연산을 통해 시간 특징맵을 생성하는 시간 특징 엔코더 네트워크 모듈; 상기 공간 특징맵 및 상기 시간 특징맵을 이용하여 상기 현재 프레임 영상에 대한 깊이 특징맵을 신경망 연산을 통해 생성하는 플로우 가이드 메모리 모듈; 및 상기 깊이 특징맵에 대한 신경망 연산을 통해 깊이 맵을 생성하는 디코더 네트워크 모듈을 포함하되, 상기 플로우 가이드 메모리 모듈은 RNN을 사용하며, 상기 RNN에 사용되는 이전 상태 특징맵을 상기 시간 특징맵에 기초한 워핑을 통해 보정하고, 상기 이전 상태 특징맵 대신 상기 보정된 이전 상태 특징맵을 이용하여 신경망 연산을 수행하는 단안 비디오 영상의 깊이 추정 장치가 제공된다. In order to achieve the above object, according to an aspect of the present invention, a spatial feature encoder network module for generating a spatial feature map through a neural network operation on a current frame image; a temporal feature encoder network module for generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image; a flow guide memory module for generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and a decoder network module for generating a depth map through neural network operation on the depth feature map, wherein the flow guide memory module uses an RNN, and a previous state feature map used for the RNN is based on the temporal feature map. Provided is an apparatus for estimating depth of a monocular video image that corrects through warping and performs neural network operation using the corrected previous state feature map instead of the previous state feature map.

상기 현재 프레임 영상, 상기 광학 플로우 영상 및 이전 프레임 영상에 대한 신경망 연산을 통해 교정된 시간 특징맵을 생성하는 광학 플로우 교정 네트워크 모듈을 포함한다. and an optical flow correction network module for generating a temporal feature map corrected through neural network operation on the current frame image, the optical flow image, and the previous frame image.

상기 플로우 가이드 메모리 모듈은 상기 시간 특징맵 대신 상기 교정된 시간 특징맵에 기초한 워핑을 통해 상기 이전 상태 특징맵을 보정한다. The flow guide memory module corrects the previous state feature map through warping based on the corrected temporal feature map instead of the temporal feature map.

상기 보정되는 이전 상태 특징맵은 마스크 특징맵에 의해 그 값이 조절되며, 상기 마스크 특징맵은 상기 시간 특징맵 또는 상기 교정된 시간 특징맵의 신뢰도를 반영한 특징맵이다. ,The value of the corrected previous state feature map is adjusted by a mask feature map, and the mask feature map is a feature map reflecting the reliability of the temporal feature map or the corrected temporal feature map. ,

상기 신뢰도는 상기 시간 특징맵 또는 교정된 시간 특징맵에 기초하여 이전 프레임 영상을 워핑한 영상과 현재 프레임 영상간의 차에 기초하여 연산된다. The reliability is calculated based on a difference between the image of the current frame and the image obtained by warping the previous frame image based on the temporal feature map or the corrected temporal feature map.

상기 플로우 가이드 메모리 모듈의 RNN은 다음의 수학식과 같이 현재 상태 특징맵(h^t), 보정된 이전 상태 특징맵(

^t), 리셋 게이트(r^t) 및 업데이트 게이트(z^t) 및 후보 상태 특징맵(

^t)을 연산한다 The RNN of the flow guide memory module is a current state feature map (h ^t ), a corrected previous state feature map (

^t ), reset gate (r ^t ) and update gate (z ^t ) and candidate state feature map (

^t ) is computed

위 수학식에서, σ는 시그모이드 함수를 의미하고,

는 엘리먼트-와이즈(element-wise) 곱셉을 의미하며, *는 콘볼루션을 의미하며, x^t는 입력되는 특징맵으로서, 공간 특징맵과 시간 특징맵을 결합한 특징맵이고, W는 미리 설정되는 가중치이며, b는 미리 설정되는 바이어스 값이고,

는 시간 특징맵 또는 교정된 시간 특징맵을 이용하여 이전 상태 특징맵을 워핑한 특징맵이고, M^t는 마스크 특징맵임. In the above equation, σ means a sigmoid function,

denotes an element-wise multiplication, * denotes convolution, x ^t denotes an input feature map, a feature map that combines a spatial feature map and a temporal feature map, and W denotes a preset weight and b is a preset bias value,

is a feature map obtained by warping a previous state feature map using a temporal feature map or a calibrated temporal feature map, and M ^t is a mask feature map.

상기 마스크 특징맵은 다음의 수학식과 같이 설정된다. The mask feature map is set as follows.

위 수학식에서, p는 픽셀을 의미하고, I^t ₃(p)는 현재 프레임 영상이며,

^t ₃(p)는 워핑된 이전 프레임 영상이고, ε은 임의로 설정되는 상수임. In the above equation, p means a pixel, I ^t ₃ (p) is the current frame image,

^t ₃ (p) is the warped previous frame image, and ε is an arbitrarily set constant.

상기 공간 특징 엔코더 네트워크 모듈 및 상기 시간 특징 엔코더 네트워크 모듈은 CNN을 이용하여 각각 공간 특징맵 및 시간 특징맵을 생성한다. The spatial feature encoder network module and the temporal feature encoder network module generate a spatial feature map and a temporal feature map using CNN, respectively.

본 발명의 다른 측면에 따르면, 현재 프레임 영상에 대한 신경망 연산을 통해 공간 특징맵을 생성하는 단계(a); 현재 프레임 영상과 이전 프레임 영상의 광학 플로우 영상에 대한 신경망 연산을 통해 시간 특징맵을 생성하는 단계(b); 상기 공간 특징맵 및 상기 시간 특징맵을 이용하여 상기 현재 프레임 영상에 대한 깊이 특징맵을 신경망 연산을 통해 생성하는 단계(c); 및 상기 깊이 특징맵에 대한 신경망 연산을 통해 깊이 맵을 생성하는 단계(d)를 포함하되, 상기 단계(c)는 RNN을 사용하며, 상기 RNN에 사용되는 이전 상태 특징맵을 상기 시간 특징맵에 기초한 워핑을 통해 보정하고, 상기 이전 상태 특징맵 대신 상기 보정된 이전 상태 특징맵을 이용하여 신경망 연산을 수행하는 단안 비디오 영상의 깊이 추정 방법이 제공된다. According to another aspect of the present invention, generating a spatial feature map through a neural network operation on a current frame image (a); generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image (b); (c) generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and generating a depth map through neural network operation on the depth feature map, wherein the step (c) uses an RNN, and a previous state feature map used for the RNN is added to the temporal feature map. Provided is a method for estimating depth of a monocular video image by performing correction through warping based on the previous state and performing neural network operation using the corrected previous state feature map instead of the previous state feature map.

본 발명에 의하면, 단안 비디오 영상에서 프레임간 상관 관계를 고려하여 정확하게 깊이를 추정할 수 있는 장점이 있다. According to the present invention, there is an advantage in that depth can be accurately estimated in consideration of inter-frame correlation in a monocular video image.

도 1은 본 발명의 제1 실시에에 따른 단안 비디오 영상의 깊이 추정 장치를 구성하는 뉴럴 네트워크 구조를 도시한 도면.
도 2는 본 발명의 제2 실시예에 따른 단안 비디오 영상의 깊이 추정을 위한 뉴럴 네트워크 구조를 나타낸 도면.
도 3은 본 발명의 일 실시예에 따른 광학 플로우 교정 네트워크 모듈의 동작 구조를 나타낸 도면.
도 4는 본 발명의 일 실시예에 따른 플로우 가이드 메모리 모듈의 동작 구조를 나타낸 도면.
도 5는 본 발명의 일 실시예에 따른 플로우 가이드 메모리 모듈에서의 워핑을 개념적으로 나타낸 도면.
도 6은 본 발명의 제2 실시예에 따른 단안 비디오 영상의 깊이 추정 방법의 전체적인 흐름을 도시한 순서도.1 is a diagram illustrating a neural network structure constituting an apparatus for estimating depth of a monocular video image according to a first embodiment of the present invention.
2 is a diagram illustrating a neural network structure for depth estimation of a monocular video image according to a second embodiment of the present invention.
Figure 3 is a diagram showing the operation structure of the optical flow correction network module according to an embodiment of the present invention.
4 is a diagram illustrating an operation structure of a flow guide memory module according to an embodiment of the present invention;
5 is a diagram conceptually illustrating warping in a flow guide memory module according to an embodiment of the present invention.
6 is a flowchart illustrating an overall flow of a method for estimating depth of a monocular video image according to a second embodiment of the present invention.

이하에서는 첨부한 도면을 참조하여 본 발명을 설명하기로 한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 따라서 여기에서 설명하는 실시예로 한정되는 것은 아니다.Hereinafter, the present invention will be described with reference to the accompanying drawings. However, the present invention may be embodied in several different forms, and thus is not limited to the embodiments described herein.

그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 부재를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다.Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another member interposed therebetween. .

또한 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 구비할 수 있다는 것을 의미한다.Also, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

이하 첨부된 도면을 참고하여 본 발명의 실시예를 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 깊이 추정 장치 및 방법은 단안 비디오 영상을 이용하여 단안 비디오 영상의 각 픽셀에 대한 깊이를 추정한다. 단안 비디오 영상은 스테레오 영상과는 달리 양안 시차 정보를 알 수 없기에 정확한 깊이 추정은 어려우나 항상 스테레오 영상을 획득할 수 없고 또한 정확히 정렬된 스테레오 영상을 획득하는 것 역시 용이하지 않은 문제이기에 단안 비디오 영상에 대한 깊이 추정은 계속적으로 요구되고 있으며, 특히 자율 주행과 같이 실시간으로 깊이 정보를 획득하여야 하는 분야에서 요구되고 있다. The depth estimation apparatus and method of the present invention estimate the depth of each pixel of the monocular video image by using the monocular video image. Unlike stereo images, since monocular video images do not know binocular disparity information, accurate depth estimation is difficult, but stereo images cannot always be acquired, and it is also not easy to acquire accurately aligned stereo images. Depth estimation is continuously required, especially in fields that require real-time depth information acquisition, such as autonomous driving.

비디오 영상은 다수의 프레임으로 이루어져 있으며 인접하는 프레임(예를 들어, t-1 프레임, t 프레임, t+1 프레임)은 상호 상관성이 높다. 그러나, 비디오 영상에서의 깊이 추정은 각 프레임별로 이루어진다. 깊이 추정이 각 프레임별로 독립적으로 이루어지기 때문에 이전 프레임(t-1 프레임)에서 추정된 깊이는 현재 프레임(t 프레임)에서의 깊이 추정에 영향을 주지 않는다. 그런데, 인접하는 프레임이 서로 상관 관계에 있기 때문에 이전 프레임과 현재 프레임의 깊이 역시 상관 관계가 있으나 이러한 상관 관계는 기존의 단안 비디오 영상의 깊이 추정에는 고려되지 않았다. 이러한 문제는 기존의 단안 비디오 영상의 깊이 추정으로 인해 생성되는 깊이 맵에 플리커 현상을 발생시키는 원인이 되기도 하였다. A video image consists of a plurality of frames, and adjacent frames (eg, t-1 frame, t frame, t+1 frame) have high correlation. However, depth estimation in a video image is performed for each frame. Since the depth estimation is performed independently for each frame, the depth estimated in the previous frame (t-1 frame) does not affect the depth estimation in the current frame (t-frame). However, since adjacent frames are correlated with each other, the depths of the previous frame and the current frame are also correlated, but this correlation is not taken into account in estimating the depth of the existing monocular video image. This problem also caused flickering in a depth map generated by depth estimation of a conventional monocular video image.

본 발명에서 제안하는 단안 비디오 영상의 깊이 추정 장치 및 방법은 이러한 프레임간 깊이의 상관 관계가 반영될 수 있는 구성을 제안한다. 다만, 주목하여야 할 점은 비디오는 동영상이기 때문에 프레임 사이에 움직이는 객체가 존재할 수 있고 또한 카메라의 움직임으로 인해 뷰 포인트(View Point)가 변경될 수도 있다는 점이다. 이러한 객체 또는 카메라의 움직임을 반영하지 않고 단지 깊이의 상관 관계만을 고려하는 것은 오히려 부정확한 깊이 추정의 원인이 될 수도 있다. 본 발명은 이와 같은 객체 또는 카메라의 움직임을 함께 반영하여 프레임간 상관 관계가 반영된 깊이 추정 방법을 제안한다. The apparatus and method for estimating the depth of a monocular video image proposed by the present invention proposes a configuration in which the correlation of depth between frames can be reflected. However, it should be noted that, since a video is a moving picture, a moving object may exist between frames, and a view point may be changed due to the movement of the camera. Considering only the correlation of depth without reflecting the movement of the object or camera may rather cause inaccurate depth estimation. The present invention proposes a depth estimation method in which inter-frame correlation is reflected by reflecting the motion of such an object or camera together.

도 1은 본 발명의 제1 실시에에 따른 단안 비디오 영상의 깊이 추정 장치를 구성하는 뉴럴 네트워크 구조를 도시한 도면이다. 1 is a diagram illustrating a neural network structure constituting an apparatus for estimating depth of a monocular video image according to a first embodiment of the present invention.

도 1을 참조하면, 본 발명의 제1 실시예에 따른 단안 비디오 영상의 깊이 추정 장치는 공간 특징 엔코더 네트워크 모듈(100), 시간 특징 엔코더 네트워크 모듈(110), 플로우 가이드 메모리 모듈(120) 및 디코더 네트워크 모듈(130)을 포함한다. 1, the apparatus for estimating the depth of a monocular video image according to the first embodiment of the present invention includes a spatial feature encoder network module 100, a temporal feature encoder network module 110, a flow guide memory module 120, and a decoder. a network module 130 .

본 발명의 단안 비디오 영상 깊이 추정 장치로는 두 개의 영상이 입력된다. 하나는 t 프레임의 입력 영상(I^t)이며 다른 하나는 t 프레임에서의 광학 플로우(Optical Flow) 영상(O^t)이다. 여기서, t 프레임에서의 광학 플로우 영상은 이전 프레임(t-1 프레임) 영상과 현재 프레임(t 프레임) 영상의 광학 플로우 연산 결과를 반영한 영상을 의미한다. 입력되는 광학 플로우 영상의 광학 플로우 값은 다양한 방법으로 연산될 수 있으며, 어떠한 방법으로 광학 플로우 영상을 생성하더라도 본 발명의 본질에는 영향을 미치지 아니한다. 널리 알려진 광학 플로우 연산 알고리즘이 이용될 수도 있을 것이며, 광학 플로우 연산을 위해 별도의 신경망이 이용될 수도 있다. Two images are input to the monocular video image depth estimation apparatus of the present invention. One is the input image (I ^t ) of the t frame, and the other is the optical flow image (O ^t ) of the t frame. Here, the optical flow image in the t frame refers to an image in which optical flow calculation results of the previous frame (t-1 frame) image and the current frame (t frame) image are reflected. The optical flow value of the input optical flow image may be calculated in various ways, and no matter how the optical flow image is generated, the essence of the present invention is not affected. A well-known optical flow calculation algorithm may be used, and a separate neural network may be used for optical flow calculation.

공간 특징 엔코더 네트워크 모듈(100)은 신경망 연산을 통해 현재 프레임 영상에 대한 공간 특징맵을 생성한다. 공간 특징 엔코더 네트워크 모듈(100)은 공간 특징맵 생성을 위해 알려진 다양한 뉴럴 네트워크를 사용할 수 있다. 일례로, 공간 특징 엔코더 네트워크 모듈(100)은 콘볼루션 커널을 현재 프레임에 적용하면서 특징맵을 생성하는 CNN(Convolutional Neural Network) 네트워크를 포함할 수 있을 것이나 이에 한정되는 것은 아니다. The spatial feature encoder network module 100 generates a spatial feature map for the current frame image through neural network operation. The spatial feature encoder network module 100 may use various known neural networks for generating a spatial feature map. As an example, the spatial feature encoder network module 100 may include, but is not limited to, a Convolutional Neural Network (CNN) network that generates a feature map while applying a convolutional kernel to a current frame.

공간 특징 엔코더 네트워크 모듈(100)은 입력 영상의 차원을 줄여나가면서 최종적인 특징맵을 생성할 수 있을 것이다. 일례로, 입력 영상에 비해 1/4 사이즈를 가지는 공간 특징맵이 공간 특징 엔코더를 통해 출력될 수 있을 것이다. 공간 특징 엔코더 네트워크 모듈(100)의 신경망 가중치는 학습을 통해 설정되며, 학습 방법은 추후 설명하기로 한다. The spatial feature encoder network module 100 may generate a final feature map while reducing the dimension of the input image. For example, a spatial feature map having a size 1/4 of that of the input image may be output through the spatial feature encoder. The neural network weight of the spatial feature encoder network module 100 is set through learning, and the learning method will be described later.

시간 특징 엔코더 네트워크 모듈(110)은 신경망 연산을 통해 광학 플로우 영상에 대한 시간 특징맵을 생성한다. 시간 특징 네트워크 모듈(110)에서 출력되는 특징맵을 시간 특징맵으로 정의한 것은 광학 플로우의 정의가 시간에 따른 각 픽셀의 움직임 정도이기 때문이며, 시간 특징맵이 시간 정보 자체를 가리키는 것은 아니다. The temporal feature encoder network module 110 generates a temporal feature map for the optical flow image through neural network operation. The reason why the feature map output from the temporal feature network module 110 is defined as a temporal feature map is because the definition of an optical flow is the degree of movement of each pixel according to time, and the temporal feature map does not indicate temporal information itself.

시간 특징 엔코더 네트워크 모듈(110)도 시간 특징맵 생성을 위해 알려진 다양한 뉴럴 네트워크를 사용할 수 있으며, 콘볼루션 커널을 광학 플로우 영상에 적용하면서 특징맵을 생성하는 CNN(Convolutional Neural Network) 네트워크를 사용할 수 있을 것이다. The temporal feature encoder network module 110 can also use a variety of known neural networks for generating a temporal feature map, and a Convolutional Neural Network (CNN) network that generates a feature map while applying a convolutional kernel to an optical flow image can be used. will be.

시간 특징 엔코더 네트워크 모듈(100) 역시 광학 플로우 영상의 차원을 줄여나가면서 최종적인 특징맵을 생성할 수 있을 것이며, 일례로, 입력된 광학 플로우 영상에 비해 1/4 사이즈를 가지는 시간 특징맵이 시간 특징 엔코더 네트워크 모듈(110)을 통해 출력될 수 있을 것이다. 시간 특징 엔코더 네트워크 모듈(110)의 신경망 가중치 역시 학습을 통해 설정된다. The temporal feature encoder network module 100 will also be able to generate a final feature map while reducing the dimension of the optical flow image. For example, the temporal feature map having a 1/4 size compared to the input optical flow image is time It may be output through the feature encoder network module 110 . Neural network weights of the temporal feature encoder network module 110 are also set through learning.

공간 특징맵 및 시간 특징맵은 상호 보완적인 특징 정보이다. 공간 특징맵은 영상의 존재하는 객체들의 형상 및 배경의 레이아웃들에 대한 특징을 포함하게 된다. 또한, 시간 특징맵은 프레임의 변화에 따른 각 픽셀들의 개별적인 움직임 궤적 정보를 포함하게 된다. The spatial feature map and the temporal feature map are complementary feature information. The spatial feature map includes features for the shapes of existing objects in the image and layouts of the background. In addition, the temporal feature map includes individual motion trajectory information of each pixel according to a change in a frame.

본 발명의 바람직한 실시예에 따르면, 공간 특징 엔코더 네트워크 모듈(100) 및 시간 특징 엔코더 네트워크 모듈(110)에 적용되는 콘볼루션 연산 시 수용 영역(Receptive Field)을 확장시키기 위한 팽창된 콘볼루션(Dilated Convolution)을 사용하는 것이 바람직하다. 팽창된 콘볼루션(Dilated Convolution)을 사용하게 될 경우 공간 정보 해상도의 손실 및 장면 디테일의 손실을 최소화할 수 있어 본 발명과 같은 깊이 추정을 위한 신경망에 보다 효과적일 수 있다. According to a preferred embodiment of the present invention, a dilated convolution for expanding a receptive field in a convolution operation applied to the spatial feature encoder network module 100 and the temporal feature encoder network module 110 ) is preferably used. When dilated convolution is used, loss of spatial information resolution and loss of scene detail can be minimized, so that the neural network for depth estimation as in the present invention can be more effective.

팽창된 콘볼루션 연산 시 팽창 비율 및 이에 의존적인 수용 영역의 사이즈는 적절히 조절될 수 있다. In the expanded convolution operation, the expansion ratio and the size of the receiving area dependent thereon may be appropriately adjusted.

플로우 가이드 메모리 모듈(120)은 공간 특징 엔코더 네트워크 모듈(100)로부터 출력되는 공간 특징맵 및 시간 특징 엔코더 네트워크 모듈(110)로부터 출력되는 시간 특징맵을 입력받아 신경망 연산을 통해 깊이 특징맵을 생성한다. 플로우 가이드 메모리 모듈(120)은 시간 특징맵 및 공간 특징맵을 순차적(t-1, t, t+1,??)으로 입력받아 깊이 특징맵을 생성한다. The flow guide memory module 120 receives the spatial feature map output from the spatial feature encoder network module 100 and the temporal feature map output from the temporal feature encoder network module 110, and generates a depth feature map through neural network operation. . The flow guide memory module 120 receives the temporal feature map and the spatial feature map sequentially (t-1, t, t+1,??) and generates a depth feature map.

바람직하게는 시간 특징맵 및 공간 특징맵을 서로 결합(Concatenate)한 특징맵이 플로우 가이드 메모리 모듈(120)로 입력된다. 특징맵간 결합을 위해 시간 특징맵의 차원과 깊이 특징맵의 차원은 동일한 것이 바람직하다. Preferably, a feature map obtained by concatenating a temporal feature map and a spatial feature map is input to the flow guide memory module 120 . For coupling between feature maps, it is preferable that the dimension of the temporal feature map and the dimension of the depth feature map are the same.

앞서 설명한 바와 같이, 연속하는 프레임들은 서로 독립적이지 않고 상관 관계가 있다. 본 발명은 이러한 상관 관계를 고려한 깊이 특징맵 생성을 위해 플로우 가이드 메모리 모듈(120)로 RNN(Recurrent Neural Network)를 사용한다. RNN 네트워크는 일반적인 CNN 네트워크와 비교하여 프레임간 상관 관계 또는 의존 관계를 보다 정확히 반영한 깊이 특징맵 생성이 가능하다. As described above, successive frames are not independent of each other but are correlated. In the present invention, a recurrent neural network (RNN) is used as the flow guide memory module 120 to generate a depth feature map in consideration of such correlation. Compared to a general CNN network, the RNN network can generate a depth feature map that more accurately reflects the correlation or dependency between frames.

RNN 네트워크는 다양한 종류의 네트워크를 포함한다. 기본적인 RNN 네트워크로 LSTM(Long Short-term Memory) 및 GRU(Gated Recurrent Unit)가 있다. 또한, 근래에는 콘볼루션을 LSTM 및 GRU에 각각 반영한 ConvLSTM 및 ConvGRU가 사용되기도 한다. RNN networks include various types of networks. Basic RNN networks include Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU). Also, recently, ConvLSTM and ConvGRU in which convolution is reflected in LSTM and GRU, respectively, are sometimes used.

본 발명의 바람직한 실시예에 따르면, RNN 네트워크 중 ConvGRU를 사용할 수 있다. RNN 네트워크 중 ConvGRU가 유리한 이유는 ConvGRU가 공간 해상도 손실이 크게 발생하지 않고 메모리 사용 관점에서 유리하기 때문이다. 물론, 다른 종류의 RNN 네트워크가 사용될 수도 있다는 점은 당업자에게 있어 자명할 것이다. According to a preferred embodiment of the present invention, ConvGRU may be used among RNN networks. The reason ConvGRU is advantageous among RNN networks is that ConvGRU does not cause much loss of spatial resolution and is advantageous in terms of memory usage. Of course, it will be apparent to those skilled in the art that other types of RNN networks may be used.

플로우 가이드 메모리 모듈(120)의 상세한 동작 구조는 별도의 도면을 참조하여 추후 셜명하기로 한다. A detailed operation structure of the flow guide memory module 120 will be described later with reference to a separate drawing.

플로우 가이드 메모리 모듈(120)로부터 출력되는 깊이 특징맵은 디코더 네트워크 모듈(130)로 입력된다. 디코더 네트워크 모듈(130)은 입력된 깊이 특징맵에 대한 신경망 연산을 통해 최종적인 깊이 맵을 생성한다. 디코더 네트워크 모듈(130)은 일례로 CNN을 이용하여 디코딩을 수행할 수 있을 것이나 이에 한정되는 것은 아니다. 디코더 네트워크 모듈(130)은 일반적인 디코더 네트워크와 같이 깊이 특징맵의 차원을 확장시키면서 깊이 맵을 생성할 수 있을 것이다. The depth feature map output from the flow guide memory module 120 is input to the decoder network module 130 . The decoder network module 130 generates a final depth map through neural network operation on the input depth feature map. The decoder network module 130 may perform decoding using CNN as an example, but is not limited thereto. The decoder network module 130 may generate the depth map while extending the dimension of the depth feature map like a general decoder network.

본 발명의 바람직한 실시예에 따르면, 엔코딩 과정에서의 특징 정보를 반영한 디코딩이 이루어질 수 있도록 공간 특징 엔코더 네트워크 모듈(100)에서 각 레이어별로 생성되는 특징맵들 및 시간 특징 엔코더 네트워크 모듈(110)에서 각 레이어별로 생성되는 특징맵들에 대한 스킵 커넥션(Skp Connection)이 이루어질 수도 있을 것이다. 스킵 커넥션은 디코딩 시 엔코딩 과정에서 생성된 특징맵들을 결합하여 다음 레이어의 디코딩에 사용하는 것으로서 스킵 커넥션은 다양한 뉴럴 네트워크의 엔코딩 및 디코딩에 사용되는 것이기에 이에 대한 상세한 설명은 생략하기로 한다. According to a preferred embodiment of the present invention, feature maps and temporal feature encoder network module 110 generated for each layer in the spatial feature encoder network module 100 so that decoding reflecting feature information in the encoding process can be performed. A skip connection may be made with respect to the feature maps generated for each layer. The skip connection combines feature maps generated in the encoding process during decoding and uses it for decoding of the next layer. Since the skip connection is used for encoding and decoding of various neural networks, a detailed description thereof will be omitted.

결국, 본 발명의 제1 실시예에 따른 깊이 추정 장치는 4개의 뉴럴 네트워크로 이루어져 있다고 할 수 있으며, 최종적인 깊이 맵은 디코더 네트워크 모듈(130)을 통해 출력되는 것이다. As a result, it can be said that the depth estimation apparatus according to the first embodiment of the present invention consists of four neural networks, and the final depth map is output through the decoder network module 130 .

본 발명의 깊이 추정 장치를 구성하는 4개의 뉴럴 네트워크의 가중치 학습은 출력되는 깊이 맵에 대한 손실을 산출하고 이를 역전파하는 방식으로 이루어질 수 있을 것이다. 손실의 역전파는 역순으로 진행되어 디코더 네트워크 모듈(130) -> 플로우 가이드 메모리 모듈(120) -> 공간 특징 엔코더 네트워크 모듈(100)/시간 특징 엔코더 네트워크 모듈(110) 순서로 이루어지며, 손실을 최소화하기 위한 방향으로 가중치 업데이트가 진행된다. The weight learning of the four neural networks constituting the depth estimation apparatus of the present invention may be performed by calculating the loss for the output depth map and backpropagating it. The backpropagation of the loss proceeds in the reverse order, so that the decoder network module 130 -> flow guide memory module 120 -> spatial feature encoder network module 100/time feature encoder network module 110 is made in the order, and the loss is minimized The weight update proceeds in the direction to

학습을 위한 손실 연산은 알려진 다양한 방법이 이용될 수 있으며, 본 발명에서 적용한 손실 연산에 대해서는 후에 설명하기로 한다. Various known methods may be used for the loss operation for learning, and the loss operation applied in the present invention will be described later.

도 2는 본 발명의 제2 실시예에 따른 단안 비디오 영상의 깊이 추정을 위한 뉴럴 네트워크 구조를 나타낸 도면이다. 2 is a diagram illustrating a neural network structure for depth estimation of a monocular video image according to a second embodiment of the present invention.

본 발명의 제1 실시예에 따른 단안 비디오 영상의 깊이 추정 장치는 공간 특징 엔코더 네트워크 모듈(100), 시간 특징 엔코더 네트워크 모듈(110), 플로우 가이드 메모리 모듈(120), 디코더 네트워크 모듈(130) 및 광학 플로우 교정 네트워크 모듈(200)을 포함한다. The apparatus for estimating the depth of a monocular video image according to the first embodiment of the present invention includes a spatial feature encoder network module 100, a temporal feature encoder network module 110, a flow guide memory module 120, a decoder network module 130 and and an optical flow calibration network module 200 .

본 발명의 제2 실시예에 따른 깊이 추정 장치는 제1 실시예와 비교할 때 광학 플로우 교정 네트워크 모듈(200)이 추가적으로 구비된다. 광학 플로우 교정 네트워크 모듈(200)을 제외한 다른 모듈들의 동작은 제1 실시예와 동일하다. Compared to the first embodiment, the depth estimation apparatus according to the second embodiment of the present invention includes an optical flow correction network module 200 additionally. Except for the optical flow calibration network module 200, the operation of the other modules is the same as in the first embodiment.

광학 플로우 교정 네트워크 모듈(200)은 신경망 연산을 통해 교정된 시간 특징맵을 생성한다. 광학 플로우 영상만으로는 정확한 광학 플로우를 반영한 시간 특징맵을 생성하기 어렵다. 광학 플로우 교정 네트워크 모듈은 보자 정확한 시간 특징맵을 생성하기 위해 현재 프레임 영상(I^t), 이전 프레임 이미지(I^t-1) 및 광학 플로우 영상(O^t)을 입력받아 신경망 연산을 통해 교정된 시간 특징맵을 생성한다. The optical flow calibration network module 200 generates a calibrated temporal feature map through neural network operation. With only the optical flow image, it is difficult to generate a temporal feature map reflecting the accurate optical flow. The optical flow calibration network module receives the current frame image (I ^t ), the previous frame image (I ^t-1 ), and the optical flow image (O ^t ) to generate an accurate temporal feature map, and the time corrected through neural network operation. Create a feature map.

교정된 시간 특징맵은 플로우 가이드 메모리 모듈(120)에서 이전 상태 특징맵의 워핑(Warping)에 이용된다. 이전 상태 특징맵의 워핑에 대해서는 별도의 도면을 참조하여 상세히 설명한다. The corrected temporal feature map is used for warping the previous state feature map in the flow guide memory module 120 . Warping of the previous state feature map will be described in detail with reference to a separate drawing.

도 3은 본 발명의 일 실시예에 따른 광학 플로우 교정 네트워크 모듈의 동작 구조를 나타낸 도면이다. 3 is a diagram illustrating an operation structure of an optical flow calibration network module according to an embodiment of the present invention.

도 3을 참조하면, 광학 플로우 교정 네트워크 모듈(200)로는 현재 프레임 영상(I^t), 이전 프레임 영상(I^t-1) 및 광학 플로우 영상(O^t)이 서로 결합(Concatenate)된 형태로 입력된다. Referring to FIG. 3 , the current frame image (I ^t ), the previous frame image (I ^t-1 ), and the optical flow image (O ^t ) are input to the optical flow calibration network module 200 in a concatenated form. do.

입력에 대한 콘볼루션 엔코딩을 통해 특징맵(Feature map)이 각 레이어별로 생성된다. 제1 레이어의 특징맵은 결합된 입력 영상과 동일한 사이즈를 가진다. 제1 레이어의 출력인 R^t ₁은 제1 특징맵과 광학 플로우 영상(O^t)이 결합된 후 콘볼루션 엔코딩을 통해 획득된다. A feature map is generated for each layer through convolutional encoding of the input. The feature map of the first layer has the same size as the combined input image. ^{R t} ₁ , the output of the first layer, is obtained through convolutional encoding after the first feature map and the optical flow image O ^{t are combined.}

제2 레이어의 특징맵은 제1 레이어의 출력인 R^t ₁와 제1 레이어 특징맵이 결합된 영상에 대한 콘볼루션 엔코딩을 통해 생성된다. 제2 레이어의 특징맵은 제1 레이어의 특징맵에 비해 1/2 다운샘플링된 사이즈를 가질 수 있다. 제2 레이어의 출력인 R^t ₂는 제2 레이어 특징맵과 1/2 다운 샘플링된 광학 플로우 영상이 결합된 영상에 대한 콘볼루션 엔코딩을 통해 획득된다. The feature map of the second layer is generated through convolutional encoding of an image in ^{which R t} ₁ , which is an output of the first layer, and the feature map of the first layer are combined. The feature map of the second layer may have a size that is 1/2 downsampled compared to the feature map of the first layer. ^{R t} _{2 , which} is an output of the second layer, is obtained through convolutional encoding of an image in which the second layer feature map and the 1/2 down-sampled optical flow image are combined.

제3 레이어의 특징맵은 제2 레이어의 출력인 R^t ₂와 제2레이어 특징맵을 결합한 영상에 대한 콘볼루션 엔코딩을 통해 생성된다. 제3 레이어 특징맵은 제2 레이어 특징맵에 비해 1/2 다운 샘플링된 사이즈를 가질 수 있다. The feature map of the third layer is ^{generated through convolutional encoding of an image in which R t} ₂ , the output of the second layer, and the feature map of the second layer are combined. The third layer feature map may have a size that is 1/2 down-sampled compared to the second layer feature map.

제3 레이어의 출력인 R^t ₃는 제3 레이어 특징맵과 1/4 다운 샘플링된 광학 플로우 영상이 결합된 영상에 대한 콘볼루션 엔코딩을 통해 획득되며, R^t ₃가 최종적으로 교정되는 시간 특징맵이고 이는 교정된 광학 플로우로 정의할 수도 있을 것이다. ^{R t} ₃ , the output of the third layer, is obtained through convolutional encoding of the image in which the third layer feature map and the 1/4 down-sampled optical flow image are combined, and the temporal feature map in which ^{R t} _{3 is finally corrected} and this may be defined as a corrected optical flow.

본 발명의 일 실시예에 따르면, 광학 플로우 교정 네트워크 모듈(200)에 대한 학습은 두 개의 손실을 이용하여 이루어질 수 있다. 제1 손실은 영상 일관성 손실(Photometric Consistency Loss)이고 제2 손실은 평활화 손실(Smoothness Loss)이다. According to an embodiment of the present invention, learning for the optical flow correction network module 200 may be performed using two losses. The first loss is a photometric consistency loss and the second loss is a smoothness loss.

영상 일관성 손실은 각 레이어에서 출력되는 광학 플로우(R^t ₁, R^t ₂, R^t ₃)를 I^t-1에 적용한 영상

과 현재 프레임 영상(I^t)과의 유사도로부터 연산되는 손실이다. 여기서 W()는 워핑 함수를 나타낸다. The image coherence loss is an image obtained by applying ^{the optical flow (R t} ₁ , R ^t ₂ , R ^t ₃ ) output from each layer to I ^{t-1 .}

It is a loss calculated from the similarity between and the current frame image (I ^{t ).} Here, W() represents a warping function.

구체적으로, 영상 일관성 손실은 다음의 수학식 1과 같이 연산될 수 있다. Specifically, the image coherence loss may be calculated as in Equation 1 below.

위 수학식1에서, Ni는 모든 픽셀의 수이고, p는 픽셀을 나타내며, SSIM은 구조적 유사도를 연산하는 함수이고, 는 ß는 밸런스 상수이고 0 내지 1의 값 중 하나로 선택된다. In Equation 1 above, Ni is the number of all pixels, p represents a pixel, SSIM is a function for calculating structural similarity, β is a balance constant, and is selected as one of 0 to 1.

제2 손실인 평활화(Smoothness) 손실은 광학 플로우 영상의 평활화 정도를 연산하는 것으로서, 영상 분야에서 학습을 위해 일반적으로 사용하는 손실이다. 일례로, 평활화 손실은 다음의 수학식 2와 같이 연산될 수 있다. The second loss, which is a smoothness loss, is a loss that calculates the degree of smoothing of an optical flow image, and is generally used for learning in the image field. As an example, the smoothing loss may be calculated as in Equation 2 below.

위 수학식에서, τ는 임의로 설정되는 상수이다. In the above equation, τ is an arbitrarily set constant.

물론, 광학 플로우 교정 네트워크 모듈(200)의 학습은 위에서 설명한 광학 일관성 손실 및 평활화 손실 이외에도 다양한 방식으로 학습될 수도 있을 것이며, 이러한 학습 방식의 변경이 본 발명의 사상에 영향을 미치지 않는다는 점을 당업자라면 이해할 수 있을 것이다. Of course, the learning of the optical flow correction network module 200 may be learned in various ways other than the optical coherence loss and smoothing loss described above, and those skilled in the art will understand that the change in the learning method does not affect the spirit of the present invention. You will understand.

결국, 제1 실시예와 제2 실시예의 차이는 플로우 가이드 메모리 모듈(120)에서의 이전 상태 특징맵 워핑 시 시간 특징 엔코더 네트워크 모듈(110)에서 출려되는 시간 특징맵을 이용할지 아니면 광학 플로우 교정 네트워크 모듈(200)에서 출력되는 교정된 시간 특징맵을 이용할지 여부이다. As a result, the difference between the first embodiment and the second embodiment is whether the temporal feature map output from the temporal feature encoder network module 110 is used when warping the previous state feature map in the flow guide memory module 120, or the optical flow correction network Whether to use the calibrated temporal feature map output from the module 200 .

제1 실시예는 별도의 교정된 시간 특징맵을 획득하지 않고 시간 특징 엔코더 네트워크 모듈(110)에서 획득된 시간 특징맵을 이용하여 플로우 가이드 메모리 모듈(120)에서 이전 상태 특징맵을 워핑한다. 그러나, 제2 실시예는 플로우 가이드 메모리 모듈(120)에서의 이전 상태 특징맵의 워핑에 획득된 시간 특징맵을 이용하지 않고 교정된 광학 플로우를 이용하는 것이다. The first embodiment warps the previous state feature map in the flow guide memory module 120 using the temporal feature map obtained from the temporal feature encoder network module 110 without acquiring a separate calibrated temporal feature map. However, the second embodiment does not use the temporal feature map obtained for warping the previous state feature map in the flow guide memory module 120, but uses the corrected optical flow.

다만, 플로우 가이드 메모리 모듈의 입력으로는 제2 실시예에서도 교정된 시간 특징맵이 입력되지 아니하고 시간 특징 엔코더 네트워크 모듈(110)로부터 출력되는 시간 특징맵이 입력된다. However, as an input of the flow guide memory module, the temporal feature map output from the temporal feature encoder network module 110 is not input, but the corrected temporal feature map in the second embodiment.

도 4는 본 발명의 일 실시예에 따른 플로우 가이드 메모리 모듈의 동작 구조를 나타낸 도면이다. 4 is a diagram illustrating an operation structure of a flow guide memory module according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른 플로우 가이드 메모리 모듈(120)이 ConvGRU를 사용하는 경우를 예로 한 동작 구조가 도시되어 있다. 그러나, 앞서 설명한 바와 같이 ConvGRU 이외에도 다양한 RNN 네트워크가 사용될 수도 있을 것이다. FIG. 4 shows an operational structure of the flow guide memory module 120 using ConvGRU as an example according to an embodiment of the present invention. However, as described above, various RNN networks may be used in addition to ConvGRU.

ConvGRU에는 5개의 값이 사용된다. h^t는 현재 상태 특징맵이고, h^t-1은 이전 상태 특징맵이며, r^t는 리셋 게이트, z^t는 업데이트 게이트이며,

^t는 후보 상태 특징맵이다. Five values are used for ConvGRU. h ^t is the current state feature map, h ^t-1 is the previous state feature map, r ^t is the reset gate, z ^t is the update gate,

^t is a candidate state feature map.

종래의 ConvGRU에서 현재 상태 특징맵, 이전 상태 특징맵, 리셋 게이트 , 업데이트 게이트 및 후보 상태 특징맵은 다음의 수학식 3과 같이 연산된다. In the conventional ConvGRU, the current state feature map, the previous state feature map, the reset gate , the update gate and the candidate state feature map are calculated as shown in Equation 3 below.

위 수학식 3에서, σ는 시그모이드 함수를 의미하고,

는 엘리먼트-와이즈(element-wise) 곱셉을 의미하며, *는 콘볼루션을 의미하며, x^t는 입력되는 특징맵을 의미한다. 본 발명에서는 공간 특징맵과 시간 특징맵을 결합한 특징맵이 x^t가 된다. In Equation 3 above, σ means a sigmoid function,

denotes an element-wise multiplication, * denotes a convolution, and x ^t denotes an input feature map. In the present invention, the feature map combining the spatial feature map and the temporal feature map becomes x ^t .

앞서 설명한 바와 같이, 종래의 ConvGRU를 이용하게 될 경우 프레임 시간 간격 사이의 움직임을 반영하기 어렵게 된다. 이전 프레임과 현재 프레임이 서로 상관 관계에 있고, 이전 프레임과 현재 프레임의 깊이가 서로 상관 관계에 있기는 하나, 깊이의 상관 관계는 현재 프레임과 이전 프레임이 동일한 상태에 있을 때 보다 정확한 상관 관계가 획득될 수 있다. 이에, 현재 프레임과 이전 프레임 사이에 객체의 이동 또는 카메라의 이동이 있을 경우 이를 보상한 상태에서 깊이의 상관 관계를 고려하는 것이 바람직하다. As described above, when the conventional ConvGRU is used, it becomes difficult to reflect the motion between frame time intervals. Although the previous frame and the current frame are correlated with each other, and the depth of the previous frame and the current frame are correlated with each other, the correlation of depth is more accurate when the current frame and the previous frame are in the same state. can be Accordingly, when there is a movement of an object or a movement of a camera between the current frame and the previous frame, it is preferable to consider the correlation of depth while compensating for the movement.

본 발명에서는 획득되는 시간 특징맵 또는 교정된 시간 특징맵을 이용하여 워핑을 통해 이전 상태 특징맵을 보정하고, 이전 상태 특징맵 대신 보정된 보정 상태 특징맵을 이용한다. 보정 상태 특징맵은 도 4에서

^t로 정의한다. In the present invention, the previous state feature map is corrected through warping using the obtained temporal feature map or the corrected temporal feature map, and the corrected corrected state feature map is used instead of the previous state feature map. The correction state feature map is shown in FIG.

^It is defined as t.

결국, 본 발명의 플로우 가이드 메모리 모듈(120)은 이전 상태 특징맵을 시간 특징맵 또는 교정된 시간 특징맵에 기초하여 보정한 후 보정 상태 특징맵을 적용하여 현재 상태 특징맵을 생성하는 것이다. After all, the flow guide memory module 120 of the present invention corrects the previous state feature map based on the temporal feature map or the corrected temporal feature map, and then applies the corrected state feature map to generate the current state feature map.

본 발명에 따른 현재 상태 특징맵, 보정 상태 특징맵, 리셋 게이트, 업데이트 게이트 및 후보 상태 특징맵은 다음의 수학식 4와 같이 연산된다. The current state feature map, the corrected state feature map, the reset gate, the update gate, and the candidate state feature map according to the present invention are calculated as in Equation 4 below.

수학식 4에서,

는 시간 특징맵 또는 교정된 시간 특징맵을 이용하여 이전 상태 특징맵을 워핑한 특징맵이고

로 정의될 수 있다. 또한, W는 미리 설정되는 가중치이고, b는 미리 설정되는 바이어스 값이다. In Equation 4,

is a feature map obtained by warping the previous state feature map using a temporal feature map or a calibrated temporal feature map,

can be defined as In addition, W is a preset weight, and b is a preset bias value.

한편, 위 수학식 4에서 워핑된 특징맵에는 마스크 특징맵 M^t가 적용되어 있다. 마스크 특징맵은 각 픽셀(P)별로 워핑의 신뢰도를 나타내는 특징맵으로 정의할 수 있다. 만일, 현재 프레임 영상(I^t ₃(p))과 시간 특징맵(광학 플로우)에 의해 워핑된 이전 프레임 영상(

^t ₃(p))과의 차이가 크다면 워핑의 신뢰도가 높지 않을 것이고 차이가 크지 않으면 워핑의 신뢰도가 높을 것이다. 이러한 사실에 기초하여, 마스크 특징맵 M^t는 다음의 수학식 5와 같이 정의될 수 있을 것이다. Meanwhile, the mask feature map M ^t is applied to the warped feature map in Equation 4 above. The mask feature map may be defined as a feature map indicating the reliability of warping for each pixel P. If the current frame image (I ^t ₃ (p)) and the previous frame image (

^If _{the difference from t 3} (p)) is large, the reliability of the warping will not be high, and if the difference is not large, the reliability of the warping will be high. Based on this fact, the mask feature map M ^t may be defined as in Equation 5 below.

위 수학식 5에서, ε은 임의로 설정되는 상수이며, ε에 의해 지수 함수의 폭이 결정된다. In Equation 5 above, ε is an arbitrarily set constant, and the width of the exponential function is determined by ε.

도 5는 본 발명의 일 실시예에 따른 플로우 가이드 메모리 모듈에서의 워핑을 개념적으로 나타낸 도면이다. 5 is a diagram conceptually illustrating warping in a flow guide memory module according to an embodiment of the present invention.

도 5를 참조하면, (t-1) 프레임에서 t 프레임 사이에 자동차가 이동하는 경우가 도시되어 있다. 기존 ConvGRU 방식에 의할 경우 이전 상태 특징맵인 h^t-1이 이용되기에 자동차의 움직임이 반영된 깊이의 상관 관계가 깊이 추정에 이용되기 어렵다. Referring to FIG. 5 , a case in which a vehicle moves between frames t-1 and t is illustrated. In the case of the existing ConvGRU method, since h ^t-1 , which is the previous state feature map, is used, it is difficult to use the correlation of the depth reflecting the motion of the vehicle for depth estimation.

본 발명은 이러한 문제를 해결하기 위해 광학 플로우(시간 특징맵)를 이용하여 이전 상태 특징맵 ht-1을

^t로 보정하고, 보정 상태 특징맵을 이용하여 깊이 특징맵을 연산하도록 하는 것이다. In order to solve this problem, the present invention uses an optical flow (temporal feature map) to calculate the previous state feature map ht-1.

^t is corrected, and the depth feature map is calculated using the corrected state feature map.

한편, 도 1에 도시된 본 발명의 일 실시예에 따른 깊이 추정 장치를 구성하는 뉴럴 네트워크들의 학습을 위해 깊이맵 참값(G^t(p))과 디코더 네트워크 모듈을 통해 출력된 깊이맵과의 차에 상응하는 손실이 이용될 수 있다. ^{Meanwhile, the difference between the depth map true value G t} (p) and the depth map output through the decoder network module for learning the neural networks constituting the depth estimation apparatus according to the embodiment of the present invention shown in FIG. 1 . A corresponding loss can be used.

보다 정확한 학습을 위해 참값과 차이에 대한 손실(L^D)과 평활화 손실(L^DS)이 함께 이용될 수 있다. Loss for the true value and the difference for a more accurate learning (L ^D) and smoothed loss (L ^DS) may be used together.

본 발명의 일 실시예에 따르면, 깊이맵 참값과 차이에 대한 손실은 다음의 수학식 6과 같이 연산될 수 있다. According to an embodiment of the present invention, the loss of the depth map true value and the difference may be calculated as in Equation 6 below.

위 수학식 6에서

로 정의되고, D^t(p)는 출력된 깊이맵이고, G^t(p)는 참값(Ground Truth) 깊이맵이다. In Equation 6 above

, where D ^t (p) is the output depth map, and G ^t (p) is the ground truth depth map.

위 수학식 7에서, 첫번째 텀은 출력된 깊이?염? 참값 깊이맵 사이의 차를 의미한다. 그런데, 단안 비디오 영상 시퀀스에서 각 픽셀의 참값 깊이를 획득하는 것은 매우 어렵다. 수학식 7의 두번째 텀은 이러한 문제를 완하시키기 위한 텀이다. 두 개의 픽셀 패어인 p와 q에 대해 s(p)와 s(q)의 곱이 합해지며, α는 0 내지 1의 값을 가지는 밸런스 상수이고 N은 모든 픽셀의 수이다. In Equation 7 above, the first term is the output depth? The difference between the true depth maps. However, it is very difficult to obtain the true depth of each pixel in a monocular video image sequence. The second term in Equation 7 is a term for alleviating this problem. For two pixel pairs, p and q, the product of s(p) and s(q) is summed, α is a balance constant having a value of 0 to 1, and N is the number of all pixels.

또한, 평활화 손실은 깊이의 불연속을 방지하기 위해 연산되며, 다음의 수학식 7과 같이 연산될 수 있을 것이다. Also, the smoothing loss is calculated to prevent discontinuity in depth, and may be calculated as in Equation 7 below.

위의 설명된 예에서, 도 1에 도시된 깊이 추정 장치의 학습을 위해 역전파되는 손실은 참값과 차이에 대한 손실(L^D)과 평활화 손실(L^DS)의 합으로 연산된다. In the illustrated example above, even loss backpropagation for learning of a depth estimation apparatus shown in Figure 1 is calculated as the sum of the loss (L ^D) and smoothed loss (L ^DS) for the true value and the difference.

도 6은 본 발명의 제2 실시예에 따른 단안 비디오 영상의 깊이 추정 방법의 전체적인 흐름을 도시한 순서도이다. 6 is a flowchart illustrating an overall flow of a method for estimating depth of a monocular video image according to a second embodiment of the present invention.

도 6을 참조하면, 우선 현재 프레임 영상을 공간 특징 엔코더 네트워크 모듈(100)에 입력하여 공간 특징맵을 생성한다(단계 600). Referring to FIG. 6 , first, a spatial feature map is generated by inputting the current frame image to the spatial feature encoder network module 100 (step 600).

또한, 현재 프레임 영상과 이전 프레임 영상을 이용하여 획득되는 광학 플로우 영상을 시간 특징 엔코더 네트워크 모듈(110)에 입력하여 시간 특징맵을 생성한다(단계 602). In addition, an optical flow image obtained using the current frame image and the previous frame image is input to the temporal feature encoder network module 110 to generate a temporal feature map (step 602).

한편, 현재 프레임 영상, 이전 프레임 영상 및 광학 플로우 영상을 광학 플로우 교정 네트워크 모듈(200)에 입력하여 교정된 시간 특징맵을 생성한다(단계 604). Meanwhile, a corrected temporal feature map is generated by inputting the current frame image, the previous frame image, and the optical flow image to the optical flow calibration network module 200 (step 604).

단계 600에서 생성되는 공간 특징맵 및 단계 602에서 시간 특징맵은 서로 결합되어 광학 플로우 메모리 모듈(120)로 입력되며, 플로우 가이드 메모리 모듈(120)은 신경망 연산을 통해 깊이 특징맵을 생성한다(단계 606). 광학 플로우 메모리 모듈은 RNN을 이용한다. 플로우 가이드 메모리 모듈(120)은 RNN에서 현재 상태 특징맵의 갱신에 이용되는 이전 상태 특징맵을 단계 604에서 생성된 교정된 시간 특징맵의 광학 플로우를 이용하여 워핑함으로써 보정 상태 특징맵을 생성하고, 보정 상태 특징맵을 현재 상태 특징맵의 갱신에 이용한다. The spatial feature map generated in step 600 and the temporal feature map in step 602 are combined with each other and input to the optical flow memory module 120, and the flow guide memory module 120 generates a depth feature map through neural network operation (step 602). 606). The optical flow memory module uses RNN. The flow guide memory module 120 generates a corrected state feature map by warping the previous state feature map used for updating the current state feature map in the RNN using the optical flow of the corrected temporal feature map generated in step 604, The corrected state feature map is used to update the current state feature map.

단계 606에서 생성되는 깊이 특징맵은 디코더 네트워크 모듈(130)로 입력되며, 디코더 네트워크 모듈(130)은 신경망 연산을 통해 깊이 맵을 생성한다(단계 608). The depth feature map generated in step 606 is input to the decoder network module 130, and the decoder network module 130 generates a depth map through neural network operation (step 608).

한편, 도 6에서는 제2 실시예의 경우를 예로 하여 전체적인 흐름을 설명하였으나, 제1 실시예는 시간 특징 엔코더 네트워크 모듈로부터 출력되는 시간 특징맵을 RNN의 이전 상태 특징맵의 워핑에 사용한다는 점에서만 차이가 있다는 점을 위에서 설명하였다. Meanwhile, in FIG. 6, the overall flow has been described using the case of the second embodiment as an example, but the first embodiment differs only in that the temporal feature map output from the temporal feature encoder network module is used for warping the previous state feature map of the RNN. It has been explained above that there is

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be.

그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

Claims

a spatial feature encoder network module that generates a spatial feature map through neural network operation on the current frame image;
a temporal feature encoder network module for generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image;
a flow guide memory module for generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and
A decoder network module for generating a depth map through a neural network operation on the depth feature map,
The flow guide memory module uses an RNN, corrects a previous state feature map used for the RNN through warping based on the temporal feature map, and uses the corrected previous state feature map instead of the previous state feature map to perform a neural network perform calculations,
and an optical flow correction network module for generating a temporal feature map corrected through neural network operation on the current frame image, the optical flow image, and the previous frame image.

delete

According to claim 1,
and the flow guide memory module corrects the previous state feature map through warping based on the corrected temporal feature map instead of the temporal feature map.

4. The method of claim 3,
The value of the corrected previous state feature map is adjusted by a mask feature map, and the mask feature map is a feature map reflecting the reliability of the temporal feature map or the corrected temporal feature map. Depth estimation device.

5. The method of claim 4,
The apparatus for estimating the depth of a monocular video image, wherein the reliability is calculated based on a difference between an image of a previous frame image warped based on the temporal feature map or the corrected temporal feature map and a current frame image.

According to claim 1,
The RNN of the flow guide memory module is a current state feature map (h ^t ), a corrected previous state feature map (

^t ), an apparatus for estimating the depth of a monocular video image.

In the above equation, σ means a sigmoid function,

5. The method of claim 4,
The mask feature map is an apparatus for estimating the depth of a monocular video image, characterized in that it is set by the following equation.

In the above equation, p means a pixel, I ^t ₃ (p) is the current frame image,

According to claim 1,
The apparatus for estimating the depth of a monocular video image, characterized in that the spatial feature encoder network module and the temporal feature encoder network module generate a spatial feature map and a temporal feature map, respectively, using CNN.

generating a spatial feature map through neural network operation on the current frame image (a);
generating a temporal feature map through neural network operation on the optical flow image of the current frame image and the previous frame image (b);
(c) generating a depth feature map for the current frame image through a neural network operation using the spatial feature map and the temporal feature map; and
Including the step (d) of generating a depth map through a neural network operation on the depth feature map,
The step (c) uses an RNN, corrects the previous state feature map used for the RNN through warping based on the temporal feature map, and uses the corrected previous state feature map instead of the previous state feature map to perform a neural network perform calculations,
The method for estimating the depth of a monocular video image, further comprising generating a temporal feature map corrected through neural network operation on the current frame image, the optical flow image, and the previous frame image.

delete

10. The method of claim 9,
The step (c) comprises correcting the previous state feature map through warping based on the corrected temporal feature map instead of the temporal feature map.

12. The method of claim 11,
The value of the corrected previous state feature map is adjusted by a mask feature map, and the mask feature map is a feature map reflecting the reliability of the temporal feature map or the corrected temporal feature map. Depth estimation method.

13. The method of claim 12,
and the reliability is calculated based on a difference between a current frame image and an image warped with a previous frame image based on the temporal feature map or the corrected temporal feature map.

10. The method of claim 9,
The RNN is a current state feature map (h ^t ), a corrected previous state feature map (

^t ), a method for estimating the depth of a monocular video image.

In the above equation, σ means a sigmoid function,

13. The method of claim 12,
The mask feature map is a depth estimation method of a monocular video image, characterized in that set by the following equation.