KR20240047701A

KR20240047701A - Real-time single-image depth estimation network system

Info

Publication number: KR20240047701A
Application number: KR1020220127039A
Authority: KR
Inventors: 강현수; 하템; 아메드
Original assignee: 충북대학교 산학협력단
Priority date: 2022-10-05
Filing date: 2022-10-05
Publication date: 2024-04-12

Abstract

본 발명은 실시간 단일 이미지 깊이 추정 네트워크 시스템에 관한 것으로서, 합성곱 신경망 기반으로, 단일 고해상도 이미지를 입력받아 실시간으로 깊이를 예측할 수 있는 기술에 관한 것이다.The present invention relates to a real-time single image depth estimation network system, and to a technology that can predict depth in real time by receiving a single high-resolution image based on a convolutional neural network.

Description

Real-time single-image depth estimation network system {Real-time single-image depth estimation network system}

본 발명은 실시간 단일 이미지 깊이 추정 네트워크 시스템에 관한 것으로, 더욱 상세하게는 실시간 합성곱 신경망 기반으로 단일 고해상도 이미지를 입력받아 빠르고 높은 정확도로 고해상도 깊이 맵을 예측/추정할 수 있는 실시간 단일 이미지 깊이 추정 네트워크 시스템에 관한 것이다.The present invention relates to a real-time single image depth estimation network system. More specifically, a real-time single image depth estimation network that receives a single high-resolution image based on a real-time convolutional neural network and can predict/estimate a high-resolution depth map quickly and with high accuracy. It's about the system.

컴퓨터 비전에서 깊이 추정은 3D 장면의 구성 및 이해, 의료 3D 이미징 및 스캐닝, 배경/전경 분리, 자율 주행 자동차 및 로봇의 깊이 인식, 3D 그래픽 등과 같은 수많은 응용 분야에서 사용되는 핵심 작업 중 하나이다.In computer vision, depth estimation is one of the key tasks used in numerous applications such as construction and understanding of 3D scenes, medical 3D imaging and scanning, background/foreground separation, depth perception in self-driving cars and robots, 3D graphics, etc.

일반적으로 깊이 추정 시스템은, 스테레오 카메라 또는, IR 깊이 카메라를 사용하며, 이러한 시스템은 고가의 장비와 고속 GPU 프로세서가 요구된다.Typically, depth estimation systems use stereo cameras or IR depth cameras, and these systems require expensive equipment and high-speed GPU processors.

최근, 자율주행 차, 실시간 3D 재구성 등 임베디드 기기와 모바일 단말기의 고속 처리 요구로 인해, 고속 컴퓨터 비전의 필요성이 높아지고 있으며, 이러한 고속 처리는 최신의 컨볼루션 신경망을 기반으로 하는 가볍고 메모리 효율적인 알고리즘을 통해서 수행되게 된다.Recently, the need for high-speed computer vision is increasing due to the high-speed processing requirements of embedded devices and mobile terminals, such as self-driving cars and real-time 3D reconstruction, and this high-speed processing is achieved through lightweight and memory-efficient algorithms based on the latest convolutional neural networks. will be carried out.

이러한 깊이 추정에 대한 최신 연구는 컨볼루션 신경망 기반의 알고리즘을 사용하여 높은 정확도로 깊이 추정의 효율성을 입증하였으나, 메모리와 처리 능력이 제한된 임베디드 시스템이나 컴퓨팅 자원이 적은 모바일 단말기와 같은 장치에 적용하기 위한 처리 속도와 이러한 모델의 최적화를 고려하고 있지 않다.The latest research on depth estimation has demonstrated the efficiency of depth estimation with high accuracy using convolutional neural network-based algorithms, but it is difficult to apply to devices such as embedded systems with limited memory and processing power or mobile terminals with low computing resources. Processing speed and optimization of these models are not considered.

즉, 상술한 깊이 추정에 대한 최신 연구는 인코더-디코더 아키텍처로 구성되어 있어, 높은 정확도로 깊이를 추정하는데 효율적이지만, 많은 컴퓨팅 자원이 필요하기 때문에, 실시간 또는 제한된 하드웨어 기능 장치에 적용이 불가능한 문제점이 있다.In other words, the latest research on depth estimation described above consists of an encoder-decoder architecture, which is efficient in estimating depth with high accuracy, but requires a lot of computing resources, so it cannot be applied to real-time or limited hardware function devices. there is.

이에 따라, 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은, 디코더 아키텍처 없이, 소규모 심층 특징 맵(deep feature map)에서 깊이 맵의 공간 이미지 구성 개념을 채택하여, 깊이 추정에 많은 컴퓨팅 자원이 요구되는 문제점을 해소하였다.Accordingly, the real-time single image depth estimation network system according to an embodiment of the present invention adopts the concept of spatial image construction of a depth map in a small-scale deep feature map without a decoder architecture, and requires a lot of computing for depth estimation. Problems requiring resources were resolved.

이와 관련해서, 국내 등록 특허 제10-1823314호("단일 이미지로부터 깊이를 추정하기")에서는 단일 쿼리 이미지에 대응하는 깊이 정보를 생성하는 기술을 개시하고 있다.In this regard, Domestic Patent No. 10-1823314 (“Estimating Depth from a Single Image”) discloses a technology for generating depth information corresponding to a single query image.

등록특허공보 제10-1823314호(등록일자 2018.01.23.)Registered Patent Publication No. 10-1823314 (registration date 2018.01.23.)

본 발명은 상기한 바와 같은 종래 기술의 문제점을 해결하기 위하여 안출된 것으로, 본 발명의 목적은 경량 인코딩 아키텍처를 사용하여 고해상도 깊이 맵을 효율적으로 구성할 수 있는 실시간 단일 이미지 깊이 추정 네트워크 시스템을 제공하는 것이다.The present invention was conceived to solve the problems of the prior art as described above. The purpose of the present invention is to provide a real-time single image depth estimation network system that can efficiently construct a high-resolution depth map using a lightweight encoding architecture. will be.

본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템으로서, 다수의 신경망 레이어를 포함하며, 입력되는 단일 이미지 데이터의 해상도에 대응되는 수 만큼의 저해상도 특징 맵을 생성하는 선행 처리부 및 하나의 신경망 레이어를 포함하며, 입력되는 다수의 저해상도 특징 맵을 업스케일(upscale)하여 슈퍼 픽셀(super pixel)을 생성하는 후행 처리부를 포함하며, 상기 슈퍼 픽셀은 고해상도 깊이 맵인 것을 특징으로 하는 것이 바람직하다.A real-time single image depth estimation network system according to an embodiment of the present invention, including a plurality of neural network layers, a preprocessor that generates a number of low-resolution feature maps corresponding to the resolution of the input single image data, and one neural network It includes a layer and a post-processing unit that upscales a plurality of input low-resolution feature maps to generate super pixels, and the super pixels are preferably high-resolution depth maps.

더 나아가, 상기 선행 처리부는 하나의 컨볼루션 레이어를 포함하며, 입력되는 단일 이미지 데이터에 컨볼루션 필터를 적용하여 포함되어 있는 특징 맵을 산출하는 제1 레이어부, 선형으로 구성된 다수의 합성곱 신경망 레이어 구조를 포함하며, 입력되는 특징 맵의 채널을 확장하고, 각 채널 별로 특징 맵을 재산출하고, 이를 투영하는 제2 레이어부, 하나의 컨볼루션 레이어를 포함하며, 입력되는 특징 맵에 컨볼루션 필터를 적용하여 특징 맵을 산출하는 제3 레이어부 및 하나의 컨볼루션 레이어를 포함하며, 각 채널의 픽셀 위치가 상이한 깊이 맵을 산출하는 제4 레이어부를 포함하는 것이 바람직하다.Furthermore, the pre-processing unit includes one convolutional layer, a first layer unit that calculates a feature map by applying a convolutional filter to input single image data, and a plurality of linear convolutional neural network layers. It includes a structure, expands the channels of the input feature map, recalculates the feature map for each channel, includes a second layer unit that projects it, and one convolution layer, and applies a convolution filter to the input feature map. It is preferable to include a third layer unit that calculates a feature map by applying and one convolution layer, and a fourth layer unit that calculates a depth map with different pixel positions in each channel.

더 나아가, 상기 후행 처리부는 하기의 수식을 통해서, 입력되는 상기 다수의 저해상도 특징 맵을 이용하여 상기 슈퍼 픽셀을 생성하는 것이 바람직하다.Furthermore, it is preferable that the post-processing unit generates the super pixel using the plurality of input low-resolution feature maps using the following equation.

더 나아가, 상기 후행 처리부는 하나의 서브 픽셀(sub-pixel) 컨볼루션 레이어를 포함하며, 입력되는 다수의 저해상도 특징 맵을 업스케일할 업스케일링 필터를 학습시켜 동작을 수행하되, 학습 과정에서 적용된 손실 함수는 하기의 수식과 같이 정의되는 평균 절대 오차 함수인 것을 특징으로 하는 것이 바람직하다.Furthermore, the post-processing unit includes one sub-pixel convolution layer and performs an operation by learning an upscaling filter to upscale a plurality of input low-resolution feature maps, and the loss applied in the learning process is The function is preferably characterized as an average absolute error function defined as follows.

더 나아가, 상기 후행 처리부는 하기의 수식을 통해서, 입력되는 다수의 저해상도 특징 맵과 생성한 슈퍼 픽셀 간의 관계를 정의하는 것이 바람직하다.Furthermore, the post-processing unit preferably defines the relationship between a plurality of input low-resolution feature maps and the generated superpixels using the following equation.

상기와 같은 구성에 의한 본 발명의 실시간 단일 이미지 깊이 추정 네트워크 시스템은 경량 인코딩 아키텍처를 사용하여 고해상도 깊이 맵을 효율적으로 구성할 수 있는 장점이 있다.The real-time single image depth estimation network system of the present invention with the above configuration has the advantage of being able to efficiently construct a high-resolution depth map using a lightweight encoding architecture.

이를 통해서, 실시간 또는 제한된 하드웨어 기능 장치에도 이미지 깊이 추정 네트워크 시스템을 적용할 수 있는 장점이 있다.Through this, there is an advantage that the image depth estimation network system can be applied to real-time or limited hardware function devices.

도 1은 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템의 신경망 구조를 나타낸 예시도이다.
도 2는 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 적용된 합성곱 신경망 레이어 구조를 나타낸 예시도이다.
도 3 및 도 4는 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵을 나타낸 예시도이다.
도 5 및 도 6은 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵과 종래의 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵의 비교 예시도이다.Figure 1 is an exemplary diagram showing the neural network structure of a real-time single image depth estimation network system according to an embodiment of the present invention.
Figure 2 is an exemplary diagram showing a convolutional neural network layer structure applied to a real-time single image depth estimation network system according to an embodiment of the present invention.
3 and 4 are exemplary diagrams showing an estimated depth map by a real-time single image depth estimation network system according to an embodiment of the present invention.
Figures 5 and 6 are illustrations of a comparison between an estimated depth map by a real-time single image depth estimation network system according to an embodiment of the present invention and an estimated depth map by a conventional depth estimation network system.

이하 첨부한 도면들을 참조하여 본 발명의 실시간 단일 이미지 깊이 추정 네트워크 시스템을 상세히 설명한다. 다음에 소개되는 도면들은 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 예로서 제공되는 것이다. 따라서, 본 발명은 이하 제시되는 도면들에 한정되지 않고 다른 형태로 구체화될 수도 있다. 또한, 명세서 전반에 걸쳐서 동일한 참조번호들은 동일한 구성요소들을 나타낸다.Hereinafter, the real-time single image depth estimation network system of the present invention will be described in detail with reference to the attached drawings. The drawings introduced below are provided as examples so that the idea of the present invention can be sufficiently conveyed to those skilled in the art. Accordingly, the present invention is not limited to the drawings presented below and may be embodied in other forms. Additionally, like reference numerals refer to like elements throughout the specification.

이때, 사용되는 기술 용어 및 과학 용어에 있어서 다른 정의가 없다면, 이 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 통상적으로 이해하고 있는 의미를 가지며, 하기의 설명 및 첨부 도면에서 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 설명은 생략한다.At this time, if there is no other definition in the technical and scientific terms used, they have the meaning commonly understood by those skilled in the art to which this invention pertains, and the gist of the present invention is summarized in the following description and attached drawings. Descriptions of known functions and configurations that may be unnecessarily obscure are omitted.

더불어, 시스템은 필요한 기능을 수행하기 위하여 조직화되고 규칙적으로 상호 작용하는 장치, 기구 및 수단 등을 포함하는 구성 요소들의 집합을 의미한다.In addition, a system refers to a set of components including devices, mechanisms, and means that are organized and interact regularly to perform necessary functions.

본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은, 스테레오 깊이 추정(SDE, Stereo Depth Estimation)이 아닌, 단일 이미지를 사용하여 깊이 예측을 수행하는 단안 깊이 추정(MDE, Monocular Depth Estimation)에 관한 기술이다.The real-time single image depth estimation network system according to an embodiment of the present invention uses monocular depth estimation (MDE), which performs depth prediction using a single image, rather than stereo depth estimation (SDE). This is a technology about.

간략하게 말하자면, 일반적으로 심층 특징 맵(deep feature map)을 구성하는 데 디코더 아키텍처가 필요하지 않는다는 점을 고려하여, 디코더 아키텍처를 제외시킴으로써, 고속 처리를 구현하면서도 실시간 또는 제한된 하드웨어 기능 장치(GPU 또는, CPU 등)에서도 작동할 수 있는 네트워크 시스템에 관한 것이다.Briefly, considering that a decoder architecture is generally not required to construct a deep feature map, by excluding the decoder architecture, we can achieve high-speed processing while implementing real-time or limited hardware capability devices (GPUs or It is about a network system that can also run on CPUs, etc.

즉, 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은, 깊이 추정 네트워크 시스템에서 일반적으로 적용되는 디코더 아키텍처를 제거하고, 이를 대체하는 경량 및 빠른 아키텍처를 적용함으로써, 단일 이미지의 깊이 추정을 위한 고속 컨볼루션 신경망을 제안할 수 있다.That is, the real-time single image depth estimation network system according to an embodiment of the present invention estimates the depth of a single image by removing the decoder architecture commonly applied in depth estimation network systems and applying a lightweight and fast architecture to replace it. A high-speed convolutional neural network can be proposed for .

특히, 이미지 초해상도 시스템에 적용되는 DTS(Depth-To-Space) 모듈을 슈퍼 픽셀을 생성하는데 적용함으로써, 의미론적 깊이 맵 구성으로 확장할 수 있는 장점이 있다. 이를 통해서, 깊이 추정 네트워크 시스템에 사용되는 GPU보다 컴퓨팅 능력이 낮은 CPU에서도 실시간으로 충분히 실행될 수 있음을 실험을 통해서 확인함으로써, 실시간 또는 제한된 하드웨어 기능 장치에도 적용할 수 있는 장점이 있다.In particular, there is an advantage in that it can be expanded to construct a semantic depth map by applying the DTS (Depth-To-Space) module applied to the image super-resolution system to generate super pixels. Through this, it was confirmed through experiments that it can be sufficiently executed in real time even on a CPU with lower computing power than the GPU used in the depth estimation network system, which has the advantage of being applicable to real-time or limited hardware function devices.

이러한 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은 도 1에 도시된 바와 같이, 선행 처리부 및 후행 처리부를 포함하게 된다. 각 구성들은 연산 처리 수단에서 구동되는 신경망 구조로 이루어지는 것이 당연하다.The real-time single image depth estimation network system according to an embodiment of the present invention includes a pre-processing unit and a post-processing unit, as shown in FIG. 1. It is natural that each configuration is composed of a neural network structure driven by computational processing means.

본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은 수정된 MobileNetV2 아키텍처로서, 작고 가벼운 매개변수와 MAC(곱셈/덧셈 계산)을 사용하는 것이 바람직하다.The real-time single image depth estimation network system according to an embodiment of the present invention is a modified MobileNetV2 architecture, preferably using small and lightweight parameters and MAC (multiply/add calculation).

종래의 MobileNetV2는 휴대폰과 같은 제한된 기능의 장치에서 작동하도록 최적화 되어 있는 네트워크 시스템이며, 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은 수정된 MobileNetV2 아키텍처로서, 선행 처리부는 종래의 MobileNetV2의 마지막 두 레이어인 fully connected layer와 global average pooing layer를 제거하고, 1024 개의 필터가 있는 1 × 1 컨볼루션 레이어를 추가하게 된다. 이 후, 후행 처리부를 통해서, 입력 이미지 데이터와 동일한 크기의 깊이 맵을 생성하게 된다.The conventional MobileNetV2 is a network system optimized to operate on devices with limited functions such as mobile phones, and the real-time single image depth estimation network system according to an embodiment of the present invention is a modified MobileNetV2 architecture, and the preprocessing unit is the same as that of the conventional MobileNetV2. The last two layers, the fully connected layer and the global average pooing layer, are removed, and a 1 × 1 convolution layer with 1024 filters is added. Afterwards, a depth map of the same size as the input image data is generated through a post-processing unit.

각 구성에 대해서 자세히 알아보자면,To learn more about each configuration,

상기 선행 처리부는 다수의 신경망 레이어를 포함하며, 입력되는 단일 이미지 데이터의 해상도에 대응되는 수 만큼의 저해상도 특징 맵을 생성하는 것이 바람직하다The pre-processing unit includes a plurality of neural network layers, and preferably generates a number of low-resolution feature maps corresponding to the resolution of the input single image data.

상기 선행 처리부는 도 1에 도시된 바와 같이, 제1 레이어부, 제2 레이어부, 제3 레이어부 및 제4 레이어부를 포함하게 된다.As shown in FIG. 1, the preceding processing unit includes a first layer unit, a second layer unit, a third layer unit, and a fourth layer unit.

상기 제1 레이어부는 하나의 컨볼루션 레이어를 포함하며, 입력되는 단일 이미지 데이터에 컨볼루션 필터를 적용하여 포함되어 있는 특징 맵을 산출하는 것이 바람직하다. 상기 제1 레이어부는 종래의 MobileNetV2와 동일한 위치에 있는 레이어와 동일한 역할을 수행하며, 이에 대한 자세한 설명은 생략한다.The first layer unit includes one convolutional layer, and it is desirable to calculate the included feature map by applying a convolutional filter to the input single image data. The first layer unit performs the same role as the layer located at the same location as the conventional MobileNetV2, and detailed description thereof will be omitted.

상기 제2 레이어부는 선형으로 구성된 다수의 합성곱 신경망 레이어 구조를 포함하며, 다수의 합성곱 신경망 레이어 구조는 도 2에 도시된 바와 같이, 제2-1 레이어부, 제2-2 레이어부 및 제2-3 레이어부를 포함하게 된다.The second layer unit includes a plurality of linearly configured convolutional neural network layer structures, and the plurality of convolutional neural network layer structures include a 2-1 layer unit, a 2-2 layer unit, and a second layer unit, as shown in FIG. 2. It contains 2-3 layers.

상기 제2 레이어부는 bottleneck inverted residual block의 기능적 특징을 갖는 것이 바람직하며, 이는 종래의 MobileNetV2에서 사용한 구조를 차용하고 있어, 구조 자체에 대한 상세한 설명은 생략한다.The second layer unit preferably has the functional characteristics of a bottleneck inverted residual block, and since it adopts the structure used in the conventional MobileNetV2, a detailed description of the structure itself will be omitted.

이러한 bottleneck inverted residual block은 depth-wise seperable convolution layer를 적용함으로써, 계산량/연산량이 작은 특성을 가지고 있다.This bottleneck inverted residual block has the characteristic of having a small amount of calculation/computation by applying a depth-wise seperable convolution layer.

더불어, Residual block간의 연결은 채널 수가 많은 특징 맵끼리 연결되어 있으나, Inverted Residual block 간의 연결은 bottleneck끼리 연결되기 때문에, 특징 정보의 손실을 최소화하는 특성을 가지고 있다.In addition, the connection between residual blocks is connected between feature maps with a large number of channels, but the connection between inverted residual blocks is connected between bottlenecks, which has the characteristic of minimizing the loss of feature information.

이를 통해서, 일반적인 Residual block은 입력과 출력의 채널 수에 비해 중간 레이어의 채널 수가 적은 구조를 가지고 있으나, Inverted Residual block은 입력과 출력의 채널 수에 비해 중간 레이어의 채널 수가 큰 구조를 가지고 있어, 특징 정보의 손실을 최소화할 수 있는 장점이 있다.Through this, a typical residual block has a structure in which the number of channels in the middle layer is small compared to the number of channels in the input and output, but the inverted residual block has a structure in which the number of channels in the middle layer is large compared to the number of channels in the input and output. It has the advantage of minimizing information loss.

이에 따라, 이러한 bottleneck inverted residual block의 구조를 갖기 위하여, 상기 제2 레이어부는 확장 레이어(더 많은 출력 필터가 있는 1 × 1 컨볼루션 레이어), 깊이 별 분리가 가능한 컨볼루션 레이어(각 채널에 개별적으로 작용하는 깊이 별 공간 컨볼루션 레이어), 투영 레이어(출력이 더 적은 1 × 1 컨볼루션 레이어)를 포함하게 된다.Accordingly, in order to have the structure of this bottleneck inverted residual block, the second layer unit includes an expansion layer (1 × 1 convolution layer with more output filters) and a convolution layer that can be separated by depth (individually for each channel). It includes a depth-specific spatial convolution layer that works) and a projection layer (a 1 × 1 convolution layer with less output).

상세하게는, 상기 제2-1 레이어부는 하나의 컨볼루션 레이어를 포함하며, 입력되는 특징 맵의 채널을 확장시키는 것이 바람직하다.In detail, the 2-1 layer unit includes one convolutional layer, and it is desirable to expand the channel of the input feature map.

상기 제2-2 레이어부는 하나의 깊이별 분리 가능(depth-wise separable) 컨볼루션 레이어를 포함하며, 상기 제2-1 레이어부로부터 입력되는 데이터의 각 채널별로 컨볼루션 필터를 적용하여 특징 맵을 재산출하게 된다.The 2-2 layer unit includes one depth-wise separable convolutional layer, and creates a feature map by applying a convolution filter to each channel of the data input from the 2-1 layer unit. It will be recalculated.

더불어, 상기 제2-3 레이어부는 하나의 컨볼루션 레이어를 포함하며, 상기 제2-2 레이어부로부터 입력되는 데이터의 특징 맵을 투영하게 된다.In addition, the 2-3 layer unit includes one convolutional layer and projects a feature map of data input from the 2-2 layer unit.

상기 제2 레이어부는 종래의 MobileNetV2와 동일한 위치에 있는 레이어들과 동일한 역할을 수행하며, 이에 대한 자세한 설명은 생략한다.The second layer unit performs the same role as the layers located at the same location as the conventional MobileNetV2, and detailed description thereof will be omitted.

상기 제3 레이어부는 하나의 컨볼루션 레이어를 포함하며, 입력되는 특징 맵에 컨볼루션 필터를 적용하여 특징 맵을 산출하는 것이 바람직하다.The third layer unit includes one convolutional layer, and preferably calculates the feature map by applying a convolutional filter to the input feature map.

상기 제3 레이어부 역시도 종래의 MobileNetV2와 동일한 위치에 있는 레이어와 동일한 역할을 수행하며, 이에 대한 자세한 설명은 생략한다.The third layer unit also performs the same role as the layer located at the same location as the conventional MobileNetV2, and detailed description thereof will be omitted.

상기 제4 레이어부는 1024 개의 필터가 있는 1 × 1 컨볼루션 레이어를 포함하며, 수정된 MobileNetV2를 구성하면서 추가된 컨볼루션 레이어로서, 각 채널의 픽셀 위치가 상이한 깊이 맵을 산출하는 것이 바람직하다.The fourth layer unit includes a 1

상세하게는, 상기 제4 레이어부를 구성하는 각 채널은 픽셀의 위치가 조금씩 상이한 저해상도 영상의 깊이 정보를 산출하게 된다. 이 저해상도의 깊이 정보를 상기 후행 처리부를 통해서 통합함으로써, 입력되는 단일 이미지 데이터와 동일한 해상도의 깊이 영상(출력 영상)을 생성하게 된다.In detail, each channel constituting the fourth layer unit calculates depth information of a low-resolution image with slightly different pixel positions. By integrating this low-resolution depth information through the post-processing unit, a depth image (output image) with the same resolution as the input single image data is generated.

상기 후행 처리부는 하나의 신경망 레이어인 서브 픽셀 컨볼루션 레이어(sub-pixel convolution layer)(또는, DTS 모듈이라고도 함.)를 포함하는 것이 바람직하다.The post-processing unit preferably includes a sub-pixel convolution layer (also referred to as a DTS module), which is one neural network layer.

종래의 서브 픽셀 컨볼루션 레이어는 저해상도 이미지의 최종 특징 맵을 고해상도로 출력할 수 있도록 동작하게 되며, 초해상도에 매우 정확한 결과를 보여주는 기술이다. 상세하게는, 레이어 깊이가 점진적으로 증가하면서 아키텍처를 단 3개의 컨볼루션 레이어로 축소한 다음, 저해상도 기능 집계를 통해 고해상도 출력 이미지를 구성함으로써, 이미지 초해상도의 높은 복잡성을 해소한 기술이다.The conventional sub-pixel convolution layer operates to output the final feature map of a low-resolution image at high resolution, and is a technology that shows very accurate results at super-resolution. In detail, this is a technology that resolves the high complexity of image super-resolution by reducing the architecture to just three convolutional layers as the layer depth gradually increases, and then constructing a high-resolution output image through low-resolution feature aggregation.

즉, 종래의 깊이 추정 네트워크에서 사용하던 디코딩 블록을 보다 간단한 구조를 가진 1 × 1 컨볼루션 레이어(제4 레이어부)와 서브 픽셀 컨볼루션 레이어로 대체함으로써, 구조의 복잡도를 크게 낮출 수 있는 장점이 있다.In other words, by replacing the decoding block used in the conventional depth estimation network with a 1 there is.

본 발명의 일 실시예에 따른 단일 이미지 깊이 추정 네트워크 시스템에서는, 이미지 대신 깊이 맵을 적용하였다.In the single image depth estimation network system according to an embodiment of the present invention, a depth map is applied instead of an image.

즉, 상기 후행 처리부는 입력되는 다수의 저해상도 특징 맵을 업스케일(upscale)하여 슈퍼 픽셀(super pixel)을 생성하는 것이 바람직하다.That is, it is preferable that the post-processing unit upscales a plurality of input low-resolution feature maps to generate super pixels.

입력되는 다수의 저해상도 특징 맵의 업 스케일은 텐서(tensor)를 H × W × r²를 rH × rW로 재배열하여 수행하게 된다. 이 때, H, W 및 r은 각각의 특성 맵 높이, 너비 및 깊이이다.Upscaling of multiple input low-resolution feature maps is performed by rearranging the tensor from H × W × r ² to rH × rW. At this time, H, W, and r are the feature map height, width, and depth, respectively.

이를 통해서, 상기 후행 처리부는 다수의 저해상도 특징 맵에서 입력 이미지 데이터와 동일한 크기의 깊이 맵을 구성(도 1의 output depth map 참조)하게 되며, 이는 하기의 수학식 1과 같이 나타낼 수 있다.Through this, the post-processing unit constructs a depth map of the same size as the input image data from a plurality of low-resolution feature maps (see output depth map in FIG. 1), which can be expressed as Equation 1 below.

여기서, D_{x, y}는 생성된 깊이 맵을 의미하며, T는 상기 선행 처리부의 마지막 레이어를 통해서 출력되는 특성 맵을 의미하며, r은 상기 선행 처리부의 마지막 레이어를 통해서 출력되는 특성 맵의 깊이을 의미하며, mod는 모듈러스(modulus)연산을 의미하다.Here, _D And mod means modulus calculation.

상기의 수학식 1은 학습 가능한 과정을 통해 조건 mod(x, r) = 0 또는, mod(y, r) = 0이 참일 때, 저해상도 특징 맵의 픽셀을 깊이 맵으로 매핑하게 된다.Equation 1 above maps pixels of the low-resolution feature map to a depth map when the condition mod(x, r) = 0 or mod(y, r) = 0 is true through a learnable process.

이러한 상기 후행 처리부는 입력되는 다수의 저해상도 특징 맵을 업스케일할 업스케일링 필터를 학습시켜 동작을 수행하는 것이 바람직하며, 학습 과정에서 적용된 손실 함수는 하기의 수학식 2와 같이 정의되는 평균 절대 오차 함수(mean absolute error function)인 것이 바람직하다. 학습 과정에서 역전파로 흐르는 기울기를 허용하는 것이 바람직하다.The post-processing unit preferably performs the operation by learning an upscaling filter to upscale a plurality of input low-resolution feature maps, and the loss function applied in the learning process is an average absolute error function defined as Equation 2 below. (mean absolute error function) is desirable. It is desirable to allow gradients to flow through backpropagation during the learning process.

여기서, 는 실제 깊이 맵(ground-truth depth map)이며, 는 생성된 깊이 맵을 의미한다.here, is the ground-truth depth map, means the created depth map.

상기 후행 처리부는 학습 과정을 통해서, 가중치 및 편향의 업데이트를 수행하게 된다.The post-processing unit updates weights and biases through a learning process.

하기의 수학식 3을 통해서, 입력되는 다수의 저해상도 특징 맵과 생성한 슈퍼 픽셀 간의 관계를 정의하는 것이 바람직하다.It is desirable to define the relationship between a plurality of input low-resolution feature maps and the generated superpixel through Equation 3 below.

여기서, W_L, b_L은 업데이트된 가중치(weights)와 편향(biases)을 의미하며,Here, W _L and b _L mean updated weights and biases,

H^LR은 저해상도 특징 맵을 의미하며,H ^LR means low-resolution feature map,

f는 레이어의 활성화 함수를 의미한다.f means the activation function of the layer.

도 3은 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵을 나타낸 예시도로서, a)는 입력 데이터(1024 × 512), b)는 선행 처리부의 마지막 레이어인 제4 레이어부를 통해서 출력되는 20개의 저해상도 특징 맵 샘플(32 × 16) 및 c)는 후행 처리부를 통해서 재구성된 고해상도 깊이 맵(1024 × 512)이다.Figure 3 is an example diagram showing an estimated depth map by a real-time single image depth estimation network system according to an embodiment of the present invention, where a) is input data (1024 × 512), and b) is the last layer of the pre-processing unit. The 20 low-resolution feature map samples (32 × 16) and c) output through the fourth layer unit are high-resolution depth maps (1024 × 512) reconstructed through the post-processing unit.

본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에서는, 널리 사용되는 데이터 셋인 KITTI, Cityscapes 및 NYUV2를 이용하여 시스템을 훈련하고 성능을 평가를 진행하였다.In the real-time single image depth estimation network system according to an embodiment of the present invention, the system was trained and performance was evaluated using widely used datasets KITTI, Cityscapes, and NYUV2.

KITTI 데이터 셋은 객체 감지, 의미론적 및 인스턴스 분할, 스테레오 깊이 추정, 단안 깊이 추정 및 3D 객체 감지와 같은 여러 자율주행 차량 관련 작업에 대한 라벨링이 되어 있는 대규모 데이터 셋이다.The KITTI dataset is a large labeled dataset for several autonomous vehicle-related tasks, such as object detection, semantic and instance segmentation, stereo depth estimation, monocular depth estimation, and 3D object detection.

Cityscapes 데이터 셋은 도시 거리 장면의 의미론적 이해를 돕는 데이터와 의미론적 및 인스턴스 분할, 깊이 추정, 3D 차량 감지와 같은 여러 컴퓨터 비전 작업에 대한 라벨링이 되어 있는 데이터 셋이다.The Cityscapes dataset is a dataset that aids semantic understanding of urban street scenes and is a semantically labeled dataset for several computer vision tasks, such as instance segmentation, depth estimation, and 3D vehicle detection.

마지막으로, NYUV2 데이터 셋은 New York University의 연구 그룹에서 제공한 실내 장면 깊이 및 분할 데이터 셋으로, 침실, 주방, 지하실, 욕실 등 다양한 실내 위치에서 수집된 수많은 실내 장면을 제공한다.Lastly, the NYUV2 dataset is an indoor scene depth and segmentation dataset provided by a research group at New York University, providing numerous indoor scenes collected from various indoor locations such as bedrooms, kitchens, basements, and bathrooms.

도 4는 상술한 데이터 셋을 적용한 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵을 나타낸 예시도이다.Figure 4 is an example diagram showing an estimated depth map by a real-time single image depth estimation network system according to an embodiment of the present invention applying the above-described data set.

a)는 KITTI 테스트 샘플에 의한 입력 이미지, 실제 깊이 맵 및 생성된 깊이 맵을 의미하며, b)는 Cityscapes 테스트 샘플에 의한 입력 이미지, 실제 깊이 맵 및 생성된 깊이 맵을 의미하며, c)는 NYUV2 테스트 샘플에 의한 입력 이미지, 실제 깊이 맵 및 생성된 깊이 맵을 의미한다.a) refers to the input image, real depth map and generated depth map by KITTI test sample, b) refers to input image, real depth map and generated depth map by Cityscapes test sample, and c) refers to NYUV2 This means the input image, actual depth map and generated depth map by test sample.

본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템의 성능 평가 방법으로는, 고해상도 테스트 이미지를 입력받아, GPU에서 초당 48 프레임, CPU에서 초당 20 프레임으로 작업을 진행하였을 때, 낮은 상대 절대 오차를 달성하고자 한다.As a method for evaluating the performance of a real-time single image depth estimation network system according to an embodiment of the present invention, when a high-resolution test image is input and the operation is performed at 48 frames per second on the GPU and 20 frames per second on the CPU, a low relative absolute We want to achieve error.

성능 평가를 하는데 사용되는 메트릭으로 평균 절대 상대 오차(REL), 상대 차이 제곱 오차(Sq_REL), 평균 제곱근 오차(RMSE) 및 임계값 정확도()를 이용하였다.Metrics used to evaluate performance include mean absolute relative error (REL), relative difference squared error (Sq_REL), root mean square error (RMSE), and threshold accuracy ( ) was used.

각 메트릭은 하기의 수학식 4와 같이 나타낼 수 있다.Each metric can be expressed as Equation 4 below.

여기서, y, 는 정답 및 예측 픽셀 값, n은 깊이 맵의 픽셀 수, thr은 미리 설정된 임계값(1, 25, 1.252, 1.253)을 의미한다.where y, is the correct answer and predicted pixel value, n is the number of pixels in the depth map, and thr is the preset threshold (1, 25, 1.252, 1.253).

평가 결과, 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템은 하기의 표 1에 도시된 바와 같이, KITTI, Cityscapes 및 NYUV2에서 각각 0.028, 0.167 및 0.069의 매우 낮은 오류를 가짐을 알 수 있다.As a result of the evaluation, it can be seen that the real-time single image depth estimation network system according to an embodiment of the present invention has very low errors of 0.028, 0.167, and 0.069 in KITTI, Cityscapes, and NYUV2, respectively, as shown in Table 1 below. there is.

이를 통해서, 실시간 또는 제한된 하드웨어 기능 장치에서도 실제 깊이 맵과 가장 유사한 예측/추정 깊이 맵을 생성할 수 있는 장점이 있다.This has the advantage of being able to generate a predicted/estimated depth map that is most similar to the actual depth map even in real-time or in devices with limited hardware capabilities.

도 5 및 도 6은 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵과 종래의 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵의 비교 예시도이다.Figures 5 and 6 are illustrations of a comparison between an estimated depth map by a real-time single image depth estimation network system according to an embodiment of the present invention and an estimated depth map by a conventional depth estimation network system.

이를 통해서, 본 발명의 일 실시예에 따른 실시간 단일 이미지 깊이 추정 네트워크 시스템에 의한 추정된 깊이 맵이 실제 깊이 맵과 가장 유사한 깊이 맵을 생성할 수 있음을 알 수 있다.Through this, it can be seen that the estimated depth map by the real-time single image depth estimation network system according to an embodiment of the present invention can generate a depth map that is most similar to the actual depth map.

이상과 같이 본 발명에서는 구체적인 구성 소자 등과 같은 특정 사항들과 한정된 실시예 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것 일 뿐, 본 발명은 상기의 일 실시예에 한정되는 것이 아니며, 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.As described above, the present invention has been described with reference to specific details such as specific components and drawings of limited embodiments, but this is only provided to facilitate a more general understanding of the present invention, and the present invention is not limited to the above-mentioned embodiment. No, those skilled in the art can make various modifications and variations from this description.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허 청구 범위뿐 아니라 이 특허 청구 범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be limited to the described embodiments, and all matters that are equivalent or equivalent to the claims of this patent as well as the claims described below shall fall within the scope of the spirit of the present invention. .

Claims

a pre-processing unit that includes a plurality of neural network layers and generates a number of low-resolution feature maps corresponding to the resolution of the input single image data; and
A post-processing unit that includes one neural network layer and generates super pixels by upscaling a plurality of input low-resolution feature maps;
Includes,
A real-time single image depth estimation network system, wherein the super pixel is a high-resolution depth map.

According to clause 1,
The preceding processing unit
a first layer unit that includes one convolutional layer and calculates a feature map by applying a convolutional filter to input single image data;
a second layer unit that includes a plurality of linear convolutional neural network layer structures, expands channels of input feature maps, recalculates feature maps for each channel, and projects them;
a third layer unit including one convolutional layer and calculating a feature map by applying a convolutional filter to the input feature map; and
a fourth layer unit that includes one convolutional layer and calculates a depth map with different pixel positions for each channel;
A real-time single image depth estimation network system including.

According to clause 2,
The post-processing unit
A real-time single image depth estimation network system that generates the super pixel using the plurality of input low-resolution feature maps through the following equation.

According to clause 3,
The post-processing unit
It includes one sub-pixel convolution layer and performs the operation by learning an upscaling filter to upscale a number of input low-resolution feature maps,
A real-time single image depth estimation network system, characterized in that the loss function applied in the learning process is an average absolute error function defined as the following equation.

According to clause 4,
The post-processing unit
A real-time single image depth estimation network system that defines the relationship between multiple input low-resolution feature maps and generated superpixels through the equation below.