KR20220079125A

KR20220079125A - System and method for semi-supervised single image depth estimation and computer program for the same

Info

Publication number: KR20220079125A
Application number: KR1020200168582A
Authority: KR
Inventors: 민동보; 최혜송
Original assignee: 이화여자대학교 산학협력단
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2022-06-13
Also published as: KR102526415B1

Abstract

단일 영상 깊이 추정 시스템은, 입력 이미지에 대한 의사 깊이(pseudo depth) 정보를 산출하도록 구성된 깊이 추정부; 상기 의사 깊이 정보에 대한 신뢰도(confidence) 정보를 산출하도록 구성된 신뢰도 산출부; 및 상기 신뢰도 정보를 이용한 학습을 통하여 상기 의사 깊이 정보의 일부를 필터링하기 위한 신뢰도의 임계값을 결정하도록 구성된 임계값 결정부를 포함한다. 상기 깊이 추정부는, 상기 임계값에 의해 필터링된 상기 의사 깊이 정보를 이용하여 단일 영상에 대한 깊이 추정 모델을 생성하도록 더 구성된다. 상기 단일 영상 깊이 추정 시스템은 공지된 단일 영상 깊이 추정 방법들을 이용하는 시스템에 비해 우수한 성능을 나타내며, 임계값 네트워크를 사용함으로써 의사 정답 깊이 영상에 존재하는 오차에 의한 성능 저하를 막고 기존 신뢰도 추정 방식을 이용한 방법의 성능을 개선할 수 있는 이점이 있다. A single image depth estimation system includes: a depth estimation unit configured to calculate pseudo depth information for an input image; a confidence calculation unit configured to calculate confidence information for the pseudo depth information; and a threshold value determining unit configured to determine a threshold of reliability for filtering a part of the pseudo-depth information through learning using the reliability information. The depth estimation unit is further configured to generate a depth estimation model for a single image by using the pseudo-depth information filtered by the threshold value. The single-image depth estimation system exhibits superior performance compared to systems using known single-image depth estimation methods, and by using a threshold network, it prevents performance degradation due to errors existing in the pseudo-correction depth image and uses the existing reliability estimation method. There is an advantage that the performance of the method can be improved.

Description

A semi-supervised single image depth estimation system and method, and a computer program for the same

실시예들은 단일 영상 깊이 추정 시스템 및 방법과 이를 위한 컴퓨터 프로그램에 관한 것이다. 보다 상세하게는, 실시예들은 스테레오 매칭(stereo matching)을 통해 생성된 의사 정답(pseudo ground truth)을 사용한 준지도(semi-supervised) 학습 방식에 기반하여 단안 깊이 추정(monocular depth estimation)을 위한 새로운 프레임워크을 제공하는 기술에 대한 것이다. Embodiments relate to a single image depth estimation system and method and a computer program therefor. More specifically, the embodiments provide a novel method for monocular depth estimation based on a semi-supervised learning method using a pseudo ground truth generated through stereo matching. It is about the technology that provides the framework.

하나의 RGB 이미지와 같은 단일 영상에서 각각의 픽셀에 대하여 깊이 정보를 예측하는 단안 깊이 추정(monocular depth estimation) 또는 단일 영상 깊이 추정 기술은, 로봇공학, 자율주행 등 다양한 분야에서 중요한 역할을 한다. 단일 영상 깊이 추정에 대한 초기 연구는 주로 정답에 해당하는 깊이 영상이 있는 훈련 데이터를 이용하는 지도 학습에 기반한다. 그러나, 이를 위해서는 엄청난 양의 학습 데이터를 구축하여야 하므로 비용이 많이 들고 노동 집약적인 문제가 있다. Monocular depth estimation or single-image depth estimation technology that predicts depth information for each pixel in a single image such as one RGB image plays an important role in various fields such as robotics and autonomous driving. Early research on single-image depth estimation is mainly based on supervised learning using training data with depth images corresponding to the correct answers. However, for this, a huge amount of training data must be built, which is expensive and labor-intensive.

이러한 문제로 인하여, 최근의 연구들은 대부분 이미지의 픽셀별 유사도를 이용하여 재구성 손실(reconstruction loss)을 구하는 자가지도(self-supervised) 학습 방식에 기반하고 있다. 자가지도 학습 방식은 정답이 있는 학습 데이터의 부재에 대한 대안이 될 수 있는 것처럼 보이나, 깊이 맵(depth map)의 결과를 사물의 경계 측면에서 흐리게 하거나, 스테레오 이미지의 폐색 영역 픽셀들을 고려하지 않으므로 가려진 영역을 처리하지 못해 추정된 깊이 정보의 성능이 저하되는 문제를 갖는다. Due to this problem, most recent studies are based on a self-supervised learning method that calculates the reconstruction loss using the pixel-by-pixel similarity of the image. The self-supervised learning method seems to be an alternative to the absence of training data with correct answers, but it blurs the result of the depth map at the boundary side of the object, or does not consider the pixels of the occluded area of the stereo image. There is a problem in that the performance of the estimated depth information is deteriorated because the region cannot be processed.

공개특허공보 제10-2017-0082794호Unexamined Patent Publication No. 10-2017-0082794

본 발명의 일 측면에 따르면, 스테레오 매칭(stereo matching)을 통해 생성된 깊이 맵(depth map)을 의사 정답(pseudo ground truth)으로 활용하는 새로운 접근 방식에 의한 것으로, 의사 정답의 깊이 정보를 임계화된 신뢰도 맵에 의하여 필터링하고 이를 통해 깊이 추정 네트워크를 학습시킴으로써 의사 정답에 존재하는 오차로 인한 성능 저하를 막을 수 있는 단일 영상 깊이 추정 시스템 및 방법과 이를 위한 컴퓨터 프로그램을 제공할 수 있다. According to an aspect of the present invention, it is by a new approach that utilizes a depth map generated through stereo matching as a pseudo ground truth, and thresholding depth information of a pseudo correct answer It is possible to provide a single image depth estimation system and method, and a computer program for the same, that can prevent performance degradation due to errors existing in pseudo-answers by filtering by the obtained confidence map and learning the depth estimation network through this.

본 발명의 일 측면에 따른 단일 영상 깊이 추정 시스템은, 입력 이미지에 대한 의사 깊이(pseudo depth) 정보를 산출하도록 구성된 깊이 추정부; 상기 의사 깊이 정보에 대한 신뢰도(confidence) 정보를 산출하도록 구성된 신뢰도 산출부; 및 상기 신뢰도 정보를 이용한 학습을 통하여 상기 의사 깊이 정보의 일부를 필터링하기 위한 신뢰도의 임계값을 결정하도록 구성된 임계값 결정부를 포함한다.A single image depth estimation system according to an aspect of the present invention includes: a depth estimation unit configured to calculate pseudo depth information for an input image; a confidence calculation unit configured to calculate confidence information for the pseudo depth information; and a threshold value determining unit configured to determine a threshold of reliability for filtering a part of the pseudo-depth information through learning using the reliability information.

이때, 상기 깊이 추정부는, 상기 임계값에 의해 필터링된 상기 의사 깊이 정보를 이용하여 단일 영상에 대한 깊이 추정 모델을 생성하도록 더 구성된다. In this case, the depth estimator is further configured to generate a depth estimation model for a single image by using the pseudo-depth information filtered by the threshold value.

일 실시예에서, 상기 깊이 추정부는, 미리 저장된 스테레오 매칭 모델을 이용하여 상기 입력 이미지로부터 상기 의사 깊이 정보를 산출하도록 구성된 스테레오 매칭부; 및 상기 임계값에 의해 필터링된 상기 의사 깊이 정보를 이용하여 깊이 추정 네트워크를 학습시키도록 구성된 깊이 학습부를 포함한다.In an embodiment, the depth estimator may include: a stereo matching unit configured to calculate the pseudo-depth information from the input image using a pre-stored stereo matching model; and a depth learning unit configured to train a depth estimation network using the pseudo-depth information filtered by the threshold value.

일 실시예에서, 상기 깊이 추정 네트워크는, 이미지로부터 특징값을 추출하기 위한 하나 이상의 인코더 레이어 및 상기 특징값을 깊이 정보로 변환하도록 구성된 하나 이상의 디코더 레이어를 포함한다. 이때, 상기 임계값 결정부는, 상기 하나 이상의 인코더 레이어에 의해 추출된 상기 특징값을 이용한 적응적 학습을 통해 상기 임계값을 결정하도록 더 구성된다.In an embodiment, the depth estimation network includes one or more encoder layers for extracting feature values from an image and one or more decoder layers configured to convert the feature values into depth information. In this case, the threshold value determining unit is further configured to determine the threshold value through adaptive learning using the feature values extracted by the one or more encoder layers.

일 실시예에서, 상기 임계값 결정부는, 상기 신뢰도 정보 및 상기 임계값을 이용하여 정의되는 차등 소프트-임계화 함수에 의하여 임계화된 신뢰도 정보를 생성하도록 더 구성된다. In an embodiment, the threshold value determining unit is further configured to generate the reliability information thresholded by a differential soft-thresholding function defined using the reliability information and the threshold value.

일 실시예에서, 상기 임계값 결정부는, 상기 임계화된 신뢰도 정보 및 기준 신뢰도 정보에 의해 정의되는 손실 함수를 이용하여 임계값 네트워크를 학습시킴으로써 상기 임계값을 결정하도록 더 구성된다. In an embodiment, the threshold determining unit is further configured to determine the threshold by learning a threshold network using a loss function defined by the thresholded reliability information and the reference reliability information.

일 실시예에서, 상기 깊이 추정부는, 상기 임계화된 신뢰도 정보 및 상기 의사 깊이 정보를 이용하여 정의되는 회귀 손실 함수에 의하여 깊이 추정 네트워크를 학습시키도록 더 구성된다. In an embodiment, the depth estimator is further configured to train the depth estimation network by a regression loss function defined using the thresholded reliability information and the pseudo depth information.

본 발명의 일 측면에 따른 단일 영상 깊이 추정 방법은, 단일 영상 깊이 추정 시스템이 입력 이미지에 대한 의사 깊이 정보를 산출하는 단계; 상기 단일 영상 깊이 추정 시스템이 상기 의사 깊이 정보에 대한 신뢰도 정보를 산출하는 단계; 상기 단일 영상 깊이 추정 시스템이 상기 신뢰도 정보를 이용한 학습을 통하여 상기 의사 깊이 정보의 일부를 필터링하기 위한 신뢰도의 임계값을 결정하는 단계; 및 단일 영상 깊이 추정 시스템이, 상기 임계값에 의해 필터링된 상기 의사 깊이 정보를 이용하여 단일 영상에 대한 깊이 추정 모델을 생성하는 단계를 포함한다.A single image depth estimation method according to an aspect of the present invention comprises: calculating, by a single image depth estimation system, pseudo depth information for an input image; calculating, by the single image depth estimation system, reliability information for the pseudo depth information; determining, by the single image depth estimation system, a threshold of reliability for filtering a part of the pseudo depth information through learning using the reliability information; and generating, by the single image depth estimation system, a depth estimation model for a single image by using the pseudo-depth information filtered by the threshold value.

일 실시예에서, 상기 의사 깊이 정보를 산출하는 단계는, 상기 단일 영상 깊이 추정 시스템이, 미리 저장된 스테레오 매칭 모델을 이용하여 상기 입력 이미지로부터 상기 의사 깊이 정보를 산출하는 단계를 포함한다.In an embodiment, the calculating of the pseudo-depth information includes calculating, by the single image depth estimation system, the pseudo-depth information from the input image using a pre-stored stereo matching model.

일 실시예에서, 상기 깊이 추정 모델을 생성하는 단계는, 상기 단일 영상 깊이 추정 시스템이, 상기 임계값에 의해 필터링된 상기 의사 깊이 정보를 이용하여 깊이 추정 네트워크를 학습시키는 단계를 포함한다.In an embodiment, generating the depth estimation model includes training, by the single image depth estimation system, a depth estimation network using the pseudo-depth information filtered by the threshold value.

일 실시예에서, 상기 깊이 추정 네트워크는, 이미지로부터 특징값을 추출하기 위한 하나 이상의 인코더 레이어 및 상기 특징값을 깊이 정보로 변환하도록 구성된 하나 이상의 디코더 레이어를 포함한다. 이때, 상기 임계값을 산출하는 단계는, 상기 단일 영상 깊이 추정 시스템이, 상기 하나 이상의 인코더 레이어에 의해 추출된 상기 특징값을 이용한 적응적 학습을 통해 상기 임계값을 결정하는 단계를 포함한다.In an embodiment, the depth estimation network includes one or more encoder layers for extracting feature values from an image and one or more decoder layers configured to convert the feature values into depth information. In this case, the calculating of the threshold includes, by the single image depth estimation system, determining the threshold through adaptive learning using the feature values extracted by the one or more encoder layers.

일 실시예에서, 상기 임계값을 산출하는 단계는, 상기 단일 영상 깊이 추정 시스템이, 상기 신뢰도 정보 및 상기 임계값을 이용하여 정의되는 차등 소프트-임계화 함수에 의하여 임계화된 신뢰도 정보를 생성하는 단계를 포함한다.In an embodiment, the calculating of the threshold value comprises: generating, by the single image depth estimation system, reliability information thresholded by a differential soft-thresholding function defined using the reliability information and the threshold value includes steps.

일 실시예에서, 상기 임계값을 산출하는 단계는, 상기 단일 영상 깊이 추정 시스템이, 상기 임계화된 신뢰도 정보 및 기준 신뢰도 정보에 의해 정의되는 손실 함수를 이용하여 임계값 네트워크를 학습시키는 단계를 더 포함한다.In an embodiment, the calculating of the threshold may further include, by the single image depth estimation system, learning the threshold network using a loss function defined by the thresholded reliability information and the reference reliability information. include

일 실시예에서, 상기 깊이 추정 모델을 생성하는 단계는, 상기 단일 영상 깊이 추정 시스템이, 상기 임계화된 신뢰도 정보 및 상기 의사 깊이 정보를 이용하여 정의되는 회귀 손실 함수에 의하여 깊이 추정 네트워크를 학습시키는 단계를 포함한다. In an embodiment, the generating of the depth estimation model comprises: training, by the single image depth estimation system, a depth estimation network using a regression loss function defined using the thresholded reliability information and the pseudo depth information. includes steps.

본 발명의 일 측면에 컴퓨터 프로그램은, 하드웨어와 결합되어 전술한 실시예들에 따른 단일 영상 깊이 추정 방법을 실행하기 위한 것으로서 컴퓨터로 판독 가능한 기록매체에 저장될 수 있다. In one aspect of the present invention, the computer program is combined with hardware to execute the single image depth estimation method according to the above-described embodiments, and may be stored in a computer-readable recording medium.

본 발명의 일 측면에 따른 단일 영상 깊이 추정 시스템 및 방법은, 단안(monocular) 깊이 네트워크, 신뢰도(confidence) 네트워크 및 임계값 네트워크의 세 가지 하위 네트워크를 이용하며, 단안 깊이 네트워크에 의한 의사 정답(pseudo ground truth)을 이용한 준지도(semi-supervised) 학습 방식으로 깊이 추정 네트워크를 학습시키도록 구성된다. A single image depth estimation system and method according to an aspect of the present invention uses three sub-networks: a monocular depth network, a confidence network, and a threshold network, and a pseudo correct answer by the monocular depth network. It is configured to train the depth estimation network in a semi-supervised learning method using ground truth.

본 발명의 일 측면에 따른 단일 영상 깊이 추정 시스템 및 방법은 공지된 단일 영상 깊이 추정 방법에 비해 우수한 성능을 나타내며, 또한 임계값 네트워크를 사용함으로써 의사 정답 깊이 영상에 존재하는 오차에 의한 성능 저하를 막고 기존 신뢰도 추정 방식을 이용한 방법의 성능을 개선할 수 있어, 자율주행차, 가상현실 등 다양한 분야에 활용될 수 있는 기반 기술을 제공하는 이점이 있다. The single-image depth estimation system and method according to an aspect of the present invention exhibit superior performance compared to known single-image depth estimation methods, and use a threshold network to prevent performance degradation due to errors existing in the pseudo correct depth image, and The performance of the method using the existing reliability estimation method can be improved, which has the advantage of providing a base technology that can be used in various fields such as autonomous vehicles and virtual reality.

도 1은 일 실시예에 따른 단일 영상 깊이 추정 시스템의 개략적인 블록도이다.
도 2는 일 실시예에 따른 단일 영상 깊이 추정 방법의 각 단계를 나타내는 순서도이다.
도 3은 일 실시예에 따른 단일 영상 깊이 추정 시스템에 포함된 하위 네트워크들을 나타내는 개념도이다.
도 4는 일 실시예에 따른 단일 영상 깊이 추정 방법에 의해 임계화된 신뢰도 값을 나타내는 그래프이다.
도 5는 원본 이미지에 일 실시예에 따른 단일 영상 깊이 추정 방법을 단계별로 적용하여 얻어진 깊이 정보를 나타내는 이미지이다.
도 6 및 도 7은 일 실시예에 따른 단일 영상 깊이 추정 방법의 성능을 종래 기술과 비교하여 나타내는 이미지이다. 1 is a schematic block diagram of a single image depth estimation system according to an embodiment.
2 is a flowchart illustrating each step of a method for estimating depth of a single image according to an exemplary embodiment.
3 is a conceptual diagram illustrating sub-networks included in a single image depth estimation system according to an embodiment.
4 is a graph illustrating a reliability value thresholded by a single image depth estimation method according to an exemplary embodiment.
5 is an image illustrating depth information obtained by step-by-step application of the single image depth estimation method according to an embodiment to an original image.
6 and 7 are images illustrating the performance of a single image depth estimation method according to an exemplary embodiment in comparison with the related art.

이하에서, 도면을 참조하여 본 발명의 실시예들에 대하여 상세히 살펴본다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 일 실시예에 따른 단일 영상 깊이 추정 시스템의 개략적인 블록도이다. 1 is a schematic block diagram of a single image depth estimation system according to an embodiment.

도 1을 참조하면, 본 실시예에 따른 단일 영상 깊이 추정 시스템(1)은 깊이 추정부(10), 신뢰도 산출부(20) 및 임계값 결정부(30)를 포함한다. 일 실시예에서, 단일 영상 깊이 추정 시스템(3)은 깊이 추정 모델 및/또는 입력 이미지 등이 저장되는 데이터베이스(database; DB)(40)를 더 포함한다. 또한 일 실시예에서, 단일 영상 깊이 추정 시스템(3)은 입력 이미지로부터 생성된 깊이 정보를 사용자에게 제공하기 위한 출력부(50)를 더 포함한다. 나아가 일 실시예에서, 깊이 추정부(10)는 스테레오 매칭(stereo matching)부(11) 및 깊이 학습부(12)를 포함한다. Referring to FIG. 1 , the single image depth estimation system 1 according to the present embodiment includes a depth estimation unit 10 , a reliability calculation unit 20 , and a threshold value determination unit 30 . In an embodiment, the single image depth estimation system 3 further includes a database (DB) 40 in which a depth estimation model and/or an input image, etc. are stored. Also in one embodiment, the single image depth estimation system 3 further includes an output unit 50 for providing the user with depth information generated from the input image. Further, in an embodiment, the depth estimation unit 10 includes a stereo matching unit 11 and a depth learning unit 12 .

본 명세서에 기재된 장치들은 전적으로 하드웨어이거나, 또는 부분적으로 하드웨어이고 부분적으로 소프트웨어인 측면을 가질 수 있다. 예컨대, 단일 영상 깊이 추정 시스템(1)에 포함된 각 부(unit)(10, 20, 30, 40, 50) 및 이들의 하위 부는, 특정 형식 및 내용의 데이터를 전자통신 방식으로 주고받기 위한 장치 및 이에 관련된 소프트웨어를 통칭할 수 있다. 본 명세서에서 "부", "모듈", "서버", "시스템", "플랫폼", "장치" 또는 "단말" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 여기서 하드웨어는 CPU 또는 다른 프로세서(processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 하드웨어에 의해 구동되는 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.The devices described herein may be wholly hardware, or may have aspects that are partly hardware and partly software. For example, each unit ( 10 , 20 , 30 , 40 , 50 ) and its sub-units included in the single image depth estimation system 1 are devices for transmitting and receiving data in a specific format and content in an electronic communication manner. and software related thereto. As used herein, terms such as “unit”, “module”, “server”, “system”, “platform”, “device” or “terminal” are intended to refer to a combination of hardware and software driven by the hardware. do. For example, the hardware herein may be a data processing device including a CPU or other processor. In addition, software driven by hardware may refer to a running process, an object, an executable file, a thread of execution, a program, and the like.

또한, 본 명세서에서 단일 영상 깊이 추정 시스템(1)을 구성하는 각각의 부는 반드시 물리적으로 구분되는 별개의 구성요소를 지칭하는 것으로 의도되지 않는다. 즉, 도 1에서 단일 영상 깊이 추정 시스템(1)의 각 부(10, 20, 30, 40, 50)는 서로 구분되는 별개의 블록으로 도시되었으나, 이는 단일 영상 깊이 추정 시스템(1)을 이에 의해 실행되는 동작에 의해 기능적으로 구분한 것이다. 실시예에 따라서는 전술한 각 부(10, 20, 30, 40, 50) 중 일부 또는 전부가 동일한 하나의 장치 내에 집적화될 수 있으며, 또는 하나 이상의 부가 다른 부와 물리적으로 구분되는 별개의 장치로 구현될 수도 있다. 예컨대, 단일 영상 깊이 추정 시스템(1)의 각 부는 분산 컴퓨팅 환경 하에서 서로 통신 가능하게 연결된 컴포넌트들일 수도 있다.In addition, in this specification, each part constituting the single image depth estimation system 1 is not necessarily intended to refer to physically distinct separate components. That is, in FIG. 1 , each part 10 , 20 , 30 , 40 , 50 of the single image depth estimation system 1 is shown as separate blocks separated from each other, but this is a single image depth estimation system 1 by this. It is functionally classified according to the action being executed. Depending on the embodiment, some or all of the aforementioned units 10, 20, 30, 40, and 50 may be integrated in the same single device, or one or more units may be physically separated from other units as separate devices. may be implemented. For example, each part of the single image depth estimation system 1 may be components communicatively connected to each other under a distributed computing environment.

깊이 추정부(10)는 입력된 입력 이미지에 대한 의사 정답(pseudo ground trugh) 깊이 정보(또는, 본 명세서에서 의사 깊이(pseudo depth) 정보로도 지칭됨)를 산출하도록 구성된다. 또한 깊이 추정부(10)는 깊이 추정 네트워크에 대한 학습을 수행함으로써 단일 영상으로부터 깊이 정보를 추정할 수 있는 깊이 추정 모델을 생성하도록 구성된다. The depth estimator 10 is configured to calculate pseudo ground truth depth information (also referred to as pseudo depth information in this specification) for the input image. Also, the depth estimator 10 is configured to generate a depth estimation model capable of estimating depth information from a single image by performing learning on the depth estimation network.

본 명세서에서 네트워크란, 하나 이상의 레이어(layer)를 통하여 입력 이미지로부터 특징값을 추출하거나 또는/또한 다른 하나 이상의 레이어를 통하여 특징값을 이미지로 복원하면서 그 추출 또는 복원 과정에 관련된 파라미터들을 학습 데이터를 이용하여 갱신해나가도록 구성된 머신러닝(machine learning) 모델을 지칭한다. In the present specification, the network refers to extracting a feature value from an input image through one or more layers and/or restoring a feature value to an image through one or more other layers while learning parameters related to the extraction or restoration process. It refers to a machine learning model configured to be updated using

신뢰도 산출부(20)는, 신뢰도 네트워크에 대한 학습을 통하여 의사 깊이 정보로부터 신뢰도 정보를 산출하도록 구성된다. 이때 신뢰도란 이미지의 확률 밀도 함수로서, 본 명세서에서 신뢰도 정보는 이미지의 각 단위 영역(예컨대, 픽셀)에 전체 객체에서 해당 영역의 값이 발생할 확률을 할당한 것을 의미한다. 예를 들어, 의사 깊이 정보와 신뢰도 정보는 각각 맵(map)의 형태를 가질 수 있다. The reliability calculation unit 20 is configured to calculate reliability information from the pseudo depth information through learning about the reliability network. In this case, the reliability is a function of the probability density of the image, and in the present specification, the reliability information means that the probability of occurrence of a value of the corresponding area in the entire object is assigned to each unit area (eg, pixel) of the image. For example, each of the pseudo-depth information and the reliability information may have the form of a map.

임계값 결정부(30)는 의사 깊이 정보의 일부를 필터링하기 위한 임계값을 결정하는 부분으로서, 신뢰도 정보를 이용한 학습에 의해 임계화된 신뢰도 맵을 생성하는 역할을 한다. 이때 깊이 추정부(10)는, 임계화된 신뢰도 맵에 의해 필터링된 의사 깊이 정보를 이용하여 단일 영상에 대한 깊이 추정 모델을 생성할 수 있다. The threshold value determining unit 30 determines a threshold value for filtering a part of the pseudo-depth information, and serves to generate a thresholded reliability map by learning using the reliability information. In this case, the depth estimator 10 may generate a depth estimation model for a single image by using the pseudo-depth information filtered by the thresholded reliability map.

또한, 깊이 추정부(10)는 이와 같이 학습을 통해 생성된 깊이 추정 모델을 이용하여 미지의 단일 영상에 대해 깊이 정보를 생성할 수 있다. 출력부(50)는 이와 같이 생성된 깊이 정보를 사용자가 볼 수 있는 형태로 가공하여 네트워크를 통한 통신 방식으로 사용자 장치(미도시)에 전송하거나 또는 단일 영상 깊이 추정 시스템(1)의 출력수단(미도시)을 통하여 제공할 수 있다. Also, the depth estimator 10 may generate depth information for an unknown single image by using the depth estimation model generated through learning as described above. The output unit 50 processes the generated depth information in a form that can be viewed by a user and transmits it to a user device (not shown) through a communication method through a network, or an output means of the single image depth estimation system 1 ( not shown) can be provided.

도 2는 일 실시예에 따른 단일 영상 깊이 추정 방법의 각 단계를 나타내는 순서도이며, 도 3은 일 실시예에 따른 단일 영상 깊이 추정 시스템에 포함된 하위 네트워크들을 나타내는 개념도이다. 이하에서는, 설명의 편의를 위하여 도 1 내지 도 3을 참조하여 본 실시예에 따른 단일 영상 깊이 추정 방법에 대하여 설명한다. 2 is a flowchart illustrating each step of a method for estimating depth of a single image according to an embodiment, and FIG. 3 is a conceptual diagram illustrating sub-networks included in the system for estimating depth of a single image according to an embodiment. Hereinafter, for convenience of description, a single image depth estimation method according to the present embodiment will be described with reference to FIGS. 1 to 3 .

먼저, 단일 영상 깊이 추정 시스템(1)은 깊이 추정 모델의 생성을 위한 학습 데이터를 입력받을 수 있다(S1). 학습 데이터는 좌안 이미지와 우안 이미지를 포함하는 스테레오 이미지일 수 있으며, 학습 데이터에는 이미지의 각 영역(예컨대, 픽셀)에 상응하는 깊이 정보가 미리 라벨링(labeling)되어 있을 수 있다. First, the single image depth estimation system 1 may receive training data for generating a depth estimation model ( S1 ). The training data may be a stereo image including a left eye image and a right eye image, and depth information corresponding to each region (eg, pixel) of the image may be previously labeled in the training data.

다음으로, 깊이 추정부(10)의 스테레오 매칭부(11)는 스테레오 매칭 방식으로 스테레오 이미지 중 어느 하나, 예컨대, 좌안에 해당하는 단안 이미지(301) I^l 에 대한 의사 깊이 정보(302) d^Pgt를 생성할 수 있다(S2). 스테레오 매칭을 통한 의사 깊이 정보(302)의 생성은 사전에 훈련된 스테레오 매칭 네트워크를 이용한 공지된 방법, 예컨대, Poggi, M. 및 Mattoccia, S.의 "Learning from scratch a confidence measure" (BMVC, 2016)에 개시된 방법에 의하여 수행될 수 있으므로, 발명의 요지를 명확히 하기 위하여 이에 대한 자세한 설명은 생략한다. Next, the stereo matching unit 11 of the depth estimator 10 performs the stereo matching method in any one of the stereo images, for example, the pseudo-depth information 302 d ^Pgt for the monocular image 301 I ^l corresponding to the left eye. can be generated (S2). The generation of pseudo-depth information 302 through stereo matching is a known method using a pre-trained stereo matching network, for example, “Learning from scratch a confidence measure” by Poggi, M. and Mattoccia, S. (BMVC, 2016). ), so a detailed description thereof will be omitted in order to clarify the gist of the invention.

깊이 추정부(10)의 깊이 학습부(12)는 단안 이미지(301) I^l 를 하나 이상의 인코딩(encoding) 레이어(351) 및 하나 이상의 디코딩(decoding) 레이어(352)를 포함하는 깊이 추정 네트워크(305)에 대한 입력 이미지로 이용하여 깊이 정보에 대한 학습을 수행할 수 있다. 예를 들어, 깊이 추정 네트워크(305)는 Ronneberger, O. 외 공저 "U-net: Convolutional networks for biomedical image segmentation" (International Conference on Medical image computing and computer-assisted intervention, 234-241, 2015)에 개시되어 유넷(U-net)으로 알려진 인코더-디코더 아키텍처를 가질 수 있으며, 13개의 컨볼루션(convolution) 레이어에 해당하는 인코더 네트워크 및 이에 대칭적인 디코더 네트워크를 가질 수 있다. 그러나, 깊이 추정 네트워크(305)의 형태는 이에 한정되는 것은 아니다. The depth learning unit 12 of the depth estimator 10 converts the monocular image 301 I ^l into a depth estimation network including one or more encoding layers 351 and one or more decoding layers 352 ( 305) can be used as an input image to learn about depth information. For example, the depth estimation network 305 is disclosed in Ronneberger, O. et al., "U-net: Convolutional networks for biomedical image segmentation" (International Conference on Medical image computing and computer-assisted intervention, 234-241, 2015). It may have an encoder-decoder architecture known as U-net, and may have an encoder network corresponding to 13 convolution layers and a decoder network symmetric thereto. However, the shape of the depth estimation network 305 is not limited thereto.

한편, 일 실시예에서는 이때 단안 이미지(301) I^l 로부터 인코딩 레이어(351)를 통해 추출된 특징값이 후술하는 임계값 네트워크(306) M_T의 적응적 학습에 이용될 수 있다.Meanwhile, in an embodiment, at this time, the feature value extracted from the monocular image 301 I ¹ through the encoding layer 351 may be used for adaptive learning of the threshold value network 306 M _T to be described later.

신뢰도 산출부(20)는 의사 깊이 정보(302) d^Pgt를 신뢰도 네트워크(303) M_c에 대한 입력 이미지로 이용한 학습을 통하여 신뢰도 정보(304) c를 생성할 수 있다(S4). 신뢰도 네트워크(303) M_c는 공지된 또는 향후 개발될 임의의 신뢰도 추정 방법에 의하여 구성될 수 있다. 예를 들어, 신뢰도 네트워크(303) M_c는 M. Poggi 및 S. Mattoccia 에 의해 제안된 CCNN 방법에 의하여 구성될 수 있으나, 이에 한정되는 것은 아니다. The reliability calculator 20 may generate the reliability information 304 c through learning using the pseudo depth information 302 d ^Pgt as an input image for the reliability network 303 M _c ( S4 ). The reliability network 303 M _c may be constructed by any known reliability estimation method or to be developed in the future. For example, the reliability network 303 M _c may be configured by the CCNN method proposed by M. Poggi and S. Mattoccia, but is not limited thereto.

다음으로, 임계값 결정부(30)는 신뢰도 정보(304) c를 임계값 네트워크(306) M_T에 대한 입력 이미지로 이용한 학습을 통하여, 임계값 T를 넘는 신뢰도를 가진 깊이값만을 신뢰할 수 있는 것으로 결정하도록 임계값 T를 결정할 수 있다(S5). 임계값 네트워크(306) M_T는 깊이 추정 네트워크(305)의 하나 이상의 인코딩 레이어(351)와 동일하거나 유사하게 구성될 수 있다. 이때, 임계값 T를 어떻게 설정할 것인지는 이미지의 특성에 따라 달라지며, 예컨대, 스테레오 매칭을 통해 의사 깊이 정보를 얻기 힘든 이미지에서는 임계값 T이 높아야 할 것이고, 스테레오 매칭을 통해 의사 깊이 정보를 얻기 쉬운 이미지에서는 임계값 T가 낮아도 될 것이다. Next, the threshold value determination unit 30 can trust only the depth value with the reliability exceeding the threshold T through learning using the reliability information 304 c as an input image for the threshold network 306 M _T It is possible to determine the threshold value T to determine that (S5). Threshold network 306 M _T may be configured the same as or similar to one or more encoding layers 351 of depth estimation network 305 . At this time, how to set the threshold value T depends on the characteristics of the image. For example, in an image in which pseudo-depth information is difficult to obtain through stereo matching, the threshold value T should be high, and the image in which pseudo-depth information is easy to obtain through stereo matching. In , the threshold value T may be low.

일 실시예에서, 이미지의 특성을 반영하여 임계값 T을 적응적으로 학습할 수 있다. 이를 위하여, 임계값 결정부(30)는 단안 이미지(301) I^l 로부터 인코딩 레이어(351)를 통해 추출된 특징값(예컨대, 컨볼루션 특징값)을 이용하여 임계값 네트워크(306) M_T에 대한 학습을 수행할 수 있다(S3, S5). In an embodiment, the threshold value T may be adaptively learned by reflecting the characteristics of the image. To this end, the threshold value determining unit 30 uses a feature value (eg, a convolutional feature value) extracted from the monocular image 301 I ^l through the encoding layer 351 to the threshold value network 306 M _T . learning can be performed (S3, S5).

일 실시예에서, 임계값 결정부(30)는 차등 소프트-임계화(differentiable soft-thresholding) 함수를 이용하여 임계값이 적용되어 임계화된 신뢰도 정보를 생성할 수 있다(S6). 예를 들어, 일 실시예에서 임계화된 신뢰도 정보에 해당하는 신뢰도 맵(307) C^T는 하기 수학식 1과 같이 산출될 수 있다. In an embodiment, the threshold value determiner 30 may generate thresholded reliability information by applying a threshold using a differential soft-thresholding function (S6). For example, in an embodiment, the reliability map 307 C ^T corresponding to the thresholded reliability information may be calculated as in Equation 1 below.

상기 수학식 1에서 p는 이미지의 픽셀을 나타내며, c_p는 픽셀의 신뢰도를 나타낸다. 이때, 임계화된 신뢰도 맵(307) C^T의 기울기는 사용자에 의해 설정되는 하이퍼 파라미터 ε의 값에 의하여 제어된다. ε는 예컨대 양의 값을 갖는 상수일 수 있다. In Equation 1, p represents a pixel of the image, and c _p represents the reliability of the pixel. At this time, the slope of the thresholded reliability map 307 C ^T is controlled by the value of the hyper parameter ε set by the user. ε may be, for example, a constant having a positive value.

도 4는 일 실시예에 따른 단일 영상 깊이 추정 방법에 의해 임계화된 신뢰도 값을 나타내는 그래프로서, 도 4에 도시된 4개의 그래프(401-404)는 각각 수학식 1의 파라미터 ε의 값이 5, 10, 25 및 90일 경우 신뢰도 c의 값에 따른 임계화된 신뢰도 C^T의 값을 나타낸다. 도시되는 바와 같이, 파라미터 ε의 값이 클수록 임계화된 신뢰도 C^T의 값이 0 또는 1에 급격하게 매핑된다. 본 명세서에 기재된 시험예들에서 파라미터 ε의 값은 10으로 설정되었으나, 이에 한정되는 것은 아니다. 4 is a graph illustrating a reliability value thresholded by a method for estimating a single image depth according to an embodiment. In each of the four graphs 401 to 404 shown in FIG. 4, the value of the parameter ε in Equation 1 is 5. , 10, 25, and 90 indicate the thresholded reliability C ^T according to the reliability c value. As shown, the larger the value of the parameter ε, the sharper the value of the thresholded reliability C ^T is mapped to 0 or 1. In the test examples described herein, the value of the parameter ε is set to 10, but is not limited thereto.

다시 도 1 내지 도 3을 참조하면, 임계값 결정부(30)는 임계화된 신뢰도 맵(307) C^T와 미리 설정된 기준(ground truth) 신뢰도 정보(308) C^gt에 의해 정의되는 손실 함수 L_T를 이용하여 임계값 네트워크(306) M_T에 대한 학습을 실시할 수 있다. 이때 기준 신뢰도 정보(308) C^gt는 Tonioni, A. 외 공저 "Unsupervised domain adaptation for depth prediction from images" (EEE transactions on pattern analysis and machine intelligence, 2019) 및 Kim, S. 외 공저 "Laf-net: Locally adaptive fusion networks for stereo confidence estimation" (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 205-214, 2019) 등에 개시된 공지된 방법으로부터 희소 기준 깊이 데이터(sparse ground truth depth data)를 이용하여 얻을 수 있다. Referring back to FIGS. 1 to 3 , the threshold determining unit 30 is a loss function L defined by the thresholded reliability map 307 C ^T and preset ground truth reliability information 308 C ^gt . _T may be used to train the threshold network 306 M _T . In this case, the reference reliability information 308 C ^gt is "Unsupervised domain adaptation for depth prediction from images" (EEE transactions on pattern analysis and machine intelligence, 2019) by Tonioni, A. et al. and by Kim, S. et al. "Laf-net: Locally adaptive fusion networks for stereo confidence estimation" (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 205-214, 2019) can be obtained using sparse ground truth depth data from a known method. have.

또한, 이 과정에서 신뢰도 산출부(20) 역시 상기 손실 함수 L_T를 이용하여 신뢰도 네트워크(303) M_C에 대한 학습을 실시할 수 있다. 예를 들어, 일 실시예에서 손실 함수 L_T는 하기 수학식 2와 같이 정의될 수 있다. In addition, in this process, the reliability calculator 20 may also learn the reliability network 303 M _C using the loss function L _T . For example, in an embodiment, the loss function L _T may be defined as in Equation 2 below.

상기 수학식 2에 의하여 학습된 임계값 네트워크(306) M_T 및 신뢰도 네트워크(303) M_C를 통하여 임계화된 신뢰도 정보가 결정되면, 깊이 추정부(10)의 깊이 학습부(12)는 임계화된 신뢰도 정보(307) C_T를 이용하여 필터링된 의사 깊이 정보(309) d^Pgt를 이용하여 깊이 추정 네트워크(305) M_D를 학습시킴으로써 단안 이미지(301) I^l 에 대한 깊이 정보(310) d를 생성할 수 있다(S8). When the thresholded reliability information is determined through the threshold network 306 M _T and the reliability network 303 M _C learned according to Equation 2, the depth learning unit 12 of the depth estimator 10 is Depth information 310 for monocular image 301 I ^l by training a depth estimation network 305 M _D using pseudo depth information 309 d ^Pgt filtered using localized reliability information 307 C _T . d can be generated (S8).

예를 들어, 일 실시예에서 깊이 추정 네트워크(305) M_D는 신뢰도에 의하여 유도된 회귀 손실 함수 L_D의 값을 산출함으로써 학습될 수 있고, 이때 회귀 손실 함수 L_D는 하기 수학식 3과 같이 정의될 수 있다. For example, in one embodiment, the depth estimation network 305 M _D may be learned by calculating a value of a regression loss function L _D induced by reliability, in which case the regression loss function L _D is expressed as in Equation 3 below. can be defined.

상기 수학식 3에서 d_p는 추정되는 깊이 정보(310)를 나타내며, d^Pgt는 의사 깊이 정보(309)를 나타내고, Ω는 단안 이미지(301) I^l 에 포함된 모든 픽셀들의 집합을 의미한다. 또한, 일 실시예에서 상기 수학식 3에 의해 산출된 손실은 아래의 수학식 4에 의하여 정규화(normalize)될 수 있다. In Equation 3, d _p denotes the estimated depth information 310 , d ^Pgt denotes the pseudo-depth information 309 , and Ω denotes a set of all pixels included in the monocular image 301 I ^l . Also, in an embodiment, the loss calculated by Equation 3 may be normalized by Equation 4 below.

이상에 기재한 학습 과정을 통하여, 단일 영상으로부터 깊이 정보를 추정하기 위한 깊이 추정 모델이 생성될 수 있다. 그 결과, 깊이 정보를 알지 못하는 미지의 입력 영상에 대하여 상기 깊이 추정 모델을 적용함으로써 깊이 정보를 추정하고 이를 사용자에게 제공할 수 있다(S9). Through the learning process described above, a depth estimation model for estimating depth information from a single image may be generated. As a result, it is possible to estimate depth information by applying the depth estimation model to an unknown input image of which depth information is not known, and provide it to the user (S9).

본 발명자들은, 임계값 네트워크(306) M_T 및 신뢰도 네트워크(303) M_C를 학습시키기 위하여 KITTI 데이터셋에 의하여 제공되는 스테레오 이미지 셋 및 희소 깊이 맵(sparse depth map)을 이용하였으며, 라이다(LiDAR) 깊이 맵이 사용되었다. The present inventors used a stereo image set and sparse depth map provided by the KITTI dataset to train the threshold network 306 M _T and the reliability network 303 M _C , and LIDAR ( LiDAR) depth maps were used.

도 5는 그 결과를 나타내는 이미지로서, 도 5의 (a)는 원본 이미지를 나타내며, 도 5의 (b)는 본 발명의 실시예에서 임계값 네트워크 M_T 및 신뢰도 네트워크 M_C가 없는 단안 깊이 추정을 통해 얻은 깊이 정보를 나타낸다. 또한, 도 5의 (c)는 임계값 T를 0.3으로 고정한 채 신뢰도 네트워크 M_C만을 학습시켜 추정된 깊이 정보를 나타내고, 도 5의 (d)는 본 발명의 실시예에 따라 임계값 네트워크 M_T 및 신뢰도 네트워크 M_C를 모두 학습시키면서 깊이 추정 네트워크 M_D에 의하여 얻어진 깊이 정보를 나타낸다. Figure 5 is an image showing the result, Figure 5 (a) shows the original image, Figure 5 (b) is monocular depth estimation without the threshold network M _T and the reliability network M _C in the embodiment of the present invention. Depth information obtained through In addition, Fig. 5 (c) shows depth information estimated by learning only the reliability network M _C while fixing the threshold value T to 0.3, and Fig. 5 (d) is the threshold value network M _T according to an embodiment of the present invention. and depth information obtained by the depth estimation network M _D while learning all of the reliability networks M _C .

도시되는 바와 같이, 도 5의 (a)로부터 (d)까지 변화하면서 이미지으 깊이 추정 결과가 점차 개선되는 것을 확인할 수 있다. 특히, 도 5의 (b)에 도시된 의사 깊이 정보가 부정확한 점으로부터 의사 깊이 정보만을 통해 깊이 추정 네트워크 M_D를 학습시키기에는 한계가 있음을 알 수 있으며, 본 발명의 실시예에 의해 임계화된 신뢰도 정보를 이용함으로써 도 5의 (b)에서 신뢰도가 낮은 픽셀들을 제외하고 깊이 추정 성능을 개선할 수 있음을 알 수 있다. As shown, it can be seen that the depth estimation result of the image is gradually improved while changing from (a) to (d) of FIG. 5 . In particular, it can be seen that there is a limit to learning the depth estimation network MD through only the pseudo-depth information from the point that the pseudo-depth information shown in FIG. 5B is inaccurate _. It can be seen that depth estimation performance can be improved except for pixels with low reliability in FIG. 5B by using the obtained reliability information.

도 6은 일 실시예에 따른 단일 영상 깊이 추정 방법의 성능을 종래 기술과 비교하여 나타내는 이미지로서, 도 6의 (a)는 원본 이미지를 나타내며, 도 6의 (f)는 본 발명의 실시예에 의해 추정된 깊이 정보를 나타낸다. 한편, 도 6의 (b) 내지 (e)는 종래 기술에 의하여 추정된 깊이 정보를 나타내는 것으로, 도 6의 (b)는 Kuznietsov, Y. 외 공저 "Semisupervised deep learning for monocular depth map prediction" (Proceedings of the IEEE conference on computer vision and pattern recognition, 6647-6655, 2017)에 개시된 방법에 의해 추정된 깊이 정보를 나타내고, 도 6의 (c)는 Godard, C. 외 공저 "Unsupervised monocular depth estimation with left-right consistency" (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 270-279, 2017)에 개시된 방법에 의해 추정된 깊이 정보를 나타내며, 도 6의 (d)는 Godard, C. 외 공저 "Digging into self-supervised monocular depth estimation" (Proceedings of the IEEE international conference on computer vision, 3828-3838, 2019)에 개시된 방법에 의해 추정된 깊이 정보를 나타내고, 도 6의 (e)는 Watson, J. 외 공저 "Self-supervised monocular depth hints", Proceedings of the IEEE International Conference on Computer Vision, 2162-2171, 2019)에 개시된 방법에 의해 추정된 깊이 정보를 나타낸다. 6 is an image showing the performance of a single image depth estimation method according to an embodiment in comparison with the prior art. FIG. 6 (a) is an original image, and FIG. 6 (f) is an embodiment of the present invention. Depth information estimated by On the other hand, FIGS. 6 (b) to (e) show depth information estimated by the prior art, and FIG. 6 (b) is "Semisupervised deep learning for monocular depth map prediction" (Proceedings) by Kuznietsov, Y. et al. of the IEEE conference on computer vision and pattern recognition, 6647-6655, 2017) represents depth information estimated by the method disclosed, and Figure 6 (c) is a co-author of Godard, C. et al. "Unsupervised monocular depth estimation with left- right consistency" (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 270-279, 2017) represents depth information estimated by the method disclosed, (d) of FIG. 6 is Godard, C. et al. co-author "Digging into Self-supervised monocular depth estimation" (Proceedings of the IEEE international conference on computer vision, 3828-3838, 2019) represents depth information estimated by the method disclosed, (e) of FIG. 6 is a co-author of Watson, J. et al. Self-supervised monocular depth hints", it represents depth information estimated by the method disclosed in Proceedings of the IEEE International Conference on Computer Vision, 2162-2171, 2019).

도시되는 바와 같이, 본 발명의 실시예에 의해 깊이 정보를 추정함으로써 종래 기술과 같이 사물의 경계가 흐려지는 현상이 없이 미세한 경계까지 사물을 온전히 탐지하는 것이 가능함을 알 수 있다. As shown, it can be seen that by estimating depth information according to an embodiment of the present invention, it is possible to completely detect an object up to a fine boundary without blurring the boundary of the object as in the prior art.

또한, 본 발명자들은 KITTI 아이젠 스플릿(eigen split) 데이터 셋을 대상으로 본 발명의 실시예들에 의한 탐지 정확도를 알려진 종래 기술과 비교하였으며, 그 결과를 표 1에 나타내였다. 데이터 열의 S와 L은 각각 스테레오 이미지 및 좌안 이미지를 나타내며, Sem은 시맨틱 세그멘테이션(semantic segmentation) 네트워크를 통해 훈련된 지도 모델을 나타내고, PGT는 본 발명의 실시예에 따른 의사 정답 깊이 정보를 이용한 모델을 나타낸다. In addition, the present inventors compared the detection accuracy according to the embodiments of the present invention for the KITTI eigen split data set with the known prior art, and the results are shown in Table 1. S and L of the data column represent a stereo image and a left-eye image, respectively, Sem represents a map model trained through a semantic segmentation network, and PGT represents a model using pseudo correct depth information according to an embodiment of the present invention. indicates.

평가수치evaluation value 정확도accuracy 방법Way 데이터data 지도
방식map
method Abs RelAbs Rel Sqr
RelSqr
Rel RM
SERM
SE RMSE
로그RMSE
Log δ<
1.25δ<
1.25 δ<
1.25² δ<
1.25 ² δ<
1.25³ δ<
1.25 ³ 1One MonodepthMonodepth SS 자가self 0.1380.138 1.1861.186 5.6505.650 0.2340.234 0.8130.813 0.9300.930 0.9690.969 22 UncentaintyUncensorship SS 자가self 0.1070.107 0.8110.811 4.7964.796 0.2000.200 0.8660.866 0.9520.952 0.9780.978 33 MonoResMatchMonoResMatch SS 자가self 0.1110.111 0.8670.867 4.7144.714 0.1990.199 0.8640.864 0.9540.954 0.9790.979 44 Kuznietsov Kuznietsov S+LiDARS+LiDAR 지도map 0.1220.122 0.7620.762 4.8154.815 0.1940.194 0.8450.845 0.9570.957 0.9870.987 55 KuznietsovKuznietsov S+LiDARS+LiDAR 준지도jun map 0.1130.113 0.7410.741 4.6214.621 0.1890.189 0.8620.862 0.9600.960 0.9860.986 66 Monodepth2Monodepth2 SS 자가self 0.1080.108 0.8420.842 4.8914.891 0.2070.207 0.8660.866 0.9490.949 0.9760.976 77 DepthHintDepthHint SS 자가self 0.1020.102 0.7620.762 4.6024.602 0.1890.189 0.8800.880 0.9600.960 0.9810.981 88 Guizilimi Guizilimi S+
SemS+
Sem 자가self 0.1020.102 0.6980.698 4.3814.381 0.1780.178 0.8960.896 0.9640.964 0.9840.984 99 실시예 1Example 1 L+
PGTL+
PGT 준지도jun map 0.0990.099 0.6570.657 4.2894.289 0.1850.185 0.8840.884 0.9620.962 0.9820.982 1010 실시예 2Example 2 L+
PGTL+
PGT 준지도jun map 0.0980.098 0.6470.647 4.2534.253 0.1860.186 0.8840.884 0.9600.960 0.9810.981

또한, 표 1의 실시예 1은 (i) 미리 훈련된 신뢰도 네트워크에 의해 신뢰도 네트워크 M_C를 고정한 상태에서 (ii) 손실함수 L_D 및 L_T를 이용하여 깊이 추정 네트워크 M_D와 임계값 네트워크 M_T의 파라미터를 학습시킨 결과에 해당한다. 또한, 실시예 2는 실시예 1의 방법 (i), (ii)에 더하여 (iii) 깊이 추정 네트워크 M_D를 고정한 상태에서 (iv) 손실함수 L_T를 이용하여 신뢰도 네트워크 M_C와 임계값 네트워크 M_T를 학습시키는 과정, 및 (v) 신뢰도 네트워크 M_C와 임계값 네트워크 M_T를 고정시킨 상태에서 손실함수 L_D를 이용하여 깊이 추정 네트워크 M_D를 학습시키는 과정을 더 실시한 실시예를 나타낸다. In addition, Example 1 of Table 1 shows the depth estimation network M _D and the threshold network M using (ii) loss functions L _D and L _T in a state where (i) the reliability network M _C is fixed by the pre-trained reliability network. It corresponds to the result of learning the parameters of _T. In addition, in Example 2, in addition to the methods (i) and (ii) of Example 1, (iii) a reliability network M _C and a threshold network using (iv) a loss function L _T in a state where the depth estimation network M _D is fixed. An embodiment in which the process of learning M _T and (v ₎ the process of learning the depth estimation network _{MD by using the loss function L D} _in a state where the reliability network MC and the threshold network _MT are fixed is further shown.

표 1에서 정확도는 값이 높을수록 성능이 우수한 것을 나타내며, 그 외의 평가수치는 값이 낮을수록 성능이 우수한 것을 나타낸다. 표 1의 성능 평가수치 중 Abs Rel 및 Sqr Rel은 각각 예측값과 정답 사이의 절대 관계 오차 및 제곱 관계 오차를 나타낸다. 또한 표 1에서, RMSE 및 RMSE 로그(log)는 각각 평균 제곱근 오차 및 로그 평균 제곱근 오차를 나타낸다. 나아가, 표 1에서 δ<1.25ⁿ은 예측값과 정답의 비율이 1.25의 n제곱보다 작은 픽셀의 비율값을 의미한다. 또한, 표 1에서 굵게 표시된 수치 및 밑줄 표시된 수치는 각 평가수치에서 성능이 우수한 순서로 1순위와 2순위를 나타낸다. 도시되는 바와 같이 본 발명의 실시예들이 최소한 종래 기술과 동등한 성능을 가지면서 대부분의 항목에서 더 우수한 성능을 갖는다는 것을 알 수 있다. In Table 1, the higher the accuracy value, the better the performance. Other evaluation values indicate that the lower the value, the better the performance. Among the performance evaluation values in Table 1, Abs Rel and Sqr Rel represent the absolute relationship error and the square relationship error between the predicted value and the correct answer, respectively. Also in Table 1, RMSE and RMSE log (log) represent root mean square error and log root mean square error, respectively. Furthermore, in Table 1, δ<1.25 ⁿ means a ratio value of pixels in which the ratio between the predicted value and the correct answer is less than 1.25 to the nth power. In addition, in Table 1, the bold and underlined figures indicate the 1st and 2nd ranks in the order of superior performance in each evaluation value. As can be seen, it can be seen that the embodiments of the present invention have better performance in most items while having at least the same performance as the prior art.

도 7은 일 실시예에 따른 단일 영상 깊이 추정 방법의 성능을 종래 기술과 비교하여 나타내는 또 다른 이미지로서, 도 7의 (a)는 원본 이미지를 나타내며, 도 7의 (e)는 본 발명의 실시예에 의해 추정된 깊이 정보를 나타낸다. 한편, 도 7의 (b) 내지 (d)는 종래 기술에 의하여 추정된 깊이 정보를 나타내는 것으로, 도 7의 (b)는 전술한 Godard, C. 외 공저 "Unsupervised monocular depth estimation with left-right consistency"에 개시된 방법에 의해 추정된 깊이 정보를 나타내고, 도 7의 (c)는 Tosi, F., Aleotti, F., Poggi, M. 및 Mattoccia, S. 공저 "Learning monocular depth estimation infusing traditional stereo knowledge" (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9799-9809, 2019)에 개시된 방법에 의해 추정된 깊이 정보를 나타내며, 도 7의 (d)는 전술한 Watson, J. 외 공저 "Self-supervised monocular depth hints"에 개시된 방법에 의해 추정된 깊이 정보를 나타낸다.7 is another image showing the performance of a method for estimating a single image depth according to an embodiment in comparison with the prior art. Depth information estimated by example is shown. Meanwhile, FIGS. 7(b) to 7(d) show depth information estimated according to the prior art, and FIG. 7(b) is a co-author of Godard, C. et al., "Unsupervised monocular depth estimation with left-right consistency." "Shows depth information estimated by the method disclosed in ", (c) of FIG. 7 is a co-author of Tosi, F., Aleotti, F., Poggi, M., and Mattoccia, S. "Learning monocular depth estimation infusing traditional stereo knowledge" (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9799-9809, 2019) indicates the depth information estimated by the method disclosed, (d) of FIG. It represents depth information estimated by the method disclosed in "monocular depth hints".

도 7에 도시된 결과는 공개된 Cityscapes 데이터셋을 이용하여 획득된 깊이 추정 결과를 나타내는 것으로, 500 검증셋(validation set)에 대한 정성 분석 결과를 나타낸다. 도시되는 바와 같이 본 발명의 실시예들이 최근 공개된 기술과 비교하더라도 우수한 성능을 갖는 것을 알 수 있다. The results shown in FIG. 7 represent depth estimation results obtained using the published Cityscapes dataset, and represent qualitative analysis results for 500 validation sets. As shown, it can be seen that the embodiments of the present invention have superior performance even when compared with the recently disclosed technology.

이상에서 설명한 실시예들에 따른 단일 영상 깊이 추정 방법에 의한 동작은 적어도 부분적으로 컴퓨터 프로그램으로 구현되고 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 실시예들에 따른 방법에 의한 동작을 구현하기 위한 프로그램이 기록되고 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 또한, 본 실시예를 구현하기 위한 기능적인 프로그램, 코드 및 코드 세그먼트(segment)들은 본 실시예가 속하는 기술 분야의 통상의 기술자에 의해 용이하게 이해될 수 있을 것이다.The operation by the single image depth estimation method according to the embodiments described above may be implemented at least partially as a computer program and recorded in a computer-readable recording medium. A computer-readable recording medium in which a program for implementing the operation according to the method according to the embodiments is recorded includes all types of recording devices in which computer-readable data is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and the computer-readable code may be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present embodiment may be easily understood by those skilled in the art to which the present embodiment belongs.

이상에서 살펴본 본 발명은 도면에 도시된 실시예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. Although the present invention as described above has been described with reference to the embodiments shown in the drawings, it will be understood that these are merely exemplary, and that various modifications and variations of the embodiments are possible therefrom by those of ordinary skill in the art. However, such modifications should be considered to be within the technical protection scope of the present invention.

Claims

a depth estimator configured to calculate pseudo-depth information for the input image;
a reliability calculation unit configured to calculate reliability information for the pseudo depth information; and
A threshold value determining unit configured to determine a threshold value of reliability for filtering a part of the pseudo depth information through learning using the reliability information,
The depth estimation unit is further configured to generate a depth estimation model for a single image by using the pseudo-depth information filtered by the threshold value.

The method of claim 1,
The depth estimation unit,
a stereo matching unit configured to calculate the pseudo-depth information from the input image using a pre-stored stereo matching model; and
and a depth learning unit configured to train a depth estimation network using the pseudo-depth information filtered by the threshold value.

3. The method of claim 2,
The depth estimation network includes one or more encoder layers for extracting feature values from an image and one or more decoder layers configured to convert the feature values into depth information,
The single image depth estimation system further configured to determine the threshold value through adaptive learning using the feature value extracted by the one or more encoder layers, wherein the threshold value determination unit is configured to determine the threshold value.

According to claim 1,
The threshold value determining unit is further configured to generate the reliability information thresholded by a differential soft-thresholding function defined using the reliability information and the threshold value.

5. The method of claim 4,
The threshold value determining unit is further configured to determine the threshold value by learning a threshold value network using a loss function defined by the thresholded reliability information and the reference reliability information.

6. The method of claim 5,
The depth estimation unit is further configured to train the depth estimation network using a regression loss function defined using the thresholded reliability information and the pseudo depth information.

calculating, by a single image depth estimation system, pseudo-depth information for an input image;
calculating, by the single image depth estimation system, reliability information for the pseudo depth information;
determining, by the single image depth estimation system, a threshold of reliability for filtering a part of the pseudo depth information through learning using the reliability information; and
and generating, by a single image depth estimation system, a depth estimation model for a single image by using the pseudo-depth information filtered by the threshold value.

8. The method of claim 7,
The calculating of the pseudo-depth information includes calculating, by the single-image depth estimation system, the pseudo-depth information from the input image using a pre-stored stereo matching model.

8. The method of claim 7,
The generating of the depth estimation model includes, by the single-image depth estimation system, training a depth estimation network using the pseudo-depth information filtered by the threshold value.

10. The method of claim 9,
The depth estimation network includes one or more encoder layers for extracting feature values from an image and one or more decoder layers configured to convert the feature values into depth information,
The calculating of the threshold value includes determining, by the single image depth estimation system, the threshold value through adaptive learning using the feature values extracted by the one or more encoder layers. Way.

8. The method of claim 7,
The step of calculating the threshold value includes, by the single image depth estimation system, generating reliability information thresholded by a differential soft-thresholding function defined using the reliability information and the threshold value. Image depth estimation method.

12. The method of claim 11,
Calculating the threshold may include, by the single image depth estimation system, learning a threshold network using a loss function defined by the thresholded reliability information and the reference reliability information. Estimation method.

13. The method of claim 12,
The generating of the depth estimation model may include training, by the single image depth estimation system, a depth estimation network using a regression loss function defined using the thresholded reliability information and the pseudo depth information. Image depth estimation method.

A computer program stored in a computer-readable recording medium in combination with hardware to execute the single image depth estimation method according to any one of claims 7 to 13.