KR20240040337A

KR20240040337A - Single infrared image-based monocular depth estimation method and apparatus

Info

Publication number: KR20240040337A
Application number: KR1020220119212A
Authority: KR
Inventors: 최유경; 한대찬
Original assignee: 세종대학교산학협력단
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2024-03-28

Abstract

본 발명은 단일 열화상 기반 단안 깊이 추정 방법 및 장치를 개시한다. 본 발명에 따르면, 프로세서 및 상기 프로세서에 연결된 메모리를 포함하되, 상기 메모리는, 스테레오 칼라영상 중 하나의 입력 칼라영상을 딥러닝 기반 제1 깊이 네트워크에 입력하여 제1 깊이 맵을 생성하고, 상기 입력 칼라영상에 대응되는 열화상을 딥러닝 기반 제2 깊이 네트워크에 입력하여 제2 깊이 맵을 생성하고, 상기 제1 깊이 맵을 이용하여 추정된 추정 칼라영상과 상기 입력 칼라영상을 비교하여 상기 제1 깊이 네트워크의 학습을 반복하고, 상기 제1 깊이 네트워크의 학습의 반복에 따른 제1 깊이 맵과 상기 제2 깊이 맵을 비교하여 상기 제2 깊이 네트워크의 학습을 반복하고, 학습된 제2 깊이 네트워크를 이용하여 새로 입력되는 열화상으로부터 제2 깊이 맵을 생성하도록, 상기 프로세서에 의해 실행되는 프로그램 명령어들을 저장한 단일 열화상 기반 단안 깊이 추정 장치가 제공된다. The present invention discloses a monocular depth estimation method and device based on a single thermal image. According to the present invention, it includes a processor and a memory connected to the processor, wherein the memory generates a first depth map by inputting one input color image from among the stereo color images to a deep learning-based first depth network, and the input The thermal image corresponding to the color image is input to a deep learning-based second depth network to generate a second depth map, and the input color image is compared with the estimated color image estimated using the first depth map to determine the first depth map. Repeat learning of the depth network, repeat learning of the second depth network by comparing the first depth map and the second depth map according to repetition of learning of the first depth network, and use the learned second depth network. A single thermal image-based monocular depth estimation device is provided that stores program instructions executed by the processor to generate a second depth map from a newly input thermal image.

Description

Single infrared image-based monocular depth estimation method and apparatus}

본 발명은 단일 열화상 기반 깊이 추정 방법 및 장치에 관한 것이다. The present invention relates to a depth estimation method and device based on a single thermal image.

최근 몇 년 동안 단안 깊이 추정은 LiDAR를 대체할 수 있는 잠재적인 이점으로 인해 자율 주행 및 증강 현실과 같은 애플리케이션과 함께 컴퓨터 비전 및 로봇 공학 분야에서 중요한 구성 요소가 되었다. In recent years, monocular depth estimation has become an important component in computer vision and robotics, with applications such as autonomous driving and augmented reality due to its potential advantage as a replacement for LiDAR.

많은 연구에서 LiDAR를 ground truth로 사용한 지도 학습을 통해 단안 깊이 추정 성능을 개선할 수 있음을 보여주었다. Many studies have shown that monocular depth estimation performance can be improved through supervised learning using LiDAR as ground truth.

그러나 지도 학습에서 LiDAR의 높은 비용으로 인해 데이터 수집이 어렵고 LiDAR의 희소한 깊이 정보로 인해 정확하고 밀도가 높은 깊이 정보를 추정할 수 없는 문제가 있다. However, in supervised learning, data collection is difficult due to the high cost of LiDAR, and accurate and dense depth information cannot be estimated due to the sparse depth information of LiDAR.

이러한 문제를 해결하기 위해 연구자들은 학습 과정 전반에 걸쳐 실제 깊이 정보가 필요하지 않은 자기지도 방법(self-supervised learning)을 제안했다. To solve this problem, researchers proposed a self-supervised learning method that does not require actual depth information throughout the learning process.

많은 연구자들이 다양한 자기지도 학습 기반 단안 깊이 추정 방법을 제안하고 있으며 지도 학습과 자기지도 학습의 성능 격차가 줄어들고 있다. Many researchers are proposing various self-supervised learning-based monocular depth estimation methods, and the performance gap between supervised learning and self-supervised learning is narrowing.

그러나 이러한 방법론은 칼라영상(RGB 이미지)을 입력으로 사용하기 때문에 야간과 같은 저조도 상황에서 성능을 보장하지 않으며, 또한 RGB 센서의 고유한 한계로 인해 비나 흐림과 같은 외부 환경 변화에 민감한 문제가 있다. However, because this methodology uses color images (RGB images) as input, it does not guarantee performance in low-light situations such as at night, and also has the problem of being sensitive to changes in the external environment such as rain or cloudiness due to the inherent limitations of the RGB sensor.

이러한 문제에 대한 현실적인 대안으로 다양한 환경 변화에 강한 장파장 열화상 카메라를 사용할 수 있다. 열화상 카메라는 칼라영상에 비해 외부 환경의 변화에 영향을 받지 않는 물체의 복사 에너지를 이미지로 기록한다. As a realistic alternative to this problem, a long-wavelength thermal imaging camera that is resistant to various environmental changes can be used. Compared to color images, thermal imaging cameras record the radiant energy of objects as images that are not affected by changes in the external environment.

그러나 RGB 기반 깊이 추정의 입력을 열화상으로 대체하는 것은 스펙트럼 차이로 인해 여전히 어려운 문제이다.However, replacing the input of RGB-based depth estimation with thermal images is still a difficult problem due to spectral differences.

도 1은 종래기술에 따른 열화상 기반 깊이 추정 과정을 나타낸 도면이다. Figure 1 is a diagram showing a thermal image-based depth estimation process according to the prior art.

도 1을 참조하면, 종래에는 칼라영상과 열화상으로부터 예측된 깊이 맵을 이용하여 외형 정합 손실(appearance matching loss)을 계산한다. Referring to Figure 1, conventionally, appearance matching loss is calculated using a depth map predicted from color images and thermal images.

그러나, 종래기술에서는 열화상과 칼라영상 간의 도메인 갭으로 인해서 깊이 추정의 한계가 생기게 된다. 또한 열화상의 블러한 엣지와 영상 속 대비가 낮다는 단점으로 인해서 학습이 잘 되지 않는 문제가 발생한다. However, in the prior art, there is a limit to depth estimation due to the domain gap between thermal images and color images. Additionally, the problem of poor learning occurs due to the blurry edges of thermal images and low contrast in the image.

KR 등록특허공보 10-1947782KR Registered Patent Publication 10-1947782

상기한 종래기술의 문제점을 해결하기 위해, 본 발명은 칼라영상으로부터 얻은 깊이 정보와 열화상으로부터 예측된 깊이 정보를 지역적인 영역과 전역적인 영역을 전부 비교하기 위해서 다양한 학습 방식을 적용하여 낮과 밤에 보다 정확한 깊이 정보를 예측할 수 있는 단일 열화상 기반 단안 깊이 추정 방법 및 장치를 제안하고자 한다. In order to solve the problems of the prior art described above, the present invention applies various learning methods to compare depth information obtained from color images and depth information predicted from thermal images in both local and global areas, day and night. We would like to propose a monocular depth estimation method and device based on a single thermal image that can predict more accurate depth information.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 따르면, 단일 열화상 기반 단안 깊이 추정 장치로서, 프로세서; 및 상기 프로세서에 연결된 메모리를 포함하되, 상기 메모리는, 스테레오 칼라영상 중 하나의 입력 칼라영상을 딥러닝 기반 제1 깊이 네트워크에 입력하여 제1 깊이 맵을 생성하고, 상기 입력 칼라영상에 대응되는 열화상을 딥러닝 기반 제2 깊이 네트워크에 입력하여 제2 깊이 맵을 생성하고, 상기 제1 깊이 맵을 이용하여 추정된 추정 칼라영상과 상기 입력 칼라영상을 비교하여 상기 제1 깊이 네트워크의 학습을 반복하고, 상기 제1 깊이 네트워크의 학습의 반복에 따른 제1 깊이 맵과 상기 제2 깊이 맵을 비교하여 상기 제2 깊이 네트워크의 학습을 반복하고, 학습된 제2 깊이 네트워크를 이용하여 새로 입력되는 열화상으로부터 제2 깊이 맵을 생성하도록, 상기 프로세서에 의해 실행되는 프로그램 명령어들을 저장한 단일 열화상 기반 단안 깊이 추정 장치가 제공된다. In order to achieve the above-described object, according to an embodiment of the present invention, a single thermal image-based monocular depth estimation device includes: a processor; and a memory connected to the processor, wherein the memory generates a first depth map by inputting one input color image from among the stereo color images to a deep learning-based first depth network, and generates a first depth map and a row corresponding to the input color image. Input the image into a deep learning-based second depth network to generate a second depth map, and repeat learning of the first depth network by comparing the input color image with an estimated color image estimated using the first depth map. Then, the learning of the second depth network is repeated by comparing the first depth map and the second depth map according to repetition of learning of the first depth network, and a newly input row is received using the learned second depth network. A single thermal imaging-based monocular depth estimation device is provided that stores program instructions that are executed by the processor to generate a second depth map from an image.

상기 프로그램 명령어들은, 상기 제1 깊이 맵과 상기 제2 깊이 맵의 자기지도 손실(Self-Guided Loss)을 산출하여 상기 제2 깊이 네트워크의 학습을 반복할 수 있다. The program instructions may calculate self-guided losses of the first depth map and the second depth map and repeat learning of the second depth network.

상기 자기지도 손실은 상기 제1 깊이 맵과 상기 제2 깊이 맵의 SSIM(structural simulation index measure)과 L1 거리의 조합에 따른 제1 손실, 상기 제1 깊이 맵과 상기 제2 깊이 맵을 VGG 네트워크에 입력하여 획득한 특징을 이용하여 산출되는 제2 손실 및 상기 제1 깊이 맵과 상기 제2 깊이 맵에서 각각 추출되는 전역 기술자(global descriptor) 및 지역 기술자(local descriptor)를 이용하여 산출되는 제3 손실을 포함할 수 있다.The self-map loss is a first loss based on a combination of SSIM (structural simulation index measure) and L1 distance of the first depth map and the second depth map, and the first depth map and the second depth map are connected to the VGG network. A second loss calculated using features obtained by inputting, and a third loss calculated using a global descriptor and a local descriptor extracted from the first depth map and the second depth map, respectively. may include.

상기 제3 손실은 Patch-NetVLAD 손실로 정의될 수 있다. The third loss may be defined as Patch-NetVLAD loss.

상기 제1 손실 및 상기 제3 손실 각각에는 미리 설정된 가중치가 적용될 수 있다. A preset weight may be applied to each of the first loss and the third loss.

상기 제1 깊이 맵은 상기 제1 깊이 네트워크의 학습이 반복됨에 따라 업데이트되는 의사 레이블로 정의될 수 있다. The first depth map may be defined as a pseudo label that is updated as learning of the first depth network is repeated.

상기 프로그램 명령어들은, 상기 추정 칼라영상과 상기 입력 칼라영상 사이의 외형 정합 손실(Appearance Matching Loss) 및 이미지 정합 손실(Image Matching Loss)을 산출하여 상기 제1 깊이 네트워크의 학습을 반복할 수 있다. The program instructions may calculate an appearance matching loss and an image matching loss between the estimated color image and the input color image and repeat learning of the first depth network.

상기 이미지 정합 손실은, 상기 추정 칼라영상과 상기 입력 칼라영상을 VGG 네트워크에 입력하여 획득한 특징을 이용하여 산출되는 지각 손실 및 상기 추정 칼라영상과 상기 입력 칼라영상에서 각각 추출되는 전역 기술자(global descriptor) 및 지역 기술자(local descriptor)를 이용하여 산출되는 Patch-NetVLAD 손실을 포함할 수 있다. The image matching loss is a perceptual loss calculated using features obtained by inputting the estimated color image and the input color image into a VGG network, and a global descriptor extracted from the estimated color image and the input color image, respectively. ) and the Patch-NetVLAD loss calculated using a local descriptor.

본 발명의 다른 측면에 따르면, 프로세서 및 메모리를 포함하는 장치에서 단일 열화상 기반으로 단안 깊이를 추정하는 방법으로서, 스테레오 칼라영상 중 하나의 입력 칼라영상을 딥러닝 기반 제1 깊이 네트워크에 입력하여 제1 깊이 맵을 생성하는 단계; 상기 입력 칼라영상에 대응되는 열화상을 딥러닝 기반 제2 깊이 네트워크에 입력하여 제2 깊이 맵을 생성하는 단계; 상기 제1 깊이 맵을 이용하여 추정된 추정 칼라영상과 상기 입력 칼라영상을 비교하여 상기 제1 깊이 네트워크의 학습을 반복하는 단계; 상기 제1 깊이 네트워크의 학습의 반복에 따른 제1 깊이 맵과 상기 제2 깊이 맵을 비교하여 상기 제2 깊이 네트워크의 학습을 반복하는 단계; 및 학습된 제2 깊이 네트워크를 이용하여 새로 입력되는 열화상으로부터 제2 깊이 맵을 생성하는 단일 열화상 기반 단안 깊이 추정 방법이 제공된다. According to another aspect of the present invention, a method of estimating monocular depth based on a single thermal image in a device including a processor and memory, wherein one input color image among stereo color images is input to a deep learning-based first depth network to 1 generating a depth map; Generating a second depth map by inputting a thermal image corresponding to the input color image into a deep learning-based second depth network; Comparing the input color image with an estimated color image estimated using the first depth map and repeating learning of the first depth network; Comparing a first depth map and the second depth map resulting from repetition of learning of the first depth network and repeating learning of the second depth network; And a single thermal image-based monocular depth estimation method is provided that generates a second depth map from a newly input thermal image using a learned second depth network.

본 발명의 또 다른 측면에 따르면, 상기한 방법을 수행하는 컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램이 제공된다. According to another aspect of the present invention, a computer program stored in a computer-readable recording medium that performs the above method is provided.

본 실시예에 따르면, 정밀도가 높은 칼라영상으로부터 얻은 깊이 정보와 열화상으로부터 예측된 깊이 정보를 비교하여 깊이 네트워크를 학습시키기 때문에 깊이 추정의 정확도가 높아지는 장점이 있다. According to this embodiment, there is an advantage of increasing the accuracy of depth estimation because the depth network is learned by comparing depth information obtained from high-precision color images with depth information predicted from thermal images.

도 1은 종래기술에 따른 열화상 기반 깊이 추정 과정을 나타낸 도면이다.
도 2는 본 발명의 바람직한 일 실시예에 따른 단일 열화상 기반 단안 깊이 추정 프레임워크를 도시한 도면이다.
도 3은 본 실시예에 따른 Patch-NetVLAD 손실을 설명하기 위한 도면이다.
도 4는 본 실시예에 따른 깊이 추정의 정확도를 나타내는 도면이다.
도 5는 본 실시예에 따른 깊이 추정 장치의 구성을 도시한 도면이다. Figure 1 is a diagram showing a thermal image-based depth estimation process according to the prior art.
Figure 2 is a diagram illustrating a single thermal image-based monocular depth estimation framework according to a preferred embodiment of the present invention.
Figure 3 is a diagram for explaining Patch-NetVLAD loss according to this embodiment.
Figure 4 is a diagram showing the accuracy of depth estimation according to this embodiment.
Figure 5 is a diagram showing the configuration of a depth estimation device according to this embodiment.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the present invention.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used herein are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to indicate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

또한, 각 도면을 참조하여 설명하는 실시예의 구성 요소가 해당 실시예에만 제한적으로 적용되는 것은 아니며, 본 발명의 기술적 사상이 유지되는 범위 내에서 다른 실시예에 포함되도록 구현될 수 있으며, 또한 별도의 설명이 생략될지라도 복수의 실시예가 통합된 하나의 실시예로 다시 구현될 수도 있음은 당연하다.In addition, the components of the embodiments described with reference to each drawing are not limited to the corresponding embodiments, and may be implemented to be included in other embodiments within the scope of maintaining the technical spirit of the present invention, and may also be included in separate embodiments. Even if the description is omitted, it is natural that a plurality of embodiments may be re-implemented as a single integrated embodiment.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일하거나 관련된 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, when describing with reference to the accompanying drawings, identical or related reference numbers will be assigned to identical or related elements regardless of the drawing symbols, and overlapping descriptions thereof will be omitted. In describing the present invention, if it is determined that a detailed description of related known technologies may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

본 실시예는 칼라영상에서 생성된 깊이 정보를 Pseudo-label(의사 레이블)로 사용하는 Self-Guided Framework를 제안한다. This embodiment proposes a Self-Guided Framework that uses depth information generated from color images as a pseudo-label.

도 2는 본 발명의 바람직한 일 실시예에 따른 단일 열화상 기반 단안 깊이 추정 프레임워크를 도시한 도면이다. Figure 2 is a diagram illustrating a single thermal image-based monocular depth estimation framework according to a preferred embodiment of the present invention.

도 2를 참조하면, 스테레오 칼라영상 중 하나의 입력 칼라영상(예를 들어, 좌우 칼라영상 중 좌측 칼라영상, )을 딥러닝 기반 깊이 네트워크 N_R(제1 깊이 네트워크)에 입력하여 칼라영상 기반 깊이 맵 (제1 깊이 맵)을 생성한다. Referring to Figure 2, one input color image among stereo color images (for example, a left color image among left and right color images, ) into the deep learning-based depth network N _R (first depth network) to create a color image-based depth map. Create (first depth map).

본 실시예에 따르면, 제1 깊이 맵을 이용하여 추정된 추정 칼라영상()과 입력 칼라영상을 비교하여 제1 깊이 네트워크의 학습을 반복하고, 이를 통해 정밀한 칼라영상 기반 깊이 맵 을 생성하며, 이 의사 레이블로 사용된다. According to this embodiment, the estimated color image estimated using the first depth map ( ) and repeat the learning of the first depth network by comparing the input color image, and through this, a precise color image-based depth map creates, This is used as a pseudo label.

또한, 본 실시예에 따르면, 입력 칼라영상에 대응되는 열화상 를 제2 깊이 네트워크 N_T에 입력하여 열화상 기반 깊이 맵(제2 깊이 맵)을 생성한다. In addition, according to this embodiment, the thermal image corresponding to the input color image is input into the second depth network N _T to generate a thermal image-based depth map (second depth map).

본 실시예에 따르면, 제2 깊이 네트워크의 학습을 위해 칼라영상 깊이 맵 을 의사 레이블로 사용하여 제1 깊이 맵과 제2 깊이 맵의 자기지도 손실(Self-Guided Loss, )를 계산한다. According to this embodiment, a color image depth map for learning the second depth network Self-Guided Loss (Self-Guided Loss) of the first depth map and the second depth map using as a pseudo label. ) is calculated.

본 실시예에 따른 는 아래의 수식에 나타난 바와 같이, 이미지 번역에서 일반적으로 사용되는 구조적 시뮬레이션 지수 측정(structural simulation index measure, SSIM)와 L1 거리의 조합인 와 지각 손실() 및 Patch-NetVLAD 손실()로 구성된다. According to this embodiment is a combination of the structural simulation index measure (SSIM) and L1 distance commonly used in image translation, as shown in the formula below: and perceptual loss ( ) and Patch-NetVLAD loss ( ) is composed of.

여기서, 는 손실을 재조정하는 정적인 값이며 10으로 설정될 수 있다. M은 외형 정합 손실 에서 계산된 자동 마스크이다. here, is a static value to readjust the loss and can be set to 10. M is the apparent matching loss It is an automatic mask calculated from .

이미지 변환 작업에서 가장 전통적인 손실인 SSIM과 L1 거리 손실 을 사용하여 제1 깊이 맵이며 의사 레이블인 와 제2 깊이 맵 을 비교하며, 이는 아래의 수식과 같다. SSIM and L1 distance loss are the most traditional losses in image transformation tasks. is the first depth map using and second depth map Compare and this is equivalent to the formula below.

여기서, 는 의 값이고, 는 의 분산이고, , , 이다. here, Is is the value of, Is is the variance of , , am.

SSIM은 두 이미지 간의 유사도를 비교하기 위해 휘도(l)와 대비 및 구조(cs)의 두 가지를 비교하고, L1 거리는 두 이미지 값을 직접 비교한다. SSIM compares luminance (l) and contrast and structure (cs) to compare the similarity between two images, and the L1 distance directly compares the two image values.

또한, 본 실시예에 따르면, 지각 손실(perceptual loss)을 사용하여 의사 레이블 과 열화상에서 예측된 깊이 맵 간의 높은 수준의 지각 및 의미 차이를 측정한다. Additionally, according to this embodiment, pseudo labels are used using perceptual loss. and predicted depth maps from thermal images. Measures high-level perceptual and semantic differences between

지각 손실은 이미지 분류를 위해 훈련된 VGG 네트워크에 두 개의 깊이 맵을 입력하여 얻은 특징을 사용하며 수식은 다음과 같다. The perceptual loss uses features obtained by inputting two depth maps into a VGG network trained for image classification, and the formula is as follows.

또한, 지역적이고 전역적인 정보를 더욱 강화하기 위해서 영상의 지역 기술자(local descriptor)와 전역 기술자(global descriptor) 추정 후 비교하는 Patch-NetVLAD 손실을 적용한다. In addition, to further strengthen local and global information, we apply Patch-NetVLAD loss, which estimates and compares the local descriptor and global descriptor of the image.

도 3은 본 실시예에 따른 Patch-NetVLAD 손실을 설명하기 위한 도면이다. Figure 3 is a diagram for explaining Patch-NetVLAD loss according to this embodiment.

도 3을 참조하면, 열화상으로부터 예측된 깊이 정보(

)와 칼라영상으로부터 예측된 깊이 정보(Pseudo-Label)에서 지역 기술자(

)와 전역 기술자(

)를 Patch-NetvLAD 모델을 통해 추정한 후 이들을 차 연산으로 비교한다. Referring to Figure 3, depth information predicted from thermal images (

) and the local descriptor (Pseudo-Label) predicted from the color image.

) and global descriptor (

) is estimated through the Patch-NetvLAD model and then compared using the difference operation.

이를 통해 열화상으로부터 예측된 깊이 정보(

)는 칼라영상으로부터 예측된 깊이 정보(Pseudo-Label)의 전역적/지역적 깊이 정보와 닮아가게 된다. Through this, the depth information predicted from the thermal image (

) resembles the global/local depth information of the depth information (Pseudo-Label) predicted from the color image.

Patch-NetVLAD 손실은 NetVLAD 네트워크 를 이용하여 제1 깊이 맵에서 추정된 전역 기술자 및 지역 기술자()와 제2 깊이 맵에서 추정된 전역 기술자 및 지역 기술자()로부터 계산되며, 수식은 다음과 같다.Patch-NetVLAD loss is caused by NetVLAD network The global descriptor and local descriptor estimated in the first depth map using ) and the global descriptor and local descriptor estimated from the second depth map ( ), and the formula is as follows.

는 여러 패치의 크기이고, 는 전체 패치의 수이다. is the size of several patches, is the total number of patches.

지각 손실과 유사하게 전역 기술자의 비교는 이미지의 인식 및 의미 정보를 향상시킨다. Similar to perceptual loss, comparison of global descriptors is Improves image recognition and semantic information.

또한, 패치를 사용하여 얻은 지역 기술자의 비교를 통해 의 지역적 의미 정보 및 세부 정보가 향상된다. Additionally, through comparison of local descriptors obtained using patches, Local semantic information and details are improved.

본 실시예와 같은 의사 레이블을 이용하는 Self-Guided Framework의 경우 의사 레이블의 성능이 의 성능을 결정한다. In the case of the Self-Guided Framework that uses pseudo labels like this embodiment, the performance of the pseudo labels is determines the performance of

본 실시예에 따르면, 의사 레이블의 성능을 향상시키기 위해 이미지 정합 손실(Image Matching Loss) 을 제안한다. According to this embodiment, Image Matching Loss is used to improve pseudo label performance. suggests.

SSIM은 픽셀 간의 낮은 수준의 차이에만 의존하기 때문에, 이미지 정합 손실은 지각 손실과 높은 수준의 유사도를 계산하는 Patch-NetVLAD 손실로 구성된다. Since SSIM relies only on low-level differences between pixels, the image registration loss consists of a perceptual loss and a Patch-NetVLAD loss that computes high-level similarities.

Patch-NetVLAD 손실의 전역 기술자를 이용한 지각 손실 및 비교를 통해 의사 레이블의 의미 정보가 더욱 강화된다. 또한 Patch-NetVLAD 손실의 지역 기술자는 세부 영역의 정확도를 향상시킨다. The semantic information of pseudo labels is further enhanced through perceptual loss and comparison using the global descriptor of the Patch-NetVLAD loss. Additionally, the local descriptor of the Patch-NetVLAD loss improves the accuracy of detailed areas.

이미지 정합 손실의 지각 손실 와 Patch-NetVLAD 손실 는 자기지도 학습에서 설명한 것과 유사하게 진행되며, 입력은 깊이 맵에서 칼라영상 이미지로만 변경된다. 위의 방정식은 다음과 같이 공식화된다. Perceptual loss of image registration loss with Patch-NetVLAD loss Proceeds similarly to what was described in self-supervised learning, and the input is changed from a depth map to a color image only. The above equation is formulated as follows:

기존 자가지도 학습에서는 종래의 도 1에서와 같이 예측된 깊이 영상과 오른쪽 칼라영상을 이용해서 새로운 왼쪽 영상을 합성한 후 실제 왼쪽 영상과 비교하는 외형 매칭 손실을 이용하지만, 이 식은 두 영상을 비교할 때 SSIM 과 L1 거리만을 사용한다는 단점이 있다. In existing self-supervised learning, as shown in Figure 1, a new left image is synthesized using the predicted depth image and the right color image and then an appearance matching loss is used to compare it with the actual left image. However, this equation is used when comparing the two images. It has the disadvantage of using only SSIM and L1 distance.

두 식은 실제 값만을 보기 때문에 시맨틱(Semantic)한 정보를 비교하기 어려우며 지역적 정보를 비교하는 것도 한계가 있었고 이러한 단점을 극복하고자 VGG 손실(지각 손실)와 Patch-NetVLAD 손실로 구성된 이미지 정합 손실을 추가한다. Because the two equations only look at actual values, it is difficult to compare semantic information, and there are also limitations in comparing local information. To overcome these shortcomings, an image registration loss consisting of VGG loss (perceptual loss) and Patch-NetVLAD loss is added. .

이처럼 이미지 정합 손실을 추가함으로써 의사 레이블의 성능이 높아지고 이를 통해 열화상을 입력으로 하는 깊이 추정 모델의 성능 또한 높아지는 효과를 얻을 수 있다. By adding image matching loss in this way, the performance of pseudo labels can be increased, and the performance of the depth estimation model that uses thermal images as input can also be improved.

이렇게 제안된 Self-Guided Framework는 어떠한 깊이 추정 모델에 적용해도 높은 성능 향상을 보여주며 도 4와 같이 밤에서도 지역적이고 전역적인 부분에서 전부 높은 성능 향상을 이룬 것을 볼 수 있다. The Self-Guided Framework proposed in this way shows high performance improvement when applied to any depth estimation model, and as shown in Figure 4, it can be seen that high performance improvement was achieved in both local and global aspects even at night.

도 5는 본 실시예에 따른 깊이 추정 장치의 구성을 도시한 도면이다. Figure 5 is a diagram showing the configuration of a depth estimation device according to this embodiment.

도 5에 도시된 바와 같이, 본 실시예에 따른 장치는 프로세서(500) 및 메모리(502)를 포함할 수 있다. As shown in FIG. 5, the device according to this embodiment may include a processor 500 and a memory 502.

프로세서(500)는 컴퓨터 프로그램을 실행할 수 있는 CPU(central processing unit)나 그 밖에 가상 머신 등을 포함할 수 있다. The processor 500 may include a central processing unit (CPU) capable of executing a computer program or another virtual machine.

메모리(502)는 고정식 하드 드라이브나 착탈식 저장 장치와 같은 불휘발성 저장 장치를 포함할 수 있다. 착탈식 저장 장치는 콤팩트 플래시 유닛, USB 메모리 스틱 등을 포함할 수 있다. 메모리(502)는 각종 랜덤 액세스 메모리와 같은 휘발성 메모리도 포함할 수 있으며, 컴퓨터 판독 가능한 기록매체로 정의될 수 있다. Memory 502 may include a non-volatile storage device, such as a non-removable hard drive or a removable storage device. Removable storage devices may include compact flash units, USB memory sticks, etc. The memory 502 may also include volatile memory such as various random access memories and may be defined as a computer-readable recording medium.

본 실시예에 따른 메모리(502)에는 단일 열화상 기반 단안 깊이 추정을 위한 프로그램 명령어들이 저장된다. Program instructions for monocular depth estimation based on a single thermal image are stored in the memory 502 according to this embodiment.

본 실시예에 따른 프로그램 명령어들은, 스테레오 칼라영상 중 하나의 입력 칼라영상을 딥러닝 기반 제1 깊이 네트워크에 입력하여 제1 깊이 맵을 생성하고, 상기 입력 칼라영상에 대응되는 열화상을 딥러닝 기반 제2 깊이 네트워크에 입력하여 제2 깊이 맵을 생성하고, 상기 제1 깊이 맵을 이용하여 추정된 추정 칼라영상과 상기 입력 칼라영상을 비교하여 상기 제1 깊이 네트워크의 학습을 반복하고, 상기 제1 깊이 네트워크의 학습의 반복에 따른 제1 깊이 맵과 상기 제2 깊이 맵을 비교하여 상기 제2 깊이 네트워크의 학습을 반복하고, 학습된 제2 깊이 네트워크를 이용하여 새로 입력되는 열화상으로부터 제2 깊이 맵을 생성한다. Program commands according to this embodiment generate a first depth map by inputting one input color image among stereo color images into a deep learning-based first depth network, and generate a first depth map by inputting a thermal image corresponding to the input color image into a deep learning-based first depth network. Generate a second depth map by inputting it to a second depth network, repeat learning of the first depth network by comparing the input color image with an estimated color image estimated using the first depth map, and repeat learning of the first depth network. The learning of the second depth network is repeated by comparing the first depth map and the second depth map according to repetition of learning of the depth network, and the second depth is determined from the newly input thermal image using the learned second depth network. Create a map.

본 실시예에 따른 프로그램 명령어들은, 상기 제1 깊이 맵과 상기 제2 깊이 맵의 자기지도 손실(Self-Guided Loss)을 산출하여 상기 제2 깊이 네트워크의 학습을 반복한다. Program instructions according to this embodiment calculate self-guided losses of the first depth map and the second depth map and repeat learning of the second depth network.

여기서, 자기지도 손실은 상기 제1 깊이 맵과 상기 제2 깊이 맵의 SSIM(structural simulation index measure)과 L1 거리의 조합에 따른 제1 손실, 상기 제1 깊이 맵과 상기 제2 깊이 맵을 VGG 네트워크에 입력하여 획득한 특징을 이용하여 산출되는 제2 손실 및 상기 제1 깊이 맵과 상기 제2 깊이 맵에서 각각 추출되는 전역 기술자(global descriptor) 및 지역 기술자(local descriptor)를 이용하여 산출되는 제3 손실을 포함할 수 있고, 제3 손실이 상기한 Patch-NetVLAD 손실로 정의된다. Here, the self-map loss is a first loss based on a combination of SSIM (structural simulation index measure) and L1 distance of the first depth map and the second depth map, and the first depth map and the second depth map are connected to the VGG network. A second loss calculated using features obtained by inputting and a third loss calculated using a global descriptor and a local descriptor extracted from the first depth map and the second depth map, respectively. It may include a loss, and the third loss is defined as the Patch-NetVLAD loss described above.

상기한 본 발명의 실시예는 예시의 목적을 위해 개시된 것이고, 본 발명에 대한 통상의 지식을 가지는 당업자라면 본 발명의 사상과 범위 안에서 다양한 수정, 변경, 부가가 가능할 것이며, 이러한 수정, 변경 및 부가는 하기의 특허청구범위에 속하는 것으로 보아야 할 것이다.The above-described embodiments of the present invention have been disclosed for illustrative purposes, and those skilled in the art will be able to make various modifications, changes, and additions within the spirit and scope of the present invention, and such modifications, changes, and additions will be possible. should be regarded as falling within the scope of the patent claims below.

Claims

As a single thermal image-based monocular depth estimation device,
processor; and
Including a memory connected to the processor,
The memory is,
Generate a first depth map by inputting one of the stereo color images into a deep learning-based first depth network,
Generate a second depth map by inputting the thermal image corresponding to the input color image into a deep learning-based second depth network,
Repeat learning of the first depth network by comparing the input color image with an estimated color image estimated using the first depth map,
Repeat learning of the second depth network by comparing the first depth map and the second depth map according to repetition of learning of the first depth network,
To generate a second depth map from a newly input thermal image using the learned second depth network,
A single thermal image-based monocular depth estimation device storing program instructions executed by the processor.

According to paragraph 1,
The program commands are:
A single thermal image-based monocular depth estimation device that calculates self-guided loss of the first depth map and the second depth map and repeats learning of the second depth network.

According to paragraph 2,
The self-map loss is a first loss based on a combination of SSIM (structural simulation index measure) and L1 distance of the first depth map and the second depth map, and the first depth map and the second depth map are connected to the VGG network. A second loss calculated using features obtained by inputting, and a third loss calculated using a global descriptor and a local descriptor extracted from the first depth map and the second depth map, respectively. A single thermal image-based monocular depth estimation device comprising a.

According to paragraph 3,
The third loss is a single thermal image-based monocular depth estimation device where the Patch-NetVLAD loss is defined.

According to paragraph 3,
A single thermal image-based monocular depth estimation device in which preset weights are applied to each of the first loss and the third loss.

According to paragraph 1,
The first depth map is a single thermal image-based monocular depth estimation device wherein the first depth map is defined as a pseudo label that is updated as learning of the first depth network is repeated.

According to paragraph 1,
The program commands are:
A single thermal image-based monocular depth estimation device that repeats learning of the first depth network by calculating appearance matching loss and image matching loss between the estimated color image and the input color image.

In clause 7,
The image matching loss is a perceptual loss calculated using features obtained by inputting the estimated color image and the input color image into a VGG network, and a global descriptor extracted from the estimated color image and the input color image, respectively. ) and a single thermal image-based monocular depth estimation device including the Patch-NetVLAD loss calculated using a local descriptor.

A method for estimating monocular depth based on a single thermal image in a device including a processor and memory, comprising:
Generating a first depth map by inputting one of the stereo color images into a deep learning-based first depth network;
Generating a second depth map by inputting a thermal image corresponding to the input color image into a deep learning-based second depth network;
Comparing the input color image with an estimated color image estimated using the first depth map and repeating learning of the first depth network;
Comparing a first depth map and the second depth map resulting from repetition of learning of the first depth network and repeating learning of the second depth network; and
A single thermal image-based monocular depth estimation method that generates a second depth map from a newly input thermal image using a learned second depth network.

A computer program stored in a computer-readable recording medium that performs the method according to claim 9.