KR102219561B1

KR102219561B1 - Unsupervised stereo matching apparatus and method using confidential correspondence consistency

Info

Publication number: KR102219561B1
Application number: KR1020180146709A
Authority: KR
Inventors: 손광훈; 정성훈
Original assignee: 연세대학교 산학협력단
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2021-02-23
Also published as: KR20200063368A

Abstract

본 발명은 동일 구조와 동일한 가중치를 가지며 비지도 학습 방식으로 미리 학습된 2개의 컨볼루션 신경망(Convolutional Neural Networks: CNN)을 포함하여, 입력되는 스테레오 영상으로부터 특징 맵들을 추출하는 인코더, 특징 맵 사이의 매칭 비용 볼륨을 계산하는 매칭 비용 계산부 및 기지정된 최대 디스패리티 범위를 갖는 디스패리티 후보들 중 매칭 비용 볼륨을 최소화하는 디스패리티를 각 픽셀 별로 획득하고, 획득된 디스패리티로부터 디스패리티 맵을 생성하는 디스패리티 맵 획득부를 포함하고, 2개의 CNN은 학습 시에 입력된 스테레오 영상으로부터 획득된 디스패리티 맵에 대해 에피폴라 제약에 따른 대응점 일관성에 기반하여 양성 샘플을 추정하고, 추정된 양성 샘플을 인접 픽셀로 전파하여 생성되는 학습 맵들과 디스패리티 맵 사이의 오차를 역전파하여 학습되는 스테레오 매칭 장치 및 방법을 제공할 수 있다.The present invention includes two convolutional neural networks (CNNs) that have the same structure and the same weight and are pre-trained by an unsupervised learning method, an encoder that extracts feature maps from an input stereo image, and between feature maps. A matching cost calculator that calculates a matching cost volume and a disparity that minimizes the matching cost volume among disparity candidates having a predetermined maximum disparity range is obtained for each pixel, and a disparity map is generated from the obtained disparity. Including a parity map acquisition unit, the two CNNs estimate positive samples based on correspondence point consistency according to epipolar constraints with respect to the disparity map obtained from the stereo image input at the time of training, and use the estimated positive samples as adjacent pixels. A stereo matching apparatus and method for learning by backpropagating an error between learning maps generated by propagation and disparity maps can be provided.

Description

Unsupervised learning stereo matching device and method based on correspondence point consistency {UNSUPERVISED STEREO MATCHING APPARATUS AND METHOD USING CONFIDENTIAL CORRESPONDENCE CONSISTENCY}

본 발명은 스테레오 매칭 장치 및 방법에 관한 것으로, 대응점 일관성에 기반한 비지도 학습 방식의 스테레오 매칭 장치 및 방법에 관한 것이다.The present invention relates to a stereo matching apparatus and method, and to a stereo matching apparatus and method of an unsupervised learning method based on correspondence point consistency.

스테레오 매칭은 영상으로부터 3차원 기하학적 구성을 인식하기 위한 방법으로, 컴퓨터 비전 시스템의 스테레오 영상 재구성, 자율 주행, 운전자 보조 시스템(Advanced Driver Assistance System: 이하 ADAS), 로봇 공학 등을 포함하는 다양한 분야에 이용되고 있다.Stereo matching is a method for recognizing 3D geometrical composition from an image, and is used in various fields including stereo image reconstruction of computer vision systems, autonomous driving, advanced driver assistance systems (ADAS), robotics, etc. Has become.

스테레오 매칭은 서로 다른 2개의 시점 영상을 갖는 스테레오 영상에서 3차원 위치 정보(깊이 정보)를 추정하는 기법으로, 스테레오 영상에서 대응점(correspondence) 사이의 비유사성을 측정하는 매칭 비용(matching cost) 계산은 스테레오 매칭 기법의 핵심 과정이다.Stereo matching is a technique for estimating 3D location information (depth information) from a stereo image having two different viewpoint images, and calculating the matching cost to measure dissimilarity between correspondence points in a stereo image This is the core process of the stereo matching technique.

그러나 매칭 비용 계산은 스테레오 영상의 폐색(occlusion) 영역, 질감없는(textureless) 영역 또는 조명의 변화 등과 같은 영상 자체의 모호성(inherent matching ambiguity)으로 인해 작업에 어려움이 있다.However, the calculation of the matching cost is difficult due to the inherent matching ambiguity of the image itself, such as an occlusion area, a textureless area, or a change in lighting of a stereo image.

이러한 매칭 비용 계산의 어려움을 극복하기 위하여 다양한 기법이 제안되었으며, 최근에는 딥 러닝(Deep learning) 기법으로 학습된 인공 신경망(Artificial Neural Network)을 이용하여 영상으로부터 매칭 비용을 계산하는 방법이 제안되었다. 특히 인공 신경망 중 영상 처리 분야에서 탁월한 성능을 나타내는 컨볼루션 신경망(Convolution Neural Network)이 매칭 비용 계산에 주로 이용되고 있다.In order to overcome this difficulty in calculating the matching cost, various techniques have been proposed, and recently, a method of calculating the matching cost from an image using an artificial neural network learned by a deep learning technique has been proposed. In particular, among artificial neural networks, a convolution neural network, which exhibits excellent performance in the image processing field, is mainly used for calculating the matching cost.

매칭 비용 계산에 인공 신경망을 이용하는 경우, 인공 신경망은 스테레오 영상에서 대응점 간의 매칭 비용을 최소화할 수 있도록 미리 학습되어야 한다. 그러나 인공 신경망이 신뢰성 있게 매칭 비용을 계산할 수 있도록 학습되기 위해서는 학습 레이블이 포함된 대량의 학습 영상을 요구하는 지도 학습(supervised learning)이 수행되어야 한다.When an artificial neural network is used to calculate the matching cost, the artificial neural network must be learned in advance to minimize the matching cost between corresponding points in a stereo image. However, in order for the artificial neural network to reliably calculate the matching cost, supervised learning, which requires a large amount of training images including learning labels, must be performed.

학습 레이블은 구조형 조명(structured light)이나 라이다(Light Detection And Ranging) 및 레이저 스캐너 등과 같은 3D 센서를 이용하여 직접 검증(ground truth)된 깊이 값을 측정함으로써 획득될 수 있다. 그러나 3D 센서를 이용한 측정은 고비용을 요구될 뿐만 아니라, 사용 조건에 따라 다양한 오류가 존재하여 수작업에 의한 오류 보정이 추가로 요구되는 문제가 있다.The learning label can be obtained by measuring a depth value that has been directly verified (ground truth) using a 3D sensor such as structured light, light detection and ranging, and a laser scanner. However, measurement using a 3D sensor not only requires high cost, but also has various errors depending on the conditions of use, so that manual error correction is additionally required.

이에 학습을 위해 학습 레이블이 포함되어 제공되는 학습 영상의 개수가 부족하여, 인공 신경망이 정상적으로 학습되지 않아 매칭 비용을 정상적으로 추정할 수 없는 문제가 있다.Accordingly, there is a problem in that the number of training images provided with a learning label for learning is insufficient, and the artificial neural network is not normally trained, so that the matching cost cannot be normally estimated.

한국 등록 특허 제10-1354387호 (2014.01.15 등록)Korean Patent Registration No. 10-1354387 (registered on January 15, 2014)

본 발명의 목적은 학습 레이블을 필요로 하지 않는 비지도 학습 방식으로 학습된 인공 신경망을 이용하는 스테레오 매칭 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide a stereo matching apparatus and method using an artificial neural network learned in an unsupervised learning method that does not require a learning label.

본 발명의 다른 목적은 입력된 스테레오 영상 사이의 에피폴라 제약과 대응 일관성을 이용하여 매칭 비용을 최소화하도록 학습된 인공 신경망을 이용하는 스테레오 매칭 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide a stereo matching apparatus and method using a learned artificial neural network to minimize matching cost by using epipolar constraints and correspondence consistency between input stereo images.

본 발명의 또 다른 목적은 조명 변화 및 야외 환경과 같은 다양한 환경에서 신뢰성 있는 스테레오 매칭을 수행할 수 있는 스테레오 매칭 장치 및 방법을 제공하는데 있다.Still another object of the present invention is to provide a stereo matching apparatus and method capable of performing reliable stereo matching in various environments such as lighting changes and outdoor environments.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 스테레오 매칭 장치는 동일 구조와 동일한 가중치를 가지며 비지도 학습 방식으로 미리 학습된 2개의 컨볼루션 신경망(Convolutional Neural Networks: CNN)을 포함하여, 입력되는 스테레오 영상으로부터 특징 맵들을 추출하는 인코더; 상기 특징 맵 사이의 매칭 비용 볼륨을 계산하는 매칭 비용 계산부; 및 기지정된 최대 디스패리티 범위를 갖는 디스패리티 후보들 중 매칭 비용 볼륨을 최소화하는 디스패리티를 각 픽셀 별로 획득하고, 획득된 디스패리티로부터 디스패리티 맵을 생성하는 상기 디스패리티 맵 획득부; 를 포함하고, 상기 2개의 CNN은 학습 시에 입력된 스테레오 영상으로부터 획득된 디스패리티 맵에 대해 에피폴라 제약에 따른 대응점 일관성에 기반하여 양성 샘플을 추정하고, 추정된 양성 샘플을 인접 픽셀로 전파하여 생성되는 학습 맵들과 디스패리티 맵 사이의 오차를 역전파하여 학습된다.The stereo matching apparatus according to an embodiment of the present invention for achieving the above object includes two convolutional neural networks (CNN) that have the same structure and the same weight and are pre-trained by an unsupervised learning method, and input An encoder for extracting feature maps from the stereo image that is being converted; A matching cost calculator configured to calculate a matching cost volume between the feature maps; And a disparity map obtaining unit that obtains a disparity for each pixel that minimizes a matching cost volume among disparity candidates having a predetermined maximum disparity range, and generates a disparity map from the obtained disparity. Including, the two CNNs estimate a positive sample based on the correspondence point consistency according to the epipolar constraint with respect to the disparity map obtained from the stereo image input at the time of training, and propagate the estimated positive sample to adjacent pixels. It is learned by backpropagating an error between the generated training maps and the disparity map.

상기 2개의 CNN은 디스패리티 맵에 대해 에피폴라 제약에 따른 대응점 일관성에 기반하여, 대응점 사이의 거리가 기지정된 임계감 미만인 픽셀을 희소 양성 샘플로 추정하여 희소 양성 샘플 맵을 획득하고, 상기 희소 양성 샘플 맵에서 상기 희소 양성 샘플을 보간 기법으로 인접 픽셀로 전파하여 보간 맵을 생성하며, 상기 보간 맵에 대해 다시 대응점 일관성따라 학습 샘플을 추정하며, 추정된 상기 학습 샘플로부터 상기 학습 맵을 생성하여 학습될 수 있다.The two CNNs estimate a sparse positive sample map by estimating a pixel whose distance between the corresponding points is less than a predetermined threshold sense based on the correspondence point consistency according to the epipolar constraint with respect to the disparity map, and obtain a sparse positive sample map. An interpolation map is generated by propagating the sparse positive samples from the sample map to adjacent pixels by an interpolation technique, and the training samples are estimated again according to the correspondence point consistency for the interpolation map, and the training map is generated from the estimated training samples. Can be.

상기 보간 맵은 상기 희소 양성 샘플 맵에서 색상 유사성 제약 조건에 따라 입력 스테레오 영상의 컬러를 가이드로 이용하여, 상기 희소 양성 샘플을 인접 픽셀로 전파하여 생성될 수 있다.The interpolation map may be generated by using a color of an input stereo image as a guide according to a color similarity constraint in the sparse positive sample map and propagating the sparse positive sample to adjacent pixels.

상기 2개의 CNN은 학습 시에 입력된 디스패리티 맵과 상기 학습 맵 사이의 오차를 역전파하여 가중치를 업데이트 함으로써 학습되며, 상기 오차가 기지정된 기준 오차 이내가 되도록 반복적으로 학습 레이블을 생성하여 반복 학습될 수 있다.The two CNNs are learned by updating the weight by backpropagating the error between the disparity map input at the time of training and the training map, and iteratively learning by repeatedly generating a training label so that the error is within a predetermined reference error. Can be.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 스테레오 매칭 방법은 동일 구조와 동일한 가중치를 가지며 비지도 학습 방식으로 미리 학습된 2개의 컨볼루션 신경망(Convolutional Neural Networks: CNN)을 이용하여 입력되는 스테레오 영상으로부터 특징 맵들을 추출하는 단계; 상기 특징 맵 사이의 매칭 비용 볼륨을 계산하는 단계; 및 기지정된 최대 디스패리티 범위를 갖는 디스패리티 후보들 중 매칭 비용 볼륨을 최소화하는 디스패리티를 각 픽셀 별로 획득하고, 획득된 디스패리티로부터 디스패리티 맵을 생성하는 단계; 를 포함하고, 상기 2개의 컨볼루션 신경망은 입력된 스테레오 영상으로부터 획득된 디스패리티 맵에 대해 에피폴라 제약에 따른 대응점 일관성에 기반하여 양성 샘플을 추정하고, 추정된 양성 샘플을 인접 픽셀로 전파하여 생성되는 학습 맵들과 디스패리티 맵 사이의 오차를 역전파하여 학습된다.The stereo matching method according to another embodiment of the present invention for achieving the above object is input using two convolutional neural networks (CNNs) that have the same structure and the same weight and are pre-trained in an unsupervised learning method. Extracting feature maps from the stereo image; Calculating a matching cost volume between the feature maps; And obtaining a disparity for minimizing a matching cost volume among disparity candidates having a predetermined maximum disparity range for each pixel, and generating a disparity map from the obtained disparity. Including, the two convolutional neural networks are generated by estimating a positive sample based on correspondence point consistency according to an epipolar constraint with respect to the disparity map obtained from the input stereo image, and propagating the estimated positive sample to adjacent pixels. It is learned by backpropagating an error between the learning maps and the disparity map.

따라서, 본 발명의 실시예에 따른 스테레오 매칭 장치 및 방법은 입력된 스테레오 영상 사이의 에피폴라 제약과 대응점 일관성을 이용하여 양성 샘플을 추정하고, 추정된 양성 샘플을 전파하여 학습 샘플 맵을 획득함으로써, 비지도 학습 방법으로 학습될 수 있다. 또한 조명 변화나 야외 환경과 같은 다양한 환경에서도 신뢰성 있는 스테레오 매칭을 수행할 수 있다.Accordingly, the stereo matching apparatus and method according to an embodiment of the present invention estimate a positive sample using epipolar constraints and correspondence point consistency between input stereo images, and obtain a learning sample map by propagating the estimated positive sample, It can be learned in an unsupervised learning method. In addition, it is possible to perform reliable stereo matching in various environments such as lighting changes or outdoor environments.

도1 은 본 발명의 일 실시예에 따른 스테레오 매칭 장치의 개략적 구성을 나타낸다.
도2 는 도1 의 학습 맵 생성부의 상세 구성을 나타낸다.
도3 은 도1 의 스테레오 매칭 장치의 각 구성별 동작을 설명하기 위한 도면이다.
도4 는 본 발명의 실시예에 따라 생성된 학습 맵의 예를 나타낸다.
도5 는 본 발명의 일 실시예에 따른 스테레오 매칭 방법 및 이의 학습 방법에 대한 개략적 구성을 나타낸다.
도6 는 도5 의 학습 맵 생성 및 학습 단계를 상세하게 나타낸다.
도7 은 도6 의 학습 맵 생성 및 학습 단계의 알고리즘을 나타낸다.
도8 은 본 발명의 실시예에 따른 학습 맵의 생성 과정에서 생성되는 맵의 예를 나타낸다.
도9 는 본 발명의 실시예에 따른 학습 샘플과 다른 비지도 학습 방식으로 획득되는 샘플을 비교한 도면이다.
도10 내지 도12 는 본 발명의 실시예에 따른 CNN의 구조에 따른 학습 샘플을 비교한 도면이다.
도13 및 도14 는 본 발명의 실시예에 따른 CNN의 구조에 따른 스테레오 매칭 결과를 비교한 도면이다.
도15 및 도16 은 본 실시예에 따른 비지도 학습 방법의 조명에 대한 강건성을 시뮬레이션한 결과를 나타낸다.
도17 및 도18 은 야외 운전 환경에서의 스테레오 매칭 성능을 시뮬레이션한 결과를 나타낸다.1 shows a schematic configuration of a stereo matching device according to an embodiment of the present invention.
2 shows a detailed configuration of the learning map generator of FIG. 1.
3 is a diagram for explaining an operation of each configuration of the stereo matching device of FIG. 1.
4 shows an example of a learning map generated according to an embodiment of the present invention.
5 shows a schematic configuration of a stereo matching method and a learning method thereof according to an embodiment of the present invention.
6 shows in detail the learning map generation and learning steps of FIG. 5.
7 shows the algorithm of the learning map generation and learning step of FIG. 6.
8 shows an example of a map generated in the process of generating a learning map according to an embodiment of the present invention.
9 is a diagram illustrating a comparison between a learning sample according to an embodiment of the present invention and a sample obtained by another unsupervised learning method.
10 to 12 are diagrams comparing training samples according to the structure of a CNN according to an embodiment of the present invention.
13 and 14 are views comparing stereo matching results according to the structure of a CNN according to an embodiment of the present invention.
15 and 16 show the results of simulating the robustness of lighting in the unsupervised learning method according to the present embodiment.
17 and 18 show simulation results of stereo matching performance in an outdoor driving environment.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean units that process at least one function or operation, which is hardware, software, or hardware. And software.

도1 은 본 발명의 일 실시예에 따른 스테레오 매칭 장치의 개략적 구성을 나타내고, 도2 는 도1 의 학습 맵 생성부의 상세 구성을 나타내며, 도3 은 도1 의 스테레오 매칭 장치의 각 구성별 동작을 설명하기 위한 도면이다.FIG. 1 shows a schematic configuration of a stereo matching device according to an embodiment of the present invention, FIG. 2 shows a detailed configuration of a learning map generator of FIG. 1, and FIG. 3 shows an operation of each configuration of the stereo matching device of FIG. It is a drawing for explanation.

도1 을 참조하면, 본 실시예에 다른 스테레오 매칭 장치(100)는 스테레오 영상 입력부(110), 인코더(120), 매칭 비용 계산부(130), 디스패리티 맵 획득부(140) 및 학습 맵 생성부(150)를 포함한다.Referring to FIG. 1, a stereo matching device 100 according to the present embodiment includes a stereo image input unit 110, an encoder 120, a matching cost calculation unit 130, a disparity map acquisition unit 140, and a learning map. Includes part 150.

스테레오 영상 입력부(110)는 스테레오 매칭을 수행해야할 스테레오 영상을 획득한다. 여기서 스테레오 영상은 스테레오 카메라에서 획득될 수 있는 영상으로 서로 다른 시점을 갖는 2개의 영상으로 구성될 수 있다. 스테레오 영상은 스테레오 카메라의 구조에 따라 상하 영상 좌우 영상 등으로 획득될 수 있으나, 여기서는 일예로 도3 에 도시된 바와 같이 좌영상(Left image)(I^l)과 우영상(Right image)(I^r)을 획득하는 것으로 가정한다.The stereo image input unit 110 acquires a stereo image to be subjected to stereo matching. Here, the stereo image is an image that can be acquired from a stereo camera, and may be composed of two images having different viewpoints. The stereo image may be obtained as a vertical image, left and right image, etc., depending on the structure of the stereo camera, but here, as an example, as shown in FIG. 3, a left image (I ^l ) and a right image (I ^r ) Is assumed to be obtained.

이에 인코더(120)는 스테레오 영상 입력부(110)에서 획득된 좌영상(I^l)과 우영상(I^r)을 인코딩하여 매칭 비용을 최소화하도록 하는 2개의 특징 맵(A^l, A^r)을 획득한다. ^{Accordingly, the encoder 120 obtains two feature maps (A l} , A ^r ) to minimize the matching cost by encoding the ^{left image (I l} ) and the right image (I ^r ) acquired from the stereo image input unit 110 do.

인코더(120)는 도3 에 도시된 바와 같이 미리 학습된 2개의 컨볼루션 신경망(Convolution neural network: 이하 CNN)을 포함하며, 인코더(120)에 포함된 2개의 CNN은 동일한 구조를 갖고, 동일한 가중치(w)가 적용되어 피드 포워드 프로세스를 수행하는 샴(siamese) CNN으로 구현된다. 인코더(120)의 2개의 CNN은 지정된 패턴 인식 기법에 따라 미리 학습됨으로써, 입력 영상인 좌영상(I^l)과 우영상(I^r) 각각으로부터 좌 특징 맵(A^l)과 우 특징 맵(A^r)을 획득한다.The encoder 120 includes two pre-trained convolution neural networks (CNNs), as shown in FIG. 3, and the two CNNs included in the encoder 120 have the same structure and have the same weight. (w) is applied and implemented as a siamese CNN that performs a feed forward process. The two CNNs of the encoder 120 are learned in advance according to a designated pattern recognition technique, so that the left feature map (A ^l ) and the right feature map (A ^{l) and the right feature map (A l) are input images from the left image (I l} ) and the right image (I ^{r ).} ^r ) is obtained.

인코더(120)에 포함된 2개의 CNN이 동일한 가중치(w)가 적용되어 피드 포워드 프로세스(Fw)를 수행하므로, 2개의 특징 맵(A^l, A^r)의 각 픽셀(i)은 수학식 1과 같이 획득된다.Since the two CNNs included in the encoder 120 perform the feed forward process (Fw) by applying the same weight (w), each pixel (i) of the ^{two feature maps (A l} , A ^{r) is Equation 1} Is obtained as

매칭 비용 계산부(130)는 인코더(120)에서 출력된 특징 맵(A^l, A^r)을 인가받아 특징 맵(A^l, A^r)들 사이의 매칭 비용 볼륨(matching cost volume)(C^l, C^r)을 수학식 2와 같이 계산한다.The matching cost calculation unit 130 receives the feature maps A ^l and A ^r output from the encoder 120 and applies a matching cost volume C ^l ^{between feature maps A l} and A ^r. , C ^r ) is calculated as in Equation 2.

(여기서, ∥·∥₁은 l₁-norm 함수를 나타낸다.)(Here, ∥·∥ ₁ represents the l ₁ -norm function.)

수학식 2는 좌 특징 맵(A^l)에 대한 좌 매칭 비용 볼륨(C^l)을 계산하는 수학식을 나타내었으나, 좌 특징 맵(A^r)에 대한 좌 매칭 비용 볼륨(C^r)도 유사하게 계산될 수 있다.Equation (2) is similarly left characteristic map (A ^l) L matching cost volume (C ^l) eoteuna indicate the equation for calculating, left matching cost volume (C ^r) to the left feature map (A ^r) of the Can be calculated.

디스패리티 맵 획득부(140)는 매칭 비용 계산부(130)에서 매칭 비용 볼륨(C^l, C^r)이 계산되면, 모든 픽셀(i)에 대해 최대 디스패리티 범위(maximum disparity range)(d_max)를 갖는 디스패리티 후보들(d={1, ..., d_max}) 중 매칭 비용 볼륨(C^l, C^r)이 최소화되는 디스패리티(d)를 수학식 3에 따라 탐색하여 획득함으로써, 디스패리티 맵을 생성한다. ^{When the matching cost volume (C l} , C ^r ) is calculated by the matching cost calculation unit 130, the disparity map acquisition unit 140 is a _{maximum disparity range (d max} ) for all pixels (i). ) Of the disparity candidates (d={1, ..., d _max }), the disparity (d) whose matching cost volume (C ^l , C ^r ) is minimized is searched and obtained according to Equation 3, Create a disparity map.

수학식 3에서도 예시로서 좌 디스패리티(d^l)를 계산하는 수학식만을 개시하였으나, 우 디스패리티(d^r)에 대해서도 유사하게 계산할 수 있다.In Equation 3 as an example, only the equation ^{for calculating the left disparity d l} is disclosed, but the right disparity d ^r can be similarly calculated.

수학식 3에 따르면, 디스패리티 맵 획득부(140)는 WTA(winner-takes-all) 방식으로 디스패리티 검색 범위에서 최소 매칭 비용을 갖는 디스패리티(d)를 획득한다.According to Equation 3, the disparity map acquisition unit 140 acquires a disparity d having a minimum matching cost in a disparity search range in a winner-takes-all (WTA) method.

즉 본 실시예에 따른 스테레오 매칭 장치(100)는 동일한 구조와 동일한 가중치(w)를 갖는 샴 CNN을 이용함으로써, 입력된 스테레오 영상(I^l, I^r)으로부터 I^l(i_x, i_y) = I^r(i_x-d^l(i), i_y)를 만족하는 밀집 대응점(dense correspondence)을 확인하여, 각 픽셀(i)에 대한 좌 디스패리디(d^l(i))를 추정한다.That is, the stereo matching apparatus 100 according to the present embodiment uses a Siamese CNN having the same structure and the same weight (w), so that I ^l (i _x , i _y ) ^{from the input stereo images (I l} , I ^r) = I ^r (i _x -d ^l (i), i _y ) is determined, and the left disparity (d ^l (i)) for each pixel (i) is estimated. .

이때 스테레오 매칭 장치(100)는 최대 디스패리티 범위(d_max)를 갖는 디스패리티 후보(d = {1, ..., d_max}) 중에서 적합한 좌 디스패리티(d^l(i))를 추정하기 위해, I^l(i_x, i_y)과 I^r(i_x-d, i_y) 사이의 매칭 비용(C^l(i))을 계산한다. 여기서 좌 디스패리티(d^l(i))는 WTA 기법에 따라 매칭 비용(C^l(i))을 최소화하도록 결정될 수 있다. 그리고 획득된 좌 디스패리티(d^l(i))로부터 좌 디스패리티 맵(d^l)을 획득할 수 있다.At this time, the stereo matching device 100 estimates a suitable left disparity (d ^l (i)) from among disparity candidates (d = {1, ..., d _max _{}) having a maximum disparity range (d max ).} To do this, calculate the matching cost (C ^l (i)) ^{between I l} (i _x , i _y ) and I ^r (i _x -d, i _{y ).} Here, the left disparity (d ^l (i)) may be determined to minimize the matching cost (C ^{l (i)) according to the WTA technique.} In addition, a left disparity map d ^l may be obtained from the obtained ^{left disparity d l (i).}

우 디스패리티 맵(d^r) 또한 유사한 방식으로 획득될 수 있다.The right disparity map d ^r can also be obtained in a similar manner.

상기에서는 스테레오 매칭 장치(100)가 좌 디스패리티 맵(d^l)과 우 디스패리티 맵(d^r)을 모두 획득하는 것으로 설명하였으나, 노이즈가 포함되지 않은 스테레오 영상에 대해 정확하게 좌 디스패리티 맵(d^l)과 우 디스패리티 맵(d^r)이 획득된다면, 좌 디스패리티 맵(d^l)과 우 디스패리티 맵(d^r)은 상호 대칭 구조를 갖게 된다. 따라서 스테레오 매칭 장치(100)는 좌 디스패리티 맵(d^l)과 우 디스패리티 맵(d^r) 중 적어도 하나를 스테레오 매칭 결과인 디스패리티 맵(d)으로서 출력할 수도 있다.In the above, it has been described that the stereo matching device 100 ^{acquires both the left disparity map (d l} ) and the right disparity map (d ^r ). However, the left disparity map (d) is accurately used for a stereo image that does not contain noise. ^{If l} ) and the right disparity map d ^r are obtained, the left disparity map d ^l and the right disparity map d ^r have a mutually symmetric structure. Accordingly, the stereo matching apparatus 100 ^{may output at least one of the left disparity map d l} and the right disparity map d ^{r as a} disparity map d that is a stereo matching result.

한편, CNN을 이용하여 매칭 비용을 계산하기 위해서는 상기한 바와 같이, CNN이 미리 학습되어야 하며, 이에 학습 레이블이 포함된 대량의 학습 영상이 요구된다. 그러나 학습 레이블이 포함된 대량의 학습 영상은 획득하기 어려우며, 따라서 CNN을 학습시키는 것이 용이하지 않다.Meanwhile, in order to calculate the matching cost using the CNN, as described above, the CNN must be learned in advance, and a large amount of training images including the training label are required. However, it is difficult to acquire a large amount of training images including the training label, and therefore, it is not easy to train the CNN.

이에 본 실시예에서는 인코더(120)의 CNN 학습 시에 학습 레이블이 포함된 학습 영상을 요구하지 않고, 학습 레이블이 포함되지 않은 스테레오 영상을 이용하여 학습 레이블을 생성할 수 있도록 학습 맵 생성부(150)를 추가로 제안한다.Accordingly, in this embodiment, the learning map generator 150 does not request a training image including a training label when learning CNN by the encoder 120, and generates a training label using a stereo image that does not include the training label. ) Is additionally suggested.

학습 맵 생성부(150)는 인코더(120)의 샴 CNN을 학습시키기 위해 추가되는 학습 레이블(또는 학습 레이블이 포함된 학습 영상)을 생성하는 구성이다. 즉 본 실시예에 따른 스테레오 매칭 장치(100)를 학습 시키기 위한 학습 장치에 포함되는 구성으로, 인코더(120)의 샴 CNN이 이미 학습된 경우, 제외될 수 있다.The learning map generator 150 is a component that generates a learning label (or a learning image including a learning label) added to learn the Siamese CNN of the encoder 120. That is, as a configuration included in the learning device for training the stereo matching device 100 according to the present embodiment, if the Siamese CNN of the encoder 120 has already been learned, it may be excluded.

학습 시에 스테레오 영상 입력부(110)는 인코더(120)의 샴 CNN을 학습시키기 위해 다수개의 스테레오 영상을 획득할 수 있다. 여기서는 일예로 스테레오 영상 입력부(110)가 T(여기서 T는 자연수)개 스테레오 영상을 획득하는 것으로 가정한다.During learning, the stereo image input unit 110 may acquire a plurality of stereo images in order to learn the Siamese CNN of the encoder 120. Here, as an example, it is assumed that the stereo image input unit 110 acquires T (where T is a natural number) stereo images.

그리고 학습시에 인코더(120)의 샴 CNN에 대한 가중치(w)의 초기값은 기지정된 범위(예를 들면, 0 < w ≤ 1)에서 랜덤하게 선택된다.In addition, during learning, the initial value of the weight w for the Siamese CNN of the encoder 120 is randomly selected from a predetermined range (eg, 0 <w ≤ 1).

학습 시에는 인코더(120)의 샴 CNN이 정상적으로 기능하지 않으므로, 인코더(120)에서 출력되는 특징 맵(A^l, A^r)으로부터 획득되는 매칭 비용 볼륨(C)과 디스패리티 맵(d^l, d^r) 또한 신뢰할 수 없다.During training, since the Siamese CNN of the encoder 120 does not function normally, the matching cost volume (C) and the disparity map (d ^l , d ^{) obtained from the feature maps (A l} , A ^{r) output from the encoder 120} ^r ) is also unreliable.

그러나 본 발명의 학습 맵 생성부(150)는 학습되지 않은 인코더(120)에서 출력된 특징 맵(A^l, A^r)으로부터 획득된 디스패리티 맵(d^l, d^r)에 대해 에피폴라 제약(epipolar constraint)에 따른 대응점 일관성(Correspondence Consistency)에 기반하여 신뢰할 수 있는 양성 샘플을 추정하고, 추정된 양성 샘플을 전파함으로써, 인코더(120)의 샴 CNN을 학습 시킬 수 있는 학습 맵을 생성한다.However, the learning map generation unit 150 of the present invention has an epipolar constraint on the disparity maps (d ^l , d ^r ^{) obtained from the feature maps (A l} , A ^{r) output from the untrained encoder 120.} A learning map capable of learning the Siamese CNN of the encoder 120 is generated by estimating a reliable positive sample and propagating the estimated positive sample based on Correspondence Consistency according to epipolar constraint).

학습 맵 생성부(150)는 동일한 입력 스테레오 영상에 대해 반복적으로 학습 맵을 생성하여, 학습 맵의 신뢰도를 점차로 증가시킨다. 그리고 다수개 스테레오 영상 각각에 대한 학습 맵을 생성함으로써, 샴 CNN이 다양한 스테레오 영상에 대해 매칭 비용을 최소화하기 위한 특징 맵(A^l, A^r)을 신뢰성 있게 추출할 수 있도록 한다.The learning map generation unit 150 repeatedly generates a learning map for the same input stereo image, thereby gradually increasing the reliability of the learning map. And by generating a learning map for each of a plurality of stereo images, the Siamese CNN can reliably extract the ^{feature maps (A l} , A ^{r) for minimizing the matching cost for various stereo images.}

도3 을 참조하면, 학습 맵 생성부(150)는 양성 샘플 추출부(151), 보간 맵 생성부(153) 및 학습 맵 획득부(155)를 포함할 수 있다.Referring to FIG. 3, the learning map generation unit 150 may include a positive sample extraction unit 151, an interpolation map generation unit 153, and a learning map acquisition unit 155.

양성 샘플 추출부(151)는 디스패리티 맵 획득부(140)에서 획득된 좌 디스패리티 맵(d^l) 및 우 디스패리티 맵(d^r)을 인가받고, 좌 디스패리티 맵(d^l) 및 우 디스패리티 맵(d^r)의 각 디스패리티(d^l(i), d^r(i))에 대해 에피폴라 제약에 따른 대응점 일관성에 기반하여 양성 샘플을 탐색하여 추출한다.The positive sample extraction unit 151 receives the left disparity map (d ^l ) and the right disparity map (d ^r ) obtained from the disparity map acquisition unit 140, and receives the left disparity map (d ^l ) and the right For each disparity (d ^l (i), d ^r (i)) of the disparity map (d ^r ), positive samples are searched and extracted based on the correspondence point consistency according to the epipolar constraint.

스테레오 매칭에서 좌 영상의 픽셀들은 에피폴라 라인(epipolar line)을 가로질러 우 영상에 최대 하나의 매칭 픽셀을 가지며, 반대의 경우도 동일하다. 즉 좌 영상에서 우 영상으로의 픽셀 일치는 우 영상의 대응점(픽셀)이 좌영상에도 일치해야 한다.In stereo matching, pixels of the left image cross an epipolar line and have at most one matching pixel in the right image, and vice versa. That is, in order to match the pixels from the left image to the right image, the corresponding point (pixel) of the right image must also match the left image.

이러한 에피폴라 제약에 따른 대응점 일관성을 이용하면, 수학식 4와 같이 좌 디스패리티 맵(d^l) 및 우 디스패리티 맵(d^r)에서 신뢰할 수 있는 양성 샘플을 추출할 수 있다.When the correspondence point consistency according to the epipolar constraint is used, a reliable positive sample can be extracted from the ^{left disparity map (d l} ) and the right disparity map (d ^{r) as shown in Equation 4.}

여기서 t는 기지정된 임계값이다.Where t is a predetermined threshold.

즉 수학식 4 에 따르면, 좌 디스패리티 맵(d^l)과 우 디스패리티 맵(d^r)에서 대응점 사이의 거리가 임계값(t) 미만인 픽셀이 희소 양성 샘플(sparse positive sample)(

)로 추정된다.That is, according to Equation 4, a ^{pixel whose distance between corresponding points in the left disparity map (d l} ) and the right disparity map (d ^r ) is less than the threshold value (t) is a sparse positive sample (

).

그리고 양성 샘플 추출부(151)는 추정된 희소 양성 샘플(

) 이외의 나머지 픽셀은 일관성이 없는 불량 픽셀로 판단하여 제거한다.And the positive sample extraction unit 151 is the estimated rare positive sample (

Pixels other than) are determined as inconsistent defective pixels and removed.

이때 추정된 희소 양성 샘플(

)은 모든 영상 도메인에 분포될 수 있으나, 객체의 경계와 질감없는(textureless) 영역의 모호함으로 인해 오류 샘플이 포함될 수 있으며, 이로 인해 학습 성능 저하를 유발할 수 있다.At this time, the estimated rare positive sample (

) May be distributed in all image domains, but an error sample may be included due to the ambiguity of the boundary of the object and the textureless region, which may cause a decrease in learning performance.

이에 양성 샘플 추출부(151)는 대응점 일관성에 의해 좌 디스패리티 맵(d^l)과 우 디스패리티 맵(d^r) 각각 에서 추정된 양성 샘플은 유사한 색상값을 갖는다는 색상 유사성 제약 조건을 추가로 적용할 수 있다. 이 경우, 추정된 양성 샘플에서 색상 차가 기지정된 기준값 미만인 양성 샘플 만을 희소 양성 샘플(

)로 추정할 수 있다.Accordingly, the positive sample extraction unit 151 additionally adds a color similarity constraint that the positive samples estimated from each ^{of the left disparity map (d l} ) and the right disparity map (d ^{r) have similar color values due to the correspondence point consistency.} Can be applied. In this case, only the positive sample whose color difference is less than the predetermined reference value in the estimated positive sample is a rare positive sample (

) Can be estimated.

그리고 보간 맵 생성부(153)는 희소 양성 샘플(

)을 주변 픽셀로 전파하여 보간 맵(p)을 생성한다. 보간 맵 생성부(153)는 입력 컬러 영상을 가이드로 이용하여 희소 양성 샘플(

)을 인접 픽셀로 전파하는 보간 기법을 적용한다.And the interpolation map generator 153 is a sparse positive sample (

) Is propagated to surrounding pixels to generate an interpolation map (p). The interpolation map generation unit 153 uses the input color image as a guide to generate a rare positive sample (

) Is applied to the adjacent pixels.

보간 맵 생성부(153)는 반복되는 학습 과정에서 수학식 5에 따른 전체 에너지 함수(J(p))가 최소화되도록 보간을 수행하여 보간 픽셀(pi)을 획득한다.The interpolation map generator 153 obtains an interpolated pixel pi by performing interpolation so that the total energy function J(p) according to Equation 5 is minimized in the repeated learning process.

여기서 λ는 두 디스패리티 맵(d^l, d^r)에서 추정된 희소 양성 샘플(

) 사이의 밸런스를 제어하기 위한 상수이며, h_i는 유효 픽셀인 경우 1이고 아니면 0을 나타내는 색인 함수이며, N₄(i)는 보간 픽셀(p_i)에 대한 인접 픽셀의 집합을 나타낸다. 그리고 범위 파라미터(σ_c)와 함께

로 정의되는 공간 변이 가중치 함수(w_i,j(I))를 이용하여 평활도 제약이 적응적으로 강제된다.Where λ is the sparse positive sample estimated from the two disparity maps (d ^l , d ^r)

) Is a constant for controlling the balance between, h _i is an index function representing 1 in the case of an effective pixel and 0 otherwise, and N ₄ (i) represents a set of adjacent pixels for the _{interpolation pixel p i.} And with the range parameter (σ _c )

The smoothness constraint is adaptively enforced using the spatial variation weight function (w _{i,j (I)) defined as.}

그리고 보간 픽셀(p_i)로 획득된 보간된 디스패리티 맵인 보간 맵(p)은 수학식 6에 의해 계산될 수 있다.In addition, an interpolation map p, which is an interpolated disparity map obtained as an _{interpolation pixel p i, may be calculated by Equation 6.}

여기서

와

은 픽셀 수 S를 갖는 보간 맵(p)과 희소 양성 샘플 맵(

)의 S X 1 열 벡터를 나타내고, h는 희소 학습 샘플의 인덱스 벡터를 나타내며, m(m ∈ {0, ..., S-1})은 각 픽셀(i)에 대응하는 스칼라 인덱스(scalar index)를 나타낸다. 그리고 I는 항등 행렬, L은 수학식 7에 의해 정의되는 공간 변이 라플라시안 행렬을 나타낸다.here

Wow

Is an interpolation map (p) with the number of pixels S and a sparse positive sample map (

) Of SX 1 column vector, h is the index vector of the sparse learning sample, m(m ∈ {0, ..., S-1}) is the scalar index corresponding to each pixel (i) ). In addition, I denotes an identity matrix, and L denotes a spatially displaced Laplacian matrix defined by Equation 7.

여기서 m 및 n은 픽셀에 대응하는 스칼라 인덱스를 나타내고, N₄(m)은 픽셀 m에 대한 인접 픽셀의 집합을 나타낸다.Here, m and n denote a scalar index corresponding to a pixel, and N ₄ (m) denotes a set of pixels adjacent to pixel m.

보간 맵 생성부(153)는 획득된 보간 맵(p)을 양성 샘플 추출부(151)로 전달하고, 양성 샘플 추출부(151)는 보간 맵 생성부(153)에서 전달된 보간 맵(p)에 대해 다시 대응점 일관성을 이용하여 양성 샘플을 학습 샘플(

)로서 획득한다. 이때 양성 샘플 추출부(151)는 학습 샘플(

)의 공간적 위치(Ω)를 함께 획득하여 학습 맵 획득부(155)로 전달한다.The interpolation map generation unit 153 transfers the obtained interpolation map (p) to the positive sample extraction unit 151, and the positive sample extraction unit 151 is an interpolation map (p) transmitted from the interpolation map generation unit 153 For the training sample (

). At this time, the positive sample extraction unit 151 is a learning sample (

The spatial location (Ω) of) is acquired together and transmitted to the learning map acquisition unit 155.

학습 맵 획득부(155)는 양성 샘플 추출부(151)에서 전달되는 학습 샘플(

)과 위치(Ω)에 따라 학습 맵(

)을 획득한다.The learning map acquisition unit 155 includes a learning sample transmitted from the positive sample extraction unit 151 (

) And the learning map (

).

그리고 학습 맵 획득부(155)는 현재 인코더(120)의 샴 CNN에서 획득한 특징 맵(A^l, A^r)으로부터 디스패리티 맵 획득부(140)가 획득한 디스패리티 맵(d)과 학습 맵(

) 사이의 오차를 역전파함으로써, 샴 CNN의 가중치(w)가 학습 레이블(또는 학습 레이블이 포함된 학습 영상)을 기반으로 추정되는 오차에 따라 업데이트 되도록 한다. 즉 샴 CNN을 학습시킨다. 이러한 학습은 오차가 기지정된 기준 오차 이내로 감소될 때까지 반복적으로 수행될 수 있으며, 다수의 입력 스테레오 영상에 대해 각각 수행된다.In addition, the learning map acquisition unit 155 includes a disparity map (d) and a learning map obtained by the disparity map acquisition unit 140 from ^{the feature maps (A l} , A ^r ) acquired from the Siamese CNN of the current encoder 120. (

), the weight (w) of the Siamese CNN is updated according to an error estimated based on the training label (or the training image including the training label). That is, it trains Siamese CNN. Such learning may be repeatedly performed until the error decreases within a predetermined reference error, and is performed for each of a plurality of input stereo images.

도3 에서 학습 맵 생성부(150)의 상세 동작을 살펴보면, 우선 양성 샘플 추출부(151)가 디스패리티 맵(d)에서 희소 양성 샘플(

)을 추출하고, 이후 보간 맵 생성부(153)가 보간 맵(p)을 생성하여 다시 양성 샘플 추출부(151)로 전달한다. 이에 양성 샘플 추출부(151)는 보간 맵(p)으로부터 학습 샘플(

)과 위치(Ω)를 추출하여 학습 맵 획득부(155)로 전달하고, 학습 맵 획득부(155)는 학습 샘플(

)과 위치(Ω)를 이용하여 학습 맵(

)을 획득한다.Referring to the detailed operation of the learning map generation unit 150 in FIG. 3, first, the positive sample extraction unit 151 is used for a sparse positive sample (

) Is extracted, and then, the interpolation map generation unit 153 generates an interpolation map p and transmits it to the positive sample extraction unit 151 again. Accordingly, the positive sample extracting unit 151 uses the learning sample (

) And the location (Ω) are extracted and transferred to the learning map acquisition unit 155, and the learning map acquisition unit 155 is a learning sample (

) And location (Ω) to learn map (

).

상기한 바와 같이 학습 맵 생성부(150)가 입력된 스테레오 영상으로부터 에피폴라 제약에 따른 대응점 일관성을 이용하여 양성 샘플을 추출하고, 추출된 양성 샘플을 전파하여 학습 레이블을 획득하는 경우, 학습 레이블이 포함된 별도의 학습 영상을 획득할 필요가 없다는 장점이 있다.As described above, when the learning map generator 150 extracts a positive sample from the input stereo image using correspondence point consistency according to the epipolar constraint, and propagates the extracted positive sample to obtain a learning label, the learning label is There is an advantage that it is not necessary to acquire a separate training image included.

또한 야외와 같이 조명의 변화가 강하게 나타나는 환경에서 획득된 스테레오 영상에 대해서도 신뢰성 있게 스테레오 매칭을 수행하도록 학습시킬 수 있으며, 폐색이나 질감없는 영역의 모호함에도 불구하고 높은 스테레오 매칭 성능을 나타내도록 학습 시킬 수 있다.In addition, it is possible to learn to perform reliably stereo matching even for stereo images acquired in environments where lighting changes are strong, such as outdoors, and to show high stereo matching performance despite obscuration or ambiguity of texture-free areas. have.

도3 에서 우측의 이미지(

,

)은 각각 좌 디스패리티 맵, 좌 희소 양성 샘플 맵 및 좌 학습 맵을 나타낸다. 도3 도시된 바와 같이, 초기 디스패리티 맵(

)에 비해 희소 양성 샘플 맵(

)에서 더욱 정확하게 객체가 표현됨을 알 수 있다. 그리고 학습 맵(

)은 희소 양성 샘플 맵(

)에서 누락된 픽셀이 보완됨으로써 영상 품질이 개선되었음을 알 수 있다.The image on the right in Fig. 3 (

,

) Denotes a left disparity map, a left sparse positive sample map, and a left learning map, respectively. As shown in Figure 3, the initial disparity map (

) Compared to the sparse positive sample map (

), you can see that the object is represented more accurately. And the learning map (

) Is the sparse positive sample map (

), it can be seen that the image quality is improved by supplementing the missing pixels.

여기서는 간단한 예시로서 좌 디스패리티 맵(

), 좌 희소 양성 샘플 맵(

) 및 좌 학습 맵(

)을 도시하였으나, 학습 맵 생성부(150)에서는 우 디스패리티 맵(

), 우 희소 양성 샘플 맵(

) 및 우 학습 맵(

) 또한 유사하게 획득된다.Here, as a simple example, the left disparity map (

), left sparse positive sample map (

) And the left learning map (

) Is shown, but in the learning map generator 150, a right disparity map (

), right rare positive sample map (

) And right learning map (

) Is also obtained similarly.

도4 는 본 발명의 실시예에 따라 생성된 학습 맵의 예를 나타낸다.4 shows an example of a learning map generated according to an embodiment of the present invention.

도4 에서 (a)는 입력 스테레오 영상의 좌 영상을 나타내고, (b)는 본 실시예에 따라 생성된 학습 맵을 나타내고, (c)는 KITTI 벤치 마크에서 제공된 검증 자료(ground trouth)를 이용하여 생성된 디스패리티 맵을 나타낸다.In Fig. 4, (a) shows the left image of the input stereo image, (b) shows the learning map generated according to this embodiment, and (c) shows the verification data (ground trouth) provided in the KITTI benchmark. Represents the generated disparity map.

도4 에 도시된 바와 같이, 본 실시예에 따라 검증 자료가 포함되지 않은 스테레오 영상에서 획득된 학습 맵이 검증 자료가 포함된 디스패리티 맵에 비해 매우 양호한 수준의 학습용 디스패리티 맵을 제공함을 알 수 있다.As shown in FIG. 4, it can be seen that the learning map obtained from a stereo image without verification data according to the present embodiment provides a very good level of disparity map for learning compared to the disparity map including verification data. have.

한편, 도3 에서 스테레오 영상의 좌 영상 및 우 영상 중 대응하는 영상을 각각 인가받아 특징 맵(A^l, A^r)을 생성하는 샴 CNN의 구조는 다양하게 설계될 수 있다.Meanwhile, in FIG. 3, a structure of a Siamese CNN that generates ^{feature maps A l} and A ^{r by} receiving a corresponding image from a left image and a right image of a stereo image may be variously designed.

본 실시예에서는 일예로 샴 CNN의 구조를 심플 CNN 구조와 정밀 CNN 구조로 2가지 구조를 제안한다.In this embodiment, as an example, we propose two structures, a simple CNN structure and a precise CNN structure.

우선 심플 CNN 구조가 표1 에 나타나 있다. 심플 CNN은 빠른 동작이 3 X 3 크기의 컨볼루션 커널(convolution kernel)로 구성되는 4개의 컨볼루션 레이어(convolution layer)(conv1 ~ conv4)를 포함한다. 그리고 3개의 컨볼루션 레이어(conv1 ~ conv3)에는 활성 함수(activation function)로서 ReLU(Rectifier Linear Unit)가 적용된다. ReLU는 각 컨볼루션 레이어(conv1 ~ conv3)의 인코딩 결과에서 양수 부분만 다음 레이어로 전달되도록 한다.First, the simple CNN structure is shown in Table 1. The simple CNN includes four convolution layers (conv1 to conv4) composed of a convolution kernel with a size of 3 X 3 for fast operation. In addition, ReLU (Rectifier Linear Unit) is applied to the three convolution layers (conv1 to conv3) as an activation function. ReLU allows only the positive part of the encoding result of each convolution layer (conv1 to conv3) to be transferred to the next layer.

그리고 마지막 컨볼루션 레이어(conv4)에는 ReLU 대신 L2 정규화 레이어(l₂-norm)를 추가하여 음수의 인코딩 결과가 유지되도록 한다. _{In addition, an L2 normalization layer (l 2} -norm) is added instead of ReLU to the last convolution layer (conv4) to maintain the encoding result of negative numbers.

표1 에서 Correlation은 도1 의 매칭 비용 계산부(130)의 구성으로, 매칭 비용 계산부(130)는 매칭 비용 볼륨(C)을 획득하기 위해 입력되는 스테레오 이미지의 크기가 h x w일때, 디스패리티 후보(d={1, ..., d_max})에 대해 h x w x d_max 횟수의 연산을 필요로 한다.Correlation in Table 1 is the configuration of the matching cost calculation unit 130 of FIG. 1, and when the size of the stereo image input to obtain the matching cost volume C is hxw, the disparity candidate For (d={1, ..., d _max }), it requires calculation of _{hxwxd max times.}

표1 의 심플 CNN으로도 일정 수준 이상의 성능을 나타낼 수 있으나, 각 컨볼루션 레이어(conv1 ~ conv4)에서 출력되는 특징 맵의 크기를 조절하지 않는 단일 스케일(single scale) 구조로 식별력(discriminative power)과 수용 영역(receptive field)에 한계가 있다.The simple CNN shown in Table 1 can also exhibit a certain level of performance, but it is a single scale structure that does not adjust the size of the feature map output from each convolution layer (conv1 ~ conv4). There is a limit to the receptive field.

표2 에 나타난 정밀 CNN 구조에서는 서브 샘플링 레이어와 스킵 커넥션을 포함하는 U-net 구조를 이용한다. 이는 피드 포워드 프로스를 통해 다중 스케일(multi scale) 정보를 증가시킨다.In the precise CNN structure shown in Table 2, a U-net structure including a sub-sampling layer and skip connection is used. This increases multi-scale information through the feed forward process.

정밀 CNN 구조에서는 수렴 경로(contracting path)와 확장 경로(expansive path)가 포함되며, 수렴 경로에서는 일반적인 CNN과 마찬가지로 2개의 3 x 3 컨볼루션 커널을 연속하여 적용하고, 각각 배치 정규화(batch normalization) 및 ReLU를 수행한 후, 다운 샘플링을 위해 맥스 풀링(max pooling)을 수행할 수 있다. 이때 스트라이드(stride)는 2로 설정될 수 있다. 각 다운 샘플링 단계에서 채널의 수는 2배로 증가된다.In the precise CNN structure, a converging path and an extended path are included, and in the converging path, two 3 x 3 convolution kernels are successively applied, as in a general CNN, and batch normalization and After performing ReLU, max pooling may be performed for downsampling. At this time, the stride may be set to 2. In each downsampling step, the number of channels is doubled.

반면 확장 경로에서는 각 단계별로 특징 맵의 업 샘플링과 업 컨볼루션 및 채널의 수를 반으로 줄이고, 2개의 3 x 3 컨볼루션 커널과 배치 정규화 및 ReLU가 적용된다. 그리고 마지막 레이어에서 배치 정규화 및 ReLU 대신 L2 정규화 레이어(l₂-norm)를 추가하여 음수의 인코딩 결과가 유지되도록 한다.On the other hand, in the extension path, the upsampling and up-convolution of the feature map and the number of channels are halved for each step, and two 3 x 3 convolution kernels, batch normalization and ReLU are applied. _{In addition, the L2 normalization layer (l 2} -norm) is added instead of the batch normalization and ReLU in the last layer to maintain a negative encoding result.

표2 에서도 Correlation은 도1 의 매칭 비용 계산부(130)의 구성으로, 매칭 비용 볼륨(C)을 계산한다.In Table 2, the correlation is the configuration of the matching cost calculation unit 130 of FIG. 1, and calculates the matching cost volume C.

상기에서는 샴 CNN의 구조의 일예로 심플 CNN 구조와 정밀 CNN 구조로 2가지 구조를 제안하였으나, 본 발명은 이에 한정되지 않는다.In the above, as an example of the structure of the Siamese CNN, two structures have been proposed, a simple CNN structure and a precise CNN structure, but the present invention is not limited thereto.

도5 는 본 발명의 일 실시예에 따른 스테레오 매칭 방법 및 이의 학습 방법에 대한 개략적 구성을 나타내고, 도6 은 도5 의 학습 맵 생성 및 학습 단계를 상세하게 나타내며, 도7 은 도4 의 학습 맵 생성 및 학습 단계의 알고리즘을 나타낸다.5 shows a schematic configuration of a stereo matching method and a learning method thereof according to an embodiment of the present invention, FIG. 6 shows in detail the learning map generation and learning steps of FIG. 5, and FIG. 7 is a learning map of FIG. Represents the algorithm of the generation and learning phase.

도5 내지 도7 을 참조하여 본 실시예에 따른 스테레오 매칭 방법 및 이의 학습 방법을 설명하면, 우선 스테레오 매칭 장치를 학습 시키기 위해, 다수의 학습 스테레오 영상이 스테레오 영상 입력부(110)로 입력된다(S10). 여기서 다수의 학습 스테레오 영상은 단순히 학습을 위해 이용되는 스테레오 영상을 의미하는 것으로, 학습 레이블이 포함되지 않은 일반의 스테레오 영상이다. 즉 본 실시예에서는 스테레오 매칭 장치의 학습을 위해 학습 레이블이 포함된 별도로 제작된 학습용 스테레오 영상을 요구하지 않는다.Referring to FIGS. 5 to 7, a stereo matching method and a learning method thereof according to the present embodiment will be described. First, in order to learn a stereo matching device, a plurality of learning stereo images are input to the stereo image input unit 110 (S10). ). Here, the plurality of learning stereo images simply mean stereo images used for learning, and are general stereo images that do not include a learning label. That is, in this embodiment, a separately produced learning stereo image including a learning label is not required for learning of the stereo matching device.

그리고 다수의 학습 스테레오 영상 각각에 대해 스테레오 매칭 방식과 동일한 방식으로 학습 디스패리티 맵(d)을 획득한 후, 획득된 디스패리티 맵(d)로부터 학습 맵을 생성하고 학습한다(S20).In addition, after acquiring the training disparity map (d) for each of the plurality of training stereo images in the same manner as the stereo matching method, a training map is generated and learned from the obtained disparity map (d) (S20).

도6 및 도7 을 참조하여, 학습 맵 생성 및 학습 단계를 상세하게 설명하면, 인코더(120)에서 학습이 완료되지 않은 동일한 구조와 동일한 가중치(w)를 갖는 2개의 샴 CNN이 학습 스테레오 영상으로부터 학습 특징 맵(A^l, A^r)을 추출한다(S21). 그리고 매칭 비용 계산부(130)는 특징 맵(A^l, A^r)으로부터 학습 매칭 비용 볼륨(C^l, C^r)을 계산한다(S22).Referring to FIGS. 6 and 7, when the learning map generation and training steps are described in detail, two Siamese CNNs having the same structure and the same weight (w) in which training is not completed in the encoder 120 are obtained from the training stereo image. Learning feature maps (A ^l , A ^r ) are extracted (S21). In addition, the matching cost calculation unit 130 calculates the learning matching cost volumes C ^l and C ^r ^{from the feature maps A l} and A ^r (S22).

디스패리티 맵 획득부(140)는 획득된 매칭 비용 볼륨(C^l, C^r)의 모든 픽셀(i)에 대해 디스패리티 후보들(d={1, ..., d_max}) 중 매칭 비용 볼륨(C^l, C^r)이 최소화되는 디스패리티(d)를 수학식 3에 따라 획득하여 학습 디스패리티 맵(d)을 생성한다(S23).The disparity map acquisition unit 140 is a matching cost volume among disparity candidates (d={1, ..., d _max }) ^{for all pixels i of the obtained matching cost volumes C l} , C ^r A learning disparity map (d) is generated by obtaining a disparity (d) in which (C ^l , C ^{r) is minimized according to Equation 3 (S23).}

학습 특징 맵 추출 단계(S21)부터 디스패리티 맵 생성 단계(S23)까지는 스테레오 매칭 장치가 스테레오 매칭을 수행하는 방법과 동일하다. 다만 상기한 바와 같이 학습 시에는 인코더(120)의 샴 CNN이 학습 완료되지 않은 상태라는 스테레오 매칭 동작과 상이하다.From the step of extracting the learning feature map (S21) to the step of generating the disparity map (S23), it is the same as the method in which the stereo matching apparatus performs stereo matching. However, as described above, during learning, the Siamese CNN of the encoder 120 is different from the stereo matching operation in which the learning is not completed.

학습 디스패리티 맵(d)이 획득되면, 학습 맵 생성부(150)의 양성 샘플 추출부(151)가 학습 디스패리티 맵(d)에 대해 에피폴라 제약에 따른 대응점 일관성에 기반하여 희소 양성 샘플(

)을 추정한다(S24). 이때 양성 샘플 추출부(151)는 추정된 희소 양성 샘플(

) 이외의 나머지 픽셀은 제거한다.When the learning disparity map (d) is obtained, the positive sample extraction unit 151 of the learning map generation unit 150 performs a sparse positive sample on the basis of the correspondence point consistency according to the epipolar constraint with respect to the learning disparity map (d).

) Is estimated (S24). At this time, the positive sample extraction unit 151 is the estimated rare positive sample (

Pixels other than) are removed.

한편, 보간 맵 생성부(153)는 입력 컬러 영상을 가이드로 이용하여 희소 양성 샘플(

)을 인접 픽셀로 전파하는 보간 기법에 따라 보간 픽셀(pi)을 획득함으로써, 보간 맵(p)을 생성한다(S25).On the other hand, the interpolation map generator 153 uses the input color image as a guide to generate a rare positive sample (

An interpolation map (p) is generated by acquiring an interpolation pixel (pi) according to an interpolation technique that propagates) to adjacent pixels (S25).

양성 샘플 추출부(151)는 보간 맵(p)에 대해 다시 대응점 일관성에 기반하여 학습 샘플(

)의 공간적 위치(Ω)를 추정한다(S26). 그리고 학습 맵 획득부(155)는 양성 샘플 추출부(151)에서 전달되는 학습 샘플(

)과 위치(Ω)에 따라 학습 맵(

)을 획득한다(S27).The positive sample extraction unit 151 is again based on the correspondence point consistency with respect to the interpolation map (p), and the learning sample (

The spatial position (Ω) of) is estimated (S26). In addition, the learning map acquisition unit 155 includes a learning sample transmitted from the positive sample extraction unit 151 (

) And the learning map (

) Is obtained (S27).

학습 맵(

)이 획득되면, 디스패리티 맵 획득부(140)에서 획득된 디스패리티 맵(d)과 획득된 학습 맵(

) 사이의 오차를 역전파함으로써, 샴 CNN의 가중치(w)가 업데이트 되도록 한다(S28). 즉 샴 CNN을 학습시킨다.Learning map(

) Is obtained, the disparity map (d) obtained by the disparity map obtaining unit 140 and the obtained learning map (

) By backpropagating the error between, so that the weight (w) of the Siamese CNN is updated (S28). That is, it trains Siamese CNN.

상기한 학습은 오차가 기지정된 기준 오차 이내로 감소될 때까지 반복적으로 수행될 수 있으며, 다수의 입력 스테레오 영상에 대해 각각 반복적으로 수행된다.The above-described learning may be repeatedly performed until the error decreases within a predetermined reference error, and is repeatedly performed for each of a plurality of input stereo images.

다시 도5 를 참조하면, 인코더(120)의 샴 CNN에 대한 학습이 완료된 이후, 스테레오 매칭을 수행할 스테레오 영상이 스테레오 영상 입력부(110)로 입력된다.Referring back to FIG. 5, after the learning of the Siamese CNN of the encoder 120 is completed, a stereo image to be performed stereo matching is input to the stereo image input unit 110.

그리고 인코더(120)에서 학습이 완료된 2개의 샴 CNN이 특징 맵(A^l, A^r)을 추출한다(S40). 매칭 비용 계산부(130)는 추출된 특징 맵(A^l, A^r)으로부터 매칭 비용 볼륨(C^l, C^r)을 계산한다(S50).In addition, the two Siamese CNNs, which have been trained in the encoder 120 ^{, extract feature maps A l} and A ^r (S40). The matching cost calculation unit 130 calculates the matching cost volumes C ^l and C ^r ^{from the extracted feature maps A l} and A ^r (S50).

디스패리티 맵 획득부(140)는 매칭 비용 볼륨(C^l, C^r)이 최소화되는 디스패리티(d)를 수학식 3에 따라 획득하여 디스패리티 맵(d)을 생성함으로써, 스테레오 매칭을 수행한다(S60).The disparity map acquisition unit 140 obtains the ^{disparity d in which the matching cost volumes C l} and C ^r are minimized according to Equation 3 and generates a disparity map d, thereby performing stereo matching. (S60).

한편, 본 발명의 실시예에 따른 학습 맵 생성부(150)는 대응점 일관성을 이용하여 불일치 픽셀을 제거하며 오차 역전파 과정이 포함됨으로써, 손실이 발생될 수 있다. 이는 모든 가능한 디스패리티 후보들에 대한 각 픽셀의 소프트 맥스 손실(softmax loss)을 허용한다.Meanwhile, the learning map generator 150 according to an embodiment of the present invention removes inconsistent pixels using correspondence point consistency and includes an error backpropagation process, so that loss may occur. This allows for a softmax loss of each pixel for all possible disparity candidates.

본 실시예에서는 각 픽셀 및 디스패리티 후보들에 대해 엔트로피 손실을 계산하며, 엔트로피 손실은 수학식 8로 정의된다.In this embodiment, entropy loss is calculated for each pixel and disparity candidate, and the entropy loss is defined by Equation 8.

여기서 s는 s_x = {i_x - 1, ..., i_x - d_max}로 모든 디스패리티에 대해 정의되는 값이며, P_T(s;i)는 샘플링 된 학습 집합에 대해 1이고, 이외에는 0으로 정의되는 레이블이고, P(s;i)는 수학식 9와 같이 정의되는 소프트 맥스 확률이다.Where s is the value defined for all disparities as _{s x} = {i _x -1, ..., i _x -d _max _{}, P T} (s;i) is 1 for the sampled training set, Otherwise, it is a label defined as 0, and P(s;i) is a soft max probability defined as in Equation 9.

여기서 v는 s와 마찬가지로, v_x = {i_x - 1, ..., i_x - d_max}로 모든 디스패리티에 대해 정의되는 값이다. 수학식 9에서는 음의 부호를 적용함으로써, 유사성 스코어를 매칭 비용으로 변환한다.Here, v is a value defined for all disparities _{as v x} = {i _x -1, ..., i _x -d _{max }, like s.} In Equation 9, a similarity score is converted into a matching cost by applying a negative sign.

학습된 CNN을 이용하더라도, 원본 매칭 비용 볼륨으로는 정확한 디스패리티 맵을 생성하기에 충분하지 않을 수가 있다. 특히 질감 없는 영역과 폐색 영역에서 오류가 발생할 수 있다. 이에 기존의 스테레오 매칭 기법과 마찬가지로 후처리 작업이 더 추가될 수 있다.Even if the learned CNN is used, the original matching cost volume may not be sufficient to generate an accurate disparity map. In particular, errors may occur in untextured areas and occluded areas. Accordingly, like the conventional stereo matching technique, a post-processing task may be further added.

후처리 작업은 글로벌 매칭과 이후 일관성 검사, 서브 픽셀 강조, 중간값 및 양방향 필터링 작업이 포함될 수 있다.Post-processing may include global matching and subsequent consistency checks, sub-pixel highlighting, and median and bidirectional filtering.

반 전역 매칭(Semi-Global Matching)은 디스패리티 맵에 대해 평활 제약(smoothness constraint) 조건을 적용하여 매칭 비용을 저감 시킨다. Semi-Global Matching reduces matching cost by applying a smoothness constraint condition to the disparity map.

디스패리티 맵(d)의 에너지 함수(E(d))를 수학식 10과 같이 정의할 수 있다.The energy function E(d) of the disparity map d may be defined as in Equation 10.

수학식 10에서

는 매칭 비용이 높은 디스패리티에 패널티를 부가하고,

및

는 각각 불연속적인 디스패리티와 P₁ < P₂의 조건에 대해 패널티를 부가한다.In Equation 10

Imposes a penalty on disparity with high matching cost,

And

Is penalized for discontinuous disparity and P ₁ <P _{2, respectively.}

반 전역 매칭 기법에서는 여러 방향에 따라 동적 프로그래밍 스타일로 1차원 비용 업데이트를 수행하여 근사값을 획득하며, 일반적으로 두개의 수평 방향과 2개의 수직 방향에 따라 최적화 한다. 이에 방향 r에서의 수학식 10의 에너지 함수(E(d))를 최소화하기 위해 방향 r을 따른 매칭 비용(Lr(i,d(i)))을 수학식 11과 같이 정의한다.In the semi-global matching technique, an approximate value is obtained by performing one-dimensional cost update in a dynamic programming style according to various directions, and generally, optimization is performed according to two horizontal directions and two vertical directions. Accordingly, in order to minimize the energy function (E(d)) of Equation 10 in the direction r, the matching cost (Lr(i,d(i))) along the direction r is defined as in Equation 11.

수학식 11을 이용하여 4개의 방향에 대한 비용을 평균한 후, 좌측 및 우측 방향에 대한 초기 디스패리티 맵을 계산할 수 있다.After averaging costs for four directions using Equation 11, initial disparity maps for left and right directions may be calculated.

이후 보간, 서브 픽셀 강조 및 순화와 같은 후처리를 통해 초기 디스패리티 맵을 개선할 수 있다. 보간은 좌우 일관성 검사를 수행하여 좌 디스패리티 매과 우 디스패리티 맵 사이의 충돌을 해소하며, 서브 픽셀 강조는 이차 함수를 인접한 픽셀에 적용하여 서브 픽셀이 강조된 디스패리티 맵을 획득한다.Thereafter, the initial disparity map may be improved through post-processing such as interpolation, sub-pixel enhancement, and smoothing. In the interpolation, a left-right consistency check is performed to eliminate a collision between a left disparity map and a right disparity map. In sub-pixel enhancement, a quadratic function is applied to adjacent pixels to obtain a disparity map in which the sub-pixels are emphasized.

이후, 에지를 흐리게 하지 않고 디스패리티 맵을 평활화하기 위해, 5 x 5 중간값 필터 및 양측 필터를 순차적으로 적용할 수 있다.Thereafter, in order to smooth the disparity map without blurring the edges, a 5 x 5 median filter and both filters may be sequentially applied.

이하에서는 본 실시예에 따른 스테레오 매칭 장치 및 방법의 성능을 시뮬레이션한 결과를 나타낸다.Hereinafter, the results of simulation of the performance of the stereo matching apparatus and method according to the present embodiment are shown.

VLFeat MatConNet Toolbox를 사용하여 시뮬레이션 환경을 구성하였으며, 학습을 위한 스테레오 영상은 알려진 KITTI, Middlebury, HCI 및 Yonsei 벤치 마크 스테레오 영상에서 임의의 256 X 512 크기로 잘라낸 패치를 이용하였으며, 최대 디스패리티 범위(d_max)는 228, 116, 140, 80으로 설정되었다.A simulation environment was constructed using VLFeat MatConNet Toolbox, and the stereo image for learning was a patch cut into a random 256 X 512 size from known KITTI, Middlebury, HCI, and Yonsei benchmark stereo images, and the maximum disparity range (d _max ) was set to 228, 116, 140, 80.

KITTI 2012와 KITTI 2015에서 각각 194와 200 쌍의 스테레오 영상과 2005년과 2006년 사이의 Middlebury의 27 쌍의 스테레오 영상을 비지도 방식으로 학습하였으며, 특히 조명의 변화에 대한 신뢰성을 평가히기 위해, Middlebury에서 3가지 다른 조건을 갖는 스테레오 영상을 이용하였다. 또한, 다양한 야외 운전 조건을 평가하기 위해, 11개의 도전 시나리오가 포함된 HCI 벤치 마크와 시간 및 계절의 조건이 상이한 53개의 Yonsei 벤치 마크 스테레오 영상을 이용하였다.In KITTI 2012 and KITTI 2015, 194 and 200 pairs of stereo images, respectively, and 27 pairs of Middlebury stereo images between 2005 and 2006 were studied in an unsupervised method. In particular, in order to evaluate the reliability of lighting changes, Middlebury Stereo images with three different conditions were used in. In addition, to evaluate various outdoor driving conditions, HCI benchmarks including 11 challenge scenarios and 53 Yonsei benchmark stereo images with different time and season conditions were used.

그리고 임계값(t)은 1, 밸런스 상수(λ)는 20², 범위 파라미터(σ_c)는 0.015로 설정되었다. 또한 학습 최적화를 위해 초기 학습율(initial learning rate)과 모멘텀(momentum) 및 가중치 감쇄(weight decay)는 각각 0.001, 0.9 및 0.005로 설정되었다.In addition, the threshold value (t) is set to 1, the balance constant (λ) is set to 20 ² , and the range parameter (σ _c ) is set to 0.015. In addition, for learning optimization, the initial learning rate, momentum, and weight decay were set to 0.001, 0.9, and 0.005, respectively.

표3 은 본 실시예에 따라 생성된 학습 맵(

)을 이용한 비지도 학습 방법을 다른 비지도 학습 방법과 비교하여 시뮬레이션 한 결과로 MIDDLEBURY, KITTI 2012, 및 KITTI 2015 학습 데이터에 대한 비지도 학습 방법에서의 오차율의 정량적 비교 결과를 나타낸다.Table 3 shows the learning map generated according to this embodiment (

), compared with other unsupervised learning methods and simulated, and shows the result of quantitative comparison of the error rates in the unsupervised learning methods for MIDDLEBURY, KITTI 2012, and KITTI 2015 learning data.

표3 에서는 본 실시예에서 대응점 일관성 및 양성 샘플 전파를 수행하는 경우와 수행하지 않는 경우(w/o PP)도 함께 비교하여 시뮬레이션하였으며, 심플 CNN을 이용하였다. 그리고 최종 매칭 비용 볼륨에서 spatial transformer network(STN)의 영상 샘플링과 soft-argmin을 적용하였으며, 색상 일관성(Col. consistency) 또는 색상 및 디스패리티 일관성(Col. Disp. Consistency)를 이용하여 학습 맵을 생성하는 경우를 비교하였다.In Table 3, the case of performing correspondence point consistency and propagation of positive samples in this example was compared and simulated together with the case of not performing (w/o PP), and a simple CNN was used. In the final matching cost volume, image sampling and soft-argmin of the spatial transformer network (STN) were applied, and a learning map was created using color consistency or color and disparity consistency (Col. Disp. Consistency). The cases were compared.

표3 에 나타난 바와 같이, 본 실시예에 따른 비지도 학습 방법을 이용하는 경우, 다른 비지도 학습 방법에 비해, 대응점 일관성 및 양성 샘플 전파를 수행하지 않는 경우에도 더 나은 성능을 나타내지만, 수행하는 경우 더 향상된 성능을 나타냄을 알 수 있다.As shown in Table 3, when the unsupervised learning method according to the present embodiment is used, compared to other unsupervised learning methods, it shows better performance even when the correspondence point consistency and positive sample propagation are not performed. It can be seen that it shows more improved performance.

표4 는 본 실시예의 인코더(120)에서 CNN의 구조에 따른 성능 비교 결과를 나타낸다. 표4 에서는 Census와 MC-CNN fast, Content-CNN, MC-CNN-WS와 표1 및 표2 의 심플 CNN 및 정밀 CNN을 비교한 결과를 나타내며, 표3 에서와 마찬가지로 MIDDLEBURY, KITTI 2012, 및 KITTI 2015 학습 데이터에 대한 비지도 학습 방법에서의 오차율의 정량적 비교 결과를 나타낸다.Table 4 shows the performance comparison results according to the structure of the CNN in the encoder 120 of this embodiment. Table 4 shows the results of comparing Census and MC-CNN fast, Content-CNN, and MC-CNN-WS with simple CNN and precision CNN in Tables 1 and 2, and as in Table 3, MIDDLEBURY, KITTI 2012, and KITTI Shows the result of a quantitative comparison of the error rate in the unsupervised learning method for 2015 learning data.

표4 에 나타난 바와 같이 본 실시예에 따른 CNN의 구조를 이용하는 경우, 심플 CNN을 이용하더라도 다른 CNN 구조에 비해 양호한 성능을 나타내며, 정밀 CNN을 이용하는 경우 매우 우수한 성능을 나타냄을 알 수 있다.As shown in Table 4, when the CNN structure according to the present embodiment is used, it can be seen that even when a simple CNN is used, it shows better performance than other CNN structures, and when a precision CNN is used, it shows very excellent performance.

도8 은 본 발명의 실시예에 따른 학습 맵의 생성 과정에서 생성되는 맵의 예를 나타낸다.8 shows an example of a map generated in the process of generating a learning map according to an embodiment of the present invention.

도8 에서 (a)는 스테레오 영상 입력부(110)에 입력된 입력 스테레오 영상의 좌 영상을 나타내고, (b)는 디스패리티 맵 획득부(140)에서 WTA 기법에 따라 획득된 좌 영상의 디스패리티 맵을 나타내며, (c)는 학습 맵 생성부(150)의 양성 샘플 추출부(151)에서 대응점 일관성을 이용하여 획득된 희소 양성 샘플 맵(

)을 나타낸다.In FIG. 8, (a) shows the left image of the input stereo image input to the stereo image input unit 110, and (b) is the disparity map of the left image obtained by the disparity map acquisition unit 140 according to the WTA technique. And (c) is a sparse positive sample map obtained by using the correspondence point consistency in the positive sample extraction unit 151 of the learning map generator 150 (

).

그리고 (d)는 보간 맵 생성부(153)에서 보간된 보간 맵(p)을 나타내고, (e)는 양성 샘플 추출부(151)에서 보간 맵(p)에 대해 다시 대응점 일관성을 이용하여 획득된 학습 맵(

)을 나타내며, (f)는 검증 자료 디스패리티 맵을 나타낸다.And (d) represents the interpolation map (p) interpolated by the interpolation map generator 153, (e) is obtained by using the correspondence point consistency again for the interpolation map (p) by the positive sample extraction unit 151 Learning map(

) And (f) represents the verification data disparity map.

도9 는 본 발명의 실시예에 따른 학습 샘플과 다른 비지도 학습 방식으로 획득되는 샘플을 비교한 도면이다.9 is a diagram illustrating a comparison between a learning sample according to an embodiment of the present invention and a sample obtained by another unsupervised learning method.

도9 에서 (a)는 입력 스테레오 영상의 좌 영상을 나타내고, (b)는 대응점 일관성 및 총 변동 임계값을 적용한 희소 양성 샘플 맵(

)을 나타내며, (c)는 MC-CNN 스테레오 알고리즘과 신뢰도 임계값을 적용한 방법을 나타낸다.In Fig. 9, (a) shows the left image of the input stereo image, and (b) is a sparse positive sample map to which correspondence point consistency and total variation threshold are applied (

), and (c) shows the method of applying the MC-CNN stereo algorithm and the reliability threshold.

(d) 내지 (h)는 본 실시예에서 초기 랜덤 가중치와 반복 횟수를 1, 3, 5, 7로 설정한 경우에 반복 횟수에 따라 획득되는 학습 맵(

)을 나타내고, (i)는 검증 자료 디스패리티 맵을 나타낸다.(d) to (h) are learning maps obtained according to the number of repetitions when the initial random weight and the number of repetitions are set to 1, 3, 5, and 7 in this embodiment.

), and (i) represents the verification data disparity map.

도8 및 도9 에서 본 실시예에 따라 생성된 학습 맵(

)은 검증 자료 디스패리티 맵에 비교하여도 양호한 품질의 학습용 맵을 제공할 수 있음을 알 수 있다. 또한 학습 맵(

)을 획득하기 위한 과정을 반복함으로써, 더욱 우수한 학습 맵(

)을 획득할 수 있음을 알 수 있다.The learning map generated according to this embodiment in Figs. 8 and 9 (

) Can provide a map for learning of good quality even when compared to the validation data disparity map. Also learning map (

) By repeating the process to obtain a better learning map (

) Can be obtained.

도10 내지 도12 는 본 발명의 실시예에 따른 CNN의 구조에 따른 학습 샘플을 비교한 도면이다.10 to 12 are diagrams comparing training samples according to the structure of a CNN according to an embodiment of the present invention.

도10 내지 도12 는 각각 Middlebury, KITTI 2012 및 KITTI 2015 데이터 집합에 대해 디스패리티 맵을 획득한 결과를 나타낸다.10 to 12 show results of obtaining a disparity map for the Middlebury, KITTI 2012 and KITTI 2015 data sets, respectively.

도10 내지 도12 각각에서 에서 왼쪽 영상은 입력 스테레오 영상의 좌 영상을 나타내고, 가운데 영상은 표1 의 심플 CNN 구조를 이용하여 획득된 디스패리티 맵을 나타내고, 우측 영상은 표2 의 정밀 CNN 구조를 이용하여 획득된 디스패리티 맵을 나타낸다.In each of Figs. 10 to 12, the left image represents the left image of the input stereo image, the middle image represents the disparity map obtained using the simple CNN structure of Table 1, and the right image represents the precise CNN structure of Table 2. Represents a disparity map obtained by using.

도10 및 도12 를 참조하면, 비록 심플 CNN 구조보다 정밀 CNN 구조에서 더욱 정확한 디스패리티 맵이 획득됨을 알 수 있으나, 심플 CNN 구조 및 정밀 CNN 구조 모두에서 학습 맵으로 이용하기에 양호한 디스패리티 맵을 획득할 수 있음을 알 수 있다.10 and 12, although it can be seen that a more accurate disparity map is obtained in a precise CNN structure than in a simple CNN structure, a disparity map that is good for use as a learning map in both the simple CNN structure and the precise CNN structure is used. You can see that you can get it.

도13 및 도14 는 본 발명의 실시예에 따른 CNN의 구조에 따른 스테레오 매칭 결과를 비교한 도면이다.13 and 14 are views comparing stereo matching results according to the structure of a CNN according to an embodiment of the present invention.

도13 및 도14 는 각각 KITTI 2012 및 KITTI 2015 데이터 집합에 대해 스테레오 매칭 결과로 획득한 디스패리티 맵을 나타내었으며, 왼쪽 영상은 입력 스테레오 영상의 좌 영상을 나타내고, 가운데 영상은 표1 의 심플 CNN 구조를 이용하여 획득된 디스패리티 맵을 나타내고, 우측 영상은 표2 의 정밀 CNN 구조를 이용하여 획득된 디스패리티 맵을 나타낸다.13 and 14 show disparity maps obtained as a result of stereo matching for KITTI 2012 and KITTI 2015 data sets, respectively. The left image shows the left image of the input stereo image, and the middle image shows the simple CNN structure of Table 1. The disparity map obtained using is shown, and the image on the right shows the disparity map obtained using the precise CNN structure of Table 2.

도13 및 도14 에서도 심플 CNN 구조보다 정밀 CNN 구조에서 더욱 정확한 디스패리티 맵이 획득됨을 알 수 있으나, 심플 CNN 구조 및 정밀 CNN 구조 모두에서 모두 양호한 스테레오 매칭 결과로서의 디스패리티 맵을 획득할 수 있음을 알 수 있다.13 and 14, it can be seen that more accurate disparity maps are obtained in the precise CNN structure than in the simple CNN structure, but both the simple CNN structure and the precise CNN structure can obtain a disparity map as a good stereo matching result. Able to know.

표5 는 본 실시예의 인코더(120)에서 CNN의 구조에 따른 스테레오 매칭 성능 비교 결과를 나타낸다. 표4 에서는 기존의 Guided Filter, Census + SGM, DLP, MC-CNN fast, MC-CNN acrt, Content-CNN, MC-CNN-WS와 표1 및 표2 의 심플 CNN 및 정밀 CNN을 비교한 결과를 나타내며, KITTI 2012 및 KITTI 2015 학습 데이터에 대한 비지도 학습 방법에서의 오차율의 정량적 비교 결과를 나타낸다.Table 5 shows the result of comparison of stereo matching performance according to the structure of the CNN in the encoder 120 of the present embodiment. Table 4 shows the results of comparing the existing Guided Filter, Census + SGM, DLP, MC-CNN fast, MC-CNN acrt, Content-CNN, and MC-CNN-WS with the simple CNN and precision CNN in Tables 1 and 2. And shows the result of quantitative comparison of the error rate in the unsupervised learning method for KITTI 2012 and KITTI 2015 learning data.

도15 및 도16 은 본 실시예에 따른 비지도 학습 방법의 조명에 대한 강건성을 시뮬레이션한 결과를 나타낸다.15 and 16 show the results of simulating the robustness of lighting in the unsupervised learning method according to the present embodiment.

조명의 변화에 따른 효과를 분석하기 위해 모든 영상에 대한 노출 지수를 1로 설정하고, 조명 지수를 1에서 3으로 변경하였다. 즉 좌측 및 추측 조명이 1/3인 경우에 대한 시뮬레이션 결과를 나타낸다.In order to analyze the effect of changes in lighting, the exposure index for all images was set to 1, and the lighting index was changed from 1 to 3. That is, it shows the simulation result for the case where the left and speculative illumination is 1/3.

그리고 도15 에서는 Middlebury 데이터 집합을 이용하였으며, 왼쪽 2개의 영상은 입력되는 스테레오 영상의 좌영상 및 우영상을 나타내고, 이후 우측 방향으로 특징 기반의 DASC (Dense Adaptive Self-Correlation)와 색상 및 디스패리티 일관성(Col. Disp. Consistency), 본 실시예 및 검증 자료 디스패리티 맵을 나타낸다.In Fig. 15, the Middlebury data set was used, and the left two images represent the left and right images of the input stereo image, and then feature-based DASC (Dense Adaptive Self-Correlation) and color and disparity consistency in the right direction. (Col. Disp. Consistency), this embodiment and verification data disparity map are shown.

도15 에서 확인할 수 있듯이 본 실시예에 따른 비지도 학습 방법에 따라 학습된 스테레오 매칭 장치는 조명의 변화에도 다른 학습 방식에 비해 매우 우수한 성능을 나타냄을 알 수 있다.As can be seen from FIG. 15, it can be seen that the stereo matching device learned according to the unsupervised learning method according to the present embodiment exhibits very superior performance compared to other learning methods even when the lighting changes.

그리고 도16 에 도시된 바와 같이 본 실시예에 따른 비지도 학습 방법은 조명의 변화에 대해 심플 CNN을 이용하는 경우에도 다른 학습 방식과 유사한 매칭 비용을 갖지만, 정밀 CNN을 이용하는 경우, 매우 낮은 오차율을 나타냄을 알 수 있다.And, as shown in Fig. 16, the unsupervised learning method according to this embodiment has a matching cost similar to that of other learning methods even when using a simple CNN for a change in lighting, but when using a precise CNN, it shows a very low error rate. Can be seen.

도17 및 도18 은 야외 운전 환경에서의 스테레오 매칭 성능을 시뮬레이션한 결과를 나타낸다.17 and 18 show simulation results of stereo matching performance in an outdoor driving environment.

도17 은 HCI 데이터 집합에 대해 시뮬레이션한 결과를 나타내고, 도18 은 Yonsei 데이터 집합에 대한 시뮬레이션한 결과를 나타낸다.Fig. 17 shows the simulation result for the HCI data set, and Fig. 18 shows the simulation result for the Yonsei data set.

도17 및 도18 에서 왼쪽 영상은 입력 스테레오 영상의 좌 영상을 나타내고, 가운데 영상은 표1 의 심플 CNN 구조를 이용하여 획득된 디스패리티 맵을 나타내고, 우측 영상은 표2 의 정밀 CNN 구조를 이용하여 획득된 디스패리티 맵을 나타낸다.In Figs. 17 and 18, the left image shows the left image of the input stereo image, the middle image shows the disparity map obtained using the simple CNN structure in Table 1, and the right image shows the precision CNN structure in Table 2. Represents the obtained disparity map.

도17 및 도18 로부터 본 실시예에 따른 비지도 학습 방법으로 학습된 스테레오 매칭 장치 및 방법은 야외의 실제 운전 환경에서도 우수한 스테레오 매칭 성능을 나타냄을 알 수 있다.It can be seen from FIGS. 17 and 18 that the stereo matching apparatus and method learned by the unsupervised learning method according to the present embodiment exhibits excellent stereo matching performance even in an outdoor actual driving environment.

표 6 은 본 실시예에 따른 스테레오 매칭 장치 및 방법의 스테레오 매칭 속도를 시뮬레이션한 결과를 나타낸다.Table 6 shows the results of simulating the stereo matching speed of the stereo matching apparatus and method according to the present embodiment.

표6 에서는 DLP 기법과 본 실시예에 따른 심플 CNN 구조 및 정밀 CNN 구조의 인코더를 이용한 경우를 비교한 결과를 나타내며, KITTI 2012, KITTI 2015 및 MIDDLEBURY에 대해 시뮬레이션한 결과를 나타낸다. 또한 KITTI와 MIDDLEBURY에 대해 서로 다른 해상도의 패치를 이용하여 시뮬레이션한 결과를 나타낸다.Table 6 shows the results of comparing the DLP technique with the case of using the simple CNN structure and the precise CNN structure encoder according to the present embodiment, and shows the simulation results for KITTI 2012, KITTI 2015 and MIDDLEBURY. In addition, simulation results for KITTI and MIDDLEBURY using patches of different resolutions are shown.

표6 에 나타난 바와 같이 본 실시예에 따른 심플 CNN 구조를 갖는 스테레오 매칭 방법은 DLP에 비해 조금 느리게 동작하지만 스테레오 매칭 성능에 있어서는 상기한 바와 같이 DLP에 비해 월등하다. 그리고 정밀 CNN 구조를 갖는 경우 처리 속도가 느린 반면 매우 우수한 스테레오 매칭 성능을 나타내므로, 효율성과 정확성 사이의 균형을 고려하여 필요한 CNN 구조를 선택적으로 이용할 수 있다.As shown in Table 6, the stereo matching method having a simple CNN structure according to the present embodiment operates slightly slower than DLP, but is superior to DLP in terms of stereo matching performance as described above. In addition, in the case of having a precise CNN structure, while the processing speed is slow, it exhibits very excellent stereo matching performance, and thus the required CNN structure can be selectively used in consideration of the balance between efficiency and accuracy.

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 조명데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, lighting data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will appreciate that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

Claims

An encoder for extracting feature maps from an input stereo image, including two convolutional neural networks (CNNs) having the same structure and the same weight set by pre-performed learning;
A matching cost calculator configured to calculate a matching cost volume between the feature maps;
A disparity map obtaining unit that obtains a disparity for each pixel that minimizes a matching cost volume among disparity candidates having a predetermined maximum disparity range, and generates a disparity map from the obtained disparity; And
The learning disparity map output from the disparity map acquisition unit is applied to the learning stereo image, which is combined at the time of learning performed in advance to learn the encoder and input to the encoder at the time of learning. Generating a learning map that estimates a positive sample based on correspondence point consistency according to an epipolar constraint for a parity map and backpropagates the error between the learning maps generated by propagating the estimated positive sample to adjacent pixels and the learning disparity map Including wealth,
The learning map generation unit
The sparse positive sample is estimated as a sparse positive sample by estimating a pixel whose distance between the corresponding points is less than a predetermined threshold based on the consistency of correspondence points according to the epipolar constraint for the learning disparity map acquired by the disparity map acquisition unit during training. A positive sample extraction unit for obtaining a sample map;
An interpolation map generator configured to generate an interpolation map by propagating the sparse positive sample from the sparse positive sample map to adjacent pixels using an interpolation technique; And
A learning map for estimating a learning sample according to correspondence point consistency for the interpolation map again, generating the learning map from the estimated learning sample, obtaining an error between the generated learning map and the learning disparity map, and backpropagating it Including the acquisition unit,
The interpolation map generator
Using the color of the input stereo image as a guide, an interpolation pixel in which a predetermined energy function is minimized according to the color smoothness constraint between the color of the sparse positive sample and adjacent pixels in the sparse positive sample map is obtained, and the obtained interpolation A stereo matching device that generates the interpolation map using pixels.

delete

The method of claim 1, wherein the learning map generator
A stereo matching device that performs iterative learning so that the error is within a predetermined reference error.

A learning disparity map obtained by extracting learning feature maps from a learning stereo image input during learning, calculating a learning matching cost volume between the extracted learning feature maps, and searching for a disparity that minimizes the learning matching cost volume Estimating a positive sample based on the correspondence point consistency according to the epipolar constraint for, and performing learning by backpropagating an error between the learning maps generated by propagating the estimated positive sample to adjacent pixels and the learning disparity map ;
Extracting feature maps from an input stereo image using two convolutional neural networks (CNNs) having the same structure and the same weight preset according to an error backpropagated in the step of performing the learning;
Calculating a matching cost volume between the feature maps; And
Obtaining a disparity for minimizing a matching cost volume among disparity candidates having a predetermined maximum disparity range for each pixel, and generating a disparity map from the obtained disparity; Including,
The step of performing the learning
Obtaining a sparse positive sample map by estimating a pixel whose distance between corresponding points is less than a predetermined threshold value as a sparse positive sample based on correspondence point consistency according to an epipolar constraint with respect to the learning disparity map;
Generating an interpolation map by propagating the sparse positive sample from the sparse positive sample map to adjacent pixels by an interpolation technique; And
Estimating a learning sample according to correspondence point consistency with respect to the interpolation map again, generating the learning map from the estimated learning sample, obtaining an error between the generated learning map and the learning disparity map, and backpropagating it; Including,
Generating the interpolation map
Acquiring an interpolation pixel in which a predetermined energy function is minimized according to a color smoothness constraint between a color of a sparse positive sample and a color smoothness between adjacent pixels in the sparse positive sample map by using the input color of the stereo image as a guide; And
Generating the interpolation map using the obtained interpolation pixels; Stereo matching method comprising a.

delete

The method of claim 6, wherein the learning step
Stereo matching method performed repeatedly so that the error is within a predetermined reference error.