KR20210055260A

KR20210055260A - Apparatus and method for estimating camera pose and depth using images continuously acquired at various viewpoints

Info

Publication number: KR20210055260A
Application number: KR1020190141508A
Authority: KR
Inventors: 이상윤; 황상원; 김우진; 이준협
Original assignee: 연세대학교 산학협력단
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2021-05-17
Also published as: KR102310789B1

Abstract

According to the present invention, provided are a device and a method for estimating the posture and the depth of a camera. The device of the present invention comprises: a frame providing unit for selecting a key frame and a reference frame from a plurality of frames continuously acquired at various viewpoints; a depth determining unit applied with at least one of the key frame and the reference frame to acquire a depth map corresponding to the applied frame in accordance with a pre-learned pattern estimating method and pattern restoring method; and the posture determining unit for obtaining a combined feature map by combining a key feature map and a reference feature map obtained by extracting a feature in accordance with the pre-learned pattern estimating method from the key frame and the reference frame, obtaining a three-dimensional feature map by combining a three-dimensional coordinate system feature map generated based on a depth map obtained to include a three-dimensional coordinate in the combined feature map, and estimating a camera posture change from the key frame to the reference frame in accordance with the pre-learned pattern estimating method from the three-dimensional feature map. Accordingly, the posture of a camera can be accurately estimated.

Description

Apparatus and method for estimating camera pose and depth using images continuously acquired at various viewpoints}

본 발명은 카메라 자세 및 깊이 추정 장치 및 방법에 관한 것으로, 다양한 시점에서 연속적으로 획득된 영상을 활용한 카메라 자세 및 깊이 추정 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for estimating camera posture and depth, and to an apparatus and method for estimating camera posture and depth using images continuously acquired at various viewpoints.

깊이 영상을 얻기 위해서는 주로 고가의 라이다 센서를 이용하거나, 동시에 취득된 스테레오 영상(좌영상과 우영상)에서 대응쌍을 탐색하고, 탐색된 대응쌍에 기반하여 디스패리티 맵(disparity map)을 획득한 후, 알고 있는 스테레오 카메라 사이의 거리(baseline)를 통해 실제 깊이 값을 얻는다. 라이다의 높은 비용으로 인해 스테레오 영상으로부터 깊이 영상을 획득하는 방식이 일반적으로 이용되고 있다.To obtain a depth image, an expensive lidar sensor is mainly used, or a corresponding pair is searched from a stereo image (left and right images) acquired at the same time, and a disparity map is obtained based on the searched corresponding pair. After that, the actual depth value is obtained through a known baseline between stereo cameras. Due to the high cost of lidar, a method of acquiring a depth image from a stereo image is generally used.

스테레오 영상으로부터 거리 정보, 즉 깊이 영상을 획득하는 방법은 두 카메라의 파라미터를 추정하는 카메라 칼리브레이션 단계와 두 카메라에서 획득된 영상 상에서 서로 대응되는 위치를 찾는 스테레오 매칭 단계를 포함하여 수행되며, 이때 깊이(z)는 수학식 z = (b × f) / d 에 따라 획득된다.A method of obtaining distance information, that is, a depth image, from a stereo image is performed including a camera calibration step of estimating parameters of two cameras and a stereo matching step of finding positions corresponding to each other on the images acquired from the two cameras. z) is obtained according to the equation z = (b × f) / d.

여기서 b는 스테레오 카메라 사이의 거리(baseline), f는 카메라의 내부 파라미터인 초점거리(focal Length), d는 디스패리티(disparity)이다. 카메라 사이의 거리(b)와 초점거리(f)는 고정된 상수 값을 갖기 때문에 깊이값(z)은 디스패리티(d)에 따라 계산되어 깊이 영상이 획득될 수 있다. 그러나 디스패리티 (d)만을 이용하여 깊이값(z)이 계산되므로 다양한 깊이 영상의 범위(Depth of field)를 얻지 못한다는 한계점이 존재한다.Here, b is a distance between stereo cameras (baseline), f is a focal length, which is an internal parameter of the camera, and d is a disparity. Since the distance b and the focal length f between the cameras have fixed constant values, the depth value z may be calculated according to the disparity d to obtain a depth image. However, since the depth value (z) is calculated using only the disparity (d), there is a limitation in that it is not possible to obtain a range of various depth images.

뿐만 아니라 좌영상과 우영상만을 가지고 깊이값을 예측하기 때문에 후처리가 요구되는 문제가 있다.In addition, there is a problem that post-processing is required because the depth value is predicted using only the left image and the right image.

이러한 문제를 해결하기 위해 동시에 다수의 시점에서 획득된 영상을 활용하는 기법이 제안된 바 있으나, 이 기법을 적용하기 위해서는 각 영상을 획득하는 카메라의 위치 및 자세를 정확하게 알아야 한다는 문제가 있다.In order to solve this problem, a technique for using images acquired from multiple viewpoints at the same time has been proposed, but in order to apply this technique, there is a problem that the position and posture of the camera that acquires each image must be accurately known.

한국 공개 특허 제10-2019-0032532호 (2019.03.27 공개)Korean Patent Publication No. 10-2019-0032532 (published on Mar. 27, 2019)

본 발명의 목적은 다양한 시점에서 연속적으로 획득된 영상으로부터 카메라의 자세를 정확하게 추정할 수 있는 카메라 자세 및 깊이 추정 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide a camera posture and depth estimation apparatus and method capable of accurately estimating a camera posture from images continuously acquired at various viewpoints.

본 발명의 다른 목적은 추정된 카메라 자세를 기반으로 다양한 시점에서 연속적으로 획득된 영상을 와핑하여 비교함으로써, 정확한 깊이 영상을 획득할 수 있는 카메라 자세 및 깊이 추정 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for estimating a camera posture and depth capable of obtaining an accurate depth image by warping and comparing images continuously acquired at various viewpoints based on an estimated camera posture.

본 발명의 또 다른 목적은 다양한 시점에서 연속적으로 획득된 영상에서 키프레임을 선택하고, 선택된 키프레임과 참조 영상 사이의 차이를 비교하여 오차가 누적되지 않는 카메라 자세 및 깊이 정보를 획득할 수 있는 카메라 자세 및 깊이 추정 장치 및 방법을 제공하는데 있다.Another object of the present invention is a camera capable of acquiring camera posture and depth information in which errors are not accumulated by selecting a keyframe from images continuously acquired at various viewpoints and comparing the difference between the selected keyframe and a reference image It is to provide an apparatus and method for estimating posture and depth.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 카메라 자세 및 깊이 추정 장치는 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상에서 키 프레임과 참조 프레임을 선택하는 프레임 제공부; 상기 키 프레임과 상기 참조 프레임 중 적어도 하나를 인가받아 미리 학습된 패턴 추정 방식 및 패턴 복원 방식에 따라 인가된 프레임에 대응하는 깊이 맵을 획득하는 깊이 판별부; 및 상기 키 프레임과 상기 참조 프레임에서 미리 학습된 패턴 추정 방식에 따라 특징을 추출하여 획득되는 키 특징맵과 참조 특징맵을 결합하여 결합 특징맵을 획득하고, 상기 결합 특징맵에 3차원 좌표가 포함되도록 획득된 깊이 맵에 기반하여 생성된 3차원 좌표계 특징맵을 결합하여 3차원 특징맵을 획득하며, 상기 3차원 특징맵으로부터 미리 학습된 패턴 추정 방식에 따라 상기 키 프레임으로부터 상기 참조 프레임으로의 카메라 자세 변화를 추정하는 자세 판별부를 포함한다.A camera posture and depth estimation apparatus according to an embodiment of the present invention for achieving the above object includes: a frame providing unit for selecting a key frame and a reference frame from images of multiple frames successively acquired at various viewpoints; A depth determination unit that receives at least one of the key frame and the reference frame and obtains a depth map corresponding to the applied frame according to a previously learned pattern estimation method and a pattern restoration method; And combining a key feature map and a reference feature map obtained by extracting features according to a pattern estimation method learned in advance from the key frame and the reference frame to obtain a combined feature map, and 3D coordinates are included in the combined feature map. A three-dimensional feature map is obtained by combining a three-dimensional coordinate system feature map generated based on the obtained depth map, and a camera from the key frame to the reference frame according to a pattern estimation method learned in advance from the three-dimensional feature map. And a posture determination unit that estimates posture change.

상기 자세 판별부는 상기 키 프레임과 상기 참조 프레임을 인가받고, 미리 학습된 패턴 추정 방식에 따라 상기 키 프레임과 상기 참조 프레임 각각의 특징을 추출하여, 상기 키 특징맵과 상기 참조 특징맵을 획득하는 특징 추출부; 상기 키 특징맵과 상기 참조 특징맵을 인가받아 결합하여 상기 결합 특징맵을 획득하는 특징 결합부; 지정된 패턴을 갖는 2차원 좌표계 특징맵에 상기 깊이 판별부에서 획득된 깊이 맵을 카메라 내부 파라미터에 따라 변형하여 가중함으로써 획득되는 상기 3차원 좌표계 특징맵을 상기 결합 특징맵에 결합하여 3차원 특징맵을 획득하는 3차원 변환부; 및 상기 3차원 특징맵을 인가받아 상기 키 프레임과 상기 참조 프레임 사이의 카메라의 회전 변화와 이동 변화를 각각 구분하여 추정하고, 추정된 카메라의 회전 변화와 이동 변화로부터 카메라 자세 변화를 판별하는 자세 변화 추정부를 포함할 수 있다.The posture determination unit receives the key frame and the reference frame, extracts features of each of the key frame and the reference frame according to a previously learned pattern estimation method, and obtains the key feature map and the reference feature map. Extraction unit; A feature combiner configured to obtain the combined feature map by receiving and combining the key feature map and the reference feature map; The three-dimensional feature map obtained by deforming and weighting the depth map obtained by the depth determination unit to the two-dimensional coordinate system feature map having a designated pattern according to the camera internal parameter is combined with the combined feature map to form a three-dimensional feature map. A three-dimensional conversion unit to obtain; And a posture change in which the three-dimensional feature map is applied and the rotation change and movement change of the camera between the key frame and the reference frame are separately estimated, and the camera posture change is determined from the estimated rotation change and movement change of the camera. It may include an estimation unit.

상기 특징 추출부는 상기 키 프레임을 인가받아 상기 키 특징맵을 획득하는 키 프레임 특징 추출부; 및 상기 키 프레임 특징 추출부와 동일한 신경망으로 구성되고 동일하게 학습되어 동일한 가중치를 가지며 상기 참조 프레임을 인가받아 상기 참조 특징맵을 획득하는 참조 프레임 특징 추출부를 포함할 수 있다.The feature extractor includes a key frame feature extractor configured to obtain the key feature map by receiving the key frame; And a reference frame feature extractor configured with the same neural network as the key frame feature extractor, trained to have the same weight, and obtain the reference feature map by receiving the reference frame.

상기 3차원 변환부는 상기 결합 특징맵과 대응하는 크기를 갖고 x축 방향의 좌표값을 설정하기 위해 미리 지정된 패턴값을 갖는 제1 채널과 y축 방향의 좌표값을 설정하기 위해 미리 지정된 패턴값을 갖는 제2 채널 및 미리 지정된 값을 갖는 동차 좌표계인 제3 채널이 결합된 행렬에 상기 카메라 내부 파라미터로서 미리 획득되는 캘리브레이션 행렬의 역행렬과 상기 깊이 맵을 곱하여 상기 3차원 좌표계 특징맵을 획득할 수 있다.The 3D transform unit has a first channel having a size corresponding to the combined feature map and having a predetermined pattern value in order to set a coordinate value in the x-axis direction and a predetermined pattern value in order to set a coordinate value in the y-axis direction. The 3D coordinate system feature map may be obtained by multiplying a matrix in which a second channel having a second channel and a third channel, which is a homogeneous coordinate system having a predetermined value, by the inverse matrix of a calibration matrix obtained in advance as the camera internal parameter and the depth map. .

상기 자세 변화 추정부는 상기 3차원 특징맵을 인가받고, 미리 학습된 패턴 추정 방식에 따라 상기 키 프레임으로부터 상기 참조 프레임으로의 카메라의 회전 변화량을 나타내는 추정하는 회전 변환 추출부; 상기 3차원 특징맵을 인가받고, 미리 학습된 패턴 추정 방식에 따라 상기 키 프레임으로부터 상기 참조 프레임으로의 카메라의 평행 이동량을 추정하는 이동 벡터 추출부; 및 상기 회전 변화량과 상기 평행 이동량을 기반으로 상기 키 프레임으로부터 상기 참조 프레임으로의 카메라 자세 변화를 판별하는 카메라 자세 획득부를 포함할 수 있다.The posture change estimator receives the 3D feature map and estimates a rotation change amount of the camera from the key frame to the reference frame according to a previously learned pattern estimation method; A motion vector extracting unit that receives the 3D feature map and estimates a parallel movement amount of the camera from the key frame to the reference frame according to a previously learned pattern estimation method; And a camera posture acquisition unit determining a camera posture change from the key frame to the reference frame based on the rotation change amount and the parallel movement amount.

상기 회전 변환 추출부는 상기 3차원 특징맵으로부터 상기 카메라의 회전 변화량을 x, y, z 축 방향 각각에서의 회전을 나태내도록 기지정된 크기를 갖는 회전 행렬(Rotation Matrix)로 획득할 수 있다.The rotation transformation extraction unit may obtain a rotation change amount of the camera from the 3D feature map as a rotation matrix having a predetermined size to indicate rotation in each of the x, y, and z axis directions.

상기 회전 변환 추출부는 상기 3차원 특징맵으로부터 상기 카메라의 회전 변화량을 3개의 오일러 각(Euler Angle)(ψ, φ, θ)으로 추정하고, 추정된 3개의 오일러 각을 기지정된 방식에 따라 상기 회전 행렬로 변환할 수 있다.The rotation transformation extraction unit estimates the rotational variation of the camera from the three-dimensional feature map as three Euler angles (ψ, φ, θ), and calculates the estimated three Euler angles according to a known method. It can be converted to a matrix.

상기 프레임 제공부는 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상을 획득하는 영상 획득부; 및 상기 다수 프레임의 영상에서 이전 선택된 키 프레임과 카메라 자세 또는 깊이 차이 중 적어도 하나가 기지정된 기준값 이상으로 판별된 참조 프레임을 기준으로 이전 프레임을 상기 키 프레임으로 선택하고, 이후 프레임을 상기 참조 프레임으로 선택하는 프레임 선택부를 포함할 수 있다.The frame providing unit includes an image acquisition unit that acquires images of multiple frames continuously acquired at various viewpoints; And selecting a previous frame as the key frame based on a reference frame determined to be greater than or equal to a predetermined reference value in which at least one of a previously selected key frame and a camera posture or depth difference in the multi-frame image, and a subsequent frame as the reference frame. It may include a frame selection unit to select.

상기 카메라 자세 및 깊이 추정 장치는 상기 깊이 판별부에서 획득된 깊이 맵과 상기 자세 판별부에서 획득된 카메라 자세 변화를 이용하여 상기 키 프레임 또는 상기 참조 프레임 중 하나를 다른 프레임에 대응하도록 와핑시키고, 와핑된 프레임을 다른 프레임과 비교하여 깊이 맵의 깊이 정보를 보정하는 와핑부를 더 포함할 수 있다.The camera posture and depth estimation apparatus warps one of the key frame or the reference frame to correspond to another frame using the depth map obtained by the depth determination unit and the camera posture change obtained by the posture determination unit, and warping A warping unit for compensating the depth information of the depth map by comparing the generated frame with other frames may be further included.

상기 카메라 자세 및 깊이 추정 장치는 학습 시에 상기 와핑부에서 와핑된 프레임과 다른 프레임 사이의 차이를 오차로 획득하고, 획득된 오차를 상기 깊이 판별부 및 상기 자세 판별부로 역전파하여 학습시키는 학습부를 더 포함할 수 있다.The camera posture and depth estimation apparatus acquires a difference between a frame warped by the warping unit and another frame as an error at the time of learning, and a learning unit that learns by backpropagating the obtained error to the depth determination unit and the posture determination unit. It may contain more.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 카메라 자세 및 깊이 추정 방법은 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상에서 키 프레임과 참조 프레임을 선택하는 단계; 상기 키 프레임과 상기 참조 프레임 중 적어도 하나를 인가받아 미리 학습된 패턴 추정 방식 및 패턴 복원 방식에 따라 인가된 프레임에 대응하는 깊이 맵을 획득하는 단계; 및 상기 키 프레임과 상기 참조 프레임에서 미리 학습된 패턴 추정 방식에 따라 특징을 추출하여 획득되는 키 특징맵과 참조 특징맵을 결합하여 결합 특징맵을 획득하고, 상기 결합 특징맵에 3차원 좌표가 포함되도록 획득된 깊이 맵에 기반하여 생성된 3차원 좌표계 특징맵을 결합하여 3차원 특징맵을 획득하며, 상기 3차원 특징맵으로부터 미리 학습된 패턴 추정 방식에 따라 상기 키 프레임으로부터 상기 참조 프레임으로의 카메라 자세 변화를 추정하는 단계를 포함한다.A method for estimating camera posture and depth according to another embodiment of the present invention for achieving the above object includes the steps of selecting a key frame and a reference frame from images of multiple frames successively acquired at various viewpoints; Receiving at least one of the key frame and the reference frame and obtaining a depth map corresponding to the applied frame according to a previously learned pattern estimation method and a pattern restoration method; And combining a key feature map and a reference feature map obtained by extracting features according to a pattern estimation method learned in advance from the key frame and the reference frame to obtain a combined feature map, and 3D coordinates are included in the combined feature map. A three-dimensional feature map is obtained by combining a three-dimensional coordinate system feature map generated based on the obtained depth map, and a camera from the key frame to the reference frame according to a pattern estimation method learned in advance from the three-dimensional feature map. And estimating the posture change.

따라서, 본 발명의 실시예에 따른 카메라 자세 및 깊이 추정 장치 및 방법은 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상에서 키 프레임과 참조 프레임을 선택하고, 선택된 키 프레임과 참조 프레임 사이의 카메라 자세 변화와 키 프레임과 참조 프레임 중 적어도 하나의 깊이 맵을 획득하고, 획득된 카메라 자세 변화와 깊이 맵을 기반으로 키 프레임과 참조 프레임 중 하나를 다른 하나에 대응하도록 와핑시켜 비교하여 학습시킴으로써, 카메라 자세와 깊이를 모두 정확하게 추정할 수 있도록 학습시킬 수 있다. 특히 키 프레임(key)과 참조 프레임(ref)에 대한 특징맵에 3차원 좌표계 특징맵을 결합하여 카메라 자세 변화를 추정하도록 함으로써, 정확하게 카메라 자세 변화를 추정할 수 있다. 또한 카메라 자세 변화 중 회전 변화를 추정하기 위해 오일러 각을 추정하도록 학습시킴으로써, 더욱 용이하고 정확하게 카메라 자세 변화를 추정할 수 있도록 한다. 뿐만 아니라 키 프레임을 이전 선택된 키 프레임과 카메라 자세 또는 깊이 차이 중 적어도 하나가 기지정된 기준값 이상으로 판별된 참조 프레임(ref)을 기준으로 직전 또는 직후 프레임 중 하나를 키 프레임(key)으로 선택함으로써, 오차가 누적되는 것을 최소화할 수 있다.Accordingly, the apparatus and method for estimating camera posture and depth according to an embodiment of the present invention select a key frame and a reference frame from images of multiple frames successively acquired at various viewpoints, and change the camera posture between the selected key frame and the reference frame. By acquiring a depth map of at least one of a key frame and a reference frame, and comparing and learning one of the key frame and the reference frame to correspond to the other based on the acquired camera posture change and depth map, the camera posture and the It can be trained to accurately estimate all of the depth. In particular, by combining the feature map for the key frame (key) and the reference frame (ref) with the feature map of the 3D coordinate system to estimate the camera posture change, it is possible to accurately estimate the camera posture change. In addition, by learning to estimate the Euler angle in order to estimate the rotational change among the camera posture changes, it is possible to more easily and accurately estimate the camera posture change. In addition, by selecting one of the immediately preceding or immediately following frame as a key frame based on the reference frame (ref) determined to be at least one of the previously selected key frame and the difference in camera posture or depth as a predetermined reference value or higher, It is possible to minimize the accumulation of errors.

도 1은 본 발명의 일 실시예에 따른 카메라 자세 및 깊이 추정 장치의 개략적 구조를 나타낸다.
도 2는 도 1의 프레임 선택부의 키 프레임 선택 여부에 따른 카메라 자세 판별 방식의 차이를 나타낸다.
도 3은 연속되는 프레임에 의한 오차의 누적을 설명하기 위한 도면이다.
도 4는 도 1의 카메라 자세 판별부의 상세 구조를 나타낸다.
도 5 및 도 6은 도 4의 3차원 변환 특징 추출부가 2차원 특징맵에 3차원 좌표 특징맵을 추가하는 개념을 설명하기 위한 도면이다.
도 7은 키 프레임과 참조 프레임, 키 프레임에 대한 깊이 맵 및 참조 프레임의 일예를 나타낸다.
도 8은 본 발명의 일 실시예에 따른 카메라 자세 및 깊이 추정 방법을 나타낸다.1 shows a schematic structure of a camera posture and depth estimation apparatus according to an embodiment of the present invention.
FIG. 2 shows a difference in a camera posture determination method according to whether or not a frame selector of FIG. 1 selects a key frame.
3 is a diagram for explaining accumulation of errors due to successive frames.
4 shows a detailed structure of the camera posture determination unit of FIG. 1.
5 and 6 are views for explaining the concept of adding a 3D coordinate feature map to a 2D feature map by the 3D transformed feature extraction unit of FIG. 4.
7 shows an example of a key frame and a reference frame, a depth map for the key frame, and a reference frame.
8 shows a method of estimating a camera posture and depth according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, operational advantages of the present invention, and objects achieved by the implementation of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a certain part "includes" a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. And software.

도 1은 본 발명의 일 실시예에 따른 카메라 자세 및 깊이 추정 장치의 개략적 구조를 나타내고, 도 2는 도 1의 프레임 선택부의 키 프레임 선택 여부에 따른 카메라 자세 판별 방식의 차이를 나타내며, 도 3은 연속되는 프레임에 의한 오차의 누적을 설명하기 위한 도면이다. 그리고 도 4는 도 1의 카메라 자세 판별부의 상세 구조를 나타내고, 도 5 및 도 6은 도 4의 3차원 변환 특징 추출부가 2차원 특징맵에 3차원 좌표 특징맵을 추가하는 개념을 설명하기 위한 도면이며, 도 7은 키 프레임과 참조 프레임, 키 프레임에 대한 깊이 맵 및 참조 프레임의 일예를 나타낸다.1 shows a schematic structure of a camera posture and depth estimation apparatus according to an embodiment of the present invention, and FIG. 2 shows a difference in a camera posture determination method according to whether or not a frame selector of FIG. 1 selects a key frame, and FIG. 3 is A diagram for explaining accumulation of errors due to successive frames. 4 is a diagram illustrating a detailed structure of the camera posture determination unit of FIG. 1, and FIGS. 5 and 6 are diagrams for explaining the concept of adding a 3D coordinate feature map to a 2D feature map by the 3D transformed feature extractor of FIG. 4 7 shows an example of a key frame and a reference frame, a depth map for the key frame, and a reference frame.

도 1을 참조하면, 본 실시예에 따른 카메라 자세 및 깊이 추정 장치는 다양한 시점의 연속 영상에서 키 프레임과 참조 프레임을 선택하여 제공하는 프레임 제공부(100), 프레임 제공부(100)에서 제공된 키 프레임과 참조 프레임 사이의 카메라의 자세 변화를 판별하는 카메라 자세 판별부(200), 깊이 정보를 획득하는 깊이 판별부(300) 및 카메라 자세 판별부(200)와 깊이 판별부(300)에서 판별된 카메라 자세와 깊이 정보를 기반으로 참조 프레임을 키 프레임에 대응하는 영상으로 와핑하는 와핑부(400)를 포함할 수 있다.Referring to FIG. 1, the camera posture and depth estimation apparatus according to the present embodiment includes a frame providing unit 100 for selecting and providing a key frame and a reference frame from continuous images of various viewpoints, and a key provided by the frame providing unit 100. The camera posture determination unit 200 for determining a change in the attitude of the camera between the frame and the reference frame, the depth determination unit 300 for obtaining depth information, and the camera posture determination unit 200 and the depth determination unit 300 A warping unit 400 for warping a reference frame into an image corresponding to a key frame based on camera posture and depth information may be included.

우선 프레임 제공부(100)는 영상 획득부(110)와 프레임 선택부(120)를 포함할 수 있다.First, the frame providing unit 100 may include an image acquisition unit 110 and a frame selection unit 120.

영상 획득부(110)는 카메라 장치가 이동하면서 연속적으로 취득한 서로 다른 시점에서 획득한다. 영상 획득부(110)는 카메라 장치로 구현될 수도 있으며, 카메라에서 획득된 영상을 저장하는 저장 장치 또는 유무선 통신망을 통해 영상을 전달받는 통신 장치 등으로 구현될 수 있다.The image acquisition unit 110 acquires images from different viewpoints that are continuously acquired while the camera device moves. The image acquisition unit 110 may be implemented as a camera device, and may be implemented as a storage device that stores an image acquired from a camera or a communication device that receives an image through a wired or wireless communication network.

본 실시예에서 영상 획득부(110)가 획득하는 영상을 스테레오 카메라를 이용한 스테레오 영상이 아닌 단일 카메라에서 연속적으로 취득된 영상이며, 특히 단일 카메라의 이동에 따라 시점이 변화되어 획득된 영상일 수 있다.In this embodiment, the image acquired by the image acquisition unit 110 is an image continuously acquired from a single camera, not a stereo image using a stereo camera, and in particular, may be an image acquired by changing a viewpoint according to the movement of the single camera. .

프레임 선택부(120)는 영상 획득부(110)가 획득한 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상을 키 프레임(key)과 참조 프레임(ref)으로 구분한다.The frame selection unit 120 divides images of a plurality of frames successively acquired at various viewpoints acquired by the image acquisition unit 110 into a key frame (key) and a reference frame (ref).

프레임 선택부(120)는 다수 프레임 영상에서 키 프레임(key)을 선택하고, 키 프레임(key)을 제외한 나머지 프레임 중 하나의 프레임을 참조 프레임(ref)으로 선택한다.The frame selector 120 selects a key frame from a multi-frame image, and selects one frame of the remaining frames except for the key frame as a reference frame ref.

프레임 선택부(120)는 우선 다수 프레임의 영상에서 키 프레임(key)을 선택한다. 프레임 선택부(120)는 다수 프레임의 영상 중 최초 프레임 또는 기지정된 프레임을 초기 키 프레임(key)으로 선택할 수 있으며, 이후에는 이전 선택된 키 프레임과 카메라 자세 또는 깊이 차이 중 적어도 하나가 기지정된 기준값 이상으로 판별된 참조 프레임(ref)을 기준으로 직전 또는 직후 프레임 중 하나를 키 프레임(key)으로 선택할 수 있다. 일예로 프레임 선택부(120)는 와핑부(400)에서 참조 프레임(ref)을 키 프레임(key)에 대응하도록 변환하는 와핑 레벨에 기반하여 키 프레임을 선택할 수 있다.The frame selector 120 first selects a key frame from a multi-frame image. The frame selection unit 120 may select the first frame or a predetermined frame among a plurality of frames as an initial key frame, and after that, at least one of a difference between a previously selected key frame and a camera posture or depth is equal to or greater than a predetermined reference value. One of the immediately preceding or immediately following frame may be selected as the key frame based on the reference frame (ref) determined by. For example, the frame selection unit 120 may select a key frame based on a warping level at which the warping unit 400 converts the reference frame ref to correspond to a key frame.

최근에는 RGB 영상 또는 옵티컬 플로우(Optical Flow)와 같은 영상들에서 각 영상들 사이의 관계를 인공 신경망을 이용하여 예측하는 기법이 제안되었다. 그러나 기존에는 도 2의 (a)에 도시된 바와 같이, 연속되는 다수 프레임의 영상에서 서로 인접한 프레임 사이의 차이만을 비교하여 카메라 자세를 추정하였다. 이 경우, 도 3의 (a)와 같이 모든 프레임이 정상 촬영되면, 카메라 자세 또한 큰 오차가 발생되지 않고 정상적으로 추정될 수 있다. 그러나 다수 프레임 중 특정 프레임에 정상 촬영된 도 3의 (b)에 도시된 바와 같이 빛 번짐 등의 큰 변화가 발생되면, 해당 프레임에 대한 카메라 자세 추정에 오차가 발생된다. 그리고 발생된 오차는 이후 프레임에서의 카메라 자세 추정 시에도 누적하여 반영되므로 도 3의 (c)에 도시된 바와 같이 계속 증가되는 문제가 발생한다. 도 3의 (c)는 도 2의 (a)와 같이 연속되는 프레임 사이의 차이를 기반으로 추정된 카메라 자세(est)를 실제 카메라 자세(gt)와 비교하여 나타내었으며, 도 3의 (c)에 도시된 바와 같이, 연속되는 프레임 사이의 차이를 기반으로 카메라 자세를 추정하는 경우, 오차가 누적되어 점차로 증가됨을 알 수 있다.Recently, a technique for predicting a relationship between images in images such as an RGB image or an optical flow using an artificial neural network has been proposed. However, conventionally, as shown in (a) of FIG. 2, the camera posture was estimated by comparing only the difference between adjacent frames in the image of a plurality of consecutive frames. In this case, if all frames are normally photographed as shown in FIG. 3A, the camera posture can also be estimated normally without causing a large error. However, when a large change such as light spreading occurs as shown in FIG. 3(b), which is normally photographed in a specific frame among a plurality of frames, an error occurs in estimating the camera posture for the frame. In addition, since the generated error is accumulated and reflected even when estimating the camera posture in a subsequent frame, a problem of increasing continuously occurs as shown in (c) of FIG. 3. 3(c) shows the camera posture est estimated based on the difference between successive frames as shown in FIG. 2(a) compared with the actual camera posture gt, and FIG. 3(c) As shown in FIG. 1, when the camera posture is estimated based on the difference between successive frames, it can be seen that errors are accumulated and gradually increased.

이에 본 실시예에서는 누적되는 오차를 저감시킬 수 있도록, 도 2의 (b)에 도시된 바와 같이, 프레임 선택부(120)가 다수 프레임 영상 중 키 프레임(key)을 선택하고, 선택된 키 프레임(key)을 기준으로 이후 획득된 다수 프레임에서의 카메라 변화를 추정한다. 따라서 선택된 키 프레임들에 의해 추정된 카메라 자세에 오차는 누적될 수 있으나, 연속하여 선택된 키 프레임들 사이에서 획득된 적어도 하나의 참조 프레임에서 발생되는 오차는 다음 키 프레임(key)이 선택되면 전달되지 않으므로 누적되지 않는다. 즉 누적되는 오차를 저감시킬 수 있다.Accordingly, in this embodiment, in order to reduce the accumulated error, as shown in FIG. 2(b), the frame selection unit 120 selects a key frame among a plurality of frame images, and the selected key frame ( key) to estimate the camera change in multiple frames acquired later. Therefore, errors may be accumulated in the camera posture estimated by the selected key frames, but errors occurring in at least one reference frame acquired between successively selected key frames are not transmitted when the next key frame is selected. Therefore, it is not accumulated. That is, accumulated errors can be reduced.

여기서 프레임 선택부(120)가 다수 프레임 영상에서 단일 키 프레임(key)을 이용하지 않고, 반복적으로 키 프레임(key)을 선택하는 것은 참조 프레임(ref)과 키 프레임(key) 사이의 차이가 커지게 되면, 빛 번짐과 같은 비정상적 조건에 의한 오차가 발생되지 않더라도, 카메라의 이동에 의해 키 프레임(key)과 참조 프레임(ref) 사이의 변화가 커지게 되어 카메라 자세 및 깊이 추정이 어렵게 되기 때문이다. 또한 프레임 선택부(120)가 이전 선택된 키 프레임(key)과 카메라 자세 또는 깊이 차이가 기지정된 기준값 이상으로 판별된 참조 프레임(ref)의 직전 또는 직후 프레임을 키 프레임(key)으로 선택하는 것은 큰 오차가 발생된 프레임의 이전 프레임 또는 이후 프레임을 기반으로 하여 안정적으로 이후 프레임에서의 카메라 자세를 추정할 수 있도록 하기 위해서이다. 즉 선택되는 키 프레임들 사이에서 누적될 수 있는 오차를 최소화하기 위해서이다.Here, when the frame selection unit 120 does not use a single key frame (key) from a multi-frame image, and repeatedly selects a key frame (key), the difference between the reference frame (ref) and the key frame (key) is large. This is because even if an error does not occur due to abnormal conditions such as light bleeding, the change between the key frame (key) and the reference frame (ref) increases due to the movement of the camera, making it difficult to estimate the camera posture and depth. . In addition, it is large that the frame selection unit 120 selects a frame immediately before or immediately after the reference frame ref, which is determined to be greater than or equal to a predetermined reference value in which the difference between the previously selected key frame (key) and the camera posture or depth is a key frame (key). This is to ensure that the camera posture in the subsequent frame can be stably estimated based on the frame before or after the frame in which the error has occurred. That is, this is to minimize errors that may accumulate between the selected key frames.

프레임 선택부(120)는 키 프레임(key)이 선택되면 선택된 키 프레임(key)과 선택된 키 프레임(key) 이후의 프레임 중 하나의 프레임을 참조 프레임(ref)으로 선택하여 카메라 자세 판별부(200)로 전달한다.When the key frame (key) is selected, the frame selection unit 120 selects one of the selected key frame (key) and the frame after the selected key frame (key) as a reference frame (ref) to determine the camera attitude (200). ).

카메라 자세 판별부(200)는 프레임 선택부(120)로부터 키 프레임(key)과 참조 프레임(ref)을 인가받고, 키 프레임(key)을 기반으로 참조 프레임(ref)에서 변화된 카메라의 자세를 추정한다. 즉 키 프레임(key)을 촬영한 시점의 카메라의 자세를 기준으로 참조 프레임(ref)을 촬영한 시점의 카메라의 자세 변화를 추정한다.The camera posture determination unit 200 receives a key frame (key) and a reference frame (ref) from the frame selection unit 120, and estimates the changed camera posture in the reference frame (ref) based on the key frame (key). do. That is, the change in the attitude of the camera at the time when the reference frame ref is photographed is estimated based on the posture of the camera at the time when the key frame (key) is photographed.

카메라 자세 판별부(200)는 특징 추출부(210), 특징 결합부(220), 3차원 변환부(230) 및 자세 변화 추정부(240)를 포함한다.The camera posture determination unit 200 includes a feature extracting unit 210, a feature combining unit 220, a 3D conversion unit 230, and a posture change estimation unit 240.

특징 추출부(210)는 프레임 선택부(120)에서 선택된 키 프레임(key)과 참조 프레임을 인가받고, 미리 학습된 패턴 추정 방식에 따라 인가된 키 프레임(key)과 참조 프레임 각각의 특징을 추출하여 키 특징맵과 참조 특징맵을 획득한다.The feature extraction unit 210 receives the key frame (key) and the reference frame selected by the frame selection unit 120, and extracts features of each of the applied key frame (key) and reference frame according to a pre-learned pattern estimation method. Thus, a key feature map and a reference feature map are obtained.

도 4를 참조하면, 특징 추출부(210)는 키 프레임(key)을 인가받아 키 프레임(key)의 특징을 추출하는 키 프레임 특징 추출부(211)와 참조 프레임(ref)을 인가받아 참조 프레임(ref)의 특징을 추출하는 참조 프레임 특징 추출부(212)를 포함할 수 있다.Referring to FIG. 4, the feature extractor 210 receives a key frame feature extractor 211 for extracting a feature of a key frame and a reference frame ref to receive a reference frame. It may include a reference frame feature extraction unit 212 for extracting the feature of (ref).

여기서 키 프레임 특징 추출부(211)와 참조 프레임 특징 추출부(212)는 동일 구조와 동일 가중치를 갖는 샴 신경망(siamese netwrok)으로 구현될 수 있다. 즉 키 프레임 특징 추출부(211)와 참조 프레임 특징 추출부(212)는 동일한 패턴 추정 방식에 따라 키 프레임(key)과 참조 프레임(ref)에서 특징을 추출하여 키 특징맵과 참조 특징맵을 획득할 수 있다.Here, the key frame feature extraction unit 211 and the reference frame feature extraction unit 212 may be implemented as a siamese neural network having the same structure and the same weight. That is, the key frame feature extractor 211 and the reference frame feature extractor 212 extract features from the key frame (key) and the reference frame (ref) according to the same pattern estimation method to obtain a key feature map and a reference feature map. can do.

키 프레임 특징 추출부(211)와 참조 프레임 특징 추출부(212) 각각은 일예로 컨볼루션 신경망(Convolutional Neural Networks: CNN)으로 구현될 수 있다.Each of the key frame feature extractor 211 and the reference frame feature extractor 212 may be implemented as a convolutional neural network (CNN).

특징 결합부(220)는 특징 추출부(211)에서 추출된 키 특징맵과 참조 특징맵을 결합(concatenate)하여 결합 특징맵을 3차원 변환부(230)로 출력한다.The feature combining unit 220 concatenates the key feature map and the reference feature map extracted from the feature extracting unit 211 and outputs the combined feature map to the 3D transform unit 230.

3차원 변환부(230)는 특징 결합부(220)에서 인가되는 결합 특징맵에 3차원 좌표 정보를 추가하여 3차원 특징맵으로 변환한다.The 3D transform unit 230 adds 3D coordinate information to the combined feature map applied from the feature combiner 220 and converts it into a 3D feature map.

특징 추출부(210)에서 추출되는 키 특징맵과 참조 특징맵은 단지 키 프레임(key)과 참조 프레임(ref)의 특징을 추출하여 표현할 뿐 좌표계 정보를 제공하지 않는다. 따라서 키 특징맵과 참조 특징맵이 결합된 결합 특징맵 또한 깊이 정보를 포함한 좌표 정보를 제공하지 않는다.The key feature map and the reference feature map extracted by the feature extraction unit 210 only extract and express features of the key frame (key) and the reference frame (ref), and do not provide coordinate system information. Therefore, the combined feature map in which the key feature map and the reference feature map are combined also does not provide coordinate information including depth information.

그러나 카메라 자세는 2차원이 아닌 3차원 정보를 추정하는 것이므로, 결합 특징맵으로부터 카메라 자세 변화를 그대로 추정하는 것은 매우 어렵다. 이에 본 실시예에서 3차원 변환부(230)는 결합 특징맵에 3차원 좌표 정보를 나타내는 3차원 좌표계 특징맵을 추가한다. 즉 결합 특징맵에 3차원 좌표계 특징맵을 추가함으로써, 결합 특징맵에 3차원의 월드 좌표계가 포함될 수 있도록 변환한다. 이때, 결합 특징맵에 3차원의 월드 좌표계를 포함하기 위해서는 카메라 내부 파라미터와 각 픽셀에 대응하는 깊이 정보가 필요하다.However, since the camera posture is to estimate 3D information instead of 2D, it is very difficult to estimate the camera posture change from the combined feature map as it is. Accordingly, in this embodiment, the 3D transform unit 230 adds a 3D coordinate system feature map representing 3D coordinate information to the combined feature map. That is, by adding a three-dimensional coordinate system feature map to the combined feature map, the three-dimensional world coordinate system is transformed to be included in the combined feature map. In this case, in order to include the three-dimensional world coordinate system in the combined feature map, the camera internal parameters and depth information corresponding to each pixel are required.

카메라 내부 파라미터는 초점 거리(f_x, f_y)와 광학축과 영상 평면(image plane)이 만나는 픽셀 좌표인 주점(principal point)(x₀, y₀)로 구성되며, 카메라 내부 파라미터는 미리 획득된다. 따라서 결합 특징맵의 각 픽셀에 대응하는 깊이(Z)를 알고 있다면 결합 특징맵으로부터 3차원의 월드 좌표계를 추출할 수 있다. 이에 3차원 변환부(230)는 결합 특징맵을 인가받아 3차원 특징맵으로 변환한다.The camera internal parameters consist of the focal length (f _x , f _y _{) and the principal point (x 0} , y ₀ ), which is the pixel coordinate where the optical axis and image plane meet, and the camera internal parameters are obtained in advance. do. Therefore, if the depth (Z) corresponding to each pixel of the combined feature map is known, a three-dimensional world coordinate system can be extracted from the combined feature map. Accordingly, the 3D conversion unit 230 receives the combined feature map and converts it into a 3D feature map.

3차원 변환부(230)는 깊이 판별부(300) 또는 와핑부(400)로부터 키 프레임에 대해 획득된 키 깊이 정보를 기반으로 3차원 좌표계 특징맵을 획득하고, 획득된 3차원 좌표계 특징맵을 결합 특징맵에 결합하여 3차원 특징맵을 획득한다.The 3D conversion unit 230 acquires a 3D coordinate system feature map based on the key depth information obtained for the key frame from the depth determination unit 300 or the warping unit 400, and converts the obtained 3D coordinate system feature map. Combined with the combined feature map, a 3D feature map is obtained.

이하에서는 도 5 및 도 6을 참조하여 3차원 변환부(230)가 3차원 좌표계 특징맵을 추가하는 방법을 설명한다.Hereinafter, a method in which the 3D transform unit 230 adds a 3D coordinate system feature map will be described with reference to FIGS. 5 and 6.

도 5에서 (a)는 2차원의 픽셀 좌표계와 3차원의 월드 좌표계 사이의 관계를 나타내고, (b)는 카메라 내부 파라미터와 각 픽셀의 깊이 값에 따른 카메라 좌표계와 월드 좌표계 사이의 관계를 나타낸다.In FIG. 5, (a) shows the relationship between the two-dimensional pixel coordinate system and the three-dimensional world coordinate system, and (b) shows the relationship between the camera coordinate system and the world coordinate system according to the camera internal parameter and the depth value of each pixel.

도 5를 참조하면, 카메라 내부 파라미터는 초점 거리(f_x, f_y)와 광학축과 영상 평면(image plane)이 만나는 픽셀 좌표인 주점(principal point)(x₀, y₀)로 구성되며, 카메라 내부 파라미터는 미리 획득된다. 2차원 영상의 각 픽셀에 대응하는 깊이(Z)를 알고 있다면, 수학식1에 따라 2차원 픽셀 좌표계(x, y)를 카메라를 중심으로 하는 카메라 좌표계(X_c, Y_c, Z_c)로 변환할 수 있다.Referring to FIG. 5, the camera internal parameter is composed of a focal length (f _x , f _y _{) and a principal point (x 0} , y ₀ ), which is a pixel coordinate where an optical axis and an image plane meet, Camera internal parameters are obtained in advance. If the depth (Z) corresponding to each pixel of the 2D image is known, the 2D pixel coordinate system (x, y) is converted _{to the camera coordinate system (X c} , Y _c , Z _{c) centered on the camera according to Equation 1.} Can be converted.

그리고 카메라 내부 파라미터(Camera Intrinsic Parameter)는 캘리브레이션을 통해 캘리브레이션 행렬(K)로서 수학식 2의 형태로 미리 획득된다.And the camera internal parameter (Camera Intrinsic Parameter) is obtained in advance in the form of equation (2) as a calibration matrix (K) through calibration.

도 6에서 (a)는 결합 특징맵(fm)에 3차원 좌표계 특징맵(cof)이 추가 결합되는 방식을 나타내고, (b)는 3차원 좌표계 특징맵(cof)의 구성의 일예를 나타낸다. 도 6을 참조하면, 결합 특징맵(fm)에 추가 결합되는 3차원 좌표계 특징맵(cof)은 3개의 채널(ch1 ~ ch3)로 구성될 수 있다. 2차원 특징맵은 기지정된 크기의 행렬 형태로 인가되며, 3차원 좌표계 특징맵(cof)의 3개의 채널(ch1 ~ ch3)은 2차원 특징맵에 대응하는 형태로 구성될 수 있다. (b)에 도시된 바와 같이, 3개의 채널(ch1 ~ ch3) 중 제1 채널(ch1)은 x축 좌표계를 나타내는 특징맵으로 x축 방향으로 증가되는 값을 갖는 행렬로 구성될 수 있으며, 제2 채널(ch2)은 y축 좌표계를 나타내는 특징맵으로 y축 방향으로 증가되는 값을 갖는 행렬로 구성될 수 있다. 그리고 제3 채널(ch3)은 동차 좌표계(homogeneous coordinate)를 나타내는 행렬로서 일예로 모든 원소의 값이 1로 설정될 수 있다.In FIG. 6, (a) shows a method in which a 3D coordinate system feature map (cof) is additionally combined with the combined feature map (fm), and (b) shows an example of a configuration of a 3D coordinate system feature map (cof). Referring to FIG. 6, a three-dimensional coordinate system feature map cof additionally coupled to the combined feature map fm may include three channels ch1 to ch3. The 2D feature map is applied in the form of a matrix having a predetermined size, and the three channels ch1 to ch3 of the 3D coordinate system feature map cof may be configured in a form corresponding to the 2D feature map. As shown in (b), the first channel (ch1) among the three channels (ch1 to ch3) is a feature map representing the x-axis coordinate system and may be composed of a matrix having a value increasing in the x-axis direction. The 2 channel ch2 is a feature map representing the y-axis coordinate system, and may be configured as a matrix having a value increasing in the y-axis direction. In addition, the third channel ch3 is a matrix representing a homogeneous coordinate system, and as an example, values of all elements may be set to 1.

다만, 3개의 채널(ch1 ~ ch3)에서는 z축 좌표계를 표현할 수 없다. 이에 3차원 변환부는 키 프레임(key)에 대한 깊이 정보(D)를 캘리브레이션 행렬(K)의 역 행렬(K^-1)과 곱하고, 3개의 채널(ch1 ~ ch3)에 곱하여 3차원 좌표계 특징맵을 획득할 수 있다. 즉 3차원 좌표계 특징맵(cof)의 3개의 채널 중 제1 및 제2 채널(ch1, ch2)는 결합 특징맵(fm)에 2차원 좌표계를 부가하기 위한 2차원 좌표계 특징맵으로 볼 수 있다. 따라서 본 실시예에서는 2차원 좌표계 특징맵에 동차 좌표계 특징맵에 해당하는 채널을 추가한 후, 카메라 내부 파라미터의 특성이 반영된 깊이 맵을 z축 좌표로 부가하여 3차원 좌표계 특징맵(cof)을 획득할 수 있따.However, the z-axis coordinate system cannot be expressed in three channels (ch1 to ch3). Accordingly, the 3D transform unit multiplies the depth information (D) for the key frame (key) with the inverse matrix (K ^-1 ) of the calibration matrix (K), and multiplies the three channels (ch1 ~ ch3) to obtain a 3D coordinate system feature map. Can be obtained. That is, among the three channels of the 3D coordinate system feature map cof, the first and second channels ch1 and ch2 may be viewed as a 2D coordinate system feature map for adding a 2D coordinate system to the combined feature map fm. Therefore, in this embodiment, after adding a channel corresponding to the homogeneous coordinate system feature map to the 2D coordinate system feature map, a depth map reflecting the characteristics of the camera internal parameters is added as the z-axis coordinate to obtain a 3D coordinate system feature map (cof). I can do it.

여기서 키 프레임(key)에 대한 깊이 정보는 깊이 판별부(300) 또는 와핑부(400)로부터 인가받아 획득할 수 있다.Here, the depth information on the key frame (key) may be obtained by being authorized from the depth determining unit 300 or the warping unit 400.

키 프레임(key)에 대한 깊이 정보는 깊이 판별부(300)가 직접 키 프레임(key)을 인가받아 깊이 맵을 획득하여 전달할 수 있다. 그러나 상기한 바와 같이, 본 실시예에서 키 프레임(key)은 프레임 선택부에 의해 카메라 자세 정보와 깊이 정보가 획득된 참조 프레임(ref) 중 이전 선택된 키 프레임(key) 사이의 차이가 기지정된 기준값 이상인 프레임의 직전 또는 직후 프레임이 선택될 수 있다. 즉 키 프레임(key)에 대한 깊이 정보는 미리 획득될 수 있으며, 이에 미리 획득된 키 프레임(key)에 대한 깊이 정보가 3차원 변환부(230)에 제공될 수 있다.Depth information on the key frame (key) may be transmitted by obtaining a depth map by directly receiving the key frame (key) by the depth determination unit 300. However, as described above, in the present embodiment, the key frame is a reference value in which the difference between the previously selected key frame among the reference frames ref from which camera attitude information and depth information is obtained by the frame selection unit is determined. A frame immediately before or immediately after the above frame may be selected. That is, depth information on a key frame (key) may be obtained in advance, and depth information on a key frame (key) obtained in advance may be provided to the 3D transform unit 230.

즉 본 실시예의 3차원 변환부(230)는 키 프레임(key)에 대해 이전 추정된 깊이 정보를 기반으로 3차원 좌표계 특징맵을 결합 특징맵에 추가하여 3차원으로 변환함으로써 카메라 자세 변화를 더 용이하게 추정할 수 있도록 한다.That is, the 3D conversion unit 230 of the present embodiment adds a 3D coordinate system feature map to the combined feature map based on previously estimated depth information for a key frame and converts it into 3D, making it easier to change the camera posture. To be able to estimate.

자세 변화 추정부(240)는 3차원 변환부(230)로부터 3차원 특징맵을 인가받고, 미리 학습된 패턴 추정 방식에 따라 3차원 특징맵에서 카메라의 자세 변화를 추정한다. 상기한 바와 같이, 3차원 특징맵은 키 특징맵과 참조 특징맵 및 키 프레임(key)에 대한 좌표계 정보가 결합된 특징맵이다. 따라서 회전 변환 추출부(241)는 3차원 특징맵을 인가받아 키 프레임(key)에 대응하는 카메라부터 참조 프레임(ref)에 대응하는 카메라로의 카메라 위치 및 자세 변화를 추정한다.The posture change estimating unit 240 receives the 3D feature map from the 3D transform unit 230 and estimates the posture change of the camera from the 3D feature map according to a previously learned pattern estimation method. As described above, the 3D feature map is a feature map in which a key feature map, a reference feature map, and coordinate system information for a key frame are combined. Accordingly, the rotation transformation extracting unit 241 receives the 3D feature map and estimates the change of the camera position and posture from the camera corresponding to the key frame (key) to the camera corresponding to the reference frame (ref).

자세 변화 추정부(240)는 회전 변환 추출부(241), 이동 벡터 추출부(242), 역 오일러 변환부(243) 및 카메라 자세 획득부(244)를 포함할 수 있다.The posture change estimating unit 240 may include a rotation transformation extracting unit 241, a motion vector extracting unit 242, an inverse Euler transforming unit 243, and a camera posture obtaining unit 244.

카메라 자세 변화는 카메라의 회전 변화와 위치 변화로 구분될 수 있으며, 이에 자세 변화 추정부(240)에서 회전 변환 추출부(241)는 3차원 특징맵을 인가받고 미리 학습된 패턴 추정 방식에 따라 키 프레임(key)과 참조 프레임(ref) 사이에서 카메라의 회전 변화를 추정하고, 이동 벡터 추출부(242)는 카메라의 위치 변화를 추정한다.The camera posture change can be divided into a rotation change and a position change of the camera. Accordingly, the rotation conversion extraction unit 241 from the posture change estimation unit 240 receives the 3D feature map and determines the key according to the previously learned pattern estimation method. The rotational change of the camera is estimated between the frame (key) and the reference frame (ref), and the motion vector extracting unit 242 estimates the positional change of the camera.

회전 변환 추출부(241)는 카메라의 회전 변화를 오일러 각(Euler Angle)을 원소로 갖는 행렬 형태로 추정할 수 있다. 일반적으로 3차원 좌표계에서 회전 변환은 x, y, z 축 방향 각각에서의 회전을 고려하여 3 X 3 크기의 회전 행렬(Rotation Matrix)로 표현된다. 그러나 3 X 3 크기의 회전 행렬은 9개의 원소를 가짐에 따라, 회전 변환 추출부(241)는 카메라의 회전 변화를 9개의 값으로 추정해야 한다. 그러나 인공 신경망을 이용할 지라도 9개의 값을 추정하는 네트워크를 구성하기 어려울 뿐만 아니라 제약 조건이 많다. 또한 학습이 용이하지 않다는 문제가 있다.The rotation transformation extraction unit 241 may estimate the rotation change of the camera in the form of a matrix having an Euler angle as an element. In general, rotation transformation in a 3D coordinate system is expressed as a rotation matrix having a size of 3 X 3 in consideration of rotation in each of the x, y, and z axis directions. However, as the rotation matrix having a size of 3 X 3 has 9 elements, the rotation transformation extraction unit 241 must estimate the rotation change of the camera as 9 values. However, even if an artificial neural network is used, it is difficult to construct a network that estimates nine values, and there are many constraints. Also, there is a problem that learning is not easy.

이에 본 실시예에 따른 회전 변환 추출부(241)는 카메라의 회전 변화를 용이하게 추정할 수 있도록 3차원 특징맵으로부터 카메라 회전 변화를 오일러 각의 표현 형태로 추정할 수 있다.Accordingly, the rotation transformation extraction unit 241 according to the present exemplary embodiment may estimate the camera rotation change from the 3D feature map as an expression form of the Euler angle so as to easily estimate the rotation change of the camera.

회전 행렬(R)로부터 오일러 각으로의 변환은 수학식 3에 따라 수행될 수 있다.Conversion from the rotation matrix R to the Euler angle may be performed according to Equation 3.

오일러 각은 3개 각도(ψ, φ, θ)의 조합으로 3차원 좌표에서의 회전 변화를 표현하도록 제안된 표현법이므로, 회전 변환 추출부(241)가 3차원 특징맵으로부터 오일러 각을 추정하도록 학습된 경우, 3개의 각도만을 추정하면 되므로, 용이하게 카메라의 회전 변화를 추정할 수 있다.Euler angle is a proposed expression method to express the rotation change in 3D coordinates by a combination of three angles (ψ, φ, θ), so the rotation transformation extraction unit 241 learns to estimate the Euler angle from the 3D feature map. In this case, since only three angles need to be estimated, the rotation change of the camera can be easily estimated.

만일 회전 변환 추출부(241)가 카메라 회전 변화를 오일러 각의 형태로 추정하도록 학습된 경우, 회전 변환 추출부(241)는 오일러 각을 다시 회전 행렬로 역 변환하는 역 오일러 변환부(미도시)를 더 포함할 수 있다.If the rotation transformation extraction unit 241 is learned to estimate the camera rotation change in the form of an Euler angle, the rotation transformation extraction unit 241 inversely transforms the Euler angle back into a rotation matrix (not shown). It may further include.

역 오일러 각 변환은 수학식 4에 따라 수행될 수 있다.Each inverse Euler transform may be performed according to Equation 4.

한편, 이동 벡터 추출부(242)는 3차원 특징맵으로부터 3차원 좌표에서 x, y, z 축 방향으로의 카메라의 평행 이동량을 추정한다. 이동 벡터 추출부(242)는 카메라의 이동을 x, y, z 축 방향에서 추정하므로 3개의 원소(t_x, t_y, t_z)를 갖는 평행 이동 벡터을 획득할 수 있다.Meanwhile, the motion vector extraction unit 242 estimates the amount of parallel movement of the camera in the x, y, and z axis directions from the 3D coordinates from the 3D feature map. Since the motion vector extractor 242 estimates the movement of the camera in the x, y, and z axis directions, it is possible to obtain a parallel motion vector having _{three elements (t x} , t _y , t _{z ).}

카메라 자세 획득부(244)는 회전 변환 추출부(241)에서 획득된 회전 행렬과 이동 벡터 추출부(242)에서 추정된 평행 이동 벡터로부터 카메라의 회전 및 이동 변화를 분석하여 키 프레임(key)에 대비한 참조 프레임에서의 카메라 자세 변화값을 획득한다.The camera posture acquisition unit 244 analyzes the rotation and movement changes of the camera from the rotation matrix obtained by the rotation transformation extraction unit 241 and the parallel movement vector estimated by the movement vector extraction unit 242 to generate a key frame. The camera posture change value in the compared reference frame is acquired.

한편, 깊이 판별부(300)는 프레인 선택부(120)로부터 키 프레임(key)을 인가받고, 인가된 키 프레임(key)의 깊이 정보를 추정하여 키 프레임 깊이 맵을 획득한다. 깊이 판별부(300)는 깊이 인코딩부(310)와 깊이 디코딩부(320)를 포함할 수 있다.Meanwhile, the depth determination unit 300 obtains a key frame depth map by receiving a key frame (key) from the frame selection unit 120 and estimating depth information of the applied key frame (key). The depth determination unit 300 may include a depth encoding unit 310 and a depth decoding unit 320.

깊이 인코딩부(310)는 미리 학습된 패턴 추정 방식에 따라 인가된 키 프레임(key)을 인코딩한다. 여기서 깊이 인코딩부(310)는 특징 추출부(210)의 키 프레임 특징 추출부(211) 및 참조 프레임 특징 추출부(212)와 유사하게 인가된 키 프레임(key)의 특징을 추출하지만, 특징 추출부(210)와 다른 특징을 추출하도록 학습된다. 그리고 깊이 디코딩부(320)는 깊이 인코딩부(310)에서 인코딩된 키 프레임(key)을 인가받고, 미리 학습된 패턴 복원 방식에 따라 인코딩된 키 프레임(key)을 디코딩하여 키 프레임(key)의 각 픽셀의 깊이 값을 나타내는 키 프레임 깊이 맵을 획득한다.The depth encoding unit 310 encodes the applied key frame according to a previously learned pattern estimation method. Here, the depth encoding unit 310 extracts the features of the applied key frame similar to the key frame feature extraction unit 211 and the reference frame feature extraction unit 212 of the feature extraction unit 210, but features extraction It is learned to extract features different from the unit 210. In addition, the depth decoding unit 320 receives a key frame encoded by the depth encoding unit 310, decodes a key frame encoded according to a previously learned pattern restoration method, and decodes the key frame (key). A key frame depth map indicating the depth value of each pixel is obtained.

와핑부(400)는 자세 변화 추정부(240)에서 획득된 카메라 자세 변화값과 깊이 판별부(300)에서 획득된 키 프레임(key)에 대한 깊이 맵을 이용하여 참조 프레임(ref)을 키 프레임(key)의 시점으로 와핑할 수 있다.The warping unit 400 keyframes the reference frame ref using the camera attitude change value obtained by the posture change estimation unit 240 and the depth map for the key frame (key) obtained by the depth determination unit 300. You can warp to the point of view of (key).

키 프레임(key)의 이미지 좌표계(p_K)와 참조 프레임(ref)의 이미지 좌표계(p_R) 사이의 관계는 수학식 5로 나타난다. _{The relationship between the image coordinate system p K} of the key frame (key) and the image coordinate _{system p R} of the reference frame ref is expressed by Equation (5).

여기서 ~ 는 비동차 좌표계 표현을 나타내고, T_K->R 은 키 프레임(key)의 이미지 좌표계(p_K)와 참조 프레임(ref)의 이미지 좌표계(p_R)로의 와핑을 나타내며, K는 카메라 내부 파라미터를 나타내는 캘리브레이션 행렬을 나타내고, D_K 는 키 프레임(key)에 대한 깊이 맵을 나타낸다.Where ~ represents the non-homogeneous coordinate system expression, T _K->R represents the warping of the key frame (key) to the image coordinate system (p _K ) and the reference frame (ref) to the image coordinate system (p _R ), and K represents the inside of the camera. Represents a calibration matrix representing a parameter, and D _K represents a depth map for a key frame.

도 7에서 (a)는 키 프레임 영상(I_K)을 나타내고, (b)는 참조 프레임 영상(I_R)을 나타내며, (c)는 깊이 판별부(300)에서 추정된 키 프레임 깊이 맵(D_K)을 나타낸다. 그리고 (d)는 와핑된 참조 프레임 영상(

)을 나타낸다.In FIG. 7, (a) represents a key frame image (I _K ), (b) represents a reference frame image (I _R ), and (c) represents a key frame depth map (D) estimated by the depth determination unit 300 _K ). And (d) is the warped reference frame image (

).

도 7에 도시된 바와 같이, 수학식 5를 적용하여 참조 프레임(ref)을 키 프레임(key)으로 와핑시킨 영상(

)과 키 프레임 영상(I_K)은 수학식 6과 같이 유사하게 표현되어야 한다.As shown in FIG. 7, an image obtained by warping a reference frame ref into a key frame by applying Equation 5 (

) And the key frame image I _K should be expressed similarly as in Equation 6.

따라서 와핑된 참조 영상(

)을 기반으로 깊이 인코딩부(310)에서 획득된 키 프레임 깊이 맵을 보정할 수 있다. 그리고 와핑부(400)는 보정된 키 프레임 깊이 맵을 참조 프레임으로 와핑하여 참조 프레임(ref)에 대한 참조 프레임 깊이 맵을 획득할 수 있다.Therefore, the warped reference image (

), the key frame depth map obtained by the depth encoding unit 310 may be corrected. In addition, the warping unit 400 may obtain a reference frame depth map for the reference frame ref by warping the corrected key frame depth map as a reference frame.

한편 본 실시예에 따른 카메라 자세 및 깊이 추정 장치는 카메라 자세 판별부(200)와 깊이 판별부(300)를 학습시키기 위한 학습부(미도시)를 더 포함할 수 있다.Meanwhile, the camera posture and depth estimation apparatus according to the present embodiment may further include a learning unit (not shown) for learning the camera posture determination unit 200 and the depth determination unit 300.

학습부는 와핑 참조 영상(

)과 키 프레임 영상(I_K) 사이의 차이(

(p_K) - I_K(p_K))를 손실 함수로 정의하고, 손실 함수의 값인 오차가 기지정된 기준 오차 이하가 되도록 오차를 역전파함으로써, 카메라 자세 판별부(200)와 깊이 판별부(300)를 학습시킬 수 있다.The learning part is the warping reference video (

) And the difference between the key frame image (I _{K) (}

(p _K )-I _K (p _K )) is defined as a loss function, and by backpropagating the error so that the error, which is the value of the loss function, is less than a predetermined reference error, the camera posture determination unit 200 and the depth determination unit ( 300) can be learned.

상기에서는 깊이 판별부(300)가 키 프레임(key)에 대한 키 프레임 깊이 맵(D_K)을 추정하는 것으로 설명하였으나, 학습이 완료된 깊이 판별부(300)는 참조 프레임(ref)을 인가받고, 참조 프레임(ref)으로부터 직접 참조 프레임 깊이 맵(D_R)을 추정하도록 구성될 수 있다.In the above, it has been described that the depth determination unit 300 estimates the key frame depth map D _K for the key frame, but the depth determination unit 300 having completed learning receives the reference frame ref, It may be configured to directly estimate the reference frame depth map D _{R from the reference frame ref.}

즉 깊이 판별부(300)는 학습 시에는 학습을 위하여 키 프레임(key)에 대한 키 프레임 깊이 맵(D_K)을 추정하는 반면, 카메라 자세 및 깊이 추정 장치가 실제 사용되는 경우에는 참조 프레임(ref)에 대한 참조 프레임 깊이 맵(D_R)을 추정할 수 있다. 그리고 이 경우에는 키 프레임(key)에 대한 키 프레임 깊이 맵(D_K)으로부터 참조 프레임(ref)에 대한 깊이 맵(D_R)을 획득할 필요가 없으므로 와핑부(400)가 생략될 수 있다. _{That is, the depth determination unit 300 estimates a key frame depth map (D K} ) for a key frame (key) for learning during learning, whereas when the camera posture and depth estimation device is actually used, the reference frame (ref A reference frame depth map (D _R ) for) may be estimated. _{In this case, since it is not necessary to obtain the depth map D R} for the reference frame ref from the key frame depth map D _K for the key frame, the warping unit 400 may be omitted.

또한 학습 시에도 깊이 판별부(300)가 참조 프레임(ref)에 대한 참조 프레임 깊이 맵(D_R)을 획득하고, 와핑부(400)가 키 프레임(key)을 참조 프레임(ref)으로 와핑시키도록 하여 오차를 획득할 수도 있다.In addition, even during learning, the depth determination unit 300 _{obtains a reference frame depth map D R} for the reference frame ref, and the warping unit 400 warps the key frame into the reference frame ref. You can also get an error by doing it.

도 8은 본 발명의 일 실시예에 따른 카메라 자세 및 깊이 추정 방법을 나타낸다.8 shows a method of estimating a camera posture and depth according to an embodiment of the present invention.

도 1 내지 도 7을 참조하여, 도 8의 카메라 자세 및 깊이 추정 방법을 설명하면, 우선 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상에서 키 프레임(key)과 참조 프레임(ref)을 선택한다(S10). 여기서 키 프레임(key)은 다수 프레임의 영상 중 최초 프레임 또는 기지정된 프레임이 선택되거나, 이전 선택된 키 프레임과 카메라 자세 또는 깊이 차이 중 적어도 하나가 기지정된 기준값 이상으로 판별된 참조 프레임(ref)을 기준으로 직전 또는 직푸 프레임 중 하나가 키 프레임(key)으로 선택될 수 있다.Referring to FIGS. 1 to 7, the camera posture and depth estimation method of FIG. 8 will be described. First, a key frame (key) and a reference frame (ref) are selected from images of multiple frames successively acquired at various viewpoints ( S10). Here, the key frame is based on a reference frame (ref) in which the first frame or a predetermined frame is selected from among a plurality of frames of the image, or at least one of the difference between the previously selected key frame and the camera posture or depth is determined to be greater than or equal to a predetermined reference value. As a result, one of the immediately preceding or direct frame may be selected as a key frame.

이후 미리 학습된 패턴 추정 방식에 따라 인가된 키 프레임(key)과 참조 프레임(ref) 각각의 특징을 추출하여 키 특징맵과 참조 특징맵을 획득한다(S20). 키 특징맵과 참조 특징맵이 획득되면, 획득된 키 특징맵과 참조 특징맵을 결합하여 결합 특징맵을 획득한다(S30).Thereafter, a key feature map and a reference feature map are obtained by extracting features of each of the applied key frame (key) and the reference frame (ref) according to the previously learned pattern estimation method (S20). When the key feature map and the reference feature map are obtained, a combined feature map is obtained by combining the obtained key feature map and the reference feature map (S30).

한편 결합 특징맵을 획득하는 과정과 별개로 키 프레임(key)과 참조 프레임(ref) 중 적어도 하나를 인가받아 미리 학습된 패턴 추정 방식 및 패턴 복원 방식에 따라 인가된 프레임(key)을 인코딩 및 디코딩하여 인가된 프레임에 대한 깊이 맵을 추정한다(S40). 깊이 맵 추정은 일예로 학습 시에는 키 프레임(key)에 대한 키 깊이 맵이 추정될 수 있고, 실제 운용시에는 참조 프레임(ref)에 대한 참조 깊이 맵이 추정될 수 있으나, 이에 한정되지 않는다.Meanwhile, apart from the process of obtaining the combined feature map, at least one of a key frame (key) and a reference frame (ref) is authorized, and the applied frame (key) is encoded and decoded according to a previously learned pattern estimation method and a pattern restoration method. Thus, the depth map for the applied frame is estimated (S40). In the depth map estimation, for example, a key depth map for a key frame may be estimated during learning, and a reference depth map for a reference frame ref may be estimated during actual operation, but is not limited thereto.

그리고 추정된 깊이 맵을 기반으로 생성되는 3차원 좌표계 특징맵을 획득된 결합 특징맵에 결합하여 결합 특징맵을 3차원 변환함으로써 3차원 특징맵을 획득한다(S50).Then, a 3D feature map is obtained by combining the 3D coordinate system feature map generated based on the estimated depth map with the obtained combined feature map and converting the combined feature map to 3D (S50).

3차원 특징맵이 획득되면, 획득된 3차원 특징맵으로부터 미리 학습된 패턴 추정 방식에 따라 카메라 회전 변환량을 추정한다(S60). 여기서 회전 변환량은 3 X 3 크기의 회전 행렬로 직접 추정될 수도 있으나, 본 실시예에서는 3개의 오일러 각(ψ, φ, θ)으로 추정된 후, 수학식 4에 따른 역 오일러 각 변환을 통해 회전 변환 행렬로 획득될 수도 있다.When the 3D feature map is obtained, a camera rotation transformation amount is estimated from the acquired 3D feature map according to a previously learned pattern estimation method (S60). Here, the amount of rotation transformation may be directly estimated with a rotation matrix having a size of 3 X 3, but in the present embodiment, it is estimated as three Euler angles (ψ, φ, θ), and then through the inverse Euler angle transformation according to Equation 4 It can also be obtained as a rotation transformation matrix.

한편 카메라 회전 변환량 추정과 별도로, 미리 학습된 패턴 추정 방식에 따라 3차원 특징맵으로부터 카메라 이동을 추정하여 카메라 이동 벡터를 획득한다(S70).Meanwhile, apart from estimating the amount of camera rotation transformation, a camera movement vector is obtained by estimating a camera movement from a 3D feature map according to a previously learned pattern estimation method (S70).

카메라 회전 변환량을 나타내는 회전 변환 행렬과 카메라의 x, y, z 축 방향의 평행 이동을 나타내는 카메라 이동 벡터가 획득되면, 획득된 회전 변환 행렬과 카메라 이동 벡터를 기반으로 키 프레임(key)에서의 카메라 자세에 대비한 참조 프레임(ref)에서의 카메라 자세 변화량을 판별한다(S80).When the rotation transformation matrix representing the amount of camera rotation transformation and the camera movement vector representing the parallel movement in the x, y, z axis directions of the camera are obtained, based on the obtained rotation transformation matrix and the camera movement vector, The camera posture change amount in the reference frame ref compared to the camera posture is determined (S80).

한편, 학습 시에는 프레임 영상 와핑 단계(S90)와 오차 계산 및 역전파 단계(S100)가 더 포함될 수 있다.Meanwhile, during learning, a frame image warping step (S90) and an error calculation and backpropagation step (S100) may be further included.

프레임 영상 와핑 단계(S90)에서는 깊이 맵 추정 단계(S40)에서 추정된 키 프레임(key)에 대한 키 깊이 맵과 판별된 카메라 자세 변화량을 기반으로 참조 프레임(ref)을 키 프레임(key)에 대응하도록 와핑시킨다. 그리고 오차 계산 및 역전파 단계(S100)에서는 와핑된 참조 프레임과 키 프레임 사이의 차이를 오차로서 계산하고, 계산된 오차를 카메라 자세 판별부(200)와 깊이 판별부(300)로 역전파하여 학습을 수행한다.In the frame image warping step (S90), the reference frame (ref) is matched to the key frame (key) based on the key depth map for the key frame (key) estimated in the depth map estimation step (S40) and the determined camera posture change amount. Warp to do. And in the error calculation and backpropagation step (S100), the difference between the warped reference frame and the key frame is calculated as an error, and the calculated error is backpropagated to the camera attitude determination unit 200 and the depth determination unit 300 to learn. Perform.

학습은 오차가 기지정된 기준 오차 이하가 되거나 기지정된 횟수에 도달할 때까지 반복하여 수행될 수 있다.Learning may be performed repeatedly until the error becomes less than or equal to a predetermined reference error or reaches a predetermined number of times.

결과적으로 본 실시예에 따른 카메라 자세 및 깊이 추정 장치 및 방법은 다양한 시점에서 연속적으로 획득된 다수 프레임의 영상에서 키 프레임(key)과 참조 프레임(ref)을 선택하고, 선택된 키 프레임(key)과 참조 프레임(ref) 사이의 카메라 자세 변화와 키 프레임(key)과 참조 프레임(ref) 중 적어도 하나의 깊이 맵을 획득하고, 획득된 카메라 자세 변화와 깊이 맵을 기반으로 키 프레임(key)과 참조 프레임(ref) 중 하나를 다른 하나에 대응하도록 와핑시켜 비교하여 학습시킴으로써, 카메라 자세와 깊이를 모두 정확하게 추정할 수 있도록 학습시킬 수 있다. 특히 키 프레임(key)과 참조 프레임(ref)에 대한 특징맵에 3차원 좌표계 특징맵을 결합하여 카메라 자세 변화를 추정하도록 함으로써, 정확하게 카메라 자세 변화를 추정할 수 있다. 또한 카메라 자세 변화 중 회전 변화를 추정하기 위해 오일러 각을 추정하도록 학습시킴으로써, 더욱 용이하고 정확하게 카메라 자세 변화를 추정할 수 있도록 한다. 뿐만 아니라 키 프레임을 이전 선택된 키 프레임과 카메라 자세 또는 깊이 차이 중 적어도 하나가 기지정된 기준값 이상으로 판별된 참조 프레임(ref)을 기준으로 직전 또는 직후 프레임 중 하나가 키 프레임(key)으로 선택함으로써, 오차가 누적되는 것을 최소화할 수 있다.As a result, the camera posture and depth estimation apparatus and method according to the present embodiment selects a key frame (key) and a reference frame (ref) from images of multiple frames successively acquired at various viewpoints, and selects the selected key frame (key) and Acquire a camera posture change between reference frames (ref) and at least one of a key frame (key) and a reference frame (ref), and refer to a key frame based on the acquired camera posture change and depth map By comparing and learning one of the frames ref to correspond to the other, it is possible to learn to accurately estimate both the camera posture and the depth. In particular, by combining the feature map for the key frame (key) and the reference frame (ref) with the feature map of the 3D coordinate system to estimate the camera posture change, it is possible to accurately estimate the camera posture change. In addition, by learning to estimate the Euler angle in order to estimate the rotational change among the camera posture changes, it is possible to more easily and accurately estimate the camera posture change. In addition, by selecting one of the immediately preceding or immediately following frame as a key frame based on the reference frame (ref), in which at least one of the previously selected key frame and the difference in camera posture or depth are determined to be greater than or equal to a predetermined reference value, It is possible to minimize the accumulation of errors.

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention can be implemented as a computer program stored in a medium for execution on a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (Read Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those of ordinary skill in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100: 프레임 제공부 110: 영상 획득부
120: 프레임 선택부 200: 자세 판별부
210: 특징 추출부 211: 키 프레임 특징 추출부
212: 참조 프레임 특징 추출부 220: 특징 결합부
230: 3차원 변환부 240: 자세 변화 추정부
241: 회전 변환 추출부 242: 이동 벡터 추출부
243: 카메라 자세 획득부 300: 깊이 판별부
310: 깊이 인코딩부 320: 깊이 디코딩부
400: 와핑부100: frame providing unit 110: image obtaining unit
120: frame selection unit 200: posture determination unit
210: feature extraction unit 211: key frame feature extraction unit
212: reference frame feature extraction unit 220: feature combining unit
230: 3D conversion unit 240: posture change estimation unit
241: rotation transformation extraction unit 242: motion vector extraction unit
243: camera posture acquisition unit 300: depth determination unit
310: depth encoding unit 320: depth decoding unit
400: warping part

Claims

A frame providing unit that selects a key frame and a reference frame from images of multiple frames successively acquired at various viewpoints;
A depth determination unit that receives at least one of the key frame and the reference frame and obtains a depth map corresponding to the applied frame according to a previously learned pattern estimation method and a pattern restoration method; And
To obtain a combined feature map by combining a key feature map and a reference feature map obtained by extracting features according to a pattern estimation method learned in advance from the key frame and the reference frame, and to include 3D coordinates in the combined feature map A three-dimensional feature map is obtained by combining a three-dimensional coordinate system feature map generated based on the acquired depth map, and the camera posture from the key frame to the reference frame according to a pattern estimation method learned in advance from the three-dimensional feature map Camera posture and depth estimation apparatus including a posture determination unit for estimating a change.

The method of claim 1, wherein the posture determination unit
A feature extractor configured to obtain the key feature map and the reference feature map by receiving the key frame and the reference frame, extracting features of the key frame and the reference frame according to a previously learned pattern estimation method;
A feature combiner configured to obtain the combined feature map by receiving and combining the key feature map and the reference feature map;
The three-dimensional feature map obtained by deforming and weighting the depth map obtained by the depth determination unit to the two-dimensional coordinate system feature map having a designated pattern according to the camera internal parameter is combined with the combined feature map to form a three-dimensional feature map. A three-dimensional conversion unit to obtain; And
Posture change estimation to determine the camera posture change from the estimated rotation change and movement change of the camera by separately estimating the rotation change and movement change of the camera between the key frame and the reference frame by receiving the 3D feature map Camera posture and depth estimation device including the unit.

The method of claim 2, wherein the feature extraction unit
A key frame feature extractor configured to obtain the key feature map by receiving the key frame; And
A camera posture and depth estimation apparatus comprising a reference frame feature extractor configured with the same neural network as the key frame feature extractor, trained to have the same weight, and obtains the reference feature map by receiving the reference frame.

The method of claim 2, wherein the three-dimensional transforming unit
A first channel having a size corresponding to the combined feature map and having a predetermined pattern value for setting a coordinate value in the x-axis direction and a second channel having a predetermined pattern value for setting a coordinate value in the y-axis direction, and A camera posture and depth estimation apparatus for obtaining the 3D coordinate system feature map by multiplying a matrix in which a third channel, which is a homogeneous coordinate system having a predetermined value, is combined by an inverse matrix of a calibration matrix obtained in advance as the camera internal parameter and the depth map.

The method of claim 2, wherein the posture change estimation unit
A rotation transformation extraction unit that receives the 3D feature map and estimates an amount of rotation change of the camera from the key frame to the reference frame according to a previously learned pattern estimation method;
A motion vector extracting unit that receives the 3D feature map and estimates a parallel movement amount of the camera from the key frame to the reference frame according to a previously learned pattern estimation method; And
Camera posture and depth estimation apparatus including a camera posture acquisition unit determining a camera posture change from the key frame to the reference frame based on the rotation change amount and the parallel movement amount.

The method of claim 5, wherein the rotation transformation extraction unit
A camera posture and depth estimation apparatus that obtains a rotational change amount of the camera from the 3D feature map as a rotation matrix having a predetermined size to indicate rotation in each of the x, y, and z axis directions.

The method of claim 6, wherein the rotation transformation extraction unit
A camera that estimates the rotational variation of the camera from the 3D feature map as three Euler angles (ψ, φ, θ) and converts the estimated three Euler angles into the rotation matrix according to a known method. Posture and depth estimation device.

The method of claim 1, wherein the frame providing unit
An image acquisition unit that acquires images of multiple frames successively acquired at various viewpoints; And
A previous frame is selected as the key frame, and a subsequent frame is selected as the reference frame based on a reference frame that is determined to be greater than or equal to a predetermined reference value in at least one of a previously selected key frame and a difference in camera posture or depth in the multi-frame image. Camera posture and depth estimation apparatus including a frame selection unit.

The apparatus of claim 1, wherein the camera posture and depth estimation apparatus
Warping one of the key frame or the reference frame to correspond to another frame by using the depth map obtained by the depth determining unit and the camera posture change obtained by the attitude determining unit, and comparing the warped frame with another frame Camera posture and depth estimation apparatus further comprising a warping unit for correcting depth information of the depth map.

The apparatus of claim 9, wherein the camera posture and depth estimation apparatus
Camera posture and depth estimation further comprising a learning unit for learning by acquiring a difference between the frame warped by the warping unit and another frame as an error, and backpropagating the obtained error to the depth determination unit and the attitude determination unit during training Device.

Selecting a key frame and a reference frame from images of multiple frames successively acquired at various viewpoints;
Receiving at least one of the key frame and the reference frame and obtaining a depth map corresponding to the applied frame according to a previously learned pattern estimation method and a pattern restoration method; And
To obtain a combined feature map by combining a key feature map and a reference feature map obtained by extracting features according to a pattern estimation method learned in advance from the key frame and the reference frame, and to include 3D coordinates in the combined feature map A three-dimensional feature map is obtained by combining a three-dimensional coordinate system feature map generated based on the acquired depth map, and the camera posture from the key frame to the reference frame according to a pattern estimation method learned in advance from the three-dimensional feature map Camera posture and depth estimation method comprising the step of estimating a change.

The method of claim 11, wherein estimating the posture change
Obtaining the key feature map and the reference feature map by receiving the key frame and the reference frame, extracting features of the key frame and the reference frame according to a previously learned pattern estimation method;
Obtaining the combined feature map by receiving and combining the key feature map and the reference feature map;
The three-dimensional feature map obtained by deforming and weighting the depth map obtained by the depth determination unit to the two-dimensional coordinate system feature map having a designated pattern according to the camera internal parameter is combined with the combined feature map to form a three-dimensional feature map. Obtaining; And
Receiving the 3D feature map, estimating a rotation change and a movement change of the camera between the key frame and the reference frame, respectively, and determining a camera posture change from the estimated rotation change and movement change of the camera. Camera posture and depth estimation method.

The method of claim 12, wherein obtaining the reference feature map comprises:
A camera posture and depth estimation method for acquiring the key feature map corresponding to the key frame and the reference feature map corresponding to the reference frame using a Siamese neural network having the same structure and learning to have the same weight.

The method of claim 12, wherein obtaining the 3D feature map comprises:
A first channel having a size corresponding to the combined feature map and having a predetermined pattern value for setting a coordinate value in the x-axis direction and a second channel having a predetermined pattern value for setting a coordinate value in the y-axis direction, and A camera posture and depth estimation method for obtaining the three-dimensional coordinate system feature map by multiplying a matrix to which a third channel, which is a homogeneous coordinate system having a predetermined value, is combined by an inverse matrix of a calibration matrix obtained in advance as the camera internal parameter and the depth map.

The method of claim 12, wherein estimating the posture change
Estimating a rotation change amount of the camera from the key frame to the reference frame according to a pattern estimation method that is applied with the 3D feature map and is learned in advance;
Receiving the 3D feature map and estimating a parallel movement amount of the camera from the key frame to the reference frame according to a previously learned pattern estimation method; And
And determining a camera posture change from the key frame to the reference frame based on the rotation change amount and the parallel movement amount.

The method of claim 15, wherein the step of estimating representing the amount of rotation change
A camera posture and depth estimation method for obtaining a rotational change amount of the camera from the 3D feature map as a rotation matrix having a predetermined size to indicate rotation in each of the x, y, and z axis directions.

The method of claim 16, wherein the step of estimating representing the amount of rotation change
Estimating the rotational variation of the camera from the three-dimensional feature map as three Euler angles (ψ, φ, θ); And
And converting the estimated three Euler angles into the rotation matrix according to a known method.

The method of claim 11, wherein selecting the frame
Acquiring images of multiple frames successively acquired at various viewpoints; And
A previous frame is selected as the key frame, and a subsequent frame is selected as the reference frame based on a reference frame that is determined to be greater than or equal to a predetermined reference value in at least one of a previously selected key frame and a difference in camera posture or depth in the multi-frame image. Camera posture and depth estimation method comprising the step of.

The method of claim 11, wherein the camera posture and depth estimation method
Warping one of the key frame or the reference frame to correspond to another frame using the depth map and the camera posture change;
Comparing the warped frame with other frames to correct depth information of the depth map.

The method of claim 19, wherein the camera posture and depth estimation method
The camera posture and depth estimation method further comprising acquiring a difference between the warped frame and another frame as an error, and backpropagating the acquired error to learn.