KR20230106057A

KR20230106057A - Method and apparatus for 6 degree of freedom pose estimation using artifical neural network

Info

Publication number: KR20230106057A
Application number: KR1020220039536A
Authority: KR
Inventors: 채재민; 이수찬
Original assignee: 국민대학교산학협력단
Priority date: 2022-01-05
Filing date: 2022-03-30
Publication date: 2023-07-12

Abstract

컴퓨팅 장치에 의해 수행되는, 6자유도 자세 추정 방법 및 장치가 개시된다.
일 실시예에 따른 컴퓨팅 장치에 의해 수행되는, 6자유도 자세 추정 방법은 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계, 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계, 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계, 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계, 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계, 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정하는 단계를 포함할 수 있다.A method and apparatus for estimating a six-degree-of-freedom posture performed by a computing device are disclosed.
A method for estimating a 6-DOF posture, performed by a computing device according to an embodiment, includes obtaining a first image of a first frame and a second image of a second frame, and converting the first image and the second image to an artificial neural network. generating a feature map from the first image and the second image by using a feature extraction network of the artificial neural network, combining 6 tokens corresponding to the 6 degrees of freedom and the feature map. Generating a set, generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combined set, the dimension of the artificial neural network The step of inputting the input vector to a reduction network and the step of determining the six degrees of freedom from the output of the dimensionality reduction network may be included.

Description

Method and apparatus for estimating 6 degree of freedom posture using artificial neural network

이하에서는 인공신경망을 이용한 6자유도 자세 추정 방법 및 장치가 개시된다. 실시예들에 따르면 6자유도를 포함하는 카메라 자세 추정을 위해 인공신경망을 학습시키는 방법 및 학습된 인공신경망을 이용하여 카메라 자세를 추정하는 방법 및 장치가 개시된다.Hereinafter, a method and apparatus for estimating a six-degree-of-freedom posture using an artificial neural network are disclosed. According to embodiments, a method for learning an artificial neural network for estimating a camera pose including six degrees of freedom and a method and apparatus for estimating a camera pose using the learned artificial neural network are disclosed.

위치 인식 및 지도구축(Simultaneous Localization and Mapping, SLAM) 기술은 로봇 운행 및 자율주행의 중요한 요소 기술이다. SLAM 등 자율이동체의 제어를 위한 Odometry 관련 기술을 구현하기 위해서는 다양한 센서들을 이용하게 되는데 그 중 단안 카메라만을 사용하여 6 자유도(Degree of freedom, DoF)를 추정하는 방식을 Monocular Visual-Odometry(VO) 라고 한다. 이 단안 카메라를 이용한 VO는 여러 개의 카메라 또는 보조 센서(관성 센서, 라이다)를 이용한 방식보다 비교적 간단한 구조를 가지며 데이터를 다루기 쉽고 경제적이라는 장점을 갖는다.Simultaneous Localization and Mapping (SLAM) technology is an important element technology for robot operation and autonomous driving. In order to implement Odometry-related technologies for controlling autonomous vehicles such as SLAM, various sensors are used. Among them, a method of estimating 6 degrees of freedom (DoF) using only a monocular camera is called Monocular Visual-Odometry (VO). It is said. VO using this monocular camera has a relatively simple structure than a method using multiple cameras or auxiliary sensors (inertial sensors, LIDAR), and has the advantage of being easy to handle data and economical.

최근 컴퓨터의 연산 속도가 비약적으로 상승함에 따라 딥러닝을 적용하여 더 큰 성능 개선을 이루어 내고 있다. 예시적으로 합성곱 신경망(Convolution Network)을 이용해 하나의 영상을 통해 깊이 정보를 예측하는 방법 중 Coarse-to-Fine 구조의 네트워크를 구성하여 깊이 센서를 통해 얻는 깊이 값을 참값(Ground Truth, GT)으로 사용하여 학습하는 방법이 제안되었지만, 실제 환경에서 GT 깊이 정보 얻는 것은 비용이 많이 드는 단점이 존재했다. 이러한 비용 문제를 해결하기 위해 GT 깊이 정보 없이 학습하는 완전한 비지도 학습(Unsupervised) 방법을 제안되었다. 이 방법은 연속적인 프레임으로부터 깊이 정보와 상대적 카메라 자세를 추정하여 두 정보로부터 새로운 영상을 합성해 실제 영상과 합성된 영상 간의 차이를 통해 학습하는 방법이다. 하지만 이 방법은 단안 카메라를 사용한 VO 방식의 근본적인 문제인 스케일 모호성(Scale-ambiguity) 때문에 카메라 자세 추정 네트워크가 긴 영상에 대해서 오차가 누적되는 문제가 발생한다. 이러한 스케일 모호성 문제를 해결하기 위해 기하학적 일관성 손실함수(Geometry Consistency Loss)가 제안되었다. 기하학적 일관성 손실함수는 깊이 정보 추정 네트워크를 통해 추정한 두 연속된 영상의 깊이 정보 값의 차이를 정규화 하여 두 값의 차이를 최소화하는 함수이다. Recently, as the computational speed of computers has risen dramatically, deep learning has been applied to achieve greater performance improvement. For example, among the methods of predicting depth information through one image using a convolutional network, a network with a coarse-to-fine structure is constructed and the depth value obtained through the depth sensor is the true value (Ground Truth, GT). However, it has the disadvantage of being expensive to obtain GT depth information in a real environment. To solve this cost problem, a completely unsupervised learning method that learns without GT depth information has been proposed. This method is a method of estimating depth information and relative camera posture from consecutive frames, synthesizing a new image from the two information, and learning through the difference between the real image and the synthesized image. However, this method has a problem of accumulation of errors for images with a long camera pose estimation network due to scale-ambiguity, which is a fundamental problem of the VO method using a monocular camera. In order to solve this scale ambiguity problem, a geometric consistency loss function (Geometry Consistency Loss) has been proposed. The geometric coherence loss function is a function that normalizes the difference between the depth values of two consecutive images estimated through the depth information estimation network and minimizes the difference between the two values.

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Neural Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. In Neural Information Processing Systems (NIPS), 2019 Jia-Wang Bian, Zhichao Li, Naiyan Wang, and Huangying Zhan. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. In Neural Information Processing Systems (NIPS), 2019

종래 기술이 가지는 비용이 많이 드는 문제와 오차가 누적되는 문제를 해결하고 이하에서 개시되는 적어도 하나의 실시예는 단안 카메라만을 사용하여 6 자유도를 추정하기 위해 인공신경망을 학습시키는 방법을 제공하는 것을 목적으로 한다. 적어도 하나의 실시예는 학습된 인공신경망을 이용하여 6 자유도를 추정하는 방법을 제공하는 것을 목적으로 한다.The object of at least one embodiment disclosed below is to provide a method for training an artificial neural network to estimate 6 degrees of freedom using only a monocular camera and solving the problem of high cost and accumulation of errors in the prior art. to be At least one embodiment aims to provide a method for estimating 6 degrees of freedom using a trained artificial neural network.

일 실시예에 따른 컴퓨팅 장치에 의해 수행되는, 6자유도 자세 추정 방법은 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계, 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계, 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계, 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계, 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계, 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정하는 단계를 포함할 수 있다.A method for estimating a 6-DOF posture, performed by a computing device according to an embodiment, includes obtaining a first image of a first frame and a second image of a second frame, and converting the first image and the second image to an artificial neural network. generating a feature map from the first image and the second image by using a feature extraction network of the artificial neural network, combining 6 tokens corresponding to the 6 degrees of freedom and the feature map. Generating a set, generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combined set, the dimension of the artificial neural network The step of inputting the input vector to a reduction network and the step of determining the six degrees of freedom from the output of the dimensionality reduction network may be included.

상기 특징 추출 네트워크는 상기 제1 영상의 3채널 영상 및 제2 영상의 3채널 영상을 채널방향으로 합친 6채널 영상을 이용하여 특징을 추출할 수 있다.The feature extraction network may extract features using a 6-channel image obtained by combining the 3-channel image of the first image and the 3-channel image of the second image in a channel direction.

상기 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계는 상기 생성된 특징맵을 일정한 크기를 가지는 패치들의 집합으로 재구성하는 단계, 상기 패치들의 집합과 상기 6개의 토큰들을 결합하여 상기 결합 집합을 생성하는 단계를 포함할 수 있다.The step of generating a combined set using the six tokens and the feature map is the step of reconstructing the generated feature map into a set of patches having a constant size, combining the set of patches and the six tokens to perform the combination. It may include generating a set.

상기 6개의 토큰들 각각은 상기 패치들 각각과 같은 크기를 가질 수 있다.Each of the six tokens may have the same size as each of the patches.

상기 결합 집합은 n+6개의 패치들을 포함하고, 상기 n은 상기 특징맵에 포함된 패치들의 개수일 수 있다.The combination set may include n+6 patches, and n may be the number of patches included in the feature map.

상기 위치 임베딩 벡터는 상기 패치들과 동일한 크기를 가지는 벡터일 수 있다.The position embedding vector may be a vector having the same size as the patches.

상기 입력 벡터는 상기 결합 집합을 구성하는 원소 벡터와 위치 임베딩 벡터의 합에 기초하여 결정될 수 있다.The input vector may be determined based on the sum of element vectors constituting the combination set and position embedding vectors.

상기 차원 축소 네트워크는 상기 입력 벡터에 대한 셀프 어텐션(self-attention) 연산을 반복 진행하여 상기 입력 벡터를 구성하는 패치들의 크기를 감소시킬 수 있다.The dimensionality reduction network may reduce the size of patches constituting the input vector by repeatedly performing a self-attention operation on the input vector.

상기 6자유도를 결정하는 단계는 상기 차원 축소 네트워크에 의해 크기가 감소된 입력 벡터의 패치들 가운데 상기 6개의 토큰들에 대응하는 패치들을 추출하는 단계 및 상기 추출한 패치들을 이용하여 패치 별 평균 풀링(average pooling)을 계산하여 6자유도를 추정하는 단계를 포함할 수 있다.The step of determining the 6 degrees of freedom includes extracting patches corresponding to the 6 tokens from among patches of the input vector whose size is reduced by the dimensionality reduction network, and performing average pooling for each patch using the extracted patches. pooling) to estimate 6 degrees of freedom.

일 실시예에 따라, 컴퓨팅 장치에 의해 수행되는, 위치 인식 및 지도구축을 위한 6자유도 자세 추정을 위한 인공신경망의 학습 방법은 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계, 상기 제1 영상의 제1 깊이맵 및 상기 제2 영상의 제2 깊이맵을 추정하는 단계, 상기 제1 영상 및 상기 제2 영상을 상기 인공신경망에 입력하여 6자유도 정보를 출력하는 단계, 상기 6자유도 정보에 기초하여 상기 제1 프레임과 상기 제2 프레임 사이의 변환행렬을 계산하는 단계 및 상기 제1 깊이맵, 상기 제2 깊이맵 및 상기 변환행렬에 기초하여 손실함수의 출력 값을 계산하고 상기 손실함수의 출력 값에 기초하여 상기 인공신경망을 갱신하는 단계를 포함할 수 있다.According to an embodiment, a learning method of an artificial neural network for position estimation with 6 degrees of freedom for location recognition and map construction, performed by a computing device, obtains a first image of a first frame and a second image of a second frame. estimating a first depth map of the first image and a second depth map of the second image, inputting the first image and the second image to the artificial neural network and outputting 6 degree-of-freedom information. calculating a transformation matrix between the first frame and the second frame based on the 6 degree of freedom information and outputting a loss function based on the first depth map, the second depth map, and the transformation matrix; Calculating a value and updating the artificial neural network based on the output value of the loss function.

상기 6자유도 정보를 출력하는 단계는 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계, 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계, 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계, 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계, 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정하는 단계를 포함할 수 있다.The step of outputting the 6 degree of freedom information includes inputting the first image and the second image to an artificial neural network, and generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network. generating a combination set using the six tokens corresponding to the six degrees of freedom and the feature map; generating a position embedding vector reflecting positional characteristics of patches included in the feature map; Generating an input vector using an embedding vector and the combination set, inputting the input vector to a dimensionality reduction network of the artificial neural network, and determining the six degrees of freedom from an output of the dimensionality reduction network. there is.

상기 6개의 토큰들 및 상기 위치 임베딩 벡터는 소정의 초기 값들로 설정된 이후 상기 손실함수의 출력 값이 작아지도록 갱신될 수 있다.After the 6 tokens and the position embedding vector are set to predetermined initial values, the output value of the loss function may be updated to decrease.

상기 손실함수의 출력은 소정 조건을 만족하는 유효 픽셀에 대하여 상기 변환행렬 및 제1 깊이맵을 이용하여 제2 프레임을 재구성한 제3 프레임과 제1 프레임 간의 차이를 모두 더하는 하는 제1 보조함수의 출력에 기초하여 결정될 수 있다.The output of the loss function is a first auxiliary function for adding all the differences between the first frame and the third frame obtained by reconstructing the second frame using the transformation matrix and the first depth map for effective pixels satisfying a predetermined condition. It can be determined based on the output.

상기 제1 보조함수는 상기 유효 픽셀에 대하여 상기 제1 프레임과 상기 제3 프레임 간의 구조적 유사 지수를 나타내는 항을 더 포함할 수 있다.The first auxiliary function may further include a term representing a structural similarity index between the first frame and the third frame with respect to the effective pixel.

다른 일 실시예에 따라 제1 보조함수는 상기 제1 깊이맵을 상기 변환행렬로 재구성한 제3 깊이맵과 제2 깊이맵이 제1 깊이맵과 같은 픽셀 그리드에 위치하도록 재구성한 제4 깊이맵에 대하여 제3 깊이맵과 제4 깊이맵의 차이를 제3 깊이맵과 제4 깊이맵의 합으로 나눈 정규화 함수를 곱하여 계산하는 함수일 수 있다.According to another embodiment, the first auxiliary function may include a third depth map obtained by reconstructing the first depth map using the transformation matrix and a fourth depth map reconstructed so that the second depth map is positioned on the same pixel grid as the first depth map. It may be a function calculated by multiplying the normalization function obtained by dividing the difference between the third depth map and the fourth depth map by the sum of the third depth map and the fourth depth map.

상기 구조적 유사 지수는 픽셀 간의 휘도, 대비, 구조에 대한 비교를 기반으로 할 수 있다.The structural similarity index may be based on comparison of luminance, contrast, and structure between pixels.

상기 소정 조건은 상기 제1 프레임과 제3 프레임에서의 차이가 제1 프레임과 제2 프레임에서의 차이보다 적은 픽셀일 수 있다.The predetermined condition may be that a difference between the first frame and the third frame is smaller than a difference between the first frame and the second frame.

일 실시예에 따라, 상기 손실함수의 출력은 공간상 모든 픽셀에 대하여 상기 제1 프레임의 공간상 기울기 성분과 제1 깊이맵의 기울기 성분의 곱을 모두 더하는 제2 보조함수의 출력을 더 고려하여 결정될 수 있다.According to an embodiment, the output of the loss function may be determined by further considering an output of a second auxiliary function that adds all products of the spatial gradient component of the first frame and the gradient component of the first depth map for all pixels in space. can

다른 일 실시예에 따라, 상기 손실함수의 출력은 상기 정규화 함수를 상기 유효 픽셀에 대하여 모두 더하는 제3 보조함수의 출력을 더 고려하여 결정될 수 있다.According to another embodiment, the output of the loss function may be determined by further considering the output of a third auxiliary function that adds the normalization function to all effective pixels.

상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계는 상기 생성된 특징맵을 겹치지 않는 일정한 크기의 패치들로 나누는 단계, 상기 패치들로 이루어진 패치 집합을 생성하는 단계, 상기 패치들과 같은 크기를 갖고 상기 6자유도에 대응되는 6개의 토큰들을 생성하는 단계 및 상기 패치 집합과 상기 6개의 토큰들을 결합하여 상기 결합 집합을 생성하는 단계를 포함할 수 있다.Generating a combined set using the six tokens corresponding to the six degrees of freedom and the feature map includes dividing the generated feature map into non-overlapping patches of a constant size, and generating a patch set composed of the patches. generating 6 tokens having the same size as the patches and corresponding to the 6 degrees of freedom, and generating the combined set by combining the patch set and the 6 tokens.

상기 6자유도를 결정하는 단계는 상기 차원 축소 네트워크에 의해 크기가 감소된 입력 벡터의 패치들 가운데 상기 6개의 토큰들에 대응하는 패치들을 추출하는 단계 및 상기 추출한 패치들을 이용하여 패치 별 평균 풀링을 계산하여 6자유도를 추정하는 단계를 포함할 수 있다.The step of determining the 6 degrees of freedom includes extracting patches corresponding to the 6 tokens from among patches of the input vector whose size has been reduced by the dimensionality reduction network, and calculating average pooling for each patch using the extracted patches. and estimating 6 degrees of freedom.

컴퓨팅 장치는 프로세서를 포함할 수 있고, 상기 프로세서는 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계; 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계; 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계; 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계; 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계; 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계; 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정하는 단계를 수행할 수 있다.The computing device may include a processor, and the processor may perform steps of obtaining a first image of a first frame and a second image of a second frame; inputting the first and second images to an artificial neural network; generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network; generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map; generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set; inputting the input vector to a dimensionality reduction network of the artificial neural network; and determining the six degrees of freedom from the output of the dimensionality reduction network.

일 실시예에 따라 컴퓨팅 장치가 개시된다. 개시된 컴퓨팅 장치는 프로세서를 포함하며, 상기 프로세서는 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계, 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계; 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계, 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계, 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계, 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계, 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정하는 단계를 수행할 수 있다.According to one embodiment, a computing device is disclosed. The disclosed computing device includes a processor, the processor comprising: acquiring a first image of a first frame and a second image of a second frame; inputting the first image and the second image to an artificial neural network; Generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network, and generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map. , generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set, the input to the dimensionality reduction network of the artificial neural network. The step of inputting a vector and the step of determining the six degrees of freedom from the output of the dimensionality reduction network may be performed.

일 실시예에 따라 컴퓨팅 장치가 개시된다. 개시된 컴퓨팅 장치는 프로세서를 포함하며, 상기 프로세서는 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계, 상기 제1 영상의 제1 깊이맵 및 상기 제2 영상의 제2 깊이맵을 추정하는 단계, 상기 제1 영상 및 상기 제2 영상을 상기 인공신경망에 입력하여 6자유도 정보를 출력하는 단계, 상기 6자유도 정보에 기초하여 상기 제1 프레임과 상기 제2 프레임 사이의 변환행렬을 계산하는 단계; 및 상기 제1 깊이맵, 상기 제2 깊이맵 및 상기 변환행렬에 기초하여 손실함수의 출력 값을 계산하고 상기 손실함수의 출력 값에 기초하여 상기 인공신경망을 갱신하는 단계를 수행하되, 상기 6자유도 정보를 출력하는 단계는, 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계; 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계, 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계, 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계, 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정하는 단계를 포함할 수 있다.According to one embodiment, a computing device is disclosed. The disclosed computing device includes a processor, and the processor includes steps of acquiring a first image of a first frame and a second image of a second frame, a first depth map of the first image and a second depth of the second image. Estimating a map, inputting the first image and the second image to the artificial neural network and outputting 6 degree of freedom information, based on the 6 degree of freedom information, between the first frame and the second frame. Calculating a transformation matrix; and calculating an output value of a loss function based on the first depth map, the second depth map, and the transformation matrix, and updating the artificial neural network based on the output value of the loss function. The outputting of the degree information may include inputting the first image and the second image to an artificial neural network; Generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network, and generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map. , generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set, the input to the dimensionality reduction network of the artificial neural network. It may include inputting a vector and determining the six degrees of freedom from the output of the dimensionality reduction network.

본 발명의 실시예들은 인공신경망을 이용한 6자유도 자세 추정 방법 및 장치를 제공할 수 있다.Embodiments of the present invention may provide a method and apparatus for estimating a six-degree-of-freedom posture using an artificial neural network.

실시예들은 인공신경망을 이용한 6자유도 자세 추정 방법을 이용하여 단안 카메라만을 이용하여 자세 추정을 위한 비지도 학습 방법을 제공할 수 있다.Embodiments may provide an unsupervised learning method for estimating a posture using only a monocular camera using a 6 DOF posture estimation method using an artificial neural network.

실시예들은 단안 카메라를 이용하여 추가 센서 없이 간단한 구조를 가지므로 데이터를 다루기 쉽고 경제적인 6자유도 자세 추정 방법을 제공할 수 있다. 이를 통해 비용을 감소시킬 수 있다.Embodiments have a simple structure using a monocular camera without an additional sensor, and thus can provide a method for estimating a 6DOF posture that is easy to handle and economical. This can reduce costs.

또한, 실시예들은 하이브리드 형태의 인공신경망을 이용하여 기존의 합성곱 신경망보다 향상된 성능의 6자유도 자세 추정 방법을 제공할 수 있다.In addition, the embodiments may provide a method for estimating a 6DOF posture with improved performance compared to conventional convolutional neural networks using a hybrid artificial neural network.

이 밖에도, 실시예들은 손실함수를 이용하는 구성을 통하여 오차가 누적되는 문제를 해결하고 정확한 카메라 자세 추정 방법을 제공할 수 있다.In addition, the embodiments may solve the problem of accumulation of errors through a configuration using a loss function and provide a method for estimating an accurate camera posture.

본 발명의 효과가 상술한 효과들로 제한되는 것은 아니며, 언급되지 아니한 효과들은 본 명세서 및 첨부된 도면으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.Effects of the present invention are not limited to the above-mentioned effects, and effects not mentioned will be clearly understood by those skilled in the art from this specification and the accompanying drawings.

본 발명의 실시예의 설명에 이용되기 위하여 첨부된 아래 도면들은 본 발명의 실시예들 중 단지 일부일 뿐이며, 본 발명의 기술분야에서 통상의 지식을 가진 사람(이하 "통상의 기술자"라 함)에게 있어서는 발명에 이르는 추가 노력 없이 이 도면들에 기초하여 다른 도면들이 얻어질 수 있다.
도 1a 내지 도 1c는 일 실시예에 따른 6자유도 자세 추정 방법을 설명하기 위한 흐름도이다.
도 2a 내지 도 2b는 일 실시예에 따른 인공신경망의 학습 방법을 나타낸 흐름도이다.
도 3은 6자유도 자세 추정 방법의 구조를 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 특징 추출 네트워크와 차원 축소 네트워크로 구성된 하이브리드 구조를 설명하기 위한 도면이다.
도 5는 일 실시예에 따라 6자유도 자세 추정을 위한 인공신경망의 학습을 위한 학습 데이터의 예시도이다.
도 6은 일 실시예에 따라 6자유도 자세 추정 방법에 대한 정성적 테스트 결과를 보여주는 예시도이다.
도 7은 일 실시예에 따른 6자유도 자세 추정 장치를 설명하기 위한 블록도이다.The accompanying drawings for use in describing the embodiments of the present invention are only some of the embodiments of the present invention, and for those of ordinary skill in the art (hereinafter referred to as "ordinary technicians") Other drawings may be obtained on the basis of these drawings without additional effort leading up to the invention.
1A to 1C are flowcharts illustrating a method for estimating a 6-DOF posture according to an exemplary embodiment.
2A to 2B are flowcharts illustrating a learning method of an artificial neural network according to an embodiment.
3 is a diagram for explaining the structure of a six degree of freedom posture estimation method.
4 is a diagram for explaining a hybrid structure composed of a feature extraction network and a dimensionality reduction network according to an embodiment.
5 is an exemplary diagram of learning data for learning an artificial neural network for 6-DOF posture estimation according to an embodiment.
6 is an exemplary view showing qualitative test results for a method for estimating a 6 degree of freedom posture according to an embodiment.
7 is a block diagram illustrating a device for estimating a 6 degree of freedom posture according to an exemplary embodiment.

후술하는 본 발명에 대한 상세한 설명은, 본 발명의 목적들, 기술적 해법들 및 장점들을 분명하게 하기 위하여 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 통상의 기술자가 본 발명을 실시할 수 있도록 상세히 설명된다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The following detailed description of the present invention refers to the accompanying drawings, which illustrate specific embodiments in which the present invention may be practiced in order to make the objects, technical solutions and advantages of the present invention clear. These embodiments are described in detail to enable those skilled in the art to practice the present invention.

본 발명의 상세한 설명 및 청구항들에 걸쳐, '포함하다'라는 단어 및 그 변형은 다른 기술적 특징들, 부가물들, 구성요소들 또는 단계들을 제외하는 것으로 의도된 것이 아니다. 또한, '하나' 또는 '한'은 하나 이상의 의미로 쓰인 것이며, '또 다른'은 적어도 두 번째 이상으로 한정된다.Throughout the description and claims of the present invention, the word 'comprise' and variations thereof are not intended to exclude other technical features, additions, components or steps. In addition, 'one' or 'one' is used to mean more than one, and 'another' is limited to at least two or more.

또한, 본 발명의 '제1', '제2' 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로서, 순서를 나타내는 것으로 이해되지 않는 한 이들 용어들에 의하여 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 이와 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.In addition, terms such as 'first' and 'second' of the present invention are intended to distinguish one component from another, and the scope of rights is limited by these terms unless understood to indicate an order. is not For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는 그 다른 구성요소에 직접 연결될 수도 있지만 중간에 다른 구성요소가 개재할 수도 있다고 이해되어야 할 것이다. 반면에 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉, "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected to the other element, but other elements may intervene. On the other hand, when an element is referred to as being “directly connected” to another element, it should be understood that no intervening elements exist. Meanwhile, other expressions describing the relationship between components, ie, “between” and “directly between” or “adjacent to” and “directly adjacent to” should be interpreted similarly.

각 단계들에 있어서 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용된 것으로 식별부호는 논리상 필연적으로 귀결되지 않는 한 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며, 반대의 순서로 수행될 수도 있다.In each step, identification codes (eg, a, b, c, etc.) are used for convenience of description, and identification codes do not explain the order of each step unless they inevitably result in logic, and each The steps may occur out of the order specified. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

통상의 기술자에게 본 발명의 다른 목적들, 장점들 및 특성들이 일부는 본 설명서로부터, 그리고 일부는 본 발명의 실시로부터 드러날 것이다. 아래의 예시 및 도면은 실례로서 제공되며, 본 발명을 한정하는 것으로 의도된 것이 아니다. 따라서, 특정 구조나 기능에 관하여 본 명세서에 개시된 상세 사항들은 한정하는 의미로 해석되어서는 아니되고, 단지 통상의 기술자가 실질적으로 적합한 임의의 상세 구조들로써 본 발명을 다양하게 실시하도록 지침을 제공하는 대표적인 기초 자료로 해석되어야 할 것이다.Other objects, advantages and characteristics of the present invention will appear to those skilled in the art, in part from this description and in part from practice of the invention. The examples and drawings below are provided as examples and are not intended to limit the invention. Accordingly, details disclosed herein with respect to a particular structure or function are not to be construed in a limiting sense, but are merely representative and provide guidance for those skilled in the art to variously practice the present invention with any detailed structures substantially suitable. It should be interpreted as basic data.

더욱이 본 발명은 본 명세서에 표시된 실시예들의 모든 가능한 조합들을 망라한다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 사상 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 사상 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다. Moreover, the present invention covers all possible combinations of the embodiments shown herein. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in one embodiment in another embodiment without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

본 명세서에서 달리 표시되거나 분명히 문맥에 모순되지 않는 한, 단수로 지칭된 항목은, 그 문맥에서 달리 요구되지 않는 한, 복수의 것을 아우른다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.In this specification, unless otherwise indicated or clearly contradicted by context, terms referred to in the singular encompass the plural unless the context requires otherwise. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description will be omitted.

이하, 통상의 기술자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily practice the present invention.

본 발명은 위치 인식 및 지도구축을 위한 6자유도(Degree of freedom, DoF) 자세 추정 방법에 관한 것으로, 단안 카메라만을 사용하여 얻은 영상을 인공신경망에 입력하여 6자유도를 포함하는 카메라의 자세를 추정하는 방식일 수 있다. 본 발명의 일 실시예에 따라, 6자유도는 서로 다른 두 프레임 사이의 카메라의 위치(x,y,z) 변화 정보와 기울기 및 회전(θ_Pitch,φ_Roll,ψ_Yaw) 정보를 포함할 수 있다.The present invention relates to a 6 degree of freedom (DoF) pose estimation method for position recognition and map construction, and estimates the pose of a camera including 6 degrees of freedom by inputting an image obtained using only a monocular camera to an artificial neural network. may be the way to do it. According to an embodiment of the present invention, the six degrees of freedom may include position (x, y, z) change information and tilt and rotation (θ_Pitch, φ_Roll, ψ_Yaw) information of the camera between two different frames.

본 발명의 일 실시예에 따르면, 복수의 서로 다른 프레임을 가지는 영상을 획득하여 인공신경망에 입력함으로써 6자유도 자세 추정하는 방법론이 개시된다. 본 발명의 일 실시예에 따른 6자유도 자세 추정 방법은 여러 개의 카메라 또는 보조 센서(관성 센서, 라이다)를 이용하는 방식보다 비교적 간단한 구조를 가져 데이터를 다루기 쉽고 경제적인 수단을 제공할 수 있다. 이를 통해, 위치 인식 및 지도구축 등 자율이동체의 제어를 위한 주행 거리 측정(Odometry) 관련 기술을 구현하기 위해 사용될 수 있다. According to an embodiment of the present invention, a methodology for estimating a 6-DOF posture by acquiring images having a plurality of different frames and inputting the images to an artificial neural network is disclosed. The 6DOF posture estimation method according to an embodiment of the present invention has a relatively simpler structure than a method using multiple cameras or auxiliary sensors (inertial sensors, LIDAR), and can provide an easy-to-handle and economical means for handling data. Through this, it can be used to implement odometry-related technologies for controlling autonomous vehicles such as location recognition and map construction.

도 1a 내지 도 1c는 일 실시예에 따른 6자유도 자세 추정 방법을 설명하기 위한 흐름도이다.1A to 1C are flowcharts illustrating a method for estimating a 6-DOF posture according to an exemplary embodiment.

일 실시예에 따라 6자유도 자세 추정 방법은 도 1a에 도시된 방식에 따라 획득한 영상을 인공신경망에 입력하여 6자유도를 결정하는 것일 수 있다. 구체적으로, 컴퓨팅 장치에 의해 수행되는 6자유도 자세 추정 방법은 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계(S110), 상기 제1 영상 및 제2 영상을 인공신경망에 입력하는 단계(S120), 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하는 단계(S130), 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하는 단계(S140), 상기 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하는 단계(S150), 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하는 단계(S160) 및 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정(S170)하는 단계를 포함할 수 있다.According to an embodiment, a method for estimating a 6 degree of freedom posture may include determining the 6 degrees of freedom by inputting an image acquired according to the method shown in FIG. 1A to an artificial neural network. Specifically, the method for estimating a 6 degree of freedom posture performed by a computing device includes obtaining a first image of a first frame and a second image of a second frame (S110), and the first image and the second image are converted to an artificial neural network. (S120), generating a feature map from the first image and the second image using the feature extraction network of the artificial neural network (S130), the six tokens corresponding to the six degrees of freedom and the Generating a combined set using the feature map (S140), generating a positional embedding vector reflecting positional characteristics of patches included in the feature map, and generating an input vector using the positional embedding vector and the combined set It may include generating (S150), inputting the input vector to the dimensionality reduction network of the artificial neural network (S160), and determining the six degrees of freedom from the output of the dimensionality reduction network (S170).

단계(S110)에서 컴퓨팅 장치는 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득할 수 있다. 제1 영상 및 제2 영상은 단일 카메라가 이동하면서 인접한 프레임들에서 취득한 서로 다른 시점의 영상일 수 있다. 일 실시예에 따라, 제1 영상은 제1 프레임을 포함할 수 있다. 일 실시예에 따라, 제1 영상 및 제2 영상은 채널수가 3인 컬러영상일 수 있다. 즉, 제1 영상 및 제2 영상은 3채널의 RGB영상일 수 있다. 하지만, 실시예가 이에 제한되는 것은 아니다. 예를 들어, 각 채널에 대응하는 칼라는 통상의 기술자가 용이하게 변경할 수 있는 범위 내에서 다르게 설정될 수도 있다.In step S110, the computing device may obtain a first image of the first frame and a second image of the second frame. The first image and the second image may be images of different viewpoints acquired in adjacent frames while a single camera moves. According to an embodiment, the first image may include a first frame. According to an embodiment, the first image and the second image may be color images having 3 channels. That is, the first image and the second image may be 3-channel RGB images. However, the embodiment is not limited thereto. For example, the color corresponding to each channel may be set differently within a range easily changeable by a person skilled in the art.

단계(S120)에서 컴퓨팅 장치는 제1 영상 및 제2 영상을 인공신경망에 입력할 수 있다. 일 실시예에 따라, 인공신경망은 특징 추출 네트워크와 차원 축소 네트워크로 구성된 하이브리드 구조를 가질 수 있다.In step S120, the computing device may input the first image and the second image to the artificial neural network. According to an embodiment, an artificial neural network may have a hybrid structure composed of a feature extraction network and a dimensionality reduction network.

일 실시예에 따라, 특징 추출 네트워크는 합성곱 신경망(Convolutional Neural Network: CNN)을 응용한 ResNet 모델을 기반으로 할 수 있다. 합성곱 신경망은 합성곱을 이용한 인공신경망 모델이고, 다차원 배열 데이터 처리에 특화되어 이미지의 특징을 뚜렷하게 검출할 수 있다. 합성곱 신경망에서는 필터를 이용한 합성곱 연산을 반복적으로 진행하면서 이미지의 특징을 검출할 수 있다. 구체적으로, 합성곱 신경망의 합성곱 계층(Convolution Layer)과 풀링 계층(Pooling Layer)에서 특징을 추출하고, 완전연결계층(Fully-connected Layer)에서 분류를 결정할 수 있다. According to an embodiment, the feature extraction network may be based on a ResNet model to which a Convolutional Neural Network (CNN) is applied. The convolutional neural network is an artificial neural network model using convolution, and it is specialized in processing multi-dimensional array data and can clearly detect features of images. In the convolutional neural network, features of an image can be detected while repeatedly performing a convolution operation using a filter. Specifically, features can be extracted from the convolution layer and pooling layer of the convolutional neural network, and classification can be determined from the fully-connected layer.

잔차 신경망(이하, ResNet)은 합성곱 신경망을 응용한 모델로서, 합성곱 계층의 출력에 전의 전 계층에 쓰였던 입력을 더함으로써 특징이 유실되지 않도록 하여 신경망이 깊게 쌓일수록 성능이 나빠지는 문제를 해결할 수 있다. ResNet은 덧셈 연산량이 증가할 뿐 학습 파라메터에는 큰 영향이 없어 더 많은 수의 계층을 갖는 깊은 모델도 더 빠르게 학습할 수 있다. Residual neural network (hereinafter, ResNet) is a model that applies convolutional neural networks. It solves the problem of deteriorating performance as neural networks are accumulated more deeply by adding the output of the convolutional layer to the input used in the previous layer to prevent loss of features. can ResNet only increases the amount of addition operations, but has no significant effect on the learning parameters, so deep models with a larger number of layers can be learned faster.

일 실시예에 따라, 특징 추출 네트워크는 18개의 계층을 갖는 ResNet18을 기반으로 할 수 있다. ResNet18의 경우에는 맨 처음부분의 합성곱 계층 1개, 그리고 레이어 1~4에서 각각 4개, 마지막으로 완전연결계층 1개까지해서 총 18개의 레이어로 구성될 수 있다. 다른 일 실시예에 따라, 특징 추출 네트워크는 ResNet50이나 ResNet152를 기반으로 할 수 있다. ResNet50에서도 계층을 세어보면 우선 Bottleneck 한블럭에는 1x1, 3x3, 1x1으로 총 3개의 합성곱 계층으로 구성되고, 이 계층에서 layer1 ~4까지 고려하면 3 * (3 + 4 + 6 +3)으로 48개에 합성곱 계층과 완전연결계층을 각각 1개씩 가지는 총 50개의 레이어로 구성된 모델일 수 있다. 상술한 설명에서는 신경망의 구조의 예시를 구체적으로 설명했으나 본 발명의 범위가 이에 제한되는 것은 아니다. 신경망의 구조는 제1 영상 및 제2 영상의 특징맵을 추출할 수 있는 기능을 수행하는 목적을 달성하기 위해 통상의 기술자가 용이하게 생각할 수 있는 범위 안에서 변경될 수 있다.According to one embodiment, the feature extraction network may be based on ResNet18 with 18 layers. In the case of ResNet18, it can be composed of a total of 18 layers, including one convolutional layer at the beginning, four layers each from layers 1 to 4, and finally one fully connected layer. According to another embodiment, the feature extraction network may be based on ResNet50 or ResNet152. If you count the layers in ResNet50, first of all, a bottleneck block consists of a total of three convolutional layers, 1x1, 3x3, and 1x1, and considering layers1 to 4 in this layer, 3 * (3 + 4 + 6 +3), 48 may be a model composed of a total of 50 layers each having one convolutional layer and one fully connected layer. In the above description, examples of the structure of the neural network have been specifically described, but the scope of the present invention is not limited thereto. The structure of the neural network may be changed within a range easily conceivable by a person skilled in the art in order to achieve the purpose of performing a function of extracting feature maps of the first image and the second image.

일 실시예에 따라 특징 추출 네트워크는 제1 영상의 3채널 영상 및 제2 영상의 3채널 영상을 채널방향으로 합친 6채널 영상을 이용하여 특징을 출력할 수 있다. 예시적으로, 컴퓨팅 장치는 서로 다른 두개의 RGB 영상을 채널방향으로 합친 6채널 영상을 특징 추출 네트워크에 입력하여 합쳐진 6채널 영상의 특징을 추출할 수 있다. According to an embodiment, the feature extraction network may output a feature using a 6-channel image obtained by combining the 3-channel image of the first image and the 3-channel image of the second image in the channel direction. Exemplarily, the computing device may extract features of the combined 6-channel image by inputting a 6-channel image obtained by combining two different RGB images in a channel direction to a feature extraction network.

일 실시예에 따라, 차원 축소 네트워크는 차원축소 비전 트랜스포머 네트워크(Dim Reduction Vision Transformer Network)를 기반으로 할 수 있다. 차원 축소 네트워크는 선형 투영(linear projection)을 사용하여 각각의 인코더층 이후 은닉층 출력의 차원을 감소시키는 방식을 사용할 수 있다. According to one embodiment, the dimension reduction network may be based on a dimension reduction vision transformer network. The dimension reduction network may use a method of reducing the dimension of the hidden layer output after each encoder layer using a linear projection.

다른 실시예에 따라, 차원 축소 네트워크는 Layer Norm, Multihead Self-Attention, MLP(Multilayer Perceptron), Dim-reduction MLP로 구성되며 L개의 층 개수만큼 어텐션 연산을 반복 진행하여 차원을 점진적으로 줄이는 구성을 가질 수 있다.According to another embodiment, the dimension reduction network is composed of Layer Norm, Multihead Self-Attention, Multilayer Perceptron (MLP), and Dim-reduction MLP, and has a configuration in which the dimension is gradually reduced by repeating the attention operation as many as the number of L layers. can

단계(S130)에서 컴퓨팅 장치는 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성할 수 있다. 구체적으로, 컴퓨팅 장치는 3채널의 RGB영상을 합친 6채널 영상에 채널에 필터를 적용하여 합성곱한 결과값을 얻을 수 있고, 얻은 결과값을 이용하여 특징맵을 생성할 수 있다. In step S130, the computing device may generate a feature map from the first image and the second image using a feature extraction network. Specifically, the computing device may apply a filter to a 6-channel image obtained by combining 3-channel RGB images to obtain a convolutional result value, and may generate a feature map using the obtained result value.

보다 구체적으로, 도 1b를 참고하면, 컴퓨팅 장치는 생성된 특징맵을 일정한 크기를 가지는 패치들의 집합으로 재구성하고(S141), 상기 패치 집합과 상기 6개의 토큰들을 결합하여 상기 결합 집합을 생성할 수 있다(S142). More specifically, referring to FIG. 1B , the computing device may reconstruct the generated feature map into a set of patches having a certain size (S141), and combine the patch set and the six tokens to generate the combined set. (S142).

일 실시예에 따라, 단계(S141)에서 컴퓨팅 장치는 3차원 특징맵에 대하여 패치 별 어텐션 연산을 하기위해 특징맵을 겹치지 않는 일정한 크기의 패치들로 나눌 수 있다. 컴퓨팅 장치는 나눈 패치들을 원소로 하는 패치들의 집합을 생성할 수 있다.According to an embodiment, in step S141, the computing device may divide the feature map into non-overlapping patches of a certain size in order to perform an attention operation for each patch on the 3D feature map. The computing device may generate a set of patches having divided patches as elements.

일 실시예에 따라, 컴퓨팅 장치는 제1 영상과 제2 영상의 특징맵을 이용해 생성한 패치들과 크기가 같은 6개의 토큰들을 생성할 수 있다. 생성한 6개의 토큰들은 6자유도를 구하기 위해 각각 6자유도에 대응하는 학습 가능한 파라미터일 수 있다. 즉, 후술하는 바와 같이 6개의 토큰들은 컴퓨팅 장치가 인공 신경망을 학습시키는 과정에서 갱신될 수 있다. 컴퓨팅 장치는 생성한 6개의 토큰들과 패치들을 결합시켜 결합 집합을 생성할 수 있다.According to an embodiment, the computing device may generate 6 tokens having the same size as the patches generated using the feature maps of the first image and the second image. The generated 6 tokens may be learnable parameters corresponding to each of the 6 degrees of freedom in order to obtain the 6 degrees of freedom. That is, as will be described later, the 6 tokens can be updated while the computing device trains the artificial neural network. The computing device may create a combined set by combining the generated 6 tokens and patches.

일 실시예에 따라, 합성곱 신경망에서 추출된 특징맵을 나눠 생성한 패치들의 집합으로 재구성된 패치들의 경우 위치 정보가 손실된 상태일 수 있다. 그러므로 위치 정보를 고려하기 위해, 단계(S150)에서 컴퓨팅 장치는 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성할 수 있다. 컴퓨팅 장치는 생성한 위치 임베딩 벡터과 결합 집합의 원소들과 합하여 입력 벡터를 생성할 수 있다. According to an embodiment, in the case of patches reconstructed as a set of patches generated by dividing a feature map extracted from a convolutional neural network, location information may be lost. Therefore, in order to consider location information, in step S150, the computing device may generate a location embedding vector that reflects locational characteristics of patches included in the feature map. The computing device may generate an input vector by adding the generated position embedding vector and elements of the combination set.

단계(S160)에서 컴퓨팅 장치는 차원 축소 네트워크에 입력 벡터를 입력할 수 있다. 컴퓨팅 장치는 차원 축소 네트워크를 통해 L개의 층 개수만큼 셀프 어텐션 연산이 반복 진행되면서 입력 벡터를 구성하는 패치들의 차원을 점진적으로 감소시킬 수 있다.In step S160, the computing device may input the input vector to the dimensionality reduction network. The computing device may gradually reduce the dimensionality of the patches constituting the input vector while repeating the self-attention operation as many as the number of L layers through the dimensionality reduction network.

셀프 어텐션은 벡터의 내적을 이용해 유사도를 계산하는 연산일 수 있다. 컴퓨팅 장치는 벡터의 내적(dot product)을 이용하면 각 벡터의 유사도를 계산할 수 있고, 소프트맥스(softmax)를 이용하여 이를 확률로 나타낼 수 있다. 소프트맥스는 각 성분의 범위를 0 이상, 1 이하로 하고 총합을 1이 되도록 하는 함수일 수 있다. 컴퓨팅 장치는 이렇게 구한 벡터의 유사도를 이용해 가중치를 구할 수 있다. 정리하면, 셀프 어텐션은 입력 벡터 내 성분들 간의 관계성을 계산하는 연산일 수 있다. 즉, 컴퓨팅 장치가 입력 벡터를 입력 받으면 이와 관련성 높은 성분을 출력하고 이러한 과정을 반복함으로써 패치들의 차원을 감소시킬 수 있다. Self attention may be an operation that calculates a degree of similarity using a dot product of vectors. The computing device may calculate the similarity of each vector by using a dot product of vectors, and may express it as a probability by using softmax. Softmax may be a function that makes the range of each component greater than or equal to 0 and less than or equal to 1 and makes the total sum equal to 1. The computing device may obtain a weight using the similarity of the vectors obtained in this way. In summary, self-attention may be an operation that calculates a relationship between components in an input vector. That is, when the computing device receives an input vector, it can reduce the dimensionality of the patches by outputting a highly related component and repeating this process.

일 실시예에 따라, 컴퓨팅 장치는 차원 축소 네트워크에 의해 크기가 감소한 입력 벡터의 패치들 가운데 6개의 토큰들의 입력에 따라 출력된 패치들을 추출할 수 있고(S171), 추출한 패치들을 이용하여 패치 별 평균 풀링을 계산함으로써 6자유도를 추정할 수 있다(S172). 평균 풀링은 각각의 패치를 평균하여서 다운 샘플링하는 방법으로, 연산시에 매개변수의 사용이 필요하지 않아 계산량을 줄일 수 있다. According to an embodiment, the computing device may extract patches output according to the input of 6 tokens from among the patches of the input vector whose size is reduced by the dimensionality reduction network (S171), and use the extracted patches to obtain an average per patch. Six degrees of freedom can be estimated by calculating pooling (S172). Average pooling is a method of downsampling by averaging each patch, and the amount of calculation can be reduced because the use of parameters is not required during calculation.

도 2a 내지 도 2b는 일 실시예에 따른 인공신경망의 학습 방법을 나타낸 흐름도이다.2A to 2B are flowcharts illustrating a learning method of an artificial neural network according to an embodiment.

도 2a를 참고하면, 컴퓨팅 장치에 의해 수행되는, 위치 인식 및 지도구축을 위한 6자유도 자세 추정을 위한 인공신경망의 학습 방법은 제1 프레임의 제1 영상 및 제2 프레임의 제2 영상을 획득하는 단계(S210), 상기 제1 영상의 제1 깊이맵 및 상기 제2 영상의 제2 깊이맵을 추정하는 단계(S220), 상기 제1 영상 및 상기 제2 영상을 상기 인공신경망에 입력하여 6자유도 정보를 출력하는 단계(S230), 상기 6자유도 정보에 기초하여 상기 제1 프레임과 상기 제2 프레임 사이의 변환행렬을 계산하는 단계(S240) 및 상기 제1 깊이맵, 상기 제2 깊이맵 및 상기 변환 행렬에 기초하여 손실함수의 출력 값을 계산하고 상기 손실함수의 출력 값에 기초하여 상기 인공 신경망을 갱신하는 단계를 포함할 수 있다.Referring to FIG. 2A , a learning method of an artificial neural network for position estimation with 6 degrees of freedom for location recognition and map construction, performed by a computing device, obtains a first image of a first frame and a second image of a second frame. step (S210), estimating a first depth map of the first image and a second depth map of the second image (S220), inputting the first image and the second image to the artificial neural network to obtain 6 Outputting DOF information (S230), calculating a transformation matrix between the first frame and the second frame based on the 6 DOF information (S240), and the first depth map and the second depth. The method may include calculating an output value of a loss function based on the map and the transformation matrix and updating the artificial neural network based on the output value of the loss function.

컴퓨팅 장치는 실제 깊이 정보 없이 학습하는 완전한 비지도(Unsupervised) 학습을 할 수 있다. 이를 위해 컴퓨팅 장치는 연속적인 프레임으로부터 깊이맵과 상대적 카메라 자세를 추정하고 이를 이용해 새로운 영상을 합성하여 합성한 영상과 실제 영상 간의 차이를 통해 학습할 수 있다.The computing device may perform completely unsupervised learning in which it learns without actual depth information. To this end, the computing device may estimate a depth map and a relative camera posture from successive frames, synthesize a new image using the depth map, and learn the difference between the synthesized image and the actual image.

구체적으로, 컴퓨팅 장치는 단계(S210)에서 제1 프레임의 제1 영상과 제2 프레임의 제2 영상을 획득할 수 있다. 단계(S220)에서 컴퓨팅 장치는 획득한 제1 영상을 깊이 추정 네트워크에 입력하여 제1 깊이맵을 추정할 수 있다. 제1 깊이맵은 제1 영상의 깊이 정보를 포함할 수 있다. 또한, 컴퓨팅 장치는 제2 영상을 깊이 추정 네트워크에 입력하여 제2 깊이맵을 추정할 수 있다. Specifically, the computing device may acquire a first image of a first frame and a second image of a second frame in step S210. In operation S220, the computing device may estimate a first depth map by inputting the obtained first image to a depth estimation network. The first depth map may include depth information of the first image. Also, the computing device may estimate the second depth map by inputting the second image to the depth estimation network.

일 실시예에 따라, 깊이 추정 네트워크는 인코더-디코더(Encoder-decoder) 형태를 갖는 U-net 구조를 기반으로 할 수 있다. U-net 구조를 이용하면 이미지를 인식하는 패치들에 대해 겹치는 비율이 적어 기존 분할(Segmentaion) 모델에서 많이 사용했던 sliding window 방식의 비효율성을 해결할 수 있다. 특히, U-net 구조를 이용할 경우 이전 패치에서 검증이 끝난 부분을 다음 패치에서 중복하여 검증하기 않을 수 있어 연산의 효율을 높이고 속도를 향상시킬 수 있다. U-net 구조의 인코더 부분은 전형적인 합성곱 신경망으로 구성될 수 있다. 예시적으로 깊이 추정 네트워크의 인코더 부분은 Resnet50 모델을 기반으로 할 수 있다. U-net 구조의 디코더 부분은 업 샘플링과 합성곱 신경망으로 구성될 수 있다. 예시적으로, 깊이 추정 네트워크의 디코더 부분은 DispResnet 모델을 기반으로 할 수 있다. According to an embodiment, the depth estimation network may be based on a U-net structure having an encoder-decoder form. By using the U-net structure, the overlapping ratio of the patches that recognize images is small, so the inefficiency of the sliding window method, which was widely used in the existing segmentation model, can be solved. In particular, in case of using the U-net structure, the part verified in the previous patch may not be repeatedly verified in the next patch, so the efficiency of calculation and the speed can be improved. The encoder part of the U-net structure can be composed of a typical convolutional neural network. Illustratively, the encoder portion of the depth estimation network may be based on the Resnet50 model. The decoder part of the U-net structure may be composed of upsampling and convolutional neural networks. Illustratively, the decoder portion of the depth estimation network may be based on the DispResnet model.

일 실시예에 따라, 컴퓨팅 장치는 3채널 영상을 깊이 추정 네트워크에 입력하여 1채널 깊이맵을 출력할 수 있다. 출력된 1채널 깊이맵은 입력된 3채널 영상과 같은 크기를 갖을 수 있다. 또한, 1채널 깊이맵을 3채널 영상의 픽셀 별 깊이 값을 나타낼 수도 있다.According to an embodiment, the computing device may output a 1-channel depth map by inputting a 3-channel image to a depth estimation network. The output 1-channel depth map may have the same size as the input 3-channel image. In addition, the 1-channel depth map may represent a depth value for each pixel of a 3-channel image.

단계(230)에서 컴퓨팅 장치는 인공신경망에 제1 영상과 제2 영상을 입력하여 6자유도 정보를 출력할 수 있다. 6자유도 정보는 제1 영상과 제2 영상 간의 카메라 자세 변화에 대한 정보 일 수 있다. 본 발명의 일 실시예에 따라, 6자유도는 제1 영상과 제2 영상 간의 카메라의 위치(x,y,z) 변화 정보와 기울기 및 회전(θ,φ,ψ) 정보를 포함할 수 있다.In step 230, the computing device may output 6DOF information by inputting the first image and the second image to the artificial neural network. The 6 degree of freedom information may be information about a change in camera posture between the first image and the second image. According to an embodiment of the present invention, the six degrees of freedom may include position (x, y, z) change information and tilt and rotation (θ, φ, ψ) information of the camera between the first image and the second image.

구체적으로, 컴퓨팅 장치는 제1 영상과 제2 영상을 인공신경망에 입력하고(S231), 인공신경망의 특징 추출 네트워크를 이용하여 제1 영상 및 제2 영상으로부터 특징맵을 생성할 수 있다(S232). 또한, 컴퓨팅 장치는 생성한 특징맵과 6자유도에 대응하는 토큰들을 이용하여 결합 집합을 생성할 수 있고(S233), 특징맵의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성해 생성한 위치 임베딩 벡터와 생성한 결합 집합을 이용하여 입력 벡터를 생성할 수 있다(S234). 컴퓨팅 장치는 인공신경망의 차원 축소 네트워크에 입력 벡터를 입력하여(S235) 차원 축소 네트워크의 출력으로부터 6자유도를 결정할 수 있다.Specifically, the computing device may input the first image and the second image to the artificial neural network (S231), and generate a feature map from the first image and the second image using the feature extraction network of the artificial neural network (S232). . In addition, the computing device may generate a combination set using the generated feature map and tokens corresponding to the six degrees of freedom (S233), and generate a location embedding vector that reflects the locational characteristics of the feature map to generate the location embedding vector. An input vector may be generated using the combination set generated by and (S234). The computing device may determine 6 degrees of freedom from the output of the dimensionality reduction network by inputting an input vector to the dimensionality reduction network of the artificial neural network (S235).

일 실시예에 따라, 생성된 6개의 토큰들은 특징맵에 포함된 패치들과 같은 크기를 가질 수 있다. 단계(S233)에서 컴퓨팅 장치는 6개의 토큰들과 특징맵에 포함된 패치들을 결합하여 결합 집합을 생성할 수 있다. 패치들과 결합한 토큰들은 학습이 진행되면서 각 패치들로부터 전역적인 정보를 습득하고 습득한 정보를 통해 영상의 클래스를 추정하고 분류하는 역할을 할 수 있다.According to an embodiment, the generated 6 tokens may have the same size as patches included in the feature map. In step S233, the computing device may create a combination set by combining the six tokens and the patches included in the feature map. Tokens combined with patches can acquire global information from each patch as learning progresses, and can play a role in estimating and classifying the class of an image through the acquired information.

단계(S240)에서 컴퓨팅 장치는 단계(S230)에서 출력한 6자유도 정보를 기초로 제1 프레임과 제2 프레임 사이의 변환행렬을 계산할 수 있다. 변환행렬은 제1 프레임과 제2 프레임 간의 카메라의 이동과 회전에 대한 정보를 포함할 수 있다. 변환행렬은 카메라의 이동에 따른 위치 변화를 (x,y,z) 성분으로 표현할 수 있고, 카메라의 회전에 따른 변화를 (θ,φ,ψ)성분으로 표현할 수 있다. In step S240, the computing device may calculate a transformation matrix between the first frame and the second frame based on the 6DOF information output in step S230. The transformation matrix may include information about movement and rotation of the camera between the first frame and the second frame. The transformation matrix can express the change in position according to the movement of the camera as (x, y, z) components, and the change according to the rotation of the camera can be expressed as (θ, φ, ψ) components.

단계(S250)에서 컴퓨팅 장치는 제1 깊이맵, 제2 깊이맵 및 변환행렬에 기초하여 손실함수의 출력 값을 계산할 수 있다. 예시적으로, 컴퓨팅 장치는 제2 프레임에 변환행렬을 곱하고 제1 깊이맵의 깊이 정보를 기초로 하여 제3 프레임을 생성할 수 있다. 일 실시예에 따라, 변환행렬과 제1 깊이맵을 기초로 생성한 제3 프레임이 제1 프레임과 동일하다면, 손실함수는 Null일 수 있다. In operation S250, the computing device may calculate an output value of the loss function based on the first depth map, the second depth map, and the transformation matrix. For example, the computing device may multiply the second frame by the transformation matrix and generate the third frame based on the depth information of the first depth map. According to an embodiment, if the third frame generated based on the transformation matrix and the first depth map is the same as the first frame, the loss function may be null.

일 실시예에 따라, 손실함수는 후술하는 제1 내지 제3 보조함수들의 조합으로 결정될 수 있다. 손실함수는 제1 내지 제3 보조함수들을 모두 포함할 수 있다. 다른 예로 손실함수는 제1 내지 제3 보조함수들 중 일부만을 포함할 수도 있다. 제1 보조함수는 변환행렬과 제3 프레임과 제1 프레임 간의 차이를 기반으로 계산할 수 있다. 예를 들어, 제1 보조함수는 수학식 1과 같이 나타낼 수 있다. According to an embodiment, the loss function may be determined by a combination of first to third auxiliary functions described later. The loss function may include all of the first to third auxiliary functions. As another example, the loss function may include only some of the first to third auxiliary functions. The first auxiliary function may be calculated based on the transformation matrix and the difference between the third frame and the first frame. For example, the first auxiliary function can be expressed as Equation 1.

수학식 1에서 V는

에서의

의 영상평면으로 성공적으로 투영된 유효한 점을 나타내고,

는 점의 개수를 나타내고,

는 제1 프레임을 나타내고,

는 제3 프레임을 나타내고,

는 제2 프레임을 나타낸다.In Equation 1, V is

in

represents a valid point successfully projected onto the image plane of

represents the number of points,

represents the first frame,

represents a third frame,

represents the second frame.

일 실시예에 따라, 제1 보조함수는 움직이는 카메라를 이용해 고정된 장면을 촬영하는 것을 가정하여 계산되는 함수일 수 있다. 그러나 이러한 가정으로 움직이는 물체나 카메라가 정지한 장면에 대해서는 무한대 거리 값을 가지도록 네트워크가 동작할 수 있다.According to an embodiment, the first auxiliary function may be a function calculated on the assumption that a fixed scene is captured using a moving camera. However, with this assumption, the network can operate to have an infinite distance value for a moving object or a scene where the camera is still.

다른 일 실시예에 따라, 수학식 1을 기반으로 하는 제1 보조함수는 실제 환경에서 실시간으로 빛의 양이 변하여 두 프레임 간의 픽셀 밝기 값이 일정하지 않는 문제에 대한 손실을 반영할 수 없을 수 있다. 이러한 손실로 인해 컴퓨팅 장치는 정상적인 학습을 하지 못할 수 있다. 이러한 문제를 해결하기 위해, 제1 보조함수는 상기 유효 픽셀에 대하여 상기 제1 프레임과 상기 제3 프레임 간의 구조적 유사 지수를 나타내는 항을 더 포함할 수 있다. 구조적 유사 지수는 픽셀 간의 휘도, 대비, 구조에 대한 비교를 기반으로 할 수 있다. 구조적 유사 지수를 포함한 제1 보조함수는 수학식 2와 같이 나타낼 수 있다.According to another embodiment, the first auxiliary function based on Equation 1 may not be able to reflect the loss for the problem that the pixel brightness value between two frames is not constant because the amount of light changes in real time in a real environment. . Due to this loss, the computing device may not be able to learn normally. To solve this problem, the first auxiliary function may further include a term indicating a structural similarity index between the first frame and the third frame with respect to the effective pixel. The structural similarity index may be based on comparison of luminance, contrast, and structure between pixels. The first auxiliary function including the structural similarity index can be expressed as Equation 2.

수학식 2에서 V는

에서의

의 영상평면으로 성공적으로 투영된 유효한 점을 나타내고,

는 점의 개수를 나타내고,

는 제1 프레임을 나타내고,

는 제3 프레임을 나타내고,

는 제2 프레임을 나타낸다. SSIM은 영상 간의 픽셀 밝기 값을 정규화 해주는 함수로써,

는 SSIM 함수를 통해

와

간의 요소별 유사성을 계산한 것을 나타낸다. 수학식 2에서 는 직접 설정이 가능한 하이퍼 파라미터(Hyper parameter)이다.In Equation 2, V is

in

represents a valid point successfully projected onto the image plane of

represents the number of points,

represents the first frame,

represents a third frame,

represents the second frame. SSIM is a function that normalizes pixel brightness values between images.

through the SSIM function

and

It shows the calculation of the similarity of each element between the liver. In Equation 2, is a hyper parameter that can be directly set.

일 실시예에 따라, 컴퓨팅 장치는 제1 깊이맵과 제2 깊이맵의 차이를 수학식 3에 기초하여 계산할 수 있다.According to an embodiment, the computing device may calculate a difference between the first depth map and the second depth map based on Equation 3.

수학식 3에서

는 제1 프레임의 제1 깊이맵

를 변환행렬을 이용해 재구성한 제3 프레임의 제3 깊이맵을 나타내고,

는 제2 깊이맵

을 보간법을 이용해 재구성한 제4 깊이맵을 나타낸다. 즉, 제1 깊이맵과 제2 깊이맵의 차이는 제3 깊이맵과 제4 깊이맵 간의 차이를 정규화 한 함수로 나타낼 수 있다. 정규화한 함수를 이용함으로써, 수학식3은 제1 깊이맵과 제2 깊이맵의 차이의 절대 값을 이용하는 것보다 더 동등한 깊이 정보 차이를 반영할 수 있다. 추가적으로, 함수의 결과 값이 0과 1 사이로 변환되기 때문에 최적화 과정에서 수치적으로 안정된 학습이 가능할 수 있다. in Equation 3

Is the first depth map of the first frame

Represents a third depth map of a third frame reconstructed using a transformation matrix,

is the second depth map

represents a fourth depth map reconstructed using the interpolation method. That is, the difference between the first depth map and the second depth map may be expressed as a normalized function of the difference between the third depth map and the fourth depth map. By using the normalized function, Equation 3 may reflect a more equal depth information difference than using the absolute value of the difference between the first depth map and the second depth map. Additionally, since the resulting value of the function is converted between 0 and 1, numerically stable learning may be possible during the optimization process.

다른 일 실시예에 따라, 컴퓨팅 장치는 움직이는 물체 또는 가려진 물체들이 있는 장면을 촬영한 영상을 획득할 수 있다. 이 경우, 컴퓨팅 장치는 원활한 학습을 하지 못할 수 있다. 이러한 문제를 해결하기 위해, 컴퓨팅 장치는 제1 보조함수에 상기 제1 깊이맵을 상기 변환행렬로 재구성한 제3 깊이맵과 제2 깊이맵이 제1 깊이맵과 같은 픽셀 그리드에 위치하도록 재구성한 제4 깊이맵에 대하여 제3 깊이맵과 제4 깊이맵의 차이를 제3 깊이맵과 제4 깊이맵의 합으로 나눈 정규화 함수를 곱할 수 있다. 구체적으로, 제1 보조함수는 수학식 4를 기초하여 계산될 수 있다.According to another embodiment, the computing device may acquire an image of a scene in which moving objects or obscured objects are present. In this case, the computing device may not be able to smoothly learn. In order to solve this problem, the computing device uses a first auxiliary function to reconstruct a third depth map obtained by reconstructing the first depth map into the transformation matrix and a second depth map reconstructed so that they are located on the same pixel grid as the first depth map. The fourth depth map may be multiplied by a normalization function obtained by dividing a difference between the third depth map and the fourth depth map by the sum of the third depth map and the fourth depth map. Specifically, the first auxiliary function may be calculated based on Equation 4.

수학식 4에서

를 나타내고,

는 수학식 3을 기초로 하는 제1 깊이맵과 제2 깊이맵의 차이를 나타내는 함수이다. 또한, 수학식 4에서 V는

에서의

의 영상평면으로 성공적으로 투영된 유효한 점을 나타내고,

는 점의 개수를 나타낸다.in Equation 4

represents,

Is a function representing a difference between the first depth map and the second depth map based on Equation 3. Also, in Equation 4, V is

in

represents a valid point successfully projected onto the image plane of

represents the number of points.

다른 일 실시예에 따라, 컴퓨팅 장치는 카메라와 같은 속도로 움직이는 물체가 있거나 정지된 카메라에 의해 촬영된 영상을 획득할 수 있다. 이러한 경우에 컴퓨팅 장치는 제1 보조함수를 제1 프레임과 제3 프레임에서의 차이가 제1 프레임과 제2 프레임에서의 차이보다 적은 픽셀인 경우에 대해서만 계산할 수 있다. 구체적으로, 제1 보조함수는 수학식 5를 기초하여 계산될 수 있다.According to another embodiment, the computing device may obtain an image captured by an object moving at the same speed as the camera or by a stationary camera. In this case, the computing device may calculate the first auxiliary function only when the difference between the first frame and the third frame is smaller than the difference between the first frame and the second frame. Specifically, the first auxiliary function may be calculated based on Equation 5.

수학식 5에서

를 나타내고,

는 수학식 3을 기초로 하는 제1 깊이맵과 제2 깊이맵의 차이를 나타내는 함수이다. 수학식 5에서

으로,은 제1 프레임이 제2 깊이맵과 변환행렬을 기반으로 재구성된 영상이고

는 아이버슨 괄호(Iverson bracket)를 나타낸다. 즉,

는 이진 마스크로 작용하여 컴퓨팅 장치가 제1 보조함수를 소정 조건을 만족하는 유효 픽셀에 대하여 계산하도록 한다. 소정 조건은 상기 제1 프레임과 제3 프레임에서의 차이가 제1 프레임과 제2 프레임에서의 차이보다 적은 픽셀이다. 또한, 수학식 5에서 V는

에서의

의 영상평면으로 성공적으로 투영된 유효한 점을 나타내고,

는 점의 개수를 나타낸다.in Equation 5

represents,

Is a function representing a difference between the first depth map and the second depth map based on Equation 3. in Equation 5

, is an image in which the first frame is reconstructed based on the second depth map and the transformation matrix,

denotes Iverson brackets. in other words,

acts as a binary mask to allow the computing device to calculate the first auxiliary function for effective pixels that satisfy a predetermined condition. A predetermined condition is that a difference between the first frame and the third frame is smaller than a difference between the first frame and the second frame. Also, in Equation 5, V is

in

represents a valid point successfully projected onto the image plane of

represents the number of points.

다른 일 실시예에 따라, 컴퓨팅 장치는 낮은 품질의 영상 또는 균일한 영역을 갖는 영상을 획득한 경우, 획득한 영상에서 정확한 정보를 얻지 못하여서 학습이 어려울 수 있다. 제2 보조함수는 이러한 낮은 품질의 영상이나 균일한 영역이 있는 영상에 대한 손실을 반영할 수 있다. 제2 보조함수는 공간상 모든 픽셀에 대하여 상기 제1 프레임의 공간상 기울기 성분과 제1 깊이맵의 기울기 성분의 곱을 모두 더하여 계산될 수 있다. 구체적으로, 컴퓨팅 장치는 제2 보조함수를 수학식 6을 기초하여 계산할 수 있다.According to another embodiment, when a computing device acquires a low-quality image or an image having a uniform area, it may be difficult to learn because accurate information is not obtained from the obtained image. The second auxiliary function may reflect the loss of such a low quality image or an image having a uniform area. The second auxiliary function may be calculated by adding all products of the spatial gradient component of the first frame and the gradient component of the first depth map for all pixels in space. Specifically, the computing device may calculate the second auxiliary function based on Equation 6.

수학식 6에서

는 제1 프레임을 나타내고,

는 제1 깊이맵을 나타내고 ∇는 공간상의 방향 미분을 나타낸다.in Equation 6

represents the first frame,

denotes the first depth map and ∇ denotes the directional differential in space.

다른 일 실시예에 따라, 컴퓨팅 장치는 단안 카메라를 이용해 촬영된 영상을 획득하므로, 단안 카메라를 이용해 획득한 영상을 이용할 경우 스케일 모호성(Scale-ambiguity)에 의해 인공신경망에 오차가 누적되는 문제가 발생할 수 있다. 컴퓨팅 장치는 스케일 모호성에 의한 손실을 보상하기 위해 연속된 영상의 깊이 정보 값의 차이를 정규화 하여 두 값의 차이를 최소화하는 제3 보조함수를 계산할 수 있다. 컴퓨팅 장치는 제3 보조함수를 통하여 획득한 프레임 간의 깊이 정보 간의 스케일을 일정하게 맞춰주어 카메라 자세 추정이 가능할 수 있다. 구체적으로, 제3 보조함수는 수학식 7을 기초하여 계산될 수 있다.According to another embodiment, since the computing device acquires an image captured using a monocular camera, when using an image obtained using a monocular camera, errors may accumulate in the artificial neural network due to scale-ambiguity. can The computing device may calculate a third auxiliary function that minimizes a difference between two values by normalizing a difference between depth information values of consecutive images to compensate for a loss due to scale ambiguity. The computing device may be capable of estimating the camera posture by uniformly adjusting the scale of depth information between frames acquired through the third auxiliary function. Specifically, the third auxiliary function may be calculated based on Equation 7.

수학식 7에서 V는

에서의

의 영상평면으로 성공적으로 투영된 유효한 점을 나타내고,

는 점의 개수를 나타내고,

는 수학식 3을 기초로 하는 제1 깊이맵과 제2 깊이맵의 차이를 나타내는 함수이다.In Equation 7, V is

in

represents a valid point successfully projected onto the image plane of

represents the number of points,

Is a function representing a difference between the first depth map and the second depth map based on Equation 3.

다른 일 실시예에 따라, 컴퓨팅 장치는 카메라와 같은 속도로 움직이는 물체가 있거나 정지된 카메라에 의해 촬영된 영상을 획득할 수 있다. 이러한 경우에 컴퓨팅 장치는 제3 보조함수를 제1 프레임과 제3 프레임에서의 차이가 제1 프레임과 제2 프레임에서의 차이보다 적은 픽셀인 경우에 대해서만 계산할 수 있다. 구체적으로, 제3 보조함수는 수학식 8을 기초하여 계산될 수 있다.According to another embodiment, the computing device may obtain an image captured by an object moving at the same speed as the camera or by a stationary camera. In this case, the computing device may calculate the third auxiliary function only when the difference between the first frame and the third frame is smaller than the difference between the first frame and the second frame. Specifically, the third auxiliary function may be calculated based on Equation 8.

수학식 8에서 V는

에서

의 영상평면으로 성공적으로 투영된 유효한 점을 나타내고,

는 점의 개수를 나태내고

는 수학식 3을 기초로 하는 제1 깊이맵과 제2 깊이맵의 차이를 나타내는 함수이다. 수학식 8에서

으로,

은 제1 프레임이 제2 깊이맵과 변환행렬을 기반으로 재구성된 영상이고

는 아이버슨 괄호(Iverson bracket)를 나타낸다. 즉,

는 이진 마스크로 작용하여 컴퓨팅 장치가 제1 보조함수를 소정 조건을 만족하는 유효 픽셀에 대하여 계산하도록 한다. 소정 조건은 상기 제1 프레임과 제3 프레임에서의 차이가 제1 프레임과 제2 프레임에서의 차이보다 적은 픽셀이다.In Equation 8, V is

at

represents a valid point successfully projected onto the image plane of

represents the number of points

Is a function representing a difference between the first depth map and the second depth map based on Equation 3. in Equation 8

by,

Is an image in which the first frame is reconstructed based on the second depth map and the transformation matrix.

denotes Iverson brackets. in other words,

acts as a binary mask to allow the computing device to calculate the first auxiliary function for effective pixels that satisfy a predetermined condition. A predetermined condition is that a difference between the first frame and the third frame is smaller than a difference between the first frame and the second frame.

일 실시예에 따라, 손실함수는 제1 내지 제3 보조함수의 합으로 표현될 수 있다. 컴퓨팅 장치는 손실함수를 수학식 9에 기초하여 계산할 수 있다.According to an embodiment, the loss function may be expressed as the sum of the first to third auxiliary functions. The computing device may calculate the loss function based on Equation 9.

수학식 9에서

는 수학식 5를 기초로 하는 제1 보조함수,

는 수학식 6을 기초로 하는 제2 보조함수,

는 수학식 8을 기초로 하는 제3 보조함수이고, α,β,γ는 제1 내지 제3 보조함수의 비율을 설정하는 하이퍼 파라미터이다.in Equation 9

Is a first auxiliary function based on Equation 5,

Is a second auxiliary function based on Equation 6,

Is a third auxiliary function based on Equation 8, and α, β, and γ are hyperparameters for setting ratios of the first to third auxiliary functions.

일 실시예에 따라, 컴퓨팅 장치는 6자유도에 대응하는 6개의 토큰들을 생성하고 소정의 초기 값으로 설정할 수 있다. 컴퓨팅 장치는 6자유도에 대응하는 6개의 토큰들은 소정의 초기 값들로 설정된 이후 학습이 진행되면서 손실함수 출력 값이 작아지도록 갱신할 수 있다.According to one embodiment, the computing device may generate 6 tokens corresponding to 6 degrees of freedom and set them to a predetermined initial value. After the six tokens corresponding to the six degrees of freedom are set to predetermined initial values, the computing device may update the output value of the loss function to decrease as learning progresses.

도 3은 6자유도 자세 추정 방법의 전체 네트워크 구조를 설명하기 위한 도면이다.3 is a diagram for explaining the overall network structure of the 6 degree of freedom posture estimation method.

일 실시예에 따라, 도 3을 참고하면, 컴퓨팅 장치는 제1 프레임(310)을 획득하고 깊이 추정 네트워크(330)에 입력하여 제1 깊이맵(311)을 생성할 수 있다. 또한, 컴퓨팅 장치는 제2 프레임(320)을 획득하고 깊이 추정 네트워크(330)에 입력함으로써 제2 깊이맵(322)를 생성할 수 있다. According to an embodiment, referring to FIG. 3 , the computing device may generate a first depth map 311 by obtaining a first frame 310 and inputting the first frame 310 to a depth estimation network 330 . Also, the computing device may generate the second depth map 322 by acquiring the second frame 320 and inputting the second frame 320 to the depth estimation network 330 .

일 실시예에 따라, 깊이 추정 네트워크는 합성곱 신경망(Convolutional Neural Network: CNN)을 응용한 ResNet 모델을 기반으로 할 수 있다. 구체적으로, U-net 구조를 기반으로 하는 깊이 추정 인코더 부분은 전형적인 합성곱 신경망으로 구성될 수 있다. 예시적으로, 깊이 추정 네트워크의 인코더 부분은 Resnet50 모델을 기반으로 할 수 있다. 깊이 추정 네트워크의 디코더 부분은 업 샘플링과 합성곱 신경망으로 구성될 수 있다. 예시적으로, 깊이 추정 네트워크의 디코더 부분은 DispResnet 모델을 기반으로 할 수 있다.According to an embodiment, the depth estimation network may be based on a ResNet model to which a Convolutional Neural Network (CNN) is applied. Specifically, the depth estimation encoder part based on the U-net structure may be composed of a typical convolutional neural network. Illustratively, the encoder portion of the depth estimation network may be based on the Resnet50 model. The decoder part of the depth estimation network may consist of upsampling and convolutional neural networks. Illustratively, the decoder portion of the depth estimation network may be based on the DispResnet model.

도 3을 참고하면, 컴퓨팅 장치는 제1 프레임과 제2 프레임을 결합하여 인공신경망(350)에 입력하여 6자유도를 결정할 수 있다. 인공신경망(350)은 특징 추출 네트워크와 차원 축소 네트워크로 구성될 수 있다. 특징 추출 네트워크는 합성곱 신경망(Convolutional Neural Network: CNN)을 응용한 ResNet 모델을 기반으로 할 수 있다. 차원 축소 네트워크는 차원축소 비전 트랜스포머 네트워크(Dim Reduction Vision Transformer Network)를 기반으로 할 수 있다.Referring to FIG. 3 , the computing device may determine six degrees of freedom by combining the first frame and the second frame and inputting the input to the artificial neural network 350 . The artificial neural network 350 may be composed of a feature extraction network and a dimensionality reduction network. The feature extraction network may be based on a ResNet model to which a Convolutional Neural Network (CNN) is applied. The dimension reduction network may be based on a dimension reduction vision transformer network.

일 실시예에 따라, 컴퓨팅 장치는 제1 프레임과 제2 프레임의 깊이맵(311,321)을 이용해 학습할 수 있다. 학습을 통하여 컴퓨팅 장치는 손실함수를 구할 수 있고, 손실함수 출력 값이 작아지도록 6자유도에 대응되는 6개의 토큰들과 위치 임베딩 벡터를 갱신할 수 있다. 컴퓨팅 장치는 갱신한 6개의 토큰들과 위치 임베딩 벡터를 이용하여 제1 프레임과 제2 프레임을 합성한 영상(340)을 인공신경망(350)에 입력하여 6자유도를 결정할 수 있다.According to an embodiment, the computing device may learn using the depth maps 311 and 321 of the first frame and the second frame. Through learning, the computing device may obtain a loss function, and may update six tokens corresponding to six degrees of freedom and a position embedding vector so that an output value of the loss function becomes small. The computing device may determine 6 degrees of freedom by inputting the image 340 synthesized with the first frame and the second frame to the artificial neural network 350 using the updated 6 tokens and the position embedding vector.

도 4는 일 실시예에 따른 특징 추출 네트워크와 차원 축소 네트워크로 구성된 하이브리드 구조를 설명하기 위한 도면이다. 4 is a diagram for explaining a hybrid structure composed of a feature extraction network and a dimensionality reduction network according to an embodiment.

도 4a를 참고하면, 인공신경망은 특징 추출 네트워크(420)와 차원 축소 네트워크(430)로 구성된 하이브리드 구조일 수 있다. 컴퓨팅 장치는 제1 프레임과 제2 프레임을 합성할 수 있고, 이를 통해 상기 제1 영상의 3채널 영상 및 제2 영상의 3채널 영상을 채널방향으로 합친 6채널 영상(410)을 얻을 수 있다. 컴퓨팅 장치는 얻은 6채널 영상(410)을 특징 추출 네트워크(420)에 입력할 수 있다. 특징 추출 네트워크는 합성곱 신경망(Convolutional Neural Network: CNN)을 응용한 ResNet 모델을 기반으로 할 수 있다.Referring to FIG. 4A , the artificial neural network may have a hybrid structure composed of a feature extraction network 420 and a dimensionality reduction network 430 . The computing device may synthesize the first frame and the second frame, thereby obtaining a 6-channel image 410 obtained by combining the 3-channel image of the first image and the 3-channel image of the second image in the channel direction. The computing device may input the obtained 6-channel image 410 to the feature extraction network 420 . The feature extraction network may be based on a ResNet model to which a Convolutional Neural Network (CNN) is applied.

일 실시예에 따라, 컴퓨팅 장치는 특징 추출 네트워크(420)를 이용해 6채널 영상(410)에 채널에 따라 필터를 적용하여 합성곱 한 결과값을 얻을 수 있고, 얻은 결과값을 이용하여 특징맵을 생성할 수 있다. 컴퓨팅 장치는 생성된 특징맵을 일정한 크기를 가지는 패치들(422)의 집합으로 재구성하고, 패치들과 크기가 같은 6개의 토큰들(423)을 생성할 수 있다. 6개의 토큰들(423)은 6자유도를 구하기 위해 각각 6자유도에 대응하는 학습 가능한 파라미터일 수 있다. 컴퓨팅 장치는 생성한 6개의 토큰들(423)과 패치들을 결합시켜 결합 집합을 생성할 수 있다. 컴퓨팅 장치는 패치들(422)의 위치 정보를 포함하는 위치 임베딩 벡터와 결합 집합의 원소들을 합하여 입력 벡터를 생성할 수 있다. According to an embodiment, the computing device may apply a channel-specific filter to the 6-channel image 410 using the feature extraction network 420 to obtain a convolutional result value, and use the obtained result value to obtain a feature map. can create The computing device may reconstruct the generated feature map into a set of patches 422 having a constant size, and generate 6 tokens 423 having the same size as the patches. The six tokens 423 may be learnable parameters corresponding to each of the six degrees of freedom in order to obtain the six degrees of freedom. The computing device may create a combined set by combining the generated 6 tokens 423 and patches. The computing device may generate an input vector by adding the location embedding vector including the location information of the patches 422 and the elements of the combination set.

도 4a를 참고하면, 컴퓨팅 장치는 입력 벡터를 차원 축소 네트워크(430)에 입력할 수 있다. 차원 축소 네트워크는 차원축소 비전 트랜스포머 네트워크(Dim Reduction Vision Transformer Network)를 포함할 수 있다. 차원 축소 네트워크(430)는 Layer Norm, Multihead Self-Attention, MLP(Multilayer Perceptron), Dim-reduction MLP로 구성되며 L개의 층 개수만큼 셀프 어텐션 연산을 반복 진행하여 차원을 점진적으로 줄이는 구성일 수 있다. 컴퓨팅 장치는 차원 축소 네트워크(430)에 의해 크기가 감소한 입력 벡터의 패치들 가운데 6개의 토큰들의 입력에 따라 출력된 패치들(440)을 추출할 수 있고, 추출한 패치들(440)을 이용하여 패치 별 평균 풀링을 계산함으로써 6자유도를 추정할 수 있다. Referring to FIG. 4A , a computing device may input an input vector to a dimensionality reduction network 430 . The dimension reduction network may include a dimension reduction vision transformer network. The dimension reduction network 430 is composed of Layer Norm, Multihead Self-Attention, Multilayer Perceptron (MLP), and Dim-reduction MLP, and may have a configuration in which the dimension is gradually reduced by repeating the self-attention operation by the number of L layers. The computing device may extract patches 440 output according to the input of 6 tokens among the patches of the input vector whose size has been reduced by the dimensionality reduction network 430, and patch using the extracted patches 440. Six degrees of freedom can be estimated by calculating the per-average pooling.

도 4b는 일 실시예에 따라 특징맵(Feature Map, 421)을 이용하여 입력 벡터(426)를 생성하는 방법을 설명하기 위한 도면이다.4B is a diagram for explaining a method of generating an input vector 426 using a feature map 421 according to an embodiment.

일 실시예에 따라, 컴퓨팅 장치는 특징 추출 네트워크(420)를 이용하여 6채널 영상(410)의 특징맵(421)을 생성할 수 있다. 구체적으로, 컴퓨팅 장치는 3채널의 RGB영상을 합친 6채널 영상에 채널에 따라 필터를 적용하여 합성곱한 결과값을 얻을 수 있고, 얻은 결과값을 이용하여 특징맵(421)을 생성할 수 있다. 컴퓨팅 장치는 3차원 특징맵(421)에 대하여 패치 별 어텐션 연산을 하기위해 특징맵(421)을 겹치지 않는 일정한 크기의 패치들(422)로 나누고, 패치들(422)의 집합으로 재구성할 수 있다.According to an embodiment, the computing device may generate a feature map 421 of the 6-channel image 410 using the feature extraction network 420 . Specifically, the computing device may apply a filter according to a channel to a 6-channel image obtained by combining 3-channel RGB images to obtain a convolutional result value, and may generate a feature map 421 using the obtained result value. The computing device may divide the feature map 421 into non-overlapping patches 422 of a certain size and reconstruct the 3D feature map 421 into a set of patches 422 in order to perform an attention operation for each patch. .

도 4b를 참고하면, 컴퓨팅 장치는 제1 영상과 제2 영상의 특징맵을 이용해 생성한 패치들과 크기가 같은 6개의 토큰들(423)을 생성할 수 있다. 생성한 6개의 토큰들(423)은 6자유도를 구하기 위해 각각 6자유도에 대응하는 학습 가능한 파라미터일 수 있다. 컴퓨팅 장치는 생성한 6개의 토큰들(423)과 패치들을 결합시켜 결합 집합(424)을 생성할 수 있다.Referring to FIG. 4B , the computing device may generate six tokens 423 having the same size as the patches generated using feature maps of the first and second images. The generated 6 tokens 423 may be learnable parameters corresponding to each of the 6 degrees of freedom in order to obtain the 6 degrees of freedom. The computing device may generate a combination set 424 by combining the generated six tokens 423 and patches.

도 4b를 참고하면, 합성곱 신경망에서 추출된 특징맵(421)을 나눠 생성한 패치들(422)의 집합으로 재구성된 패치들은 위치 정보가 손실된 상태일 수 있다. 그러므로 위치 정보를 고려하기 위해, 컴퓨팅 장치는 특징맵(421)에 포함된 패치들(422)의 위치적 특성을 반영하는 위치 임베딩 벡터(425)를 생성할 수 있다. 컴퓨팅 장치는 생성한 위치 임베딩 벡터(425)와 결합 집합(424)의 원소들을 합하여 입력 벡터(426)를 생성할 수 있다. 컴퓨팅 장치는 입력 벡터를 차원 축소 네트워크(430)에 입력할 수 있다.Referring to FIG. 4B , patches reconstructed as a set of patches 422 generated by dividing a feature map 421 extracted from a convolutional neural network may be in a state in which location information is lost. Therefore, in order to consider the location information, the computing device may generate a location embedding vector 425 reflecting locational characteristics of the patches 422 included in the feature map 421 . The computing device may generate the input vector 426 by adding the generated position embedding vector 425 and elements of the combination set 424 . The computing device may input the input vectors into dimensionality reduction network 430 .

일 실시예에 따라, 특징맵(421)에 포함된 패치들의 개수가 n개 일 때, 결합 집합(424)은 n+6개의 패치들을 포함할 수 있다.According to an embodiment, when the number of patches included in the feature map 421 is n, the combination set 424 may include n+6 patches.

일 실시예에 따라, 컴퓨팅 장치는 차원 축소 네트워크(430)에 의해 크기가 감소된 입력 벡터의 패치들 가운데 상기 6개의 토큰들(423)에 대응하는 패치들(440)을 추출할 수 있다. 컴퓨팅 장치는 추출한 패치들(440)을 이용하여 패치 별 평균 풀링을 계산함으로써 6자유도를 추정할 수 있다.According to an embodiment, the computing device may extract patches 440 corresponding to the six tokens 423 from among patches of the input vector whose size is reduced by the dimensionality reduction network 430 . The computing device may estimate 6 degrees of freedom by calculating average pooling for each patch using the extracted patches 440 .

도 5는 일 실시예에 따라 6자유도 자세 추정을 위한 인공신경망의 학습을 위한 학습 데이터의 예시도이다.5 is an exemplary diagram of learning data for learning an artificial neural network for 6-DOF posture estimation according to an embodiment.

일 실시예에 따라, 컴퓨팅 장치는 자동차에 총4개의 카메라(흑백 카메라 2개, 칼라 카메라 2개)와 회전 레이저 스캐너, GPS/IMU 장비를 이용하여 학습 데이터를 획득할 수 있다. 도 5에 개시된 바와 같이, 학습 데이터는 독일의 카를스루 도시의 주거지역, 캠퍼스, 도로 등 다양한 환경에서 촬영된 영상 일 수 있다.According to an embodiment, the computing device may obtain learning data by using a total of four cameras (two black and white cameras and two color cameras), a rotating laser scanner, and a GPS/IMU device in the vehicle. As disclosed in FIG. 5 , learning data may be images captured in various environments such as a residential area, a campus, and a road in Karlsruhe, Germany.

일 실시예에 따라, 컴퓨팅 장치는 분할 레이블, 물체 감지 레이블, 물체 추적 레이블 등의 전처리 과정을 진행한 영상을 이용하여 학습을 진행할 수 있다. 예시적으로, 학습 영상은 832*256 해상도로 전처리를 진행하고 데이터 증강 기법(Augmentation)을 사용하여 랜덤 크기 변경, 잘라내기, 뒤집기 등을 적용한 영상일 수 있다. 구체적으로, 컴퓨팅 장치는 ADAMW 최적화 기법을 사용하고 학습률(Learning rate)는 10^-4, 광학 손실함수의 하이퍼 파라미터는 λ_i=0.15,λ_s=0.85 , 전체 손실함수의 하이퍼 파라미터 α=1.0,β=0.1,γ=0.5 로 설정하여 학습을 진행할 수 있다.According to an embodiment, the computing device may perform learning using images that have undergone preprocessing such as segmentation labels, object detection labels, and object tracking labels. Illustratively, the training image may be an image to which preprocessing is performed at a resolution of 832*256 and random resizing, cropping, flipping, etc. are applied using a data augmentation technique (Augmentation). Specifically, the computing device uses the ADAMW optimization technique, the learning rate is 10 ^-4 , the hyperparameters of the optical loss function are λ _i =0.15,λ _s =0.85 , the hyperparameters of the total loss function α=1.0,β = 0.1, γ = 0.5 to proceed with learning.

도 6은 일 실시예에 따라 6자유도 자세 추정 방법에 대한 정성적 테스트 결과를 보여주는 예시도이다.6 is an exemplary view showing qualitative test results for a method for estimating a 6 degree of freedom posture according to an embodiment.

도 6을 참고하면, 컴퓨팅 장치는 학습 데이터의 양을 다르게 하여 두가지 실험을 진행할 수 있다. 첫 번째 실험(K)는 총 8,146장의 데이터를 이용한 실험이고 두번째 실험(CS+K)는 총 65,688장의 데이터를 이용한 실험이다.컴퓨팅 장치는 더 많은 데이터를 사용한 두번째 실험(CS+K)에서 인공신경망의 성능이 향상된 것을 확인할 수 있다.Referring to FIG. 6 , the computing device may perform two experiments with different amounts of learning data. The first experiment (K) is an experiment using a total of 8,146 sheets of data, and the second experiment (CS+K) is an experiment using a total of 65,688 sheets of data. The computing device is an artificial neural network in the second experiment (CS+K) using more data. It can be seen that the performance of

표 1의 테스트 결과에서도 알 수 있듯이, 일 실시예에 따른(3,4행) 인공신경망을 이용한 경우 기존의 합성공 네트워크만 사용한 경우(1,2행) 보다 이동 에러(Translation Error, t_err), 회전 에러(Rotation Error, r_err), 절대 경로 에러(Absolute Trajectory Error, ATE) 모두 더 낮을 것을 확일 할 수 있었다.As can be seen from the test results in Table 1, when using the artificial neural network according to one embodiment (lines 3 and 4), the translation error (translation error, t _err ) is higher than when only the existing synthetic ball network is used (lines 1 and 2). , Rotation Error (r _err ), and Absolute Trajectory Error (ATE) were all lower.

MethodMethod Seq.09Seq.09 Seq.10Seq. 10 t_err t _err r_err r _err ATEATE t_err t _err r_err r _err ATEATE 종래 기술(K)Prior art (K) 12.57112.571 3.3393.339 56.83256.832 10.05410.054 4.9114.911 20.70620.706 종래 기술(CS+K)Prior Art (CS+K) 7.6057.605 2.1882.188 15.08315.083 10.07010.070 4.6264.626 20.34320.343 Ours(K)Ours(K) 7.3237.323 2.0872.087 31.04031.040 8.1028.102 3.6673.667 15.32515.325 Ours(CS+K)Ours (CS+K) 4.3634.363 1.3221.322 13.93213.932 6.6086.608 2.9862.986 12.35612.356

도 7은 일 실시예에 따른 6자유도 자세 추정 장치를 설명하기 위한 블록도이다.7 is a block diagram illustrating a device for estimating a 6 degree of freedom posture according to an exemplary embodiment.

도 7을 참조하면, 일 실시예에 따른 컴퓨팅 장치(700)는 프로세서(720)를 포함한다. 오류 주입 공격 장치(700)는 메모리(730) 및 통신부(710)를 더 포함할 수 있다. 프로세서(720), 메모리(730) 및 통신부(710)는 통신 버스(미도시)를 통해 서로 통신할 수 있다.Referring to FIG. 7 , a computing device 700 according to an embodiment includes a processor 720 . The error injection attack device 700 may further include a memory 730 and a communication unit 710. The processor 720, the memory 730, and the communication unit 710 may communicate with each other through a communication bus (not shown).

프로세서(720)는 앞서 설명된 컴퓨팅 장치(700)의 일련의 동작을 제어할 수 있다. 보다 구체적으로, 프로세서(720)는 제1 영상 및 제2 영상을 인공신경망에 입력하고 상기 인공신경망의 특징 추출 네트워크를 이용하여 상기 제1 영상 및 상기 제2 영상으로부터 특징맵을 생성하고, 상기 6자유도에 대응하는 6개의 토큰들 및 상기 특징맵을 이용하여 결합 집합을 생성하고, 특징맵에 포함된 패치들의 위치적 특성을 반영하는 위치 임베딩 벡터를 생성하고, 상기 위치 임베딩 벡터 및 상기 결합 집합을 이용하여 입력 벡터를 생성하고, 상기 인공신경망의 차원 축소 네트워크에 상기 입력 벡터를 입력하고, 상기 차원 축소 네트워크의 출력으로부터 상기 6자유도를 결정할 수 있다.The processor 720 may control a series of operations of the computing device 700 described above. More specifically, the processor 720 inputs the first image and the second image to the artificial neural network, generates a feature map from the first image and the second image using a feature extraction network of the artificial neural network, and A combination set is generated using 6 tokens corresponding to degrees of freedom and the feature map, a location embedding vector reflecting positional characteristics of patches included in the feature map is generated, and the location embedding vector and the combination set are generated. It is possible to generate an input vector, input the input vector to the dimensionality reduction network of the artificial neural network, and determine the six degrees of freedom from the output of the dimensionality reduction network.

메모리(730)는 휘발성 메모리 또는 비 휘발성 메모리일 수 있다.Memory 730 may be volatile memory or non-volatile memory.

이 밖에도, 프로세서(720)는 프로그램을 실행하고, 컴퓨팅 장치(700)를 제어할 수 있다. 프로세서(720)에 의하여 실행되는 프로그램 코드는 메모리(730)에 저장될 수 있다. 컴퓨팅 장치(700)는 입출력 장치(미도시)를 통하여 외부 장치(예를 들어, 퍼스널 컴퓨터 또는 네트워크)에 연결되고, 데이터를 교환할 수 있다. 오류 주입 공격 장치(700)는 서버에 탑재될 수 있다.In addition, the processor 720 may execute a program and control the computing device 700 . Program codes executed by the processor 720 may be stored in the memory 730 . The computing device 700 may be connected to an external device (eg, a personal computer or network) through an input/output device (not shown) and exchange data. The error injection attack device 700 may be mounted on a server.

위 실시예의 설명에 기초하여 해당 기술분야의 통상의 기술자는, 본 발명의 방법 및/또는 프로세스들, 그리고 그 단계들이 하드웨어, 소프트웨어 또는 특정 용례에 적합한 하드웨어 및 소프트웨어의 임의의 조합으로 실현될 수 있다는 점을 명확하게 이해할 수 있다. 더욱이 본 발명의 기술적 해법의 대상물 또는 선행 기술들에 기여하는 부분들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 기계 판독 가능한 기록 매체에 기록될 수 있다. 상기 기계 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 기계 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 통상의 기술자에게 공지되어 사용 가능한 것일 수도 있다. 기계 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD, Blu-ray와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 전술한 장치들 중 어느 하나뿐만 아니라 프로세서, 프로세서 아키텍처 또는 상이한 하드웨어 및 소프트웨어의 조합들의 이종 조합, 또는 다른 어떤 프로그램 명령어들을 실행할 수 있는 기계 상에서 실행되기 위하여 저장 및 컴파일 또는 인터프리트될 수 있는, C와 같은 구조적 프로그래밍 언어, C++ 같은 객체지향적 프로그래밍 언어 또는 고급 또는 저급 프로그래밍 언어(어셈블리어, 하드웨어 기술 언어들 및 데이터베이스 프로그래밍 언어 및 기술들)를 사용하여 만들어질 수 있는바, 기계어 코드, 바이트코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 이에 포함된다. Based on the description of the above embodiments, a person skilled in the art can understand that the methods and/or processes of the present invention, and the steps thereof, can be realized with hardware, software, or any combination of hardware and software suitable for a particular application. point can be clearly understood. Furthermore, the objects of the technical solution of the present invention or parts contributing to the prior art may be implemented in the form of program instructions that can be executed through various computer components and recorded on a machine-readable recording medium. The machine-readable recording medium may include program commands, data files, data structures, etc. alone or in combination. Program instructions recorded on the machine-readable recording medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in the art of computer software. Examples of machine-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROM, DVD, and Blu-ray, and magneto-optical media such as floptical disks. (magneto-optical media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include stored and compiled or interpreted for execution on any one of the foregoing devices, as well as a heterogeneous combination of processors, processor architectures, or different combinations of hardware and software, or any other machine capable of executing program instructions. Machine code, This includes not only bytecode, but also high-level language code that can be executed by a computer using an interpreter or the like.

따라서 본 발명에 따른 일 태양에서는, 앞서 설명된 방법 및 그 조합들이 하나 이상의 연산 장치들에 의하여 수행될 때, 그 방법 및 방법의 조합들이 각 단계들을 수행하는 실행 가능한 코드로서 실시될 수 있다. 다른 일 태양에서는, 상기 방법은 상기 단계들을 수행하는 시스템들로서 실시될 수 있고, 방법들은 장치들에 걸쳐 여러 가지 방법으로 분산되거나 모든 기능들이 하나의 전용, 독립형 장치 또는 다른 하드웨어에 통합될 수 있다. 또 다른 일 태양에서는, 위에서 설명한 프로세스들과 연관된 단계들을 수행하는 수단들은 앞서 설명한 임의의 하드웨어 및/또는 소프트웨어를 포함할 수 있다. 그러한 모든 순차 결합 및 조합들은 본 개시서의 범위 내에 속하도록 의도된 것이다.Therefore, in one aspect according to the present invention, when the above-described methods and combinations thereof are performed by one or more computing devices, the methods and combinations of methods may be implemented as executable code that performs each step. In another aspect, the method may be implemented as systems performing the steps, the methods may be distributed in several ways across devices or all functions may be integrated into one dedicated, stand-alone device or other hardware. In another aspect, the means for performing the steps associated with the processes described above may include any of the hardware and/or software described above. All such sequential combinations and combinations are intended to fall within the scope of this disclosure.

예를 들어, 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. 상기 하드웨어 장치는, 프로그램 명령어를 저장하기 위한 ROM/RAM 등과 같은 메모리와 결합되고 상기 메모리에 저장된 명령어들을 실행하도록 구성되는 MPU, CPU, GPU, TPU와 같은 프로세서를 포함할 수 있으며, 외부 장치와 신호를 주고받을 수 있는 입출력부를 포함할 수 있다. 덧붙여, 상기 하드웨어 장치는 개발자들에 의하여 작성된 명령어들을 전달받기 위한 키보드, 마우스, 기타 외부 입력장치를 포함할 수 있다.For example, the hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa. The hardware device may include a processor such as an MPU, CPU, GPU, TPU coupled to a memory such as ROM/RAM for storing program instructions and configured to execute instructions stored in the memory, and external devices and signals It may include an input/output unit capable of sending and receiving. In addition, the hardware device may include a keyboard, mouse, and other external input devices for receiving commands written by developers.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 사람이라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described by specific details such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , Various modifications and variations can be made from these descriptions by those of ordinary skill in the art to which the present invention belongs.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments and should not be determined, and not only the claims to be described later, but also all modifications equivalent or equivalent to these claims fall within the scope of the spirit of the present invention. will do it

그와 같이 균등하게 또는 등가적으로 변형된 것에는, 예컨대 본 발명에 따른 방법을 실시한 것과 동일한 결과를 낼 수 있는, 논리적으로 동치(logically equivalent)인 방법이 포함될 것인바, 본 발명의 진의 및 범위는 전술한 예시들에 의하여 제한되어서는 아니되며, 법률에 의하여 허용 가능한 가장 넓은 의미로 이해되어야 한다.Such equivalent or equivalent modifications will include, for example, logically equivalent methods that can produce the same results as those performed by the method according to the present invention, the spirit and scope of the present invention. should not be limited by the above examples, and should be understood in the broadest sense permitted by law.

Claims

In the six degree of freedom posture estimation method, performed by a computing device,
obtaining a first image of a first frame and a second image of a second frame;
inputting the first and second images to an artificial neural network;
generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network;
generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map;
generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set;
inputting the input vector to a dimensionality reduction network of the artificial neural network; and
6 degrees of freedom posture estimation method comprising determining the 6 degrees of freedom from the output of the dimensionality reduction network.

According to claim 1,
The feature extraction network,
A six-degree-of-freedom posture estimation method for extracting features using a six-channel image obtained by combining the three-channel image of the first image and the three-channel image of the second image in a channel direction.

According to claim 1,
Generating a combination set using the six tokens and the feature map,
reconstructing the generated feature map into a set of patches having a constant size; and
combining the set of patches and the six tokens to generate the combined set;
Each of the six tokens has the same size as each of the patches.

According to claim 3,
The combination set,
A method for estimating a 6 degree of freedom pose comprising n+6 patches, wherein n is the number of patches included in the feature map.

According to claim 1,
The position embedding vector is a vector having the same size as the patches,
The input vector is,
A method for estimating a six-degree-of-freedom pose determined based on the sum of element vectors and position embedding vectors constituting the combination set.

According to claim 1,
The dimensionality reduction network,
A method for estimating a six-degree-of-freedom posture in which a size of patches constituting the input vector is reduced by repeatedly performing a self-attention operation on the input vector.

According to claim 1,
In the step of determining the six degrees of freedom,
extracting patches corresponding to the six tokens from among patches of an input vector whose size is reduced by the dimensionality reduction network; and
and estimating 6 degrees of freedom by calculating average pooling for each patch using the extracted patches.

In the learning method of an artificial neural network for estimating a six-degree-of-freedom posture, performed by a computing device,
obtaining a first image of a first frame and a second image of a second frame;
estimating a first depth map of the first image and a second depth map of the second image;
outputting six degree of freedom information by inputting the first image and the second image to the artificial neural network;
calculating a transformation matrix between the first frame and the second frame based on the six degree of freedom information; and
Calculating an output value of a loss function based on the first depth map, the second depth map, and the transformation matrix, and updating the artificial neural network based on the output value of the loss function,
In the step of outputting the six degree of freedom information,
inputting the first and second images to an artificial neural network;
generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network;
generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map;
generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set;
inputting the input vector to a dimensionality reduction network of the artificial neural network; and
An artificial neural network learning method comprising the step of determining the six degrees of freedom from the output of the dimensionality reduction network.

According to claim 8,
The 6 tokens and the location embedding vector,
An artificial neural network learning method in which an output value of the loss function is updated to become smaller after being set to predetermined initial values.

According to claim 8,
The output of the loss function is a first auxiliary function for adding all the differences between the first frame and the third frame obtained by reconstructing the second frame using the transformation matrix and the first depth map for effective pixels satisfying a predetermined condition. determined based on the output;
The first auxiliary function,
Further comprising a term indicating a structural similarity index between the first frame and the third frame with respect to the effective pixel;
A third depth map and a fourth depth map for a third depth map obtained by reconstructing the first depth map with the transformation matrix and a fourth depth map reconstructed such that the second depth map is positioned on the same pixel grid as the first depth map. A function calculated by multiplying the normalization function divided by the sum of the third depth map and the fourth depth map,
The structural similarity index is based on comparison of luminance, contrast, and structure between pixels,
The predetermined condition is that the difference between the first frame and the third frame is less pixels than the difference between the first frame and the second frame.

According to claim 10,
The output of the loss function is determined by further considering the output of a second auxiliary function that adds all products of the spatial gradient component of the first frame and the gradient component of the first depth map for all pixels in space.

According to claim 11,
The output of the loss function is determined by further considering the output of a third auxiliary function for adding the normalization function to all of the effective pixels.

According to claim 8,
Generating a combination set using the 6 tokens corresponding to the 6 degrees of freedom and the feature map,
dividing the generated feature map into non-overlapping patches of a constant size;
generating a patch set composed of the patches;
generating 6 tokens having the same size as the patches and corresponding to the 6 degrees of freedom; and
combining the patch set and the six tokens to generate the combined set;
The combination set,
It includes n+6 patches, wherein n is the number of patches included in the feature map.

According to claim 8,
In the step of determining the six degrees of freedom,
extracting patches corresponding to the six tokens from among patches of an input vector whose size is reduced by the dimensionality reduction network; and
An artificial neural network learning method comprising the step of estimating 6 degrees of freedom by calculating average pooling for each patch using the extracted patches.

In a computing device,
contains a processor;
obtaining, by the processor, a first image of a first frame and a second image of a second frame; inputting the first and second images to an artificial neural network; generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network; generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map; generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set; inputting the input vector to a dimensionality reduction network of the artificial neural network; and determining the six degrees of freedom from the output of the dimensionality reduction network.

In a computing device,
contains a processor;
obtaining, by the processor, a first image of a first frame and a second image of a second frame; estimating a first depth map of the first image and a second depth map of the second image; outputting six degree of freedom information by inputting the first image and the second image to the artificial neural network; calculating a transformation matrix between the first frame and the second frame based on the six degree of freedom information; and calculating an output value of a loss function based on the first depth map, the second depth map, and the transformation matrix, and updating the artificial neural network based on the output value of the loss function,
In the step of outputting the six degree of freedom information,
inputting the first and second images to an artificial neural network;
generating a feature map from the first image and the second image using a feature extraction network of the artificial neural network;
generating a combination set using 6 tokens corresponding to the 6 degrees of freedom and the feature map;
generating a positional embedding vector reflecting positional characteristics of the patches included in the feature map, and generating an input vector using the positional embedding vector and the combination set;
inputting the input vector to a dimensionality reduction network of the artificial neural network; and
and determining the six degrees of freedom from the output of the dimensionality reduction network.