KR20210138788A

KR20210138788A - Scenario depth and camera motion prediction method and device, device, medium and program

Info

Publication number: KR20210138788A
Application number: KR1020217036422A
Authority: KR
Inventors: 타오 한; 잔펭 장; 후이 쳉
Original assignee: 선전 센스타임 테크놀로지 컴퍼니 리미티드
Priority date: 2020-04-28
Filing date: 2021-02-08
Publication date: 2021-11-19
Also published as: JP7178514B2; CN111540000B; TWI767596B; JP2022528012A; WO2021218282A1; TW202141428A; CN111540000A; KR102397268B1; CN113822918A

Abstract

본 발명은 시나리오 깊이와 카메라 움직임 예측 방법 및 장치, 전자 기기, 매체와 프로그램에 관한 것이고, 상기 방법은, t 시각의 타깃 이미지 프레임을 획득하는 단계; 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하는 단계 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함하고, 상기 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 - 를 포함한다.The present invention relates to a scenario depth and camera motion prediction method and apparatus, an electronic device, a medium and a program, the method comprising the steps of: acquiring a target image frame at time t; determining, by the scenario depth prediction network, a prediction depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1 - the first 1 The hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network.

Description

Scenario depth and camera motion prediction method and device, device, medium and program

본 출원은 출원 번호가 202010348872.2이고, 출원일이 2020년 4월 28일인 중국 특허 출원에 기반하여 제출하였고, 상기 중국 특허 출원의 우선권을 주장하는 바, 상기 중국 특허 출원의 모든 내용은 참조로서 본 출원에 인용된다.This application is filed on the basis of a Chinese patent application with an application number of 202010348872.2 and an filing date of April 28, 2020, and claims the priority of the Chinese patent application, all contents of the Chinese patent application are incorporated herein by reference. are cited

본 발명은 컴퓨터 기술 분야에 관한 것이고, 시나리오 깊이와 카메라 움직임 예측 방법 및 장치, 전자 기기, 컴퓨터 판독 가능한 저장 매체와 컴퓨터 프로그램에 관한 것이지만 이에 한정되지 않는다.The present invention relates to the field of computer technology, and to a method and apparatus for predicting scenario depth and camera motion, an electronic device, a computer readable storage medium, and a computer program, but is not limited thereto.

단안 이미지 수집 기기(예를 들어, 단안 카메라)를 이용하여 수집된 이미지를 입력으로 사용하여 시나리오 깊이 및 카메라 움직임을 예측하는 것은 지난 20년 동안 컴퓨터 비젼 분야의 한 개 활발하고도 중요한 연구 방향이었으며, 증강 현실, 무인 운전 및 모바일 로봇 포지셔닝 항법 등 많은 분야에 광범히 적용된다.Predicting scenario depth and camera motion using images acquired using a monocular image acquisition device (e.g., monocular camera) as input has been one of the active and important research directions in the field of computer vision over the past 20 years, It has wide application in many fields such as augmented reality, unmanned driving and mobile robot positioning and navigation.

본 발명의 실시예에서는 시나리오 깊이와 카메라 움직임 예측 방법 및 장치, 전자 기기, 매체와 프로그램의 기술방안을 제공한다.An embodiment of the present invention provides a description method for a scenario depth and camera motion prediction method and apparatus, an electronic device, a medium, and a program.

본 발명의 실시예에서 제공한 시나리오 깊이 예측 방법은, t 시각의 타깃 이미지 프레임을 획득하는 단계; 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하는 단계를 포함하며, 여기서, 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함하고, 상기 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 이다.A scenario depth prediction method provided in an embodiment of the present invention includes: obtaining a target image frame at time t; determining, by the scenario depth prediction network, a prediction depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1, , where the first hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network.

본 발명의 일부 실시예에 있어서, 상기 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하는 단계는, 상기 타깃 이미지 프레임에 대해 특징 추출을 수행하여, 상기 타깃 이미지 프레임에 대응되는 제1 특징 맵을 결정하는 단계 - 상기 제1 특징 맵은 시나리오 깊이와 관련되는 특징 맵임 - ; 상기 제1 특징 맵과 t-1 시각의 상기 제1 은닉 상태 정보에 따라, t 시각의 상기 제1 은닉 상태 정보를 결정하는 단계; 및 t 시각의 상기 제1 은닉 상태 정보에 따라, 상기 예측 깊이 맵을 결정하는 단계를 포함한다.In some embodiments of the present invention, through the scenario depth prediction network performing scenario depth prediction on the target image frame using the first hidden state information at time t-1, The determining of the predicted depth map may include performing feature extraction on the target image frame to determine a first feature map corresponding to the target image frame, wherein the first feature map is a feature map related to the scenario depth. - ; determining the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1; and determining the predicted depth map according to the first hidden state information at time t.

본 발명의 일부 실시예에 있어서, t-1 시각의 상기 제1 은닉 상태 정보는, t-1 시각의 상이한 스케일 하의 상기 제1 은닉 상태 정보를 포함하고; 상기 타깃 이미지 프레임에 대해 특징 추출을 수행하여, 상기 타깃 이미지 프레임에 대응되는 제1 특징 맵을 결정하는 단계는, 상기 타깃 이미지 프레임에 대해 복수 개 스케일의 다운 샘플링을 수행하여, 상기 타깃 이미지 프레임에 대응되는 상이한 스케일 하의 상기 제1 특징 맵을 결정하는 단계를 포함하고; 상기 제1 특징 맵과 t-1 시각의 상기 제1 은닉 상태 정보에 따라, t 시각의 상기 제1 은닉 상태 정보를 결정하는 단계는, 임의의 스케일에 대해, 상기 스케일 하의 상기 제1 특징 맵과 t-1 시각의 상기 스케일 하의 상기 제1 은닉 상태 정보에 따라, t 시각의 상기 스케일 하의 상기 제1 은닉 상태 정보를 결정하는 단계르 포함하며; 상기 t 시각의 상기 제1 은닉 상태 정보에 따라, 상기 예측 깊이 맵을 결정하는 단계는, t 시각의 상이한 스케일 하의 상기 제1 은닉 상태 정보에 특징 융합을 수행하여, 상기 예측 깊이 맵을 결정하는 단계를 포함한다.In some embodiments of the present invention, the first hidden state information at time t-1 includes the first hidden state information under a different scale at time t-1; The step of determining a first feature map corresponding to the target image frame by performing feature extraction on the target image frame includes performing down-sampling of a plurality of scales on the target image frame to obtain the target image frame. determining the first feature map under a corresponding different scale; The determining of the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1 includes, for an arbitrary scale, the first feature map under the scale and determining, according to the first hidden state information under the scale at time t-1, the first hidden state information under the scale at time t; The determining of the predicted depth map according to the first hidden state information at the time t includes: performing feature fusion to the first hidden state information under different scales at the time t to determine the predicted depth map includes

본 발명의 일부 실시예에 있어서, 상기 방법은 또한, t 시각에 대응되는 샘플 이미지 프레임 시퀀스를 획득하는 단계 - 상기 샘플 이미지 프레임 시퀀스는 t 시각의 제1 샘플 이미지의 프레임과 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임을 포함함 - ; 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하는 단계 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ; 훈련될 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하는 단계 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ; 상기 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하는 단계; 및 상기 손실 함수에 따라, 상기 훈련될 시나리오 깊이 예측 네트워크에 대해 훈련을 수행하여, 상기 시나리오 깊이 예측 네트워크를 얻는 단계를 포함한다. In some embodiments of the present invention, the method further comprises: obtaining a sequence of sample image frames corresponding to a time t, wherein the sequence of sample image frames includes a frame of a first sample image at a time t and the frame of the first sample image contains adjacent sample image frames of - ; determining, by the camera motion prediction network, a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1 - the second hidden state information includes characteristic information related to camera movement; A sample prediction depth map corresponding to the first sample image frame through which the scenario depth prediction network to be trained performs scenario depth prediction on the first sample image frame using the first hidden state information at time t-1 determining, wherein the first hidden state information includes feature information related to a scenario depth; constructing a loss function according to the sample prediction depth map and the sample prediction camera motion; and performing training on the to-be-trained scenario depth prediction network according to the loss function to obtain the scenario depth prediction network.

본 발명의 일부 실시예에 있어서, 상기 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하는 단계는, 상기 샘플 예측 카메라 움직임에 따라, 상기 샘플 이미지 프레임 시퀀스에서 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임이 상기 제1 샘플 이미지 프레임에 대한 재투영 오차항을 결정하는 단계; 상기 샘플 예측 깊이 맵의 분포 연속성에 따라, 벌칙 함수항을 결정하는 단계; 및 상기 재투영 오차항과 상기 벌칙 함수항에 따라, 상기 손실 함수를 구축하는 단계를 포함한다.In some embodiments of the present invention, according to the sample prediction depth map and the sample prediction camera motion, building a loss function comprises: according to the sample prediction camera motion, the first sample image in the sample image frame sequence determining a reprojection error term for an adjacent sample image frame of a frame with respect to the first sample image frame; determining a penalty function term according to the distribution continuity of the sample prediction depth map; and constructing the loss function according to the reprojection error term and the penalty function term.

본 발명의 실시예는 또한 카메라 움직임 예측 방법을 제공하며, t 시각에 대응되는 이미지 프레임 시퀀스를 획득하는 단계 - 상기 이미지 프레임 시퀀스는 t 시각의 타깃 이미지 프레임과 상기 타깃 이미지 프레임의 인접 이미지 프레임을 포함함 - ; 및 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 결정하는 단계 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함하고, 상기 카메라 움직임 예측 네트워크는 시나리오 깊이 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 - ;를 포함한다.An embodiment of the present invention also provides a camera motion prediction method, comprising the steps of: obtaining a sequence of image frames corresponding to time t, wherein the sequence of image frames includes a target image frame at time t and an adjacent image frame of the target image frame Ham - ; and determining, by a camera motion prediction network, a predicted camera motion corresponding to the image frame sequence by performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1. The second hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained by auxiliary training based on the scenario depth prediction network.

본 발명의 일부 실시예에 있어서, 상기 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 결정하는 단계는, 상기 이미지 프레임 시퀀스에 대해 특징 추출을 수행하여, 상기 이미지 프레임 시퀀스에 대응되는 제2 특징 맵을 결정하는 단계 - 상기 제2 특징 맵은 카메라 움직임과 관련되는 특징 맵임 - ; 상기 제2 이미지 특징과 t-1 시각의 상기 제2 은닉 상태 정보에 따라, t 시각의 상기 제2 은닉 상태 정보를 결정하는 단계; 및 t 시각의 상기 제2 은닉 상태 정보에 따라, 상기 예측 카메라 움직임을 결정하는 단계를 포함한다.In some embodiments of the present invention, the camera motion prediction network performs camera pose prediction on the image frame sequence using the second hidden state information at time t-1, so that The determining of the predicted camera motion includes: performing feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence, the second feature map being a feature map related to the camera motion - ; determining the second hidden state information at time t according to the second image feature and the second hidden state information at time t-1; and determining the predicted camera movement according to the second hidden state information at time t.

본 발명의 일부 실시예에 있어서, 상기 예측 카메라 움직임은 상기 이미지 프레임 시퀀스에서 인접 이미지 프레임 간의 상대 포즈를 포함한다. In some embodiments of the present invention, the predicted camera movement comprises a relative pose between adjacent image frames in the sequence of image frames.

본 발명의 일부 실시예에 있어서, 상기 방법은 또한, t 시각에 대응되는 샘플 이미지 프레임 시퀀스를 획득하는 단계 - 상기 샘플 이미지 프레임 시퀀스는 t 시각의 제1 샘플 이미지 프레임과 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임을 포함함 - ; 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하는 단계 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ; 훈련될 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하는 단계 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ; 상기 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하는 단계; 및 상기 손실 함수에 따라, 상기 훈련될 카메라 움직임 예측 네트워크에 대해 훈련을 수행하여, 상기 카메라 움직임 예측 네트워크를 얻는 단계를 포함한다. In some embodiments of the present invention, the method further comprises: obtaining a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and a sequence of the first sample image frame at time t. Contains adjacent sample image frames - ; The scenario depth prediction network determines a sample prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the first sample image frame using the first hidden state information at time t-1. doing, wherein the first hidden state information includes feature information related to the scenario depth; A camera motion prediction network to be trained determines a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1. doing, wherein the second hidden state information includes characteristic information related to camera movement; constructing a loss function according to the sample prediction depth map and the sample prediction camera motion; and performing training on the to-be-trained camera motion prediction network according to the loss function to obtain the camera motion prediction network.

본 발명의 실시예는 또한 시나리오 깊이 예측 장치를 제공하며, t 시각의 타깃 이미지 프레임을 획득하도록 구성된 제1 획득 모듈; 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하도록 구성된 제1 시나리오 깊이 예측 모듈 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함하고, 상기 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 - 을 포함한다.An embodiment of the present invention also provides a scenario depth prediction apparatus, comprising: a first acquiring module, configured to acquire a target image frame at time t; A first scenario, wherein the scenario depth prediction network is configured to determine a predicted depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1 and a depth prediction module, wherein the first hidden state information includes feature information related to a scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on a camera motion prediction network.

본 발명의 일부 실시예에 있어서, 상기 제1 시나리오 깊이 예측 모듈은, 상기 타깃 이미지 프레임에 대해 특징 추출을 수행하여, 상기 타깃 이미지 프레임에 대응되는 제1 특징 맵을 결정하도록 구성된 제1 결정 서브 모듈 - 상기 제1 특징 맵은 시나리오 깊이와 관련되는 특징 맵임 - ; 상기 제1 특징 맵과 t-1 시각의 상기 제1 은닉 상태 정보에 따라, t 시각의 상기 제1 은닉 상태 정보를 결정하도록 구성된 제2 결정 서브 모듈; , t 시각의 상기 제1 은닉 상태 정보에 따라, 상기 예측 깊이 맵을 결정하도록 구성된 제3 결정 서브 모듈을 포함한다.In some embodiments of the present invention, the first scenario depth prediction module is configured to perform feature extraction on the target image frame to determine a first feature map corresponding to the target image frame. - the first feature map is a feature map related to the scenario depth; a second determining submodule, configured to determine the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1; , a third determining submodule, configured to determine the predicted depth map according to the first hidden state information at time t.

본 발명의 일부 실시예에 있어서, t-1 시각의 상기 제1 은닉 상태 정보는 t-1 시각의 상이한 스케일 하의 상기 제1 은닉 상태 정보를 포함하고; 상기 제1 결정 서브 모듈은 구체적으로, 상기 타깃 이미지 프레임에 대해 복수 개 스케일의 다운 샘플링을 수행하여, 상기 타깃 이미지 프레임에 대응되는 상이한 스케일 하의 상기 제1 특징 맵을 결정하도록 구성되고; 상기 제2 결정 서브 모듈은 구체적으로, 임의의 스케일에 대해, 상기 스케일 하의 상기 제1 특징 맵과 t-1 시각의 상기 스케일 하의 상기 제1 은닉 상태 정보에 따라, t 시각의 상기 스케일 하의 상기 제1 은닉 상태 정보를 결정하도록 구성되었으며; 상기 제3 결정 서브 모듈은 구체적으로, t 시각의 상이한 스케일 하의 상기 제1 은닉 상태 정보에 특징 융합을 수행하여, 상기 예측 깊이 맵을 결정하도록 구성되었다.In some embodiments of the present invention, the first hidden state information at time t-1 includes the first hidden state information under a different scale at time t-1; the first determining submodule is specifically configured to perform down-sampling of a plurality of scales on the target image frame to determine the first feature map under different scales corresponding to the target image frame; The second determining submodule is specifically, for any scale, according to the first feature map under the scale and the first hidden state information under the scale at time t-1, the second determination submodule under the scale at time t 1 configured to determine the hidden state information; The third determining submodule is specifically configured to perform feature fusion on the first hidden state information under a different scale of time t to determine the prediction depth map.

본 발명의 일부 실시예에 있어서, 상기 장치는 또한, 제1 훈련 모듈을 포함하고, 상기 제1 훈련 모듈은, In some embodiments of the present invention, the device further comprises a first training module, wherein the first training module comprises:

t 시각에 대응되는 샘플 이미지 프레임 시퀀스를 획득하고 - 상기 샘플 이미지 프레임 시퀀스는 t 시각의 제1 샘플 이미지 프레임과 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임을 포함함 - ; obtain a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and an adjacent sample image frame of the first sample image frame;

카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하며 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ; A camera motion prediction network determines a sample predicted camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1; the second hidden state information includes characteristic information related to camera movement;

훈련될 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하며 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ; A sample prediction depth map corresponding to the first sample image frame through which the scenario depth prediction network to be trained performs scenario depth prediction on the first sample image frame using the first hidden state information at time t-1 determine, wherein the first hidden state information includes feature information related to the scenario depth;

상기 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하며; construct a loss function according to the sample prediction depth map and the sample prediction camera motion;

상기 손실 함수에 따라, 상기 훈련될 시나리오 깊이 예측 네트워크에 대해 훈련을 수행하여, 상기 시나리오 깊이 예측 네트워크를 얻도록 구성된다.and perform training on the to-be-trained scenario depth prediction network according to the loss function to obtain the scenario depth prediction network.

본 발명의 일부 실시예에 있어서, 상기 제1 훈련 모듈은 구체적으로, 상기 샘플 예측 카메라 움직임에 따라, 상기 샘플 이미지 프레임 시퀀스에서 상기 제1 샘플 이미지 프레임과 인접 샘플 이미지 프레임이 상기 제1 샘플 이미지 프레임에 대한 재투영 오차항을 결정하고; 상기 샘플 예측 깊이 맵의 분포 연속성에 따라, 벌칙 함수항을 결정하며; 상기 재투영 오차항과 상기 벌칙 함수항에 따라, 상기 손실 함수를 구축하도록 구성된다. In some embodiments of the present invention, the first training module is specifically configured to: according to the sample prediction camera movement, the first sample image frame and an adjacent sample image frame in the sample image frame sequence are the first sample image frames determine a reprojection error term for determine a penalty function term according to the distribution continuity of the sample prediction depth map; and build the loss function according to the reprojection error term and the penalty function term.

본 발명의 실시예는 또한 카메라 움직임 예측 장치를 제공하며, 상기 카메라 움직임 예측 장치는 t 시각에 대응되는 이미지 프레임 시퀀스를 획득하도록 구성된 제2 획득 모듈 - 상기 이미지 프레임 시퀀스는 t 시각의 타깃 이미지 프레임과 상기 타깃 이미지 프레임의 인접 이미지 프레임을 포함함 - ; 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 결정하도록 구성된 제1 카메라 움직임 예측 모듈 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함하고, 상기 카메라 움직임 예측 네트워크는 시나리오 깊이 예측 네트워크에 기반하여 보조 훈련으로 얻음 -을 포함한다.An embodiment of the present invention also provides an apparatus for predicting camera motion, wherein the apparatus for predicting camera motion is a second obtaining module, configured to obtain a sequence of image frames corresponding to time t, wherein the sequence of image frames includes a target image frame at time t and including an image frame adjacent to the target image frame; a first camera, configured to determine a predicted camera motion corresponding to the image frame sequence, through the camera motion prediction network performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1 a motion prediction module, wherein the second hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained by auxiliary training based on a scenario depth prediction network.

본 발명의 일부 실시예에 있어서, 상기 제1 카메라 움직임 예측 모듈은, 상기 이미지 프레임 시퀀스에 대해 특징 추출을 수행하여, 상기 이미지 프레임 시퀀스에 대응되는 제2 특징 맵을 결정하도록 구성된 제6 결정 서브 모듈 - 상기 제2 특징 맵은 카메라 움직임과 관련되는 특징 맵임 - ; 상기 제2 이미지 특징과 t-1 시각의 상기 제2 은닉 상태 정보에 따라, t 시각의 상기 제2 은닉 상태 정보를 결정하도록 구성된 제7 결정 서브 모듈; t 시각의 상기 제2 은닉 상태 정보에 따라, 상기 예측 카메라 움직임을 결정하는 제8 결정 서브 모듈을 포함한다.In some embodiments of the present invention, the first camera motion prediction module is configured to perform feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence. - the second feature map is a feature map related to camera movement; a seventh determining submodule, configured to determine the second hidden state information at time t according to the second image feature and the second hidden state information at time t-1; and an eighth determining submodule configured to determine the predicted camera motion according to the second hidden state information at time t.

본 발명의 일부 실시예에 있어서, 상기 예측 카메라 움직임은 상기 이미지 프레임 시퀀스에서 인접 이미지 프레임 간의 상대 포즈를 포함한다.In some embodiments of the present invention, the predicted camera movement comprises a relative pose between adjacent image frames in the sequence of image frames.

본 발명의 일부 실시예에 있어서, 상기 장치는 또한, 제2 훈련 모듈을 포함하고, 상기 제2 훈련 모듈은, In some embodiments of the present invention, the device further comprises a second training module, wherein the second training module comprises:

t 시각에 대응되는 샘플 이미지 프레임 시퀀스를 획득하고 - 상기 샘플 이미지 프레임 시퀀스는 t 시각의 제1 샘플 이미지 프레임과 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임을 포함함 - ;obtain a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and an adjacent sample image frame of the first sample image frame;

시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하며 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ; The scenario depth prediction network determines a sample prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the first sample image frame using the first hidden state information at time t-1. and wherein the first hidden state information includes feature information related to the scenario depth;

훈련될 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하며 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ; A camera motion prediction network to be trained determines a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1. and, the second hidden state information includes characteristic information related to camera movement;

상기 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하며;construct a loss function according to the sample prediction depth map and the sample prediction camera motion;

상기 손실 함수에 따라, 상기 훈련될 카메라 움직임 예측 네트워크에 대해 훈련을 수행하여, 상기 카메라 움직임 예측 네트워크를 얻도록 구성된다. and perform training on the to-be-trained camera motion prediction network according to the loss function to obtain the camera motion prediction network.

본 발명의 일부 실시예에 있어서, 상기 제2 훈련 모듈은 구체적으로, 상기 샘플 예측 카메라 움직임에 따라, 상기 샘플 이미지 프레임 시퀀스에서 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임이 상기 제1 샘플 이미지 프레임에 대한 재투영 오차항을 결정하고; 상기 샘플 예측 깊이 맵과 분포 연속성에 따라, 벌칙 함수항을 결정하며; 상기 재투영 오차항과 상기 벌칙 함수항에 따라, 상기 손실 함수를 구축하도록 구성된다. In some embodiments of the present invention, the second training module is specifically configured to: according to the sample prediction camera movement, an adjacent sample image frame of the first sample image frame in the sample image frame sequence is the first sample image frame determine a reprojection error term for determine a penalty function term according to the sample prediction depth map and distribution continuity; and build the loss function according to the reprojection error term and the penalty function term.

본 발명의 실시예는 또한 전자 기기를 제공하며, 프로세서; 및 프로세서가 실행 가능한 명령어를 저장하기 위한 메모리를 포함하며; 여기서, 상기 프로세서는 상기 메모리에 저장된 명령어를 호출하여, 상기 방법을 실행하도록 구성된다.An embodiment of the present invention also provides an electronic device, comprising: a processor; and a memory for storing instructions executable by the processor; Here, the processor is configured to execute the method by calling the instruction stored in the memory.

본 발명의 실시예는 또한 컴퓨터 플그램 명령어가 저장된 컴퓨터 판독 가능한 매체를 제공하며, 상기 컴퓨터 프로그램 명령어는 프로세서에 의해 실행될 때 상기 방법을 구현한다. An embodiment of the present invention also provides a computer readable medium having computer program instructions stored thereon, wherein the computer program instructions implement the method when executed by a processor.

본 발명의 실시예는 또한 컴퓨터 판독 가능한 코드를 포함하는 컴퓨터 프로그램을 더 제공하고, 상기 컴퓨터 판독 가능한 코드가 전자 기기에서 작동될 때, 상기 전자 기기에서의 프로세서는 상기 어느 한 방법을 구현하기 위해 실행된다.An embodiment of the present invention also provides a computer program comprising computer readable code, wherein when the computer readable code is run in an electronic device, a processor in the electronic device is executed to implement the any one method do.

본 발명의 실시예에서, t 시각에 대응되는 타깃 이미지 프레임을 획득하고, 인접 시각 간의 시나리오 깊이가 시계열에서 관련 관계를 구비하므로, t-1 시각에서 시나리오 깊이와 관련되는 제1 은닉 상태 정보를 이용하여, 시나리오 깊이 예측 네트워크를 통해 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 타깃 이미지 프레임에 대응되는 예측 정확도가 더 높은 예측 깊이 맵을 얻을 수 있다.In an embodiment of the present invention, a target image frame corresponding to time t is obtained, and since the scenario depth between adjacent times has a relation in time series, the first hidden state information related to the scenario depth at time t-1 is used. Accordingly, by performing scenario depth prediction on the target image frame through the scenario depth prediction network, it is possible to obtain a prediction depth map with higher prediction accuracy corresponding to the target image frame.

본 발명의 실시예에서, t 시각에 대응되는, t 시각의 타깃 이미지 프레임 및 타깃 이미지 프레임의 인접 이미지 프레임을 포함하는 이미지 프레임 시퀀스를 획득하고, 인접 시각 간의 카메라 포즈가 시계열에서 관련 관계를 구비하므로, t-1 시각에서 카메라 움직임과 관련되는 제2 은닉 상태 정보를 이용하여, 카메라 움직임 예측 네트워크를 통해 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행함으로써, 예측 정확도가 높은 예측 카메라 움직임을 얻을 수 있다. In the embodiment of the present invention, an image frame sequence including a target image frame at time t and an adjacent image frame of the target image frame corresponding to time t is obtained, and since the camera poses between adjacent times have a related relationship in the time series, , , by performing camera pose prediction on the image frame sequence through the camera motion prediction network using the second hidden state information related to the camera motion at time t-1, it is possible to obtain a predicted camera motion with high prediction accuracy.

이해해야 할 것은, 이상의 일반적인 설명 및 하기의 상세한 설명은 다만 예시적이고 해석적인 것이며, 본 발명을 한정하려는 것은 아니다. 아래에 첨부 도면에 따라 예시적 실시예를 상세하게 설명하므로, 본 발명의 다른 특징 및 측면이 더욱 명확해질 것이다.It should be understood that the above general description and the following detailed description are illustrative and interpretative only, and are not intended to limit the present invention. Other features and aspects of the present invention will become more apparent as the exemplary embodiments are described in detail below in accordance with the accompanying drawings.

아래의 도면은 본 명세서의 일부분으로서 명세서 전체를 구성하며, 이러한 도면은 본 발명에 맞는 실시예를 예시하여, 명세서와 함께 본 발명의 기술방안을 설명하기 위한것이다.
도 1은 본 발명의 실시예의 시나리오 깊이 예측 방법의 흐름도이고;
도 2는 본 발명의 실시예의 시나리오 깊이 예측 네트워크의 블록도이고;
도 3은 본 발명의 실시예의 비모니터링 네트워크 훈련의 블록도이고;
도 4는 본 발명의 실시예의 카메라 움직임 예측 방법의 흐름도이고;
도 5는 본 발명의 실시예의 시나리오 깊이 예측 장치의 구조 예시도이며;
도 6은 본 발명의 실시예의 카메라 움직임 예측 장치의 구조 예시도이며;
도 7은 본 발명의 실시예의 전자 기기의 구조 예시도이며;
도 8은 본 발명의 실시예의 전자 기기의 구조 예시도이다.The drawings below constitute the entire specification as a part of this specification, and these drawings are for explaining the technical solution of the present invention together with the specification by illustrating an embodiment suitable for the present invention.
1 is a flowchart of a scenario depth prediction method in an embodiment of the present invention;
2 is a block diagram of a scenario depth prediction network in an embodiment of the present invention;
3 is a block diagram of unmonitored network training in an embodiment of the present invention;
4 is a flowchart of a camera motion prediction method according to an embodiment of the present invention;
5 is an exemplary structural diagram of a scenario depth prediction apparatus according to an embodiment of the present invention;
6 is a structural diagram of an apparatus for predicting camera motion according to an embodiment of the present invention;
7 is an exemplary structural diagram of an electronic device according to an embodiment of the present invention;
8 is an exemplary structural diagram of an electronic device according to an embodiment of the present invention.

이하, 도면을 참조하여 본 발명의 다양한 예시적인 실시예, 특징 및 측면을 상세하게 설명한다. 도면에서 동일한 도면 부호는 동일하거나 유사한 기능을 갖는 요소를 표시한다. 실시예의 다양한 측면이 도면에 도시되어 있지만, 구체적으로 언급되지 않는 한, 도면은 반드시 비례대로 도시될 필요없다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Various exemplary embodiments, features and aspects of the present invention will be described in detail below with reference to the drawings. In the drawings, the same reference numbers indicate elements having the same or similar functions. While various aspects of the embodiments are illustrated in drawings, the drawings are not necessarily drawn to scale unless specifically stated otherwise.

여기서 "예시적"이라는 특정한 단어는 "예, 실시예 또는 설명적인 것으로 사용됨"을 의미한다. 여기서 "예시적"으로 설명된 임의의 실시예는 다른 예보다 뛰어나거나 좋은 것으로 해석될 필요는 없다.The specific word “exemplary” herein means “used as an example, embodiment, or descriptive”. Any embodiment described herein as “exemplary” is not necessarily to be construed as superior or superior to other examples.

본문 중의 용어 “및/또는"은 다만 연관 대상의 연관 관계를 설명하기 위한 것이며, 세 가지 관계가 존재할 수 있음을 의미하는데, 예를 들어, "A 및/또는 D"는, A가 단독적으로 존재하거나, A와 D가 동시에 존재하거나, D가 단독적으로 존재하는 세 가지 경우를 의미한다. 또한, 본 문의 용어 "적어도 하나"라는 용어는 복수 개 중 어느 하나 또는 복수 개 중 적어도 2 개의 임의의 조성을 나타내며, 예를 들어, A, C 및 D 중 적어도 하나는 A, C 및 D에 의해 형성된 집합에서 선택된 임의의 하나 또는 복수 개의 요소를 나타낼 수 있다.The term “and/or” in the text is only for describing the relation of the related object, and means that three relations may exist, for example, “A and/or D” means that A exists alone. or A and D are present simultaneously, or D is present alone. and, for example, at least one of A, C and D may represent any one or a plurality of elements selected from the set formed by A, C and D.

또한, 본 발명을 보다 잘 설명하기 위해, 아래의 구체적인 실시형태에서 많은 세부사항들이 제공된다. 본 분야의 기술자는 일부 구체적인 세부 사항이 없이도, 본 발명이 마찬가지로 실시될 수 있음을 이해해야 한다. 일부 예에 있어서, 본 분야에게 널리 알려진 방법, 수단, 구성 요소 및 회로는 본 발명의 요지를 강조하기 위해, 상세히 설명되지 않는다.Further, in order to better illustrate the present invention, numerous details are provided in the specific embodiments below. It should be understood by those skilled in the art that the present invention may be practiced without some specific details. In some instances, methods, means, components, and circuits well known in the art have not been described in detail in order to emphasize the subject matter of the present invention.

도 1은 본 발명의 실시예에 따른 시나리오 깊이 예측 방법의 흐름도를 도시한다. 도 1에 도시된 바와 같이 시나리오 깊이 예측 방법은 단말 기기 또는 다른 처리 기기에 의해 실행될 수 있고, 여기서, 단말 기기는 사용자 기기(User Equipment, UE), 모바일 기기, 사용자 단말, 단말, 셀룰러 폰, 무선 전화, 개인용 정보 단말기(Personal Digital Assistant, PDA), 핸드 헬드 기기, 컴퓨팅 기기, 차량용 기기, 웨어러블 기기 등일 수 있다. 다른 처리 기기는 서버 또는 클라우드 서버 등일 수 있다. 일부 실시예에 있어서, 상기 시나리오 깊이 예측 방법은 프로세서가 메모리에 저장된 컴퓨터 판독 가능한 명령어를 호출하는 방식을 통해 구현될 수 있다. 도 1에 도시된 바와 같이, 상기 방법은 아래와 같은 단계를 포함할 수 있다.1 is a flowchart of a scenario depth prediction method according to an embodiment of the present invention. As shown in FIG. 1 , the scenario depth prediction method may be executed by a terminal device or other processing device, where the terminal device is a user equipment (UE), a mobile device, a user terminal, a terminal, a cellular phone, or a wireless device. It may be a telephone, a personal digital assistant (PDA), a handheld device, a computing device, a vehicle device, a wearable device, and the like. The other processing device may be a server or a cloud server or the like. In some embodiments, the scenario depth prediction method may be implemented through a method in which a processor calls a computer readable instruction stored in a memory. 1 , the method may include the following steps.

단계 S11에서, t 시각의 타깃 이미지 프레임을 획득한다.In step S11, a target image frame at time t is acquired.

단계 S12에서, 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하며, 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함하고, 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것이다.In step S12, the scenario depth prediction network determines a prediction depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1, 1 The hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network.

본 발명의 실시예에서, t 시각의 타깃 이미지 프레임을 획득하는 것은, 인접 시각 간의 시나리오 깊이가 시계열에서 관련 관계를 구비하므로, t-1 시각에서 시나리오 깊이와 관련되는 제1 은닉 상태 정보를 이용하여, 시나리오 깊이 예측 네트워크를 통해 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 타깃 이미지 프레임에 대응되는 예측 정확도가 더 높은 예측 깊이 맵을 얻을 수 있다.In an embodiment of the present invention, acquiring the target image frame at time t is obtained by using the first hidden state information related to the scenario depth at time t-1 because the scenario depth between adjacent times has a relation in time series. , by performing scenario depth prediction on the target image frame through the scenario depth prediction network, it is possible to obtain a prediction depth map with higher prediction accuracy corresponding to the target image frame.

일부 실시예에 있어서, 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하는 단계는, 타깃 이미지 프레임에 대해 특징 추출을 수행하여, 타깃 이미지 프레임에 대응되는 제1 특징 맵을 결정하는 단계 - 제1 특징 맵은 시나리오 깊이와 관련되는 특징 맵임 - ; 제1 특징 맵과 t-1 시각의 제1 은닉 상태 정보에 따라, t 시각의 제1 은닉 상태 정보를 결정하는 단계; 및 t 시각의 제1 은닉 상태 정보에 따라, 예측 깊이 맵을 결정하는 단계를 포함할 수 있다.In some embodiments, the scenario depth prediction network determines a prediction depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1. The step includes: performing feature extraction on the target image frame to determine a first feature map corresponding to the target image frame, the first feature map being a feature map related to the scenario depth; determining first hidden state information at time t according to the first feature map and the first hidden state information at time t-1; and determining a prediction depth map according to the first hidden state information at time t.

인접 시각 간의 시나리오 깊이가 시계열에서 관련 관계를 구비하므로, 시나리오 깊이 예측 네트워크가 현재 시각(예를 들어 , t 시각)의 타깃 이미지 프레임에 대응되는 시나리오 깊이와 관련되는 제1 특징 맵, 및 이전 시각(예를 들어, t-1 시각)에서 시나리오 깊이와 관련되는 제1 은닉 상태 정보를 이용하여, 현재 시각과 시나리오 깊이와 관련되는 제1 은닉 상태 정보를 결정할 수 있으므로써, 현재 시각과 시나리오 깊이와 관련되는 제1 은닉 상태 정보에 기반하여 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 현재 시각의 타깃 이미지 프레임에 대응되는 예측 정확도가 더 높은 예측 깊이 맵을 얻을 수 있다.Since the scenario depths between adjacent times have a related relationship in the time series, the scenario depth prediction network provides a first feature map associated with the scenario depth corresponding to the target image frame at the current time (eg, time t), and the previous time ( For example, at t-1 time), the first hidden state information related to the current time and the scenario depth may be determined using the first hidden state information related to the scenario depth, so that the current time and the scenario depth related to the current time and the scenario depth may be determined. By performing scenario depth prediction on the target image frame based on the first hidden state information, a prediction depth map having higher prediction accuracy corresponding to the target image frame at the current time may be obtained.

예를 들어, 시나리오 깊이 예측 네트워크를 이용하여 이미지 프레임 시퀀스(제1 시각부터 제t 시각의 이미지 프레임을 포함)에서 각각 이미지 프레임에 대응되는 예측 깊이 맵을 예측할 때, 시나리오 깊이 예측 네트워크의 초기화 단계에서, 시나리오 깊이와 관련되는 제1 은닉 상태 정보의 기설정된 초기 값을 설정한다. 제1 은닉 상태 정보의 기설정된 초기 값 및 제1 시각의 이미지 프레임에 대응되는 시나리오 깊이와 관련되는 제1 특징 맵에 기반하여, 제1 시각의 제1 은닉 상태를 결정함으로써, 제1 시각의 제1 은닉 상태에 기반하여 제1 시각의 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 제1 시각의 이미지 프레임에 대응되는 예측 깊이 맵을 얻고; 제1 시각의 제1 은닉 상태 및 제2 시각의 이미지 프레임에 대응되는 시나리오 깊이와 관련되는 제1 특징 맵에 기반하여, 제2 시각의 제1 은닉 상태를 결정함으로써, 제2 시각의 제1 은닉 상태에 기반하여 제2 시각의 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 제2 시각의 이미지 프레임에 대응되는 예측 깊이 맵을 얻으며; 제2 시각의 제1 은닉 상태 및 제3 시각의 이미지 프레임에 대응되는 시나리오 깊이와 관련되는 제1 특징 맵에 기반하여, 제3 시각의 제1 은닉 상태를 결정함으로써, 제3 시각의 제1 은닉 상태에 기반하여 제3 시각의 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 제3 시각의 이미지 프레임에 대응되는 예측 깊이 맵을 얻으며; 이와 같은 방식으로 차례로 유추하면, 최종 이미지 프레임 시퀀스(제1 시각부터 제t 시각의 이미지 프레임을 포함)에서 각각 이미지 프레임에 대응되는 예측 깊이 맵을 얻는다.For example, when predicting the predicted depth map corresponding to each image frame in the image frame sequence (including the image frame from the first time to the t-th time) using the scenario depth prediction network, in the initialization step of the scenario depth prediction network , set a preset initial value of the first hidden state information related to the scenario depth. Based on the preset initial value of the first hidden state information and the first feature map related to the scenario depth corresponding to the image frame at the first time, by determining the first hidden state of the first time, 1 perform scenario depth prediction on the image frame at the first time based on the hidden state to obtain a predicted depth map corresponding to the image frame at the first time; Based on the first hidden state at the first time and the first feature map associated with the scenario depth corresponding to the image frame at the second time, by determining the first hidden state at the second time, the first concealment of the second time performing scenario depth prediction on the image frame at the second time based on the state to obtain a predicted depth map corresponding to the image frame at the second time; Based on the first hidden state at the second time and the first feature map associated with the scenario depth corresponding to the image frame at the third time, by determining the first hidden state at the third time, the first concealment of the third time performing scenario depth prediction on the image frame at the third time based on the state to obtain a predicted depth map corresponding to the image frame at the third time; By sequential analogy in this way, a predicted depth map corresponding to each image frame is obtained in the final image frame sequence (including the image frame from the first time to the t-th time).

일부 실시예에 있어서, t-1 시각의 제1 은닉 상태 정보는 t-1 시각의 상이한 스케일 하의 제1 은닉 상태 정보를 포함하고; 타깃 이미지 프레임에 대해 특징 추출을 수행하여, 타깃 이미지 프레임에 대응되는 제1 특징 맵을 결정하는 단계는, 타깃 이미지 프레임에 대해 복수 개 스케일 다운 샘플링을 수행하여, 타깃 이미지 프레임에 대응되는 상이한 스케일 하의 제1 특징 맵을 결정하는 단계를 포함할 수 있고; 제1 특징 맵과 t-1 시각의 제1 은닉 상태 정보에 따라, t 시각의 제1 은닉 상태 정보를 결정하는 단계는, 임의의 스케일의 경우, 상기 스케일 하의 제1 특징 맵과 t-1 시각의 상기 스케일 하의 제1 은닉 상태 정보에 따라, t 시각의 상기 스케일 하의 제1 은닉 상태 정보를 결정하는 단계를 포함할 수 있으며; t 시각의 제1 은닉 상태 정보에 따라, 예측 깊이 맵을 결정하는 단계는, t 시각의 상이한 스케일 하의 제1 은닉 상태 정보에 대해 특징 융합을 수행하여, 예측 깊이 맵을 결정하는 단계를 포함할 수 있다.In some embodiments, the first hidden state information at time t-1 includes first hidden state information under a different scale at time t-1; The step of performing feature extraction on the target image frame to determine the first feature map corresponding to the target image frame may include performing a plurality of scale down-sampling on the target image frame under different scales corresponding to the target image frame. determining a first feature map; The determining of the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1 may include, in the case of an arbitrary scale, the first feature map under the scale and time t-1 determining, according to the first hidden state information under the scale of , the first hidden state information under the scale at time t; Determining the predicted depth map according to the first hidden state information at time t may include performing feature fusion on the first hidden state information under different scales at time t to determine the predicted depth map have.

t 시각의 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 더 잘 결정하기 위해, 시나리오 깊이 예측 네트워크는 다양한 스케일 특징 융합 메커니즘을 채택할 수 있다. 도 2는 본 발명의 실시예에 따른 시나리오 깊이 예측 네트워크의 블록도를 도시하고, 도 2에 도시된 바와 같이, 시나리오 깊이 예측 네트워크에는 깊이 인코더(202), 다양한 스케일 컨볼루션 게이트 순환 유닛(Convolutional Gated Recurrent Unit, ConvGRU) 및 깊이 디코더(205)가 포함된다. t 시각의 타깃 이미지 프레임(201)을 깊이 인코더(202)에 입력하여 다양한 스케일의 다운 샘플링을 수행하여, 타깃 이미지 프레임에 대응되는 상이한 스케일 하의 제1 특징 맵(203)을 얻으며; 즉 제1 스케일 하의 제1 특징 맵

,제2 스케일 하의 제1 특징 맵

과 제3 스케일 하의 제1 특징 맵

이다. 여기서, 다양한 스케일 ConvGRU은 다양한 스케일 제1 특징 맵의 스케일과 대응되며, 즉, 다양한 스케일 ConvGRU은, 제1 스케일 하의 ConvGRU⁰,제2 스케일 하의 ConvGRU¹및 제3 스케일 하의 ConvGRU²을 포함한다.In order to better determine the prediction depth map corresponding to the target image frame at time t, the scenario depth prediction network can adopt various scale feature fusion mechanisms. 2 shows a block diagram of a scenario depth prediction network according to an embodiment of the present invention, and as shown in FIG. 2 , the scenario depth prediction network includes a depth encoder 202, various scale convolutional gate recursive units (Convolutional Gated). Recurrent Unit, ConvGRU) and a depth decoder 205 are included. input the target image frame 201 at time t into the depth encoder 202 to perform down-sampling of various scales to obtain a first feature map 203 under different scales corresponding to the target image frame; That is, the first feature map under the first scale

, the first feature map under the second scale

and a first feature map under a third scale

am. Here, the various-scale ConvGRU corresponds to the scale of the various-scale first feature map, that is, the various-scale ConvGRU includes ConvGRU ⁰ ^{under the first scale, ConvGRU 1} under the second scale, ^{and ConvGRU 2} under the third scale.

상기 도 2를 예로 들어, 제1 특징 맵

을 ConvGRU⁰에 입력하고, 제1 특징 맵

을 ConvGRU¹에 입력하며, 제1 특징 맵

을 ConvGRU²에 입력한다. ConvGRU⁰은 제1 특징 맵

과 ConvGRU⁰에 저장된 t-1 시각의 제1 스케일 하의 제1 은닉 상태 정보

에 대해 특징 융합을 수행하여, t 시각의 제1 스케일 하의 제1 은닉 상태

을 얻고, ConvGRU⁰은 t 시각의 제1 스케일 하의 제1 은닉 상태

에 대해 저장을 수행하고, 및 t 시각의 제1 스케일 하의 제1 은닉 상태

를 깊이 디코더로 출력하며; ConvGRU¹은 제1 특징 맵

과 ConvGRU¹에 저장된 t-1 시각의 제2 스케일 하의 제1 은닉 상태 정보

에 대해 특징 융합을 수행하여, t 시각의 제2 스케일 하의 제1 은닉 상태

를 얻고, ConvGRU¹은 t 시각의 제2 스케일 하의 제1 은닉 상태

에 대해 저장을 수행하고, 및 t 시각의 제2 스케일 하의 제1 은닉 상태

를 깊이 디코더로 출력하며; ConvGRU²은 제1 특징 맵

과 ConvGRU²에 저장된 t-1 시각의 제3 스케일 하의 제1 은닉 상태 정보

에 대해 특징 융합을 수행하여, t 시각의 제3 스케일 하의 제1 은닉 상태

를 얻고, ConvGRU²은 t 시각의 제3 스케일 하의 제1 은닉 상태

에 저장을 수행하며, 및 t 시각의 제3 스케일 하의 제1 은닉 상태

를 깊이 디코더로 출력한다. 도 2에서, 다양한 스케일의 은닉 상태(204)는 t 시각의 제1 스케일 하의 제1 은닉 상태

,제2 스케일 하의 제1 은닉 상태

및 제3 스케일 하의 제1 은닉 상태

를 포함한다.Taking FIG. 2 as an example, the first feature map

is input to ConvGRU ⁰ , and the first feature map

is input into ConvGRU ¹ , the first feature map

is entered into ConvGRU ^2. ConvGRU ⁰ is the first feature map

and the first hidden state information under the first scale of time t-1 stored in ^{ConvGRU 0}

Perform feature fusion on , the first hidden state under the first scale at time t

, and ConvGRU ⁰ is the first hidden state under the first scale at time t.

perform storage on , and a first hidden state under a first scale at time t

to the depth decoder; ConvGRU ¹ is the first feature map

and the first hidden state information under the second scale of time t-1 stored in ^{ConvGRU 1}

Perform feature fusion on , so that the first hidden state under the second scale at time t is

, and ConvGRU ¹ is the first hidden state under the second scale at time t.

perform storage on , and a first hidden state under a second scale at time t

to the depth decoder; ConvGRU ² is the first feature map

and the first hidden state information under the third scale of time t-1 stored in ^{ConvGRU 2}

Feature fusion is performed on the first hidden state under the third scale at time t.

, and ConvGRU ² is the first hidden state under the third scale at time t

perform storage on , and the first hidden state under the third scale at time t

is output to the depth decoder. In Fig. 2, the various scales of the hidden state 204 are the first hidden states under the first scale at time t.

, the first hidden state under the second scale

and a first hidden state under a third scale

includes

깊이 디코더(205)는 각각 t 시각의 제1 스케일 하의 제1 은닉 상태

,제2 스케일 하의 제1 은닉 상태

및 제3 스케일 하의 제1 은닉 상태

의 스케일을 타깃 이미지 프레임(201)의 스케일(타깃 이미지 프레임의 스케일을 타깃 스케일로 약칭한다.)과 동일하게 회복하여, t 시각의 타깃 스케일 하의 세 개 제1 은닉 상태를 얻는다. 제1 은닉 상태 정보가 시나리오 깊이와 관련되는 특징 정보를 포함하므로, 시나리오 깊이 예측 네트워크에서도 특징 맵의 형식으로 존재하며, 따라서, t 시각의 타깃 스케일 하의 세 개 제1 은닉 상태에 대해 특징 맵 융합을 수행함으로써, t 시각의 타깃 이미지 프레임에 대응되는 예측 깊이 맵

을 얻는다.The depth decoder 205 each has a first hidden state under a first scale at time t.

, the first hidden state under the second scale

and a first hidden state under a third scale

The scale of is restored equal to the scale of the target image frame 201 (the scale of the target image frame is abbreviated as the target scale) to obtain three first hidden states under the target scale at time t. Since the first hidden state information includes feature information related to the scenario depth, it also exists in the form of a feature map in the scenario depth prediction network, and thus, feature map fusion is performed for the three first hidden states under the target scale at time t. By performing the prediction depth map corresponding to the target image frame at time t

get

일부 실시예에 있어서, 상기 시나리오 깊이 예측 방법은 또한, t 시각에 대응되는 샘플 이미지 프레임 시퀀스를 획득하는 단계 - 샘플 이미지 프레임 시퀀스는 t 시각의 제1 샘플 이미지의 프레임과 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임을 포함함 - ; 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하는 단계 - 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ; 훈련될 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하는 단계 - 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ; 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하는 단계; 및 손실 함수에 따라, 훈련될 시나리오 깊이 예측 네트워크에 대해 훈련을 수행하여, 시나리오 깊이 예측 네트워크를 얻는 단계를 포함할 수 있다.In some embodiments, the scenario depth prediction method further comprises: obtaining a sample image frame sequence corresponding to a time t, wherein the sample image frame sequence is adjacent to the frame of the first sample image at time t and the frame of the first sample image Contains sample image frames - ; determining, by the camera motion prediction network, a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1; 2 The hidden state information includes characteristic information related to camera movement; The scenario depth prediction network to be trained determines a sample prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the first sample image frame using the first hidden state information at time t-1. a step of - wherein the first hidden state information includes feature information related to the scenario depth; constructing a loss function according to the sample prediction depth map and the sample prediction camera motion; and performing training on the scenario depth prediction network to be trained according to the loss function to obtain a scenario depth prediction network.

본 발명의 실시예에서, 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것이고, 또는, 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크는 연합 훈련하여 얻는다. 인접 시각 간의 시나리오 깊이와 카메라 포즈가 시계열에서의 관련 관계를 이용하여, 슬라이딩 윈도우 데이터 융합의 메커니즘을 도입하여, 슬라이딩 윈도우 시퀀스에서 타깃 시각(t 시각)의 시나리오 깊이 및 카메라 움직임과 관련되는 은닉 상태 정보를 추출하고 기억함으로써, 시나리오 깊이 예측 네트워크 및/ 또는 카메라 움직임 예측 네트워크에 대해 비모니터링 네트워크 훈련을 수행한다.In an embodiment of the present invention, the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network, or the scenario depth prediction network and the camera motion prediction network are obtained through joint training. Using the relationship between the scenario depth and the camera pose between adjacent times in the time series, a mechanism of sliding window data fusion is introduced, and hidden state information related to the scenario depth and camera movement at the target time (time t) in the sliding window sequence. By extracting and remembering , non-monitoring network training is performed on the scenario depth prediction network and/or the camera motion prediction network.

본 발명의 실시예에서, 사전에 훈련 세트를 생성할 수 있고, 상기 훈련 세트에는 시계열에서 연속 채택하여 얻은 샘플 이미지 프레임 시퀀스를 포함함으로써, 상기 훈련 세트에 기반하여 훈련될 시나리오 깊이 예측 네트워크에 대해 훈련을 수행한다. 도3 은 본 발명의 실시예의 비모니터링 네트워크 훈련의 블록도를 도시하였다. 도 3에 도시된 바와 같이, 타깃 시각이 t 시각이고, 타깃 시각에 대응되는 샘플 이미지 프레임 시퀀스(301)(즉 타깃 시각에 대응되는 슬라이딩 윈도우에 포함되는 샘플 이미지 프레임 시퀀스)는, t 시각의 제1 샘플 이미지 프레임I_t,t-1시각의 인접 샘플 이미지 프레임I_t-1과 t+1시각의 인접 샘플 이미지 프레임I_t+1을 포함한다. 샘플 이미지 프레임 시퀀스에서 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임들의 수는 실제 상황에 따라 결정할 수 있으며, 본 발명에서 한정하지 않는다.In an embodiment of the present invention, a training set may be generated in advance, and the training set includes a sequence of sample image frames obtained by successively adopted in a time series, thereby training the scenario depth prediction network to be trained based on the training set. carry out 3 shows a block diagram of non-monitoring network training in an embodiment of the present invention. As shown in FIG. 3 , the target time is time t, and the sample image frame sequence 301 corresponding to the target time (that is, the sample image frame sequence included in the sliding window corresponding to the target time) is at time t. One sample image frame I _t , includes an adjacent sample image frame I _{t-1 at time t-1} and an adjacent sample image frame I _t+1 at time t+1. The number of adjacent sample image frames of the first sample image frame in the sample image frame sequence may be determined according to an actual situation, and is not limited in the present invention.

도 3에 도시된 훈련될 시나리오 깊이 예측 네트워크가 채택한 것은 단일 스케일 특징 융합 메커니즘이다. 네트워크 훈련 과정에서, 훈련될 시나리오 깊이 예측 네트워크는 도 3에 도시된 단일 스케일 특징 융합 메커니즘을 채택할 수 있고, 또한 도 2에 도시된 다양한 스케일 특징 융합 메커니즘을 사용할 수도 있으며, 본 발명에서 한정하지 않는다. 도 3에 도시된 바와 같이, 훈련될 시나리오 깊이 예측 네트워크에는 깊이 인코더(202), ConvGRU와 깊이 디코더(205)가 포함된다. t 시각의 제1 샘플 이미지 프레임I_t을 깊이 인코더(202)에 입력하고 특징 추출을 수행하여, 제1 샘플 이미지 프레임I_t에 대응되는 제1 특징 맵

을 얻음으로써, 제1 특징 맵

을 ConvGRU에 입력하여, 제1 특징 맵

과 ConvGRU에 저장된 t-1 시각의 제1 은닉 상태 정보

에 대해 특징 융합을 수행하여, t 시각의 제1 은닉 상태

를 얻고, ConvGRU가 t 시각의 제1 은닉 상태

에 대해 저장을 수행하고, t 시각의 제1 은닉 상태

를 깊이 디코더(205)에 출력함으로써, t 시각의 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵

을 얻는다.The single-scale feature fusion mechanism adopted by the to-be-trained scenario depth prediction network shown in Fig. 3 is a single-scale feature fusion mechanism. In the network training process, the scenario depth prediction network to be trained may adopt the single-scale feature fusion mechanism shown in FIG. 3, and may also use various scale feature fusion mechanisms shown in FIG. 2, which is not limited in the present invention. . As shown in FIG. 3 , the scenario depth prediction network to be trained includes a depth encoder 202 , a ConvGRU and a depth decoder 205 . A first feature map corresponding to the first _{sample image frame I t} by inputting the first sample image frame I _t at time t to the depth encoder 202 and performing feature extraction

By obtaining , the first feature map

is input into ConvGRU, the first feature map

and the first hidden state information at time t-1 stored in ConvGRU

By performing feature fusion on , the first hidden state at time t

, and ConvGRU is the first hidden state at time t

store on , and the first hidden state at time t

by outputting to the depth decoder 205, the sample prediction depth map corresponding to the first sample image frame at time t

get

상기 도 3을 예로 들어, 도 3에 도시된 바와 같이, 카메라 움직임 예측 네트워크에 포즈 인코더(302), ConvGRU 및 포즈 디코더(303)가 포함된다. t 시각에 대응되는 샘플 이미지 프레임 시퀀스 [I_t,I_t-1,I_t+1]를 포즈 인코더(302)에 입력하고 특징 추출을 수행하여, 샘플 이미지 프레임 시퀀스에 대응되는 제2 특징 맵

을 얻음으로써, 제2 특징 맵

을 ConvGRU에 입력하여, 제2 특징 맵

과 ConvGRU에 저장된 t-1 시각의 제2 은닉 상태 정보

에 대해 특징 융합을 수행하여, t 시각의 제2 은닉 상태

를 얻고, ConvGRU는 t 시각의 제2 은닉 상태

에 대해 저장을 수행하고, t 시각의 제2 은닉 상태

를 포즈 디코더로 출력함으로써, t 시각의 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임[

,

]을 얻는다.Taking FIG. 3 as an example, as shown in FIG. 3 , a pause encoder 302 , a ConvGRU and a pause decoder 303 are included in the camera motion prediction network. A second feature map corresponding to the sample image frame sequence by inputting the sample image frame sequence [I _t ,I _t-1 ,I _{t+1 ] corresponding to time t to the pause encoder 302 and performing feature extraction}

By obtaining , the second feature map

is input into ConvGRU, and the second feature map

and the second hidden state information at time t-1 stored in ConvGRU

By performing feature fusion on , the second hidden state at time t

, and ConvGRU is the second hidden state at time t

store on , and the second hidden state at time t

By outputting to the pose decoder, the sample prediction camera motion [

,

] is obtained.

상기 도 3을 예로 들면, 샘플 예측 깊이 맵

과 샘플 예측 카메라 움직임[

,

]에 따라, 손실 함수

를 구축할 수 있다. 구체적으로, 샘플 예측 카메라 움직임[

,

]에 따라, 샘플 이미지 프레임 시퀀스에서의 인접 샘플 이미지 프레임I_t-1및 I_t+1이 제1 샘플 이미지 프레임I_t에 대한 재투영 오차항

을 결정하고; 샘플 예측 깊이 맵

의 분포 연속성에 따라, 벌칙 함수항

을 결정한다. 하기 공식(1)을 통해 손실 함수

를 구축하며;Taking Fig. 3 as an example, the sample prediction depth map

and sample prediction camera motion[

,

] according to the loss function

can be built Specifically, sample prediction camera motion [

,

], so that adjacent sample image frames I _t-1 and I _t+1 in the sample image frame sequence are reprojection error terms for the first sample image frame I _{t .}

to determine; Sample prediction depth map

According to the distribution continuity of

to decide The loss function through the following formula (1)

to build;

(1).

(One).

여기서,

는 가중 파라미터이며, 실제 상황에 따라

의 값의 크기를 결정할 수 있으며, 본 발명은 이에 더이상 구체적으로 한정하지 않는다.here,

is a weighting parameter, according to the actual situation

The magnitude of the value of can be determined, and the present invention is not specifically limited thereto.

일부 실시예에 있어서, 샘플 예측 깊이 맵

의 분포 연속성에 따라, 벌칙 함수항

를 결정하는 구체적 과정은, 제1 샘플 이미지 프레임I_t에서 각 픽셀 포인트의 경사도 값을 결정하고, 각 픽셀 포인트의 경사도 값은 제1 샘플 이미지 프레임I_t의 분포 연속성(평활성이라고 할 수 있음)을 반영할 수 있으며, 따라서, 각 픽셀 포인트의 경사도 값에 따라 제1 샘플 이미지 프레임I_t에서의 경계 영역(경사도 값은 임계값의 픽셀 포인트 구성의 영역보다 크거나 같음)과 비경계 영역(경사도 값은 임계값의 픽셀 포인트 구성의 영역보다 작음)을 결정할 수 있으므로써, 제1 샘플 이미지 프레임I_t에 대응되는 샘플 예측 깊이 맵

에서의 경계 영역과 비경계 영역을 결정할 수 있고; 샘플 예측 깊이 맵

에서 각 픽셀 포인트의 경사도 값을 결정하며, 샘플 예측 깊이 맵

에서 비 경계 영역의 분포 연속성 및 경계 영역의 분포 불연속성을 보장하기 위해, 샘플 예측 깊이 맵

에서 비경계 영역에서의 각 픽셀 포인트에 대해, 경사도 값과 정비례하는 벌칙 인자를 설정하고; 샘플 예측 깊이 맵

에서 경계 영역에서의 각 픽셀 포인트에 대해, 경사도 값과 반비례하는 벌칙 인자를 설정함으로써; 샘플 예측 깊이 맵

에서 각 픽셀 포인트의 벌칙 인자에 기반하여, 벌칙 함수항

을 구축한다.In some embodiments, the sample prediction depth map

According to the distribution continuity of

Specifically the process is, the slope value of the first sample image frame, each pixel point, each pixel point determines the slope value, and in I _t is (which may be described as smoothness) distribution continuity of the first sample image frame I _t to determine the Thus, according to the gradient value of each pixel point, the boundary region (the gradient value is greater than or equal to the region of the pixel point configuration of the threshold) and the non-boundary region (the gradient value) in _{the first sample image frameI t .} is smaller than the area of the pixel point configuration of the threshold value), so that the sample prediction depth map corresponding to _{the first sample image frameI t .}

can determine a boundary region and a non-boundary region in ; Sample prediction depth map

Determine the slope value of each pixel point in the sample prediction depth map

To ensure the distribution continuity of the non-boundary region and the distribution discontinuity of the boundary region in the sample prediction depth map

for each pixel point in the non-boundary region in , set a penalty factor that is directly proportional to the gradient value; Sample prediction depth map

by setting a penalty factor inversely proportional to the gradient value for each pixel point in the boundary region in ; Sample prediction depth map

Based on the penalty factor of each pixel point in

to build

샘플 예측 깊이 맵과 샘플 예측 카메라 움직임은 인접 시각 사이 시나리오 깊이와 카메라 움직임이 시계열에서의 관련 관계를 이용하여 얻은 것이므로, 카메라 움직임 예측 네트워크가 얻은 예측 카메라 움직임에 따라 결정된 재투영 오차항, 및 시나리오 깊이 예측 네트워크로 얻은 예측 깊이 맵에 따라 결정된 벌칙 함수항으로 구축된 손실 함수를 종합 이용하여 , 훈련될 시나리오 깊이 예측 네트워크에 대해 훈련을 수행하고, 훈련하여 얻은 시나리오 깊이 예측 네트워크는 시나리오 깊이 예측의 예측 정확도를 향상할 수 있다.Since the sample prediction depth map and the sample prediction camera motion are obtained using the time series relation between the scenario depth and the camera motion between adjacent views, the reprojection error term determined according to the predicted camera motion obtained by the camera motion prediction network, and the scenario depth prediction By synthetically using the loss function constructed with the penalty function term determined according to the prediction depth map obtained with the network, the training is performed on the scenario depth prediction network to be trained, and the scenario depth prediction network obtained by training improves the prediction accuracy of the scenario depth prediction. can be improved

일부 실시예에 있어서, 도 3에서의 카메라 움직임 예측 네트워크는 훈련될 카메라 움직임 예측 네트워크일 수 있고, 상기 손실 함수에 따라, 훈련될 카메라 움직임 네트워크에 대해 훈련을 수행하여, 훈련될 시나리오 깊이 예측 네트워크와 훈련될 카메라 움직임 네트워크에 대해 연합 훈련을 구현하여, 훈련된 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크를 얻는다.In some embodiments, the camera motion prediction network in FIG. 3 may be a camera motion prediction network to be trained, and according to the loss function, training is performed on the to-be-trained camera motion network to obtain a scenario depth prediction network to be trained and By implementing federated training on the camera motion network to be trained, we obtain a trained scenario depth prediction network and a camera motion prediction network.

예측 깊이 맵과 예측 카메라 움직임은 인접 시각 사이 시나리오 깊이와 카메라 움직임이 시계열에서의 관련 관계를 이용하여 얻은 것이므로, 카메라 움직임 예측 네트워크가 얻은 예측 카메라 움직임에 따라 결정된 재투영 오차항, 및 시나리오 깊이 예측 네트워크가 얻은 예측 깊이 맵으에 따라 결정된 벌칙 함수항으로 구축된 손실 함수를 종합 이용하여, 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크에 대해 연합 훈련을 수행하여, 훈련하여 얻은 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크로 하여금 시나리오 깊이 예측과 카메라 움직임 예측의 예측 정확도를 향상할 수 있도록 한다.Since the predicted depth map and the predicted camera motion are obtained using the time series relation between the scenario depth and the camera motion between adjacent views, the reprojection error term determined according to the predicted camera motion obtained by the camera motion prediction network, and the scenario depth prediction network are By comprehensively using the loss function constructed with the penalty function term determined according to the obtained prediction depth map, joint training is performed on the scenario depth prediction network and the camera motion prediction network, and the obtained scenario depth prediction network and the camera motion prediction network are to improve the prediction accuracy of scenario depth prediction and camera motion prediction.

일부 실시예에 있어서, 깊이 인코더와 포즈 인코더는 ResNet18 구조를 재사용할 수 있고, ResNet54 구조를 재사용할 수 있으며, 또한 다른 구조를 재사용할 수도 있으며, 본 발명은 이에 구체적으로 한정하지 않는다. 깊이 디코더와 포즈 디코더는 Unet네트워크 구조를 채택할 수 있고, 또한 다른 디코더 네트워크 구조를 채택할 수도 있으며, 본 발명은 이에 구체적으로 한정하지 않는다.In some embodiments, the depth encoder and the pause encoder may reuse the ResNet18 structure, reuse the ResNet54 structure, and may reuse other structures, although the present invention is not specifically limited thereto. The depth decoder and the pause decoder may adopt the Unet network structure, and may also adopt other decoder network structures, but the present invention is not specifically limited thereto.

일부 실시예에 있어서, ConvGRU에는 컨볼루션 동작을 포함하고, ConvGRU에서의 활성화 함수는 ELU활성화 함수이다.In some embodiments, the ConvGRU includes a convolution operation, and the activation function in the ConvGRU is an ELU activation function.

예를 들어, 1차원 데이터에 대해서만 데이터 처리를 수행할 수 있는 컨볼루션 게이트 순환 유닛ConvGRU을 통해 개선할 수 있고, ConvGRU에서의 선형 동작을 컨볼루션 동작으로 대체하고, ConvGRU에서의 tanh활성화 함수를 ELU활성화 함수로 대체함으로써, 2차원 이미지 데이터에 대해 데이터 처리를 수행할 수 있는 ConvGRU을 얻는다.For example, it can be improved through the convolution gate recursive unit ConvGRU, which can perform data processing only on one-dimensional data, replace the linear operation in ConvGRU with the convolution operation, and change the tanh activation function in ConvGRU to the ELU By substituting with the activation function, ConvGRU capable of performing data processing on two-dimensional image data is obtained.

시나리오 깊이 및/ 또는 카메라 움직임이 시계열에서 관련 관계를 구비한 것을 이용하여, ConvGRU를 통해 상이한 시각에 대응되는 이미지 프레임 시퀀스를 시계열에 따라 순환 컨볼루션 처리하여, 상이한 시각에 대응되는 제1 은닉 상태 및/ 또는 제2 은닉 상태를 얻을 수 있다.Using that scenario depth and/or camera movement have a relation in time series, image frame sequences corresponding to different times are cyclically convolved according to time series through ConvGRU, and the first hidden state corresponding to different times and / or obtain a second hidden state.

슬라이딩 윈도우 데이터 융합의 메커니즘을 구현하기 위해, 상기 ConvGRU를 채택하는 외에, 컨볼루션 장단기 메모리(Convolutional Long Short-Term Memory, ConvLSTM)를 채택할 수 있고, 또한 다른 슬라이딩 윈도우 데이터 융합을 구현할 수 있는 구조를 채택할 수 있으며, 본 발명은 이에 구체적으로 한정하지 않는다.In order to implement the mechanism of sliding window data fusion, in addition to adopting the ConvGRU, Convolutional Long Short-Term Memory (ConvLSTM) can be adopted, and also a structure that can implement other sliding window data fusion. may be adopted, and the present invention is not specifically limited thereto.

도 4는 본 발명의 실시예의 카메라 움직임 예측 방법에 따른 흐름도이다. 도 4에 도시된 바와 같이 카메라 움직임 예측 방법은 단말 기기 또는 다른 처리 기기에 의해 실행될 수 있고, 여기서, 단말 기기는 사용자 기기(User Equipment, UE), 모바일 기기, 사용자 단말, 단말, 셀룰러 폰, 무선 전화, 개인용 정보 단말기(Personal Digital Assistant, PDA), 핸드 헬드 기기, 컴퓨팅 기기, 차량용 기기, 웨어러블 기기 등일 수 있다. 다른 처리 기기는 서버 또는 클라우드 서버 등일 수 있다. 일부 가능한 구현 방식에 있어서, 상기 카메라 움직임 예측 방법은 프로세서가 메모리에 저장된 컴퓨터 판독 가능한 명령어를 호출하는 방식을 통해 구현될 수 있다. 도 4에 도시된 바와 같이, 상기 방법은 또한, 4 is a flowchart according to a method for predicting camera motion according to an embodiment of the present invention. As shown in FIG. 4 , the method for predicting camera motion may be executed by a terminal device or other processing device, where the terminal device is a user equipment (UE), a mobile device, a user terminal, a terminal, a cellular phone, or a wireless device. It may be a telephone, a personal digital assistant (PDA), a handheld device, a computing device, a vehicle device, a wearable device, and the like. The other processing device may be a server or a cloud server or the like. In some possible implementation manners, the camera motion prediction method may be implemented through a method in which a processor calls a computer readable instruction stored in a memory. As shown in Figure 4, the method also

t 시각에 대응되는 이미지 프레임 시퀀스를 획득하는 단계 S41 - 이미지 프레임 시퀀스는 t 시각의 타깃 이미지 프레임과 타깃 이미지 프레임의 인접 이미지 프레임를 포함함 - ; obtaining an image frame sequence corresponding to time t S41, wherein the image frame sequence includes a target image frame at time t and an adjacent image frame of the target image frame;

카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 결정하는 단계S42 - 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함하고, 카메라 움직임 예측 네트워크는 시나리오 깊이 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 -를 포함한다.Step S42 - Second concealment of the camera motion prediction network by performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1 The state information includes feature information related to camera motion, and the camera motion prediction network is obtained by auxiliary training based on the scenario depth prediction network.

본 발명의 실시예에 있어서, t 시각의 타깃 이미지 프레임 및 타깃 이미지 프레임의 인접 이미지 프레임을 포함하는 이미지 프레임 시퀀스를 획득하는 단계는, 인접 시각 사이 카메라 움직임이 시계열에서 관련 관계를 구비하므로, t-1 시각이 카메라 움직임과 관련되는 제2 은닉 상태 정보를 이용하고, 카메라 움직임 예측 네트워크를 통해 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하여, 이미지 프레임 시퀀스에 대응되는 예측 정확도가 더 높은 예측 카메라 움직임을 얻을 수 있다.In an embodiment of the present invention, the step of obtaining an image frame sequence including a target image frame at time t and an adjacent image frame of the target image frame includes: since camera motion between adjacent times has a related relationship in time series, t- 1 Using the second hidden state information related to the camera motion, the camera pose prediction is performed on the image frame sequence through the camera motion prediction network, and the predicted camera motion corresponding to the image frame sequence has higher prediction accuracy. can be obtained

일부 실시예에 있어서, 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 결정하는 단계는, 이미지 프레임 시퀀스에 대해 특징 추출을 수행하여, 이미지 프레임 시퀀스에 대응되는 제2 특징 맵을 결정하는 단계 - 제2 특징 맵은 카메라 움직임과 관련되는 특징 맵임 - ; 제2 이미지 특징과 t-1 시각의 제2 은닉 상태 정보에 따라, t 시각의 제2 은닉 상태 정보를 결정하는 단계; 및 t 시각의 제2 은닉 상태 정보에 따라, 예측 카메라 움직임을 결정하는 단계를 포함할 수 있다.In some embodiments, the camera motion prediction network determines the predicted camera motion corresponding to the image frame sequence by performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1. The step includes: performing feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence, the second feature map being a feature map related to camera motion; determining second hidden state information at time t according to the second image feature and second hidden state information at time t-1; and determining the predicted camera movement according to the second hidden state information at time t.

인접 시각 사이 카메라 움직임이 시계열에서 관련 관계를 구비함으로써, 카메라 움직임 예측 네트워크가 t 시각의 이미지 프레임 시퀀스에 대응되는 시나리오 깊이와 관련되는 제2 특징 맵, 및 t-1 시각에서 카메라 움직임과 관련되는 제2 은닉 상태 정보를 이용하여, t 시각에서 카메라 움직임과 관련되는 제2 은닉 상태 정보를 결정할 수 있으므로써, t 시각의 카메라 움직임과 관련되는 제2 은닉 상태 정보에 기반하여 t 시각의 이미지 프레임 시퀀스에 대해 카메라 움직임 예측을 수행하여, t 시각의 이미지 프레임 시퀀스에 대응되는 예측 정확도가 더 높은 예측 깊이 맵을 얻을 수 있다.The camera motion between adjacent times has a relation in time series, so that the camera motion prediction network has a second feature map related to the scenario depth corresponding to the sequence of image frames at time t, and a second feature map related to camera motion at time t-1. 2 By using the hidden state information, second hidden state information related to the camera motion at time t can be determined, so that the image frame sequence at time t is based on the second hidden state information related to the camera motion at time t. By performing the camera motion prediction for the time t, it is possible to obtain a prediction depth map with higher prediction accuracy corresponding to the image frame sequence at time t.

일부 실시예에 있어서, 예측 카메라 움직임은 이미지 프레임 시퀀스에서 인접 이미지 프레임 간의 상대 포즈를 포함할 수 있다. 여기서, 상대 포즈는 6차원 파라미터이고, 3차원 회전 정보와 3차원 평행 이동 정보를 포함한다.In some embodiments, predictive camera movement may include relative poses between adjacent image frames in a sequence of image frames. Here, the relative pose is a six-dimensional parameter, and includes three-dimensional rotation information and three-dimensional translation information.

예를 들어, 예측 카메라 움직임[

,

]에는 인접 이미지 프레임I_t-1에서 타깃 이미지 프레임I_t사이의 상대 포즈

,및 타깃 이미지 프레임I_t에서 인접 이미지 프레임I_t+1사이의 상대 포즈

가 포함된다.For example, predictive camera movement[

,

] contains the relative poses between adjacent image framesI _t-1 and target image framesI _t

, and the relative poses between the target image frameI _t and adjacent image frames I _t+1

is included

상기 도 3을 예로 들어, 도 3에 도시된 바와 같이, 카메라 움직임 예측 네트워크에는 포즈 인코더, ConvGRU 및 포즈 디코더가 포함된다. t 시각에 대응되는 이미지 프레임 시퀀스 [I_t,I_t-1,I_t+1]를 포즈 인코더(302)에 입력하고 특징 추출을 수행하여, 이미지 프레임 시퀀스에 대응되는 제2 특징 맵

을 얻음으로써, 제2 특징 맵

을 ConvGRU에 입력하여, 제2 특징 맵

과 ConvGRU에 저장된 t-1 시각의 제2 은닉 상태 정보

에 대해 특징 융합을 수행하여, t 시각의 제2 은닉 상태

를 얻고, ConvGRU는 t 시각의 제2 은닉 상태

에 대해 저장을 수행하고, 및 t 시각의 제2 은닉 상태

를 포즈 디코더에 출력함으로써, t 시각의 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임[

,

]을 얻는다.Taking FIG. 3 as an example, as shown in FIG. 3 , the camera motion prediction network includes a pause encoder, a ConvGRU, and a pause decoder. The image frame sequence [I _t ,I _t-1 ,I _t+1 ] corresponding to time t is input to the pause encoder 302 and feature extraction is performed to perform feature extraction, and thus a second feature map corresponding to the image frame sequence

By obtaining , the second feature map

is input into ConvGRU, and the second feature map

and the second hidden state information at time t-1 stored in ConvGRU

By performing feature fusion on , the second hidden state at time t

, and ConvGRU is the second hidden state at time t

perform storage on , and the second hidden state at time t

By outputting to the pose decoder, the predicted camera motion corresponding to the sequence of image frames at time t[

,

] is obtained.

예를 들어, 카메라 움직임 예측 네트워크를 이용하여 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 예측할 때, 카메라 움직임 예측 네트워크의 초기화 단계에서, 카메라 움직임과 관련되는 제2 은닉 상태 정보의 기설정된 초기 값을 설정한다. 제2 은닉 상태 정보의 기설정 초기 값 및 제1 시각의 이미지 프레임 시퀀스에 대응되는 카메라 움직임과 관련되는 제2 특징 맵에 기반하여, 제1 시각의 제2 은닉 상태를 결정함으로써, 제1 시각의 제2 은닉 상태에 기반하여 제1 시각의 이미지 프레임 시퀀스에 대해 카메라 움직임 예측을 수행하여, 제1 시각의 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 얻고; 제1 시각의 제2 은닉 상태 및 제2 시각의 이미지 프레임 시퀀스에 대응되는 카메라 움직임과 관련되는 제2 특징 맵에 기반하여, 제2 시각의 제2 은닉 상태를 결정함으로써, 제2 시각의 제2 은닉 상태에 기반하여 제2 시각의 이미지 프레임 시퀀스에 대해 카메라 움직임 예측을 수행하여, 제2 시각의 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 얻고; 제2 시각의 제2 은닉 상태 및 제3 시각의 이미지 프레임 시퀀스에 대응되는 카메라 움직임과 관련되는 제2 특징 맵에 기반하여, 제3 시각의 제2 은닉 상태를 결정함으로써, 제3 시각의 제2 은닉 상태에 기반하여 제3 시각의 이미지 프레임 시퀀스에 대해 카메라 움직임 예측을 수행하여, 제3 시각의 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 얻으며; 이와 같은 방식으로 차례로 유추하면, 최종 상이한 시각의 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 얻는다.For example, when predicting the predicted camera motion corresponding to the image frame sequence using the camera motion prediction network, in the initialization step of the camera motion prediction network, a preset initial value of the second hidden state information related to the camera motion is set do. By determining the second hidden state of the first time based on the preset initial value of the second hidden state information and the second feature map related to the camera motion corresponding to the image frame sequence at the first time, performing camera motion prediction on the image frame sequence at the first time based on the second hidden state to obtain a predicted camera motion corresponding to the image frame sequence at the first time; determining a second hidden state at a second time based on a second hidden state at the first time and a second feature map associated with camera movement corresponding to the sequence of image frames at the second time; performing camera motion prediction on the image frame sequence at the second time based on the hidden state to obtain a predicted camera motion corresponding to the image frame sequence at the second time; determining a second hidden state at the third time, based on the second hidden state at the second time and a second feature map associated with camera motion corresponding to the sequence of image frames at the third time, by determining the second hidden state at the third time performing camera motion prediction on the image frame sequence at the third time based on the hidden state to obtain a predicted camera motion corresponding to the image frame sequence at the third time; By sequential analogy in this way, the predicted camera motion corresponding to the sequence of image frames at the final different time is obtained.

일부 실시예에 있어서, 상기 카메라 움직임 예측 방법은 또한, t 시각에 대응되는 샘플 이미지 프레임 시퀀스를 획득하는 단계 - 샘플 이미지 프레임 시퀀스는 t 시각의 제1 샘플 이미지 프레임과 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임을 포함함 - ; 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 제1 샘플 이미지 프레임에 대응되는 예측 깊이 맵을 결정하는 단계 - 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ; 훈련될 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하는 단계 - 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ; 샘플 예측 깊이 맵과 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하는 단계; 및 손실 함수에 따라, 훈련될 카메라 움직임 예측 네트워크에 대해 훈련을 수행함으로써, 카메라 움직임 예측 네트워크를 얻는 단계를 포함할 수 있다.In some embodiments, the method for predicting camera motion further comprises: obtaining a sequence of sample image frames corresponding to time t, the sample image frame sequence being a first sample image frame at time t and adjacent samples of the first sample image frame at time t Contains image frames - ; Determining, by the scenario depth prediction network, a prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1 - first The hidden state information includes feature information related to the scenario depth; determining, by the camera motion prediction network to be trained, a sample prediction camera motion corresponding to the sample image frame sequence, by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1 - the second hidden state information includes characteristic information related to camera movement; building a loss function according to the sample prediction depth map and the sample prediction camera motion; and obtaining a camera motion prediction network by performing training on the to-be-trained camera motion prediction network according to the loss function.

일부 실시예에 있어서, 샘플 예측 깊이 맵과 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하는 단계는, 샘플 예측 카메라 움직임에 따라, 샘플 이미지 프레임 시퀀스에서 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임이 제1 샘플 이미지 프레임에 대한 재투영 오차항을 결정하는 단계; 샘플 예측 깊이 맵의 분포 연속성에 따라, 벌칙 함수항을 결정하는 단계; 및 재투영 오차항과 벌칙 함수항에 따라, 손실 함수를 구축하는 단계를 포함할 수 있다.In some embodiments, according to the sample-predicted depth map and the sample-predicted camera movement, building the loss function comprises: according to the sample-predicted camera movement, adjacent sample image frames of the first sample image frame in the sample image frame sequence. 1 determining a reprojection error term for a sample image frame; determining a penalty function term according to the distribution continuity of the sample prediction depth map; and constructing a loss function according to the reprojection error term and the penalty function term.

본 발명의 실시예에서, 카메라 움직임 예측 네트워크는 시나리오 깊이 예측 네트워크에 기반하여 보조 훈련하여 얻은 것이고, 또는, 시나리오 깊이 예측 네트워크 및 카메라 움직임 예측 네트워크는 연합 훈련하여 얻은 것이다. 일부 실시예에 있어서, 상기 도 3에 기반하여 훈련될 카메라 움직임 예측 네트워크에 대해 훈련을 수행할 수 있으며, 훈련 과정에서, 도 3에서의 카메라 움직임 예측 네트워크는 훈련될 카메라 움직임 예측 네트워크이며, 도 3에서의 시나리오 깊이 예측 네트워크는 훈련될 시나리오 깊이 예측 네트워크(훈련될 시나리오 깊이 예측 네트워크 및 훈련될 카메라 움직임 예측 네트워크를 연합 훈련함)일 수 있고, 또한 훈련된 시나리오 깊이 예측 네트워크(훈련될 카메라 움직임 예측 네트워크에 대해 단독 훈련을 수행)일 수도 있으며, 구체적인 훈련 과정과 상기 도 3은 동일하며, 본 발명의 실시예는 더이상 설명하지 않는다.In an embodiment of the present invention, the camera motion prediction network is obtained by auxiliary training based on the scenario depth prediction network, or the scenario depth prediction network and the camera motion prediction network are obtained through joint training. In some embodiments, training may be performed on a camera motion prediction network to be trained based on FIG. 3 , and in the training process, the camera motion prediction network in FIG. 3 is a camera motion prediction network to be trained, FIG. The scenario depth prediction network in the to-be-trained scenario depth prediction network (by jointly training the to-be-trained scenario depth prediction network and the to-be-trained camera motion prediction network) may also be a trained scenario depth prediction network (the to-be-trained camera motion prediction network). may be performed alone), the specific training process and FIG. 3 are the same, and the embodiment of the present invention will not be described any further.

예측 깊이 맵과 예측 카메라 움직임은 인접 시각 사이 시나리오 깊이와 카메라 움직임이 시계열에서의 관련 관계를 이용하여 얻은 것이고, 이로 하여 카메라 움직임 예측 네트워크가 얻은 예측 카메라 움직임에 따라 결정된 재투영 오차항, 및 시나리오 깊이 예측 네트워크가 얻은 예측 깊이 맵에 따라 결정된 벌칙 함수항으로 구축된 손실 함수를 종합 이용하여, 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크에 대해 연합 훈련을 수행하여, 훈련하여 얻은 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크로 하여금 시나리오 깊이 예측과 카메라 움직임 예측의 예측 정확도를 향상할 수 있도록 한다.The predicted depth map and the predicted camera motion are obtained using the time series relation between the scenario depth and the camera motion between adjacent views, and the reprojection error term determined according to the predicted camera motion obtained by the camera motion prediction network, and the scenario depth prediction By comprehensively using the loss function constructed with the penalty function term determined according to the prediction depth map obtained by the network, joint training is performed on the scenario depth prediction network and the camera motion prediction network to predict the scenario depth prediction network and the camera motion obtained by training. It allows the network to improve the prediction accuracy of scenario depth prediction and camera motion prediction.

본 발명의 실시예에서, 상기 도 3에 도시된 네트워크 훈련 방법을 통해 훈련하여 얻은 시나리오 깊이 예측 네트워크와 카메라 움직임 예측 네트워크는 환경의 깊이 예측과 삼차원 시나리오 구축을 수행할 수 있다. 예를 들어 , 시나리오 깊이 예측 네트워크를 청소기, 잔디 깍이 장치 등 실내, 실외의 모바일 로봇의 항법 시나리오에 적용하고, RGB(Red Green Blue) 카메라를 통해 RGB 이미지를 획득함으로써, 시나리오 깊이 예측 네트워크를 이용하여 RGB이미지에 대응되는 예측 깊이 맵을 결정하고, 카메라 예측 네트워크를 이용하여 RGB카메라의 카메라 움직임을 결정함으로써, 장애물에 대한 거리 측정 및 3차원 시나리오 구축을 구현하여, 장애물 회피와 항법 작업을 완성한다.In an embodiment of the present invention, the scenario depth prediction network and the camera motion prediction network obtained by training through the network training method shown in FIG. 3 may perform environment depth prediction and three-dimensional scenario construction. For example, by applying the scenario depth prediction network to the navigation scenarios of indoor and outdoor mobile robots such as vacuum cleaners and lawnmowers, and acquiring RGB images through RGB (Red Green Blue) cameras, the scenario depth prediction network is used to By determining the predicted depth map corresponding to the RGB image, and determining the camera movement of the RGB camera using the camera prediction network, distance measurement and 3D scenario construction for obstacles are implemented, and obstacle avoidance and navigation tasks are completed.

이해할 수 있는 것은, 본 발명에서 언급한 상기 각 방법 실시예는, 원리 논리를 위반하지 않는 한, 상호 결합되어 결합된 실시예를 형성할 수 있고, 편폭의 제한으로, 본 발명에서 더이상 반복하여 설명하지 않는다. 본 분야의 기술자는 구체적인 실시형태의 상기 방법에서, 각 단계의 기록 순서는 엄격한 실행 순서를 의미하지 않고, 각 단계의 구체적인 실행 순서는 그 기능 및 가능한 내부 논리에 의해 결정된다.It is to be understood that each of the method embodiments mentioned in the present invention can be combined with each other to form a combined embodiment, as long as the principle logic is not violated, and due to the limitation of the width, the description is no longer repeated in the present invention. I never do that. A person skilled in the art will know that in the above method of the specific embodiment, the recording order of each step does not imply a strict execution order, and the specific execution order of each step is determined by its function and possible internal logic.

또한, 본 발명은 시나리오 깊이/ 카메라 움직임 예측 장치, 전자 기기, 컴퓨터 판독 가능한 저장 매체, 프로그램을 더 제공하고, 상기는 모두 본 발명에서 제공한 어느 하나의 시나리오 깊이/ 카메라 움직임 예측 방법을 구현하는데 사용될 수 있으며, 상응하는 기술방안 및 설명 및 참조 방법 부분의 상응하는 기재는, 여기서 더이상 반복하여 설명하지 않는다.In addition, the present invention further provides a scenario depth/camera motion prediction device, an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any one scenario depth/camera motion prediction method provided by the present invention. and corresponding technical solutions and descriptions and corresponding descriptions of reference method parts, which are not further repeated herein.

도 5는 본 발명 실시예에 따른 시나리오 깊이 예측 장치의 블록도를 도시한다. 도 5에 도시된 바와 같이, 시나리오 깊이 예측 장치(50)는,5 is a block diagram of a scenario depth prediction apparatus according to an embodiment of the present invention. As shown in FIG. 5 , the scenario depth prediction apparatus 50 includes:

t 시각의 타깃 이미지 프레임을 획득하도록 구성된 제1 획득 모듈(51); 및a first acquiring module 51, configured to acquire a target image frame at time t; and

네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하는 것을 통해 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하여, 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하도록 구성된 제1 시나리오 깊이 예측 모듈(52) - 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함하고, 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 - ; 을 포함한다.A first scenario depth prediction module 52, wherein the network performs scenario depth prediction on the target image frame through using the first hidden state information at time t-1 to determine a predicted depth map corresponding to the target image frame - The first hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network - ; includes

일부 실시예에 있어서, 제1 시나리오 깊이 예측 모듈(52)은, In some embodiments, the first scenario depth prediction module 52 comprises:

타깃 이미지 프레임에 대해 특징 추출을 수행하여, 타깃 이미지 프레임에 대응되는 제1 특징 맵을 결정하도록 구성된 제1 결정 서브 모듈 - 제1 특징 맵은 시나리오 깊이와 관련되는 특징 맵임 - ;a first determining submodule, configured to perform feature extraction on the target image frame to determine a first feature map corresponding to the target image frame, wherein the first feature map is a feature map related to the scenario depth;

제1 특징 맵과 t-1 시각의 제1 은닉 상태 정보에 따라, t 시각의 제1 은닉 상태 정보를 결정하도록 구성된 제2 결정 서브 모듈; 및a second determining submodule, configured to determine, according to the first feature map and the first hidden state information at time t-1, the first hidden state information at time t; and

t 시각의 제1 은닉 상태 정보에 따라, 예측 깊이 맵을 결정하도록 구성된 제3 결정 서브 모듈을 포함한다.and a third determining submodule, configured to determine, according to the first hidden state information at time t, a prediction depth map.

일부 실시예에 있어서, t-1 시각의 제1 은닉 상태 정보는 t-1 시각의 상이한 스케일 하의 제1 은닉 상태 정보를 포함하고; In some embodiments, the first hidden state information at time t-1 includes first hidden state information under a different scale at time t-1;

제1 결정 서브 모듈은 구체적으로, 타깃 이미지 프레임에 대해 복수 개 스케일 다운샘플링을 수행하여, 타깃 이미지 프레임에 대응되는 상이한 스케일 하의 제1 특징 맵을 결정하도록 구성되고;the first determining submodule is specifically configured to perform a plurality of scale downsampling on the target image frame to determine a first feature map under a different scale corresponding to the target image frame;

제2 결정 서브 모듈은 구체적으로, 임의의 스케일의 경우, 상기 스케일 하의 제1 특징 맵과 t-1 시각의 상기 스케일 하의 제1 은닉 상태 정보에 따라, t 시각의 상기 스케일 하의 제1 은닉 상태 정보를 결정하도록 구성되며;The second determining submodule is specifically, for any scale, according to the first feature map under the scale and the first hidden state information under the scale at time t-1, the first hidden state information under the scale at time t is configured to determine;

제3 결정 서브 모듈은 구체적으로, t 시각의 상이한 스케일 하의 제1 은닉 상태 정보에 특징 융합을 수행하여, 예측 깊이 맵을 결정하도록 구성된다.The third determining submodule is specifically configured to perform feature fusion on the first hidden state information under a different scale of time t, to determine the prediction depth map.

일부 실시예에 있어서, 시나리오 깊이 예측 장치(50)은 또한 제1 훈련 모듈을 포함하고, 상기 제1 훈련 모듈은, In some embodiments, the scenario depth prediction apparatus 50 also includes a first training module, the first training module comprising:

카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하며 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ;A camera motion prediction network determines a sample predicted camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1; the second hidden state information includes characteristic information related to camera movement;

훈련될 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하며 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ;A sample prediction depth map corresponding to the first sample image frame through which the scenario depth prediction network to be trained performs scenario depth prediction on the first sample image frame using the first hidden state information at time t-1 determine, wherein the first hidden state information includes feature information related to the scenario depth;

일부 실시예에 있어서, 제1 훈련 모듈은 구체적으로, 상기 샘플 예측 카메라 움직임에 따라, 상기 샘플 이미지 프레임 시퀀스에서 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임이 상기 제1 샘플 이미지 프레임에 대한 재투영 오차항을 결정하고; 상기 샘플 예측 깊이 맵과 분포 연속성에 따라, 벌칙 함수항을 결정하며; 상기 재투영 오차항과 상기 벌칙 함수항에 따라, 상기 손실 함수를 구축하도록 구성되었다.In some embodiments, the first training module is specifically configured to: according to the sample prediction camera movement, an adjacent sample image frame of the first sample image frame in the sample image frame sequence is re-projected onto the first sample image frame determine an error term; determine a penalty function term according to the sample prediction depth map and distribution continuity; and construct the loss function according to the reprojection error term and the penalty function term.

도 6는 본 발명 실시예에 따른 카메라 움직임 예측 장치의 블록도를 도시한다. 도 6에 도시된 바와 같이, 카메라 움직임 예측 장치(60)는, 6 is a block diagram of an apparatus for predicting camera motion according to an embodiment of the present invention. As shown in Figure 6, the camera motion prediction device 60,

t 시각에 대응되는 이미지 프레임 시퀀스를 획득하도록 구성된 제2 획득 모듈(61) - 이미지 프레임 시퀀스는 t 시각의 타깃 이미지 프레임과 타깃 이미지 프레임의 인접 이미지 프레임을 포함함 - ; 및a second acquiring module 61, configured to acquire an image frame sequence corresponding to time t, wherein the image frame sequence includes a target image frame at time t and an adjacent image frame of the target image frame; and

네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하는 것을 통해 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하여, 이미지 프레임 시퀀스에 대응되는 예측 카메라 움직임을 결정하도록 구성된 제1 카메라 움직임 예측 모듈(62) - 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함하고, 카메라 움직임 예측 네트워크는 시나리오 깊이 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 - 을 포함한다.a first camera motion prediction module 62, wherein the network performs camera pose prediction on the image frame sequence through using the second hidden state information at time t-1 to determine a predicted camera motion corresponding to the image frame sequence - The second hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained by auxiliary training based on the scenario depth prediction network.

일부 실시예에 있어서, 제1 카메라 움직임 예측 모듈(62)는, In some embodiments, the first camera motion prediction module 62 includes:

이미지 프레임 시퀀스에 대해 특징 추출을 수행하여, 이미지 프레임 시퀀스에 대응되는 제2 특징 맵을 결정하도록 구성된 제6 결정 서브 모듈 - 제2 특징 맵은 카메라 움직임과 관련되는 특징 맵임 - ;a sixth determining submodule, configured to perform feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence, wherein the second feature map is a feature map related to camera motion;

제2 이미지 특징과 t-1 시각의 제2 은닉 상태 정보에 따라, t 시각의 제2 은닉 상태 정보를 결정하도록 구성된 제7 결정 서브 모듈; 및a seventh determining submodule, configured to determine, according to the second image feature and the second hidden state information at time t-1, the second hidden state information at time t; and

T 시각의 제2 은닉 상태 정보에 따라, 예측 카메라 움직임을 결정하도록 구성된 제8 결정 서브 모듈을 포함한다.an eighth determining submodule, configured to determine, according to the second hidden state information at time T, the predicted camera movement.

일부 실시예에 있어서, 예측 카메라 움직임은 이미지 프레임 시퀀스에서 인접 이미지 프레임 간의 상대 포즈를 포함한다.In some embodiments, the predictive camera movement includes relative poses between adjacent image frames in a sequence of image frames.

일부 실시예에 있어서, 카메라 움직임 예측 장치(60)은 또한, 제2 훈련 모듈을 포함하고, 상기 제2 훈련 모듈은, In some embodiments, the camera motion prediction device 60 also includes a second training module, the second training module comprising:

시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 제1 샘플 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 제1 샘플 이미지 프레임에 대응되는 샘플 예측 깊이 맵을 결정하고 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함함 - ;The scenario depth prediction network determines a sample prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the first sample image frame using the first hidden state information at time t-1. and, the first hidden state information includes feature information related to the scenario depth;

훈련될 카메라 움직임 예측 네트워크가 t-1 시각의 제2 은닉 상태 정보를 이용하여 상기 샘플 이미지 프레임 시퀀스에 대해 카메라 포즈 예측을 수행하는 것을 통해, 상기 샘플 이미지 프레임 시퀀스에 대응되는 샘플 예측 카메라 움직임을 결정하며 - 상기 제2 은닉 상태 정보는 카메라 움직임과 관련되는 특징 정보를 포함함 - ;A camera motion prediction network to be trained determines a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1. and, the second hidden state information includes characteristic information related to camera movement;

상기 샘플 예측 깊이 맵과 상기 샘플 예측 카메라 움직임에 따라, 손실 함수를 구축하고; 및construct a loss function according to the sample prediction depth map and the sample prediction camera motion; and

상기 손실 함수에 따라, 상기 훈련될 카메라 움직임 예측 네트워크에 대해 훈련을 수행하여, 상기 카메라 움직임 예측 네트워크를 얻도록 구성된다.and perform training on the to-be-trained camera motion prediction network according to the loss function to obtain the camera motion prediction network.

일부 실시예에 있어서, 제2 훈련 모듈은 구체적으로, 상기 샘플 예측 카메라 움직임에 따라, 상기 샘플 이미지 프레임 시퀀스에서 상기 제1 샘플 이미지 프레임의 인접 샘플 이미지 프레임이 상기 제1 샘플 이미지 프레임에 대한 재투영 오차항을 결정하고; 상기 샘플 예측 깊이 맵과 분포 연속성에 따라, 벌칙 함수항을 결정하며; 상기 재투영 오차항과 상기 벌칙 함수항에 따라, 상기 손실 함수를 구축하도록 구성되었다.In some embodiments, the second training module is specifically configured to: according to the sample prediction camera movement, an adjacent sample image frame of the first sample image frame in the sample image frame sequence is re-projected onto the first sample image frame determine an error term; determine a penalty function term according to the sample prediction depth map and distribution continuity; and construct the loss function according to the reprojection error term and the penalty function term.

일부 실시예에 있어서, 본 발명의 실시예에서 제공한 장치가 갖고 있는 기능 또는 포함하는 모듈은 전술한 방법 실시예에서 설명한 방법을 실행하는데 사용될 수 있고, 그 구현은 전술한 방법 실시예의 설명을 참조할 수 있으며, 간결함을 위해, 여기서 더이상 반복하여 설명하지 않는다.In some embodiments, a function possessed by an apparatus provided in an embodiment of the present invention or a module included may be used to execute the method described in the above-described method embodiment, and the implementation is referred to the description of the above-described method embodiment. and, for the sake of brevity, are not further repeated here.

본 발명 실시예는 또한 컴퓨터 프로그램 명령어가 저장된 컴퓨터 판독 가능 저장 매체를 제공하며, 상기 컴퓨터 프로그램 명령어는 프로세서에 의해 실행될 때 상기 방법을 구현한다. 컴퓨터 판독 가능한 저장 매체는 휘발성 또는 비휘발성 컴퓨터 판독 가능한 저장 매체일 수 있다.Embodiments of the present invention also provide a computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method when executed by a processor. A computer-readable storage medium may be a volatile or non-volatile computer-readable storage medium.

본 발명의 실시예는 또한 전자 기기를 제공하며, 프로세서; 및 프로세서가 실행 가능한 명령어를 저장하기 위한 메모리를 포함하며; 여기서, 상기 프로세서는 상기 메모리에 저장된 명령어를 호출하여, 상기 임의의 시나리오 깊이 예측 방법 또는 임의의 카메라 움직임 예측 방법을 실행하도록 구성된다.An embodiment of the present invention also provides an electronic device, comprising: a processor; and a memory for storing instructions executable by the processor; Here, the processor is configured to execute the arbitrary scenario depth prediction method or any camera motion prediction method by calling the instruction stored in the memory.

본 발명의 실시예는 또한 컴퓨터 판독 가능한 코드를 포함하는 컴퓨터 프로그램 제품을 제공하고, 컴퓨터 판독 가능한 코드가 기기에서 작동될 때, 기기에서의 프로세서는 전술한 어느 한 실시예에서 제공한 시나리오 깊이 및/또는 카메라 움직임 예측 방법을 구현하기 위한 명령어를 실행한다.An embodiment of the present invention also provides a computer program product comprising computer readable code, wherein when the computer readable code is run on the device, the processor in the device can determine the scenario depth and/or the depth provided by any of the preceding embodiments. Alternatively, a command for implementing a camera motion prediction method is executed.

본 발명의 실시예는 또한 컴퓨터 판독 가능한 명령어를 저장하기 위한 다른 컴퓨터 프로그램 제품을 제공하고, 명령어가 실행될 경우 컴퓨터가 전술한 어느 한 실시예에서 제공한 시나리오 깊이 및/또는 카메라 움직임 예측 방법을 실행하도록 한다.Embodiments of the present invention also provide another computer program product for storing computer readable instructions, such that when the instructions are executed, the computer executes the scenario depth and/or camera motion prediction method provided in any one of the above embodiments. do.

전자 기기는 단말, 서버 또는 다른 형태의 기기로 제공될 수 있다.The electronic device may be provided as a terminal, server, or other type of device.

도 7은 본 발명의 실시예에 따른 전자 기기(800)의 블록도이다. 도 7에 도시된 바와 같이, 전자 기기(800)은 휴대폰, 컴퓨터, 디지털 방송 단말, 메시징 기기, 게임 콘솔, 태블릿 기기, 의료 기기, 피트니스 기기, 개인 휴대용 단말 등 단말일 수 있다.7 is a block diagram of an electronic device 800 according to an embodiment of the present invention. As shown in FIG. 7 , the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a terminal such as a personal portable terminal.

도 7를 참조하면, 전자 기기(800)는 제1 처리 컴포넌트(802), 제1 메모리(804), 제1 전원 컴포넌트(806), 멀티미디어 컴포넌트(808), 오디오 컴포넌트(810), 제1 입력/출력(Input Output, I/O) 인터페이스(812), 센서 컴포넌트(814) 및 통신 컴포넌트(816) 중 하나 또는 복수 개의 컴포넌트를 포함할 수 있다.Referring to FIG. 7 , the electronic device 800 includes a first processing component 802 , a first memory 804 , a first power component 806 , a multimedia component 808 , an audio component 810 , and a first input one or more of an Input Output (I/O) interface 812 , a sensor component 814 , and a communication component 816 .

제1 처리 컴포넌트(802)는 일반적으로 디스플레이, 전화 통화, 데이터 통신, 카메라 동작 및 기록 동작과 관련된 동작과 같은 전자 기기(800)의 전체적인 동작을 제어한다. 제1 처리 컴포넌트(802)는 상기 방법 단계의 전부 또는 일부를 구현하기 위한 명령어를 실행하기 위한 하나 또는 복수 개의 프로세서(820)를 포함할 수 있다. 또한, 제1 처리 컴포넌트(802)는 제1 처리 컴포넌트(802) 및 다른 컴포넌트 사이의 인터랙션을 용이하게 하기 위한 하나 또는 복수 개의 모듈을 포함할 수 있다. 예를 들어, 제1 처리 컴포넌트(802)는 멀티미디어 컴포넌트(808) 및 제1 처리 컴포넌트(802) 사이의 인터랙션을 용이하게 하기 위한 멀티미디어 모듈을 포함할 수 있다.The first processing component 802 generally controls the overall operation of the electronic device 800 , such as operations related to displays, phone calls, data communications, camera operations, and recording operations. The first processing component 802 may include one or a plurality of processors 820 for executing instructions for implementing all or part of the method steps. Further, the first processing component 802 may include one or a plurality of modules for facilitating interaction between the first processing component 802 and other components. For example, the first processing component 802 can include a multimedia component 808 and a multimedia module for facilitating an interaction between the first processing component 802 .

제1 메모리(804)는 전자 기기(800)의 동작을 지원하기 위해, 다양한 타입의 데이터를 저장하도록 구성된다. 이러한 데이터의 예시는 전자 기기(800)에서 동작하는 임의의 애플리케이션 프로그램 또는 방법의 명령어, 연락인 데이터, 전화번호부 데이터, 메시지, 사진, 비디오 등을 포함한다. 제1 메모리(804)는, 정적 랜덤 액세스 메모리(Static Random-Access Memory, SRAM), 전기적 소거 가능한 프로그래머블 판독 전용 메모리(Electrically Erasable Programmable Read Only Memory, EEPROM), 소거 가능한 프로그래머블 판독 전용 메모리(Electrical Programmable Read Only Memory, EPROM), 프로그래머블 판독 전용 메모리(Programmable Read-Only Memory, PROM), 판독 전용 메모리(Read-Only Memory, ROM), 자기 메모리, 플래쉬 메모리, 자기 디스크 또는 광 디스크와 같은 임의의 타입의 휘발성 또는 비휘발성 저장 기기 또는 이들의 조합에 의해 구현될 수 있다.The first memory 804 is configured to store various types of data to support the operation of the electronic device 800 . Examples of such data include instructions of any application program or method running on the electronic device 800 , contact data, phonebook data, messages, photos, videos, and the like. The first memory 804 includes: Static Random-Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), and Electrically Erasable Programmable Read Only Memory (EEPROM); Any type of volatile, such as Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk Alternatively, it may be implemented by a non-volatile storage device or a combination thereof.

제1 전원 컴포넌트(806)는 전자 기기(800)의 다양한 컴포넌트에 전력을 공급한다. 제1 전원 컴포넌트(806)는 전원 관리 시스템, 하나 또는 복수 개의 전원 및 전자 기기(800)를 위해 전력을 생성, 관리 및 분배하는 것과 관련된 다른 컴포넌트를 포함할 수 있다.The first power component 806 supplies power to various components of the electronic device 800 . The first power component 806 may include a power management system, one or more power sources, and other components related to generating, managing, and distributing power for the electronic device 800 .

멀티미디어 컴포넌트(808)는 상기 전자 기기(800) 및 사용자 사이의 하나의 출력 인터페이스를 제공하는 스크린을 포함한다. 일부 실시예에서, 스크린은 액정 디스플레이(Liquid Crystal Display, LCD) 및 터치 패널(Touch Pad, TP)을 포함할 수 있다. 스크린이 터치 패널을 포함하는 경우, 사용자로부터의 입력 신호를 수신하기 위해 스크린은 터치 스크린으로서 구현될 수 있다. 터치 패널은 터치, 스와이프 및 터치 패널 상의 제스처를 감지하기 위한 하나 또는 복수 개의 터치 센서를 포함한다. 상기 터치 센서는 터치 또는 슬라이드 동작의 경계를 감지할 수 있을 뿐만 아니라, 상기 터치나 슬라이드 동작과 관련된 지속 시간 및 압력을 검출할 수도 있다. 일부 실시에에 있어서, 멀티미디어 컴포넌트(808)는 하나의 전방 카메라 및 하나의 후방 카메라 중 적어도 하나를 포함한다. 전자 기기(800)가 촬영 모드 또는 비디오 모드와 같은 동작 모드에 있을 경우, 전방 카메라 및 후방 카메라 중 적어도 하나는 외부의 멀티미디어 데이터를 수신할 수 있다. 각 전방 카메라 및 후방 카메라는 고정된 광학 렌즈 시스템이거나 초점 거리 및 광학 줌 기능을 구비할 수 있다.The multimedia component 808 includes a screen that provides one output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (Touch Pad, TP). When the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or a plurality of touch sensors for sensing touch, swipe and gestures on the touch panel. The touch sensor may not only detect a boundary of a touch or slide operation, but also detect a duration and pressure associated with the touch or slide operation. In some implementations, the multimedia component 808 includes at least one of one front camera and one rear camera. When the electronic device 800 is in an operation mode such as a photographing mode or a video mode, at least one of the front camera and the rear camera may receive external multimedia data. Each front and rear camera may be a fixed optical lens system or may have a focal length and optical zoom function.

오디오 컴포넌트(810)는 오디오 신호를 출력 및/또는 입력하도록 구성된다. 예를 들어, 오디오 컴포넌트(810)는 하나의 마이크로폰(MICrophone, MIC)을 포함하며, 전자 기기(800)가 콜 모드, 녹음 모드 및 음성 인식 모드와 같은 동작 모드에 있을 경우, 마이크로폰은 외부 오디오 신호를 수신하도록 구성된다. 수신된 오디오 신호는 제1 메모리(804)에 추가로 저장되거나 통신 컴포넌트(816)에 의해 송신될 수 있다. 일부 실시예에 있어서, 오디오 컴포넌트(810)는 오디오 신호를 출력하기 위한 하나의 스피커를 더 포함한다.The audio component 810 is configured to output and/or input an audio signal. For example, the audio component 810 includes a single microphone (MICrophone, MIC), and when the electronic device 800 is in an operation mode such as a call mode, a recording mode, and a voice recognition mode, the microphone is an external audio signal is configured to receive The received audio signal may be further stored in the first memory 804 or transmitted by the communication component 816 . In some embodiments, the audio component 810 further includes one speaker for outputting an audio signal.

제1 입력/ 출력 인터페이스(812)는 제1 처리 부재(802)와 외부 인터페이스 모듈 사이에서 인터페이스를 제공하고, 상기 외부 인터페이스 모듈은 키보드, 클릭 휠, 버튼 등일 수 있다. 이러한 버튼은 홈 버튼, 볼륨 버튼, 시작 버튼 및 잠금 버튼을 포함할 수 있지만 이에 한정되지 않는다.The first input/output interface 812 provides an interface between the first processing member 802 and an external interface module, which may be a keyboard, a click wheel, a button, or the like. Such buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.

센서 컴포넌트(814)는 전자 기기(800)를 위한 다양한 측면의 상태 평가를 제공하기 위한 하나 또는 복수 개의 센서를 포함한다. 예를 들어, 센서 컴포넌트(814)는 전자 기기(800)의 온/오프 상태, 컴포넌트의 상대 포지셔닝을 검출할 수 있으며, 예를 들어, 상기 컴포넌트는 전자 기기(800)의 모니터와 키패드이며, 센서 컴포넌트(814)는 전자 기기(800) 또는 전자 기기(800)에서 하나의 컴포넌트의 위치 변화, 사용자와 전자 기기(800) 접촉의 존재 유무, 전자 기기(800) 방향 또는 가속/감속 및 전자 기기(800)의 온도 변화를 검출할 수 있다. 센서 컴포넌트(814)는 그 어떤 물리적 접촉이 없이 근처의 물체의 존재를 검출하도록 구성된 근접 센서를 포함할 수 있다. 센서 컴포넌트(814)는 상보성 금속 산화막 반도체(Complementary Metal Oxide Semiconductor, CMOS) 또는 전하 결합 소자(Charge Coupled Device, CCD) 이미지 센서와 같은 광센서를 더 포함할 수 있고, 이미징 애플리케이션에 사용되기 위한 것이다. 일부 실시예에 있어서, 상기 센서 컴포넌트(814)는 가속도 센서, 자이로 센서, 자기 센서, 압력 센서 또는 온도 센서를 더 포함할 수 있다.The sensor component 814 includes one or a plurality of sensors to provide status assessment of various aspects for the electronic device 800 . For example, the sensor component 814 may detect an on/off state of the electronic device 800 , a relative positioning of the component, for example, the component is a monitor and a keypad of the electronic device 800 , and a sensor The component 814 is the electronic device 800 or a change in the position of one component in the electronic device 800, the presence or absence of contact between the user and the electronic device 800, the direction or acceleration/deceleration of the electronic device 800, and the electronic device ( 800) can be detected. The sensor component 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor component 814 may further include a photosensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in an imaging application. In some embodiments, the sensor component 814 may further include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

통신 컴포넌트(816)는 전자 기기(800)와 다른 기기 사이의 유선 또는 무선 방식으로 통신을 용이하게 하도록 구성된다. 전자 기기(800)는 WiFi, 2G 또는 3G 또는 이들의 조합과 같은 통신 기준에 기반한 무선 인터넷에 액세스할 수 있다. 하나의 예시적인 실시예에서, 통신 컴포넌트(816)는 방송 채널을 통해 외부 방송 관리 시스템으로부터의 방송 신호 또는 방송 관련 정보를 수신한다. 하나의 예시적 실시예에 있어서, 상기 통신 부재(816)는 근거리 통신을 추진하는 근거리 무선 통신(Near Field Communication, NFC) 모듈을 더 포함한다. 예를 들어, NFC 모듈은 무선 주파수 식별자(Radio Frequency Identification, RFID) 기술, 적외선 통신 규격(Infrared Data Association, IrDA) 기술, 초광대역(Ultra WideBand, UWB) 기술, 블루투스(BlueTooth, BT) 기술 및 다른 기술에 기반하여 구현될 수 있다.The communication component 816 is configured to facilitate communication between the electronic device 800 and another device in a wired or wireless manner. The electronic device 800 may access the wireless Internet based on a communication standard such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication member 816 further includes a Near Field Communication (NFC) module for promoting short-range communication. For example, the NFC module is a radio frequency identifier (Radio Frequency Identification, RFID) technology, infrared communication standard (Infrared Data Association, IrDA) technology, Ultra WideBand (UWB) technology, Bluetooth (BlueTooth, BT) technology and other It can be implemented based on technology.

예시적 실시예에서, 전자 기기(800)는 상기 방법을 실행하기 위해, 하나 또는 복수 개의 주문형 집적 회로(ApplicationSpecificIntegratedCircuit, ASIC), 디지털 신호 프로세서(Digital Signal Processor, DSP), 디지털 신호 처리 장치(Digital Signal Processing Devices, DSPD), 프로그래머블 논리 장치(Programmable Logic Device, PLD), 필드 프로그래머블 게이트 어레이(Field Programmable Gate Array, FPGA), 컨트롤러, 마이크로 컨트롤러, 마이크로 프로세서 또는 다른 전자 소자에 의해 상기 임의의 시나리오 깊이 예측 방법 또는 상기 카메라 움직임 예측 방법을 구현할 수 있다.In an exemplary embodiment, the electronic device 800 includes one or a plurality of ApplicationSpecific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processors (DSPs), to execute the method. Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic devices for predicting depth of any of the above scenarios. Alternatively, the camera motion prediction method may be implemented.

예시적 실시예에서 또한, 컴퓨터 프로그램 명령어를 포함하는 제1 메모리(804)와 같은 비휘발성 컴퓨터 판독 가능 저장 매체를 제공하며, 상기 컴퓨터 프로그램 명령어는 상기 방법을 완성하기 위해 전자 기기(800)의 프로세서(820)에 의해 실행될 수 있다.In an exemplary embodiment, there is also provided a non-volatile computer readable storage medium, such as a first memory 804 , comprising computer program instructions, the computer program instructions being the processor of the electronic device 800 to complete the method. 820 may be executed.

도 8은 본 발명의 실시예에 따른 전자 기기의 블록도를 도시한다. 도 8에 도시된 바와 같이, 전자 기기(900)는 하나의 서버로 제공될 수 있다. 도 8을 참조하면, 전자 기기(900)는 제2 처리 컴포넌트(922)를 포함하고, 또한 하나 또는 복수 개의 프로세서 및 제2 메모리(932)로 대표되는 메모리 자원을 포함하고, 제2 처리 컴포넌트(922), 예를 들어 애플리케이션 프로그램에 의해 실행될 수 있는 명령어를 저장하기 위한 것이다. 제2 메모리(932)에 저장된 애플리케이션 프로그램은 하나 또는 하나 이상의 각 명령어 세트에 대응되는 모듈을 포함할 수 있다. 또한, 제2 처리 컴포넌트(922)는 명령어를 실행하여, 상기 임의의 시나리오 깊이 예측 방법 또는 상기 임의의 카메라 움직임 예측 방법을 실행하도록 구성된다.8 is a block diagram of an electronic device according to an embodiment of the present invention. As shown in FIG. 8 , the electronic device 900 may be provided as one server. Referring to FIG. 8 , the electronic device 900 includes a second processing component 922 , and also includes memory resources represented by one or more processors and a second memory 932 , and includes a second processing component ( 922), for example to store instructions that can be executed by an application program. The application program stored in the second memory 932 may include one or more modules corresponding to each instruction set. Further, the second processing component 922 is configured to execute the instructions to execute any of the scenario depth prediction methods or any of the camera motion prediction methods.

전자 기기(900)는 또한 전자 기기(900)의 전원 관리를 실행하도록 구성된 하나의 제2 전원 컴포넌트(926), 전자 기기 (900)을 네트워크에 연결하도록 구성된 하나의 유선 또는 무선 네트워크 인터페이스(950), 및 하나의 제2 입력/출력(I/O) 인터페이스(958)를 더 포함할 수 있다. 전자 기기(900)는 예를 들어 Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM 또는 유사한 것 등과 같은 제2 메모리(932)에 저장된 것에 기반한 운영 시스템을 조작할 수 있다.The electronic device 900 also includes one second power component 926 configured to perform power management of the electronic device 900 , and one wired or wireless network interface 950 configured to connect the electronic device 900 to a network. , and one second input/output (I/O) interface 958 . The electronic device 900 may operate an operating system based on what is stored in the second memory 932 , such as, for example, Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

예시적 실시예에서 또한, 컴퓨터 프로그램 명령어를 포함하는 제2 메모리(932)와 같은 비휘발성 컴퓨터 판독 가능 저장 매체를 제공하며, 상기 컴퓨터 프로그램 명령어는 상기 방법을 완성하기 위해 전자 기기(900)의 제2 처리 컴포넌트(922)에 의해 실행될 수 있다.In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as a second memory 932 , comprising computer program instructions, the computer program instructions being used in the electronic device 900 to complete the method. 2 may be executed by processing component 922 .

본 발명은 시스템, 방법 및 컴퓨터 프로그램 제품 중 적어도 하나일 수 있다. 컴퓨터 프로그램 제품은 컴퓨터 판독 가능한 저장 매체를 포함할 수 있고, 컴퓨터 판독 가능한 저장 매체는 프로세서가 본 발명의 각 측면을 구현하도록 하는 컴퓨터 판독 가능한 프로그램 명령어를 포함한다. The invention may be at least one of a system, a method and a computer program product. The computer program product may include a computer readable storage medium comprising computer readable program instructions for causing a processor to implement each aspect of the present invention.

컴퓨터 판독 가능 저장 매체는 명령어 실행 기기에 의해 실행되는 명령어를 유지 및 저장할 수 있는 타입의 기기일 수 있다. 컴퓨터 판독 가능한 저장 매체는 예를 들어 축전 기기, 자기 저장 기기, 광 저장 기기, 전자기 저장 기기, 반도체 저장 기기 또는 전술한 축전 기기, 자기 저장 기기, 광 저장 기기, 전자기 저장 기기, 반도체 저장 기기의 임의의 적절한 조합일 수 있지만 이에 한정되지 않는다. 컴퓨터 판독 가능한 저장 매체의 더 구체적인 예는(비제한 리스트), 휴대용 컴퓨터 디스크, 하드 디스크, 랜덤 액세스 메모리(Random Access Memory, RAM), 판독 전용 메모리(Read Only Memory, ROM) 및 소거 가능하고 프로그램 가능한 판독 전용 메모리((Erasable Programmable Read Only Memory, EPROM 또는 플래시 메모리), 정적 랜덤 액세스 메모리(Static Random Access Memory, SRAM), 휴대용 컴팩트 디스크 판독 전용 메모리(Portable Compact Disk Read-Only Memory, CD-ROM), DVD (Digital Versatile Disk), 메모리 스틱, 플로피 디스크, 명령어가 저장된 장치와 같은 기계적 코딩 장치 홀 카드 또는 그루브에서의 볼록한 구조 및 전술한 임의의 적절한 조합을 포함한다. 여기서 사용되는 컴퓨터 판독 가능한 저장 매체는 무선 전자파 또는 다른 자유롭게 전파되는 전자기파, 도파관 또는 다른 전송 매체를 통해 전파되는 전자기파(예를 들어, 광섬유 케이블을 통한 광펄스), 또는 와이어를 통해 전송되는 전기 신호와 같은 순간 신호 자체로 해석되지 않아야 한다.A computer-readable storage medium may be a tangible device capable of holding and storing instructions executed by an instruction execution device. The computer-readable storage medium may be, for example, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any of the aforementioned electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, and a semiconductor storage device. may be an appropriate combination of , but is not limited thereto. More specific examples of computer readable storage media (non-limiting list) include portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), and erasable and programmable Erasable Programmable Read Only Memory (EPROM or Flash Memory), Static Random Access Memory (SRAM), Portable Compact Disk Read-Only Memory (CD-ROM), A mechanical coding device such as a Digital Versatile Disk (DVD), a memory stick, a floppy disk, a device storing instructions, a hole card or a convex structure in a groove, and any suitable combination of the foregoing. shall not be interpreted as instantaneous signals themselves, such as radio or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted over wires .

여기서 설명한 컴퓨터 판독 가능한 프로그램 명령어는 컴퓨터 판독 가능한 저장 매체로부터 각 컴퓨팅/처리 기기로 다운 로드될 수 있거나, 인터넷, 근거리 통신망, 광역 통신망 및 무선 네트워크 중 적어도 하나와 같은 네트워크를 통해, 외부 컴퓨터 또는 외부 저장 기기로 다운 로드될 수 있다. 네트워크는 동 전송 케이블, 광섬유 전송, 무선 전송, 라우터, 방화벽, 교환기, 게이트웨이 컴퓨터 및 에지 서버를 포함 할 수 있다. 각 컴퓨팅/처리 기기의 네트워크 어댑터 카드 또는 네트워크 인터페이스는 네트워크로부터 컴퓨터 판독 가능한 프로그램 명령어를 수신하고, 각 컴퓨팅/처리 기기에서의 컴퓨터 판독 가능한 저장 매체에 저장하기 위해, 컴퓨터 판독 가능한 프로그램 명령어를 전달한다.The computer readable program instructions described herein may be downloaded from a computer readable storage medium to each computing/processing device, or stored in an external computer or external storage via a network such as at least one of the Internet, a local area network, a wide area network, and a wireless network. It can be downloaded to your device. Networks may include copper transport cables, fiber optic transport, wireless transport, routers, firewalls, switches, gateway computers, and edge servers. A network adapter card or network interface of each computing/processing device receives computer readable program instructions from the network and transmits the computer readable program instructions for storage in a computer readable storage medium in each computing/processing device.

본 발명의 동작을 실행하기 위한 컴퓨터 프로그램 명령어는 어셈블리 명령어, 명령어 세트 아키텍처(Instruction Set Architecture, ISA) 명령어, 머신 명령어, 머신 관련 명령어, 마이크로 코드, 펌웨어 명령어, 상태 설정 데이터, 또는 하나 또는 하나 이상의 프로그래밍 언어의 임의의 조합으로 작성된 소스 코드 또는 객체 코드일 수 있고, 상기 프로그래밍 언어에는 스몰 토크, C++ 등과 같은 객체 지향 프로그래밍 언어 및 "C" 언어 또는 유사한 프로그래밍 언어와 같은 기존 프로그래밍 언어가 포함된다. 컴퓨터 판독 가능한 프로그램 명령어는 사용자 컴퓨터에서 완전히 또는 부분적으로 실행될수 있고, 독립형 소프트웨어 패키지로 실행되며, 부분적으로 사용자 컴퓨터에서, 부분적으로 원격 컴퓨터에서 또는 완전히 원격 컴퓨터 또는 서버에서 실행될 수 있다. 원격 컴퓨터와 관련된 상황에서, 원격 컴퓨터는 근거리 통신망(Local Area Network, LAN) 또는 광대역 통신망(Wide Area Network, WAN)을 포함하는 모든 타입의 네트워크를 통해 사용자 컴퓨터에 연결되거나 외부 컴퓨터에 연결(예를 들어 인터넷 서비스 제공 업체를 사용하여 인터넷을 통해 연결)될 수 있다. 일부 실시예에 있어서, 프로그램 가능한 논리 회로, 필드 프로그래머블 게이트 어레이(Field Programmable Gate Array, FPGA) 또는 프로그램 가능한 논리 어레이 (Programmable Logic Array, PLA)와 같은 전자 회로는 컴퓨터 판독 가능 프로그램 명령어의 상태 정보를 이용하여 개인화될 수 있고, 상기 전자 회로는 컴퓨터 판독 가능 프로그램 명령어를 실행할 수 있음으로써, 본 발명의 다양한 측면을 구현한다.Computer program instructions for carrying out the operations of the present invention may include assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or one or more programming instructions. It may be source code or object code written in any combination of languages, including object-oriented programming languages such as Smalltalk, C++, and the like, and conventional programming languages such as "C" languages or similar programming languages. The computer readable program instructions may be fully or partially executed on a user's computer, executed as a standalone software package, and partially executed on the user's computer, partially on a remote computer, or completely on a remote computer or server. In situations involving remote computers, the remote computer is connected to your computer through any type of network, including a local area network (LAN) or wide area network (WAN), or to an external computer (e.g. For example, you can connect via the Internet using your Internet Service Provider). In some embodiments, an electronic circuit such as a programmable logic circuit, a Field Programmable Gate Array (FPGA) or a Programmable Logic Array (PLA) uses state information from computer readable program instructions. to be personalized, the electronic circuit being capable of executing computer readable program instructions, thereby implementing various aspects of the present invention.

본 발명은 본 발명의 실시예의 방법, 장치(시스템), 및 컴퓨터 프로그램 제품에 따른 흐름도 및 블록도 중 적어도 하나를 참조하여 설명된다. 이해해야 할 것은, 흐름도 및/또는 블록도의 각 블록 및 흐름도 및/또는 블록도의 각 블록의 조합은, 모두 컴퓨터 판독 가능 프로그램 명령어에 의해 구현될 수 있다.The present invention is described with reference to at least one of a flowchart and a block diagram according to a method, an apparatus (system), and a computer program product of an embodiment of the present invention. It should be understood that each block in the flowchart and/or block diagram and each combination of blocks in the flowchart and/or block diagram may all be implemented by computer readable program instructions.

이러한 컴퓨터 판독 가능 프로그램 명령어는 범용 컴퓨터, 특수 목적 컴퓨터 또는 다른 프로그래머블 데이터 처리 장치의 프로세서에 제공됨으로써, 기계를 생성하며, 이러한 명령어가 컴퓨터나 다른 프로그래머블 데이터 처리 장치의 프로세서를 통해 실행될 때, 흐름도 및 블록도 중 하나 또는 복수 개의 블록에서 규정한 기능/동작을 구현하기 위한 장치를 생성한다. 이러한 컴퓨터 판독 가능한 프로그램 명령어를 컴퓨터 판독 가능한 저장 매체에 저장할 수도 있으며, 이러한 명령어는 컴퓨터, 프로그램 가능한 데이터 처리 장치 및 다른 기기가 특정한 방식으로 작동될 수 있도록 함으로써, 명령어가 저장되어 있는 컴퓨터 판독 가능한 매체는 제조품을 포함하며, 상기 제조품은 흐름도 및 블록도 중 적어도 하나에서 하나 또는 복수 개의 블록에 지정된 기능/동작을 구현하는 명령어를 포함한다.These computer readable program instructions are provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing device, thereby creating a machine, which, when executed by the processor of the computer or other programmable data processing device, provides flow diagrams and blocks. An apparatus for implementing the functions/operations specified in one or a plurality of blocks in the figure is created. These computer readable program instructions may be stored in a computer readable storage medium, which may cause a computer, a programmable data processing apparatus, and other devices to operate in a specific way, such that the computer readable medium having the instructions stored thereon may include: An article of manufacture is included, wherein the article of manufacture includes instructions for implementing functions/operations specified in one or a plurality of blocks in at least one of a flowchart and a block diagram.

컴퓨터 판독 가능한 프로그램 명령어는 또한 컴퓨터, 다른 프로그램 가능한 데이터 처리 장치 또는 다른 기기에 로딩될 수 있어, 컴퓨터로 구현되는 과정을 생성하기 위해, 일련의 동작 단계가 컴퓨터, 다른 프로그램 가능한 데이터 처리 장치 또는 다른 기기에서 수행되도록 함으로써, 컴퓨터, 다른 프로그램 가능한 데이터 처리 장치 또는 다른 기기에서 실행되는 명령어는 흐름도 및 블록도 중 적어도 하나에서 하나 또는 복수 개의 블록에 지정된 기능/동작을 구현한다.The computer readable program instructions may also be loaded into a computer, other programmable data processing device, or other device such that a series of operational steps is performed on the computer, other programmable data processing device, or other device to create a computer-implemented process. By causing the instructions to be executed in a computer, other programmable data processing device, or other device, the instructions to be executed in the computer, other programmable data processing device, or other device implement the functions/operations specified in one or a plurality of blocks in at least one of the flowchart and block diagram.

도면의 흐름도 및 블록도는 본 발명의 복수 개의 실시예에 따른 시스템, 방법 및 컴퓨터 프로그램 제품의 구현 가능한 체계 아키텍처, 기능 및 동작을 도시한다. 이와 관련하여, 흐름도 또는 블록도의 각 블록은 모듈, 프로그램 세그먼트 또는 명령어의 일부를 나타낼 수 있고, 상기 모듈, 프로그램 세그먼트 또는 명령어의 일부는 하나 또는 복수 개의 지정된 논리적 기능을 구현하기 위한 실행 가능한 명령어를 포함한다. 일부 대안적인 구현에서, 블록에 표시된 기능은 도면에 표시된 것과 다른 순서로 발생될 수도 있다. 예를 들어, 두 개의 연속적인 블록은 실제로 병렬로 실행될 수 있으며, 때로는 관련 기능에 따라 역순으로 실행될 수도 있으며, 이는 관련된 기능에 의해 결정된다. 또한 유의해야 할 것은, 블록도 및 흐름도 중 적어도 하나에서의 각 블록, 및 블록도 및 흐름도 중 적어도 하나에서의 블록의 조합은, 지정된 기능 또는 동작의 전용 하드웨어 기반의 시스템에 의해 구현될 수 있거나, 전용 하드웨어와 컴퓨터 명령어의 조합으로 구현될 수 있다.The flow diagrams and block diagrams in the drawings illustrate implementable system architectures, functions, and operations of systems, methods, and computer program products according to a plurality of embodiments of the present invention. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or portion of an instruction, wherein the module, program segment or portion of the instruction contains executable instructions for implementing one or a plurality of specified logical functions. include In some alternative implementations, the functions indicated in the blocks may occur in a different order than indicated in the figures. For example, two consecutive blocks may actually be executed in parallel, sometimes in reverse order depending on the function involved, which is determined by the function involved. It should also be noted that each block in at least one of the block diagrams and flowcharts, and combinations of blocks in at least one of the block diagrams and flowcharts, may be implemented by a dedicated hardware-based system of a designated function or operation, or It may be implemented by a combination of dedicated hardware and computer instructions.

상기 컴퓨터 프로그램 제품은 구체적으로 하드웨어, 소프트웨어 또는 이들의 조합을 통해 구현될 수 있다. 선택 가능한 실시예에 있어서, 상기 컴퓨터 프로그램 제품은 컴퓨터 저장 매체로서 구체적으로 구현되며, 다른 선택 가능한 실시예에 있어서, 컴퓨터 프로그램 제품은 소프트웨어 개발 키트(Software Development Kit, SDK) 등과 같은 소프트웨어 제품으로서 구체적으로 구현된다.The computer program product may be specifically implemented through hardware, software, or a combination thereof. In an optional embodiment, the computer program product is specifically embodied as a computer storage medium, and in another optional embodiment, the computer program product is specifically implemented as a software product such as a Software Development Kit (SDK) or the like. is implemented

이상 본 발명의 각 실시예를 설명하였고, 상기 설명은 예시적이고, 비철저하며, 개시된 각 실시예에 한정되지도 않는다. 설명된 각 실시예의 범위 및 사상을 벗어나지 않는 한, 많은 수정 및 변경은 본 기술분야의 기술자에게는 자명한 것이다. 본 명세서에서 사용된 용어의 선택은 각 실시예의 원리, 실제 응용 또는 시장에서의 기술에 대한 개선을 가장 잘 해석하거나, 당업자가 본 명세서에 개시된 각 실시예를 이해할 수 있도록 의도된다.Each embodiment of the present invention has been described above, and the description is illustrative, non-exhaustive, and is not limited to each disclosed embodiment. Many modifications and changes will be apparent to those skilled in the art without departing from the scope and spirit of each described embodiment. The choice of terminology used herein is intended to best interpret the principle of each embodiment, practical application, or improvement over market technology, or to enable those skilled in the art to understand each embodiment disclosed herein.

산업상 실용성industrial practicality

본 발명의 실시예는 시나리오 깊이와 카메라 움직임 예측 방법 및 장치, 전자 기기, 매체와 프로그램을 제공하고, 상기 방법은, t 시각의 타깃 이미지 프레임을 획득하는 단계; 시나리오 깊이 예측 네트워크가 t-1 시각의 제1 은닉 상태 정보를 이용하여 상기 타깃 이미지 프레임에 대해 시나리오 깊이 예측을 수행하는 것을 통해, 상기 타깃 이미지 프레임에 대응되는 예측 깊이 맵을 결정하는 단계 - 상기 제1 은닉 상태 정보는 시나리오 깊이와 관련되는 특징 정보를 포함하고, 상기 시나리오 깊이 예측 네트워크는 카메라 움직임 예측 네트워크에 기반하여 보조 훈련하여 얻은 것 임 - 를 포함한다. 본 발명의 실시예는 타깃 이미지 프레임에 대응되는 예측 정확도가 높은 예측 깊이 맵을 얻을 수 있다.An embodiment of the present invention provides a method and apparatus, an electronic device, a medium and a program for predicting scenario depth and camera motion, the method comprising: acquiring a target image frame at time t; determining, by the scenario depth prediction network, a prediction depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1 - the first 1 The hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network. An embodiment of the present invention can obtain a prediction depth map with high prediction accuracy corresponding to a target image frame.

Claims

A scenario depth prediction method comprising:
obtaining a target image frame at time t; and
determining, by the scenario depth prediction network, a prediction depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1 - the first 1 The hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on the camera motion prediction network.

According to claim 1,
determining, by the scenario depth prediction network, a prediction depth map corresponding to the target image frame, by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1,
performing feature extraction on the target image frame to determine a first feature map corresponding to the target image frame, wherein the first feature map is a feature map related to a scenario depth;
determining the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1; and
and determining the prediction depth map according to the first hidden state information at time t.

3. The method of claim 2,
the first hidden state information at the time t-1 includes the first hidden state information under a different scale at the time t-1;
Determining a first feature map corresponding to the target image frame by performing feature extraction on the target image frame,
performing a plurality of scale down-sampling on the target image frame to determine the first feature map under a different scale corresponding to the target image frame;
The step of determining the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1 includes:
for any scale, determining the first hidden state information under the scale at time t according to the first feature map under the scale and the first hidden state information under the scale at time t-1 do;
Determining the predicted depth map according to the first hidden state information at the time t includes:
and determining the predicted depth map by performing feature fusion on the first hidden state information under a different scale at time t.

According to any one of claims 1 to 3,
The method also comprises:
obtaining a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and an adjacent sample image frame of the first sample image frame;
determining, by the camera motion prediction network, a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1 - the second hidden state information includes characteristic information related to camera movement;
A sample prediction depth map corresponding to the first sample image frame through which the scenario depth prediction network to be trained performs scenario depth prediction on the first sample image frame using the first hidden state information at time t-1 determining, wherein the first hidden state information includes feature information related to a scenario depth;
constructing a loss function according to the sample prediction depth map and the sample prediction camera motion; and
and performing training on the to-be-trained scenario depth prediction network according to the loss function to obtain the scenario depth prediction network.

5. The method of claim 4,
According to the sample prediction depth map and the sample prediction camera movement, building a loss function comprises:
determining, according to the sample prediction camera movement, a re-projection error term of an adjacent sample image frame of the first sample image frame in the sample image frame sequence with respect to the first sample image frame;
determining a penalty function term according to the distribution continuity of the sample prediction depth map; and
and constructing the loss function according to the reprojection error term and the penalty function term.

A method for predicting camera motion, comprising:
obtaining a sequence of image frames corresponding to time t, wherein the sequence of image frames includes a target image frame at time t and an adjacent image frame of the target image frame; and
determining, by a camera motion prediction network, a predicted camera motion corresponding to the image frame sequence, by performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1; 2 The hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained by auxiliary training based on a scenario depth prediction network.

7. The method of claim 6,
The step of determining, by the camera motion prediction network, the predicted camera motion corresponding to the image frame sequence, by performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1,
performing feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence, wherein the second feature map is a feature map related to camera motion;
determining the second hidden state information at time t according to the second feature map and the second hidden state information at time t-1; and
and determining the predicted camera motion according to the second hidden state information at time t.

8. The method of claim 6 or 7,
wherein the predicted camera motion includes a relative pose between adjacent image frames in the sequence of image frames.

9. The method according to any one of claims 6 to 8,
The method also comprises:
obtaining a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and an adjacent sample image frame of the first sample image frame;
The scenario depth prediction network determines a sample prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the first sample image frame using the first hidden state information at time t-1. doing, wherein the first hidden state information includes feature information related to the scenario depth;
A camera motion prediction network to be trained determines a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1. doing, wherein the second hidden state information includes characteristic information related to camera movement;
constructing a loss function according to the sample prediction depth map and the sample prediction camera motion; and
and performing training on the to-be-trained camera motion prediction network according to the loss function to obtain the camera motion prediction network.

10. The method of claim 9,
According to the sample prediction depth map and the sample prediction camera movement, building a loss function comprises:
determining, according to the sample prediction camera movement, a re-projection error term of an adjacent sample image frame of the first sample image frame in the sample image frame sequence with respect to the first sample image frame;
determining a penalty function term according to the distribution continuity of the sample prediction depth map; and
and constructing the loss function according to the reprojection error term and the penalty function term.

A scenario depth prediction device, comprising:
a first acquiring module, configured to acquire a target image frame at time t; and
A first scenario, wherein the scenario depth prediction network is configured to determine a predicted depth map corresponding to the target image frame by performing scenario depth prediction on the target image frame using the first hidden state information at time t-1 Scenario comprising a depth prediction module, wherein the first hidden state information includes feature information related to the scenario depth, and the scenario depth prediction network is obtained by auxiliary training based on a camera motion prediction network. Depth prediction device.

12. The method of claim 11,
The first scenario depth prediction module includes:
a first determining submodule, configured to perform feature extraction on the target image frame to determine a first feature map corresponding to the target image frame, wherein the first feature map is a feature map related to the scenario depth;
a second determining submodule, configured to determine, according to the first feature map and the first hidden state information at time t-1, the first hidden state information at time t; and
and a third determining submodule, configured to determine a prediction depth map according to the first hidden state information at time t.

13. The method of claim 12,
the first hidden state information at time t-1 includes first hidden state information under a different scale at time t-1;
the first determining submodule is specifically configured to perform a plurality of scale downsampling on the target image frame to determine a first feature map under a different scale corresponding to the target image frame;
The second determining submodule is specifically, for any scale, according to the first feature map under the scale and the first hidden state information under the scale at time t-1, the first hidden state information under the scale at time t is configured to determine;
The third determining submodule is specifically configured to perform feature fusion on the first hidden state information under a different scale at time t to determine a prediction depth map.

14. The method according to any one of claims 11 to 13,
The device also includes a first training module, the first training module comprising:
obtain a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and an adjacent sample image frame of the first sample image frame;
A camera motion prediction network determines a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1, the second hidden state information includes characteristic information related to camera movement;
A sample prediction depth map corresponding to the first sample image frame through which the scenario depth prediction network to be trained performs scenario depth prediction on the first sample image frame using the first hidden state information at time t-1 determine, wherein the first hidden state information includes feature information related to the scenario depth;
construct a loss function according to the sample prediction depth map and the sample prediction camera motion; and
and perform training on the to-be-trained scenario depth prediction network according to the loss function to obtain the scenario depth prediction network.

15. The method of claim 14,
The first training module, specifically,
according to the sample prediction camera movement, an adjacent sample image frame of the first sample image frame in the sample image frame sequence determines a reprojection error term for the first sample image frame; determine a penalty function term according to the distribution continuity of the sample prediction depth map; and constructing the loss function according to the reprojection error term and the penalty function term.

A camera motion prediction device comprising:
a second acquiring module, configured to acquire a sequence of image frames corresponding to time t, wherein the sequence of image frames includes a target image frame at time t and an adjacent image frame of the target image frame; and
a first camera, configured to determine a predicted camera motion corresponding to the image frame sequence, through the camera motion prediction network performing camera pose prediction on the image frame sequence using the second hidden state information at time t-1 A camera comprising a motion prediction module, wherein the second hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained by auxiliary training based on a scenario depth prediction network. motion prediction device.

17. The method of claim 16,
The first camera motion prediction module,
a sixth determining submodule, configured to perform feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence, wherein the second feature map is a feature map related to camera motion;
a seventh determining submodule, configured to determine, according to the second image feature and the second hidden state information at time t-1, the second hidden state information at time t; and
and an eighth determining submodule, configured to determine a predicted camera motion according to the second hidden state information at time t.

18. The method of claim 16 or 17,
wherein the predicted camera motion includes a relative pose between adjacent image frames in an image frame sequence.

19. The method according to any one of claims 16 to 18,
The device also includes a second training module,
The second training module,
obtain a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and an adjacent sample image frame of the first sample image frame;
The scenario depth prediction network determines a sample prediction depth map corresponding to the first sample image frame by performing scenario depth prediction on the first sample image frame using the first hidden state information at time t-1. and, the first hidden state information includes feature information related to the scenario depth;
A camera motion prediction network to be trained determines a sample prediction camera motion corresponding to the sample image frame sequence by performing camera pose prediction on the sample image frame sequence using the second hidden state information at time t-1. and, the second hidden state information includes characteristic information related to camera movement;
construct a loss function according to the sample prediction depth map and the sample prediction camera motion; and
and perform training on the to-be-trained camera motion prediction network according to the loss function to obtain the camera motion prediction network.

20. The method of claim 19,
The second training module,
according to the sample prediction camera movement, an adjacent sample image frame of the first sample image frame in the sample image frame sequence determines a reprojection error term for the first sample image frame; determine a penalty function term according to the distribution continuity of the sample prediction depth map; and construct the loss function according to the reprojection error term and the penalty function term.

As an electronic device,
processor; and
a memory configured to store processor-executable instructions;
The electronic device, characterized in that the processor is configured to execute the method according to any one of claims 1 to 10 by calling an instruction stored in the memory.

A computer-readable storage medium comprising:
A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor implements the method according to any one of claims 1 to 10.

A computer program comprising:
A computer comprising computer readable code, wherein when the computer readable code is run in an electronic device, a processor in the electronic device implements the method according to any one of claims 1 to 10. program.