TW202141428A

TW202141428A - Scene depth and camera motion prediction method, electronic equipment and computer readable storage medium

Info

Publication number: TW202141428A
Application number: TW110107767A
Authority: TW
Inventors: 韓滔; 張展鵬; 成慧
Original assignee: 大陸商深圳市商湯科技有限公司
Priority date: 2020-04-28
Filing date: 2021-03-04
Publication date: 2021-11-01
Also published as: KR20210138788A; TWI767596B; KR102397268B1; CN111540000B; JP2022528012A; CN111540000A; JP7178514B2; WO2021218282A1; CN113822918A

Abstract

The invention relates to a scene depth and camera motion prediction method, electronic equipment and computer-readable storage medium. The method comprises the steps of obtaining a target image frame at a t moment; and performing scene depth prediction on the target image frame by using the first hidden state information at the t-1 moment through a scene depth prediction network, and determining a prediction depth map corresponding to the target image frame, wherein the first hidden state information comprising feature information related to scene depth, and the scene depth prediction network being obtained based on camera motion prediction network auxiliary training.

Description

Scene depth and camera motion prediction method, electronic device and computer readable storage medium

本發明關於電腦技術領域，關於但不限於一種場景深度和相機運動預測方法、電子設備和電腦可讀儲存介質。The present invention relates to the field of computer technology, but is not limited to a scene depth and camera motion prediction method, electronic equipment and computer-readable storage media.

利用單目圖像採集設備（例如，單目相機）採集的圖像作為輸入來預測場景深度以及相機運動是電腦視覺領域近二十年一個活躍而重要的研究方向，廣泛應用於擴增實境、無人駕駛以及移動機器人定位導航等眾多領域。Using images collected by monocular image acquisition devices (for example, monocular cameras) as input to predict scene depth and camera motion has been an active and important research direction in the field of computer vision for the past two decades, and is widely used in augmented reality , Unmanned driving and mobile robot positioning and navigation and many other fields.

本發明實施例提出了一種場景深度和相機運動預測方法、電子設備和電腦可讀儲存介質的技術方案。The embodiment of the present invention proposes a technical solution of a scene depth and camera motion prediction method, an electronic device, and a computer-readable storage medium.

本發明實施例提供了一種場景深度預測方法，包括：獲取t時刻的目標圖像幀；通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述目標圖像幀進行場景深度預測，確定所述目標圖像幀對應的預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊，所述場景深度預測網路是基於相機運動預測網路輔助訓練得到的。The embodiment of the present invention provides a scene depth prediction method, which includes: acquiring a target image frame at time t; using the first hidden state information at time t-1 to perform scene depth on the target image frame through a scene depth prediction network Prediction, determining the predicted depth map corresponding to the target image frame, wherein the first hidden state information includes feature information related to the scene depth, and the scene depth prediction network is obtained based on the auxiliary training of the camera motion prediction network of.

本發明的一些實施例中，所述通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述目標圖像幀進行場景深度預測，確定所述目標圖像幀對應的預測深度圖，包括：對所述目標圖像幀進行特徵提取，確定所述目標圖像幀對應的第一特徵圖，其中，所述第一特徵圖為與場景深度相關的特徵圖；根據所述第一特徵圖和t-1時刻的所述第一隱狀態資訊，確定t時刻的所述第一隱狀態資訊；根據t時刻的所述第一隱狀態資訊，確定所述預測深度圖。In some embodiments of the present invention, the scene depth prediction network uses the first hidden state information at time t-1 to perform scene depth prediction on the target image frame to determine the predicted depth corresponding to the target image frame The figure includes: performing feature extraction on the target image frame, and determining a first feature map corresponding to the target image frame, wherein the first feature map is a feature map related to the scene depth; according to the first feature map; A feature map and the first hidden state information at time t-1 are used to determine the first hidden state information at time t; and the predicted depth map is determined based on the first hidden state information at time t.

本發明的一些實施例中，t-1時刻的所述第一隱狀態資訊包括t-1時刻的不同尺度下的所述第一隱狀態資訊；所述對所述目標圖像幀進行特徵提取，確定所述目標圖像幀對應的第一特徵圖，包括：對所述目標圖像幀進行多尺度下採樣，確定所述目標圖像幀對應的不同尺度下的所述第一特徵圖；所述根據所述第一特徵圖和t-1時刻的所述第一隱狀態資訊，確定t時刻的所述第一隱狀態資訊，包括：針對任一尺度，根據該尺度下的所述第一特徵圖和t-1時刻的該尺度下的所述第一隱狀態資訊，確定t時刻的該尺度下的所述第一隱狀態資訊；所述根據t時刻的所述第一隱狀態資訊，確定所述預測深度圖，包括：將t時刻的不同尺度下的所述第一隱狀態資訊進行特徵融合，確定所述預測深度圖。In some embodiments of the present invention, the first hidden state information at time t-1 includes the first hidden state information at different scales at time t-1; and the feature extraction is performed on the target image frame , Determining the first feature map corresponding to the target image frame includes: performing multi-scale downsampling on the target image frame, and determining the first feature map at different scales corresponding to the target image frame; The determining the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1 includes: for any scale, according to the first hidden state information at that scale A feature map and the first hidden state information at the scale at time t-1 determine the first hidden state information at the scale at time t; the first hidden state information at time t is determined according to the first hidden state information at time t , Determining the predicted depth map includes: performing feature fusion of the first hidden state information at different scales at time t to determine the predicted depth map.

本發明的一些實施例中，所述方法還包括：獲取t時刻對應的樣本圖像幀序列，其中，所述樣本圖像幀序列包括t時刻的第一樣本圖像幀和所述第一樣本圖像幀的相鄰樣本圖像幀；通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述樣本圖像幀序列進行相機位姿預測，確定所述樣本圖像幀序列對應的樣本預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊；通過待訓練的場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述第一樣本圖像幀進行場景深度預測，確定所述第一樣本圖像幀對應的樣本預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊；根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數；根據所述損失函數，對所述待訓練的場景深度預測網路進行訓練，以得到所述場景深度預測網路。In some embodiments of the present invention, the method further includes: acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes the first sample image frame at time t and the first sample image frame. The adjacent sample image frames of the sample image frame; using the second hidden state information at time t-1 to perform camera pose prediction on the sample image frame sequence through the camera motion prediction network to determine the sample image The sample corresponding to the frame sequence predicts the camera motion, wherein the second hidden state information includes feature information related to the camera motion; the scene depth prediction network to be trained uses the first hidden state information at time t-1 to compare the The scene depth prediction is performed on the first sample image frame, and the sample prediction depth map corresponding to the first sample image frame is determined, wherein the first hidden state information includes feature information related to the scene depth; The sample prediction depth map and the sample prediction camera motion are used to construct a loss function; according to the loss function, the scene depth prediction network to be trained is trained to obtain the scene depth prediction network.

本發明的一些實施例中，所述根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數，包括：根據所述樣本預測相機運動，確定所述樣本圖像幀序列中所述第一樣本圖像幀的相鄰樣本圖像幀相對所述第一樣本圖像幀的重投影誤差項；根據所述樣本預測深度圖的分佈連續性，確定懲罰函數項；根據所述重投影誤差項和所述懲罰函數項，構建所述損失函數。In some embodiments of the present invention, predicting the camera motion based on the sample and predicting the camera motion to construct a loss function includes: predicting the camera motion based on the sample, and determining the sample image frame sequence in the The reprojection error term of adjacent sample image frames of the first sample image frame relative to the first sample image frame; the penalty function term is determined according to the distribution continuity of the sample predicted depth map; according to the The error term and the penalty function term are reprojected to construct the loss function.

本發明實施例還提供了一種相機運動預測方法，包括：獲取t時刻對應的圖像幀序列，其中，所述圖像幀序列包括t時刻的目標圖像幀和所述目標圖像幀的相鄰圖像幀；通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述圖像幀序列進行相機位姿預測，確定所述圖像幀序列對應的預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊，所述相機運動預測網路是基於場景深度預測網路輔助訓練得到的。An embodiment of the present invention also provides a camera motion prediction method, which includes: acquiring a sequence of image frames corresponding to time t, wherein the sequence of image frames includes a target image frame at time t and a phase of the target image frame. Neighboring image frames; using the second hidden state information at time t-1 to perform camera pose prediction on the image frame sequence through the camera motion prediction network to determine the predicted camera motion corresponding to the image frame sequence, where, The second hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained based on the auxiliary training of the scene depth prediction network.

本發明的一些實施例中，所述通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述圖像幀序列進行相機位姿預測，確定所述圖像幀序列對應的預測相機運動，包括：對所述圖像幀序列進行特徵提取，確定所述圖像幀序列對應的第二特徵圖，其中，所述第二特徵圖為與相機運動相關的特徵圖；根據所述第二圖特徵和t-1時刻的所述第二隱狀態資訊，確定t時刻的所述第二隱狀態資訊；根據t時刻的所述第二隱狀態資訊，確定所述預測相機運動。In some embodiments of the present invention, the camera motion prediction network uses the second hidden state information at time t-1 to perform camera pose prediction on the image frame sequence to determine the prediction corresponding to the image frame sequence The camera movement includes: extracting features of the image frame sequence, and determining a second feature map corresponding to the image frame sequence, wherein the second feature map is a feature map related to camera motion; The second image feature and the second hidden state information at time t-1 determine the second hidden state information at time t; and the predicted camera motion is determined based on the second hidden state information at time t.

本發明的一些實施例中，所述預測相機運動包括所述圖像幀序列中相鄰圖像幀之間的相對位姿。In some embodiments of the present invention, the predicted camera motion includes the relative pose between adjacent image frames in the sequence of image frames.

本發明的一些實施例中，所述方法還包括：獲取t時刻對應的樣本圖像幀序列，其中，所述樣本圖像幀序列包括t時刻的第一樣本圖像幀和所述第一樣本圖像幀的相鄰樣本圖像幀；通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述第一樣本圖像幀進行場景深度預測，確定所述第一樣本圖像幀對應的樣本預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊；通過待訓練的相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述樣本圖像幀序列進行相機位姿預測，確定所述樣本圖像幀序列對應的樣本預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊；根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數；根據所述損失函數，對所述待訓練的相機運動預測網路進行訓練，以得到所述相機運動預測網路。In some embodiments of the present invention, the method further includes: acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes the first sample image frame at time t and the first sample image frame. The adjacent sample image frames of the sample image frame; the scene depth prediction is performed on the first sample image frame by using the first hidden state information at time t-1 through the scene depth prediction network to determine the first The sample prediction depth map corresponding to the sample image frame, wherein the first hidden state information includes feature information related to the depth of the scene; the second hidden state information at time t-1 is used through the camera motion prediction network to be trained Perform camera pose prediction on the sample image frame sequence, and determine the sample predicted camera motion corresponding to the sample image frame sequence, wherein the second hidden state information includes feature information related to camera motion; The sample prediction depth map and the sample predict camera motion to construct a loss function; according to the loss function, the camera motion prediction network to be trained is trained to obtain the camera motion prediction network.

本發明實施例還提供了一種場景深度預測裝置，包括：第一獲取模組，配置為獲取t時刻的目標圖像幀；第一場景深度預測模組，配置為通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述目標圖像幀進行場景深度預測，確定所述目標圖像幀對應的預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊，所述場景深度預測網路是基於相機運動預測網路輔助訓練得到的。An embodiment of the present invention also provides a scene depth prediction device, including: a first acquisition module configured to acquire a target image frame at time t; a first scene depth prediction module configured to use t through a scene depth prediction network The first hidden state information at time -1 performs scene depth prediction on the target image frame, and determines the predicted depth map corresponding to the target image frame, wherein the first hidden state information includes features related to the scene depth Information, the scene depth prediction network is obtained based on the auxiliary training of the camera motion prediction network.

本發明的一些實施例中，所述第一場景深度預測模組，包括：第一確定子模組，配置為對所述目標圖像幀進行特徵提取，確定所述目標圖像幀對應的第一特徵圖，其中，所述第一特徵圖為與場景深度相關的特徵圖；第二確定子模組，配置為根據所述第一特徵圖和t-1時刻的所述第一隱狀態資訊，確定t時刻的所述第一隱狀態資訊；第三確定子模組，配置為根據t時刻的所述第一隱狀態資訊，確定所述預測深度圖。In some embodiments of the present invention, the first scene depth prediction module includes: a first determining sub-module configured to perform feature extraction on the target image frame, and determine the first scene corresponding to the target image frame A feature map, wherein the first feature map is a feature map related to the scene depth; a second determining sub-module is configured to be based on the first feature map and the first hidden state information at time t-1 , Determining the first hidden state information at time t; a third determining sub-module configured to determine the predicted depth map according to the first hidden state information at time t.

本發明的一些實施例中，t-1時刻的所述第一隱狀態資訊包括t-1時刻的不同尺度下的所述第一隱狀態資訊；所述第一確定子模組具體配置為：對所述目標圖像幀進行多尺度下採樣，確定所述目標圖像幀對應的不同尺度下的所述第一特徵圖；所述第二確定子模組具體配置為：針對任一尺度，根據該尺度下的所述第一特徵圖和t-1時刻的該尺度下的所述第一隱狀態資訊，確定t時刻的該尺度下的所述第一隱狀態資訊；所述第三確定子模組具體配置為：將t時刻的不同尺度下的所述第一隱狀態資訊進行特徵融合，確定所述預測深度圖。In some embodiments of the present invention, the first hidden state information at time t-1 includes the first hidden state information at different scales at time t-1; the first determining submodule is specifically configured as: Perform multi-scale down-sampling on the target image frame to determine the first feature map at different scales corresponding to the target image frame; the second determining submodule is specifically configured to: for any scale, According to the first feature map at the scale and the first hidden state information at the scale at time t-1, the first hidden state information at the scale at time t is determined; the third determination The sub-module is specifically configured to perform feature fusion of the first hidden state information at different scales at time t to determine the predicted depth map.

本發明的一些實施例中，所述裝置還包括第一訓練模組，所述第一訓練模組配置為：獲取t時刻對應的樣本圖像幀序列，其中，所述樣本圖像幀序列包括t時刻的第一樣本圖像幀和所述第一樣本圖像幀的相鄰樣本圖像幀；通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述樣本圖像幀序列進行相機位姿預測，確定所述樣本圖像幀序列對應的樣本預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊；通過待訓練的場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述第一樣本圖像幀進行場景深度預測，確定所述第一樣本圖像幀對應的樣本預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊；根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數；根據所述損失函數，對所述待訓練的場景深度預測網路進行訓練，以得到所述場景深度預測網路。In some embodiments of the present invention, the device further includes a first training module, and the first training module is configured to: Acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and adjacent sample image frames of the first sample image frame; Perform camera pose prediction on the sample image frame sequence through the camera motion prediction network using the second hidden state information at time t-1, and determine the sample predicted camera motion corresponding to the sample image frame sequence, wherein the The second hidden state information includes feature information related to camera movement; Use the first hidden state information at time t-1 to perform scene depth prediction on the first sample image frame through the scene depth prediction network to be trained, and determine the sample prediction depth corresponding to the first sample image frame Figure, wherein the first hidden state information includes feature information related to the depth of the scene; Construct a loss function according to the sample predicted depth map and the sample predicted camera motion; According to the loss function, the scene depth prediction network to be trained is trained to obtain the scene depth prediction network.

本發明的一些實施例中，所述第一訓練模組，具體配置為：根據所述樣本預測相機運動，確定所述樣本圖像幀序列中所述第一樣本圖像幀的相鄰樣本圖像幀相對所述第一樣本圖像幀的重投影誤差項；根據所述樣本預測深度圖的分佈連續性，確定懲罰函數項；根據所述重投影誤差項和所述懲罰函數項，構建所述損失函數。In some embodiments of the present invention, the first training module is specifically configured to predict camera motion according to the samples, and determine adjacent samples of the first sample image frame in the sequence of sample image frames The reprojection error term of the image frame relative to the first sample image frame; the penalty function term is determined according to the distribution continuity of the sample predicted depth map; according to the reprojection error term and the penalty function term, Construct the loss function.

本發明實施例還提供了一種相機運動預測裝置，包括：第二獲取模組，配置為獲取t時刻對應的圖像幀序列，其中，所述圖像幀序列包括t時刻的目標圖像幀和所述目標圖像幀的相鄰圖像幀；第一相機運動預測模組，配置為通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述圖像幀序列進行相機位姿預測，確定所述圖像幀序列對應的預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊，所述相機運動預測網路是基於場景深度預測網路輔助訓練得到的。An embodiment of the present invention also provides a camera motion prediction device, including: a second acquisition module configured to acquire a sequence of image frames corresponding to time t, wherein the sequence of image frames includes a target image frame at time t and Adjacent image frames of the target image frame; a first camera motion prediction module configured to perform camera positioning on the image frame sequence using the second hidden state information at time t-1 through the camera motion prediction network Pose prediction determines the predicted camera motion corresponding to the image frame sequence, wherein the second hidden state information includes feature information related to camera motion, and the camera motion prediction network is based on scene depth prediction network assisted training owned.

本發明的一些實施例中，所述第一相機運動預測模組，包括：第六確定子模組，配置為對所述圖像幀序列進行特徵提取，確定所述圖像幀序列對應的第二特徵圖，其中，所述第二特徵圖為與相機運動相關的特徵圖；第七確定子模組，配置為根據所述第二圖特徵和t-1時刻的所述第二隱狀態資訊，確定t時刻的所述第二隱狀態資訊；第八確定子模組，配置為根據t時刻的所述第二隱狀態資訊，確定所述預測相機運動。In some embodiments of the present invention, the first camera motion prediction module includes: a sixth determining sub-module configured to perform feature extraction on the image frame sequence, and determine the first camera motion prediction module corresponding to the image frame sequence Two feature maps, wherein the second feature map is a feature map related to camera motion; a seventh determining sub-module is configured to be based on the second map feature and the second hidden state information at time t-1 , Determining the second hidden state information at time t; an eighth determining sub-module configured to determine the predicted camera motion according to the second hidden state information at time t.

本發明的一些實施例中，所述裝置還包括：第二訓練模組，所述第二訓練模組配置為：獲取t時刻對應的樣本圖像幀序列，其中，所述樣本圖像幀序列包括t時刻的第一樣本圖像幀和所述第一樣本圖像幀的相鄰樣本圖像幀；通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述第一樣本圖像幀進行場景深度預測，確定所述第一樣本圖像幀對應的樣本預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊；通過待訓練的相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述樣本圖像幀序列進行相機位姿預測，確定所述樣本圖像幀序列對應的樣本預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊；根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數；根據所述損失函數，對所述待訓練的相機運動預測網路進行訓練，以得到所述相機運動預測網路。In some embodiments of the present invention, the device further includes: a second training module configured to: Acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and adjacent sample image frames of the first sample image frame; The scene depth prediction is performed on the first sample image frame through the scene depth prediction network using the first hidden state information at time t-1, and the sample prediction depth map corresponding to the first sample image frame is determined, where , The first hidden state information includes feature information related to the depth of the scene; The camera motion prediction network to be trained uses the second hidden state information at time t-1 to perform camera pose prediction on the sample image frame sequence to determine the sample predicted camera motion corresponding to the sample image frame sequence, where , The second hidden state information includes feature information related to camera movement; Construct a loss function according to the sample predicted depth map and the sample predicted camera motion; According to the loss function, the camera motion prediction network to be trained is trained to obtain the camera motion prediction network.

本發明的一些實施例中，所述第二訓練模組，具體配置為：根據所述樣本預測相機運動，確定所述樣本圖像幀序列中所述第一樣本圖像幀的相鄰樣本圖像幀相對所述第一樣本圖像幀的重投影誤差項；根據所述樣本預測深度圖的分佈連續性，確定懲罰函數項；根據所述重投影誤差項和所述懲罰函數項，構建所述損失函數。In some embodiments of the present invention, the second training module is specifically configured to predict camera motion according to the samples, and determine adjacent samples of the first sample image frame in the sequence of sample image frames The reprojection error term of the image frame relative to the first sample image frame; the penalty function term is determined according to the distribution continuity of the sample predicted depth map; according to the reprojection error term and the penalty function term, Construct the loss function.

本發明實施例還提供了一種電子設備，包括：處理器；配置為儲存處理器可執行指令的記憶體；其中，所述處理器被配置為調用所述記憶體儲存的指令，以執行上述任意一種方法。An embodiment of the present invention also provides an electronic device, including: a processor; a memory configured to store executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute any of the foregoing a way.

本發明實施例還提供了一種電腦可讀儲存介質，其上儲存有電腦程式指令，所述電腦程式指令被處理器執行時實現上述任意一種方法。The embodiment of the present invention also provides a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, any one of the above-mentioned methods is implemented.

本發明實施例還提供了一種電腦程式，包括電腦可讀代碼，當所述電腦可讀代碼在電子設備中運行時，所述電子設備中的處理器執行用於實現上述任意一種方法。The embodiment of the present invention also provides a computer program, including computer-readable code, when the computer-readable code is run in an electronic device, a processor in the electronic device executes to implement any one of the above-mentioned methods.

在本發明實施例中，獲取t時刻對應的目標圖像幀，由於相鄰時刻之間場景深度在時序上具有關聯關係，利用t-1時刻與場景深度相關的第一隱狀態資訊，通過場景深度預測網路對目標圖像幀進行場景深度預測，可以得到目標圖像幀對應的預測精度較高的預測深度圖。In the embodiment of the present invention, the target image frame corresponding to time t is acquired. Since the scene depth between adjacent times has an association relationship in time series, the first hidden state information related to the scene depth at time t-1 is used to pass the scene The depth prediction network performs scene depth prediction on the target image frame, and can obtain a predicted depth map with higher prediction accuracy corresponding to the target image frame.

本發明實施例中，獲取t時刻對應的包括t時刻的目標圖像幀和目標圖像幀的相鄰圖像幀的圖像幀序列，由於相鄰時刻之間相機位姿在時序上具有關聯關係，利用t-1時刻與相機運動相關的第二隱狀態資訊，通過相機運動預測網路對圖像幀序列進行相機位姿預測，可以得到預測精度較高的預測相機運動。In the embodiment of the present invention, the image frame sequence including the target image frame at time t and the adjacent image frame of the target image frame corresponding to time t is acquired, because the camera poses between adjacent time are related in time sequence. In relation to this, using the second hidden state information related to the camera motion at time t-1, the camera pose prediction is performed on the image frame sequence through the camera motion prediction network, and the predicted camera motion with higher prediction accuracy can be obtained.

應當理解的是，以上的一般描述和後文的細節描述僅是示例性和解釋性的，而非限制本發明。根據下面參考附圖對示例性實施例的詳細說明，本發明的其它特徵及方面將變得清楚。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, rather than limiting the present invention. According to the following detailed description of exemplary embodiments with reference to the accompanying drawings, other features and aspects of the present invention will become clear.

以下將參考附圖詳細說明本發明的各種示例性實施例、特徵和方面。附圖中相同的附圖標記表示功能相同或相似的元件。儘管在附圖中示出了實施例的各種方面，但是除非特別指出，不必按比例繪製附圖。Various exemplary embodiments, features, and aspects of the present invention will be described in detail below with reference to the drawings. The same reference numerals in the drawings indicate elements with the same or similar functions. Although various aspects of the embodiments are shown in the drawings, unless otherwise noted, the drawings are not necessarily drawn to scale.

在這裡專用的詞“示例性”意為“用作例子、實施例或說明性”。這裡作為“示例性”所說明的任何實施例不必解釋為優於或好於其它實施例。The dedicated word "exemplary" here means "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" need not be construed as being superior or better than other embodiments.

本文中術語“和/或”，僅僅是一種描述關聯物件的關聯關係，表示可以存在三種關係，例如，A和/或D，可以表示：單獨存在A，同時存在A和D，單獨存在D這三種情況。另外，本文中術語“至少一種”表示多種中的任意一種或多種中的至少兩種的任意組合，例如，包括A、C、D中的至少一種，可以表示包括從A、C和D構成的集合中選擇的任意一個或多個元素。The term "and/or" in this article is only an association relationship describing related objects, which means that there can be three types of relationships. For example, A and/or D can mean: A alone exists, A and D exist at the same time, and D exists alone. three situations. In addition, the term "at least one" in this document means any one or any combination of at least two of the multiple, for example, including at least one of A, C, and D, and may mean including those formed from A, C, and D Any one or more elements selected in the set.

另外，為了更好地說明本發明，在下文的具體實施方式中給出了眾多的具體細節。本領域技術人員應當理解，沒有某些具體細節，本發明同樣可以實施。在一些實例中，對於本領域技術人員熟知的方法、手段、元件和電路未作詳細描述，以便於凸顯本發明的主旨。In addition, in order to better illustrate the present invention, numerous specific details are given in the following specific embodiments. Those skilled in the art should understand that the present invention can also be implemented without certain specific details. In some examples, the methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order to highlight the gist of the present invention.

圖1示出根據本發明實施例的場景深度預測方法的流程圖。如圖1所示的場景深度預測方法可以由終端設備或其它處理設備執行，其中，終端設備可以為使用者設備（User Equipment，UE）、移動設備、使用者終端、終端、蜂窩電話、無線電話、個人數位助理（Personal Digital Assistant，PDA）、手持設備、計算設備、車載設備、可穿戴設備等。其它處理設備可為伺服器或雲端伺服器等。在一些實施例中，該場景深度預測方法可以通過處理器調用記憶體中儲存的電腦可讀指令的方式來實現。如圖1所示，該方法可以包括：在步驟S11中，獲取t時刻的目標圖像幀。在步驟S12中，通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對目標圖像幀進行場景深度預測，確定目標圖像幀對應的預測深度圖，其中，第一隱狀態資訊包括與場景深度相關的特徵資訊，場景深度預測網路是基於相機運動預測網路輔助訓練得到的。Fig. 1 shows a flowchart of a scene depth prediction method according to an embodiment of the present invention. The scene depth prediction method shown in Figure 1 can be executed by a terminal device or other processing device, where the terminal device can be a user equipment (UE), a mobile device, a user terminal, a terminal, a cellular phone, or a wireless phone. , Personal Digital Assistant (PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. Other processing equipment may be servers or cloud servers, etc. In some embodiments, the scene depth prediction method can be implemented by a processor invoking computer-readable instructions stored in a memory. As shown in Figure 1, the method may include: In step S11, the target image frame at time t is acquired. In step S12, the scene depth prediction is performed on the target image frame by using the first hidden state information at time t-1 through the scene depth prediction network to determine the predicted depth map corresponding to the target image frame, where the first hidden state information Including feature information related to the depth of the scene, the scene depth prediction network is obtained based on the auxiliary training of the camera motion prediction network.

本發明實施例中，獲取t時刻的目標圖像幀，由於相鄰時刻之間場景深度在時序上具有關聯關係，利用t-1時刻與場景深度相關的第一隱狀態資訊，通過場景深度預測網路對目標圖像幀進行場景深度預測，可以得到目標圖像幀對應的預測精度較高的預測深度圖。In the embodiment of the present invention, the target image frame at time t is acquired. Since the scene depth between adjacent time is related in time series, the first hidden state information related to the scene depth at time t-1 is used to predict the scene depth. The network performs scene depth prediction on the target image frame, and a predicted depth map with higher prediction accuracy corresponding to the target image frame can be obtained.

在一些實施例中，通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對目標圖像幀進行場景深度預測，確定目標圖像幀對應的預測深度圖，可以包括：對目標圖像幀進行特徵提取，確定目標圖像幀對應的第一特徵圖，其中，第一特徵圖為與場景深度相關的特徵圖；根據第一特徵圖和t-1時刻的第一隱狀態資訊，確定t時刻的第一隱狀態資訊；根據t時刻的第一隱狀態資訊，確定預測深度圖。In some embodiments, using the first hidden state information at time t-1 to perform scene depth prediction on the target image frame through the scene depth prediction network to determine the predicted depth map corresponding to the target image frame may include: Perform feature extraction on the image frame to determine the first feature map corresponding to the target image frame, where the first feature map is a feature map related to the scene depth; according to the first feature map and the first hidden state information at time t-1, Determine the first hidden state information at time t; determine the predicted depth map according to the first hidden state information at time t.

由於相鄰時刻之間場景深度在時序上具有關聯關係，場景深度預測網路利用當前時刻（例如，t時刻）的目標圖像幀對應的與場景深度相關的第一特徵圖，以及上一時刻（例如，t-1時刻）與場景深度相關的第一隱狀態資訊，可以確定當前時刻與場景深度相關的第一隱狀態資訊，進而基於當前時刻與場景深度相關的第一隱狀態資訊對目標圖像幀進行場景深度預測，可以得到當前時刻的目標圖像幀對應的預測精度較高的預測深度圖。Since the scene depth between adjacent moments is related in time series, the scene depth prediction network uses the first feature map related to the scene depth corresponding to the target image frame at the current moment (for example, moment t), and the previous moment (For example, at time t-1) the first hidden state information related to the depth of the scene can determine the first hidden state information related to the depth of the scene at the current moment, and then based on the first hidden state information related to the depth of the scene at the current time to target the target The scene depth prediction is performed on the image frame, and a predicted depth map with higher prediction accuracy corresponding to the target image frame at the current moment can be obtained.

例如，利用場景深度預測網路預測圖像幀序列（包括第1時刻至第t時刻的圖像幀）中各個圖像幀對應的預測深度圖時，在場景深度預測網路的初始化階段，設置與場景深度相關的第一隱狀態資訊的預設初始值。基於第一隱狀態資訊的預設初始值以及第1時刻的圖像幀對應的與場景深度相關的第一特徵圖，確定第1時刻的第一隱狀態，進而基於第1時刻的第一隱狀態對第1時刻的圖像幀進行場景深度預測，得到第1時刻的圖像幀對應的預測深度圖；基於第1時刻的第一隱狀態以及第2時刻的圖像幀對應的與場景深度相關的第一特徵圖，確定第2時刻的第一隱狀態，進而基於第2時刻的第一隱狀態對第2時刻的圖像幀進行場景深度預測，得到第2時刻的圖像幀對應的預測深度圖；基於第2時刻的第一隱狀態以及第3時刻的圖像幀對應的與場景深度相關的第一特徵圖，確定第3時刻的第一隱狀態，進而基於第3時刻的第一隱狀態對第3時刻的圖像幀進行場景深度預測，得到第3時刻的圖像幀對應的預測深度圖；依次類推，最終得到圖像幀序列（包括第1時刻至第t時刻的圖像幀）中各個圖像幀對應的預測深度圖。For example, when using the scene depth prediction network to predict the predicted depth map corresponding to each image frame in the image frame sequence (including the image frames from time 1 to time t), in the initialization stage of the scene depth prediction network, set The default initial value of the first hidden state information related to the scene depth. Based on the preset initial value of the first hidden state information and the first feature map corresponding to the scene depth corresponding to the image frame at the first time, the first hidden state at the first time is determined, and then based on the first hidden state at the first time The state performs scene depth prediction on the image frame at the first time, and obtains the predicted depth map corresponding to the image frame at the first time; based on the first hidden state at the first time and the scene depth corresponding to the image frame at the second time The related first feature map determines the first hidden state at time 2, and then performs scene depth prediction on the image frame at time 2 based on the first hidden state at time 2, and obtains the corresponding image frame at time 2 Predicted depth map; based on the first hidden state at time 2 and the first feature map related to the scene depth corresponding to the image frame at time 3, the first hidden state at time 3 is determined, and then based on the first hidden state at time 3 A hidden state performs scene depth prediction on the image frame at time 3 to obtain the predicted depth map corresponding to the image frame at time 3; Image frame) the predicted depth map corresponding to each image frame.

在一些實施例中，t-1時刻的第一隱狀態資訊包括t-1時刻的不同尺度下的第一隱狀態資訊；對目標圖像幀進行特徵提取，確定目標圖像幀對應的第一特徵圖，可以包括：對目標圖像幀進行多尺度下採樣，確定目標圖像幀對應的不同尺度下的第一特徵圖；根據第一特徵圖和t-1時刻的第一隱狀態資訊，確定t時刻的第一隱狀態資訊，可以包括：針對任一尺度，根據該尺度下的第一特徵圖和t-1時刻的該尺度下的第一隱狀態資訊，確定t時刻的該尺度下的第一隱狀態資訊；根據t時刻的第一隱狀態資訊，確定預測深度圖，可以包括：將t時刻的不同尺度下的第一隱狀態資訊進行特徵融合，確定預測深度圖。In some embodiments, the first hidden state information at time t-1 includes first hidden state information at different scales at time t-1; feature extraction is performed on the target image frame to determine the first hidden state information corresponding to the target image frame. The feature map may include: multi-scale down-sampling of the target image frame to determine the first feature map at different scales corresponding to the target image frame; according to the first feature map and the first hidden state information at time t-1, Determining the first hidden state information at time t may include: for any scale, according to the first feature map at that scale and the first hidden state information at that scale at time t-1, determining the scale at time t According to the first hidden state information at time t, determining the predicted depth map may include: feature fusion of the first hidden state information at different scales at time t to determine the predicted depth map.

為了更好地確定t時刻的目標圖像幀對應的預測深度圖，場景深度預測網路可以採用多尺度特徵融合機制。圖2示出根據本發明實施例的場景深度預測網路的方塊圖，如圖2所示，場景深度預測網路中包括深度編碼器202、多尺度卷積門控循環單元（Convolutional Gated Recurrent Unit，ConvGRU）和深度解碼器205。將t時刻的目標圖像幀201輸入深度編碼器202進行多尺度下採樣，得到目標圖像幀對應的不同尺度下的第一特徵圖203：第一尺度下的第一特徵圖

、第二尺度下的第一特徵圖

和第三尺度下的第一特徵圖

。其中，多尺度ConvGRU與多尺度第一特徵圖的尺度對應，即，多尺度ConvGRU包括：第一尺度下的ConvGRU⁰ ，第二尺度下的ConvGRU¹ 和第三尺度下的ConvGRU² 。In order to better determine the predicted depth map corresponding to the target image frame at time t, the scene depth prediction network can adopt a multi-scale feature fusion mechanism. FIG. 2 shows a block diagram of a scene depth prediction network according to an embodiment of the present invention. As shown in FIG. 2, the scene depth prediction network includes a depth encoder 202, a multi-scale convolutional gated recurrent unit (Convolutional Gated Recurrent Unit). , ConvGRU) and depth decoder 205. The target image frame 201 at time t is input to the depth encoder 202 for multi-scale downsampling to obtain the first feature map at different scales corresponding to the target image frame 203: the first feature map at the first scale

, The first feature map at the second scale

And the first feature map at the third scale

. Wherein the multi-scale and multi-scale ConvGRU scale corresponding to the first characteristic diagram, i.e., multi-scale ConvGRU comprising: ConvGRU ⁰ in the first dimension, ConvGRU ConvGRU ¹ in the third dimension and the second dimension under ^2.

仍以上述圖2為例，將第一特徵圖

輸入ConvGRU⁰ ，將第一特徵圖

輸入ConvGRU¹ ，將第一特徵圖

輸入ConvGRU² 。ConvGRU⁰ 將第一特徵圖

與ConvGRU⁰ 中儲存的t-1時刻的第一尺度下的第一隱狀態資訊

進行特徵融合，得到t時刻的第一尺度下的第一隱狀態

，ConvGRU⁰ 對t時刻的第一尺度下的第一隱狀態

進行儲存，以及將t時刻的第一尺度下的第一隱狀態

輸出至深度解碼器；ConvGRU¹ 將第一特徵圖

與ConvGRU¹ 中儲存的t-1時刻的第二尺度下的第一隱狀態資訊

進行特徵融合，得到t時刻的第二尺度下的第一隱狀態

，ConvGRU¹ 對t時刻的第二尺度下的第一隱狀態

進行儲存，以及將t時刻的第二尺度下的第一隱狀態

輸出至深度解碼器；ConvGRU² 將第一特徵圖

與ConvGRU² 中儲存的t-1時刻的第三尺度下的第一隱狀態資訊

進行特徵融合，得到t時刻的第三尺度下的第一隱狀態

，ConvGRU² 對t時刻的第三尺度下的第一隱狀態

進行儲存，以及將t時刻的第三尺度下的第一隱狀態

輸出至深度解碼器。圖2中，多尺度隱狀態204包括t時刻的第一尺度下的第一隱狀態

、第二尺度下的第一隱狀態

和第三尺度下的第一隱狀態

。Still taking the above figure 2 as an example, the first feature map

Enter ConvGRU ⁰ and set the first feature map

Enter ConvGRU ¹ and set the first feature map

Enter ConvGRU ² . ConvGRU ⁰ will first feature map

And the first hidden state information at the first scale at t-1 stored in ^{ConvGRU 0}

Perform feature fusion to obtain the first hidden state at the first scale at time t

, ConvGRU ⁰ compares the first hidden state at the first scale at time t

Store, and store the first hidden state at the first scale at time t

Output to the depth decoder; ConvGRU ¹ converts the first feature map

And the first hidden state information at the second scale at t-1 stored in ^{ConvGRU 1}

Perform feature fusion to obtain the first hidden state at the second scale at time t

, ConvGRU ¹ compares the first hidden state at the second scale at time t

Store, and store the first hidden state at the second scale at time t

Output to the depth decoder; ConvGRU ² converts the first feature map

And the first hidden state information at the third scale at t-1 stored in ^{ConvGRU 2}

Perform feature fusion to obtain the first hidden state at the third scale at time t

, ConvGRU ² compares the first hidden state at the third scale at time t

To store, and the first hidden state under the third scale at time t

Output to the depth decoder. In FIG. 2, the multi-scale hidden state 204 includes the first hidden state at the first scale at time t

, The first hidden state under the second scale

And the first hidden state in the third scale

.

深度解碼器205分別將t時刻的第一尺度下的第一隱狀態

、第二尺度下的第一隱狀態

和第三尺度下的第一隱狀態

的尺度恢復至與目標圖像幀201的尺度（以下將目標圖像幀的尺度簡稱目標尺度）相同，得到t時刻的目標尺度下的三個第一隱狀態。由於第一隱狀態資訊包括與場景深度相關的特徵資訊，在場景深度預測網路中也是以特徵圖的形式存在，因此，將t時刻的目標尺度下的三個第一隱狀態進行特徵圖融合，從而得到t時刻的目標圖像幀對應的預測深度圖

。The depth decoder 205 separately calculates the first hidden state at the first scale at time t

, The first hidden state under the second scale

And the first hidden state in the third scale

The scale of is restored to the same scale as the scale of the target image frame 201 (hereinafter the scale of the target image frame is referred to as the target scale), and the three first hidden states at the target scale at time t are obtained. Since the first hidden state information includes feature information related to the depth of the scene, it also exists in the form of a feature map in the scene depth prediction network. Therefore, the three first hidden states at the target scale at time t are combined with feature maps. , So as to obtain the predicted depth map corresponding to the target image frame at time t

.

在一些實施例中，該場景深度預測方法還可以包括：獲取t時刻對應的樣本圖像幀序列，其中，樣本圖像幀序列包括t時刻的第一樣本圖像幀和第一樣本圖像幀的相鄰樣本圖像幀；通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對樣本圖像幀序列進行相機位姿預測，確定樣本圖像幀序列對應的樣本預測相機運動，其中，第二隱狀態資訊包括與相機運動相關的特徵資訊；通過待訓練的場景深度預測網路利用t-1時刻的第一隱狀態資訊對第一樣本圖像幀進行場景深度預測，確定第一樣本圖像幀對應的樣本預測深度圖，其中，第一隱狀態資訊包括與場景深度相關的特徵資訊；根據樣本預測深度圖和樣本預測相機運動，構建損失函數；根據損失函數，對待訓練的場景深度預測網路進行訓練，以得到場景深度預測網路。In some embodiments, the scene depth prediction method may further include: acquiring a sequence of sample image frames corresponding to time t, where the sequence of sample image frames includes the first sample image frame and the first sample image at time t The adjacent sample image frames of the image frame; use the second hidden state information at time t-1 to perform camera pose prediction on the sample image frame sequence through the camera motion prediction network, and determine the sample prediction camera corresponding to the sample image frame sequence Movement, where the second hidden state information includes feature information related to camera movement; the scene depth prediction network to be trained uses the first hidden state information at time t-1 to perform scene depth prediction on the first sample image frame , Determine the sample predicted depth map corresponding to the first sample image frame, where the first hidden state information includes feature information related to the scene depth; predict the depth map based on the sample and predict the camera motion based on the sample to construct the loss function; based on the loss function , To train the scene depth prediction network to be trained to obtain the scene depth prediction network.

在本發明實施例中，場景深度預測網路是基於相機運動預測網路輔助訓練得到的，或者，場景深度預測網路和相機運動預測網路是聯合訓練得到的。利用相鄰時刻之間的場景深度和相機位姿在時序上的關聯關係，引入滑動視窗資料融合的機制，提取並記憶滑動視窗序列中與目標時刻（t時刻）的場景深度和相機運動相關的隱狀態資訊，進而對場景深度預測網路和/或相機運動預測網路進行無監督網路訓練。In the embodiment of the present invention, the scene depth prediction network is obtained based on the auxiliary training of the camera motion prediction network, or the scene depth prediction network and the camera motion prediction network are jointly trained. Using the temporal relationship between the scene depth and the camera pose between adjacent moments, the sliding window data fusion mechanism is introduced to extract and memorize the scene depth and camera movement related to the target moment (time t) in the sliding window sequence Hidden state information, and then unsupervised network training for scene depth prediction network and/or camera motion prediction network.

在本發明實施例中，可以預先創建訓練集，該訓練集中包括在時序上連續採集得到的樣本圖像幀序列，進而基於該訓練集對待訓練的場景深度預測網路進行訓練。圖3示出本發明實施例的無監督網路訓練的方塊圖。如圖3所示，目標時刻為t時刻，目標時刻對應的樣本圖像幀序列301（即目標時刻對應的滑動視窗中包括的樣本圖像幀序列）包括：t時刻的第一樣本圖像幀I_t 、t-1時刻的相鄰樣本圖像幀I_t-1 和t+1時刻的相鄰樣本圖像幀I_t+1 。樣本圖像幀序列中第一樣本圖像幀的相鄰樣本圖像幀鄰的數目可以根據實際情況確定，本發明對此不做具體限定。In the embodiment of the present invention, a training set may be created in advance, and the training set includes a sequence of sample image frames continuously collected in time sequence, and then the scene depth prediction network to be trained is trained based on the training set. Fig. 3 shows a block diagram of unsupervised network training according to an embodiment of the present invention. As shown in Figure 3, the target time is time t, and the sample image frame sequence 301 corresponding to the target time (that is, the sample image frame sequence included in the sliding window corresponding to the target time) includes: the first sample image at time t Frame I _t , adjacent sample image frame I _{t-1 at time t-1,} and adjacent sample image frame I t+1 at time _t+1 . The number of adjacent sample image frames of the first sample image frame in the sequence of sample image frames can be determined according to actual conditions, which is not specifically limited in the present invention.

圖3示出的待訓練的場景深度預測網路採用的是單尺度特徵融合機制。在網路訓練過程中，待訓練的場景深度預測網路可以採用圖3所示的單尺度特徵融合機制，也可以採用圖2所示的多尺度特徵融合機制，本發明對此不做具體限定。如圖3所示，待訓練的場景深度預測網路中包括深度編碼器202、ConvGRU和深度解碼器205。將t時刻的第一樣本圖像幀I_t 輸入深度編碼器202進行特徵提取，得到第一樣本圖像幀I_t 對應的第一特徵圖

，進而將第一特徵圖

輸入ConvGRU，使得第一特徵圖

與ConvGRU中儲存的t-1時刻的第一隱狀態資訊

進行特徵融合，得到t時刻的第一隱狀態

，ConvGRU對t時刻的第一隱狀態

進行儲存，以及將t時刻的第一隱狀態

輸出至深度解碼器205，從而得到t時刻的第一樣本圖像幀對應的樣本預測深度圖

。Figure 3 shows that the scene depth prediction network to be trained uses a single-scale feature fusion mechanism. In the network training process, the scene depth prediction network to be trained can use the single-scale feature fusion mechanism shown in Figure 3, or the multi-scale feature fusion mechanism shown in Figure 2, which is not specifically limited in the present invention. . As shown in FIG. 3, the scene depth prediction network to be trained includes a depth encoder 202, a ConvGRU, and a depth decoder 205. The first sample image frame I t at time _{t is} input to the depth encoder 202 for feature extraction to obtain a first feature map corresponding to the first _{sample image frame I t}

, And then the first feature map

Enter ConvGRU to make the first feature map

And the first hidden state information at t-1 stored in ConvGRU

Perform feature fusion to get the first hidden state at time t

, ConvGRU's first hidden state at time t

To store, and the first hidden state at time t

Output to the depth decoder 205 to obtain the sample predicted depth map corresponding to the first sample image frame at time t

.

仍以上述圖3為例，如圖3所示，相機運動預測網路中包括位姿編碼器302、ConvGRU和位姿解碼器303。將t時刻對應的樣本圖像幀序列[I_t ，I_t-1 ，I_t+1 ]輸入位姿編碼器302進行特徵提取，得到樣本圖像幀序列對應的第二特徵圖

，進而將第二特徵圖

輸入ConvGRU，使得第二特徵圖

與ConvGRU中儲存的t-1時刻的第二隱狀態資訊

進行特徵融合，得到t時刻的第二隱狀態

，ConvGRU對t時刻的第二隱狀態

進行儲存，以及將t時刻的第二隱狀態

輸出至位姿解碼器，從而得到t時刻的樣本圖像幀序列對應的樣本預測相機運動[

，

]。Still taking the above FIG. 3 as an example, as shown in FIG. 3, the camera motion prediction network includes a pose encoder 302, a ConvGRU, and a pose decoder 303. Input the sample image frame sequence [I _t , I _t-1 , I _t+1 ] corresponding to time t into the pose encoder 302 for feature extraction, and obtain the second feature map corresponding to the sample image frame sequence

, And then the second feature map

Enter ConvGRU to make the second feature map

And the second hidden state information stored in ConvGRU at t-1

Perform feature fusion to get the second hidden state at time t

, ConvGRU's second hidden state at time t

Store, and set the second hidden state at time t

Output to the pose decoder to get the sample prediction camera motion corresponding to the sample image frame sequence at time t [

,

].

仍以上述圖3為例，根據樣本預測深度圖

和樣本預測相機運動[

，

]，可構建損失函數

。具體地，根據樣本預測相機運動[

，

]，確定樣本圖像幀序列中的相鄰樣本圖像幀I_t-1 和I_t+1 相對第一樣本圖像幀I_t 的重投影誤差項

；根據樣本預測深度圖

的分佈連續性，確定懲罰函數項

。進而，通過下述公式（1）構建損失函數

：

（1）。其中，

為權重係數，可以根據實際情況確定

的取值大小，本發明對此不做具體限定。Still taking the above figure 3 as an example, predict the depth map based on the sample

And sample prediction camera motion [

,

], the loss function can be constructed

. Specifically, the camera motion is predicted based on samples [

,

], determine the reprojection error terms _{of adjacent sample image frames I t-1} and I _t+1 in the sequence of sample image frames relative to the first sample image frame I _t

; Predict the depth map based on the sample

The continuity of the distribution, determine the penalty function term

. Furthermore, the loss function is constructed by the following formula (1)

:

(1). in,

Is the weight coefficient, which can be determined according to the actual situation

The value of is not specifically limited in the present invention.

在一些實施例中，根據樣本預測深度圖

的分佈連續性，確定懲罰函數項

的具體過程為：確定第一樣本圖像幀I_t 中各圖元點的梯度值，各圖元點的梯度值可以反映第一樣本圖像幀I_t 的分佈連續性（也可稱為平滑性），因此，根據各圖元點的梯度值可以確定第一樣本圖像幀I_t 中的邊緣區域（梯度值大於等於閾值的圖元點構成的區域）和非邊緣區域（梯度值小於閾值的圖元點構成的區域），進而可以確定第一樣本圖像幀I_t 對應的樣本預測深度圖

中的邊緣區域和非邊緣區域；確定樣本預測深度圖

中各圖元點的梯度值，為了確保樣本預測深度圖

中非邊緣區域的分佈連續性以及邊緣區域的分佈不連續性，針對樣本預測深度圖

中非邊緣區域中的各圖元點，設置與梯度值成正比的懲罰因數；針對樣本預測深度圖

中邊緣區域中的各圖元點，設置與梯度值成反比的懲罰因數；進而基於樣本預測深度圖

中各圖元點的懲罰因數，構建懲罰函數項

。In some embodiments, the depth map is predicted based on the sample

The continuity of the distribution, determine the penalty function term

The specific process: determining a gradient value for each sample point of the first primitive image frame I _t, the gradient value of each primitive may reflect the distribution of continuity points a first sample image frame I _t (also known as as smoothness), thereby determining the edge region of the first sample image frame I _t in accordance with the gradient value of each of the element points (gradient value equals the threshold elements constituting the point area) and the non-edge area (gradient The area constituted by pixel points whose value is less than the threshold), and then the sample predicted depth map corresponding to the _{first sample image frame I t can be determined}

Edge area and non-edge area in, determine the sample prediction depth map

The gradient value of each pixel point in, in order to ensure that the sample predicts the depth map

The distribution continuity of the central and non-edge regions and the discontinuity of the edge regions, predict the depth map for the sample

Set a penalty factor proportional to the gradient value for each pixel point in the edge region of the Central Africa; predict the depth map for the sample

For each pixel point in the middle edge area, set a penalty factor that is inversely proportional to the gradient value; and then predict the depth map based on the sample

The penalty factor of each pixel point in, construct the penalty function term

.

由於樣本預測深度圖和樣本預測相機運動是利用相鄰時刻之間場景深度和相機運動在時序上的關聯關係得到的，從而使得綜合利用根據相機運動預測網路得到的預測相機運動確定的重投影誤差項，以及根據場景深度預測網路得到的預測深度圖確定的懲罰函數項構建的損失函數，來對待訓練的場景深度預測網路進行訓練，訓練得到場景深度預測網路可以提高場景深度預測的預測精度。Since the sample prediction depth map and the sample prediction camera motion are obtained by using the correlation relationship between the scene depth and the camera motion in the adjacent moments, the reprojection determined by the predicted camera motion obtained by the camera motion prediction network is comprehensively used. The error term, and the loss function constructed based on the penalty function item determined by the predicted depth map obtained by the scene depth prediction network, are used to train the scene depth prediction network to be trained. The trained scene depth prediction network can improve the scene depth prediction Forecast accuracy.

在一些實施例中，圖3中的相機運動預測網路可以是待訓練的相機運動預測網路，根據上述損失函數，可以對待訓練的相機運動網路進行訓練，以實現對待訓練的場景深度預測網路和待訓練的相機運動網路的聯合訓練，得到訓練好的場景深度預測網路和相機運動預測網路。In some embodiments, the camera motion prediction network in FIG. 3 may be a camera motion prediction network to be trained. According to the above loss function, the camera motion network to be trained can be trained to realize the depth prediction of the scene to be trained. Joint training of the network and the camera motion network to be trained, to obtain the trained scene depth prediction network and the camera motion prediction network.

由於預測深度圖和預測相機運動是利用相鄰時刻之間場景深度和相機運動在時序上的關聯關係得到的，從而使得綜合利用根據相機運動預測網路得到的預測相機運動確定的重投影誤差項，以及根據場景深度預測網路得到的預測深度圖確定的懲罰函數項構建的損失函數，來對場景深度預測網路和相機運動預測網路進行聯合訓練，訓練得到場景深度預測網路和相機運動預測網路可以提高場景深度預測和相機運動預測的預測精度。Since the predicted depth map and predicted camera motion are obtained by using the temporal relationship between the scene depth and camera motion between adjacent moments, the reprojection error term determined by the predicted camera motion obtained by the camera motion prediction network is comprehensively used. , And a loss function constructed based on the penalty function item determined by the predicted depth map obtained by the scene depth prediction network to jointly train the scene depth prediction network and the camera motion prediction network to train the scene depth prediction network and camera motion The prediction network can improve the prediction accuracy of scene depth prediction and camera motion prediction.

在一些實施例中，深度編碼器和位姿編碼器可以複用ResNet18結構，可以複用ResNet54結構，還可以複用其它結構，本發明對此不做具體限定。深度解碼器和位姿解碼器可以採用Unet網路結構，還可以採用其它解碼器網路結構，本發明對此不做具體限定。In some embodiments, the depth encoder and the pose encoder may reuse the ResNet18 structure, may reuse the ResNet54 structure, and may also reuse other structures, which are not specifically limited in the present invention. The depth decoder and the pose decoder can adopt the Unet network structure, and can also adopt other decoder network structures, which is not specifically limited in the present invention.

在一些實施例中，ConvGRU中包括卷積操作，且ConvGRU中的啟動函數為ELU啟動函數。In some embodiments, ConvGRU includes a convolution operation, and the activation function in ConvGRU is an ELU activation function.

例如，可以通過對只能對一維資料進行資料處理的卷積門控循環單元ConvGRU進行改進，將ConvGRU中的線性操作替換為卷積操作，將ConvGRU中的tanh啟動函數替換為ELU啟動函數，從而得到可以對二維圖像資料進行資料處理的ConvGRU。For example, the convolution gated loop unit ConvGRU, which can only process one-dimensional data, can be improved by replacing the linear operation in ConvGRU with a convolution operation, and replacing the tanh startup function in ConvGRU with the ELU startup function. In this way, a ConvGRU that can process two-dimensional image data is obtained.

利用場景深度和/或相機運動在時序上具有的關聯關係，通過ConvGRU可以對不同時刻對應的圖像幀序列按時序循環卷積處理，從而可以得到不同時刻對應的第一隱狀態和/或第二隱狀態。Taking advantage of the time-series association relationship of scene depth and/or camera motion, ConvGRU can perform time-series circular convolution processing on the image frame sequence corresponding to different moments, so that the first hidden state and/or first hidden state corresponding to different moments can be obtained. Two hidden states.

為了實現滑動視窗資料融合的機制，除了可以採用上述ConvGRU外，還可以採用卷積長短期記憶單元（Convolutional Long Short-Term Memory，ConvLSTM），還可以採用其它能夠實現滑動視窗資料融合的結構，本發明對此不做具體限定。In order to realize the sliding window data fusion mechanism, in addition to the above-mentioned ConvGRU, Convolutional Long Short-Term Memory (Convolutional Long Short-Term Memory, ConvLSTM) can also be used, and other structures that can realize sliding window data fusion can also be used. The invention does not specifically limit this.

圖4示出根據本發明實施例的相機運動預測方法的流程圖。如圖4所示的相機運動預測方法可以由終端設備或其它處理設備執行，其中，終端設備可以為使用者設備（User Equipment，UE）、移動設備、使用者終端、終端、蜂窩電話、無線電話、個人數位助理（Personal Digital Assistant，PDA）、手持設備、計算設備、車載設備、可穿戴設備等。其它處理設備可為伺服器或雲端伺服器等。在一些可能的實現方式中，該相機運動預測方法可以通過處理器調用記憶體中儲存的電腦可讀指令的方式來實現。如圖4所示，該方法可以包括：在步驟S41中，獲取t時刻對應的圖像幀序列，其中，圖像幀序列包括t時刻的目標圖像幀和目標圖像幀的相鄰圖像幀。在步驟S42中，通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對圖像幀序列進行相機位姿預測，確定圖像幀序列對應的預測相機運動，其中，第二隱狀態資訊包括與相機運動相關的特徵資訊，相機運動預測網路是基於場景深度預測網路輔助訓練得到的。Fig. 4 shows a flowchart of a camera motion prediction method according to an embodiment of the present invention. The camera motion prediction method shown in FIG. 4 can be executed by a terminal device or other processing device, where the terminal device can be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, or a wireless phone. , Personal Digital Assistant (PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. Other processing equipment may be servers or cloud servers, etc. In some possible implementations, the camera motion prediction method can be implemented by a processor calling computer-readable instructions stored in a memory. As shown in Figure 4, the method may include: In step S41, an image frame sequence corresponding to time t is acquired, where the image frame sequence includes the target image frame at time t and adjacent image frames of the target image frame. In step S42, the camera pose prediction is performed on the image frame sequence using the second hidden state information at time t-1 through the camera motion prediction network to determine the predicted camera motion corresponding to the image frame sequence, where the second hidden state The information includes feature information related to camera motion, and the camera motion prediction network is obtained based on scene depth prediction network assisted training.

本發明實施例中，獲取包括t時刻的目標圖像幀和目標圖像幀的相鄰圖像幀的圖像幀序列，由於相鄰時刻之間相機運動在時序上具有關聯關係，利用t-1時刻與相機運動相關的第二隱狀態資訊，通過相機運動預測網路對圖像幀序列進行相機位姿預測，可以得到圖像幀序列對應的預測精度較高的預測相機運動。In the embodiment of the present invention, the image frame sequence including the target image frame at time t and the adjacent image frames of the target image frame is acquired. Since the camera movement between adjacent times has an association relationship in time sequence, t- The second hidden state information related to the camera motion at time 1, the camera pose prediction of the image frame sequence through the camera motion prediction network can obtain the predicted camera motion corresponding to the image frame sequence with higher prediction accuracy.

在一些實施例中，通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對圖像幀序列進行相機位姿預測，確定圖像幀序列對應的預測相機運動，可以包括：對圖像幀序列進行特徵提取，確定圖像幀序列對應的第二特徵圖，其中，第二特徵圖為與相機運動相關的特徵圖；根據第二圖特徵和t-1時刻的第二隱狀態資訊，確定t時刻的第二隱狀態資訊；根據t時刻的第二隱狀態資訊，確定預測相機運動。In some embodiments, using the second hidden state information at time t-1 to perform camera pose prediction on the image frame sequence through the camera motion prediction network to determine the predicted camera motion corresponding to the image frame sequence may include: Perform feature extraction from the image frame sequence to determine the second feature map corresponding to the image frame sequence, where the second feature map is a feature map related to camera motion; according to the second image features and the second hidden state information at time t-1 , Determine the second hidden state information at time t; determine the predicted camera motion according to the second hidden state information at time t.

由於相鄰時刻之間相機運動在時序上具有關聯關係，相機運動預測網路利用t時刻的圖像幀序列對應的與場景深度相關的第二特徵圖，以及t-1時刻與相機運動相關的第二隱狀態資訊，可以確定t時刻與相機運動相關的的第二隱狀態資訊，進而基於t時刻與相機運動相關的的第二隱狀態資訊對t時刻的圖像幀序列進行相機運動預測，可以得到t時刻的圖像幀序列對應的預測精度較高的預測深度圖。Since the camera motions between adjacent moments are related in time series, the camera motion prediction network uses the second feature map related to the scene depth corresponding to the image frame sequence at time t, and the second feature map related to the camera motion at time t-1. The second hidden state information can determine the second hidden state information related to camera motion at time t, and then perform camera motion prediction on the image frame sequence at time t based on the second hidden state information related to camera motion at time t, A predicted depth map with higher prediction accuracy corresponding to the sequence of image frames at time t can be obtained.

在一些實施例中，預測相機運動可以包括圖像幀序列中相鄰圖像幀之間的相對位姿。其中，相對位姿為六維參數，包括三維旋轉資訊和三維平移資訊。In some embodiments, predicting camera motion may include the relative pose between adjacent image frames in the sequence of image frames. Among them, the relative pose is a six-dimensional parameter, including three-dimensional rotation information and three-dimensional translation information.

例如，預測相機運動[

，

]中包括相鄰圖像幀I_t-1 到目標圖像幀I_t 之間的相對位姿

，以及目標圖像幀I_t 到相鄰圖像幀I_t+1 之間的相對位姿

。For example, predict camera movement [

,

] Includes the relative pose between the adjacent image frame I _t-1 to the target image frame I _t

, And the relative pose between the target image frame I _t and the adjacent image frame I _t+1

.

以上述圖3為例，如圖3所示，相機運動預測網路中包括位姿編碼器、ConvGRU和位姿解碼器。將t時刻對應的圖像幀序列[I_t ，I_t-1 ，I_t+1 ]輸入位姿編碼器302進行特徵提取，得到圖像幀序列對應的第二特徵圖

，進而將第二特徵圖

輸入ConvGRU，使得第二特徵圖

與ConvGRU中儲存的t-1時刻的第二隱狀態資訊

進行特徵融合，得到t時刻的第二隱狀態

，ConvGRU對t時刻的第二隱狀態

進行儲存，以及將t時刻的第二隱狀態

輸出至位姿解碼器，從而得到t時刻的圖像幀序列對應的預測相機運動[

，

]。Taking Figure 3 as an example, as shown in Figure 3, the camera motion prediction network includes a pose encoder, ConvGRU, and a pose decoder. Input the image frame sequence [I _t , I _t-1 , I _t+1 ] corresponding to time t into the pose encoder 302 for feature extraction, and obtain the second feature map corresponding to the image frame sequence

, And then the second feature map

Enter ConvGRU to make the second feature map

And the second hidden state information stored in ConvGRU at t-1

Perform feature fusion to get the second hidden state at time t

, ConvGRU's second hidden state at time t

Store, and set the second hidden state at time t

Output to the pose decoder to obtain the predicted camera motion corresponding to the image frame sequence at time t [

,

].

例如，利用相機運動預測網路預測圖像幀序列對應的預測相機運動時，在相機運動預測網路的初始化階段，設置與相機運動相關的第二隱狀態資訊的預設初始值。基於第二隱狀態資訊的預設初始值以及第1時刻的圖像幀序列對應的與相機運動相關的第二特徵圖，確定第1時刻的第二隱狀態，進而基於第1時刻的第二隱狀態對第1時刻的圖像幀序列進行相機運動預測，得到第1時刻的圖像幀序列對應的預測相機運動；基於第1時刻的第二隱狀態以及第2時刻的圖像幀序列對應的與相機運動相關的第二特徵圖，確定第2時刻的第二隱狀態，進而基於第2時刻的第二隱狀態對第2時刻的圖像幀序列進行相機運動預測，得到第2時刻的圖像幀序列對應的預測相機運動；基於第2時刻的第二隱狀態以及第3時刻的圖像幀序列對應的與相機運動相關的第二特徵圖，確定第3時刻的第二隱狀態，進而基於第3時刻的第二隱狀態對第3時刻的圖像幀序列進行相機運動預測，得到第3時刻的圖像幀序列對應的預測相機運動；依次類推，最終得到不同時刻的圖像幀序列對應的預測相機運動。For example, when the camera motion prediction network is used to predict the camera motion corresponding to the image frame sequence, the default initial value of the second hidden state information related to the camera motion is set in the initialization stage of the camera motion prediction network. Based on the preset initial value of the second hidden state information and the second feature map related to the camera motion corresponding to the image frame sequence at the first time, the second hidden state at the first time is determined, and then the second hidden state at the first time is determined. The hidden state performs camera motion prediction on the image frame sequence at the first time to obtain the predicted camera motion corresponding to the image frame sequence at the first time; based on the second hidden state at the first time and the corresponding image frame sequence at the second time The second feature map related to the camera movement determines the second hidden state at the second time, and then performs the camera motion prediction on the image frame sequence at the second time based on the second hidden state at the second time, and obtains the second hidden state at the second time The predicted camera motion corresponding to the image frame sequence; based on the second hidden state at the second moment and the second feature map related to the camera motion corresponding to the image frame sequence at the third moment, the second hidden state at the third moment is determined, Furthermore, based on the second hidden state at the third time, the camera motion prediction is performed on the image frame sequence at the third time to obtain the predicted camera motion corresponding to the image frame sequence at the third time; The predicted camera motion corresponding to the sequence.

在一些實施例中，該相機運動預測方法還可以包括：獲取t時刻對應的樣本圖像幀序列，其中，樣本圖像幀序列包括t時刻的第一樣本圖像幀和第一樣本圖像幀的相鄰樣本圖像幀；通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對目標圖像幀進行場景深度預測，確定第一樣本圖像幀對應的預測深度圖，其中，第一隱狀態資訊包括與場景深度相關的特徵資訊；通過待訓練的相機運動預測網路利用t-1時刻的第二隱狀態資訊對樣本圖像幀序列進行相機位姿預測，確定樣本圖像幀序列對應的樣本預測相機運動，其中，第二隱狀態資訊包括與相機運動相關的特徵資訊；根據樣本預測深度圖和樣本預測相機運動，構建損失函數；根據損失函數，對待訓練的相機運動預測網路進行訓練，以得到相機運動預測網路。In some embodiments, the camera motion prediction method may further include: acquiring a sequence of sample image frames corresponding to time t, where the sequence of sample image frames includes the first sample image frame and the first sample image at time t The adjacent sample image frames of the image frame; use the first hidden state information at time t-1 to perform scene depth prediction on the target image frame through the scene depth prediction network, and determine the predicted depth map corresponding to the first sample image frame , Where the first hidden state information includes feature information related to the depth of the scene; the camera motion prediction network to be trained uses the second hidden state information at time t-1 to perform camera pose prediction on the sequence of sample image frames to determine The sample image frame sequence corresponding to the sample predicts camera motion, where the second hidden state information includes feature information related to the camera motion; the depth map is predicted based on the sample and the camera motion is predicted by the sample to construct a loss function; according to the loss function, the target to be trained The camera motion prediction network is trained to obtain the camera motion prediction network.

在一些實施例中，根據樣本預測深度圖和樣本預測相機運動，構建損失函數，可以包括：根據樣本預測相機運動，確定樣本圖像幀序列中第一樣本圖像幀的相鄰樣本圖像幀相對第一樣本圖像幀的重投影誤差項；根據樣本預測深度圖的分佈連續性，確定懲罰函數項；根據重投影誤差項和懲罰函數項，構建損失函數。In some embodiments, predicting the depth map based on the samples and predicting the camera movement based on the samples to construct the loss function may include: predicting the camera movement based on the samples, and determining adjacent sample images of the first sample image frame in the sequence of sample image frames The reprojection error term of the frame relative to the first sample image frame; the penalty function term is determined according to the distribution continuity of the sample prediction depth map; the loss function is constructed according to the reprojection error term and the penalty function term.

在本發明實施例中，相機運動預測網路是基於場景深度預測網路輔助訓練得到的，或者，場景深度預測網路和相機運動預測網路是聯合訓練得到的。在一些實施例中，可以基於上述圖3可以對待訓練的相機運動預測網路進行訓練，在此訓練過程中，圖3中的相機運動預測網路為待訓練的相機運動預測網路，圖3中的場景深度預測網路可以為待訓練的場景深度預測網路（聯合訓練待訓練的場景深度預測網路和待訓練相機運動預測網路），也可以為訓練好的場景深度預測網路（對待訓練的相機運動預測網路進行單獨訓練），具體訓練過程與上述圖3相同，本發明實施例在此不再贅述。In the embodiment of the present invention, the camera motion prediction network is obtained based on the auxiliary training of the scene depth prediction network, or the scene depth prediction network and the camera motion prediction network are jointly trained. In some embodiments, the camera motion prediction network to be trained can be trained based on the above-mentioned Figure 3. In this training process, the camera motion prediction network in Figure 3 is the camera motion prediction network to be trained, Figure 3 The scene depth prediction network in can be the scene depth prediction network to be trained (joint training the scene depth prediction network to be trained and the camera motion prediction network to be trained), or it can be the trained scene depth prediction network ( The camera motion prediction network to be trained is separately trained). The specific training process is the same as that of FIG. 3, and the details of this embodiment of the present invention are not repeated here.

本發明實施例中，通過上述圖3所示網路訓練方法訓練得到的場景深度預測網路和相機運動預測網路可以進行環境的深度預測和三維場景構建。例如，將場景深度預測網路應用於掃地機、割草機等室內、室外的移動機器人導航場景中，通過紅綠藍（Red Green Blue，RGB）相機得到RGB圖像，進而利用場景深度預測網路確定RGB圖像對應的預測深度圖，利用相機預測網路確定RGB相機的相機運動，從而實現對障礙物的距離測量和三維場景構建，以完成避障和導航任務。In the embodiment of the present invention, the scene depth prediction network and the camera motion prediction network trained by the network training method shown in FIG. 3 can perform environment depth prediction and three-dimensional scene construction. For example, the scene depth prediction network is applied to indoor and outdoor mobile robot navigation scenes such as sweepers and lawn mowers, and RGB images are obtained through Red Green Blue (RGB) cameras, and then the scene depth prediction network is used The path determines the predicted depth map corresponding to the RGB image, and uses the camera prediction network to determine the camera movement of the RGB camera, so as to achieve distance measurement of obstacles and three-dimensional scene construction to complete obstacle avoidance and navigation tasks.

可以理解，本發明提及的上述各個方法實施例，在不違背原理邏輯的情況下，均可以彼此相互結合形成結合後的實施例，限於篇幅，本發明不再贅述。本領域技術人員可以理解，在具體實施方式的上述方法中，各步驟的具體執行順序應當以其功能和可能的內在邏輯確定。It can be understood that the various method embodiments mentioned in the present invention can be combined with each other to form a combined embodiment without violating the principle and logic. The length is limited, and the present invention will not be repeated. Those skilled in the art can understand that, in the above method of the specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.

此外，本發明還提供了場景深度/相機運動預測裝置、電子設備、電腦可讀儲存介質、程式，上述均可用來實現本發明提供的任一種場景深度/相機運動預測方法，相應技術方案和描述和參見方法部分的相應記載，不再贅述。In addition, the present invention also provides a scene depth/camera motion prediction device, electronic equipment, computer-readable storage media, and programs, all of which can be used to implement any scene depth/camera motion prediction method provided by the present invention, and the corresponding technical solutions and descriptions And refer to the corresponding record in the method section, so I won't repeat it.

圖5示出根據本發明實施例的場景深度預測裝置的方塊圖。如圖5所示，場景深度預測裝置50包括：第一獲取模組51，配置為獲取t時刻的目標圖像幀；第一場景深度預測模組52，配置為通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對目標圖像幀進行場景深度預測，確定目標圖像幀對應的預測深度圖，其中，第一隱狀態資訊包括與場景深度相關的特徵資訊，場景深度預測網路是基於相機運動預測網路輔助訓練得到的。Fig. 5 shows a block diagram of a scene depth prediction apparatus according to an embodiment of the present invention. As shown in FIG. 5, the scene depth prediction device 50 includes: The first acquisition module 51 is configured to acquire the target image frame at time t; The first scene depth prediction module 52 is configured to use the first hidden state information at time t-1 to perform scene depth prediction on the target image frame through the scene depth prediction network to determine the predicted depth map corresponding to the target image frame, where , The first hidden state information includes feature information related to the depth of the scene, and the scene depth prediction network is obtained based on the auxiliary training of the camera motion prediction network.

在一些實施例中，第一場景深度預測模組52，包括：第一確定子模組，配置為對目標圖像幀進行特徵提取，確定目標圖像幀對應的第一特徵圖，其中，第一特徵圖為與場景深度相關的特徵圖；第二確定子模組，配置為根據第一特徵圖和t-1時刻的第一隱狀態資訊，確定t時刻的第一隱狀態資訊；第三確定子模組，配置為根據t時刻的第一隱狀態資訊，確定預測深度圖。In some embodiments, the first scene depth prediction module 52 includes: The first determining sub-module is configured to perform feature extraction on the target image frame, and determine a first feature map corresponding to the target image frame, where the first feature map is a feature map related to the scene depth; The second determining sub-module is configured to determine the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1; The third determining sub-module is configured to determine the predicted depth map according to the first hidden state information at time t.

在一些實施例中，t-1時刻的第一隱狀態資訊包括t-1時刻的不同尺度下的第一隱狀態資訊；第一確定子模組具體配置為：對目標圖像幀進行多尺度下採樣，確定目標圖像幀對應的不同尺度下的第一特徵圖；第二確定子模組具體配置為：針對任一尺度，根據該尺度下的第一特徵圖和t-1時刻的該尺度下的第一隱狀態資訊，確定t時刻的該尺度下的第一隱狀態資訊；第三確定子模組具體配置為：將t時刻的不同尺度下的第一隱狀態資訊進行特徵融合，確定預測深度圖。In some embodiments, the first hidden state information at time t-1 includes first hidden state information at different scales at time t-1; The first determining submodule is specifically configured to: perform multi-scale down-sampling on the target image frame, and determine the first feature maps at different scales corresponding to the target image frame; The second determining sub-module is specifically configured to: for any scale, determine the first feature map at that scale and the first hidden state information at that scale at time t-1 to determine the first at time t. Hidden state information; The third determining submodule is specifically configured to perform feature fusion of the first hidden state information at different scales at time t to determine the predicted depth map.

在一些實施例中，場景深度預測裝置50還包括第一訓練模組，所述第一訓練模組配置為：獲取t時刻對應的樣本圖像幀序列，其中，所述樣本圖像幀序列包括t時刻的第一樣本圖像幀和所述第一樣本圖像幀的相鄰樣本圖像幀；通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述樣本圖像幀序列進行相機位姿預測，確定所述樣本圖像幀序列對應的樣本預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊；通過待訓練的場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述第一樣本圖像幀進行場景深度預測，確定所述第一樣本圖像幀對應的樣本預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊；根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數；根據所述損失函數，對所述待訓練的場景深度預測網路進行訓練，以得到所述場景深度預測網路。In some embodiments, the scene depth prediction device 50 further includes a first training module, and the first training module is configured to: Acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and adjacent sample image frames of the first sample image frame; Perform camera pose prediction on the sample image frame sequence through the camera motion prediction network using the second hidden state information at time t-1, and determine the sample predicted camera motion corresponding to the sample image frame sequence, wherein the The second hidden state information includes feature information related to camera movement; Use the first hidden state information at time t-1 to perform scene depth prediction on the first sample image frame through the scene depth prediction network to be trained, and determine the sample prediction depth corresponding to the first sample image frame Figure, wherein the first hidden state information includes feature information related to the depth of the scene; Construct a loss function according to the sample predicted depth map and the sample predicted camera motion; According to the loss function, the scene depth prediction network to be trained is trained to obtain the scene depth prediction network.

在一些實施例中，第一訓練模組，具體配置為：根據所述樣本預測相機運動，確定所述樣本圖像幀序列中所述第一樣本圖像幀的相鄰樣本圖像幀相對所述第一樣本圖像幀的重投影誤差項；根據所述樣本預測深度圖的分佈連續性，確定懲罰函數項；根據所述重投影誤差項和所述懲罰函數項，構建所述損失函數。In some embodiments, the first training module is specifically configured to: predict camera motion according to the samples, and determine that adjacent sample image frames of the first sample image frame in the sequence of sample image frames are relative to each other. The reprojection error term of the first sample image frame; the penalty function term is determined according to the distribution continuity of the sample prediction depth map; the loss is constructed according to the reprojection error term and the penalty function term function.

圖6示出根據本發明實施例的相機運動預測裝置的方塊圖。如圖6所示，相機運動預測裝置60包括：第二獲取模組61，配置為獲取t時刻對應的圖像幀序列，其中，圖像幀序列包括t時刻的目標圖像幀和目標圖像幀的相鄰圖像幀；第一相機運動預測模組62，配置為通過相機運動預測網路利用t-1時刻的第二隱狀態資訊對圖像幀序列進行相機位姿預測，確定圖像幀序列對應的預測相機運動，其中，第二隱狀態資訊包括與相機運動相關的特徵資訊，相機運動預測網路是基於場景深度預測網路輔助訓練得到的。Fig. 6 shows a block diagram of a camera motion prediction device according to an embodiment of the present invention. As shown in FIG. 6, the camera motion prediction device 60 includes: The second acquisition module 61 is configured to acquire a sequence of image frames corresponding to time t, where the sequence of image frames includes a target image frame at time t and adjacent image frames of the target image frame; The first camera motion prediction module 62 is configured to use the second hidden state information at time t-1 to perform camera pose prediction on the image frame sequence through the camera motion prediction network to determine the predicted camera motion corresponding to the image frame sequence, Among them, the second hidden state information includes feature information related to camera motion, and the camera motion prediction network is obtained based on the auxiliary training of the scene depth prediction network.

在一些實施例中，第一相機運動預測模組62，包括：第六確定子模組，配置為對圖像幀序列進行特徵提取，確定圖像幀序列對應的第二特徵圖，其中，第二特徵圖為與相機運動相關的特徵圖；第七確定子模組，配置為根據第二圖特徵和t-1時刻的第二隱狀態資訊，確定t時刻的第二隱狀態資訊；第八確定子模組，配置為根據t時刻的第二隱狀態資訊，確定預測相機運動。In some embodiments, the first camera motion prediction module 62 includes: The sixth determining sub-module is configured to perform feature extraction on the image frame sequence and determine a second feature map corresponding to the image frame sequence, where the second feature map is a feature map related to camera motion; The seventh determining sub-module is configured to determine the second hidden state information at time t according to the features of the second picture and the second hidden state information at time t-1; The eighth determining sub-module is configured to determine and predict the camera movement according to the second hidden state information at time t.

在一些實施例中，預測相機運動包括圖像幀序列中相鄰圖像幀之間的相對位姿。In some embodiments, predicting camera motion includes relative poses between adjacent image frames in the sequence of image frames.

在一些實施例中，相機運動預測裝置60還包括第二訓練模組，所述第二訓練模組配置為：獲取t時刻對應的樣本圖像幀序列，其中，所述樣本圖像幀序列包括t時刻的第一樣本圖像幀和所述第一樣本圖像幀的相鄰樣本圖像幀；通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述第一樣本圖像幀進行場景深度預測，確定所述第一樣本圖像幀對應的樣本預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊；通過待訓練的相機運動預測網路利用t-1時刻的第二隱狀態資訊對所述樣本圖像幀序列進行相機位姿預測，確定所述樣本圖像幀序列對應的樣本預測相機運動，其中，所述第二隱狀態資訊包括與相機運動相關的特徵資訊；根據所述樣本預測深度圖和所述樣本預測相機運動，構建損失函數；根據所述損失函數，對所述待訓練的相機運動預測網路進行訓練，以得到所述相機運動預測網路。In some embodiments, the camera motion prediction device 60 further includes a second training module configured to: Acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and adjacent sample image frames of the first sample image frame; The scene depth prediction is performed on the first sample image frame through the scene depth prediction network using the first hidden state information at time t-1, and the sample prediction depth map corresponding to the first sample image frame is determined, where , The first hidden state information includes feature information related to the depth of the scene; The camera motion prediction network to be trained uses the second hidden state information at time t-1 to perform camera pose prediction on the sample image frame sequence to determine the sample predicted camera motion corresponding to the sample image frame sequence, where , The second hidden state information includes feature information related to camera movement; Construct a loss function according to the sample predicted depth map and the sample predicted camera motion; According to the loss function, the camera motion prediction network to be trained is trained to obtain the camera motion prediction network.

在一些實施例中，第二訓練模組，具體配置為：根據所述樣本預測相機運動，確定所述樣本圖像幀序列中所述第一樣本圖像幀的相鄰樣本圖像幀相對所述第一樣本圖像幀的重投影誤差項；根據所述樣本預測深度圖的分佈連續性，確定懲罰函數項；根據所述重投影誤差項和所述懲罰函數項，構建所述損失函數。In some embodiments, the second training module is specifically configured to: predict camera motion according to the samples, and determine that adjacent sample image frames of the first sample image frame in the sequence of sample image frames are relative to each other. The reprojection error term of the first sample image frame; the penalty function term is determined according to the distribution continuity of the sample prediction depth map; the loss is constructed according to the reprojection error term and the penalty function term function.

在一些實施例中，本發明實施例提供的裝置具有的功能或包含的模組可以用於執行上文方法實施例描述的方法，其具體實現可以參照上文方法實施例的描述，為了簡潔，這裡不再贅述。In some embodiments, the functions or modules included in the device provided in the embodiments of the present invention can be used to execute the methods described in the above method embodiments. For specific implementation, refer to the description of the above method embodiments. For brevity, I won't repeat it here.

本發明實施例還提出一種電腦可讀儲存介質，其上儲存有電腦程式指令，所述電腦程式指令被處理器執行時實現上述方法。電腦可讀儲存介質可以是易失性或非易失性電腦可讀儲存介質。An embodiment of the present invention also provides a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor. The computer-readable storage medium may be a volatile or non-volatile computer-readable storage medium.

本發明實施例還提出一種電子設備，包括：處理器；用於儲存處理器可執行指令的記憶體；其中，所述處理器被配置為調用所述記憶體儲存的指令，以執行上述任意一種場景深度預測方法或上述任意一種相機運動預測方法。An embodiment of the present invention also provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute any of the foregoing Scene depth prediction method or any of the above camera motion prediction methods.

本發明實施例還提供了一種電腦程式產品，包括電腦可讀代碼，當電腦可讀代碼在設備上運行時，設備中的處理器執行用於實現如上任一實施例提供的場景深度和/或相機運動預測方法的指令。The embodiment of the present invention also provides a computer program product, which includes computer-readable code. When the computer-readable code runs on the device, the processor in the device executes to realize the scene depth and/or the scene depth provided by any of the above embodiments. Instructions for the camera motion prediction method.

本發明實施例還提供了另一種電腦程式產品，用於儲存電腦可讀指令，指令被執行時使得電腦執行上述任一實施例提供的場景深度和/或相機運動預測方法的操作。The embodiment of the present invention also provides another computer program product for storing computer-readable instructions, which when executed cause the computer to perform the operation of the scene depth and/or camera motion prediction method provided by any of the foregoing embodiments.

電子設備可以被提供為終端、伺服器或其它形態的設備。Electronic devices can be provided as terminals, servers, or other types of devices.

圖7示出根據本發明實施例的一種電子設備800的方塊圖。如圖7所示，電子設備800可以是行動電話，電腦，數位廣播終端，訊息收發設備，遊戲控制台，平板設備，醫療設備，健身設備，個人數位助理等終端。FIG. 7 shows a block diagram of an electronic device 800 according to an embodiment of the present invention. As shown in FIG. 7, the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and other terminals.

參照圖7，電子設備800可以包括以下一個或多個組件：第一處理組件802，第一記憶體804，第一電源組件806，多媒體組件808，音頻組件810，第一輸入/輸出（Input Output，I/ O）的介面812，感測器組件814，以及通信組件816。7, the electronic device 800 may include one or more of the following components: a first processing component 802, a first memory 804, a first power supply component 806, a multimedia component 808, an audio component 810, a first input/output (Input Output , I/O) interface 812, sensor component 814, and communication component 816.

第一處理組件802通常控制電子設備800的整體操作，諸如與顯示，電話呼叫，資料通信，相機操作和記錄操作相關聯的操作。第一處理組件802可以包括一個或多個處理器820來執行指令，以完成上述的方法的全部或部分步驟。此外，第一處理組件802可以包括一個或多個模組，便於第一處理組件802和其他組件之間的交互。例如，第一處理組件802可以包括多媒體模組，以方便多媒體組件808和第一處理組件802之間的交互。The first processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communication, camera operations, and recording operations. The first processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the first processing component 802 may include one or more modules to facilitate the interaction between the first processing component 802 and other components. For example, the first processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the first processing component 802.

第一記憶體804被配置為儲存各種類型的資料以支援在電子設備800的操作。這些資料的示例包括用於在電子設備800上操作的任何應用程式或方法的指令，連絡人資料，電話簿資料，訊息，圖片，視頻等。第一記憶體804可以由任何類型的易失性或非易失性存放裝置或者它們的組合實現，如靜態隨機存取記憶體（Static Random-Access Memory，SRAM），電可擦除可程式設計唯讀記憶體（Electrically Erasable Programmable Read Only Memory，EEPROM），可擦除可程式設計唯讀記憶體（Electrical Programmable Read Only Memory ，EPROM），可程式設計唯讀記憶體（Programmable Read-Only Memory，PROM），唯讀記憶體（Read-Only Memory，ROM），磁記憶體，快閃記憶體，磁片或光碟。The first memory 804 is configured to store various types of data to support the operation of the electronic device 800. Examples of such data include instructions for any application or method operated on the electronic device 800, contact information, phone book data, messages, pictures, videos, etc. The first memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as Static Random-Access Memory (SRAM), electrically erasable and programmable Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM) ), Read-Only Memory (ROM), magnetic memory, flash memory, floppy disk or CD-ROM.

第一電源組件806為電子設備800的各種組件提供電力。第一電源組件806可以包括電源管理系統，一個或多個電源，及其他與為電子設備800生成、管理和分配電力相關聯的組件。The first power supply component 806 provides power for various components of the electronic device 800. The first power supply component 806 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 800.

多媒體組件808包括在所述電子設備800和使用者之間的提供一個輸出介面的螢幕。在一些實施例中，螢幕可以包括液晶顯示器（Liquid Crystal Display，LCD）和觸摸面板（Touch Pad，TP）。如果螢幕包括觸摸面板，螢幕可以被實現為觸控式螢幕，以接收來自使用者的輸入信號。觸摸面板包括一個或多個觸摸感測器以感測觸摸、滑動和觸摸面板上的手勢。所述觸摸感測器可以不僅感測觸摸或滑動動作的邊界，而且還檢測與所述觸摸或滑動操作相關的持續時間和壓力。在一些實施例中，多媒體組件808包括一個前置攝影頭和/或後置攝影頭。當電子設備800處於操作模式，如拍攝模式或視頻模式時，前置攝影頭和/或後置攝影頭可以接收外部的多媒體資料。每個前置攝影頭和後置攝影頭可以是一個固定的光學透鏡系統或具有焦距和光學變焦能力。The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (Liquid Crystal Display, LCD) and a touch panel (Touch Pad, TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.

音頻組件810被配置為輸出和/或輸入音頻信號。例如，音頻組件810包括一個麥克風（MIC），當電子設備800處於操作模式，如呼叫模式、記錄模式和語音辨識模式時，麥克風被配置為接收外部音頻信號。所接收的音頻信號可以被進一步儲存在第一記憶體804或經由通信組件816發送。在一些實施例中，音頻組件810還包括一個揚聲器，用於輸出音頻信號。The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC). When the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the first memory 804 or sent via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for outputting audio signals.

第一輸入/輸出介面812為第一處理組件802和週邊介面模組之間提供介面，上述週邊介面模組可以是鍵盤，點擊輪，按鈕等。這些按鈕可包括但不限於：主頁按鈕、音量按鈕、啟動按鈕和鎖定按鈕。The first input/output interface 812 provides an interface between the first processing component 802 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include but are not limited to: home button, volume button, start button, and lock button.

感測器組件814包括一個或多個感測器，用於為電子設備800提供各個方面的狀態評估。例如，感測器組件814可以檢測到電子設備800的打開/關閉狀態，組件的相對定位，例如所述組件為電子設備800的顯示器和小鍵盤，感測器組件814還可以檢測電子設備800或電子設備800一個組件的位置改變，使用者與電子設備800接觸的存在或不存在，電子設備800方位或加速/減速和電子設備800的溫度變化。感測器組件814可以包括接近感測器，被配置用來在沒有任何的物理接觸時檢測附近物體的存在。感測器組件814還可以包括光感測器，如互補金屬氧化物半導體（Complementary Metal Oxide Semiconductor，CMOS）或電荷耦合器件（Charge Coupled Device，CCD）圖像感測器，用於在成像應用中使用。在一些實施例中，該感測器組件814還可以包括加速度感測器，陀螺儀感測器，磁感測器，壓力感測器或溫度感測器。The sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of state evaluation. For example, the sensor component 814 can detect the on/off status of the electronic device 800 and the relative positioning of the components. For example, the component is the display and the keypad of the electronic device 800, and the sensor component 814 can also detect the electronic device 800 or The position of a component of the electronic device 800 changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 814 may also include a light sensor, such as a complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) or a charge coupled device (Charge Coupled Device, CCD) image sensor for use in imaging applications use. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

通信組件816被配置為便於電子設備800和其他設備之間有線或無線方式的通信。電子設備800可以接入基於通信標準的無線網路，如WiFi，2G或3G，或它們的組合。在一個示例性實施例中，通信組件816經由廣播通道接收來自外部廣播管理系統的廣播信號或廣播相關資訊。在一個示例性實施例中，所述通信組件816還包括近場通信（Near Field Communication，NFC）模組，以促進短程通信。例如，在NFC模組可基於射頻識別（Radio Frequency Identification，RFID）技術，紅外資料協會（Infrared Data Association，IrDA）技術，超寬頻（Ultra Wide Band，UWB）技術，藍牙（Bluetooth，BT）技術和其他技術來實現。The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a near field communication (Near Field Communication, NFC) module to facilitate short-range communication. For example, the NFC module can be based on Radio Frequency Identification (RFID) technology, Infrared Data Association (Infrared Data Association, IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and Other technologies to achieve.

在示例性實施例中，電子設備800可以被一個或多個應用專用積體電路（Application Specific Integrated Circuit，ASIC）、數位訊號處理器（Digital Signal Processor，DSP）、數位信號處理設備（Digital Signal Process，DSPD）、可程式設計邏輯器件（Programmable Logic Device，PLD）、現場可程式設計閘陣列（Field Programmable Gate Array，FPGA）、控制器、微控制器、微處理器或其他電子組件實現，用於執行上述任意一種場景深度預測方法或上述任意一種相機運動預測方法。In an exemplary embodiment, the electronic device 800 may be used by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processor (Digital Signal Processor, DSP), and digital signal processing equipment (Digital Signal Process). , DSPD), programmable logic device (Programmable Logic Device, PLD), field programmable gate array (Field Programmable Gate Array, FPGA), controller, microcontroller, microprocessor or other electronic components to achieve Execute any one of the above-mentioned scene depth prediction methods or any one of the above-mentioned camera motion prediction methods.

在示例性實施例中，還提供了一種非易失性電腦可讀儲存介質，例如包括電腦程式指令的第一記憶體804，上述電腦程式指令可由電子設備800的處理器820執行以完成上述任意一種場景深度預測方法或上述任意一種相機運動預測方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as the first memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to accomplish any of the foregoing. A scene depth prediction method or any of the aforementioned camera motion prediction methods.

圖8示出根據本發明實施例的一種電子設備的方塊圖。如圖8所示，電子設備900可以被提供為一伺服器。參照圖8，電子設備900包括第二處理組件922，其進一步包括一個或多個處理器，以及由第二記憶體932所代表的記憶體資源，用於儲存可由第二處理組件922的執行的指令，例如應用程式。第二記憶體932中儲存的應用程式可以包括一個或一個以上的每一個對應於一組指令的模組。此外，第二處理組件922被配置為執行指令，以執行上述任意一種場景深度預測方法或上述任意一種相機運動預測方法。Fig. 8 shows a block diagram of an electronic device according to an embodiment of the present invention. As shown in FIG. 8, the electronic device 900 may be provided as a server. 8, the electronic device 900 includes a second processing component 922, which further includes one or more processors, and a memory resource represented by the second memory 932, which is used to store the execution of the second processing component 922 Instructions, such as applications. The application program stored in the second memory 932 may include one or more modules each corresponding to a set of commands. In addition, the second processing component 922 is configured to execute instructions to execute any of the aforementioned scene depth prediction methods or any of the aforementioned camera motion prediction methods.

電子設備900還可以包括一個第二電源組件926被配置為執行電子設備900的電源管理，一個有線或無線網路介面950被配置為將電子設備900連接到網路，和第二輸入輸出（I/O）介面958。電子設備900可以操作基於儲存在第二記憶體932的作業系統，例如Windows Server^TM ，Mac OS X^TM ，Unix^TM , Linux^TM ，FreeBSD^TM 或類似。The electronic device 900 may also include a second power component 926 configured to perform power management of the electronic device 900, a wired or wireless network interface 950 configured to connect the electronic device 900 to the network, and a second input and output (I /O) Interface 958. The electronic device 900 can operate based on an operating system stored in the second memory 932, such as Windows Server ^TM , Mac OS X ^TM , Unix ^TM , Linux ^TM , FreeBSD ^TM or the like.

在示例性實施例中，還提供了一種非易失性電腦可讀儲存介質，例如包括電腦程式指令的第二記憶體932，上述電腦程式指令可由電子設備900的第二處理組件922執行以完成上述任意一種場景深度預測方法或上述任意一種相機運動預測方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as the second memory 932 including computer program instructions, which can be executed by the second processing component 922 of the electronic device 900. Any of the foregoing scene depth prediction methods or any of the foregoing camera motion prediction methods.

本發明可以是系統、方法和/或電腦程式產品。電腦程式產品可以包括電腦可讀儲存介質，其上載有用於使處理器實現本發明的各個方面的電腦可讀程式指令。The present invention can be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling the processor to implement various aspects of the present invention.

電腦可讀儲存介質可以是可以保持和儲存由指令執行設備使用的指令的有形設備。電腦可讀儲存介質例如可以是(但不限於)電存放裝置、磁存放裝置、光存放裝置、電磁存放裝置、半導體存放裝置或者上述的任意合適的組合。電腦可讀儲存介質的更具體的例子（非窮舉的列表）包括：可擕式電腦盤、硬碟、隨機存取記憶體（Random-Access Memory，RAM）、唯讀記憶體（ROM）、可擦式可程式設計唯讀記憶體（EPROM或快閃記憶體）、靜態隨機存取記憶體（SRAM）、可擕式壓縮磁碟唯讀記憶體（CD-ROM）、數位多功能盤（DVD）、記憶棒、軟碟、機械編碼設備、例如其上儲存有指令的打孔卡或凹槽內凸起結構、以及上述的任意合適的組合。這裡所使用的電腦可讀儲存介質不被解釋為暫態信號本身，諸如無線電波或者其他自由傳播的電磁波、通過波導或其他傳輸媒介傳播的電磁波（例如，通過光纖電纜的光脈衝）、或者通過電線傳輸的電信號。The computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device. The computer-readable storage medium can be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples of computer-readable storage media (non-exhaustive list) include: portable computer disks, hard disks, random-access memory (Random-Access Memory, RAM), read-only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk ( DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards on which instructions are stored or raised structures in the grooves, and any suitable combination of the above. The computer-readable storage medium used here is not interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or passing through Electrical signals transmitted by wires.

這裡所描述的電腦可讀程式指令可以從電腦可讀儲存介質下載到各個計算/處理設備，或者通過網路、例如網際網路、局域網、廣域網路和/或無線網下載到外部電腦或外部存放裝置。網路可以包括銅傳輸電纜、光纖傳輸、無線傳輸、路由器、防火牆、交換機、閘道電腦和/或邊緣伺服器。每個計算/處理設備中的網路介面卡或者網路介面從網路接收電腦可讀程式指令，並轉發該電腦可讀程式指令，以供儲存在各個計算/處理設備中的電腦可讀儲存介質中。The computer-readable program instructions described here can be downloaded from a computer-readable storage medium to each computing/processing device, or downloaded to an external computer or external storage via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network Device. The network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. The network interface card or network interface in each computing/processing device receives computer-readable program instructions from the network, and forwards the computer-readable program instructions for computer-readable storage in each computing/processing device Medium.

用於執行本發明操作的電腦程式指令可以是彙編指令、指令集架構（Instruction Set Architecture，ISA）指令、機器指令、機器相關指令、微代碼、固件指令、狀態設置資料、或者以一種或多種程式設計語言的任意組合編寫的原始程式碼或目標代碼，所述程式設計語言包括物件導向的程式設計語言—諸如Smalltalk、C++等，以及常規的過程式程式設計語言—諸如“C”語言或類似的程式設計語言。電腦可讀程式指令可以完全地在使用者電腦上執行、部分地在使用者電腦上執行、作為一個獨立的套裝軟體執行、部分在使用者電腦上部分在遠端電腦上執行、或者完全在遠端電腦或伺服器上執行。在涉及遠端電腦的情形中，遠端電腦可以通過任意種類的網路—包括局域網（Local Area Network，LAN)或廣域網路（Wide Area Network，WAN)—連接到使用者電腦，或者，可以連接到外部電腦（例如利用網際網路服務提供者來通過網際網路連接）。在一些實施例中，通過利用電腦可讀程式指令的狀態資訊來個性化定制電子電路，例如可程式設計邏輯電路、現場可程式設計閘陣列（FPGA）或可程式設計邏輯陣列（(Programmable Logic Array，PLA），該電子電路可以執行電腦可讀程式指令，從而實現本發明的各個方面。The computer program instructions used to perform the operations of the present invention may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, status setting data, or one or more programs Source code or object code written in any combination of design languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as "C" language or the like Programming language. Computer-readable program instructions can be executed entirely on the user’s computer, partly on the user’s computer, executed as a stand-alone software package, partly on the user’s computer and partly on a remote computer, or completely remotely executed. Run on the end computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network—including a local area network (LAN) or a wide area network (Wide Area Network, WAN)—or, it can be connected To an external computer (for example, using an Internet service provider to connect via the Internet). In some embodiments, the electronic circuit is personalized by using the status information of the computer-readable program instructions, such as programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (Programmable Logic Array). , PLA), the electronic circuit can execute computer-readable program instructions to realize various aspects of the present invention.

這裡參照根據本發明實施例的方法、裝置（系統）和電腦程式產品的流程圖和/或方塊圖描述了本發明的各個方面。應當理解，流程圖和/或方塊圖的每個方塊以及流程圖和/或方塊圖中各方塊的組合，都可以由電腦可讀程式指令實現。Herein, various aspects of the present invention are described with reference to flowcharts and/or block diagrams of methods, devices (systems) and computer program products according to embodiments of the present invention. It should be understood that each block of the flowchart and/or block diagram and the combination of each block in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

這些電腦可讀程式指令可以提供給通用電腦、專用電腦或其它可程式設計資料處理裝置的處理器，從而生產出一種機器，使得這些指令在通過電腦或其它可程式設計資料處理裝置的處理器執行時，產生了實現流程圖和/或方塊圖中的一個或多個方塊中規定的功能/動作的裝置。也可以把這些電腦可讀程式指令儲存在電腦可讀儲存介質中，這些指令使得電腦、可程式設計資料處理裝置和/或其他設備以特定方式工作，從而，儲存有指令的電腦可讀介質則包括一個製造品，其包括實現流程圖和/或方塊圖中的一個或多個方塊中規定的功能/動作的各個方面的指令。These computer-readable program instructions can be provided to the processors of general-purpose computers, dedicated computers, or other programmable data processing devices, thereby producing a machine that allows these instructions to be executed by the processors of the computer or other programmable data processing devices At this time, a device that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram is produced. It is also possible to store these computer-readable program instructions in a computer-readable storage medium. These instructions make computers, programmable data processing devices, and/or other equipment work in a specific manner, so that the computer-readable medium storing the instructions is It includes an article of manufacture, which includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowchart and/or block diagram.

也可以把電腦可讀程式指令載入到電腦、其它可程式設計資料處理裝置、或其它設備上，使得在電腦、其它可程式設計資料處理裝置或其它設備上執行一系列操作步驟，以產生電腦實現的過程，從而使得在電腦、其它可程式設計資料處理裝置、或其它設備上執行的指令實現流程圖和/或方塊圖中的一個或多個方塊中規定的功能/動作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operation steps are executed on the computer, other programmable data processing device, or other equipment to generate a computer The process of implementation enables instructions executed on a computer, other programmable data processing device, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

附圖中的流程圖和方塊圖顯示了根據本發明的多個實施例的系統、方法和電腦程式產品的可能實現的體系架構、功能和操作。在這點上，流程圖或方塊圖中的每個方塊可以代表一個模組、程式段或指令的一部分，所述模組、程式段或指令的一部分包含一個或多個用於實現規定的邏輯功能的可執行指令。在有些作為替換的實現中，方塊中所標注的功能也可以以不同於附圖中所標注的順序發生。例如，兩個連續的方塊實際上可以基本並行地執行，它們有時也可以按相反的循序執行，這依所涉及的功能而定。也要注意的是，方塊圖和/或流程圖中的每個方塊、以及方塊圖和/或流程圖中的方塊的組合，可以用執行規定的功能或動作的專用的基於硬體的系統來實現，或者可以用專用硬體與電腦指令的組合來實現。The flowcharts and block diagrams in the accompanying drawings show the possible implementation of the system architecture, functions, and operations of the system, method, and computer program product according to multiple embodiments of the present invention. In this regard, each block in the flowchart or block diagram can represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction includes one or more logic for implementing the specified Executable instructions for the function. In some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed basically in parallel, and they can sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or actions. It can be realized, or it can be realized by a combination of dedicated hardware and computer instructions.

該電腦程式產品可以具體通過硬體、軟體或其結合的方式實現。在一個可選實施例中，所述電腦程式產品具體體現為電腦儲存介質，在另一個可選實施例中，電腦程式產品具體體現為軟體產品，例如軟體發展包（Software Development Kit，SDK)等等。The computer program product can be implemented by hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is specifically embodied as a computer storage medium. In another alternative embodiment, the computer program product is specifically embodied as a software product, such as a software development kit (SDK), etc. Wait.

以上已經描述了本發明的各實施例，上述說明是示例性的，並非窮盡性的，並且也不限於所披露的各實施例。在不偏離所說明的各實施例的範圍和精神的情況下，對於本技術領域的普通技術人員來說許多修改和變更都是顯而易見的。本文中所用術語的選擇，旨在最好地解釋各實施例的原理、實際應用或對市場中的技術的改進，或者使本技術領域的其它普通技術人員能理解本文披露的各實施例。The embodiments of the present invention have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Without departing from the scope and spirit of the described embodiments, many modifications and changes are obvious to those of ordinary skill in the art. The choice of terms used herein is intended to best explain the principles, practical applications, or improvements to the technology in the market for each embodiment, or to enable those of ordinary skill in the art to understand the various embodiments disclosed herein.

工業實用性本發明實施例提供了一種場景深度和相機運動預測方法、電子設備和電腦可讀儲存介質，所述方法包括：獲取t時刻的目標圖像幀；通過場景深度預測網路利用t-1時刻的第一隱狀態資訊對所述目標圖像幀進行場景深度預測，確定所述目標圖像幀對應的預測深度圖，其中，所述第一隱狀態資訊包括與場景深度相關的特徵資訊，所述場景深度預測網路是基於相機運動預測網路輔助訓練得到的。本發明實施例可以得到目標圖像幀對應的預測精度較高的預測深度圖。Industrial applicability The embodiment of the present invention provides a scene depth and camera motion prediction method, an electronic device, and a computer-readable storage medium. The method includes: obtaining a target image frame at time t; and using a scene depth prediction network at time t-1 The first hidden state information performs scene depth prediction on the target image frame to determine the predicted depth map corresponding to the target image frame, wherein the first hidden state information includes feature information related to the scene depth, the The scene depth prediction network is based on the auxiliary training of the camera motion prediction network. The embodiment of the present invention can obtain a predicted depth map with higher prediction accuracy corresponding to the target image frame.

201:目標圖像幀 202:深度編碼器 203:不同尺度下的第一特徵圖 204:多尺度隱狀態 205:深度解碼器 301:樣本圖像幀序列 302:位姿編碼器 303:位姿解碼器 50:場景深度預測裝置 51:第一獲取模組 52:第一場景深度預測模組 60:相機運動預測裝置 61:第二獲取模組 62:第一相機運動預測模組 800:電子設備 802:第一處理組件 804:第一記憶體 806:第一電源組件 808:多媒體組件 810:音頻組件 812:第一輸入/輸出介面 814:感測器組件 816:通信組件 820:處理器 1900:電子設備 1922:第二處理組件 1926:第二電源組件 1932:第二記憶體 1950:網路介面 1958:第二輸入輸出介面 S11,S12:步驟 S41,S42:步驟201: target image frame 202: Depth encoder 203: The first feature map at different scales 204: Multi-scale hidden state 205: Depth decoder 301: Sample image frame sequence 302: Pose encoder 303: Pose decoder 50: Scene depth prediction device 51: The first acquisition module 52: The first scene depth prediction module 60: Camera motion prediction device 61: The second acquisition module 62: The first camera motion prediction module 800: electronic equipment 802: The first processing component 804: first memory 806: The first power supply component 808: Multimedia components 810: Audio component 812: The first input/output interface 814: Sensor component 816: Communication component 820: processor 1900: electronic equipment 1922: Second processing component 1926: Second power supply assembly 1932: second memory 1950: network interface 1958: The second input and output interface S11, S12: steps S41, S42: steps

此處的附圖被併入說明書中並構成本說明書的一部分，這些附圖示出了符合本發明的實施例，並與說明書一起用於說明本發明的技術方案。圖1為本發明實施例的場景深度預測方法的流程圖；圖2為本發明實施例的場景深度預測網路的方塊圖；圖3為本發明實施例的無監督網路訓練的方塊圖；圖4為本發明實施例的相機運動預測方法的流程圖；圖5為本發明實施例的場景深度預測裝置的結構示意圖；圖6為本發明實施例的相機運動預測裝置的結構示意圖；圖7為本發明實施例的一種電子設備的結構示意圖；圖8為本發明實施例的一種電子設備的結構示意圖。The drawings herein are incorporated into the specification and constitute a part of the specification. These drawings show embodiments in accordance with the present invention and are used together with the specification to illustrate the technical solution of the present invention. Fig. 1 is a flowchart of a scene depth prediction method according to an embodiment of the present invention; Figure 2 is a block diagram of a scene depth prediction network according to an embodiment of the present invention; FIG. 3 is a block diagram of unsupervised network training according to an embodiment of the present invention; 4 is a flowchart of a camera motion prediction method according to an embodiment of the present invention; FIG. 5 is a schematic structural diagram of a scene depth prediction apparatus according to an embodiment of the present invention; 6 is a schematic structural diagram of a camera motion prediction device according to an embodiment of the present invention; FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention; FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

S11,S12:步驟S11, S12: steps

Claims

A scene depth prediction method, including: Obtain the target image frame at time t; The scene depth prediction network uses the first hidden state information at time t-1 to perform scene depth prediction on the target image frame to determine the predicted depth map corresponding to the target image frame, wherein the first hidden state The information includes feature information related to the depth of the scene, and the scene depth prediction network is obtained based on the auxiliary training of the camera motion prediction network.

The method according to claim 1, wherein the scene depth prediction network uses the first hidden state information at time t-1 to perform scene depth prediction on the target image frame to determine that the target image frame corresponds to The predicted depth map includes: Performing feature extraction on the target image frame, and determining a first feature map corresponding to the target image frame, where the first feature map is a feature map related to the scene depth; Determine the first hidden state information at time t according to the first feature map and the first hidden state information at time t-1; Determine the predicted depth map according to the first hidden state information at time t.

The method according to claim 2, wherein the first hidden state information at time t-1 includes the first hidden state information at different scales at time t-1; The performing feature extraction on the target image frame and determining the first feature map corresponding to the target image frame includes: Performing multi-scale down-sampling on the target image frame, and determining the first feature map at different scales corresponding to the target image frame; The determining the first hidden state information at time t based on the first feature map and the first hidden state information at time t-1 includes: For any scale, determine the first hidden state information at the scale at time t according to the first feature map at the scale and the first hidden state information at the scale at time t-1; The determining the predicted depth map according to the first hidden state information at time t includes: Perform feature fusion of the first hidden state information at different scales at time t to determine the predicted depth map.

The method according to any one of claims 1 to 3, wherein the method further comprises: Acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and adjacent sample image frames of the first sample image frame; Perform camera pose prediction on the sample image frame sequence through the camera motion prediction network using the second hidden state information at time t-1, and determine the sample predicted camera motion corresponding to the sample image frame sequence, wherein the The second hidden state information includes feature information related to camera movement; Use the first hidden state information at time t-1 to perform scene depth prediction on the first sample image frame through the scene depth prediction network to be trained, and determine the sample prediction depth corresponding to the first sample image frame Figure, wherein the first hidden state information includes feature information related to the depth of the scene; Construct a loss function according to the sample predicted depth map and the sample predicted camera motion; According to the loss function, the scene depth prediction network to be trained is trained to obtain the scene depth prediction network.

The method according to claim 4, wherein the predicting the depth map and the sample to predict the camera motion according to the sample to construct a loss function includes: Predicting camera motion according to the sample, determining a reprojection error term of adjacent sample image frames of the first sample image frame relative to the first sample image frame in the sequence of sample image frames; Determine the penalty function term according to the distribution continuity of the sample prediction depth map; Construct the loss function according to the reprojection error term and the penalty function term.

A method for camera motion prediction, including: Acquiring an image frame sequence corresponding to time t, where the image frame sequence includes a target image frame at time t and adjacent image frames of the target image frame; The camera motion prediction network uses the second hidden state information at time t-1 to perform camera pose prediction on the image frame sequence to determine the predicted camera motion corresponding to the image frame sequence, wherein the second hidden state information The state information includes feature information related to camera motion, and the camera motion prediction network is obtained based on the auxiliary training of the scene depth prediction network.

The method according to claim 6, wherein the camera motion prediction network uses the second hidden state information at time t-1 to perform camera pose prediction on the image frame sequence to determine the image frame sequence The corresponding predicted camera movement includes: Performing feature extraction on the image frame sequence to determine a second feature map corresponding to the image frame sequence, where the second feature map is a feature map related to camera motion; Determine the second hidden state information at time t according to the second feature map and the second hidden state information at time t-1; Determine the predicted camera motion according to the second hidden state information at time t.

The method according to claim 6 or 7, wherein the method further includes: Acquiring a sequence of sample image frames corresponding to time t, wherein the sequence of sample image frames includes a first sample image frame at time t and adjacent sample image frames of the first sample image frame; The scene depth prediction is performed on the first sample image frame through the scene depth prediction network using the first hidden state information at time t-1, and the sample prediction depth map corresponding to the first sample image frame is determined, where , The first hidden state information includes feature information related to the depth of the scene; The camera motion prediction network to be trained uses the second hidden state information at time t-1 to perform camera pose prediction on the sample image frame sequence to determine the sample predicted camera motion corresponding to the sample image frame sequence, where , The second hidden state information includes feature information related to camera movement; Construct a loss function according to the sample predicted depth map and the sample predicted camera motion; According to the loss function, the camera motion prediction network to be trained is trained to obtain the camera motion prediction network.

The method according to claim 8, wherein the predicting the camera motion according to the sample prediction depth map and the sample to construct a loss function includes: Predicting camera motion according to the sample, determining a reprojection error term of adjacent sample image frames of the first sample image frame relative to the first sample image frame in the sequence of sample image frames; Determine the penalty function term according to the distribution continuity of the sample prediction depth map; Construct the loss function according to the reprojection error term and the penalty function term.

An electronic device including: processor; A memory configured to store executable instructions of the processor; Wherein, the processor is configured to call instructions stored in the memory to execute the method described in any one of request items 1-9.

A computer-readable storage medium has computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the method described in any one of request items 1 to 9 is realized.