WO2023155043A1 - Historical information-based scene depth reasoning method and apparatus, and electronic device - Google Patents

Historical information-based scene depth reasoning method and apparatus, and electronic device Download PDF

Info

Publication number
WO2023155043A1
WO2023155043A1 PCT/CN2022/076348 CN2022076348W WO2023155043A1 WO 2023155043 A1 WO2023155043 A1 WO 2023155043A1 CN 2022076348 W CN2022076348 W CN 2022076348W WO 2023155043 A1 WO2023155043 A1 WO 2023155043A1
Authority
WO
WIPO (PCT)
Prior art keywords
image frame
feature map
error
depth
image
Prior art date
Application number
PCT/CN2022/076348
Other languages
French (fr)
Chinese (zh)
Inventor
王飞
程俊
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2022/076348 priority Critical patent/WO2023155043A1/en
Publication of WO2023155043A1 publication Critical patent/WO2023155043A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images

Definitions

  • the application belongs to the technical field of computer vision and image processing, and in particular relates to a scene depth reasoning method, device and electronic equipment based on historical information.
  • the scheme based on fully unsupervised learning to restore scene depth from 2D images needs to predict the camera pose from adjacent frames at the same time, and inaccurate pose will produce wrong affine transformation results, directly affecting the quality of the synthesized image , thus affecting the restored scene depth quality.
  • the purpose of the embodiments of this specification is to provide a scene depth reasoning method, device and electronic equipment based on historical information.
  • the present application provides a scene depth reasoning method based on historical information, the method comprising:
  • the first image frame being the image frame at the moment before the second image frame;
  • the first image frame, the second image frame, the first depth weight and the first motion weight using an error calculation module to calculate the first error
  • the first image frame, the second image frame, the second depth weight and the second motion weight use an error calculation module to calculate a second error
  • the scene depth of the second image frame and the relative pose between the first image frame and the second image frame are determined.
  • the present application provides a device for scene depth reasoning based on historical information, the device comprising:
  • the first acquisition module is used to acquire the first image frame and the second image frame of the image to be tested, and the first image frame is an image frame at a moment before the second image frame;
  • the second acquisition module is used to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
  • the first processing module is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate the first error;
  • An update module configured to use the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;
  • the second processing module is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses an error calculation module to calculate the second error;
  • the determination module is configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
  • the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the program, the scene depth based on historical information as in the first aspect is realized. reasoning method.
  • the solution restore the scene depth from the two-dimensional image in a completely unsupervised form, and inject the historical frame information in the memory unit into the current input unit through the temporal attention module , and model the spatial correlation of the spatio-temporal feature map to improve the accuracy of the camera pose and reduce the influence of the wrong affine transformation caused by the inaccurate pose;
  • FIG. 1 is a schematic flow diagram of a scene depth reasoning method based on historical information provided by the present application
  • Fig. 2 is a joint training block diagram of the depth estimation network and the camera motion network provided by the embodiment of the present application;
  • FIG. 3 is a schematic diagram of the principle of the temporal attention module provided in the embodiment of the present application.
  • FIG. 4 is a schematic diagram of the principle of the spatio-temporal correlation module provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of the scene depth inference process provided by the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a scene depth reasoning device based on historical information provided by the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by the present application.
  • FIG. 1 shows a schematic flow chart applicable to a method for scene depth reasoning based on historical information provided by an embodiment of the present application.
  • the scene depth reasoning method based on historical information may include:
  • the image to be tested is any image that requires reasoning and prediction of scene depth, and the image to be tested may be a two-dimensional image.
  • the first image frame and the second image frame of the image to be tested are two image frames at adjacent moments
  • the first image frame is the first image frame
  • the image frame at the moment before the two image frames for example, the first image frame is an image frame at t-1 moment
  • the second image frame is an image frame at t moment
  • the first image frame is an image frame at t moment
  • the second image The frame is an image frame at time t+1, etc.
  • the depth estimation network is used to estimate the scene depth of the two-dimensional image, and the depth estimation network may adopt a neural network with an encoder-decoder structure, and the type and network structure of the neural network adopted by the depth estimation network are not limited.
  • a camera motion network is used to predict the relative pose between adjacent image frames.
  • Both the depth estimation network and the camera motion network are pre-built and trained.
  • the training set data is obtained first.
  • the training set data is a group of image frames at three adjacent moments in the image, for example, the image frame at time t-1, The image frame at time t and the image frame at time t+1 are a piece of data in the training data set, and the image frame at time t-2, the image frame at time t-1, and the image frame at time t are also a piece of data in the training data set. And so on.
  • the image frame at time t-1, the image frame at time t, and the image frame at time t+1 are used as the input of the camera motion network and the depth estimation network, and the depth estimation network is jointly trained using the objective function and the camera motion network, which guide the update of the depth weights of the depth estimation network and the motion weights of the camera motion network.
  • the camera motion network may include an encoder, a temporal attention module, and a spatiotemporal correlation module.
  • the encoder is used to extract the features of the stacked image frame to obtain the stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension.
  • the time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, and at the same time inject the information in the current input unit
  • the global relevant information of is stored in the historical memory unit as the historical memory unit at the next moment;
  • the current input unit includes a stacked feature map, and the stacked feature map is updated to an updated feature map by the update unit
  • the historical memory unit includes the first memory feature map and The first time feature map, the historical memory unit at the next moment includes the second memory feature map and the second time feature map.
  • the spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation respectively.
  • the time attention module uses the shared time attention weight to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and through the update unit, the information in the globally related historical memory unit The information is injected into the current input unit, and at the same time, the global relevant information in the current input unit is stored in the historical memory unit as the historical memory unit at the next moment, including:
  • the input feature vector determines the input feature vector based on time attention
  • the second feature vector and the time attention feature vector determine the memory feature vector based on time attention
  • the third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.
  • FIG. 3 shows a schematic diagram of the principle of the temporal attention module provided by the embodiment of the present application.
  • the input feature map at time t (that is, the stacked feature map) is expressed as
  • the time feature map at time t-1 (that is, the first time feature map) is
  • the temporal memory feature map at time t-1 (namely, the first memory feature map) is Its calculation process is as follows:
  • the feature information in the input feature map X t at time t and the feature information in the memory feature map X (m,t-1) at time t-1 are injected into the time feature map X at time t-1 (time,t-1) , get the third time feature map
  • C represents the number of feature map channels
  • H represents the height of the feature map
  • W represents the width of the feature map
  • ⁇ gelu ( ) represents the activation function
  • W (m, t-1) represents the corresponding learned weight
  • b (i, t) , b (time, t-1) , b (m, t-1) represents the corresponding paranoid item.
  • s represents a scalar scaling factor
  • “*” represents the product of corresponding elements
  • the function F split ( ⁇ ) represents the slice of the feature map according to the channel dimension
  • the function F reshape ( ⁇ ) is used to adjust the feature map or feature vector into a pre-set
  • T means transpose
  • G s ⁇ sig (W (qk_ms,t-1) X (qk_m,t-1) +b (qk_ms,t-1) +W (qk_is,t) X (qk_i,t) +b (qk_is,t ) ) (6)
  • the function ⁇ sig ( ⁇ ) represents the sigmoid activation function
  • W (qk_ms,t-1) and W (qk_is,t) represent the corresponding weights
  • b (qk_ms,t-1) and b (qk_is,t) represent the corresponding paranoia item.
  • X (im,t) ⁇ tanh (W (qk_imi,t) X (qk_i,t) +b (qk_imi,t) +G s *(W (qk_imm,t-1) X (qk_m,t-1) +b (qk_imm,t-1) )) (7)
  • the function ⁇ tanh ( ⁇ ) represents the tanh activation function
  • W (qk_imi,t) and W (qk_imm,t-1) represent the corresponding weights
  • b (qk_imi,t) and b (qk_imm,t-1) represent the corresponding paranoia item.
  • W (qk_ir,t) and W (qk_mr,t-1) represent the corresponding weights
  • b (qk_ir,t) and b (qk_mr,t-1) represent the corresponding bias items.
  • W (qk_io, t) and W (qk_mo, t-1) represent the corresponding weights
  • b (qk_io, t) and b (qk_mo, t-1) represent the corresponding bias items.
  • the time feature map at time t-1 Update the time feature map at time t (that is, the second time feature map) X (time,t) :
  • W (time,t-1) , W (io,t) and W (m,t) represent the corresponding weights
  • b (time,t-1) , b (io,t) and b (m,t ) represents the corresponding paranoid term.
  • construct the spatiotemporal correlation module shown in Figure 4 use the global spatial correlation weight to model the spatial context information, and stack The dependencies between feature map channels of frames are modeled to constrain the timing information between stacked frames.
  • the spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation, respectively.
  • the updated feature map is modeled as the first spatiotemporal feature map with spatial correlation, including:
  • the first spatio-temporal feature vector is adjusted to a first spatio-temporal feature map with spatial correlation.
  • the second memory feature map is modeled as a second spatiotemporal feature map with spatial correlation, including:
  • the second spatio-temporal feature vector is adjusted to a second spatio-temporal feature map with spatial correlation.
  • the input feature map (that is, the updated feature map or the second memory feature map) Transform to get the feature map And divide it into three sub-feature maps equally (namely, the first/fourth sub-feature map, the second/fifth sub-feature map and the third/sixth sub-feature map) Among them, the function F split ( ⁇ ) means to slice the feature map in the channel dimension.
  • W mid is the corresponding weight
  • b mid represents the corresponding paranoid item.
  • the corresponding feature map is adjusted into a feature vector, and the corresponding feature vector is obtained
  • the function F reshape ( ⁇ ) is used to adjust the feature map or feature vector into a preset shape.
  • F C ( ) consists of two layers of one-dimensional convolution and activation functions with a kernel size of 3 and a stride of 1.
  • an error calculation module is used to calculate the first error.
  • the total error is determined from image synthesis error, scene depth structure consistency error, feature perception loss error, smoothing loss error. Wherein, the total error includes the first error and the second error.
  • the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, including:
  • Affine transform the first world coordinates of the first image frame to the second image frame panel, and determine the third world coordinates after the affine transformation
  • the first scene depth, the second scene depth, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation determine the scene depth structure consistency error, the first depth structure inconsistency weight and the second depth structure inconsistency sexual weight;
  • the total error is determined based on image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
  • D t F D (I t
  • I t the two-dimensional image that needs to restore the scene depth at time t
  • W D Indicates the weight parameters learned in the scene depth fitting function F D ( ⁇ )
  • D t the scene depth of the recovered two-dimensional image I t at time t
  • X (enc,t) represents the image output by the encoder at time t
  • the feature map of frame I t Represents the pose transformed from the image frame I t-1 at time t-1 to the image frame I t at time t
  • W T represents the weight parameters learned in the pose transformation function F T ( )
  • the camera internal reference is expressed as K , use P (xy, t-1) to represent the image coordinates of the image frame I t-1 , P (xyz, t-1) the world coordinates of the image frame I t-1 , and P (xy, t) to represent the image
  • the sign "*" represents the product of the corresponding elements of the matrix.
  • E X ERF(X (enc,t) ,X (syn_enc,t) )+ERF(X (enc,t-1) ,X (syn_enc,t-1) ) (21)
  • the first image frame, the second image frame, the second depth weight, and the second motion weight use an error calculation module to calculate the second error, which may include:
  • an error calculation module is used to calculate a second error.
  • step S130 refers to the specific calculation process of step S130, only the first depth weight and the first motion weight in S130 are replaced by the second depth weight and the second motion weight, which will not be repeated here.
  • S160 Determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
  • the second scene depth is used as the scene depth of the second image frame
  • the first relative pose is used as the relative pose between the first image frame and the second image frame
  • the fourth scene depth is used as the scene depth of the second image frame
  • the second relative pose is used as the relative pose between the first image frame and the second image frame.
  • FIG. 5 it shows a schematic diagram of a scene depth inference process.
  • the process of inferring scene depth is as follows:
  • the depth of the scene is recovered from the two-dimensional image in a completely unsupervised form, and the historical frame information in the memory unit is injected into the current input unit through the temporal attention module, and the spatial correlation of the spatiotemporal feature map Carry out modeling to improve the accuracy of camera pose and reduce the influence of wrong affine transformation caused by inaccurate pose; during inference, use online decision-making reasoning to improve the generalization ability of the algorithm to unknown scenes.
  • FIG. 6 shows a schematic structural diagram of an apparatus for scene depth inference based on historical information according to an embodiment of the present application.
  • the scene depth reasoning device 600 based on historical information may include:
  • the first acquisition module 610 is configured to acquire a first image frame and a second image frame of the image to be tested, and the first image frame is an image frame at a moment before the second image frame;
  • the second acquiring module 620 is configured to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
  • the first processing module 630 is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate the first error;
  • An update module 640 configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guide signal to obtain a second depth weight and a second motion weight;
  • the second processing module 650 is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses an error calculation module to calculate a second error;
  • the determination module 660 is configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
  • the camera motion network includes an encoder, a temporal attention module and a spatiotemporal correlation module;
  • the encoder is used to extract the features of the stacked image frame to obtain the stacked feature map;
  • the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension;
  • the time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, and at the same time inject the information in the current input unit
  • the global relevant information of is stored in the historical memory unit as the historical memory unit at the next moment;
  • the current input unit includes a stacked feature map, and the stacked feature map is updated to an updated feature map by the update unit, and the historical memory unit includes the first memory feature map and The first time feature map, the historical memory unit at the next moment includes the second memory feature map and the second time feature map;
  • the spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation, respectively.
  • the scene depth reasoning device 600 based on historical information is also used for:
  • the input feature vector determines the input feature vector based on time attention
  • the second feature vector and the time attention feature vector determine the memory feature vector based on time attention
  • the third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.
  • the scene depth reasoning device 600 based on historical information is also used for:
  • the second spatio-temporal feature vector is adjusted to the second spatio-temporal feature map with spatial correlation.
  • the first processing module 630 is also used for:
  • the second processing module 650 is also used for:
  • an error calculation module is used to calculate a second error.
  • the determining module 660 is also used for:
  • the second scene depth is used as the scene depth of the second image frame
  • the first relative pose is used as the relative pose between the first image frame and the second image frame
  • the fourth scene depth is used as the scene depth of the second image frame
  • the second relative pose is used as the relative pose between the first image frame and the second image frame.
  • the total error includes a first error and a second error; the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
  • the first processing module 630 or the second processing module 650 is also used for:
  • Affine transform the first world coordinates of the first image frame to the second image frame panel, and determine the third world coordinates after the affine transformation
  • the first scene depth, the second scene depth, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation determine the scene depth structure consistency error, the first depth structure inconsistency weight and the second depth structure inconsistency sexual weight;
  • the smoothing loss error According to the first scene depth, the second scene depth, the first image frame and the second image frame, determine the smoothing loss error
  • the total error is determined based on image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
  • the historical information-based scene depth reasoning device provided in this embodiment can execute the above-mentioned embodiment of the method, and its implementation principle and technical effect are similar, and will not be repeated here.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 7 , a schematic structural diagram of an electronic device 300 suitable for implementing the embodiments of the present application is shown.
  • an electronic device 300 includes a central processing unit (CPU) 301, which can operate according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage section 308 into a random access memory (RAM) 303 Instead, various appropriate actions and processes are performed.
  • ROM read-only memory
  • RAM random access memory
  • various programs and data necessary for the operation of the device 300 are also stored.
  • the CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304.
  • An input/output (I/O) interface 305 is also connected to the bus 304 .
  • the following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, etc.; an output section 307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 308 including a hard disk, etc. and a communication section 309 including a network interface card such as a LAN card, a modem, or the like.
  • the communication section 309 performs communication processing via a network such as the Internet.
  • Drive 310 is also connected to I/O interface 306 as needed.
  • a removable medium 311, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 310 as necessary so that a computer program read therefrom is installed into the storage section 308 as necessary.
  • the process described above with reference to FIG. 1 may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product, which includes a computer program tangibly contained on a machine-readable medium, the computer program including program codes for executing the above-mentioned scene depth reasoning method based on historical information.
  • the computer program may be downloaded and installed from a network via communication portion 309 and/or installed from removable media 311 .
  • each block in a flowchart or block diagram may represent a module, program segment, or part of code that includes one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the units or modules involved in the embodiments described in the present application may be implemented by means of software or by means of hardware.
  • the described units or modules may also be provided in a processor.
  • the names of these units or modules do not constitute limitations on the units or modules themselves in some cases.
  • a typical implementing device is a computer.
  • the computer can be, for example, a personal computer, a notebook computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any of these devices combination of devices.
  • the present application also provides a storage medium, which may be the storage medium contained in the aforementioned device in the above embodiment, or may be a storage medium that exists independently and is not assembled into the device.
  • the storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the scene depth reasoning method based on historical information described in this application.
  • Storage media includes permanent and non-permanent, removable and non-removable media.
  • Information storage can be realized by any method or technology.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

A historical information-based scene depth reasoning method and apparatus, and an electronic device. The method comprises: acquiring a first image frame and a second image frame of an image to be detected (S110); acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network (S120); calculating a first error on the basis of the first image frame, the second image frame, the first depth weight and the first motion weight and by using an error calculation module (S130); using the first error as a guiding signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network, so as to obtain a second depth weight and a second motion weight (S140); calculating a second error on the basis of the first image frame, the second image frame, the second depth weight, and the second depth weight and by using an error calculation module (S150); and determining, according to the first error and the second error, the scene depth of the second image frame and the relative pose between the first image frame and the second image frame (S160).

Description

一种基于历史信息的场景深度推理方法、装置及电子设备A method, device and electronic equipment for scene depth reasoning based on historical information 技术领域technical field
本申请属于计算机视觉与图像处理技术领域,特别涉及一种基于历史信息的场景深度推理方法、装置及电子设备。The application belongs to the technical field of computer vision and image processing, and in particular relates to a scene depth reasoning method, device and electronic equipment based on historical information.
背景技术Background technique
从二维图像中准确的恢复出场景深度有助于更好的理解场景的三维结构,从而更好的完成各种视觉任务。然而,普通的相机在拍摄时获取的都是二维图像,丢失了场景的深度信息,因此如何从二维图像或者视频序列中恢复出场景深度成为了计算机视觉领域基础且极具挑战的任务。虽然目前已能从二维图像中恢复出有竞争的场景深度,但是需要大量的人工标注的数据去训练神经网络,耗时费力,且一旦完成模型的训练,模型的权重即被冻结,降低了算法对未知场景的泛化能力。此外,基于全无监督学习的方案从二维图像中恢复场景深度,需要同时从相邻帧中预测出相机姿态,而不准确的姿态会产生错误的仿射变换结果,直接影响合成图像的质量,从而影响到恢复的场景深度质量。Accurately recovering scene depth from 2D images helps to better understand the 3D structure of the scene, so as to better complete various visual tasks. However, ordinary cameras acquire two-dimensional images when shooting, and lose the depth information of the scene. Therefore, how to recover the scene depth from two-dimensional images or video sequences has become a basic and extremely challenging task in the field of computer vision. Although it has been possible to restore the competitive scene depth from two-dimensional images, a large amount of manually labeled data is required to train the neural network, which is time-consuming and laborious. Once the model training is completed, the weight of the model is frozen, reducing the The generalization ability of the algorithm to unknown scenarios. In addition, the scheme based on fully unsupervised learning to restore scene depth from 2D images needs to predict the camera pose from adjacent frames at the same time, and inaccurate pose will produce wrong affine transformation results, directly affecting the quality of the synthesized image , thus affecting the restored scene depth quality.
发明内容Contents of the invention
本说明书实施例的目的是提供一种基于历史信息的场景深度推理方法、装置及电子设备。The purpose of the embodiments of this specification is to provide a scene depth reasoning method, device and electronic equipment based on historical information.
为解决上述技术问题,本申请实施例通过以下方式实现的:In order to solve the above technical problems, the embodiments of the present application are implemented in the following ways:
第一方面,本申请提供一种基于历史信息的场景深度推理方法,该方法包括:In the first aspect, the present application provides a scene depth reasoning method based on historical information, the method comprising:
获取待测图像的第一图像帧和第二图像帧,第一图像帧为第二图像帧前一时刻的图像帧;Obtaining the first image frame and the second image frame of the image to be tested, the first image frame being the image frame at the moment before the second image frame;
获取预先构建的深度估计网络的第一深度权重及预先构建的相机运动网络的第一运动权重;Obtaining the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
第一图像帧、第二图像帧、第一深度权重和第一运动权重,采用误差计算模块,计算第一误差;The first image frame, the second image frame, the first depth weight and the first motion weight, using an error calculation module to calculate the first error;
将第一误差作为指导信号联合更新深度估计网络的第一深度权重和相机运动网络的第一运动权重,得到第二深度权重和第二运动权重;Using the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;
第一图像帧、第二图像帧、第二深度权重和第二运动权重,采用误差计算模块,计算第二误差;The first image frame, the second image frame, the second depth weight and the second motion weight use an error calculation module to calculate a second error;
根据第一误差和第二误差,确定第二图像帧的场景深度及第一图像帧与第二图像帧之间的相对位姿。According to the first error and the second error, the scene depth of the second image frame and the relative pose between the first image frame and the second image frame are determined.
第二方面,本申请提供一种基于历史信息的场景深度推理装置,该装置包括:In a second aspect, the present application provides a device for scene depth reasoning based on historical information, the device comprising:
第一获取模块,用于获取待测图像的第一图像帧和第二图像帧,第一图像帧为第二图像帧前一时刻的图像帧;The first acquisition module is used to acquire the first image frame and the second image frame of the image to be tested, and the first image frame is an image frame at a moment before the second image frame;
第二获取模块,用于获取预先构建的深度估计网络的第一深度权重及预先构建的相机运动网络的第一运动权重;The second acquisition module is used to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
第一处理模块,用于第一图像帧、第二图像帧、第一深度权重和第一运动权重,采用误差计算模块,计算第一误差;The first processing module is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate the first error;
更新模块,用于将第一误差作为指导信号联合更新深度估计网络的第一深度权重和相机运动网络的第一运动权重,得到第二深度权重和第二运动权重;An update module, configured to use the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;
第二处理模块,用于第一图像帧、第二图像帧、第二深度权重和第二运动权重,采用误差计算模块,计算第二误差;The second processing module is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses an error calculation module to calculate the second error;
确定模块,用于根据第一误差和第二误差,确定第二图像帧的场景深度及第一图像帧与第二图像帧之间的相对位姿。The determination module is configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
第三方面,本申请提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现如第一方面的基于历史信息的场景深度推理方法。In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the scene depth based on historical information as in the first aspect is realized. reasoning method.
由以上本说明书实施例提供的技术方案可见,该方案:采用全无监督的形式从二维图像中恢复场景深度,通过时间注意力模块,将记忆单元中的历史帧信息注入到当前输入单元中,并对时空特征图的空间相关性进行建模来提高相机位姿的精度,降低因姿态不准确而产生的错误仿射变换的影响;推理期间,利用在线决策推理来提高算法对未知场景的泛化能力。It can be seen from the technical solution provided by the above embodiments of this specification that the solution: restore the scene depth from the two-dimensional image in a completely unsupervised form, and inject the historical frame information in the memory unit into the current input unit through the temporal attention module , and model the spatial correlation of the spatio-temporal feature map to improve the accuracy of the camera pose and reduce the influence of the wrong affine transformation caused by the inaccurate pose; Generalization.
附图说明Description of drawings
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中 的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments described in this specification. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.
图1为本申请提供的基于历史信息的场景深度推理方法的流程示意图;FIG. 1 is a schematic flow diagram of a scene depth reasoning method based on historical information provided by the present application;
图2为本申请实施例提供的深度估计网络和相机运动网络联合训练框图;Fig. 2 is a joint training block diagram of the depth estimation network and the camera motion network provided by the embodiment of the present application;
图3为本申请实施例提供的时间注意力模块的原理示意图;FIG. 3 is a schematic diagram of the principle of the temporal attention module provided in the embodiment of the present application;
图4为本申请实施例提供的时空相关性模块的原理示意图;FIG. 4 is a schematic diagram of the principle of the spatio-temporal correlation module provided by the embodiment of the present application;
图5为本申请实施例提供的场景深度推理过程示意图;FIG. 5 is a schematic diagram of the scene depth inference process provided by the embodiment of the present application;
图6为本申请提供的基于历史信息的场景深度推理装置的结构示意图;FIG. 6 is a schematic structural diagram of a scene depth reasoning device based on historical information provided by the present application;
图7为本申请提供的电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only some of the embodiments in this specification, not all of them. Based on the embodiments in this specification, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of this specification.
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
在不背离本申请的范围或精神的情况下,可对本申请说明书的具体实施方式做多种改进和变化,这对本领域技术人员而言是显而易见的。由本申请的说明书得到的其他实施方式对技术人员而言是显而易见得的。本申请说明书和实施例仅是示例性的。It will be apparent to those skilled in the art that various modifications and changes can be made to the specific embodiments described in the present application without departing from the scope or spirit of the present application. Other embodiments will be apparent to those skilled in the art from the description of this application. The specification and examples in this application are exemplary only.
关于本文中所使用的“包含”、“包括”、“具有”、“含有”等等,均为开放性的用语,即意指包含但不限于。As used herein, "comprising", "comprising", "having", "comprising" and so on are all open terms, meaning including but not limited to.
本申请中的“份”如无特别说明,均按质量份计。The "parts" in this application are by mass parts unless otherwise specified.
下面结合附图和实施例对本发明进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.
参照图1,其示出了适用于本申请实施例提供的基于历史信息的场景深度推理方法的流程示意图。Referring to FIG. 1 , it shows a schematic flow chart applicable to a method for scene depth reasoning based on historical information provided by an embodiment of the present application.
如图1所示,基于历史信息的场景深度推理方法,可以包括:As shown in Figure 1, the scene depth reasoning method based on historical information may include:
S110、获取待测图像的第一图像帧和第二图像帧,第一图像帧为第二图像帧前一时刻的图像帧。S110. Acquire a first image frame and a second image frame of the image to be tested, where the first image frame is an image frame at a moment before the second image frame.
其中,待测图像为需要推理预测场景深度的任意一幅图像,待测图像可以为二维图像。Wherein, the image to be tested is any image that requires reasoning and prediction of scene depth, and the image to be tested may be a two-dimensional image.
将待测图像按照等时间间隔剪辑为若干相邻帧的图像帧,其中,待测图像的第一图像帧和第二图像帧为相邻时刻的两个图像帧,且第一图像帧为第二图像帧前一时刻的图像帧,例如第一图像帧为t-1时刻的图像帧,第二图像帧为t时刻的图像帧,或第一图像帧为t时刻的图像帧,第二图像帧为t+1时刻的图像帧等。Clip the image to be tested into image frames of several adjacent frames according to equal time intervals, wherein the first image frame and the second image frame of the image to be tested are two image frames at adjacent moments, and the first image frame is the first image frame The image frame at the moment before the two image frames, for example, the first image frame is an image frame at t-1 moment, the second image frame is an image frame at t moment, or the first image frame is an image frame at t moment, and the second image The frame is an image frame at time t+1, etc.
S120、获取预先构建的深度估计网络的第一深度权重及预先构建的相机运动网络的第一运动权重。S120. Acquire a first depth weight of a pre-built depth estimation network and a first motion weight of a pre-built camera motion network.
其中,深度估计网络用于估计二维图像的场景深度,该深度估计网络可以采用具有编码器解码器结构的神经网络,对于该深度估计网络采用的神经网络的类型和网络结构不做限定。Wherein, the depth estimation network is used to estimate the scene depth of the two-dimensional image, and the depth estimation network may adopt a neural network with an encoder-decoder structure, and the type and network structure of the neural network adopted by the depth estimation network are not limited.
相机运动网络用于预测相邻图像帧之间的相对位姿。A camera motion network is used to predict the relative pose between adjacent image frames.
深度估计网络和相机运动网络均是预先构建及训练好的。Both the depth estimation network and the camera motion network are pre-built and trained.
参照图2,其示出了本申请实施例提供的深度估计网络和相机运动网络联合训练框图(图2中最左边原始图像、场景深度图片及合成的视图均为彩色图,这里处理为了灰度图)。可以理解的,要训练深度估计网络和相机运动网络,先获取训练集数据,本申请中训练集数据为图像中若干相邻三个时刻的图像帧组,例如,t-1时刻的图像帧、t时刻的图像帧、t+1时刻的图像帧为训练数据集中一个数据,t-2时刻的图像帧、t-1时刻的图像帧、t时刻的图像帧也为训练数据集中一个数据,以此类推。按照图2所示的训练框图,以t-1时刻的图像帧、t时刻的图像帧、t+1时刻的图像帧作为相机运动网络和深度估计网络的输入,利用目标函数联合训练深度估计网络和相机运动网络,即指导深度估计网络的深度权重和相机运动网络的运动权重的更新。Referring to Fig. 2, it shows the joint training block diagram of the depth estimation network and the camera motion network provided by the embodiment of the present application (the leftmost original image in Fig. picture). It can be understood that to train the depth estimation network and the camera motion network, the training set data is obtained first. In this application, the training set data is a group of image frames at three adjacent moments in the image, for example, the image frame at time t-1, The image frame at time t and the image frame at time t+1 are a piece of data in the training data set, and the image frame at time t-2, the image frame at time t-1, and the image frame at time t are also a piece of data in the training data set. And so on. According to the training block diagram shown in Figure 2, the image frame at time t-1, the image frame at time t, and the image frame at time t+1 are used as the input of the camera motion network and the depth estimation network, and the depth estimation network is jointly trained using the objective function and the camera motion network, which guide the update of the depth weights of the depth estimation network and the motion weights of the camera motion network.
可以理解的,t-1时刻的图像帧、t时刻的图像帧、t+1时刻的图像帧输入深 度估计网络和相机运动网络之前,先进行图像预处理,例如包括对图像数据进行随机翻转、随机裁剪、数据归一化处理,并将处理后的数据转换成维度为C×H×W的张量数据,此处省略batch维度,其中,C表示样本的通道维度大小,其中,训练期间深度估计网络C=3,相机运动网络中C=9(测试期间相机运动网络的输入为相邻时刻的3个图像帧),测试推理期间(即实现本申请基于历史信息的场景深度推理方法时),相机运动网络中C=6(推理期间相机运动网络的输入为相邻时刻的2个图像帧),H表示输入样本图像的高度,示例性的,H=256,W表示输入样本图像的宽度,示例性的,W=832。It can be understood that before the image frame at time t-1, the image frame at time t, and the image frame at time t+1 are input into the depth estimation network and the camera motion network, image preprocessing is performed first, such as including random flipping of image data, Random cropping, data normalization processing, and converting the processed data into tensor data with a dimension of C×H×W, where the batch dimension is omitted here, where C represents the channel dimension of the sample, where the depth during training Estimation network C=3, C=9 in the camera motion network (the input of the camera motion network during the test period is 3 image frames at adjacent moments), during the test reasoning period (that is, when implementing the scene depth reasoning method based on historical information in this application) , C=6 in the camera motion network (the input of the camera motion network during inference is two image frames at adjacent moments), H represents the height of the input sample image, for example, H=256, W represents the width of the input sample image , Exemplarily, W=832.
继续参照图2,相机运动网络可以包括编码器、时间注意力模块和时空相关性模块。Continuing to refer to Figure 2, the camera motion network may include an encoder, a temporal attention module, and a spatiotemporal correlation module.
其中,编码器用于提取堆叠图像帧的特征,得到堆叠特征图;堆叠图像帧为第一图像帧和第二图像帧按通道维度堆叠得到的。Among them, the encoder is used to extract the features of the stacked image frame to obtain the stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension.
时间注意力模块用于将历史记忆单元的信息与当前输入单元的信息建立全局依赖关系,并通过更新单元,将全局相关的历史记忆单元中的信息注入到当前输入单元,同时将当前输入单元中的全局相关信息储存到历史记忆单元,作为下一时刻的历史记忆单元;当前输入单元包括堆叠特征图,堆叠特征图通过更新单元更新为更新后特征图,历史记忆单元包括第一记忆特征图和第一时间特征图,下一时刻的历史记忆单元包括第二记忆特征图和第二时间特征图。The time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, and at the same time inject the information in the current input unit The global relevant information of is stored in the historical memory unit as the historical memory unit at the next moment; the current input unit includes a stacked feature map, and the stacked feature map is updated to an updated feature map by the update unit, and the historical memory unit includes the first memory feature map and The first time feature map, the historical memory unit at the next moment includes the second memory feature map and the second time feature map.
时空相关性模块用于将更新后特征图/第二记忆特征图分别建模成为具有空间相关性的第一/第二时空特征图。The spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation respectively.
在一个实施例中,时间注意力模块,利用共享的时间注意力权重,将历史记忆单元的信息与当前输入单元的信息建立全局依赖关系,并通过更新单元,将全局相关的历史记忆单元中的信息注入到当前输入单元,同时将当前输入单元中的全局相关信息储存到历史记忆单元,作为下一时刻的历史记忆单元,包括:In one embodiment, the time attention module uses the shared time attention weight to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and through the update unit, the information in the globally related historical memory unit The information is injected into the current input unit, and at the same time, the global relevant information in the current input unit is stored in the historical memory unit as the historical memory unit at the next moment, including:
将堆叠特征图中的特征信息和第一记忆特征图的特征信息注入到第一时间特征图,得到第三时间特征图;Injecting the feature information in the stacked feature map and the feature information of the first memory feature map into the first time feature map to obtain a third time feature map;
根据第三时间特征图,确定时间注意力特征向量;Determine the time attention feature vector according to the third time feature map;
根据堆叠特征图,确定第一特征向量;Determining a first feature vector according to the stacked feature map;
根据第一记忆特征图;确定第二特征向量According to the first memory feature map; determine the second feature vector
根据第一特征向量和时间注意力特征向量,确定基于时间注意力的输入特征向量;According to the first feature vector and the time attention feature vector, determine the input feature vector based on time attention;
根据第二特征向量和时间注意力特征向量,确定基于时间注意力的记忆特征向量;According to the second feature vector and the time attention feature vector, determine the memory feature vector based on time attention;
分别将基于时间注意力的输入特征向量和基于时间注意力的记忆特征向量,调整成对应的第一特征图和第二特征图;Respectively adjust the input feature vector based on time attention and the memory feature vector based on time attention into corresponding first feature map and second feature map;
根据第一特征图和第二特征图,确定更新后特征图和第二记忆特征图;Determine the updated feature map and the second memory feature map according to the first feature map and the second feature map;
根据更新后特征图和第二记忆特征图,将第三时间特征图更新为第二时间特征图。The third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.
示例性的,参照图3,其示出了本申请实施例提供的时间注意力模块的原理示意图。为描述方便,假设t时刻的输入特征图(即堆叠特征图)表示为
Figure PCTCN2022076348-appb-000001
t-1时刻时间特征图(即第一时间特征图)为
Figure PCTCN2022076348-appb-000002
t-1时刻时间记忆特征图(即第一记忆特征图)为
Figure PCTCN2022076348-appb-000003
其计算过程如下:
For example, refer to FIG. 3 , which shows a schematic diagram of the principle of the temporal attention module provided by the embodiment of the present application. For the convenience of description, it is assumed that the input feature map at time t (that is, the stacked feature map) is expressed as
Figure PCTCN2022076348-appb-000001
The time feature map at time t-1 (that is, the first time feature map) is
Figure PCTCN2022076348-appb-000002
The temporal memory feature map at time t-1 (namely, the first memory feature map) is
Figure PCTCN2022076348-appb-000003
Its calculation process is as follows:
1)根据公式(1)将t时刻的输入特征图X t中的特征信息和t-1时刻记忆特征图X (m,t-1)中的特征信息注入到t-1时刻时间特征图X (time,t-1)中,得到第三时间特征图
Figure PCTCN2022076348-appb-000004
1) According to the formula (1), the feature information in the input feature map X t at time t and the feature information in the memory feature map X (m,t-1) at time t-1 are injected into the time feature map X at time t-1 (time,t-1) , get the third time feature map
Figure PCTCN2022076348-appb-000004
Figure PCTCN2022076348-appb-000005
Figure PCTCN2022076348-appb-000005
其中,
Figure PCTCN2022076348-appb-000006
表示所属特征空间,C表示特征图通道数量,H表示特征图的高度,W表示特征图的宽度,δ gelu(·)表示激活函数,W (i,t),W (time,t-1),W (m,t-1)表示学习出的对应的权重,b (i,t),b (time,t-1),b (m,t-1)表示对应的偏执项。
in,
Figure PCTCN2022076348-appb-000006
Represents the feature space, C represents the number of feature map channels, H represents the height of the feature map, W represents the width of the feature map, δ gelu ( ) represents the activation function, W (i,t) ,W (time,t-1) , W (m, t-1) represents the corresponding learned weight, b (i, t) , b (time, t-1) , b (m, t-1) represents the corresponding paranoid item.
2)根据公式组(2)计算出时间注意力特征向量x (qk_time,t-1)2) Calculate the time attention feature vector x (qk_time,t-1) according to the formula group (2):
Figure PCTCN2022076348-appb-000007
Figure PCTCN2022076348-appb-000007
Figure PCTCN2022076348-appb-000008
Figure PCTCN2022076348-appb-000008
其中,s表示标量缩放因子,“*”表示对应元素的乘积,函数F split(·)表示按照通道维度对特征图进行切片,函数F reshape(·)用于将特征图或特征向量调整成预设形状,“T”表示转置,
Figure PCTCN2022076348-appb-000009
表示对应的权重,
Figure PCTCN2022076348-appb-000010
表示对应的偏执项。
Among them, s represents a scalar scaling factor, “*” represents the product of corresponding elements, the function F split (·) represents the slice of the feature map according to the channel dimension, and the function F reshape (·) is used to adjust the feature map or feature vector into a pre-set Let shape, "T" means transpose,
Figure PCTCN2022076348-appb-000009
Indicates the corresponding weight,
Figure PCTCN2022076348-appb-000010
Denotes the corresponding paranoid term.
3)根据公式组(3)将t时刻的输入特征图X (i,t)调整成特征向量(即第一特征向量)
Figure PCTCN2022076348-appb-000011
将t-1时刻记忆特征图X (m,t-1)调整成特征向量(即第二特征向量)
Figure PCTCN2022076348-appb-000012
3) According to the formula group (3), adjust the input feature map X (i, t) at time t into a feature vector (ie, the first feature vector)
Figure PCTCN2022076348-appb-000011
Adjust the memory feature map X (m,t-1) at time t-1 into a feature vector (ie, the second feature vector)
Figure PCTCN2022076348-appb-000012
x (i,t)=F reshape(X (i,t)) x (i,t) = F reshape (X (i,t) )
x (m,t-1)=F reshape(X (m,t-1))    (3) x (m,t-1) = F reshape (X (m,t-1) ) (3)
4)根据公式组(4)计算出基于时间注意力的输入特征向量
Figure PCTCN2022076348-appb-000013
以及基于时间注意力的记忆特征向量
Figure PCTCN2022076348-appb-000014
4) Calculate the input feature vector based on time attention according to formula group (4)
Figure PCTCN2022076348-appb-000013
and memory feature vectors based on temporal attention
Figure PCTCN2022076348-appb-000014
Figure PCTCN2022076348-appb-000015
Figure PCTCN2022076348-appb-000015
5)根据公式组(5)分别将特征向量x (qk_i,t)和x (qk_m,t-1)调整成对应的特征图
Figure PCTCN2022076348-appb-000016
(即第一特征图)和
Figure PCTCN2022076348-appb-000017
(即第二特征图):
5) Adjust the feature vectors x (qk_i,t) and x (qk_m,t-1) to the corresponding feature maps according to the formula group (5)
Figure PCTCN2022076348-appb-000016
(i.e. the first feature map) and
Figure PCTCN2022076348-appb-000017
(i.e. the second feature map):
Figure PCTCN2022076348-appb-000018
Figure PCTCN2022076348-appb-000018
6)根据公式(6)计算出信息选择门
Figure PCTCN2022076348-appb-000019
用于有选择的将t-1时刻记忆特征图X (qk_m,t-1)中的信息注入到t时刻的输入特征图X (qk_i,t)中:
6) Calculate the information selection gate according to formula (6)
Figure PCTCN2022076348-appb-000019
It is used to selectively inject the information in the memory feature map X (qk_m,t-1) at time t-1 into the input feature map X (qk_i,t) at time t:
G s=δ sig(W (qk_ms,t-1)X (qk_m,t-1)+b (qk_ms,t-1)+W (qk_is,t)X (qk_i,t)+b (qk_is,t))   (6) G s =δ sig (W (qk_ms,t-1) X (qk_m,t-1) +b (qk_ms,t-1) +W (qk_is,t) X (qk_i,t) +b (qk_is,t ) ) (6)
其中,函数δ sig(·)表示sigmoid激活函数,W (qk_ms,t-1)和W (qk_is,t)表示对应的权 重,b (qk_ms,t-1)和b (qk_is,t)表示对应的偏执项。 Among them, the function δ sig (·) represents the sigmoid activation function, W (qk_ms,t-1) and W (qk_is,t) represent the corresponding weights, b (qk_ms,t-1) and b (qk_is,t) represent the corresponding paranoia item.
7)根据公式(7)计算出包含记忆特征图信息的新的特征图
Figure PCTCN2022076348-appb-000020
7) Calculate a new feature map containing memory feature map information according to formula (7)
Figure PCTCN2022076348-appb-000020
X (im,t)=δ tanh(W (qk_imi,t)X (qk_i,t)+b (qk_imi,t)+G s*(W (qk_imm,t-1)X (qk_m,t-1)+b (qk_imm,t-1)))  (7) X (im,t) = δ tanh (W (qk_imi,t) X (qk_i,t) +b (qk_imi,t) +G s *(W (qk_imm,t-1) X (qk_m,t-1) +b (qk_imm,t-1) )) (7)
其中,函数δ tanh(·)表示tanh激活函数,W (qk_imi,t)和W (qk_imm,t-1)表示对应的权重,b (qk_imi,t)和b (qk_imm,t-1)表示对应的偏执项。 Among them, the function δ tanh (·) represents the tanh activation function, W (qk_imi,t) and W (qk_imm,t-1) represent the corresponding weights, b (qk_imi,t) and b (qk_imm,t-1) represent the corresponding paranoia item.
8)根据公式组(8)计算出记忆门
Figure PCTCN2022076348-appb-000021
用于将t-1时刻记忆特征图X (qk_m,t-1)信息更新到t时刻的记忆特征图(第二记忆特征图)
Figure PCTCN2022076348-appb-000022
8) Calculate the memory gate according to the formula group (8)
Figure PCTCN2022076348-appb-000021
Used to update the memory feature map X (qk_m,t-1) information at time t-1 to the memory feature map at time t (the second memory feature map)
Figure PCTCN2022076348-appb-000022
Figure PCTCN2022076348-appb-000023
Figure PCTCN2022076348-appb-000023
其中,W (qk_ir,t)和W (qk_mr,t-1)表示对应的权重,b (qk_ir,t)和b (qk_mr,t-1)表示对应的偏执项。 Among them, W (qk_ir,t) and W (qk_mr,t-1) represent the corresponding weights, and b (qk_ir,t) and b (qk_mr,t-1) represent the corresponding bias items.
9)根据公式(9)计算出输出门
Figure PCTCN2022076348-appb-000024
用于更新输入特征图X (qk_i,t),获得更新后的新特征图(即更新后特征图)
Figure PCTCN2022076348-appb-000025
作为下一时刻的输入特征图:
9) Calculate the output gate according to formula (9)
Figure PCTCN2022076348-appb-000024
It is used to update the input feature map X (qk_i,t) to obtain the updated new feature map (that is, the updated feature map)
Figure PCTCN2022076348-appb-000025
As the input feature map at the next moment:
Figure PCTCN2022076348-appb-000026
Figure PCTCN2022076348-appb-000026
其中,W (qk_io,t)和W (qk_mo,t-1)表示对应的权重,b (qk_io,t)和b (qk_mo,t-1)表示对应的偏执项。 Among them, W (qk_io, t) and W (qk_mo, t-1) represent the corresponding weights, b (qk_io, t) and b (qk_mo, t-1) represent the corresponding bias items.
10)根据公式(10)将t-1时刻的时间特征图
Figure PCTCN2022076348-appb-000027
更新到t时刻的时间特征图(即第二时间特征图)X (time,t)
10) According to formula (10), the time feature map at time t-1
Figure PCTCN2022076348-appb-000027
Update the time feature map at time t (that is, the second time feature map) X (time,t) :
Figure PCTCN2022076348-appb-000028
Figure PCTCN2022076348-appb-000028
其中,W (time,t-1)、W (io,t)和W (m,t)表示对应的权重,b (time,t-1)、b (io,t)和b (m,t)表示对应的偏执项。 Among them, W (time,t-1) , W (io,t) and W (m,t) represent the corresponding weights, b (time,t-1) , b (io,t) and b (m,t ) represents the corresponding paranoid term.
为了利用特征图的全局空间结构信息以及空间结构之间的依赖关系来推理相机运动,构造如图4所示的时空相关性模块,利用全局空间相关性权重对空间上下文信息进行建模,通过堆叠帧的特征图通道之间的依赖关系进行建模,来约束堆叠帧之间的时序信息。In order to use the global spatial structure information of the feature map and the dependencies between the spatial structures to infer camera motion, construct the spatiotemporal correlation module shown in Figure 4, use the global spatial correlation weight to model the spatial context information, and stack The dependencies between feature map channels of frames are modeled to constrain the timing information between stacked frames.
在一个实施例中,时空相关性模块用于将更新后特征图/第二记忆特征图分别建模成具有空间相关性的第一/第二时空特征图。In one embodiment, the spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation, respectively.
其中,将更新后特征图建模成具有空间相关性的第一时空特征图,包括:Among them, the updated feature map is modeled as the first spatiotemporal feature map with spatial correlation, including:
将更新后特征图在通道维度进行切片,得到第一子特征图、第二子特征图和第三子特征图;Slicing the updated feature map in the channel dimension to obtain the first sub-feature map, the second sub-feature map and the third sub-feature map;
分别将更新后特征图、第一子特征图、第二子特征图和第三子特征图,对应调整为第三特征向量、第一子特征向量、第二子特征向量和第三子特征向量;Respectively adjust the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map to the third feature vector, the first sub-feature vector, the second sub-feature vector and the third sub-feature vector ;
根据第一子特征向量和第二子特征向量,计算第一子特征图和第二子特征图之间的第一空间相关性矩阵;calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;
利用第一空间相关性矩阵对第三子特征向量进行加权处理,得到第一空间相关特征向量;weighting the third sub-eigenvector by using the first spatial correlation matrix to obtain the first spatially correlated feature vector;
根据第一空间相关特征向量和第三特征向量,确定第一时空特征向量;determining a first spatiotemporal eigenvector according to the first spatially correlated eigenvector and the third eigenvector;
将第一时空特征向量调整为具有空间相关性的第一时空特征图。The first spatio-temporal feature vector is adjusted to a first spatio-temporal feature map with spatial correlation.
其中,将第二记忆特征图建模成具有空间相关性的第二时空特征图,包括:Wherein, the second memory feature map is modeled as a second spatiotemporal feature map with spatial correlation, including:
将所述第二记忆特征图在通道维度进行切片,得到第四子特征图、第五子特征图和第六子特征图;Slicing the second memory feature map in the channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;
分别将第二记忆特征图、第四子特征图、第五子特征图和第六子特征图调整为第四特征向量、第四子特征向量、第五子特征向量和第六子特征向量;adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map to a fourth feature vector, a fourth sub-character vector, a fifth sub-character vector and a sixth sub-character vector;
根据第四子特征向量和第五子特征向量,计算第四子特征图和第五子特征图之间的第二空间相关性矩阵;calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;
利用第二空间相关性矩阵对第六子特征向量进行加权处理,得到第二空间相关特征向量;performing weighting processing on the sixth sub-eigenvector by using the second spatial correlation matrix to obtain the second spatially correlated feature vector;
根据第二空间相关特征向量和第四特征向量,确定第二时空特征向量;determining a second spatiotemporal feature vector according to the second spatial correlation feature vector and the fourth feature vector;
将第二时空特征向量调整为具有空间相关性的第二时空特征图。The second spatio-temporal feature vector is adjusted to a second spatio-temporal feature map with spatial correlation.
示例性的,参照图4,计算如下:Exemplarily, referring to Fig. 4, the calculation is as follows:
1)根据公式组(11)对输入特征图(即更新后特征图或第二记忆特征图)
Figure PCTCN2022076348-appb-000029
进行变换,得到特征图
Figure PCTCN2022076348-appb-000030
并将其均等划分成三个子特征图(即分别为第一/四子特征图、第二/五子特征图和第三/六子特征图)
Figure PCTCN2022076348-appb-000031
其中函数F split(·)表示在通道维度对特征图进行切片处理。
1) According to the formula group (11), the input feature map (that is, the updated feature map or the second memory feature map)
Figure PCTCN2022076348-appb-000029
Transform to get the feature map
Figure PCTCN2022076348-appb-000030
And divide it into three sub-feature maps equally (namely, the first/fourth sub-feature map, the second/fifth sub-feature map and the third/sixth sub-feature map)
Figure PCTCN2022076348-appb-000031
Among them, the function F split (·) means to slice the feature map in the channel dimension.
Figure PCTCN2022076348-appb-000032
Figure PCTCN2022076348-appb-000032
Figure PCTCN2022076348-appb-000033
Figure PCTCN2022076348-appb-000033
其中,W mid为对应的权重,b mid表示对应的偏执项。 Among them, W mid is the corresponding weight, and b mid represents the corresponding paranoid item.
2)根据公式组(12)将相应特征图调整成特征向量,得到对应的特征向量
Figure PCTCN2022076348-appb-000034
其中函数F reshape(·)用于将特征图或特征向量调整成预设形状。
2) According to the formula group (12), the corresponding feature map is adjusted into a feature vector, and the corresponding feature vector is obtained
Figure PCTCN2022076348-appb-000034
The function F reshape (·) is used to adjust the feature map or feature vector into a preset shape.
Figure PCTCN2022076348-appb-000035
Figure PCTCN2022076348-appb-000035
3)根据公式(13)计算出第一/第四子特征图
Figure PCTCN2022076348-appb-000036
与第二/第五子特征图
Figure PCTCN2022076348-appb-000037
之间的空间相关性矩阵
Figure PCTCN2022076348-appb-000038
其中s表示标量缩放因子
3) Calculate the first/fourth sub-feature map according to formula (13)
Figure PCTCN2022076348-appb-000036
with the second/fifth sub-feature map
Figure PCTCN2022076348-appb-000037
The spatial correlation matrix between
Figure PCTCN2022076348-appb-000038
where s represents the scalar scaling factor
Figure PCTCN2022076348-appb-000039
Figure PCTCN2022076348-appb-000039
4)利用上述计算出的空间相关矩阵对第三/六子特征量
Figure PCTCN2022076348-appb-000040
进行加权处理,得到空间相关特征向量(包括第一空间相关性特征向量和第二空间相关性特征向量)
Figure PCTCN2022076348-appb-000041
如公式(14)所示
4) Use the above-mentioned calculated spatial correlation matrix for the third/sixth sub-features
Figure PCTCN2022076348-appb-000040
Perform weighting processing to obtain spatial correlation feature vectors (including the first spatial correlation feature vector and the second spatial correlation feature vector)
Figure PCTCN2022076348-appb-000041
As shown in formula (14)
Figure PCTCN2022076348-appb-000042
Figure PCTCN2022076348-appb-000042
5)根据公式(15)对特征图空间结构之间的依赖关系进行建模,通过建模堆叠帧的特征图通道之间的依赖关系,来约束堆叠帧之间的时序信息,计算出第 一/二时空特征向量x time_corr5) According to the formula (15) to model the dependency relationship between the feature map spatial structure, by modeling the dependency relationship between the feature map channels of the stacked frames, to constrain the timing information between the stacked frames, and calculate the first / Two spatiotemporal feature vectors x time_corr :
Figure PCTCN2022076348-appb-000043
Figure PCTCN2022076348-appb-000043
其中,F C(·)由两层核大小为3,步长为1的一维卷积和激活函数组成。 Among them, F C ( ) consists of two layers of one-dimensional convolution and activation functions with a kernel size of 3 and a stride of 1.
6)最后将具有空间相关性的第一/二时空特征向量x time_corr调整成具有空间相关性的第一/二时空特征图
Figure PCTCN2022076348-appb-000044
6) Finally, adjust the first/second spatiotemporal feature vector x time_corr with spatial correlation into the first/second spatiotemporal feature map with spatial correlation
Figure PCTCN2022076348-appb-000044
S130、第一图像帧、第二图像帧、第一深度权重和第一运动权重,采用误差计算模块,计算第一误差,包括:S130, the first image frame, the second image frame, the first depth weight and the first motion weight, using an error calculation module to calculate the first error, including:
将第一图像帧和第二图像帧输入预先构建的深度估计网络,根据第一图像帧和第一深度权重,得到第一图像帧的第一场景深度和第一图像帧的第一编码器特征图,根据第二图像帧和第一深度权重,得到第二图像帧的第二场景深度和第二图像帧的第二编码器特征图;Input the first image frame and the second image frame into the pre-built depth estimation network, and obtain the first scene depth of the first image frame and the first encoder feature of the first image frame according to the first image frame and the first depth weight Figure, according to the second image frame and the first depth weight, obtain the second scene depth of the second image frame and the second encoder feature map of the second image frame;
将第一图像帧和第二图像帧输入预先构建的相机运动网络,根据第一图像帧、第二图像帧和第一运动权重,得到第一图像帧和第二图像帧之间的第一相对位姿;Input the first image frame and the second image frame into the pre-built camera motion network, according to the first image frame, the second image frame and the first motion weight, the first relative relationship between the first image frame and the second image frame is obtained pose;
第一场景深度、第二场景深度及第一相对位姿,采用误差计算模块,计算第一误差。For the first scene depth, the second scene depth and the first relative pose, an error calculation module is used to calculate the first error.
在一个实施例中,总误差根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差确定。其中,总误差包括第一误差和第二误差。In one embodiment, the total error is determined from image synthesis error, scene depth structure consistency error, feature perception loss error, smoothing loss error. Wherein, the total error includes the first error and the second error.
具体的,总误差根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差确定,包括:Specifically, the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, including:
获取第一图像帧的第一图像坐标、第二图像帧的第二图像坐标;Obtain the first image coordinates of the first image frame and the second image coordinates of the second image frame;
根据第一图像坐标、相机内参、第一场景深度,确定第一图像帧的第一世界坐标;Determining the first world coordinates of the first image frame according to the first image coordinates, the internal camera parameters, and the first scene depth;
根据第二图像坐标、相机内参、第二场景深度,确定第二图像帧的第二世界坐标;Determining the second world coordinates of the second image frame according to the second image coordinates, the internal camera parameters, and the second scene depth;
将第一图像帧的第一世界坐标仿射变换到第二图像帧面板,确定仿射变换后的第三世界坐标;Affine transform the first world coordinates of the first image frame to the second image frame panel, and determine the third world coordinates after the affine transformation;
将第二图像帧的第二世界坐标仿射变换到第一图像帧面板,确定仿射变换后的第四世界坐标;Affine transform the second world coordinate of the second image frame to the first image frame panel, and determine the fourth world coordinate after affine transformation;
将第三世界坐标和第四世界坐标分别投影到二维平面,得到第一仿射变换后场景深度和第二仿射变换后场景深度及对应的第一仿射变换后图像坐标和第二仿射变换后图像坐标;Project the third world coordinates and the fourth world coordinates to the two-dimensional plane respectively, and obtain the scene depth after the first affine transformation and the scene depth after the second affine transformation and the corresponding image coordinates after the first affine transformation and the second affine transformation Image coordinates after projection transformation;
根据第一场景深度、第二场景深度、第一仿射变换后图像坐标、第二仿射变换后图像坐标,确定场景深度结构一致性误差、第一深度结构不一致性权重和第二深度结构不一致性权重;According to the first scene depth, the second scene depth, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation, determine the scene depth structure consistency error, the first depth structure inconsistency weight and the second depth structure inconsistency sexual weight;
根据第一图像帧的第一图像坐标、第二仿射变换后图像坐标、第二图像帧的第二图像坐标、第一仿射变换后图像坐标,确定第一相机流一致性遮挡掩码和第二相机流一致性遮挡掩码;Determine the first camera stream consistent occlusion mask and The second camera flow consistent occlusion mask;
根据第一深度结构不一致性权重、第二深度结构不一致性权重、第一相机流一致性遮挡掩码和第二相机流一致性遮挡掩码,确定图像合成误差;determining an image synthesis error based on the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera flow consistent occlusion mask, and the second camera flow consistent occlusion mask;
根据第一图像帧、第二图像帧、第一仿射变换后图像坐标和第二仿射变换后图像坐标,确定特征感知损失误差;Determine the feature perception loss error according to the first image frame, the second image frame, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation;
根据第一场景深度、第二场景深度、第一图像帧和第二图像帧,确定平滑损失误差;determining a smoothing loss error according to the first scene depth, the second scene depth, the first image frame, and the second image frame;
根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差,确定总误差。The total error is determined based on image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
示例性的,为方便描述,假设已经训练好的场景深度拟合函数是D t=F D(I t|W D),其中I t表示需要恢复t时刻的场景深度的二维图像,W D表示场景深度拟合函数F D(·)中学习好的权重参数,D t表示恢复出的t时刻的二维图像I t的场景深度,X (enc,t)表示编码器输出的t时刻图像帧I t的特征图,
Figure PCTCN2022076348-appb-000045
表示从t-1时刻的图像帧I t-1变换到t时刻的图像帧I t的位姿,W T表示位姿变换函数F T(·)中学习好的权重参数,相机内参表示为K,用P (xy,t-1)表示图像帧I t-1的图像坐标,P (xyz,t-1)图像帧I t-1的世界坐标,P (xy,t)表示图像帧I t的图像坐标,P (xyz,t)表示图像帧I t的世界坐标。总误差计算过程如下:
Exemplarily, for the convenience of description, it is assumed that the scene depth fitting function that has been trained is D t =F D (I t |W D ), where I t represents the two-dimensional image that needs to restore the scene depth at time t, and W D Indicates the weight parameters learned in the scene depth fitting function F D (·), D t represents the scene depth of the recovered two-dimensional image I t at time t, X (enc,t) represents the image output by the encoder at time t The feature map of frame I t ,
Figure PCTCN2022076348-appb-000045
Represents the pose transformed from the image frame I t-1 at time t-1 to the image frame I t at time t, W T represents the weight parameters learned in the pose transformation function F T ( ), and the camera internal reference is expressed as K , use P (xy, t-1) to represent the image coordinates of the image frame I t-1 , P (xyz, t-1) the world coordinates of the image frame I t-1 , and P (xy, t) to represent the image frame I t The image coordinates of , P (xyz,t) represents the world coordinates of the image frame I t . The total error calculation process is as follows:
1)根据公式(16)计算出图像帧I t-1的世界坐标(即第一世界坐标)P (xyz,t-1),图像帧I t的世界坐标(即第二世界坐标)P (xyz,t)1) Calculate the world coordinates (i.e. the first world coordinates ) P (xyz, t-1) of the image frame I t-1 according to formula (16), and the world coordinates (i.e. the second world coordinates) P ( xyz,t) :
Figure PCTCN2022076348-appb-000046
Figure PCTCN2022076348-appb-000046
其中,“*”号表示矩阵对应元素的乘积。Among them, the sign "*" represents the product of the corresponding elements of the matrix.
2)根据公式(17)计算出将图像帧I t-1的世界坐标P (xyz,t-1)仿射变换到图像帧I t面板,获得仿射变换后的世界坐标(即第三世界坐标)P (proj_xyz,t),以及将图像帧I t的世界坐标P (xyz,t)仿射变换到图像帧I t-1面板,获得仿射变换后的世界坐标(即第四世界坐标)P (proj_xyz,t-1)2) According to the formula (17), calculate the affine transformation of the world coordinate P (xyz, t-1) of the image frame I t -1 to the image frame I t panel, and obtain the world coordinates after the affine transformation (that is, the third world Coordinates) P (proj_xyz, t) , and the world coordinate P (xyz, t) of the image frame I t is affinely transformed to the image frame I t-1 panel, and the world coordinates after the affine transformation (ie the fourth world coordinate )P (proj_xyz,t-1) :
Figure PCTCN2022076348-appb-000047
Figure PCTCN2022076348-appb-000047
3)将仿射变换计算出的世界坐标P (proj_xyz,t)、P (proj_xyz,t-1)分别投影到二维平面,获得仿射变换后的场景深度D (proj,t)(即第一仿射变换后场景深度)和D (proj,t-1)(即第二仿射变换后场景深度),以及对应的仿射变换后的图像坐标P (proj_xy,t)(即第一仿射变换后图像坐标)、P (proj_xy,t-1)(即第二仿射变换后图像坐标)。 3) Project the world coordinates P (proj_xyz,t) and P (proj_xyz,t-1) calculated by the affine transformation onto the two-dimensional plane respectively, and obtain the scene depth D (proj,t) after the affine transformation (that is, the first Scene depth after affine transformation) and D (proj,t-1) (that is, scene depth after second affine transformation), and corresponding image coordinates P (proj_xy,t) after affine transformation (that is, first affine transformation image coordinates after affine transformation), P (proj_xy,t-1) (that is, image coordinates after the second affine transformation).
4)根据图像帧I t-1和P (proj_xy,t-1)合成图像I (syn,t);根据编码器特征图X (enc,t-1)和P (proj_xy,t-1)合成特征图X (syn_enc,t);根据估计的深度图D t-1和P (projxy,t-1)合成深度图D (syn,t);根据图像帧I t和P (proj_xy,t)合成图像I (syn,t-1);根据编码器特征图X (enc,t)和P (proj_xy,t)合成特征图X (syn_enc,t-1);根据估计的深度图D t和P (projxy,t)合成深度图D (syn,t-1);根据I t-1的图像坐标和P (proj_xy,t-1)计算出前向相机流U forward;根据I t的图像坐标和P (proj_xy,t)计算出后向相机流U backward;根据U forward和P (proj_xy,t)合成前向相机流U syn_forward;根据U backward和P (projxy,t-1)合成后向相机流U syn_backward4) Synthesize image I (syn, t) according to image frame I t-1 and P (proj_xy, t-1) ; synthesize according to encoder feature map X (enc, t-1) and P (proj_xy, t-1) Feature map X (syn_enc, t) ; synthesize depth map D (syn , t ) according to estimated depth map D t-1 and P (projxy, t-1) ; synthesize according to image frame I t and P (proj_xy, t) Image I (syn, t-1) ; according to the encoder feature map X (enc, t) and P (proj_xy, t) synthetic feature map X (syn_enc, t-1) ; according to the estimated depth map D t and P ( projxy, t) synthetic depth map D (syn, t-1) ; calculate the forward camera flow U forward according to the image coordinates of I t-1 and P (proj_xy, t-1) ; according to the image coordinates of I t and P ( proj_xy, t) calculates the backward camera stream U backward ; synthesizes the forward camera stream U syn_forward according to U forward and P (proj_xy, t) ; synthesizes the backward camera stream U syn_backward according to U backward and P (projxy, t-1) .
5)根据公式组(18)分别计算第一相机流一致性遮挡掩码M (occ,t-1)和第二相机流一致性遮挡掩码M (occ,t)5) Calculate the first camera flow consistent occlusion mask M (occ,t-1) and the second camera flow consistent occlusion mask M (occ,t) according to the formula group (18):
Figure PCTCN2022076348-appb-000048
Figure PCTCN2022076348-appb-000048
6)根据公式组(19)计算场景深度结构一致性误差E D以及第一深度结构不一致性权重M (D,t-1)、第二深度结构不一致性权重M (D,t)6) Calculate the scene depth structure consistency error E D and the first depth structure inconsistency weight M (D,t-1) and the second depth structure inconsistency weight M (D,t) according to the formula group (19).
Figure PCTCN2022076348-appb-000049
Figure PCTCN2022076348-appb-000049
Figure PCTCN2022076348-appb-000050
Figure PCTCN2022076348-appb-000050
Figure PCTCN2022076348-appb-000051
Figure PCTCN2022076348-appb-000051
7)根据公式(20)计算出图像合成误差E I7) Calculate the image synthesis error E I according to formula (20):
Figure PCTCN2022076348-appb-000052
Figure PCTCN2022076348-appb-000052
其中,
Figure PCTCN2022076348-appb-000053
in,
Figure PCTCN2022076348-appb-000053
8)根据公式(21)计算出特征感知损失误差E X8) Calculate the feature perception loss error E X according to formula (21):
E X=ERF(X (enc,t),X (syn_enc,t))+ERF(X (enc,t-1),X (syn_enc,t-1))     (21) E X =ERF(X (enc,t) ,X (syn_enc,t) )+ERF(X (enc,t-1) ,X (syn_enc,t-1) ) (21)
9)根据公式(22)计算出平滑损失误差E S9) Calculate the smoothing loss error E S according to formula (22):
Figure PCTCN2022076348-appb-000054
Figure PCTCN2022076348-appb-000054
10)根据公式(23)计算出总误差E:10) Calculate the total error E according to formula (23):
Figure PCTCN2022076348-appb-000055
Figure PCTCN2022076348-appb-000055
S140、将第一误差作为指导信号联合更新深度估计网络的第一深度权重和相 机运动网络的第一运动权重,得到第二深度权重和第二运动权重。S140. Using the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain the second depth weight and the second motion weight.
S150、第一图像帧、第二图像帧、第二深度权重和第二运动权重,采用误差计算模块,计算第二误差,可以包括:S150. The first image frame, the second image frame, the second depth weight, and the second motion weight use an error calculation module to calculate the second error, which may include:
将第一图像帧和第二图像帧输入预先构建的深度估计网络,根据第一图像帧和第二深度权重,得到第一图像帧的第三场景深度和第一图像帧的第三编码器特征图,根据第二图像帧和第二深度权重,得到第二图像帧的第四场景深度和第二图像帧的第四编码器特征图;Input the first image frame and the second image frame into the pre-built depth estimation network, and obtain the third scene depth of the first image frame and the third encoder feature of the first image frame according to the first image frame and the second depth weight Figure, according to the second image frame and the second depth weight, obtain the fourth scene depth of the second image frame and the fourth encoder feature map of the second image frame;
将第一图像帧和第二图像帧输入预先构建的相机运动网络,根据第一图像帧、第二图像帧和第二运动权重,得到第一图像帧和第二图像帧之间的第二相对位姿;Input the first image frame and the second image frame into the pre-built camera motion network, according to the first image frame, the second image frame and the second motion weight, get the second relative between the first image frame and the second image frame pose;
第一图像帧、第二图像帧、第三场景深度、第四场景深度、第三编码器特征图、第四编码器特征图及第二相对位姿,采用误差计算模块,计算第二误差。For the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second relative pose, an error calculation module is used to calculate a second error.
该步骤参考S130步骤的具体计算过程,只是将S130中的第一深度权重和第一运动权重替换为第二深度权重和第二运动权重即可,这里不再赘述。For this step, refer to the specific calculation process of step S130, only the first depth weight and the first motion weight in S130 are replaced by the second depth weight and the second motion weight, which will not be repeated here.
S160、根据第一误差和第二误差,确定第二图像帧的场景深度及第一图像帧与第二图像帧之间的相对位姿。S160. Determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
具体的,若第一误差大于第二误差,将第二场景深度作为第二图像帧的场景深度,将第一相对位姿作为第一图像帧与第二图像帧之间的相对位姿;Specifically, if the first error is greater than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the relative pose between the first image frame and the second image frame;
若第一误差小于或等于第二误差,将第四场景深度作为第二图像帧的场景深度,将第二相对位姿作为第一图像帧与第二图像帧之间的相对位姿。If the first error is less than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the relative pose between the first image frame and the second image frame.
参照图5,其示出了场景深度推理过程示意图。推理场景深度的过程如下:Referring to FIG. 5 , it shows a schematic diagram of a scene depth inference process. The process of inferring scene depth is as follows:
1)以训练所得的深度估计网络的权重W D,相机运动网络权重W T作为推理期间深度估计网络和相机运动网络的模型权重,根据S130,计算出总误差E、历史帧到当前帧的位姿变换矩阵
Figure PCTCN2022076348-appb-000056
当前帧的场景深度D t
1) Using the weight W D of the depth estimation network obtained from training and the weight W T of the camera motion network as the model weights of the depth estimation network and the camera motion network during inference, according to S130, calculate the total error E, the bit from the historical frame to the current frame pose transformation matrix
Figure PCTCN2022076348-appb-000056
The scene depth D t of the current frame.
2)、以上述1)计算出的总误差作为指导信号来更新深度估计网络和相机运动网络的权重,得到新的模型权重
Figure PCTCN2022076348-appb-000057
Figure PCTCN2022076348-appb-000058
2) Use the total error calculated in the above 1) as a guiding signal to update the weights of the depth estimation network and the camera motion network to obtain new model weights
Figure PCTCN2022076348-appb-000057
and
Figure PCTCN2022076348-appb-000058
3)、根据上述2)所得模型权重
Figure PCTCN2022076348-appb-000059
Figure PCTCN2022076348-appb-000060
以及S150计算出此时的总误差
Figure PCTCN2022076348-appb-000061
历史帧到当前帧的位姿变换矩阵
Figure PCTCN2022076348-appb-000062
当前帧的场景深度
Figure PCTCN2022076348-appb-000063
3), according to the above 2) obtained model weight
Figure PCTCN2022076348-appb-000059
and
Figure PCTCN2022076348-appb-000060
And S150 calculates the total error at this time
Figure PCTCN2022076348-appb-000061
The pose transformation matrix from the historical frame to the current frame
Figure PCTCN2022076348-appb-000062
The scene depth of the current frame
Figure PCTCN2022076348-appb-000063
4)通过比较总误差E和
Figure PCTCN2022076348-appb-000064
的大小决策最终输出的当前帧的场景深度。
4) By comparing the total error E and
Figure PCTCN2022076348-appb-000064
The size of the decision final output of the scene depth of the current frame.
本申请实施例,采用全无监督的形式从二维图像中恢复场景深度,通过时间注意力模块,将记忆单元中的历史帧信息注入到当前输入单元中,并对时空特征图的空间相关性进行建模来提高相机位姿的精度,降低因姿态不准确而产生的错误仿射变换的影响;推理期间,利用在线决策推理提高算法对未知场景的泛化能力。In the embodiment of the present application, the depth of the scene is recovered from the two-dimensional image in a completely unsupervised form, and the historical frame information in the memory unit is injected into the current input unit through the temporal attention module, and the spatial correlation of the spatiotemporal feature map Carry out modeling to improve the accuracy of camera pose and reduce the influence of wrong affine transformation caused by inaccurate pose; during inference, use online decision-making reasoning to improve the generalization ability of the algorithm to unknown scenes.
参照图6,其示出了根据本申请一个实施例描述的基于历史信息的场景深度推理装置的结构示意图。Referring to FIG. 6 , it shows a schematic structural diagram of an apparatus for scene depth inference based on historical information according to an embodiment of the present application.
如图6所示,基于历史信息的场景深度推理装置600,可以包括:As shown in FIG. 6, the scene depth reasoning device 600 based on historical information may include:
第一获取模块610,用于获取待测图像的第一图像帧和第二图像帧,第一图像帧为第二图像帧前一时刻的图像帧;The first acquisition module 610 is configured to acquire a first image frame and a second image frame of the image to be tested, and the first image frame is an image frame at a moment before the second image frame;
第二获取模块620,用于获取预先构建的深度估计网络的第一深度权重及预先构建的相机运动网络的第一运动权重;The second acquiring module 620 is configured to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
第一处理模块630,用于第一图像帧、第二图像帧、第一深度权重和第一运动权重,采用误差计算模块,计算第一误差;The first processing module 630 is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate the first error;
更新模块640,用于将第一误差作为指导信号联合更新深度估计网络的第一深度权重和相机运动网络的第一运动权重,得到第二深度权重和第二运动权重;An update module 640, configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guide signal to obtain a second depth weight and a second motion weight;
第二处理模块650,用于第一图像帧、第二图像帧、第二深度权重和第二运动权重,采用误差计算模块,计算第二误差;The second processing module 650 is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses an error calculation module to calculate a second error;
确定模块660,用于根据第一误差和第二误差,确定第二图像帧的场景深度及第一图像帧与第二图像帧之间的相对位姿。The determination module 660 is configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
可选的,相机运动网络包括编码器、时间注意力模块和时空相关性模块;Optionally, the camera motion network includes an encoder, a temporal attention module and a spatiotemporal correlation module;
编码器用于提取堆叠图像帧的特征,得到堆叠特征图;堆叠图像帧为第一图像帧和第二图像帧按通道维度堆叠得到的;The encoder is used to extract the features of the stacked image frame to obtain the stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension;
时间注意力模块用于将历史记忆单元的信息与当前输入单元的信息建立全局依赖关系,并通过更新单元,将全局相关的历史记忆单元中的信息注入到当前输入单元,同时将当前输入单元中的全局相关信息储存到历史记忆单元,作为下一时刻的历史记忆单元;当前输入单元包括堆叠特征图,堆叠特征图通过更新单元更新为更新后特征图,历史记忆单元包括第一记忆特征图和第一时间特征图,下一时刻的历史记忆单元包括第二记忆特征图和第二时间特征图;The time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, and at the same time inject the information in the current input unit The global relevant information of is stored in the historical memory unit as the historical memory unit at the next moment; the current input unit includes a stacked feature map, and the stacked feature map is updated to an updated feature map by the update unit, and the historical memory unit includes the first memory feature map and The first time feature map, the historical memory unit at the next moment includes the second memory feature map and the second time feature map;
时空相关性模块用于将更新后特征图/第二记忆特征图分别建模成为具有空间相关性的第一/二时空特征图。The spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation, respectively.
可选的,基于历史信息的场景深度推理装置600,还用于:Optionally, the scene depth reasoning device 600 based on historical information is also used for:
将堆叠特征图中的特征信息和第一记忆特征图的特征信息注入到第一时间特征图,得到第三时间特征图;Injecting the feature information in the stacked feature map and the feature information of the first memory feature map into the first time feature map to obtain a third time feature map;
根据第三时间特征图,确定时间注意力特征向量;Determine the time attention feature vector according to the third time feature map;
根据堆叠特征图,确定第一特征向量;Determining a first feature vector according to the stacked feature map;
根据第一记忆特征图;确定第二特征向量According to the first memory feature map; determine the second feature vector
根据第一特征向量和时间注意力特征向量,确定基于时间注意力的输入特征向量;According to the first feature vector and the time attention feature vector, determine the input feature vector based on time attention;
根据第二特征向量和时间注意力特征向量,确定基于时间注意力的记忆特征向量;According to the second feature vector and the time attention feature vector, determine the memory feature vector based on time attention;
分别将基于时间注意力的输入特征向量和基于时间注意力的记忆特征向量,调整成对应的第一特征图和第二特征图;Respectively adjust the input feature vector based on time attention and the memory feature vector based on time attention into corresponding first feature map and second feature map;
根据第一特征图和第二特征图,确定更新后特征图和第二记忆特征图;Determine the updated feature map and the second memory feature map according to the first feature map and the second feature map;
根据更新后特征图和第二记忆特征图,将第三时间特征图更新为第二时间特征图。The third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.
可选的,基于历史信息的场景深度推理装置600,还用于:Optionally, the scene depth reasoning device 600 based on historical information is also used for:
将更新后特征图在通道维度进行切片,得到第一子特征图、第二子特征图和第三子特征图;Slicing the updated feature map in the channel dimension to obtain the first sub-feature map, the second sub-feature map and the third sub-feature map;
分别将更新后特征图、第一子特征图、第二子特征图和第三子特征图,对应调整为第三特征向量、第一子特征向量、第二子特征向量和第三子特征向量;Respectively adjust the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map to the third feature vector, the first sub-feature vector, the second sub-feature vector and the third sub-feature vector ;
根据第一子特征向量和第二子特征向量,计算第一子特征图和第二子特征图之间的第一空间相关性矩阵;calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;
利用第一空间相关性矩阵对第三子特征向量进行加权处理,得到第一空间相关特征向量;weighting the third sub-eigenvector by using the first spatial correlation matrix to obtain the first spatially correlated feature vector;
根据第一空间相关特征向量和第三特征向量,确定第一时空特征向量;determining a first spatiotemporal eigenvector according to the first spatially correlated eigenvector and the third eigenvector;
将第一时空特征向量调整为具有空间相关性的第一时空特征图;adjusting the first spatiotemporal feature vector to a first spatiotemporal feature map with spatial correlation;
将第二记忆特征图在通道维度进行切片,得到第四子特征图、第五子特征图 和第六子特征图;Slicing the second memory feature map in the channel dimension to obtain the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map;
分别将第二记忆特征图、第四子特征图、第五子特征图和第六子特征图调整为第四特征向量、第四子特征向量、第五子特征向量和第六子特征向量;adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map to a fourth feature vector, a fourth sub-character vector, a fifth sub-character vector and a sixth sub-character vector;
根据第四子特征向量和第五子特征向量,计算第四子特征图和第五子特征图之间的第二空间相关性矩阵;calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;
利用第二空间相关性矩阵对第六子特征向量进行加权处理,得到第二空间相关特征向量;performing weighting processing on the sixth sub-eigenvector by using the second spatial correlation matrix to obtain the second spatially correlated feature vector;
根据第二空间相关特征向量和第四特征向量,确定第二时空特征向量;determining a second spatiotemporal feature vector according to the second spatial correlation feature vector and the fourth feature vector;
将第二时空特征向量调整为具有空间相关性的所述第二时空特征图。The second spatio-temporal feature vector is adjusted to the second spatio-temporal feature map with spatial correlation.
可选的,第一处理模块630,还用于:Optionally, the first processing module 630 is also used for:
将第一图像帧和第二图像帧输入预先构建的深度估计网络,根据第一图像帧和第一深度权重,得到第一图像帧的第一场景深度和第一图像帧的第一编码器特征图,根据第二图像帧和第一深度权重,得到第二图像帧的第二场景深度和第二图像帧的第二编码器特征图;Input the first image frame and the second image frame into the pre-built depth estimation network, and obtain the first scene depth of the first image frame and the first encoder feature of the first image frame according to the first image frame and the first depth weight Figure, according to the second image frame and the first depth weight, obtain the second scene depth of the second image frame and the second encoder feature map of the second image frame;
将第一图像帧和第二图像帧输入预先构建的相机运动网络,根据第一图像帧、第二图像帧和第一运动权重,得到第一图像帧和第二图像帧之间的第一相对位姿;Input the first image frame and the second image frame into the pre-built camera motion network, according to the first image frame, the second image frame and the first motion weight, the first relative relationship between the first image frame and the second image frame is obtained pose;
第一图像帧、第二图像帧、第一场景深度、第二场景深度、第一编码器特征图、第二编码器特征图及第一相对位姿,采用误差计算模块,计算第一误差;The first image frame, the second image frame, the first scene depth, the second scene depth, the first encoder feature map, the second encoder feature map, and the first relative pose, using an error calculation module to calculate the first error;
可选的,第二处理模块650,还用于:Optionally, the second processing module 650 is also used for:
将第一图像帧和第二图像帧输入预先构建的深度估计网络,根据第一图像帧和第二深度权重,得到第一图像帧的第三场景深度和第一图像帧的第三编码器特征图,根据第二图像帧和第二深度权重,得到第二图像帧的第四场景深度和第二图像帧的第四编码器特征图;Input the first image frame and the second image frame into the pre-built depth estimation network, and obtain the third scene depth of the first image frame and the third encoder feature of the first image frame according to the first image frame and the second depth weight Figure, according to the second image frame and the second depth weight, obtain the fourth scene depth of the second image frame and the fourth encoder feature map of the second image frame;
将第一图像帧和第二图像帧输入预先构建的相机运动网络,根据第一图像帧、第二图像帧和第二运动权重,得到第一图像帧和第二图像帧之间的第二相对位姿;Input the first image frame and the second image frame into the pre-built camera motion network, according to the first image frame, the second image frame and the second motion weight, get the second relative between the first image frame and the second image frame pose;
第一图像帧、第二图像帧、第三场景深度、第四场景深度、第三编码器特征图、第四编码器特征图及第二相对位姿,采用误差计算模块,计算第二误差。For the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second relative pose, an error calculation module is used to calculate a second error.
可选的,确定模块660还用于:Optionally, the determining module 660 is also used for:
若第一误差大于第二误差,将第二场景深度作为第二图像帧的场景深度,将 第一相对位姿作为第一图像帧与第二图像帧之间的相对位姿;If the first error is greater than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the relative pose between the first image frame and the second image frame;
若第一误差小于或等于第二误差,将第四场景深度作为第二图像帧的场景深度,将第二相对位姿作为第一图像帧与第二图像帧之间的相对位姿。If the first error is less than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the relative pose between the first image frame and the second image frame.
可选的,总误差包括第一误差和第二误差;总误差根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差确定。Optionally, the total error includes a first error and a second error; the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
可选的,第一处理模块630或第二处理模块650还用于:Optionally, the first processing module 630 or the second processing module 650 is also used for:
获取第一图像帧的第一图像坐标、第二图像帧的第二图像坐标;Obtain the first image coordinates of the first image frame and the second image coordinates of the second image frame;
根据第一图像坐标、相机内参、第一场景深度,确定第一图像帧的第一世界坐标;Determining the first world coordinates of the first image frame according to the first image coordinates, the internal camera parameters, and the first scene depth;
根据第二图像坐标、相机内参、第二场景深度,确定第二图像帧的第二世界坐标;Determining the second world coordinates of the second image frame according to the second image coordinates, the internal camera parameters, and the second scene depth;
将第一图像帧的第一世界坐标仿射变换到第二图像帧面板,确定仿射变换后的第三世界坐标;Affine transform the first world coordinates of the first image frame to the second image frame panel, and determine the third world coordinates after the affine transformation;
将第二图像帧的第二世界坐标仿射变换到第一图像帧面板,确定仿射变换后的第四世界坐标;Affine transform the second world coordinate of the second image frame to the first image frame panel, and determine the fourth world coordinate after affine transformation;
将第三世界坐标和第四世界坐标分别投影到二维平面,得到第一仿射变换后场景深度和第二仿射变换后场景深度及对应的第一仿射变换后图像坐标和第二仿射变换后图像坐标;Project the third world coordinates and the fourth world coordinates to the two-dimensional plane respectively, and obtain the scene depth after the first affine transformation and the scene depth after the second affine transformation and the corresponding image coordinates after the first affine transformation and the second affine transformation Image coordinates after projection transformation;
根据第一场景深度、第二场景深度、第一仿射变换后图像坐标、第二仿射变换后图像坐标,确定场景深度结构一致性误差、第一深度结构不一致性权重和第二深度结构不一致性权重;According to the first scene depth, the second scene depth, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation, determine the scene depth structure consistency error, the first depth structure inconsistency weight and the second depth structure inconsistency sexual weight;
根据第一图像帧的第一图像坐标、第二仿射变换后图像坐标、第二图像帧的第二图像坐标、第一仿射变换后图像坐标,确定第一相机流一致性遮挡掩码和第二相机流一致性遮挡掩码;Determine the first camera stream consistent occlusion mask and The second camera flow consistent occlusion mask;
根据第一深度结构不一致性权重、第二深度结构不一致性权重、第一相机流一致性遮挡掩码和第二相机流一致性遮挡掩码,确定图像合成误差;determining an image synthesis error based on the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera flow consistent occlusion mask, and the second camera flow consistent occlusion mask;
根据第一图像帧、第二图像帧、第一仿射变换后图像坐标和第二仿射变换后图像坐标,确定特征感知损失误差;Determine the feature perception loss error according to the first image frame, the second image frame, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation;
根据第一场景深度、第二场景深度、第一图像帧和第二图像帧,确定平滑损 失误差;According to the first scene depth, the second scene depth, the first image frame and the second image frame, determine the smoothing loss error;
根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差,确定总误差。The total error is determined based on image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
本实施例提供的一种基于历史信息的场景深度推理装置,可以执行上述方法的实施例,其实现原理和技术效果类似,在此不再赘述。The historical information-based scene depth reasoning device provided in this embodiment can execute the above-mentioned embodiment of the method, and its implementation principle and technical effect are similar, and will not be repeated here.
图7为本发明实施例提供的一种电子设备的结构示意图。如图7所示,示出了适于用来实现本申请实施例的电子设备300的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 7 , a schematic structural diagram of an electronic device 300 suitable for implementing the embodiments of the present application is shown.
如图7所示,电子设备300包括中央处理单元(CPU)301,其可以根据存储在只读存储器(ROM)302中的程序或者从存储部分308加载到随机访问存储器(RAM)303中的程序而执行各种适当的动作和处理。在RAM 303中,还存储有设备300操作所需的各种程序和数据。CPU 301、ROM 302以及RAM 303通过总线304彼此相连。输入/输出(I/O)接口305也连接至总线304。As shown in FIG. 7 , an electronic device 300 includes a central processing unit (CPU) 301, which can operate according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage section 308 into a random access memory (RAM) 303 Instead, various appropriate actions and processes are performed. In the RAM 303, various programs and data necessary for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to the bus 304 .
以下部件连接至I/O接口305:包括键盘、鼠标等的输入部分306;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分307;包括硬盘等的存储部分308;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分309。通信部分309经由诸如因特网的网络执行通信处理。驱动器310也根据需要连接至I/O接口306。可拆卸介质311,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器310上,以便于从其上读出的计算机程序根据需要被安装入存储部分308。The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, etc.; an output section 307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 308 including a hard disk, etc. and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the Internet. Drive 310 is also connected to I/O interface 306 as needed. A removable medium 311, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 310 as necessary so that a computer program read therefrom is installed into the storage section 308 as necessary.
特别地,根据本公开的实施例,上文参考图1描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行上述基于历史信息的场景深度推理方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分309从网络上被下载和安装,和/或从可拆卸介质311被安装。In particular, according to an embodiment of the present disclosure, the process described above with reference to FIG. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product, which includes a computer program tangibly contained on a machine-readable medium, the computer program including program codes for executing the above-mentioned scene depth reasoning method based on historical information. In such an embodiment, the computer program may be downloaded and installed from a network via communication portion 309 and/or installed from removable media 311 .
附图中的流程图和框图,图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,前述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标 注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or part of code that includes one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本申请实施例中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中。这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定。The units or modules involved in the embodiments described in the present application may be implemented by means of software or by means of hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not constitute limitations on the units or modules themselves in some cases.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、笔记本电脑、行动电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules, or units described in the above embodiments can be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer. Specifically, the computer can be, for example, a personal computer, a notebook computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any of these devices combination of devices.
作为另一方面,本申请还提供了一种存储介质,该存储介质可以是上述实施例中前述装置中所包含的存储介质;也可以是单独存在,未装配入设备中的存储介质。存储介质存储有一个或者一个以上程序,前述程序被一个或者一个以上的处理器用来执行描述于本申请的基于历史信息的场景深度推理方法。As another aspect, the present application also provides a storage medium, which may be the storage medium contained in the aforementioned device in the above embodiment, or may be a storage medium that exists independently and is not assembled into the device. The storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the scene depth reasoning method based on historical information described in this application.
存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Storage media includes permanent and non-permanent, removable and non-removable media. Information storage can be realized by any method or technology. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.
需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要 素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes none other elements specifically listed, or also include elements inherent in the process, method, commodity, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, refer to part of the description of the method embodiment.

Claims (10)

  1. 一种基于历史信息的场景深度推理方法,其特征在于,所述方法包括:A method for scene depth reasoning based on historical information, characterized in that the method comprises:
    获取待测图像的第一图像帧和第二图像帧,所述第一图像帧为所述第二图像帧前一时刻的图像帧;Acquiring a first image frame and a second image frame of the image to be tested, the first image frame being an image frame at a moment before the second image frame;
    获取预先构建的深度估计网络的第一深度权重及预先构建的相机运动网络的第一运动权重;Obtaining the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
    所述第一图像帧、所述第二图像帧、所述第一深度权重和所述第一运动权重,采用误差计算模块,计算第一误差;The first image frame, the second image frame, the first depth weight and the first motion weight use an error calculation module to calculate a first error;
    将所述第一误差作为指导信号联合更新所述深度估计网络的第一深度权重和所述相机运动网络的第一运动权重,得到第二深度权重和第二运动权重;Using the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;
    所述第一图像帧、所述第二图像帧、所述第二深度权重和所述第二运动权重,采用所述误差计算模块,计算第二误差;The first image frame, the second image frame, the second depth weight and the second motion weight use the error calculation module to calculate a second error;
    根据所述第一误差和所述第二误差,确定所述第二图像帧的场景深度及所述第一图像帧与所述第二图像帧之间的相对位姿。Determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
  2. 根据权利要求1所述的方法,其特征在于,所述相机运动网络包括编码器、时间注意力模块和时空相关性模块;The method according to claim 1, wherein the camera motion network comprises an encoder, a temporal attention module and a spatiotemporal correlation module;
    所述编码器用于提取所述堆叠图像帧的特征,得到堆叠特征图;所述堆叠图像帧为所述第一图像帧和所述第二图像帧按通道维度堆叠得到的;The encoder is used to extract the features of the stacked image frame to obtain a stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension;
    所述时间注意力模块用于将历史记忆单元的信息与当前输入单元的信息建立全局依赖关系,并通过更新单元,将全局相关的所述历史记忆单元中的信息注入到所述当前输入单元,同时将所述当前输入单元中的全局相关信息储存到所述历史记忆单元,作为下一时刻的历史记忆单元;所述当前输入单元包括所述堆叠特征图,所述堆叠特征图通过所述更新单元更新为更新后特征图,所述历史记忆单元包括第一记忆特征图和第一时间特征图,所述下一时刻的历史记忆单元包括第二记忆特征图和第二时间特征图;The time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, At the same time, the global relevant information in the current input unit is stored in the historical memory unit as the historical memory unit at the next moment; the current input unit includes the stacked feature map, and the stacked feature map is updated through the The unit is updated to an updated feature map, the historical memory unit includes a first memory feature map and a first time feature map, and the historical memory unit at the next moment includes a second memory feature map and a second time feature map;
    所述时空相关性模块用于将所述更新后特征图/所述第二记忆特征图分别建模成为具有空间相关性的第一/第二时空特征图。The spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into a first/second spatio-temporal feature map with spatial correlation, respectively.
  3. 根据权利要求2所述的方法,其特征在于,所述将历史记忆单元的信息与 当前输入单元的信息建立全局依赖关系,并通过更新单元,将全局相关的所述历史记忆单元中的信息注入到所述当前输入单元,同时将所述当前输入单元中的全局相关信息储存到所述历史记忆单元,作为下一时刻的历史记忆单元,包括:The method according to claim 2, characterized in that, establishing a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and injecting the globally related information in the historical memory unit through the update unit to the current input unit, and simultaneously store the global relevant information in the current input unit to the historical memory unit as the historical memory unit at the next moment, including:
    将所述堆叠特征图中的特征信息和所述第一记忆特征图的特征信息注入到所述第一时间特征图,得到第三时间特征图;Injecting feature information in the stack feature map and feature information in the first memory feature map into the first temporal feature map to obtain a third temporal feature map;
    根据所述第三时间特征图,确定时间注意力特征向量;Determine a temporal attention feature vector according to the third temporal feature map;
    根据所述堆叠特征图,确定第一特征向量;determining a first feature vector according to the stack feature map;
    根据所述第一记忆特征图;确定第二特征向量According to the first memory feature map; determine a second feature vector
    根据所述第一特征向量和所述时间注意力特征向量,确定基于时间注意力的输入特征向量;Determine an input feature vector based on temporal attention according to the first feature vector and the temporal attention feature vector;
    根据所述第二特征向量和所述时间注意力特征向量,确定基于时间注意力的记忆特征向量;According to the second feature vector and the time attention feature vector, determine a memory feature vector based on time attention;
    分别将所述基于时间注意力的输入特征向量和所述基于时间注意力的记忆特征向量,调整成对应的第一特征图和第二特征图;Adjusting the temporal attention-based input feature vector and the temporal attention-based memory feature vector into corresponding first feature maps and second feature maps, respectively;
    根据所述第一特征图和所述第二特征图,确定所述更新后特征图和所述第二记忆特征图;determining the updated feature map and the second memory feature map according to the first feature map and the second feature map;
    根据所述更新后特征图和所述第二记忆特征图,将所述第三时间特征图更新为所述第二时间特征图。The third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.
  4. 根据权利要求2所述的方法,其特征在于,将所述更新后特征图建模成具有空间相关性的第一时空特征图,包括:The method according to claim 2, wherein modeling the updated feature map as a first spatiotemporal feature map with spatial correlation comprises:
    将所述更新后特征图在通道维度进行切片,得到第一子特征图、第二子特征图和第三子特征图;Slicing the updated feature map in the channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;
    分别将所述更新后特征图、所述第一子特征图、所述第二子特征图和所述第三子特征图调整成为第三特征向量、第一子特征向量、第二子特征向量和第三子特征向量;respectively adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, and a second sub-feature vector and the third sub-eigenvector;
    根据所述第一子特征向量和所述第二子特征向量,计算所述第一子特征图和所述第二子特征图之间的第一空间相关性矩阵;calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map based on the first sub-feature vector and the second sub-feature vector;
    利用所述第一空间相关性矩阵对所述第三子特征向量进行加权处理,得到第 一空间相关特征向量;Using the first spatial correlation matrix to carry out weighting processing on the third sub-eigenvector to obtain the first spatial correlation eigenvector;
    根据所述第一空间相关特征向量和所述第三特征向量,确定第一时空特征向量;determining a first spatiotemporal feature vector according to the first spatial correlation feature vector and the third feature vector;
    将所述第一时空特征向量调整为具有空间相关性的所述第一时空特征图;adjusting the first spatio-temporal feature vector to the first spatio-temporal feature map with spatial correlation;
    将所述第二记忆特征图建模成具有空间相关性的第二时空特征图,包括:Modeling the second memory feature map into a second spatiotemporal feature map with spatial correlation, comprising:
    将所述第二记忆特征图在通道维度进行切片,得到第四子特征图、第五子特征图和第六子特征图;Slicing the second memory feature map in the channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;
    分别将所述第二记忆特征图、所述第四子特征图、所述第五子特征图和所述第六子特征图调整为第四特征向量、第四子特征向量、第五子特征向量和第六子特征向量;adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map, and the sixth sub-feature map to a fourth feature vector, a fourth sub-feature vector, and a fifth sub-feature vector and the sixth sub-eigenvector;
    根据所述第四子特征向量和所述第五子特征向量,计算所述第四子特征图和所述第五子特征图之间的第二空间相关性矩阵;calculating a second spatial correlation matrix between the fourth sub-eigenmap and the fifth sub-eigenmap based on the fourth sub-eigenvector and the fifth sub-eigenvector;
    利用所述第二空间相关性矩阵对所述第六子特征向量进行加权处理,得到第二空间相关特征向量;performing weighting processing on the sixth sub-eigenvector by using the second spatial correlation matrix to obtain a second spatially correlated feature vector;
    根据所述第二空间相关特征向量和所述第四特征向量,确定第二时空特征向量;determining a second spatiotemporal feature vector according to the second spatial correlation feature vector and the fourth feature vector;
    将所述第二时空特征向量调整为具有空间相关性的所述第二时空特征图。The second spatio-temporal feature vector is adjusted to the second spatio-temporal feature map with spatial correlation.
  5. 根据权利要求1项所述的方法,其特征在于,The method according to claim 1, characterized in that,
    所述第一图像帧、所述第二图像帧、所述第一深度权重和所述第一运动权重,采用误差计算模块,计算第一误差,包括:The first image frame, the second image frame, the first depth weight and the first motion weight use an error calculation module to calculate a first error, including:
    将所述第一图像帧和所述第二图像帧输入预先构建的深度估计网络,根据所述第一图像帧和所述第一深度权重,得到所述第一图像帧的第一场景深度和所述第一图像帧的第一编码器特征图,根据所述第二图像帧和所述第一深度权重,得到所述第二图像帧的第二场景深度和所述第二图像帧的第二编码器特征图;Inputting the first image frame and the second image frame into a pre-built depth estimation network, and obtaining the first scene depth and the first scene depth of the first image frame according to the first image frame and the first depth weight The first encoder feature map of the first image frame, according to the second image frame and the first depth weight, obtain the second scene depth of the second image frame and the second scene depth of the second image frame Two encoder feature maps;
    将所述第一图像帧和所述第二图像帧输入预先构建的相机运动网络,根据所述第一图像帧、所述第二图像帧和所述第一运动权重,得到所述第一图像帧和所述第二图像帧之间的第一相对位姿;inputting the first image frame and the second image frame into a pre-built camera motion network, and obtaining the first image according to the first image frame, the second image frame and the first motion weight a first relative pose between a frame and said second image frame;
    所述第一图像帧、所述第二图像帧、所述第一场景深度、所述第二场景深度、 所述第一编码器特征图、所述第二编码器特征图及所述第一相对位姿,采用所述误差计算模块,计算所述第一误差;The first image frame, the second image frame, the first scene depth, the second scene depth, the first encoder feature map, the second encoder feature map, and the first Relative pose, using the error calculation module to calculate the first error;
    所述第一图像帧、所述第二图像帧、所述第二深度权重和所述第二运动权重,采用所述误差计算模块,计算第二误差,包括:The first image frame, the second image frame, the second depth weight and the second motion weight use the error calculation module to calculate a second error, including:
    将所述第一图像帧和所述第二图像帧输入预先构建的深度估计网络,根据所述第一图像帧和所述第二深度权重,得到所述第一图像帧的第三场景深度和所述第一图像帧的第三编码器特征图,根据所述第二图像帧和所述第二深度权重,得到所述第二图像帧的第四场景深度和所述第二图像帧的第四编码器特征图;Inputting the first image frame and the second image frame into a pre-built depth estimation network, according to the first image frame and the second depth weight, obtaining a third scene depth and The third encoder feature map of the first image frame, according to the second image frame and the second depth weight, obtain the fourth scene depth of the second image frame and the fourth scene depth of the second image frame Four-encoder feature map;
    将所述第一图像帧和所述第二图像帧输入预先构建的相机运动网络,根据所述第一图像帧、所述第二图像帧和所述第二运动权重,得到所述第一图像帧和所述第二图像帧之间的第二相对位姿;inputting the first image frame and the second image frame into a pre-built camera motion network, and obtaining the first image according to the first image frame, the second image frame and the second motion weight a second relative pose between a frame and said second image frame;
    所述第一图像帧、所述第二图像帧、所述第三场景深度、所述第四场景深度、所述第三编码器特征图、所述第四编码器特征图及所述第二相对位姿,采用所述误差计算模块,计算所述第二误差。The first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second Relative pose, using the error calculation module to calculate the second error.
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述第一误差和所述第二误差,确定所述第二图像帧的场景深度及所述第一图像帧与所述第二图像帧之间的相对位姿,包括:The method according to claim 5, characterized in that, according to the first error and the second error, the scene depth of the second image frame and the relationship between the first image frame and the second image frame are determined. Relative poses between image frames, including:
    若所述第一误差大于所述第二误差,将所述第二场景深度作为所述第二图像帧的场景深度,将所述第一相对位姿作为所述第一图像帧与所述第二图像帧之间的相对位姿;If the first error is greater than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the first image frame and the first image frame. The relative pose between the two image frames;
    若所述第一误差小于或等于所述第二误差,将所述第四场景深度作为所述第二图像帧的场景深度,将所述第二相对位姿作为所述第一图像帧与所述第二图像帧之间的相对位姿。If the first error is less than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the first image frame and the second image frame. The relative pose between the second image frames.
  7. 根据权利要求5或6所述的方法,其特征在于,总误差包括所述第一误差和所述第二误差;The method according to claim 5 or 6, wherein the total error comprises the first error and the second error;
    所述总误差根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差确定。The total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
  8. 根据权利要求7所述的方法,其特征在于,所述总误差根据图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差确定,包括:The method according to claim 7, wherein the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, including:
    获取所述第一图像帧的第一图像坐标、所述第二图像帧的第二图像坐标;Acquiring first image coordinates of the first image frame and second image coordinates of the second image frame;
    根据所述第一图像坐标、相机内参、所述第一场景深度,确定所述第一图像帧的第一世界坐标;Determine the first world coordinates of the first image frame according to the first image coordinates, camera intrinsic parameters, and the first scene depth;
    根据所述第二图像坐标、所述相机内参、所述第二场景深度,确定所述第二图像帧的第二世界坐标;determining the second world coordinates of the second image frame according to the second image coordinates, the internal camera parameters, and the second scene depth;
    将所述第一图像帧的第一世界坐标仿射变换到所述第二图像帧面板,确定仿射变换后的第三世界坐标;Affine transforming the first world coordinates of the first image frame to the second image frame panel, and determining the third world coordinates after the affine transformation;
    将所述第二图像帧的第二世界坐标仿射变换到所述第一图像帧面板,确定仿射变换后的第四世界坐标;Affine transforming the second world coordinates of the second image frame to the first image frame panel, and determining the fourth world coordinates after the affine transformation;
    将所述第三世界坐标和所述第四世界坐标分别投影到二维平面,得到第一仿射变换后场景深度和第二仿射变换后场景深度及对应的第一仿射变换后图像坐标和第二仿射变换后图像坐标;respectively projecting the third world coordinates and the fourth world coordinates onto a two-dimensional plane to obtain the scene depth after the first affine transformation, the scene depth after the second affine transformation, and the corresponding image coordinates after the first affine transformation and the image coordinates after the second affine transformation;
    根据所述第一场景深度、所述第二场景深度、所述第一仿射变换后图像坐标、所述第二仿射变换后图像坐标,确定所述场景深度结构一致性误差、第一深度结构不一致性权重和第二深度结构不一致性权重;According to the first scene depth, the second scene depth, the first affine transformed image coordinates, and the second affine transformed image coordinates, determine the scene depth structural consistency error, the first depth Structural inconsistency weights and second-depth structural inconsistency weights;
    根据所述第一图像帧的第一图像坐标、所述第二仿射变换后图像坐标、所述第二图像帧的第二图像坐标、所述第一仿射变换后图像坐标,确定第一相机流一致性遮挡掩码和第二相机流一致性遮挡掩码;Determine the first A camera flow consistent occlusion mask and a second camera flow consistent occlusion mask;
    根据所述第一深度结构不一致性权重、所述第二深度结构不一致性权重、所述第一相机流一致性遮挡掩码和所述第二相机流一致性遮挡掩码,确定所述图像合成误差;Determine the image composition according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera flow consistent occlusion mask and the second camera flow consistent occlusion mask error;
    根据所述第一图像帧、所述第二图像帧、所述第一仿射变换后图像坐标和所述第二仿射变换后图像坐标,确定所述特征感知损失误差;determining the feature perception loss error based on the first image frame, the second image frame, the first affine transformed image coordinates, and the second affine transformed image coordinates;
    根据所述第一场景深度、所述第二场景深度、所述第一图像帧和所述第二图像帧,确定所述平滑损失误差;determining the smoothing loss error based on the first scene depth, the second scene depth, the first image frame, and the second image frame;
    根据所述图像合成误差、场景深度结构一致性误差、特征感知损失误差、平滑损失误差,确定所述总误差。The total error is determined according to the image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
  9. 一种基于历史信息的场景深度推理装置,其特征在于,所述装置包括:A device for scene depth reasoning based on historical information, characterized in that the device includes:
    第一获取模块,用于获取待测图像的第一图像帧和第二图像帧,所述第一图像帧为所述第二图像帧前一时刻的图像帧;A first acquisition module, configured to acquire a first image frame and a second image frame of the image to be tested, the first image frame being an image frame at a moment before the second image frame;
    第二获取模块,用于获取预先构建的深度估计网络的第一深度权重及预先构建的相机运动网络的第一运动权重;The second acquisition module is used to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;
    第一处理模块,用于所述第一图像帧、所述第二图像帧、所述第一深度权重和所述第一运动权重,采用误差计算模块,计算第一误差;The first processing module is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate a first error;
    更新模块,用于将所述第一误差作为指导信号联合更新所述深度估计网络的第一深度权重和所述相机运动网络的第一运动权重,得到第二深度权重和第二运动权重;An update module, configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guide signal to obtain a second depth weight and a second motion weight;
    第二处理模块,用于所述第一图像帧、所述第二图像帧、所述第二深度权重和所述第二运动权重,采用所述误差计算模块,计算第二误差;The second processing module is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses the error calculation module to calculate a second error;
    确定模块,用于根据所述第一误差和所述第二误差,确定所述第二图像帧的场景深度及所述第一图像帧与所述第二图像帧之间的相对位姿。A determining module, configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
  10. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1-8中任一所述的基于历史信息的场景深度推理方法。An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the processor executes the program, it implements any of claims 1-8 The described scene depth reasoning method based on historical information.
PCT/CN2022/076348 2022-02-15 2022-02-15 Historical information-based scene depth reasoning method and apparatus, and electronic device WO2023155043A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/076348 WO2023155043A1 (en) 2022-02-15 2022-02-15 Historical information-based scene depth reasoning method and apparatus, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/076348 WO2023155043A1 (en) 2022-02-15 2022-02-15 Historical information-based scene depth reasoning method and apparatus, and electronic device

Publications (1)

Publication Number Publication Date
WO2023155043A1 true WO2023155043A1 (en) 2023-08-24

Family

ID=87577270

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076348 WO2023155043A1 (en) 2022-02-15 2022-02-15 Historical information-based scene depth reasoning method and apparatus, and electronic device

Country Status (1)

Country Link
WO (1) WO2023155043A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898630A (en) * 2018-06-27 2018-11-27 清华-伯克利深圳学院筹备办公室 A kind of three-dimensional rebuilding method, device, equipment and storage medium
US20200160546A1 (en) * 2018-11-16 2020-05-21 Nvidia Corporation Estimating depth for a video stream captured with a monocular rgb camera
CN111311685A (en) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU/monocular image
US20200258249A1 (en) * 2017-11-15 2020-08-13 Google Llc Unsupervised learning of image depth and ego-motion prediction neural networks
CN111540000A (en) * 2020-04-28 2020-08-14 深圳市商汤科技有限公司 Scene depth and camera motion prediction method and device, electronic device and medium
CN112381868A (en) * 2020-11-13 2021-02-19 北京地平线信息技术有限公司 Image depth estimation method and device, readable storage medium and electronic equipment
CN113570658A (en) * 2021-06-10 2021-10-29 西安电子科技大学 Monocular video depth estimation method based on depth convolutional network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200258249A1 (en) * 2017-11-15 2020-08-13 Google Llc Unsupervised learning of image depth and ego-motion prediction neural networks
CN108898630A (en) * 2018-06-27 2018-11-27 清华-伯克利深圳学院筹备办公室 A kind of three-dimensional rebuilding method, device, equipment and storage medium
US20200160546A1 (en) * 2018-11-16 2020-05-21 Nvidia Corporation Estimating depth for a video stream captured with a monocular rgb camera
CN111540000A (en) * 2020-04-28 2020-08-14 深圳市商汤科技有限公司 Scene depth and camera motion prediction method and device, electronic device and medium
CN113822918A (en) * 2020-04-28 2021-12-21 深圳市商汤科技有限公司 Scene depth and camera motion prediction method and device, electronic device and medium
CN111311685A (en) * 2020-05-12 2020-06-19 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU/monocular image
CN112381868A (en) * 2020-11-13 2021-02-19 北京地平线信息技术有限公司 Image depth estimation method and device, readable storage medium and electronic equipment
CN113570658A (en) * 2021-06-10 2021-10-29 西安电子科技大学 Monocular video depth estimation method based on depth convolutional network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
UMMENHOFER BENJAMIN; ZHOU HUIZHONG; UHRIG JONAS; MAYER NIKOLAUS; ILG EDDY; DOSOVITSKIY ALEXEY; BROX THOMAS: "DeMoN: Depth and Motion Network for Learning Monocular Stereo", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOCIETY, US, 21 July 2017 (2017-07-21), US , pages 5622 - 5631, XP033249922, ISSN: 1063-6919, DOI: 10.1109/CVPR.2017.596 *
徐豪飞 (XU, HAOFEI): "基于图像的高效高精度深度恢复 (Efficient and Accurate Depth Estimation from Images)", CHINESE MASTER'S THESES FULL-TEXT DATABASE (INFORMATION SCIENCE AND TECHNOLOGY), no. 6, 15 June 2021 (2021-06-15) *
路昊 (LU, HAO): "基于深度学习的动态场景相机姿态估计方法 (A Method of Estimating Camera Pose in Dynamic Scene Based on DNN)", 高技术通讯 (HIGH TECHNOLOGY LETTERS), vol. 30, no. 1, 15 January 2020 (2020-01-15) *

Similar Documents

Publication Publication Date Title
CN109754417B (en) System and method for unsupervised learning of geometry from images
EP3698323B1 (en) Depth from motion for augmented reality for handheld user devices
TWI766175B (en) Method, device and apparatus for monocular image depth estimation, program and storage medium thereof
CN110782490B (en) Video depth map estimation method and device with space-time consistency
WO2019149206A1 (en) Depth estimation method and apparatus, electronic device, program, and medium
US20200273190A1 (en) Method for 3d scene dense reconstruction based on monocular visual slam
US20090067728A1 (en) Image matching method and image interpolation method using the same
CN111951372B (en) Three-dimensional face model generation method and equipment
US8098963B2 (en) Resolution conversion apparatus, method and program
KR101266362B1 (en) System and method of camera tracking and live video compositing system using the same
US20230401672A1 (en) Video processing method and apparatus, computer device, and storage medium
CN111798485B (en) Event camera optical flow estimation method and system enhanced by IMU
CN112639878A (en) Unsupervised depth prediction neural network
Jiao et al. Comparing representations in tracking for event camera-based slam
US20200134389A1 (en) Rolling shutter rectification in images/videos using convolutional neural networks with applications to sfm/slam with rolling shutter images/videos
US20130135430A1 (en) Method for adjusting moving depths of video
Zhou et al. PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes
Guan et al. PoseGU: 3D human pose estimation with novel human pose generator and unbiased learning
US20200202563A1 (en) 3d image reconstruction processing apparatus, 3d image reconstruction processing method and computer-readable storage medium storing 3d image reconstruction processing program
WO2023155043A1 (en) Historical information-based scene depth reasoning method and apparatus, and electronic device
CN111460741A (en) Fluid simulation method based on data driving
Goldenstein et al. 3D facial tracking from corrupted movie sequences
CN112766120B (en) Three-dimensional human body posture estimation method and system based on depth point cloud
Šilar et al. Comparison of two optical flow estimation methods using Matlab
Nomura Spatio-temporal optimization method for determining motion vector fields under non-stationary illumination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926381

Country of ref document: EP

Kind code of ref document: A1