CN114627176A - Scene depth reasoning method and device based on historical information and electronic equipment - Google Patents

Scene depth reasoning method and device based on historical information and electronic equipment Download PDF

Info

Publication number
CN114627176A
CN114627176A CN202210139037.7A CN202210139037A CN114627176A CN 114627176 A CN114627176 A CN 114627176A CN 202210139037 A CN202210139037 A CN 202210139037A CN 114627176 A CN114627176 A CN 114627176A
Authority
CN
China
Prior art keywords
image frame
error
feature map
depth
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210139037.7A
Other languages
Chinese (zh)
Inventor
王飞
程俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202210139037.7A priority Critical patent/CN114627176A/en
Publication of CN114627176A publication Critical patent/CN114627176A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06T3/02
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a scene depth inference method and device based on historical information and electronic equipment, wherein the method comprises the following steps: acquiring a first image frame and a second image frame; acquiring a first depth weight of a depth estimation network and a first motion weight of a camera motion network; calculating a first error by adopting an error calculation module according to the first image frame, the second image frame, the first depth weight and the first motion weight; jointly updating a first depth weight of a depth estimation network and a first motion weight of a camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight; calculating a second error by adopting an error calculation module; and determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error. The scheme can reduce the influence of wrong affine transformation caused by inaccurate posture.

Description

Scene depth reasoning method and device based on historical information and electronic equipment
Technical Field
The application belongs to the technical field of computer vision and image processing, and particularly relates to a scene depth inference method and device based on historical information and electronic equipment.
Background
The accurate recovery of the scene depth from the two-dimensional image facilitates better understanding of the three-dimensional structure of the scene, thereby better accomplishing various visual tasks. However, the common camera acquires two-dimensional images during shooting, and loses depth information of a scene, so how to recover the depth of the scene from the two-dimensional images or video sequences becomes a basic and extremely challenging task in the field of computer vision. Although the competitive scene depth can be recovered from the two-dimensional image at present, a large amount of manually marked data is needed to train the neural network, time and labor are consumed, and once the training of the model is completed, the weight of the model is frozen, so that the generalization capability of the algorithm on the unknown scene is reduced. In addition, the scheme based on fully unsupervised learning recovers the scene depth from the two-dimensional image, the camera pose needs to be predicted from the adjacent frames at the same time, and the inaccurate pose can generate an incorrect affine transformation result, directly influences the quality of the synthesized image and further influences the recovered scene depth quality.
Disclosure of Invention
The embodiment of the specification aims to provide a scene depth inference method and device based on historical information and electronic equipment.
In order to solve the above technical problem, the embodiments of the present application are implemented as follows:
in a first aspect, the present application provides a scene depth inference method based on historical information, including:
acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame;
acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;
calculating a first error by adopting an error calculation module according to the first image frame, the second image frame, the first depth weight and the first motion weight;
the first error is used as a guide signal to jointly update a first depth weight of a depth estimation network and a first motion weight of a camera motion network, so that a second depth weight and a second motion weight are obtained;
calculating a second error by adopting an error calculation module according to the first image frame, the second depth weight and the second motion weight;
and determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
In a second aspect, the present application provides a scene depth inference device based on historical information, the device including:
the first acquisition module is used for acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame;
the second acquisition module is used for acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;
the first processing module is used for calculating a first error by adopting the error calculation module according to the first image frame, the second image frame, the first depth weight and the first motion weight;
the updating module is used for jointly updating a first depth weight of the depth estimation network and a first motion weight of the camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight;
the second processing module is used for calculating a second error by adopting the error calculation module, wherein the second processing module is used for the first image frame, the second depth weight and the second motion weight;
and the determining module is used for determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the scene depth inference method based on historical information according to the first aspect.
As can be seen from the technical solutions provided in the embodiments of the present specification, the solution: recovering scene depth from a two-dimensional image in a fully unsupervised mode, injecting historical frame information in a memory unit into a current input unit through a time attention module, and modeling spatial correlation of a time-space feature map to improve the accuracy of camera pose and reduce the influence of wrong affine transformation caused by inaccurate pose; during reasoning, the generalization capability of the algorithm to an unknown scene is improved by utilizing online decision reasoning.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a schematic flow chart of a scene depth inference method based on historical information according to the present application;
FIG. 2 is a block diagram of joint training of a depth estimation network and a camera motion network provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a temporal attention module provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a spatiotemporal correlation module provided in an embodiment of the present application;
fig. 5 is a schematic view of a scene depth inference process provided in an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a scene depth inference apparatus based on historical information according to the present application;
fig. 7 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art without making creative efforts based on the embodiments in the present specification shall fall within the protection scope of the present specification.
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system architectures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments described herein without departing from the scope or spirit of the application. Other embodiments will be apparent to the skilled person from the description of the present application. The specification and examples are exemplary only.
As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including but not limited to.
In the present application, "parts" are parts by mass unless otherwise specified.
The present invention will be described in further detail with reference to the drawings and examples.
Referring to fig. 1, a flowchart of a scene depth inference method based on historical information according to an embodiment of the present application is shown.
As shown in fig. 1, the scene depth inference method based on historical information may include:
s110, acquiring a first image frame and a second image frame of the image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame.
The image to be detected is any one image of which the scene depth needs to be inferred and predicted, and the image to be detected can be a two-dimensional image.
The method comprises the steps of cutting an image to be detected into image frames of a plurality of adjacent frames at equal time intervals, wherein a first image frame and a second image frame of the image to be detected are two image frames at adjacent moments, and the first image frame is an image frame at a moment before the second image frame, for example, the first image frame is an image frame at the moment t-1, the second image frame is an image frame at the moment t, or the first image frame is an image frame at the moment t, and the second image frame is an image frame at the moment t + 1.
S120, acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network.
The depth estimation network is used for estimating the scene depth of the two-dimensional image, the depth estimation network may adopt a neural network with a coder-decoder structure, and the type and the network structure of the neural network adopted by the depth estimation network are not limited.
The camera motion network is used to predict the relative pose between adjacent image frames.
Both the depth estimation network and the camera motion network are pre-constructed and trained.
Referring to fig. 2, it shows a depth estimation network and camera motion network joint training block diagram provided by the embodiment of the present application (the leftmost original image, the scene depth picture, and the synthesized view in fig. 2 are color maps, and this process is a grayscale map). It can be understood that, to train the depth estimation network and the camera motion network, training set data is obtained first, and in the present application, the training set data is an image frame group of a plurality of adjacent three moments in an image, for example, an image frame at a time t-1, an image frame at a time t +1 is one data in a training data set, an image frame at a time t-2, an image frame at a time t-1, and an image frame at a time t are also one data in the training data set, and so on. According to the training block diagram shown in fig. 2, the image frame at the time t-1, the image frame at the time t, and the image frame at the time t +1 are used as the input of the camera motion network and the depth estimation network, and the depth estimation network and the camera motion network are jointly trained by using the target function, that is, the depth weight of the depth estimation network and the motion weight of the camera motion network are guided to be updated.
It can be understood that before the image frame at the time t-1, the image frame at the time t, and the image frame at the time t +1 are input into the depth estimation network and the camera motion network, image preprocessing is performed, for example, the image data is randomly flipped, randomly cropped, and data normalized, and the processed data is converted into tensor data with a dimension of C × H × W, where the dimension of batch is omitted, where C represents the channel dimension size of the sample, where the depth estimation network C during training is 3, C in the camera motion network is 9 (the input of the camera motion network during testing is 3 image frames at adjacent times), C in the camera motion network is 6 during test inference (that is, when the scene depth inference method based on the history information of the present application is implemented), C in the camera motion network is 2 image frames at adjacent times, h denotes the height of the input sample image, for example, H equals 256, W denotes the width of the input sample image, and for example, W equals 832.
With continued reference to FIG. 2, the camera motion network may include an encoder, a temporal attention module, and a spatiotemporal correlation module.
The encoder is used for extracting the features of the stacked image frames to obtain a stacked feature map; the stacked image frames are obtained by stacking a first image frame and a second image frame according to a channel dimension.
The time attention module is used for establishing a global dependency relationship between the information of the history memory unit and the information of the current input unit, injecting the information in the globally relevant history memory unit into the current input unit through the updating unit, and storing the globally relevant information in the current input unit into the history memory unit to be used as the history memory unit at the next moment; the current input unit comprises a stacking feature map, the stacking feature map is updated into an updated feature map through an updating unit, the history storage unit comprises a first memory feature map and a first time feature map, and the history storage unit at the next moment comprises a second memory feature map and a second time feature map.
The spatiotemporal correlation module is used for modeling the updated/second memory feature maps into first/second spatiotemporal feature maps with spatial correlation, respectively.
In one embodiment, the time attention module, using the shared time attention weight to establish a global dependency relationship between information in the history memory unit and information in the current input unit, and via the update unit, injects information in the history memory unit related to the global locality into the current input unit, and stores global related information in the current input unit into the history memory unit as a history memory unit at a next time, includes:
injecting the feature information in the stacking feature map and the feature information of the first memory feature map into the first time feature map to obtain a third time feature map;
determining a time attention feature vector according to the third time feature map;
determining a first feature vector according to the stacking feature map;
according to the first memory characteristic diagram; determining a second feature vector
Determining an input feature vector based on the time attention according to the first feature vector and the time attention feature vector;
determining a memory feature vector based on the time attention according to the second feature vector and the time attention feature vector;
respectively adjusting an input feature vector based on time attention and a memory feature vector based on time attention into a corresponding first feature map and a corresponding second feature map;
determining an updated feature map and a second memory feature map according to the first feature map and the second feature map;
and updating the third time characteristic diagram into the second time characteristic diagram according to the updated characteristic diagram and the second memory characteristic diagram.
Exemplary, refer toAnd 3, which shows a schematic diagram of a time attention module provided by the embodiment of the application. For convenience of description, it is assumed that the input feature map at time t (i.e., the stacked feature map) is represented as
Figure RE-RE-GDA0003632332530000061
the time characteristic diagram (namely the first time characteristic diagram) at the moment t-1 is
Figure RE-RE-GDA0003632332530000062
the time memory characteristic diagram (i.e. the first memory characteristic diagram) at the time t-1 is
Figure RE-RE-GDA0003632332530000063
The calculation process is as follows:
1) inputting a characteristic diagram X at the time t according to the formula (1)tCharacteristic information and t-1 moment memory characteristic diagram X in (1)(m,t-1)The feature information in (1) is injected into the time feature diagram X at the t-1 moment(time,t-1)In (3), a third time profile is obtained
Figure RE-RE-GDA0003632332530000064
Figure RE-RE-GDA0003632332530000065
Wherein the content of the first and second substances,
Figure RE-RE-GDA0003632332530000069
representing the feature space, C representing the number of the feature map channels, H representing the height of the feature map, W representing the width of the feature map, deltagelu(. represents an activation function, W(i,t),W(time,t-1),W(m,t-1)Representing the learned corresponding weight, b(i,t),b(time,t-1),b(m,t-1)Representing the corresponding paranoia item.
2) Calculating a temporal attention feature vector x according to the formula set (2)(qk_time,t-1)
Figure RE-RE-GDA0003632332530000066
Figure RE-RE-GDA0003632332530000067
Figure RE-RE-GDA0003632332530000068
Figure RE-RE-GDA0003632332530000071
Figure RE-RE-GDA0003632332530000072
Where s denotes a scalar scaling factor, "+" denotes the product of the corresponding elements, function Fsplit(. to) shows the feature graph sliced by channel dimension, function Freshape(. to) adjust the feature map or feature vector to a predetermined shape, T represents transpose,
Figure RE-RE-GDA0003632332530000073
a corresponding weight is represented by a weight that is,
Figure RE-RE-GDA0003632332530000074
representing the corresponding paranoia item.
3) Inputting the characteristic diagram X at the time t according to a formula group (3)(i,t)Adjusted to a feature vector (i.e., the first feature vector)
Figure RE-RE-GDA0003632332530000075
Memorizing a characteristic diagram X at the t-1 moment(m,t-1)Adjusted to a feature vector (i.e., a second feature vector)
Figure RE-RE-GDA0003632332530000076
x(i,t)=Freshape(X(i,t))
x(m,t-1)=Freshape(X(m,t-1)) (3)
4) Calculating an input feature vector based on temporal attention according to the formula set (4)
Figure RE-RE-GDA0003632332530000077
And a temporal attention-based memory feature vector
Figure RE-RE-GDA0003632332530000078
Figure RE-RE-GDA0003632332530000079
Figure RE-RE-GDA00036323325300000710
5) Respectively dividing the feature vector x according to a formula group (5)(qk_i,t)And x(qk_m,t-1)Adjusted to corresponding characteristic diagram
Figure RE-RE-GDA00036323325300000711
(i.e., first characteristic diagram) and
Figure RE-RE-GDA00036323325300000712
(i.e., second feature map):
Figure RE-RE-GDA00036323325300000713
Figure RE-RE-GDA00036323325300000714
6) calculate the information selection gate according to the formula (6)
Figure RE-RE-GDA00036323325300000715
For selectively memorizing a characteristic map X at a time t-1(qk_m,t-1)The information in (1) is injected into the input feature map X at the time t(qk_i,t)The method comprises the following steps:
Gs=δsig(W(qk_ms,t-1)X(qk_m,t-1)+b(qk_ms,t-1)+W(qk_is,t)X(qk_i,t)+b(qk_is,t)) (6)
wherein the function deltasig(. -) represents a sigmoid activation function, W(qk_ms,t-1)And W(qk_is,t)Representing corresponding weights, b(qk_ms,t-1)And b(qk_is,t)Representing the corresponding paranoia item.
7) Calculating a new feature map containing the information of the memory feature map according to the formula (7)
Figure RE-RE-GDA0003632332530000081
X(im,t)=δtanh(W(qk_imi,t)X(qk_i,t)+b(qk_imi,t)+Gs*(W(qk_imm,t-1)X(qk_m,t-1)+b(qk_imm,t-1))) (7)
Wherein the function deltatanh(. -) represents the tanh activation function, W(qk_imi,t)And W(qk_imm,t-1)Representing corresponding weights, b(qk_imi,t)And b(qk_imm,t-1)Representing the corresponding paranoia item.
8) Calculating the memory gate according to the formula set (8)
Figure RE-RE-GDA0003632332530000082
For memorizing a characteristic diagram X at t-1 moment(qk_m,t-1)Memory characteristic diagram (second memory characteristic diagram) of information update to time t
Figure RE-RE-GDA0003632332530000083
Gr=δsig(W(qk_ir,t)X(qk_i,t)+b(qk_ir,t)+W(qk_mr,t-1)X(qk_m,t-1)+b(qk_mr,t-1))
X(m,t)=(1-Gr)*X(im,t)+Gr*X(qk_m,t-1) (8)
Wherein, W(qk_ir,t)And W(qk_mr,t-1)Representing the corresponding weight, b(qk_ir,t)And b(qk_mr,t-1)Representing the corresponding paranoia item.
9) Calculating the output gate according to the formula (9)
Figure RE-RE-GDA0003632332530000084
For updating input profile X(qk_i,t)Obtaining the updated new characteristic diagram (namely the updated characteristic diagram)
Figure RE-RE-GDA0003632332530000085
As input feature map for the next time:
Go=δsig(W(qk_io,t)X(qk_i,t)+b(qk_io,t)+W(qk_mo,t-1)X(qk_m,t-1)+b(qk_mo,t-1))
X(io,t)=Go*X(im,t) (9)
wherein, W(qk_io,t)And W(qk_mo,t-1)Representing the corresponding weight, b(qk_io,t)And b(qk_mo,t-1)Representing the corresponding paranoia item.
10) Time characteristic diagram of t-1 moment according to formula (10)
Figure RE-RE-GDA0003632332530000086
Updating the time characteristic diagram (i.e. the second time characteristic diagram) X to the time t(time,t)
Figure RE-RE-GDA0003632332530000087
Wherein, W(time,t-1)、W(io,t)And W(m,t)Representing the corresponding weight, b(time,t-1)、b(io,t)And b(m,t)Indicate a correspondenceAnd (4) executing the item by a bias.
In order to infer camera motion using global spatial structure information of the feature map and the dependencies between the spatial structures, a spatiotemporal correlation module as shown in fig. 4 is constructed, modeling spatial context information using global spatial correlation weights, and constraining timing information between stacked frames by modeling the dependencies between feature map channels of the stacked frames.
In one embodiment, the spatiotemporal correlation module is to model the updated/second memory feature maps as first/second spatiotemporal feature maps having spatial correlations, respectively.
Wherein modeling the updated feature map as a first spatio-temporal feature map having spatial correlation comprises:
slicing the updated feature map in the channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;
respectively correspondingly adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, a second sub-feature vector and a third sub-feature vector;
calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;
weighting the third sub-eigenvector by using the first spatial correlation matrix to obtain a first spatial correlation eigenvector;
determining a first space-time eigenvector according to the first space-related eigenvector and the third eigenvector;
and adjusting the first space-time feature vector into a first space-time feature map with spatial correlation.
Wherein modeling the second memory signature as a second spatiotemporal signature with spatial correlation comprises:
slicing the second memory feature map in a channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;
respectively adjusting the second memory characteristic diagram, the fourth sub-characteristic diagram, the fifth sub-characteristic diagram and the sixth sub-characteristic diagram into a fourth characteristic vector, a fourth sub-characteristic vector, a fifth sub-characteristic vector and a sixth sub-characteristic vector;
calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;
weighting the sixth sub-eigenvector by using the second spatial correlation matrix to obtain a second spatial correlation eigenvector;
determining a second space-time feature vector according to the second space-related feature vector and the fourth feature vector;
and adjusting the second space-time feature vector into a second space-time feature map with spatial correlation.
Illustratively, referring to FIG. 4, the calculation is as follows:
1) according to formula set (11) to input feature map (i.e. updated feature map or second memory feature map)
Figure RE-RE-GDA0003632332530000101
Performing transformation to obtain a characteristic diagram
Figure RE-RE-GDA0003632332530000102
And equally divide the graph into three sub-feature graphs (namely a first/fourth sub-feature graph, a second/fifth sub-feature graph and a third/sixth sub-feature graph respectively)
Figure RE-RE-GDA0003632332530000103
Wherein the function Fsplit(. cndot.) represents the slicing process for the feature map in the channel dimension.
Figure RE-RE-GDA0003632332530000104
Wherein, WmidIs a corresponding weight, bmidRepresenting the corresponding paranoia item.
2) Adjusting the corresponding characteristic diagram into a characteristic vector according to a formula group (12) to obtain a pairCorresponding feature vector
Figure RE-RE-GDA0003632332530000105
Wherein the function FreshapeAnd (h) adjusting the feature map or the feature vector into a preset shape.
xmid=Freshape(Xmid)
Figure RE-RE-GDA0003632332530000106
Figure RE-RE-GDA0003632332530000107
Figure RE-RE-GDA0003632332530000108
3) Calculating the first/fourth sub-feature map according to the formula (13)
Figure RE-RE-GDA0003632332530000109
And the second/fifth sub-feature maps
Figure RE-RE-GDA00036323325300001010
Spatial correlation matrix between
Figure RE-RE-GDA00036323325300001011
Where s represents a scalar scaling factor
Figure RE-RE-GDA00036323325300001012
4) Using the calculated spatial correlation matrix to make a third/sixth sub-feature quantity
Figure RE-RE-GDA00036323325300001013
Weighting to obtain space correlation feature vector (including first space correlation feature vector)Quantity and second spatial correlation feature vector quantity)
Figure RE-RE-GDA00036323325300001014
As shown in formula (14)
Figure RE-RE-GDA00036323325300001015
5) Modeling the dependency relationship between the feature map space structures according to the formula (15), constraining the time sequence information between the stacked frames by modeling the dependency relationship between the feature map channels of the stacked frames, and calculating a first/second space-time feature vector xtime_corr
Figure RE-RE-GDA0003632332530000111
Wherein, FC(. cndot.) consists of a one-dimensional convolution and activation function with two layers of kernel size 3 and step size 1.
6) Finally, the first/second space-time feature vector x with spatial correlationtime_corrAdjusted to first/second spatio-temporal feature maps with spatial correlation
Figure RE-RE-GDA0003632332530000112
S130, calculating a first error by adopting an error calculation module, wherein the first image frame, the second image frame, the first depth weight and the first motion weight comprise:
inputting a first image frame and a second image frame into a depth estimation network which is constructed in advance, obtaining a first scene depth of the first image frame and a first encoder characteristic map of the first image frame according to the first image frame and a first depth weight, and obtaining a second scene depth of the second image frame and a second encoder characteristic map of the second image frame according to the second image frame and the first depth weight;
inputting a first image frame and a second image frame into a camera motion network which is constructed in advance, and obtaining a first relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and a first motion weight;
and calculating a first error by adopting an error calculation module according to the first scene depth, the second scene depth and the first relative pose.
In one embodiment, the total error is determined from image synthesis errors, scene depth structure consistency errors, feature perception loss errors, and smoothing loss errors. Wherein the total error comprises the first error and the second error.
Specifically, the total error is determined according to an image synthesis error, a scene depth structure consistency error, a feature perception loss error and a smoothing loss error, and the method comprises the following steps:
acquiring a first image coordinate of a first image frame and a second image coordinate of a second image frame;
determining a first world coordinate of a first image frame according to the first image coordinate, camera internal parameters and a first scene depth;
determining a second world coordinate of a second image frame according to the second image coordinate, the camera internal parameter and the second scene depth;
affine transforming the first world coordinates of the first image frame to a second image frame panel, and determining third world coordinates after affine transformation;
affine transforming the second world coordinate of the second image frame to the first image frame panel, and determining a fourth world coordinate after affine transformation;
respectively projecting the third world coordinate and the fourth world coordinate to a two-dimensional plane to obtain the scene depth after the first affine transformation, the scene depth after the second affine transformation and the corresponding image coordinate after the first affine transformation and the image coordinate after the second affine transformation;
determining a scene depth structure consistency error, a first depth structure inconsistency weight and a second depth structure inconsistency weight according to the first scene depth, the second scene depth, the first affine-transformed image coordinates and the second affine-transformed image coordinates;
determining a first camera flow consistency occlusion mask and a second camera flow consistency occlusion mask according to a first image coordinate of a first image frame, a second affine-transformed image coordinate, a second image coordinate of a second image frame and a first affine-transformed image coordinate;
determining an image synthesis error according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera stream consistency occlusion mask and the second camera stream consistency occlusion mask;
determining a characteristic perception loss error according to the first image frame, the second image frame, the image coordinate after the first affine transformation and the image coordinate after the second affine transformation;
determining a smooth loss error according to the first scene depth, the second scene depth, the first image frame and the second image frame;
and determining the total error according to the image synthesis error, the scene depth structure consistency error, the characteristic perception loss error and the smooth loss error.
Illustratively, for convenience of description, assume that the already trained scene depth fitting function is Dt=FD(It|WD) In which ItTwo-dimensional image, W, representing the depth of the scene at which restoration is required at time tDFitting function F representing field depthDLearned weight parameter, DtTwo-dimensional image I representing recovered time ttDepth of scene of (1), X(enc,t)T-time image frame I representing output of encodertIs characterized by comprising a characteristic diagram of (A),
Figure RE-RE-GDA0003632332530000121
representing an image frame I from time t-1t-1Image frame I transformed to time ttPosture of (W)TFunction F for representing pose transformationTThe learned weight parameter in (c), camera parameters are denoted as K, and P(xy,t-1)Representing image frames It-1Image coordinates of (1), P(xyz,t-1)Image frame It-1World coordinate of (P)(xy,t)Representing a picture frame ItImage coordinates of (1), P(xyz,t)Representing an image frame ItThe world coordinates of (a). Total error calculation procedureThe following were used:
1) calculating image frame I according to equation (16)t-1World coordinates (i.e., first world coordinates) P(xyz,t-1)Image frame ItWorld coordinate (i.e., second world coordinate) P(xyz,t)
P(xyz,t-1)=Dt-1*K-P(xy,t-1)
P(xyz,t)=Dt*K-P(xy,t) (16)
Wherein a "-" number indicates the product of the corresponding elements of the matrix.
2) Calculating image frame I according to formula (17)t-1World coordinate P(xyz,t-1)Affine transformation to image frame ItA panel for obtaining world coordinates (i.e. third world coordinates) P after affine transformation(proj_xyz,t)And image frame ItWorld coordinate P(xyz,t)Affine transformation to image frame It-1A panel for obtaining affine-transformed world coordinates (i.e. fourth world coordinates) P(proj_xyz,t-1)
Figure RE-RE-GDA0003632332530000131
Figure RE-RE-GDA0003632332530000132
3) World coordinate P calculated by affine transformation(proj_xyz,t)、P(proj_xyz,t-1)Respectively projecting the images to a two-dimensional plane to obtain the depth D of the scene after affine transformation(proj,t)(i.e., scene depth after first affine transformation) and D(proj,t-1)(i.e., second affine-transformed scene depth), and corresponding affine-transformed image coordinates P(proj_xy,t)(i.e., first affine transformed image coordinates), P(proj_xy,t-1)(i.e., the second affine-transformed image coordinates).
4) From image frame It-1And P(proj_xy,t-1)Composite image I(syn,t)(ii) a From the encoder profile X(enc,t-1)And P(proj_xy,t-1)Synthetic feature map X(syn_enc,t)(ii) a From estimated depth map Dt-1And P(projxy,t-1)Synthetic depth map D(syn,t)(ii) a From image frame ItAnd P(proj_xy,t)Composite image I(syn,t-1)(ii) a From the encoder profile X(enc,t)And P(proj_xy,t)Synthetic feature map X(syn_enc,t-1)(ii) a From estimated depth map DtAnd P(projxy,t)Synthetic depth map D(syn,t-1)(ii) a According to It-1Image coordinates and P(proj_xy,t-1)Calculate the forward camera stream Uforward(ii) a According to ItImage coordinates and P(proj_xy,t)Calculate the backward camera stream Ubackward(ii) a According to UforwardAnd P(proj_xy,t)Synthetic forward camera stream Usyn_forward(ii) a According to UbackwardAnd P(projxy,t-1)Synthesizing backward camera stream Usyn_backward
5) Separately calculating a first camera stream coherence occlusion mask M according to a set of formulas (18)(occ,t-1)Occlusion mask M consistent with second camera stream(occ,t)
M(occ,t-1)=Γ(‖Usyn_forward+Ubackward21(‖Usyn_forward2+‖Ubackward2)+α2)
M(occ,t)=Γ(‖Usyn_backward+Uforward21(‖Usyn_backward2+‖Uforward2)+α2)
Figure RE-RE-GDA0003632332530000141
6) Calculating scene depth structure consistency error E according to formula set (19)DAnd a first depth structure inconsistency weight M(D,t-1)Second depth structure inconsistency weight M(D,t)
Figure RE-RE-GDA0003632332530000142
Figure RE-RE-GDA0003632332530000143
Figure RE-RE-GDA0003632332530000144
7) Calculating the image synthesis error E according to the formula (20)I
Figure RE-RE-GDA0003632332530000145
Wherein the content of the first and second substances,
Figure RE-RE-GDA0003632332530000146
8) calculating the characteristic perceptual loss error E according to the formula (21)X
EX=ERF(X(enc,t),X(syn_enc,t))+ERF(X(enc,t-1),X(syn_enc,t-1)) (21)
9) The smoothing loss error E is calculated according to the formula (22)S
Figure RE-RE-GDA0003632332530000147
10) The total error E is calculated according to equation (23):
E=λIEIDEDXEXSES (23)
s140, the first error is used as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network, and a second depth weight and a second motion weight are obtained.
S150, calculating a second error by using an error calculation module, where the calculating includes:
inputting a first image frame and a second image frame into a depth estimation network which is constructed in advance, obtaining a third scene depth of the first image frame and a third encoder characteristic map of the first image frame according to the first image frame and a second depth weight, and obtaining a fourth scene depth of the second image frame and a fourth encoder characteristic map of the second image frame according to the second image frame and the second depth weight;
inputting the first image frame and the second image frame into a camera motion network which is constructed in advance, and obtaining a second relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and the second motion weight;
and calculating a second error by adopting an error calculation module according to the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder characteristic diagram, the fourth encoder characteristic diagram and the second relative pose.
In this step, referring to the specific calculation process in step S130, only the first depth weight and the first motion weight in step S130 are replaced with the second depth weight and the second motion weight, which is not described herein again.
And S160, determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
Specifically, if the first error is larger than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the relative pose between the first image frame and the second image frame;
and if the first error is smaller than or equal to the second error, taking the fourth scene depth as the scene depth of the second image frame, and taking the second relative pose as the relative pose between the first image frame and the second image frame.
Referring to fig. 5, a schematic diagram of a scene depth inference process is shown. The process of inferring scene depth is as follows:
1) estimating the weight W of the network with the depth obtained by trainingDCamera motion network weight WTAs the model weight of the depth estimation network and the camera motion network during the inference, the total error E and the pose transformation matrix from the historical frame to the current frame are calculated according to S130
Figure RE-RE-GDA0003632332530000151
Scene depth D of current framet
2) Updating the weights of the depth estimation network and the camera motion network by using the total error calculated in the step 1) as a guide signal to obtain a new model weight
Figure RE-RE-GDA0003632332530000152
And
Figure RE-RE-GDA0003632332530000153
3) the model weights obtained according to 2) above
Figure RE-RE-GDA0003632332530000154
And
Figure RE-RE-GDA0003632332530000155
and S150 calculating the total error at this time
Figure RE-RE-GDA0003632332530000156
Pose transformation matrix from historical frame to current frame
Figure RE-RE-GDA0003632332530000157
Scene depth of current frame
Figure RE-RE-GDA0003632332530000158
4) By comparing the total error E and
Figure RE-RE-GDA0003632332530000159
of the final outputThe scene depth of the current frame.
According to the method and the device, the scene depth is recovered from the two-dimensional image in a fully unsupervised mode, the historical frame information in the memory unit is injected into the current input unit through the time attention module, the spatial correlation of the spatiotemporal characteristic diagram is modeled to improve the accuracy of the pose of the camera, and the influence of wrong affine transformation caused by inaccurate pose is reduced; and during reasoning, the generalization capability of the algorithm to the unknown scene is improved by utilizing online decision reasoning.
Referring to fig. 6, a schematic structural diagram of a scene depth inference apparatus based on historical information is shown according to an embodiment of the present application.
As shown in fig. 6, the scene depth inference apparatus 600 based on historical information may include:
the first obtaining module 610 is configured to obtain a first image frame and a second image frame of an image to be detected, where the first image frame is an image frame of a previous moment of the second image frame;
a second obtaining module 620, configured to obtain a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;
a first processing module 630, configured to calculate a first error by using an error calculation module, the first image frame, the second image frame, the first depth weight, and the first motion weight;
an updating module 640, configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guidance signal, so as to obtain a second depth weight and a second motion weight;
a second processing module 650, configured to calculate a second error by using an error calculation module, for the first image frame, the second depth weight, and the second motion weight;
the determining module 660 is configured to determine a scene depth of the second image frame and a relative pose between the first image frame and the second image frame according to the first error and the second error.
Optionally, the camera motion network comprises an encoder, a temporal attention module and a spatiotemporal correlation module;
the encoder is used for extracting the characteristics of the stacked image frames to obtain a stacked characteristic diagram; the stacked image frame is obtained by stacking a first image frame and a second image frame according to the channel dimension;
the time attention module is used for establishing a global dependency relationship between the information of the history memory unit and the information of the current input unit, injecting the information in the globally relevant history memory unit into the current input unit through the updating unit, and storing the globally relevant information in the current input unit into the history memory unit to be used as the history memory unit at the next moment; the current input unit comprises a stacking characteristic diagram, the stacking characteristic diagram is updated into an updated characteristic diagram through an updating unit, the history memory unit comprises a first memory characteristic diagram and a first time characteristic diagram, and the history memory unit at the next moment comprises a second memory characteristic diagram and a second time characteristic diagram;
the spatiotemporal correlation module is used for modeling the updated feature map/the second memory feature map into a first/second spatiotemporal feature map with spatial correlation respectively.
Optionally, the scene depth inference apparatus 600 based on historical information is further configured to:
injecting the feature information in the stacking feature map and the feature information of the first memory feature map into the first time feature map to obtain a third time feature map;
determining a time attention feature vector according to the third time feature map;
determining a first feature vector according to the stacking feature map;
according to the first memory characteristic diagram; determining a second feature vector;
determining an input feature vector based on the time attention according to the first feature vector and the time attention feature vector;
determining a memory feature vector based on the time attention according to the second feature vector and the time attention feature vector;
respectively adjusting an input feature vector based on time attention and a memory feature vector based on time attention into a corresponding first feature map and a corresponding second feature map;
determining an updated feature map and a second memory feature map according to the first feature map and the second feature map;
and updating the third time characteristic diagram into the second time characteristic diagram according to the updated characteristic diagram and the second memory characteristic diagram.
Optionally, the scene depth inference apparatus 600 based on historical information is further configured to:
slicing the updated feature map in the channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;
correspondingly adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, a second sub-feature vector and a third sub-feature vector respectively;
calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;
weighting the third sub-eigenvector by using the first spatial correlation matrix to obtain a first spatial correlation eigenvector;
determining a first space-time eigenvector according to the first space-related eigenvector and the third eigenvector;
adjusting the first space-time feature vector into a first space-time feature map with spatial correlation;
slicing the second memory feature map in a channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;
respectively adjusting the second memory characteristic diagram, the fourth sub-characteristic diagram, the fifth sub-characteristic diagram and the sixth sub-characteristic diagram into a fourth characteristic vector, a fourth sub-characteristic vector, a fifth sub-characteristic vector and a sixth sub-characteristic vector;
calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;
weighting the sixth sub-eigenvector by using the second spatial correlation matrix to obtain a second spatial correlation eigenvector;
determining a second space-time feature vector according to the second space-related feature vector and the fourth feature vector;
adjusting a second spatio-temporal feature vector to the second spatio-temporal feature map with spatial correlation.
Optionally, the first processing module 630 is further configured to:
inputting a first image frame and a second image frame into a depth estimation network which is constructed in advance, obtaining a first scene depth of the first image frame and a first encoder characteristic map of the first image frame according to the first image frame and a first depth weight, and obtaining a second scene depth of the second image frame and a second encoder characteristic map of the second image frame according to the second image frame and the first depth weight;
inputting a first image frame and a second image frame into a camera motion network which is constructed in advance, and obtaining a first relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and a first motion weight;
calculating a first error by adopting an error calculation module according to a first image frame, a second image frame, a first scene depth, a second scene depth, a first encoder characteristic diagram, a second encoder characteristic diagram and a first relative pose;
optionally, the second processing module 650 is further configured to:
inputting a first image frame and a second image frame into a depth estimation network which is constructed in advance, obtaining a third scene depth of the first image frame and a third encoder characteristic map of the first image frame according to the first image frame and a second depth weight, and obtaining a fourth scene depth of the second image frame and a fourth encoder characteristic map of the second image frame according to the second image frame and the second depth weight;
inputting the first image frame and the second image frame into a camera motion network which is constructed in advance, and obtaining a second relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and the second motion weight;
and calculating a second error by adopting an error calculation module according to the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder characteristic diagram, the fourth encoder characteristic diagram and the second relative pose.
Optionally, the determining module 660 is further configured to:
if the first error is larger than the second error, taking the second scene depth as the scene depth of the second image frame, and taking the first relative pose as the relative pose between the first image frame and the second image frame;
and if the first error is smaller than or equal to the second error, taking the fourth scene depth as the scene depth of the second image frame, and taking the second relative pose as the relative pose between the first image frame and the second image frame.
Optionally, the total error includes a first error and a second error; the total error is determined according to the image synthesis error, the scene depth structure consistency error, the feature perception loss error and the smooth loss error.
Optionally, the first processing module 630 or the second processing module 650 is further configured to:
acquiring a first image coordinate of a first image frame and a second image coordinate of a second image frame;
determining a first world coordinate of a first image frame according to the first image coordinate, camera internal parameters and a first scene depth;
determining a second world coordinate of a second image frame according to the second image coordinate, the camera internal parameter and the second scene depth;
affine transformation is carried out on the first world coordinate of the first image frame to a second image frame panel, and a third world coordinate after affine transformation is determined;
affine transforming the second world coordinate of the second image frame to the first image frame panel, and determining a fourth world coordinate after affine transformation;
respectively projecting the third world coordinate and the fourth world coordinate to a two-dimensional plane to obtain the scene depth after the first affine transformation, the scene depth after the second affine transformation and the corresponding image coordinate after the first affine transformation and the image coordinate after the second affine transformation;
determining a scene depth structure consistency error, a first depth structure inconsistency weight and a second depth structure inconsistency weight according to the first scene depth, the second scene depth, the first affine-transformed image coordinates and the second affine-transformed image coordinates;
determining a first camera flow consistency occlusion mask and a second camera flow consistency occlusion mask according to a first image coordinate of a first image frame, a second affine-transformed image coordinate, a second image coordinate of a second image frame and a first affine-transformed image coordinate;
determining an image synthesis error according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera stream consistency occlusion mask and the second camera stream consistency occlusion mask;
determining a characteristic perception loss error according to the first image frame, the second image frame, the image coordinate after the first affine transformation and the image coordinate after the second affine transformation;
determining a smoothing loss error according to the first scene depth, the second scene depth, the first image frame and the second image frame;
and determining the total error according to the image synthesis error, the scene depth structure consistency error, the characteristic perception loss error and the smooth loss error.
The scene depth inference device based on historical information provided by this embodiment may implement the embodiments of the above method, and the implementation principle and technical effect thereof are similar, and are not described herein again.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 7, a schematic structural diagram of an electronic device 300 suitable for implementing the embodiments of the present application is shown.
As shown in fig. 7, the electronic apparatus 300 includes a Central Processing Unit (CPU)301 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage section 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that the computer program read out therefrom is mounted into the storage section 308 as necessary.
In particular, the process described above with reference to fig. 1 may be implemented as a computer software program, according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the above-described historical information-based scene depth inference method. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 309, and/or installed from the removable medium 311.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not in some cases constitute a limitation of the unit or module itself.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a gaming console, a tablet computer, a wearable device, or a combination of any of these devices.
As another aspect, the present application also provides a storage medium, which may be the storage medium included in the foregoing apparatus in the foregoing embodiment; or may be a storage medium that exists separately and is not assembled into the device. The storage medium stores one or more programs used by one or more processors to perform the scene depth inference method based on historical information described herein.
Storage media, including permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The embodiments in the present specification are described in a progressive manner, and portions that are similar to each other in the embodiments are referred to each other, and each embodiment focuses on differences from other embodiments. In particular, as for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.

Claims (10)

1. A scene depth inference method based on historical information is characterized by comprising the following steps:
acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame;
acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;
calculating a first error by using an error calculation module for the first image frame, the second image frame, the first depth weight and the first motion weight;
jointly updating a first depth weight of the depth estimation network and a first motion weight of the camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight;
the first image frame, the second depth weight and the second motion weight adopt the error calculation module to calculate a second error;
determining a scene depth of the second image frame and a relative pose between the first image frame and the second image frame according to the first error and the second error.
2. The method of claim 1, wherein the camera motion network comprises an encoder, a temporal attention module, and a spatiotemporal correlation module;
the encoder is used for extracting the features of the stacked image frames to obtain a stacked feature map; the stacked image frames are obtained by stacking the first image frame and the second image frame according to a channel dimension;
the time attention module is used for establishing a global dependency relationship between the information of the history memory unit and the information of the current input unit, injecting the information in the history memory unit which is globally related to the current input unit through the updating unit, and storing the globally related information in the current input unit into the history memory unit to be used as the history memory unit at the next moment; the current input unit comprises the stacking feature map, the stacking feature map is updated to an updated feature map through the updating unit, the history memory unit comprises a first memory feature map and a first time feature map, and the history memory unit at the next moment comprises a second memory feature map and a second time feature map;
the spatiotemporal correlation module is used for modeling the updated feature map/the second memory feature map into a first/second spatiotemporal feature map with spatial correlation respectively.
3. The method according to claim 2, wherein the establishing of global dependency relationship between the information of the history memory unit and the information of the current input unit, and the injecting of the information of the history memory unit that is globally related to the current input unit by the update unit, and the storing of the globally related information of the current input unit to the history memory unit as the history memory unit at the next time comprises:
injecting feature information in the stacking feature map and feature information of the first memory feature map into the first time feature map to obtain a third time feature map;
determining a temporal attention feature vector according to the third temporal feature map;
determining a first feature vector according to the stacking feature map;
according to the first memory characteristic diagram; determining a second feature vector
Determining a temporal attention-based input feature vector from the first feature vector and the temporal attention feature vector;
determining a temporal attention-based memory feature vector from the second feature vector and the temporal attention feature vector;
adjusting the input feature vector based on the time attention and the memory feature vector based on the time attention into a corresponding first feature map and a corresponding second feature map respectively;
determining the updated feature map and the second memory feature map according to the first feature map and the second feature map;
and updating the third time characteristic diagram into the second time characteristic diagram according to the updated characteristic diagram and the second memory characteristic diagram.
4. The method of claim 2, wherein modeling the updated feature map as a first spatio-temporal feature map having spatial correlation comprises:
slicing the updated feature map in a channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;
respectively adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, a second sub-feature vector and a third sub-feature vector;
calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;
weighting the third sub-feature vector by using the first spatial correlation matrix to obtain a first spatial correlation feature vector;
determining a first space-time feature vector according to the first space-related feature vector and the third feature vector;
adjusting the first spatiotemporal feature vector to the first spatiotemporal feature map with spatial correlation;
modeling the second memory signature as a second spatiotemporal signature with spatial correlation, comprising:
slicing the second memory feature map in a channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;
respectively adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map into a fourth feature vector, a fourth sub-feature vector, a fifth sub-feature vector and a sixth sub-feature vector;
calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;
weighting the sixth sub-feature vector by using the second spatial correlation matrix to obtain a second spatial correlation feature vector;
determining a second space-time feature vector according to the second space-related feature vector and the fourth feature vector;
adjusting the second spatio-temporal feature vector to the second spatio-temporal feature map with spatial correlation.
5. The method of claim 1,
calculating a first error using an error calculation module for the first image frame, the second image frame, the first depth weight, and the first motion weight, comprising:
inputting the first image frame and the second image frame into a depth estimation network which is constructed in advance, obtaining a first scene depth of the first image frame and a first encoder feature map of the first image frame according to the first image frame and the first depth weight, and obtaining a second scene depth of the second image frame and a second encoder feature map of the second image frame according to the second image frame and the first depth weight;
inputting the first image frame and the second image frame into a camera motion network which is constructed in advance, and obtaining a first relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and the first motion weight;
the first image frame, the second image frame, the first scene depth, the second scene depth, the first encoder feature map, the second encoder feature map and the first relative pose are used for calculating the first error by adopting the error calculation module;
the first image frame, the second depth weight, and the second motion weight, using the error calculation module, calculate a second error, comprising:
inputting the first image frame and the second image frame into a depth estimation network which is constructed in advance, obtaining a third scene depth of the first image frame and a third encoder feature map of the first image frame according to the first image frame and the second depth weight, and obtaining a fourth scene depth of the second image frame and a fourth encoder feature map of the second image frame according to the second image frame and the second depth weight;
inputting the first image frame and the second image frame into a camera motion network which is constructed in advance, and obtaining a second relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and the second motion weight;
the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second relative pose are used to calculate the second error by using the error calculation module.
6. The method of claim 5, wherein the determining a scene depth of the second image frame and a relative pose between the first image frame and the second image frame from the first error and the second error comprises:
if the first error is larger than the second error, taking the second scene depth as the scene depth of the second image frame, and taking the first relative pose as the relative pose between the first image frame and the second image frame;
if the first error is smaller than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the relative pose between the first image frame and the second image frame.
7. The method of claim 5 or 6, wherein the total error comprises the first error and the second error;
the total error is determined according to an image synthesis error, a scene depth structure consistency error, a feature perception loss error and a smooth loss error.
8. The method of claim 7, wherein the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, and comprises:
acquiring a first image coordinate of the first image frame and a second image coordinate of the second image frame;
determining first world coordinates of the first image frame according to the first image coordinates, camera internal parameters and the first scene depth;
determining second world coordinates of the second image frame according to the second image coordinates, the camera internal parameters and the second scene depth;
affine transforming the first world coordinates of the first image frame to the second image frame panel, and determining third world coordinates after affine transformation;
affine transforming the second world coordinate of the second image frame to the first image frame panel, and determining a fourth world coordinate after affine transformation;
respectively projecting the third world coordinate and the fourth world coordinate to a two-dimensional plane to obtain a scene depth after first affine transformation, a scene depth after second affine transformation and corresponding image coordinates after first affine transformation and image coordinates after second affine transformation;
determining the scene depth structure consistency error, a first depth structure inconsistency weight and a second depth structure inconsistency weight according to the first scene depth, the second scene depth, the first affine-transformed image coordinates and the second affine-transformed image coordinates;
determining a first camera flow consistency occlusion mask and a second camera flow consistency occlusion mask according to a first image coordinate of the first image frame, the second affine-transformed image coordinate, a second image coordinate of the second image frame, and the first affine-transformed image coordinate;
determining the image synthesis error according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera stream consistency occlusion mask, and the second camera stream consistency occlusion mask;
determining the feature perception loss error according to the first image frame, the second image frame, the first affine-transformed image coordinate and the second affine-transformed image coordinate;
determining the smoothing loss error from the first scene depth, the second scene depth, the first image frame, and the second image frame;
and determining the total error according to the image synthesis error, the scene depth structure consistency error, the feature perception loss error and the smoothing loss error.
9. An apparatus for scene depth inference based on historical information, the apparatus comprising:
the first acquisition module is used for acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at a previous moment of the second image frame;
the second acquisition module is used for acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;
a first processing module, configured to calculate a first error by using an error calculation module, for the first image frame, the second image frame, the first depth weight, and the first motion weight;
the updating module is used for jointly updating a first depth weight of the depth estimation network and a first motion weight of the camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight;
a second processing module, configured to calculate a second error using the error calculation module, for the first image frame, the second depth weight, and the second motion weight;
a determining module, configured to determine a scene depth of the second image frame and a relative pose between the first image frame and the second image frame according to the first error and the second error.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for scene depth inference based on historical information as claimed in any one of claims 1-8 when executing the program.
CN202210139037.7A 2022-02-15 2022-02-15 Scene depth reasoning method and device based on historical information and electronic equipment Pending CN114627176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210139037.7A CN114627176A (en) 2022-02-15 2022-02-15 Scene depth reasoning method and device based on historical information and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210139037.7A CN114627176A (en) 2022-02-15 2022-02-15 Scene depth reasoning method and device based on historical information and electronic equipment

Publications (1)

Publication Number Publication Date
CN114627176A true CN114627176A (en) 2022-06-14

Family

ID=81898033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210139037.7A Pending CN114627176A (en) 2022-02-15 2022-02-15 Scene depth reasoning method and device based on historical information and electronic equipment

Country Status (1)

Country Link
CN (1) CN114627176A (en)

Similar Documents

Publication Publication Date Title
CN110782490B (en) Video depth map estimation method and device with space-time consistency
US8958484B2 (en) Enhanced image and video super-resolution processing
CN109922231A (en) A kind of method and apparatus for generating the interleave image of video
US20230401672A1 (en) Video processing method and apparatus, computer device, and storage medium
US11443481B1 (en) Reconstructing three-dimensional scenes portrayed in digital images utilizing point cloud machine-learning models
US20090110331A1 (en) Resolution conversion apparatus, method and program
CN113724155B (en) Self-lifting learning method, device and equipment for self-supervision monocular depth estimation
CN103856711A (en) Rolling shutter correcting method and image processing device
US20170011494A1 (en) Method for deblurring video using modeling blurred video with layers, recording medium and device for performing the method
US20130135430A1 (en) Method for adjusting moving depths of video
CN113487739A (en) Three-dimensional reconstruction method and device, electronic equipment and storage medium
US11176702B2 (en) 3D image reconstruction processing apparatus, 3D image reconstruction processing method and computer-readable storage medium storing 3D image reconstruction processing program
CN114812601A (en) State estimation method and device of visual inertial odometer and electronic equipment
US10482611B2 (en) Determining optical flow
CN114627176A (en) Scene depth reasoning method and device based on historical information and electronic equipment
WO2023155043A1 (en) Historical information-based scene depth reasoning method and apparatus, and electronic device
US20230316463A1 (en) Filter for temporal noise reduction
US11954867B2 (en) Motion vector generation apparatus, projection image generation apparatus, motion vector generation method, and program
CN114004872A (en) Cartoon texture separation method and device for image, electronic equipment and storage medium
CN110830848A (en) Image interpolation method, image interpolation device, computer equipment and storage medium
CN117241065B (en) Video plug-in frame image generation method, device, computer equipment and storage medium
CN113556581B (en) Interpolation frame generation method and device and electronic equipment
EP4318404A1 (en) System and apparatus suitable for use with a hypernetwork in association with neural radiance fields (nerf) related processing, and a processing method in association thereto
US20220375030A1 (en) Method and apparatus for interpolating frame based on artificial intelligence
WO2022186256A1 (en) Map information update method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination