CN114627176A

CN114627176A - Scene depth reasoning method and device based on historical information and electronic equipment

Info

Publication number: CN114627176A
Application number: CN202210139037.7A
Authority: CN
Inventors: 王飞; 程俊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-06-14

Abstract

The application provides a scene depth inference method and device based on historical information and electronic equipment, wherein the method comprises the following steps: acquiring a first image frame and a second image frame; acquiring a first depth weight of a depth estimation network and a first motion weight of a camera motion network; calculating a first error by adopting an error calculation module according to the first image frame, the second image frame, the first depth weight and the first motion weight; jointly updating a first depth weight of a depth estimation network and a first motion weight of a camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight; calculating a second error by adopting an error calculation module; and determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error. The scheme can reduce the influence of wrong affine transformation caused by inaccurate posture.

Description

Scene depth reasoning method and device based on historical information and electronic equipment

Technical Field

The application belongs to the technical field of computer vision and image processing, and particularly relates to a scene depth inference method and device based on historical information and electronic equipment.

Background

The accurate recovery of the scene depth from the two-dimensional image facilitates better understanding of the three-dimensional structure of the scene, thereby better accomplishing various visual tasks. However, the common camera acquires two-dimensional images during shooting, and loses depth information of a scene, so how to recover the depth of the scene from the two-dimensional images or video sequences becomes a basic and extremely challenging task in the field of computer vision. Although the competitive scene depth can be recovered from the two-dimensional image at present, a large amount of manually marked data is needed to train the neural network, time and labor are consumed, and once the training of the model is completed, the weight of the model is frozen, so that the generalization capability of the algorithm on the unknown scene is reduced. In addition, the scheme based on fully unsupervised learning recovers the scene depth from the two-dimensional image, the camera pose needs to be predicted from the adjacent frames at the same time, and the inaccurate pose can generate an incorrect affine transformation result, directly influences the quality of the synthesized image and further influences the recovered scene depth quality.

Disclosure of Invention

The embodiment of the specification aims to provide a scene depth inference method and device based on historical information and electronic equipment.

In order to solve the above technical problem, the embodiments of the present application are implemented as follows:

in a first aspect, the present application provides a scene depth inference method based on historical information, including:

acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame;

acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;

calculating a first error by adopting an error calculation module according to the first image frame, the second image frame, the first depth weight and the first motion weight;

the first error is used as a guide signal to jointly update a first depth weight of a depth estimation network and a first motion weight of a camera motion network, so that a second depth weight and a second motion weight are obtained;

calculating a second error by adopting an error calculation module according to the first image frame, the second depth weight and the second motion weight;

and determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.

In a second aspect, the present application provides a scene depth inference device based on historical information, the device including:

the first acquisition module is used for acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame;

the second acquisition module is used for acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;

the first processing module is used for calculating a first error by adopting the error calculation module according to the first image frame, the second image frame, the first depth weight and the first motion weight;

the updating module is used for jointly updating a first depth weight of the depth estimation network and a first motion weight of the camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight;

the second processing module is used for calculating a second error by adopting the error calculation module, wherein the second processing module is used for the first image frame, the second depth weight and the second motion weight;

and the determining module is used for determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the scene depth inference method based on historical information according to the first aspect.

As can be seen from the technical solutions provided in the embodiments of the present specification, the solution: recovering scene depth from a two-dimensional image in a fully unsupervised mode, injecting historical frame information in a memory unit into a current input unit through a time attention module, and modeling spatial correlation of a time-space feature map to improve the accuracy of camera pose and reduce the influence of wrong affine transformation caused by inaccurate pose; during reasoning, the generalization capability of the algorithm to an unknown scene is improved by utilizing online decision reasoning.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic flow chart of a scene depth inference method based on historical information according to the present application;

FIG. 2 is a block diagram of joint training of a depth estimation network and a camera motion network provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a temporal attention module provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a spatiotemporal correlation module provided in an embodiment of the present application;

fig. 5 is a schematic view of a scene depth inference process provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a scene depth inference apparatus based on historical information according to the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art without making creative efforts based on the embodiments in the present specification shall fall within the protection scope of the present specification.

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system architectures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments described herein without departing from the scope or spirit of the application. Other embodiments will be apparent to the skilled person from the description of the present application. The specification and examples are exemplary only.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including but not limited to.

In the present application, "parts" are parts by mass unless otherwise specified.

The present invention will be described in further detail with reference to the drawings and examples.

Referring to fig. 1, a flowchart of a scene depth inference method based on historical information according to an embodiment of the present application is shown.

As shown in fig. 1, the scene depth inference method based on historical information may include:

s110, acquiring a first image frame and a second image frame of the image to be detected, wherein the first image frame is an image frame at the previous moment of the second image frame.

The image to be detected is any one image of which the scene depth needs to be inferred and predicted, and the image to be detected can be a two-dimensional image.

The method comprises the steps of cutting an image to be detected into image frames of a plurality of adjacent frames at equal time intervals, wherein a first image frame and a second image frame of the image to be detected are two image frames at adjacent moments, and the first image frame is an image frame at a moment before the second image frame, for example, the first image frame is an image frame at the moment t-1, the second image frame is an image frame at the moment t, or the first image frame is an image frame at the moment t, and the second image frame is an image frame at the moment t + 1.

S120, acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network.

The depth estimation network is used for estimating the scene depth of the two-dimensional image, the depth estimation network may adopt a neural network with a coder-decoder structure, and the type and the network structure of the neural network adopted by the depth estimation network are not limited.

The camera motion network is used to predict the relative pose between adjacent image frames.

Both the depth estimation network and the camera motion network are pre-constructed and trained.

Referring to fig. 2, it shows a depth estimation network and camera motion network joint training block diagram provided by the embodiment of the present application (the leftmost original image, the scene depth picture, and the synthesized view in fig. 2 are color maps, and this process is a grayscale map). It can be understood that, to train the depth estimation network and the camera motion network, training set data is obtained first, and in the present application, the training set data is an image frame group of a plurality of adjacent three moments in an image, for example, an image frame at a time t-1, an image frame at a time t +1 is one data in a training data set, an image frame at a time t-2, an image frame at a time t-1, and an image frame at a time t are also one data in the training data set, and so on. According to the training block diagram shown in fig. 2, the image frame at the time t-1, the image frame at the time t, and the image frame at the time t +1 are used as the input of the camera motion network and the depth estimation network, and the depth estimation network and the camera motion network are jointly trained by using the target function, that is, the depth weight of the depth estimation network and the motion weight of the camera motion network are guided to be updated.

It can be understood that before the image frame at the time t-1, the image frame at the time t, and the image frame at the time t +1 are input into the depth estimation network and the camera motion network, image preprocessing is performed, for example, the image data is randomly flipped, randomly cropped, and data normalized, and the processed data is converted into tensor data with a dimension of C × H × W, where the dimension of batch is omitted, where C represents the channel dimension size of the sample, where the depth estimation network C during training is 3, C in the camera motion network is 9 (the input of the camera motion network during testing is 3 image frames at adjacent times), C in the camera motion network is 6 during test inference (that is, when the scene depth inference method based on the history information of the present application is implemented), C in the camera motion network is 2 image frames at adjacent times, h denotes the height of the input sample image, for example, H equals 256, W denotes the width of the input sample image, and for example, W equals 832.

With continued reference to FIG. 2, the camera motion network may include an encoder, a temporal attention module, and a spatiotemporal correlation module.

The encoder is used for extracting the features of the stacked image frames to obtain a stacked feature map; the stacked image frames are obtained by stacking a first image frame and a second image frame according to a channel dimension.

The time attention module is used for establishing a global dependency relationship between the information of the history memory unit and the information of the current input unit, injecting the information in the globally relevant history memory unit into the current input unit through the updating unit, and storing the globally relevant information in the current input unit into the history memory unit to be used as the history memory unit at the next moment; the current input unit comprises a stacking feature map, the stacking feature map is updated into an updated feature map through an updating unit, the history storage unit comprises a first memory feature map and a first time feature map, and the history storage unit at the next moment comprises a second memory feature map and a second time feature map.

The spatiotemporal correlation module is used for modeling the updated/second memory feature maps into first/second spatiotemporal feature maps with spatial correlation, respectively.

In one embodiment, the time attention module, using the shared time attention weight to establish a global dependency relationship between information in the history memory unit and information in the current input unit, and via the update unit, injects information in the history memory unit related to the global locality into the current input unit, and stores global related information in the current input unit into the history memory unit as a history memory unit at a next time, includes:

injecting the feature information in the stacking feature map and the feature information of the first memory feature map into the first time feature map to obtain a third time feature map;

determining a time attention feature vector according to the third time feature map;

determining a first feature vector according to the stacking feature map;

according to the first memory characteristic diagram; determining a second feature vector

Determining an input feature vector based on the time attention according to the first feature vector and the time attention feature vector;

determining a memory feature vector based on the time attention according to the second feature vector and the time attention feature vector;

respectively adjusting an input feature vector based on time attention and a memory feature vector based on time attention into a corresponding first feature map and a corresponding second feature map;

determining an updated feature map and a second memory feature map according to the first feature map and the second feature map;

and updating the third time characteristic diagram into the second time characteristic diagram according to the updated characteristic diagram and the second memory characteristic diagram.

Exemplary, refer toAnd 3, which shows a schematic diagram of a time attention module provided by the embodiment of the application. For convenience of description, it is assumed that the input feature map at time t (i.e., the stacked feature map) is represented as

the time characteristic diagram (namely the first time characteristic diagram) at the moment t-1 is

the time memory characteristic diagram (i.e. the first memory characteristic diagram) at the time t-1 is

The calculation process is as follows:

1) inputting a characteristic diagram X at the time t according to the formula (1)_tCharacteristic information and t-1 moment memory characteristic diagram X in (1)_(m,t-1)The feature information in (1) is injected into the time feature diagram X at the t-1 moment_(time,t-1)In (3), a third time profile is obtained

Wherein the content of the first and second substances,

representing the feature space, C representing the number of the feature map channels, H representing the height of the feature map, W representing the width of the feature map, delta_gelu(. represents an activation function, W_(i,t),W_(time,t-1),W_(m,t-1)Representing the learned corresponding weight, b_(i,t),b_(time,t-1),b_(m,t-1)Representing the corresponding paranoia item.

2) Calculating a temporal attention feature vector x according to the formula set (2)_{(qk_time,t-1)}：

Where s denotes a scalar scaling factor, "+" denotes the product of the corresponding elements, function F_split(. to) shows the feature graph sliced by channel dimension, function F_reshape(. to) adjust the feature map or feature vector to a predetermined shape, T represents transpose,

a corresponding weight is represented by a weight that is,

representing the corresponding paranoia item.

3) Inputting the characteristic diagram X at the time t according to a formula group (3)_(i,t)Adjusted to a feature vector (i.e., the first feature vector)

Memorizing a characteristic diagram X at the t-1 moment_(m,t-1)Adjusted to a feature vector (i.e., a second feature vector)

x_(i,t)＝F_reshape(X_(i,t))

x_(m,t-1)＝F_reshape(X_(m,t-1)) (3)

4) Calculating an input feature vector based on temporal attention according to the formula set (4)

And a temporal attention-based memory feature vector

5) Respectively dividing the feature vector x according to a formula group (5)_{(qk_i,t)}And x_{(qk_m,t-1)}Adjusted to corresponding characteristic diagram

(i.e., first characteristic diagram) and

(i.e., second feature map):

6) calculate the information selection gate according to the formula (6)

For selectively memorizing a characteristic map X at a time t-1_{(qk_m,t-1)}The information in (1) is injected into the input feature map X at the time t_{(qk_i,t)}The method comprises the following steps:

G_s＝δ_sig(W_{(qk_ms,t-1)}X_{(qk_m,t-1)}+b_{(qk_ms,t-1)}+W_{(qk_is,t)}X_{(qk_i,t)}+b_{(qk_is,t)}) (6)

wherein the function delta_sig(. -) represents a sigmoid activation function, W_{(qk_ms,t-1)}And W_{(qk_is,t)}Representing corresponding weights, b_{(qk_ms,t-1)}And b_{(qk_is,t)}Representing the corresponding paranoia item.

7) Calculating a new feature map containing the information of the memory feature map according to the formula (7)

X_(im,t)＝δ_tanh(W_{(qk_imi,t)}X_{(qk_i,t)}+b_{(qk_imi,t)}+G_s*(W_{(qk_imm,t-1)}X_{(qk_m,t-1)}+b_{(qk_imm,t-1)})) (7)

Wherein the function delta_tanh(. -) represents the tanh activation function, W_{(qk_imi,t)}And W_{(qk_imm,t-1)}Representing corresponding weights, b_{(qk_imi,t)}And b_{(qk_imm,t-1)}Representing the corresponding paranoia item.

8) Calculating the memory gate according to the formula set (8)

For memorizing a characteristic diagram X at t-1 moment_{(qk_m,t-1)}Memory characteristic diagram (second memory characteristic diagram) of information update to time t

G_r＝δ_sig(W_{(qk_ir,t)}X_{(qk_i,t)}+b_{(qk_ir,t)}+W_{(qk_mr,t-1)}X_{(qk_m,t-1)}+b_{(qk_mr,t-1)})

X_(m,t)＝(1-G_r)*X_(im,t)+G_r*X_{(qk_m,t-1)} (8)

Wherein, W_{(qk_ir,t)}And W_{(qk_mr,t-1)}Representing the corresponding weight, b_{(qk_ir,t)}And b_{(qk_mr,t-1)}Representing the corresponding paranoia item.

9) Calculating the output gate according to the formula (9)

For updating input profile X_{(qk_i,t)}Obtaining the updated new characteristic diagram (namely the updated characteristic diagram)

As input feature map for the next time:

G_o＝δ_sig(W_{(qk_io,t)}X_{(qk_i,t)}+b_{(qk_io,t)}+W_{(qk_mo,t-1)}X_{(qk_m,t-1)}+b_{(qk_mo,t-1)})

X_(io,t)＝G_o*X_(im,t) (9)

wherein, W_{(qk_io,t)}And W_{(qk_mo,t-1)}Representing the corresponding weight, b_{(qk_io,t)}And b_{(qk_mo,t-1)}Representing the corresponding paranoia item.

10) Time characteristic diagram of t-1 moment according to formula (10)

Updating the time characteristic diagram (i.e. the second time characteristic diagram) X to the time t_(time,t)：

Wherein, W_(time,t-1)、W_(io,t)And W_(m,t)Representing the corresponding weight, b_(time,t-1)、b_(io,t)And b_(m,t)Indicate a correspondenceAnd (4) executing the item by a bias.

In order to infer camera motion using global spatial structure information of the feature map and the dependencies between the spatial structures, a spatiotemporal correlation module as shown in fig. 4 is constructed, modeling spatial context information using global spatial correlation weights, and constraining timing information between stacked frames by modeling the dependencies between feature map channels of the stacked frames.

In one embodiment, the spatiotemporal correlation module is to model the updated/second memory feature maps as first/second spatiotemporal feature maps having spatial correlations, respectively.

Wherein modeling the updated feature map as a first spatio-temporal feature map having spatial correlation comprises:

slicing the updated feature map in the channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;

respectively correspondingly adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, a second sub-feature vector and a third sub-feature vector;

calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;

weighting the third sub-eigenvector by using the first spatial correlation matrix to obtain a first spatial correlation eigenvector;

determining a first space-time eigenvector according to the first space-related eigenvector and the third eigenvector;

and adjusting the first space-time feature vector into a first space-time feature map with spatial correlation.

Wherein modeling the second memory signature as a second spatiotemporal signature with spatial correlation comprises:

slicing the second memory feature map in a channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;

respectively adjusting the second memory characteristic diagram, the fourth sub-characteristic diagram, the fifth sub-characteristic diagram and the sixth sub-characteristic diagram into a fourth characteristic vector, a fourth sub-characteristic vector, a fifth sub-characteristic vector and a sixth sub-characteristic vector;

calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;

weighting the sixth sub-eigenvector by using the second spatial correlation matrix to obtain a second spatial correlation eigenvector;

determining a second space-time feature vector according to the second space-related feature vector and the fourth feature vector;

and adjusting the second space-time feature vector into a second space-time feature map with spatial correlation.

Illustratively, referring to FIG. 4, the calculation is as follows:

1) according to formula set (11) to input feature map (i.e. updated feature map or second memory feature map)

Performing transformation to obtain a characteristic diagram

And equally divide the graph into three sub-feature graphs (namely a first/fourth sub-feature graph, a second/fifth sub-feature graph and a third/sixth sub-feature graph respectively)

Wherein the function F_split(. cndot.) represents the slicing process for the feature map in the channel dimension.

Wherein, W_midIs a corresponding weight, b_midRepresenting the corresponding paranoia item.

2) Adjusting the corresponding characteristic diagram into a characteristic vector according to a formula group (12) to obtain a pairCorresponding feature vector

Wherein the function F_reshapeAnd (h) adjusting the feature map or the feature vector into a preset shape.

x_mid＝F_reshape(X_mid)

3) Calculating the first/fourth sub-feature map according to the formula (13)

And the second/fifth sub-feature maps

Spatial correlation matrix between

Where s represents a scalar scaling factor

4) Using the calculated spatial correlation matrix to make a third/sixth sub-feature quantity

Weighting to obtain space correlation feature vector (including first space correlation feature vector)Quantity and second spatial correlation feature vector quantity)

As shown in formula (14)

5) Modeling the dependency relationship between the feature map space structures according to the formula (15), constraining the time sequence information between the stacked frames by modeling the dependency relationship between the feature map channels of the stacked frames, and calculating a first/second space-time feature vector x_{time_corr}：

Wherein, F_C(. cndot.) consists of a one-dimensional convolution and activation function with two layers of kernel size 3 and step size 1.

6) Finally, the first/second space-time feature vector x with spatial correlation_{time_corr}Adjusted to first/second spatio-temporal feature maps with spatial correlation

S130, calculating a first error by adopting an error calculation module, wherein the first image frame, the second image frame, the first depth weight and the first motion weight comprise:

inputting a first image frame and a second image frame into a depth estimation network which is constructed in advance, obtaining a first scene depth of the first image frame and a first encoder characteristic map of the first image frame according to the first image frame and a first depth weight, and obtaining a second scene depth of the second image frame and a second encoder characteristic map of the second image frame according to the second image frame and the first depth weight;

inputting a first image frame and a second image frame into a camera motion network which is constructed in advance, and obtaining a first relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and a first motion weight;

and calculating a first error by adopting an error calculation module according to the first scene depth, the second scene depth and the first relative pose.

In one embodiment, the total error is determined from image synthesis errors, scene depth structure consistency errors, feature perception loss errors, and smoothing loss errors. Wherein the total error comprises the first error and the second error.

Specifically, the total error is determined according to an image synthesis error, a scene depth structure consistency error, a feature perception loss error and a smoothing loss error, and the method comprises the following steps:

acquiring a first image coordinate of a first image frame and a second image coordinate of a second image frame;

determining a first world coordinate of a first image frame according to the first image coordinate, camera internal parameters and a first scene depth;

determining a second world coordinate of a second image frame according to the second image coordinate, the camera internal parameter and the second scene depth;

affine transforming the first world coordinates of the first image frame to a second image frame panel, and determining third world coordinates after affine transformation;

affine transforming the second world coordinate of the second image frame to the first image frame panel, and determining a fourth world coordinate after affine transformation;

respectively projecting the third world coordinate and the fourth world coordinate to a two-dimensional plane to obtain the scene depth after the first affine transformation, the scene depth after the second affine transformation and the corresponding image coordinate after the first affine transformation and the image coordinate after the second affine transformation;

determining a scene depth structure consistency error, a first depth structure inconsistency weight and a second depth structure inconsistency weight according to the first scene depth, the second scene depth, the first affine-transformed image coordinates and the second affine-transformed image coordinates;

determining a first camera flow consistency occlusion mask and a second camera flow consistency occlusion mask according to a first image coordinate of a first image frame, a second affine-transformed image coordinate, a second image coordinate of a second image frame and a first affine-transformed image coordinate;

determining an image synthesis error according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera stream consistency occlusion mask and the second camera stream consistency occlusion mask;

determining a characteristic perception loss error according to the first image frame, the second image frame, the image coordinate after the first affine transformation and the image coordinate after the second affine transformation;

determining a smooth loss error according to the first scene depth, the second scene depth, the first image frame and the second image frame;

and determining the total error according to the image synthesis error, the scene depth structure consistency error, the characteristic perception loss error and the smooth loss error.

Illustratively, for convenience of description, assume that the already trained scene depth fitting function is D_t＝F_D(I_t|W_D) In which I_tTwo-dimensional image, W, representing the depth of the scene at which restoration is required at time t_DFitting function F representing field depth_DLearned weight parameter, D_tTwo-dimensional image I representing recovered time t_tDepth of scene of (1), X_(enc,t)T-time image frame I representing output of encoder_tIs characterized by comprising a characteristic diagram of (A),

representing an image frame I from time t-1_t-1Image frame I transformed to time t_tPosture of (W)_TFunction F for representing pose transformation_TThe learned weight parameter in (c), camera parameters are denoted as K, and P_(xy,t-1)Representing image frames I_t-1Image coordinates of (1), P_(xyz,t-1)Image frame I_t-1World coordinate of (P)_(xy,t)Representing a picture frame I_tImage coordinates of (1), P_(xyz,t)Representing an image frame I_tThe world coordinates of (a). Total error calculation procedureThe following were used:

1) calculating image frame I according to equation (16)_t-1World coordinates (i.e., first world coordinates) P_(xyz,t-1)Image frame I_tWorld coordinate (i.e., second world coordinate) P_(xyz,t)：

P_(xyz,t-1)＝D_t-1*K^-P_(xy,t-1)

P_(xyz,t)＝D_t*K^-P_(xy,t) (16)

Wherein a "-" number indicates the product of the corresponding elements of the matrix.

2) Calculating image frame I according to formula (17)_t-1World coordinate P_(xyz,t-1)Affine transformation to image frame I_tA panel for obtaining world coordinates (i.e. third world coordinates) P after affine transformation_{(proj_xyz,t)}And image frame I_tWorld coordinate P_(xyz,t)Affine transformation to image frame I_t-1A panel for obtaining affine-transformed world coordinates (i.e. fourth world coordinates) P_{(proj_xyz,t-1)}：

3) World coordinate P calculated by affine transformation_{(proj_xyz,t)}、P_{(proj_xyz,t-1)}Respectively projecting the images to a two-dimensional plane to obtain the depth D of the scene after affine transformation_(proj,t)(i.e., scene depth after first affine transformation) and D_(proj,t-1)(i.e., second affine-transformed scene depth), and corresponding affine-transformed image coordinates P_{(proj_xy,t)}(i.e., first affine transformed image coordinates), P_{(proj_xy,t-1)}(i.e., the second affine-transformed image coordinates).

4) From image frame I_t-1And P_{(proj_xy,t-1)}Composite image I_(syn,t)(ii) a From the encoder profile X_(enc,t-1)And P_{(proj_xy,t-1)}Synthetic feature map X_{(syn_enc,t)}(ii) a From estimated depth map D_t-1And P_(projxy,t-1)Synthetic depth map D_(syn,t)(ii) a From image frame I_tAnd P_{(proj_xy,t)}Composite image I_(syn,t-1)(ii) a From the encoder profile X_(enc,t)And P_{(proj_xy,t)}Synthetic feature map X_{(syn_enc,t-1)}(ii) a From estimated depth map D_tAnd P_(projxy,t)Synthetic depth map D_(syn,t-1)(ii) a According to I_t-1Image coordinates and P_{(proj_xy,t-1)}Calculate the forward camera stream U_forward(ii) a According to I_tImage coordinates and P_{(proj_xy,t)}Calculate the backward camera stream U_backward(ii) a According to U_forwardAnd P_{(proj_xy,t)}Synthetic forward camera stream U_{syn_forward}(ii) a According to U_backwardAnd P_(projxy,t-1)Synthesizing backward camera stream U_{syn_backward}。

5) Separately calculating a first camera stream coherence occlusion mask M according to a set of formulas (18)_(occ,t-1)Occlusion mask M consistent with second camera stream_(occ,t)：

M_(occ,t-1)＝Γ(‖U_{syn_forward}+U_backward‖²,α₁(‖U_{syn_forward}‖²+‖U_backward‖²)+α₂)

M_(occ,t)＝Γ(‖U_{syn_backward}+U_forward‖²,α₁(‖U_{syn_backward}‖²+‖U_forward‖²)+α₂)

6) Calculating scene depth structure consistency error E according to formula set (19)_DAnd a first depth structure inconsistency weight M_(D,t-1)Second depth structure inconsistency weight M_(D,t)。

7) Calculating the image synthesis error E according to the formula (20)_I：

Wherein the content of the first and second substances,

8) calculating the characteristic perceptual loss error E according to the formula (21)_X：

E_X＝ERF(X_(enc,t),X_{(syn_enc,t)})+ERF(X_(enc,t-1),X_{(syn_enc,t-1)}) (21)

9) The smoothing loss error E is calculated according to the formula (22)_S：

10) The total error E is calculated according to equation (23):

E＝λ_IE_I+λ_DE_D+λ_XE_X+λ_SE_S (23)

s140, the first error is used as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network, and a second depth weight and a second motion weight are obtained.

S150, calculating a second error by using an error calculation module, where the calculating includes:

inputting a first image frame and a second image frame into a depth estimation network which is constructed in advance, obtaining a third scene depth of the first image frame and a third encoder characteristic map of the first image frame according to the first image frame and a second depth weight, and obtaining a fourth scene depth of the second image frame and a fourth encoder characteristic map of the second image frame according to the second image frame and the second depth weight;

inputting the first image frame and the second image frame into a camera motion network which is constructed in advance, and obtaining a second relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and the second motion weight;

and calculating a second error by adopting an error calculation module according to the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder characteristic diagram, the fourth encoder characteristic diagram and the second relative pose.

In this step, referring to the specific calculation process in step S130, only the first depth weight and the first motion weight in step S130 are replaced with the second depth weight and the second motion weight, which is not described herein again.

And S160, determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.

Specifically, if the first error is larger than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the relative pose between the first image frame and the second image frame;

and if the first error is smaller than or equal to the second error, taking the fourth scene depth as the scene depth of the second image frame, and taking the second relative pose as the relative pose between the first image frame and the second image frame.

Referring to fig. 5, a schematic diagram of a scene depth inference process is shown. The process of inferring scene depth is as follows:

1) estimating the weight W of the network with the depth obtained by training_DCamera motion network weight W_TAs the model weight of the depth estimation network and the camera motion network during the inference, the total error E and the pose transformation matrix from the historical frame to the current frame are calculated according to S130

Scene depth D of current frame_t。

2) Updating the weights of the depth estimation network and the camera motion network by using the total error calculated in the step 1) as a guide signal to obtain a new model weight

And

3) the model weights obtained according to 2) above

And

and S150 calculating the total error at this time

Pose transformation matrix from historical frame to current frame

Scene depth of current frame

4) By comparing the total error E and

of the final outputThe scene depth of the current frame.

According to the method and the device, the scene depth is recovered from the two-dimensional image in a fully unsupervised mode, the historical frame information in the memory unit is injected into the current input unit through the time attention module, the spatial correlation of the spatiotemporal characteristic diagram is modeled to improve the accuracy of the pose of the camera, and the influence of wrong affine transformation caused by inaccurate pose is reduced; and during reasoning, the generalization capability of the algorithm to the unknown scene is improved by utilizing online decision reasoning.

Referring to fig. 6, a schematic structural diagram of a scene depth inference apparatus based on historical information is shown according to an embodiment of the present application.

As shown in fig. 6, the scene depth inference apparatus 600 based on historical information may include:

the first obtaining module 610 is configured to obtain a first image frame and a second image frame of an image to be detected, where the first image frame is an image frame of a previous moment of the second image frame;

a second obtaining module 620, configured to obtain a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network;

a first processing module 630, configured to calculate a first error by using an error calculation module, the first image frame, the second image frame, the first depth weight, and the first motion weight;

an updating module 640, configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guidance signal, so as to obtain a second depth weight and a second motion weight;

a second processing module 650, configured to calculate a second error by using an error calculation module, for the first image frame, the second depth weight, and the second motion weight;

the determining module 660 is configured to determine a scene depth of the second image frame and a relative pose between the first image frame and the second image frame according to the first error and the second error.

Optionally, the camera motion network comprises an encoder, a temporal attention module and a spatiotemporal correlation module;

the encoder is used for extracting the characteristics of the stacked image frames to obtain a stacked characteristic diagram; the stacked image frame is obtained by stacking a first image frame and a second image frame according to the channel dimension;

the time attention module is used for establishing a global dependency relationship between the information of the history memory unit and the information of the current input unit, injecting the information in the globally relevant history memory unit into the current input unit through the updating unit, and storing the globally relevant information in the current input unit into the history memory unit to be used as the history memory unit at the next moment; the current input unit comprises a stacking characteristic diagram, the stacking characteristic diagram is updated into an updated characteristic diagram through an updating unit, the history memory unit comprises a first memory characteristic diagram and a first time characteristic diagram, and the history memory unit at the next moment comprises a second memory characteristic diagram and a second time characteristic diagram;

the spatiotemporal correlation module is used for modeling the updated feature map/the second memory feature map into a first/second spatiotemporal feature map with spatial correlation respectively.

Optionally, the scene depth inference apparatus 600 based on historical information is further configured to:

determining a first feature vector according to the stacking feature map;

according to the first memory characteristic diagram; determining a second feature vector;

correspondingly adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, a second sub-feature vector and a third sub-feature vector respectively;

adjusting the first space-time feature vector into a first space-time feature map with spatial correlation;

adjusting a second spatio-temporal feature vector to the second spatio-temporal feature map with spatial correlation.

Optionally, the first processing module 630 is further configured to:

calculating a first error by adopting an error calculation module according to a first image frame, a second image frame, a first scene depth, a second scene depth, a first encoder characteristic diagram, a second encoder characteristic diagram and a first relative pose;

optionally, the second processing module 650 is further configured to:

Optionally, the determining module 660 is further configured to:

if the first error is larger than the second error, taking the second scene depth as the scene depth of the second image frame, and taking the first relative pose as the relative pose between the first image frame and the second image frame;

Optionally, the total error includes a first error and a second error; the total error is determined according to the image synthesis error, the scene depth structure consistency error, the feature perception loss error and the smooth loss error.

Optionally, the first processing module 630 or the second processing module 650 is further configured to:

affine transformation is carried out on the first world coordinate of the first image frame to a second image frame panel, and a third world coordinate after affine transformation is determined;

determining a smoothing loss error according to the first scene depth, the second scene depth, the first image frame and the second image frame;

The scene depth inference device based on historical information provided by this embodiment may implement the embodiments of the above method, and the implementation principle and technical effect thereof are similar, and are not described herein again.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 7, a schematic structural diagram of an electronic device 300 suitable for implementing the embodiments of the present application is shown.

As shown in fig. 7, the electronic apparatus 300 includes a Central Processing Unit (CPU)301 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage section 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that the computer program read out therefrom is mounted into the storage section 308 as necessary.

In particular, the process described above with reference to fig. 1 may be implemented as a computer software program, according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the above-described historical information-based scene depth inference method. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 309, and/or installed from the removable medium 311.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not in some cases constitute a limitation of the unit or module itself.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a gaming console, a tablet computer, a wearable device, or a combination of any of these devices.

As another aspect, the present application also provides a storage medium, which may be the storage medium included in the foregoing apparatus in the foregoing embodiment; or may be a storage medium that exists separately and is not assembled into the device. The storage medium stores one or more programs used by one or more processors to perform the scene depth inference method based on historical information described herein.

Storage media, including permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and portions that are similar to each other in the embodiments are referred to each other, and each embodiment focuses on differences from other embodiments. In particular, as for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.

Claims

1. A scene depth inference method based on historical information is characterized by comprising the following steps:

calculating a first error by using an error calculation module for the first image frame, the second image frame, the first depth weight and the first motion weight;

jointly updating a first depth weight of the depth estimation network and a first motion weight of the camera motion network by taking the first error as a guide signal to obtain a second depth weight and a second motion weight;

the first image frame, the second depth weight and the second motion weight adopt the error calculation module to calculate a second error;

determining a scene depth of the second image frame and a relative pose between the first image frame and the second image frame according to the first error and the second error.

2. The method of claim 1, wherein the camera motion network comprises an encoder, a temporal attention module, and a spatiotemporal correlation module;

the encoder is used for extracting the features of the stacked image frames to obtain a stacked feature map; the stacked image frames are obtained by stacking the first image frame and the second image frame according to a channel dimension;

the time attention module is used for establishing a global dependency relationship between the information of the history memory unit and the information of the current input unit, injecting the information in the history memory unit which is globally related to the current input unit through the updating unit, and storing the globally related information in the current input unit into the history memory unit to be used as the history memory unit at the next moment; the current input unit comprises the stacking feature map, the stacking feature map is updated to an updated feature map through the updating unit, the history memory unit comprises a first memory feature map and a first time feature map, and the history memory unit at the next moment comprises a second memory feature map and a second time feature map;

3. The method according to claim 2, wherein the establishing of global dependency relationship between the information of the history memory unit and the information of the current input unit, and the injecting of the information of the history memory unit that is globally related to the current input unit by the update unit, and the storing of the globally related information of the current input unit to the history memory unit as the history memory unit at the next time comprises:

injecting feature information in the stacking feature map and feature information of the first memory feature map into the first time feature map to obtain a third time feature map;

determining a temporal attention feature vector according to the third temporal feature map;

determining a first feature vector according to the stacking feature map;

Determining a temporal attention-based input feature vector from the first feature vector and the temporal attention feature vector;

determining a temporal attention-based memory feature vector from the second feature vector and the temporal attention feature vector;

adjusting the input feature vector based on the time attention and the memory feature vector based on the time attention into a corresponding first feature map and a corresponding second feature map respectively;

determining the updated feature map and the second memory feature map according to the first feature map and the second feature map;

4. The method of claim 2, wherein modeling the updated feature map as a first spatio-temporal feature map having spatial correlation comprises:

slicing the updated feature map in a channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;

respectively adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, a second sub-feature vector and a third sub-feature vector;

weighting the third sub-feature vector by using the first spatial correlation matrix to obtain a first spatial correlation feature vector;

determining a first space-time feature vector according to the first space-related feature vector and the third feature vector;

adjusting the first spatiotemporal feature vector to the first spatiotemporal feature map with spatial correlation;

modeling the second memory signature as a second spatiotemporal signature with spatial correlation, comprising:

respectively adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map into a fourth feature vector, a fourth sub-feature vector, a fifth sub-feature vector and a sixth sub-feature vector;

weighting the sixth sub-feature vector by using the second spatial correlation matrix to obtain a second spatial correlation feature vector;

adjusting the second spatio-temporal feature vector to the second spatio-temporal feature map with spatial correlation.

5. The method of claim 1,

calculating a first error using an error calculation module for the first image frame, the second image frame, the first depth weight, and the first motion weight, comprising:

inputting the first image frame and the second image frame into a depth estimation network which is constructed in advance, obtaining a first scene depth of the first image frame and a first encoder feature map of the first image frame according to the first image frame and the first depth weight, and obtaining a second scene depth of the second image frame and a second encoder feature map of the second image frame according to the second image frame and the first depth weight;

inputting the first image frame and the second image frame into a camera motion network which is constructed in advance, and obtaining a first relative pose between the first image frame and the second image frame according to the first image frame, the second image frame and the first motion weight;

the first image frame, the second image frame, the first scene depth, the second scene depth, the first encoder feature map, the second encoder feature map and the first relative pose are used for calculating the first error by adopting the error calculation module;

the first image frame, the second depth weight, and the second motion weight, using the error calculation module, calculate a second error, comprising:

inputting the first image frame and the second image frame into a depth estimation network which is constructed in advance, obtaining a third scene depth of the first image frame and a third encoder feature map of the first image frame according to the first image frame and the second depth weight, and obtaining a fourth scene depth of the second image frame and a fourth encoder feature map of the second image frame according to the second image frame and the second depth weight;

the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second relative pose are used to calculate the second error by using the error calculation module.

6. The method of claim 5, wherein the determining a scene depth of the second image frame and a relative pose between the first image frame and the second image frame from the first error and the second error comprises:

if the first error is smaller than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the relative pose between the first image frame and the second image frame.

7. The method of claim 5 or 6, wherein the total error comprises the first error and the second error;

the total error is determined according to an image synthesis error, a scene depth structure consistency error, a feature perception loss error and a smooth loss error.

8. The method of claim 7, wherein the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, and comprises:

acquiring a first image coordinate of the first image frame and a second image coordinate of the second image frame;

determining first world coordinates of the first image frame according to the first image coordinates, camera internal parameters and the first scene depth;

determining second world coordinates of the second image frame according to the second image coordinates, the camera internal parameters and the second scene depth;

affine transforming the first world coordinates of the first image frame to the second image frame panel, and determining third world coordinates after affine transformation;

respectively projecting the third world coordinate and the fourth world coordinate to a two-dimensional plane to obtain a scene depth after first affine transformation, a scene depth after second affine transformation and corresponding image coordinates after first affine transformation and image coordinates after second affine transformation;

determining the scene depth structure consistency error, a first depth structure inconsistency weight and a second depth structure inconsistency weight according to the first scene depth, the second scene depth, the first affine-transformed image coordinates and the second affine-transformed image coordinates;

determining a first camera flow consistency occlusion mask and a second camera flow consistency occlusion mask according to a first image coordinate of the first image frame, the second affine-transformed image coordinate, a second image coordinate of the second image frame, and the first affine-transformed image coordinate;

determining the image synthesis error according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera stream consistency occlusion mask, and the second camera stream consistency occlusion mask;

determining the feature perception loss error according to the first image frame, the second image frame, the first affine-transformed image coordinate and the second affine-transformed image coordinate;

determining the smoothing loss error from the first scene depth, the second scene depth, the first image frame, and the second image frame;

and determining the total error according to the image synthesis error, the scene depth structure consistency error, the feature perception loss error and the smoothing loss error.

9. An apparatus for scene depth inference based on historical information, the apparatus comprising:

the first acquisition module is used for acquiring a first image frame and a second image frame of an image to be detected, wherein the first image frame is an image frame at a previous moment of the second image frame;

a first processing module, configured to calculate a first error by using an error calculation module, for the first image frame, the second image frame, the first depth weight, and the first motion weight;

a second processing module, configured to calculate a second error using the error calculation module, for the first image frame, the second depth weight, and the second motion weight;

a determining module, configured to determine a scene depth of the second image frame and a relative pose between the first image frame and the second image frame according to the first error and the second error.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for scene depth inference based on historical information as claimed in any one of claims 1-8 when executing the program.