WO2023155043A1

WO2023155043A1 - Historical information-based scene depth reasoning method and apparatus, and electronic device

Info

Publication number: WO2023155043A1
Application number: PCT/CN2022/076348
Authority: WO
Inventors: 王飞; 程俊
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2023-08-24

Abstract

A historical information-based scene depth reasoning method and apparatus, and an electronic device. The method comprises: acquiring a first image frame and a second image frame of an image to be detected (S110); acquiring a first depth weight of a pre-constructed depth estimation network and a first motion weight of a pre-constructed camera motion network (S120); calculating a first error on the basis of the first image frame, the second image frame, the first depth weight and the first motion weight and by using an error calculation module (S130); using the first error as a guiding signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network, so as to obtain a second depth weight and a second motion weight (S140); calculating a second error on the basis of the first image frame, the second image frame, the second depth weight, and the second depth weight and by using an error calculation module (S150); and determining, according to the first error and the second error, the scene depth of the second image frame and the relative pose between the first image frame and the second image frame (S160).

Description

A method, device and electronic equipment for scene depth reasoning based on historical information

technical field

The application belongs to the technical field of computer vision and image processing, and in particular relates to a scene depth reasoning method, device and electronic equipment based on historical information.

Background technique

Accurately recovering scene depth from 2D images helps to better understand the 3D structure of the scene, so as to better complete various visual tasks. However, ordinary cameras acquire two-dimensional images when shooting, and lose the depth information of the scene. Therefore, how to recover the scene depth from two-dimensional images or video sequences has become a basic and extremely challenging task in the field of computer vision. Although it has been possible to restore the competitive scene depth from two-dimensional images, a large amount of manually labeled data is required to train the neural network, which is time-consuming and laborious. Once the model training is completed, the weight of the model is frozen, reducing the The generalization ability of the algorithm to unknown scenarios. In addition, the scheme based on fully unsupervised learning to restore scene depth from 2D images needs to predict the camera pose from adjacent frames at the same time, and inaccurate pose will produce wrong affine transformation results, directly affecting the quality of the synthesized image , thus affecting the restored scene depth quality.

Contents of the invention

The purpose of the embodiments of this specification is to provide a scene depth reasoning method, device and electronic equipment based on historical information.

In order to solve the above technical problems, the embodiments of the present application are implemented in the following ways:

In the first aspect, the present application provides a scene depth reasoning method based on historical information, the method comprising:

Obtaining the first image frame and the second image frame of the image to be tested, the first image frame being the image frame at the moment before the second image frame;

Obtaining the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;

The first image frame, the second image frame, the first depth weight and the first motion weight, using an error calculation module to calculate the first error;

Using the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;

The first image frame, the second image frame, the second depth weight and the second motion weight use an error calculation module to calculate a second error;

According to the first error and the second error, the scene depth of the second image frame and the relative pose between the first image frame and the second image frame are determined.

In a second aspect, the present application provides a device for scene depth reasoning based on historical information, the device comprising:

The first acquisition module is used to acquire the first image frame and the second image frame of the image to be tested, and the first image frame is an image frame at a moment before the second image frame;

The second acquisition module is used to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;

The first processing module is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate the first error;

An update module, configured to use the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;

The second processing module is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses an error calculation module to calculate the second error;

The determination module is configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the scene depth based on historical information as in the first aspect is realized. reasoning method.

It can be seen from the technical solution provided by the above embodiments of this specification that the solution: restore the scene depth from the two-dimensional image in a completely unsupervised form, and inject the historical frame information in the memory unit into the current input unit through the temporal attention module , and model the spatial correlation of the spatio-temporal feature map to improve the accuracy of the camera pose and reduce the influence of the wrong affine transformation caused by the inaccurate pose; Generalization.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of this specification or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments described in this specification. Those skilled in the art can also obtain other drawings based on these drawings without any creative effort.

FIG. 1 is a schematic flow diagram of a scene depth reasoning method based on historical information provided by the present application;

Fig. 2 is a joint training block diagram of the depth estimation network and the camera motion network provided by the embodiment of the present application;

FIG. 3 is a schematic diagram of the principle of the temporal attention module provided in the embodiment of the present application;

FIG. 4 is a schematic diagram of the principle of the spatio-temporal correlation module provided by the embodiment of the present application;

FIG. 5 is a schematic diagram of the scene depth inference process provided by the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a scene depth reasoning device based on historical information provided by the present application;

FIG. 7 is a schematic structural diagram of an electronic device provided by the present application.

Detailed ways

In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only some of the embodiments in this specification, not all of them. Based on the embodiments in this specification, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of this specification.

In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be apparent to those skilled in the art that various modifications and changes can be made to the specific embodiments described in the present application without departing from the scope or spirit of the present application. Other embodiments will be apparent to those skilled in the art from the description of this application. The specification and examples in this application are exemplary only.

As used herein, "comprising", "comprising", "having", "comprising" and so on are all open terms, meaning including but not limited to.

The "parts" in this application are by mass parts unless otherwise specified.

The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

Referring to FIG. 1 , it shows a schematic flow chart applicable to a method for scene depth reasoning based on historical information provided by an embodiment of the present application.

As shown in Figure 1, the scene depth reasoning method based on historical information may include:

S110. Acquire a first image frame and a second image frame of the image to be tested, where the first image frame is an image frame at a moment before the second image frame.

Wherein, the image to be tested is any image that requires reasoning and prediction of scene depth, and the image to be tested may be a two-dimensional image.

Clip the image to be tested into image frames of several adjacent frames according to equal time intervals, wherein the first image frame and the second image frame of the image to be tested are two image frames at adjacent moments, and the first image frame is the first image frame The image frame at the moment before the two image frames, for example, the first image frame is an image frame at t-1 moment, the second image frame is an image frame at t moment, or the first image frame is an image frame at t moment, and the second image The frame is an image frame at time t+1, etc.

S120. Acquire a first depth weight of a pre-built depth estimation network and a first motion weight of a pre-built camera motion network.

Wherein, the depth estimation network is used to estimate the scene depth of the two-dimensional image, and the depth estimation network may adopt a neural network with an encoder-decoder structure, and the type and network structure of the neural network adopted by the depth estimation network are not limited.

A camera motion network is used to predict the relative pose between adjacent image frames.

Both the depth estimation network and the camera motion network are pre-built and trained.

Referring to Fig. 2, it shows the joint training block diagram of the depth estimation network and the camera motion network provided by the embodiment of the present application (the leftmost original image in Fig. picture). It can be understood that to train the depth estimation network and the camera motion network, the training set data is obtained first. In this application, the training set data is a group of image frames at three adjacent moments in the image, for example, the image frame at time t-1, The image frame at time t and the image frame at time t+1 are a piece of data in the training data set, and the image frame at time t-2, the image frame at time t-1, and the image frame at time t are also a piece of data in the training data set. And so on. According to the training block diagram shown in Figure 2, the image frame at time t-1, the image frame at time t, and the image frame at time t+1 are used as the input of the camera motion network and the depth estimation network, and the depth estimation network is jointly trained using the objective function and the camera motion network, which guide the update of the depth weights of the depth estimation network and the motion weights of the camera motion network.

It can be understood that before the image frame at time t-1, the image frame at time t, and the image frame at time t+1 are input into the depth estimation network and the camera motion network, image preprocessing is performed first, such as including random flipping of image data, Random cropping, data normalization processing, and converting the processed data into tensor data with a dimension of C×H×W, where the batch dimension is omitted here, where C represents the channel dimension of the sample, where the depth during training Estimation network C=3, C=9 in the camera motion network (the input of the camera motion network during the test period is 3 image frames at adjacent moments), during the test reasoning period (that is, when implementing the scene depth reasoning method based on historical information in this application) , C=6 in the camera motion network (the input of the camera motion network during inference is two image frames at adjacent moments), H represents the height of the input sample image, for example, H=256, W represents the width of the input sample image , Exemplarily, W=832.

Continuing to refer to Figure 2, the camera motion network may include an encoder, a temporal attention module, and a spatiotemporal correlation module.

Among them, the encoder is used to extract the features of the stacked image frame to obtain the stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension.

The time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, and at the same time inject the information in the current input unit The global relevant information of is stored in the historical memory unit as the historical memory unit at the next moment; the current input unit includes a stacked feature map, and the stacked feature map is updated to an updated feature map by the update unit, and the historical memory unit includes the first memory feature map and The first time feature map, the historical memory unit at the next moment includes the second memory feature map and the second time feature map.

The spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation respectively.

In one embodiment, the time attention module uses the shared time attention weight to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and through the update unit, the information in the globally related historical memory unit The information is injected into the current input unit, and at the same time, the global relevant information in the current input unit is stored in the historical memory unit as the historical memory unit at the next moment, including:

Injecting the feature information in the stacked feature map and the feature information of the first memory feature map into the first time feature map to obtain a third time feature map;

Determine the time attention feature vector according to the third time feature map;

Determining a first feature vector according to the stacked feature map;

According to the first memory feature map; determine the second feature vector

According to the first feature vector and the time attention feature vector, determine the input feature vector based on time attention;

According to the second feature vector and the time attention feature vector, determine the memory feature vector based on time attention;

Respectively adjust the input feature vector based on time attention and the memory feature vector based on time attention into corresponding first feature map and second feature map;

Determine the updated feature map and the second memory feature map according to the first feature map and the second feature map;

The third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.

For example, refer to FIG. 3 , which shows a schematic diagram of the principle of the temporal attention module provided by the embodiment of the present application. For the convenience of description, it is assumed that the input feature map at time t (that is, the stacked feature map) is expressed as

The time feature map at time t-1 (that is, the first time feature map) is

The temporal memory feature map at time t-1 (namely, the first memory feature map) is

Its calculation process is as follows:

1) According to the formula (1), the feature information in the input feature map X _t at time t and the feature information in the memory feature map X _(m,t-1) at time t-1 are injected into the time feature map X at time t-1 _(time,t-1) , get the third time feature map

in,

Represents the feature space, C represents the number of feature map channels, H represents the height of the feature map, W represents the width of the feature map, δ _gelu ( ) represents the activation function, W _(i,t) ,W _(time,t-1) , W _{(m, t-1)} represents the corresponding learned weight, b _{(i, t)} , b _{(time, t-1)} , b _{(m, t-1)} represents the corresponding paranoid item.

2) Calculate the time attention feature vector x _{(qk_time,t-1)} according to the formula group (2):

Among them, s represents a scalar scaling factor, “*” represents the product of corresponding elements, the function F _split (·) represents the slice of the feature map according to the channel dimension, and the function F _reshape (·) is used to adjust the feature map or feature vector into a pre-set Let shape, "T" means transpose,

Indicates the corresponding weight,

Denotes the corresponding paranoid term.

3) According to the formula group (3), adjust the input feature map X _{(i, t)} at time t into a feature vector (ie, the first feature vector)

Adjust the memory feature map X _(m,t-1) at time t-1 into a feature vector (ie, the second feature vector)

x _(i,t) = F _reshape (X _(i,t) )

x _(m,t-1) = F _reshape (X _(m,t-1) ) (3)

4) Calculate the input feature vector based on time attention according to formula group (4)

and memory feature vectors based on temporal attention

5) Adjust the feature vectors x _{(qk_i,t)} and x _{(qk_m,t-1)} to the corresponding feature maps according to the formula group (5)

(i.e. the first feature map) and

(i.e. the second feature map):

6) Calculate the information selection gate according to formula (6)

It is used to selectively inject the information in the memory feature map X _{(qk_m,t-1)} at time t-1 into the input feature map X _{(qk_i,t)} at time t:

G _s ＝δ _sig (W _{(qk_ms,t-1)} X _{(qk_m,t-1)} +b _{(qk_ms,t-1)} +W _{(qk_is,t)} X _{(qk_i,t)} +b _{(qk_is,t )} ) (6)

Among them, the function δ _sig (·) represents the sigmoid activation function, W _{(qk_ms,t-1)} and W _{(qk_is,t)} represent the corresponding weights, b _{(qk_ms,t-1)} and b _{(qk_is,t)} represent the corresponding paranoia item.

7) Calculate a new feature map containing memory feature map information according to formula (7)

X _(im,t) = δ _tanh (W _{(qk_imi,t)} X _{(qk_i,t)} +b _{(qk_imi,t)} +G _s *(W _{(qk_imm,t-1)} X _{(qk_m,t-1)} +b _{(qk_imm,t-1)} )) (7)

Among them, the function δ _tanh (·) represents the tanh activation function, W _{(qk_imi,t)} and W _{(qk_imm,t-1)} represent the corresponding weights, b _{(qk_imi,t)} and b _{(qk_imm,t-1)} represent the corresponding paranoia item.

8) Calculate the memory gate according to the formula group (8)

Used to update the memory feature map X _{(qk_m,t-1)} information at time t-1 to the memory feature map at time t (the second memory feature map)

Among them, W _{(qk_ir,t)} and W _{(qk_mr,t-1)} represent the corresponding weights, and b _{(qk_ir,t)} and b _{(qk_mr,t-1)} represent the corresponding bias items.

9) Calculate the output gate according to formula (9)

It is used to update the input feature map X _{(qk_i,t)} to obtain the updated new feature map (that is, the updated feature map)

As the input feature map at the next moment:

Among them, W _{(qk_io, t)} and W _{(qk_mo, t-1)} represent the corresponding weights, b _{(qk_io, t)} and b _{(qk_mo, t-1)} represent the corresponding bias items.

10) According to formula (10), the time feature map at time t-1

Update the time feature map at time t (that is, the second time feature map) X _(time,t) :

Among them, W _(time,t-1) , W _(io,t) and W _(m,t) represent the corresponding weights, b _(time,t-1) , b _(io,t) and b _{(m,t )} represents the corresponding paranoid term.

In order to use the global spatial structure information of the feature map and the dependencies between the spatial structures to infer camera motion, construct the spatiotemporal correlation module shown in Figure 4, use the global spatial correlation weight to model the spatial context information, and stack The dependencies between feature map channels of frames are modeled to constrain the timing information between stacked frames.

In one embodiment, the spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation, respectively.

Among them, the updated feature map is modeled as the first spatiotemporal feature map with spatial correlation, including:

Slicing the updated feature map in the channel dimension to obtain the first sub-feature map, the second sub-feature map and the third sub-feature map;

Respectively adjust the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map to the third feature vector, the first sub-feature vector, the second sub-feature vector and the third sub-feature vector ;

calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map according to the first sub-feature vector and the second sub-feature vector;

weighting the third sub-eigenvector by using the first spatial correlation matrix to obtain the first spatially correlated feature vector;

determining a first spatiotemporal eigenvector according to the first spatially correlated eigenvector and the third eigenvector;

The first spatio-temporal feature vector is adjusted to a first spatio-temporal feature map with spatial correlation.

Wherein, the second memory feature map is modeled as a second spatiotemporal feature map with spatial correlation, including:

Slicing the second memory feature map in the channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;

adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map to a fourth feature vector, a fourth sub-character vector, a fifth sub-character vector and a sixth sub-character vector;

calculating a second spatial correlation matrix between the fourth sub-feature map and the fifth sub-feature map according to the fourth sub-feature vector and the fifth sub-feature vector;

performing weighting processing on the sixth sub-eigenvector by using the second spatial correlation matrix to obtain the second spatially correlated feature vector;

determining a second spatiotemporal feature vector according to the second spatial correlation feature vector and the fourth feature vector;

The second spatio-temporal feature vector is adjusted to a second spatio-temporal feature map with spatial correlation.

Exemplarily, referring to Fig. 4, the calculation is as follows:

1) According to the formula group (11), the input feature map (that is, the updated feature map or the second memory feature map)

Transform to get the feature map

And divide it into three sub-feature maps equally (namely, the first/fourth sub-feature map, the second/fifth sub-feature map and the third/sixth sub-feature map)

Among them, the function F _split (·) means to slice the feature map in the channel dimension.

Among them, W _mid is the corresponding weight, and b _mid represents the corresponding paranoid item.

2) According to the formula group (12), the corresponding feature map is adjusted into a feature vector, and the corresponding feature vector is obtained

The function F _reshape (·) is used to adjust the feature map or feature vector into a preset shape.

3) Calculate the first/fourth sub-feature map according to formula (13)

with the second/fifth sub-feature map

The spatial correlation matrix between

where s represents the scalar scaling factor

4) Use the above-mentioned calculated spatial correlation matrix for the third/sixth sub-features

Perform weighting processing to obtain spatial correlation feature vectors (including the first spatial correlation feature vector and the second spatial correlation feature vector)

As shown in formula (14)

5) According to the formula (15) to model the dependency relationship between the feature map spatial structure, by modeling the dependency relationship between the feature map channels of the stacked frames, to constrain the timing information between the stacked frames, and calculate the first / Two spatiotemporal feature vectors x _{time_corr} :

Among them, F _C ( ) consists of two layers of one-dimensional convolution and activation functions with a kernel size of 3 and a stride of 1.

6) Finally, adjust the first/second spatiotemporal feature vector x _{time_corr} with spatial correlation into the first/second spatiotemporal feature map with spatial correlation

S130, the first image frame, the second image frame, the first depth weight and the first motion weight, using an error calculation module to calculate the first error, including:

Input the first image frame and the second image frame into the pre-built depth estimation network, and obtain the first scene depth of the first image frame and the first encoder feature of the first image frame according to the first image frame and the first depth weight Figure, according to the second image frame and the first depth weight, obtain the second scene depth of the second image frame and the second encoder feature map of the second image frame;

Input the first image frame and the second image frame into the pre-built camera motion network, according to the first image frame, the second image frame and the first motion weight, the first relative relationship between the first image frame and the second image frame is obtained pose;

For the first scene depth, the second scene depth and the first relative pose, an error calculation module is used to calculate the first error.

In one embodiment, the total error is determined from image synthesis error, scene depth structure consistency error, feature perception loss error, smoothing loss error. Wherein, the total error includes the first error and the second error.

Specifically, the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, including:

Obtain the first image coordinates of the first image frame and the second image coordinates of the second image frame;

Determining the first world coordinates of the first image frame according to the first image coordinates, the internal camera parameters, and the first scene depth;

Determining the second world coordinates of the second image frame according to the second image coordinates, the internal camera parameters, and the second scene depth;

Affine transform the first world coordinates of the first image frame to the second image frame panel, and determine the third world coordinates after the affine transformation;

Affine transform the second world coordinate of the second image frame to the first image frame panel, and determine the fourth world coordinate after affine transformation;

Project the third world coordinates and the fourth world coordinates to the two-dimensional plane respectively, and obtain the scene depth after the first affine transformation and the scene depth after the second affine transformation and the corresponding image coordinates after the first affine transformation and the second affine transformation Image coordinates after projection transformation;

According to the first scene depth, the second scene depth, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation, determine the scene depth structure consistency error, the first depth structure inconsistency weight and the second depth structure inconsistency sexual weight;

Determine the first camera stream consistent occlusion mask and The second camera flow consistent occlusion mask;

determining an image synthesis error based on the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera flow consistent occlusion mask, and the second camera flow consistent occlusion mask;

Determine the feature perception loss error according to the first image frame, the second image frame, the image coordinates after the first affine transformation, and the image coordinates after the second affine transformation;

determining a smoothing loss error according to the first scene depth, the second scene depth, the first image frame, and the second image frame;

The total error is determined based on image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.

Exemplarily, for the convenience of description, it is assumed that the scene depth fitting function that has been trained is D _t =F _D (I _t |W _D ), where I _t represents the two-dimensional image that needs to restore the scene depth at time t, and W _D Indicates the weight parameters learned in the scene depth fitting function F _D (·), D _t represents the scene depth of the recovered two-dimensional image I _t at time t, X _(enc,t) represents the image output by the encoder at time t The feature map of frame I _t ,

Represents the pose transformed from the image frame I _{t-1 at time t-1} to the image frame I _t at time t, W _T represents the weight parameters learned in the pose transformation function F _T ( ), and the camera internal reference is expressed as K , use P _{(xy, t-1)} to represent the image coordinates of the image frame I _t-1 , P _{(xyz, t-1)} the world coordinates of the image frame I _t-1 , and P _{(xy, t)} to represent the image frame I _t The image coordinates of , P _(xyz,t) represents the world coordinates of the image frame I _t . The total error calculation process is as follows:

1) Calculate the world coordinates (i.e. the first world coordinates ₎ P _{(xyz, t-1)} of the image frame I _t-1 according to formula (16), and the world coordinates (i.e. the second world coordinates) P _{( xyz,t)} :

Among them, the sign "*" represents the product of the corresponding elements of the matrix.

2) According to the formula (17), calculate the affine transformation of the world coordinate P _{(xyz, t-1)} of the image frame I _t -1 to the image frame I _t panel, and obtain the world coordinates after the affine transformation (that is, the third world Coordinates) P _{(proj_xyz, t)} , and the world coordinate P _{(xyz, t)} of the image frame I _t is affinely transformed to the image frame I _t-1 panel, and the world coordinates after the affine transformation (ie the fourth world coordinate )P _{(proj_xyz,t-1)} :

3) Project the world coordinates P _{(proj_xyz,t)} and P _{(proj_xyz,t-1)} calculated by the affine transformation onto the two-dimensional plane respectively, and obtain the scene depth D _(proj,t) after the affine transformation (that is, the first Scene depth after affine transformation) and D _(proj,t-1) (that is, scene depth after second affine transformation), and corresponding image coordinates P _{(proj_xy,t)} after affine transformation (that is, first affine transformation image coordinates after affine transformation), P _{(proj_xy,t-1)} (that is, image coordinates after the second affine transformation).

4) Synthesize image I _{(syn, t)} according to image frame I _t-1 and P _{(proj_xy, t-1)} ; synthesize according to encoder feature map X _{(enc, t-1)} and P _{(proj_xy, t-1)} Feature map X _{(syn_enc, t)} ; synthesize depth map D (syn _{, t} ) according to estimated depth map D _t-1 and P _{(projxy, t-1)} ; synthesize according to image frame I _t and P _{(proj_xy, t)} Image I _{(syn, t-1)} ; according to the encoder feature map X _{(enc, t)} and P _{(proj_xy, t)} synthetic feature map X _{(syn_enc, t-1)} ; according to the estimated depth map D _t and P _{( projxy, t)} synthetic depth map D _{(syn, t-1)} ; calculate the forward camera flow U _forward according to the image coordinates of I _t-1 and P _{(proj_xy, t-1)} ; according to the image coordinates of I _t and P _{( proj_xy, t)} calculates the backward camera stream U _backward ; synthesizes the forward camera stream U _{syn_forward} according to U _forward and P _{(proj_xy, t)} ; synthesizes the backward camera stream U _{syn_backward} according to U _backward and P _{(projxy, t-1)} .

5) Calculate the first camera flow consistent occlusion mask M _(occ,t-1) and the second camera flow consistent occlusion mask M _(occ,t) according to the formula group (18):

6) Calculate the scene depth structure consistency error E _D and the first depth structure inconsistency weight M _(D,t-1) and the second depth structure inconsistency weight M _(D,t) according to the formula group (19).

7) Calculate the image synthesis error E _I according to formula (20):

in,

8) Calculate the feature perception loss error E _X according to formula (21):

E _X ＝ERF(X _(enc,t) ,X _{(syn_enc,t)} )+ERF(X _(enc,t-1) ,X _{(syn_enc,t-1)} ) (21)

9) Calculate the smoothing loss error E _S according to formula (22):

10) Calculate the total error E according to formula (23):

S140. Using the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain the second depth weight and the second motion weight.

S150. The first image frame, the second image frame, the second depth weight, and the second motion weight use an error calculation module to calculate the second error, which may include:

Input the first image frame and the second image frame into the pre-built depth estimation network, and obtain the third scene depth of the first image frame and the third encoder feature of the first image frame according to the first image frame and the second depth weight Figure, according to the second image frame and the second depth weight, obtain the fourth scene depth of the second image frame and the fourth encoder feature map of the second image frame;

Input the first image frame and the second image frame into the pre-built camera motion network, according to the first image frame, the second image frame and the second motion weight, get the second relative between the first image frame and the second image frame pose;

For the first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second relative pose, an error calculation module is used to calculate a second error.

For this step, refer to the specific calculation process of step S130, only the first depth weight and the first motion weight in S130 are replaced by the second depth weight and the second motion weight, which will not be repeated here.

S160. Determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.

Specifically, if the first error is greater than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the relative pose between the first image frame and the second image frame;

If the first error is less than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the relative pose between the first image frame and the second image frame.

Referring to FIG. 5 , it shows a schematic diagram of a scene depth inference process. The process of inferring scene depth is as follows:

1) Using the weight W _D of the depth estimation network obtained from training and the weight W _T of the camera motion network as the model weights of the depth estimation network and the camera motion network during inference, according to S130, calculate the total error E, the bit from the historical frame to the current frame pose transformation matrix

The scene depth D _t of the current frame.

2) Use the total error calculated in the above 1) as a guiding signal to update the weights of the depth estimation network and the camera motion network to obtain new model weights

and

3), according to the above 2) obtained model weight

and

And S150 calculates the total error at this time

The pose transformation matrix from the historical frame to the current frame

The scene depth of the current frame

4) By comparing the total error E and

The size of the decision final output of the scene depth of the current frame.

In the embodiment of the present application, the depth of the scene is recovered from the two-dimensional image in a completely unsupervised form, and the historical frame information in the memory unit is injected into the current input unit through the temporal attention module, and the spatial correlation of the spatiotemporal feature map Carry out modeling to improve the accuracy of camera pose and reduce the influence of wrong affine transformation caused by inaccurate pose; during inference, use online decision-making reasoning to improve the generalization ability of the algorithm to unknown scenes.

Referring to FIG. 6 , it shows a schematic structural diagram of an apparatus for scene depth inference based on historical information according to an embodiment of the present application.

As shown in FIG. 6, the scene depth reasoning device 600 based on historical information may include:

The first acquisition module 610 is configured to acquire a first image frame and a second image frame of the image to be tested, and the first image frame is an image frame at a moment before the second image frame;

The second acquiring module 620 is configured to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;

The first processing module 630 is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate the first error;

An update module 640, configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guide signal to obtain a second depth weight and a second motion weight;

The second processing module 650 is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses an error calculation module to calculate a second error;

The determination module 660 is configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.

Optionally, the camera motion network includes an encoder, a temporal attention module and a spatiotemporal correlation module;

The encoder is used to extract the features of the stacked image frame to obtain the stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension;

The time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, and at the same time inject the information in the current input unit The global relevant information of is stored in the historical memory unit as the historical memory unit at the next moment; the current input unit includes a stacked feature map, and the stacked feature map is updated to an updated feature map by the update unit, and the historical memory unit includes the first memory feature map and The first time feature map, the historical memory unit at the next moment includes the second memory feature map and the second time feature map;

The spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into the first/second spatio-temporal feature map with spatial correlation, respectively.

Optionally, the scene depth reasoning device 600 based on historical information is also used for:

Determining a first feature vector according to the stacked feature map;

According to the first memory feature map; determine the second feature vector

adjusting the first spatiotemporal feature vector to a first spatiotemporal feature map with spatial correlation;

Slicing the second memory feature map in the channel dimension to obtain the fourth sub-feature map, the fifth sub-feature map and the sixth sub-feature map;

The second spatio-temporal feature vector is adjusted to the second spatio-temporal feature map with spatial correlation.

Optionally, the first processing module 630 is also used for:

The first image frame, the second image frame, the first scene depth, the second scene depth, the first encoder feature map, the second encoder feature map, and the first relative pose, using an error calculation module to calculate the first error;

Optionally, the second processing module 650 is also used for:

Optionally, the determining module 660 is also used for:

If the first error is greater than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the relative pose between the first image frame and the second image frame;

Optionally, the total error includes a first error and a second error; the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.

Optionally, the first processing module 630 or the second processing module 650 is also used for:

According to the first scene depth, the second scene depth, the first image frame and the second image frame, determine the smoothing loss error;

The historical information-based scene depth reasoning device provided in this embodiment can execute the above-mentioned embodiment of the method, and its implementation principle and technical effect are similar, and will not be repeated here.

FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 7 , a schematic structural diagram of an electronic device 300 suitable for implementing the embodiments of the present application is shown.

As shown in FIG. 7 , an electronic device 300 includes a central processing unit (CPU) 301, which can operate according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage section 308 into a random access memory (RAM) 303 Instead, various appropriate actions and processes are performed. In the RAM 303, various programs and data necessary for the operation of the device 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to the bus 304 .

The following components are connected to the I/O interface 305: an input section 306 including a keyboard, a mouse, etc.; an output section 307 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 308 including a hard disk, etc. and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the Internet. Drive 310 is also connected to I/O interface 306 as needed. A removable medium 311, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 310 as necessary so that a computer program read therefrom is installed into the storage section 308 as necessary.

In particular, according to an embodiment of the present disclosure, the process described above with reference to FIG. 1 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product, which includes a computer program tangibly contained on a machine-readable medium, the computer program including program codes for executing the above-mentioned scene depth reasoning method based on historical information. In such an embodiment, the computer program may be downloaded and installed from a network via communication portion 309 and/or installed from removable media 311 .

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or part of code that includes one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The units or modules involved in the embodiments described in the present application may be implemented by means of software or by means of hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not constitute limitations on the units or modules themselves in some cases.

The systems, devices, modules, or units described in the above embodiments can be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer. Specifically, the computer can be, for example, a personal computer, a notebook computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or any of these devices combination of devices.

As another aspect, the present application also provides a storage medium, which may be the storage medium contained in the aforementioned device in the above embodiment, or may be a storage medium that exists independently and is not assembled into the device. The storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the scene depth reasoning method based on historical information described in this application.

Storage media includes permanent and non-permanent, removable and non-removable media. Information storage can be realized by any method or technology. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

It should be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes none other elements specifically listed, or also include elements inherent in the process, method, commodity, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus that includes the element.

Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, refer to part of the description of the method embodiment.

Claims

A method for scene depth reasoning based on historical information, characterized in that the method comprises:

Acquiring a first image frame and a second image frame of the image to be tested, the first image frame being an image frame at a moment before the second image frame;

Obtaining the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;

The first image frame, the second image frame, the first depth weight and the first motion weight use an error calculation module to calculate a first error;

Using the first error as a guide signal to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network to obtain a second depth weight and a second motion weight;

The first image frame, the second image frame, the second depth weight and the second motion weight use the error calculation module to calculate a second error;

Determining the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
The method according to claim 1, wherein the camera motion network comprises an encoder, a temporal attention module and a spatiotemporal correlation module;

The encoder is used to extract the features of the stacked image frame to obtain a stacked feature map; the stacked image frame is obtained by stacking the first image frame and the second image frame according to the channel dimension;

The time attention module is used to establish a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and inject the information in the globally related historical memory unit into the current input unit through the update unit, At the same time, the global relevant information in the current input unit is stored in the historical memory unit as the historical memory unit at the next moment; the current input unit includes the stacked feature map, and the stacked feature map is updated through the The unit is updated to an updated feature map, the historical memory unit includes a first memory feature map and a first time feature map, and the historical memory unit at the next moment includes a second memory feature map and a second time feature map;

The spatio-temporal correlation module is used to model the updated feature map/the second memory feature map into a first/second spatio-temporal feature map with spatial correlation, respectively.
The method according to claim 2, characterized in that, establishing a global dependency relationship between the information of the historical memory unit and the information of the current input unit, and injecting the globally related information in the historical memory unit through the update unit to the current input unit, and simultaneously store the global relevant information in the current input unit to the historical memory unit as the historical memory unit at the next moment, including:

Injecting feature information in the stack feature map and feature information in the first memory feature map into the first temporal feature map to obtain a third temporal feature map;

Determine a temporal attention feature vector according to the third temporal feature map;

determining a first feature vector according to the stack feature map;

According to the first memory feature map; determine a second feature vector

Determine an input feature vector based on temporal attention according to the first feature vector and the temporal attention feature vector;

According to the second feature vector and the time attention feature vector, determine a memory feature vector based on time attention;

Adjusting the temporal attention-based input feature vector and the temporal attention-based memory feature vector into corresponding first feature maps and second feature maps, respectively;

determining the updated feature map and the second memory feature map according to the first feature map and the second feature map;

The third time feature map is updated to the second time feature map according to the updated feature map and the second memory feature map.
The method according to claim 2, wherein modeling the updated feature map as a first spatiotemporal feature map with spatial correlation comprises:

Slicing the updated feature map in the channel dimension to obtain a first sub-feature map, a second sub-feature map and a third sub-feature map;

respectively adjusting the updated feature map, the first sub-feature map, the second sub-feature map and the third sub-feature map into a third feature vector, a first sub-feature vector, and a second sub-feature vector and the third sub-eigenvector;

calculating a first spatial correlation matrix between the first sub-feature map and the second sub-feature map based on the first sub-feature vector and the second sub-feature vector;

Using the first spatial correlation matrix to carry out weighting processing on the third sub-eigenvector to obtain the first spatial correlation eigenvector;

determining a first spatiotemporal feature vector according to the first spatial correlation feature vector and the third feature vector;

adjusting the first spatio-temporal feature vector to the first spatio-temporal feature map with spatial correlation;

Modeling the second memory feature map into a second spatiotemporal feature map with spatial correlation, comprising:

Slicing the second memory feature map in the channel dimension to obtain a fourth sub-feature map, a fifth sub-feature map and a sixth sub-feature map;

adjusting the second memory feature map, the fourth sub-feature map, the fifth sub-feature map, and the sixth sub-feature map to a fourth feature vector, a fourth sub-feature vector, and a fifth sub-feature vector and the sixth sub-eigenvector;

calculating a second spatial correlation matrix between the fourth sub-eigenmap and the fifth sub-eigenmap based on the fourth sub-eigenvector and the fifth sub-eigenvector;

performing weighting processing on the sixth sub-eigenvector by using the second spatial correlation matrix to obtain a second spatially correlated feature vector;

determining a second spatiotemporal feature vector according to the second spatial correlation feature vector and the fourth feature vector;

The second spatio-temporal feature vector is adjusted to the second spatio-temporal feature map with spatial correlation.
The method according to claim 1, characterized in that,

The first image frame, the second image frame, the first depth weight and the first motion weight use an error calculation module to calculate a first error, including:

Inputting the first image frame and the second image frame into a pre-built depth estimation network, and obtaining the first scene depth and the first scene depth of the first image frame according to the first image frame and the first depth weight The first encoder feature map of the first image frame, according to the second image frame and the first depth weight, obtain the second scene depth of the second image frame and the second scene depth of the second image frame Two encoder feature maps;

inputting the first image frame and the second image frame into a pre-built camera motion network, and obtaining the first image according to the first image frame, the second image frame and the first motion weight a first relative pose between a frame and said second image frame;

The first image frame, the second image frame, the first scene depth, the second scene depth, the first encoder feature map, the second encoder feature map, and the first Relative pose, using the error calculation module to calculate the first error;

The first image frame, the second image frame, the second depth weight and the second motion weight use the error calculation module to calculate a second error, including:

Inputting the first image frame and the second image frame into a pre-built depth estimation network, according to the first image frame and the second depth weight, obtaining a third scene depth and The third encoder feature map of the first image frame, according to the second image frame and the second depth weight, obtain the fourth scene depth of the second image frame and the fourth scene depth of the second image frame Four-encoder feature map;

inputting the first image frame and the second image frame into a pre-built camera motion network, and obtaining the first image according to the first image frame, the second image frame and the second motion weight a second relative pose between a frame and said second image frame;

The first image frame, the second image frame, the third scene depth, the fourth scene depth, the third encoder feature map, the fourth encoder feature map, and the second Relative pose, using the error calculation module to calculate the second error.
The method according to claim 5, characterized in that, according to the first error and the second error, the scene depth of the second image frame and the relationship between the first image frame and the second image frame are determined. Relative poses between image frames, including:

If the first error is greater than the second error, the second scene depth is used as the scene depth of the second image frame, and the first relative pose is used as the first image frame and the first image frame. The relative pose between the two image frames;

If the first error is less than or equal to the second error, the fourth scene depth is used as the scene depth of the second image frame, and the second relative pose is used as the first image frame and the second image frame. The relative pose between the second image frames.
The method according to claim 5 or 6, wherein the total error comprises the first error and the second error;

The total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
The method according to claim 7, wherein the total error is determined according to image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error, including:

Acquiring first image coordinates of the first image frame and second image coordinates of the second image frame;

Determine the first world coordinates of the first image frame according to the first image coordinates, camera intrinsic parameters, and the first scene depth;

determining the second world coordinates of the second image frame according to the second image coordinates, the internal camera parameters, and the second scene depth;

Affine transforming the first world coordinates of the first image frame to the second image frame panel, and determining the third world coordinates after the affine transformation;

Affine transforming the second world coordinates of the second image frame to the first image frame panel, and determining the fourth world coordinates after the affine transformation;

respectively projecting the third world coordinates and the fourth world coordinates onto a two-dimensional plane to obtain the scene depth after the first affine transformation, the scene depth after the second affine transformation, and the corresponding image coordinates after the first affine transformation and the image coordinates after the second affine transformation;

According to the first scene depth, the second scene depth, the first affine transformed image coordinates, and the second affine transformed image coordinates, determine the scene depth structural consistency error, the first depth Structural inconsistency weights and second-depth structural inconsistency weights;

Determine the first A camera flow consistent occlusion mask and a second camera flow consistent occlusion mask;

Determine the image composition according to the first depth structure inconsistency weight, the second depth structure inconsistency weight, the first camera flow consistent occlusion mask and the second camera flow consistent occlusion mask error;

determining the feature perception loss error based on the first image frame, the second image frame, the first affine transformed image coordinates, and the second affine transformed image coordinates;

determining the smoothing loss error based on the first scene depth, the second scene depth, the first image frame, and the second image frame;

The total error is determined according to the image synthesis error, scene depth structure consistency error, feature perception loss error, and smoothing loss error.
A device for scene depth reasoning based on historical information, characterized in that the device includes:

A first acquisition module, configured to acquire a first image frame and a second image frame of the image to be tested, the first image frame being an image frame at a moment before the second image frame;

The second acquisition module is used to acquire the first depth weight of the pre-built depth estimation network and the first motion weight of the pre-built camera motion network;

The first processing module is used for the first image frame, the second image frame, the first depth weight and the first motion weight, and uses an error calculation module to calculate a first error;

An update module, configured to jointly update the first depth weight of the depth estimation network and the first motion weight of the camera motion network by using the first error as a guide signal to obtain a second depth weight and a second motion weight;

The second processing module is used for the first image frame, the second image frame, the second depth weight and the second motion weight, and uses the error calculation module to calculate a second error;

A determining module, configured to determine the scene depth of the second image frame and the relative pose between the first image frame and the second image frame according to the first error and the second error.
An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the processor executes the program, it implements any of claims 1-8 The described scene depth reasoning method based on historical information.