WO2022206020A1

WO2022206020A1 - Method and apparatus for estimating depth of field of image, and terminal device and storage medium

Info

Publication number: WO2022206020A1
Application number: PCT/CN2021/137609
Authority: WO
Inventors: 王飞; 程俊; 刘鹏磊
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-03-31
Filing date: 2021-12-13
Publication date: 2022-10-06
Also published as: CN113160294B; CN113160294A

Abstract

The present application relates to the technical field of image processing. Provided are a method and apparatus for estimating the depth of field of an image, and a terminal device and a storage medium. In the present application, during the optimization and update of parameters of a depth estimation network used, a camera pose estimation network is used to predict a camera pose vector of an input sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image; then, according to a depth-of-field image of the target frame image that is predicted by the depth estimation network, the camera pose vector, the reference frame image, and internal parameters of a corresponding camera, a reconstructed image corresponding to the target frame image is generated; next, a corresponding loss function during image reconstruction is calculated according to the target frame image and the reconstructed image; and finally, an objective function is constructed on the basis of the loss function, and parameters of the depth estimation network are updated on the basis of the objective function. In this manner, image information included in a target frame image and a reference frame image can be fully mined, and the cost of sample data collection is reduced.

Description

Method, device, terminal device and storage medium for estimating depth of image scene

technical field

The present application relates to the technical field of image processing, and in particular, to a method, apparatus, terminal device and storage medium for estimating the depth of an image scene.

Background technique

Scene depth estimation from images is an important research direction in the fields of robot navigation and autonomous driving. With the development of high-performance computing devices, people usually use deep neural networks to predict the scene depth of images. However, in order to ensure the accuracy of scene depth prediction for images, a large amount of sample data is required when training the deep neural network, resulting in high data acquisition costs.

SUMMARY OF THE INVENTION

In view of this, the embodiments of the present application provide a method, apparatus, terminal device, and storage medium for estimating the depth of an image scene, which can reduce the cost of sample data collection.

A first aspect of the embodiments of the present application provides a method for estimating the depth of an image scene, including:

Obtain the image to be tested;

Inputting the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;

Wherein, the parameters of the depth estimation network are updated in the following ways:

Obtaining a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence;

Inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;

Inputting the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;

generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence;

Calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the difference between the target frame image and the first reconstructed image;

constructing an objective function based on the first image reconstruction loss;

The parameters of the depth estimation network are updated according to the objective function.

When optimizing and updating parameters, the depth estimation network adopted in the embodiment of the present application will combine the camera pose estimation network to predict the camera pose vector of the input sample image sequence, where the sample image sequence includes the target frame image and the reference frame image; then, according to the depth The scene depth image of the target frame image predicted by the estimation network, the camera pose vector, the reference frame image and the internal parameters of the corresponding camera generate a reconstructed image corresponding to the target frame image; then, according to the target frame image and the reconstructed image Calculate the corresponding loss function when reconstructing the image, and finally construct an objective function based on the loss function and update the parameters of the depth estimation network based on the objective function. Through this setting, the potential image information contained in the target frame image and the reference frame image can be fully exploited, that is, enough image information can be obtained by sampling less sample images to complete the training of the depth estimation network, thereby reducing the time required for sample data collection. cost.

In an embodiment of the present application, after acquiring the sample image sequence, it may further include:

Inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence;

Calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the difference between the reference frame image and the second reconstructed image;

The constructing an objective function based on the first image reconstruction loss includes:

calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;

The objective function is constructed based on the bidirectional image reconstruction loss.

By adding a bidirectional image reconstruction loss to the objective function of the depth estimation network, the potential information in the image data can be fully exploited, and the robustness of the depth estimation algorithm can be further improved.

Further, generating a first reconstruction corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:

Determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;

Calculate the first coordinate of the target frame image in the world coordinate system according to the internal parameters of the camera and the first scene depth image;

Transforming the first coordinates based on the first transformation matrix to obtain the second coordinates of the target frame image in the world coordinate system after transformation;

converting the second coordinate into a third coordinate in the image coordinate system;

Based on the reference frame image, and using the third coordinate as a grid point, an image of the reference frame image after affine transformation is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the third coordinate. a reconstructed image;

The generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence, comprising: :

Determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;

Calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;

Transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system;

converting the fifth coordinate into the sixth coordinate in the image coordinate system;

Based on the target frame image, using the sixth coordinate as a grid point, the affine transformed image of the target frame image is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the sixth coordinate. 2. Reconstructed images.

In an embodiment of the present application, the method may further include:

obtaining the seventh coordinate of the target frame image in the image coordinate system;

Performing the process of making a difference between the corresponding elements on the third coordinate and the seventh coordinate to obtain the first forward flow coordinate;

obtaining the eighth coordinate of the reference frame image in the image coordinate system;

Performing the process of making a difference between the corresponding elements on the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate;

Using the third coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first backward flow coordinate to synthesize the second forward flow coordinate;

Using the sixth coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;

A forward flow occlusion mask is calculated according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and the second forward flow occlusion mask. degree of matching between forward flow coordinates;

A backward flow occlusion mask is calculated according to the first backward flow coordinates and the second backward flow coordinates, and the backward flow occlusion mask is used to measure the first backward flow coordinates and the second backward flow degree of matching between backward flow coordinates;

The calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss includes:

The bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.

The occlusion mask is used to determine whether there are occluding objects in consecutive video frames. Adding the occlusion mask to the calculation of the loss of bidirectional image reconstruction can improve the depth estimation accuracy of the depth estimation network for images with occluded objects.

In an embodiment of the present application, the method may further include:

Determine the first scene depth value of the target frame image according to the second coordinate;

determining the second scene depth value of the reference frame image according to the fifth coordinate;

obtaining the third scene depth value of the pixel corresponding to the second coordinate in the first scene depth image;

obtaining the fourth scene depth value of the pixel corresponding to the fifth coordinate in the second scene depth image;

Based on the third coordinate and the fourth scene depth value, reconstruct the fifth scene depth value of the target frame image through a bilinear sampling mechanism;

Based on the sixth coordinate and the third scene depth value, reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism;

The forward scene structure consistency loss is calculated according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the target frame image calculated by multi-view geometric transformation The difference between the scene depth value and the scene depth value of the reconstructed target frame image;

The backward scene structure consistency loss is calculated according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the reference frame image calculated by the multi-view geometric transformation The difference between the scene depth value and the reconstructed scene depth value of the reference frame image;

Calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;

The constructing the objective function based on the bidirectional image reconstruction loss may include:

The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.

When constructing the objective function, adding the loss of bidirectional scene structure consistency can effectively deal with occluded objects and moving objects in the scene of the image to be tested, thereby improving the accuracy of scene depth estimation.

In an embodiment of the present application, the depth estimation network includes an encoding network, and the method may further include:

Obtain the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network;

Based on the third coordinates and the second feature image, reconstruct a third feature image of the target frame image through a bilinear sampling mechanism;

Based on the sixth coordinate and the first feature image, reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism;

According to the first feature image, the second feature image, the third feature image and the fourth feature image, the bidirectional feature perception loss is calculated, and the bidirectional feature perception loss is used to measure the obtained through the encoding network. The difference between the feature image of the target frame image and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the encoding network and the reconstructed feature image of the reference frame image differences between images;

The constructing the objective function based on the bidirectional image reconstruction loss includes:

The objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.

By introducing the bidirectional feature perception loss into the objective function, the weak texture scene in the image to be tested can be effectively processed, thereby improving the accuracy of scene depth estimation.

Further, after obtaining the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network, the method may further include:

According to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network;

Described based on the two-way image reconstruction loss and the two-way feature perception loss, the construction obtains the objective function, including:

The objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.

The gradients of scene depth images and feature images obtained by the depth estimation network can be regularized by introducing a smoothing loss into the objective function.

A second aspect of the embodiments of the present application provides an apparatus for estimating the depth of an image scene, including:

an image acquisition module to be tested, used to acquire the image to be tested;

a scene depth estimation module, configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;

A sample acquisition module, configured to acquire a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is one or more frames before or after the target frame image in the sample image sequence Image;

a first scene depth prediction module, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;

a camera pose estimation module, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;

A first image reconstruction module, configured to generate an image corresponding to the target frame image according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence. the first reconstructed image;

a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss difference between reconstructed images;

an objective function building module for constructing an objective function based on the first image reconstruction loss;

A network parameter updating module, configured to update the parameters of the depth estimation network according to the objective function.

A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program The method for estimating the depth of an image scene provided by the first aspect of the embodiments of the present application is implemented.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the implementation of the first aspect of the embodiments of the present application is implemented method for estimating scene depth in images.

A fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the method for estimating the depth of an image scene described in the first aspect of the embodiments of the present application.

It can be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description in the first aspect, which is not repeated here.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

FIG. 1 is a flowchart of an embodiment of a method for estimating the depth of an image scene provided by an embodiment of the present application;

FIG. 2 is a schematic flowchart of an optimization and updating depth estimation network parameter provided by an embodiment of the present application;

3 is a schematic structural diagram of a depth estimation network provided by an embodiment of the present application;

4 is a schematic structural diagram of a residual module in the network structure shown in FIG. 3;

5 is a schematic structural diagram of a camera pose estimation network according to an embodiment of the present application;

6 is a comparison diagram of the results of the depth prediction of a monocular image performed by the method for estimating the depth of an image scene provided by an embodiment of the present application and various algorithms in the prior art;

7 is a structural diagram of an embodiment of an apparatus for estimating the depth of an image scene provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.

The present application proposes a method, device, terminal device and storage medium for estimating the depth of an image scene, which can reduce the cost of sample data collection. It should be understood that the execution body of each method embodiment of the present application is various types of terminal devices or servers, such as mobile phones, tablet computers, notebook computers, desktop computers, and wearable devices.

Referring to FIG. 1, a method for estimating the depth of an image scene provided by an embodiment of the present application is shown, including:

101. Obtain an image to be tested;

First, an image to be tested is acquired, and the image to be tested is any image whose depth of the scene needs to be predicted.

102. Input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested.

After the image to be tested is acquired, the image to be tested is input into a pre-built depth estimation network to obtain a scene depth image of the image to be tested, thereby obtaining a scene depth estimation result of the image to be tested. Specifically, the depth estimation network may be a neural network with an encoder-decoder architecture, and the present application does not make any limitations on the type and network structure of the neural network used in the depth estimation network.

Please refer to FIG. 2, which shows a schematic flowchart of optimizing and updating depth estimation network parameters provided by an embodiment of the present application, including the following steps:

2.1. Obtain a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence;

To train and optimize the depth estimation network, it is first necessary to obtain the training set data, and perform certain preprocessing operations on the training set data. For example, the autonomous driving data set KITTI can be obtained as the training set data, and preprocessing operations such as random flipping, random cropping, and data normalization can be performed on the training set data to convert the training set data into tensor data of specified dimensions. , as the input to the depth estimation network. In the embodiment of the present application, the training set data may be composed of a large number of sample image sequences, wherein each sample image sequence includes a target frame image and a reference frame image, and the reference frame image is in the sample image sequence before the target frame image or More than one frame after the image. For example, the sample image sequence can be a video clip containing 5 consecutive video frames, assuming I ₀ , I ₁ , I ₂ , I ₃ , I ₄ , then I ₂ can be used as the target frame image, I ₀ , I ₁ , I ₃ and I ₄ can be used as corresponding reference frame images.

2.2. Input the target frame image into the depth estimation network to obtain the predicted depth image of the first scene;

The target frame image in the sample image sequence is input to the depth estimation network to obtain the predicted first scene depth image, that is, the scene depth image corresponding to the target frame image.

In an embodiment of the present application, a schematic structural diagram of the depth estimation network is shown in FIG. 3 , which includes an encoder part and a decoder part. The encoder part is used to extract the abstract features of the input image data by layer-by-layer downsampling. Assuming that the target frame image is preprocessed to obtain tensor data with a dimension of 3*256*832, the first layer of the encoder is used. After convolution, normalization, and activation function processing, a feature image with a dimension of 64*128*416 is obtained, and the first downsampling process is completed. Then, the feature image is processed by maximum pooling and multiple residual modules to obtain a feature image with a dimension of 256*64*208, and the second downsampling process is completed. By analogy, after multiple downsampling processes, a feature image with a dimension of 2048*8*26 is obtained. The decoder part is used to process the feature image obtained by the encoder by layer-by-layer upsampling. Specifically, the convolutional layer with the convolution kernel size of 3*3, the nonlinear ELU processing layer and the nearest neighbor upsampling layer can be used. The feature image obtained by the encoder is processed to obtain a feature image with a dimension of 512*16*52. Then, as shown in Figure 3, the 512*16*52 feature image and the 1024*16*52 feature image obtained by the encoder are spliced in the channel dimension to obtain a feature image with a dimension of 1536*16*52, At this point, the processing of the first upsampling is completed. By analogy, after multiple upsampling processing, a feature image with a dimension of 32*256*832 is finally obtained. Next, the 32*256*832 feature image is sequentially input into the convolutional layer with the convolution kernel size of 3*3, the Sigmoid function and F(x)=1/(10*x+0.01) for processing to obtain the final scene Depth image, where x represents the depth image obtained after mapping by the Sigmoid function, and F(x) represents the final scene depth image obtained. Specifically, after the feature image is transformed by the Sigmoid function, the value range of each pixel will be mapped between 0 and 1. Here, it is assumed that the actual scene depth range is between 0.1m and 100m, then the function F(x) The =1/(10*x+0.01) process can establish a mapping relationship between the pixels in the estimated depth image and the actual scene depth, for example, if x=0, it corresponds to 100m in the actual scene. Therefore, the estimated depth image can be constrained to a reasonable range between 0.1m and 100m by processing the function F(x)=1/(10*x+0.01).

The schematic diagram of the structure of the residual module in the network structure shown in Figure 3 is shown in Figure 4. The input end is divided into two branches, one of which is processed by each convolution layer, the normalization layer BN and the ReLU function in turn. Superimpose with another branch to obtain the output data of the residual module.

In addition, in the network structure shown in Figure 3, the technical means of shortcut connection is adopted, that is, the feature image extracted by the encoder directly crosses the convolution layer and the feature image of the same resolution obtained by the decoder is spliced in the channel dimension. In the process of using the encoder to extract features from the input image, a fixed-size convolution kernel is used to continuously extract image features through a sliding window. However, due to the limitation of the size of the convolution kernel and the nature of convolution local image feature extraction , so that the shallow network can only extract local features of the image. With the continuous increase of the number of convolutional layers, the resolution of extracting feature images continues to decrease, and the number of feature images also increases, so that more abstract and larger depth features can be extracted. As for the decoder part, it directly decodes the feature image finally output by the encoder, and performs multiple upsampling processing on the deep-level feature to obtain deep-level feature images with different resolutions. For example, after the first upsampling process, the dimension It is a feature image of 512*16*52. At this time, the feature image with a dimension of 1024*16*52 extracted by the encoder directly crosses the corresponding convolutional layer and the feature image of 512*16*52 obtained by the decoder. fusion. As shown in Figure 3, the feature images of each resolution extracted by the encoder are fused with the feature images obtained by the corresponding decoder to realize the fusion of local image features and depth feature information.

2.3. Input the sample image sequence into a pre-built camera pose estimation network to obtain a camera pose vector between the predicted target frame image and the reference frame image;

In order to obtain the camera pose vector between the target frame image and the reference frame image, a camera pose estimation network is also pre-built in this embodiment of the present application. The schematic diagram of the structure of the network can be shown in FIG. 5 , which includes a number of different parameters the convolutional layer. Specifically, it is assumed that the input sample image sequence is I ₀ , I ₁ , I ₂ , I ₃ , I ₄ , a total of 5 frames of images. First, these 5 frames of images are preprocessed into tensor data of a specified dimension, which is used as the camera pose Estimate the input of the network; the camera pose estimation network uses multiple convolutional layers with specified strides to extract image features and downsample, and sequentially obtain corresponding feature images. For example, in Figure 5, after the input tensor data is processed by 8 layers of convolution, a 24-dimensional feature vector can be obtained, and finally the feature vector is adjusted into a camera pose vector of 6*N _ref , where 6 represents the The camera pose vector is a 6-dimensional vector composed of 3 translation vectors and 3 rotation vectors, and N _ref =4 indicates that the number of reference frame images input in the sample image sequence is 4.

2.4. Generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence;

After obtaining the estimated first scene depth image and the camera pose vector, it is necessary to perform image reconstruction based on these data, and obtain the first reconstructed image corresponding to the target frame image, so as to calculate the image reconstruction loss subsequently.

Specifically, generating the first reconstruction corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:

(1) determine the first transformation matrix that the target frame image is converted to the reference frame image according to the camera attitude vector;

(2) calculating the first coordinates of the target frame image in the world coordinate system according to the internal parameters of the camera and the depth image of the first scene;

(3) transforming the first coordinate based on the first transformation matrix to obtain the second coordinate of the target frame image in the world coordinate system after conversion;

(4) converting the second coordinate into the third coordinate under the image coordinate system;

(5) Based on the reference frame image, using the third coordinate as a grid point, reconstruct the affine transformed image of the reference frame image through a bilinear sampling mechanism, and determine the reconstructed image as the first reconstructed image.

Assuming that the target frame image is I _tgt , the reference frame image is I _ref , and the internal parameter matrix corresponding to the camera is K, the depth estimation network described above can be used to estimate the depth image of the first scene corresponding to I _tgt as D _tgt , which is given by The camera pose estimation network described above estimates the camera pose between two frames of images, and obtains a first transformation matrix T (consisting of a rotation vector and a translation vector) converted from the target frame image I _tgt to the reference frame image I _ref . Then, according to the camera internal parameter matrix K, the first scene depth image D _tgt and the target frame image It _tgt , the coordinates (first coordinates) of the target frame image It _tgt in the world coordinate system can be calculated. For example, suppose that the image coordinates of a certain pixel in the target frame image I _tgt are

According to the depth image D _tgt of the first scene, the depth of the pixel is determined as d _tgt , then the coordinates of the pixel in the world coordinate system can be calculated by the following formulas:

Z=d _tgt

in,

Indicates the coordinates of the pixel in the world coordinate system, (c _x , _cy , f) are the parameters in the camera's internal parameter matrix, c _x and _cy represent the principal point offset, and f represents the focal length.

Then, based on the first transformation matrix T, the first coordinate

Perform transformation to obtain the second coordinate of the target frame image I _tgt after transformation in the world coordinate system

Specifically, the following formula can be used to calculate:

Wherein, (R _x , R _y , R _z , t)∈SE3 is a 3D rotation angle and translation vector, which can be obtained by the first transformation matrix T. R _x , R _y , and R _z represent the rotation relative to the x-axis, y-axis and z-axis in the world coordinate system, respectively, t represents the translation of the x-axis, y-axis and z-axis, and SE3 represents a special Euclidean group.

Next, convert the second coordinate to the third coordinate in the image coordinate system

Specifically, it can be converted by the following formula group:

Among them, T _tgt->ref represents the camera extrinsic parameter matrix composed of the rotation matrix and the translation matrix.

After getting the third coordinate

After that, based on the reference frame image I _ref , using the third coordinate as a grid point, the affine-transformed image of the reference frame image I _ref can be reconstructed through the bilinear sampling mechanism

and reconstruct the resulting image

Determined to be the first reconstructed image. The principle of the bilinear sampling mechanism may refer to the prior art, which will not be repeated here.

2.5. Calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the difference between the target frame image and the first reconstructed image;

Specifically, the following formula can be used to calculate the first image reconstruction loss:

in,

Indicates the reconstruction loss of the first image, and α is a preset weight parameter, for example, it can be 0.85. SSIM(*) is the structural similarity measure function shown below:

In the above formula, μ and δ are the pixel mean and variance, respectively, c ₁ =0.01 ² , and c ₂ =0.03 ² .

ERF(*) is the robustness error metric shown below:

In the above formula, ò=0.01.

2.6, constructing an objective function based on the first image reconstruction loss;

After obtaining the first image reconstruction loss, an objective function may be constructed based on the first image reconstruction loss, so as to complete the parameter update of the depth estimation network.

(1) inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

(2) generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence;

(3) Calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the difference between the reference frame image and the second reconstructed image.

Specifically, the second reconstruction corresponding to the reference frame image is generated according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:

(2.1) Determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;

(2.2) According to the internal reference of the camera and the depth image of the second scene, calculate the fourth coordinate of the reference frame image in the world coordinate system;

(2.3) the 4th coordinate is transformed based on the described second transformation matrix to obtain the 5th coordinate of the reference frame image under the world coordinate system after conversion;

(2.4) converting the fifth coordinate into the sixth coordinate under the image coordinate system;

(2.5) Based on the target frame image, using the sixth coordinate as a grid point, reconstruct the affine transformed image of the target frame image through the bilinear sampling mechanism, and determine the reconstructed image as the second reconstructed image.

Similar to the above method for calculating the first image reconstruction loss, when calculating the second image reconstruction loss, it is assumed that the target frame image is I _tgt , the reference frame image is I _ref , and the internal parameter matrix of the corresponding camera is K. The depth estimation network described above estimates that the depth image of the second scene corresponding to I _ref is D _ref , and the camera pose between the two frames of images is estimated by the camera pose estimation network described above to obtain the conversion from the reference frame image I _ref to D ref . The second transformation matrix T _{inv of} the target frame image It _tgt is an inverse transformation matrix of the first transformation matrix T transformed from the target frame image It _tgt to the reference frame image I _ref . Then, according to the camera internal parameter matrix K, the second scene depth image D _ref and the reference frame image I _ref , the coordinates (fourth coordinates) of the reference frame image I _ref in the world coordinate system can be calculated. Then, the fourth coordinate is transformed based on the second transformation matrix T _inv to obtain the transformed fifth coordinate of the reference frame image I _ref in the world coordinate system, and then the fifth coordinate in the image coordinate system is calculated. For the sixth coordinate, the specific coordinate transformation step may refer to the above-mentioned related content of calculating the reconstruction loss of the first image. Finally, based on the target frame image It _tgt , and taking the sixth coordinate as the grid point, the affine transformed image of the target frame image It _tgt can be reconstructed through the bilinear sampling mechanism

and reconstruct the resulting image

Determined to be the second reconstructed image. The following formula can be used to calculate the second image reconstruction loss:

For the definition of each parameter in the above formula, reference may be made to the description in the formula for calculating the reconstruction loss of the first image described above.

The first image reconstruction loss can be defined as the forward image reconstruction loss, and the second image reconstruction loss can be defined as the backward image reconstruction loss, then the bidirectional image reconstruction loss can be constructed based on the two image reconstruction losses, and the specific calculation formula can be as follows :

Afterwards, the objective function can be constructed according to the loss based on this bidirectional image reconstruction. By adding a bidirectional image reconstruction loss to the objective function of the depth estimation network, the potential information in the image data can be fully exploited, and the robustness of the depth estimation algorithm can be further improved.

In an embodiment of the present application, the method may further include:

(1) obtaining the seventh coordinate of the target frame image under the image coordinate system;

(2) performing the process of making a difference between the corresponding elements on the third coordinate and the seventh coordinate to obtain the first forward flow coordinate;

(3) obtaining the eighth coordinate of the reference frame image under the image coordinate system;

(4) the processing of the difference of the corresponding elements is performed on the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate;

(5) using the third coordinate as a grid point, adopting a bilinear sampling mechanism to perform affine transformation on the first backward flow coordinate to synthesize the second forward flow coordinate;

(6) using the sixth coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;

(7) Calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and all the the matching degree between the second forward flow coordinates;

(8) Calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, and the backward flow occlusion mask is used to measure the first backward flow coordinate and all the backward flow occlusion masks. The matching degree between the coordinates of the second backward flow.

This process can be summarized as the check of bidirectional flow consistency, including forward flow consistency check and backward flow consistency check. First, obtain the seventh coordinate of the target frame image in the image coordinate system

and the third coordinate described above (ie

), and then perform the processing of the corresponding element difference between the third coordinate and the seventh coordinate to obtain the first forward flow coordinate

As shown in the following formula:

Similarly, obtain the eighth coordinate of the reference frame image in the image coordinate system

and the aforementioned sixth coordinate (which can be expressed as

), and then perform the processing of the corresponding element difference between the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate

As shown in the following formula:

Then, using the third coordinate as the grid coordinate, a bilinear sampling mechanism is used to determine the first backward flow coordinate

perform an affine transformation to synthesize the second forward flow coordinates

In an ideal case, the composite forward flow coordinates

and the calculated forward flow coordinates

are the same in size and opposite in direction, which is forward flow consistency.

Using the sixth coordinate as the grid coordinate, the bilinear sampling mechanism is adopted to the first forward flow coordinate.

perform an affine transformation to synthesize the second backward flow coordinates

In an ideal case, the synthesized backward flow coordinates

and the calculated backward flow coordinates

are the same in size and opposite in direction, which is backward flow consistency.

Next, the forward flow occlusion mask can be obtained by calculating according to the first forward flow coordinate and the second forward flow coordinate

The mask, , is used to measure the matching degree between the first forward flow coordinate and the second forward flow coordinate, which can be calculated by the following formula:

in,

Parameters α ₁ =0.01 α ₂ =0.5.

The backward flow occlusion mask can be calculated according to the first backward flow coordinate and the second backward flow coordinate

This mask is used to measure the matching degree between the first backward flow coordinate and the second backward flow coordinate, and can be calculated by the following formula:

The definition of each parameter may refer to the foregoing description.

After two flow occlusion masks are obtained by calculation, calculating the bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss may include:

The occlusion mask is used to determine whether there are occluding objects in consecutive video frames. Adding the occlusion mask to the calculation of the bidirectional image reconstruction loss can improve the depth estimation accuracy of the depth estimation network for images with occluded objects.

Further, the method can also include:

(1) determining the first scene depth value of the target frame image according to the second coordinate;

(2) determining the second scene depth value of the reference frame image according to the fifth coordinate;

(3) obtaining the third scene depth value of the pixel corresponding to the second coordinate in the first scene depth image;

(4) acquiring the fourth scene depth value of the pixel corresponding to the fifth coordinate in the second scene depth image;

(5) based on the third coordinate and the fourth scene depth value, reconstruct the fifth scene depth value of the target frame image through a bilinear sampling mechanism;

(6) reconstructing the sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;

(7) Calculate the forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the The difference between the scene depth value of the target frame image and the reconstructed scene depth value of the target frame image;

(8) Calculate the backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the the difference between the scene depth value of the reference frame image and the reconstructed scene depth value of the reference frame image;

(9) Calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss.

The above steps are used to calculate the loss of bidirectional scene structure consistency. First, according to the second coordinate described above

The depth value of the corresponding scene can be obtained

(first scene depth value); according to the fifth coordinate described above

The corresponding field can be obtained, and the depth value of the corresponding scene can be obtained

(Second scene depth value). Then, according to the first scene depth image, it can be estimated that the image coordinates in the target frame image I _tgt are

The depth value d _tgt (the third _scene depth value) of the pixel point at

The depth value of the pixel at d _ref (the fourth scene depth value). Next, based on the third coordinate and the depth value d _ref , the fifth scene depth value of the target frame image can be reconstructed through the bilinear sampling mechanism

Based on the sixth coordinate and the depth value d _tgt , the depth value of the sixth scene of the reference frame image can be reconstructed through the bilinear sampling mechanism

In theory, the first scene depth value

and the fifth scene depth value

should be equal, the second scene depth value

and the sixth scene depth value

should be equal. However, it is found through experimental tests that they are not always equal, so the forward scene structure error can be calculated separately by the following two formulas

and the backward scene structure error

This in turn imposes consistency constraints on the scene structure:

By imposing consistency constraints on the scene structure, the positions of moving objects and occluding objects in the image scene can be located. E.g,

and

The larger the value of , the more likely there are moving objects and occluding objects at the position.

Then, the forward scene structure consistency loss is calculated, which is used to measure the difference between the scene depth value of the target frame image calculated by the multi-view geometric transformation and the scene depth value of the reconstructed target frame image. Specifically, the following can be used Formula calculation:

where _Nref represents the number of valid grid coordinates in the reference frame image.

Calculate the backward scene structure consistency loss, which is used to measure the difference between the scene depth value of the reference frame image calculated by the multi-view geometric transformation and the scene depth value of the reconstructed reference frame image. Specifically, the following formula can be used to calculate :

where N _tgt represents the number of valid grid coordinates in the target frame image.

Finally, according to the forward scene structure consistency loss and the backward scene structure consistency loss, the bidirectional scene structure consistency loss can be calculated as follows:

On the other hand, when calculating the bidirectional image reconstruction loss, the two occlusion masks and the two scene structure errors mentioned above can be introduced at the same time. For example, the following formula can be used to calculate:

by using

and

Weighting the image reconstruction loss function can achieve the purpose of dealing with occlusion and moving objects. Specifically, the first image reconstruction loss is weighted by using the forward flow occlusion mask and the forward scene structure inconsistency weight; the first image reconstruction loss is weighted using the backward flow occlusion mask and the backward scene structure inconsistency weight; The two image reconstruction losses are weighted; the bidirectional image reconstruction loss is constructed based on the weighted first image reconstruction loss and the weighted second image reconstruction loss.

(1) obtaining the first characteristic image of the target frame image and the second characteristic image of the reference frame image through the encoding network;

(2) based on the third coordinate and the second feature image, reconstruct the third feature image of the target frame image through a bilinear sampling mechanism;

(3) based on the sixth coordinate and the first feature image, reconstruct the fourth feature image of the reference frame image through a bilinear sampling mechanism;

(4) According to the first feature image, the second feature image, the third feature image, and the fourth feature image, the bidirectional feature perception loss is calculated and obtained, and the bidirectional feature perception loss is used to measure the loss through coding The difference between the feature image of the target frame image obtained by the network and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the coding network and the reconstructed reference frame The difference between the feature images of an image.

The above steps are used to calculate the perceptual loss of bidirectional features. Compared with the original RGB image, the features extracted by the encoder have better discrimination in weak texture areas. The present application uses the highest resolution feature map extracted by the coding network to process the weak texture area, and through the coding network in the depth estimation network, the feature image f _tgt (first feature image) of the target frame image and the feature image of the reference frame image can be extracted. Image f _ref (second feature image). Then, based on the aforementioned third coordinates and the feature image f _ref of the reference frame image, the feature image f _ref can be affinely transformed through a bilinear sampling mechanism to reconstruct the third feature image of the target frame image

Based on the aforementioned sixth coordinate and the feature image f _tgt of the target frame image, the feature image f _tgt can be affinely transformed through a bilinear sampling mechanism to reconstruct the fourth feature image of the reference frame image

Then, the two-way feature perception loss can be calculated by the following formula:

The bidirectional feature-aware loss L _feat is used to measure the difference between the feature image of the target frame image obtained by the encoding network and the feature image of the reconstructed target frame image, and the feature image of the reference frame image obtained by the encoding network and the reconstructed image. The difference between the feature images of the reference frame image.

According to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network.

The constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss may include:

In order to regularize the gradients of the scene depth image and feature image obtained through the depth estimation network, a smoothing loss L _s can be introduced into the objective function, which can be calculated by the following formula:

in,

Indicates that the partial derivative is calculated for the reference frame depth map d _ref estimated by the depth estimation network, and then the absolute value at each element position is calculated,

Indicates that the partial derivative is calculated for the reference frame image I _ref , and then the absolute value of each element position is calculated,

means with

is the natural exponent of the power, and so on.

Four types of loss functions are proposed in the previous section, namely bidirectional image reconstruction loss, smoothing loss, bidirectional scene structure consistency loss, and bidirectional feature perception loss. The final objective function can be constructed based on these loss functions. For example, the expression of an objective function L is as follows:

L=λ _photo L _photo +λ _s L _s +λ _dsc L _dsc +λ _feat L _feat

Wherein, each λ is a set coefficient, for example, λ _photo =1.0, λ _s =0.001, λ _dsc =0.5, λ _feat =0.05.

In addition, in the process of calculating each loss function described above, the calculation result of a single reference frame image is illustrated as an example, and if there are multiple reference frame images, each reference frame image can use the same method as described above. The corresponding loss value is calculated, and finally, the average value of the corresponding loss values of these reference frame images can be used as the loss value used in the final construction of the objective function.

2.7. Update the parameters of the depth estimation network according to the objective function.

After the objective function is constructed, the parameters of the depth estimation network can be updated according to the objective function to achieve the purpose of optimizing and training the network. Specifically, the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network, and use this gradient to update the weight of the depth estimation network, and so on, until the maximum number of iterations is reached, and the depth is completed. Estimate the training of the network.

Further, this objective function can be used to train the camera pose estimation network described above at the same time. Similarly, the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the camera attitude estimation network, and use this gradient to update the weight of the camera attitude estimation network, and so on, iterate continuously until the set maximum number of iterations is reached, and the completion of The training of the camera pose estimation network. In general, after the objective function is constructed, the objective function can be used as a supervision signal to jointly guide the training of the depth estimation network and the camera pose estimation network. Specifically, the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network and the gradient of the objective function relative to the weight of the camera pose estimation network, and use the gradient to update the depth estimation network and camera pose estimation at the same time. The weight of the network is continuously iterated until the set maximum number of iterations is reached, and the joint training of the depth estimation network and the camera pose estimation network is completed.

After completing the training of the two networks, the monocular image (such as the image to be tested) can be used as the input of the depth estimation network to directly calculate the corresponding scene depth image. It is also possible to use a continuous image sequence (for example, any continuous 5 frames of monocular images) as the input of the camera pose estimation network, and calculate the corresponding camera pose vector. It should be noted that the depth estimation network and the camera pose estimation network only need to be jointly optimized during the training period. After the training is completed, the weight of the network is fixed. Backpropagation is not required during the test period, only forward propagation is required. Therefore, Both networks can be used independently during testing.

In addition, by adding two-way image reconstruction loss, two-way scene structure consistency loss, two-way feature perception loss and smoothing loss to the objective function, the potential information contained in the image can be further mined, the cost of sample data collection can be reduced, and it can effectively Deal with moving objects, occlusions, etc. in video frames, and improve robustness to weakly textured environments.

The following content is to illustrate the technical effects of the image scene depth estimation and camera pose estimation proposed in the present application through simulation results. Among them, the test set divided by Eigen is used as the evaluation data of the depth estimation network, and the 09-10 sequence in the KITTI Odometry data set is used as the evaluation data of the camera pose estimation network.

The evaluation criteria used by the depth estimation network include: absolute error (AbsRel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (Rmselog) and threshold (δ _t ); camera pose estimation network The evaluation metric used is absolute trajectory error (ATE). After the simulation test, the test results of comparing the method proposed in the present application with the algorithm of the prior art are shown in Tables 1 to 3 below.

Table 1

Table 1 is a comparison of the results of scene depth prediction for monocular images in the depth range of 80m. Among them, the absolute value of absolute error (AbsRel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (Rmselog) represents the algorithm error value, which is used to measure the accuracy of the algorithm, the smaller the error value The higher the representation accuracy, the threshold (δ _t ) represents how close the predicted scene depth is to the true value, and the higher the threshold, the better the algorithm stability. From the test results in Table 1, it can be found that, compared with the algorithm in the prior art, the method proposed in this application can obtain higher scene depth prediction accuracy and better algorithm stability.

Table 2

Table 2 is a comparison of the results of scene depth prediction for monocular images in the depth range of 50m. From the test results in Table 2, it can also be found that, compared with the algorithm in the prior art, the method proposed in this application can obtain higher scene depth prediction accuracy and better algorithm stability, so it can predict more robustly Scene depth and more detail from monocular images.

table 3

The absolute trajectory error (ATE) in Table 3 represents the difference between the true value of the camera pose and the predicted camera pose. The smaller the error value, the more accurate the predicted camera pose. The simulation results show that, compared with various existing algorithms, the camera pose estimation method proposed in the present application predicts the camera pose more accurately.

In addition, FIG. 6 is a comparison diagram of the results of the monocular image depth prediction by the method for estimating the depth of the image scene proposed by the present application and various algorithms in the prior art, wherein the Ground Truth depth map is a depth map obtained by visualizing lidar data. .

It should be understood that the size of the sequence numbers of the steps in the above-mentioned various embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. .

A method for estimating the depth of an image scene is mainly described above, and an apparatus for estimating the depth of an image scene will be described below.

Referring to FIG. 7 , an embodiment of an apparatus for estimating the depth of an image scene in an embodiment of the present application includes:

an image to be measured acquisition module 701, configured to acquire an image to be measured;

a scene depth estimation module 702, configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;

A sample acquisition module 703, configured to acquire a sample image sequence, where the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is a frame before or after the target frame image in the sample image sequence image above;

A first scene depth prediction module 704, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;

A camera pose estimation module 705, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain a predicted camera pose vector between the target frame image and the reference frame image;

A first image reconstruction module 706, configured to generate images corresponding to the target frame according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence The first reconstructed image of ;

The first image reconstruction loss calculation module 707 is configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, and the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss. a difference between the reconstructed images;

an objective function construction module 708, configured to construct an objective function based on the first image reconstruction loss;

A network parameter updating module 709, configured to update the parameters of the depth estimation network according to the objective function.

In an embodiment of the present application, the apparatus may further include:

A second scene depth prediction module, configured to input the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

A second image reconstruction module, configured to generate an image corresponding to the reference frame image according to the depth image of the second scene, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence the second reconstructed image;

A second image reconstruction loss calculation module, configured to calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the reference frame image and the second image reconstruction loss difference between reconstructed images;

The objective function building block may include:

a bidirectional image reconstruction loss calculation unit, configured to calculate the bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;

an objective function construction unit, configured to construct the objective function based on the bidirectional image reconstruction loss.

Further, the first image reconstruction module may include:

a first transformation matrix determining unit, configured to determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;

a first coordinate calculation unit, configured to calculate the first coordinate of the target frame image in the world coordinate system according to the camera's internal parameters and the first scene depth image;

a first coordinate transformation unit, configured to transform the first coordinate based on the first transformation matrix to obtain the second coordinate of the target frame image after the transformation in the world coordinate system;

a first coordinate conversion unit, configured to convert the second coordinate into a third coordinate in the image coordinate system;

A first image reconstruction unit, configured to reconstruct an affine transformed image of the reference frame image based on the reference frame image and use the third coordinate as a grid point through a bilinear sampling mechanism, and reconstruct the image after affine transformation. The obtained image is determined as the first reconstructed image;

The second image reconstruction module may include:

a second transformation matrix determining unit, configured to determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;

a second coordinate calculation unit, configured to calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;

a second coordinate transformation unit, configured to transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system;

a second coordinate conversion unit, configured to convert the fifth coordinate into a sixth coordinate in the image coordinate system;

The second image reconstruction unit is configured to, based on the target frame image, take the sixth coordinate as a grid point, reconstruct the affine transformed image of the target frame image through a bilinear sampling mechanism, and reconstruct an image of the target frame image after affine transformation. The obtained image is determined as the second reconstructed image.

In an embodiment of the present application, the apparatus may further include:

a first coordinate obtaining module, used for obtaining the seventh coordinate of the target frame image in the image coordinate system;

a forward flow coordinate determination module, configured to perform a difference processing of corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;

A second coordinate obtaining module, configured to obtain the eighth coordinate of the reference frame image in the image coordinate system;

a backward flow coordinate determination module, configured to perform a difference processing of corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;

a forward flow coordinate synthesis module, configured to use the third coordinate as a grid point and perform affine transformation on the first backward flow coordinate using a bilinear sampling mechanism to synthesize the second forward flow coordinate;

a backward flow coordinate synthesis module, configured to use the sixth coordinate as a grid point, and use a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;

A forward flow occlusion mask calculation module, configured to calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the the degree of matching between the first forward flow coordinates and the second forward flow coordinates;

A backward flow occlusion mask calculation module, configured to calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, and the backward flow occlusion mask is used to measure the the degree of matching between the first backward flow coordinates and the second backward flow coordinates;

The bidirectional image reconstruction loss calculation unit may be specifically configured to: calculate the bidirectional image according to the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask and the backward flow occlusion mask reconstruction loss.

In an embodiment of the present application, the apparatus may further include:

a first scene depth value determination module, configured to determine the first scene depth value of the target frame image according to the second coordinates;

A second scene depth value determining module, configured to determine the second scene depth value of the reference frame image according to the fifth coordinate;

A third scene depth value determination module, configured to obtain a third scene depth value of the pixel point corresponding to the second coordinate in the first scene depth image;

a fourth scene depth value determination module, configured to acquire the fourth scene depth value of the pixel point corresponding to the fifth coordinate in the second scene depth image;

a first scene depth value reconstruction module, configured to reconstruct a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;

A second scene depth value reconstruction module, configured to reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;

The forward scene structure consistency loss calculation module is used to calculate the forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the pass The difference between the scene depth value of the target frame image obtained by the multi-view geometric transformation calculation and the scene depth value of the reconstructed target frame image;

The backward scene structure consistency loss calculation module is used to calculate the backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the pass the difference between the scene depth value of the reference frame image obtained by the multi-view geometric transformation calculation and the scene depth value of the reconstructed reference frame image;

a bidirectional scene structure consistency loss calculation module, configured to calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;

The objective function construction unit may be specifically configured to: construct and obtain the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.

In an embodiment of the present application, the depth estimation network includes an encoding network, and the apparatus may further include:

A feature image acquisition module, configured to obtain the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network;

a first feature image reconstruction module, configured to reconstruct a third feature image of the target frame image through a bilinear sampling mechanism based on the third coordinates and the second feature image;

A second feature image reconstruction module, configured to reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first feature image;

The bidirectional feature perception loss calculation module is configured to calculate the bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, and the bidirectional feature perception loss The loss is used to measure the difference between the feature image of the target frame image obtained by the encoding network and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained by the encoding network and the reconstructed image. the difference between the feature images of the reference frame image;

The objective function construction unit may be specifically configured to: construct and obtain the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.

Further, the device may also include:

a smoothing loss calculation module, configured to, according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, Calculate the smoothing loss, and the smoothing loss is used to regularize the gradient of the scene depth image and feature image obtained by the depth estimation network;

The objective function construction unit may be specifically configured to: construct the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.

Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements any method for estimating the depth of an image scene as shown in FIG. 1 . .

Embodiments of the present application also provide a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to execute any method for estimating the depth of an image scene as shown in FIG. 1 .

FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 8 , the terminal device 8 of this embodiment includes: a processor 80 , a memory 81 , and a computer program 82 stored in the memory 81 and executable on the processor 80 . When the processor 80 executes the computer program 82, the steps in each of the above embodiments of the method for estimating the depth of an image scene are implemented, for example, steps 101 to 102 shown in FIG. 1 . Alternatively, when the processor 80 executes the computer program 82, the functions of the modules/units in the above device embodiments, for example, the functions of the modules 701 to 709 shown in FIG. 7 are implemented.

The computer program 82 may be divided into one or more modules/units, which are stored in the memory 81 and executed by the processor 80 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the terminal device 8 .

The so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 81 may be an internal storage unit of the terminal device 8 , such as a hard disk or a memory of the terminal device 8 . The memory 81 can also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc. Further, the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device. The memory 81 is used to store the computer program and other programs and data required by the terminal device. The memory 81 can also be used to temporarily store data that has been output or will be output.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above-mentioned system, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software functional units.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented. . Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Excluded are electrical carrier signals and telecommunication signals.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A method for estimating the depth of an image scene, comprising:

Obtain the image to be tested;

Inputting the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;

Wherein, the parameters of the depth estimation network are updated in the following ways:

Obtaining a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence;

Inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;

Inputting the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;

generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence;

Calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the difference between the target frame image and the first reconstructed image;

constructing an objective function based on the first image reconstruction loss;

The parameters of the depth estimation network are updated according to the objective function.
The method of claim 1, further comprising: after acquiring the sample image sequence:

Inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence;

Calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the difference between the reference frame image and the second reconstructed image;

The constructing an objective function based on the first image reconstruction loss includes:

calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;

The objective function is constructed based on the bidirectional image reconstruction loss.
The method according to claim 2, characterized in that, according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence, generating the The first reconstructed image corresponding to the target frame image includes:

Determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;

Calculate the first coordinate of the target frame image in the world coordinate system according to the internal parameters of the camera and the first scene depth image;

Transforming the first coordinates based on the first transformation matrix to obtain the second coordinates of the target frame image in the world coordinate system after transformation;

converting the second coordinate into a third coordinate in the image coordinate system;

Based on the reference frame image, and using the third coordinate as a grid point, an image of the reference frame image after affine transformation is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the third coordinate. a reconstructed image;

The generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence, comprising: :

Determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;

Calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;

Transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system;

converting the fifth coordinate into the sixth coordinate in the image coordinate system;

Based on the target frame image, using the sixth coordinate as a grid point, the affine transformed image of the target frame image is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the sixth coordinate. 2. Reconstructed images.
The method of claim 3, further comprising:

obtaining the seventh coordinate of the target frame image in the image coordinate system;

Performing the process of making a difference between the corresponding elements on the third coordinate and the seventh coordinate to obtain the first forward flow coordinate;

obtaining the eighth coordinate of the reference frame image in the image coordinate system;

Performing the process of making a difference between the corresponding elements on the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate;

Using the third coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first backward flow coordinate to synthesize the second forward flow coordinate;

Using the sixth coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;

A forward flow occlusion mask is calculated according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and the second forward flow occlusion mask. degree of matching between forward flow coordinates;

A backward flow occlusion mask is calculated according to the first backward flow coordinates and the second backward flow coordinates, and the backward flow occlusion mask is used to measure the first backward flow coordinates and the second backward flow degree of matching between backward flow coordinates;

The calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss includes:

The bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
The method of claim 3, further comprising:

Determine the first scene depth value of the target frame image according to the second coordinate;

determining the second scene depth value of the reference frame image according to the fifth coordinate;

obtaining the third scene depth value of the pixel corresponding to the second coordinate in the first scene depth image;

obtaining the fourth scene depth value of the pixel corresponding to the fifth coordinate in the second scene depth image;

Based on the third coordinate and the fourth scene depth value, reconstruct the fifth scene depth value of the target frame image through a bilinear sampling mechanism;

Based on the sixth coordinate and the third scene depth value, reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism;

The forward scene structure consistency loss is calculated according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the target frame image calculated by multi-view geometric transformation The difference between the scene depth value and the scene depth value of the reconstructed target frame image;

The backward scene structure consistency loss is calculated according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the reference frame image calculated by the multi-view geometric transformation The difference between the scene depth value and the reconstructed scene depth value of the reference frame image;

Calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;

The constructing the objective function based on the bidirectional image reconstruction loss includes:

The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
The method according to any one of claims 3 to 5, wherein the depth estimation network comprises an encoding network, and the method further comprises:

Obtain the first feature image of the target frame image and the second feature image of the reference frame image by the encoding network;

Based on the third coordinates and the second feature image, reconstruct a third feature image of the target frame image through a bilinear sampling mechanism;

Based on the sixth coordinate and the first feature image, reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism;

According to the first feature image, the second feature image, the third feature image and the fourth feature image, the bidirectional feature perception loss is calculated, and the bidirectional feature perception loss is used to measure the obtained through the encoding network. The difference between the feature image of the target frame image and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the encoding network and the reconstructed feature image of the reference frame image differences between images;

The constructing the objective function based on the bidirectional image reconstruction loss includes:

The objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
The method according to claim 6, wherein after acquiring the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network, the method further comprises:

According to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network;

The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional feature perception loss, including:

The objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.
A device for estimating the depth of an image scene, comprising:

an image acquisition module to be tested, used to acquire the image to be tested;

a scene depth estimation module, configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;

A sample acquisition module, configured to acquire a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is one or more frames before or after the target frame image in the sample image sequence Image;

a first scene depth prediction module, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;

a camera pose estimation module, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;

A first image reconstruction module, configured to generate an image corresponding to the target frame image according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence. the first reconstructed image;

a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss difference between reconstructed images;

an objective function building module for constructing an objective function based on the first image reconstruction loss;

A network parameter updating module, configured to update the parameters of the depth estimation network according to the objective function.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the process according to claim 1 to The method for estimating the depth of an image scene according to any one of 7.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the image scene depth according to any one of claims 1 to 7 is realized estimation method.