CN113160294B

CN113160294B - Image scene depth estimation method and device, terminal equipment and storage medium

Info

Publication number: CN113160294B
Application number: CN202110346713.3A
Authority: CN
Inventors: 王飞; 程俊; 刘鹏磊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-12-23
Anticipated expiration: 2041-03-31
Also published as: CN113160294A; WO2022206020A1

Abstract

The application relates to the technical field of image processing, and provides an estimation method and device of image scene depth, terminal equipment and a storage medium. When the depth estimation network is used for optimizing the updating parameters, the camera attitude vector of an input sample image sequence is predicted by combining the camera attitude estimation network, wherein the sample image sequence comprises a target frame image and a reference frame image; then, generating a reconstructed image corresponding to the target frame image according to the scene depth image of the target frame image, the camera attitude vector, the reference frame image and the internal parameters of the corresponding camera which are obtained by the prediction of the depth estimation network; and then, calculating a corresponding loss function when the image is reconstructed according to the target frame image and the reconstructed image, and finally constructing a target function based on the loss function and updating the parameters of the depth estimation network based on the target function. By the arrangement, image information contained in the target frame image and the reference frame image can be fully mined, and the cost of sample data acquisition is reduced.

Description

Image scene depth estimation method and device, terminal equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image scene depth estimation method and apparatus, a terminal device, and a storage medium.

Background

Scene depth estimation of images is an important research direction in the fields of robot navigation and automatic driving. With the development of high-performance computing devices, people generally use a deep neural network to predict the scene depth of an image. However, to ensure the accuracy of scene depth prediction on an image, a large amount of sample data is required in training the deep neural network, which results in high data acquisition cost.

Disclosure of Invention

In view of this, embodiments of the present application provide an image scene depth estimation method and apparatus, a terminal device, and a storage medium, which can reduce the cost of sample data acquisition.

A first aspect of an embodiment of the present application provides a method for estimating depth of an image scene, including:

acquiring an image to be detected;

inputting the image to be detected into a depth estimation network which is constructed in advance to obtain a scene depth image of the image to be detected;

wherein the parameters of the depth estimation network are updated by:

acquiring a sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is more than one frame of image in the sample image sequence before or after the target frame image;

inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;

inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;

generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera attitude vector, the reference frame image and internal parameters of a camera adopted for shooting the sample image sequence;

calculating a first image reconstruction loss according to the target frame image and the first reconstructed image, wherein the first image reconstruction loss is used for measuring the difference between the target frame image and the first reconstructed image;

constructing an objective function based on the first image reconstruction loss;

and updating the parameters of the depth estimation network according to the objective function.

When the depth estimation network adopted by the embodiment of the application optimizes the updating parameters, the camera attitude vector of the input sample image sequence is predicted by combining the camera attitude estimation network, and the sample image sequence comprises a target frame image and a reference frame image; then, generating a reconstructed image corresponding to the target frame image according to the scene depth image of the target frame image, the camera attitude vector, the reference frame image and the internal parameters of the corresponding camera which are obtained by the prediction of the depth estimation network; and then, calculating a corresponding loss function when the image is reconstructed according to the target frame image and the reconstructed image, and finally constructing a target function based on the loss function and updating the parameters of the depth estimation network based on the target function. Through the arrangement, the potential image information contained in the target frame image and the reference frame image can be fully mined, namely, enough image information can be obtained by sampling less sample images to finish the training of the depth estimation network, so that the cost of sample data acquisition is reduced.

In an embodiment of the present application, after acquiring the sample image sequence, the method may further include:

inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera attitude vector, the target frame image and internal parameters of a camera adopted for shooting the sample image sequence;

calculating a second image reconstruction loss according to the reference frame image and the second reconstructed image, wherein the second image reconstruction loss is used for measuring the difference between the reference frame image and the second reconstructed image;

the constructing an objective function based on the first image reconstruction loss comprises:

calculating the reconstruction loss of the bidirectional image according to the reconstruction loss of the first image and the reconstruction loss of the second image;

constructing the objective function based on the bi-directional image reconstruction loss.

By adding bidirectional image reconstruction loss in an objective function of the depth estimation network, potential information in image data can be fully mined, and the robustness of a depth estimation algorithm is further improved.

Further, the generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image and the internal reference of the camera used for capturing the sample image sequence may include:

determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;

calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;

transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image under a world coordinate system after transformation;

converting the second coordinate into a third coordinate in an image coordinate system;

reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image;

generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and an internal reference of a camera used for shooting the sample image sequence, including:

determining a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;

according to the internal reference of the camera and the second scene depth image, calculating a fourth coordinate of the reference frame image in a world coordinate system;

transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image in a world coordinate system after transformation;

converting the fifth coordinate into a sixth coordinate in an image coordinate system;

and reconstructing an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determining the reconstructed image as the second reconstructed image.

In one embodiment of the present application, the method may further include:

acquiring a seventh coordinate of the target frame image in an image coordinate system;

performing difference processing on corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;

acquiring an eighth coordinate of the reference frame image in an image coordinate system;

performing difference processing on corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;

performing affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second forward flow coordinate;

performing affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second backward flow coordinate;

calculating a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, wherein the forward flow occlusion mask is used for measuring the matching degree between the first forward flow coordinate and the second forward flow coordinate;

calculating a backward flow shielding mask according to the first backward flow coordinate and the second backward flow coordinate, wherein the backward flow shielding mask is used for measuring the matching degree between the first backward flow coordinate and the second backward flow coordinate;

said calculating a bi-directional image reconstruction loss from said first image reconstruction loss and said second image reconstruction loss, comprising:

calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.

The occlusion mask is used for judging whether occlusion objects exist in continuous video frames or not, and the occlusion mask is added into calculation of bidirectional image reconstruction loss, so that the accuracy of depth estimation of the image with the occlusion objects by the depth estimation network can be improved.

In one embodiment of the present application, the method may further include:

determining a first scene depth value of the target frame image according to the second coordinate;

determining a second scene depth value of the reference frame image according to the fifth coordinate;

acquiring a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;

acquiring a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;

reconstructing a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;

reconstructing a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;

calculating a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, wherein the forward scene structure consistency loss is used for measuring a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;

calculating a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, wherein the backward scene structure consistency loss is used for measuring a difference between a scene depth value of the reference frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the reference frame image;

calculating the consistency loss of the bidirectional scene structure according to the consistency loss of the forward scene structure and the consistency loss of the backward scene structure;

the constructing the objective function based on the bidirectional image reconstruction loss may include:

and constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.

When the objective function is constructed, the consistency loss of the bidirectional scene structure is added, and the sheltering object and the moving object in the image scene to be detected can be effectively processed, so that the accuracy of scene depth estimation is improved.

In an embodiment of the present application, the depth estimation network includes an encoding network, and the method may further include:

acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;

reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;

reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;

calculating to obtain a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, wherein the bidirectional feature perception loss is used for measuring a difference between a feature image of the target frame image obtained through an encoding network and a feature image of the reconstructed target frame image, and a difference between a feature image of the reference frame image obtained through the encoding network and a feature image of the reconstructed reference frame image;

the constructing the objective function based on the bi-directional image reconstruction loss comprises:

and constructing and obtaining the target function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.

By introducing bidirectional feature perception loss into the objective function, the weak texture scene in the image to be detected can be effectively processed, and therefore the accuracy of scene depth estimation is improved.

Further, after acquiring the first feature image of the target frame image and the second feature image of the reference frame image through the coding network, the method may further include:

calculating to obtain a smoothing loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, wherein the smoothing loss is used for regularizing gradients of the scene depth image and the feature image obtained through the depth estimation network;

the constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss comprises:

and constructing and obtaining the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.

By introducing a smoothing loss in the objective function, the gradients of the scene depth image and the feature image obtained by the depth estimation network may be regularized.

A second aspect of the embodiments of the present application provides an apparatus for estimating depth of an image scene, including:

the to-be-detected image acquisition module is used for acquiring an image to be detected;

the scene depth estimation module is used for inputting the image to be detected into a depth estimation network which is constructed in advance to obtain a scene depth image of the image to be detected;

the device comprises a sample acquisition module, a processing module and a processing module, wherein the sample acquisition module is used for acquiring a sample image sequence, the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is more than one frame of image in the sample image sequence before or after the target frame image;

the first scene depth prediction module is used for inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;

the camera attitude estimation module is used for inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;

a first image reconstruction module, configured to generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and an internal reference of a camera used for shooting the sample image sequence;

a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure a difference between the target frame image and the first reconstructed image;

an objective function construction module for constructing an objective function based on the first image reconstruction loss;

and the network parameter updating module is used for updating the parameters of the depth estimation network according to the target function.

A third aspect of an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for estimating depth of an image scene as provided in the first aspect of the embodiment of the present application.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the method for estimating depth of an image scene as provided by the first aspect of embodiments of the present application.

A fifth aspect of the embodiments of the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method for estimating depth of an image scene according to the first aspect of the embodiments of the present application.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of an embodiment of a method for estimating a depth of an image scene according to an embodiment of the present application;

FIG. 2 is a schematic flowchart illustrating a process of optimizing and updating a depth estimation network parameter according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a depth estimation network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a residual module in the network architecture of FIG. 3;

fig. 5 is a schematic structural diagram of a camera pose estimation network according to an embodiment of the present application;

fig. 6 is a comparison graph of the result of monocular image depth prediction performed by the image scene depth estimation method provided in the embodiment of the present application and various algorithms in the prior art;

fig. 7 is a block diagram of an embodiment of an apparatus for estimating depth of an image scene according to an embodiment of the present application;

fig. 8 is a schematic diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

The application provides an image scene depth estimation method and device, terminal equipment and a storage medium, and the cost of sample data acquisition can be reduced. It should be understood that the main subjects of the embodiments of the method of the present application are various types of terminal devices or servers, such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a wearable device, and the like.

Referring to fig. 1, a method for estimating depth of an image scene according to an embodiment of the present application is shown, including:

101. acquiring an image to be detected;

firstly, an image to be detected is obtained, wherein the image to be detected is any image of which the scene depth needs to be predicted.

102. And inputting the image to be detected into a pre-constructed depth estimation network to obtain a scene depth image of the image to be detected.

After the image to be detected is obtained, the image to be detected is input into a depth estimation network which is constructed in advance, a scene depth image of the image to be detected is obtained, and therefore a scene depth estimation result of the image to be detected is obtained. Specifically, the depth estimation network may be a neural network having an encoder-decoder architecture, and the application does not limit the type and network structure of the neural network used by the depth estimation network.

Referring to fig. 2, a schematic flow chart of optimizing and updating a depth estimation network parameter provided in an embodiment of the present application is shown, including the following steps:

2.1, acquiring a sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is an image of more than one frame in the sample image sequence before or after the target frame image;

to train the optimized deep estimation network, training set data needs to be acquired first, and certain preprocessing operation can be performed on the training set data. For example, an automatic driving data set KITTI may be acquired as training set data, and the training set data is subjected to preprocessing operations such as random inversion, random cropping, data normalization, and the like, so as to convert the training set data into tensor data of a specified dimension, which is used as an input of the depth estimation network. In an embodiment of the present application, the training set data may be composed of a large number of sample image sequences, wherein each sample image sequence contains a targetThe target frame image is a target frame image, and the reference frame image is an image of more than one frame in the sample image sequence before or after the target frame image. For example, the sample image sequence may be a video clip comprising 5 consecutive video frames, assumed to be I ₀ 、I ₁ 、I ₂ 、I ₃ 、I ₄ Then, I ₂ Can be used as target frame image, I ₀ 、I ₁ 、I ₃ 、I ₄ Can be used as a corresponding reference frame image.

2.2, inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;

and inputting the target frame image in the sample image sequence into the depth estimation network to obtain a predicted first scene depth image, namely a scene depth image corresponding to the target frame image.

In one embodiment of the present application, the depth estimation network is schematically shown in fig. 3, which includes an encoder portion and a decoder portion. The encoder part is used for extracting abstract features of input image data in a layer-by-layer down-sampling mode, and supposing that tensor data with the dimension of 3 × 256 × 832 is obtained after preprocessing a target frame image, feature images with the dimension of 64 × 128 × 416 are obtained after the first layer of convolution, normalization and activation function processing of the encoder, and the first down-sampling processing is completed. Then, the feature image is processed by a maximum pooling module and a plurality of residual modules to obtain a feature image with the dimension of 256 × 64 × 208, and the processing of the second down-sampling is completed. By analogy, after multiple downsampling processes, a feature image with the dimension of 2048 × 8 × 26 is obtained. The decoder part is used for processing the characteristic image obtained by the encoder in a layer-by-layer upsampling mode, and specifically, the characteristic image obtained by the encoder can be processed by a convolution layer with a convolution kernel size of 3 × 3, a nonlinear ELU processing layer and a nearest neighbor upsampling layer to obtain a characteristic image with a dimension of 512 × 16 × 52. Then, as shown in fig. 3, the 512 × 16 × 52 feature image is spliced with the 1024 × 16 × 52 feature image obtained by the encoder in the channel dimension to obtain a feature image with a dimension of 1536 × 16 × 52, so as to complete the processing of the first upsampling. By analogy, after multiple times of upsampling processing, a feature image with the dimension of 32 × 256 × 832 is finally obtained. Then, the feature image of 32 × 256 × 832 is sequentially input to a convolution layer with a convolution kernel size of 3 × 3, a Sigmoid function, and F (x) = 1/(10 × x + 0.01) to be processed, so as to obtain a final scene depth image, where x represents a depth image obtained after mapping by the Sigmoid function, and F (x) represents the final scene depth image. Specifically, after the feature image is transformed by the Sigmoid function, the range of each pixel is mapped between 0 and 1, and here, assuming that the depth range of the actual scene is between 0.1m and 100m, the mapping relationship between the pixel in the estimated depth image and the depth of the actual scene may be established by processing the function F (x) = 1/(10 x + 0.01), for example, if x =0 corresponds to 100m in the actual scene. Therefore, processing through function F (x) = 1/(10 x + 0.01) may constrain the estimated depth image to a reasonable range between 0.1m and 100m.

Fig. 4 shows a schematic structural diagram of a residual error module in the network structure shown in fig. 3, where an input end is divided into 2 branches, and one branch is sequentially processed by each convolution layer, normalization layer BN, and ReLU function, and then is superimposed with another branch, so as to obtain output data of the residual error module.

In addition, in the network structure shown in fig. 3, a technical means of shortcut connection is adopted, that is, the feature images extracted by the encoder are directly spliced in the channel dimension across the convolutional layer and the feature images with the same resolution obtained by the decoder. In the process of extracting the features of the input image by using the encoder, a convolution kernel with a fixed size is adopted to continuously extract the image features in a sliding window mode, however, due to the limitation of the size of the convolution kernel and the property of extracting the features of the convolution local image, the shallow network can only extract the local features of the image. With the increasing of the number of the convolution layers, the resolution of the extracted feature images is continuously reduced, and meanwhile, the number of the feature images is also continuously increased, so that more abstract depth features with larger receptive fields can be extracted. As for the decoder part, the last feature image output by the encoder is directly decoded, and the deep level features are subjected to multiple times of upsampling processing to obtain deep level feature images with different resolutions, for example, after the first time of upsampling processing, a feature image with the dimension of 512 × 16 × 52 is obtained, and at this time, the feature image with the dimension of 1024 × 16 × 52 extracted by the encoder directly crosses the corresponding convolution layer and is fused with the 512 × 16 × 52 feature image obtained by the decoder in the channel dimension. As shown in fig. 3, the feature image of each resolution extracted by the encoder is subjected to feature fusion with the feature image obtained by the corresponding decoder, so that fusion of the local features of the image and the depth feature information is realized.

2.3, inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;

in order to obtain the camera pose vector between the target frame image and the reference frame image, a camera pose estimation network is also constructed in advance according to the embodiment of the present application, and a schematic structural diagram of the network may be as shown in fig. 5, where the network includes convolution layers with different parameters. Specifically, assume that the input sample image sequence is I ₀ 、I ₁ 、I ₂ 、I ₃ 、I ₄ The method comprises the steps that total 5 frames of images are preprocessed to be tensor data with specified dimensionality, and the tensor data serve as input of a camera attitude estimation network; the camera attitude estimation network adopts a plurality of convolution layers with specified step length to extract image features and carry out down-sampling, and corresponding feature images are obtained in sequence. For example, in fig. 5, the input tensor data is convolved by 8 layers to obtain a 24-dimensional eigenvector, and the eigenvector is finally adjusted to 6 × n _ref Where 6 denotes that the camera pose vector is a 6-dimensional vector consisting of 3 translation vectors and 3 rotation vectors, N _ref =4 indicates that the number of reference frame pictures input for the sample picture sequence is 4.

2.4, generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera attitude vector, the reference frame image and the internal reference of a camera adopted for shooting the sample image sequence;

after obtaining the estimated first scene depth image and the camera pose vector, image reconstruction needs to be performed based on these data, and a first reconstructed image corresponding to the target frame image is obtained, so as to perform image reconstruction loss calculation subsequently.

Specifically, the generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal reference of the camera used for shooting the sample image sequence may include:

(1) Determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;

(2) Calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;

(3) Transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image in a world coordinate system after transformation;

(4) Converting the second coordinate into a third coordinate in an image coordinate system;

(5) And reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image.

Assume that the target frame image is I _tgt The reference frame image is I _ref If the reference matrix for the corresponding camera is K, I can be estimated by the depth estimation network described above _tgt Corresponding first scene depth image is D _tgt The camera pose between two frames of images is estimated by the camera pose estimation network described above, resulting in the target frame image I _tgt Conversion to reference frame picture I _ref A first transformation matrix T (consisting of rotation vectors and translation vectors). Then, according to the camera internal parameter matrix K and the first scene depth image D _tgt And the target frame image I _tgt The target frame image I can be calculated _tgt Coordinates under the world coordinate system (first coordinates).For example, assume target frame image I _tgt The image coordinate of a certain pixel point is

According to the first scene depth image D _tgt Determining the depth of the pixel point as d _tgt Then, the coordinates of the pixel in the world coordinate system can be calculated by the following formula:

Z＝d _tgt

wherein,

representing the coordinates of the pixel point in the world coordinate system, (c) _x ,c _y F) is a parameter in the camera reference matrix, c _x And c _y Denotes a principal point offset amount, and f denotes a focal length.

Then, the first coordinate is mapped based on the first transformation matrix T

Transforming to obtain the target frame image I _tgt Second coordinate in world coordinate system after conversion

Specifically, the following may be adoptedCalculating the formula:

wherein (R) _x ,R _y ,R _z T) e SE3 is the 3D rotation angle and translation vector, which can be obtained by this first transformation matrix T. R is _x ，R _y ，R _z Respectively indicate rotation amounts relative to an x-axis, a y-axis and a z-axis in a world coordinate system, t indicates translation amounts of the x-axis, the y-axis and the z-axis, and SE3 indicates a special euclidean group.

Then, the second coordinate is converted into a third coordinate in an image coordinate system

Specifically, the conversion can be performed by the following formula group:

wherein, T _tgt-＞ref A camera extrinsic reference matrix consisting of a rotation matrix and a translation matrix is represented.

At the time of obtaining the third coordinate

Thereafter, the reference frame image I can be based on _ref And reconstructing a reference frame image I by using the third coordinate as a grid point through a bilinear sampling mechanism _ref Affine transformed image

And reconstructing the obtained image

Is determined as the first reconstructed image. The principle of the bilinear sampling mechanism can refer to the prior art, and is not described herein again.

2.5, calculating a first image reconstruction loss according to the target frame image and the first reconstructed image, wherein the first image reconstruction loss is used for measuring the difference between the target frame image and the first reconstructed image;

specifically, the first image reconstruction loss can be calculated by using the following formula:

wherein,

representing this first image reconstruction loss, α is a preset weighting parameter, which may be, for example, 0.85.SSIM (, is a structural similarity metric function shown below:

in the above formula, μ and δ are the pixel mean and variance, respectively, c ₁ ＝0.01 ² ,c ₂ ＝0.03 ² 。

ERF (×) is a robust error metric parameter shown below:

in the above formula, ∈ =0.01.

2.6, constructing an objective function based on the first image reconstruction loss;

after obtaining the first image reconstruction loss, an objective function may be constructed based on the first image reconstruction loss to complete parameter updating of the depth estimation network.

(1) Inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

(2) Generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera attitude vector, the target frame image and internal parameters of a camera adopted for shooting the sample image sequence;

(3) And calculating a second image reconstruction loss according to the reference frame image and the second reconstructed image, wherein the second image reconstruction loss is used for measuring the difference between the reference frame image and the second reconstructed image.

Specifically, the generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal reference of the camera used for shooting the sample image sequence may include:

(2.1) determining a second transformation matrix for the reference frame image to transform to the target frame image according to the camera pose vector;

(2.2) calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;

(2.3) transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the transformed reference frame image in a world coordinate system;

(2.4) converting the fifth coordinate into a sixth coordinate in an image coordinate system;

and (2.5) reconstructing an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determining the reconstructed image as the second reconstructed image.

And calculating the first imageThe reconstruction loss method is similar, when the reconstruction loss of the second image is calculated, the target frame image is assumed to be I _tgt The reference frame image is I _ref If the internal reference matrix of the corresponding camera is K, I can be estimated by the depth estimation network described above _ref Corresponding second scene depth image is D _ref The camera pose between two frames of images is estimated by the camera pose estimation network described above, resulting in a reference frame image I _ref Conversion to target frame image I _tgt Second transformation matrix T _inv The second transformation matrix is derived from the target frame image I _tgt Conversion to reference frame picture I _ref Is determined by the first transformation matrix T. Then, according to the camera internal parameter matrix K and the second scene depth image D _ref And the reference frame picture I _ref The reference frame image I can be calculated _ref Coordinates in the world coordinate system (fourth coordinates). Then, based on the second transformation matrix T _inv Converting the fourth coordinate to obtain the reference frame image I _ref The fifth coordinate in the world coordinate system after transformation is followed by calculating the sixth coordinate in the image coordinate system, and the specific coordinate transformation step can refer to the related contents of calculating the first image reconstruction loss as described above. Finally, it can be based on the target frame image I _tgt And reconstructing a target frame image I by using the sixth coordinate as a grid point through a bilinear sampling mechanism _tgt Affine transformed image

And reconstructing the obtained image

The second reconstructed image is determined. The following equation may be used to calculate the second image reconstruction loss:

with regard to the definition of the respective parameters in the above formula, reference may be made to the description of the formula for calculating the first image reconstruction loss described above.

The first image reconstruction loss can be defined as a forward image reconstruction loss, and the second image reconstruction loss can be defined as a backward image reconstruction loss, so that a bidirectional image reconstruction loss can be constructed based on the two image reconstruction losses, and a specific calculation formula can be as follows:

an objective function may then be constructed based on the bi-directional image reconstruction loss. By adding bidirectional image reconstruction loss in an objective function of the depth estimation network, potential information in image data can be fully mined, and the robustness of a depth estimation algorithm is further improved.

In one embodiment of the present application, the method may further include:

(1) Acquiring a seventh coordinate of the target frame image in an image coordinate system;

(2) Performing difference processing on corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;

(3) Acquiring an eighth coordinate of the reference frame image in an image coordinate system;

(4) Performing difference processing on corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;

(5) Performing affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second forward flow coordinate;

(6) Performing affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second backward flow coordinate;

(7) Calculating a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, wherein the forward flow occlusion mask is used for measuring the matching degree between the first forward flow coordinate and the second forward flow coordinate;

(8) And calculating a backward flow shielding mask according to the first backward flow coordinate and the second backward flow coordinate, wherein the backward flow shielding mask is used for measuring the matching degree between the first backward flow coordinate and the second backward flow coordinate.

This process can be summarized as the check of the bidirectional flow consistency, including the check of the forward flow consistency and the check of the backward flow consistency. Firstly, acquiring a seventh coordinate of the target frame image in an image coordinate system

And the third coordinate (i.e., as described above)

) Then, the difference processing of the corresponding elements is executed to the third coordinate and the seventh coordinate to obtain the first forward flow coordinate

As shown in the following equation:

similarly, the eighth coordinate of the reference frame image in the image coordinate system is obtained

And the sixth coordinate (which may be expressed as

) Then executing the difference processing of the corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate

As shown in the following equation:

then, taking the third coordinate as a grid coordinate, and adopting a bilinear sampling mechanism to carry out coordinate alignment on the first backward flow

Performing affine transformation to synthesize second forward stream coordinates

In the ideal case, the synthesized forward stream coordinates

And the calculated forward stream coordinates

Are the same in magnitude and opposite in direction, this is forward flow consistency.

Taking the sixth coordinate as a grid coordinate, and adopting a bilinear sampling mechanism to carry out coordinate alignment on the first forward flow

Performing affine transformation to synthesize second backward flow coordinates

In the ideal case, the synthesized backward flow coordinates

And the calculated backward flow coordinates

Are the same in magnitude and opposite in direction, this is backward flow consistency.

Next, a forward flow occlusion mask may be calculated from the first forward flow coordinate and the second forward flow coordinate

The mask is used for measuring a matching degree between the first forward flow coordinate and the second forward flow coordinate, and may specifically be calculated by using the following formula:

wherein,

parameter alpha ₁ ＝0.01α ₂ ＝0.5。

A backward flow occlusion mask may be calculated from the first backward flow coordinate and the second backward flow coordinate

The mask is used for measuring a matching degree between the first backward flow coordinate and the second backward flow coordinate, and may specifically be calculated by using the following formula:

wherein, the definition of each parameter can refer to the above.

After calculating the two stream occlusion masks, calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss may include:

The occlusion mask is used for judging whether an occlusion object exists in continuous video frames or not, and the occlusion mask is added into calculation of bidirectional image reconstruction loss, so that the accuracy of depth estimation of the image with the occlusion object by the depth estimation network can be improved.

Further, the method may further include:

(1) Determining a first scene depth value of the target frame image according to the second coordinate;

(2) Determining a second scene depth value of the reference frame image according to the fifth coordinate;

(3) Acquiring a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;

(4) Acquiring a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;

(5) Reconstructing a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;

(6) Reconstructing a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;

(7) Calculating a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, wherein the forward scene structure consistency loss is used for measuring a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;

(8) Calculating a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, wherein the backward scene structure consistency loss is used for measuring a difference between a scene depth value of the reference frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the reference frame image;

(9) And calculating the consistency loss of the bidirectional scene structure according to the consistency loss of the forward scene structure and the consistency loss of the backward scene structure.

The above steps are used to calculate the consistency loss of the bidirectional scene structure, first, according to the second coordinate mentioned above

The depth value of the corresponding scene can be obtained as

(first scene depth value); according to the fifth coordinate mentioned above

The depth value of the corresponding scene can be obtained as

(second scene depth value). Then, from the first scene depth image, a target frame image I can be estimated _tgt The coordinates of the middle image are

Depth value d of pixel point _tgt (third scene depth value); according to the second scene depth image, a reference frame image I can be estimated _ref The coordinates of the middle image are

Depth value d of pixel point _ref (fourth scene depth value). Then, based on the third coordinate and the depth value d _ref The fifth scene depth value of the target frame image can be reconstructed through a bilinear sampling mechanism

Based on the sixth coordinate and the depth value d _tgt The sixth scene depth value of the reference frame image can be reconstructed through a bilinear sampling mechanism

Theoretically, the first scene depth value

And a fifth scene depth value

Should be equal, the second scene depth value

And a sixth scene depth value

Should be equal. However, through experimental tests, the forward scene structure errors are not always equal to each other, so that the forward scene structure errors can be calculated through the following 2 formulas

And backward scene structure error

And then applying consistency constraint to the scene structure:

by applying a consistency constraint to the scene structure, the positions of moving objects and occluding objects in the image scene can be located. For example, in the case of a liquid,

and

the larger the value of (c) is, the more likely there are moving objects and occluding objects at that position.

Then, a forward scene structure consistency loss is calculated, which is used to measure a difference between a scene depth value of the target frame image calculated by the multi-view geometric transformation and a scene depth value of the reconstructed target frame image, and specifically, the following formula is used for calculating:

wherein N is _ref Representing the number of valid grid coordinates in the reference frame image.

Calculating a backward scene structure consistency loss for measuring a difference between a scene depth value of a reference frame image calculated by multi-view geometric transformation and a scene depth value of a reconstructed reference frame image, wherein the backward scene structure consistency loss can be calculated by adopting the following formula:

wherein N is _tgt Representing the number of valid grid coordinates in the target frame image.

Finally, according to the forward scene structure consistency loss and the backward scene structure consistency loss, the bidirectional scene structure consistency loss can be calculated as follows:

When the objective function is constructed, the consistency loss of the bidirectional scene structure is added, and the shielding object and the moving object in the image scene to be detected can be effectively processed, so that the accuracy of scene depth estimation is improved.

On the other hand, when calculating the bidirectional image reconstruction loss, the two occlusion masks and the two scene structure errors described above may be introduced at the same time, and may be calculated by using the following formula:

by using

And

the purpose of processing the sheltered and moving objects can be achieved by carrying out weighting processing on the image reconstruction loss function. Specifically, weighting the first image reconstruction loss by using the forward flow occlusion mask and the forward scene structure inconsistency weight; weighting the second image reconstruction loss by using the backward flow occlusion mask and the backward scene structure inconsistency weight; and constructing the bidirectional image reconstruction loss based on the weighted first image reconstruction loss and the weighted second image reconstruction loss.

(1) Acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;

(2) Reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;

(3) Reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;

(4) And calculating to obtain a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, wherein the bidirectional feature perception loss is used for measuring a difference between the feature image of the target frame image obtained through the coding network and the feature image of the reconstructed target frame image, and a difference between the feature image of the reference frame image obtained through the coding network and the feature image of the reconstructed reference frame image.

The above steps are used to calculate the two-way feature perception loss, phaseCompared with the original RGB image, the features extracted by the encoder have better distinguishability in the weak texture area. The method and the device utilize the highest-resolution characteristic image extracted by the coding network to process the weak texture region, and can extract the characteristic image f of the target frame image through the coding network in the depth estimation network _tgt (first feature image) and feature image f of reference frame image _ref (second feature image). Then, based on the third coordinate and the feature image f of the reference frame image _ref The characteristic image f can be obtained by a bilinear sampling mechanism _ref Performing affine transformation to reconstruct a third characteristic image of the target frame image

The sixth coordinate and the feature image f of the target frame image _tgt The characteristic image f can be obtained by a bilinear sampling mechanism _tgt Performing affine transformation to reconstruct a fourth characteristic image of the reference frame image

Then, the bidirectional feature perception loss can be calculated by using the following formula:

bidirectional feature perception loss L _feat The method is used for measuring the difference between the characteristic image of the target frame image obtained through the coding network and the characteristic image of the reconstructed target frame image, and the difference between the characteristic image of the reference frame image obtained through the coding network and the characteristic image of the reconstructed reference frame image.

Further, after the first feature image of the target frame image and the second feature image of the reference frame image are acquired through the coding network, the method may further include:

and calculating to obtain a smoothing loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first characteristic image and the second characteristic image, wherein the smoothing loss is used for regularizing gradients of the scene depth image and the characteristic image obtained through the depth estimation network.

The constructing the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss may include:

In order to regularize the gradients of the scene depth image and the feature image obtained by said depth estimation network, a smoothing loss L may be introduced in the objective function _s Specifically, the following formula can be adopted for calculation:

wherein,

reference frame depth map d representing estimates for a depth estimation network _ref The partial derivatives are calculated, then the absolute value at each element position is calculated,

representing a reference frame picture I _ref Calculating partial derivatives, then calculating the absolute value of each element position,

is shown in

Is the natural exponent of the power, and so on.

In the foregoing, four types of loss functions are proposed, which are bidirectional image reconstruction loss, smoothing loss, bidirectional scene structure consistency loss and bidirectional feature perception loss, and a final objective function can be constructed and obtained based on the loss functions. For example, the expression of a certain objective function L is as follows:

L＝λ _photo L _photo +λ _s L _s +λ _dsc L _dsc +λ _feat L _feat

wherein each λ is a set coefficient, for example, λ _photo ＝1.0,λ _s ＝0.001,λ _dsc ＝0.5,λ _feat ＝0.05。

In addition, in the process of calculating each loss function described above, the calculation result of a single reference frame image is illustrated, and if there are multiple reference frame images, each reference frame image may be calculated to obtain a corresponding loss value in the same manner as described above, and finally, an average value of the loss values corresponding to the reference frame images may be used as a loss value used in the final construction of the objective function.

And 2.7, updating the parameters of the depth estimation network according to the objective function.

After the objective function is constructed, the parameters of the depth estimation network can be updated according to the objective function, so that the purpose of optimizing and training the network is achieved. Specifically, an AdamW optimizer can be used to solve a gradient of the objective function relative to the weight of the depth estimation network, and the weight of the depth estimation network is updated according to the gradient, so that iteration is performed continuously until a set maximum number of iterations is reached, and training of the depth estimation network is completed.

Further, the objective function may be used to train the camera pose estimation network described above. Similarly, an AdamW optimizer can be used for solving the gradient of the objective function relative to the weight of the camera pose estimation network, the weight of the camera pose estimation network is updated according to the gradient, iteration is carried out continuously until the set maximum iteration number is reached, and the training of the camera pose estimation network is completed. In general, after the objective function is constructed, the objective function can be used as a supervision signal to jointly guide the training of the depth estimation network and the camera pose estimation network. Specifically, an AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network and the gradient of the objective function relative to the weight of the camera pose estimation network, and update the weights of the depth estimation network and the camera pose estimation network at the same time according to the gradients, so that iteration is performed continuously until the set maximum iteration number is reached, and the joint training of the depth estimation network and the camera pose estimation network is completed.

After the training of the two networks is completed, a monocular image (for example, an image to be measured) can be used as an input of the depth estimation network, and a corresponding scene depth image is directly calculated. A sequence of consecutive images (e.g., any sequence of 5 monocular images) may also be used as input to the camera pose estimation network to compute the corresponding camera pose vector. It should be noted that the depth estimation network and the camera pose estimation network only need to be jointly optimized during training, the weights of the networks are fixed after training is completed, reverse propagation is not needed during testing, only forward propagation is needed, and therefore the two networks can be used independently during testing.

When the depth estimation network adopted by the embodiment of the application optimizes the updating parameters, the camera attitude vector of the input sample image sequence is predicted by combining the camera attitude estimation network, and the sample image sequence comprises a target frame image and a reference frame image; then, generating a reconstructed image corresponding to the target frame image according to the scene depth image of the target frame image, the camera attitude vector, the reference frame image and the internal parameters of the corresponding camera which are obtained by the depth estimation network prediction; and then, calculating a corresponding loss function when the image is reconstructed according to the target frame image and the reconstructed image, and finally constructing a target function based on the loss function and updating the parameters of the depth estimation network based on the target function. Through the arrangement, the potential image information contained in the target frame image and the reference frame image can be fully mined, namely, enough image information can be obtained by sampling less sample images to finish the training of the depth estimation network, so that the cost of sample data acquisition is reduced.

In addition, by adding bidirectional image reconstruction loss, bidirectional scene structure consistency loss, bidirectional feature perception loss and smoothness loss in the objective function, potential information contained in the image can be further mined, the acquisition cost of sample data is reduced, problems of moving objects, shielding and the like in the video frame can be effectively processed, and the robustness to a weak texture environment is improved.

The following is to illustrate the technical effects of the image scene depth estimation and camera pose estimation proposed by the present application through simulation results. The testing set divided by Eigen is used as the evaluation data of the depth estimation network, and the 09-10 sequence in the KITTI odometer data set is used as the evaluation data of the camera attitude estimation network.

The evaluation criteria adopted by the depth estimation network include: absolute error (abssel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (rmselect), and threshold (δ) _t ) (ii) a The evaluation index adopted by the camera attitude estimation network is Absolute Track Error (ATE). Through simulation tests, the test results of the method proposed by the present application compared with the prior art algorithm are shown in tables 1 to 3 below.

TABLE 1

Table 1 shows comparison of the results of scene depth prediction for monocular images in the depth range of 80 m. Wherein, absolute error (AbsRel), root mean square error (Rmsee), mean square error (SqRel) and logarithm root mean square error (Rmseog) absolute values represent algorithm error values, which are used for measuring the accuracy of the algorithm, and the smaller the error value is, the higher the accuracy is, and the threshold value (delta) _t ) Representing predicted fieldsThe close degree of the scene depth and the true value indicates that the stability of the algorithm is better when the threshold value is higher. The test results in table 1 show that, compared with the algorithm in the prior art, the method provided by the present application can obtain higher scene depth prediction accuracy and better algorithm stability.

TABLE 2

Table 2 shows comparison of the scene depth prediction results for the monocular image in the depth range of 50 m. The test results in table 2 also show that, compared with the algorithm in the prior art, the method provided by the present application can obtain higher scene depth prediction accuracy and better algorithm stability, so that the scene depth and more details of the monocular image can be predicted more robustly.

TABLE 3

The Absolute Track Error (ATE) in table 3 represents the difference between the true value of the camera pose and the predicted camera pose, with the smaller the error value, the more accurate the predicted camera pose. Simulation results show that compared with various existing algorithms, the camera pose prediction method provided by the application is more accurate in camera pose prediction.

In addition, fig. 6 is a comparison graph of the result of monocular image depth prediction performed by the image scene depth estimation method and various algorithms in the prior art, where the group Truth depth map is a depth map obtained by visualizing lidar data.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The above mainly describes a method for estimating depth of an image scene, and an apparatus for estimating depth of an image scene will be described below.

Referring to fig. 7, an embodiment of an apparatus for estimating depth of an image scene in an embodiment of the present application includes:

an image to be measured acquisition module 701, configured to acquire an image to be measured;

a scene depth estimation module 702, configured to input the image to be detected into a depth estimation network that is constructed in advance, to obtain a scene depth image of the image to be detected;

a sample obtaining module 703, configured to obtain a sample image sequence, where the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of more than one frame in the sample image sequence before or after the target frame image;

a first scene depth prediction module 704, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;

a camera pose estimation module 705, configured to input the sample image sequence into a camera pose estimation network constructed in advance, to obtain a predicted camera pose vector between the target frame image and the reference frame image;

a first image reconstruction module 706, configured to generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and an internal reference of a camera used for capturing the sample image sequence;

a first image reconstruction loss calculation module 707, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure a difference between the target frame image and the first reconstructed image;

an objective function construction module 708 for constructing an objective function based on the first image reconstruction loss;

a network parameter updating module 709, configured to update parameters of the depth estimation network according to the objective function.

In one embodiment of the present application, the apparatus may further include:

the second scene depth prediction module is used for inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;

a second image reconstruction module, configured to generate a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and an internal reference of a camera used for shooting the sample image sequence;

a second image reconstruction loss calculation module, configured to calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure a difference between the reference frame image and the second reconstructed image;

the objective function building module may include:

a bidirectional image reconstruction loss calculation unit for calculating a bidirectional image reconstruction loss from the first image reconstruction loss and the second image reconstruction loss;

and the object function construction unit is used for constructing the object function based on the bidirectional image reconstruction loss.

Further, the first image reconstruction module may include:

a first transformation matrix determination unit for determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;

the first coordinate calculation unit is used for calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;

the first coordinate transformation unit is used for transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image under a world coordinate system after being transformed;

a first coordinate conversion unit for converting the second coordinate into a third coordinate in an image coordinate system;

the first image reconstruction unit is used for reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image;

the second image reconstruction module may include:

a second transformation matrix determination unit for determining a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;

the second coordinate calculation unit is used for calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;

the second coordinate transformation unit is used for transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image in a world coordinate system after transformation;

a second coordinate conversion unit, configured to convert the fifth coordinate into a sixth coordinate in an image coordinate system;

and the second image reconstruction unit is used for reconstructing an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determining the reconstructed image as the second reconstructed image.

the first coordinate acquisition module is used for acquiring a seventh coordinate of the target frame image in an image coordinate system;

a forward flow coordinate determination module, configured to perform processing of performing a difference between corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;

the second coordinate acquisition module is used for acquiring an eighth coordinate of the reference frame image in an image coordinate system;

the backward flow coordinate determination module is used for executing the difference processing of corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;

a forward flow coordinate synthesis module, configured to perform affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and using a bilinear sampling mechanism to synthesize a second forward flow coordinate;

the backward flow coordinate synthesis module is used for carrying out affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism so as to synthesize a second backward flow coordinate;

a forward flow occlusion mask calculation module, configured to calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, where the forward flow occlusion mask is used to measure a matching degree between the first forward flow coordinate and the second forward flow coordinate;

a backward flow occlusion mask calculation module, configured to calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, where the backward flow occlusion mask is used to measure a matching degree between the first backward flow coordinate and the second backward flow coordinate;

the bidirectional image reconstruction loss calculation unit may be specifically configured to: calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.

a first scene depth value determining module, configured to determine a first scene depth value of the target frame image according to the second coordinate;

a second scene depth value determining module, configured to determine a second scene depth value of the reference frame image according to the fifth coordinate;

a third scene depth value determining module, configured to obtain a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;

a fourth scene depth value determining module, configured to obtain a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;

a first scene depth value reconstruction module, configured to reconstruct a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;

a second scene depth value reconstruction module, configured to reconstruct a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;

a forward scene structure consistency loss calculation module, configured to calculate a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, where the forward scene structure consistency loss is used to measure a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;

a backward scene structure consistency loss calculation module, configured to calculate a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, where the backward scene structure consistency loss is used to measure a difference between a scene depth value of the reference frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the reference frame image;

the bidirectional scene structure consistency loss calculation module is used for calculating bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;

the objective function construction unit may specifically be configured to: and constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.

In one embodiment of the present application, the depth estimation network includes a coding network, and the apparatus may further include:

the characteristic image acquisition module is used for acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;

the first characteristic image reconstruction module is used for reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;

the second characteristic image reconstruction module is used for reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;

a bidirectional feature perception loss calculation module, configured to calculate a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image, and the fourth feature image, where the bidirectional feature perception loss is used to measure a difference between a feature image of the target frame image obtained through a coding network and a feature image of the reconstructed target frame image, and a difference between a feature image of the reference frame image obtained through the coding network and a feature image of the reconstructed reference frame image;

the objective function construction unit may specifically be configured to: and constructing and obtaining the target function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.

Further, the apparatus may further include:

a smooth loss calculation module, configured to calculate a smooth loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image, and the second feature image, where the smooth loss is used to normalize gradients of the scene depth image and the feature image obtained through the depth estimation network;

the objective function construction unit may specifically be configured to: and constructing and obtaining the target function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements any one of the methods for estimating depth of an image scene as shown in fig. 1.

Embodiments of the present application further provide a computer program product, which when running on a terminal device, causes the terminal device to execute a method for estimating depth of an image scene, which implements any one of the methods shown in fig. 1.

Fig. 8 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 8, the terminal device 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82 stored in said memory 81 and executable on said processor 80. The processor 80, when executing the computer program 82, implements the steps in the embodiments of the method for estimating depth of an image scene described above, such as the steps 101 to 102 shown in fig. 1. Alternatively, the processor 80, when executing the computer program 82, implements the functions of each module/unit in each device embodiment described above, for example, the functions of the modules 701 to 709 shown in fig. 7.

The computer program 82 may be divided into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 82 in the terminal device 8.

The Processor 80 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. The memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing the computer program and other programs and data required by the terminal device. The memory 81 may also be used to temporarily store data that has been output or is to be output.

It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for estimating depth of an image scene, comprising:

acquiring an image to be detected;

inputting the image to be detected into a pre-constructed depth estimation network to obtain a scene depth image of the image to be detected;

wherein the parameters of the depth estimation network are updated by:

updating parameters of the depth estimation network according to the objective function;

after acquiring the sample image sequence, the method further comprises:

calculating a second image reconstruction loss from the reference frame image and the second reconstructed image, the second image reconstruction loss being used to measure a difference between the reference frame image and the second reconstructed image;

calculating bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;

constructing the objective function based on the bidirectional image reconstruction loss;

wherein the generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal reference of the camera used for shooting the sample image sequence includes:

transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image in a world coordinate system after transformation;

calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;

based on the target frame image, reconstructing an affine-transformed image of the target frame image by using the sixth coordinate as a grid point through a bilinear sampling mechanism, and determining the reconstructed image as the second reconstructed image;

the method further comprises the following steps:

calculating a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, wherein the backward flow occlusion mask is used for measuring the matching degree between the first backward flow coordinate and the second backward flow coordinate;

said calculating a bi-directional image reconstruction loss from said first image reconstruction loss and said second image reconstruction loss comprises:

2. The method of claim 1, further comprising:

calculating a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, wherein the backward scene structure consistency loss is used for measuring the difference between the scene depth value of the reference frame image calculated by multi-view geometric transformation and the reconstructed scene depth value of the reference frame image;

3. The method of claim 1 or 2, wherein the depth estimation network comprises an encoding network, the method further comprising:

the constructing the objective function based on the bidirectional image reconstruction loss comprises:

4. The method of claim 3, further comprising, after acquiring the first feature image of the target frame image and the second feature image of the reference frame image over the encoding network:

5. An apparatus for estimating depth of an image scene, comprising:

the network parameter updating module is used for updating the parameters of the depth estimation network according to the target function;

the objective function building module comprises:

an objective function construction unit for constructing the objective function based on the bidirectional image reconstruction loss;

the first image reconstruction module comprises:

the first coordinate transformation unit is used for transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image in a world coordinate system after being transformed;

the second image reconstruction module includes:

the second coordinate transformation unit is used for transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image under a world coordinate system after being transformed;

a second image reconstruction unit, configured to reconstruct an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determine the reconstructed image as the second reconstructed image;

the device further comprises:

a backward flow coordinate synthesis module, configured to perform affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and using a bilinear sampling mechanism, so as to synthesize a second backward flow coordinate;

the bidirectional image reconstruction loss calculation unit is specifically configured to: calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.

6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method for estimating depth of an image scene according to any one of claims 1 to 4 when executing the computer program.

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of estimating a depth of an image scene as set forth in any one of claims 1 to 4.