CN113160294B - Image scene depth estimation method and device, terminal equipment and storage medium - Google Patents

Image scene depth estimation method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN113160294B
CN113160294B CN202110346713.3A CN202110346713A CN113160294B CN 113160294 B CN113160294 B CN 113160294B CN 202110346713 A CN202110346713 A CN 202110346713A CN 113160294 B CN113160294 B CN 113160294B
Authority
CN
China
Prior art keywords
image
coordinate
frame image
reference frame
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110346713.3A
Other languages
Chinese (zh)
Other versions
CN113160294A (en
Inventor
王飞
程俊
刘鹏磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202110346713.3A priority Critical patent/CN113160294B/en
Publication of CN113160294A publication Critical patent/CN113160294A/en
Priority to PCT/CN2021/137609 priority patent/WO2022206020A1/en
Application granted granted Critical
Publication of CN113160294B publication Critical patent/CN113160294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • G06T7/85Stereo camera calibration

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application relates to the technical field of image processing, and provides an estimation method and device of image scene depth, terminal equipment and a storage medium. When the depth estimation network is used for optimizing the updating parameters, the camera attitude vector of an input sample image sequence is predicted by combining the camera attitude estimation network, wherein the sample image sequence comprises a target frame image and a reference frame image; then, generating a reconstructed image corresponding to the target frame image according to the scene depth image of the target frame image, the camera attitude vector, the reference frame image and the internal parameters of the corresponding camera which are obtained by the prediction of the depth estimation network; and then, calculating a corresponding loss function when the image is reconstructed according to the target frame image and the reconstructed image, and finally constructing a target function based on the loss function and updating the parameters of the depth estimation network based on the target function. By the arrangement, image information contained in the target frame image and the reference frame image can be fully mined, and the cost of sample data acquisition is reduced.

Description

Image scene depth estimation method and device, terminal equipment and storage medium
Technical Field
The present application relates to the field of image processing technologies, and in particular, to an image scene depth estimation method and apparatus, a terminal device, and a storage medium.
Background
Scene depth estimation of images is an important research direction in the fields of robot navigation and automatic driving. With the development of high-performance computing devices, people generally use a deep neural network to predict the scene depth of an image. However, to ensure the accuracy of scene depth prediction on an image, a large amount of sample data is required in training the deep neural network, which results in high data acquisition cost.
Disclosure of Invention
In view of this, embodiments of the present application provide an image scene depth estimation method and apparatus, a terminal device, and a storage medium, which can reduce the cost of sample data acquisition.
A first aspect of an embodiment of the present application provides a method for estimating depth of an image scene, including:
acquiring an image to be detected;
inputting the image to be detected into a depth estimation network which is constructed in advance to obtain a scene depth image of the image to be detected;
wherein the parameters of the depth estimation network are updated by:
acquiring a sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is more than one frame of image in the sample image sequence before or after the target frame image;
inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;
inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;
generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera attitude vector, the reference frame image and internal parameters of a camera adopted for shooting the sample image sequence;
calculating a first image reconstruction loss according to the target frame image and the first reconstructed image, wherein the first image reconstruction loss is used for measuring the difference between the target frame image and the first reconstructed image;
constructing an objective function based on the first image reconstruction loss;
and updating the parameters of the depth estimation network according to the objective function.
When the depth estimation network adopted by the embodiment of the application optimizes the updating parameters, the camera attitude vector of the input sample image sequence is predicted by combining the camera attitude estimation network, and the sample image sequence comprises a target frame image and a reference frame image; then, generating a reconstructed image corresponding to the target frame image according to the scene depth image of the target frame image, the camera attitude vector, the reference frame image and the internal parameters of the corresponding camera which are obtained by the prediction of the depth estimation network; and then, calculating a corresponding loss function when the image is reconstructed according to the target frame image and the reconstructed image, and finally constructing a target function based on the loss function and updating the parameters of the depth estimation network based on the target function. Through the arrangement, the potential image information contained in the target frame image and the reference frame image can be fully mined, namely, enough image information can be obtained by sampling less sample images to finish the training of the depth estimation network, so that the cost of sample data acquisition is reduced.
In an embodiment of the present application, after acquiring the sample image sequence, the method may further include:
inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera attitude vector, the target frame image and internal parameters of a camera adopted for shooting the sample image sequence;
calculating a second image reconstruction loss according to the reference frame image and the second reconstructed image, wherein the second image reconstruction loss is used for measuring the difference between the reference frame image and the second reconstructed image;
the constructing an objective function based on the first image reconstruction loss comprises:
calculating the reconstruction loss of the bidirectional image according to the reconstruction loss of the first image and the reconstruction loss of the second image;
constructing the objective function based on the bi-directional image reconstruction loss.
By adding bidirectional image reconstruction loss in an objective function of the depth estimation network, potential information in image data can be fully mined, and the robustness of a depth estimation algorithm is further improved.
Further, the generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image and the internal reference of the camera used for capturing the sample image sequence may include:
determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;
transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image under a world coordinate system after transformation;
converting the second coordinate into a third coordinate in an image coordinate system;
reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image;
generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and an internal reference of a camera used for shooting the sample image sequence, including:
determining a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
according to the internal reference of the camera and the second scene depth image, calculating a fourth coordinate of the reference frame image in a world coordinate system;
transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image in a world coordinate system after transformation;
converting the fifth coordinate into a sixth coordinate in an image coordinate system;
and reconstructing an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determining the reconstructed image as the second reconstructed image.
In one embodiment of the present application, the method may further include:
acquiring a seventh coordinate of the target frame image in an image coordinate system;
performing difference processing on corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;
acquiring an eighth coordinate of the reference frame image in an image coordinate system;
performing difference processing on corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;
performing affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second forward flow coordinate;
performing affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second backward flow coordinate;
calculating a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, wherein the forward flow occlusion mask is used for measuring the matching degree between the first forward flow coordinate and the second forward flow coordinate;
calculating a backward flow shielding mask according to the first backward flow coordinate and the second backward flow coordinate, wherein the backward flow shielding mask is used for measuring the matching degree between the first backward flow coordinate and the second backward flow coordinate;
said calculating a bi-directional image reconstruction loss from said first image reconstruction loss and said second image reconstruction loss, comprising:
calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
The occlusion mask is used for judging whether occlusion objects exist in continuous video frames or not, and the occlusion mask is added into calculation of bidirectional image reconstruction loss, so that the accuracy of depth estimation of the image with the occlusion objects by the depth estimation network can be improved.
In one embodiment of the present application, the method may further include:
determining a first scene depth value of the target frame image according to the second coordinate;
determining a second scene depth value of the reference frame image according to the fifth coordinate;
acquiring a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;
acquiring a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;
reconstructing a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;
reconstructing a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
calculating a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, wherein the forward scene structure consistency loss is used for measuring a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;
calculating a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, wherein the backward scene structure consistency loss is used for measuring a difference between a scene depth value of the reference frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the reference frame image;
calculating the consistency loss of the bidirectional scene structure according to the consistency loss of the forward scene structure and the consistency loss of the backward scene structure;
the constructing the objective function based on the bidirectional image reconstruction loss may include:
and constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
When the objective function is constructed, the consistency loss of the bidirectional scene structure is added, and the sheltering object and the moving object in the image scene to be detected can be effectively processed, so that the accuracy of scene depth estimation is improved.
In an embodiment of the present application, the depth estimation network includes an encoding network, and the method may further include:
acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;
reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;
reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;
calculating to obtain a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, wherein the bidirectional feature perception loss is used for measuring a difference between a feature image of the target frame image obtained through an encoding network and a feature image of the reconstructed target frame image, and a difference between a feature image of the reference frame image obtained through the encoding network and a feature image of the reconstructed reference frame image;
the constructing the objective function based on the bi-directional image reconstruction loss comprises:
and constructing and obtaining the target function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
By introducing bidirectional feature perception loss into the objective function, the weak texture scene in the image to be detected can be effectively processed, and therefore the accuracy of scene depth estimation is improved.
Further, after acquiring the first feature image of the target frame image and the second feature image of the reference frame image through the coding network, the method may further include:
calculating to obtain a smoothing loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, wherein the smoothing loss is used for regularizing gradients of the scene depth image and the feature image obtained through the depth estimation network;
the constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss comprises:
and constructing and obtaining the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.
By introducing a smoothing loss in the objective function, the gradients of the scene depth image and the feature image obtained by the depth estimation network may be regularized.
A second aspect of the embodiments of the present application provides an apparatus for estimating depth of an image scene, including:
the to-be-detected image acquisition module is used for acquiring an image to be detected;
the scene depth estimation module is used for inputting the image to be detected into a depth estimation network which is constructed in advance to obtain a scene depth image of the image to be detected;
the device comprises a sample acquisition module, a processing module and a processing module, wherein the sample acquisition module is used for acquiring a sample image sequence, the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is more than one frame of image in the sample image sequence before or after the target frame image;
the first scene depth prediction module is used for inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;
the camera attitude estimation module is used for inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;
a first image reconstruction module, configured to generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and an internal reference of a camera used for shooting the sample image sequence;
a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure a difference between the target frame image and the first reconstructed image;
an objective function construction module for constructing an objective function based on the first image reconstruction loss;
and the network parameter updating module is used for updating the parameters of the depth estimation network according to the target function.
A third aspect of an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for estimating depth of an image scene as provided in the first aspect of the embodiment of the present application.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the method for estimating depth of an image scene as provided by the first aspect of embodiments of the present application.
A fifth aspect of the embodiments of the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method for estimating depth of an image scene according to the first aspect of the embodiments of the present application.
It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of an embodiment of a method for estimating a depth of an image scene according to an embodiment of the present application;
FIG. 2 is a schematic flowchart illustrating a process of optimizing and updating a depth estimation network parameter according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a depth estimation network according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a residual module in the network architecture of FIG. 3;
fig. 5 is a schematic structural diagram of a camera pose estimation network according to an embodiment of the present application;
fig. 6 is a comparison graph of the result of monocular image depth prediction performed by the image scene depth estimation method provided in the embodiment of the present application and various algorithms in the prior art;
fig. 7 is a block diagram of an embodiment of an apparatus for estimating depth of an image scene according to an embodiment of the present application;
fig. 8 is a schematic diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
The application provides an image scene depth estimation method and device, terminal equipment and a storage medium, and the cost of sample data acquisition can be reduced. It should be understood that the main subjects of the embodiments of the method of the present application are various types of terminal devices or servers, such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a wearable device, and the like.
Referring to fig. 1, a method for estimating depth of an image scene according to an embodiment of the present application is shown, including:
101. acquiring an image to be detected;
firstly, an image to be detected is obtained, wherein the image to be detected is any image of which the scene depth needs to be predicted.
102. And inputting the image to be detected into a pre-constructed depth estimation network to obtain a scene depth image of the image to be detected.
After the image to be detected is obtained, the image to be detected is input into a depth estimation network which is constructed in advance, a scene depth image of the image to be detected is obtained, and therefore a scene depth estimation result of the image to be detected is obtained. Specifically, the depth estimation network may be a neural network having an encoder-decoder architecture, and the application does not limit the type and network structure of the neural network used by the depth estimation network.
Referring to fig. 2, a schematic flow chart of optimizing and updating a depth estimation network parameter provided in an embodiment of the present application is shown, including the following steps:
2.1, acquiring a sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is an image of more than one frame in the sample image sequence before or after the target frame image;
to train the optimized deep estimation network, training set data needs to be acquired first, and certain preprocessing operation can be performed on the training set data. For example, an automatic driving data set KITTI may be acquired as training set data, and the training set data is subjected to preprocessing operations such as random inversion, random cropping, data normalization, and the like, so as to convert the training set data into tensor data of a specified dimension, which is used as an input of the depth estimation network. In an embodiment of the present application, the training set data may be composed of a large number of sample image sequences, wherein each sample image sequence contains a targetThe target frame image is a target frame image, and the reference frame image is an image of more than one frame in the sample image sequence before or after the target frame image. For example, the sample image sequence may be a video clip comprising 5 consecutive video frames, assumed to be I 0 、I 1 、I 2 、I 3 、I 4 Then, I 2 Can be used as target frame image, I 0 、I 1 、I 3 、I 4 Can be used as a corresponding reference frame image.
2.2, inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;
and inputting the target frame image in the sample image sequence into the depth estimation network to obtain a predicted first scene depth image, namely a scene depth image corresponding to the target frame image.
In one embodiment of the present application, the depth estimation network is schematically shown in fig. 3, which includes an encoder portion and a decoder portion. The encoder part is used for extracting abstract features of input image data in a layer-by-layer down-sampling mode, and supposing that tensor data with the dimension of 3 × 256 × 832 is obtained after preprocessing a target frame image, feature images with the dimension of 64 × 128 × 416 are obtained after the first layer of convolution, normalization and activation function processing of the encoder, and the first down-sampling processing is completed. Then, the feature image is processed by a maximum pooling module and a plurality of residual modules to obtain a feature image with the dimension of 256 × 64 × 208, and the processing of the second down-sampling is completed. By analogy, after multiple downsampling processes, a feature image with the dimension of 2048 × 8 × 26 is obtained. The decoder part is used for processing the characteristic image obtained by the encoder in a layer-by-layer upsampling mode, and specifically, the characteristic image obtained by the encoder can be processed by a convolution layer with a convolution kernel size of 3 × 3, a nonlinear ELU processing layer and a nearest neighbor upsampling layer to obtain a characteristic image with a dimension of 512 × 16 × 52. Then, as shown in fig. 3, the 512 × 16 × 52 feature image is spliced with the 1024 × 16 × 52 feature image obtained by the encoder in the channel dimension to obtain a feature image with a dimension of 1536 × 16 × 52, so as to complete the processing of the first upsampling. By analogy, after multiple times of upsampling processing, a feature image with the dimension of 32 × 256 × 832 is finally obtained. Then, the feature image of 32 × 256 × 832 is sequentially input to a convolution layer with a convolution kernel size of 3 × 3, a Sigmoid function, and F (x) = 1/(10 × x + 0.01) to be processed, so as to obtain a final scene depth image, where x represents a depth image obtained after mapping by the Sigmoid function, and F (x) represents the final scene depth image. Specifically, after the feature image is transformed by the Sigmoid function, the range of each pixel is mapped between 0 and 1, and here, assuming that the depth range of the actual scene is between 0.1m and 100m, the mapping relationship between the pixel in the estimated depth image and the depth of the actual scene may be established by processing the function F (x) = 1/(10 x + 0.01), for example, if x =0 corresponds to 100m in the actual scene. Therefore, processing through function F (x) = 1/(10 x + 0.01) may constrain the estimated depth image to a reasonable range between 0.1m and 100m.
Fig. 4 shows a schematic structural diagram of a residual error module in the network structure shown in fig. 3, where an input end is divided into 2 branches, and one branch is sequentially processed by each convolution layer, normalization layer BN, and ReLU function, and then is superimposed with another branch, so as to obtain output data of the residual error module.
In addition, in the network structure shown in fig. 3, a technical means of shortcut connection is adopted, that is, the feature images extracted by the encoder are directly spliced in the channel dimension across the convolutional layer and the feature images with the same resolution obtained by the decoder. In the process of extracting the features of the input image by using the encoder, a convolution kernel with a fixed size is adopted to continuously extract the image features in a sliding window mode, however, due to the limitation of the size of the convolution kernel and the property of extracting the features of the convolution local image, the shallow network can only extract the local features of the image. With the increasing of the number of the convolution layers, the resolution of the extracted feature images is continuously reduced, and meanwhile, the number of the feature images is also continuously increased, so that more abstract depth features with larger receptive fields can be extracted. As for the decoder part, the last feature image output by the encoder is directly decoded, and the deep level features are subjected to multiple times of upsampling processing to obtain deep level feature images with different resolutions, for example, after the first time of upsampling processing, a feature image with the dimension of 512 × 16 × 52 is obtained, and at this time, the feature image with the dimension of 1024 × 16 × 52 extracted by the encoder directly crosses the corresponding convolution layer and is fused with the 512 × 16 × 52 feature image obtained by the decoder in the channel dimension. As shown in fig. 3, the feature image of each resolution extracted by the encoder is subjected to feature fusion with the feature image obtained by the corresponding decoder, so that fusion of the local features of the image and the depth feature information is realized.
2.3, inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;
in order to obtain the camera pose vector between the target frame image and the reference frame image, a camera pose estimation network is also constructed in advance according to the embodiment of the present application, and a schematic structural diagram of the network may be as shown in fig. 5, where the network includes convolution layers with different parameters. Specifically, assume that the input sample image sequence is I 0 、I 1 、I 2 、I 3 、I 4 The method comprises the steps that total 5 frames of images are preprocessed to be tensor data with specified dimensionality, and the tensor data serve as input of a camera attitude estimation network; the camera attitude estimation network adopts a plurality of convolution layers with specified step length to extract image features and carry out down-sampling, and corresponding feature images are obtained in sequence. For example, in fig. 5, the input tensor data is convolved by 8 layers to obtain a 24-dimensional eigenvector, and the eigenvector is finally adjusted to 6 × n ref Where 6 denotes that the camera pose vector is a 6-dimensional vector consisting of 3 translation vectors and 3 rotation vectors, N ref =4 indicates that the number of reference frame pictures input for the sample picture sequence is 4.
2.4, generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera attitude vector, the reference frame image and the internal reference of a camera adopted for shooting the sample image sequence;
after obtaining the estimated first scene depth image and the camera pose vector, image reconstruction needs to be performed based on these data, and a first reconstructed image corresponding to the target frame image is obtained, so as to perform image reconstruction loss calculation subsequently.
Specifically, the generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal reference of the camera used for shooting the sample image sequence may include:
(1) Determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
(2) Calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;
(3) Transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image in a world coordinate system after transformation;
(4) Converting the second coordinate into a third coordinate in an image coordinate system;
(5) And reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image.
Assume that the target frame image is I tgt The reference frame image is I ref If the reference matrix for the corresponding camera is K, I can be estimated by the depth estimation network described above tgt Corresponding first scene depth image is D tgt The camera pose between two frames of images is estimated by the camera pose estimation network described above, resulting in the target frame image I tgt Conversion to reference frame picture I ref A first transformation matrix T (consisting of rotation vectors and translation vectors). Then, according to the camera internal parameter matrix K and the first scene depth image D tgt And the target frame image I tgt The target frame image I can be calculated tgt Coordinates under the world coordinate system (first coordinates).For example, assume target frame image I tgt The image coordinate of a certain pixel point is
Figure BDA0003001035010000131
According to the first scene depth image D tgt Determining the depth of the pixel point as d tgt Then, the coordinates of the pixel in the world coordinate system can be calculated by the following formula:
Figure BDA0003001035010000132
Figure BDA0003001035010000133
Figure BDA0003001035010000134
Z=d tgt
Figure BDA0003001035010000135
wherein,
Figure BDA0003001035010000141
representing the coordinates of the pixel point in the world coordinate system, (c) x ,c y F) is a parameter in the camera reference matrix, c x And c y Denotes a principal point offset amount, and f denotes a focal length.
Then, the first coordinate is mapped based on the first transformation matrix T
Figure BDA0003001035010000142
Transforming to obtain the target frame image I tgt Second coordinate in world coordinate system after conversion
Figure BDA0003001035010000143
Specifically, the following may be adoptedCalculating the formula:
Figure BDA0003001035010000144
wherein (R) x ,R y ,R z T) e SE3 is the 3D rotation angle and translation vector, which can be obtained by this first transformation matrix T. R is x ,R y ,R z Respectively indicate rotation amounts relative to an x-axis, a y-axis and a z-axis in a world coordinate system, t indicates translation amounts of the x-axis, the y-axis and the z-axis, and SE3 indicates a special euclidean group.
Then, the second coordinate is converted into a third coordinate in an image coordinate system
Figure BDA0003001035010000145
Specifically, the conversion can be performed by the following formula group:
Figure BDA0003001035010000146
Figure BDA0003001035010000147
Figure BDA0003001035010000148
wherein, T tgt->ref A camera extrinsic reference matrix consisting of a rotation matrix and a translation matrix is represented.
At the time of obtaining the third coordinate
Figure BDA0003001035010000149
Thereafter, the reference frame image I can be based on ref And reconstructing a reference frame image I by using the third coordinate as a grid point through a bilinear sampling mechanism ref Affine transformed image
Figure BDA00030010350100001410
And reconstructing the obtained image
Figure BDA00030010350100001411
Is determined as the first reconstructed image. The principle of the bilinear sampling mechanism can refer to the prior art, and is not described herein again.
2.5, calculating a first image reconstruction loss according to the target frame image and the first reconstructed image, wherein the first image reconstruction loss is used for measuring the difference between the target frame image and the first reconstructed image;
specifically, the first image reconstruction loss can be calculated by using the following formula:
Figure BDA00030010350100001412
wherein,
Figure BDA0003001035010000151
representing this first image reconstruction loss, α is a preset weighting parameter, which may be, for example, 0.85.SSIM (, is a structural similarity metric function shown below:
Figure BDA0003001035010000152
in the above formula, μ and δ are the pixel mean and variance, respectively, c 1 =0.01 2 ,c 2 =0.03 2
ERF (×) is a robust error metric parameter shown below:
Figure BDA0003001035010000153
in the above formula, ∈ =0.01.
2.6, constructing an objective function based on the first image reconstruction loss;
after obtaining the first image reconstruction loss, an objective function may be constructed based on the first image reconstruction loss to complete parameter updating of the depth estimation network.
In an embodiment of the present application, after acquiring the sample image sequence, the method may further include:
(1) Inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
(2) Generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera attitude vector, the target frame image and internal parameters of a camera adopted for shooting the sample image sequence;
(3) And calculating a second image reconstruction loss according to the reference frame image and the second reconstructed image, wherein the second image reconstruction loss is used for measuring the difference between the reference frame image and the second reconstructed image.
Specifically, the generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal reference of the camera used for shooting the sample image sequence may include:
(2.1) determining a second transformation matrix for the reference frame image to transform to the target frame image according to the camera pose vector;
(2.2) calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;
(2.3) transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the transformed reference frame image in a world coordinate system;
(2.4) converting the fifth coordinate into a sixth coordinate in an image coordinate system;
and (2.5) reconstructing an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determining the reconstructed image as the second reconstructed image.
And calculating the first imageThe reconstruction loss method is similar, when the reconstruction loss of the second image is calculated, the target frame image is assumed to be I tgt The reference frame image is I ref If the internal reference matrix of the corresponding camera is K, I can be estimated by the depth estimation network described above ref Corresponding second scene depth image is D ref The camera pose between two frames of images is estimated by the camera pose estimation network described above, resulting in a reference frame image I ref Conversion to target frame image I tgt Second transformation matrix T inv The second transformation matrix is derived from the target frame image I tgt Conversion to reference frame picture I ref Is determined by the first transformation matrix T. Then, according to the camera internal parameter matrix K and the second scene depth image D ref And the reference frame picture I ref The reference frame image I can be calculated ref Coordinates in the world coordinate system (fourth coordinates). Then, based on the second transformation matrix T inv Converting the fourth coordinate to obtain the reference frame image I ref The fifth coordinate in the world coordinate system after transformation is followed by calculating the sixth coordinate in the image coordinate system, and the specific coordinate transformation step can refer to the related contents of calculating the first image reconstruction loss as described above. Finally, it can be based on the target frame image I tgt And reconstructing a target frame image I by using the sixth coordinate as a grid point through a bilinear sampling mechanism tgt Affine transformed image
Figure BDA0003001035010000161
And reconstructing the obtained image
Figure BDA0003001035010000162
The second reconstructed image is determined. The following equation may be used to calculate the second image reconstruction loss:
Figure BDA0003001035010000163
with regard to the definition of the respective parameters in the above formula, reference may be made to the description of the formula for calculating the first image reconstruction loss described above.
The first image reconstruction loss can be defined as a forward image reconstruction loss, and the second image reconstruction loss can be defined as a backward image reconstruction loss, so that a bidirectional image reconstruction loss can be constructed based on the two image reconstruction losses, and a specific calculation formula can be as follows:
Figure BDA0003001035010000171
an objective function may then be constructed based on the bi-directional image reconstruction loss. By adding bidirectional image reconstruction loss in an objective function of the depth estimation network, potential information in image data can be fully mined, and the robustness of a depth estimation algorithm is further improved.
In one embodiment of the present application, the method may further include:
(1) Acquiring a seventh coordinate of the target frame image in an image coordinate system;
(2) Performing difference processing on corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;
(3) Acquiring an eighth coordinate of the reference frame image in an image coordinate system;
(4) Performing difference processing on corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;
(5) Performing affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second forward flow coordinate;
(6) Performing affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second backward flow coordinate;
(7) Calculating a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, wherein the forward flow occlusion mask is used for measuring the matching degree between the first forward flow coordinate and the second forward flow coordinate;
(8) And calculating a backward flow shielding mask according to the first backward flow coordinate and the second backward flow coordinate, wherein the backward flow shielding mask is used for measuring the matching degree between the first backward flow coordinate and the second backward flow coordinate.
This process can be summarized as the check of the bidirectional flow consistency, including the check of the forward flow consistency and the check of the backward flow consistency. Firstly, acquiring a seventh coordinate of the target frame image in an image coordinate system
Figure BDA0003001035010000172
And the third coordinate (i.e., as described above)
Figure BDA0003001035010000173
) Then, the difference processing of the corresponding elements is executed to the third coordinate and the seventh coordinate to obtain the first forward flow coordinate
Figure BDA0003001035010000181
As shown in the following equation:
Figure BDA0003001035010000182
similarly, the eighth coordinate of the reference frame image in the image coordinate system is obtained
Figure BDA0003001035010000183
And the sixth coordinate (which may be expressed as
Figure BDA0003001035010000184
) Then executing the difference processing of the corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate
Figure BDA0003001035010000185
As shown in the following equation:
Figure BDA0003001035010000186
then, taking the third coordinate as a grid coordinate, and adopting a bilinear sampling mechanism to carry out coordinate alignment on the first backward flow
Figure BDA0003001035010000187
Performing affine transformation to synthesize second forward stream coordinates
Figure BDA0003001035010000188
In the ideal case, the synthesized forward stream coordinates
Figure BDA0003001035010000189
And the calculated forward stream coordinates
Figure BDA00030010350100001810
Are the same in magnitude and opposite in direction, this is forward flow consistency.
Taking the sixth coordinate as a grid coordinate, and adopting a bilinear sampling mechanism to carry out coordinate alignment on the first forward flow
Figure BDA00030010350100001811
Performing affine transformation to synthesize second backward flow coordinates
Figure BDA00030010350100001812
In the ideal case, the synthesized backward flow coordinates
Figure BDA00030010350100001813
And the calculated backward flow coordinates
Figure BDA00030010350100001814
Are the same in magnitude and opposite in direction, this is backward flow consistency.
Next, a forward flow occlusion mask may be calculated from the first forward flow coordinate and the second forward flow coordinate
Figure BDA00030010350100001815
The mask is used for measuring a matching degree between the first forward flow coordinate and the second forward flow coordinate, and may specifically be calculated by using the following formula:
Figure BDA00030010350100001816
wherein,
Figure BDA00030010350100001817
parameter alpha 1 =0.01α 2 =0.5。
A backward flow occlusion mask may be calculated from the first backward flow coordinate and the second backward flow coordinate
Figure BDA00030010350100001818
The mask is used for measuring a matching degree between the first backward flow coordinate and the second backward flow coordinate, and may specifically be calculated by using the following formula:
Figure BDA00030010350100001819
wherein, the definition of each parameter can refer to the above.
After calculating the two stream occlusion masks, calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss may include:
calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
The occlusion mask is used for judging whether an occlusion object exists in continuous video frames or not, and the occlusion mask is added into calculation of bidirectional image reconstruction loss, so that the accuracy of depth estimation of the image with the occlusion object by the depth estimation network can be improved.
Further, the method may further include:
(1) Determining a first scene depth value of the target frame image according to the second coordinate;
(2) Determining a second scene depth value of the reference frame image according to the fifth coordinate;
(3) Acquiring a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;
(4) Acquiring a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;
(5) Reconstructing a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;
(6) Reconstructing a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
(7) Calculating a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, wherein the forward scene structure consistency loss is used for measuring a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;
(8) Calculating a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, wherein the backward scene structure consistency loss is used for measuring a difference between a scene depth value of the reference frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the reference frame image;
(9) And calculating the consistency loss of the bidirectional scene structure according to the consistency loss of the forward scene structure and the consistency loss of the backward scene structure.
The above steps are used to calculate the consistency loss of the bidirectional scene structure, first, according to the second coordinate mentioned above
Figure BDA0003001035010000201
The depth value of the corresponding scene can be obtained as
Figure BDA0003001035010000202
(first scene depth value); according to the fifth coordinate mentioned above
Figure BDA0003001035010000203
The depth value of the corresponding scene can be obtained as
Figure BDA0003001035010000204
(second scene depth value). Then, from the first scene depth image, a target frame image I can be estimated tgt The coordinates of the middle image are
Figure BDA0003001035010000205
Depth value d of pixel point tgt (third scene depth value); according to the second scene depth image, a reference frame image I can be estimated ref The coordinates of the middle image are
Figure BDA0003001035010000206
Depth value d of pixel point ref (fourth scene depth value). Then, based on the third coordinate and the depth value d ref The fifth scene depth value of the target frame image can be reconstructed through a bilinear sampling mechanism
Figure BDA0003001035010000207
Based on the sixth coordinate and the depth value d tgt The sixth scene depth value of the reference frame image can be reconstructed through a bilinear sampling mechanism
Figure BDA0003001035010000208
Theoretically, the first scene depth value
Figure BDA0003001035010000209
And a fifth scene depth value
Figure BDA00030010350100002010
Should be equal, the second scene depth value
Figure BDA00030010350100002011
And a sixth scene depth value
Figure BDA00030010350100002012
Should be equal. However, through experimental tests, the forward scene structure errors are not always equal to each other, so that the forward scene structure errors can be calculated through the following 2 formulas
Figure BDA00030010350100002013
And backward scene structure error
Figure BDA00030010350100002014
And then applying consistency constraint to the scene structure:
Figure BDA00030010350100002015
Figure BDA00030010350100002016
by applying a consistency constraint to the scene structure, the positions of moving objects and occluding objects in the image scene can be located. For example, in the case of a liquid,
Figure BDA00030010350100002017
and
Figure BDA00030010350100002018
the larger the value of (c) is, the more likely there are moving objects and occluding objects at that position.
Then, a forward scene structure consistency loss is calculated, which is used to measure a difference between a scene depth value of the target frame image calculated by the multi-view geometric transformation and a scene depth value of the reconstructed target frame image, and specifically, the following formula is used for calculating:
Figure BDA0003001035010000211
wherein N is ref Representing the number of valid grid coordinates in the reference frame image.
Calculating a backward scene structure consistency loss for measuring a difference between a scene depth value of a reference frame image calculated by multi-view geometric transformation and a scene depth value of a reconstructed reference frame image, wherein the backward scene structure consistency loss can be calculated by adopting the following formula:
Figure BDA0003001035010000212
wherein N is tgt Representing the number of valid grid coordinates in the target frame image.
Finally, according to the forward scene structure consistency loss and the backward scene structure consistency loss, the bidirectional scene structure consistency loss can be calculated as follows:
Figure BDA0003001035010000213
the constructing the objective function based on the bidirectional image reconstruction loss may include:
and constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
When the objective function is constructed, the consistency loss of the bidirectional scene structure is added, and the shielding object and the moving object in the image scene to be detected can be effectively processed, so that the accuracy of scene depth estimation is improved.
On the other hand, when calculating the bidirectional image reconstruction loss, the two occlusion masks and the two scene structure errors described above may be introduced at the same time, and may be calculated by using the following formula:
Figure BDA0003001035010000214
by using
Figure BDA0003001035010000215
And
Figure BDA0003001035010000216
the purpose of processing the sheltered and moving objects can be achieved by carrying out weighting processing on the image reconstruction loss function. Specifically, weighting the first image reconstruction loss by using the forward flow occlusion mask and the forward scene structure inconsistency weight; weighting the second image reconstruction loss by using the backward flow occlusion mask and the backward scene structure inconsistency weight; and constructing the bidirectional image reconstruction loss based on the weighted first image reconstruction loss and the weighted second image reconstruction loss.
In an embodiment of the present application, the depth estimation network includes an encoding network, and the method may further include:
(1) Acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;
(2) Reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;
(3) Reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;
(4) And calculating to obtain a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, wherein the bidirectional feature perception loss is used for measuring a difference between the feature image of the target frame image obtained through the coding network and the feature image of the reconstructed target frame image, and a difference between the feature image of the reference frame image obtained through the coding network and the feature image of the reconstructed reference frame image.
The above steps are used to calculate the two-way feature perception loss, phaseCompared with the original RGB image, the features extracted by the encoder have better distinguishability in the weak texture area. The method and the device utilize the highest-resolution characteristic image extracted by the coding network to process the weak texture region, and can extract the characteristic image f of the target frame image through the coding network in the depth estimation network tgt (first feature image) and feature image f of reference frame image ref (second feature image). Then, based on the third coordinate and the feature image f of the reference frame image ref The characteristic image f can be obtained by a bilinear sampling mechanism ref Performing affine transformation to reconstruct a third characteristic image of the target frame image
Figure BDA0003001035010000221
The sixth coordinate and the feature image f of the target frame image tgt The characteristic image f can be obtained by a bilinear sampling mechanism tgt Performing affine transformation to reconstruct a fourth characteristic image of the reference frame image
Figure BDA0003001035010000222
Then, the bidirectional feature perception loss can be calculated by using the following formula:
Figure BDA0003001035010000223
bidirectional feature perception loss L feat The method is used for measuring the difference between the characteristic image of the target frame image obtained through the coding network and the characteristic image of the reconstructed target frame image, and the difference between the characteristic image of the reference frame image obtained through the coding network and the characteristic image of the reconstructed reference frame image.
The constructing the objective function based on the bidirectional image reconstruction loss may include:
and constructing and obtaining the target function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
By introducing bidirectional feature perception loss into the objective function, the weak texture scene in the image to be detected can be effectively processed, and therefore the accuracy of scene depth estimation is improved.
Further, after the first feature image of the target frame image and the second feature image of the reference frame image are acquired through the coding network, the method may further include:
and calculating to obtain a smoothing loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first characteristic image and the second characteristic image, wherein the smoothing loss is used for regularizing gradients of the scene depth image and the characteristic image obtained through the depth estimation network.
The constructing the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss may include:
and constructing and obtaining the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.
In order to regularize the gradients of the scene depth image and the feature image obtained by said depth estimation network, a smoothing loss L may be introduced in the objective function s Specifically, the following formula can be adopted for calculation:
Figure BDA0003001035010000231
wherein,
Figure BDA0003001035010000232
reference frame depth map d representing estimates for a depth estimation network ref The partial derivatives are calculated, then the absolute value at each element position is calculated,
Figure BDA0003001035010000233
representing a reference frame picture I ref Calculating partial derivatives, then calculating the absolute value of each element position,
Figure BDA0003001035010000234
is shown in
Figure BDA0003001035010000235
Is the natural exponent of the power, and so on.
In the foregoing, four types of loss functions are proposed, which are bidirectional image reconstruction loss, smoothing loss, bidirectional scene structure consistency loss and bidirectional feature perception loss, and a final objective function can be constructed and obtained based on the loss functions. For example, the expression of a certain objective function L is as follows:
L=λ photo L photos L sdsc L dscfeat L feat
wherein each λ is a set coefficient, for example, λ photo =1.0,λ s =0.001,λ dsc =0.5,λ feat =0.05。
In addition, in the process of calculating each loss function described above, the calculation result of a single reference frame image is illustrated, and if there are multiple reference frame images, each reference frame image may be calculated to obtain a corresponding loss value in the same manner as described above, and finally, an average value of the loss values corresponding to the reference frame images may be used as a loss value used in the final construction of the objective function.
And 2.7, updating the parameters of the depth estimation network according to the objective function.
After the objective function is constructed, the parameters of the depth estimation network can be updated according to the objective function, so that the purpose of optimizing and training the network is achieved. Specifically, an AdamW optimizer can be used to solve a gradient of the objective function relative to the weight of the depth estimation network, and the weight of the depth estimation network is updated according to the gradient, so that iteration is performed continuously until a set maximum number of iterations is reached, and training of the depth estimation network is completed.
Further, the objective function may be used to train the camera pose estimation network described above. Similarly, an AdamW optimizer can be used for solving the gradient of the objective function relative to the weight of the camera pose estimation network, the weight of the camera pose estimation network is updated according to the gradient, iteration is carried out continuously until the set maximum iteration number is reached, and the training of the camera pose estimation network is completed. In general, after the objective function is constructed, the objective function can be used as a supervision signal to jointly guide the training of the depth estimation network and the camera pose estimation network. Specifically, an AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network and the gradient of the objective function relative to the weight of the camera pose estimation network, and update the weights of the depth estimation network and the camera pose estimation network at the same time according to the gradients, so that iteration is performed continuously until the set maximum iteration number is reached, and the joint training of the depth estimation network and the camera pose estimation network is completed.
After the training of the two networks is completed, a monocular image (for example, an image to be measured) can be used as an input of the depth estimation network, and a corresponding scene depth image is directly calculated. A sequence of consecutive images (e.g., any sequence of 5 monocular images) may also be used as input to the camera pose estimation network to compute the corresponding camera pose vector. It should be noted that the depth estimation network and the camera pose estimation network only need to be jointly optimized during training, the weights of the networks are fixed after training is completed, reverse propagation is not needed during testing, only forward propagation is needed, and therefore the two networks can be used independently during testing.
When the depth estimation network adopted by the embodiment of the application optimizes the updating parameters, the camera attitude vector of the input sample image sequence is predicted by combining the camera attitude estimation network, and the sample image sequence comprises a target frame image and a reference frame image; then, generating a reconstructed image corresponding to the target frame image according to the scene depth image of the target frame image, the camera attitude vector, the reference frame image and the internal parameters of the corresponding camera which are obtained by the depth estimation network prediction; and then, calculating a corresponding loss function when the image is reconstructed according to the target frame image and the reconstructed image, and finally constructing a target function based on the loss function and updating the parameters of the depth estimation network based on the target function. Through the arrangement, the potential image information contained in the target frame image and the reference frame image can be fully mined, namely, enough image information can be obtained by sampling less sample images to finish the training of the depth estimation network, so that the cost of sample data acquisition is reduced.
In addition, by adding bidirectional image reconstruction loss, bidirectional scene structure consistency loss, bidirectional feature perception loss and smoothness loss in the objective function, potential information contained in the image can be further mined, the acquisition cost of sample data is reduced, problems of moving objects, shielding and the like in the video frame can be effectively processed, and the robustness to a weak texture environment is improved.
The following is to illustrate the technical effects of the image scene depth estimation and camera pose estimation proposed by the present application through simulation results. The testing set divided by Eigen is used as the evaluation data of the depth estimation network, and the 09-10 sequence in the KITTI odometer data set is used as the evaluation data of the camera attitude estimation network.
The evaluation criteria adopted by the depth estimation network include: absolute error (abssel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (rmselect), and threshold (δ) t ) (ii) a The evaluation index adopted by the camera attitude estimation network is Absolute Track Error (ATE). Through simulation tests, the test results of the method proposed by the present application compared with the prior art algorithm are shown in tables 1 to 3 below.
TABLE 1
Figure BDA0003001035010000261
Table 1 shows comparison of the results of scene depth prediction for monocular images in the depth range of 80 m. Wherein, absolute error (AbsRel), root mean square error (Rmsee), mean square error (SqRel) and logarithm root mean square error (Rmseog) absolute values represent algorithm error values, which are used for measuring the accuracy of the algorithm, and the smaller the error value is, the higher the accuracy is, and the threshold value (delta) t ) Representing predicted fieldsThe close degree of the scene depth and the true value indicates that the stability of the algorithm is better when the threshold value is higher. The test results in table 1 show that, compared with the algorithm in the prior art, the method provided by the present application can obtain higher scene depth prediction accuracy and better algorithm stability.
TABLE 2
Figure BDA0003001035010000262
Table 2 shows comparison of the scene depth prediction results for the monocular image in the depth range of 50 m. The test results in table 2 also show that, compared with the algorithm in the prior art, the method provided by the present application can obtain higher scene depth prediction accuracy and better algorithm stability, so that the scene depth and more details of the monocular image can be predicted more robustly.
TABLE 3
Figure BDA0003001035010000271
The Absolute Track Error (ATE) in table 3 represents the difference between the true value of the camera pose and the predicted camera pose, with the smaller the error value, the more accurate the predicted camera pose. Simulation results show that compared with various existing algorithms, the camera pose prediction method provided by the application is more accurate in camera pose prediction.
In addition, fig. 6 is a comparison graph of the result of monocular image depth prediction performed by the image scene depth estimation method and various algorithms in the prior art, where the group Truth depth map is a depth map obtained by visualizing lidar data.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
The above mainly describes a method for estimating depth of an image scene, and an apparatus for estimating depth of an image scene will be described below.
Referring to fig. 7, an embodiment of an apparatus for estimating depth of an image scene in an embodiment of the present application includes:
an image to be measured acquisition module 701, configured to acquire an image to be measured;
a scene depth estimation module 702, configured to input the image to be detected into a depth estimation network that is constructed in advance, to obtain a scene depth image of the image to be detected;
a sample obtaining module 703, configured to obtain a sample image sequence, where the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of more than one frame in the sample image sequence before or after the target frame image;
a first scene depth prediction module 704, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;
a camera pose estimation module 705, configured to input the sample image sequence into a camera pose estimation network constructed in advance, to obtain a predicted camera pose vector between the target frame image and the reference frame image;
a first image reconstruction module 706, configured to generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and an internal reference of a camera used for capturing the sample image sequence;
a first image reconstruction loss calculation module 707, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure a difference between the target frame image and the first reconstructed image;
an objective function construction module 708 for constructing an objective function based on the first image reconstruction loss;
a network parameter updating module 709, configured to update parameters of the depth estimation network according to the objective function.
In one embodiment of the present application, the apparatus may further include:
the second scene depth prediction module is used for inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
a second image reconstruction module, configured to generate a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and an internal reference of a camera used for shooting the sample image sequence;
a second image reconstruction loss calculation module, configured to calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure a difference between the reference frame image and the second reconstructed image;
the objective function building module may include:
a bidirectional image reconstruction loss calculation unit for calculating a bidirectional image reconstruction loss from the first image reconstruction loss and the second image reconstruction loss;
and the object function construction unit is used for constructing the object function based on the bidirectional image reconstruction loss.
Further, the first image reconstruction module may include:
a first transformation matrix determination unit for determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
the first coordinate calculation unit is used for calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;
the first coordinate transformation unit is used for transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image under a world coordinate system after being transformed;
a first coordinate conversion unit for converting the second coordinate into a third coordinate in an image coordinate system;
the first image reconstruction unit is used for reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image;
the second image reconstruction module may include:
a second transformation matrix determination unit for determining a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
the second coordinate calculation unit is used for calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;
the second coordinate transformation unit is used for transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image in a world coordinate system after transformation;
a second coordinate conversion unit, configured to convert the fifth coordinate into a sixth coordinate in an image coordinate system;
and the second image reconstruction unit is used for reconstructing an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determining the reconstructed image as the second reconstructed image.
In one embodiment of the present application, the apparatus may further include:
the first coordinate acquisition module is used for acquiring a seventh coordinate of the target frame image in an image coordinate system;
a forward flow coordinate determination module, configured to perform processing of performing a difference between corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;
the second coordinate acquisition module is used for acquiring an eighth coordinate of the reference frame image in an image coordinate system;
the backward flow coordinate determination module is used for executing the difference processing of corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;
a forward flow coordinate synthesis module, configured to perform affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and using a bilinear sampling mechanism to synthesize a second forward flow coordinate;
the backward flow coordinate synthesis module is used for carrying out affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism so as to synthesize a second backward flow coordinate;
a forward flow occlusion mask calculation module, configured to calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, where the forward flow occlusion mask is used to measure a matching degree between the first forward flow coordinate and the second forward flow coordinate;
a backward flow occlusion mask calculation module, configured to calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, where the backward flow occlusion mask is used to measure a matching degree between the first backward flow coordinate and the second backward flow coordinate;
the bidirectional image reconstruction loss calculation unit may be specifically configured to: calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
In one embodiment of the present application, the apparatus may further include:
a first scene depth value determining module, configured to determine a first scene depth value of the target frame image according to the second coordinate;
a second scene depth value determining module, configured to determine a second scene depth value of the reference frame image according to the fifth coordinate;
a third scene depth value determining module, configured to obtain a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;
a fourth scene depth value determining module, configured to obtain a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;
a first scene depth value reconstruction module, configured to reconstruct a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;
a second scene depth value reconstruction module, configured to reconstruct a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
a forward scene structure consistency loss calculation module, configured to calculate a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, where the forward scene structure consistency loss is used to measure a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;
a backward scene structure consistency loss calculation module, configured to calculate a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, where the backward scene structure consistency loss is used to measure a difference between a scene depth value of the reference frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the reference frame image;
the bidirectional scene structure consistency loss calculation module is used for calculating bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;
the objective function construction unit may specifically be configured to: and constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
In one embodiment of the present application, the depth estimation network includes a coding network, and the apparatus may further include:
the characteristic image acquisition module is used for acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;
the first characteristic image reconstruction module is used for reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;
the second characteristic image reconstruction module is used for reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;
a bidirectional feature perception loss calculation module, configured to calculate a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image, and the fourth feature image, where the bidirectional feature perception loss is used to measure a difference between a feature image of the target frame image obtained through a coding network and a feature image of the reconstructed target frame image, and a difference between a feature image of the reference frame image obtained through the coding network and a feature image of the reconstructed reference frame image;
the objective function construction unit may specifically be configured to: and constructing and obtaining the target function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
Further, the apparatus may further include:
a smooth loss calculation module, configured to calculate a smooth loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image, and the second feature image, where the smooth loss is used to normalize gradients of the scene depth image and the feature image obtained through the depth estimation network;
the objective function construction unit may specifically be configured to: and constructing and obtaining the target function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when executed by a processor, the computer program implements any one of the methods for estimating depth of an image scene as shown in fig. 1.
Embodiments of the present application further provide a computer program product, which when running on a terminal device, causes the terminal device to execute a method for estimating depth of an image scene, which implements any one of the methods shown in fig. 1.
Fig. 8 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 8, the terminal device 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82 stored in said memory 81 and executable on said processor 80. The processor 80, when executing the computer program 82, implements the steps in the embodiments of the method for estimating depth of an image scene described above, such as the steps 101 to 102 shown in fig. 1. Alternatively, the processor 80, when executing the computer program 82, implements the functions of each module/unit in each device embodiment described above, for example, the functions of the modules 701 to 709 shown in fig. 7.
The computer program 82 may be divided into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 82 in the terminal device 8.
The Processor 80 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. The memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing the computer program and other programs and data required by the terminal device. The memory 81 may also be used to temporarily store data that has been output or is to be output.
It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (7)

1. A method for estimating depth of an image scene, comprising:
acquiring an image to be detected;
inputting the image to be detected into a pre-constructed depth estimation network to obtain a scene depth image of the image to be detected;
wherein the parameters of the depth estimation network are updated by:
acquiring a sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is more than one frame of image in the sample image sequence before or after the target frame image;
inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;
inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;
generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera attitude vector, the reference frame image and internal parameters of a camera adopted for shooting the sample image sequence;
calculating a first image reconstruction loss according to the target frame image and the first reconstructed image, wherein the first image reconstruction loss is used for measuring the difference between the target frame image and the first reconstructed image;
constructing an objective function based on the first image reconstruction loss;
updating parameters of the depth estimation network according to the objective function;
after acquiring the sample image sequence, the method further comprises:
inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera attitude vector, the target frame image and internal parameters of a camera adopted for shooting the sample image sequence;
calculating a second image reconstruction loss from the reference frame image and the second reconstructed image, the second image reconstruction loss being used to measure a difference between the reference frame image and the second reconstructed image;
the constructing an objective function based on the first image reconstruction loss comprises:
calculating bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;
constructing the objective function based on the bidirectional image reconstruction loss;
wherein the generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal reference of the camera used for shooting the sample image sequence includes:
determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;
transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image in a world coordinate system after transformation;
converting the second coordinate into a third coordinate in an image coordinate system;
reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image;
generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and an internal reference of a camera used for shooting the sample image sequence, including:
determining a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;
transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image in a world coordinate system after transformation;
converting the fifth coordinate into a sixth coordinate in an image coordinate system;
based on the target frame image, reconstructing an affine-transformed image of the target frame image by using the sixth coordinate as a grid point through a bilinear sampling mechanism, and determining the reconstructed image as the second reconstructed image;
the method further comprises the following steps:
acquiring a seventh coordinate of the target frame image in an image coordinate system;
performing difference processing on corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;
acquiring an eighth coordinate of the reference frame image in an image coordinate system;
performing difference processing on corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;
performing affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second forward flow coordinate;
performing affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and adopting a bilinear sampling mechanism to synthesize a second backward flow coordinate;
calculating a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, wherein the forward flow occlusion mask is used for measuring the matching degree between the first forward flow coordinate and the second forward flow coordinate;
calculating a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, wherein the backward flow occlusion mask is used for measuring the matching degree between the first backward flow coordinate and the second backward flow coordinate;
said calculating a bi-directional image reconstruction loss from said first image reconstruction loss and said second image reconstruction loss comprises:
calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
2. The method of claim 1, further comprising:
determining a first scene depth value of the target frame image according to the second coordinate;
determining a second scene depth value of the reference frame image according to the fifth coordinate;
acquiring a third scene depth value of a pixel point corresponding to the second coordinate in the first scene depth image;
acquiring a fourth scene depth value of a pixel point corresponding to the fifth coordinate in the second scene depth image;
reconstructing a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;
reconstructing a sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
calculating a forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, wherein the forward scene structure consistency loss is used for measuring a difference between a scene depth value of the target frame image calculated through multi-view geometric transformation and a reconstructed scene depth value of the target frame image;
calculating a backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, wherein the backward scene structure consistency loss is used for measuring the difference between the scene depth value of the reference frame image calculated by multi-view geometric transformation and the reconstructed scene depth value of the reference frame image;
calculating the consistency loss of the bidirectional scene structure according to the consistency loss of the forward scene structure and the consistency loss of the backward scene structure;
the constructing the objective function based on the bi-directional image reconstruction loss comprises:
and constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
3. The method of claim 1 or 2, wherein the depth estimation network comprises an encoding network, the method further comprising:
acquiring a first characteristic image of the target frame image and a second characteristic image of the reference frame image through the coding network;
reconstructing a third characteristic image of the target frame image through a bilinear sampling mechanism based on the third coordinate and the second characteristic image;
reconstructing a fourth characteristic image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first characteristic image;
calculating to obtain a bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, wherein the bidirectional feature perception loss is used for measuring a difference between a feature image of the target frame image obtained through an encoding network and a feature image of the reconstructed target frame image, and a difference between a feature image of the reference frame image obtained through the encoding network and a feature image of the reconstructed reference frame image;
the constructing the objective function based on the bidirectional image reconstruction loss comprises:
and constructing and obtaining the target function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
4. The method of claim 3, further comprising, after acquiring the first feature image of the target frame image and the second feature image of the reference frame image over the encoding network:
calculating to obtain a smoothing loss according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, wherein the smoothing loss is used for regularizing gradients of the scene depth image and the feature image obtained through the depth estimation network;
the constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss comprises:
and constructing and obtaining the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.
5. An apparatus for estimating depth of an image scene, comprising:
the to-be-detected image acquisition module is used for acquiring an image to be detected;
the scene depth estimation module is used for inputting the image to be detected into a depth estimation network which is constructed in advance to obtain a scene depth image of the image to be detected;
the device comprises a sample acquisition module, a processing module and a processing module, wherein the sample acquisition module is used for acquiring a sample image sequence, the sample image sequence comprises a target frame image and a reference frame image, and the reference frame image is more than one frame of image in the sample image sequence before or after the target frame image;
the first scene depth prediction module is used for inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;
the camera attitude estimation module is used for inputting the sample image sequence into a camera attitude estimation network which is constructed in advance to obtain a predicted camera attitude vector between the target frame image and the reference frame image;
a first image reconstruction module, configured to generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and an internal reference of a camera used for shooting the sample image sequence;
a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure a difference between the target frame image and the first reconstructed image;
an objective function construction module for constructing an objective function based on the first image reconstruction loss;
the network parameter updating module is used for updating the parameters of the depth estimation network according to the target function;
the second scene depth prediction module is used for inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
a second image reconstruction module, configured to generate a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and an internal reference of a camera used for shooting the sample image sequence;
a second image reconstruction loss calculation module, configured to calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure a difference between the reference frame image and the second reconstructed image;
the objective function building module comprises:
a bidirectional image reconstruction loss calculation unit for calculating a bidirectional image reconstruction loss from the first image reconstruction loss and the second image reconstruction loss;
an objective function construction unit for constructing the objective function based on the bidirectional image reconstruction loss;
the first image reconstruction module comprises:
a first transformation matrix determination unit for determining a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
the first coordinate calculation unit is used for calculating a first coordinate of the target frame image in a world coordinate system according to the internal reference of the camera and the first scene depth image;
the first coordinate transformation unit is used for transforming the first coordinate based on the first transformation matrix to obtain a second coordinate of the target frame image in a world coordinate system after being transformed;
a first coordinate conversion unit for converting the second coordinate into a third coordinate in an image coordinate system;
the first image reconstruction unit is used for reconstructing an image of the reference frame image after affine transformation by using the third coordinate as a grid point through a bilinear sampling mechanism based on the reference frame image, and determining the reconstructed image as the first reconstructed image;
the second image reconstruction module includes:
a second transformation matrix determination unit for determining a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
the second coordinate calculation unit is used for calculating a fourth coordinate of the reference frame image in a world coordinate system according to the internal reference of the camera and the second scene depth image;
the second coordinate transformation unit is used for transforming the fourth coordinate based on the second transformation matrix to obtain a fifth coordinate of the reference frame image under a world coordinate system after being transformed;
a second coordinate conversion unit, configured to convert the fifth coordinate into a sixth coordinate in an image coordinate system;
a second image reconstruction unit, configured to reconstruct an image of the target frame image after affine transformation by using the sixth coordinate as a grid point through a bilinear sampling mechanism based on the target frame image, and determine the reconstructed image as the second reconstructed image;
the device further comprises:
the first coordinate acquisition module is used for acquiring a seventh coordinate of the target frame image in an image coordinate system;
a forward flow coordinate determination module, configured to perform processing of performing a difference between corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;
the second coordinate acquisition module is used for acquiring an eighth coordinate of the reference frame image in an image coordinate system;
the backward flow coordinate determination module is used for executing the difference processing of corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;
a forward flow coordinate synthesis module, configured to perform affine transformation on the first backward flow coordinate by using the third coordinate as a grid point and using a bilinear sampling mechanism to synthesize a second forward flow coordinate;
a backward flow coordinate synthesis module, configured to perform affine transformation on the first forward flow coordinate by using the sixth coordinate as a grid point and using a bilinear sampling mechanism, so as to synthesize a second backward flow coordinate;
a forward flow occlusion mask calculation module, configured to calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, where the forward flow occlusion mask is used to measure a matching degree between the first forward flow coordinate and the second forward flow coordinate;
a backward flow occlusion mask calculation module, configured to calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, where the backward flow occlusion mask is used to measure a matching degree between the first backward flow coordinate and the second backward flow coordinate;
the bidirectional image reconstruction loss calculation unit is specifically configured to: calculating the bi-directional image reconstruction loss from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method for estimating depth of an image scene according to any one of claims 1 to 4 when executing the computer program.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of estimating a depth of an image scene as set forth in any one of claims 1 to 4.
CN202110346713.3A 2021-03-31 2021-03-31 Image scene depth estimation method and device, terminal equipment and storage medium Active CN113160294B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110346713.3A CN113160294B (en) 2021-03-31 2021-03-31 Image scene depth estimation method and device, terminal equipment and storage medium
PCT/CN2021/137609 WO2022206020A1 (en) 2021-03-31 2021-12-13 Method and apparatus for estimating depth of field of image, and terminal device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110346713.3A CN113160294B (en) 2021-03-31 2021-03-31 Image scene depth estimation method and device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113160294A CN113160294A (en) 2021-07-23
CN113160294B true CN113160294B (en) 2022-12-23

Family

ID=76885688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110346713.3A Active CN113160294B (en) 2021-03-31 2021-03-31 Image scene depth estimation method and device, terminal equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113160294B (en)
WO (1) WO2022206020A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160294B (en) * 2021-03-31 2022-12-23 中国科学院深圳先进技术研究院 Image scene depth estimation method and device, terminal equipment and storage medium
CN113592940B (en) * 2021-07-28 2024-07-02 北京地平线信息技术有限公司 Method and device for determining target object position based on image
CN113592706B (en) * 2021-07-28 2023-10-17 北京地平线信息技术有限公司 Method and device for adjusting homography matrix parameters
CN113792730B (en) * 2021-08-17 2022-09-27 北京百度网讯科技有限公司 Method and device for correcting document image, electronic equipment and storage medium
CN114049388A (en) * 2021-11-10 2022-02-15 北京地平线信息技术有限公司 Image data processing method and device
CN113793283B (en) * 2021-11-15 2022-02-11 江苏游隼微电子有限公司 Vehicle-mounted image noise reduction method
CN114219900B (en) * 2022-02-21 2022-07-01 北京影创信息科技有限公司 Three-dimensional scene reconstruction method, reconstruction system and application based on mixed reality glasses
CN114627006B (en) * 2022-02-28 2022-12-20 复旦大学 Progressive image restoration method based on depth decoupling network

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9589326B2 (en) * 2012-11-29 2017-03-07 Korea Institute Of Science And Technology Depth image processing apparatus and method based on camera pose conversion
AU2019270095B2 (en) * 2018-05-17 2024-06-27 Niantic, Inc. Self-supervised training of a depth estimation system
CN110490928B (en) * 2019-07-05 2023-08-15 天津大学 Camera attitude estimation method based on deep neural network
CN110503680B (en) * 2019-08-29 2023-08-18 大连海事大学 Unsupervised convolutional neural network-based monocular scene depth estimation method
CN110782490B (en) * 2019-09-24 2022-07-05 武汉大学 Video depth map estimation method and device with space-time consistency
CN111105451B (en) * 2019-10-31 2022-08-05 武汉大学 Driving scene binocular depth estimation method for overcoming occlusion effect
US11157774B2 (en) * 2019-11-14 2021-10-26 Zoox, Inc. Depth data model training with upsampling, losses, and loss balancing
CN111311685B (en) * 2020-05-12 2020-08-07 中国人民解放军国防科技大学 Motion scene reconstruction unsupervised method based on IMU and monocular image
CN111369608A (en) * 2020-05-29 2020-07-03 南京晓庄学院 Visual odometer method based on image depth estimation
CN111783582A (en) * 2020-06-22 2020-10-16 东南大学 Unsupervised monocular depth estimation algorithm based on deep learning
CN112819875B (en) * 2021-02-03 2023-12-19 苏州挚途科技有限公司 Monocular depth estimation method and device and electronic equipment
CN113160294B (en) * 2021-03-31 2022-12-23 中国科学院深圳先进技术研究院 Image scene depth estimation method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113160294A (en) 2021-07-23
WO2022206020A1 (en) 2022-10-06

Similar Documents

Publication Publication Date Title
CN113160294B (en) Image scene depth estimation method and device, terminal equipment and storage medium
US10593021B1 (en) Motion deblurring using neural network architectures
CN110910447B (en) Visual odometer method based on dynamic and static scene separation
CN110501072B (en) Reconstruction method of snapshot type spectral imaging system based on tensor low-rank constraint
Sinha et al. GPU-based video feature tracking and matching
CN107525588B (en) Rapid reconstruction method of dual-camera spectral imaging system based on GPU
CN112001914A (en) Depth image completion method and device
CN112330729A (en) Image depth prediction method and device, terminal device and readable storage medium
CN109146787B (en) Real-time reconstruction method of dual-camera spectral imaging system based on interpolation
CN105488759B (en) A kind of image super-resolution rebuilding method based on local regression model
CN114152217B (en) Binocular phase expansion method based on supervised learning
CN110910437A (en) Depth prediction method for complex indoor scene
CN112819876A (en) Monocular vision depth estimation method based on deep learning
CN113962858A (en) Multi-view depth acquisition method
CN117542122B (en) Human body pose estimation and three-dimensional reconstruction method, network training method and device
CN116773018A (en) Space spectrum combined image reconstruction method and system for calculating spectrum imaging
CN115565039A (en) Monocular input dynamic scene new view synthesis method based on self-attention mechanism
KR20230150867A (en) Multi-view neural person prediction using implicit discriminative renderer to capture facial expressions, body posture geometry, and clothing performance
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN117095132B (en) Three-dimensional reconstruction method and system based on implicit function
CN117036613B (en) Polarization three-dimensional reconstruction method and system based on multiple receptive field blending network
CN117788544A (en) Image depth estimation method based on lightweight attention mechanism
Polasek et al. Vision UFormer: Long-range monocular absolute depth estimation
CN111696167A (en) Single image super-resolution reconstruction method guided by self-example learning
CN116934591A (en) Image stitching method, device and equipment for multi-scale feature extraction and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant