WO2022206020A1 - Method and apparatus for estimating depth of field of image, and terminal device and storage medium - Google Patents
Method and apparatus for estimating depth of field of image, and terminal device and storage medium Download PDFInfo
- Publication number
- WO2022206020A1 WO2022206020A1 PCT/CN2021/137609 CN2021137609W WO2022206020A1 WO 2022206020 A1 WO2022206020 A1 WO 2022206020A1 CN 2021137609 W CN2021137609 W CN 2021137609W WO 2022206020 A1 WO2022206020 A1 WO 2022206020A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- frame image
- coordinate
- scene
- target frame
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 239000013598 vector Substances 0.000 claims abstract description 49
- 230000006870 function Effects 0.000 claims description 96
- 230000002457 bidirectional effect Effects 0.000 claims description 83
- 230000009466 transformation Effects 0.000 claims description 55
- 238000005070 sampling Methods 0.000 claims description 43
- 230000007246 mechanism Effects 0.000 claims description 41
- 239000011159 matrix material Substances 0.000 claims description 33
- 238000004590 computer program Methods 0.000 claims description 27
- 230000008447 perception Effects 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 23
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 21
- 238000009499 grossing Methods 0.000 claims description 17
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 18
- 238000013480 data collection Methods 0.000 abstract description 6
- 238000005457 optimization Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 17
- 238000004422 calculation algorithm Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 11
- 238000012360 testing method Methods 0.000 description 8
- 238000010276 construction Methods 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 5
- 238000013519 translation Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/80—Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
- G06T7/85—Stereo camera calibration
Definitions
- the present application relates to the technical field of image processing, and in particular, to a method, apparatus, terminal device and storage medium for estimating the depth of an image scene.
- Scene depth estimation from images is an important research direction in the fields of robot navigation and autonomous driving.
- people usually use deep neural networks to predict the scene depth of images.
- a large amount of sample data is required when training the deep neural network, resulting in high data acquisition costs.
- the embodiments of the present application provide a method, apparatus, terminal device, and storage medium for estimating the depth of an image scene, which can reduce the cost of sample data collection.
- a first aspect of the embodiments of the present application provides a method for estimating the depth of an image scene, including:
- the parameters of the depth estimation network are updated in the following ways:
- the sample image sequence includes a target frame image and a reference frame image
- the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence
- the parameters of the depth estimation network are updated according to the objective function.
- the depth estimation network adopted in the embodiment of the present application will combine the camera pose estimation network to predict the camera pose vector of the input sample image sequence, where the sample image sequence includes the target frame image and the reference frame image; then, according to the depth The scene depth image of the target frame image predicted by the estimation network, the camera pose vector, the reference frame image and the internal parameters of the corresponding camera generate a reconstructed image corresponding to the target frame image; then, according to the target frame image and the reconstructed image Calculate the corresponding loss function when reconstructing the image, and finally construct an objective function based on the loss function and update the parameters of the depth estimation network based on the objective function.
- the potential image information contained in the target frame image and the reference frame image can be fully exploited, that is, enough image information can be obtained by sampling less sample images to complete the training of the depth estimation network, thereby reducing the time required for sample data collection. cost.
- the sample image sequence after acquiring the sample image sequence, it may further include:
- the constructing an objective function based on the first image reconstruction loss includes:
- the objective function is constructed based on the bidirectional image reconstruction loss.
- generating a first reconstruction corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence images can include:
- an image of the reference frame image after affine transformation is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the third coordinate.
- the generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence comprising: :
- the affine transformed image of the target frame image is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the sixth coordinate. 2. Reconstructed images.
- the method may further include:
- a forward flow occlusion mask is calculated according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and the second forward flow occlusion mask. degree of matching between forward flow coordinates;
- a backward flow occlusion mask is calculated according to the first backward flow coordinates and the second backward flow coordinates, and the backward flow occlusion mask is used to measure the first backward flow coordinates and the second backward flow degree of matching between backward flow coordinates;
- the calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss includes:
- the bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
- the occlusion mask is used to determine whether there are occluding objects in consecutive video frames. Adding the occlusion mask to the calculation of the loss of bidirectional image reconstruction can improve the depth estimation accuracy of the depth estimation network for images with occluded objects.
- the method may further include:
- the forward scene structure consistency loss is calculated according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the target frame image calculated by multi-view geometric transformation The difference between the scene depth value and the scene depth value of the reconstructed target frame image;
- the backward scene structure consistency loss is calculated according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the reference frame image calculated by the multi-view geometric transformation The difference between the scene depth value and the reconstructed scene depth value of the reference frame image;
- the constructing the objective function based on the bidirectional image reconstruction loss may include:
- the objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
- the depth estimation network includes an encoding network
- the method may further include:
- the bidirectional feature perception loss is calculated, and the bidirectional feature perception loss is used to measure the obtained through the encoding network.
- the constructing the objective function based on the bidirectional image reconstruction loss includes:
- the objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
- the weak texture scene in the image to be tested can be effectively processed, thereby improving the accuracy of scene depth estimation.
- the method may further include:
- the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network;
- the construction obtains the objective function, including:
- the objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.
- the gradients of scene depth images and feature images obtained by the depth estimation network can be regularized by introducing a smoothing loss into the objective function.
- a second aspect of the embodiments of the present application provides an apparatus for estimating the depth of an image scene, including:
- an image acquisition module to be tested used to acquire the image to be tested
- a scene depth estimation module configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested
- a sample acquisition module configured to acquire a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is one or more frames before or after the target frame image in the sample image sequence Image;
- a first scene depth prediction module configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image
- a camera pose estimation module configured to input the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;
- a first image reconstruction module configured to generate an image corresponding to the target frame image according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence. the first reconstructed image;
- a first image reconstruction loss calculation module configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss difference between reconstructed images;
- an objective function building module for constructing an objective function based on the first image reconstruction loss
- a network parameter updating module configured to update the parameters of the depth estimation network according to the objective function.
- a third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program
- the method for estimating the depth of an image scene provided by the first aspect of the embodiments of the present application is implemented.
- a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the implementation of the first aspect of the embodiments of the present application is implemented method for estimating scene depth in images.
- a fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the method for estimating the depth of an image scene described in the first aspect of the embodiments of the present application.
- FIG. 1 is a flowchart of an embodiment of a method for estimating the depth of an image scene provided by an embodiment of the present application
- FIG. 2 is a schematic flowchart of an optimization and updating depth estimation network parameter provided by an embodiment of the present application
- FIG. 3 is a schematic structural diagram of a depth estimation network provided by an embodiment of the present application.
- FIG. 4 is a schematic structural diagram of a residual module in the network structure shown in FIG. 3;
- FIG. 5 is a schematic structural diagram of a camera pose estimation network according to an embodiment of the present application.
- FIG. 6 is a comparison diagram of the results of the depth prediction of a monocular image performed by the method for estimating the depth of an image scene provided by an embodiment of the present application and various algorithms in the prior art;
- FIG. 7 is a structural diagram of an embodiment of an apparatus for estimating the depth of an image scene provided by an embodiment of the present application.
- FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.
- the present application proposes a method, device, terminal device and storage medium for estimating the depth of an image scene, which can reduce the cost of sample data collection. It should be understood that the execution body of each method embodiment of the present application is various types of terminal devices or servers, such as mobile phones, tablet computers, notebook computers, desktop computers, and wearable devices.
- FIG. 1 a method for estimating the depth of an image scene provided by an embodiment of the present application is shown, including:
- an image to be tested is acquired, and the image to be tested is any image whose depth of the scene needs to be predicted.
- the image to be tested is input into a pre-built depth estimation network to obtain a scene depth image of the image to be tested, thereby obtaining a scene depth estimation result of the image to be tested.
- the depth estimation network may be a neural network with an encoder-decoder architecture, and the present application does not make any limitations on the type and network structure of the neural network used in the depth estimation network.
- FIG. 2 shows a schematic flowchart of optimizing and updating depth estimation network parameters provided by an embodiment of the present application, including the following steps:
- the sample image sequence includes a target frame image and a reference frame image
- the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence
- the training set data can be obtained as the training set data, and preprocessing operations such as random flipping, random cropping, and data normalization can be performed on the training set data to convert the training set data into tensor data of specified dimensions.
- the training set data may be composed of a large number of sample image sequences, wherein each sample image sequence includes a target frame image and a reference frame image, and the reference frame image is in the sample image sequence before the target frame image or More than one frame after the image.
- the sample image sequence can be a video clip containing 5 consecutive video frames, assuming I 0 , I 1 , I 2 , I 3 , I 4 , then I 2 can be used as the target frame image, I 0 , I 1 , I 3 and I 4 can be used as corresponding reference frame images.
- the target frame image in the sample image sequence is input to the depth estimation network to obtain the predicted first scene depth image, that is, the scene depth image corresponding to the target frame image.
- FIG. 3 a schematic structural diagram of the depth estimation network is shown in FIG. 3 , which includes an encoder part and a decoder part.
- the encoder part is used to extract the abstract features of the input image data by layer-by-layer downsampling. Assuming that the target frame image is preprocessed to obtain tensor data with a dimension of 3*256*832, the first layer of the encoder is used. After convolution, normalization, and activation function processing, a feature image with a dimension of 64*128*416 is obtained, and the first downsampling process is completed.
- the feature image is processed by maximum pooling and multiple residual modules to obtain a feature image with a dimension of 256*64*208, and the second downsampling process is completed.
- a feature image with a dimension of 2048*8*26 is obtained.
- the decoder part is used to process the feature image obtained by the encoder by layer-by-layer upsampling. Specifically, the convolutional layer with the convolution kernel size of 3*3, the nonlinear ELU processing layer and the nearest neighbor upsampling layer can be used.
- the feature image obtained by the encoder is processed to obtain a feature image with a dimension of 512*16*52.
- the 512*16*52 feature image and the 1024*16*52 feature image obtained by the encoder are spliced in the channel dimension to obtain a feature image with a dimension of 1536*16*52, At this point, the processing of the first upsampling is completed.
- a feature image with a dimension of 32*256*832 is finally obtained.
- x represents the depth image obtained after mapping by the Sigmoid function
- F(x) represents the final scene depth image obtained.
- the value range of each pixel will be mapped between 0 and 1.
- FIG. 4 The schematic diagram of the structure of the residual module in the network structure shown in Figure 3 is shown in Figure 4.
- the input end is divided into two branches, one of which is processed by each convolution layer, the normalization layer BN and the ReLU function in turn. Superimpose with another branch to obtain the output data of the residual module.
- the technical means of shortcut connection is adopted, that is, the feature image extracted by the encoder directly crosses the convolution layer and the feature image of the same resolution obtained by the decoder is spliced in the channel dimension.
- a fixed-size convolution kernel is used to continuously extract image features through a sliding window.
- the shallow network can only extract local features of the image.
- the decoder part it directly decodes the feature image finally output by the encoder, and performs multiple upsampling processing on the deep-level feature to obtain deep-level feature images with different resolutions.
- the dimension It is a feature image of 512*16*52.
- the feature image with a dimension of 1024*16*52 extracted by the encoder directly crosses the corresponding convolutional layer and the feature image of 512*16*52 obtained by the decoder. fusion.
- the feature images of each resolution extracted by the encoder are fused with the feature images obtained by the corresponding decoder to realize the fusion of local image features and depth feature information.
- a camera pose estimation network is also pre-built in this embodiment of the present application.
- the schematic diagram of the structure of the network can be shown in FIG. 5 , which includes a number of different parameters the convolutional layer. Specifically, it is assumed that the input sample image sequence is I 0 , I 1 , I 2 , I 3 , I 4 , a total of 5 frames of images.
- these 5 frames of images are preprocessed into tensor data of a specified dimension, which is used as the camera pose Estimate the input of the network; the camera pose estimation network uses multiple convolutional layers with specified strides to extract image features and downsample, and sequentially obtain corresponding feature images.
- a 24-dimensional feature vector can be obtained, and finally the feature vector is adjusted into a camera pose vector of 6*N ref , where 6 represents the
- generating the first reconstruction corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence images can include:
- the depth estimation network described above can be used to estimate the depth image of the first scene corresponding to I tgt as D tgt , which is given by
- the camera pose estimation network described above estimates the camera pose between two frames of images, and obtains a first transformation matrix T (consisting of a rotation vector and a translation vector) converted from the target frame image I tgt to the reference frame image I ref .
- the coordinates (first coordinates) of the target frame image It tgt in the world coordinate system can be calculated. For example, suppose that the image coordinates of a certain pixel in the target frame image I tgt are According to the depth image D tgt of the first scene, the depth of the pixel is determined as d tgt , then the coordinates of the pixel in the world coordinate system can be calculated by the following formulas:
- the first coordinate Perform transformation to obtain the second coordinate of the target frame image I tgt after transformation in the world coordinate system Specifically, the following formula can be used to calculate:
- (R x , R y , R z , t) ⁇ SE3 is a 3D rotation angle and translation vector, which can be obtained by the first transformation matrix T.
- R x , R y , and R z represent the rotation relative to the x-axis, y-axis and z-axis in the world coordinate system, respectively
- t represents the translation of the x-axis, y-axis and z-axis
- SE3 represents a special Euclidean group.
- T tgt->ref represents the camera extrinsic parameter matrix composed of the rotation matrix and the translation matrix.
- the affine-transformed image of the reference frame image I ref can be reconstructed through the bilinear sampling mechanism and reconstruct the resulting image Determined to be the first reconstructed image.
- the principle of the bilinear sampling mechanism may refer to the prior art, which will not be repeated here.
- ⁇ is a preset weight parameter, for example, it can be 0.85.
- SSIM(*) is the structural similarity measure function shown below:
- ERF(*) is the robustness error metric shown below:
- an objective function may be constructed based on the first image reconstruction loss, so as to complete the parameter update of the depth estimation network.
- the sample image sequence after acquiring the sample image sequence, it may further include:
- the second reconstruction corresponding to the reference frame image is generated according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:
- the target frame image is I tgt
- the reference frame image is I ref
- the internal parameter matrix of the corresponding camera is K.
- the depth estimation network described above estimates that the depth image of the second scene corresponding to I ref is D ref
- the camera pose between the two frames of images is estimated by the camera pose estimation network described above to obtain the conversion from the reference frame image I ref to D ref .
- the second transformation matrix T inv of the target frame image It tgt is an inverse transformation matrix of the first transformation matrix T transformed from the target frame image It tgt to the reference frame image I ref .
- the coordinates (fourth coordinates) of the reference frame image I ref in the world coordinate system can be calculated.
- the fourth coordinate is transformed based on the second transformation matrix T inv to obtain the transformed fifth coordinate of the reference frame image I ref in the world coordinate system, and then the fifth coordinate in the image coordinate system is calculated.
- the specific coordinate transformation step may refer to the above-mentioned related content of calculating the reconstruction loss of the first image.
- the affine transformed image of the target frame image It tgt can be reconstructed through the bilinear sampling mechanism and reconstruct the resulting image Determined to be the second reconstructed image.
- the following formula can be used to calculate the second image reconstruction loss:
- the first image reconstruction loss can be defined as the forward image reconstruction loss
- the second image reconstruction loss can be defined as the backward image reconstruction loss
- the bidirectional image reconstruction loss can be constructed based on the two image reconstruction losses, and the specific calculation formula can be as follows :
- the objective function can be constructed according to the loss based on this bidirectional image reconstruction.
- the method may further include:
- This process can be summarized as the check of bidirectional flow consistency, including forward flow consistency check and backward flow consistency check.
- a bilinear sampling mechanism is used to determine the first backward flow coordinate perform an affine transformation to synthesize the second forward flow coordinates
- the composite forward flow coordinates and the calculated forward flow coordinates are the same in size and opposite in direction, which is forward flow consistency.
- the bilinear sampling mechanism is adopted to the first forward flow coordinate. perform an affine transformation to synthesize the second backward flow coordinates
- the synthesized backward flow coordinates and the calculated backward flow coordinates are the same in size and opposite in direction, which is backward flow consistency.
- the forward flow occlusion mask can be obtained by calculating according to the first forward flow coordinate and the second forward flow coordinate
- the mask is used to measure the matching degree between the first forward flow coordinate and the second forward flow coordinate, which can be calculated by the following formula:
- the backward flow occlusion mask can be calculated according to the first backward flow coordinate and the second backward flow coordinate This mask is used to measure the matching degree between the first backward flow coordinate and the second backward flow coordinate, and can be calculated by the following formula:
- calculating the bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss may include:
- the bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
- the occlusion mask is used to determine whether there are occluding objects in consecutive video frames. Adding the occlusion mask to the calculation of the bidirectional image reconstruction loss can improve the depth estimation accuracy of the depth estimation network for images with occluded objects.
- the method can also include:
- the above steps are used to calculate the loss of bidirectional scene structure consistency.
- the depth value of the corresponding scene can be obtained (first scene depth value); according to the fifth coordinate described above
- the corresponding field can be obtained, and the depth value of the corresponding scene can be obtained (Second scene depth value).
- the image coordinates in the target frame image I tgt are The depth value d tgt (the third scene depth value) of the pixel point at The depth value of the pixel at d ref (the fourth scene depth value).
- the fifth scene depth value of the target frame image can be reconstructed through the bilinear sampling mechanism
- the depth value of the sixth scene of the reference frame image can be reconstructed through the bilinear sampling mechanism
- the forward scene structure error can be calculated separately by the following two formulas and the backward scene structure error This in turn imposes consistency constraints on the scene structure:
- the positions of moving objects and occluding objects in the image scene can be located. E.g, and The larger the value of , the more likely there are moving objects and occluding objects at the position.
- the forward scene structure consistency loss is calculated, which is used to measure the difference between the scene depth value of the target frame image calculated by the multi-view geometric transformation and the scene depth value of the reconstructed target frame image.
- the forward scene structure consistency loss is calculated, which is used to measure the difference between the scene depth value of the target frame image calculated by the multi-view geometric transformation and the scene depth value of the reconstructed target frame image.
- Nref represents the number of valid grid coordinates in the reference frame image.
- N tgt represents the number of valid grid coordinates in the target frame image.
- the bidirectional scene structure consistency loss can be calculated as follows:
- the constructing the objective function based on the bidirectional image reconstruction loss may include:
- the objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
- the two occlusion masks and the two scene structure errors mentioned above can be introduced at the same time.
- the following formula can be used to calculate:
- the image reconstruction loss function can achieve the purpose of dealing with occlusion and moving objects.
- the first image reconstruction loss is weighted by using the forward flow occlusion mask and the forward scene structure inconsistency weight; the first image reconstruction loss is weighted using the backward flow occlusion mask and the backward scene structure inconsistency weight;
- the two image reconstruction losses are weighted;
- the bidirectional image reconstruction loss is constructed based on the weighted first image reconstruction loss and the weighted second image reconstruction loss.
- the depth estimation network includes an encoding network
- the method may further include:
- the bidirectional feature perception loss is calculated and obtained, and the bidirectional feature perception loss is used to measure the loss through coding
- the difference between the feature image of the target frame image obtained by the network and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the coding network and the reconstructed reference frame The difference between the feature images of an image.
- the above steps are used to calculate the perceptual loss of bidirectional features.
- the features extracted by the encoder have better discrimination in weak texture areas.
- the present application uses the highest resolution feature map extracted by the coding network to process the weak texture area, and through the coding network in the depth estimation network, the feature image f tgt (first feature image) of the target frame image and the feature image of the reference frame image can be extracted.
- Image f ref second feature image).
- the feature image f ref can be affinely transformed through a bilinear sampling mechanism to reconstruct the third feature image of the target frame image
- the feature image f tgt can be affinely transformed through a bilinear sampling mechanism to reconstruct the fourth feature image of the reference frame image
- the bidirectional feature-aware loss L feat is used to measure the difference between the feature image of the target frame image obtained by the encoding network and the feature image of the reconstructed target frame image, and the feature image of the reference frame image obtained by the encoding network and the reconstructed image. The difference between the feature images of the reference frame image.
- the constructing the objective function based on the bidirectional image reconstruction loss may include:
- the objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
- the weak texture scene in the image to be tested can be effectively processed, thereby improving the accuracy of scene depth estimation.
- the method may further include:
- the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network.
- the constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss may include:
- the objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.
- a smoothing loss L s can be introduced into the objective function, which can be calculated by the following formula:
- each reference frame image can use the same method as described above.
- the corresponding loss value is calculated, and finally, the average value of the corresponding loss values of these reference frame images can be used as the loss value used in the final construction of the objective function.
- the parameters of the depth estimation network can be updated according to the objective function to achieve the purpose of optimizing and training the network.
- the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network, and use this gradient to update the weight of the depth estimation network, and so on, until the maximum number of iterations is reached, and the depth is completed. Estimate the training of the network.
- this objective function can be used to train the camera pose estimation network described above at the same time.
- the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the camera attitude estimation network, and use this gradient to update the weight of the camera attitude estimation network, and so on, iterate continuously until the set maximum number of iterations is reached, and the completion of The training of the camera pose estimation network.
- the objective function can be used as a supervision signal to jointly guide the training of the depth estimation network and the camera pose estimation network.
- the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network and the gradient of the objective function relative to the weight of the camera pose estimation network, and use the gradient to update the depth estimation network and camera pose estimation at the same time.
- the weight of the network is continuously iterated until the set maximum number of iterations is reached, and the joint training of the depth estimation network and the camera pose estimation network is completed.
- the monocular image (such as the image to be tested) can be used as the input of the depth estimation network to directly calculate the corresponding scene depth image. It is also possible to use a continuous image sequence (for example, any continuous 5 frames of monocular images) as the input of the camera pose estimation network, and calculate the corresponding camera pose vector. It should be noted that the depth estimation network and the camera pose estimation network only need to be jointly optimized during the training period. After the training is completed, the weight of the network is fixed. Backpropagation is not required during the test period, only forward propagation is required. Therefore, Both networks can be used independently during testing.
- the depth estimation network adopted in the embodiment of the present application will combine the camera pose estimation network to predict the camera pose vector of the input sample image sequence, where the sample image sequence includes the target frame image and the reference frame image; then, according to the depth The scene depth image of the target frame image predicted by the estimation network, the camera pose vector, the reference frame image and the internal parameters of the corresponding camera generate a reconstructed image corresponding to the target frame image; then, according to the target frame image and the reconstructed image Calculate the corresponding loss function when reconstructing the image, and finally construct an objective function based on the loss function and update the parameters of the depth estimation network based on the objective function.
- the potential image information contained in the target frame image and the reference frame image can be fully exploited, that is, enough image information can be obtained by sampling less sample images to complete the training of the depth estimation network, thereby reducing the time required for sample data collection. cost.
- the potential information contained in the image can be further mined, the cost of sample data collection can be reduced, and it can effectively Deal with moving objects, occlusions, etc. in video frames, and improve robustness to weakly textured environments.
- the following content is to illustrate the technical effects of the image scene depth estimation and camera pose estimation proposed in the present application through simulation results.
- the test set divided by Eigen is used as the evaluation data of the depth estimation network
- the 09-10 sequence in the KITTI Odometry data set is used as the evaluation data of the camera pose estimation network.
- the evaluation criteria used by the depth estimation network include: absolute error (AbsRel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (Rmselog) and threshold ( ⁇ t ); camera pose estimation network
- the evaluation metric used is absolute trajectory error (ATE). After the simulation test, the test results of comparing the method proposed in the present application with the algorithm of the prior art are shown in Tables 1 to 3 below.
- Table 1 is a comparison of the results of scene depth prediction for monocular images in the depth range of 80m.
- the absolute value of absolute error (AbsRel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (Rmselog) represents the algorithm error value, which is used to measure the accuracy of the algorithm, the smaller the error value
- the threshold ( ⁇ t ) represents how close the predicted scene depth is to the true value, and the higher the threshold, the better the algorithm stability. From the test results in Table 1, it can be found that, compared with the algorithm in the prior art, the method proposed in this application can obtain higher scene depth prediction accuracy and better algorithm stability.
- Table 2 is a comparison of the results of scene depth prediction for monocular images in the depth range of 50m. From the test results in Table 2, it can also be found that, compared with the algorithm in the prior art, the method proposed in this application can obtain higher scene depth prediction accuracy and better algorithm stability, so it can predict more robustly Scene depth and more detail from monocular images.
- the absolute trajectory error (ATE) in Table 3 represents the difference between the true value of the camera pose and the predicted camera pose. The smaller the error value, the more accurate the predicted camera pose.
- the simulation results show that, compared with various existing algorithms, the camera pose estimation method proposed in the present application predicts the camera pose more accurately.
- FIG. 6 is a comparison diagram of the results of the monocular image depth prediction by the method for estimating the depth of the image scene proposed by the present application and various algorithms in the prior art, wherein the Ground Truth depth map is a depth map obtained by visualizing lidar data. .
- a method for estimating the depth of an image scene is mainly described above, and an apparatus for estimating the depth of an image scene will be described below.
- an embodiment of an apparatus for estimating the depth of an image scene in an embodiment of the present application includes:
- a scene depth estimation module 702 configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;
- a sample acquisition module 703 configured to acquire a sample image sequence, where the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is a frame before or after the target frame image in the sample image sequence image above;
- a first scene depth prediction module 704 configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image
- a camera pose estimation module 705, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain a predicted camera pose vector between the target frame image and the reference frame image;
- a first image reconstruction module 706, configured to generate images corresponding to the target frame according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence The first reconstructed image of ;
- the first image reconstruction loss calculation module 707 is configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, and the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss. a difference between the reconstructed images;
- an objective function construction module 708, configured to construct an objective function based on the first image reconstruction loss
- a network parameter updating module 709 configured to update the parameters of the depth estimation network according to the objective function.
- the apparatus may further include:
- a second scene depth prediction module configured to input the reference frame image into the depth estimation network to obtain a predicted second scene depth image
- a second image reconstruction module configured to generate an image corresponding to the reference frame image according to the depth image of the second scene, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence the second reconstructed image;
- a second image reconstruction loss calculation module configured to calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the reference frame image and the second image reconstruction loss difference between reconstructed images;
- the objective function building block may include:
- a bidirectional image reconstruction loss calculation unit configured to calculate the bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss
- an objective function construction unit configured to construct the objective function based on the bidirectional image reconstruction loss.
- the first image reconstruction module may include:
- a first transformation matrix determining unit configured to determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector
- a first coordinate calculation unit configured to calculate the first coordinate of the target frame image in the world coordinate system according to the camera's internal parameters and the first scene depth image
- a first coordinate transformation unit configured to transform the first coordinate based on the first transformation matrix to obtain the second coordinate of the target frame image after the transformation in the world coordinate system
- a first coordinate conversion unit configured to convert the second coordinate into a third coordinate in the image coordinate system
- a first image reconstruction unit configured to reconstruct an affine transformed image of the reference frame image based on the reference frame image and use the third coordinate as a grid point through a bilinear sampling mechanism, and reconstruct the image after affine transformation.
- the obtained image is determined as the first reconstructed image
- the second image reconstruction module may include:
- a second transformation matrix determining unit configured to determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector
- a second coordinate calculation unit configured to calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;
- a second coordinate transformation unit configured to transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system
- a second coordinate conversion unit configured to convert the fifth coordinate into a sixth coordinate in the image coordinate system
- the second image reconstruction unit is configured to, based on the target frame image, take the sixth coordinate as a grid point, reconstruct the affine transformed image of the target frame image through a bilinear sampling mechanism, and reconstruct an image of the target frame image after affine transformation.
- the obtained image is determined as the second reconstructed image.
- the apparatus may further include:
- a first coordinate obtaining module used for obtaining the seventh coordinate of the target frame image in the image coordinate system
- a forward flow coordinate determination module configured to perform a difference processing of corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate
- a second coordinate obtaining module configured to obtain the eighth coordinate of the reference frame image in the image coordinate system
- a backward flow coordinate determination module configured to perform a difference processing of corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate
- a forward flow coordinate synthesis module configured to use the third coordinate as a grid point and perform affine transformation on the first backward flow coordinate using a bilinear sampling mechanism to synthesize the second forward flow coordinate;
- a backward flow coordinate synthesis module configured to use the sixth coordinate as a grid point, and use a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;
- a forward flow occlusion mask calculation module configured to calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the the degree of matching between the first forward flow coordinates and the second forward flow coordinates;
- a backward flow occlusion mask calculation module configured to calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, and the backward flow occlusion mask is used to measure the the degree of matching between the first backward flow coordinates and the second backward flow coordinates;
- the bidirectional image reconstruction loss calculation unit may be specifically configured to: calculate the bidirectional image according to the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask and the backward flow occlusion mask reconstruction loss.
- the apparatus may further include:
- a first scene depth value determination module configured to determine the first scene depth value of the target frame image according to the second coordinates
- a second scene depth value determining module configured to determine the second scene depth value of the reference frame image according to the fifth coordinate
- a third scene depth value determination module configured to obtain a third scene depth value of the pixel point corresponding to the second coordinate in the first scene depth image
- a fourth scene depth value determination module configured to acquire the fourth scene depth value of the pixel point corresponding to the fifth coordinate in the second scene depth image
- a first scene depth value reconstruction module configured to reconstruct a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;
- a second scene depth value reconstruction module configured to reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
- the forward scene structure consistency loss calculation module is used to calculate the forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the pass The difference between the scene depth value of the target frame image obtained by the multi-view geometric transformation calculation and the scene depth value of the reconstructed target frame image;
- the backward scene structure consistency loss calculation module is used to calculate the backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the pass the difference between the scene depth value of the reference frame image obtained by the multi-view geometric transformation calculation and the scene depth value of the reconstructed reference frame image;
- a bidirectional scene structure consistency loss calculation module configured to calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;
- the objective function construction unit may be specifically configured to: construct and obtain the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
- the depth estimation network includes an encoding network
- the apparatus may further include:
- a feature image acquisition module configured to obtain the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network
- a first feature image reconstruction module configured to reconstruct a third feature image of the target frame image through a bilinear sampling mechanism based on the third coordinates and the second feature image;
- a second feature image reconstruction module configured to reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first feature image;
- the bidirectional feature perception loss calculation module is configured to calculate the bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, and the bidirectional feature perception loss
- the loss is used to measure the difference between the feature image of the target frame image obtained by the encoding network and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained by the encoding network and the reconstructed image. the difference between the feature images of the reference frame image;
- the objective function construction unit may be specifically configured to: construct and obtain the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
- the device may also include:
- a smoothing loss calculation module configured to, according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, Calculate the smoothing loss, and the smoothing loss is used to regularize the gradient of the scene depth image and feature image obtained by the depth estimation network;
- the objective function construction unit may be specifically configured to: construct the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.
- Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements any method for estimating the depth of an image scene as shown in FIG. 1 . .
- Embodiments of the present application also provide a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to execute any method for estimating the depth of an image scene as shown in FIG. 1 .
- FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.
- the terminal device 8 of this embodiment includes: a processor 80 , a memory 81 , and a computer program 82 stored in the memory 81 and executable on the processor 80 .
- the processor 80 executes the computer program 82, the steps in each of the above embodiments of the method for estimating the depth of an image scene are implemented, for example, steps 101 to 102 shown in FIG. 1 .
- the processor 80 executes the computer program 82
- the functions of the modules/units in the above device embodiments for example, the functions of the modules 701 to 709 shown in FIG. 7 are implemented.
- the computer program 82 may be divided into one or more modules/units, which are stored in the memory 81 and executed by the processor 80 to complete the present application.
- the one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the terminal device 8 .
- the so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the memory 81 may be an internal storage unit of the terminal device 8 , such as a hard disk or a memory of the terminal device 8 .
- the memory 81 can also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc.
- the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device.
- the memory 81 is used to store the computer program and other programs and data required by the terminal device.
- the memory 81 can also be used to temporarily store data that has been output or will be output.
- the disclosed apparatus and method may be implemented in other manners.
- the system embodiments described above are only illustrative.
- the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software functional units.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing the relevant hardware through a computer program.
- the computer program can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented.
- the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
- the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc.
- ROM Read-Only Memory
- RAM Random Access Memory
- electric carrier signal telecommunication signal and software distribution medium, etc.
- the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Excluded are electrical carrier signals and telecommunication signals.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The present application relates to the technical field of image processing. Provided are a method and apparatus for estimating the depth of field of an image, and a terminal device and a storage medium. In the present application, during the optimization and update of parameters of a depth estimation network used, a camera pose estimation network is used to predict a camera pose vector of an input sample image sequence, wherein the sample image sequence comprises a target frame image and a reference frame image; then, according to a depth-of-field image of the target frame image that is predicted by the depth estimation network, the camera pose vector, the reference frame image, and internal parameters of a corresponding camera, a reconstructed image corresponding to the target frame image is generated; next, a corresponding loss function during image reconstruction is calculated according to the target frame image and the reconstructed image; and finally, an objective function is constructed on the basis of the loss function, and parameters of the depth estimation network are updated on the basis of the objective function. In this manner, image information included in a target frame image and a reference frame image can be fully mined, and the cost of sample data collection is reduced.
Description
本申请涉及图像处理技术领域,尤其涉及一种图像场景深度的估计方法、装置、终端设备和存储介质。The present application relates to the technical field of image processing, and in particular, to a method, apparatus, terminal device and storage medium for estimating the depth of an image scene.
图像的场景深度估计是机器人导航和自动驾驶领域的重要研究方向。随着高性能计算设备的发展,人们通常会使用深度神经网络来对图像的场景深度进行预测。然而,为保证对图像进行场景深度预测的准确性,在训练该深度神经网络时需要大量的样本数据,导致数据采集的成本较高。Scene depth estimation from images is an important research direction in the fields of robot navigation and autonomous driving. With the development of high-performance computing devices, people usually use deep neural networks to predict the scene depth of images. However, in order to ensure the accuracy of scene depth prediction for images, a large amount of sample data is required when training the deep neural network, resulting in high data acquisition costs.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请实施例提供了一种图像场景深度的估计方法、装置、终端设备和存储介质,能够降低样本数据采集的成本。In view of this, the embodiments of the present application provide a method, apparatus, terminal device, and storage medium for estimating the depth of an image scene, which can reduce the cost of sample data collection.
本申请实施例的第一方面提供了一种图像场景深度的估计方法,包括:A first aspect of the embodiments of the present application provides a method for estimating the depth of an image scene, including:
获取待测图像;Obtain the image to be tested;
将所述待测图像输入预先构建的深度估计网络,得到所述待测图像的场景深度图像;Inputting the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;
其中,所述深度估计网络的参数通过以下方式更新:Wherein, the parameters of the depth estimation network are updated in the following ways:
获取样本图像序列,所述样本图像序列包含目标帧图像和参考帧图像,所述参考帧图像为所述样本图像序列中处于所述目标帧图像之前或之后的一帧以上的图像;Obtaining a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence;
将所述目标帧图像输入所述深度估计网络,得到预测的第一场景深度图像;Inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;
将所述样本图像序列输入预先构建的相机姿态估计网络,得到预测的所述目标帧图像和所述参考帧图像之间的相机姿态向量;Inputting the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;
根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像;generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence;
根据所述目标帧图像和所述第一重建图像计算第一图像重建损失,所述第一图像重建损失用于衡量所述目标帧图像和所述第一重建图像之间的差异;Calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the difference between the target frame image and the first reconstructed image;
基于所述第一图像重建损失构建目标函数;constructing an objective function based on the first image reconstruction loss;
根据所述目标函数更新所述深度估计网络的参数。The parameters of the depth estimation network are updated according to the objective function.
本申请实施例采用的深度估计网络在优化更新参数时,会结合相机姿态估计网络预测输入的样本图像序列的相机姿态向量,该样本图像序列包含目标帧图像和参考帧图像;然后,根据该深度估计网络预测得到的该目标帧图像的场景深度图像、该相机姿态向量、该参考帧图像和对应相机的内参生成与该目标帧图像对应的重建图像;接着,根据该目标帧图像和该重建图像计算得到重建图像时对应的损失函数,最后基于该损失函数构建目标函数并基于该目标函数更新该深度估计网络的参数。通过这样设置,能够充分挖掘目标帧图像和参考帧图像包含的潜在图像信息,也即采样较少的样本图像即可获得足够的图像信息以完成该深度估计网络的训练,从而降低样本数据采集的成本。When optimizing and updating parameters, the depth estimation network adopted in the embodiment of the present application will combine the camera pose estimation network to predict the camera pose vector of the input sample image sequence, where the sample image sequence includes the target frame image and the reference frame image; then, according to the depth The scene depth image of the target frame image predicted by the estimation network, the camera pose vector, the reference frame image and the internal parameters of the corresponding camera generate a reconstructed image corresponding to the target frame image; then, according to the target frame image and the reconstructed image Calculate the corresponding loss function when reconstructing the image, and finally construct an objective function based on the loss function and update the parameters of the depth estimation network based on the objective function. Through this setting, the potential image information contained in the target frame image and the reference frame image can be fully exploited, that is, enough image information can be obtained by sampling less sample images to complete the training of the depth estimation network, thereby reducing the time required for sample data collection. cost.
在本申请的一个实施例中,在获取样本图像序列之后,还可以包括:In an embodiment of the present application, after acquiring the sample image sequence, it may further include:
将所述参考帧图像输入所述深度估计网络,得到预测的第二场景深度图像;Inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像;generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence;
根据所述参考帧图像和所述第二重建图像计算第二图像重建损失,所述第二图像重建损失用于衡量所述参考帧图像和所述第二重建图像之间的差异;Calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the difference between the reference frame image and the second reconstructed image;
所述基于所述第一图像重建损失构建目标函数,包括:The constructing an objective function based on the first image reconstruction loss includes:
根据所述第一图像重建损失和所述第二图像重建损失,计算双向图像重建损失;calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;
基于所述双向图像重建损失构建所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss.
通过在深度估计网络的目标函数中添加双向图像重建损失,能够充分挖掘图像数据中的潜在信息,进一步提升深度估计算法的鲁棒性。By adding a bidirectional image reconstruction loss to the objective function of the depth estimation network, the potential information in the image data can be fully exploited, and the robustness of the depth estimation algorithm can be further improved.
进一步的,所述根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像,可以包括:Further, generating a first reconstruction corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:
根据所述相机姿态向量确定所述目标帧图像转换到所述参考帧图像的第一变换矩阵;Determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
根据所述相机的内参和所述第一场景深度图像,计算所述目标帧图像在世界坐标系下的第一坐标;Calculate the first coordinate of the target frame image in the world coordinate system according to the internal parameters of the camera and the first scene depth image;
基于所述第一变换矩阵对所述第一坐标进行变换,得到所述目标帧图像经转换后在世界坐标系下的第二坐标;Transforming the first coordinates based on the first transformation matrix to obtain the second coordinates of the target frame image in the world coordinate system after transformation;
将所述第二坐标转换为在图像坐标系下的第三坐标;converting the second coordinate into a third coordinate in the image coordinate system;
基于所述参考帧图像,以所述第三坐标作为网格点,通过双线性采样机制重建出所述参考帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第一重建图像;Based on the reference frame image, and using the third coordinate as a grid point, an image of the reference frame image after affine transformation is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the third coordinate. a reconstructed image;
所述根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像,包括:The generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence, comprising: :
根据所述相机姿态向量确定所述参考帧图像转换到所述目标帧图像的第二变换矩阵;Determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
根据所述相机的内参和所述第二场景深度图像,计算所述参考帧图像在世界坐标系下的第四坐标;Calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;
基于所述第二变换矩阵对所述第四坐标进行变换,得到所述参考帧图像经转换后在世界坐标系下的第五坐标;Transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system;
将所述第五坐标转换为在图像坐标系下的第六坐标;converting the fifth coordinate into the sixth coordinate in the image coordinate system;
基于所述目标帧图像,以所述第六坐标作为网格点,通过双线性采样机制重建出所述目标帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第二重建图像。Based on the target frame image, using the sixth coordinate as a grid point, the affine transformed image of the target frame image is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the sixth coordinate. 2. Reconstructed images.
在本申请的一个实施例中,所述方法还可以包括:In an embodiment of the present application, the method may further include:
获取所述目标帧图像在图像坐标系下的第七坐标;obtaining the seventh coordinate of the target frame image in the image coordinate system;
对所述第三坐标和所述第七坐标执行对应元素作差的处理,得到第一前向流坐标;Performing the process of making a difference between the corresponding elements on the third coordinate and the seventh coordinate to obtain the first forward flow coordinate;
获取所述参考帧图像在图像坐标系下的第八坐标;obtaining the eighth coordinate of the reference frame image in the image coordinate system;
对所述第六坐标和所述第八坐标执行对应元素作差的处理,得到第一后向流坐标;Performing the process of making a difference between the corresponding elements on the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate;
以所述第三坐标作为网格点,采用双线性采样机制对所述第一后向流坐标进行仿射变换,以合成第二前向流坐标;Using the third coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first backward flow coordinate to synthesize the second forward flow coordinate;
以所述第六坐标作为网格点,采用双线性采样机制对所述第一前向流坐标进行仿射变换,以合成第二后向流坐标;Using the sixth coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;
根据所述第一前向流坐标和所述第二前向流坐标计算前向流遮挡掩码,所述前向流遮挡掩码用于衡量所述第一前向流坐标和所述第二前向流坐标之间的匹配程度;A forward flow occlusion mask is calculated according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and the second forward flow occlusion mask. degree of matching between forward flow coordinates;
根据所述第一后向流坐标和所述第二后向流坐标计算后向流遮挡掩码,所述后向流遮挡掩码用于衡量所述第一后向流坐标和所述第二后向流坐标之间的匹配程度;A backward flow occlusion mask is calculated according to the first backward flow coordinates and the second backward flow coordinates, and the backward flow occlusion mask is used to measure the first backward flow coordinates and the second backward flow degree of matching between backward flow coordinates;
所述根据所述第一图像重建损失和所述第二图像重建损失计算双向图像重建损失,包括:The calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss includes:
根据所述第一图像重建损失、所述第二图像重建损失、所述前向流遮挡掩码和所述后向流遮挡掩码计算所述双向图像重建损失。The bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
遮挡掩码用于判断连续的视频帧中是否存在遮挡物体,将遮挡掩码添加到双向图像重建损失的计算中,能够提高该深度估计网络对带有遮挡物体的图像 进行深度估计的准确率。The occlusion mask is used to determine whether there are occluding objects in consecutive video frames. Adding the occlusion mask to the calculation of the loss of bidirectional image reconstruction can improve the depth estimation accuracy of the depth estimation network for images with occluded objects.
在本申请的一个实施例中,所述方法还可以包括:In an embodiment of the present application, the method may further include:
根据所述第二坐标确定所述目标帧图像的第一场景深度值;Determine the first scene depth value of the target frame image according to the second coordinate;
根据所述第五坐标确定所述参考帧图像的第二场景深度值;determining the second scene depth value of the reference frame image according to the fifth coordinate;
获取所述第一场景深度图像中与所述第二坐标对应的像素点的第三场景深度值;obtaining the third scene depth value of the pixel corresponding to the second coordinate in the first scene depth image;
获取所述第二场景深度图像中与所述第五坐标对应的像素点的第四场景深度值;obtaining the fourth scene depth value of the pixel corresponding to the fifth coordinate in the second scene depth image;
基于所述第三坐标和所述第四场景深度值,通过双线性采样机制重建出所述目标帧图像的第五场景深度值;Based on the third coordinate and the fourth scene depth value, reconstruct the fifth scene depth value of the target frame image through a bilinear sampling mechanism;
基于所述第六坐标和所述第三场景深度值,通过双线性采样机制重建出所述参考帧图像的第六场景深度值;Based on the sixth coordinate and the third scene depth value, reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism;
根据所述第一场景深度值和所述第五场景深度值计算前向场景结构一致性损失,所述前向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述目标帧图像的场景深度值与重建出的所述目标帧图像的场景深度值之间的差异;The forward scene structure consistency loss is calculated according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the target frame image calculated by multi-view geometric transformation The difference between the scene depth value and the scene depth value of the reconstructed target frame image;
根据所述第二场景深度值和所述第六场景深度值计算后向场景结构一致性损失,所述后向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述参考帧图像的场景深度值与重建出的所述参考帧图像的场景深度值之间的差异;The backward scene structure consistency loss is calculated according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the reference frame image calculated by the multi-view geometric transformation The difference between the scene depth value and the reconstructed scene depth value of the reference frame image;
根据所述前向场景结构一致性损失和所述后向场景结构一致性损失,计算双向场景结构一致性损失;Calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;
所述基于所述双向图像重建损失构建所述目标函数,可以包括:The constructing the objective function based on the bidirectional image reconstruction loss may include:
基于所述双向图像重建损失和所述双向场景结构一致性损失,构建得到所述目标函数。The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
在构建目标函数时,添加双向场景结构一致性损失,能够有效处理待测图 像场景中的遮挡物体与运动对象,从而提高场景深度估计的准确率。When constructing the objective function, adding the loss of bidirectional scene structure consistency can effectively deal with occluded objects and moving objects in the scene of the image to be tested, thereby improving the accuracy of scene depth estimation.
在本申请的一个实施例中,所述深度估计网络包括编码网络,所述方法还可以包括:In an embodiment of the present application, the depth estimation network includes an encoding network, and the method may further include:
通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图像的第二特征图像;Obtain the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network;
基于所述第三坐标和所述第二特征图像,通过双线性采样机制重建出所述目标帧图像的第三特征图像;Based on the third coordinates and the second feature image, reconstruct a third feature image of the target frame image through a bilinear sampling mechanism;
基于所述第六坐标和所述第一特征图像,通过双线性采样机制重建出所述参考帧图像的第四特征图像;Based on the sixth coordinate and the first feature image, reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism;
根据所述第一特征图像、所述第二特征图像、所述第三特征图像和所述第四特征图像,计算得到双向特征感知损失,所述双向特征感知损失用于衡量通过编码网络获得的所述目标帧图像的特征图像与重建出的所述目标帧图像的特征图像之间的差异,以及通过编码网络获得的所述参考帧图像的特征图像与重建出的所述参考帧图像的特征图像之间的差异;According to the first feature image, the second feature image, the third feature image and the fourth feature image, the bidirectional feature perception loss is calculated, and the bidirectional feature perception loss is used to measure the obtained through the encoding network. The difference between the feature image of the target frame image and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the encoding network and the reconstructed feature image of the reference frame image differences between images;
所述基于所述双向图像重建损失构建所述目标函数,包括:The constructing the objective function based on the bidirectional image reconstruction loss includes:
基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
通过在目标函数中引入双向特征感知损失,能够有效处理待测图像中的弱纹理场景,从而提高场景深度估计的准确率。By introducing the bidirectional feature perception loss into the objective function, the weak texture scene in the image to be tested can be effectively processed, thereby improving the accuracy of scene depth estimation.
进一步的,在通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图像的第二特征图像之后,还可以包括:Further, after obtaining the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network, the method may further include:
根据所述目标帧图像、所述参考帧图像、所述第一场景深度图像、所述第二场景深度图像、所述第一特征图像和所述第二特征图像,计算得到平滑损失,所述平滑损失用于正则化通过所述深度估计网络获得的场景深度图像和特征图像的梯度;According to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network;
所述基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述 目标函数,包括:Described based on the two-way image reconstruction loss and the two-way feature perception loss, the construction obtains the objective function, including:
基于所述双向图像重建损失、所述双向特征感知损失和所述平滑损失,构建得到所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.
通过在目标函数中引入平滑损失,可以正则化通过所述深度估计网络获得的场景深度图像和特征图像的梯度。The gradients of scene depth images and feature images obtained by the depth estimation network can be regularized by introducing a smoothing loss into the objective function.
本申请实施例的第二方面提供了一种图像场景深度的估计装置,包括:A second aspect of the embodiments of the present application provides an apparatus for estimating the depth of an image scene, including:
待测图像获取模块,用于获取待测图像;an image acquisition module to be tested, used to acquire the image to be tested;
场景深度估计模块,用于将所述待测图像输入预先构建的深度估计网络,得到所述待测图像的场景深度图像;a scene depth estimation module, configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;
样本获取模块,用于获取样本图像序列,所述样本图像序列包含目标帧图像和参考帧图像,所述参考帧图像为所述样本图像序列中处于所述目标帧图像之前或之后的一帧以上的图像;A sample acquisition module, configured to acquire a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is one or more frames before or after the target frame image in the sample image sequence Image;
第一场景深度预测模块,用于将所述目标帧图像输入所述深度估计网络,得到预测的第一场景深度图像;a first scene depth prediction module, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;
相机姿态估计模块,用于将所述样本图像序列输入预先构建的相机姿态估计网络,得到预测的所述目标帧图像和所述参考帧图像之间的相机姿态向量;a camera pose estimation module, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;
第一图像重建模块,用于根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像;A first image reconstruction module, configured to generate an image corresponding to the target frame image according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence. the first reconstructed image;
第一图像重建损失计算模块,用于根据所述目标帧图像和所述第一重建图像计算第一图像重建损失,所述第一图像重建损失用于衡量所述目标帧图像和所述第一重建图像之间的差异;a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss difference between reconstructed images;
目标函数构建模块,用于基于所述第一图像重建损失构建目标函数;an objective function building module for constructing an objective function based on the first image reconstruction loss;
网络参数更新模块,用于根据所述目标函数更新所述深度估计网络的参数。A network parameter updating module, configured to update the parameters of the depth estimation network according to the objective function.
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行 所述计算机程序时实现如本申请实施例的第一方面提供的图像场景深度的估计方法。A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program The method for estimating the depth of an image scene provided by the first aspect of the embodiments of the present application is implemented.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如本申请实施例的第一方面提供的图像场景深度的估计方法。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the implementation of the first aspect of the embodiments of the present application is implemented method for estimating scene depth in images.
本申请实施例的第五方面提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行本申请实施例的第一方面所述的图像场景深度的估计方法。A fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the method for estimating the depth of an image scene described in the first aspect of the embodiments of the present application.
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。It can be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description in the first aspect, which is not repeated here.
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1是本申请实施例提供的一种图像场景深度的估计方法的一个实施例的流程图;FIG. 1 is a flowchart of an embodiment of a method for estimating the depth of an image scene provided by an embodiment of the present application;
图2是本申请实施例提供的一种优化更新深度估计网络参数的流程示意图;FIG. 2 is a schematic flowchart of an optimization and updating depth estimation network parameter provided by an embodiment of the present application;
图3是本申请实施例提供的一种深度估计网络的结构示意图;3 is a schematic structural diagram of a depth estimation network provided by an embodiment of the present application;
图4是图3所示网络结构中的残差模块的结构示意图;4 is a schematic structural diagram of a residual module in the network structure shown in FIG. 3;
图5是本申请实施例提供的相一种相机姿态估计网络的结构示意图;5 is a schematic structural diagram of a camera pose estimation network according to an embodiment of the present application;
图6是本申请实施例提供的图像场景深度的估计方法和现有技术的各类算法分别进行单目图像深度预测的结果对比图;6 is a comparison diagram of the results of the depth prediction of a monocular image performed by the method for estimating the depth of an image scene provided by an embodiment of the present application and various algorithms in the prior art;
图7是本申请实施例提供的一种图像场景深度的估计装置的一个实施例的结构图;7 is a structural diagram of an embodiment of an apparatus for estimating the depth of an image scene provided by an embodiment of the present application;
图8是本申请实施例提供的一种终端设备的示意图。FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.
本申请提出一种图像场景深度的估计方法、装置、终端设备和存储介质,能够降低样本数据采集的成本。应当理解,本申请各个方法实施例的执行主体为各种类型的终端设备或服务器,比如手机、平板电脑、笔记本电脑、台式电脑和可穿戴设备等。The present application proposes a method, device, terminal device and storage medium for estimating the depth of an image scene, which can reduce the cost of sample data collection. It should be understood that the execution body of each method embodiment of the present application is various types of terminal devices or servers, such as mobile phones, tablet computers, notebook computers, desktop computers, and wearable devices.
请参阅图1,示出了本申请实施例提供的一种图像场景深度的估计方法,包括:Referring to FIG. 1, a method for estimating the depth of an image scene provided by an embodiment of the present application is shown, including:
101、获取待测图像;101. Obtain an image to be tested;
首先,获取待测图像,待测图像是需要预测场景深度的任意一幅图像。First, an image to be tested is acquired, and the image to be tested is any image whose depth of the scene needs to be predicted.
102、将所述待测图像输入预先构建的深度估计网络,得到所述待测图像的场景深度图像。102. Input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested.
在获取待测图像之后,将该待测图像输入一个预先构建的深度估计网络,得到该待测图像的场景深度图像,从而获得该待测图像的场景深度估计结果。具体的,该深度估计网络可以是具有编码器-解码器体系架构的神经网络,而本申请对该深度估计网络采用的神经网络的类型和网络结构不做任何限定。After the image to be tested is acquired, the image to be tested is input into a pre-built depth estimation network to obtain a scene depth image of the image to be tested, thereby obtaining a scene depth estimation result of the image to be tested. Specifically, the depth estimation network may be a neural network with an encoder-decoder architecture, and the present application does not make any limitations on the type and network structure of the neural network used in the depth estimation network.
请参阅图2,示出了本申请实施例提供的一种优化更新深度估计网络参数的流程示意图,包括以下步骤:Please refer to FIG. 2, which shows a schematic flowchart of optimizing and updating depth estimation network parameters provided by an embodiment of the present application, including the following steps:
2.1、获取样本图像序列,所述样本图像序列包含目标帧图像和参考帧图像,所述参考帧图像为所述样本图像序列中处于所述目标帧图像之前或之后的一帧以上的图像;2.1. Obtain a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence;
要训练优化深度估计网络,首先需要获取训练集数据,并可以对该训练集数据进行一定的预处理操作。例如,可以获取自动驾驶数据集KITTI作为训练集数据,并对该训练集数据进行随机翻转、随机裁剪、数据归一化等预处理操作,以将该训练集数据转换成指定维度的张量数据,作为深度估计网络的输入。在本申请实施例中,该训练集数据可由大量的样本图像序列组成,其中每个样本图像序列包含目标帧图像和参考帧图像,参考帧图像为该样本图像序列中处于该目标帧图像之前或之后的一帧以上的图像。例如,样本图像序列可以是一个包含5个连续视频帧的视频剪辑,假设为I
0、I
1、I
2、I
3、I
4,则I
2可作为目标帧图像,I
0、I
1、I
3、I
4可作为对应的参考帧图像。
To train and optimize the depth estimation network, it is first necessary to obtain the training set data, and perform certain preprocessing operations on the training set data. For example, the autonomous driving data set KITTI can be obtained as the training set data, and preprocessing operations such as random flipping, random cropping, and data normalization can be performed on the training set data to convert the training set data into tensor data of specified dimensions. , as the input to the depth estimation network. In the embodiment of the present application, the training set data may be composed of a large number of sample image sequences, wherein each sample image sequence includes a target frame image and a reference frame image, and the reference frame image is in the sample image sequence before the target frame image or More than one frame after the image. For example, the sample image sequence can be a video clip containing 5 consecutive video frames, assuming I 0 , I 1 , I 2 , I 3 , I 4 , then I 2 can be used as the target frame image, I 0 , I 1 , I 3 and I 4 can be used as corresponding reference frame images.
2.2、将所述目标帧图像输入所述深度估计网络,得到预测的第一场景深度图像;2.2. Input the target frame image into the depth estimation network to obtain the predicted depth image of the first scene;
对于该样本图像序列中的目标帧图像,将其输入至该深度估计网络,得到预测的第一场景深度图像,也即该目标帧图像对应的场景深度图像。The target frame image in the sample image sequence is input to the depth estimation network to obtain the predicted first scene depth image, that is, the scene depth image corresponding to the target frame image.
在本申请的一个实施例中,该深度估计网络的结构示意图如图3所示,其中包含编码器部分和解码器部分。编码器部分用于通过逐层下采样的方式提取输入图像数据的抽象特征,假设对目标帧图像进行预处理后获得维度为3*256*832的张量数据,则经过编码器的第一层卷积、归一化和激活函数处理后,获得维度为64*128*416的特征图像,完成第一次下采样的处理。然后,该特征图像再经过最大池化和多个残差模块的处理,获得维度为256*64*208的特征图像,完成第二次下采样的处理。以此类推,经过多次下采样处理后,获得维度为2048*8*26的特征图像。解码器部分用于通过逐层上采样的方式对编码器获得的特征图像进行处理,具体可以用卷积核大小为3*3的卷积层、非线性ELU处理层和最近邻上采样层对编码器获得的特征图像进行处理,获得维度为 512*16*52的特征图像。然后,如图3所示,将该512*16*52的特征图像与编码器获得的1024*16*52的特征图像在通道维度上进行拼接,得到维度为1536*16*52的特征图像,至此完成第一次上采样的处理。以此类推,经过多次上采样处理后,最终获得维度为32*256*832的特征图像。接着,将该32*256*832的特征图像依次输入卷积核大小为3*3的卷积层、Sigmoid函数以及F(x)=1/(10*x+0.01)处理,获得最终的场景深度图像,其中,x表示经过Sigmoid函数映射后获得的深度图像,F(x)表示最终获得的场景深度图像。具体的,特征图像经过Sigmoid函数变换后,其中每个像素的取值范围会映射到0至1之间,此处假设实际场景深度范围为0.1m到100m之间,则通过函数F(x)=1/(10*x+0.01)处理可以将估计的深度图像中的像素与实际场景深度建立映射关系,例如若x=0对应于现实场景中的100m。因此,经过函数F(x)=1/(10*x+0.01)处理可以将估计的深度图像约束在0.1m到100m之间的合理范围内。In an embodiment of the present application, a schematic structural diagram of the depth estimation network is shown in FIG. 3 , which includes an encoder part and a decoder part. The encoder part is used to extract the abstract features of the input image data by layer-by-layer downsampling. Assuming that the target frame image is preprocessed to obtain tensor data with a dimension of 3*256*832, the first layer of the encoder is used. After convolution, normalization, and activation function processing, a feature image with a dimension of 64*128*416 is obtained, and the first downsampling process is completed. Then, the feature image is processed by maximum pooling and multiple residual modules to obtain a feature image with a dimension of 256*64*208, and the second downsampling process is completed. By analogy, after multiple downsampling processes, a feature image with a dimension of 2048*8*26 is obtained. The decoder part is used to process the feature image obtained by the encoder by layer-by-layer upsampling. Specifically, the convolutional layer with the convolution kernel size of 3*3, the nonlinear ELU processing layer and the nearest neighbor upsampling layer can be used. The feature image obtained by the encoder is processed to obtain a feature image with a dimension of 512*16*52. Then, as shown in Figure 3, the 512*16*52 feature image and the 1024*16*52 feature image obtained by the encoder are spliced in the channel dimension to obtain a feature image with a dimension of 1536*16*52, At this point, the processing of the first upsampling is completed. By analogy, after multiple upsampling processing, a feature image with a dimension of 32*256*832 is finally obtained. Next, the 32*256*832 feature image is sequentially input into the convolutional layer with the convolution kernel size of 3*3, the Sigmoid function and F(x)=1/(10*x+0.01) for processing to obtain the final scene Depth image, where x represents the depth image obtained after mapping by the Sigmoid function, and F(x) represents the final scene depth image obtained. Specifically, after the feature image is transformed by the Sigmoid function, the value range of each pixel will be mapped between 0 and 1. Here, it is assumed that the actual scene depth range is between 0.1m and 100m, then the function F(x) The =1/(10*x+0.01) process can establish a mapping relationship between the pixels in the estimated depth image and the actual scene depth, for example, if x=0, it corresponds to 100m in the actual scene. Therefore, the estimated depth image can be constrained to a reasonable range between 0.1m and 100m by processing the function F(x)=1/(10*x+0.01).
图3所示网络结构中的残差模块的结构示意图如图4所示,输入端分为2个分支,其中一个分支依次经过各个卷积层、归一化层BN和ReLU函数的处理后,与另一个分支叠加,从而获得残差模块的输出数据。The schematic diagram of the structure of the residual module in the network structure shown in Figure 3 is shown in Figure 4. The input end is divided into two branches, one of which is processed by each convolution layer, the normalization layer BN and the ReLU function in turn. Superimpose with another branch to obtain the output data of the residual module.
另外,在图3所示的网络结构中,采用了捷径连接的技术手段,也即将编码器提取的特征图像直接跨越卷积层与解码器获得的相同分辨率的特征图像在通道维度进行拼接。在使用编码器对输入图像进行特征提取的过程中,会采用固定大小的卷积核通过滑动窗口的方式不断地提取图像特征,然而由于卷积核大小的限制及卷积局部图像特征提取的性质,使得浅层网络只能提取图像的局部特征。随着卷积层数的不断增加,提取特征图像的分辨率不断降低,同时特征图像的数量也不断增加,从而能够提取出更加抽象、感受野更大的深度特征。至于解码器部分,直接对编码器最后输出的特征图像进行解码,对该深层次特征进行多次上采样处理,得到不同分辨率的深层次特征图像,如经过第一次上采样处理后得到维度为512*16*52的特征图像,此时将编码器提取的维度为 1024*16*52的特征图像直接跨越相应卷积层与解码器获得的512*16*52的特征图像在通道维度进行融合。如图3所示,编码器提取的每种分辨率的特征图像均与对应解码器获得特征图像进行特征融合,实现了图像局部特征与深度特征信息的融合。In addition, in the network structure shown in Figure 3, the technical means of shortcut connection is adopted, that is, the feature image extracted by the encoder directly crosses the convolution layer and the feature image of the same resolution obtained by the decoder is spliced in the channel dimension. In the process of using the encoder to extract features from the input image, a fixed-size convolution kernel is used to continuously extract image features through a sliding window. However, due to the limitation of the size of the convolution kernel and the nature of convolution local image feature extraction , so that the shallow network can only extract local features of the image. With the continuous increase of the number of convolutional layers, the resolution of extracting feature images continues to decrease, and the number of feature images also increases, so that more abstract and larger depth features can be extracted. As for the decoder part, it directly decodes the feature image finally output by the encoder, and performs multiple upsampling processing on the deep-level feature to obtain deep-level feature images with different resolutions. For example, after the first upsampling process, the dimension It is a feature image of 512*16*52. At this time, the feature image with a dimension of 1024*16*52 extracted by the encoder directly crosses the corresponding convolutional layer and the feature image of 512*16*52 obtained by the decoder. fusion. As shown in Figure 3, the feature images of each resolution extracted by the encoder are fused with the feature images obtained by the corresponding decoder to realize the fusion of local image features and depth feature information.
2.3、将所述样本图像序列输入预先构建的相机姿态估计网络,得到预测的所述目标帧图像和所述参考帧图像之间的相机姿态向量;2.3. Input the sample image sequence into a pre-built camera pose estimation network to obtain a camera pose vector between the predicted target frame image and the reference frame image;
为了获得该目标帧图像和该参考帧图像之间的相机姿态向量,本申请实施例还预先构建了一个相机姿态估计网络,该网络的结构示意图可以如图5所示,其中包含多个不同参数的卷积层。具体的,假设输入的样本图像序列为I
0、I
1、I
2、I
3、I
4总共5帧图像,首先将这5帧图像进行预处理成指定维度的张量数据,作为该相机姿态估计网络的输入;该相机姿态估计网络采用多个指定步长的卷积层来提取图像特征并进行下采样,依次获得相应的特征图像。例如在图5中,输入的张量数据在经过8层卷积处理后,可以得到24维的特征向量,最后将该特征向量调整成6*N
ref的相机姿态向量,此处的6表示该相机姿态向量是由3个平移向量和3个旋转向量构成的6维向量,N
ref=4表示该样本图像序列输入的参考帧图像的数量是4。
In order to obtain the camera pose vector between the target frame image and the reference frame image, a camera pose estimation network is also pre-built in this embodiment of the present application. The schematic diagram of the structure of the network can be shown in FIG. 5 , which includes a number of different parameters the convolutional layer. Specifically, it is assumed that the input sample image sequence is I 0 , I 1 , I 2 , I 3 , I 4 , a total of 5 frames of images. First, these 5 frames of images are preprocessed into tensor data of a specified dimension, which is used as the camera pose Estimate the input of the network; the camera pose estimation network uses multiple convolutional layers with specified strides to extract image features and downsample, and sequentially obtain corresponding feature images. For example, in Figure 5, after the input tensor data is processed by 8 layers of convolution, a 24-dimensional feature vector can be obtained, and finally the feature vector is adjusted into a camera pose vector of 6*N ref , where 6 represents the The camera pose vector is a 6-dimensional vector composed of 3 translation vectors and 3 rotation vectors, and N ref =4 indicates that the number of reference frame images input in the sample image sequence is 4.
2.4、根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像;2.4. Generate a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence;
在获得估计的第一场景深度图像和相机姿态向量之后,需要基于这些数据进行图像重建,获得该目标帧图像对应的第一重建图像,以便后续进行图像重建损失的计算。After obtaining the estimated first scene depth image and the camera pose vector, it is necessary to perform image reconstruction based on these data, and obtain the first reconstructed image corresponding to the target frame image, so as to calculate the image reconstruction loss subsequently.
具体的,所述根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像,可以包括:Specifically, generating the first reconstruction corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:
(1)根据所述相机姿态向量确定所述目标帧图像转换到所述参考帧图像的 第一变换矩阵;(1) determine the first transformation matrix that the target frame image is converted to the reference frame image according to the camera attitude vector;
(2)根据所述相机的内参和所述第一场景深度图像,计算所述目标帧图像在世界坐标系下的第一坐标;(2) calculating the first coordinates of the target frame image in the world coordinate system according to the internal parameters of the camera and the depth image of the first scene;
(3)基于所述第一变换矩阵对所述第一坐标进行变换,得到所述目标帧图像经转换后在世界坐标系下的第二坐标;(3) transforming the first coordinate based on the first transformation matrix to obtain the second coordinate of the target frame image in the world coordinate system after conversion;
(4)将所述第二坐标转换为在图像坐标系下的第三坐标;(4) converting the second coordinate into the third coordinate under the image coordinate system;
(5)基于所述参考帧图像,以所述第三坐标作为网格点,通过双线性采样机制重建出所述参考帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第一重建图像。(5) Based on the reference frame image, using the third coordinate as a grid point, reconstruct the affine transformed image of the reference frame image through a bilinear sampling mechanism, and determine the reconstructed image as the first reconstructed image.
假设目标帧图像为I
tgt,参考帧图像为I
ref,对应相机的内参矩阵为K,则可以通过上文所述的深度估计网络估计出I
tgt对应的第一场景深度图像为D
tgt,由上文所述的相机姿态估计网络估计出两帧图像之间的相机姿态,得到从目标帧图像I
tgt转换到参考帧图像I
ref的第一变换矩阵T(由旋转向量和平移向量组成)。然后,根据该相机内参矩阵K、第一场景深度图像D
tgt和该目标帧图像I
tgt,可以计算出目标帧图像I
tgt在世界坐标系下的坐标(第一坐标)。例如,假设目标帧图像I
tgt中某一像素点的图像坐标为
根据该第一场景深度图像D
tgt确定该像素点的深度为d
tgt,则可以由以下公式组计算得到该像素在世界坐标系下的坐标:
Assuming that the target frame image is I tgt , the reference frame image is I ref , and the internal parameter matrix corresponding to the camera is K, the depth estimation network described above can be used to estimate the depth image of the first scene corresponding to I tgt as D tgt , which is given by The camera pose estimation network described above estimates the camera pose between two frames of images, and obtains a first transformation matrix T (consisting of a rotation vector and a translation vector) converted from the target frame image I tgt to the reference frame image I ref . Then, according to the camera internal parameter matrix K, the first scene depth image D tgt and the target frame image It tgt , the coordinates (first coordinates) of the target frame image It tgt in the world coordinate system can be calculated. For example, suppose that the image coordinates of a certain pixel in the target frame image I tgt are According to the depth image D tgt of the first scene, the depth of the pixel is determined as d tgt , then the coordinates of the pixel in the world coordinate system can be calculated by the following formulas:
Z=d
tgt
Z=d tgt
其中,
表示该像素点在世界坐标系下的坐标,(c
x,c
y,f)是相 机内参矩阵中的参数,c
x和c
y表示主点偏移量,f表示焦距。
in, Indicates the coordinates of the pixel in the world coordinate system, (c x , cy , f) are the parameters in the camera's internal parameter matrix, c x and cy represent the principal point offset, and f represents the focal length.
然后,基于该第一变换矩阵T对第一坐标
进行变换,获得该目标帧图像I
tgt经转换后在世界坐标系下的第二坐标
具体可以采用以下公式计算:
Then, based on the first transformation matrix T, the first coordinate Perform transformation to obtain the second coordinate of the target frame image I tgt after transformation in the world coordinate system Specifically, the following formula can be used to calculate:
其中,(R
x,R
y,R
z,t)∈SE3是3D旋转角度和平移向量,可以通过该第一变换矩阵T获得。R
x,R
y,R
z分别表示相对于世界坐标系中的x轴、y轴和z轴的旋转量,t表示x轴、y轴和z轴的平移量,SE3表示特殊欧式群。
Wherein, (R x , R y , R z , t)∈SE3 is a 3D rotation angle and translation vector, which can be obtained by the first transformation matrix T. R x , R y , and R z represent the rotation relative to the x-axis, y-axis and z-axis in the world coordinate system, respectively, t represents the translation of the x-axis, y-axis and z-axis, and SE3 represents a special Euclidean group.
接着,将该第二坐标转换为在图像坐标系下的第三坐标
具体可以通过以下公式组进行转换:
Next, convert the second coordinate to the third coordinate in the image coordinate system Specifically, it can be converted by the following formula group:
其中,T
tgt->ref表示由旋转矩阵和平移矩阵构成的相机外参矩阵。
Among them, T tgt->ref represents the camera extrinsic parameter matrix composed of the rotation matrix and the translation matrix.
在获得第三坐标
之后之后,可以基于参考帧图像I
ref,以该第三坐标作为网格点,通过双线性采样机制重建出参考帧图像I
ref经仿射变换后的图像
并将重建得到的图像
确定为该第一重建图像。其中,双线性采样机制的原理可以参照现有技术,在此不再赘述。
After getting the third coordinate After that, based on the reference frame image I ref , using the third coordinate as a grid point, the affine-transformed image of the reference frame image I ref can be reconstructed through the bilinear sampling mechanism and reconstruct the resulting image Determined to be the first reconstructed image. The principle of the bilinear sampling mechanism may refer to the prior art, which will not be repeated here.
2.5、根据所述目标帧图像和所述第一重建图像计算第一图像重建损失,所述第一图像重建损失用于衡量所述目标帧图像和所述第一重建图像之间的差异;2.5. Calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the difference between the target frame image and the first reconstructed image;
具体的,可以采用以下公式计算得到第一图像重建损失:Specifically, the following formula can be used to calculate the first image reconstruction loss:
其中,
表示该第一图像重建损失,α为预设的权重参数,例如 可以为0.85。SSIM(*)为以下所示的结构相似性度量函数:
in, Indicates the reconstruction loss of the first image, and α is a preset weight parameter, for example, it can be 0.85. SSIM(*) is the structural similarity measure function shown below:
上式中,μ,δ分别是像素均值和方差,c
1=0.01
2,c
2=0.03
2。
In the above formula, μ and δ are the pixel mean and variance, respectively, c 1 =0.01 2 , and c 2 =0.03 2 .
ERF(*)为以下所示的鲁棒性误差度量参数:ERF(*) is the robustness error metric shown below:
上式中,ò=0.01。In the above formula, ò=0.01.
2.6、基于所述第一图像重建损失构建目标函数;2.6, constructing an objective function based on the first image reconstruction loss;
在获得该第一图像重建损失之后,可以基于该第一图像重建损失构建一个目标函数,以便完成该深度估计网络的参数更新。After obtaining the first image reconstruction loss, an objective function may be constructed based on the first image reconstruction loss, so as to complete the parameter update of the depth estimation network.
在本申请的一个实施例中,在获取样本图像序列之后,还可以包括:In an embodiment of the present application, after acquiring the sample image sequence, it may further include:
(1)将所述参考帧图像输入所述深度估计网络,得到预测的第二场景深度图像;(1) inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
(2)根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像;(2) generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence;
(3)根据所述参考帧图像和所述第二重建图像计算第二图像重建损失,所述第二图像重建损失用于衡量所述参考帧图像和所述第二重建图像之间的差异。(3) Calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the difference between the reference frame image and the second reconstructed image.
具体的,所述根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像,可以包括:Specifically, the second reconstruction corresponding to the reference frame image is generated according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence images, which can include:
(2.1)根据所述相机姿态向量确定所述参考帧图像转换到所述目标帧图像的第二变换矩阵;(2.1) Determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
(2.2)根据所述相机的内参和所述第二场景深度图像,计算所述参考帧图像在世界坐标系下的第四坐标;(2.2) According to the internal reference of the camera and the depth image of the second scene, calculate the fourth coordinate of the reference frame image in the world coordinate system;
(2.3)基于所述第二变换矩阵对所述第四坐标进行变换,得到所述参考帧 图像经转换后在世界坐标系下的第五坐标;(2.3) the 4th coordinate is transformed based on the described second transformation matrix to obtain the 5th coordinate of the reference frame image under the world coordinate system after conversion;
(2.4)将所述第五坐标转换为在图像坐标系下的第六坐标;(2.4) converting the fifth coordinate into the sixth coordinate under the image coordinate system;
(2.5)基于所述目标帧图像,以所述第六坐标作为网格点,通过双线性采样机制重建出所述目标帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第二重建图像。(2.5) Based on the target frame image, using the sixth coordinate as a grid point, reconstruct the affine transformed image of the target frame image through the bilinear sampling mechanism, and determine the reconstructed image as the second reconstructed image.
与上述计算第一图像重建损失的方法类似,在计算第二图像重建损失时,假设目标帧图像为I
tgt,参考帧图像为I
ref,对应相机的内参矩阵为K,则可以通过上文所述的深度估计网络估计出I
ref对应的第二场景深度图像为D
ref,由上文所述的相机姿态估计网络估计出两帧图像之间的相机姿态,得到从参考帧图像I
ref转换到目标帧图像I
tgt的第二变换矩阵T
inv,该第二变换矩阵是从目标帧图像I
tgt转换到参考帧图像I
ref的第一变换矩阵T的逆变换矩阵。然后,根据该相机内参矩阵K、第二场景深度图像D
ref和该参考帧图像I
ref,可以计算出参考帧图像I
ref在世界坐标系下的坐标(第四坐标)。然后,基于该第二变换矩阵T
inv对该第四坐标进行转换,获得该参考帧图像I
ref经转换后在世界坐标系下的第五坐标,接着计算出此第五坐标在图像坐标系的第六坐标,具体的坐标变换步骤可以参照前文所述的计算第一图像重建损失的相关内容。最后,可以基于目标帧图像I
tgt,以该第六坐标为网格点,通过双线性采样机制重建出目标帧图像I
tgt经仿射变换后的图像
并将重建得到的图像
确定为该第二重建图像。计算第二图像重建损失可以采用以下公式:
Similar to the above method for calculating the first image reconstruction loss, when calculating the second image reconstruction loss, it is assumed that the target frame image is I tgt , the reference frame image is I ref , and the internal parameter matrix of the corresponding camera is K. The depth estimation network described above estimates that the depth image of the second scene corresponding to I ref is D ref , and the camera pose between the two frames of images is estimated by the camera pose estimation network described above to obtain the conversion from the reference frame image I ref to D ref . The second transformation matrix T inv of the target frame image It tgt is an inverse transformation matrix of the first transformation matrix T transformed from the target frame image It tgt to the reference frame image I ref . Then, according to the camera internal parameter matrix K, the second scene depth image D ref and the reference frame image I ref , the coordinates (fourth coordinates) of the reference frame image I ref in the world coordinate system can be calculated. Then, the fourth coordinate is transformed based on the second transformation matrix T inv to obtain the transformed fifth coordinate of the reference frame image I ref in the world coordinate system, and then the fifth coordinate in the image coordinate system is calculated. For the sixth coordinate, the specific coordinate transformation step may refer to the above-mentioned related content of calculating the reconstruction loss of the first image. Finally, based on the target frame image It tgt , and taking the sixth coordinate as the grid point, the affine transformed image of the target frame image It tgt can be reconstructed through the bilinear sampling mechanism and reconstruct the resulting image Determined to be the second reconstructed image. The following formula can be used to calculate the second image reconstruction loss:
关于上述公式中各个参数的定义,可以参照前文所述的计算第一图像重建损失的公式中的说明。For the definition of each parameter in the above formula, reference may be made to the description in the formula for calculating the reconstruction loss of the first image described above.
第一图像重建损失可以定义为前向图像重建损失,第二图像重建损失可以定义为后向图像重建损失,那么可以基于这两个图像重建损失构建得到双向图 像重建损失,具体的计算公式可以如下:The first image reconstruction loss can be defined as the forward image reconstruction loss, and the second image reconstruction loss can be defined as the backward image reconstruction loss, then the bidirectional image reconstruction loss can be constructed based on the two image reconstruction losses, and the specific calculation formula can be as follows :
之后,可以根据基于该双向图像重建损失构建目标函数。通过在深度估计网络的目标函数中添加双向图像重建损失,能够充分挖掘图像数据中的潜在信息,进一步提升深度估计算法的鲁棒性。Afterwards, the objective function can be constructed according to the loss based on this bidirectional image reconstruction. By adding a bidirectional image reconstruction loss to the objective function of the depth estimation network, the potential information in the image data can be fully exploited, and the robustness of the depth estimation algorithm can be further improved.
在本申请的一个实施例中,所述方法还可以包括:In an embodiment of the present application, the method may further include:
(1)获取所述目标帧图像在图像坐标系下的第七坐标;(1) obtaining the seventh coordinate of the target frame image under the image coordinate system;
(2)对所述第三坐标和所述第七坐标执行对应元素作差的处理,得到第一前向流坐标;(2) performing the process of making a difference between the corresponding elements on the third coordinate and the seventh coordinate to obtain the first forward flow coordinate;
(3)获取所述参考帧图像在图像坐标系下的第八坐标;(3) obtaining the eighth coordinate of the reference frame image under the image coordinate system;
(4)对所述第六坐标和所述第八坐标执行对应元素作差的处理,得到第一后向流坐标;(4) the processing of the difference of the corresponding elements is performed on the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate;
(5)以所述第三坐标作为网格点,采用双线性采样机制对所述第一后向流坐标进行仿射变换,以合成第二前向流坐标;(5) using the third coordinate as a grid point, adopting a bilinear sampling mechanism to perform affine transformation on the first backward flow coordinate to synthesize the second forward flow coordinate;
(6)以所述第六坐标作为网格点,采用双线性采样机制对所述第一前向流坐标进行仿射变换,以合成第二后向流坐标;(6) using the sixth coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;
(7)根据所述第一前向流坐标和所述第二前向流坐标计算前向流遮挡掩码,所述前向流遮挡掩码用于衡量所述第一前向流坐标和所述第二前向流坐标之间的匹配程度;(7) Calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and all the the matching degree between the second forward flow coordinates;
(8)根据所述第一后向流坐标和所述第二后向流坐标计算后向流遮挡掩码,所述后向流遮挡掩码用于衡量所述第一后向流坐标和所述第二后向流坐标之间的匹配程度。(8) Calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, and the backward flow occlusion mask is used to measure the first backward flow coordinate and all the backward flow occlusion masks. The matching degree between the coordinates of the second backward flow.
这个过程可以概括为双向流一致性的检验,其中包括前向流一致性检验和后向流一致性检验。首先,获取目标帧图像在图像坐标系下的第七坐标
以及前文所述的第三坐标(即
),然后对该第三坐标和该第七坐标执行对应元素作差的处理,得到第一前向流坐标
如以下公 式所示:
This process can be summarized as the check of bidirectional flow consistency, including forward flow consistency check and backward flow consistency check. First, obtain the seventh coordinate of the target frame image in the image coordinate system and the third coordinate described above (ie ), and then perform the processing of the corresponding element difference between the third coordinate and the seventh coordinate to obtain the first forward flow coordinate As shown in the following formula:
类似的,获取该参考帧图像在图像坐标系下的第八坐标
以及前文所述的第六坐标(可以表示为
),然后对该第六坐标和该第八坐标执行对应元素作差的处理,得到第一后向流坐标
如以下公式所示:
Similarly, obtain the eighth coordinate of the reference frame image in the image coordinate system and the aforementioned sixth coordinate (which can be expressed as ), and then perform the processing of the corresponding element difference between the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate As shown in the following formula:
然后,以该第三坐标作为网格坐标,采用双线性采样机制对该第一后向流坐标
进行仿射变换,以合成第二前向流坐标
在理想的情况下,合成的前向流坐标
和计算得到的前向流坐标
的大小相同,而方向相反,此为前向流一致性。
Then, using the third coordinate as the grid coordinate, a bilinear sampling mechanism is used to determine the first backward flow coordinate perform an affine transformation to synthesize the second forward flow coordinates In an ideal case, the composite forward flow coordinates and the calculated forward flow coordinates are the same in size and opposite in direction, which is forward flow consistency.
以该第六坐标作为网格坐标,采用双线性采样机制对该第一前向流坐标
进行仿射变换,以合成第二后向流坐标
在理想的情况下,合成的后向流坐标
和计算得到的后向流坐标
的大小相同,而方向相反,此为后向流一致性。
Using the sixth coordinate as the grid coordinate, the bilinear sampling mechanism is adopted to the first forward flow coordinate. perform an affine transformation to synthesize the second backward flow coordinates In an ideal case, the synthesized backward flow coordinates and the calculated backward flow coordinates are the same in size and opposite in direction, which is backward flow consistency.
接下来,可以根据该第一前向流坐标和该第二前向流坐标计算得到前向流遮挡掩码
该掩码,,该掩码用于衡量第一前向流坐标和第二前向流坐标之间的匹配程度,具体可以采用以下公式计算:
Next, the forward flow occlusion mask can be obtained by calculating according to the first forward flow coordinate and the second forward flow coordinate The mask, , is used to measure the matching degree between the first forward flow coordinate and the second forward flow coordinate, which can be calculated by the following formula:
可以根据该第一后向流坐标和该第二后向流坐标计算得到后向流遮挡掩码
该掩,该掩码用于衡量第一后向流坐标和第二后向流坐标之间的匹配程度,具体可以采用以下公式计算:
The backward flow occlusion mask can be calculated according to the first backward flow coordinate and the second backward flow coordinate This mask is used to measure the matching degree between the first backward flow coordinate and the second backward flow coordinate, and can be calculated by the following formula:
其中,各个参数的定义可以参照前文所述。The definition of each parameter may refer to the foregoing description.
在计算得到两个流遮挡掩码之后,所述根据所述第一图像重建损失和所述第二图像重建损失计算双向图像重建损失,可以包括:After two flow occlusion masks are obtained by calculation, calculating the bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss may include:
根据所述第一图像重建损失、所述第二图像重建损失、所述前向流遮挡掩码和所述后向流遮挡掩码计算所述双向图像重建损失。The bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
遮挡掩码用于判断连续的视频帧中是否存在遮挡物体,将遮挡掩码添加到双向图像重建损失的计算中,能够提高该深度估计网络对带有遮挡物体的图像进行深度估计的准确率。The occlusion mask is used to determine whether there are occluding objects in consecutive video frames. Adding the occlusion mask to the calculation of the bidirectional image reconstruction loss can improve the depth estimation accuracy of the depth estimation network for images with occluded objects.
进一步的,所述方法还可以包括:Further, the method can also include:
(1)根据所述第二坐标确定所述目标帧图像的第一场景深度值;(1) determining the first scene depth value of the target frame image according to the second coordinate;
(2)根据所述第五坐标确定所述参考帧图像的第二场景深度值;(2) determining the second scene depth value of the reference frame image according to the fifth coordinate;
(3)获取所述第一场景深度图像中与所述第二坐标对应的像素点的第三场景深度值;(3) obtaining the third scene depth value of the pixel corresponding to the second coordinate in the first scene depth image;
(4)获取所述第二场景深度图像中与所述第五坐标对应的像素点的第四场景深度值;(4) acquiring the fourth scene depth value of the pixel corresponding to the fifth coordinate in the second scene depth image;
(5)基于所述第三坐标和所述第四场景深度值,通过双线性采样机制重建出所述目标帧图像的第五场景深度值;(5) based on the third coordinate and the fourth scene depth value, reconstruct the fifth scene depth value of the target frame image through a bilinear sampling mechanism;
(6)基于所述第六坐标和所述第三场景深度值,通过双线性采样机制重建出所述参考帧图像的第六场景深度值;(6) reconstructing the sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
(7)根据所述第一场景深度值和所述第五场景深度值计算前向场景结构一致性损失,所述前向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述目标帧图像的场景深度值与重建出的所述目标帧图像的场景深度值之间的差异;(7) Calculate the forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the The difference between the scene depth value of the target frame image and the reconstructed scene depth value of the target frame image;
(8)根据所述第二场景深度值和所述第六场景深度值计算后向场景结构一致性损失,所述后向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述参考帧图像的场景深度值与重建出的所述参考帧图像的场景深度值之间的差异;(8) Calculate the backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the the difference between the scene depth value of the reference frame image and the reconstructed scene depth value of the reference frame image;
(9)根据所述前向场景结构一致性损失和所述后向场景结构一致性损失,计算双向场景结构一致性损失。(9) Calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss.
上述步骤用于计算双向场景结构一致性损失,首先,根据前文所述的第二坐标
可以得到对应场景的深度值为
(第一场景深度值);根据前文所述的第五坐标
可以得到对应场,可以得到对应场景的深度值为
(第二场景深度值)。然后,根据该第一场景深度图像,可以估计出目标帧图像I
tgt中图像坐标为
处的像素点的深度值d
tgt(第三场景深度值);根据该第二场景深度图像,可以估计出参考帧图像I
ref中图像坐标为
处的像素点的深度值d
ref(第四场景深度值)。接着,基于第三坐标和深度值d
ref,可以通过双线性采样机制重建出目标帧图像的第五场景深度值
基于第六坐标和深度值d
tgt,可以通过双线性采样机制重建出参考帧图像的第六场景深度值
The above steps are used to calculate the loss of bidirectional scene structure consistency. First, according to the second coordinate described above The depth value of the corresponding scene can be obtained (first scene depth value); according to the fifth coordinate described above The corresponding field can be obtained, and the depth value of the corresponding scene can be obtained (Second scene depth value). Then, according to the first scene depth image, it can be estimated that the image coordinates in the target frame image I tgt are The depth value d tgt (the third scene depth value) of the pixel point at The depth value of the pixel at d ref (the fourth scene depth value). Next, based on the third coordinate and the depth value d ref , the fifth scene depth value of the target frame image can be reconstructed through the bilinear sampling mechanism Based on the sixth coordinate and the depth value d tgt , the depth value of the sixth scene of the reference frame image can be reconstructed through the bilinear sampling mechanism
在理论上,第一场景深度值
和第五场景深度值
应该相等,第二场景深度值
和第六场景深度值
应该相等。然而通过实验测试发现,它们之间并不总是相等,因此可以通过以下2个公式分别计算前向场景结构误差
以及后向场景结构误差
进而对场景结构施加一致性约束:
In theory, the first scene depth value and the fifth scene depth value should be equal, the second scene depth value and the sixth scene depth value should be equal. However, it is found through experimental tests that they are not always equal, so the forward scene structure error can be calculated separately by the following two formulas and the backward scene structure error This in turn imposes consistency constraints on the scene structure:
通过对场景结构施加一致性约束,可以定位出图像场景中的运动对象和遮挡物体的位置。例如,
和
的值越大的位置,表示该位置越可能存在运动对象和遮挡物体。
By imposing consistency constraints on the scene structure, the positions of moving objects and occluding objects in the image scene can be located. E.g, and The larger the value of , the more likely there are moving objects and occluding objects at the position.
然后,计算前向场景结构一致性损失,其用于衡量通过多视图几何变换计算得到的目标帧图像的场景深度值与重建出的目标帧图像的场景深度值之间的 差异,具体可以采用以下公式计算:Then, the forward scene structure consistency loss is calculated, which is used to measure the difference between the scene depth value of the target frame image calculated by the multi-view geometric transformation and the scene depth value of the reconstructed target frame image. Specifically, the following can be used Formula calculation:
其中,N
ref表示参考帧图像中有效网格坐标的数量。
where Nref represents the number of valid grid coordinates in the reference frame image.
计算后向场景结构一致性损失,其用于衡量通过多视图几何变换计算得到的参考帧图像的场景深度值与重建出的参考帧图像的场景深度值之间的差异,具体可以采用以下公式计算:Calculate the backward scene structure consistency loss, which is used to measure the difference between the scene depth value of the reference frame image calculated by the multi-view geometric transformation and the scene depth value of the reconstructed reference frame image. Specifically, the following formula can be used to calculate :
其中,N
tgt表示目标帧图像中有效网格坐标的数量。
where N tgt represents the number of valid grid coordinates in the target frame image.
最后,根据前向场景结构一致性损失和后向场景结构一致性损失,可以计算双向场景结构一致性损失如下:Finally, according to the forward scene structure consistency loss and the backward scene structure consistency loss, the bidirectional scene structure consistency loss can be calculated as follows:
所述基于所述双向图像重建损失构建所述目标函数,可以包括:The constructing the objective function based on the bidirectional image reconstruction loss may include:
基于所述双向图像重建损失和所述双向场景结构一致性损失,构建得到所述目标函数。The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
在构建目标函数时,添加双向场景结构一致性损失,能够有效处理待测图像场景中的遮挡物体与运动对象,从而提高场景深度估计的准确率。When constructing the objective function, adding the loss of bidirectional scene structure consistency can effectively deal with occluded objects and moving objects in the scene of the image to be tested, thereby improving the accuracy of scene depth estimation.
另一方面,在计算双向图像重建损失时,可以同时引入前文所述的两个遮挡掩码以及两个场景结构误差,例如可以采用以下公式计算:On the other hand, when calculating the bidirectional image reconstruction loss, the two occlusion masks and the two scene structure errors mentioned above can be introduced at the same time. For example, the following formula can be used to calculate:
通过使用
和
对图像重建损失函数进行加权处理,能够达到处理遮挡和运动对象的目的。具体的,使用所述前向流遮挡掩码和所述前向场景结构不一致权重对第一图像重建损失进行加权处理;使用所述后向流遮挡掩码和后向场景结构不一致性权重对第二图像重建损失进行加权处理;基于加权处理后的第一图像重建损失和加权处理后的第二图像重建损失构建所述双向图像重建损失。
by using and Weighting the image reconstruction loss function can achieve the purpose of dealing with occlusion and moving objects. Specifically, the first image reconstruction loss is weighted by using the forward flow occlusion mask and the forward scene structure inconsistency weight; the first image reconstruction loss is weighted using the backward flow occlusion mask and the backward scene structure inconsistency weight; The two image reconstruction losses are weighted; the bidirectional image reconstruction loss is constructed based on the weighted first image reconstruction loss and the weighted second image reconstruction loss.
在本申请的一个实施例中,所述深度估计网络包括编码网络,所述方法还可以包括:In an embodiment of the present application, the depth estimation network includes an encoding network, and the method may further include:
(1)通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图像的第二特征图像;(1) obtaining the first characteristic image of the target frame image and the second characteristic image of the reference frame image through the encoding network;
(2)基于所述第三坐标和所述第二特征图像,通过双线性采样机制重建出所述目标帧图像的第三特征图像;(2) based on the third coordinate and the second feature image, reconstruct the third feature image of the target frame image through a bilinear sampling mechanism;
(3)基于所述第六坐标和所述第一特征图像,通过双线性采样机制重建出所述参考帧图像的第四特征图像;(3) based on the sixth coordinate and the first feature image, reconstruct the fourth feature image of the reference frame image through a bilinear sampling mechanism;
(4)根据所述第一特征图像、所述第二特征图像、所述第三特征图像和所述第四特征图像,计算得到双向特征感知损失,所述双向特征感知损失用于衡量通过编码网络获得的所述目标帧图像的特征图像与重建出的所述目标帧图像的特征图像之间的差异,以及通过编码网络获得的所述参考帧图像的特征图像与重建出的所述参考帧图像的特征图像之间的差异。(4) According to the first feature image, the second feature image, the third feature image, and the fourth feature image, the bidirectional feature perception loss is calculated and obtained, and the bidirectional feature perception loss is used to measure the loss through coding The difference between the feature image of the target frame image obtained by the network and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the coding network and the reconstructed reference frame The difference between the feature images of an image.
上述步骤用于计算双向特征感知损失,相比于原始RGB图像,经过编码器提取的特征在弱纹理区域具有更好的区分性。本申请利用编码网络提取的最高分辨率特征图来处理弱纹理区域,通过该深度估计网络中的编码网络,可以提取目标帧图像的特征图像f
tgt(第一特征图像)和参考帧图像的特征图像f
ref(第二特征图像)。然后,基于前文所述的第三坐标以及该参考帧图像的特征图像f
ref,可以通过双线性采样机制将该特征图像f
ref进行仿射变换,重建出该目标帧图像的第三特征图像
基于前文所述的第六坐标以及该目标帧图像的特征图像f
tgt,可以通过双线性采样机制将该特征图像f
tgt进行仿射变换,重建出该参考帧图像的第四特征图像
接着,可以采用以下公式计算得到双向特征感知损失:
The above steps are used to calculate the perceptual loss of bidirectional features. Compared with the original RGB image, the features extracted by the encoder have better discrimination in weak texture areas. The present application uses the highest resolution feature map extracted by the coding network to process the weak texture area, and through the coding network in the depth estimation network, the feature image f tgt (first feature image) of the target frame image and the feature image of the reference frame image can be extracted. Image f ref (second feature image). Then, based on the aforementioned third coordinates and the feature image f ref of the reference frame image, the feature image f ref can be affinely transformed through a bilinear sampling mechanism to reconstruct the third feature image of the target frame image Based on the aforementioned sixth coordinate and the feature image f tgt of the target frame image, the feature image f tgt can be affinely transformed through a bilinear sampling mechanism to reconstruct the fourth feature image of the reference frame image Then, the two-way feature perception loss can be calculated by the following formula:
双向特征感知损失L
feat用于衡量通过编码网络获得的目标帧图像的特征图像与重建出的目标帧图像的特征图像之间的差异,以及通过编码网络获得的参 考帧图像的特征图像与重建出的参考帧图像的特征图像之间的差异。
The bidirectional feature-aware loss L feat is used to measure the difference between the feature image of the target frame image obtained by the encoding network and the feature image of the reconstructed target frame image, and the feature image of the reference frame image obtained by the encoding network and the reconstructed image. The difference between the feature images of the reference frame image.
所述基于所述双向图像重建损失构建所述目标函数,可以包括:The constructing the objective function based on the bidirectional image reconstruction loss may include:
基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
通过在目标函数中引入双向特征感知损失,能够有效处理待测图像中的弱纹理场景,从而提高场景深度估计的准确率。By introducing the bidirectional feature perception loss into the objective function, the weak texture scene in the image to be tested can be effectively processed, thereby improving the accuracy of scene depth estimation.
进一步的,在通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图像的第二特征图像之后,还可以包括:Further, after obtaining the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network, the method may further include:
根据所述目标帧图像、所述参考帧图像、所述第一场景深度图像、所述第二场景深度图像、所述第一特征图像和所述第二特征图像,计算得到平滑损失,所述平滑损失用于正则化通过所述深度估计网络获得的场景深度图像和特征图像的梯度。According to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network.
所述基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述目标函数,可以包括:The constructing and obtaining the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss may include:
基于所述双向图像重建损失、所述双向特征感知损失和所述平滑损失,构建得到所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.
为了正则化通过所述深度估计网络获得的场景深度图像和特征图像的梯度,可以在目标函数中引入平滑损失L
s,具体可以采用以下公式计算:
In order to regularize the gradients of the scene depth image and feature image obtained through the depth estimation network, a smoothing loss L s can be introduced into the objective function, which can be calculated by the following formula:
其中,
表示对深度估计网络估计的参考帧深度图d
ref计算偏导数,然后计算每个元素位置处的绝对值,
表示对参考帧图像I
ref计算偏导数,然后计算每个元素位置的绝对值,
表示以
为幂的自然指数,以此类推。
in, Indicates that the partial derivative is calculated for the reference frame depth map d ref estimated by the depth estimation network, and then the absolute value at each element position is calculated, Indicates that the partial derivative is calculated for the reference frame image I ref , and then the absolute value of each element position is calculated, means with is the natural exponent of the power, and so on.
前文提出了四种类型的损失函数,分别为双向图像重建损失、平滑损失、双向场景结构一致性损失以及双向特征感知损失,可以基于这几个损失函数构建得到最终的目标函数。例如,某个目标函数L的表达式如下:Four types of loss functions are proposed in the previous section, namely bidirectional image reconstruction loss, smoothing loss, bidirectional scene structure consistency loss, and bidirectional feature perception loss. The final objective function can be constructed based on these loss functions. For example, the expression of an objective function L is as follows:
L=λ
photoL
photo+λ
sL
s+λ
dscL
dsc+λ
featL
feat
L=λ photo L photo +λ s L s +λ dsc L dsc +λ feat L feat
其中,各个λ为设定的系数,例如可以为λ
photo=1.0,λ
s=0.001,λ
dsc=0.5,λ
feat=0.05。
Wherein, each λ is a set coefficient, for example, λ photo =1.0, λ s =0.001, λ dsc =0.5, λ feat =0.05.
另外,在前文所述计算各个损失函数的过程中,举例说明的是单个参考帧图像的计算结果,而若参考帧图像有多个,则每个参考帧图像都可以采用前文所述相同的方式计算得到对应的损失值,最后可以用这些参考帧图像对应损失值的平均值作为最后构建目标函数时采用的损失值。In addition, in the process of calculating each loss function described above, the calculation result of a single reference frame image is illustrated as an example, and if there are multiple reference frame images, each reference frame image can use the same method as described above. The corresponding loss value is calculated, and finally, the average value of the corresponding loss values of these reference frame images can be used as the loss value used in the final construction of the objective function.
2.7、根据所述目标函数更新所述深度估计网络的参数。2.7. Update the parameters of the depth estimation network according to the objective function.
在构建出目标函数之后,可以根据该目标函数更新该深度估计网络的参数,以达到优化和训练网络的目的。具体的,可以利用AdamW优化器求解出该目标函数相对于深度估计网络权重的梯度,并以此梯度来更新深度估计网络的权重,如此不断迭代,直至达到设定的最大迭代次数,完成该深度估计网络的训练。After the objective function is constructed, the parameters of the depth estimation network can be updated according to the objective function to achieve the purpose of optimizing and training the network. Specifically, the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network, and use this gradient to update the weight of the depth estimation network, and so on, until the maximum number of iterations is reached, and the depth is completed. Estimate the training of the network.
进一步的,该目标函数可以同时用来对前文所述的相机姿态估计网络进行训练。同样的,可以利用AdamW优化器求解出该目标函数相对于相机姿态估计网络权重的梯度,并以此梯度来更新相机姿态估计网络的权重,如此不断迭代,直至达到设定的最大迭代次数,完成该相机姿态估计网络的训练。总的来说,在构建好目标函数后,可以用该目标函数作为监督信号来联合指导深度估计网络和相机姿态估计网络的训练。具体的,可以利用AdamW优化器求解出于该目标函数相对于深度估计网络权重的梯度以及该目标函数相对于相机姿态估计网络权重的梯度,并以此梯度来同时更新深度估计网络和相机姿态估计网络的权重,如此不断迭代,直至达到设定的最大迭代次数,完成深度估计网络和相机姿态估计网络的联合训练。Further, this objective function can be used to train the camera pose estimation network described above at the same time. Similarly, the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the camera attitude estimation network, and use this gradient to update the weight of the camera attitude estimation network, and so on, iterate continuously until the set maximum number of iterations is reached, and the completion of The training of the camera pose estimation network. In general, after the objective function is constructed, the objective function can be used as a supervision signal to jointly guide the training of the depth estimation network and the camera pose estimation network. Specifically, the AdamW optimizer can be used to solve the gradient of the objective function relative to the weight of the depth estimation network and the gradient of the objective function relative to the weight of the camera pose estimation network, and use the gradient to update the depth estimation network and camera pose estimation at the same time. The weight of the network is continuously iterated until the set maximum number of iterations is reached, and the joint training of the depth estimation network and the camera pose estimation network is completed.
在完成两个网络的训练后,就可以使用单目图像(例如待测图像)作为该 深度估计网络的输入,直接计算出对应的场景深度图像。也可以使用连续的图像序列(例如任意连续的5帧单目图像)作为该相机姿态估计网络的输入,计算得到对应的相机姿态向量。需要说明的是,深度估计网络和相机姿态估计网络仅在训练期间需要联合优化,训练完成后网络的权重即固定了,测试期间不需要进行反向传播,只需要进行前向传播即可,因而测试期间两个网络可以单独使用。After completing the training of the two networks, the monocular image (such as the image to be tested) can be used as the input of the depth estimation network to directly calculate the corresponding scene depth image. It is also possible to use a continuous image sequence (for example, any continuous 5 frames of monocular images) as the input of the camera pose estimation network, and calculate the corresponding camera pose vector. It should be noted that the depth estimation network and the camera pose estimation network only need to be jointly optimized during the training period. After the training is completed, the weight of the network is fixed. Backpropagation is not required during the test period, only forward propagation is required. Therefore, Both networks can be used independently during testing.
本申请实施例采用的深度估计网络在优化更新参数时,会结合相机姿态估计网络预测输入的样本图像序列的相机姿态向量,该样本图像序列包含目标帧图像和参考帧图像;然后,根据该深度估计网络预测得到的该目标帧图像的场景深度图像、该相机姿态向量、该参考帧图像和对应相机的内参生成与该目标帧图像对应的重建图像;接着,根据该目标帧图像和该重建图像计算得到重建图像时对应的损失函数,最后基于该损失函数构建目标函数并基于该目标函数更新该深度估计网络的参数。通过这样设置,能够充分挖掘目标帧图像和参考帧图像包含的潜在图像信息,也即采样较少的样本图像即可获得足够的图像信息以完成该深度估计网络的训练,从而降低样本数据采集的成本。When optimizing and updating parameters, the depth estimation network adopted in the embodiment of the present application will combine the camera pose estimation network to predict the camera pose vector of the input sample image sequence, where the sample image sequence includes the target frame image and the reference frame image; then, according to the depth The scene depth image of the target frame image predicted by the estimation network, the camera pose vector, the reference frame image and the internal parameters of the corresponding camera generate a reconstructed image corresponding to the target frame image; then, according to the target frame image and the reconstructed image Calculate the corresponding loss function when reconstructing the image, and finally construct an objective function based on the loss function and update the parameters of the depth estimation network based on the objective function. Through this setting, the potential image information contained in the target frame image and the reference frame image can be fully exploited, that is, enough image information can be obtained by sampling less sample images to complete the training of the depth estimation network, thereby reducing the time required for sample data collection. cost.
另外,通过在目标函数中添加双向的图像重建损失、双向场景结构一致性损失、双向特征感知损失和平滑损失,能够进一步挖掘图像中包含的潜在信息,降低样本数据的采集成本,并能够有效的处理视频帧中存在的运动对象、遮挡等问题以及提高对弱纹理环境的鲁棒性。In addition, by adding two-way image reconstruction loss, two-way scene structure consistency loss, two-way feature perception loss and smoothing loss to the objective function, the potential information contained in the image can be further mined, the cost of sample data collection can be reduced, and it can effectively Deal with moving objects, occlusions, etc. in video frames, and improve robustness to weakly textured environments.
以下内容为通过仿真结果来说明本申请提出的图像场景深度估计以及相机姿态估计的技术效果。其中,采用Eigen划分的测试集来作为深度估计网络的评估数据,使用KITTI Odometry数据集中的09-10序列作为相机姿态估计网络的评估数据。The following content is to illustrate the technical effects of the image scene depth estimation and camera pose estimation proposed in the present application through simulation results. Among them, the test set divided by Eigen is used as the evaluation data of the depth estimation network, and the 09-10 sequence in the KITTI Odometry data set is used as the evaluation data of the camera pose estimation network.
深度估计网络采用的评估标准包括:绝对误差(AbsRel)、均方根误差(Rmse)、均方误差(SqRel)、对数均方根误差(Rmselog)和阈值(δ
t);相机姿态估计网络采用的评估指标为绝对轨迹误差(ATE)。经过仿真测试,本申请提出的方法与现有技 术的算法进行比较的测试结果如以下的表1-表3所示。
The evaluation criteria used by the depth estimation network include: absolute error (AbsRel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (Rmselog) and threshold (δ t ); camera pose estimation network The evaluation metric used is absolute trajectory error (ATE). After the simulation test, the test results of comparing the method proposed in the present application with the algorithm of the prior art are shown in Tables 1 to 3 below.
表1Table 1
表1为在80m的深度范围内,对单目图像进行场景深度预测的结果对比。其中,绝对误差(AbsRel)、均方根误差(Rmse)、均方误差(SqRel)、对数均方根误差(Rmselog)绝对值表示算法误差值,用于衡量算法的精度,误差值越小表示精度越高,阈值(δ
t)表示预测的场景深度与真实值的接近程度,阈值越高表示算法稳定性越好。通过表1中的测试结果可以发现,本申请提出的方法与现有技术的算法相比,能够获得更高的场景深度预测精度,以及更好的算法稳定性。
Table 1 is a comparison of the results of scene depth prediction for monocular images in the depth range of 80m. Among them, the absolute value of absolute error (AbsRel), root mean square error (Rmse), mean square error (SqRel), logarithmic root mean square error (Rmselog) represents the algorithm error value, which is used to measure the accuracy of the algorithm, the smaller the error value The higher the representation accuracy, the threshold (δ t ) represents how close the predicted scene depth is to the true value, and the higher the threshold, the better the algorithm stability. From the test results in Table 1, it can be found that, compared with the algorithm in the prior art, the method proposed in this application can obtain higher scene depth prediction accuracy and better algorithm stability.
表2Table 2
表2为在50m的深度范围内,对单目图像进行场景深度预测的结果对比。通过表2中的测试结果同样可以发现,本申请提出的方法与现有技术的算法相比,能够获得更高的场景深度预测精度,以及更好的算法稳定性,因此能够更加鲁棒的预测出单目图像的场景深度和更多细节。Table 2 is a comparison of the results of scene depth prediction for monocular images in the depth range of 50m. From the test results in Table 2, it can also be found that, compared with the algorithm in the prior art, the method proposed in this application can obtain higher scene depth prediction accuracy and better algorithm stability, so it can predict more robustly Scene depth and more detail from monocular images.
表3table 3
表3中的绝对轨迹误差(ATE)表示相机位姿的真实值与预测的相机位姿之间的差值,误差值越小,预测的相机位姿越准确。仿真结果表明,同现有的各类算法相比,本申请提出的相机姿态估计的方法预测的相机位姿更准确。The absolute trajectory error (ATE) in Table 3 represents the difference between the true value of the camera pose and the predicted camera pose. The smaller the error value, the more accurate the predicted camera pose. The simulation results show that, compared with various existing algorithms, the camera pose estimation method proposed in the present application predicts the camera pose more accurately.
另外,图6是本申请提出的图像场景深度的估计方法和现有技术的各类算法分别进行单目图像深度预测的结果对比图,其中Ground Truth深度图是通过可视化激光雷达数据获得的深度图。In addition, FIG. 6 is a comparison diagram of the results of the monocular image depth prediction by the method for estimating the depth of the image scene proposed by the present application and various algorithms in the prior art, wherein the Ground Truth depth map is a depth map obtained by visualizing lidar data. .
应理解,上述各个实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above-mentioned various embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. .
上面主要描述了一种图像场景深度的估计方法,下面将对一种图像场景深度的估计装置进行描述。A method for estimating the depth of an image scene is mainly described above, and an apparatus for estimating the depth of an image scene will be described below.
请参阅图7,本申请实施例中一种图像场景深度的估计装置的一个实施例包括:Referring to FIG. 7 , an embodiment of an apparatus for estimating the depth of an image scene in an embodiment of the present application includes:
待测图像获取模块701,用于获取待测图像;an image to be measured acquisition module 701, configured to acquire an image to be measured;
场景深度估计模块702,用于将所述待测图像输入预先构建的深度估计网络,得到所述待测图像的场景深度图像;a scene depth estimation module 702, configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;
样本获取模块703,用于获取样本图像序列,所述样本图像序列包含目标帧图像和参考帧图像,所述参考帧图像为所述样本图像序列中处于所述目标帧图像之前或之后的一帧以上的图像;A sample acquisition module 703, configured to acquire a sample image sequence, where the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is a frame before or after the target frame image in the sample image sequence image above;
第一场景深度预测模块704,用于将所述目标帧图像输入所述深度估计网络,得到预测的第一场景深度图像;A first scene depth prediction module 704, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;
相机姿态估计模块705,用于将所述样本图像序列输入预先构建的相机姿态估计网络,得到预测的所述目标帧图像和所述参考帧图像之间的相机姿态向量;A camera pose estimation module 705, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain a predicted camera pose vector between the target frame image and the reference frame image;
第一图像重建模块706,用于根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像;A first image reconstruction module 706, configured to generate images corresponding to the target frame according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence The first reconstructed image of ;
第一图像重建损失计算模块707,用于根据所述目标帧图像和所述第一重建图像计算第一图像重建损失,所述第一图像重建损失用于衡量所述目标帧图像和所述第一重建图像之间的差异;The first image reconstruction loss calculation module 707 is configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, and the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss. a difference between the reconstructed images;
目标函数构建模块708,用于基于所述第一图像重建损失构建目标函数;an objective function construction module 708, configured to construct an objective function based on the first image reconstruction loss;
网络参数更新模块709,用于根据所述目标函数更新所述深度估计网络的参数。A network parameter updating module 709, configured to update the parameters of the depth estimation network according to the objective function.
在本申请的一个实施例中,所述装置还可以包括:In an embodiment of the present application, the apparatus may further include:
第二场景深度预测模块,用于将所述参考帧图像输入所述深度估计网络,得到预测的第二场景深度图像;A second scene depth prediction module, configured to input the reference frame image into the depth estimation network to obtain a predicted second scene depth image;
第二图像重建模块,用于根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像;A second image reconstruction module, configured to generate an image corresponding to the reference frame image according to the depth image of the second scene, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence the second reconstructed image;
第二图像重建损失计算模块,用于根据所述参考帧图像和所述第二重建图像计算第二图像重建损失,所述第二图像重建损失用于衡量所述参考帧图像和所述第二重建图像之间的差异;A second image reconstruction loss calculation module, configured to calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the reference frame image and the second image reconstruction loss difference between reconstructed images;
所述目标函数构建模块可以包括:The objective function building block may include:
双向图像重建损失计算单元,用于根据所述第一图像重建损失和所述第二图像重建损失,计算双向图像重建损失;a bidirectional image reconstruction loss calculation unit, configured to calculate the bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;
目标函数构建单元,用于基于所述双向图像重建损失构建所述目标函数。an objective function construction unit, configured to construct the objective function based on the bidirectional image reconstruction loss.
进一步的,所述第一图像重建模块可以包括:Further, the first image reconstruction module may include:
第一变换矩阵确定单元,用于根据所述相机姿态向量确定所述目标帧图像转换到所述参考帧图像的第一变换矩阵;a first transformation matrix determining unit, configured to determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;
第一坐标计算单元,用于根据所述相机的内参和所述第一场景深度图像,计算所述目标帧图像在世界坐标系下的第一坐标;a first coordinate calculation unit, configured to calculate the first coordinate of the target frame image in the world coordinate system according to the camera's internal parameters and the first scene depth image;
第一坐标变换单元,用于基于所述第一变换矩阵对所述第一坐标进行变换,得到所述目标帧图像经转换后在世界坐标系下的第二坐标;a first coordinate transformation unit, configured to transform the first coordinate based on the first transformation matrix to obtain the second coordinate of the target frame image after the transformation in the world coordinate system;
第一坐标转换单元,用于将所述第二坐标转换为在图像坐标系下的第三坐标;a first coordinate conversion unit, configured to convert the second coordinate into a third coordinate in the image coordinate system;
第一图像重建单元,用于基于所述参考帧图像,以所述第三坐标作为网格点,通过双线性采样机制重建出所述参考帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第一重建图像;A first image reconstruction unit, configured to reconstruct an affine transformed image of the reference frame image based on the reference frame image and use the third coordinate as a grid point through a bilinear sampling mechanism, and reconstruct the image after affine transformation. The obtained image is determined as the first reconstructed image;
所述第二图像重建模块可以包括:The second image reconstruction module may include:
第二变换矩阵确定单元,用于根据所述相机姿态向量确定所述参考帧图像转换到所述目标帧图像的第二变换矩阵;a second transformation matrix determining unit, configured to determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;
第二坐标计算单元,用于根据所述相机的内参和所述第二场景深度图像,计算所述参考帧图像在世界坐标系下的第四坐标;a second coordinate calculation unit, configured to calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;
第二坐标变换单元,用于基于所述第二变换矩阵对所述第四坐标进行变换,得到所述参考帧图像经转换后在世界坐标系下的第五坐标;a second coordinate transformation unit, configured to transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system;
第二坐标转换单元,用于将所述第五坐标转换为在图像坐标系下的第六坐标;a second coordinate conversion unit, configured to convert the fifth coordinate into a sixth coordinate in the image coordinate system;
第二图像重建单元,用于基于所述目标帧图像,以所述第六坐标作为网格点,通过双线性采样机制重建出所述目标帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第二重建图像。The second image reconstruction unit is configured to, based on the target frame image, take the sixth coordinate as a grid point, reconstruct the affine transformed image of the target frame image through a bilinear sampling mechanism, and reconstruct an image of the target frame image after affine transformation. The obtained image is determined as the second reconstructed image.
在本申请的一个实施例中,所述装置还可以包括:In an embodiment of the present application, the apparatus may further include:
第一坐标获取模块,用于获取所述目标帧图像在图像坐标系下的第七坐标;a first coordinate obtaining module, used for obtaining the seventh coordinate of the target frame image in the image coordinate system;
前向流坐标确定模块,用于对所述第三坐标和所述第七坐标执行对应元素作差的处理,得到第一前向流坐标;a forward flow coordinate determination module, configured to perform a difference processing of corresponding elements on the third coordinate and the seventh coordinate to obtain a first forward flow coordinate;
第二坐标获取模块,用于获取所述参考帧图像在图像坐标系下的第八坐标;A second coordinate obtaining module, configured to obtain the eighth coordinate of the reference frame image in the image coordinate system;
后向流坐标确定模块,用于对所述第六坐标和所述第八坐标执行对应元素作差的处理,得到第一后向流坐标;a backward flow coordinate determination module, configured to perform a difference processing of corresponding elements on the sixth coordinate and the eighth coordinate to obtain a first backward flow coordinate;
前向流坐标合成模块,用于以所述第三坐标作为网格点,采用双线性采样机制对所述第一后向流坐标进行仿射变换,以合成第二前向流坐标;a forward flow coordinate synthesis module, configured to use the third coordinate as a grid point and perform affine transformation on the first backward flow coordinate using a bilinear sampling mechanism to synthesize the second forward flow coordinate;
后向流坐标合成模块,用于以所述第六坐标作为网格点,采用双线性采样机制对所述第一前向流坐标进行仿射变换,以合成第二后向流坐标;a backward flow coordinate synthesis module, configured to use the sixth coordinate as a grid point, and use a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;
前向流遮挡掩码计算模块,用于根据所述第一前向流坐标和所述第二前向流坐标计算前向流遮挡掩码,所述前向流遮挡掩码用于衡量所述第一前向流坐标和所述第二前向流坐标之间的匹配程度;A forward flow occlusion mask calculation module, configured to calculate a forward flow occlusion mask according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the the degree of matching between the first forward flow coordinates and the second forward flow coordinates;
后向流遮挡掩码计算模块,用于根据所述第一后向流坐标和所述第二后向流坐标计算后向流遮挡掩码,所述后向流遮挡掩码用于衡量所述第一后向流坐标和所述第二后向流坐标之间的匹配程度;A backward flow occlusion mask calculation module, configured to calculate a backward flow occlusion mask according to the first backward flow coordinate and the second backward flow coordinate, and the backward flow occlusion mask is used to measure the the degree of matching between the first backward flow coordinates and the second backward flow coordinates;
双向图像重建损失计算单元具体可以用于:根据所述第一图像重建损失、所述第二图像重建损失、所述前向流遮挡掩码和所述后向流遮挡掩码计算所述双向图像重建损失。The bidirectional image reconstruction loss calculation unit may be specifically configured to: calculate the bidirectional image according to the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask and the backward flow occlusion mask reconstruction loss.
在本申请的一个实施例中,所述装置还可以包括:In an embodiment of the present application, the apparatus may further include:
第一场景深度值确定模块,用于根据所述第二坐标确定所述目标帧图像的第一场景深度值;a first scene depth value determination module, configured to determine the first scene depth value of the target frame image according to the second coordinates;
第二场景深度值确定模块,用于根据所述第五坐标确定所述参考帧图像的第二场景深度值;A second scene depth value determining module, configured to determine the second scene depth value of the reference frame image according to the fifth coordinate;
第三场景深度值确定模块,用于获取所述第一场景深度图像中与所述第二坐标对应的像素点的第三场景深度值;A third scene depth value determination module, configured to obtain a third scene depth value of the pixel point corresponding to the second coordinate in the first scene depth image;
第四场景深度值确定模块,用于获取所述第二场景深度图像中与所述第五坐标对应的像素点的第四场景深度值;a fourth scene depth value determination module, configured to acquire the fourth scene depth value of the pixel point corresponding to the fifth coordinate in the second scene depth image;
第一场景深度值重建模块,用于基于所述第三坐标和所述第四场景深度值,通过双线性采样机制重建出所述目标帧图像的第五场景深度值;a first scene depth value reconstruction module, configured to reconstruct a fifth scene depth value of the target frame image through a bilinear sampling mechanism based on the third coordinate and the fourth scene depth value;
第二场景深度值重建模块,用于基于所述第六坐标和所述第三场景深度值,通过双线性采样机制重建出所述参考帧图像的第六场景深度值;A second scene depth value reconstruction module, configured to reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the third scene depth value;
前向场景结构一致性损失计算模块,用于根据所述第一场景深度值和所述第五场景深度值计算前向场景结构一致性损失,所述前向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述目标帧图像的场景深度值与重建出的所述目标帧图像的场景深度值之间的差异;The forward scene structure consistency loss calculation module is used to calculate the forward scene structure consistency loss according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the pass The difference between the scene depth value of the target frame image obtained by the multi-view geometric transformation calculation and the scene depth value of the reconstructed target frame image;
后向场景结构一致性损失计算模块,用于根据所述第二场景深度值和所述第六场景深度值计算后向场景结构一致性损失,所述后向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述参考帧图像的场景深度值与重建出的所述参考帧图像的场景深度值之间的差异;The backward scene structure consistency loss calculation module is used to calculate the backward scene structure consistency loss according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the pass the difference between the scene depth value of the reference frame image obtained by the multi-view geometric transformation calculation and the scene depth value of the reconstructed reference frame image;
双向场景结构一致性损失计算模块,用于根据所述前向场景结构一致性损失和所述后向场景结构一致性损失,计算双向场景结构一致性损失;a bidirectional scene structure consistency loss calculation module, configured to calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;
所述目标函数构建单元具体可以用于:基于所述双向图像重建损失和所述双向场景结构一致性损失,构建得到所述目标函数。The objective function construction unit may be specifically configured to: construct and obtain the objective function based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
在本申请的一个实施例中,所述深度估计网络包括编码网络,所述装置还可以包括:In an embodiment of the present application, the depth estimation network includes an encoding network, and the apparatus may further include:
特征图像获取模块,用于通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图像的第二特征图像;A feature image acquisition module, configured to obtain the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network;
第一特征图像重建模块,用于基于所述第三坐标和所述第二特征图像,通过双线性采样机制重建出所述目标帧图像的第三特征图像;a first feature image reconstruction module, configured to reconstruct a third feature image of the target frame image through a bilinear sampling mechanism based on the third coordinates and the second feature image;
第二特征图像重建模块,用于基于所述第六坐标和所述第一特征图像,通过双线性采样机制重建出所述参考帧图像的第四特征图像;A second feature image reconstruction module, configured to reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism based on the sixth coordinate and the first feature image;
双向特征感知损失计算模块,用于根据所述第一特征图像、所述第二特征图像、所述第三特征图像和所述第四特征图像,计算得到双向特征感知损失,所述双向特征感知损失用于衡量通过编码网络获得的所述目标帧图像的特征图像与重建出的所述目标帧图像的特征图像之间的差异,以及通过编码网络获得的所述参考帧图像的特征图像与重建出的所述参考帧图像的特征图像之间的差异;The bidirectional feature perception loss calculation module is configured to calculate the bidirectional feature perception loss according to the first feature image, the second feature image, the third feature image and the fourth feature image, and the bidirectional feature perception loss The loss is used to measure the difference between the feature image of the target frame image obtained by the encoding network and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained by the encoding network and the reconstructed image. the difference between the feature images of the reference frame image;
所述目标函数构建单元具体可以用于:基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述目标函数。The objective function construction unit may be specifically configured to: construct and obtain the objective function based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
进一步的,所述装置还可以包括:Further, the device may also include:
平滑损失计算模块,用于根据所述目标帧图像、所述参考帧图像、所述第一场景深度图像、所述第二场景深度图像、所述第一特征图像和所述第二特征图像,计算得到平滑损失,所述平滑损失用于正则化通过所述深度估计网络获得的场景深度图像和特征图像的梯度;a smoothing loss calculation module, configured to, according to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, Calculate the smoothing loss, and the smoothing loss is used to regularize the gradient of the scene depth image and feature image obtained by the depth estimation network;
所述目标函数构建单元具体可以用于:基于所述双向图像重建损失、所述双向特征感知损失和所述平滑损失,构建得到所述目标函数。The objective function construction unit may be specifically configured to: construct the objective function based on the bidirectional image reconstruction loss, the bidirectional feature perception loss and the smoothing loss.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如图1表示的任意一种图像场景深度的估计方法。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements any method for estimating the depth of an image scene as shown in FIG. 1 . .
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在终端设备上运行时,使得终端设备执行实现如图1表示的任意一种图像场景深度的估计方法。Embodiments of the present application also provide a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to execute any method for estimating the depth of an image scene as shown in FIG. 1 .
图8是本申请一实施例提供的终端设备的示意图。如图8所示,该实施例的终端设备8包括:处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机程序82。所述处理器80执行所述计算机程序82时实现上述各个图像场景深度的估计方法的实施例中的步骤,例如图1所示的步骤101至102。或者,所述处理器80执行所述计算机程序82时实现上述各 装置实施例中各模块/单元的功能,例如图7所示模块701至709的功能。FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 8 , the terminal device 8 of this embodiment includes: a processor 80 , a memory 81 , and a computer program 82 stored in the memory 81 and executable on the processor 80 . When the processor 80 executes the computer program 82, the steps in each of the above embodiments of the method for estimating the depth of an image scene are implemented, for example, steps 101 to 102 shown in FIG. 1 . Alternatively, when the processor 80 executes the computer program 82, the functions of the modules/units in the above device embodiments, for example, the functions of the modules 701 to 709 shown in FIG. 7 are implemented.
所述计算机程序82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由所述处理器80执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序82在所述终端设备8中的执行过程。The computer program 82 may be divided into one or more modules/units, which are stored in the memory 81 and executed by the processor 80 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the terminal device 8 .
所称处理器80可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
所述存储器81可以是所述终端设备8的内部存储单元,例如终端设备8的硬盘或内存。所述存储器81也可以是所述终端设备8的外部存储设备,例如所述终端设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述终端设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。The memory 81 may be an internal storage unit of the terminal device 8 , such as a hard disk or a memory of the terminal device 8 . The memory 81 can also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc. Further, the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device. The memory 81 is used to store the computer program and other programs and data required by the terminal device. The memory 81 can also be used to temporarily store data that has been output or will be output.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上 述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above-mentioned system, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的 形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing the relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps of the foregoing method embodiments can be implemented. . Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Excluded are electrical carrier signals and telecommunication signals.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.
Claims (10)
- 一种图像场景深度的估计方法,其特征在于,包括:A method for estimating the depth of an image scene, comprising:获取待测图像;Obtain the image to be tested;将所述待测图像输入预先构建的深度估计网络,得到所述待测图像的场景深度图像;Inputting the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;其中,所述深度估计网络的参数通过以下方式更新:Wherein, the parameters of the depth estimation network are updated in the following ways:获取样本图像序列,所述样本图像序列包含目标帧图像和参考帧图像,所述参考帧图像为所述样本图像序列中处于所述目标帧图像之前或之后的一帧以上的图像;Obtaining a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is an image of one or more frames before or after the target frame image in the sample image sequence;将所述目标帧图像输入所述深度估计网络,得到预测的第一场景深度图像;Inputting the target frame image into the depth estimation network to obtain a predicted first scene depth image;将所述样本图像序列输入预先构建的相机姿态估计网络,得到预测的所述目标帧图像和所述参考帧图像之间的相机姿态向量;Inputting the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像;generating a first reconstructed image corresponding to the target frame image according to the first scene depth image, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence;根据所述目标帧图像和所述第一重建图像计算第一图像重建损失,所述第一图像重建损失用于衡量所述目标帧图像和所述第一重建图像之间的差异;Calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the difference between the target frame image and the first reconstructed image;基于所述第一图像重建损失构建目标函数;constructing an objective function based on the first image reconstruction loss;根据所述目标函数更新所述深度估计网络的参数。The parameters of the depth estimation network are updated according to the objective function.
- 如权利要求1所述的方法,其特征在于,在获取样本图像序列之后,还包括:The method of claim 1, further comprising: after acquiring the sample image sequence:将所述参考帧图像输入所述深度估计网络,得到预测的第二场景深度图像;Inputting the reference frame image into the depth estimation network to obtain a predicted second scene depth image;根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像;generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence;根据所述参考帧图像和所述第二重建图像计算第二图像重建损失,所述第二图像重建损失用于衡量所述参考帧图像和所述第二重建图像之间的差异;Calculate a second image reconstruction loss according to the reference frame image and the second reconstructed image, where the second image reconstruction loss is used to measure the difference between the reference frame image and the second reconstructed image;所述基于所述第一图像重建损失构建目标函数,包括:The constructing an objective function based on the first image reconstruction loss includes:根据所述第一图像重建损失和所述第二图像重建损失,计算双向图像重建损失;calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss;基于所述双向图像重建损失构建所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss.
- 如权利要求2所述的方法,其特征在于,所述根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像,包括:The method according to claim 2, characterized in that, according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence, generating the The first reconstructed image corresponding to the target frame image includes:根据所述相机姿态向量确定所述目标帧图像转换到所述参考帧图像的第一变换矩阵;Determine a first transformation matrix for converting the target frame image to the reference frame image according to the camera pose vector;根据所述相机的内参和所述第一场景深度图像,计算所述目标帧图像在世界坐标系下的第一坐标;Calculate the first coordinate of the target frame image in the world coordinate system according to the internal parameters of the camera and the first scene depth image;基于所述第一变换矩阵对所述第一坐标进行变换,得到所述目标帧图像经转换后在世界坐标系下的第二坐标;Transforming the first coordinates based on the first transformation matrix to obtain the second coordinates of the target frame image in the world coordinate system after transformation;将所述第二坐标转换为在图像坐标系下的第三坐标;converting the second coordinate into a third coordinate in the image coordinate system;基于所述参考帧图像,以所述第三坐标作为网格点,通过双线性采样机制重建出所述参考帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第一重建图像;Based on the reference frame image, and using the third coordinate as a grid point, an image of the reference frame image after affine transformation is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the third coordinate. a reconstructed image;所述根据所述第二场景深度图像、所述相机姿态向量、所述目标帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述参考帧图像对应的第二重建图像,包括:The generating a second reconstructed image corresponding to the reference frame image according to the second scene depth image, the camera pose vector, the target frame image, and the internal parameters of the camera used to capture the sample image sequence, comprising: :根据所述相机姿态向量确定所述参考帧图像转换到所述目标帧图像的第二变换矩阵;Determine a second transformation matrix for converting the reference frame image to the target frame image according to the camera pose vector;根据所述相机的内参和所述第二场景深度图像,计算所述参考帧图像在世界坐标系下的第四坐标;Calculate the fourth coordinate of the reference frame image in the world coordinate system according to the internal reference of the camera and the depth image of the second scene;基于所述第二变换矩阵对所述第四坐标进行变换,得到所述参考帧图像经转换后在世界坐标系下的第五坐标;Transform the fourth coordinate based on the second transformation matrix to obtain the transformed fifth coordinate of the reference frame image in the world coordinate system;将所述第五坐标转换为在图像坐标系下的第六坐标;converting the fifth coordinate into the sixth coordinate in the image coordinate system;基于所述目标帧图像,以所述第六坐标作为网格点,通过双线性采样机制重建出所述目标帧图像经仿射变换后的图像,并将重建得到的图像确定为所述第二重建图像。Based on the target frame image, using the sixth coordinate as a grid point, the affine transformed image of the target frame image is reconstructed through a bilinear sampling mechanism, and the reconstructed image is determined as the sixth coordinate. 2. Reconstructed images.
- 如权利要求3所述的方法,其特征在于,还包括:The method of claim 3, further comprising:获取所述目标帧图像在图像坐标系下的第七坐标;obtaining the seventh coordinate of the target frame image in the image coordinate system;对所述第三坐标和所述第七坐标执行对应元素作差的处理,得到第一前向流坐标;Performing the process of making a difference between the corresponding elements on the third coordinate and the seventh coordinate to obtain the first forward flow coordinate;获取所述参考帧图像在图像坐标系下的第八坐标;obtaining the eighth coordinate of the reference frame image in the image coordinate system;对所述第六坐标和所述第八坐标执行对应元素作差的处理,得到第一后向流坐标;Performing the process of making a difference between the corresponding elements on the sixth coordinate and the eighth coordinate to obtain the first backward flow coordinate;以所述第三坐标作为网格点,采用双线性采样机制对所述第一后向流坐标进行仿射变换,以合成第二前向流坐标;Using the third coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first backward flow coordinate to synthesize the second forward flow coordinate;以所述第六坐标作为网格点,采用双线性采样机制对所述第一前向流坐标进行仿射变换,以合成第二后向流坐标;Using the sixth coordinate as a grid point, using a bilinear sampling mechanism to perform affine transformation on the first forward flow coordinate to synthesize the second backward flow coordinate;根据所述第一前向流坐标和所述第二前向流坐标计算前向流遮挡掩码,所述前向流遮挡掩码用于衡量所述第一前向流坐标和所述第二前向流坐标之间的匹配程度;A forward flow occlusion mask is calculated according to the first forward flow coordinate and the second forward flow coordinate, and the forward flow occlusion mask is used to measure the first forward flow coordinate and the second forward flow occlusion mask. degree of matching between forward flow coordinates;根据所述第一后向流坐标和所述第二后向流坐标计算后向流遮挡掩码,所述后向流遮挡掩码用于衡量所述第一后向流坐标和所述第二后向流坐标之间的匹配程度;A backward flow occlusion mask is calculated according to the first backward flow coordinates and the second backward flow coordinates, and the backward flow occlusion mask is used to measure the first backward flow coordinates and the second backward flow degree of matching between backward flow coordinates;所述根据所述第一图像重建损失和所述第二图像重建损失计算双向图像重建损失,包括:The calculating a bidirectional image reconstruction loss according to the first image reconstruction loss and the second image reconstruction loss includes:根据所述第一图像重建损失、所述第二图像重建损失、所述前向流遮挡掩码和所述后向流遮挡掩码计算所述双向图像重建损失。The bidirectional image reconstruction loss is calculated from the first image reconstruction loss, the second image reconstruction loss, the forward flow occlusion mask, and the backward flow occlusion mask.
- 如权利要求3所述的方法,其特征在于,还包括:The method of claim 3, further comprising:根据所述第二坐标确定所述目标帧图像的第一场景深度值;Determine the first scene depth value of the target frame image according to the second coordinate;根据所述第五坐标确定所述参考帧图像的第二场景深度值;determining the second scene depth value of the reference frame image according to the fifth coordinate;获取所述第一场景深度图像中与所述第二坐标对应的像素点的第三场景深度值;obtaining the third scene depth value of the pixel corresponding to the second coordinate in the first scene depth image;获取所述第二场景深度图像中与所述第五坐标对应的像素点的第四场景深度值;obtaining the fourth scene depth value of the pixel corresponding to the fifth coordinate in the second scene depth image;基于所述第三坐标和所述第四场景深度值,通过双线性采样机制重建出所述目标帧图像的第五场景深度值;Based on the third coordinate and the fourth scene depth value, reconstruct the fifth scene depth value of the target frame image through a bilinear sampling mechanism;基于所述第六坐标和所述第三场景深度值,通过双线性采样机制重建出所述参考帧图像的第六场景深度值;Based on the sixth coordinate and the third scene depth value, reconstruct the sixth scene depth value of the reference frame image through a bilinear sampling mechanism;根据所述第一场景深度值和所述第五场景深度值计算前向场景结构一致性损失,所述前向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述目标帧图像的场景深度值与重建出的所述目标帧图像的场景深度值之间的差异;The forward scene structure consistency loss is calculated according to the first scene depth value and the fifth scene depth value, and the forward scene structure consistency loss is used to measure the target frame image calculated by multi-view geometric transformation The difference between the scene depth value and the scene depth value of the reconstructed target frame image;根据所述第二场景深度值和所述第六场景深度值计算后向场景结构一致性损失,所述后向场景结构一致性损失用于衡量通过多视图几何变换计算得到的所述参考帧图像的场景深度值与重建出的所述参考帧图像的场景深度值之间的差异;The backward scene structure consistency loss is calculated according to the second scene depth value and the sixth scene depth value, and the backward scene structure consistency loss is used to measure the reference frame image calculated by the multi-view geometric transformation The difference between the scene depth value and the reconstructed scene depth value of the reference frame image;根据所述前向场景结构一致性损失和所述后向场景结构一致性损失,计算双向场景结构一致性损失;Calculate the bidirectional scene structure consistency loss according to the forward scene structure consistency loss and the backward scene structure consistency loss;所述基于所述双向图像重建损失构建所述目标函数,包括:The constructing the objective function based on the bidirectional image reconstruction loss includes:基于所述双向图像重建损失和所述双向场景结构一致性损失,构建得到所述目标函数。The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional scene structure consistency loss.
- 如权利要求3至5中任一项所述的方法,其特征在于,所述深度估计网络包括编码网络,所述方法还包括:The method according to any one of claims 3 to 5, wherein the depth estimation network comprises an encoding network, and the method further comprises:通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图 像的第二特征图像;Obtain the first feature image of the target frame image and the second feature image of the reference frame image by the encoding network;基于所述第三坐标和所述第二特征图像,通过双线性采样机制重建出所述目标帧图像的第三特征图像;Based on the third coordinates and the second feature image, reconstruct a third feature image of the target frame image through a bilinear sampling mechanism;基于所述第六坐标和所述第一特征图像,通过双线性采样机制重建出所述参考帧图像的第四特征图像;Based on the sixth coordinate and the first feature image, reconstruct a fourth feature image of the reference frame image through a bilinear sampling mechanism;根据所述第一特征图像、所述第二特征图像、所述第三特征图像和所述第四特征图像,计算得到双向特征感知损失,所述双向特征感知损失用于衡量通过编码网络获得的所述目标帧图像的特征图像与重建出的所述目标帧图像的特征图像之间的差异,以及通过编码网络获得的所述参考帧图像的特征图像与重建出的所述参考帧图像的特征图像之间的差异;According to the first feature image, the second feature image, the third feature image and the fourth feature image, the bidirectional feature perception loss is calculated, and the bidirectional feature perception loss is used to measure the obtained through the encoding network. The difference between the feature image of the target frame image and the reconstructed feature image of the target frame image, and the feature image of the reference frame image obtained through the encoding network and the reconstructed feature image of the reference frame image differences between images;所述基于所述双向图像重建损失构建所述目标函数,包括:The constructing the objective function based on the bidirectional image reconstruction loss includes:基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss and the bidirectional feature perception loss.
- 如权利要求6所述的方法,其特征在于,在通过所述编码网络获取所述目标帧图像的第一特征图像以及所述参考帧图像的第二特征图像之后,还包括:The method according to claim 6, wherein after acquiring the first feature image of the target frame image and the second feature image of the reference frame image through the encoding network, the method further comprises:根据所述目标帧图像、所述参考帧图像、所述第一场景深度图像、所述第二场景深度图像、所述第一特征图像和所述第二特征图像,计算得到平滑损失,所述平滑损失用于正则化通过所述深度估计网络获得的场景深度图像和特征图像的梯度;According to the target frame image, the reference frame image, the first scene depth image, the second scene depth image, the first feature image and the second feature image, the smoothing loss is calculated and obtained, the A smoothing loss is used to regularize the gradients of scene depth images and feature images obtained by the depth estimation network;所述基于所述双向图像重建损失和所述双向特征感知损失,构建得到所述目标函数,包括:The objective function is constructed and obtained based on the bidirectional image reconstruction loss and the bidirectional feature perception loss, including:基于所述双向图像重建损失、所述双向特征感知损失和所述平滑损失,构建得到所述目标函数。The objective function is constructed based on the bidirectional image reconstruction loss, the bidirectional feature perception loss, and the smoothing loss.
- 一种图像场景深度的估计装置,其特征在于,包括:A device for estimating the depth of an image scene, comprising:待测图像获取模块,用于获取待测图像;an image acquisition module to be tested, used to acquire the image to be tested;场景深度估计模块,用于将所述待测图像输入预先构建的深度估计网络, 得到所述待测图像的场景深度图像;a scene depth estimation module, configured to input the image to be tested into a pre-built depth estimation network to obtain a scene depth image of the image to be tested;样本获取模块,用于获取样本图像序列,所述样本图像序列包含目标帧图像和参考帧图像,所述参考帧图像为所述样本图像序列中处于所述目标帧图像之前或之后的一帧以上的图像;A sample acquisition module, configured to acquire a sample image sequence, the sample image sequence includes a target frame image and a reference frame image, and the reference frame image is one or more frames before or after the target frame image in the sample image sequence Image;第一场景深度预测模块,用于将所述目标帧图像输入所述深度估计网络,得到预测的第一场景深度图像;a first scene depth prediction module, configured to input the target frame image into the depth estimation network to obtain a predicted first scene depth image;相机姿态估计模块,用于将所述样本图像序列输入预先构建的相机姿态估计网络,得到预测的所述目标帧图像和所述参考帧图像之间的相机姿态向量;a camera pose estimation module, configured to input the sample image sequence into a pre-built camera pose estimation network to obtain the predicted camera pose vector between the target frame image and the reference frame image;第一图像重建模块,用于根据所述第一场景深度图像、所述相机姿态向量、所述参考帧图像以及拍摄所述样本图像序列采用的相机的内参,生成与所述目标帧图像对应的第一重建图像;A first image reconstruction module, configured to generate an image corresponding to the target frame image according to the depth image of the first scene, the camera pose vector, the reference frame image, and the internal parameters of the camera used to capture the sample image sequence. the first reconstructed image;第一图像重建损失计算模块,用于根据所述目标帧图像和所述第一重建图像计算第一图像重建损失,所述第一图像重建损失用于衡量所述目标帧图像和所述第一重建图像之间的差异;a first image reconstruction loss calculation module, configured to calculate a first image reconstruction loss according to the target frame image and the first reconstructed image, where the first image reconstruction loss is used to measure the target frame image and the first image reconstruction loss difference between reconstructed images;目标函数构建模块,用于基于所述第一图像重建损失构建目标函数;an objective function building module for constructing an objective function based on the first image reconstruction loss;网络参数更新模块,用于根据所述目标函数更新所述深度估计网络的参数。A network parameter updating module, configured to update the parameters of the depth estimation network according to the objective function.
- 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的图像场景深度的估计方法。A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the process according to claim 1 to The method for estimating the depth of an image scene according to any one of 7.
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的图像场景深度的估计方法。A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the image scene depth according to any one of claims 1 to 7 is realized estimation method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110346713.3 | 2021-03-31 | ||
CN202110346713.3A CN113160294B (en) | 2021-03-31 | 2021-03-31 | Image scene depth estimation method and device, terminal equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022206020A1 true WO2022206020A1 (en) | 2022-10-06 |
Family
ID=76885688
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/137609 WO2022206020A1 (en) | 2021-03-31 | 2021-12-13 | Method and apparatus for estimating depth of field of image, and terminal device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113160294B (en) |
WO (1) | WO2022206020A1 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160294B (en) * | 2021-03-31 | 2022-12-23 | 中国科学院深圳先进技术研究院 | Image scene depth estimation method and device, terminal equipment and storage medium |
CN113592940B (en) * | 2021-07-28 | 2024-07-02 | 北京地平线信息技术有限公司 | Method and device for determining target object position based on image |
CN113592706B (en) * | 2021-07-28 | 2023-10-17 | 北京地平线信息技术有限公司 | Method and device for adjusting homography matrix parameters |
CN113792730B (en) * | 2021-08-17 | 2022-09-27 | 北京百度网讯科技有限公司 | Method and device for correcting document image, electronic equipment and storage medium |
CN114049388A (en) * | 2021-11-10 | 2022-02-15 | 北京地平线信息技术有限公司 | Image data processing method and device |
CN113793283B (en) * | 2021-11-15 | 2022-02-11 | 江苏游隼微电子有限公司 | Vehicle-mounted image noise reduction method |
CN114219900B (en) * | 2022-02-21 | 2022-07-01 | 北京影创信息科技有限公司 | Three-dimensional scene reconstruction method, reconstruction system and application based on mixed reality glasses |
CN114627006B (en) * | 2022-02-28 | 2022-12-20 | 复旦大学 | Progressive image restoration method based on depth decoupling network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150178900A1 (en) * | 2012-11-29 | 2015-06-25 | Korea Institute Of Science And Technology | Depth image processing apparatus and method based on camera pose conversion |
CN110503680A (en) * | 2019-08-29 | 2019-11-26 | 大连海事大学 | It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method |
CN110782490A (en) * | 2019-09-24 | 2020-02-11 | 武汉大学 | Video depth map estimation method and device with space-time consistency |
CN112819875A (en) * | 2021-02-03 | 2021-05-18 | 苏州挚途科技有限公司 | Monocular depth estimation method and device and electronic equipment |
CN113160294A (en) * | 2021-03-31 | 2021-07-23 | 中国科学院深圳先进技术研究院 | Image scene depth estimation method and device, terminal equipment and storage medium |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102506959B1 (en) * | 2018-05-17 | 2023-03-07 | 나이앤틱, 인크. | Self-supervised training of depth estimation systems |
CN110490928B (en) * | 2019-07-05 | 2023-08-15 | 天津大学 | Camera attitude estimation method based on deep neural network |
CN111105451B (en) * | 2019-10-31 | 2022-08-05 | 武汉大学 | Driving scene binocular depth estimation method for overcoming occlusion effect |
US11157774B2 (en) * | 2019-11-14 | 2021-10-26 | Zoox, Inc. | Depth data model training with upsampling, losses, and loss balancing |
CN111311685B (en) * | 2020-05-12 | 2020-08-07 | 中国人民解放军国防科技大学 | Motion scene reconstruction unsupervised method based on IMU and monocular image |
CN111369608A (en) * | 2020-05-29 | 2020-07-03 | 南京晓庄学院 | Visual odometer method based on image depth estimation |
CN111783582A (en) * | 2020-06-22 | 2020-10-16 | 东南大学 | Unsupervised monocular depth estimation algorithm based on deep learning |
-
2021
- 2021-03-31 CN CN202110346713.3A patent/CN113160294B/en active Active
- 2021-12-13 WO PCT/CN2021/137609 patent/WO2022206020A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150178900A1 (en) * | 2012-11-29 | 2015-06-25 | Korea Institute Of Science And Technology | Depth image processing apparatus and method based on camera pose conversion |
CN110503680A (en) * | 2019-08-29 | 2019-11-26 | 大连海事大学 | It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method |
CN110782490A (en) * | 2019-09-24 | 2020-02-11 | 武汉大学 | Video depth map estimation method and device with space-time consistency |
CN112819875A (en) * | 2021-02-03 | 2021-05-18 | 苏州挚途科技有限公司 | Monocular depth estimation method and device and electronic equipment |
CN113160294A (en) * | 2021-03-31 | 2021-07-23 | 中国科学院深圳先进技术研究院 | Image scene depth estimation method and device, terminal equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
YIN ZHICHAO; SHI JIANPING: "GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 1983 - 1992, XP033476163, DOI: 10.1109/CVPR.2018.00212 * |
ZHOU TINGHUI; BROWN MATTHEW; SNAVELY NOAH; LOWE DAVID G.: "Unsupervised Learning of Depth and Ego-Motion from Video", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOCIETY, US, 21 July 2017 (2017-07-21), US , pages 6612 - 6619, XP033250026, ISSN: 1063-6919, DOI: 10.1109/CVPR.2017.700 * |
Also Published As
Publication number | Publication date |
---|---|
CN113160294B (en) | 2022-12-23 |
CN113160294A (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022206020A1 (en) | Method and apparatus for estimating depth of field of image, and terminal device and storage medium | |
CN112001914B (en) | Depth image complement method and device | |
CN110501072B (en) | Reconstruction method of snapshot type spectral imaging system based on tensor low-rank constraint | |
CN110910437B (en) | Depth prediction method for complex indoor scene | |
CN114429555A (en) | Image density matching method, system, equipment and storage medium from coarse to fine | |
CN113962858A (en) | Multi-view depth acquisition method | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
CN114140623A (en) | Image feature point extraction method and system | |
JP2024510230A (en) | Multi-view neural human prediction using implicitly differentiable renderer for facial expression, body pose shape and clothing performance capture | |
CN111325828A (en) | Three-dimensional face acquisition method and device based on three-eye camera | |
CN112767467A (en) | Double-image depth estimation method based on self-supervision deep learning | |
Qin et al. | Depth estimation by parameter transfer with a lightweight model for single still images | |
CN116188550A (en) | Self-supervision depth vision odometer based on geometric constraint | |
Zhang et al. | End-to-end learning of self-rectification and self-supervised disparity prediction for stereo vision | |
CN111696167A (en) | Single image super-resolution reconstruction method guided by self-example learning | |
CN116934591A (en) | Image stitching method, device and equipment for multi-scale feature extraction and storage medium | |
CN114399547B (en) | Monocular SLAM robust initialization method based on multiframe | |
CN115375740A (en) | Pose determination method, three-dimensional model generation method, device, equipment and medium | |
CN114842066A (en) | Image depth recognition model training method, image depth recognition method and device | |
CN110689513B (en) | Color image fusion method and device and terminal equipment | |
CN113962846A (en) | Image alignment method and device, computer readable storage medium and electronic device | |
CN112634331A (en) | Optical flow prediction method and device | |
CN112131902A (en) | Closed loop detection method and device, storage medium and electronic equipment | |
US20230145048A1 (en) | Real-time body pose estimation system and method in unconstrained video | |
CN115457101B (en) | Edge-preserving multi-view depth estimation and ranging method for unmanned aerial vehicle platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21934656 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21934656 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21934656 Country of ref document: EP Kind code of ref document: A1 |