CN117808857A - Self-supervision 360-degree depth estimation method, device, equipment and medium - Google Patents

Self-supervision 360-degree depth estimation method, device, equipment and medium Download PDF

Info

Publication number
CN117808857A
CN117808857A CN202410232514.3A CN202410232514A CN117808857A CN 117808857 A CN117808857 A CN 117808857A CN 202410232514 A CN202410232514 A CN 202410232514A CN 117808857 A CN117808857 A CN 117808857A
Authority
CN
China
Prior art keywords
depth
domain
features
map
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410232514.3A
Other languages
Chinese (zh)
Other versions
CN117808857B (en
Inventor
王旭
何紫嫣
张秋丹
江健民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202410232514.3A priority Critical patent/CN117808857B/en
Priority claimed from CN202410232514.3A external-priority patent/CN117808857B/en
Publication of CN117808857A publication Critical patent/CN117808857A/en
Application granted granted Critical
Publication of CN117808857B publication Critical patent/CN117808857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Processing (AREA)

Abstract

The invention discloses a self-supervision 360-degree depth estimation method, a device, equipment and a medium, which are used for converting ERP images into TP images through E2P conversion; inputting a TP image with minimum distortion in the TP image into a preset backbone network, and extracting TP characteristic blocks at different scales; extracting global features in the TP feature blocks according to the frequency domain space domain feature aggregation model, and adding the global features into the original features to obtain aggregation features of non-local information in the blocks; respectively inputting the aggregation features into a TP domain depth decoder and an ERP domain depth decoder for decoding to obtain a TP domain depth map and a corresponding confidence map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map; and respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map. According to the depth estimation method and device, accuracy of depth estimation can be improved.

Description

Self-supervision 360-degree depth estimation method, device, equipment and medium
Technical Field
The invention relates to the technical field of image processing, in particular to a self-supervision 360-degree depth estimation method, device, equipment and medium.
Background
As an important task for understanding a three-dimensional scene, 360-degree panoramic depth estimation plays an important role in applications such as autonomous navigation, virtual reality, three-dimensional scene reconstruction and the like. EPR projection, i.e. equidistant columnar projection, is the most commonly used panoramic image format because of its simple mapping relationship and the ability to capture a complete continuous omnidirectional scene. However, it has serious spherical distortion in the two-pole area, and the direct application of the common convolution to the ERP image can lead to the dramatic decrease of the performance of the model, and the accuracy of depth estimation is lower.
Disclosure of Invention
In order to solve the problems, the invention provides a self-supervision 360-degree depth estimation method, a device, equipment and a medium, which improve the accuracy of depth estimation.
The embodiment of the invention provides a self-supervision 360-degree depth estimation method, which comprises the following steps:
converting the ERP image into a TP image through E2P conversion;
minimizing distortion in generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scalesN patch TP feature blocks;
extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocks N patch A plurality of aggregated features;
will beN patch The aggregation features are respectively input into a preset TP domain depth decoderDecoding in a preset ERP domain depth decoder to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;
adopting spherical view synthesis to respectively generate views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map to obtain a depth estimation map;
wherein,N patch is a positive integer.
Preferably, the method extracts according to a preset frequency domain space domain characteristic aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch After the feature is polymerized, the method further comprises:
will beN patch Individual aggregated featuresThe number of the characteristic channels is reduced in a Conv3D layer input to a preset structural characteristic alignment model;
processing the output characteristics of the Conv3D layer by using a Softmax activation function to obtain the attention weight of each pixel in the TP block
Will get the attention weights and respectivelyN patch Individual aggregated featuresPerforming matrix multiplication to obtain geometric prior information;
recalibrationThe importance of each channel in the system is obtained, and the calibrated global geometric prior information is output
Integrating global geometry prior information by means of element broadcast additionN patch In the aggregate characteristicsPerforming P2E conversion to obtain alignedN patch And a polymeric feature.
Preferably, the extraction is performed according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features, including;
extracting TP features of a TP feature block according to a TP encoder in the frequency domain spatial domain feature aggregation model;
converting the extracted TP characteristics into a frequency domain through fast Fourier transformation to obtain frequency domain characteristics;
splicing the real part and the imaginary part of the obtained frequency domain features together in a feature dimension, and extracting the global context of each TP feature block by applying Conv2D blocks in the frequency domain;
converting the global context back to a space domain through inverse Fourier transform to obtain global features;
after the global feature and the TP feature block are spliced, the global feature and the TP feature block are sent to another Conv2D block, the number of feature channels is restored, and the generation is performedN patch And a polymeric feature.
As a preferred solution, the generating process of the depth map and the corresponding confidence map of the TP domain includes:
by the first unit, the second unit and the third unit of the TP domain depth decoder N patch Processing the aggregation features to respectively obtain three decoding features;
obtaining depth decoding characteristics through fusion of three decoding characteristics, and up-sampling the depth decoding characteristics to the same resolution as a TP characteristic block;
decoding the up-sampling features by two Conv3D layers to obtain depth maps respectivelyConfidence map corresponding to the same
Wherein each unit pairN patch By aggregation of featuresThe line processing process comprises the following steps:
will be up sampled by the layerN patch Individual aggregated featuresSample to feature->After the same size, inputting the two blocks into a first Conv3D block of the TP domain depth decoder, and enriching the spatial characteristic representation of the two blocks; characteristics and features to be obtained->Connecting the two parts, and combining the local details and the semantic priors to obtain cascading features; and inputting the obtained cascade features into a second Conv3D block of the TP domain depth decoder, and reducing the number of feature channels to obtain decoding features.
Preferably, the depth map generating process of the ERP domain includes:
feature alignment module in ERP domain depth decoderN patch Individual aggregated featuresAligning to obtain output characteristics;
converting the output characteristics into the ERP domain by adopting P2E conversion to obtain conversion characteristics
Features are processed by a first decoding unit of the ERP domain depth decoder Inputting the frequency domain attention block, and optimizing by using a self-attention mechanism;
conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsamplingDepth features of the same size +.>
Simultaneous reception of multi-scale features by a fourth decoding unit of the ERP domain depth decoderAnd depth featuresAs input, get depth feature +.>
Decoded ERP domain depth featuresInputting the depth distribution into a preset depth distribution classification module, and calculating the Laplacian mixed distribution of the median value of the discrete interval to obtain a depth map of the ERP domain;
wherein, the depth map of ERP domain,/>Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the i-th depth interval and the actual median value of the j-th depth interval respectively,,/>,/>and->Representing each imageWeight and scale of Laplace distribution corresponding to jth depth interval of element, ++>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>Is the minimum threshold, ++>Is the number of depth intervals.
Preferably, the fusion map
Wherein,confidence map for TP field, +.>Depth map for TP field, ++>Is a depth map of the ERP domain.
As a preferred embodiment, the method further comprises:
Calculating a loss value of the depth estimation map based on a loss function constructed by spherical photometric loss, smooth loss, significant direction normal loss and coplanar loss, and correcting the depth estimation map according to the loss value;
wherein the loss value of the depth estimation map,/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b),/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.
The embodiment of the invention also provides a self-supervision 360-degree depth estimation device, which comprises:
the image conversion module is used for converting the ERP image into a TP image through E2P conversion;
an extraction module for minimizing distortion in the generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scalesN patch TP feature blocks;
the aggregation module is used for extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in the individual TP feature blocks,and adds it to the original characteristics to obtain non-local information in the aggregated blockN patch A plurality of aggregated features;
a decoding module for decodingN patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;
The map generation module is used for respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map;
wherein,N patch is a positive integer.
Preferably, the apparatus further comprises: an alignment module for:
extracting according to a preset frequency domain space domain characteristic aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch After the feature is polymerizedN patch Individual aggregated featuresThe number of the characteristic channels is reduced in a Conv3D layer input to a preset structural characteristic alignment model;
processing the output characteristics of the Conv3D layer by using a Softmax activation function to obtain the attention weight of each pixel in the TP block
Will get the attention weights and respectivelyN patch Individual aggregated featuresPerforming matrix multiplication to obtain geometric prior information;
recalibrationImportance of each channel in (a)Obtaining the calibrated global geometric prior information output
Integrating global geometry prior information by means of element broadcast additionN patch In the polymerized features, P2E conversion is performed to obtain alignedN patch And a polymeric feature.
Preferably, the aggregation module is specifically configured to:
Extracting TP features of a TP feature block according to a TP encoder in the frequency domain spatial domain feature aggregation model;
converting the extracted TP characteristics into a frequency domain through fast Fourier transformation to obtain frequency domain characteristics;
splicing the real part and the imaginary part of the obtained frequency domain features together in a feature dimension, and extracting the global context of each TP feature block by applying Conv2D blocks in the frequency domain;
converting the global context back to a space domain through inverse Fourier transform to obtain global features;
after the global feature and the TP feature block are spliced, the global feature and the TP feature block are sent to another Conv2D block, the number of feature channels is restored, and the generation is performedN patch And a polymeric feature.
Preferably, the process of generating the depth map and the corresponding confidence map of the TP-domain by the decoding module includes:
by the first unit, the second unit and the third unit of the TP domain depth decoderN patch Processing the aggregation features to respectively obtain three decoding features;
obtaining depth decoding characteristics through fusion of three decoding characteristics, and up-sampling the depth decoding characteristics to the same resolution as a TP characteristic block;
decoding the up-sampling features by two Conv3D layers to obtain depth maps respectivelyConfidence map corresponding to the same
Wherein each unit pair N patch The process of processing the aggregation features comprises the following steps:
will be up sampled by the layerN patch Individual aggregated featuresSample to feature->After the same size, inputting the two blocks into a first Conv3D block of the TP domain depth decoder, and enriching the spatial characteristic representation of the two blocks; characteristics and features to be obtained->Connecting the two parts, and combining the local details and the semantic priors to obtain cascading features; and inputting the obtained cascade features into a second Conv3D block of the TP domain depth decoder, and reducing the number of feature channels to obtain decoding features.
Preferably, the process of generating the depth map of the ERP domain by the decoding module includes:
feature alignment module in ERP domain depth decoderN patch Individual aggregated featuresAligning to obtain output characteristics;
converting the output characteristics into the ERP domain by adopting P2E conversion to obtain conversion characteristics
Features are processed by a first decoding unit of the ERP domain depth decoderInputting the frequency domain attention block, and optimizing by using a self-attention mechanism;
conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsamplingHas the same meaningDepth features of size->
Simultaneous reception of multi-scale features by a fourth decoding unit of the ERP domain depth decoderAnd depth features As input, get depth feature +.>
Decoded ERP domain depth featuresInputting the depth distribution into a preset depth distribution classification module, and calculating the Laplacian mixed distribution of the median value of the discrete interval to obtain a depth map of the ERP domain;
wherein, the depth map of ERP domain,/>Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the i-th depth interval and the actual median value of the j-th depth interval respectively,,/>,/>and->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>Is the minimum threshold, ++>Is the number of depth intervals.
Preferably, the fusion map
Wherein,confidence map for TP field, +.>Depth map for TP field, ++>Is a depth map of the ERP domain.
Preferably, the apparatus further comprises a loss calculation module for:
calculating a loss value of the depth estimation map based on a loss function constructed by spherical photometric loss, smooth loss, significant direction normal loss and coplanar loss, and correcting the depth estimation map according to the loss value;
wherein the loss value of the depth estimation map,/>Loss value for the depth map of the TP field, < +. >Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b),/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.
The embodiment of the invention also provides a terminal device, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements a self-supervision 360-degree depth estimation method according to any one of the embodiments.
The embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program, wherein the computer program controls equipment where the computer readable storage medium is located to execute the self-supervision 360-degree depth estimation method according to any one of the embodiments.
The invention provides a self-supervision 360-degree depth estimation method, a device, equipment and a medium, which are used for converting ERP images into TP images through E2P conversion; minimizing distortion in generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scales N patch TP feature blocks; extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features; will beN patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map; and respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map. According to the depth estimation method and device, accuracy of depth estimation can be improved.
Drawings
FIG. 1 is a schematic flow chart of a self-monitoring 360 DEG depth estimation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a self-monitoring 360 ° depth estimation device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flow chart of a self-monitoring 360 ° depth estimation method according to an embodiment of the present invention is shown, and the method includes steps S1 to S5;
s1, converting an ERP image into a TP image through E2P conversion;
s2, minimizing distortion in the generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scalesN patch TP feature blocks;
s3, extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features;
s4, willN patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;
s5, synthesizing a view of a new view point from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting a spherical view, and obtaining a depth estimation map;
wherein,N patch is a positive integer.
In order to mitigate the effect of spherical distortion when the present embodiment is embodied, the present application uses TP images to estimate 360 th depth information. In order to solve the main problem of influencing fusion quality, namely the difference of TP blocks between depth predictions, an asymmetric double-domain depth decoding module is developed to cooperatively learn the depth characteristics of TP and ERP domains so as to make up the difference between TP blocks. Specifically, the present application first converts 360 ° images into by E2P conversion N patch A TP image with less distortion is extracted from the TP image by adopting ResNet-34 with the last residual block removed as an encoderN patch TP feature blocks
Exploring monocular depth cues in TP block features to obtain depth features using designed FSFC modulesNamely, extracting Npatch TP feature blocks according to a preset frequency domain spatial domain feature aggregation model>The global features of the block are added to the original features to obtain the aggregate features of the non-local information in the block>
The depth features are then fed into a dual-domain depth decoder to co-decode the depth information, i.eN patch The aggregation features are respectively input into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a depth map D of the TP domain TP And corresponding confidence map C TP And depth map D of ERP domain ERP Fusing the depth map of the TP domain and the depth map of the ERP domain to obtain D Fusion
Synthesis of slave D from spherical views TP 、D ERP And D Fusion And respectively generating views of the new viewpoints to obtain a depth estimation image.
The scheme uses TP images to estimate 360 th step depth information, and reduces the influence of spherical distortion. An asymmetric double-domain depth decoding module cooperatively learns the depth characteristics of TP and ERP domains to make up the difference between TP blocks, solve the main problem of influencing fusion quality, namely the difference between TP blocks in depth prediction, and improve the accuracy of depth estimation.
In yet another embodiment provided by the present invention, the method further includes, after the step S3:
will beN patch Individual aggregated featuresFeature reduction flux in Conv3D layer input to preset structural feature alignment modelThe number of lanes;
processing the output characteristics of the Conv3D layer by using a Softmax activation function to obtain the attention weight of each pixel in the TP block
Will get the attention weights and respectivelyN patch Individual aggregated featuresPerforming matrix multiplication to obtain geometric prior information;
recalibrationThe importance of each channel in the system is obtained, and the calibrated global geometric prior information is output
Integrating global geometry prior information by means of element broadcast additionN patch In the polymerized features, P2E conversion is performed to obtain alignedN patch And a polymeric feature.
In the implementation of this embodiment, the ERP image is converted into a plurality of TP image blocks to perform feature extraction respectively, which can effectively avoid the influence of spherical distortion, but causes a problem of non-negligible loss of global geometric consistency, especially in a long-distance region and a weak texture region of the panoramic image. In this regard, the present application introduces a structural feature alignment model that captures inter-block geometric consistency-related features through a simplified non-local attention mechanism and adds these aggregated global structural features to the aggregated features To capture long-range dependencies.
Specifically, aggregated featuresFirst input a Conv3D layer to reduce the number of characteristic channelsThen processing the TP block by using a Softmax activation function to obtain the attention weight of each pixel in the TP block>. These attention weights are followed by the aggregate feature +.>Matrix multiplication is carried out to obtain global geometric prior +.>For modeling global geometry information. Namely:
the present application uses a two-layer bottleneck transformation to recalibrate without significantly increasing the number of parametersThe importance of each channel of (a) to obtain a calibrated output +.>Integrating global geometry prior information into aggregate feature by means of element broadcast addition>On, and P2E converting it to obtain aligned features +.>
The structural feature alignment module provided by the application can adaptively emphasize global features among different TP blocks in the training process, and improves geometric structure consistency among the blocks. The SFA module is introduced into the ERP domain decoder, and the structural features of the TP blocks from the jump connection can be aligned before P2E conversion so as to better perform decoding meeting global consistency.
In yet another embodiment of the present invention, the step S3 specifically includes:
Extracting TP features of a TP feature block according to a TP encoder in the frequency domain spatial domain feature aggregation model;
converting the extracted TP characteristics into a frequency domain through fast Fourier transformation to obtain frequency domain characteristics;
splicing the real part and the imaginary part of the obtained frequency domain features together in a feature dimension, and extracting the global context of each TP feature block by applying Conv2D blocks in the frequency domain;
converting the global context back to a space domain through inverse Fourier transform to obtain global features;
after the global feature and the TP feature block are spliced, the global feature and the TP feature block are sent to another Conv2D block, the number of feature channels is restored, and the generation is performedN patch And a polymeric feature.
In the implementation of the present embodiment, learning and utilizing monocular depth cues is a great importance in the depth estimation task. However, the limited receptive field of conventional full convolution models makes the model unable to obtain global information for an entire 360 ° scene, resulting in very useful monocular depth cues, such as occlusions, texture gradients, and relative sizes, being destroyed. This phenomenon is particularly pronounced in early layers of a full convolution network, as the convolution kernels in the shallow layers are typically smaller. Conventional full convolution networks typically extend the receptive field through extensive downsampling operations. However, excessive downsampling may lose local geometry detail, resulting in poor performance of the 360 ° depth estimation.
Fast fourier convolution can utilize global context information in the shallow layers of the network to effectively solve the above-described problems. Thus, the present application embeds the fast fourier convolution into the skip connection, building a frequency domain-spatial domain feature aggregation (FSFC) module that adds non-local information within the TP blocks to the features of the TP image blocks of the early layers of the encoder. Features extracted from the TP encoder are first converted to the frequency domain by a Fast Fourier Transform (FFT), then the real and imaginary parts of the resulting frequency domain features are stitched together in the feature dimension, and Conv2D blocks are applied in the frequency domain to extract the global context of each TP feature block. These fourier features are then transformed back into the spatial domain by inverse transformation, generating. After the features are spliced with the input features, the features are sent to another Conv2D block to restore the number of feature channels, and an aggregate feature (i.e. a total context is aggregated) is generated>. In addition, the introduction of fourier features can promote the rapid convergence of the network to predict low frequency components in the depth map, thereby providing a priori learning of high frequency components for later training stages.
The non-local features in a single TP block are captured through frequency domain-space domain feature aggregation, and the method replaces simple jump connection, so that the receiving domain of a shallow layer in a network can be expanded, and the monocular depth clues are effectively explored while local details are reserved, so that the loss of detail contents caused by excessive downsampling is reduced.
In yet another embodiment of the present invention, the generating process of the depth map and the corresponding confidence map of the TP domain includes:
by the first unit, the second unit and the third unit of the TP domain depth decoderN patch Processing the aggregation features to respectively obtain three decoding features;
obtaining depth decoding characteristics through fusion of three decoding characteristics, and up-sampling the depth decoding characteristics to the same resolution as a TP characteristic block;
decoding the up-sampling features by two Conv3D layers to obtain depth maps respectivelyConfidence map corresponding to the same
Wherein each unit pairN patch The process of processing the aggregation features comprises the following steps:
will be up sampled by the layerN patch Individual aggregated featuresSample to feature->After the same size, inputting the two blocks into a first Conv3D block of the TP domain depth decoder, and enriching the spatial characteristic representation of the two blocks; characteristics and features to be obtained->Connecting the two parts, and combining the local details and the semantic priors to obtain cascading features; and inputting the obtained cascade features into a second Conv3D block of the TP domain depth decoder, and reducing the number of feature channels to obtain decoding features.
When the embodiment is implemented, the TP image can process irregular distortion in the 360 ° image, so that decoding in the TP domain can better recover local detail information. In order to gradually decode the depth information of the TP-domain, the present application developed a TP-domain depth decoder consisting of three units, each unit containing an upsampling layer and two Conv3D blocks. Taking the first cell as an example, depth is characterized Upsampling to AND->After the same size, the space feature representation is enriched by inputting the space feature representation into a first Conv3D block. Then, the obtained features are combined with +.>Connected, local detail and semantic priors are combined. The resulting concatenated features are then input into a second Conv3D block to reduce the number of feature channels. Thus, the decoding characteristics of the first unit can be obtained>. The remaining two units perform the same operations as the first unit.
After completion of these three units, a depth decoding feature is obtained. Subsequently, taking into account the feature coherence between the TP block features, the decoding features +.>Upsampling to the same resolution as the input TP block and decoding the upsampled features using two Conv3D layers to obtain depth map +.>And its corresponding confidence map->. Note that in generating the confidence map, a Sigmoid activation function needs to be added after the convolutional layer to limit the confidence weight between 0 and 1. Finally, through P2E transformation and element-by-element product, the depth maps of the N patch TP blocks and the confidence maps corresponding to the N patch TP blocks can be respectively combined into a depth map D TP And confidence map C TP . This preserves the geometry details of the 360 image well.
In yet another embodiment of the present invention, the depth map generating process of the ERP domain includes:
Feature alignment module in ERP domain depth decoderN patch Individual aggregated featuresAligning to obtain output characteristics;
converting the output characteristics into the ERP domain by adopting P2E conversion to obtain conversion characteristics
Features are processed by a first decoding unit of the ERP domain depth decoderInputting the frequency domain attention block, and optimizing by using a self-attention mechanism;
conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsamplingDepth features of the same size +.>
Simultaneous reception of multi-scale features by a fourth decoding unit of the ERP domain depth decoderAnd depth featuresAs input, get depth feature +.>
Decoded ERP domain depth featuresInputting the depth distribution into a preset depth distribution classification module, and calculating the Laplacian mixed distribution of the median value of the discrete interval to obtain a depth map of the ERP domain;
wherein, the depth map of ERP domain,/>Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the i-th depth interval and the actual median value of the j-th depth interval respectively,,/>,/>and->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++ >Is the minimum threshold, ++>Is the number of depth intervals.
When the embodiment is implemented, the problem of spherical distortion can be solved by exploring the TP image, but depth difference and discontinuity among blocks seriously affect the merging quality of the depth map. In response to this problem, the present application proposes an ERP domain depth decoder consisting of an SFA module and four decoding units. Specifically, the present application first uses SFA module alignmentThe output features are then transformed into ERP domain using P2E transformation, resulting in +.>
In addition, in order to mine long-range dependency and global context between ERP domain features, a frequency domain attention block is embedded in each decoding unit to study self-attention and cross-attention mechanisms on ERP domain features.
For the first decoding unit, the present application will first be describedThis frequency domain attention block is entered and optimized using the self-attention mechanism. The optimized features are then processed by a Conv2D layer with ELU activation function and up-sampled to obtain the sum/>Decoding features with the same size +.>. Since the first decoding unit only accepts +.>As input, while other subsequent decoding units receive multi-scale features at the same time +.>Decoding characteristics->As input, the frequency domain attention block of the first decoding unit therefore performs only the self-attention mechanism, while the subsequent decoding unit performs the cross-attention mechanism.
The depth value discretization is performed, and the final regression problem is converted into the classification problem. This may further accelerate the convergence of the model, avoiding local overfitting. Specifically, the present application divides the effective depth range into discrete depth intervals by predicting the residual of the networkCalculating the actual median ++for each depth interval by accumulating the median values of the intervals whose reciprocal is uniformly distributed>. Decoded ERP domain depth feature>Is input into the Depth Distribution Classification (DDC) module to calculate the Laplacian mixed distribution of the median value of the discrete interval so as to obtain a depth map D output by the ERP domain ERP
Wherein, the depth map of ERP domain
Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the ith depth interval and the actual median value of the jth depth interval, respectively,/->,/>
And->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>For the minimum threshold set by human, it is set to 0.55,/in this application>Is the number of depth intervals.
The present application provides an asymmetric two-domain depth decoding module that not only enables an estimated depth map D ERP Smoother in the ERP domain and also improves depth consistency between different TP blocks in the TP domain by back propagation.
In a further embodiment provided by the present invention, the fusion map
Wherein,confidence map for TP field, +.>Depth map for TP field, ++>Is a depth map of the ERP domain.
In the implementation of the embodiment, in order to further fully utilize the advantages of the two, finally, we also explicitly combine the two to generate a final depth map DFusion, solve the problems of spherical distortion, inter-block difference, inconsistency and the like, and combine the depth maps of the TP domain and the ERP domain to obtain a fusion map;
fusion map
Wherein,confidence map for TP field, +.>Depth map for TP field, ++>Is a depth map of the ERP domain.
In yet another embodiment provided by the present invention, the method further comprises:
calculating a loss value of the depth estimation map based on a loss function constructed by spherical photometric loss, smooth loss, significant direction normal loss and coplanar loss, and correcting the depth estimation map according to the loss value;
wherein the loss value of the depth estimation map,/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b),/>Indicating spherical photometric loss, +. >Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.
When the embodiment is implemented, a supervision signal is constructed based on spherical luminosity loss, smooth loss, significant direction normal line loss and coplanar loss, so that a model is trained, and a depth estimation graph is corrected;
loss value of depth estimation map,/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b),/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.
The method provided by the application has the least model parameters in the current panoramic depth estimation model, and simultaneously shows better performance than the existing self-supervision method.
In still another embodiment of the present invention, referring to fig. 2, a schematic structural diagram of a self-supervising 360 ° depth estimation device according to an embodiment of the present invention is provided, where the device includes:
the image conversion module is used for converting the ERP image into a TP image through E2P conversion;
an extraction module for minimizing distortion in the generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scales N patch TP feature blocks;
the aggregation module is used for extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features;
a decoding module for decodingN patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;
the map generation module is used for respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map;
wherein,N patch is a positive integer.
The self-monitoring 360 ° depth estimation device provided in this embodiment can perform all the steps and functions of the self-monitoring 360 ° depth estimation method provided in any one of the above embodiments, and specific functions of the device are not described herein.
Referring to fig. 3, a schematic structural diagram of a terminal device according to an embodiment of the present invention is provided. The terminal device includes: a processor, a memory, and a computer program stored in the memory and executable on the processor, such as a self-supervising 360 ° depth estimation program. The steps in each of the embodiments of the self-monitoring 360 ° depth estimation method described above, such as steps S1 to S5 shown in fig. 1, are implemented when the processor executes the computer program. Alternatively, the processor may implement the functions of the modules in the above-described device embodiments when executing the computer program.
The computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program in the one self-supervising 360 ° depth estimation device. For example, the computer program may be divided into modules, and specific functions of each module are described in detail in a self-supervised 360 ° depth estimation method provided in any of the above embodiments, and specific functions of the apparatus are not described herein.
The self-supervision 360-degree depth estimation device can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The self-supervising 360 ° depth estimation device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a self-supervising 360 ° depth estimation apparatus, and is not limiting of a self-supervising 360 ° depth estimation apparatus, and may include more or fewer components than illustrated, or may combine certain components, or different components, e.g., the self-supervising 360 ° depth estimation apparatus may further include input and output devices, network access devices, buses, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the self-supervising 360 ° depth estimation device, and various interfaces and lines are used to connect various parts of the entire self-supervising 360 ° depth estimation device.
The memory may be used to store the computer program and/or module, and the processor may implement the various functions of the self-supervising 360 ° depth estimation device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the module integrated with the self-supervising 360 ° depth estimation device may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. A self-supervising 360 ° depth estimation method, the method comprising:
converting the ERP image into a TP image through E2P conversion;
minimizing distortion in generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scalesN patch TP feature blocks;
extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features;
will beN patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;
adopting spherical view synthesis to respectively generate views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map to obtain a depth estimation map;
Wherein,N patch is a positive integer.
2. The method of claim 1, wherein the extraction is performed according to a predetermined frequency domain spatial domain feature aggregation modelN patch Global features in the TP feature blocks are added to the original features to obtainTo non-local information within the aggregated blockN patch After the feature is polymerized, the method further comprises:
will beN patch Individual aggregated featuresThe number of the characteristic channels is reduced in a Conv3D layer input to a preset structural characteristic alignment model;
processing the output characteristics of the Conv3D layer by using a Softmax activation function to obtain the attention weight of each pixel in the TP block
Will get the attention weights and respectivelyN patch Individual aggregated featuresPerforming matrix multiplication to obtain geometric prior information;
recalibrationThe importance of each channel in the system is obtained, and the calibrated global geometric prior information is output
Integrating global geometry prior information by means of element broadcast additionN patch In the polymerized features, P2E conversion is performed to obtain alignedN patch And a polymeric feature.
3. The method of claim 1, wherein the extracting is based on a preset frequency domain spatial domain feature aggregation model N patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features, including;
extracting TP features of a TP feature block according to a TP encoder in the frequency domain spatial domain feature aggregation model;
converting the extracted TP characteristics into a frequency domain through fast Fourier transformation to obtain frequency domain characteristics;
splicing the real part and the imaginary part of the obtained frequency domain features together in a feature dimension, and extracting the global context of each TP feature block by applying Conv2D blocks in the frequency domain;
converting the global context back to a space domain through inverse Fourier transform to obtain global features;
after the global feature and the TP feature block are spliced, the global feature and the TP feature block are sent to another Conv2D block, the number of feature channels is restored, and the generation is performedN patch And a polymeric feature.
4. The self-supervising 360 ° depth estimation method of claim 1, wherein the TP-domain depth map and corresponding confidence map generation process comprises:
by the first unit, the second unit and the third unit of the TP domain depth decoderN patch Processing the aggregation features to respectively obtain three decoding features;
obtaining depth decoding characteristics through fusion of three decoding characteristics, and up-sampling the depth decoding characteristics to the same resolution as a TP characteristic block;
Decoding the up-sampling features by two Conv3D layers to obtain depth maps respectivelyAnd its corresponding confidence map->
Wherein each unit pairN patch The process of processing the aggregation features comprises the following steps:
will be up sampled by the layerN patch Individual aggregated featuresSample to feature->After the same size, inputting the two blocks into a first Conv3D block of the TP domain depth decoder, and enriching the spatial characteristic representation of the two blocks; characteristics and features to be obtained->Connecting the two parts, and combining the local details and the semantic priors to obtain cascading features; and inputting the obtained cascade features into a second Conv3D block of the TP domain depth decoder, and reducing the number of feature channels to obtain decoding features.
5. The self-supervising 360 ° depth estimation method according to claim 1, wherein the ERP domain depth map generation process comprises:
feature alignment module in ERP domain depth decoderN patch Individual aggregated featuresAligning to obtain output characteristics;
converting the output characteristics into the ERP domain by adopting P2E conversion to obtain conversion characteristics
Features are processed by a first decoding unit of the ERP domain depth decoderInputting the frequency domain attention block, and optimizing by using a self-attention mechanism;
conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsampling Depth features of the same size +.>
Simultaneous reception of multi-scale features by a fourth decoding unit of the ERP domain depth decoderAnd depth featuresAs input, get depth feature +.>
Decoded ERP domain depth featuresInputting the depth distribution into a preset depth distribution classification module, and calculating the Laplacian mixed distribution of the median value of the discrete interval to obtain a depth map of the ERP domain;
wherein, the depth map of ERP domain,/>Representing the probability that the pixel belongs to the ith depth interval,/->Andrepresenting the actual median value of the ith depth interval and the actual median value of the jth depth interval, respectively,/->,/>And->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>Is the minimum threshold, ++>Is the number of depth intervals.
6. The self-supervised 360 ° depth estimation method of claim 1, wherein said fusion map
Wherein,confidence map for TP field, +.>Depth map for TP field, ++>Is a depth map of the ERP domain.
7. The self-supervising 360 ° depth estimation method according to claim 1, further comprising:
calculating a loss value of the depth estimation map based on a loss function constructed by spherical photometric loss, smooth loss, significant direction normal loss and coplanar loss, and correcting the depth estimation map according to the loss value;
Wherein the depth isEstimating a loss value of a graph,/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b),/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.
8. A self-supervising 360 ° depth estimation device, the device comprising:
the image conversion module is used for converting the ERP image into a TP image through E2P conversion;
an extraction module for minimizing distortion in the generated TP imageN patch Inputting TP images into a preset backbone network, and extracting at different scalesN patch TP feature blocks;
the aggregation module is used for extracting according to a preset frequency domain space domain feature aggregation modelN patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN patch A plurality of aggregated features;
a decoding module for decodingN patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;
The map generation module is used for respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map;
wherein,N patch is a positive integer.
9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the self-supervised 360 ° depth estimation method as claimed in any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the self-supervised 360 ° depth estimation method as claimed in any one of claims 1 to 7.
CN202410232514.3A 2024-03-01 Self-supervision 360-degree depth estimation method, device, equipment and medium Active CN117808857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410232514.3A CN117808857B (en) 2024-03-01 Self-supervision 360-degree depth estimation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410232514.3A CN117808857B (en) 2024-03-01 Self-supervision 360-degree depth estimation method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN117808857A true CN117808857A (en) 2024-04-02
CN117808857B CN117808857B (en) 2024-05-24

Family

ID=

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220198605A1 (en) * 2019-03-10 2022-06-23 Google Llc 360 Degree Wide-Angle Camera With Baseball Stitch
US20230186590A1 (en) * 2021-12-13 2023-06-15 Robert Bosch Gmbh Method for omnidirectional dense regression for machine perception tasks via distortion-free cnn and spherical self-attention
CN117036436A (en) * 2023-08-10 2023-11-10 福州大学 Monocular depth estimation method and system based on double encoder-decoder

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220198605A1 (en) * 2019-03-10 2022-06-23 Google Llc 360 Degree Wide-Angle Camera With Baseball Stitch
US20230186590A1 (en) * 2021-12-13 2023-06-15 Robert Bosch Gmbh Method for omnidirectional dense regression for machine perception tasks via distortion-free cnn and spherical self-attention
CN117036436A (en) * 2023-08-10 2023-11-10 福州大学 Monocular depth estimation method and system based on double encoder-decoder

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈裕如;赵海涛;: "基于自适应像素级注意力模型的场景深度估计", 应用光学, no. 03, 15 May 2020 (2020-05-15), pages 64 - 73 *

Similar Documents

Publication Publication Date Title
CN112001914B (en) Depth image complement method and device
CN108694700B (en) System and method for deep learning image super-resolution
CN112308763A (en) Generating a composite digital image using a neural network with a dual stream encoder architecture
AU2019268184B2 (en) Precise and robust camera calibration
US20230177643A1 (en) Image super-resolution
CN113870104A (en) Super-resolution image reconstruction
CN113505848B (en) Model training method and device
WO2021164269A1 (en) Attention mechanism-based disparity map acquisition method and apparatus
US20230143452A1 (en) Method and apparatus for generating image, electronic device and storage medium
CN112308866A (en) Image processing method, image processing device, electronic equipment and storage medium
US20230153965A1 (en) Image processing method and related device
CN110827341A (en) Picture depth estimation method and device and storage medium
CN117094362B (en) Task processing method and related device
CN113902789A (en) Image feature processing method, depth image generating method, depth image processing apparatus, depth image generating medium, and device
CN112598673A (en) Panorama segmentation method, device, electronic equipment and computer readable medium
CN117808857B (en) Self-supervision 360-degree depth estimation method, device, equipment and medium
WO2023109086A1 (en) Character recognition method, apparatus and device, and storage medium
CN117808857A (en) Self-supervision 360-degree depth estimation method, device, equipment and medium
CN113807354B (en) Image semantic segmentation method, device, equipment and storage medium
CN114549322A (en) Image super-resolution method and device based on unsupervised field self-adaption
CN114519731A (en) Method and device for complementing depth image
CN114596203A (en) Method and apparatus for generating images and for training image generation models
CN111325068A (en) Video description method and device based on convolutional neural network
CN116980541B (en) Video editing method, device, electronic equipment and storage medium
Liu et al. MODE: Monocular omnidirectional depth estimation via consistent depth fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant