CN117808857A

CN117808857A - Self-supervision 360-degree depth estimation method, device, equipment and medium

Info

Publication number: CN117808857A
Application number: CN202410232514.3A
Authority: CN
Inventors: 王旭; 何紫嫣; 张秋丹; 江健民
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2024-03-01
Filing date: 2024-03-01
Publication date: 2024-04-02
Anticipated expiration: 2044-03-01

Abstract

The invention discloses a self-supervision 360-degree depth estimation method, a device, equipment and a medium, which are used for converting ERP images into TP images through E2P conversion; inputting a TP image with minimum distortion in the TP image into a preset backbone network, and extracting TP characteristic blocks at different scales; extracting global features in the TP feature blocks according to the frequency domain space domain feature aggregation model, and adding the global features into the original features to obtain aggregation features of non-local information in the blocks; respectively inputting the aggregation features into a TP domain depth decoder and an ERP domain depth decoder for decoding to obtain a TP domain depth map and a corresponding confidence map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map; and respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map. According to the depth estimation method and device, accuracy of depth estimation can be improved.

Description

Self-supervision 360-degree depth estimation method, device, equipment and medium

Technical Field

The invention relates to the technical field of image processing, in particular to a self-supervision 360-degree depth estimation method, device, equipment and medium.

Background

As an important task for understanding a three-dimensional scene, 360-degree panoramic depth estimation plays an important role in applications such as autonomous navigation, virtual reality, three-dimensional scene reconstruction and the like. EPR projection, i.e. equidistant columnar projection, is the most commonly used panoramic image format because of its simple mapping relationship and the ability to capture a complete continuous omnidirectional scene. However, it has serious spherical distortion in the two-pole area, and the direct application of the common convolution to the ERP image can lead to the dramatic decrease of the performance of the model, and the accuracy of depth estimation is lower.

Disclosure of Invention

In order to solve the problems, the invention provides a self-supervision 360-degree depth estimation method, a device, equipment and a medium, which improve the accuracy of depth estimation.

The embodiment of the invention provides a self-supervision 360-degree depth estimation method, which comprises the following steps:

converting the ERP image into a TP image through E2P conversion;

minimizing distortion in generated TP imageN _patch Inputting TP images into a preset backbone network, and extracting at different scalesN _patch TP feature blocks;

extracting according to a preset frequency domain space domain feature aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocks N _patch A plurality of aggregated features;

will beN _patch The aggregation features are respectively input into a preset TP domain depth decoderDecoding in a preset ERP domain depth decoder to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;

adopting spherical view synthesis to respectively generate views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map to obtain a depth estimation map;

wherein,N _patch is a positive integer.

Preferably, the method extracts according to a preset frequency domain space domain characteristic aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch After the feature is polymerized, the method further comprises:

will beN _patch Individual aggregated featuresThe number of the characteristic channels is reduced in a Conv3D layer input to a preset structural characteristic alignment model;

processing the output characteristics of the Conv3D layer by using a Softmax activation function to obtain the attention weight of each pixel in the TP block；

Will get the attention weights and respectivelyN _patch Individual aggregated featuresPerforming matrix multiplication to obtain geometric prior information;

recalibrationThe importance of each channel in the system is obtained, and the calibrated global geometric prior information is output ；

Integrating global geometry prior information by means of element broadcast additionN _patch In the aggregate characteristicsPerforming P2E conversion to obtain alignedN _patch And a polymeric feature.

Preferably, the extraction is performed according to a preset frequency domain space domain feature aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch A plurality of aggregated features, including;

extracting TP features of a TP feature block according to a TP encoder in the frequency domain spatial domain feature aggregation model;

converting the extracted TP characteristics into a frequency domain through fast Fourier transformation to obtain frequency domain characteristics;

splicing the real part and the imaginary part of the obtained frequency domain features together in a feature dimension, and extracting the global context of each TP feature block by applying Conv2D blocks in the frequency domain;

converting the global context back to a space domain through inverse Fourier transform to obtain global features;

after the global feature and the TP feature block are spliced, the global feature and the TP feature block are sent to another Conv2D block, the number of feature channels is restored, and the generation is performedN _patch And a polymeric feature.

As a preferred solution, the generating process of the depth map and the corresponding confidence map of the TP domain includes:

by the first unit, the second unit and the third unit of the TP domain depth decoder N _patch Processing the aggregation features to respectively obtain three decoding features;

obtaining depth decoding characteristics through fusion of three decoding characteristics, and up-sampling the depth decoding characteristics to the same resolution as a TP characteristic block;

decoding the up-sampling features by two Conv3D layers to obtain depth maps respectivelyConfidence map corresponding to the same；

Wherein each unit pairN _patch By aggregation of featuresThe line processing process comprises the following steps:

will be up sampled by the layerN _patch Individual aggregated featuresSample to feature->After the same size, inputting the two blocks into a first Conv3D block of the TP domain depth decoder, and enriching the spatial characteristic representation of the two blocks; characteristics and features to be obtained->Connecting the two parts, and combining the local details and the semantic priors to obtain cascading features; and inputting the obtained cascade features into a second Conv3D block of the TP domain depth decoder, and reducing the number of feature channels to obtain decoding features.

Preferably, the depth map generating process of the ERP domain includes:

feature alignment module in ERP domain depth decoderN _patch Individual aggregated featuresAligning to obtain output characteristics;

converting the output characteristics into the ERP domain by adopting P2E conversion to obtain conversion characteristics；

Features are processed by a first decoding unit of the ERP domain depth decoder Inputting the frequency domain attention block, and optimizing by using a self-attention mechanism;

conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsamplingDepth features of the same size +.>；

Simultaneous reception of multi-scale features by a fourth decoding unit of the ERP domain depth decoderAnd depth featuresAs input, get depth feature +.>；

Decoded ERP domain depth featuresInputting the depth distribution into a preset depth distribution classification module, and calculating the Laplacian mixed distribution of the median value of the discrete interval to obtain a depth map of the ERP domain;

wherein, the depth map of ERP domain，/>Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the i-th depth interval and the actual median value of the j-th depth interval respectively,，/>，/>and->Representing each imageWeight and scale of Laplace distribution corresponding to jth depth interval of element, ++>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>Is the minimum threshold, ++>Is the number of depth intervals.

Preferably, the fusion map；

Wherein,confidence map for TP field, +.>Depth map for TP field, ++>Is a depth map of the ERP domain.

As a preferred embodiment, the method further comprises:

Calculating a loss value of the depth estimation map based on a loss function constructed by spherical photometric loss, smooth loss, significant direction normal loss and coplanar loss, and correcting the depth estimation map according to the loss value;

wherein the loss value of the depth estimation map，/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b)，/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.

The embodiment of the invention also provides a self-supervision 360-degree depth estimation device, which comprises:

the image conversion module is used for converting the ERP image into a TP image through E2P conversion;

an extraction module for minimizing distortion in the generated TP imageN _patch Inputting TP images into a preset backbone network, and extracting at different scalesN _patch TP feature blocks;

the aggregation module is used for extracting according to a preset frequency domain space domain feature aggregation modelN _patch Global features in the individual TP feature blocks,and adds it to the original characteristics to obtain non-local information in the aggregated blockN _patch A plurality of aggregated features;

a decoding module for decodingN _patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;

The map generation module is used for respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map;

wherein,N _patch is a positive integer.

Preferably, the apparatus further comprises: an alignment module for:

extracting according to a preset frequency domain space domain characteristic aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch After the feature is polymerizedN _patch Individual aggregated featuresThe number of the characteristic channels is reduced in a Conv3D layer input to a preset structural characteristic alignment model;

recalibrationImportance of each channel in (a)Obtaining the calibrated global geometric prior information output；

Integrating global geometry prior information by means of element broadcast additionN _patch In the polymerized features, P2E conversion is performed to obtain alignedN _patch And a polymeric feature.

Preferably, the aggregation module is specifically configured to:

Preferably, the process of generating the depth map and the corresponding confidence map of the TP-domain by the decoding module includes:

by the first unit, the second unit and the third unit of the TP domain depth decoderN _patch Processing the aggregation features to respectively obtain three decoding features;

Wherein each unit pair N _patch The process of processing the aggregation features comprises the following steps:

Preferably, the process of generating the depth map of the ERP domain by the decoding module includes:

Features are processed by a first decoding unit of the ERP domain depth decoderInputting the frequency domain attention block, and optimizing by using a self-attention mechanism;

conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsamplingHas the same meaningDepth features of size->；

Simultaneous reception of multi-scale features by a fourth decoding unit of the ERP domain depth decoderAnd depth features As input, get depth feature +.>；

wherein, the depth map of ERP domain，/>Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the i-th depth interval and the actual median value of the j-th depth interval respectively,，/>，/>and->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>Is the minimum threshold, ++>Is the number of depth intervals.

Preferably, the fusion map；

Preferably, the apparatus further comprises a loss calculation module for:

wherein the loss value of the depth estimation map，/>Loss value for the depth map of the TP field, < +. >Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b)，/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.

The embodiment of the invention also provides a terminal device, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor implements a self-supervision 360-degree depth estimation method according to any one of the embodiments.

The embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program, wherein the computer program controls equipment where the computer readable storage medium is located to execute the self-supervision 360-degree depth estimation method according to any one of the embodiments.

The invention provides a self-supervision 360-degree depth estimation method, a device, equipment and a medium, which are used for converting ERP images into TP images through E2P conversion; minimizing distortion in generated TP imageN _patch Inputting TP images into a preset backbone network, and extracting at different scales N _patch TP feature blocks; extracting according to a preset frequency domain space domain feature aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch A plurality of aggregated features; will beN _patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map; and respectively generating views of new viewpoints from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting spherical view synthesis to obtain a depth estimation map. According to the depth estimation method and device, accuracy of depth estimation can be improved.

Drawings

FIG. 1 is a schematic flow chart of a self-monitoring 360 DEG depth estimation method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a self-monitoring 360 ° depth estimation device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flow chart of a self-monitoring 360 ° depth estimation method according to an embodiment of the present invention is shown, and the method includes steps S1 to S5;

s1, converting an ERP image into a TP image through E2P conversion;

s2, minimizing distortion in the generated TP imageN _patch Inputting TP images into a preset backbone network, and extracting at different scalesN _patch TP feature blocks;

s3, extracting according to a preset frequency domain space domain feature aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch A plurality of aggregated features;

s4, willN _patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;

s5, synthesizing a view of a new view point from the depth map of the TP domain, the depth map of the ERP domain and the fusion map by adopting a spherical view, and obtaining a depth estimation map;

wherein,N _patch is a positive integer.

In order to mitigate the effect of spherical distortion when the present embodiment is embodied, the present application uses TP images to estimate 360 th depth information. In order to solve the main problem of influencing fusion quality, namely the difference of TP blocks between depth predictions, an asymmetric double-domain depth decoding module is developed to cooperatively learn the depth characteristics of TP and ERP domains so as to make up the difference between TP blocks. Specifically, the present application first converts 360 ° images into by E2P conversion N _patch A TP image with less distortion is extracted from the TP image by adopting ResNet-34 with the last residual block removed as an encoderN _patch TP feature blocks。

Exploring monocular depth cues in TP block features to obtain depth features using designed FSFC modulesNamely, extracting Npatch TP feature blocks according to a preset frequency domain spatial domain feature aggregation model>The global features of the block are added to the original features to obtain the aggregate features of the non-local information in the block>。

The depth features are then fed into a dual-domain depth decoder to co-decode the depth information, i.eN _patch The aggregation features are respectively input into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a depth map D of the TP domain _TP And corresponding confidence map C _TP And depth map D of ERP domain _ERP Fusing the depth map of the TP domain and the depth map of the ERP domain to obtain D _Fusion 。

Synthesis of slave D from spherical views _TP 、D _ERP And D _Fusion And respectively generating views of the new viewpoints to obtain a depth estimation image.

The scheme uses TP images to estimate 360 th step depth information, and reduces the influence of spherical distortion. An asymmetric double-domain depth decoding module cooperatively learns the depth characteristics of TP and ERP domains to make up the difference between TP blocks, solve the main problem of influencing fusion quality, namely the difference between TP blocks in depth prediction, and improve the accuracy of depth estimation.

In yet another embodiment provided by the present invention, the method further includes, after the step S3:

will beN _patch Individual aggregated featuresFeature reduction flux in Conv3D layer input to preset structural feature alignment modelThe number of lanes;

recalibrationThe importance of each channel in the system is obtained, and the calibrated global geometric prior information is output；

In the implementation of this embodiment, the ERP image is converted into a plurality of TP image blocks to perform feature extraction respectively, which can effectively avoid the influence of spherical distortion, but causes a problem of non-negligible loss of global geometric consistency, especially in a long-distance region and a weak texture region of the panoramic image. In this regard, the present application introduces a structural feature alignment model that captures inter-block geometric consistency-related features through a simplified non-local attention mechanism and adds these aggregated global structural features to the aggregated features To capture long-range dependencies.

Specifically, aggregated featuresFirst input a Conv3D layer to reduce the number of characteristic channelsThen processing the TP block by using a Softmax activation function to obtain the attention weight of each pixel in the TP block>. These attention weights are followed by the aggregate feature +.>Matrix multiplication is carried out to obtain global geometric prior +.>For modeling global geometry information. Namely:

；

the present application uses a two-layer bottleneck transformation to recalibrate without significantly increasing the number of parametersThe importance of each channel of (a) to obtain a calibrated output +.>Integrating global geometry prior information into aggregate feature by means of element broadcast addition>On, and P2E converting it to obtain aligned features +.>。

The structural feature alignment module provided by the application can adaptively emphasize global features among different TP blocks in the training process, and improves geometric structure consistency among the blocks. The SFA module is introduced into the ERP domain decoder, and the structural features of the TP blocks from the jump connection can be aligned before P2E conversion so as to better perform decoding meeting global consistency.

In yet another embodiment of the present invention, the step S3 specifically includes:

In the implementation of the present embodiment, learning and utilizing monocular depth cues is a great importance in the depth estimation task. However, the limited receptive field of conventional full convolution models makes the model unable to obtain global information for an entire 360 ° scene, resulting in very useful monocular depth cues, such as occlusions, texture gradients, and relative sizes, being destroyed. This phenomenon is particularly pronounced in early layers of a full convolution network, as the convolution kernels in the shallow layers are typically smaller. Conventional full convolution networks typically extend the receptive field through extensive downsampling operations. However, excessive downsampling may lose local geometry detail, resulting in poor performance of the 360 ° depth estimation.

Fast fourier convolution can utilize global context information in the shallow layers of the network to effectively solve the above-described problems. Thus, the present application embeds the fast fourier convolution into the skip connection, building a frequency domain-spatial domain feature aggregation (FSFC) module that adds non-local information within the TP blocks to the features of the TP image blocks of the early layers of the encoder. Features extracted from the TP encoder are first converted to the frequency domain by a Fast Fourier Transform (FFT), then the real and imaginary parts of the resulting frequency domain features are stitched together in the feature dimension, and Conv2D blocks are applied in the frequency domain to extract the global context of each TP feature block. These fourier features are then transformed back into the spatial domain by inverse transformation, generating. After the features are spliced with the input features, the features are sent to another Conv2D block to restore the number of feature channels, and an aggregate feature (i.e. a total context is aggregated) is generated>. In addition, the introduction of fourier features can promote the rapid convergence of the network to predict low frequency components in the depth map, thereby providing a priori learning of high frequency components for later training stages.

The non-local features in a single TP block are captured through frequency domain-space domain feature aggregation, and the method replaces simple jump connection, so that the receiving domain of a shallow layer in a network can be expanded, and the monocular depth clues are effectively explored while local details are reserved, so that the loss of detail contents caused by excessive downsampling is reduced.

In yet another embodiment of the present invention, the generating process of the depth map and the corresponding confidence map of the TP domain includes:

Wherein each unit pairN _patch The process of processing the aggregation features comprises the following steps:

When the embodiment is implemented, the TP image can process irregular distortion in the 360 ° image, so that decoding in the TP domain can better recover local detail information. In order to gradually decode the depth information of the TP-domain, the present application developed a TP-domain depth decoder consisting of three units, each unit containing an upsampling layer and two Conv3D blocks. Taking the first cell as an example, depth is characterized Upsampling to AND->After the same size, the space feature representation is enriched by inputting the space feature representation into a first Conv3D block. Then, the obtained features are combined with +.>Connected, local detail and semantic priors are combined. The resulting concatenated features are then input into a second Conv3D block to reduce the number of feature channels. Thus, the decoding characteristics of the first unit can be obtained>. The remaining two units perform the same operations as the first unit.

After completion of these three units, a depth decoding feature is obtained. Subsequently, taking into account the feature coherence between the TP block features, the decoding features +.>Upsampling to the same resolution as the input TP block and decoding the upsampled features using two Conv3D layers to obtain depth map +.>And its corresponding confidence map->. Note that in generating the confidence map, a Sigmoid activation function needs to be added after the convolutional layer to limit the confidence weight between 0 and 1. Finally, through P2E transformation and element-by-element product, the depth maps of the N patch TP blocks and the confidence maps corresponding to the N patch TP blocks can be respectively combined into a depth map D _TP And confidence map C _TP . This preserves the geometry details of the 360 image well.

In yet another embodiment of the present invention, the depth map generating process of the ERP domain includes:

wherein, the depth map of ERP domain，/>Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the i-th depth interval and the actual median value of the j-th depth interval respectively,，/>，/>and->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++ >Is the minimum threshold, ++>Is the number of depth intervals.

When the embodiment is implemented, the problem of spherical distortion can be solved by exploring the TP image, but depth difference and discontinuity among blocks seriously affect the merging quality of the depth map. In response to this problem, the present application proposes an ERP domain depth decoder consisting of an SFA module and four decoding units. Specifically, the present application first uses SFA module alignmentThe output features are then transformed into ERP domain using P2E transformation, resulting in +.>。

In addition, in order to mine long-range dependency and global context between ERP domain features, a frequency domain attention block is embedded in each decoding unit to study self-attention and cross-attention mechanisms on ERP domain features.

For the first decoding unit, the present application will first be describedThis frequency domain attention block is entered and optimized using the self-attention mechanism. The optimized features are then processed by a Conv2D layer with ELU activation function and up-sampled to obtain the sum/>Decoding features with the same size +.>. Since the first decoding unit only accepts +.>As input, while other subsequent decoding units receive multi-scale features at the same time +.>Decoding characteristics->As input, the frequency domain attention block of the first decoding unit therefore performs only the self-attention mechanism, while the subsequent decoding unit performs the cross-attention mechanism.

The depth value discretization is performed, and the final regression problem is converted into the classification problem. This may further accelerate the convergence of the model, avoiding local overfitting. Specifically, the present application divides the effective depth range into discrete depth intervals by predicting the residual of the networkCalculating the actual median ++for each depth interval by accumulating the median values of the intervals whose reciprocal is uniformly distributed>. Decoded ERP domain depth feature>Is input into the Depth Distribution Classification (DDC) module to calculate the Laplacian mixed distribution of the median value of the discrete interval so as to obtain a depth map D output by the ERP domain _ERP 。

Wherein, the depth map of ERP domain；

Representing the probability that the pixel belongs to the ith depth interval,/->And->Representing the actual median value of the ith depth interval and the actual median value of the jth depth interval, respectively,/->，/>；

And->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>For the minimum threshold set by human, it is set to 0.55,/in this application>Is the number of depth intervals.

The present application provides an asymmetric two-domain depth decoding module that not only enables an estimated depth map D _ERP Smoother in the ERP domain and also improves depth consistency between different TP blocks in the TP domain by back propagation.

In a further embodiment provided by the present invention, the fusion map；

In the implementation of the embodiment, in order to further fully utilize the advantages of the two, finally, we also explicitly combine the two to generate a final depth map DFusion, solve the problems of spherical distortion, inter-block difference, inconsistency and the like, and combine the depth maps of the TP domain and the ERP domain to obtain a fusion map;

fusion map；

In yet another embodiment provided by the present invention, the method further comprises:

wherein the loss value of the depth estimation map，/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b)，/>Indicating spherical photometric loss, +. >Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.

When the embodiment is implemented, a supervision signal is constructed based on spherical luminosity loss, smooth loss, significant direction normal line loss and coplanar loss, so that a model is trained, and a depth estimation graph is corrected;

loss value of depth estimation map，/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b)，/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.

The method provided by the application has the least model parameters in the current panoramic depth estimation model, and simultaneously shows better performance than the existing self-supervision method.

In still another embodiment of the present invention, referring to fig. 2, a schematic structural diagram of a self-supervising 360 ° depth estimation device according to an embodiment of the present invention is provided, where the device includes:

an extraction module for minimizing distortion in the generated TP imageN _patch Inputting TP images into a preset backbone network, and extracting at different scales N _patch TP feature blocks;

the aggregation module is used for extracting according to a preset frequency domain space domain feature aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch A plurality of aggregated features;

wherein,N _patch is a positive integer.

The self-monitoring 360 ° depth estimation device provided in this embodiment can perform all the steps and functions of the self-monitoring 360 ° depth estimation method provided in any one of the above embodiments, and specific functions of the device are not described herein.

Referring to fig. 3, a schematic structural diagram of a terminal device according to an embodiment of the present invention is provided. The terminal device includes: a processor, a memory, and a computer program stored in the memory and executable on the processor, such as a self-supervising 360 ° depth estimation program. The steps in each of the embodiments of the self-monitoring 360 ° depth estimation method described above, such as steps S1 to S5 shown in fig. 1, are implemented when the processor executes the computer program. Alternatively, the processor may implement the functions of the modules in the above-described device embodiments when executing the computer program.

The computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program in the one self-supervising 360 ° depth estimation device. For example, the computer program may be divided into modules, and specific functions of each module are described in detail in a self-supervised 360 ° depth estimation method provided in any of the above embodiments, and specific functions of the apparatus are not described herein.

The self-supervision 360-degree depth estimation device can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The self-supervising 360 ° depth estimation device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a self-supervising 360 ° depth estimation apparatus, and is not limiting of a self-supervising 360 ° depth estimation apparatus, and may include more or fewer components than illustrated, or may combine certain components, or different components, e.g., the self-supervising 360 ° depth estimation apparatus may further include input and output devices, network access devices, buses, etc.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the self-supervising 360 ° depth estimation device, and various interfaces and lines are used to connect various parts of the entire self-supervising 360 ° depth estimation device.

The memory may be used to store the computer program and/or module, and the processor may implement the various functions of the self-supervising 360 ° depth estimation device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Wherein the module integrated with the self-supervising 360 ° depth estimation device may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a stand alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A self-supervising 360 ° depth estimation method, the method comprising:

converting the ERP image into a TP image through E2P conversion;

extracting according to a preset frequency domain space domain feature aggregation modelN _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch A plurality of aggregated features;

will beN _patch Respectively inputting the aggregation features into a preset TP domain depth decoder and a preset ERP domain depth decoder for decoding to obtain a TP domain depth map, a corresponding confidence map and an ERP domain depth map, and fusing the TP domain depth map and the ERP domain depth map to obtain a fusion map;

Wherein,N _patch is a positive integer.

2. The method of claim 1, wherein the extraction is performed according to a predetermined frequency domain spatial domain feature aggregation modelN _patch Global features in the TP feature blocks are added to the original features to obtainTo non-local information within the aggregated blockN _patch After the feature is polymerized, the method further comprises:

3. The method of claim 1, wherein the extracting is based on a preset frequency domain spatial domain feature aggregation model N _patch Global features in each TP feature block are added to the original features to obtain non-local information in the aggregated blocksN _patch A plurality of aggregated features, including;

4. The self-supervising 360 ° depth estimation method of claim 1, wherein the TP-domain depth map and corresponding confidence map generation process comprises:

Decoding the up-sampling features by two Conv3D layers to obtain depth maps respectivelyAnd its corresponding confidence map->；

5. The self-supervising 360 ° depth estimation method according to claim 1, wherein the ERP domain depth map generation process comprises:

conv2D layer processing optimization features adopting ELU activation function, and obtaining and features through upsampling Depth features of the same size +.>；

wherein, the depth map of ERP domain，/>Representing the probability that the pixel belongs to the ith depth interval,/->Andrepresenting the actual median value of the ith depth interval and the actual median value of the jth depth interval, respectively,/->，，/>And->Weight and scale representing the Laplace distribution corresponding to the jth depth interval of each pixel, +.>Representing the probability that the pixel belongs to the ith depth interval, Z is a normalization constant, ++>Is the minimum threshold, ++>Is the number of depth intervals.

6. The self-supervised 360 ° depth estimation method of claim 1, wherein said fusion map；

7. The self-supervising 360 ° depth estimation method according to claim 1, further comprising:

Wherein the depth isEstimating a loss value of a graph，/>Loss value for the depth map of the TP field, < +.>Loss value for depth map of ERP domain, +.>Loss value for fusion map, +.>、/>Or (b)，/>Indicating spherical photometric loss, +.>Indicating a loss of smoothness and therefore a loss of smoothness,representing significant directional normal loss,/->Represents coplanar loss, ++>、/>And->Is a preset weight.

8. A self-supervising 360 ° depth estimation device, the device comprising:

wherein,N _patch is a positive integer.

9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the self-supervised 360 ° depth estimation method as claimed in any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the self-supervised 360 ° depth estimation method as claimed in any one of claims 1 to 7.