CN116823914A

CN116823914A - Unsupervised focal stack depth estimation method based on all-focusing image synthesis

Info

Publication number: CN116823914A
Application number: CN202311101094.7A
Authority: CN
Inventors: 黄章进; 周萌
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-09-29
Anticipated expiration: 2043-08-30
Also published as: CN116823914B

Abstract

The invention discloses an unsupervised focal stack depth estimation method based on full-focusing image synthesis, which comprises the following steps: s1, performing all-focus image calculation by using an image pyramid-based and focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information; s2, carrying out high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module; s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input feature map into a channel polarization feature map and a space polarization feature map; and S4, positioning the layer where the maximum definition of the focal stack is located by adopting a layered depth probability prediction module, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image. The method has relatively high accuracy and good generalization performance in the aspect of depth prediction, is suitable for depth estimation tasks under different scenes, and has high practicability.

Description

Unsupervised focal stack depth estimation method based on all-focusing image synthesis

Technical Field

The invention relates to the technical field of monocular depth estimation, in particular to an unsupervised focal stack depth estimation method based on full-focusing image synthesis.

Background

The supervised approach shows a high accuracy in the depth estimation task, but is limited in that depth truth values are required, which may be difficult to obtain in practical application scenarios. In recent years, with the continuous development of deep learning technology and continuous exploration of the field of computer vision, the field of unsupervised monocular depth estimation has made a long progress. The unsupervised monocular depth estimation refers to estimating depth information of a scene through a computer vision algorithm without a depth tag. Unsupervised focal stack depth estimation can be divided into two categories, reconstruction supervision and auxiliary supervision.

Reconstruction supervision supervises the network through reconstruction loss of the network, thereby learning depth information, regarding the unsupervised focal stack depth estimation as a special case of multi-view monocular depth estimation, estimating scene depth by utilizing the blur difference of the focusing sequence, then refocusing by utilizing the focusing diagram and the estimated intermediate depth, outputting the focal stack, and performing supervised learning by utilizing the reconstruction loss. However, because of the discomfort of the depth estimation task, reconstructing the model easily results in multiple depth solutions competing with each other, making it difficult to determine the optimal solution, and therefore the network structure is very unstable, while the intermediate representation is easily interpreted as compressed encoding of the information of the focal stack, resulting in difficulty in model convergence, and therefore, often introducing additional losses to constrain the intermediate representation.

The auxiliary supervision is to guide the learning process of the network through some auxiliary information under the condition of no supervision, and adopt the full-focusing image as auxiliary supervision information. However, this model has certain limitations, such as a large amount of parameters, and requires the dataset itself to provide an all-in-focus image as the supervisory information, so the application limit is large. Therefore, how to provide an unsupervised focal stack depth estimation method based on all-in-focus image synthesis is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide an unsupervised focal stack depth estimation method based on full-focusing image synthesis, which has relatively high accuracy and good generalization performance in the aspect of depth prediction, is suitable for depth estimation tasks in different scenes, and has high practicability.

According to the embodiment of the invention, an unsupervised focal stack depth estimation method based on all-focusing image synthesis comprises the following steps:

s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;

s2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining the features of the fuzzy ambiguity codes through a differential value calculation module, and cascading the primary extraction features and the fuzzy ambiguity features to obtain a focal body;

s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;

s4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.

Optionally, the image pyramid specifically includes:

downsampling with Gaussian pyramid to original imageRepresents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:

;

wherein , wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;

downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;

gaussian pyramid upsampling to image the originalExpanding twice in each direction, filling the newly added rows and columns with 0, and convolving the newly added rows and columns with the amplified image by multiplying the newly added rows and columns by four by the convolution kernel which is the same as the previous row to obtain a reconstructed image;

introducing Laplacian pyramid into the reconstructed image, and settingRepresents the +.o of the Laplacian pyramid>Layer (c):

;

wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;

original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.

Optionally, the fusion process of the image pyramid specifically includes:

given a focal stack sequence:

；

wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;

focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->, wherein ,/>Representing the number of layers of the pyramid;

laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>，/>Generated from an index map and a laplacian pyramid:

by means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.

Optionally, the image pyramid-based all-focus image synthesis method specifically includes focusing on an input focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->To Laplacian pyramid->And carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.

Optionally, theThe all-focus image synthesis method based on the focus measurement operator comprises the steps of applying a small-area neighborhood fusion operator to each focusing sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.

Optionally, the method for synthesizing the all-focus image based on the focus measurement operator specifically includes:

the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:

is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;

wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;

computing windowInner other vector->And center vector->The difference results in a difference vector->：

;

wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor;

;

wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length between;

and applying the obtained scalar value image to index maximization operation to evaluate the definition of the image, and extracting the pixel value of the corresponding position from the input focal stack according to the index of the optimal definition to obtain the corresponding all-focusing image.

Optionally, the three-dimensional sensing module completes high-frequency noise filtering and preliminary feature extraction of the focal stack through a four-layer network structure, and comprises a plurality of parallel convolution layers with different convolution kernel sizes and step sizes, and the parallel convolution layers are used for capturing fuzzy features on different scales;

the step S2 specifically comprises the following steps:

s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;

s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:

;

wherein ,representing the fused RGB channel difference, +.>Different color dimensions representing input features;

s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.

Optionally, the channel polarization feature map is obtained by performing polarization transformation on the input feature map x:

the polarization transformation converts the input feature map x into two sets of basis vectors and />；

wherein , and />Query and key corresponding to channel level;

calculation of and />Similarity score +.>：

;

wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->；

Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>:

;

wherein ,representing a channel-level multiply operator.

Optionally, the method for the spatial polarization characteristic map includes:

to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and />；

wherein ,global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;

calculating a similarity matrix from two sets of polarization vectors：

;

wherein ,andrespectively representing standard 1 x 1 three-dimensional convolutional layers,an intermediate parameter representing the convolution of the channel,、andx represents the matrix dot product operation,representing global pooling;

corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained；

;

wherein Representing the spatial multiplication operator.

Optionally, the step S4 specifically includes:

s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;

s42, application between layersThe hierarchy where the best definition is located is determined through operation, the best focusing position is obtained, and an all-focusing image is obtained;

s43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.

The beneficial effects of the invention are as follows:

the invention firstly synthesizes the all-focusing image and uses the all-focusing image as supervision information, and then carries out depth estimation through a characteristic coarse extraction module, a polarization self-attention module and a layered depth estimation module. The method and the device have the advantages that the focal stack synthesized full-focusing image is used as supervision information, and the scene depth is acquired by utilizing the association capability of a self-attention mechanism, so that the method and the device have relatively high accuracy and good generalization performance in the aspect of depth prediction, are suitable for depth estimation tasks under different scenes, and have high practicability.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a model of an unsupervised focal stack depth estimation in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;

FIG. 2 is a block diagram of a focus measurement sharpness metric in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;

FIG. 3 is a qualitative contrast chart of the full-focus image composition in the unsupervised focal stack depth estimation method based on the full-focus image composition according to the present invention;

FIG. 4 is a block diagram of a three-dimensional perception module in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;

FIG. 5 is a block diagram of a channel difference module in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;

FIG. 6 is a visual contrast chart of generalization performance on a Defocus Net in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;

fig. 7 is a visual contrast diagram of generalization performance on MobileDepth in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

Referring to fig. 1, an unsupervised focal stack depth estimation method based on all-in-focus image synthesis includes:

referring to fig. 2, a process of synthesizing an all-in-focus image by two methods is shown in this embodiment.

In the drawingsRepresenting the focusing sequence, gaussian pyramid downsampling, with the original image +.>Represents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:

;

wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;

;

In this embodiment, the fusion process of the image pyramid specifically includes:

given a focal stack sequence:

；

laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>Full-focusing Laplacian pyramid->Generated from an index map and a laplacian pyramid: />;

In this embodiment, the method for synthesizing the all-in-focus image based on the image pyramid specifically includes focusing the input focal stackDecomposing to obtain Gaussian pyramid->And Laplacian pyramid->Because the whole decomposition process is completely reversible, the image transformation method has no information loss and is applied to Laplacian pyramid>And carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.

In this embodiment, the method for synthesizing the all-focused image based on the focus measurement operator includes applying the small-area neighborhood fusion operator to each focusing sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.

The full-focusing image fusion algorithm based on the image pyramid and the small window fusion operator can synthesize high-quality full-focusing images. The proposed model utilizes the global correlation structure to effectively improve the accuracy of depth prediction, and meanwhile, the design of light weight enables the model to have real-time reasoning capability.

Referring to fig. 3, in this embodiment, the method for synthesizing an all-in-focus image based on a focus measurement operator specifically includes:

;

wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor, < >>Play an important role in computing scalar feature images;

;

wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length therebetween；

The obtained scalar value image is applied to index maximization operation to evaluate the definition of the image, the pixel value of the corresponding position is extracted from the input focal stack according to the index of the optimal definition, and the corresponding full-focusing image is obtained.

in this embodiment, the three-dimensional sensing module completes the high-frequency noise filtering and the preliminary feature extraction of the focal stack through a four-layer network structure, and includes a plurality of parallel convolution layers with different convolution kernel sizes and step sizes, for capturing the fuzzy features on different scales;

referring to fig. 4, S2 specifically includes:

;

wherein ,representing the fused RGB channel difference, representing different color dimensions of the input features;

in this embodiment, the channel polarization feature map is obtained by performing polarization transformation on the input feature map x:

wherein , and />Query and key corresponding to channel level;

calculation of and />Similarity score +.>：

;

Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>：

;

wherein ,representing a channel-level multiply operator.

In this embodiment, the method for spatial polarization profile includes:

calculating a similarity matrix from two sets of polarization vectors：

;

wherein , and />Respectively representing standard 1 x 1 three-dimensional convolution layers,/->An intermediate parameter representing the convolution of the channel,、/> and />Representing three tensor remodelling operations, x represents matrix dot product operation, +.>Representing global pooling;

corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained；;

wherein Representing the spatial multiplication operator.

It should be noted that all of the convolution operations and tensor reconstruction operations described above are performed in three channel dimensions, and thus, the three-dimensional polarization self-attention mechanism can take into account both channel correlation and spatial blur correlation.

The model provided by the invention shows good performance on a smaller focal stack and has excellent generalization capability.

In this embodiment, S4 specifically includes:

s42, determining the hierarchy where the best definition is located by applying Softmax operation among the hierarchies to obtain the best focusing position and obtain an all-focusing image;

and in the test, determining the level of the target depth by using fuzzy information in the input focusing sequence, and calculating a depth probability value by using a probability density function of the corresponding level.

In example 1:

the present invention quantifies on 4D Light Field, devacusNet and Flyingthings3D datasets:

as can be seen from table 1 above, the proposed method of combining the all-in-focus images can combine relatively accurate all-in-focus images from a smaller focal stack.

Tables 2-4 above are the results of the present invention quantitatively comparing to the latest methods on the 4D Light Field, devacusNet and Flyingthings3D data sets.

As can be seen from tables 1-4 above, the results on the 4D Light Field dataset show that the present invention is improved by 42.5% and 26.3% over the AiFDepthNet method MSE and RMSE indices, respectively, in unsupervised depth estimation. In comparison to the supervised approach, the present approach outperforms most of the supervised approaches, including VDFF, PSPNet, DDFF, even though the performance in MSE and RMSE differ only by 15.0% and 4.6% compared to the DefocusNet approach. The results on the DefocusNet dataset and the FlyingThings3D dataset show that the method achieves higher accuracy on the MAE, MSE, RMSE index than the AiFDepthNet method. Compared with the 16M parameter of the AiFDepthNet method, the parameter of the method is smaller and is 3.3M, and the calculation efficiency is higher.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. An unsupervised focal stack depth estimation method based on all-focus image synthesis, comprising:

s2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining coded ambiguity features through a differential value calculation module, and cascading the primary extraction features and the ambiguity features to obtain a focal body;

2. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the image pyramid specifically comprises:

;

3. The method for estimating the depth of an unsupervised focal stack based on the synthesis of an all-focused image according to claim 2, wherein the process of merging the image pyramids specifically comprises:

given a focal stack sequence:

；

focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>Laplacian pyramid, wherein ,/>Representing the number of layers of the pyramid;

laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>，/>The all-focus Laplacian pyramid is generated by an index map and the Laplacian pyramid:;

4. A non-focusing image synthesis based on claim 3The method for estimating the depth of the supervised focal stack is characterized by specifically comprising the steps of focusing the input focal stackDecomposing to obtain Gaussian pyramid->And Laplacian pyramid->Laplacian pyramidAnd carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.

5. An unsupervised focal stack depth estimation method based on full focus image synthesis according to claim 3, wherein the full focus image synthesis method based on focus measurement operator comprises applying a small region neighborhood fusion operator to each focus sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.

6. The method for estimating depth of an unsupervised focal stack based on the composition of an all-focused image according to claim 5, wherein the method for composing an all-focused image based on a focus measurement operator specifically comprises:

;

；

7. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the three-dimensional perception module is configured to complete high-frequency noise filtering and preliminary feature extraction of the focal stack through a four-layer network structure, and the three-dimensional perception module comprises a plurality of parallel convolution layers with different convolution kernel sizes and step sizes for capturing fuzzy features on different scales;

the step S2 specifically comprises the following steps:

；

8. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the channel polarization feature map is obtained by performing polarization transformation on an input feature map x:

wherein , and />Query and key corresponding to channel level;

calculation of and />Similarity score +.>：

;

wherein ,representing a channel-level multiply operator.

9. The method for estimating depth of an unsupervised focal stack based on full focus image synthesis according to claim 8, wherein the method for spatially polarizing the feature map comprises:

to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and /> wherein ,/>Global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;

calculating a similarity matrix from two sets of polarization vectors：

;

wherein , and />Respectively representing standard 1 x 1 three-dimensional convolution layers,/->Intermediate parameters representing the convolution of the channel,/->、 and />Representing three tensor remodelling operations, x represents matrix dot product operation, +.>Representing global pooling;

;

wherein Representing the spatial multiplication operator.

10. The method for estimating the depth of an unsupervised focal stack based on the synthesis of an all-in-focus image according to claim 1, wherein S4 specifically comprises: