CN116823914A - Unsupervised focal stack depth estimation method based on all-focusing image synthesis - Google Patents

Unsupervised focal stack depth estimation method based on all-focusing image synthesis Download PDF

Info

Publication number
CN116823914A
CN116823914A CN202311101094.7A CN202311101094A CN116823914A CN 116823914 A CN116823914 A CN 116823914A CN 202311101094 A CN202311101094 A CN 202311101094A CN 116823914 A CN116823914 A CN 116823914A
Authority
CN
China
Prior art keywords
image
focus
representing
focal stack
pyramid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311101094.7A
Other languages
Chinese (zh)
Other versions
CN116823914B (en
Inventor
黄章进
周萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202311101094.7A priority Critical patent/CN116823914B/en
Publication of CN116823914A publication Critical patent/CN116823914A/en
Application granted granted Critical
Publication of CN116823914B publication Critical patent/CN116823914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an unsupervised focal stack depth estimation method based on full-focusing image synthesis, which comprises the following steps: s1, performing all-focus image calculation by using an image pyramid-based and focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information; s2, carrying out high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module; s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input feature map into a channel polarization feature map and a space polarization feature map; and S4, positioning the layer where the maximum definition of the focal stack is located by adopting a layered depth probability prediction module, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image. The method has relatively high accuracy and good generalization performance in the aspect of depth prediction, is suitable for depth estimation tasks under different scenes, and has high practicability.

Description

Unsupervised focal stack depth estimation method based on all-focusing image synthesis
Technical Field
The invention relates to the technical field of monocular depth estimation, in particular to an unsupervised focal stack depth estimation method based on full-focusing image synthesis.
Background
The supervised approach shows a high accuracy in the depth estimation task, but is limited in that depth truth values are required, which may be difficult to obtain in practical application scenarios. In recent years, with the continuous development of deep learning technology and continuous exploration of the field of computer vision, the field of unsupervised monocular depth estimation has made a long progress. The unsupervised monocular depth estimation refers to estimating depth information of a scene through a computer vision algorithm without a depth tag. Unsupervised focal stack depth estimation can be divided into two categories, reconstruction supervision and auxiliary supervision.
Reconstruction supervision supervises the network through reconstruction loss of the network, thereby learning depth information, regarding the unsupervised focal stack depth estimation as a special case of multi-view monocular depth estimation, estimating scene depth by utilizing the blur difference of the focusing sequence, then refocusing by utilizing the focusing diagram and the estimated intermediate depth, outputting the focal stack, and performing supervised learning by utilizing the reconstruction loss. However, because of the discomfort of the depth estimation task, reconstructing the model easily results in multiple depth solutions competing with each other, making it difficult to determine the optimal solution, and therefore the network structure is very unstable, while the intermediate representation is easily interpreted as compressed encoding of the information of the focal stack, resulting in difficulty in model convergence, and therefore, often introducing additional losses to constrain the intermediate representation.
The auxiliary supervision is to guide the learning process of the network through some auxiliary information under the condition of no supervision, and adopt the full-focusing image as auxiliary supervision information. However, this model has certain limitations, such as a large amount of parameters, and requires the dataset itself to provide an all-in-focus image as the supervisory information, so the application limit is large. Therefore, how to provide an unsupervised focal stack depth estimation method based on all-in-focus image synthesis is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide an unsupervised focal stack depth estimation method based on full-focusing image synthesis, which has relatively high accuracy and good generalization performance in the aspect of depth prediction, is suitable for depth estimation tasks in different scenes, and has high practicability.
According to the embodiment of the invention, an unsupervised focal stack depth estimation method based on all-focusing image synthesis comprises the following steps:
s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;
s2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining the features of the fuzzy ambiguity codes through a differential value calculation module, and cascading the primary extraction features and the fuzzy ambiguity features to obtain a focal body;
s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;
s4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.
Optionally, the image pyramid specifically includes:
downsampling with Gaussian pyramid to original imageRepresents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:
;
wherein , wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;
downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;
gaussian pyramid upsampling to image the originalExpanding twice in each direction, filling the newly added rows and columns with 0, and convolving the newly added rows and columns with the amplified image by multiplying the newly added rows and columns by four by the convolution kernel which is the same as the previous row to obtain a reconstructed image;
introducing Laplacian pyramid into the reconstructed image, and settingRepresents the +.o of the Laplacian pyramid>Layer (c):
;
wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;
original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.
Optionally, the fusion process of the image pyramid specifically includes:
given a focal stack sequence:
wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;
focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->, wherein ,/>Representing the number of layers of the pyramid;
laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>,/>Generated from an index map and a laplacian pyramid:
by means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.
Optionally, the image pyramid-based all-focus image synthesis method specifically includes focusing on an input focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->To Laplacian pyramid->And carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.
Optionally, theThe all-focus image synthesis method based on the focus measurement operator comprises the steps of applying a small-area neighborhood fusion operator to each focusing sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.
Optionally, the method for synthesizing the all-focus image based on the focus measurement operator specifically includes:
the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:
is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;
wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;
computing windowInner other vector->And center vector->The difference results in a difference vector->
;
;
;
wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor;
;
wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length between;
and applying the obtained scalar value image to index maximization operation to evaluate the definition of the image, and extracting the pixel value of the corresponding position from the input focal stack according to the index of the optimal definition to obtain the corresponding all-focusing image.
Optionally, the three-dimensional sensing module completes high-frequency noise filtering and preliminary feature extraction of the focal stack through a four-layer network structure, and comprises a plurality of parallel convolution layers with different convolution kernel sizes and step sizes, and the parallel convolution layers are used for capturing fuzzy features on different scales;
the step S2 specifically comprises the following steps:
s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;
s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:
;
wherein ,representing the fused RGB channel difference, +.>Different color dimensions representing input features;
s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.
Optionally, the channel polarization feature map is obtained by performing polarization transformation on the input feature map x:
the polarization transformation converts the input feature map x into two sets of basis vectors and />
wherein , and />Query and key corresponding to channel level;
calculation of and />Similarity score +.>
;
wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->
Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>:
;
wherein ,representing a channel-level multiply operator.
Optionally, the method for the spatial polarization characteristic map includes:
to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and />
wherein ,global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;
calculating a similarity matrix from two sets of polarization vectors
;
wherein ,andrespectively representing standard 1 x 1 three-dimensional convolutional layers,an intermediate parameter representing the convolution of the channel,andx represents the matrix dot product operation,representing global pooling;
corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained
;
wherein Representing the spatial multiplication operator.
Optionally, the step S4 specifically includes:
s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;
s42, application between layersThe hierarchy where the best definition is located is determined through operation, the best focusing position is obtained, and an all-focusing image is obtained;
s43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.
The beneficial effects of the invention are as follows:
the invention firstly synthesizes the all-focusing image and uses the all-focusing image as supervision information, and then carries out depth estimation through a characteristic coarse extraction module, a polarization self-attention module and a layered depth estimation module. The method and the device have the advantages that the focal stack synthesized full-focusing image is used as supervision information, and the scene depth is acquired by utilizing the association capability of a self-attention mechanism, so that the method and the device have relatively high accuracy and good generalization performance in the aspect of depth prediction, are suitable for depth estimation tasks under different scenes, and have high practicability.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a model of an unsupervised focal stack depth estimation in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 2 is a block diagram of a focus measurement sharpness metric in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 3 is a qualitative contrast chart of the full-focus image composition in the unsupervised focal stack depth estimation method based on the full-focus image composition according to the present invention;
FIG. 4 is a block diagram of a three-dimensional perception module in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 5 is a block diagram of a channel difference module in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 6 is a visual contrast chart of generalization performance on a Defocus Net in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
fig. 7 is a visual contrast diagram of generalization performance on MobileDepth in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention.
Detailed Description
The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.
Referring to fig. 1, an unsupervised focal stack depth estimation method based on all-in-focus image synthesis includes:
s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;
referring to fig. 2, a process of synthesizing an all-in-focus image by two methods is shown in this embodiment.
In the drawingsRepresenting the focusing sequence, gaussian pyramid downsampling, with the original image +.>Represents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:
;
wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;
downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;
;
wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;
original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.
In this embodiment, the fusion process of the image pyramid specifically includes:
given a focal stack sequence:
wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;
focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->, wherein ,/>Representing the number of layers of the pyramid;
laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>Full-focusing Laplacian pyramid->Generated from an index map and a laplacian pyramid: />;
By means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.
In this embodiment, the method for synthesizing the all-in-focus image based on the image pyramid specifically includes focusing the input focal stackDecomposing to obtain Gaussian pyramid->And Laplacian pyramid->Because the whole decomposition process is completely reversible, the image transformation method has no information loss and is applied to Laplacian pyramid>And carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.
In this embodiment, the method for synthesizing the all-focused image based on the focus measurement operator includes applying the small-area neighborhood fusion operator to each focusing sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.
The full-focusing image fusion algorithm based on the image pyramid and the small window fusion operator can synthesize high-quality full-focusing images. The proposed model utilizes the global correlation structure to effectively improve the accuracy of depth prediction, and meanwhile, the design of light weight enables the model to have real-time reasoning capability.
Referring to fig. 3, in this embodiment, the method for synthesizing an all-in-focus image based on a focus measurement operator specifically includes:
the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:
is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;
wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;
computing windowInner other vector->And center vector->The difference results in a difference vector->
;
;
;
wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor, < >>Play an important role in computing scalar feature images;
;
wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length therebetween;
The obtained scalar value image is applied to index maximization operation to evaluate the definition of the image, the pixel value of the corresponding position is extracted from the input focal stack according to the index of the optimal definition, and the corresponding full-focusing image is obtained.
S2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining the features of the fuzzy ambiguity codes through a differential value calculation module, and cascading the primary extraction features and the fuzzy ambiguity features to obtain a focal body;
in this embodiment, the three-dimensional sensing module completes the high-frequency noise filtering and the preliminary feature extraction of the focal stack through a four-layer network structure, and includes a plurality of parallel convolution layers with different convolution kernel sizes and step sizes, for capturing the fuzzy features on different scales;
referring to fig. 4, S2 specifically includes:
s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;
s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:
;
wherein ,representing the fused RGB channel difference, representing different color dimensions of the input features;
s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.
S3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;
in this embodiment, the channel polarization feature map is obtained by performing polarization transformation on the input feature map x:
the polarization transformation converts the input feature map x into two sets of basis vectors and />
wherein , and />Query and key corresponding to channel level;
calculation of and />Similarity score +.>
;
wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->
Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>
;
wherein ,representing a channel-level multiply operator.
In this embodiment, the method for spatial polarization profile includes:
to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and />
wherein ,global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;
calculating a similarity matrix from two sets of polarization vectors
;
wherein , and />Respectively representing standard 1 x 1 three-dimensional convolution layers,/->An intermediate parameter representing the convolution of the channel,、/> and />Representing three tensor remodelling operations, x represents matrix dot product operation, +.>Representing global pooling;
corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained;
wherein Representing the spatial multiplication operator.
It should be noted that all of the convolution operations and tensor reconstruction operations described above are performed in three channel dimensions, and thus, the three-dimensional polarization self-attention mechanism can take into account both channel correlation and spatial blur correlation.
The model provided by the invention shows good performance on a smaller focal stack and has excellent generalization capability.
S4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.
In this embodiment, S4 specifically includes:
s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;
s42, determining the hierarchy where the best definition is located by applying Softmax operation among the hierarchies to obtain the best focusing position and obtain an all-focusing image;
and in the test, determining the level of the target depth by using fuzzy information in the input focusing sequence, and calculating a depth probability value by using a probability density function of the corresponding level.
S43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.
In example 1:
the present invention quantifies on 4D Light Field, devacusNet and Flyingthings3D datasets:
as can be seen from table 1 above, the proposed method of combining the all-in-focus images can combine relatively accurate all-in-focus images from a smaller focal stack.
Tables 2-4 above are the results of the present invention quantitatively comparing to the latest methods on the 4D Light Field, devacusNet and Flyingthings3D data sets.
As can be seen from tables 1-4 above, the results on the 4D Light Field dataset show that the present invention is improved by 42.5% and 26.3% over the AiFDepthNet method MSE and RMSE indices, respectively, in unsupervised depth estimation. In comparison to the supervised approach, the present approach outperforms most of the supervised approaches, including VDFF, PSPNet, DDFF, even though the performance in MSE and RMSE differ only by 15.0% and 4.6% compared to the DefocusNet approach. The results on the DefocusNet dataset and the FlyingThings3D dataset show that the method achieves higher accuracy on the MAE, MSE, RMSE index than the AiFDepthNet method. Compared with the 16M parameter of the AiFDepthNet method, the parameter of the method is smaller and is 3.3M, and the calculation efficiency is higher.
The invention firstly synthesizes the all-focusing image and uses the all-focusing image as supervision information, and then carries out depth estimation through a characteristic coarse extraction module, a polarization self-attention module and a layered depth estimation module. The method and the device have the advantages that the focal stack synthesized full-focusing image is used as supervision information, and the scene depth is acquired by utilizing the association capability of a self-attention mechanism, so that the method and the device have relatively high accuracy and good generalization performance in the aspect of depth prediction, are suitable for depth estimation tasks under different scenes, and have high practicability.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (10)

1. An unsupervised focal stack depth estimation method based on all-focus image synthesis, comprising:
s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;
s2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining coded ambiguity features through a differential value calculation module, and cascading the primary extraction features and the ambiguity features to obtain a focal body;
s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;
s4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.
2. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the image pyramid specifically comprises:
downsampling with Gaussian pyramid to original imageRepresents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:
;
wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;
downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;
gaussian pyramid upsampling to image the originalExpanding twice in each direction, filling the newly added rows and columns with 0, and convolving the newly added rows and columns with the amplified image by multiplying the newly added rows and columns by four by the convolution kernel which is the same as the previous row to obtain a reconstructed image;
introducing Laplacian pyramid into the reconstructed image, and settingRepresents the +.o of the Laplacian pyramid>Layer (c):
;
wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;
original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.
3. The method for estimating the depth of an unsupervised focal stack based on the synthesis of an all-focused image according to claim 2, wherein the process of merging the image pyramids specifically comprises:
given a focal stack sequence:
wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;
focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>Laplacian pyramid, wherein ,/>Representing the number of layers of the pyramid;
laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>,/>The all-focus Laplacian pyramid is generated by an index map and the Laplacian pyramid:;
by means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.
4. A non-focusing image synthesis based on claim 3The method for estimating the depth of the supervised focal stack is characterized by specifically comprising the steps of focusing the input focal stackDecomposing to obtain Gaussian pyramid->And Laplacian pyramid->Laplacian pyramidAnd carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.
5. An unsupervised focal stack depth estimation method based on full focus image synthesis according to claim 3, wherein the full focus image synthesis method based on focus measurement operator comprises applying a small region neighborhood fusion operator to each focus sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.
6. The method for estimating depth of an unsupervised focal stack based on the composition of an all-focused image according to claim 5, wherein the method for composing an all-focused image based on a focus measurement operator specifically comprises:
the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:
is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;
wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;
computing windowInner other vector->And center vector->The difference results in a difference vector->
;
;
;
wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor;
wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length between;
and applying the obtained scalar value image to index maximization operation to evaluate the definition of the image, and extracting the pixel value of the corresponding position from the input focal stack according to the index of the optimal definition to obtain the corresponding all-focusing image.
7. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the three-dimensional perception module is configured to complete high-frequency noise filtering and preliminary feature extraction of the focal stack through a four-layer network structure, and the three-dimensional perception module comprises a plurality of parallel convolution layers with different convolution kernel sizes and step sizes for capturing fuzzy features on different scales;
the step S2 specifically comprises the following steps:
s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;
s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:
wherein ,representing the fused RGB channel difference, +.>Different color dimensions representing input features;
s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.
8. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the channel polarization feature map is obtained by performing polarization transformation on an input feature map x:
the polarization transformation converts the input feature map x into two sets of basis vectors and />
wherein , and />Query and key corresponding to channel level;
calculation of and />Similarity score +.>
;
wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->
Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>
;
wherein ,representing a channel-level multiply operator.
9. The method for estimating depth of an unsupervised focal stack based on full focus image synthesis according to claim 8, wherein the method for spatially polarizing the feature map comprises:
to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and /> wherein ,/>Global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;
calculating a similarity matrix from two sets of polarization vectors
;
wherein , and />Respectively representing standard 1 x 1 three-dimensional convolution layers,/->Intermediate parameters representing the convolution of the channel,/-> and />Representing three tensor remodelling operations, x represents matrix dot product operation, +.>Representing global pooling;
corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained
;
wherein Representing the spatial multiplication operator.
10. The method for estimating the depth of an unsupervised focal stack based on the synthesis of an all-in-focus image according to claim 1, wherein S4 specifically comprises:
s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;
s42, application between layersThe hierarchy where the best definition is located is determined through operation, the best focusing position is obtained, and an all-focusing image is obtained;
s43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.
CN202311101094.7A 2023-08-30 2023-08-30 Unsupervised focal stack depth estimation method based on all-focusing image synthesis Active CN116823914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311101094.7A CN116823914B (en) 2023-08-30 2023-08-30 Unsupervised focal stack depth estimation method based on all-focusing image synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311101094.7A CN116823914B (en) 2023-08-30 2023-08-30 Unsupervised focal stack depth estimation method based on all-focusing image synthesis

Publications (2)

Publication Number Publication Date
CN116823914A true CN116823914A (en) 2023-09-29
CN116823914B CN116823914B (en) 2024-01-09

Family

ID=88141360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311101094.7A Active CN116823914B (en) 2023-08-30 2023-08-30 Unsupervised focal stack depth estimation method based on all-focusing image synthesis

Country Status (1)

Country Link
CN (1) CN116823914B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120218386A1 (en) * 2011-02-28 2012-08-30 Duke University Systems and Methods for Comprehensive Focal Tomography
CN110246172A (en) * 2019-06-18 2019-09-17 首都师范大学 A kind of the light field total focus image extraction method and system of the fusion of two kinds of Depth cues
CN110751160A (en) * 2019-10-30 2020-02-04 华中科技大学 Method, device and system for detecting object in image
CN112465796A (en) * 2020-12-07 2021-03-09 清华大学深圳国际研究生院 Light field feature extraction method fusing focus stack and full-focus image
CN114792430A (en) * 2022-04-24 2022-07-26 深圳市安软慧视科技有限公司 Pedestrian re-identification method, system and related equipment based on polarization self-attention
US20220309696A1 (en) * 2021-03-23 2022-09-29 Mediatek Inc. Methods and Apparatuses of Depth Estimation from Focus Information
CN115830240A (en) * 2022-12-14 2023-03-21 山西大学 Unsupervised deep learning three-dimensional reconstruction method based on image fusion visual angle

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120218386A1 (en) * 2011-02-28 2012-08-30 Duke University Systems and Methods for Comprehensive Focal Tomography
CN110246172A (en) * 2019-06-18 2019-09-17 首都师范大学 A kind of the light field total focus image extraction method and system of the fusion of two kinds of Depth cues
CN110751160A (en) * 2019-10-30 2020-02-04 华中科技大学 Method, device and system for detecting object in image
CN112465796A (en) * 2020-12-07 2021-03-09 清华大学深圳国际研究生院 Light field feature extraction method fusing focus stack and full-focus image
US20220309696A1 (en) * 2021-03-23 2022-09-29 Mediatek Inc. Methods and Apparatuses of Depth Estimation from Focus Information
CN114792430A (en) * 2022-04-24 2022-07-26 深圳市安软慧视科技有限公司 Pedestrian re-identification method, system and related equipment based on polarization self-attention
CN115830240A (en) * 2022-12-14 2023-03-21 山西大学 Unsupervised deep learning three-dimensional reconstruction method based on image fusion visual angle

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TIAN, B, ET.AL: "Fine-grained multi-focus image fusion based on edge features", 《SCIENTIFIC REPORTS 》, vol. 13, no. 1 *
周萌等: "基于失焦模糊特性的焦点堆栈深度估计方法", 《计算机应用》, pages 2 *
张雪霏: "面向单目深度估计的无监督深度学习模型研究", 《中国优秀硕士论文电子期刊网》 *

Also Published As

Publication number Publication date
CN116823914B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
Lim et al. DSLR: Deep stacked Laplacian restorer for low-light image enhancement
CN110443842B (en) Depth map prediction method based on visual angle fusion
Yang et al. Spatial-depth super resolution for range images
CN111259945B (en) Binocular parallax estimation method introducing attention map
CN111047548A (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN113870335B (en) Monocular depth estimation method based on multi-scale feature fusion
Ghorai et al. Multiple pyramids based image inpainting using local patch statistics and steering kernel feature
CN113762147B (en) Facial expression migration method and device, electronic equipment and storage medium
CN112686830B (en) Super-resolution method of single depth map based on image decomposition
Shi et al. Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-CNN structure for face super-resolution
CN115861616A (en) Semantic segmentation system for medical image sequence
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN115345866B (en) Building extraction method in remote sensing image, electronic equipment and storage medium
Li et al. Model-informed Multi-stage Unsupervised Network for Hyperspectral Image Super-resolution
CN116205962B (en) Monocular depth estimation method and system based on complete context information
CN115511767A (en) Self-supervised learning multi-modal image fusion method and application thereof
CN115565039A (en) Monocular input dynamic scene new view synthesis method based on self-attention mechanism
CN115049739A (en) Binocular vision stereo matching method based on edge detection
Zhou et al. A superior image inpainting scheme using Transformer-based self-supervised attention GAN model
Li et al. Progressive spatial information-guided deep aggregation convolutional network for hyperspectral spectral super-resolution
Wang et al. Neighbor Spectra Maintenance and Context Affinity Enhancement for Single Hyperspectral Image Super-Resolution
Hara et al. Enhancement of novel view synthesis using omnidirectional image completion
Gupta et al. A robust and efficient image de-fencing approach using conditional generative adversarial networks
Zhang et al. Unsupervised detail-preserving network for high quality monocular depth estimation
CN116823914B (en) Unsupervised focal stack depth estimation method based on all-focusing image synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant