CN116823914A - Unsupervised focal stack depth estimation method based on all-focusing image synthesis - Google Patents
Unsupervised focal stack depth estimation method based on all-focusing image synthesis Download PDFInfo
- Publication number
- CN116823914A CN116823914A CN202311101094.7A CN202311101094A CN116823914A CN 116823914 A CN116823914 A CN 116823914A CN 202311101094 A CN202311101094 A CN 202311101094A CN 116823914 A CN116823914 A CN 116823914A
- Authority
- CN
- China
- Prior art keywords
- image
- focus
- representing
- focal stack
- pyramid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 26
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 23
- 230000010287 polarization Effects 0.000 claims abstract description 50
- 238000005259 measurement Methods 0.000 claims abstract description 26
- 238000004364 calculation method Methods 0.000 claims abstract description 23
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000001914 filtration Methods 0.000 claims abstract description 10
- 230000007246 mechanism Effects 0.000 claims abstract description 7
- 230000008447 perception Effects 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 39
- 238000010586 diagram Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000001308 synthesis method Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000011049 filling Methods 0.000 claims description 2
- 238000013459 approach Methods 0.000 description 5
- 230000002194 synthesizing effect Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000007499 fusion processing Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Landscapes
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an unsupervised focal stack depth estimation method based on full-focusing image synthesis, which comprises the following steps: s1, performing all-focus image calculation by using an image pyramid-based and focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information; s2, carrying out high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module; s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input feature map into a channel polarization feature map and a space polarization feature map; and S4, positioning the layer where the maximum definition of the focal stack is located by adopting a layered depth probability prediction module, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image. The method has relatively high accuracy and good generalization performance in the aspect of depth prediction, is suitable for depth estimation tasks under different scenes, and has high practicability.
Description
Technical Field
The invention relates to the technical field of monocular depth estimation, in particular to an unsupervised focal stack depth estimation method based on full-focusing image synthesis.
Background
The supervised approach shows a high accuracy in the depth estimation task, but is limited in that depth truth values are required, which may be difficult to obtain in practical application scenarios. In recent years, with the continuous development of deep learning technology and continuous exploration of the field of computer vision, the field of unsupervised monocular depth estimation has made a long progress. The unsupervised monocular depth estimation refers to estimating depth information of a scene through a computer vision algorithm without a depth tag. Unsupervised focal stack depth estimation can be divided into two categories, reconstruction supervision and auxiliary supervision.
Reconstruction supervision supervises the network through reconstruction loss of the network, thereby learning depth information, regarding the unsupervised focal stack depth estimation as a special case of multi-view monocular depth estimation, estimating scene depth by utilizing the blur difference of the focusing sequence, then refocusing by utilizing the focusing diagram and the estimated intermediate depth, outputting the focal stack, and performing supervised learning by utilizing the reconstruction loss. However, because of the discomfort of the depth estimation task, reconstructing the model easily results in multiple depth solutions competing with each other, making it difficult to determine the optimal solution, and therefore the network structure is very unstable, while the intermediate representation is easily interpreted as compressed encoding of the information of the focal stack, resulting in difficulty in model convergence, and therefore, often introducing additional losses to constrain the intermediate representation.
The auxiliary supervision is to guide the learning process of the network through some auxiliary information under the condition of no supervision, and adopt the full-focusing image as auxiliary supervision information. However, this model has certain limitations, such as a large amount of parameters, and requires the dataset itself to provide an all-in-focus image as the supervisory information, so the application limit is large. Therefore, how to provide an unsupervised focal stack depth estimation method based on all-in-focus image synthesis is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide an unsupervised focal stack depth estimation method based on full-focusing image synthesis, which has relatively high accuracy and good generalization performance in the aspect of depth prediction, is suitable for depth estimation tasks in different scenes, and has high practicability.
According to the embodiment of the invention, an unsupervised focal stack depth estimation method based on all-focusing image synthesis comprises the following steps:
s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;
s2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining the features of the fuzzy ambiguity codes through a differential value calculation module, and cascading the primary extraction features and the fuzzy ambiguity features to obtain a focal body;
s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;
s4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.
Optionally, the image pyramid specifically includes:
downsampling with Gaussian pyramid to original imageRepresents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:
;
wherein , wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;
downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;
gaussian pyramid upsampling to image the originalExpanding twice in each direction, filling the newly added rows and columns with 0, and convolving the newly added rows and columns with the amplified image by multiplying the newly added rows and columns by four by the convolution kernel which is the same as the previous row to obtain a reconstructed image;
introducing Laplacian pyramid into the reconstructed image, and settingRepresents the +.o of the Laplacian pyramid>Layer (c):
;
wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;
original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.
Optionally, the fusion process of the image pyramid specifically includes:
given a focal stack sequence:
;
wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;
focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->, wherein ,/>Representing the number of layers of the pyramid;
laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>,/>Generated from an index map and a laplacian pyramid:
by means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.
Optionally, the image pyramid-based all-focus image synthesis method specifically includes focusing on an input focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->To Laplacian pyramid->And carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.
Optionally, theThe all-focus image synthesis method based on the focus measurement operator comprises the steps of applying a small-area neighborhood fusion operator to each focusing sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.
Optionally, the method for synthesizing the all-focus image based on the focus measurement operator specifically includes:
the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:
is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;
wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;
computing windowInner other vector->And center vector->The difference results in a difference vector->:
;
;
;
wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor;
;
wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length between;
and applying the obtained scalar value image to index maximization operation to evaluate the definition of the image, and extracting the pixel value of the corresponding position from the input focal stack according to the index of the optimal definition to obtain the corresponding all-focusing image.
Optionally, the three-dimensional sensing module completes high-frequency noise filtering and preliminary feature extraction of the focal stack through a four-layer network structure, and comprises a plurality of parallel convolution layers with different convolution kernel sizes and step sizes, and the parallel convolution layers are used for capturing fuzzy features on different scales;
the step S2 specifically comprises the following steps:
s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;
s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:
;
wherein ,representing the fused RGB channel difference, +.>Different color dimensions representing input features;
s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.
Optionally, the channel polarization feature map is obtained by performing polarization transformation on the input feature map x:
the polarization transformation converts the input feature map x into two sets of basis vectors and />;
wherein , and />Query and key corresponding to channel level;
calculation of and />Similarity score +.>:
;
wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->;
Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>:
;
wherein ,representing a channel-level multiply operator.
Optionally, the method for the spatial polarization characteristic map includes:
to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and />;
wherein ,global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;
calculating a similarity matrix from two sets of polarization vectors:
;
wherein ,andrespectively representing standard 1 x 1 three-dimensional convolutional layers,an intermediate parameter representing the convolution of the channel,、andx represents the matrix dot product operation,representing global pooling;
corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained;
;
wherein Representing the spatial multiplication operator.
Optionally, the step S4 specifically includes:
s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;
s42, application between layersThe hierarchy where the best definition is located is determined through operation, the best focusing position is obtained, and an all-focusing image is obtained;
s43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.
The beneficial effects of the invention are as follows:
the invention firstly synthesizes the all-focusing image and uses the all-focusing image as supervision information, and then carries out depth estimation through a characteristic coarse extraction module, a polarization self-attention module and a layered depth estimation module. The method and the device have the advantages that the focal stack synthesized full-focusing image is used as supervision information, and the scene depth is acquired by utilizing the association capability of a self-attention mechanism, so that the method and the device have relatively high accuracy and good generalization performance in the aspect of depth prediction, are suitable for depth estimation tasks under different scenes, and have high practicability.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a model of an unsupervised focal stack depth estimation in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 2 is a block diagram of a focus measurement sharpness metric in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 3 is a qualitative contrast chart of the full-focus image composition in the unsupervised focal stack depth estimation method based on the full-focus image composition according to the present invention;
FIG. 4 is a block diagram of a three-dimensional perception module in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 5 is a block diagram of a channel difference module in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
FIG. 6 is a visual contrast chart of generalization performance on a Defocus Net in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention;
fig. 7 is a visual contrast diagram of generalization performance on MobileDepth in an unsupervised focal stack depth estimation method based on full-focus image synthesis according to the present invention.
Detailed Description
The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.
Referring to fig. 1, an unsupervised focal stack depth estimation method based on all-in-focus image synthesis includes:
s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;
referring to fig. 2, a process of synthesizing an all-in-focus image by two methods is shown in this embodiment.
In the drawingsRepresenting the focusing sequence, gaussian pyramid downsampling, with the original image +.>Represents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:
;
wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;
downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;
;
wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;
original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.
In this embodiment, the fusion process of the image pyramid specifically includes:
given a focal stack sequence:
;
wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;
focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>And Laplacian pyramid->, wherein ,/>Representing the number of layers of the pyramid;
laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>Full-focusing Laplacian pyramid->Generated from an index map and a laplacian pyramid: />;
By means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.
In this embodiment, the method for synthesizing the all-in-focus image based on the image pyramid specifically includes focusing the input focal stackDecomposing to obtain Gaussian pyramid->And Laplacian pyramid->Because the whole decomposition process is completely reversible, the image transformation method has no information loss and is applied to Laplacian pyramid>And carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.
In this embodiment, the method for synthesizing the all-focused image based on the focus measurement operator includes applying the small-area neighborhood fusion operator to each focusing sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.
The full-focusing image fusion algorithm based on the image pyramid and the small window fusion operator can synthesize high-quality full-focusing images. The proposed model utilizes the global correlation structure to effectively improve the accuracy of depth prediction, and meanwhile, the design of light weight enables the model to have real-time reasoning capability.
Referring to fig. 3, in this embodiment, the method for synthesizing an all-in-focus image based on a focus measurement operator specifically includes:
the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:
is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;
wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;
computing windowInner other vector->And center vector->The difference results in a difference vector->:
;
;
;
wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor, < >>Play an important role in computing scalar feature images;
;
wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length therebetween;
The obtained scalar value image is applied to index maximization operation to evaluate the definition of the image, the pixel value of the corresponding position is extracted from the input focal stack according to the index of the optimal definition, and the corresponding full-focusing image is obtained.
S2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining the features of the fuzzy ambiguity codes through a differential value calculation module, and cascading the primary extraction features and the fuzzy ambiguity features to obtain a focal body;
in this embodiment, the three-dimensional sensing module completes the high-frequency noise filtering and the preliminary feature extraction of the focal stack through a four-layer network structure, and includes a plurality of parallel convolution layers with different convolution kernel sizes and step sizes, for capturing the fuzzy features on different scales;
referring to fig. 4, S2 specifically includes:
s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;
s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:
;
wherein ,representing the fused RGB channel difference, representing different color dimensions of the input features;
s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.
S3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;
in this embodiment, the channel polarization feature map is obtained by performing polarization transformation on the input feature map x:
the polarization transformation converts the input feature map x into two sets of basis vectors and />;
wherein , and />Query and key corresponding to channel level;
calculation of and />Similarity score +.>:
;
wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->;
Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>:
;
wherein ,representing a channel-level multiply operator.
In this embodiment, the method for spatial polarization profile includes:
to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and />;
wherein ,global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;
calculating a similarity matrix from two sets of polarization vectors:
;
wherein , and />Respectively representing standard 1 x 1 three-dimensional convolution layers,/->An intermediate parameter representing the convolution of the channel,、/> and />Representing three tensor remodelling operations, x represents matrix dot product operation, +.>Representing global pooling;
corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained;;
wherein Representing the spatial multiplication operator.
It should be noted that all of the convolution operations and tensor reconstruction operations described above are performed in three channel dimensions, and thus, the three-dimensional polarization self-attention mechanism can take into account both channel correlation and spatial blur correlation.
The model provided by the invention shows good performance on a smaller focal stack and has excellent generalization capability.
S4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.
In this embodiment, S4 specifically includes:
s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;
s42, determining the hierarchy where the best definition is located by applying Softmax operation among the hierarchies to obtain the best focusing position and obtain an all-focusing image;
and in the test, determining the level of the target depth by using fuzzy information in the input focusing sequence, and calculating a depth probability value by using a probability density function of the corresponding level.
S43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.
In example 1:
the present invention quantifies on 4D Light Field, devacusNet and Flyingthings3D datasets:
as can be seen from table 1 above, the proposed method of combining the all-in-focus images can combine relatively accurate all-in-focus images from a smaller focal stack.
Tables 2-4 above are the results of the present invention quantitatively comparing to the latest methods on the 4D Light Field, devacusNet and Flyingthings3D data sets.
As can be seen from tables 1-4 above, the results on the 4D Light Field dataset show that the present invention is improved by 42.5% and 26.3% over the AiFDepthNet method MSE and RMSE indices, respectively, in unsupervised depth estimation. In comparison to the supervised approach, the present approach outperforms most of the supervised approaches, including VDFF, PSPNet, DDFF, even though the performance in MSE and RMSE differ only by 15.0% and 4.6% compared to the DefocusNet approach. The results on the DefocusNet dataset and the FlyingThings3D dataset show that the method achieves higher accuracy on the MAE, MSE, RMSE index than the AiFDepthNet method. Compared with the 16M parameter of the AiFDepthNet method, the parameter of the method is smaller and is 3.3M, and the calculation efficiency is higher.
The invention firstly synthesizes the all-focusing image and uses the all-focusing image as supervision information, and then carries out depth estimation through a characteristic coarse extraction module, a polarization self-attention module and a layered depth estimation module. The method and the device have the advantages that the focal stack synthesized full-focusing image is used as supervision information, and the scene depth is acquired by utilizing the association capability of a self-attention mechanism, so that the method and the device have relatively high accuracy and good generalization performance in the aspect of depth prediction, are suitable for depth estimation tasks under different scenes, and have high practicability.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (10)
1. An unsupervised focal stack depth estimation method based on all-focus image synthesis, comprising:
s1, performing all-focus image calculation by using an all-focus image synthesis method based on an image pyramid and an all-focus image synthesis method based on a focus measurement operator to obtain corresponding all-focus images, and fusing the obtained all-focus images to serve as supervision information;
s2, performing high-frequency noise filtration and preliminary feature extraction on the focal stack through a three-dimensional perception module to obtain primary extraction features, simultaneously obtaining coded ambiguity features through a differential value calculation module, and cascading the primary extraction features and the ambiguity features to obtain a focal body;
s3, introducing a three-dimensional polarization self-attention mechanism into a focal stack, and dividing an input characteristic focal body into a channel polarization characteristic diagram and a space polarization characteristic diagram;
s4, positioning the layer where the maximum definition of the focal stack is located through the depth probability prediction module by the channel polarization feature map and the space polarization feature map, outputting a corresponding probability value, determining the layer where the optimal definition is located, and obtaining the full-focusing image.
2. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the image pyramid specifically comprises:
downsampling with Gaussian pyramid to original imageRepresents the bottom layer of the Gaussian pyramid with resolution +.>By defining the gaussian pyramid of the i-th layer:
;
wherein ,representing convolution operations +.>Representing a size of +.>Is a convolution kernel of->A downsampling process that removes even rows and even columns of the input image;
downsampling the resolution of an input imageReducing the height to one fourth, and obtaining the whole Gaussian pyramid by continuously iterating the steps;
gaussian pyramid upsampling to image the originalExpanding twice in each direction, filling the newly added rows and columns with 0, and convolving the newly added rows and columns with the amplified image by multiplying the newly added rows and columns by four by the convolution kernel which is the same as the previous row to obtain a reconstructed image;
introducing Laplacian pyramid into the reconstructed image, and settingRepresents the +.o of the Laplacian pyramid>Layer (c):
;
wherein ,representing the upsampling process, i.e., expanding the image twice as much as it was in each direction, with the newly added rows and columns filled with 0's;
original imageIs decomposed into a gaussian pyramid and a laplacian pyramid, and the same decomposition operation is performed for each image in the focal stack, resulting in a set of image pyramids.
3. The method for estimating the depth of an unsupervised focal stack based on the synthesis of an all-focused image according to claim 2, wherein the process of merging the image pyramids specifically comprises:
given a focal stack sequence:
;
wherein ,representing the spatial coordinates of the pixel points, +.>Representing the number of focusing sequences, wherein each picture corresponds to a specific focusing distance;
focal stackDecomposing the image pyramid to obtain Gaussian pyramid +.>Laplacian pyramid, wherein ,/>Representing the number of layers of the pyramid;
laplacian pyramidIs +.>Performing focus measurement to obtain index map corresponding to maximum definition +.>,/>The all-focus Laplacian pyramid is generated by an index map and the Laplacian pyramid:;
by means ofFor all-focus Laplacian pyramid->And up-sampling from top to bottom to obtain an all-focusing image corresponding to the focal stack.
4. A non-focusing image synthesis based on claim 3The method for estimating the depth of the supervised focal stack is characterized by specifically comprising the steps of focusing the input focal stackDecomposing to obtain Gaussian pyramid->And Laplacian pyramid->Laplacian pyramidAnd carrying out regional information entropy calculation to obtain a focus measurement definition measurement value of each layer, extracting a layer with the maximum definition measurement value as an all-focusing image of the corresponding layer, and reconstructing to obtain a final all-focusing image.
5. An unsupervised focal stack depth estimation method based on full focus image synthesis according to claim 3, wherein the full focus image synthesis method based on focus measurement operator comprises applying a small region neighborhood fusion operator to each focus sequenceAnd obtaining focus measurement definition values of all focus images, carrying out index maximization to determine an index corresponding to the optimal definition, and extracting pixel values in a focus stack according to the index to serve as an all-focus image.
6. The method for estimating depth of an unsupervised focal stack based on the composition of an all-focused image according to claim 5, wherein the method for composing an all-focused image based on a focus measurement operator specifically comprises:
the vector value image is converted into a scalar value image through vector operation to obtain comprehensive characteristics:
is provided withRepresenting vector value pixels, ">Representing scalar value pixels, selecting a tile size +.>Make->For the center vector value pixel, ">For window->Vector value pixels within;
wherein the vector value pixelsCorresponding scalar value pixels +.>Obtaining by scaling the differential vector length in the window;
computing windowInner other vector->And center vector->The difference results in a difference vector->:
;
;
;
wherein ,scalar value representing dot product formation of result vector, < ->Representing a local adaptive scaling factor;
;
wherein ,the dot product between the differential vectors is calculated to measure the similarity between the features,providing a differential vector->And center vector->Cross-product length between;
and applying the obtained scalar value image to index maximization operation to evaluate the definition of the image, and extracting the pixel value of the corresponding position from the input focal stack according to the index of the optimal definition to obtain the corresponding all-focusing image.
7. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the three-dimensional perception module is configured to complete high-frequency noise filtering and preliminary feature extraction of the focal stack through a four-layer network structure, and the three-dimensional perception module comprises a plurality of parallel convolution layers with different convolution kernel sizes and step sizes for capturing fuzzy features on different scales;
the step S2 specifically comprises the following steps:
s21, filtering the focal stack by using a 3D convolution network to extract fuzzy features;
s22, introducing a differential value calculation module into the network structure, inputting the fuzzy characteristic into the differential value calculation module, and calculating differential values of RGB three channels by the differential value calculation module:
;
wherein ,representing the fused RGB channel difference, +.>Different color dimensions representing input features;
s23, obtaining RGB differential features through a downsampling layer, fusing the RGB differential features and the fuzzy features, and constructing a focus body fused with fuzzy ambiguity.
8. The method for estimating depth of an unsupervised focal stack based on full-focus image synthesis according to claim 1, wherein the channel polarization feature map is obtained by performing polarization transformation on an input feature map x:
the polarization transformation converts the input feature map x into two sets of basis vectors and />;
wherein , and />Query and key corresponding to channel level;
calculation of and />Similarity score +.>:
;
wherein ,representing an activation function->Representing normalized exponential function, ++>、/> and />Respectively representing 1 x 1 three-dimensional convolution layers, < >> and />Representing two tensor remodelling operators, x represents multiplication at element level, +.> and />And->The number of channels between is->;
Score for useAs weights, the input vectors are weighted and summed to obtain a channel polarization characteristic map of the channel correlation>:
;
wherein ,representing a channel-level multiply operator.
9. The method for estimating depth of an unsupervised focal stack based on full focus image synthesis according to claim 8, wherein the method for spatially polarizing the feature map comprises:
to input the channel polarization characteristic diagramPerforming polarization change to obtain two sets of polarization vectors and /> wherein ,/>Global spatial feature acquisition by global pooling of three channels +.>Rearranging pixels in an input feature map through three-dimensional convolution to enhance features in different directions in space;
calculating a similarity matrix from two sets of polarization vectors:
;
wherein , and />Respectively representing standard 1 x 1 three-dimensional convolution layers,/->Intermediate parameters representing the convolution of the channel,/->、 and />Representing three tensor remodelling operations, x represents matrix dot product operation, +.>Representing global pooling;
corresponding weights are obtained through the similarity matrix, weighted summation is carried out on the weights and the input channel polarization characteristics, and the comprehensive self-attention characteristic representation associated with the channel and the spatial characteristics is obtained;
;
wherein Representing the spatial multiplication operator.
10. The method for estimating the depth of an unsupervised focal stack based on the synthesis of an all-in-focus image according to claim 1, wherein S4 specifically comprises:
s41, after passing through a codec network with a pooling layer removed, dividing the output of a focal stack depth estimation network into a plurality of layers, wherein each layer corresponds to a specific focusing distance;
s42, application between layersThe hierarchy where the best definition is located is determined through operation, the best focusing position is obtained, and an all-focusing image is obtained;
s43, obtaining a final depth estimation result by using a multi-layer probability value weighted summation mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311101094.7A CN116823914B (en) | 2023-08-30 | 2023-08-30 | Unsupervised focal stack depth estimation method based on all-focusing image synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311101094.7A CN116823914B (en) | 2023-08-30 | 2023-08-30 | Unsupervised focal stack depth estimation method based on all-focusing image synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116823914A true CN116823914A (en) | 2023-09-29 |
CN116823914B CN116823914B (en) | 2024-01-09 |
Family
ID=88141360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311101094.7A Active CN116823914B (en) | 2023-08-30 | 2023-08-30 | Unsupervised focal stack depth estimation method based on all-focusing image synthesis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116823914B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120218386A1 (en) * | 2011-02-28 | 2012-08-30 | Duke University | Systems and Methods for Comprehensive Focal Tomography |
CN110246172A (en) * | 2019-06-18 | 2019-09-17 | 首都师范大学 | A kind of the light field total focus image extraction method and system of the fusion of two kinds of Depth cues |
CN110751160A (en) * | 2019-10-30 | 2020-02-04 | 华中科技大学 | Method, device and system for detecting object in image |
CN112465796A (en) * | 2020-12-07 | 2021-03-09 | 清华大学深圳国际研究生院 | Light field feature extraction method fusing focus stack and full-focus image |
CN114792430A (en) * | 2022-04-24 | 2022-07-26 | 深圳市安软慧视科技有限公司 | Pedestrian re-identification method, system and related equipment based on polarization self-attention |
US20220309696A1 (en) * | 2021-03-23 | 2022-09-29 | Mediatek Inc. | Methods and Apparatuses of Depth Estimation from Focus Information |
CN115830240A (en) * | 2022-12-14 | 2023-03-21 | 山西大学 | Unsupervised deep learning three-dimensional reconstruction method based on image fusion visual angle |
-
2023
- 2023-08-30 CN CN202311101094.7A patent/CN116823914B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120218386A1 (en) * | 2011-02-28 | 2012-08-30 | Duke University | Systems and Methods for Comprehensive Focal Tomography |
CN110246172A (en) * | 2019-06-18 | 2019-09-17 | 首都师范大学 | A kind of the light field total focus image extraction method and system of the fusion of two kinds of Depth cues |
CN110751160A (en) * | 2019-10-30 | 2020-02-04 | 华中科技大学 | Method, device and system for detecting object in image |
CN112465796A (en) * | 2020-12-07 | 2021-03-09 | 清华大学深圳国际研究生院 | Light field feature extraction method fusing focus stack and full-focus image |
US20220309696A1 (en) * | 2021-03-23 | 2022-09-29 | Mediatek Inc. | Methods and Apparatuses of Depth Estimation from Focus Information |
CN114792430A (en) * | 2022-04-24 | 2022-07-26 | 深圳市安软慧视科技有限公司 | Pedestrian re-identification method, system and related equipment based on polarization self-attention |
CN115830240A (en) * | 2022-12-14 | 2023-03-21 | 山西大学 | Unsupervised deep learning three-dimensional reconstruction method based on image fusion visual angle |
Non-Patent Citations (3)
Title |
---|
TIAN, B, ET.AL: "Fine-grained multi-focus image fusion based on edge features", 《SCIENTIFIC REPORTS 》, vol. 13, no. 1 * |
周萌等: "基于失焦模糊特性的焦点堆栈深度估计方法", 《计算机应用》, pages 2 * |
张雪霏: "面向单目深度估计的无监督深度学习模型研究", 《中国优秀硕士论文电子期刊网》 * |
Also Published As
Publication number | Publication date |
---|---|
CN116823914B (en) | 2024-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lim et al. | DSLR: Deep stacked Laplacian restorer for low-light image enhancement | |
CN110443842B (en) | Depth map prediction method based on visual angle fusion | |
Yang et al. | Spatial-depth super resolution for range images | |
CN111259945B (en) | Binocular parallax estimation method introducing attention map | |
CN111047548A (en) | Attitude transformation data processing method and device, computer equipment and storage medium | |
CN113870335B (en) | Monocular depth estimation method based on multi-scale feature fusion | |
Ghorai et al. | Multiple pyramids based image inpainting using local patch statistics and steering kernel feature | |
CN113762147B (en) | Facial expression migration method and device, electronic equipment and storage medium | |
CN112686830B (en) | Super-resolution method of single depth map based on image decomposition | |
Shi et al. | Exploiting multi-scale parallel self-attention and local variation via dual-branch transformer-CNN structure for face super-resolution | |
CN115861616A (en) | Semantic segmentation system for medical image sequence | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN115345866B (en) | Building extraction method in remote sensing image, electronic equipment and storage medium | |
Li et al. | Model-informed Multi-stage Unsupervised Network for Hyperspectral Image Super-resolution | |
CN116205962B (en) | Monocular depth estimation method and system based on complete context information | |
CN115511767A (en) | Self-supervised learning multi-modal image fusion method and application thereof | |
CN115565039A (en) | Monocular input dynamic scene new view synthesis method based on self-attention mechanism | |
CN115049739A (en) | Binocular vision stereo matching method based on edge detection | |
Zhou et al. | A superior image inpainting scheme using Transformer-based self-supervised attention GAN model | |
Li et al. | Progressive spatial information-guided deep aggregation convolutional network for hyperspectral spectral super-resolution | |
Wang et al. | Neighbor Spectra Maintenance and Context Affinity Enhancement for Single Hyperspectral Image Super-Resolution | |
Hara et al. | Enhancement of novel view synthesis using omnidirectional image completion | |
Gupta et al. | A robust and efficient image de-fencing approach using conditional generative adversarial networks | |
Zhang et al. | Unsupervised detail-preserving network for high quality monocular depth estimation | |
CN116823914B (en) | Unsupervised focal stack depth estimation method based on all-focusing image synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |