CN115830240A

CN115830240A - Unsupervised deep learning three-dimensional reconstruction method based on image fusion visual angle

Info

Publication number: CN115830240A
Application number: CN202211618155.2A
Authority: CN
Inventors: 闫涛; 盖彦辛
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-03-21

Abstract

The invention discloses an unsupervised deep learning three-dimensional shape reconstruction method based on an image fusion visual angle. The method comprises the following steps: firstly, acquiring a focus stack of an image and a corresponding focus position; secondly, iterating a focusing area detection module and a down-sampling focusing area detection module to obtain focusing volumes with different scales; then outputting attention of the multi-scale focus volume through a four-layer hourglass network, and further obtaining a predicted depth map and a full focus image of the scene; and finally, the predicted depth map and the full-focus image are subjected to a guide filtering function to obtain a final three-dimensional shape reconstruction result of the scene. The method solves the problem of reconstruction of the three-dimensional appearance of the scene from an unsupervised view angle, and can effectively relieve the problem of difficult real depth annotation in the process of reconstruction of the three-dimensional appearance.

Description

Unsupervised deep learning three-dimensional reconstruction method based on image fusion visual angle

Technical Field

The invention belongs to the field of computer vision, and particularly relates to an unsupervised deep learning three-dimensional reconstruction method based on an image fusion visual angle.

Background

The vision-based three-dimensional reconstruction has the characteristics of high speed, good real-time performance, analysis visualization and the like, and is widely applied to the fields of automatic navigation in the robot field, obstacle identification in computer vision, three-dimensional modeling in architecture, cultural relic restoration in archaeology and the like. Therefore, the universal requirements of multiple fields push the development of the three-dimensional reconstruction technology towards easy realization and high precision.

In computer vision, an additional depth cue is a key for recovering three-dimensional structural information of a scene by using a two-dimensional image, and a traditional three-dimensional reconstruction method starts from depth cues such as defocusing, shading and shapes and often recovers the depth of each pixel by a focus position with maximum sharpness. For example, the focal topography reconstruction method estimates a depth map of a scene by using a change of focal point information in a multi-depth-of-field image sequence as a clue, and is a typical passive optical method. Compared with other methods, the focusing morphology reconstruction method does not need to rely on high-precision depth detection equipment, and scene texture information can be effectively reserved in the reconstruction process. However, the focus measurement operator is affected by noise level, contrast, scene texture, and other factors, resulting in a focal volume containing erroneous focus values, which in turn affects the accuracy of the depth map. Furthermore, calculating depth based on the sharpness of each pixel is time consuming and does not work well for non-textured objects.

The neural network has the advantages that semantic information of an image is effectively extracted, and pixel information is associated through convolution, so that the defects existing in the traditional method can be effectively overcome by introducing deep learning into the depth estimation field to predict the depth of a focus stack. For example, because a strong association relationship exists between the scene depth and the defocus blur amount, a deep learning method for recovering the depth by using defocus information can obtain a depth value more accurate than that of a conventional method by learning a direct regression depth value, but such a deep learning model needs a large amount of data sets with true values, and the true values of the scene are difficult to obtain in practice, so that the deep learning model is widely applied to the field of multi-focus three-dimensional topography reconstruction.

Through the above analysis of the current research situation, we consider that the existing method has the following disadvantages: depth information acquisition devices typically require specialized hardware, such as structured light projectors and laser co-focused laser emitters; the traditional characteristic evaluation method of passive reconstruction lacks scene applicability and method robustness due to the need of prior knowledge intervention; the deep learning technology helps to overcome the problems, but a typical deep learning model needs deep labeling data of a real scene and is difficult to be applied in practice. Therefore, how to realize domain self-adaptation, effectively utilize defocus information and do not need three-dimensional reconstruction of a real scene depth map is an important problem.

Therefore, the depth information is acquired in the image fusion process so as to realize unsupervised three-dimensional shape reconstruction, and the problem of difficulty in three-dimensional shape reconstruction and labeling of a real scene is effectively solved.

Disclosure of Invention

In order to overcome the defects in the existing solution, the invention aims to provide an unsupervised deep learning three-dimensional reconstruction method based on an image fusion visual angle, which comprises the following steps:

step 1, giving a focus stack FS epsilon R ^H×W×N×C And the corresponding focus position P ∈ R ^H×W×N×C H and W respectively represent the height and width of the focal slices, N is the number of the focal slices, C is the number of channels, and R represents a real number domain;

step 2, the focus stack FS ∈ R in the step 1 ^H×W×N×C Obtaining a focus volume FV according to the focus area detection modules of equations (1) to (3) ₁ ∈R ^H×W×N×C ，

F ₁ ＝dilated(FS) (1)

F ₂ ＝RELU(ResNet(F ₁ )+FS) (2)

FV ₁ ＝RELU(conv(RELU(conv(F ₂ ))))+F ₂ (3)

Where scaled () represents the extended convolution, F ₁ For initial characterization, resNet () represents the residual module, RELU () represents the activation function, F ₂ For semantic features, conv () represents a 3D convolution module;

step 3, focusing volume FV obtained in step 2 ₁ Outputting an effective downsampling feature F according to the downsampling focus detection module of equation (4) ₃ To feature F ₃ Performing feature extraction according to the formula (5) and the formula (6) to obtain a second-scale focal volume

F ₃ ＝RELU(stride_conv(FV ₁ )+conv(Maxpooling(FV ₁ ))) (4)

F ₄ ＝RELU(ResNet(F ₃ )+FS) (5)

FV ₂ ＝RELU(conv(RELU(conv(F ₄ ))))+F ₄ (6)

Where stride _ conv () represents a step convolution, maxpooling () represents a 3D max pooling operation, F ₄ Is a semantic feature;

step 4, focusing volume FV obtained in step 3 ₂ Down-sampling focus detection module of input formula (7) obtains down-sampling output characteristic F ₅ Then for feature F ₅ Obtaining a third dimension of the focal volume according to equations (8) and (9)

F ₅ ＝RELU(stride_conv(FV ₂ )+conv(Maxpooling(FV ₂ ))) (7)

F ₆ ＝RELU(ResNet(F ₅ )+FS) (8)

FV ₃ ＝RELU(conv(RELU(conv(F ₆ ))))+F ₆ (9)

Wherein F ₆ Is a semantic feature;

step 5, the focus volume FV obtained in step 2, step 3 and step 4 is respectively ₁ ,FV ₂ ,FV ₃ Inputting a four-layer hourglass network according to the formula (10) to carry out combination and refinement of different size characteristics, and outputting the middle attention M epsilon R of the maximum sharpness probability of each focus position ^H×W×N ；

M＝hourglass(FV ₁ ,FV ₂ ,FV ₃ ) (10)

Step 6, normalizing the intermediate attention M obtained in the step 5 according to the formula (11) to obtain the depth map attention M ^depth And a predicted depth map D is obtained by performing dot multiplication on the focal position P according to the equation (12),

where F denotes the number of pictures of the focal stack, M _i,j,t Representing the intermediate attention value at pixel point (i, j) in the t-th image in the focal stack,

representing the attention value of the depth map at the pixel point (i, j) in the t-th slice, wherein the value range of the pixel point (i, j) is that i is more than or equal to 1 and less than or equal to H, j is more than or equal to 1 and less than or equal to W, t is a stack subscript, the range of t is more than or equal to 1 and less than or equal to N, and D _i,j Representing depth information of a pixel point (i, j) in a depth map, exp () representing an exponential function, ln () representing a logarithmic function;

step 7, normalizing the intermediate attention M obtained in the step 5 according to the formula (13) to obtain the attention M of the full-focus image ^AiF And dot-multiplying the focus stack FS according to the equation (14) to obtain a full focus image I,

wherein

Represents the attention value, I, of the fully focused image at pixel point (I, j) in the t-th slice _i,j Representing gray information of a pixel point (i, j) in the depth map;

step 8, obtaining a final three-dimensional reconstruction result D of the scene according to the guiding filter function of the formula (15) by the depth map D obtained in the step 6 and the full focus image I obtained in the step 7 _depth ，

D _depth ＝GT(I,D) (15)

Where GT () represents the guided filtering function.

Compared with the prior art, the invention has the following advantages:

(1) The three-dimensional reconstruction method provided by the invention fully utilizes the relation between the depth and the full-focus image estimation, and realizes the unsupervised depth information estimation of a scene;

(2) The three-dimensional reconstruction method provided by the invention has good scene universality, realizes depth estimation by extracting the focus information with invariance in the process of full focus image estimation, and has good scene generalization.

Drawings

FIG. 1 is a flow chart of an unsupervised deep learning three-dimensional topography reconstruction method based on an image fusion view angle according to the present invention;

FIG. 2 is a schematic diagram of an unsupervised deep learning three-dimensional morphology reconstruction method based on an image fusion view angle according to the present invention;

FIG. 3 is a schematic view of a focus region detection module of an unsupervised deep learning three-dimensional topography reconstruction method based on an image fusion view angle according to the present invention;

FIG. 4 is a schematic diagram of a down-sampling focusing detection module of an unsupervised deep learning three-dimensional feature reconstruction method based on an image fusion view angle according to the present invention;

FIG. 5 shows four layers of the unsupervised deep learning three-dimensional morphology reconstruction method based on image fusion view angle of the present invention

Schematic representation of the hourglass network.

Detailed Description

As shown in fig. 1 and 2, a method for reconstructing unsupervised deep learning three-dimensional topography based on image fusion view angle includes the following steps:

step 2, focusing in the step 1Stack FS ∈ R ^H×W×N×C Obtaining a focus volume FV according to the focus area detection modules of equations (1) to (3) ₁ ∈R ^H×W×N×C As shown in fig. 3, in this example,

F ₁ ＝dilated(FS) (1)

F ₂ ＝RELU(ResNet(F ₁ )+FS) (2)

FV ₁ ＝RELU(conv(RELU(conv(F ₂ ))))+F ₂ (3)

where scaled () represents the extended convolution, F ₁ For the initial feature, resNet () represents the residual module, RELU () represents the activation function, F ₂ For semantic features, conv () represents a 3D convolution module;

step 3, focusing volume FV obtained in step 2 ₁ The down-sampling focus detection module according to equation (4) (as shown in fig. 4) outputs a valid down-sampling feature F ₃ To feature F ₃ Performing feature extraction according to the formulas (5) and (6) to obtain a second-scale focal volume

F ₃ ＝RELU(stride_conv(FV ₁ )+conv(Maxpooling(FV ₁ ))) (4)

F ₄ ＝RELU(ResNet(F ₃ )+FS) (5)

FV ₂ ＝RELU(conv(RELU(conv(F ₄ ))))+F ₄ (6)

step 4, focusing volume FV obtained in step 3 ₂ The down-sampling focus detection module of input formula (7) (as shown in FIG. 4) obtains the down-sampling output characteristic F ₅ Then for feature F ₅ Obtaining a third dimension of the focal volume according to equations (8) and (9)

F ₅ ＝RELU(stride_conv(FV ₂ )+conv(Maxpooling(FV ₂ ))) (7)

F ₆ ＝RELU(ResNet(F ₅ )+FS) (8)

FV ₃ ＝RELU(conv(RELU(conv(F ₆ ))))+F ₆ (9)

Wherein F ₆ Is a semantic feature;

step 5, the focus volume FV obtained in step 2, step 3 and step 4 is respectively ₁ ,FV ₂ ,FV ₃ The method comprises the steps of inputting a four-layer hourglass network (shown in figure 5) according to the formula (10) to carry out combination and refinement of different size features, and outputting the middle attention M epsilon R of the maximum sharpness probability of each focus position ^H×W×N ；

M＝hourglass(FV ₁ ,FV ₂ ,FV ₃ ) (10)

representing the attention value of the depth map at the pixel point (i, j) in the t-th slice, wherein the value range of the pixel point (i, j) is that i is more than or equal to 1 and less than or equal to H, j is more than or equal to 1 and less than or equal to W, t is a stack subscript, the range of t is more than or equal to 1 and less than or equal to N, and D _i,j Representing depth information of a pixel point (i, j) in a depth map, exp () representing an exponentFunction, ln () represents a logarithmic function;

step 7, normalizing the intermediate attention M obtained in the step 5 according to the formula (13) to obtain the attention M of the full-focus image ^AiF And dot-multiplying the focal stack FS according to the equation (14) to obtain a full focus image I,

wherein

step 8, obtaining a final three-dimensional reconstruction result D of the scene according to the guiding filter function of the formula (15) by using the depth map D obtained in the step 6 and the full focus image I obtained in the step 7 _depth ，

D _depth ＝GT(I,D) (15)

Where GT () denotes a guided filter function.

Claims

1. An unsupervised deep learning three-dimensional shape reconstruction method based on an image fusion visual angle is characterized by comprising the following steps of:

F ₁ ＝dilated(FS) (1)

F ₂ ＝RELU(ResNet(F ₁ )+FS) (2)

FV ₁ ＝RELU(conv(RELU(conv(F ₂ ))))+F ₂ (3)

F ₃ ＝RELU(stride_conv(FV ₁ )+conv(Maxpooling(FV ₁ ))) (4)

F ₄ ＝RELU(ResNet(F ₃ )+FS) (5)

FV ₂ ＝RELU(conv(RELU(conv(F ₄ ))))+F ₄ (6)

Where stride _ conv () represents a stride convolution, maxpooling () represents a 3D max pooling operation, F ₄ Is a semantic feature;

F ₅ ＝RELU(stride_conv(FV ₂ )+conv(Maxpooling(FV ₂ ))) (7)

F ₆ ＝RELU(ResNet(F ₅ )+FS) (8)

FV ₃ ＝RELU(conv(RELU(conv(F ₆ ))))+F ₆ (9)

Wherein F ₆ Is a semantic feature;

M＝hourglass(FV ₁ ,FV ₂ ,FV ₃ ) (10)

representing the attention value of the depth map at the pixel point (i, j) in the t-th slice, wherein the value range of the pixel point (i, j) is that i is more than or equal to 1 and less than or equal to H, j is more than or equal to 1 and less than or equal to W, t is a stack subscript, the range of t is more than or equal to 1 and less than or equal to N, and D _i,j Representing depth information of a pixel point (i, j) in the depth map, exp () representing an exponential function, ln () representing a logarithmic function;

step 7, normalizing the intermediate attention M obtained in the step 5 according to the formula (13) to obtain the attention M of the full-focus image ^AiF And according to formula (14) with cokeThe point stack FS performs a point multiplication to obtain a fully focused image I,

wherein

D _depth ＝GT(I,D) (15)

Where GT () represents the guided filtering function.