CN115588038A

CN115588038A - Multi-view depth estimation method

Info

Publication number: CN115588038A
Application number: CN202211279016.1A
Authority: CN
Inventors: 魏东; 刘欢; 张潇瀚; 张焱焱
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2023-01-10

Abstract

The invention provides a multi-view depth estimation method, which comprises the following specific steps: step 1: image input: acquiring N +1 different images by a camera, and respectively using the front-view image and the other direction images as reference images I _i＝0 And a target image I _i The value range of i is 0 to N; step 2: feature extraction: the FPN network module comprises an FPN network module and a CA module; step 2.1: the FPN network module extracts initial characteristic graphs of different scales based on the image obtained in the step 1; step 2.2: after the FPN network module is used for extracting the features of the input camera image in three different scales, the problems of shielding, weak texture areas, reflecting surfaces and repeated patterns existing in the acquired image in the prior art can be solved, and the traditional deep image methodThe depth estimation method is inaccurate in the predicted depth information or has an unpredictable problem in such a case.

Description

Multi-view depth estimation method

Technical Field

The invention belongs to the technical field of computer vision and deep learning, and particularly relates to a multi-view depth estimation method.

Background

The method has important significance in acquiring depth information of the image in the fields of automatic driving technology, industrial detection, medical treatment, aerospace, three-dimensional reconstruction and the like. The technology for recovering the depth map by the multi-view method has wide prospects. Specifically, the depth estimation method based on multiple views is to acquire multiple images of an object or a scene from different angles by a camera, use the images as main input information of depth estimation, and finally generate a depth map by using a computer vision algorithm.

A depth map is an image having distances from points in a real scene to a camera as pixel values. A smaller depth value indicates that a point in the real scene is closer to the camera. Obtaining depth information of a target scene can be divided into two methods based on an active method and a passive method. The active depth acquisition method is to acquire depth information of a target scene by using hardware equipment with higher precision and mature technology, such as a laser radar, and to transmit laser and acquire the distance from a target object to a camera by using a mature Time of Flight (TOF) ranging technology so as to acquire the depth information. Although active-based methods are faster, more convenient, and more accurate for obtaining depth information of a target scene, these devices are generally very expensive, resulting in higher cost for depth acquisition. At the same time, there are requirements on the external environment of the scene, such as light interference and distance measurement, which all affect the measurement result. The depth obtaining method based on the passive mode utilizes the characteristic points in the target scene image and predicts the image depth information through the computer vision algorithm theory, the whole process is simple to operate, no additional equipment is needed, and the practicability is high. However, based on the passive depth acquisition method, the main information source is the images with different viewing angles, and the shot images have the problems of shielding areas, external illumination reflection, different illumination intensities and repeated patterns in the images, and all have error influence on the depth result of the estimated target scene.

In the passive depth acquisition method, the method of estimating depth by using an image is generally divided into the realization based on the traditional computer vision algorithm theory and the concrete realization of a network framework based on deep learning.

Based on the traditional multi-view depth estimation algorithm, the flow mainly comprises a Structure From Motion (SFM) and a multi-view Stereo matching method (Muti-view Stereo, MVS):

1. SFM algorithm: camera motion and depth information in a target scene are estimated from a set of two-dimensional images from different perspectives. Firstly, extracting the feature points of the images, performing feature matching on every two images, reconstructing a space point corresponding to the matched feature point through epipolar geometry, optimizing the obtained camera pose and the space coordinate of the feature point by using a Bundler Ajustment method, and finally continuously adding the rest images with the most matching points with the obtained space point and continuously performing Bundler Ajustment optimization. And finally carrying out global Bundler Ajustment after obtaining all the space points. The SFM algorithm may provide the MVS algorithm with a position matrix of the cameras and initial sparse spatial points of the scene.

2. MVS algorithm: and performing further dense reconstruction on the basis of the camera position parameters calculated by the SFM method, and outputting dense three-dimensional space points. The MVS algorithm is realized in various modes of point cloud, depth map and voxel.

The features used by the traditional depth estimation algorithm are designed manually, and the features are difficult to extract in the weak texture region, so that the depth information of the weak texture region can be predicted inaccurately, and a large amount of manpower and time are consumed to design the features in the early stage.

Disclosure of Invention

Therefore, the technical problem to be solved by the present invention is to provide a multi-view depth estimation method, which can solve the existing problems of occlusion, weak texture regions, reflective surfaces, and repetitive patterns in the acquired image, and the problem that the predicted depth information is inaccurate or unpredictable in the conventional depth estimation method. In order to acquire depth information more accurately and rapidly and make up for the defects of the traditional depth estimation method, the multi-view depth estimation method is provided.

In order to solve the above problems, the present invention provides a multi-view depth estimation method, which comprises the following specific steps:

step 1: image input: acquiring N +1 different images by a camera, and respectively using the front-view image and the other direction images as reference images I _i＝0 And a target image I _i The value range of i is 0 to N;

and 2, step: feature extraction: the FPN network module comprises an FPN network module and a CA module;

step 2.1: the FPN network module extracts initial characteristic graphs of different scales based on the image obtained in the step 1;

step 2.2: the method comprises the steps that after an FPN network module carries out feature extraction on an input camera image in three different scales, an obtained initial feature map is transited to a CA module through a DCN module;

and step 3: deep refinement: predicting depth maps with different resolutions in a cascading mode according to feature maps with different scales obtained after feature extraction;

and 4, step 4: and (3) depth optimization: and refining the initial depth map with the resolution of W multiplied by H output by the depth refinement module by using a residual error learning network to obtain an optimized depth map, training a network model by using Focal loss, and performing gradient update on the network model according to the total loss by using an Adam optimization method, thereby guiding the training of the whole model.

Optionally, step 2.2 includes that the CA module embeds the attention information of the feature map in the horizontal and vertical directions into the channel, and includes the following specific steps:

step 2.2.1: the CA module respectively performs global average pooling on the input features along the horizontal direction and the vertical direction, and the global average pooling is as follows in formula (1) and formula (2):

wherein the input tensor is X = [ X = ₁ ，x ₂ ，…，x _c ]∈R ^W×H×C And { W, H, C } are shown in the figureImage width, height and channel number; encoding the channel with two pooling layers of

Wx

1 and 1 XH, respectively, (W, j) and (i, H) representing the input tensor x, respectively _c The position of the image coordinates of (a),

as a result of the output of the c-th channel in the vertical direction h,

is the output result of the c channel in the horizontal direction w;

step 2.2.2: the output of the pooling layer in both the horizontal and vertical directions is P ^w And p ^h Next, the Concate operation is performed, as shown in the following equation (3):

wherein,

representing a Concate operation, P being the result output after the Concate operation, P ^w For output of results from the pooling layer in the horizontal direction, p ^h Outputting results for the pooling layers in the vertical direction;

step 2.2.3: sending the output result of the Concate into a 1 × 1 convolutional layer, a BN layer and a Non-line activation function to obtain an intermediate characteristic diagram, wherein the intermediate characteristic diagram is represented by the following formula (4):

f＝δ(F _1×1 (P)) (4)

where P is the result of the output of the Concate operation, F _1×1 Is convolution transformation with convolution kernel size of 1 × 1, delta is nonlinear activation function, f is intermediate characteristic diagram after encoding spatial information of input characteristic diagram along horizontal and vertical directions, f is equal to R ^C ^/r×(H+W) C is the number of channels, r is the reduction rate of the channels, and W and H are the width and height of the image;

step 2.2.4: dividing the intermediate feature map f into two separate tensors f in the horizontal and vertical directions ^w ∈R ^C ^/r×W×H And f ^h ∈R ^C/r×W×H Respectively using 1 × 1 convolution operation, and further using Sigmoid activated function to process two independent tensors to obtain attention weights in horizontal direction and vertical direction, q ^w And q is ^h As in equations (5) and (6):

q ^w ＝σ(F _1×1 (f ^w )) (5)

q ^h ＝σ(F _1×1 (f ^h )) (6)

wherein q is ^w For horizontal attention, q ^h For vertical attention, sigmoid activation function operation is denoted by σ, and F is used _1×1 Represents a 1 × 1 convolution operation, f ^w And f ^h Respectively representing intermediate eigenvectors in the horizontal direction and the vertical direction;

step 2.2.5: attention q in the horizontal direction ^w And attention q in the vertical direction ^h And input characteristic X = [ X ] ₁ ，x ₂ ，…，x _c ]∈R ^W×H×C Weighted summation is carried out to obtain the final output tensor Y = [ Y ] ₁ ，y ₂ ，…，y _c ]The following formula (7):

wherein x _c (i, j) is the feature tensor x at image coordinate (i, j) _c ，

Is the attention weight of the c-th channel in the horizontal direction,

attention weight for the c-th channel in the vertical direction.

Optionally, step 3 includes constructing a cost body, regularization of the cost body, and depth estimation, and specifically includes the following steps:

step 3.1: constructing a cost body: sending the feature graph output by the CA module to a cost body construction module to construct a cost body;

step 3.2: regularization of a cost body: regularizing the cost body by using 3D convolution;

step 3.3: depth estimation: and normalizing the normalized cost body by using Softmax operation to obtain a probability body, and predicting the depth map from the probability body.

Optionally, step 3.1 further comprises the following steps:

step 3.1.1: establishing a hypothetical depth plane for each pixel in the reference image according to the depth hypothetical range;

step 3.1.2: transforming two-dimensional features of each target image into hypothetical flats of a reference image using homography

Forming a characteristic body in the surface, and carrying out a homography transformation process according to the following formula (8):

wherein H _i (d) Representing homography transformation matrix of target image feature map and reference feature map when depth is d, i is feature map serial numbers 0 to N, K _i 、R _i 、t _i The camera internal parameters, the rotation matrix and the translation vector when shooting a target image;

and t ₀ The inverse matrix operation of the camera internal reference matrix, the transpose matrix operation of the rotation matrix and the translation vector when the reference image is shot are respectively adopted, I is an identity matrix, n is ₀ ^T Is the direction of the principal axis of the reference camera;

step 3.1.3: using a cost metric of variance to aggregate the plurality of feature volumes into a cost volume, calculating the cost volume, as in equation (9):

wherein, V _cost Obtaining N target image characteristics as a cost bodyBody V _i And reference view feature

And performing variance calculation.

Optionally, step 3.2 further comprises the following steps:

step 3.2.1: performing downsampling operation through the 3D convolution and the maximum pooling layer to output each feature layer;

step 3.2.2: and then, performing 3D deconvolution and upsampling operation on the obtained feature layer.

Optionally, step 4 further includes the following steps:

step 4.1: the original scale feature depth estimation result of the reference image and depth refinement module is used as the input of a residual learning network to obtain an optimized depth map;

and 4.2: using Focal loss as a loss function for model training, the cross entropy loss function is as in equation (10):

wherein L is _CE In order to be a function of the cross-entropy loss,

is the predicted probability of pixel p at depth hypothesis d,

is the depth value closest to the true value, p _v Is a true value of the subset of pixels;

step 4.3: the balanced cross entropy loss function is commonly used in the direction of target detection classification, and is expressed mathematically as formula (11):

wherein L is _BL In order to balance the cross entropy loss function, a weight value alpha is introduced on the basis of the cross entropy loss function，α∈[0,1]；

Is the predicted probability of pixel p at depth hypothesis d,

is the depth value, p, closest to the true value _v Is a true value of the subset of pixels;

step 4.4: the Focal loss function is as in equation (12):

wherein L is _FL It is shown that the local mass of,

is a regulating factor, gamma is a parameter,

is the predicted probability of pixel p at depth hypothesis d,

step 4.5: and (4) reversely propagating and updating network model parameters by using an Adam optimization method according to the calculated total loss.

Advantageous effects

The invention provides a multi-view depth estimation method, which introduces an end-to-end network model training idea in deep learning to acquire depth information. The whole algorithm framework can rapidly process a plurality of images to generate reliable depth information. The traditional depth estimation algorithm needs to use manually designed features to calculate a cost body, and in a non-Lambert body scene, feature extraction is difficult. However, the depth estimation method based on the deep learning has accurate and robust depth estimation results even in weak texture regions, reflection surfaces, repetitive patterns and occlusion regions. The method of the invention uses the FPN network to obtain the depth image characteristics from the input image with three different scales, the initial characteristics extracted by the FPN network with different scales are transited to the CA module through the DCN module, the CA module obtains the position information of the initial characteristic diagram along the horizontal and vertical directions, then the convolution operation of obtaining the global receptive field is carried out on the characteristics from the two directions, and the processed characteristic diagram can be roughly regarded as the characteristic diagram with the global context characteristic information. As the FPN network is used as a basic feature extraction module, the extracted information is the context feature information of the relative local neighborhood. Therefore, a DCN module needs to be added between the FPN network and the CA module to adaptively adjust the range of feature extraction by an additional offset. Some networks directly construct cost bodies after extracting features with different scales from the FPN module, and ignore global context information. The method adds the CA module, and the expression capability of the characteristic diagram is enhanced by embedding the position information into the channel, thereby playing a beneficial role in the accuracy of the output results of different functional modules behind the whole network.

The method adopts the Focal local function to train the neural network framework. Because in a complex scene data set, the precision of the result of the processing of the image boundary region by the Focal loss function is higher than that obtained by the prior use of the cross entropy loss function.

At present, due to the continuous improvement of hardware device performance, the technology of applying deep learning to the deep estimation algorithm is rapidly developed. A multi-view depth estimation method based on deep learning specifically realizes an algorithm by using a neural network framework to complete the whole process from a two-dimensional image to a predicted depth image.

The multi-view depth estimation method based on deep learning comprises the following steps: a network framework for estimating the depth of an image is built, a depth map corresponding to each image can be predicted after a group of images are input, the difference between the predicted depth map and a real depth image is obtained through loss function calculation, optimization parameters are continuously trained on the network by using an Adam method, and finally the network with a small loss value is used as a trained model and is directly used for predicting the depth map.

Drawings

FIG. 1 is a general network framework diagram of an embodiment of the present invention;

FIG. 2 is a CBR module diagram in feature extraction according to an embodiment of the present invention;

FIG. 3 is a diagram of an out module in feature extraction according to an embodiment of the present invention;

FIG. 4 is a block diagram of a DCN module in feature extraction according to an embodiment of the present invention;

FIG. 5 is a diagram of a CA module in feature extraction according to an embodiment of the present invention;

fig. 6 is a block diagram of a 3D UNet module in a regularized cost body according to an embodiment of the present invention.

Detailed Description

Referring to fig. 1 to fig. 6 in combination, according to an embodiment of the present invention, a multi-view depth estimation method includes the following specific steps:

furthermore, a group of multi-view images of the target scene is input, namely N +1 views with different angles are shot by a camera, wherein one of the views is a reference image and the other N views are all used as target images, and the reference image and the target images thereof are sent to a feature extraction part of the network framework.

Step 2: feature extraction: the FPN network module comprises an FPN network module and a CA module;

step 2.1: and (3) extracting initial feature maps with different scales based on the image acquired in the step (1) by the FPN network module.

Further, firstly, inputting the N +1 images into a Feature Pyramid (FPN) network for Feature extraction. Let the input image be I _i I ranges from 0 to N, I _i＝0 For the reference image, N is the number of target images, width and height are denoted by W and H, respectively, and resolution is W × H. After the N +1 images pass through the FPN network, each image can obtain an initial characteristic diagram with the resolution of W/4 xH/4, W/2 xH/2 and W xH, and the characteristic receptive field can be expanded by utilizing the characteristics with different scales. Will output noThe same scale initial feature map is input to a CA (coding association, CA) module.

Further, the image I with the resolution of WXH is input first _i And performing preliminary feature map extraction by using an FPN network capable of outputting three feature scales. The FPN network consists of three CBR modules and three out output feature modules: the CBR module is shown in fig. 2, and is composed of a Convolution Layer (Convolution Layer), a BN (Batch normalization) Layer, and a ReLu activation function. Image I _i Firstly, a CBR0 module is used, wherein the CBR0 module consists of two identical convolution layers, a BN layer and a ReLu activation function, the convolution kernel size of each convolution layer is 3 multiplied by 3, the number of output channels is 8, and the step length is 1; the output result of the CBR0 is input into a CBR1 module, the CBR1 module consists of three convolution layers, a BN layer and a ReLu activation function, the sizes of convolution kernels of the three convolution layers are respectively 5 multiplied by 5, 3 multiplied by 3 and 3 multiplied by 3, output channels are all 16, and step lengths are respectively 2,1 and 1; the output result of the CBR1 is then input into the CBR2 module, and the CBR2 module and the CBR1 module are identical except that the channel number is 32.

Furthermore, the BN layer is a batch normalization method and is usually used before a function is activated, regularization processing is carried out on certain layer of characteristics by the BN layer, and a model is trained more stably by calculating two parameters of a mean value and a variance, so that the training speed and the convergence process are accelerated. The BN layer is typically a linear variation that normalizes each pixel in the feature map after convolutional layer operation.

The ReLu activation function can not only increase the convergence speed of the network but also alleviate the problem of gradient disappearance, and is a new activation function proposed to improve the saturation of Sigmoid. The Sigmoid function makes the output range of each neuron 0 to 1, which is suitable for a model having a prediction probability as an output, but the Sigmoid function itself has disadvantages that: saturation may result in the disappearance of the gradient.

Further, the out module is used to obtain feature maps of different scales, as shown in fig. 3, firstly, the output result of the CBR2 is used as the input of the out0 module, and the out0 performs 1 × 1 convolution operation on the output of the CBR2 to output a feature map with a scale of W/4 × H/4; then the output of the out0 module and the output of the CBR1 module are used as the input of out1, the out1 module performs the adding operation on the output result of the out0 after the up sampling is doubled and the output of the CBR1 module, then the output result after the adding operation is subjected to the convolution operation of 1 multiplied by 1, and the output result of the out1 is a characteristic diagram with the scale of W/2 multiplied by H/2; and finally, taking the output results of the out1 and the CBR0 as the input of the out2, performing double up-sampling on the result of the out1, adding the result of the double up-sampling with the output of the CBR0, and performing 1 × 1 convolution operation on the output result of the previous addition operation, wherein the output result of the out2 module is a characteristic diagram with the scale of W × H. Through the steps, the FPN network outputs multi-scale feature maps with the resolution ratios of W/4 xH/4, W/2 xH/2 and W xH, N +1 as the input of the CA module.

Step 2.2: and (3) carrying out feature extraction on the input camera image by the FPN network module at three different scales to obtain an initial feature map, and transitioning the initial feature map to a CA module through the DCN module.

Further, the initial feature maps of different scales extracted by the FPN network are transitioned to a subsequent CA module by using a Deformable convolution operation (DCN). And the CA module performs characteristic operation of acquiring the global receptive field on the initial characteristic graphs with different scales. Finally, outputting N +1 groups of characteristic graphs with different scales, wherein the sizes are W/4 xH/4, W/2 xH/2 and W xH respectively, and the corresponding channels are 32, 16,8. And outputting the N +1 groups of feature maps with different scales to a depth refinement part.

Further, the CA module: after the FPN network extracts the features of three different scales from the input camera image, the obtained initial feature map is transited to a CA module through a DCN module. As shown in fig. 4, the DCN module is composed of a Deformable convolution layer (Deformable conv), a BN layer and a ReLu activation function layer, wherein the convolution kernel size of the Deformable convolution layer is 3 × 3, the step size is 1, the padding is 1, and the Deformable convolution group is 1.

The deformable convolution and the normal convolution operation differ in that: after the size of the ordinary convolution kernel is determined, the arrangement of sampling points of convolution operation is very regular and is a square. The deformable convolution adds an offset learned by an extra convolution layer to each sampling point, so that the ordering of the sampling points becomes irregular, and the same offset can be added to each sampling point to achieve the effect of sampling area scale change. The deformable convolution module simultaneously inputs the characteristic diagram and the offset as the input of the deformable convolution layer, and the convolution layer firstly offsets and then convolutes the sampling point.

Furthermore, the CA module embeds attention information of the characteristic diagram in the horizontal and vertical directions into the channel, and the performance of the model is remarkably improved. The CA module can be regarded as a calculation unit, and inputs tensor X = [ X ] ₁ ,x ₂ ,…,x _c ]∈R ^W×H×C The transformation tensor Y = [ Y ] with the same size is output after the processing of the CA module ₁ ,y ₂ ,…,y _c ]. The CA module diagram is shown in FIG. 5:

wherein the input tensor is X = [ X = ₁ ，x ₂ ，…，x _c ]∈R ^W×H×C { W, H, C } indicates image width, height, and channel number; encoding the channel with two pooling layers of

Wx

as a result of the output of the c-th channel in the vertical direction h,

is the output result of the c channel in the horizontal direction w;

step 2.2.2: the pooled layer output results in both horizontal and vertical directions are then subjected to the Concate operation,

representing the Concate operation, and P is the output result after the Concate operation, and the step is represented by formula (3):

wherein,

representing a Concate operation, P being the result output after the Concate operation, P ^w For output of results from the pooling layer in the horizontal direction, p ^h The results are output for the pooling layer in the vertical direction.

Step 2.2.3: the output result of the Concate is sent to a 1 × 1 convolutional layer, a BN layer and a Non-line activation function, an intermediate characteristic diagram is obtained, and the intermediate characteristic diagram is expressed by a mathematical formula (4):

f＝δ(F _1×1 (p)) (4)

where P is the result of the output of the conditioner operation, F _1×1 Is convolution transformation with convolution kernel size of 1 × 1, delta is nonlinear activation function, f is intermediate characteristic diagram after encoding spatial information of input characteristic diagram along horizontal and vertical directions, f is equal to R ^C ^/r×(H+W) C is the number of channels, r is the reduction rate of the channels, and W and H are the width and height of the image;

q ^w ＝σ(F _1×1 (f ^w )) (5)

q ^h ＝σ(F _1×1 (f ^h )) (6)

wherein q is ^w For horizontal attention, q ^h For vertical attention, useSigma denotes Sigmoid activation function operation, denoted F _1×1 Denotes a 1 × 1 convolution operation, f ^w And f ^h Respectively representing intermediate eigenvectors in the horizontal direction and the vertical direction;

wherein x is _c (i, j) is the feature tensor x at image coordinate (i, j) _c ，

Is the attention weight of the c-th channel in the horizontal direction,

attention weight for the c-th channel in the vertical direction.

As can be seen from the above, the CA module decomposes the channel attention into one-dimensional feature codes along two spatial directions, and then aggregates the features in the two directions, unlike the channel attention that converts the feature tensor into a single feature vector through two-dimensional global pooling. Wherein remote dependencies can be obtained along one spatial direction, while accurate position information is retained along another spatial direction. The CA module encodes the channel relationships and remote dependencies into a pair of attention maps using accurate location information through these two steps. The input eigenmaps are then applied simultaneously to the input tensor, in the horizontal and vertical directions respectively, and an attention-seeking map can accurately locate the exact location of the object of interest in the input eigenmaps. Therefore, the CA module can better help the whole model identification and obtain more beneficial precision information. Finally, the CA module output is used as the input of the depth refinement module.

And step 3: and predicting depth maps with different resolutions in a cascading mode by using feature maps with different scales obtained after feature extraction. And the depth prediction result of the previous low-resolution feature map guides the depth range of the next high-resolution feature map when a parallel plane is assumed. The depth range in feature map depth prediction with a resolution of W/4 × H/4 is the entire depth range of the input scene, and since the depth range is large, it is assumed that the plane interval is large, and a rough depth value is generated. In the following operations for estimating depth at resolution W/2 XH/2 and WXH, a more accurate depth estimate is recovered with a finer hypothetical plane spacing. The depth refinement module is divided into three parts of Cost body construction (Cost volume), cost body regularization and depth estimation

Further, step 3.1: constructing a cost body: the feature graph output by the CA module is sent to a cost body construction module, and the cost body construction module is mainly divided into three steps: a hypothetical depth plane is first created for each pixel in the reference image based on a range of depth hypotheses. The two-dimensional features of each target image are then homography transformed into hypothetical planes of the reference image to form the feature volumes. Finally, a cost measure of variance is used to aggregate the plurality of feature volumes into a cost volume.

Furthermore, the feature map output by the CA module is used for constructing the three-dimensional cost body, the depth information range and the camera parameters are required to be acquired, and the three-dimensional cost body is obtained through micro-homography change. Not only can the occupation of the GPU be reduced, but also more accurate depth information can be obtained through depth predicted values of different scales;

homography projects the target image feature map to a plurality of parallel planes under the reference image, and the process is similar to a plane scanning algorithm in three-dimensional reconstruction. And under the reference image coordinate, the target image feature map is encoded by using the camera parameter and converted into a coordinate system corresponding to the reference image through homography. Assuming that there are N target images in the input set of images, the N mapped feature maps constitute M feature volumes. The target image feature map is mapped to the reference image coordinate system, and the sizes of the mapped feature maps are different due to different depths. Because the sizes of the feature spaces formed by the feature maps are different, a bilinear interpolation algorithm needs to be adopted for each feature map to ensure that the height and the width of the mapped feature maps are the same. The homography transform determines the coordinate change from the target image feature map to the cost volume at depth value d, and the transformation formula (8) is as follows:

wherein H _i (d) Representing homography transformation matrix of target image feature map and reference feature map when depth is d, i is feature map serial numbers 0 to N, K _i 、R _i 、t _i The camera internal parameters, the rotation matrix and the translation vector under other visual angles;

and t ₀ Is the camera internal reference, rotation matrix and translation vector under the reference visual angle, I is the unit matrix, n ₀ Is the direction of the principal axis of the reference camera.

Using a cost metric of variance to aggregate the plurality of feature volumes into a cost volume, calculating the cost volume, as in equation (9):

wherein, V _cost Obtaining N target image feature volumes V as cost volumes _i And reference View feature

Performing variance operation;

step 3.2: and (3) cost regularization: the cost volume is regularized using 3D convolution. And by constructing a smooth and dense matching relation, reducing the influence of noise on the primary price body, and aggregating context information to standardize the primary price body to obtain the regularized price body.

Furthermore, noise exists in the generated cost body due to interference factors such as occlusion or weak texture regions, so that the cost body needs to be regularized to obtain a standard cost body. The initial cost body is regularized by adopting multi-scale 3D convolution, and the 3D convolution part is a 3D UNet structure. And generating a probability body for depth prediction by the regularized cost body through a Softmax operation.

As shown in fig. 6, the 3D UNet network is an encoder-decoder structure: firstly, downsampling operation is carried out through a 3D convolution and a maximum pooling layer to output each feature layer, and then the obtained feature layers are subjected to 3D deconvolution and upsampling operation. In the training process, the encoder and the decoder are connected in a layer jump connection mode, the same scale features of down-sampling and up-sampling output are fused, and the problem of gradient disappearance is effectively prevented. Specifically, in the encoding process, the features are input into a 3D convolution, then are subjected to layer-by-layer down-sampling to 1/2,1/4,1/8 of the original resolution in the maximum pooling layer with the step size of 2, and then are subjected to a feature decoding part, and the obtained features are subjected to up-sampling to 1/4,1/2 and the original feature resolution through 3D deconvolution and up-sampling operations.

Step 3.3: depth estimation:

and (4) normalizing the regularized cost body by using Softmax operation to obtain a Probability body (Proavailability volume), and predicting the depth map from the Probability body. The Soft argmin operation is used to sample uniformly over a range of depth assumptions where the expected values can yield a continuous depth estimate, outputting an initial depth map. And guiding the depth range of the next-scale image by the obtained depth information.

Further, as in the depth estimation section of fig. 1, in order to estimate the depth value of each point, it is necessary to transform the cost volume into the probability volume in the depth direction using the Softmax function. And then uniformly sampling in the depth hypothesis range by using Soft argmin operation, and outputting an initial depth map. And compressing depth dimension information of the regularized cost body to be distributed between 0 and 1 by Softmax operation, and obtaining a probability body, wherein the probability body can not only infer pixel depth but also measure and estimate confidence. After the Soft argmin operation obtains the initial depth map of each image. And then, self-adaptive sampling is carried out on the depth range under the current scale according to the depth map of the previous scale, and a smaller depth interval is obtained to obtain a more accurate value for the later depth estimation.

And 4, step 4: and (3) depth optimization: and refining the W multiplied by H initial depth map output by the depth refinement module by utilizing a residual error learning network to obtain a reference depth map, training a network model by using local loss, and performing gradient updating on the network model by an Adam optimization method according to total loss so as to guide the training of the whole model.

Step 4.1: referring to the depth optimization module shown in fig. 1, the original scale feature depth estimation result of the reference image and depth refinement module is used as the input of the residual learning network to obtain the optimized depth map. Specifically, the original resolution depth map predicted from the probability volume is refined by using the reference image as a guide, and the optimized reference depth map with the output channel of 1 is output after four convolution operations with the convolution kernel size of 3 × 3 and the step size of 1 are performed.

During the training process of the network framework, the loss function is used for evaluating the difference degree between the predicted value and the true value of the model. And calculating the error of the result and the truth value of each forward iteration of the network, and guiding the next deep learning model training.

In the past, the loss function of the training network mainly uses the loss based on deep regression, and the loss is determined by the absolute value between the predicted value and the true value. The method of the invention takes depth estimation as a classification task instead to strengthen the training supervision of the complex area, and uses Focal loss to train the network model.

The regression-based loss function is a mean square error loss function, the mean square error loss function calculation method is to calculate an expected value by taking the difference between a predicted value and a true value output by the model and then squaring, and the smaller the obtained expected value is, the higher the prediction accuracy of the model is. The method for judging the mean square error loss function is simple, but when the model uses Softmax to obtain the probability part and is matched with the mean square error loss function to learn by using a gradient descent method, the learning speed of the model is slow.

Step 4.2: the cross entropy loss function is used for calculating the probability of different classes, derivation calculation can be easily carried out in a gradient descent method, and the model learning speed is high. L is _CE As a cross entropy loss function, as in equation (10):

wherein L is _CE In order to be a function of the cross-entropy loss,

is the predicted probability of pixel p at depth hypothesis d,

is the depth value, p, closest to the true value _v Is a true subset of pixels.

Focal loss belongs to the loss function of the classification. A common classification loss function is the cross-entropy loss function.

Step 4.3: the cross entropy loss function is commonly used in the direction of target detection classification, in order to balance the problem that the classification results of the common foreground and background are not balanced, a weight alpha epsilon [0,1] is introduced on the basis of the cross entropy loss function, and the balanced cross entropy loss function is as shown in a formula (11):

wherein L is _BL In order to balance the cross entropy loss function, which is a problem of common imbalance of foreground and background classification results, a weight value alpha is introduced on the basis of the cross entropy loss function, and the alpha belongs to [0,1]]；

Is the predicted probability of pixel p at depth hypothesis d,

is the depth value, p, closest to the true value _v Is a true subset of pixels.

Although equation (11) can solve the classification imbalance, it does not make the model focus on the hard-to-classify region better. Thus, the steps4.4: the Focal loss function is a function for increasing attention of the network to a difficult processing area, the structure of the cross entropy loss function is readjusted, and the cross entropy loss function is added

Factor, γ is a parameter. L for Focal loss _FL Expressed, as in equation (12):

wherein L is _FL It is shown that the local mass of,

is a regulating factor, gamma is a parameter,

is the predicted probability of pixel p at depth hypothesis d,

is the depth value, p, closest to the true value _v Is a true subset of pixels.

The Focal loss function may adjust the loss weights in different complexity regions: when gamma =0, the Focal loss function is a cross entropy loss function; the Focal loss function is suitable for complex scenes when gamma =2, so the Focal loss is more accurate than the cross entropy loss function. In a complex scene data set such as Tanks and samples, a Focal loss function is more suitable for diversified and complex scenes in the data set, and particularly, the precision of a processing result of an image boundary area is higher than that of a result obtained by using a cross entropy loss function in the past, so that a trained model obtains the best depth estimation value.

Step 4.5: and finally, reversely propagating and updating network model parameters by using an Adam optimization method according to the calculated total loss.

At present, due to the continuous improvement of the performance of hardware equipment, the technology of applying deep learning to the depth estimation algorithm is rapidly developed. A multi-view depth estimation method based on deep learning specifically realizes an algorithm by using a neural network framework to complete the whole process from a two-dimensional image to a predicted depth image.

It is readily understood by a person skilled in the art that the advantageous ways described above can be freely combined, superimposed without conflict.

Claims

1. A multi-view depth estimation method is characterized by comprising the following specific steps:

and step 3: deep refinement: predicting depth maps with different resolutions in a cascading mode according to different scale feature maps obtained after feature extraction;

and 4, step 4: and (3) depth optimization: and refining the initial depth map with the resolution of W multiplied by H output by the depth refinement module by using a residual learning network to obtain an optimized depth map, training a network model by using Focalloss, and performing gradient updating on the network model according to the total loss by using an Adam optimization method, thereby guiding the training of the whole model.

2. The multi-view depth estimation method of claim 1, wherein step 2.2 comprises the CA module embedding the attention information of the feature map in the horizontal and vertical directions into the channel, and comprises the following specific steps:

wherein the input tensor is X = [ X = ₁ ，x ₂ ，…，x _c ]∈R ^W×H×C { W, H, C } indicates image width, height, and number of channels; the channels are coded with two pooling layers of Wx 1 and 1 xH, respectively, (W, j) and (i, H) representing the input tensor x, respectively _c Image coordinate position of (D), P _c ^h (h) As a result of the output of the c-th channel in the vertical direction h, P _c ^w (w) is the output result of the c-th channel in the horizontal direction w;

wherein,

to representConcate operation, P is the output result after the Concate operation, P ^w For output of results from the pooling layer in the horizontal direction, p ^h Outputting results for the pooling layers in the vertical direction;

f＝δ(F _1×1 (P)) (4)

where P is the result of the output of the conditioner operation, F _1×1 Is convolution transformation with convolution kernel size of 1 × 1, delta is nonlinear activation function, f is intermediate characteristic diagram after encoding spatial information of input characteristic diagram along horizontal and vertical directions, f is equal to R ^C/r×(H+W) C is the number of channels, r is the reduction rate of the channels, and W and H are the width and height of the image;

step 2.2.4: dividing the intermediate feature map f into two separate tensors f in the horizontal and vertical directions ^w ∈R ^C/r×W×H And f ^h ∈R ^C/r×W×H Respectively using 1 × 1 convolution operation, and further using Sigmoid activated function to process two independent tensors to obtain attention weights in horizontal direction and vertical direction, q ^w And q is ^h As in equations (5) and (6):

q ^w ＝σ(F _1×1 (f ^w )) (5)

q ^h ＝σ(F _1×1 (f ^h )) (6)

wherein q is ^w For horizontal attention, q ^h For vertical attention, sigmoid activation function operation is denoted by σ, and F is used _1×1 Denotes a 1 × 1 convolution operation, f ^w And f ^h Respectively representing an intermediate feature vector in the horizontal direction and an intermediate feature vector in the vertical direction;

step 2.2.5: attention q in the horizontal direction ^w And attention q in the vertical direction ^h And input characteristic X = [ X ] ₁ ，x ₂ ，…，x _c ]∈R ^W ^×H×C Weighted summation is carried out to obtain the final output tensor Y = [ Y ] ₁ ，y ₂ ，…，y _c ]The following formula (7):

wherein x _c (i, j) is the feature tensor x at the image coordinates (i, j) _c ，

Is the attention weight of the c-th channel in the horizontal direction,

attention weight for the c-th channel in the vertical direction.

3. The multi-view depth estimation method according to claim 1, wherein step 3 includes constructing a cost body, regularization of the cost body, and depth estimation, and specifically includes the following steps:

4. The multi-view depth estimation method of claim 3, wherein step 3.1 further comprises the steps of:

step 3.1.2: and (3) transforming the two-dimensional characteristics of each target image into a hypothetical plane of the reference image by using homography to form a characteristic body, wherein the homography transformation process is as the following formula (8):

and t ₀ The method comprises the steps of inverse matrix operation of a camera internal reference matrix, transposition matrix operation of a rotation matrix and translation vector when a reference image is shot, wherein I is an identity matrix, n is ₀ ^T Is the direction of the principal axis of the reference camera;

step 3.1.3: using the cost measure of variance to aggregate the plurality of feature volumes into cost volumes, calculating the cost volumes, as in equation (9):

And performing variance calculation.

5. The multi-view depth estimation method of claim 3, wherein step 3.2 further comprises the steps of:

6. The multi-view depth estimation method according to claim 1, wherein step 4 further comprises the steps of:

step 4.2: using Focalloss as a loss function for model training, the cross entropy loss function is as in equation (10):

wherein L is _CE In order to be a function of the cross-entropy loss,

is the predicted probability of pixel p at depth hypothesis d,

wherein L is _BL In order to balance the cross entropy loss function, a weight value alpha is introduced on the basis of the cross entropy loss function, and the alpha belongs to [0,1]]；

Is the predicted probability of pixel p at depth hypothesis d,

step 4.4: the Focal loss function is as in equation (12):

wherein L is _FL It is shown that the local mass of,

is a regulating factor, gamma is a parameter,

is the predicted probability of pixel p at depth hypothesis d,

step 4.5: and (4) reversely propagating and updating network model parameters by using an Adam optimization method according to the solved total loss.