CN111402405A

CN111402405A - Attention mechanism-based multi-view image three-dimensional reconstruction method

Info

Publication number: CN111402405A
Application number: CN202010205875.0A
Authority: CN
Inventors: 孔德慧; 虞义兰; 王少帆; 李敬华; 王立春
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-07-10

Abstract

The invention relates to a multi-view image three-dimensional reconstruction method based on an attention mechanism, which is used for solving the problems that feature sampling is limited to regular lattice points during multi-view reconstruction and features in complementary views cannot be effectively fused during fusion. The invention is divided into an encoding layer, a gravity center constraint distance attention aggregation module and a decoding layer, and specifically comprises the following steps: obtaining a depth feature set P of N elements by an image set containing N pictures through a coding layer; inputting the feature set into a gravity center constraint distance attention aggregation module, outputting the fusion features Y ', and generating a predicted three-dimensional model Y' through deconvolution operation of a decoding layer, and obtaining a prediction result closest to a real three-dimensional model (GT) by minimizing total reconstruction loss. The method adopts the idea of deformable convolution, uses the convolution kernel with offset to carry out convolution operation, realizes dynamic self-adaptive adjustment of the receptive field of convolution operation, and has the advantage of improving the quality of feature extraction; meanwhile, a gravity center constraint term is introduced into the attention gathering module, so that the gathering characteristics keep the weight correlation influence of different characteristics of the input characteristic concentration weight in a gravity center distance constraint mode, the deviation of the fusion characteristics and the input characteristics is balanced, better multi-view fusion characteristics are obtained, and the model reconstruction result is further improved.

Description

Attention mechanism-based multi-view image three-dimensional reconstruction method

Technical Field

The invention belongs to the field of computer vision, and relates to a novel method based on an image learning feature fusion mechanism, which is used for deep learning three-dimensional reconstruction of multi-view images.

Background

Conventional three-dimensional reconstruction methods, represented by Structure From Motion (SFM) and visual synchronous localization and mapping (vS L AM), typically rely on manually defined features and multi-view feature matching to reconstruct three-dimensional models, however, if the baseline between multiple viewpoints is too long, the images can produce significant changes in appearance or self-occlusion, which presents a significant challenge to feature matching and thus reduces the quality of the multi-view reconstructed model.

Some depth learning methods in recent years, which are used to estimate three-dimensional shapes from multiple images and achieve encouraging results, such as 3D-R2N2, L SM, deep mvs, RayNet, and attsets.3d-R2N 2 and L SM all describe multi-view reconstruction as a sequence learning problem and use RNNs to fuse multiple depth features extracted by a shared encoder into an input image.

Although DeepMVS and RayNet realize the invariance of the permutation, they only acquire first-order or first-order moment information from a large depth feature set, completely neglect other features which may be valuable for accurate three-dimensional shape estimation, and cause the reconstruction result to be unsatisfactory. According to the method, the deep features of any number of multi-view image sets are subjected to attention aggregation to replace an RNN module or maximum/average pool operation, so that input sequence order-independent multi-view information fusion reconstruction is realized.

However, there are two major areas of AttSets that need improvement. Firstly, the method adopts the traditional convolution operation to extract the characteristics, the convolution kernel of the traditional convolution is rectangular, and the sampling points involved in the operation are limited to regular lattice points, so that the algorithm is limited to extract the more targeted characteristics on the object to be reconstructed, and the quality of the object reconstruction characteristics is reduced. Secondly, the feature fusion method based on attention aggregation has the advantage of strengthening complementary features irrelevant to view angles, but can correspondingly inhibit some view angle-related features, so that the feature deviation of the fused features and the features with smaller weights in the input feature set is larger, and a plurality of effective information contained in the features cannot be effectively reflected in a reconstruction result, thereby reducing the reconstruction quality of the model.

Aiming at the two problems, a feedforward neural network is designed, on one hand, the idea of deformable convolution is adopted, and a convolution kernel with offset is used for performing convolution operation, so that the receptive field of convolution operation is dynamically and adaptively adjusted, and the characteristic extraction quality is improved; meanwhile, a gravity center constraint term is introduced into the attention gathering module, so that the gathering characteristics keep the weight correlation influence of different characteristics of the input characteristic concentration weight in a gravity center distance constraint mode, the deviation of the fusion characteristics and the input characteristics is balanced, better multi-view fusion characteristics are obtained, and the model reconstruction result is further improved.

Disclosure of Invention

Aiming at the problem of multi-view model reconstruction, the invention provides a strategy of adopting deformable convolution to realize the self-adaptive information extraction of single-view input features and adopting a gravity center constraint distance attention aggregation module to perform multi-view feature fusion, so that the quality of the input image features and fusion features is improved while the invariance of input image replacement is ensured, and a better model reconstruction result can be obtained.

The technical scheme of the invention is shown in figure 1. In general, it can be divided into an encoding layer, a gravity center constrained distance attention-gathering module, and a decoding layer. We get the depth feature set P of N elements from the image set containing N pictures through the coding layer { P }₁,p₂,…,p_N}，p_n∈R^1×DWhere N is an arbitrary value and D is a feature dimension fixed for a particular encoder, inputting the feature set into a gravity center constrained distance attention-aggregation module, and outputting a fused feature y' ∈ R^1×DAnd Y 'is subjected to deconvolution operation of a decoding layer to generate a predicted three-dimensional model Y'. We get the prediction result closest to the true three-dimensional model (GT) by minimizing equation (1):

Loss＝L_ae+L_w(1)

l therein_aeFor coding losses, L_wIs a center of gravity constraint term.

The encoding operation is as follows: and (3) passing the image set containing N pictures through an encoding layer to obtain a depth feature set P of N elements. One of our innovations is at the coding layer, which is shown in FIG. 2 together with the coding layers of Atttsets. The idea of deformable convolution is adopted, convolution operation is carried out by using a convolution kernel with offset, the receptive field of convolution operation is adjusted dynamically and adaptively, the feature extraction quality is improved, and features with higher expression ability are extracted. Since replacing the conventional convolutional layer with a deformable convolution increases the amount of computation, we find through experiments that the effect of replacing every other layer is almost the same, so half of the conventional convolutional layer is replaced here. The addition of a deformable convolution for coding is one of the innovative points of the inventor.

The gravity center constraint distance attention aggregation module fuses the depth feature sets P of the N elements, loss is calculated in the link, and coding loss is calculated by using cross entropy:

L_ae＝Y log Y'+(1-Y)log(1-Y') (9)

the coding loss in the present invention is obtained by means of attention aggregation, so that they can also be called attention module. Because the attention aggregation module ignores the characteristics of the view angles with smaller weight, the gravity center constraint distance module is added to ensure that the sum of the distances between the input characteristics and the fusion output characteristics of each view angle is smaller, which is a second innovation point of us, and the calculation formula is as follows:

d for measuring p_iDistance from y', where we choose Euclidean distance for the metric and set λ_i1. And adding the equations (9) and (10) to obtain the total reconstruction loss corresponding to the model.

Advantageous effects

The method aims to realize dynamic self-adaptive adjustment of the receptive field of convolution operation and improve the characteristic extraction quality by using partial deformable convolution in a coding layer and using a convolution kernel with offset to carry out convolution operation; meanwhile, a gravity center constraint term is introduced into the attention gathering module, so that the gathering characteristics keep the weight correlation influence of different characteristics of the input characteristic concentration weight in a gravity center distance constraint mode, the deviation of the fusion characteristics and the input characteristics is balanced, better multi-view fusion characteristics are obtained, and the model reconstruction result is further improved.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2(a) a coding layer representation of AttSets;

FIG. 2(b) is a representation of the coding layers of the present invention;

FIG. 3(a) is a representation of a conventional convolution rule sample;

FIG. 3(b) is a representation of a deformable convolution sample;

FIG. 4 is a decoding layer representation;

FIG. 5 is a representation of the attention-aggregation module of AttSets;

FIG. 6 is a representation of an attention focus module for gravity-constrained distance;

FIG. 7 is an exemplary graph of single-view reconstruction results on a ShapeNet test set;

FIG. 8 is an exemplary graph of multi-view reconstruction results on a ShapeNet test set.

Detailed Description

The following describes in detail a multi-view image three-dimensional reconstruction method based on the attention mechanism, and the structure of the method includes: the coding layer, the gravity center constraining distance attention module and the decoding layer are specifically implemented as follows:

step 1 encoding an input image

Step 1.1, inputting an image set containing N pictures into a coding layer, wherein the specific structure of the coding layer is shown in fig. 2 (b).

The conventional two-dimensional convolution includes two steps, the first step: sampling is performed on the input feature map x using a regular grid V, which defines the size and step size of the convolution kernel. The second step is that: the sampled values weighted by w are summed. For example, a step size of 1, size 3 x 3, V { (-1, -1), (-1,0), …, (0,1), (1,1) }. Each position a in the feature map p of the encoding operation output of the conventional two-dimensional convolution₀Is represented as follows:

wherein a is_mThe sample point locations in V are enumerated.

It is one of our innovations to introduce a deformable convolution for feature extraction.Deformable is an improvement over conventional rectangular convolution and essentially a sampling improvement.Deformable convolution adds a sample point offset { △ a) to the regular grid V_m1, …, M, where M is | V |. Then, the formula (3) evolves as:

after the deformable convolution coding outputs the feature map p, we deform it into 1 × D feature p_iI represents the ith viewing angleFigure (a).

All the features are coded to obtain a depth feature set P ═ P₁,p₂,…,p_N}，p_n∈R^1×DWhere N is an arbitrary value and D is a characteristic dimension fixed for a particular encoder

Step 2 gravity center constraint distance attention module

Step 2.1 we input each element of the feature set P into an activation function h, which can be a standard neural layer, an optional linear translation layer, followed by a non-linear activation function. Here we use one fully connected layer and one tanh layer as an example, and the bias term is omitted for simplicity. The output of the function h is a set of learned attention activations Z ═ { Z ═ Z₁,z₂,…,z_NTherein of

z_n＝h(p_n,W)＝tanh(p_n,W),(p_n∈R^1×D,W∈R^D×D,z_n∈R^1×D) (5)

Step 2.2 normalizes the N learned attention-activating elements and calculates a set of attention scores S ═ S₁，s₂，…，s_N}. We choose softmax as the normalization operation, so the attention score of the nth element is:

wherein

Is z_nItem d of (1).

The attention scores S calculated in step 2.3 are multiplied by their corresponding set of raw features P to generate a new set of deep features, called weighted features C ═ C₁，c₂，…，c_NTherein of

c_n＝p_n×s_n，(p_n∈R^1×D,s_n∈R^1×1,c_n∈R^1×D) (7)

Step 2.4 sums the set of weighted features for all N elements to obtain a fixed size feature vector denoted y', where

Wherein

Is c_nItem d of (1).

Step 2.5 we introduce the center of gravity to constrain the distance, which is our second innovation point. With D (p)_i,p_w) Represents p_iAnd p_wThe barycentric constraint distance is calculated as:

d for measuring p_iDistance from y', where we choose Euclidean distance for the metric and set λ_i＝1。

Step 3, decoding the fusion characteristics to obtain a predicted three-dimensional model

And 3.1, decoding the fused features Y ' to obtain a predicted three-dimensional model, and decoding Y ' to obtain a predicted three-dimensional model Y '.

Step 3.2 encoding loss L_aeAnd calculating the cross entropy of the real voxel model Y and the predicted three-dimensional model Y'.

L_ae＝Y log Y'+(1-Y)log(1-Y') (9)

And 3.3, adding the formulas (9) and (10) to obtain the total reconstruction loss corresponding to the model.

Loss＝L_ae+L_w(1)

The single-view experimental result is shown in table 1, wherein the Ours-de correspondence only introduces deformable convolution in the coding layer, and does not use the feature aggregation reconstruction model constrained by the center of gravity; ours-ba represents a reconstructed model that still uses conventional convolution features but introduces a barycentric constraint term; ours-com represents a reconstruction model of feature clustering that introduces both variability convolution features and center of gravity constraints. Table 2 shows the average IoU of the multi-view reconstruction for different numbers of views. Experimental results show that both the two improvements of the scheme have gains, the deformable convolution can extract more expressive characteristics to generate mass gains, the gravity center constraint term obviously contributes to the gains of the reconstruction mass, and the two improvements can be combined to obtain a better result.

TABLE 1 Single class average IoU for single view reconstruction on ShapeNet. The best number for each category is highlighted in bold.

Table 2 average IoU of all 13 classes of multi-view reconstructions on sharenet, the best results for different numbers of views are highlighted in bold.

Claims

1. A multi-view image three-dimensional reconstruction method based on an attention mechanism is characterized by comprising the following steps:

(1) obtaining a depth feature set P ═ P of N elements by passing an image set containing N pictures through an encoding layer₁，p₂，...，p_N}，p_n∈R^1×DWhere N is an arbitrary value and D is a characteristic dimension that is fixed for a particular encoder;

(2) inputting the feature set P into a gravity center constraint distance attention aggregation module, and outputting a fused feature y' ∈ R^1×D；

(3) And Y 'is subjected to deconvolution operation of a decoding layer to generate a predicted three-dimensional model Y', and a prediction result closest to the real three-dimensional model GT is obtained by minimizing the total reconstruction loss.

2. The method for three-dimensional reconstruction of multi-view image based on attention mechanism as claimed in claim 1, wherein: the coding layer employs an AttSets coding layer and is improved by replacing the conventional convolutional layers of all even layers with deformable convolutional layers.

3. The method for three-dimensional reconstruction of multi-view image based on attention mechanism as claimed in claim 1, wherein: the attention-focusing module described is specifically as follows,

(1) each element of the feature set P is input to an activation function h, the output of which is a learned set of attention activations Z ═ Z₁，z₂，...，z_NIn which z is_n＝h(p_nW), W is the learnable weight to be trained, p_n∈R^1×D，W∈R^D×D，z_n∈R^1×D；

(2) N attention activated elements z to be learned_nNormalization is performed to calculate a set of attention scores S ═ S₁，s₂，...，s_NSelecting softmax as the normalization operation, so the attention score of the nth element is:

wherein the content of the first and second substances,

is the d-th term of zn;

(3) the calculated attention scores S are multiplied by their corresponding set of raw features P to generate a new set of deep features, called weighted features C ═ C₁，c₂，...，c_NTherein of

c_n＝p_n×s_n，p_n∈R^1×D，s_n∈R^1×1，c_n∈R^1×D(7)

(4) Summing the weighted feature sets of all the N elements to obtain a feature vector of fixed size, denoted as y', as follows:

wherein

Is c_nItem d of (1).

4. The method for three-dimensional reconstruction of multi-view image based on attention mechanism as claimed in claim 3, wherein: the activation function h is a standard neural layer, an optional linear conversion layer, followed by a nonlinear activation function.

5. The method for three-dimensional reconstruction of multi-view image based on attention mechanism as claimed in claim 4, wherein: the activation function h is preferably one fully connected layer and one tanh layer.

6. The method for three-dimensional reconstruction of multi-view image based on attention mechanism as claimed in claim 1, wherein: the total reconstruction loss is as follows:

Loss＝L_ae+L_w

wherein the content of the first and second substances,

coding loss L_aeThe cross entropy of the real voxel model Y and the predicted three-dimensional model Y' is calculated, as follows,

L_ae＝YlogY′+(1-Y)log(1-Y′)，

loss term L for center of gravity constrained distance_wSpecifically, as follows, the following description will be given,

D(p_iy') denotes p_iAnd y', preferably in Euclidean distance.