CN113506336B

CN113506336B - Light field depth prediction method based on convolutional neural network and attention mechanism

Info

Publication number: CN113506336B
Application number: CN202110732927.4A
Authority: CN
Inventors: 张倩; 杜昀璋; 刘敬怀; 花定康; 王斌; 朱苏磊
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2024-04-26
Anticipated expiration: 2041-06-30
Also published as: CN113506336A

Abstract

The invention relates to a light field depth prediction method based on a convolutional neural network and an attention mechanism, which comprises the following steps: acquiring a light field image and preprocessing the light field image to generate a light field image set; constructing a light field depth prediction model, wherein the model comprises an EPI learning module, an attention module and a feature fusion module; respectively inputting the light field image set into an EPI learning module and an attention module, and respectively acquiring EPI information of the light field image and each image weight; and respectively inputting the EPI information of the light field images and the weights of the images into a feature fusion module to obtain a light field depth prediction result. Compared with the prior art, the method has the advantages of high prediction precision, good practicability and the like.

Description

Light field depth prediction method based on convolutional neural network and attention mechanism

Technical Field

The invention relates to the technical field of light field depth estimation, in particular to a light field depth prediction method based on a convolutional neural network and an attention mechanism.

Background

The light field depth information reflects precise spatial information of the corresponding target. Scene depth acquisition is a technical key for determining whether a light field image can be widely applied, and is also one of research hotspots in the fields of computer vision and the like. The method plays an important role in the fields of three-dimensional reconstruction, target identification, automatic driving of automobiles and the like.

Currently, light field depth estimation algorithms are largely divided into non-learning-based methods and learning-based methods. The non-learning method mainly comprises a focusing and defocusing fusion method and a stereo matching-based method. The focusing and defocusing fusion method obtains corresponding depth by measuring the ambiguity of different Jiao Dui pixels, and the depth map obtained by the method can keep more details, but can introduce defocusing errors and reduce the accuracy of the depth map.

In recent years, deep learning has achieved tremendous achievement in the field of light field depth estimation, as disclosed in chinese patent CN112785637a, a light field depth estimation method based on a dynamic fusion network, comprising the steps of: determining a light field data set, determining a training set and a testing set based on the light field data set; expanding the light field dataset; building a dynamic fusion network model; the dynamic fusion network model consists of a double-flow network and a multi-mode dynamic fusion module; the dual-flow network consists of an RGB flow and a focal stack flow; taking the output global RGB features and focus features of the double-flow network as inputs of a multi-mode dynamic fusion module, and outputting a final depth map; training the constructed dynamic fusion network model based on the training set; and testing the trained dynamic fusion network model on the test set, and verifying on the mobile phone data set. The light field depth estimation method in the patent can obtain accuracy better than other light field depth estimation methods, reduce noise, retain more detail information, break the limit of a light field camera and be successfully applied to common consumer-grade camera data, but the light field depth estimation method in the patent does not fully consider the geometric characteristics of light field images, and the accuracy of prediction is not high.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide the light field depth prediction method based on the convolutional neural network and the attention mechanism, which has high prediction precision and good practicability.

The aim of the invention can be achieved by the following technical scheme:

a light field depth prediction method based on a convolutional neural network and an attention mechanism comprises the following steps:

step 1: acquiring a light field image and preprocessing the light field image to generate a light field image set;

step 2: constructing a light field depth prediction model, wherein the model comprises an EPI learning module, an attention module and a feature fusion module;

Step 3: inputting the light field image set obtained in the step 1 into an EPI learning module and an attention module respectively, and obtaining EPI information of the light field image and each image weight respectively;

Step 4: and respectively inputting the EPI information of the light field images and the weights of the images into a feature fusion module to obtain a light field depth prediction result.

Preferably, the preprocessing of the light field image in the step 1 specifically includes: and performing data enhancement operation on the light field image.

Preferably, the EPI learning module specifically includes:

Parallel EPI learning networks are respectively arranged at four angles of 0 degree, 45 degree, 90 degree and 135 degree, and each of the four parallel EPI learning networks comprises a two-dimensional convolution layer, an activation layer, a two-dimensional convolution layer, an activation layer and a batch normalization layer which are sequentially connected.

More preferably, the loss function of the EPI learning network is:

Wherein L is a return loss value, N is a total sample amount, and x and y are predicted outputs respectively.

More preferably, the activating layer specifically comprises: sigmoid function.

Preferably, the attention module comprises a two-dimensional convolution layer, resblock, a feature extraction layer, a Cost volume layer, a pooling layer, a full connection layer and an activation layer which are connected in sequence.

More preferably, the feature extraction layer specifically includes: the spatial pyramid pooling layer.

Preferably, the step 2 further includes verifying the optical field depth prediction model during training.

More preferably, the verification method is as follows:

first, the mean square error MSE of the light field depth prediction result and ground truth is calculated:

Wherein N is the total number of pixels in the light field image; dep and GT are the light field depth prediction result and ground truth respectively; i is each pixel in the light field image;

secondly, calculating peak signal-to-noise ratio PSNR:

wherein MAX is the maximum value of pixels in the light field image;

Then, the structural similarity index SSIM is calculated:

Wherein x and y are the light field depth prediction result and ground truth respectively; μ is an average value of the light field image pixel values; σ _x ² and σ _y ² are the variances of the corresponding images, respectively; σ _x,y is the covariance of x and y;

And finally judging whether MSE, PSNR and SSIM are all within a preset threshold, if so, completing training of the model, otherwise, continuing training of the model.

Preferably, the feature fusion module comprises 8 convolution blocks and 1 optimization block which are connected in sequence; the optimization block comprises two-dimensional convolution layers and an activation layer.

Compared with the prior art, the invention has the following beneficial effects:

1. The prediction precision is high: the light field depth prediction method fully considers the geometric characteristics of the light field image, fully utilizes the angular characteristics and symmetry of the light field image, improves the accuracy of depth estimation, and can provide more accurate results under the same working time length and working conditions.

2. The practicability is good: the light field depth prediction method provided by the invention does not depend on precise equipment such as radars, antennas and the like, can conveniently acquire the required depth information, and has strong practicability.

Drawings

FIG. 1 is a flow chart of a light field depth prediction method according to the present invention;

FIG. 2 is a schematic diagram of a light field depth prediction model according to the present invention;

fig. 3 is a schematic diagram of three modes of the attention module according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Acquisition of a light field image: with the gradual maturation of light field imaging technology, consumer-level light field cameras are applied in a large scale, the light field cameras can be used for acquiring position and direction information rich in light rays in a scene, and the depth information of the scene can be further acquired by analyzing and processing the information by a passive depth estimation method. The light field camera can obtain four-dimensional light field information after one shooting, namely, obtain scene images with multiple visual angles. These images form an array of 9 x 9 images of 81 total, with the relative position of each picture in the array being fixed. The difference between the relative positions of each picture (i.e. the baseline) and the position difference between the corresponding same spatial point in each picture (i.e. the disparity) are calculated. And obtaining the distance between the corresponding point in the space and the visual angle of the camera center lens by calculating the relation between the base line and the parallax.

Because the acquisition of the light field image needs to use a certain professional device, such as a fixed camera array, a camera portal frame or a light field camera, the problem of insufficient image data calculated under the same scene can sometimes occur, and the data is preprocessed through data enhancement aiming at the problem in actual operation.

Under the condition of keeping the geometric relation among the sub-aperture images in the light field unchanged, transforming limited data to enlarge the available data scale, wherein the data enhancement operation in the embodiment comprises the following steps:

1. Light field image with center viewpoint transferred

The acquired light field data has 9×9 views, the spatial resolution of each view is 512×512, and more than 9 times of data which can be used for training can be obtained by translating on a 9×9 array by using a window with the size of 7×7;

2. Angle of change

The new data for training can be directly obtained by a rotating mode, or the polar surface characteristics of the viewpoints can be extracted, the sub-aperture images are first rotated, and then the viewpoint images are rearranged for connection.

3. Scaling and flipping

It should be noted that, while the image is enlarged or reduced, the disparity value is also transformed accordingly.

The above three methods can operate in various dimensions of the image, such as center view, image size, image RGB values, image random color transforms, image gray values, gamma values, etc.

Step 2: constructing a light field depth prediction model shown in fig. 2, wherein the model comprises an EPI learning module, an attention module and a feature fusion module;

The construction method of the EPI learning module comprises the following steps:

The four-dimensional light field image may be represented as L (x, y, u, v), where (x, y) is the spatial resolution and (u, v) is the angular resolution, and the relationship of the light field image center to other viewpoints may be represented as:

L(x,y,0,0)＝L(x+d(x,y)*u,y+d(x,y)*v,u,v)

Where d (x, y) is the parallax of the center viewpoint pixel (x, y) and the pixel corresponding to the neighboring viewpoint.

For an angular direction θ (tan θ=v/u), the following relationship is reestablished:

L(x,y,0,0)＝L(x+d(x,y)*u,y+d(x,y)*utanθ,u,utanθ)

Wherein, the light field view point is a tidy 9×9 array, and the corresponding view point can be ensured only when utan theta is an integer. Thus, the image angles 0 °, 45 °, 90 °, and 135 ° for the four viewpoint directions are selected, and it can be assumed that the angular resolution of the light field image is (2n+1) × (2n+1).

Therefore, the EPI learning network is respectively provided with parallel EPI learning networks at four angles of 0 °, 45 °, 90 ° and 135 °, and respectively performs feature extraction on the light field image data. The four parallel EPI learning networks each include a two-dimensional convolution layer 2D Conv, an activation layer Relu, a two-dimensional convolution layer 2D Conv, an activation layer Relu, and a bulk normalization layer BN, which are connected in sequence.

In the two-dimensional convolution layer 2D Conv, a and B are two-dimensional matrices, and the convolution result is:

C(j,l)＝∑_p∑_q A(p,q)B(j-p+1,k-q+1)

the sigmoid function is adopted at the activation layer Relu, specifically:

the activation function introduces nonlinear output into the output z of the upper layer neurons, phi (z) is the output of the next layer, relu avoids the problems of gradient explosion and gradient disappearance to a certain extent.

Since the deep neural network is multi-layered, the learning speed is reduced. To prevent the change in the lower layer input from becoming larger or smaller, the upper layer is caused to fall into the saturation region, so that learning is stopped prematurely. After the last activation of the functional layer, batch normalization (BN, batch normalization) is selected. The batch normalization layer BN is specifically:

Wherein μ is a conversion parameter, σ is a scaling parameter, and the two parameters are used for converting and scaling data so that the data accords with a standard distribution with an average value of 0 and a variance of 1; b is a reconversion parameter and g is a rescaling parameter to ensure that the expressive power of the model is not reduced by normalization.

The loss function of the EPI learning network is:

To address the problem of too small a baseline in the light field, a small disparity value is measured using a convolution kernel of size 2x 2 with a step size of 1. The convolution depth is set to 7 in the network, and the learning rate is 1e ^-5.

The construction method of the attention module comprises the following steps:

A large number of pictures of different angular perspectives are acquired in the light field data. As described in the first step, depth information in a three-dimensional space can be obtained by calculating disparity information and EPI information of corresponding points in these pictures. However, the pictures contain a large amount of redundant information, so that an attention module is arranged, the pictures in the light field are calculated and assigned with weights, and the importance and contribution of the pictures with more value for estimating the depth of the light field are highlighted.

The attention module includes two-dimensional convolution layers 2D Conv, resblock, feature extraction layer FE block, cost volume layer, pooling layer Pooling, full connection layer Connected and activation layer Relu Connected in sequence, specifically:

Firstly, preprocessing an optical field image through two-dimensional convolution layers 2D Conv and Resblock, and then carrying out feature extraction in a feature extraction layer FE block to remove texture areas and non-lambertian curved surfaces. And the feature extraction layer FE block extracts features according to the connection of the adjacent areas and connects all feature maps to obtain an output feature map. Next, in the Cost volume layer, the relative positions of the feature views are adjusted, and five dimensions (batch size×gap×height×width×feature size) are calculated to connect these feature mapped Cost volumes. Finally, the input cost amounts are assembled to generate the attention graph, followed by the connection layer and the activation layer. Taking the HCI dataset as an example, there are 9 x 9 sub-aperture views in each scene, so a 9 x 9 size attention map is finally obtained. This part of the operation is divided into three steps:

first, extracting image features using a feature extraction layer

The feature extraction layer selects an SPP (SPATIAL PYRAMID pooling, space pyramid pooling) module, and the SPP module is used for estimating the disparity value by utilizing the information of the adjacent areas of the corresponding points.

The SPP module specifically comprises: in a CNN, the last pooling layer is removed and replaced with an SPP to maximize pooling (max pooling). SPP-net can be trained with standard back-production.

Second, calculate the Cost volume

And transmitting the characteristic map of each sub-aperture view through the SPP module to obtain the characteristic map of each view. To better utilize these feature maps, a calculate Cost volume is set. According to the feature map provided by the SPP module, the input image is manually moved in the u or v direction with different levels of disparity so that the second half of the network can directly view pixel information at different spatial locations using relatively small received signals. 9 parallax levels are set, ranging from-4 to 4. After moving the feature maps, the feature maps are connected to a 5D Cost volume, which is equal to the batch size x parallax x height x width x feature size.

Third step, obtaining attention force diagram

Note that the drawing is essentially a 9 x 9 drawing, which shows the importance of the corresponding drawing. The first type is a free attention map, where each view has its own importance value. Learning all images in the light field picture; the second type is the symmetrical note that the light field image array is symmetrical along the u and v axes. 25 images of which symmetry can be learned from symmetry. The entire map can be constructed by mirroring along the u-axis and v-axis; in the third type, the image is symmetrical along u, v and two diagonal axes. Again using symmetry, weights for the symmetrical 15 images are calculated and then a complete attention map is constructed by mirroring along the diagonal, v and u axes. By constraining the structure of the attention map, the number of learnable weights is reduced. With the Cost volume as input, the view selection module generates attention patterns through a global pooling layer, then a full connection layer, and finally an activation layer, thereby obtaining attention distribution patterns for all pictures of the light field image.

The attention module includes three modes, as shown in FIG. 3, in the first mode, the module performs an attention assessment for each image; in the second mode, only 0 ° and 90 ° are mirrored with the directional image; in the last mode, 45 ° and 135 ° directions are added. The three methods are combined to get the attention. Attention will be paid to combining with the convolutional layers in the neural network in the form of weights, and then the weights of the sub-aperture views are enhanced.

The construction method of the feature fusion module comprises the following steps:

The feature fusion module comprises 8 convolution blocks and 1 optimization block which are connected in sequence, wherein the optimization block comprises two-dimensional convolution layers and an activation layer.

Step 2 further includes verifying the optical field depth prediction model during training, specifically:

secondly, calculating peak signal-to-noise ratio PSNR:

wherein MAX is the maximum value of pixels in the light field image;

Then, the structural similarity index SSIM is calculated:

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The light field depth prediction method based on the convolutional neural network and the attention mechanism is characterized by comprising the following steps of:

Step 4: respectively inputting the EPI information of the light field images and the weights of the images into a feature fusion module to obtain a light field depth prediction result;

The attention module comprises a two-dimensional convolution layer, resblock, a feature extraction layer, a Cost volume layer, a pooling layer, a full connection layer and an activation layer which are connected in sequence; the characteristic extraction layer specifically comprises: a spatial pyramid pooling layer; specific:

Firstly, preprocessing an optical field image through two-dimensional convolution layers 2D Conv and Resblock, and then carrying out feature extraction in a feature extraction layer FE block to remove texture areas and non-lambertian curved surfaces; the feature extraction layer FE block extracts features according to the connection of the adjacent areas and connects all feature maps to obtain output feature maps; next, in the Cost volume layer, the relative position of the feature view is adjusted, and five dimensions are calculated to connect the Cost volume after the feature mapping, wherein the five dimensions are batch processing size x parallax x height x width x feature size; finally, the input cost amounts are assembled to generate an attention graph, followed by a connection layer and an activation layer; this part of the operation is divided into three steps:

first, extracting image features using a feature extraction layer

The feature extraction layer selects an SPP space pyramid pooling module, and the SPP module is used for estimating the disparity value by utilizing the information of the adjacent areas of the corresponding points;

the SPP module specifically comprises: removing the last pooling layer in a CNN, and changing the pooling layer into an SPP to perform the maximum pooling operation;

second, calculate the Cost volume

The feature map of each sub-aperture view is transmitted through an SPP module to obtain the feature map of each view; setting a calculation Cost volume, and manually moving the input image along the u or v direction at different parallax levels according to the feature map provided by the SPP module; setting 9 parallax levels, ranging from-4 to 4; after moving the feature maps, the feature maps are connected to a 5D Cost volume, which is equal to the batch size x parallax x height x width x feature size;

third step, obtaining attention force diagram

Note that the figure is essentially a 9 x 9 diagram, which indicates the importance of the corresponding view; the first type is a free attention map, where each view has its own importance value; learning all images in the light field picture; the second type is the symmetric note that the light field image array is symmetric along the u and v axes; learning 25 images of which symmetry is based on symmetry; the entire map is constructed by mirroring along the u-axis and v-axis; in the third type, the image is symmetrical along u, v and two diagonal axes; again using symmetry, calculating weights for the symmetrical 15 images, and then constructing a complete attention map by mirroring along diagonal, v and u axes; the number of the learnable weights is reduced by constraining the structure of the attention map; with the Cost volume as input, the view selection module generates attention patterns through a global pooling layer, then a full connection layer, and finally an activation layer, thereby obtaining attention distribution patterns for all pictures of the light field image.

2. The method for predicting the depth of a light field based on a convolutional neural network and an attention mechanism according to claim 1, wherein the preprocessing of the light field image in step 1 is specifically: and performing data enhancement operation on the light field image.

3. The method for predicting light field depth based on convolutional neural network and attention mechanism according to claim 1, wherein the EPI learning module specifically comprises:

4. A method for predicting light field depth based on convolutional neural network and attention mechanism as recited in claim 3, wherein said EPI learning network has a loss function of:

5. The method for predicting light field depth based on convolutional neural network and attention mechanism as recited in claim 3, wherein said active layer specifically comprises: sigmoid function.

6. The method of claim 1, wherein step 2 further comprises verifying the depth of field prediction model during training.

7. The method for predicting light field depth based on convolutional neural network and attention mechanism as recited in claim 6, wherein said verification method is as follows:

secondly, calculating peak signal-to-noise ratio PSNR:

wherein MAX is the maximum value of pixels in the light field image;

Then, the structural similarity index SSIM is calculated:

8. The light field depth prediction method based on the convolutional neural network and the attention mechanism according to claim 1, wherein the feature fusion module comprises 8 convolutional blocks and 1 optimization block which are connected in sequence; the optimization block comprises two-dimensional convolution layers and an activation layer.