CN110163246A

CN110163246A - The unsupervised depth estimation method of monocular light field image based on convolutional neural networks

Info

Publication number: CN110163246A
Application number: CN201910276356.0A
Authority: CN
Inventors: 戴国骏; 刘高敏; 张桦; 周文晖; 陶星; 戴美想
Original assignee: Hangzhou Dianzi University
Current assignee: Zhejiang Xinmai Microelectronics Co ltd
Priority date: 2019-04-08
Filing date: 2019-04-08
Publication date: 2019-08-23
Anticipated expiration: 2039-04-08
Also published as: CN110163246B

Abstract

The invention discloses a kind of unsupervised depth estimation method of monocular light field image based on convolutional neural networks.The present invention, as training set, makes training set sample tend to balance first with disclosed large-scale light field image data set by data enhancing, data extending.Construct improved ResNet50 network model, extract the advanced and rudimentary feature of model respectively using encoder and decoder, the result of encoder and decoder is merged by intensive poor structure again, a super-resolution occlusion detection network is in addition constructed simultaneously, deep learning is able to use and accurately predicts occlusion issue between each visual angle；Objective function based on light field image estimation of Depth task is more loss functions, is trained by the network model pre-defined to pretreated image, finally carries out extensive assessment to network model on test set.The present invention is significant to the light field image pretreating effect of complex scene, realizes the effect of the unsupervised estimation of Depth of more accurate light field image.

Description

Monocular light field image unsupervised depth estimation method based on convolutional neural network

Technical Field

The invention relates to the field of light field image processing, in particular to a monocular light field image unsupervised depth estimation method based on a convolutional neural network.

Background

In recent years, the single image depth estimation based on supervised learning is rapidly developed, but the accurate label to be acquired by the supervised learning is very difficult and is influenced by a plurality of external factors such as environment, illumination and the like, and the influence needs to be overcome at a huge cost. Based on the characteristic of the image depth, the invention discloses a monocular unsupervised depth estimation method based on a depth convolution neural network, and the depth information of the image can be quickly and accurately estimated. Because of the unsupervised estimation, no special making of depth information tags is required, which can greatly reduce the up-front workload and cost of depth estimation.

Disclosure of Invention

The invention aims to solve the problem of the monocular supervised depth estimation data set label, and provides a monocular light field image unsupervised depth estimation method based on a convolutional neural network.

The method considers the influence of the light field image shielding problem of different visual angles (3 multiplied by 3 squared squares) on the depth estimation consistency, enhances the data, constructs a convolutional neural network and provides a loss function of the network suitable for the light field image, realizes the accurate discrete mapping from the image to the depth, and ensures that the image depth estimation result is more accurate, rapid and efficient.

In order to achieve the purpose, the invention provides the following technical scheme which comprises the following main steps:

1. the method for unsupervised depth estimation of the monocular light field image based on the convolutional neural network is characterized by comprising the following steps of:

step 1, data preprocessing:

the experimental data set is based on a light field image data set which is obtained by shooting a real object in the real world by a Lytroillum light field camera disclosed by Stanford;

the data preprocessing comprises image brightness enhancement, horizontal/vertical turning and random shearing;

after data preprocessing, the light field image data set is further expanded, and the diversity of training samples and testing samples is increased;

step 2, constructing models including a convolutional neural network depth estimation model and a convolutional neural network occlusion detection model;

the depth estimation model of the convolutional neural network is specifically realized as follows:

taking a ResNet50 network model as an encoder Eecode, and improving an original network by using self-adaptive normalization on the basis of ResNet50 to adapt to the use of a light field image; the encoder gradually compresses the length and width of the image and increases the number of features, and the original input image is set as I_256*256*3The subscript indicates the length, width and channel number of the image, and the intermediate result change process of the step-by-step encoding by the encoder is as follows:

I^E _256*256*64→I^E _128*128*128→I^E _64*64*256→I^E _32*32*512→I^E _16*16*1024

the decoder just reverses the method, and the length and the width of the characteristic diagram of the result of the encoder are restored to the size of the original image step by step; uses a dense residual structure to connect two processes of Decode and Encode, i.e. I^E _32*32*512And I^D _32*32*512Connected together through a jump layer;

considering that the parallax range of a camera with a light field is in an interval of [ -4,4], extracting a predicted parallax map by adopting a Tanh activation function, and multiplying the range of Tanh by 4 times on the basis of the obtained parallax map to obtain a real parallax map as the range of Tanh is between [ -1,1 ]; acquiring a disparity map by adopting a 4-layer pyramid structure, so that a disparity map fusion result with 4 different scales is finally obtained by the network;

the convolutional neural network occlusion detection model is used for learning occlusion relations among different visual angles, meanwhile, a plurality of loss functions are used for constraint training, the problem of image occlusion and the problem of consistency of depth estimation are solved, and self-adaptive regularization is used in each layer of structure; the network is composed of 8 layers of full convolution layers, wherein 1 to 3 layers are used for extracting features by an encoder, 4 to 6 layers are used for recovering images by a decoder according to the features, 7 th layer is used for carrying out deconvolution operation for obtaining super-resolution images, and the last layer is used for obtaining the size of an original image by down-sampling;

step 3, in order to optimize the quality of the network model estimation disparity map, estimating images of other visual angles of an original input image through estimated disparity map bilinear interpolation Warping, and constraining a composite map of the images of other visual angles through a loss function;

step 4, setting an optimizer, dynamically optimizing and adjusting the learning rate, dynamically setting an ideal learning rate for the model, setting the initial learning rate to be 0.0001, and slowing down the learning rate along with the increase of the number of batches in the model training process, wherein the slowing down mechanism is as follows: training and parameter solving are carried out on the model by using a momentum-based random gradient descent type network optimization algorithm, a momentum factor mu is dynamically adjusted along with the fluctuation of loss, the initial value of mu is set to be 0.5, when the loss fluctuation is reduced, the network is considered to be relatively stable, and the corresponding mu is reduced, so that the effect of dynamically adjusting the learning rate and refining the training process is achieved;

step 5, training a convolutional neural network:

firstly, selecting 60% of data samples in the data set in the step 1 as a training sample set, and setting a random value to determine that the training set obtained each time is a disordered and uniformly distributed sample;

secondly, defining a loss function and an optimizer, adjusting network parameters and counting indexes;

finally, the network model in the step 2 is used as a training model to train the data sample, and the model is stored after the training is finished, so that the model can be conveniently and rapidly loaded at the later stage;

step 8, testing the convolutional neural network: and evaluating by using the PSNR and the SSIM, wherein the two indexes are indexes for quantizing the image quality accuracy and are used for presenting the quality quantization effect of the synthesized image, and the accuracy index is used for measuring the estimation effect of the model by comparing the data predicted by the model with the test data to finally obtain the accuracy of the depth estimation of the model on the test set.

The loss functions in the convolutional neural network are set to be 3, and are specifically defined as follows:

the first loss function I is the image consistency constraint L_{image_loss}Enabling the estimated image and the original image to be as close as possible, and also having image quality constraint, which requires that the estimated image and the original image have consistent similarity on local parts;

the second loss function II is the consistency constraint L of the disparity map_consistThe problem of the consistency of the parallax map and the problem of parallax occlusion are solved;

the third loss function III is a disparity map smoothness constraint loss function L_SmoothThe presence of some outliers in the estimated disparity map is prevented from leading to a final underperforming problem.

The loss function I is specified as follows:

the loss function I is used for measuring the difference between the estimated image and the original image, the L1 distance is used for comparison, the image quality is detected by using SSIM, if the estimated image is more similar to the original image, the value of SSIM is closer to 1, and the loss function I is expressed as follows:

wherein,representing the original image from view (i, j),the first term of the formula is used for detecting the quality and local similarity of the prediction graph, and the second term of the formula is used for detecting the distance between the prediction graph and the original image, namely the similarity of pixel values of pixel points by pixel points;

the loss function II is specifically defined as follows:

wherein D is_i+x，j+yDenotes a disparity map at (i + x, j + y) viewing angle, D_i，j+Dx，yRepresenting the disparity map at the (i + x, j + y) view angle obtained by the disparity map at the (i, j) view angle through an (x, y) vector Warping;

the loss function III is defined as follows:

wherein,respectively representing partial derivatives of the abscissa and ordinate of the disparity map at (i, j) disparity,representing the abscissa and ordinate, respectively, of the original at (i, j) parallaxA partial derivative;

the final overall loss function is as follows:

L_totle＝L_{image_loss}+L_consist+L_Smooth。

by defining the depth estimation of the light field image by the multi-loss function, the result can be optimized from different aspects, so that the result is more accurate.

The step 1 is data enhancement, after some enhancement operations are performed on original data, the network has stronger robustness and network overfitting can be prevented, three methods of random inversion, color enhancement and random shearing are mainly used, and an original image I is assumed to be composed of 9 pixel blocks and is represented as follows:

the random turning has two types of vertical turning and horizontal turning, and the images obtained after turning are respectively I₁And I₂Then, I₁、I₂Is represented as follows:

the random color enhancement means that an enhanced coefficient is firstly randomly generated, the enhanced coefficient can be for an RGB single color channel or directly for three same channels, and the enhanced coefficient for the three channels is α, so that the enhanced image I₃Is represented as follows:

the random cropping is to change the pixel value of a certain area or several areas in the original image to 0 or other values so as to changeAn example I of random clipping that the semanteme of the transformed image is disambiguated and discontinuous in some regions₄As follows:

although only 3 enhancing methods are used, the data samples obtained by combining the enhancing methods are multiple times of the original samples, and the enhanced data are used for training the model, so that the model has stronger robustness and generalization capability, and the prediction accuracy of the model is further improved.

The invention has the following beneficial effects:

the invention relates to a monocular light field image unsupervised depth estimation based on a convolutional neural network, which does not need to specially manufacture a label of a data set due to unsupervised estimation, so that the depth estimation is more convenient, and meanwhile, the model is restrained by using a multi-loss function, so that the model has high prediction precision. The improved ResNet50 model has good generalization performance, a convolutional neural network model framework with deeper depth is used, the performance is good, the robustness is stronger due to a dense poor structure, the learning process can be stabilized through case regularization, the model convergence rate is effectively improved, the problems of occlusion and boundary blurring and occlusion are effectively solved by the super-resolution occlusion detection network, and the target function combines multiple loss functions to serve as a network model optimizer. By properly adopting some training skills and selecting ideal network parameters, an optimization algorithm and the setting of the learning rate, the network is more stable, the result is more reliable, and the unsupervised depth estimation accuracy of the light field image is greatly improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is further illustrated by the following figures and examples.

As shown in fig. 1, the method for unsupervised depth estimation of monocular light field images based on convolutional neural network specifically includes the following steps:

step 1, the experimental data set is based on a light field image data set which is disclosed by Stanford and obtained by shooting real objects in the real world by a Lytroillum light field camera, and the data set comprises a large number of plants, flowers, street views, sculpture images and the like. The images are preprocessed and enhanced, the enhancement methods mainly used in the invention comprise image brightness enhancement, horizontal/vertical turning, random shearing and the like, after the images are enhanced, the data set is further expanded, the diversity of training samples and testing samples is increased, the robustness of a network model is further enhanced, and the generalization capability of the model is stronger so as not to generate an overfitting phenomenon. On the other hand, the model performance is improved to a certain extent.

After some enhancement operations are performed on original data, the network can have stronger robustness and prevent the network from overfitting, three methods of random inversion, color enhancement and random shearing are mainly used, and an original image I is assumed to be composed of 9 pixel blocks and is represented as follows:

random color enhancement means that first a gain is randomly generatedThe strong coefficients, either for the RGB single color channel or directly for the three co-channels, assume that the coefficients enhanced for the three channels are α, and the enhanced image I₃Is represented as follows:

the random cropping is an example I of randomly cropping in which the pixel value of a certain region or regions in the original image is changed to 0 or other values so that the semantics of the changed image in some regions are ambiguous and discontinuous₄As follows:

And 2, constructing an unsupervised depth estimation method of the convolutional neural network based on ResNet50, and performing foreground and background feature region identification and feature segmentation by using an encoder Encode and a decoder Decode network, so that the extracted feature region realizes higher segmentation accuracy, and the efficiency and accuracy of deep learning of image features by the convolutional neural network are improved. And (3) fusing the low-level features and the high-level features learned by the network together by using a pyramid model with a 4-layer structure and a residual structure, and increasing the information learned by the network.

And 3, after processing of the network model, wherein the occlusion is always the largest factor influencing the depth estimation precision, in order to enable better occlusion information to be learned, a super-resolution convolutional neural network model for occlusion learning is constructed, occlusion relations between different visual angles are learned, besides, a plurality of loss functions are used for constraint training, the problem of image occlusion and the problem of consistency of depth estimation are solved, self-adaptive regularization is used in each layer structure, the overfitting phenomenon is avoided, and the generalization capability of the network and the depth estimation precision are also improved.

And 4, in order to test the quality of the estimated disparity map of the network model, estimating images of other visual angles of the estimated original central image of the disparity map by a bilinear interpolation method, and optimizing the estimated image and the original image.

Step 5, defining a network model loss function, wherein in order to better guide the network to train, a special loss function is defined for the light field image for constraint, 3 loss functions are mainly used for constraint, the first one is image consistency constraint, so that an estimated image and an original image can be as close as possible, and image quality constraint is also provided, which requires that the similarity of the estimated image and the original image on local parts is consistent; secondly, the consistency constraint of the disparity map is carried out, and the problem of disparity map consistency is solved; the third is the disparity map smoothness constraint, which prevents some outliers existing in the estimated disparity map from causing the final result to be poor in accuracy. These loss functions need to be redefined to enable adaptation to the light-field image.

Because the loss function of the network is used for guiding network optimization and measuring the error between the predicted value and the real sample mark, the quality of the loss function is directly related to the quality of the final result of the network, and 3 special loss functions are designed for guiding network training optimization.

Loss function 1. image consistency constraint, which measures the difference between the estimated image and the original image, the L1 distance used here is compared, the image quality is detected using SSIM, the value of SSIM is closer to 1 if the estimated image is more similar to the original image, and the loss function is expressed as follows:

in the above-mentioned formula, the first and second groups,representing the original image from view (i, j),shown are the estimated images of view (i, j), α, β、ΨIs a super ginseng. The former term is used for detecting the quality and local similarity of the prediction image, and the latter term is used for detecting the distance between the prediction image and the original image, namely the similarity of pixel values pixel by pixel.

The invention trains a network specially used for occlusion detection to predict the occluded part, and also defines a loss function to constrain the consistency between disparity maps, wherein the loss function is defined as follows:

in the above formula D_i+x，j+yDenotes a disparity map at (i + x, j + y) viewing angle, D_i，j+D_x，yThe disparity map at the (i + x, j + y) view angle is obtained from the disparity map at the point (i, j) through the (x, y) vector Warping, and if the disparity estimation is correct, the two terms should be equal, that is, the two terms are consistent.

And 3, defining a parallax smoothness loss function to restrict the parallax smoothness constraint in order to eliminate the influence of some abnormal values in the predicted parallax map on the result, wherein the loss function is defined as follows:

in the above-mentioned formula, the first and second groups,respectively representing partial derivatives of the abscissa and ordinate of the disparity map at (i, j),the partial derivatives of the original at (i, j) are indicated on the abscissa and ordinate, respectively. That is, the greater the deviation or gradient of the original image, the smaller the deviation coefficient of the disparity map and the smoother the disparity map, so the final total loss function of the present invention is as follows:

L_totle＝L_{image_loss}+L_consist+L_Smooth

And 6, dynamically optimizing and adjusting the learning rate, dynamically setting an ideal learning rate for the model, setting the initial learning rate to be 0.0001, and slowing down the learning rate along with the increase of the number of batches in the model training process, wherein the slowing down mechanism is as follows: if the loss stops decreasing within two or more training batches, the learning rate is decreased toTraining and parameter solving are carried out on the model by utilizing a momentum-based random gradient descent type network optimization algorithm, a momentum factor mu is set to be dynamically adjusted along with the fluctuation of loss, the initial value of mu is set to be 0.5, when the loss fluctuation is reduced, the network is considered to be basically stable, the corresponding mu is reduced, and therefore the dynamic regulation learning rate is achievedThe effect of the training process is refined, and the network can be helped to jump out of local limitation when the network tends to converge in the middle and later stages of network training and the network parameters oscillate back and forth near the local minimum value, so that better network parameters can be found.

And 7, when the network training module trains the convolutional neural network, firstly, 60% of data samples in the data set in the step 1 are selected as a training sample set, and a random value is set to determine that the training set obtained each time is a sample which is unordered and uniformly distributed. And (5) defining a loss function in the step (4) and an optimizer in the step (5), adjusting network parameters and counting indexes. And (3) taking the network model in the step (2) as a training model to train the data sample, and storing the model after the training is finished so as to facilitate the loading of the model at a later stage.

And 8, the network test module evaluates by using the PSNR and the SSIM, wherein the two indexes are indexes for quantifying the image quality accuracy rate and are used for presenting the visual effect of the quality of the synthesized image, the estimation effect of the model is measured by comparing the data predicted by the model with the test data and using the accuracy rate index, and finally the accuracy rate of the estimated depth of the model on the test set is obtained.

Claims

step 1, data preprocessing:

step 5, training a convolutional neural network:

2. The unsupervised depth estimation method for monocular light field images based on convolutional neural network as claimed in claim 1, wherein there are 3 loss functions in the convolutional neural network, specifically defined as follows:

the second loss function II is the disparity map consistency constraint L_consistThe problem of the consistency of the parallax map and the problem of parallax occlusion are solved;

3. The unsupervised depth estimation method for monocular light field images based on convolutional neural network as claimed in claim 2, wherein the loss function i is specifically as follows:

wherein,representing the original image from view (i, j),showing the estimated image of the view angle (i, j), α, β, psi all representing the hyper-parameter, the first term of the formula being to detect the quality of the prediction mapLocal similarity, wherein the second term of the formula is used for detecting the distance between the prediction image and the original image, namely the similarity of pixel values of pixel points by pixel points;

the loss function ii is specifically defined as follows:

wherein D is_i+x，j+yRepresenting a disparity map at (i + x, j + y) view,representing the disparity map at the (i + x, j + y) view angle obtained by the disparity map at the (i, j) view angle through an (x, y) vector Warping;

the loss function iii is defined as follows:

wherein,respectively representing partial derivatives of the abscissa and ordinate of the disparity map at (i, j) disparity,respectively representing partial derivatives of the abscissa and ordinate of the original image at the (i, j) parallax;

the final overall loss function is as follows:

L_totle＝L_{image_loss}+L_consist+L_Smooth。