CN113537379A

CN113537379A - Three-dimensional matching method based on CGANs

Info

Publication number: CN113537379A
Application number: CN202110860315.3A
Authority: CN
Inventors: 魏东; 刘涵; 何雪; 于璟玮
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-10-22
Anticipated expiration: 2041-07-27
Also published as: CN113537379B

Abstract

A stereo matching method based on CGANs comprises the following steps: image input: inputting two left and right camera views and a true value, respectively taking a left image and a right image as a reference image and a target image, and taking the true value as a label corresponding to the left image; feature extraction: respectively extracting features of the two input left and right camera views by using a pseudo-twin network, and fusing the extracted features by using channel dimensions; generating a disparity map: the fused features are used as conditions in CGANs to enable a generator to be shared with a discriminator, and the generator generates a disparity map; and (3) identifying true and false: extracting the fused features as conditions, inputting the fused features and the real values or the generated disparity maps into a discriminator, and then discriminating whether the input samples generate samples or real values by the discriminator; training a model: the error of the generated disparity map and the true value and the result output by the discriminator are used for guiding the network model learning.

Description

Three-dimensional matching method based on CGANs

Technical Field

The invention belongs to the field of computer vision and the technical field of deep learning, and particularly relates to a stereo matching method based on CGANs (conditional general adaptive networks), wherein the CGANs (conditional general adaptive networks) refer to a conditional generation countermeasure network.

Background

Stereo matching plays a crucial role in many applications of computer vision orientation, such as robotic navigation, autopilot, augmented reality, gesture recognition, three-dimensional reconstruction, military reconnaissance, and maintenance detection. The purpose of computer vision is to mimic the human perception of the distance of objects in a three-dimensional scene using the visual system. And stereo matching can obtain depth information in a three-dimensional scene according to a two-dimensional image. The stereo matching is used as one of key research directions in computer vision, firstly, related pixel points are matched in two camera views under different viewpoints, then, the difference value of the related pixel points on horizontal displacement is calculated to obtain parallax, and finally, depth information is obtained through a mathematical model. Because the parallax between two related pixel points has a proportional relation with the depth information, the task of acquiring the depth information can be reasonably converted into the task of calculating the parallax through mathematical transformation. The problems of shielding, illumination, weak texture and the like exist in stereo matching, and various previous algorithms are all used for solving the problems so as to improve the accuracy of parallax prediction.

The algorithms can be divided into two types, namely a traditional stereo matching algorithm and an end-to-end stereo matching algorithm. The first type of traditional stereo matching algorithm comprises four steps of matching cost calculation, cost aggregation, parallax calculation and parallax refinement. However, the method of dividing and solving the problems has the problems that firstly, the number of the hyper-parameters is increased; secondly, the implementation process of stereo matching becomes complicated; thirdly, the method of dividing the steps does not necessarily obtain the best result, because the optimal solution of the combined subproblem is not equal to the global optimal solution; fourthly, the association range of a single pixel point is limited by an aggregation window when the parallax is calculated. The second type is an end-to-end stereo matching algorithm. Deep learning is one of the key research fields of artificial intelligence and machine learning, and achieves remarkable results in the computer vision direction. In order to solve the problems caused by the dividing steps of the traditional stereo matching algorithm, an end-to-end system can be constructed through deep learning to combine the four steps. The deep learning is a method for training a multilayer neural network, the training process of the deep learning is that training data is firstly input into a first layer of neurons, the weights of the layer are obtained through a nonlinear activation method, then the data output by the layer of neurons is used as input and transmitted to the next layer to continuously obtain the weights of the corresponding layer, the values of the weights can be continuously updated according to the advance of the learning progress, and finally, reasonable weights are obtained, namely, the distributed characteristic representation of the learned data is obtained. The end-to-end idea in deep learning realizes the process that one end inputs data and the other end directly outputs results, so that the defects caused by manual design can be overcome. In the end-to-end deep neural network, the independence of each pixel point in parallax prediction can be reduced by the characteristics generated after semantic information fusion is carried out on each layer of neurons. Therefore, the end-to-end stereo matching algorithm is that two left and right camera views are input to output corresponding disparity maps, and intermediate feature learning and information fusion are both handed to deep learning processing.

Although the existing end-to-end stereo matching algorithm solves the problems existing in the traditional method due to step-by-step implementation, the essence of the algorithm is that the cost amount is formed by calculating the matching points, but the matching of the pixel points and the pixel points does not necessarily accord with the ideal situation, and the precision loss is caused. Also, most of them use 3D convolution to handle the cost, which results in very high computation cost.

Disclosure of Invention

Object of the Invention

The invention provides a stereo matching method based on CGANs, aiming at the problems caused by divide-and-conquer of the traditional stereo matching algorithm and the problem of high calculation cost caused by 3D convolution processing cost.

Technical scheme

A stereo matching method based on CGANs comprises the following steps:

image input: inputting two left and right camera views and a real value, respectively taking a left image and a right image as a reference image and a target image, and taking the real value as a label corresponding to the left image;

feature extraction: respectively extracting features of the two input left and right camera views by using a pseudo-twin network, and fusing the extracted features by using channel dimensions;

generating a disparity map: the fused features are used as conditions in CGANs to enable a generator to be shared with a discriminator, and the generator generates a disparity map;

and (3) identifying true and false: extracting the fused features and inputting the fused features and a real value or a generated disparity map into a discriminator, and then discriminating whether the input sample is a generated sample or a real value by the discriminator;

training a model: the error of the generated disparity map and the true value and the result output by the discriminator are used for guiding the network model learning.

Further, after the two left and right camera views are input, a cropping operation with a size of 256 × 256 is performed, and then whether the number of channels of the two images is 3 or not is judged, if so, the next operation is performed, otherwise, an error is reported.

Further, when the input is two images, a pseudo-twin network method is adopted. The pseudo-twin network used in the algorithm is composed of two convolutional neural networks with the same structure but different weights. Features extracted from both images need to be changed to one input before being input into the next module, and therefore need to be superimposed in the channel dimension.

Further, the fused features are set as conditions in CGANs, which are input to U-Net in the generator. And U-Net is used as a coder decoder network, and performs down-sampling and up-sampling operations on the input to generate a disparity map with the number of channels being 1.

Further, the feature extraction takes the result of channel dimension superposition as a condition and inputs the result and a true value or a U-Net output result into a discriminator, and the discriminator identifies a true or false binary problem through convolutional neural network processing, namely, a probability value is output to indicate whether an input sample is a true value or a generated sample.

Further, the disparity map generated by the U-Net calculates the error between the actual value and the traditional loss function of L1. The conventional loss function of L1 is as follows:

wherein E is_x,yMeaning that x is the expectation of the training data distribution and y is the true value distribution, x is the condition that the input in generator G is also shared with discriminator D, G (x) is the generated sample, and y is the true value.

The judgment result of the discriminator D on the real value y or the generated sample G (x) is used for calculating the loss function of the CGANs. The loss function for CGANs is as follows:

wherein E is_xMeaning that x conforms to the expectation of the distribution of the training data.

And the two loss functions are updated in a gradient mode through an Adam optimization method so as to guide the training of the whole network model. In training the network, the generator G needs to minimize the loss function and the discriminator D needs to maximize the loss function. The resulting loss function G of the CGANs used in the algorithm^*Is represented as follows:

advantages and effects

In order to obtain parallax information with higher precision, the problem caused by the fact that the traditional stereo matching algorithm is divided and controlled is solved, meanwhile, the parallax is predicted by a better method instead of the traditional end-to-end algorithm needing 3D convolution, and the calculation cost is reduced. The invention provides a three-dimensional matching method based on CGANs; the invention combines four steps in the traditional stereo matching method into one step from the end-to-end thought in deep learning, simplifies the steps of the stereo matching algorithm and solves the problems caused by divide and conquer. Compared with the traditional similar end-to-end algorithm, the pseudo-twin network is adopted for processing two similar pictures, and the negative influence of the convolutional neural network on subsequent U-Net learning is eliminated. The conditions set in the CGANs are changed from left and right camera views into feature maps extracted from the pseudo-twin network, so that the parameter quantity of the training model is reduced, and the calculation cost is reduced. For selection of a generator in CGANs, the U-Net which has better capability of learning high-level semantic information and has skip level connection is used to generate a disparity map with higher precision and better effect. And meanwhile, the network structure of the U-Net is adjusted, and a relatively proper layer number setting is found between the calculated amount and the accuracy of the disparity map.

The method adopts CGANs to generate the disparity map to finish the task of predicting the disparity, reduces the consumption of memory and time under the condition of improving certain accuracy, reduces the calculation cost and simplifies the implementation process of the stereo matching algorithm.

Drawings

Fig. 1 is a network structure diagram of a stereo matching method based on CGANs provided by the present invention.

Fig. 2 is a logic flow diagram of a stereo matching method based on CGANs provided in the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

examples

A stereo matching method based on CGANs comprises the following steps:

image input: inputting two left and right camera views and a real value, respectively taking a left image and a right image as a reference image and a target image, and taking the real value as a label corresponding to the left image; as shown in fig. 1.

the difference from the conventional image generation task is that: for the binocular stereo matching task, the input is no longer one picture but two left and right camera views. Two methods for performing input processing on two left and right camera views are generally used, one is to directly stack two original images on a channel dimension, and the other is to use a twin network for reference, namely to respectively extract features of input by using two convolutional neural networks with the same structure and shared weight and then fuse the features. Although the two methods can be handed to a subsequent network to learn data distribution, the first method can adversely affect feature learning, so that the effect of generating the disparity map by the generator G has a certain limitation. The second method using a twin network affects subsequent network learning because of the correlation with pixel points, and ignores the concern about a slight difference between left and right camera views when finding parallax to some extent.

So to improve the effect, feature extraction, i.e. pseudo-twin networks, can be achieved by two neural network branches with different weights in the network structure of the generator G. Therefore, the influence caused by the correlation calculation in the step can be eliminated, the attention to the small difference of the left camera view and the right camera view in the process of parallax prediction is reserved, and the difficulty of subsequent network learning can be reduced.

And respectively inputting the left camera view and the right camera view which are cut into 256 × 256 sizes into two convolutional neural networks with the same structure and different weights for feature extraction. The two images as input respectively pass through a convolution module with 64 layers of output channels and then pass through a convolution module with 128 layers of output channels, and the size of the images is kept to be 256 × 256 all the time. The convolution module used by this part of the convolutional neural network consists of a convolutional layer with size 3 × 3, step size 1 and padding 1, a BN layer and a leakage relu activation function layer.

The BN layer is a call Batch Normalization (Batch Normalization) regularization method. The CGANs learning process is a process of capturing the distribution rules of the training data, but the distribution rules of the processed pictures at each time are different in the expression mode of the values, which is not beneficial to the learning of the network model. Therefore, the value range of the input data can be unified to the range of [ -1,1] by using a common batch normalization method in deep learning. Therefore, the method not only solves the problem of difficulty in learning the network model, but also is beneficial to updating the gradient of back propagation, utilizing the nonlinearity of the LeakyReLU activation function, accelerating the speed of network convergence and reducing the sensitivity of the network to the adjustment of hyper-parameters. The specific way of processing when using batch normalization is to subtract the channel-by-channel calculated mean value from the batch size (batch size) after the convolutional layer and divide by the standard deviation, and when dividing the image by the standard deviation in training, the divisor may be directly replaced by a value of 255, i.e. the maximum value of 8-bit unsigned integer representing the maximum number of RGB channels, in order to reduce the calculation amount.

The LeakyReLU function is expressed as follows, where a_iIs a fixed parameter in the interval (0, + ∞) set to 0.2; x is the number of_iRepresenting a value input into a function; y is_iRepresenting the output of the function.

The features extracted for the two images respectively are superimposed on the channel dimensions. As input to the generator G and as a condition for the discriminator D.

the condition shared by the generator and the discriminator in the CGANs is that the convolution layer in the pseudo-twin network is used for extracting higher-layer features with higher resolution from the two left and right camera views to replace the original image pixel condition.

In order to improve the accuracy of the generated result, the problems of occlusion, light and weak texture in stereo matching need to be solved. The key to this is to learn high-level semantic information, so it is necessary to select a suitable network as the generator in CGANs, and the network of codec structures is capable of handling these problems. The encoder processes low-level features such as contours, colors, edges, textures and shapes, continuously extracts the features, reduces the pictures and increases the size of a receptive field, and the decoder restores the images to process high-level features which are beneficial to understanding and have complex semantics. U-Net is one of networks of encoder and decoder structures, and has advantages in generating disparity maps compared with other networks. The conventional CGANs generation model network structure requires that all data information flow through each layer from input to output, which undoubtedly lengthens the training time. For the stereo matching task, although the two input left and right camera views and the generated disparity map need to be subjected to complicated conversion, the structures of the two views are approximately the same, so that low-level semantic information shared between the two views is very important. In the process of feature learning, the information is prevented from being lost and redundant conversion operation is carried out, and the network structure of the feature learning module can be adjusted according to the requirement of stereo matching. And the U-Net with the layer-hopping connection in the network structure not only can realize the information sharing between input and output, but also avoids the resource waste brought by adopting the traditional CGANs network structure to a certain extent. In other words, the operation of the generator network is to fuse the features extracted by the pseudo-twin network in the channel dimension, and then give the fused features to U-Net to learn and generate the disparity map.

U-Net performs 8 down-sampling and 8 up-sampling operations for processing the input. The convolution module used in the down-sampling consists of convolution layer with size of 3 × 3, step size of 2 and padding of 1, BN layer and LeakyReLU activation function layer. The first seven layers of the convolution module used in upsampling are deconvolution layers with the size of 3 × 3, the step length of 2 and the filling of 1, a BN layer and a ReLU activation function layer are formed, and the mathematical expression of the ReLU activation function is as follows:

the last layer of upsampling, i.e. the output layer, will replace the activation function with a Tanh function, the mathematical expression of which is as follows:

wherein e is^xIs that the input value is subjected to an exponential function operation, e^-xThe method is characterized in that exponential function operation is carried out after an input value takes a negative value.

The input data will go through the convolution module with 3 output channels being 256 and the convolution module with 5 output channels being 512. During the downsampling process, the length and width of the input image are reduced by half every time the input image passes through a convolution module, and the size of the input image is changed from 128 × 128 to 1 × 1 at the end of the downsampling process. During the up-sampling process, data passes through the deconvolution module with 512 layers of output channels and 256 layers of output channels, and during the processing process, the data is superposed with the output results of the corresponding layers in the down-sampling process by using the skip layer connection of U-Net, and then is input into the deconvolution module. And the length and the width of the image are enlarged by half through a deconvolution module, and are gradually adjusted from the size 1 x 1 to the size 256 x 256 required when the disparity map is output, and the processing is performed so as to keep consistent with the size of the input image in the step one.

And (3) identifying true and false: inputting the true value or the generated disparity map and the condition into a discriminator, and then discriminating whether the input sample is a generated sample or a true value by the discriminator;

for the discriminator network, the original left and right camera views are no longer used as conditions shared with the generator, but the setting of the conditions is replaced with feature maps extracted for the two left and right views by the pseudo-twin network. After stacking the condition and the generated sample or the real sample on the channel dimension, inputting the stacked condition and generated sample or real sample into the convolution modules with the four layers of output channels with the numbers of 64, 128, 256 and 512, and then outputting a probability value indicating the judgment result of the discriminator by utilizing the convolution module with the output channel with the number of 1. The first four layers of convolution modules used in the discriminator are consistent in structure with the convolution modules adopted in the U-Net down sampling in the step three, and the last layer of output layer convolution module consists of convolution layers with the size of 3 x 3, the step length of 2 and the filling of 1 and a Sigmoid activation function layer. The Sigmoid function is used to handle the binary problem that the input samples are true or false. The mathematical expression of Sigmoid function is as follows:

where σ (x) refers to the output value of the Sigmoid function.

During training, the generator G is firstly trained, then the discriminator D is trained, and the pseudo-twin network for extracting the characteristics of the two left and right camera views is trained together with the U-Net, so that the training is circulated until the training is finished. The whole training is the process of the generator G and the discriminator D for gaming, the generator G hopes that the disparity map generated by the U-Net tricks the discriminator D, namely the discriminator D discriminates the generated sample as true, so the generator G tries to minimize the loss function. And arbiter D tries to maximize the loss function because it wants to improve its ability to discriminate the generated sample as false. The training process stops when the generator G and the discriminator D both obtain the optimal solution, theoretically achieving nash balance.

The detailed process of training is to guide the training of the whole network model through a loss function, namely, the gradient is updated by means of an optimization method, and the gradient is continuously decreased to approach the optimal solution to update the weight parameters. And regarding the weight parameter, the method relates to both the weight initialization and the optimization method.

The weight initialization is to enable the network model to have a better initial position when seeking a global optimal solution in a numerical space, so that better and faster convergence is facilitated during network model learning. When the weight of the convolutional layer is initialized, random normal distribution with the mean value of 0 and the variance of 0.02 is adopted.

The process by which the network model searches for the optimal solution may be referred to as optimization. The method adopted in the optimization is an Adam method improved on a gradient descent method, and the Adam method is used because the learning rate can be automatically adjusted to help the network model to better and faster converge during learning as long as initial values of some related hyper-parameters are set.

The disparity map generated by the U-Net calculates the error between the real value and the real value through an L1 loss function; the conventional loss function of L1 is as follows:

wherein E is_x,yX is the expectation that the training data distribution is met and y is the true value distribution, x is the condition that the input in the generator G is shared with the discriminator D, G (x) is a generated sample, and y is the true value;

the judgment result of the discriminator D on the real value y or the generated sample G (x) is used for calculating the loss function of the CGANs; the loss function for CGANs is as follows:

wherein E is_xMeaning that x conforms to the expectation of the distribution of the training data. The two loss functions are subjected to gradient updating together through an optimization method so as to guide the training of the whole network model; in the process of training the network, a generator G needs to minimize a loss function, and a discriminator D needs to maximize the loss function; to balance the CGAN loss term and the L1 loss term, a hyperparameter λ is added: the resulting loss function G of the CGANs used in the algorithm^*Is represented as follows:

wherein G is^*For the loss function, λ is the hyperparameter added to balance the CGAN loss term and the L1 loss term.

Claims

1. A three-dimensional matching method based on CGANs is characterized in that: the method comprises the following steps:

and (3) identifying true and false: the extracted and fused features are used as conditions and input into a discriminator together with a real value or a generated disparity map, and then the discriminator identifies whether the input sample is a generated sample or a real value;

2. The CGANs-based stereo matching method according to claim 1, wherein: and after the two left and right camera views are input, a cropping operation with the size of 256 × 256 is carried out, then whether the number of channels of the two images is 3 or not is judged, if so, the next operation is carried out, and if not, an error is reported.

3. The CGANs-based stereo matching method according to claim 1, wherein: when the input is two images, a pseudo twin network method is adopted; the pseudo-twin network used in the algorithm is composed of two convolutional neural networks with the same structure and different weights; and features extracted from both images need to be superimposed in the channel dimension before being input to the next module.

4. The CGANs-based stereo matching method according to claim 1, wherein: setting the fused features as conditions in CGANs, and inputting the conditions into U-Net in a generator; and U-Net is used as a coder decoder network, and performs down-sampling and up-sampling operations on the input to generate a disparity map with the number of channels being 1.

5. The CGANs-based stereo matching method according to claim 4, wherein: either the real value or the output result of the U-Net and the condition are superposed by the channel dimension and then input into a discriminator, and the discriminator identifies the real or false binary problem through the convolutional neural network processing, namely, a probability value is output to indicate whether the input sample is the real value or the generated sample.

6. The CGANs-based stereo matching method according to claim 5, wherein: the disparity map generated by the U-Net calculates the error between the real value and the real value through an L1 loss function; the conventional loss function of L1 is as follows:

wherein E is_xMeaning that x conforms to the expectation of the distribution of the training data; the two loss functions are subjected to gradient updating together through an optimization method so as to guide the training of the whole network model; in the process of training the network, a generator G needs to minimize a loss function, and a discriminator D needs to maximize the loss function; to balance the CGAN loss term and the L1 loss term, a hyperparameter λ is added: the resulting loss function G of the CGANs used in the algorithm^*Is represented as follows: