CN113537379B

CN113537379B - Three-dimensional matching method based on CGANs

Info

Publication number: CN113537379B
Application number: CN202110860315.3A
Authority: CN
Inventors: 魏东; 刘涵; 何雪; 于璟玮
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2024-04-16
Anticipated expiration: 2041-07-27
Also published as: CN113537379A

Abstract

A stereo matching method based on CGANs, the method comprising: and (3) inputting an image: inputting two left and right camera views and a true value, taking a left image and a right image as a reference image and a target image respectively, and taking the true value as a label to correspond to the left image; feature extraction: respectively extracting features of the two input left and right camera views by using a pseudo-twin network, and fusing the extracted features by using channel dimensions; generating a disparity map: the fused features are used as conditions in the CGANs to enable the generator to be shared with the discriminator, and the generator generates a parallax map; and (5) identifying true and false: the extracted features are taken as conditions and are input into a discriminator together with a true value or a generated disparity map, and then the discriminator discriminates whether an input sample is a generated sample or a true value; training a model: the error between the generated disparity map and the true value and the result output by the discriminator are used for guiding the network model to learn.

Description

Three-dimensional matching method based on CGANs

Technical Field

The invention belongs to the technical fields of computer vision and deep learning, and particularly relates to a three-dimensional matching method based on CGANs, wherein CGANs (Conditional Generative Adversarial Networks) refers to a condition generation countermeasure network.

Background

Stereo matching plays a vital role in numerous applications of computer vision directions, such as robotic navigation, autopilot, augmented reality, gesture recognition, three-dimensional reconstruction, military reconnaissance, and maintenance detection. The purpose of computer vision is to mimic human perception of how far and near objects in a three-dimensional scene are using the vision system. And the stereo matching can obtain depth information in the three-dimensional scene according to the two-dimensional image. The stereo matching is used as one of key research directions in computer vision, and is characterized in that related pixel points are firstly matched in camera views under two different viewpoints, then the difference value of the related pixel points on horizontal displacement is calculated to obtain parallax, and finally depth information is obtained through a mathematical model. Because the parallax between two related pixel points has a proportional relation with the depth information, the task of acquiring the depth information can be reasonably converted into the task of calculating the parallax through mathematical transformation. The problems of occlusion, illumination, weak textures and the like always exist in stereo matching, and various algorithms in the past aim to solve the problems, so that the accuracy of parallax prediction is improved.

Algorithms can be classified into a conventional stereo matching algorithm and an end-to-end stereo matching algorithm. The first type of traditional stereo matching algorithm comprises four steps of matching cost calculation, cost aggregation, parallax calculation and parallax refinement. However, this partial solution has problems, namely, increasing the number of super parameters; secondly, the implementation process of stereo matching becomes complicated; thirdly, the method of the dividing step does not necessarily obtain the best result, because the optimal solution of the combined sub-problem is not equal to the global optimal solution; and fourthly, the association range of single pixel points is limited by an aggregation window when parallax is calculated. The second type is an end-to-end stereo matching algorithm. Deep learning is one of the key research fields of artificial intelligence and machine learning, and has achieved remarkable results in the computer vision direction. In order to solve the problems of the conventional stereo matching algorithm caused by the division steps, four steps can be combined by constructing an end-to-end system through deep learning. The deep learning is a method for training a multi-layer neural network, and the training process of the deep learning is that training data is firstly input into a first layer of neurons, the weights of the layers are obtained through a nonlinear activation method, then the data output by the neurons of the layers are used as input to be transmitted to the next layer to continuously obtain the weights of the corresponding layers, the values of the weights are continuously updated according to the progress of learning, and finally reasonable weights, namely the distributed characteristic representation of the learned data, are obtained. The end-to-end idea in deep learning realizes the process of inputting data from one end and directly outputting results from the other end, so that the defects brought by manual design can be improved. In the end-to-end deep neural network, the independence of each pixel point in the parallax prediction process can be reduced by the characteristics generated after the fusion of semantic information of each layer of neurons. Therefore, the end-to-end stereo matching algorithm refers to that the middle feature learning and the information fusion are both submitted to the deep learning processing by inputting two left and right camera views to output corresponding disparity maps.

Although the traditional end-to-end stereo matching algorithm solves the problems of the traditional method caused by step realization, the essence of the traditional method is that the cost quantity is formed by calculating the matching points, but the matching of the pixel points does not necessarily accord with the ideal condition, and the precision loss is caused. At the same time, most of them use 3D convolution to handle the amount of cost, which can result in very high computational costs.

Disclosure of Invention

Object of the Invention

Aiming at the problems brought by the traditional stereo matching algorithm in a divide-and-conquer way and the problem of high calculation cost caused by the 3D convolution processing cost, the invention provides a stereo matching method based on CGANs.

Technical proposal

A stereo matching method based on CGANs, the method comprising:

and (3) inputting an image: inputting two left and right camera views and a true value, wherein the left image and the right image are respectively used as a reference image and a target image, and the true value is used as a label corresponding to the left image;

feature extraction: respectively extracting features of the two input left and right camera views by using a pseudo-twin network, and fusing the extracted features in channel dimensions;

generating a disparity map: the fused features are used as conditions in the CGANs to enable the generator to be shared with the discriminator, and the generator generates a parallax map;

and (5) identifying true and false: the fused features are extracted and input into a discriminator together with a true value or a generated parallax map, and then the discriminator discriminates whether an input sample is a generated sample or a true value;

training a model: the error between the generated disparity map and the true value and the result output by the discriminator are used for guiding the network model to learn.

Further, the two left and right camera views are input, then the clipping operation with the size of 256 x 256 is performed, then whether the channel number of the two images is 3 is judged, if yes, the next operation is performed, otherwise, the error is reported.

Further, when the input is two images, a pseudo-twin network method is adopted. The pseudo-twin network used in the algorithm is composed of two convolutional neural networks with the same structure but different weights. And features extracted from both images need to be changed to one input before being input to the next module, and thus need to be superimposed in the channel dimension.

Further, the fused features are set as conditions in CGANs, which are input to the U-Net in the generator. And U-Net as the encoder/decoder network, down-sampling and up-sampling the input to generate a channel number 1 disparity map.

Further, the feature extraction is carried out by taking the result after channel dimension superposition as a condition and inputting the result and a true value or a U-Net output result into a discriminator, wherein the discriminator is used for discriminating true or false classification problems through convolutional neural network processing, namely outputting a probability value to indicate whether an input sample is the true value or a generated sample.

Further, the disparity map generated by U-Net calculates an error from the true value through the L1 conventional loss function. The L1 conventional loss function is as follows:

wherein E is _x,y It is meant that x meets the expectations of the training data distribution and y meets the true value distribution, the input in x generator G is also a condition shared with arbiter D, G (x) is the generated sample, and y is the true value.

The judgment result of the discriminator D on the true value y or the generated sample G (x) is used for calculating the loss function of the CGANs. The loss function of CGANs is as follows:

wherein E is _x Meaning that x meets the expectations of the training data distribution.

And the two loss functions are subjected to gradient update through an Adam optimization method so as to guide the training of the whole network model. During training of the network, the generator G needs to minimize the loss function and the arbiter D needs to maximize the loss function. Thus the CGANs final loss function G used in the algorithm ^* The expression is as follows:

advantages and effects

In order to obtain higher-precision parallax information, the problem caused by partial treatment of the traditional stereo matching algorithm is solved, meanwhile, the parallax is predicted by a better method instead of the traditional end-to-end algorithm requiring 3D convolution, and the calculation cost is reduced. The invention provides a three-dimensional matching method based on CGANs; the invention combines four steps in the traditional stereo matching method into one step from the end-to-end idea in deep learning, simplifies the steps of the stereo matching algorithm and solves the problems brought by partial cure. Compared with the prior similar end-to-end algorithm, the pseudo-twin network is adopted for processing two similar pictures, so that the negative influence of the part of convolutional neural network on subsequent U-Net learning is eliminated. The setting of conditions in CGANs is changed into a feature map extracted from a pseudo-twin network from left and right camera views, so that the parameter number of a training model is reduced, and the calculation cost is reduced. For the selection of the generator in the CGANs, the U-Net with better high-level semantic information learning capability and layer jump connection is used for generating a disparity map with higher precision and better effect. Meanwhile, the network structure of the U-Net is adjusted, and a relatively proper layer number setting is found between the calculated amount and the accuracy of the disparity map.

According to the method, the task of parallax prediction is completed by adopting the CGANs to generate the parallax map, so that the consumption of memory and time is reduced under the condition of improving certain accuracy, the calculation cost is reduced, and the implementation process of a stereo matching algorithm is simplified.

Drawings

Fig. 1 is a network structure diagram of a stereo matching method based on CGANs.

Fig. 2 is a logic flow diagram of a stereo matching method based on CGANs provided by the invention.

Detailed Description

The invention is further described with reference to the accompanying drawings:

examples

A stereo matching method based on CGANs, the method comprising:

and (3) inputting an image: inputting two left and right camera views and a true value, wherein the left image and the right image are respectively used as a reference image and a target image, and the true value is used as a label corresponding to the left image; as shown in fig. 1.

unlike the conventional image generation task, the following are: for the binocular stereo matching task, the input is no longer one picture but two left and right camera views. Two methods for performing input processing on two left and right camera views are generally two, namely, directly stacking two original images in a channel dimension, and referencing a twin network, namely, extracting features of input respectively by using two convolution neural networks with the same structure and shared weight, and then fusing the features. Although these two methods can be handed to a subsequent network to learn data distribution, the first method can have an adverse effect on feature learning, so that the effect of generating the disparity map by the generator G has a certain limitation. The second method using the twin network affects the subsequent network learning because of the correlation of the pixel points, and the attention of small differences between the left and right camera views when the parallax is found is ignored to a certain extent.

So for improved effect feature extraction, i.e. pseudo-twin network, can be achieved in the network structure of the generator G by two neural network branches of different weights. Therefore, the influence caused by the calculation of the correlation in the step can be eliminated, the attention of small differences of the left and right camera views in the parallax prediction process is reserved, and the difficulty of subsequent network learning can be reduced.

The left and right camera views cut into 256 x 256 sizes are respectively input into two convolutional neural networks with the same structure but different weights for feature extraction. Two images as input pass through a convolution module with the number of output channels of 64 in two layers respectively, and then pass through a convolution module with the number of output channels of 128 in three layers, and the image size is always kept to be 256×256. The convolution module used in this part of the convolutional neural network consists of a convolution layer with a size of 3*3, a step size of 1 and a padding of 1, a BN layer and a LeakyReLU activation function layer.

The BN layer is a call to batch normalization (Batch Normalization) regularization method. Because the CGANs learning process is a process of capturing the distribution rule of the training data, the distribution rule of the pictures processed each time is different in the numerical expression mode, so that the learning of the network model is not facilitated. Therefore, the value range of the input data can be unified into the interval [ -1,1] by using a batch normalization method common in deep learning. Thus, the method not only solves the problem of difficult learning of the network model, but also is beneficial to gradient updating of back propagation, nonlinear utilization of the LeakyReLU activation function, acceleration of network convergence and reduction of sensitivity of the network to adjustment of super parameters. The specific processing mode when using batch normalization is to subtract the average value calculated by each channel based on batch data (batch size) after the convolution layer and divide the average value by the standard deviation, and to directly replace the divisor with the value 255, that is, the 8-bit unsigned integer maximum value representing the maximum number of channels of RGB, when dividing the image by the standard deviation in training.

The LeakyReLU function mathematical expression is as followsIn a _i Is represented by the formula (0), ++ infinity) interval the parameters are fixed and the parameters are fixed, set to 0.2; x is x _i Representing the value entered into the function; y is _i Representing the output of the function.

Features extracted for the two images respectively are superimposed in the channel dimension. As input to generator G and condition of arbiter D.

the condition shared by the generator and the arbiter in CGANs is that higher-layer features with higher resolution are extracted from two left and right camera views by using a convolution layer in a pseudo-twin network to replace the original image pixel condition.

In order to improve the accuracy of the generated result, it is necessary to solve the problems of occlusion, light, weak texture, and the like in stereo matching. The key is to learn high-level semantic information, so that a suitable network needs to be selected as a generator in CGANs, and the network of the encoder-decoder structure has the capability to deal with these problems. The encoder processes low-level features such as contours, colors, edges, textures, shapes and the like, continuously extracts the features, reduces pictures and increases the receptive field size, and the decoder restores images, and processes high-level features which are beneficial to understanding and have complex semantics. U-Net is one of the networks of encoder and decoder architecture, and has advantages over other networks in the task of generating disparity maps. Conventional CGANs generation model network structures require all data information to flow through each layer from input to output, which undoubtedly lengthens training time. For stereo matching task, although the input two left and right camera views and the generated disparity map need to undergo complex transformation, the structures of the two views are substantially the same, so that low-level semantic information shared between them is very important. The loss of the information and redundant conversion operation are prevented in the process of feature learning, and the network structure of the feature learning module can be adjusted according to the requirement of stereo matching. The U-Net with layer jump connection in the network structure not only can realize information sharing between input and output, but also avoids resource waste caused by adopting the traditional CGANs network structure to a certain extent. In other words, the generator network works by merging the features extracted by the pseudo-twin network in the channel dimension and then giving the merged features to the U-Net to learn and generate the disparity map.

The U-Net performs 8 downsampling and 8 upsampling operations for the input processing. The convolution module used in downsampling consists of a convolution layer of size 3*3, step size 2 and padding 1, BN layer and a LeakyReLU activation function layer. The first seven layers of the convolution module used in the up-sampling process are deconvolution layers with the size of 3*3, the step length of 2 and the filling of 1, and the BN layer and the ReLU activation function layer are formed, and the mathematical expression of the ReLU activation function is as follows:

the last layer of up-sampling, the output layer, replaces the activation function with a Tanh function, the mathematical expression of which is as follows:

wherein e ^x Refers to the input value to perform exponential function operation, e ^-x The method is that the exponential function operation is carried out after the input value takes a negative value.

The incoming data will pass through 3 convolution modules with 256 output channels and 5 convolution modules with 512 output channels. In the downsampling process, the length and width of the input image are reduced by half each time the input image passes through a convolution module, and the size of 128 x 128 from the first one becomes 1*1 at the end of the downsampling process. In the up-sampling process, the data passes through a deconvolution module with the number of 4 layers of output channels being 512 and the number of 3 layers of output channels being 256, and in the processing process, the data is overlapped with the output result of the corresponding layer in the down-sampling process in the channel dimension by utilizing the jump layer connection of U-Net, and then is input to the deconvolution module. The length and width of the image are increased by half by one deconvolution module, and the image is gradually adjusted from the size 1*1 to 256×256 required for outputting the parallax image, so that the image is consistent with the size of the input image in the first step.

And (5) identifying true and false: the true value or the generated disparity map is input into a discriminator together with the condition, and then the discriminator discriminates whether the input sample is the generated sample or the true value;

for the arbiter network, the original left and right camera views are not used as conditions shared with the generator any more, but the setting of the conditions is replaced by the feature map extracted by the pseudo-twin network for the two left and right views. The condition and the generated sample or the real sample are stacked in the channel dimension and then input into the four-layer convolution modules with the output channel numbers of 64, 128, 256 and 512, and then the output layer convolution module with the output channel number of 1 is utilized to output a probability value to indicate the judging result of the discriminator. The first four layers of convolution modules used in the discriminator are consistent in structure with the convolution modules adopted in the step three when the U-Net is downsampled, and the last layer of output layer convolution module consists of convolution layers with the size of 3*3, the step length of 2 and the filling of 1, and the Sigmoid activates the function layer. The Sigmoid function is used to handle the two classification problem that the input sample is true or false. The mathematical expression of the Sigmoid function is as follows:

where σ (x) refers to the output value of the Sigmoid function.

The training is that the generator G is trained once and then the discriminator D is trained, and the pseudo-twin network for extracting the view characteristics of the two left and right cameras is trained together with the U-Net, so that the cycle is completed until the training is finished. The whole training is the process of game playing between the generator G and the arbiter D, the generator G hopes to make the parallax image generated by the U-Net deceptively pass the arbiter D, namely makes the arbiter D identify the generated sample as true, so the generator G strives to minimize the loss function. While arbiter D wants to boost its ability to discriminate that the generated sample is false, so arbiter D strives to maximize the loss function. The training process stops when the generator G and the arbiter D both get the optimal solution, theoretically reaching the nash balance.

The detailed training process is to guide the training of the whole network model through the loss function, namely, the gradient is updated by means of an optimization method, and the gradient is continuously reduced to approach the optimal solution to update the weight parameters. And the weight parameters relate to two aspects of a weight initialization and optimization method.

The weight initialization is to make the network model have a better initial position when seeking the global optimal solution in the numerical space, so that better and faster convergence is facilitated when the network model learns. The weight of the convolution layer is initialized by adopting random normal distribution with the mean value of 0 and the variance of 0.02.

The process of the network model searching for the optimal solution may be referred to as optimization. The method adopted in the optimization is an Adam method improved on the gradient descent method, and the reason for using the Adam method is that as long as the initial values of some relevant super parameters are set, the Adam method can automatically adjust the learning rate to help better and faster convergence in the network model learning.

Calculating an error between a parallax map generated by U-Net and a true value through an L1 loss function; the L1 conventional loss function is as follows:

wherein E is _x,y X meets the expectations of training data distribution and y meets the distribution of real values, the input of the x generator G is the condition shared with the arbiter D, G (x) is a generated sample, and y is the real value;

the judging result of the discriminator D on the true value y or the generated sample G (x) is used for calculating the loss function of the CGANs; the loss function of CGANs is as follows:

wherein E is _x Meaning that x meets the expectations of the training data distribution. The two loss functions are subjected to gradient updating through an optimization method so as to guide the training of the whole network model; during the training of the network, the generator G needs to minimize the loss function and the arbiter D needs to maximize the loss function; to balance the CGAN loss term and the L1 loss term, the super parameter λ is added: thus the CGANs final loss function G used in the algorithm ^* The expression is as follows:

wherein G is ^* As a loss function, λ is a super parameter added to balance the CGAN loss term and the L1 loss term.

Claims

1. A three-dimensional matching method based on CGANs is characterized in that: the method comprises the following steps:

feature extraction: respectively extracting features of the two input left and right camera views by using a pseudo-twin network, and fusing the extracted features in channel dimensions; when two images are processed, adopting a pseudo-twin network method; the pseudo-twin network used in the method is composed of two convolutional neural networks with the same structure and different weights; and features of the two images extracted need to be superimposed on the channel dimension before being input to the next module;

generating a disparity map: the fused features are used as conditions in the CGANs to enable the generator to be shared with the discriminator, and the generator generates a parallax map; the fused features are set as conditions in CGANs, and the conditions are input to U-Net in a generator; the U-Net is used as a coder decoder network, and performs downsampling and upsampling operations on input so as to generate a parallax map with 1 channel number; generating an antagonism network on the condition of the CGANs;

the U-Net processes the input, and performs 8 downsampling and 8 upsampling operations; the module used in downsampling consists of a convolution layer with a size of 3*3, a step size of 2 and a padding of 1, a BN layer and a LeakyReLU activation function layer; the first seven layers of the convolution module used in the up-sampling process are deconvolution layers with the size of 3*3, the step length of 2 and the filling of 1, a BN layer and a ReLU activation function layer, and the last layer of the up-sampling process, namely an output layer, replaces the activation function with a Tanh function;

and (5) identifying true and false: the extracted fused features are taken as conditions and are input into a discriminator together with a true value or a generated disparity map, and then the discriminator discriminates whether an input sample is a generated sample or a true value; the true value or the result output by the U-Net is overlapped with the condition in the channel dimension and then input into a discriminator, and the discriminator processes the two classification problems of true or false through a convolutional neural network, namely, outputs a probability value to indicate whether an input sample is the true value or a generated sample;

training a model: the error between the generated parallax map and the true value and the result output by the discriminator are used for guiding the network model to learn;

（5）

（6）

wherein E is _x Meaning that x meets the expectations of training data distribution; the L1 traditional loss function and the loss function of the CGANs are subjected to gradient update through an optimization method so as to guide the training of the whole network model; during the training of the network, the generator G needs to minimize the loss function and the arbiter D needs to maximize the loss function; to balance the CGAN loss term and the L1 loss term, the super parameter λ is added: thus CGANs final loss function G ^* The expression is as follows:

（7）。

2. the stereo matching method based on CGANs according to claim 1, wherein: after the two left and right camera views are input, a cropping operation with the size of 256 to 256 is performed, then whether the channel number of the two images is 3 is judged, if yes, the next operation is performed, and otherwise, the error is reported.