CN113592021B

CN113592021B - Stereo matching method based on deformable and depth separable convolution

Info

Publication number: CN113592021B
Application number: CN202110916262.2A
Authority: CN
Inventors: 高会敏; 徐志京
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2024-03-22
Anticipated expiration: 2041-08-11
Also published as: CN113592021A

Abstract

The invention discloses a stereo matching method based on deformable and depth separable convolution, which comprises the following steps: inputting a left image and a right image into a deformable feature extraction network model to extract effective features, wherein the left image and the right image are two images obtained by a binocular vision camera respectively; the effective characteristics are subjected to cascade operation, and cost quantity is obtained after fusion; inputting the cost quantity to a depth-separable 3DCNN network model, learning the characteristics of different scales, positions and forms, and aggregating effective information to obtain a 3DCNN network-learned image; restoring the 3DCNN network-learned image to the original image size using upsampling; and performing parallax regression prediction on the restored image by using a softmax function, and outputting a parallax map. By applying the embodiment of the invention, the object characteristic deformation is self-adaptive, the effective receptive field is enlarged, and the information loss is reduced; the depth can be separated and convolved, and the depth can be integrated into a learning network, so that huge parameters brought by 3DCNN are reduced, and the operand is reduced.

Description

Stereo matching method based on deformable and depth separable convolution

Technical Field

The invention relates to the technical field of machine vision and binocular vision, in particular to a stereo matching method based on deformable and depth separable convolution.

Background

Computer vision is a discipline that has studied using computers to simulate the human visual system, with binocular stereo vision being an important branch of the computer vision field. The system processes the real world by simulating a human visual system, and can be put into use by simulating the human eye perception principle and only needing two cameras to be installed on the same horizontal line and three-dimensionally correcting. The basic flow comprises the following steps: acquiring an image, calibrating a camera, correcting the image, extracting characteristics, performing stereo matching and reconstructing three dimensions. The most important step in binocular stereoscopic vision is stereoscopic matching, which is an important basis in the field of binocular vision, and with the development of computer vision, stereoscopic matching is widely applied, such as: autopilot, 3D modeling, industrial control, etc. Where stereo matching can be achieved with limited performance based on traditional stereo matching, but for pathological areas such as: the weak texture, parallax discontinuity, uneven radiation, etc. cannot achieve good effects. In recent years, stereo matching based on deep learning has greatly advanced compared with the traditional algorithm, and a convolutional neural network has super strong capability in feature extraction, and the convolutional neural network regards stereo matching as a learning task, continuously learns optimized model parameters from a large amount of data, and finally outputs a disparity map.

Because of various interferences in the real world, the three-dimensional matching method based on the convolutional neural network in early stage is simpler, good results are not obtained for a pathological region, the precision, time efficiency and the like of an algorithm are greatly improved in the deep learning mode in recent years, and a great progress space is still reserved for the complexity brought by the deformation of the characteristics and the 3D convolution.

Disclosure of Invention

The invention aims to provide a three-dimensional matching method based on deformable and depth separable convolution, which aims to solve the existing defects, solves the problems of the traditional convolution fixed sampling mode through the novel convolution constructed by the deformable convolution and the deformable convolution kernel, adapts to the characteristic deformation of an object, enlarges the effective receptive field and reduces the information loss; the depth-separable convolution is integrated into a learning network, so that huge parameters brought by the 3D CNN are reduced, and the operand is reduced.

In order to achieve the above object, the present invention provides a stereo matching method based on deformable and depth separable convolution, comprising:

inputting a left image and a right image into a deformable feature extraction network model to extract effective features, wherein the left image and the right image are respectively two images obtained by a binocular vision camera

The effective characteristics are subjected to cascade operation, and cost quantity is obtained after fusion;

inputting the cost quantity to a depth-separable 3D CNN network model, learning the characteristics of different scales, positions and forms, and aggregating effective information to obtain a 3D CNN network-learned image;

restoring the image after 3D CNN network learning into the size of the left image by using upsampling, wherein the sizes of the left image and the right image are the same;

performing parallax regression prediction on the restored image by using a softmax function, and outputting a parallax map;

iterative training is carried out on the three-dimensional matching integral network, and a joint loss function is used in the training process:

L＝L _L1 +λL _Log-cosh

wherein L is _L1 To smooth the L1 loss function, L _Log-cosh Is a Log-dash loss function, lambda is L _Log-cosh The stereo matching overall network comprises: extracting a network model and a 3D CNN network from the deformable characteristics;

L _L1 for smoothloss:

L _Log-cosh is Log-hash_loss:

wherein: n is the number of marked pixels, d is the background true value,to predict the disparity value.

Optionally, the deformable feature extraction network model is a novel convolution constructed by combining a deformable convolution and a deformable convolution kernel.

In one implementation, the deformable convolution sum and deformable convolution kernel output pixels are:

the effective receptive fields are as follows:

wherein I represents an image and W represents a convolution kernel; i, j represents the sampling position, k represents the convolution kernel position, m represents the mth convolution kernel, n represents the index of each layer, Δj represents the offset of the sampling position j, and Δk represents the offset of the convolution kernel position k.

Optionally, the effective features are subjected to cascade operation, and the cost is obtained after fusion:

connecting the extracted features through concat cascading operation;

the connected features are rearranged and combined by convolution to form new features that fuse the connected features.

In one implementation, the depth separable 3D CNN network model includes: a depth convolution of 3*3 and a point-by-point convolution of 1*1.

Optionally, the deformable feature extraction network further includes: batch normalization layer and leak activation layer.

In one implementation, the Softmax function has the expression:

wherein Z is _i Representative is the linear prediction result of the ith class, σ (Z) _i i=0, 1,2,3,... Is the probability that the data belongs to category i.

In one implementation, the convolution formula for upsampling is:

y＝G*x _t +b _t

wherein G is two-dimensional Gaussian distribution, b _t The gaussian kernel formula, representing the convolution bias, is:

wherein (x, y) is a distributed coordinate point, (x) ₀ ,y ₀ ) As the center point coordinates, sigma ² Is the variance.

The stereo matching method based on deformable and depth separable convolution has the following beneficial effects:

(1) The stereo matching pathological region is as follows: the problems of matching precision caused by weak textures, uneven radiation and discontinuous parallax are solved, and the deformation problem of the characteristics is not considered in the conventional stereo matching. The invention can deform the characteristic extraction network, self-adapt to the deformation requirement of the object, enlarge the effective receptive field, and extract more effective characteristics.

(2) Although the traditional 3D CNN can bring about a good matching effect, more time is sacrificed, and the algorithm operation amount and the network parameters are increased. According to the depth separable 3D CNN network, the characteristics of different dimensions, positions and forms are learned, information of different dimensions is aggregated, and the operation complexity of an algorithm is greatly reduced.

(2) The invention introduces the joint loss function, performs common optimization on the network, can play the role of an algorithm, and improves the stereo matching precision.

Drawings

FIG. 1 is a schematic flow chart of a stereo matching method based on deformable and depth separable convolution according to an embodiment of the invention

Fig. 2 is another flow chart of a stereo matching method based on deformable and depth separable convolution according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a deformable feature extraction network according to an embodiment of the invention.

Fig. 4 is a schematic diagram of a depth separable 3D CNN network model according to an embodiment of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.

The present invention provides a stereo matching method based on deformable and depth separable convolution as shown in fig. 1-2, comprising:

s110, inputting a left image and a right image into a deformable feature extraction network model to extract effective features, wherein the left image and the right image are two images obtained by a binocular vision camera respectively;

it should be noted that, in the existing stereo matching method based on the neural network, more contexts and multi-scale information are aggregated, details are added, and problems of deformation of objects are rarely considered. The invention provides a feature extraction method of a deformable convolutional neural network.

(1) Traditional convolution:

in a general convolution, the output image is I ε R ^D×D The convolution kernel is W E R ^k×k The output image is at each coordinate j=r ² The pixels at this point are:

wherein,i represents an image, and W represents a convolution kernel; i, j represents the sampling position, k represents the convolution kernel position, m represents the mth convolution kernel, and n represents the index of each layer.

The effective receptive fields are as follows:

(2) Deformable convolution:

the deformable convolution has stronger expressive power on the characteristics of the object such as the scale, the gesture, the deformation and the like, and compared with the general convolution, the deformable convolution output pixel is as follows:

the effective receptive fields are as follows:

where Δj represents the offset of sampling position j.

In the sampling process, an offset is added to each sampling point position, so that sampling of different positions and deformation is realized, and the receptive field is enlarged. The characteristic points of the general convolution have a receptive field with a fixed size, and the deformable convolution can adaptively learn the receptive field according to the shape and the size of the object, so that the characteristic of the object is more met, and the characteristic extraction is facilitated.

(3) Deformable convolution kernel:

the deformable convolution kernel output pixels are:

the effective receptive fields are as follows:

where Δk represents the offset of the convolution kernel position k.

The general convolution kernel cannot adapt to the deformation of the feature, and the deformable convolution kernel can adjust the kernel space while keeping the feature points unchanged, and compared with the general convolution kernel, the deformable convolution kernel shares the data position but has different sampling kernel values.

(4) Novel convolution:

the deformable convolution + deformable convolution kernel output pixels are:

the effective receptive fields are as follows:

the deformable feature extraction network of the invention: and (3) combining the deformable convolution and the deformable convolution kernel to construct novel convolution, inputting a left image and a right image obtained by the binocular vision camera into a deformable feature extraction network model to extract effective features, adapting to the deformation requirement of an object, and extracting more effective features. The deformable feature extraction network further comprises: batch normalization layer and leak activation layer. As in fig. 3. The batch normalization layer and the leack activation layer are common convolution layers in the neural network, and the batch normalization effect is that each layer of input of the network is normalized; the lean activation layer uses a lean activation function to map the inputs of neurons to outputs, increasing the nonlinearity of the neural network.

S120, the effective features are subjected to cascading operation, and cost quantity is obtained after fusion;

it can be understood that the extracted effective features are connected through a concat cascade operation, the connected features are rearranged and combined by a convolution operation, and are fused into new features to form cost volume, and the new features are fused with the connected features.

S130, inputting the cost amount to a depth-separable 3D CNN network, learning features of different scales, positions and forms, and aggregating effective information to obtain a 3D CNN network-learned image;

in order to improve accuracy, the cost volume is learned through the depth separable 3D CNN convolutional neural network, effective information is aggregated, and the parameters and the operation complexity caused by the 3D CNN convolutional are reduced through space and channel dimension calculation.

The depth separable 3D CNN convolution decomposes the general convolution into one 3*3 depth convolution (depthwise convolution) and one 1*1 point-by-point convolution (pointwise convolution). In general, the input feature map size is H×W×C ₁ Convolution kernel sizeThe size of the output characteristic diagram is KXK, and the size of the output characteristic diagram is H XW XC ₂ The amount of computation of the general convolution is: alpha _T ＝H×W×C ₁ ×C ₂ X K. The calculation amount of the depth separable convolution is as follows: alpha _D ＝H×W×C ₁ ×K×K+H×W×C ₂ X K. As can be seen from the formula, the operation amount of the depth separable convolution is greatly reduced, the operation complexity and the time complexity are reduced, and the accuracy of the algorithm is improved. The depth separable 3D CNN network model is constructed as in fig. 4.

S140, restoring the image after 3D CNN network learning into the size of the left image by using up-sampling, wherein the sizes of the left image and the right image are the same;

it will be appreciated that the original image undergoes convolution operations of different sizes to change the size of the image, so the present invention uses up-sampling to restore the 3D CNN network-learned image to the original image size.

The convolution formula is: y=g×x _t +b _t

Wherein G is a Gaussian convolution kernel, b _t The gaussian kernel formula, representing the convolution bias, is:

Up-sampling principle: the image is enlarged, i.e. interpolated. If an image is to be enlarged, it is obtained by an up-sampling operation, as follows:

(1) Expanding the image by 2 times in each direction, and filling the newly added rows and columns with 0;

(2) The approximation of the newly added pixel is obtained by convolving the enlarged image with the convolution kernel 4 described above.

And S150, performing parallax regression prediction on the restored image by using a softmax function, and outputting a parallax map.

Note that Softmax:

Regression prediction: the probability of each disparity d is based on the matching cost c by a softmax function σ (·) _d Calculating, predicting parallaxThe probability for each disparity d is weighted. The formula is as follows:

the stereo matching overall network comprises: in order to better train a network and improve the matching precision of a stereoscopic matching integral network, a combination loss function is introduced in the invention:

L＝L _L1 +λL _Log-cosh

wherein L is _L1 To smooth the L1 loss function, L _Log-cosh Is a Log-dash loss function, lambda is L _Log-cosh For balancing the importance of the two loss functions, the value in the experiment was 0.1.

L _L1 For smoothloss:

L _Log-cosh is Log-hash_loss:

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A stereo matching method based on deformable and depth separable convolution, comprising:

inputting a left image and a right image into a deformable feature extraction network model to extract effective features, wherein the left image and the right image are two images obtained by a binocular vision camera respectively;

inputting the cost quantity to a depth-separable 3D CNN network, learning the characteristics of different scales, positions and forms, and aggregating effective information to obtain a 3D CNN network-learned image;

L＝L _L1 +λL _Log-cosh

L _L1 for smoothloss:

L _Log-cosh is Log-hash_loss:

wherein N is the number of marked pixels, d is the background true value,is a predictive disparity value;

the deformable convolution and deformable convolution kernel output pixels are:

the effective receptive fields are as follows:

wherein I represents an image, W represents a convolution kernel, I, j represents a sampling position, k represents a convolution kernel position, m represents an mth convolution kernel, n represents an index of each layer, Δj represents an offset of the sampling position j, and Δk represents an offset of the convolution kernel position k.

2. The stereo matching method based on deformable and depth separable convolution according to claim 1, wherein the deformable feature extraction network model is a novel convolution constructed by combining a deformable convolution and a deformable convolution kernel.

3. A stereo matching method based on deformable and depth separable convolution according to claim 1, characterized by the step of obtaining the cost quantity after the fusion of the valid features by cascade operation:

connecting the extracted features through concat cascading operation;

4. The stereo matching method based on deformable and depth separable convolution according to claim 1, wherein the depth separable 3D CNN network model comprises: a depth convolution of 3*3 and a point-by-point convolution of 1*1.

5. A stereo matching method based on deformable and depth separable convolution as recited in any one of claims 1 to 4, wherein the deformable feature extraction network further comprises: batch normalization layer and leak activation layer.

6. The stereo matching method based on deformable and depth separable convolution according to claim 1, wherein the Softmax function has the expression:

wherein Z is _i Representative is the linear prediction result of the i-th class, i=0, 1,2,3 _i Is the probability that the data belongs to category i.

7. A stereo matching method based on deformable and depth separable convolution according to claim 1, characterized in that the convolution formula of up-sampling is:

y＝G*x _t +b _t