CN116703999A

CN116703999A - Residual fusion method for binocular stereo matching

Info

Publication number: CN116703999A
Application number: CN202310972969.4A
Authority: CN
Inventors: 俞正中; 翟聚才; 钱刃; 杨文帮; 赵勇; 李福池
Original assignee: Dongguan Aipeike Technology Co ltd
Current assignee: Dongguan Aipeike Technology Co ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-09-05

Abstract

A residual fusion method for binocular stereo matching relates to the field of stereo matching. The method comprises the following steps: respectively acquiring image features of a left view and a right view of the binocular camera; performing point-by-point correlation on the image features of the left view and the right view to construct cost volumes with a plurality of set scales; performing nonlinear operation on each cost volume with set scale to correspondingly obtain a first cost volume, and performing linear operation on the first cost volume to correspondingly obtain a second cost volume; fitting the second price roll by using an attention module to correspondingly obtain a third price roll; upsampling the third cost volume to the first set resolution to obtain a fourth cost volume; performing difference on the third cost volume and the fourth cost volume to obtain a residual cost volume; fusing the residual cost volume to a cost volume corresponding to a set scale to obtain a parallax regression graph; up-sampling the parallax regression graph to a second set resolution to obtain a parallax graph; geometric information of objects in the left view and the right view is estimated using the disparity map.

Description

Residual fusion method for binocular stereo matching

Technical Field

The application relates to the field of stereo matching, in particular to a residual fusion algorithm for binocular stereo matching.

Background

In stereo matching of binocular vision, a key problem is to find corresponding points of left and right images to obtain a horizontal position difference of corresponding pixels in the two images, which is also called parallax. The depth of each pixel can be directly calculated according to the parameters between the parallax and the binocular camera.

Currently, most methods use convolutional neural networks for stereo matching, and the network model of the stereo matching generally comprises four parts: feature extraction, cost calculation, cost aggregation and parallax regression. Some network models use 3D convolution during cost aggregation, bring more floating point calculations while achieving high accuracy, and are long running and difficult to deploy in real-time applications. Still other network models use 2D convolution during cost aggregation, but 2D convolution is less accurate in depth estimation, reducing the applicability of binocular vision stereo matching networks.

Disclosure of Invention

The application mainly solves the technical problems that: a high-precision residual fusion method for binocular stereo matching is provided.

According to a first aspect, an embodiment provides a residual fusion algorithm for binocular stereo matching, comprising:

respectively acquiring image features of a left view and a right view of the binocular camera;

performing point-by-point correlation on the image features of the left view and the right view to construct cost volumes with a plurality of set scales;

performing nonlinear operation on each cost volume with a set scale to correspondingly obtain a first cost volume, and performing linear operation on the first cost volume to correspondingly obtain a second cost volume;

fitting the second price volume by using an attention module to correspondingly obtain a third price volume;

upsampling the third cost volume to a first set resolution to obtain a fourth cost volume; performing difference on the third cost volume and the fourth cost volume to obtain a residual cost volume; fusing the residual cost volume to the third price volume to obtain a parallax characteristic diagram; fusing the parallax characteristic map to a cost roll corresponding to a set scale to obtain a parallax regression map;

up-sampling the parallax regression graph to a second set resolution to obtain a parallax graph; and estimating geometric information of objects in the left view and the right view by using the disparity map.

In one embodiment, the capturing image features of the left view and the right view of the binocular camera respectively includes:

and respectively extracting the characteristics of the images of the different areas of the left view and the right view by using the same convolution kernel so as to correspondingly obtain the image characteristics of the left view and the right view.

In one embodiment, the point-by-point correlation of the image features of the left view and the right view to construct a plurality of scale cost volumes includes:

the scaled cost volume includes: 1/3 Dmax. Times.1/3 Hx 1/3W, 1/6 Dmax. Times.1/6 Hx 1/6W, and 1/12 Dmax. Times.1/12 Hx 1/12W;

where Dmax denotes a maximum parallax range, H denotes original heights of the left and right views, and W denotes original widths of the left and right views.

In one embodiment, the fitting the second cost volume with the attention module to obtain a third cost volume includes:

fitting the image features of the points around the points in the second cost volume to the points to obtain a third cost volume, wherein the third cost volume is used for increasing the connection between the points in the second cost volume and the surrounding points.

In one embodiment, the upsampling the third cost volume to the first set resolution includes:

and carrying out point-by-point convolution on the third price volume to enlarge the channel number of the third price volume, and up-sampling the third price volume with the enlarged channel number by a nearest neighbor interpolation method to reach the first set resolution.

In one embodiment, the first set resolution comprises 1/2 resolution.

In one embodiment, the fusing the disparity feature map to a cost volume corresponding to a set scale includes:

and adjusting the cost volume with the set dimension into the dimension of the parallax characteristic map through a set convolution layer, and fusing each point image characteristic in the parallax characteristic map to the cost volume with the set dimension, wherein the dimension of the cost volume is adjusted.

In one embodiment, the set convolution layer comprises a 3 x 3 convolution layer with a step size of 2.

In one embodiment, the second set resolution includes original resolutions of left and right views.

According to a second aspect, an embodiment provides a computer readable storage medium having stored thereon a program executable by a processor to implement the above-described residual fusion method for binocular stereo matching.

According to the residual fusion method and the computer-readable storage medium for binocular stereo matching of the embodiments, cost volumes of different set scales are constructed according to image features of left view and right view, so that a disparity map can contain information of different scales. And then, respectively processing the cost volume of each set scale by utilizing nonlinear operation and linear operation, so that higher-order image information can be obtained to obtain a more accurate parallax image. The use of an attention module can emphasize different parts of each cost volume and also facilitate further cross-scale aggregation. And finally, fusing by utilizing an up-sampling mode and a down-sampling mode, so that the fused image features have stronger recognition capability.

Drawings

FIG. 1 is a flowchart I of a residual fusion method for binocular stereo matching in one embodiment;

FIG. 2 is a second flowchart of a residual fusion method for binocular stereo matching in one embodiment;

FIG. 3 is a schematic block diagram of a residual fusion method for binocular stereo matching in one embodiment;

FIG. 4 is a third flowchart of a residual fusion method for binocular stereo matching in one embodiment;

fig. 5 is a flowchart four of a residual fusion method for binocular stereo matching in one embodiment.

Detailed Description

The application will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning. The term "coupled" as used herein includes both direct and indirect coupling (coupling), unless otherwise indicated.

An embodiment of the application provides a residual fusion method for binocular stereo matching, which consists of four parts of feature extraction, cost volume construction, cost aggregation and parallax refinement. Please refer to fig. 1, which specifically includes the following steps.

In the feature extraction stage, step S100 is adopted: and respectively acquiring the image characteristics of the left view and the right view of the binocular camera.

In some embodiments, when performing step S100 to obtain the image features of the left view and the right view of the binocular camera, please refer to fig. 2, the following steps are further included.

Step S110: and respectively extracting the characteristics of the images of the different areas of the left view and the right view by using the same convolution kernel so as to correspondingly obtain the image characteristics of the left view and the right view.

In some embodiments, a stacked hourglass extractor is used to extract features from images of areas where left and right views are not used. The stacked hourglass extractor is formed by stacking a plurality of "hourglass" modules, including a plurality of convolution kernels. The left view and the right view are respectively extracted by using the same convolution kernel in a plurality of convolution kernels, and the extracted image features comprise different scales because the stacked hourglass extractor comprises the plurality of convolution kernels. And then, splicing the features between different scales of the left view and the right view by using dense connection so as to correspondingly obtain the image features of the left view and the right view.

In the cost volume construction stage, step S200 is adopted: the image features of the left view and the right view are correlated point by point to construct a cost volume of a plurality of set scales.

In some embodiments, each corresponding pixel point in the left view and the right view is correlated point by point, so as to construct a plurality of cost volumes with set scales.

In some embodiments, the scaled cost volume includes 1/3Dmax 1/3H 1/3W, 1/6Dmax 1/6H 1/6W, and 1/12Dmax 1/12H 1/12W. Where Dmax denotes a maximum parallax range, H denotes original heights of the left and right views, and W denotes original widths of the left and right views. Different cost information can be acquired by using the constructed cost volumes with different scales, so that a parallax map with higher precision is constructed.

In the cost aggregation stage, step S300 is adopted: and determining a parallax regression graph according to the cost volumes with different set scales. Referring to fig. 3, in step S300, three sub-blocks, namely a combination module 300a, an attention module 300b and a residual fusion module 300c, are adopted for determining a disparity regression map according to cost volumes with different set scales.

In some embodiments, referring to fig. 4, determining a disparity regression map according to cost volumes of different set scales in step S300 includes the following steps.

Step S310 is performed in the combination module 300 a: and carrying out nonlinear operation on each cost volume with set scale to correspondingly obtain a first cost volume, and carrying out linear operation on the first cost volume to correspondingly obtain a second cost volume.

In some embodiments, each scaled cost volume of the construct is processed using a residual structure consisting of a 1*1 point volume and a 3*3 depth convolution. Nonlinear operation is carried out on each cost volume with set scale by using a ReLU activation function to obtain first cost volumes corresponding to each cost volume with set scale, and then each first cost volume is processed by using linear transformation to obtain corresponding second cost volumes.

Processing each scaled cost volume of the construct using a residual structure consisting of 1*1 point convolution and 3*3 depth convolution refers to: by adding some extra convolution layers on the basis of the scaled cost volume, a structure similar to a residual network is formed. The structure can enhance the expression capability of image characteristics and improve the accuracy and the robustness of binocular stereo matching according to the parallax images. In addition, the point convolution can reduce the dimension and increase the nonlinearity, and the depth convolution can increase the depth and receptive field of the network, so that the expression capability of the image features is further improved.

Step S320 is performed in the attention module 300 b: the second cost volume is fitted by the attention module to correspondingly obtain a third cost volume.

In some embodiments, image features of points surrounding each point in the second cost volume are fitted to the point to obtain a third cost volume. That is, there are several points around each point in the second cost volume, so each point may be referred to as a center point, and the image features of the points around the center point are fitted to the center point, so that the center point includes the image features of the points around.

In some embodiments, the attention module is a lightweight attention module, with which attention can be focused on the most relevant image features for which a disparity map is to be obtained, thereby improving the accuracy of the disparity map. Let V epsilon R ^C ^*h*w Is the second cost volume of input, where C is the number of input channels and h and w are the height and width of the second cost volume. The cost volume is divided into g groups in the channel direction, and each group is processed separately. group g is expressed in [ V ] ₁ ，V ₂ ，...，V _g ]V is set up _k Is defined as one of the groups, wherein k is equal to or greater than 1 and g is equal to or less than g. Processing of the second cost volume with the attention module may be accomplished by the following formula:

，

wherein maxpool is 3*3 max pooling layer and PW isPoint-by-point convolution, A _k Is from V _k The inferred attention is sought. Each group A _k Spatial relationships are captured by learning cross-channel information. The softmax was used to activate the probability of correlation between the construction points. For each group, outputIs obtained by element multiplication and addition, and the output third generation price volume V' is obtained by stacking all highlighted +.>The result, concat, represents an increase in the number of channels.

Step S330 is performed in the residual fusion module 300 c: upsampling the third cost volume to the first set resolution to obtain a fourth cost volume; performing difference on the third cost volume and the fourth cost volume to obtain a residual cost volume; fusing the residual cost volume to a third price volume to obtain a parallax characteristic diagram; and fusing the parallax characteristic map to a cost volume corresponding to the set scale to obtain a parallax regression map.

Where resolution refers to the image size, i.e. length-width size, of the input model. The magnitude of the input resolution is typically determined by the number of model downsampling and the resolution of the feature map after the last downsampling.

In convolutional neural networks, since the size of the output tends to be small after the input image has been characterized by the convolutional neural network, it is sometimes necessary to restore the image to its original size for further computation, an operation that maps the image from a small resolution to a large resolution, called upsampling.

Referring to fig. 5, in some embodiments, step S330 is performed to upsample the third cost volume to the first set resolution to obtain a fourth cost volume; performing difference on the third cost volume and the fourth cost volume to obtain a residual cost volume; when fusing the residual cost volume to the cost volume corresponding to the set scale to obtain the parallax regression graph, the method comprises the following steps:

step S331: and carrying out point-by-point convolution on the third price volume to enlarge the channel number, and up-sampling the third price volume with the enlarged channel number by a nearest neighbor interpolation method to reach the first set resolution to form a fourth price volume.

In some embodiments, the first set resolution is 1/2 resolution.

Step S332: and performing difference on the third cost volume and the fourth cost volume to obtain a residual cost volume, and processing the residual cost volume through a 3*3 convolution layer with the step length of 1 to fuse the residual cost volume to the third cost volume to obtain a parallax characteristic diagram.

In some embodiments, the disparity map is obtained by the following formula:

，

wherein R is _up The up sampling processing result is represented, UPsampling is carried out, PW is pointwise, and the method is used for expanding the channel number, V _l Representing low resolution cost volumes, V _h Representing a high resolution cost volume, conv represents a convolution operation,representing the disparity characteristics of the residual cost volume fused to the high resolution cost volume.

Step S333: and (3) adjusting the cost volume with the set dimension into the dimension of the parallax characteristic map through a 3X 3 convolution layer with the step length of 2 by utilizing a downsampling mode, and fusing each point image characteristic in the parallax characteristic map to the cost volume with the set dimension adjusted to obtain a parallax regression map.

In some embodiments, the disparity regression map is obtained by the following formula:

，

wherein R is _down Representing the result of the downsampling process, conv represents the convolution operation, stride=2 represents the convolution step size of 2, which is used as downsampling, V _h Representing high resolution cost volumes, V _l Representing a low-resolution cost volume of the cost,representing the disparity characteristics of the residual cost volume fused to the low resolution cost volume.

In some embodiments, in upsampling and downsampling, residual cost volumes are concentrated on different information between different inputs, and the addition operation saves the original cost volume in each branch to save information, with fused features having a stronger recognition capability than simple additions or splices.

In the parallax refinement stage, step S400 is employed: and up-sampling the parallax regression graph to a second set resolution to obtain a parallax graph, and estimating geometric information of objects in the left view and the right view by using the parallax graph.

The disparity map is an image in which any one of a left view and a right view is used as a reference image, the size of the disparity map is equal to that of the reference image, the element value is a disparity value, and the disparity map contains geometric distance information of a scene. The successive disparity maps are estimated in a disparity regression manner using the disparity regression map. In some embodiments, the second set resolution is the original resolution of the left and right views.

In the residual fusion method for binocular stereo matching provided by the application, after the image features of the left view and the right view under the binocular camera are obtained, a plurality of cost volumes with set scales are constructed for the image features of the left view and the right view in a point-by-point correlation mode. Taking 1/3Dmax×1/3 Hx 1/3W as an example, nonlinear operation is performed on 1/3Dmax×1/3 Hx 1/3W cost rolls to obtain a first cost roll corresponding to 1/3Dmax×1/3 Hx 1/3W cost rolls, and then linear operation is performed on the first cost roll to obtain a second cost roll. Fitting the second cost volume by using an attention module to obtain a corresponding third cost volume, upsampling the third cost volume to 1/2 resolution to obtain a fourth cost volume, differencing the third cost volume and the fourth cost volume to obtain a residual cost volume, fusing the residual cost volume to the third cost volume to obtain a parallax regression graph, and finally fusing the parallax regression graph to the cost volume of 1/3Dmax multiplied by 1/3 Hmultiplied by 1/3W to obtain the parallax regression graph. The other two scaled cost volumes 1/6Dmax×1/6 Hx 1/6W and 1/12Dmax×1/12 Hx 1/12W are subjected to the same process, and will not be described again here.

The residual fusion method for binocular stereo matching provided by the application combines the characteristics of different receptive fields by utilizing the combination module, emphasizes the obvious areas in different cost volumes by utilizing the attention module, and focuses on extracting the difference between adjacent convolutions by utilizing the residual fusion module instead of simple stacking or adding by using gradual residual aggregation, thereby improving the accuracy of binocular stereo matching and the accuracy of the acquired parallax map.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the application has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the application pertains, based on the idea of the application.

Claims

1. A residual fusion method for binocular stereo matching, comprising:

2. The residual fusion method for binocular stereo matching according to claim 1, wherein the respectively acquiring image features of left and right views under the binocular camera comprises:

3. The residual fusion method for binocular stereo matching of claim 1, wherein the point-wise correlating of the image features of the left and right views to construct a cost volume of several set scales comprises:

4. The residual fusion method for binocular stereo matching of claim 1, wherein fitting the second cost volume with an attention module to obtain a third cost volume comprises:

5. The residual fusion method for binocular stereo matching of claim 1, wherein upsampling the third cost volume to the first set resolution comprises:

6. The residual fusion method for binocular stereo matching of claim 5, wherein the first set resolution comprises 1/2 resolution.

7. The residual fusion method for binocular stereo matching of claim 1, wherein the fusing the disparity feature map to a corresponding scaled cost volume comprises:

8. The residual fusion method for binocular stereo matching of claim 7, wherein the set convolution layer comprises a 3 x 3 convolution layer with a step size of 2.

9. The residual fusion method for binocular stereo matching of claim 1, wherein the second set resolution includes original resolutions of left and right views.

10. A computer readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the method of any of claims 1-9.