CN117058252B

CN117058252B - Self-adaptive fusion stereo matching method

Info

Publication number: CN117058252B
Application number: CN202311317490.3A
Authority: CN
Inventors: 俞正中; 李鹏飞; 钱刃; 丘文峰; 赵勇; 李福池
Original assignee: Dongguan Aipeike Technology Co ltd
Current assignee: Dongguan Aipeike Technology Co ltd
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-12-26
Anticipated expiration: 2043-10-12
Also published as: CN117058252A

Abstract

A self-adaptive fusion stereo matching method relates to the technical field of stereo matching. The method comprises the following steps: acquiring a left view and a right view shot by a binocular camera; feature extraction is carried out on the left view and the right view to construct a first scale cost volume and a second scale cost volume; downsampling the first scale cost volume to generate a first fused branch; convolving the second scale cost volume to generate a second fused branch; fusing the first fused branch and the second fused branch to generate an initial fused cost volume; sequentially upsampling and downsampling the initial fusion cost volume to update the first fusion branch; convolving the second scale cost roll to maintain a second fused branch; fusing the first fused branch and the second fused branch to update the initial fused cost roll; repeating the updating of the initial fusion cost volume until the set times are reached, and outputting the initial fusion cost volume as a fusion cost volume; and generating a dense disparity map according to the fusion cost volume to calculate disparities of the left view and the right view.

Description

Self-adaptive fusion stereo matching method

Technical Field

The invention relates to the technical field of stereo matching, in particular to a self-adaptive fusion stereo matching method.

Background

Convolutional neural networks in stereo matching have made significant progress, however, it remains difficult to handle occlusion regions. Stereo matching is a fundamental problem in computer science and has applications in many areas of computer vision, such as robotics, autopilot, etc. The goal of stereo matching is to establish a close correspondence between a pair of corrected stereo images. With the development of deep learning, convolutional neural networks have been widely applied to stereo matching, and compared with the traditional method, the existing shielding, repeating and reflecting areas cannot achieve an accurate matching effect in a complex unmanned scene.

Disclosure of Invention

The invention mainly solves the technical problems that: provided is a stereo matching method capable of improving accuracy of a matching effect in a complex unmanned scene.

According to a first aspect, in one embodiment, there is provided a stereo matching method of adaptive fusion, including:

acquiring a left view and a right view shot by a binocular camera;

feature extraction is carried out on the left view and the right view to construct a first scale cost volume and a second scale cost volume, and the scale of the first scale cost volume is larger than that of the second scale cost volume;

downsampling the first scale cost volume to generate a first fused branch; convolving the second scale cost volume to generate a second fused branch that maintains the scale of the second scale cost volume; fusing the first fused branch and the second fused branch to generate an initial fused cost volume;

updating the initial fusion cost volume to generate a fusion cost volume, wherein the updating of the initial fusion cost volume comprises the following steps:

sequentially upsampling and downsampling the initial fusion cost volume to update the first fusion branch; convolving the second scale cost roll to maintain the second fused branch; fusing the updated first fused branch and the second fused branch to update the initial fused cost roll;

repeating the updating of the initial fusion cost volume until the set times are reached, and outputting the initial fusion cost volume as a fusion cost volume;

and generating a dense parallax map according to the fusion cost volume so as to calculate the parallax of the left view and the right view.

In one embodiment, the first scale cost volume comprises 1/4d,1/4h,1/4w cost volumes; the second scale cost volume comprises 1/16d,1/16h and 1/16w cost volumes; wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

In one embodiment, the downsampling the first scale cost volume to generate a first fused branch includes:

when the first scale cost volume is downsampled, the first scale cost volume is changed from 1/4d,1/4h,1/4w cost volume to 1/8d,1/8h and 1/8w cost volume, and then is changed from 1/8d,1/8h,1/8w cost volume to 1/16d,1/16h and 1/16w cost volume; the first fusion branch comprises 1/16d,1/16h and 1/16w cost volumes; wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

In one embodiment, the convolving the second scale cost volume to generate a second fused branch that maintains the scale of the second scale cost volume, comprising:

and convolving the second scale cost volume by using a two-layer 3D convolution with a step length of 1 to generate a second fusion branch maintaining the scale of the second scale cost volume.

In one embodiment, the fusing the first fused branch and the second fused branch generates an initial fused cost volume, including:

respectively carrying out maximum pooling and average pooling on the first fusion branch and the second fusion branch so as to correspondingly acquire the 2D characteristics of the first fusion branch and the 2D characteristics of the second fusion branch;

inputting the 2D characteristics of the first fusion branch and the 2D characteristics of the second fusion branch into a 2D convolution layer to correspondingly generate a weight map of the first fusion branch and a weight map of the second fusion branch;

summing the weight map of the first fusion branch and the weight map of the second fusion branch to generate an initial fusion cost volume; the initial fusion cost volume comprises 1/16d,1/16h and 1/16w cost volumes; wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

In one embodiment, the sequentially upsampling and downsampling the initial fusion cost volume includes:

when the initial fusion cost volume is up-sampled, the initial fusion cost volume is changed from 1/16d,1/16h,1/16w cost volume to 1/8d,1/8h and 1/8w cost volume, and then from 1/8d,1/8h and 1/8w cost volume to 1/4d,1/4h and 1/4w cost volume;

when the initial fusion cost volume is downsampled, the initial fusion cost volume is changed from 1/4d,1/4h,1/4w cost volume to 1/8d,1/8h and 1/8w cost volume, and then from 1/8d,1/8h and 1/8w cost volume to 1/16d,1/16h and 1/16w cost volume;

wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

In one embodiment, the set number of times is 2.

In one embodiment, the capturing left and right views captured by the binocular camera includes:

calibrating the binocular camera;

and matching the original left view and the original right view shot by the binocular camera according to the camera calibration result, and correcting the matched original left view and original right view to generate the left view and the right view shot by the binocular camera.

In one embodiment, the correcting the matched original left view and original right view to generate the left view and right view shot by the binocular camera includes:

and correcting the matched original left view and original right view by using a Bouguet polar line correction method so as to generate the left view and the right view shot by the binocular camera.

According to a second aspect, an embodiment provides a computer readable storage medium having stored thereon a program executable by a processor to implement the method described above.

According to the self-adaptive fusion stereo matching method and the computer readable storage medium, the algorithm firstly obtains the left view and the right view of the binocular camera, then performs feature extraction on the left view and the right view to construct a first scale cost volume and a second scale cost volume, and then performs downsampling on the first scale cost volume to generate a first fusion branch; the second scale cost volume is convolved to generate a second fused branch that maintains the scale of the second scale cost volume. And fusing the first fused branch and the second fused branch to generate an initial fused cost volume, and then sequentially carrying out up-sampling and down-sampling updating on the initial fused cost volume. The downsampling and upsampling operations are carried out on the initial fusion cost volume, the scale of the initial fusion cost volume can be changed from large to small and then from small to large, so that detailed information in the left view and the right view can be captured better, and the understanding capability and the recognition accuracy of the left view and the right view can be improved. And finally, outputting the initial fusion cost volume up to the update times as a fusion cost volume, and calculating the parallaxes of the left view and the right view according to the fusion cost volume dense parallax map. According to the method, the accuracy of the matching effect of the stereo matching method is improved by learning cost rolls with different scales and fusing information obtained by fusing the cost rolls with different scales.

Drawings

FIG. 1 is a flow chart of a stereo matching method of adaptive fusion according to one embodiment;

FIG. 2 is a flowchart of step S100 of an adaptive fusion stereo matching method according to an embodiment;

FIG. 3 is a schematic diagram of cost aggregation of a first scale cost volume and a second scale cost volume of an adaptive fusion stereo matching method according to an embodiment;

FIG. 4 is a flowchart of step S300 of an adaptive fusion stereo matching method according to an embodiment;

FIG. 5 is a schematic diagram of a first fusion branch and a second fusion branch of an embodiment of a stereo matching method of adaptive fusion;

fig. 6 is a flowchart of step S340 of the stereo matching method of adaptive fusion according to an embodiment.

Detailed Description

The invention will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning.

The application provides a self-adaptive fusion stereo matching method, which learns weights of cost rolls of different scales through different distributions of the parallax dimensions of different scales, and performs weighted fusion on different branches. The self-adaptive fusion stereo matching method provided by the application can utilize the information of different branches to achieve a good effect in a weak texture area, and is specifically described below.

Referring to fig. 1, the stereo matching method of adaptive fusion provided in the present application includes the following steps.

Step S100: and acquiring a left view and a right view shot by the binocular camera.

Referring to fig. 2, in one embodiment, when step S100 is performed to obtain the left view and the right view captured by the binocular camera, the following steps are further included.

Step S110: and calibrating the binocular camera.

In one embodiment, the binocular camera is calibrated using its internal and external parameters. Calibrating the internal parameters of the camera refers to determining internal parameters of the camera, such as focal length, principal point position, distortion, etc. of the camera. Calibrating the external parameters of the cameras refers to determining the relative position and attitude between the binocular cameras, i.e., the rotational matrix and translational vector of the two cameras. These parameters are important for stereo matching in computer vision tasks, and the external parameters of a calibration camera are obtained by solving a transformation matrix between two cameras by using a set of known spatial points to correspond to the two cameras.

Step S120: and matching the original left view and the original right view shot by the binocular camera according to the camera calibration result, and correcting the matched original left view and original right view to generate the left view and right view shot by the binocular camera.

In one embodiment, an original left view and an original right view shot by a binocular camera are matched according to a camera calibration result, then the matched original left view and original right view are corrected, and pixel points of the original left view and the original right view which are possibly matched are located on the same horizontal polar line after correction. In reality, the imaging of two cameras may have non-parallel optical axes and the optical centers are not on the same plane, so that the Bouguet polar correction method is used for correcting the matched original left view and original right view, so as to generate a left view and a right view shot by the binocular camera.

Step S200: and extracting features of the left view and the right view to construct a first scale cost volume and a second scale cost volume.

In one embodiment, before feature extraction is performed on the left view and the right view, normalization preprocessing is performed on the left view and the right view, and then random clipping, contrast variation and other operations are performed on the left view and the right view in order to improve generalization of matching.

In one embodiment, the normalization preprocessing refers to performing some numerical adjustment on the input left view and right view so that the numerical ranges of the input left view and right view are the same in each dimension, so as to improve the training effect and stability of the model and reduce the occurrence of the overfitting phenomenon. Generalization refers to the ability of a machine learning model to perform when processing data that has not been seen. A model with good generalization performance can efficiently process new data rather than just fitting training data. Therefore, generalization performance is an important index for measuring the quality of the machine learning model. Random cropping refers to random collection of left and right views to increase the diversity and generalization ability of the data. The contrast change refers to random contrast adjustment of the left view and the right view, so that different illumination conditions and background noise can be better adapted.

And after the processing is finished, extracting the characteristics of the left view and the right view, so as to construct a first scale cost volume and a second scale cost volume, wherein the scale of the first scale cost volume is larger than that of the second scale cost volume. In one embodiment, the first scale cost volume is b, c,1/4d,1/4h,1/4w cost volume; the second scale cost volume is b, c,1/16d,1/16h and 1/16w cost volume, wherein b represents the number of pairs of pictures fed into the neural network each time, c represents the number of channels of the feature, d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

Step S300: and carrying out cost aggregation on the first scale cost volume and the second scale cost volume.

Please refer to fig. 3, which is a schematic diagram of cost aggregation of the first scale cost volume and the second scale cost volume, wherein the upper branch in fig. 3 is a branch corresponding to the second scale cost volume, and the lower branch in fig. 3 is a branch corresponding to the first scale cost volume.

Referring to fig. 4, in one embodiment, performing step S300 to aggregate costs of the first scale cost volume and the second scale cost volume includes the following steps.

Step S310: downsampling the first scale cost volume to generate a first fused branch.

In one embodiment, when the first scale cost volume is downsampled, the scale of the first scale cost volume is changed from 1/4d,1/4h,1/4w to 1/8d,1/8h,1/8w, and then from 1/8d,1/8h,1/8w to 1/16d,1/16h,1/16w, so that the generated first fusion branch is the 1/16d,1/16h,1/16w cost volume, wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

Step S320: the second scale cost volume is convolved to generate a second fused branch that maintains the scale of the second scale cost volume.

In one embodiment, the second scale cost volume is convolved with a two-layer 3D convolution of step size 1 to generate a second fused branch that maintains the scale of the second scale cost volume.

Step S330: the first fused branch and the second fused branch are fused to generate an initial fused cost volume.

Please refer to fig. 5, which is a schematic diagram illustrating a first merging branch and a second merging branch, wherein the upper half of fig. 5 is the first merging branch, and the lower half of fig. 5 is the second merging branch. In one embodiment, when the first fusion branch and the second fusion branch are fused, the first fusion branch and the second fusion branch are subjected to maximum pooling and average pooling along the parallax dimension D, so that the 2D feature of the first fusion branch and the 2D feature of the second fusion branch are correspondingly acquired. And then inputting the 2D features of the first fusion branch and the 2D features of the second fusion branch into a 2D convolution layer, so as to correspondingly generate a weight map of the first fusion branch and a weight map of the second fusion branch. And finally, summing the weight graph of the first fusion branch and the weight graph of the second fusion branch to generate an initial fusion cost volume. In one embodiment, the initial fusion cost volume is 1/16d,1/16h, and 1/16w cost volume, where d represents a preset maximum disparity value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

In one embodiment, in the whole fusion process, in order to fully utilize fusion branches corresponding to cost volumes of two scales, an attention mechanism is introduced to guide a convolutional neural network to select important matching information of different scales. And carrying out maximum pooling and average pooling on the first fused branch and the second fused branch along the parallax dimension to obtain matching information of a single pixel point in the parallax dimension, obtaining three weight feature maps (the three weight feature maps are respectively weight feature maps corresponding to the maximum pooling, the weight feature maps corresponding to the average pooling and the weight feature maps corresponding to the maximum pooling-average pooling) for each fused branch, and inputting the three weight feature maps of each branch into a 2D convolution layer to obtain spliced feature maps of the three weight maps, wherein each spliced feature map comprises the weight map of each pixel point. And summing weights of pixel points at the same position in each fusion branch, and finally generating an initial fusion cost volume. The method is specifically calculated by the following formula:

wherein M is _(c,h,w) Representing a weight feature map corresponding to maximum pooling, maxpool representing maximum pooling calculation, V _k And c represents a channel, d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

；

Wherein A is _(c,h,w) Representing weight feature graphs corresponding to mean pooling, wherein Avgpool represents mean pooling calculation, V _k And c represents a channel, d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

；

Wherein I is _(3c,h,w) Spliced feature map representing three weight maps, concat representing feature fusion, M _(c,h,w) Representing a weight characteristic diagram corresponding to maximum pooling, A _(c,h,w) And c represents a channel, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

；

Wherein V is _(c,d,h,w) Representing an initial fusion cost volume, V _1(c,d,h,w) Represents a first scale cost roll, by which is expressed a exclusive OR operation, PWNet represents a 2D convolutional neural network of the scientific system, I _1(3c,h,w) A spliced characteristic diagram representing three weight diagrams corresponding to the first scale cost volume, V _2(c,d,h,w) Representing a second scale cost volume, I _2(3c,h,w) And c represents a channel, d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

Step S340: updating the initial fusion cost volume to generate a fusion cost volume.

Referring to fig. 6, in one embodiment, after the initial fusion cost volume is generated, the initial fusion cost volume is updated, and the whole updating process includes the following steps.

Step S341: and sequentially up-sampling and down-sampling the initial fusion cost volume, and updating the first fusion branch according to the down-sampling result.

In one embodiment, the initial fusion cost volume is 1/16d,1/16h,1/16w cost volume, so that the initial fusion cost volume is up-sampled, the scale of the initial fusion cost volume is changed from 1/16d,1/16h,1/16w to 1/8d,1/8h,1/8w, and then from 1/8d,1/8h,1/8w to 1/4d,1/4h,1/4w. After the up-sampling is completed, the down-sampling is performed, the scale of the initial fusion cost volume is changed from 1/4d,1/4h,1/4w to 1/8d,1/8h,1/8w cost volume, and then from 1/8d,1/8h,1/8w to 1/16d,1/16h,1/16w, wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view. After the initial fusion cost volume finishes up-sampling and down-sampling once, updating the first fusion branch according to the down-sampling result, wherein the updated first fusion branch is still 1/16d,1/16h and 1/16w cost volume, but the up-sampling and the down-sampling of one round are already carried out, and at the moment, the updated first fusion branch better captures detailed information in the left view and the right view, so that the understanding capability and the recognition accuracy of the left view and the right view are further improved.

Step S342: the second scale cost volume is convolved to maintain a second fused branch.

In one embodiment, the second scale cost volume is still convolved with a two-layer 3D convolution of step 1 while the first fused branch is updated, thereby generating a sustained second fused branch.

Step S343: and fusing the updated first fused branch and the updated second fused branch to update the initial fused cost roll.

In one embodiment, the updated first fusion branch and the second fusion branch are fused, so that the update of the initial fusion cost volume is completed, and the fusion process is the same as that of the initial fusion cost volume in step S330. And repeatedly executing the steps S341-S343 until the set times are reached, and outputting the initial fusion cost volume finally output in the step S343 as the fusion cost volume. In one embodiment, the number of settings is 2, i.e., 2, cycles of updates.

Step S400: and generating a dense parallax map according to the fusion cost volume so as to calculate the parallax of the left view and the right view.

The stereo matching method comprises the steps of obtaining a left view and a right view of a binocular camera, then carrying out feature extraction on the left view and the right view to construct a first scale cost volume and a second scale cost volume, and then carrying out downsampling on the first scale cost volume to generate a first fusion branch; the second scale cost volume is convolved to generate a second fused branch that maintains the scale of the second scale cost volume. And fusing the first fused branch and the second fused branch to generate an initial fused cost volume, and then sequentially carrying out up-sampling and down-sampling updating on the initial fused cost volume. The downsampling and upsampling operations are carried out on the initial fusion cost volume, the scale of the initial fusion cost volume can be changed from large to small and then from small to large, so that detailed information in the left view and the right view can be captured better, and the understanding capability and the recognition accuracy of the left view and the right view can be improved. Outputting the initial fusion cost volume up to the update times as a fusion cost volume, and calculating the parallaxes of the left view and the right view according to the dense parallax map of the fusion cost volume. According to the stereo matching method, the accuracy of the matching effect of the stereo matching method is improved by learning cost rolls with different scales and fusing information obtained by the cost rolls with different scales.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the invention has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the invention pertains, based on the idea of the invention.

Claims

1. The self-adaptive fusion stereo matching method is characterized by comprising the following steps:

acquiring a left view and a right view shot by a binocular camera;

the method comprises the steps of obtaining matching information of a single pixel point in a parallax dimension by using an attention mechanism, and correspondingly determining a maximum pooled weight feature map, an average pooled weight feature map and a maximum pooled-average pooled weight feature map of a first fused branch and a second fused branch; inputting weight feature graphs corresponding to the first fusion branch and the second fusion branch into a 2D convolution layer to correspondingly determine the 2D features of the first fusion branch and the 2D features of the second fusion branch; inputting the 2D characteristics of the first fusion branch and the 2D characteristics of the second fusion branch into a 2D convolution layer to correspondingly generate a weight map of the first fusion branch and a weight map of the second fusion branch; summing the weight map of the first fusion branch and the weight map of the second fusion branch to generate an initial fusion cost volume;

2. The method of stereo matching for adaptive fusion of claim 1, wherein the first scale cost volume comprises 1/4d,1/4h,1/4w cost volumes; the second scale cost volume comprises 1/16d,1/16h and 1/16w cost volumes; wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

3. The method of stereo matching for adaptive fusion of claim 2, wherein downsampling the first scale cost volume to generate a first fused branch comprises:

4. The method of stereo matching for adaptive fusion of claim 2, wherein convolving the second scale cost volume to generate a second fused branch that maintains the scale of the second scale cost volume, comprises:

5. The method for stereo matching for adaptive fusion according to claim 2,

the initial fusion cost volume comprises 1/16d,1/16h and 1/16w cost volumes; wherein d represents a preset maximum parallax value, h represents the heights of the left view and the right view, and w represents the widths of the left view and the right view.

6. The stereo matching method of adaptive fusion of claim 5, wherein the sequentially upsampling and downsampling the initial fusion cost volume comprises:

7. The stereo matching method of adaptive fusion according to claim 1, wherein the set number of times is 2.

8. The method for stereo matching of adaptive fusion according to claim 1, wherein the acquiring the left view and the right view photographed by the binocular camera comprises:

calibrating the binocular camera;

9. The method of stereo matching for adaptive fusion according to claim 8, wherein the correcting the matched original left view and original right view to generate the left view and right view captured by the binocular camera comprises:

10. A computer-readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the stereo matching method as claimed in any one of claims 1-9.