CN117576428A

CN117576428A - Hierarchical parallel aggregation calculation method and device for stereo matching

Info

Publication number: CN117576428A
Application number: CN202311350821.3A
Authority: CN
Inventors: 赵昀; 杨文邦; 钱刃; 刘钢; 赵勇
Original assignee: Naro Era Technology Shenzhen Co ltd
Current assignee: Naro Era Technology Shenzhen Co ltd
Priority date: 2023-10-18
Filing date: 2023-10-18
Publication date: 2024-02-20

Abstract

The application discloses a hierarchical parallel aggregation calculation method and device for stereo matching, which are characterized in that firstly, feature acquisition is carried out on two stereo matched original images which are acquired at the same moment and are subjected to polar line correction so as to acquire low-resolution feature images of at least two different scales; then carrying out hierarchical aggregation on the low-resolution feature maps with different scales to obtain cost volumes with at least one scale; and finally, carrying out parallel aggregation on the cost volumes of each scale, and using the preset size feature images obtained by parallel aggregation for predicting the parallax images. The context information from the multi-scale cost volume is fused into the integrated cost volume, and then the global clues and the local clues of the context information are captured simultaneously by utilizing a plurality of three-dimensional expansion convolutions, so that the parallax map can be finally obtained through parallax regression, and the stereo matching performance is greatly improved.

Description

Hierarchical parallel aggregation calculation method and device for stereo matching

Technical Field

The invention relates to the technical field of machine vision stereo matching, in particular to a hierarchical parallel aggregation calculation method and device for stereo matching.

Background

Stereo matching, also known as disparity estimation (disparity estimation), or binocular depth estimation, has been widely studied as one of the core techniques of computer vision, and is indispensable for many applications such as autopilot, robotic navigation, three-dimensional reconstruction, and the like. Accurate disparity estimation is essential to correct stereoscopic images for many computer vision tasks. The stereo matching input is two images (left image I _l And right image I _r ) The output is a disparity map d composed of disparity values corresponding to each pixel in a reference image (typically, a left image is taken as a reference image). Referring to fig. 1, a schematic view of disparity map acquisition is shown, where disparity is a pixel level difference between positions of corresponding points in left and right images of a certain point in a three-dimensional scene, and after a disparity map d is acquired, a depth map can be acquired according to a depth acquisition formula, where the depth acquisition formula is as follows:

z=（b×f）/d；

where f is the focal length of the camera lens, b is the distance between the centers of the two cameras, d is the parallax, and z is the depth value of the parallax d of the corresponding pixel on the left and right images by prediction. How to accurately and quickly predict parallax under limited computing resources through a given pair of corrected stereo images is a core problem in stereo matching computation.

Disclosure of Invention

The invention mainly solves the technical problem of how to construct a three-dimensional matching calculation method capable of capturing context information representation.

According to a first aspect, in one embodiment, there is provided a hierarchical parallel aggregation computing method for stereo matching, including:

performing feature acquisition on two three-dimensional matched original images acquired at the same moment and subjected to polar line correction to acquire low-resolution feature images of at least two different scales;

hierarchical aggregation is carried out on the low-resolution feature maps with different scales so as to obtain cost volumes with at least one scale;

and carrying out parallel aggregation on the cost volumes of each scale, and using a preset size feature map obtained by parallel aggregation for predicting the parallax map.

In an embodiment, the feature acquiring of the two stereo matching original images acquired at the same time and corrected by the epipolar line includes:

and respectively carrying out feature acquisition on the two original images through a twin feature extraction network sharing weight so as to respectively acquire low-resolution feature graphs of a first scale, a second scale, a third scale and a fourth scale, wherein the values of the first scale, the second scale, the third scale and the fourth scale are sequentially decreased in proportion.

In an embodiment, the twin feature extraction network includes a convolution layer with a 3x3 convolution kernel, four residual blocks, and two hole convolution blocks.

In an embodiment, the feature acquiring of the two stereo matching original images acquired at the same time and corrected by the epipolar line further includes:

and regularizing each low-resolution feature map through two preset convolution layers.

In an embodiment, the regularizing each low-resolution feature map through two preset convolution layers includes:

and a batch normalization layer and a correction linear unit activation layer are arranged behind each convolution layer except the last convolution layer in the twin feature extraction network.

In one embodiment, the first scale, the second scale, the third scale and the fourth scale have values of 1/4,1/8,1/16 and 1/32, respectively

In an embodiment, the hierarchical aggregation of the low resolution feature maps of different scales includes downsampling and/or upsampling aggregation;

the downsampling aggregation includes:

downsampling the low-resolution feature map with the high scale value to obtain a low-resolution feature map with the same value as the low scale value;

performing equal proportion convolution operation on the new low-resolution feature map obtained by downsampling and the original low-resolution feature map with equal proportion value so as to obtain the cost volume corresponding to the resolution feature map with high proportion value;

the upsampling aggregation includes:

upsampling the low resolution feature map of the low scale value to obtain a low resolution feature map of the same value as the high scale value;

and performing equal proportion convolution operation on the new low-resolution characteristic map obtained by up-sampling and the original low-resolution characteristic map with equal proportion value so as to obtain the cost volume corresponding to the resolution characteristic map with the low proportion value.

In an embodiment, the parallel aggregation of the cost rolls for each scale includes:

after each cost volume is subjected to 3D convolution according to a preset stride, reducing the characteristic size of the cost volume to 1/8 by using another three-dimensional convolution so as to obtain a cost volume to be expanded;

carrying out parallel expansion convolution on each cost roll to be expanded so as to output expansion characteristic diagrams which are the same in number and size as the cost rolls to be expanded;

splicing each expansion characteristic diagram to obtain a combined characteristic diagram for combining the characteristic mapping of each cost volume;

and inputting the combined feature map into a three-dimensional convolution operation model to obtain a feature map with a preset size output by the three-dimensional convolution operation model, wherein the three-dimensional convolution operation model comprises two cascaded three-dimensional convolution layers, and the later convolution layer is an deconvolution layer with a step of 2.

According to a second aspect, an embodiment provides a computer readable storage medium having stored thereon a program executable by a processor to implement a hierarchical parallel aggregation computing method as described above.

According to a third aspect, there is provided in an embodiment a hierarchical parallel aggregation computing device for stereo matching for applying the hierarchical parallel aggregation computing method as described above, the hierarchical parallel aggregation computing device comprising:

the twin feature extraction neural network unit is used for performing feature acquisition on two three-dimensional matched original images which are acquired at the same moment and subjected to polar line correction so as to acquire low-resolution feature images of at least two different scales;

the hierarchical aggregation neural network unit is used for performing hierarchical aggregation on the low-resolution feature maps with different scales so as to obtain cost volumes with at least one scale;

and the parallel aggregation neural network unit is used for carrying out parallel aggregation on the cost volumes of each scale and using a preset size characteristic diagram obtained by parallel aggregation for predicting the parallax diagram.

According to the hierarchical parallel aggregation computing method of the embodiment, the context information from the multi-scale cost volume is fused into one integrated cost volume, and then the global clues and the local clues of the context information are captured simultaneously by utilizing a plurality of three-dimensional expansion convolutions, so that the parallax map can be finally obtained through parallax regression, and the stereo matching performance is greatly improved.

Drawings

Fig. 1 is a parallax map acquisition schematic diagram;

FIG. 2 is a schematic workflow diagram of a stereo matching system in one embodiment;

FIG. 3 is a flow diagram of a hierarchical parallel aggregation computing method in one embodiment;

FIG. 4 is a block diagram of a hierarchical parallel aggregation computing device in one embodiment;

FIG. 5 is a flow chart of a hierarchical parallel aggregation computing method according to another embodiment.

Detailed Description

The invention will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning. The terms "coupled" and "connected," as used herein, are intended to encompass both direct and indirect coupling (coupling), unless otherwise indicated.

Please refer to fig. 2, which is a schematic diagram of a workflow of a stereo matching system in an embodiment, wherein the left and right images acquired by two image acquisition devices at the same time are firstly acquired, then calibrated according to the relative positions and acquisition parameters of the two image acquisition devices, then polar correction is performed according to the calibration values, and finally stereo matching calculation is performed by applying a stereo matching algorithm to acquire a parallax image. At present, a three-dimensional matching network based on deep learning is generally used for constructing a cost quantity of a single scale and regularizing and regressing differences. However, none of these methods utilize multi-scale context information, resulting in limited parallax prediction performance in the sick region.

In the embodiment of the application, it is proposed that context information from a multi-scale cost volume is fused into an integrated cost volume through hierarchical aggregation and parallel aggregation, global and local clues of the context information are captured simultaneously by utilizing a plurality of three-dimensional expansion convolutions, and finally, a parallax map can be obtained in a parallax regression mode, so that the performance of stereo matching is greatly improved.

Embodiment one:

referring to fig. 3, a flow chart of a hierarchical parallel aggregation computing method according to an embodiment includes:

and step 101, extracting features.

And performing feature acquisition on the two three-dimensional matched original images acquired at the same moment and subjected to polar line correction to acquire low-resolution feature images of at least two different scales. In an embodiment, feature acquisition is performed on two original images through a twin feature extraction network with shared weights, so as to obtain low-resolution feature graphs of a first scale, a second scale, a third scale and a fourth scale, where values of the first scale, the second scale, the third scale and the fourth scale decrease in proportion in sequence. In one embodiment, the values of the first scale, the second scale, the third scale, and the fourth scale are 1/4,1/8,1/16, and 1/32, respectively. In one embodiment, the twin feature extraction network includes a convolution layer with a 3x3 convolution kernel, four residual blocks, and two hole convolution blocks. In one embodiment, a batch normalization layer and a modified linear cell activation layer are provided after each convolution layer except the last convolution layer in the twinning feature extraction network. In one implementation, feature acquisition is performed on two stereo matching original images acquired at the same time and subjected to epipolar correction, and regularization processing is performed on each low-resolution feature image through two preset convolution layers.

Step 102, hierarchical aggregation.

And carrying out hierarchical aggregation on the low-resolution feature maps with different scales to obtain cost volumes with at least one scale. In one embodiment, the hierarchical aggregation includes downsampling and/or upsampling aggregation;

the downsampling aggregation includes:

firstly, downsampling a low-resolution characteristic map with a high scale value to obtain the low-resolution characteristic map with the same value as the low scale value; and then, carrying out equal proportion convolution operation on the new low-resolution characteristic diagram obtained by downsampling and the original low-resolution characteristic diagram with equal proportion value so as to obtain a cost volume corresponding to the resolution characteristic diagram with high proportion value.

The upsampling aggregation includes:

firstly, up-sampling a low-resolution characteristic diagram with a low scale value to obtain the low-resolution characteristic diagram with the same value as a high scale value; and then, carrying out equal proportion convolution operation on the new low-resolution characteristic diagram obtained by up-sampling and the original low-resolution characteristic diagram with equal proportion value so as to obtain a cost volume corresponding to the resolution characteristic diagram with the low proportion value.

Step 103, parallel aggregation.

And carrying out parallel aggregation on the cost volumes of each scale, and using the preset size feature images obtained by parallel aggregation for predicting the parallax images. In one embodiment, aggregating cost volumes for each dimension in parallel includes:

first, after 3D convolution is performed on each cost volume according to a preset step, another three-dimensional convolution is used to reduce the feature size of the cost volume to 1/8, so as to obtain the cost volume to be expanded.

Then, parallel expansion convolution is performed on each cost volume to be expanded to output expansion feature maps of the same number and size as the cost volumes to be expanded.

And then, splicing each expansion characteristic map to obtain a combined characteristic map for combining the characteristic maps of each cost volume.

And finally, inputting the combined feature map into a three-dimensional convolution operation model to obtain a feature map with a preset size output by the three-dimensional convolution operation model, wherein the three-dimensional convolution operation model comprises two cascaded three-dimensional convolution layers, and the later convolution layer is an deconvolution layer with a step of 2.

Referring to fig. 4, which is a block diagram of a hierarchical parallel aggregation computing device in an embodiment, in an embodiment of the present application, a hierarchical parallel aggregation computing device is further disclosed, which is configured to apply the hierarchical parallel aggregation computing method as described above, where the hierarchical parallel aggregation computing device includes a twin feature extraction neural network unit 100, a hierarchical aggregation neural network unit 200, and a parallel aggregation neural network unit 300. The twin feature extraction neural network unit 100 is configured to perform feature extraction on two stereo-matched original images acquired at the same time and subjected to polar line correction, so as to acquire low-resolution feature maps with at least two different scales. The hierarchical aggregation neural network unit 200 is configured to perform hierarchical aggregation on low-resolution feature maps with different scales, so as to obtain a cost volume with at least one scale. The parallel aggregation neural network unit 300 performs parallel aggregation on the cost volumes of each scale, and uses a preset size feature map obtained by the parallel aggregation for predicting the disparity map.

According to the hierarchical parallel aggregation calculation method disclosed by the embodiment of the application, firstly, feature acquisition is carried out on two three-dimensional matched original images which are acquired at the same moment and subjected to polar line correction, so as to acquire low-resolution feature images of at least two different scales; then carrying out hierarchical aggregation on the low-resolution feature maps with different scales to obtain cost volumes with at least one scale; and finally, carrying out parallel aggregation on the cost volumes of each scale, and using the preset size feature images obtained by parallel aggregation for predicting the parallax images. The context information from the multi-scale cost volume is fused into the integrated cost volume, and then the global clues and the local clues of the context information are captured simultaneously by utilizing a plurality of three-dimensional expansion convolutions, so that the parallax map can be finally obtained through parallax regression, and the stereo matching performance is greatly improved.

The flow of the hierarchical parallel aggregation computing method disclosed in the present application is described below by way of a specific embodiment.

Referring to fig. 5, a flow chart of a hierarchical parallel aggregation computing method according to another embodiment specifically includes:

feature extraction:

starting with a twin feature extraction network of shared weights, which takes as input a pair of images. The twin feature extraction network first uses 3 convolutional layers with 3x3 kernels, 4 residual blocks, and 2 hole convolutional blocks. The twin feature extraction network specifically comprises two convolution layers with a stride of 2 to obtain a 1/4 scale feature map. In addition, other 3 downsampling blocks with the step length of 2 are used respectively to obtain low-resolution characteristic diagrams of 1/8,1/16 and 1/32 scale. In one embodiment, to construct the connected cost volume, four scale feature maps are regularized using two other convolution layers, i.e., each convolution is followed by a batch normalization layer and a modified linear element activation layer, except for the last convolution.

In one embodiment, network performance may be improved using packet-dependent cost volumes in combination with connection volumes.

Wherein < sum > is a splicing operator, the scale of the multi-scale cost volume is channel C multiplied by multi-scale coefficient alpha (deep multiplied by high multiplied by wide), and the multi-scale coefficient alpha is 1/4,1/8,1/16,1/32 respectively. Channel C is 32, 64, 128 in order from high to low.

Hierarchical aggregation:

the 1/4 combined volume is downsampled to 1/8 scale (V1) by three-dimensional convolution with a step of 2. The 1/8 combined roll (V1) is connected with the original 1/8 roll (V2) to form a new 1/8-scale roll (V). Then, 1x1x1 convolution is performed to halve the channel of the new cost volume into the channel corresponding to the scale. During downsampling, four rolls are hierarchically aggregated until a minimum ratio (1/32) is obtained. And vice versa. In cooperation with the other three larger volumes and linking operations, the new lowest scale volume (1/32) is staged to the highest scale volume (1/4). The corresponding cost volume is up-sampled using a three-dimensional deconvolution with a step size of 2. Multi-scale cost volumes with four different scales are aggregated into 1/4-sized volumes by hierarchical aggregation for subsequent differential regression.

Parallel aggregation:

in one embodiment of the present application, it is proposed to aggregate the cost of an original network by aggregating networks in parallel. The parallel aggregation network consists of three cascaded parallel aggregation modules in order to learn additional context information. Firstly, using a 3D convolution with a step of 2, and then using another three-dimensional convolution to reduce the feature size to 1/8; then, 4 parallel expansion convolutions with increased expansion rates output 4 feature maps of the same size. After the splice is completed, the four feature maps are combined together and then input into two three-dimensional convolutions, the latter being a step-2 deconvolution. And processing the final 1/4-size feature map to predict the parallax map during output.

In one embodiment, two stacked 3D volumes and an upsampling operator are used to generate a 1-channel 4D volume when aggregating outputs in parallel. Then, the 4D volume is converted into a probability volume with softmax along the parallax dimension, and the set Cd is the maximum parallax of the prediction cost Dmax, and then the following steps are:

；

cross-scale features are extracted from a multi-scale cost volume to improve the network's understanding of multi-level contexts. The parallel aggregation module with the extended convolution is used for cost quantity filtering, and the utilization rate of global context information is improved.

In this embodiment, the hierarchical stitching may fuse multi-scale content information, and make full use of global information and local information to construct an aggregate cost body, so as to obtain more accurate depth estimation, and stitch multi-scale features may construct feature representations with more expressive ability. The splicing operation is to connect two tensors in series in the channel dimension to form the same tensor, and the parallel aggregation can accelerate the reasoning speed of the network, so that the problems of high time consumption and high calculation of the sequential reasoning of the dense pixel matching task are solved. And the use of hole convolutions of multiple sizes can be that the obtained feature map has a larger receptive field, so that the local information and the global information are fully understood.

Experimental results of the embodiment show that the content-based hierarchical parallel aggregation network HPA-Net has the most advanced stereo matching performance on the KITTI data set.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the invention has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the invention pertains, based on the idea of the invention.

Claims

1. A hierarchical parallel aggregation computing method for stereo matching, comprising:

2. The hierarchical parallel aggregation computing method according to claim 1, wherein the feature acquisition of the two stereo matching original images acquired at the same time and subjected to epipolar correction includes:

3. The hierarchical parallel aggregation computing method of claim 2, wherein the twinning feature extraction network comprises a convolutional layer with a 3x3 convolutional kernel, four residual blocks, and two hole convolutional blocks.

4. The hierarchical parallel aggregation computing method according to claim 3, wherein the feature acquisition of the two stereo matching original images acquired at the same time and subjected to epipolar correction further comprises:

5. The hierarchical parallel aggregation computing method according to claim 4, wherein the regularizing each low-resolution feature map by two preset convolution layers includes:

6. The hierarchical parallel aggregation computing method of claim 2, wherein the first scale, the second scale, the third scale, and the fourth scale have values of 1/4,1/8,1/16, and 1/32, respectively.

7. The hierarchical parallel aggregation computing method of claim 2, wherein the hierarchical aggregation of the low resolution feature maps of different scales comprises downsampling and/or upsampling aggregation;

the downsampling aggregation includes:

the upsampling aggregation includes:

8. The hierarchical parallel aggregation computing method of claim 2, wherein the parallel aggregating the cost rolls for each scale comprises:

9. A computer readable storage medium having stored thereon a program executable by a processor to implement the hierarchical parallel aggregation computing method of any one of claims 1-8.

10. Hierarchical parallel aggregation computing device for stereo matching, characterized by being adapted to apply the hierarchical parallel aggregation computing method according to any one of claims 1-8, the hierarchical parallel aggregation computing device comprising: