CN111968168B - Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network - Google Patents

Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network Download PDF

Info

Publication number
CN111968168B
CN111968168B CN202010776723.6A CN202010776723A CN111968168B CN 111968168 B CN111968168 B CN 111968168B CN 202010776723 A CN202010776723 A CN 202010776723A CN 111968168 B CN111968168 B CN 111968168B
Authority
CN
China
Prior art keywords
convolution
branch
depth
point
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010776723.6A
Other languages
Chinese (zh)
Other versions
CN111968168A (en
Inventor
齐志
邢佳斌
董纪莹
刘昊
时龙兴
宋慧滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010776723.6A priority Critical patent/CN111968168B/en
Publication of CN111968168A publication Critical patent/CN111968168A/en
Application granted granted Critical
Publication of CN111968168B publication Critical patent/CN111968168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-branch adjustable bottleneck convolution Module (MAB) and an end-to-end stereo matching network, which are used for estimating the parallax of left and right images. The number of channels and the receptive field of convolution capture information are adjusted by adjusting the scale coefficient of multiple branches and the expansion rate of convolution of cavities of each branch in the MAB module, so that the gains of saving calculation amount and data access amount and the information amount of convolution results are balanced. The MAB module can be used as a light-weight feature extraction module and widely used in a deep learning network. Based on the MAB module and the 3D expansion structure of the MAB module, the lightweight end-to-end stereo matching neural network is constructed, compared with the previous stereo matching neural network, the model parameters and the operation times are greatly reduced, but the accuracy reaches the SOTA level when the model parameters are tested on a scene flow and KITTI data set. Therefore, it is easier to deploy onto resource-constrained systems such as embedded platforms, wearable devices, and the like.

Description

Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network
Technical Field
The invention provides a novel multi-branch Adjustable Bottleneck (MAB) convolution module and a 3D extension thereof, and a lightweight end-to-end stereo matching neural network (MABNet) based on the module, which is used for estimating the parallax of left and right images. Belonging to the field of computer vision stereo matching.
Background
And (3) stereo matching, namely calculating the parallax of the left image and the right image obtained by the binocular camera, and further performing depth estimation. It is widely used in various computer vision fields including autopilot, 3D reconstruction, AR/VR, etc. However, the current traditional algorithm based on image processing uses manually set features and functions, which has great limitations, especially for non-texture areas, edge occlusion, repeated texture areas, and the like, which are easy to fail to match. The algorithm based on deep learning enables the system to select required characteristics by learning a large amount of existing data, and can be obviously improved in the aspects of precision and speed. Although good performance is shown, the system is generally large in parameter quantity and large in operation times, and is difficult to deploy to hardware equipment with limited resources, and systems using stereo matching are often automobiles, unmanned planes, wearable equipment and the like. Therefore, a lightweight stereo matching method is urgently needed.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the prior art, a lightweight multi-branch adjustable bottleneck convolution module and an end-to-end stereo matching network are provided.
The technical scheme is as follows: a multi-branch bottleneck-adjustable convolution module is constructed by the following steps:
when the input and output profiles are 3-dimensional data and are equal in size:
step A1: firstly, cutting an input feature map by a channel, dividing the input feature map into two parts on the dimension of the channel, wherein one part is directly transmitted backwards as a residual error, and the other part is subjected to convolution operation in the step A2;
step A2: simultaneously entering a plurality of branches by the part of the input feature map entering the convolution operation, wherein each branch comprises a point-by-point convolution and a depth-by-depth convolution and corresponds to a scale coefficient; the point-by-point convolution maps input from high dimensions to low dimensions, specifically, the number of reserved channels is determined by a scale coefficient, then, the depth-by-depth convolution is carried out, the number of channels is not changed, the depth-by-depth convolution of each branch uses hole convolution with different expansion rates, feature information extracted by each branch under the condition of the same parameter number has different receptive fields, and then the multi-scale convolution result is realized;
step A3: directly connecting the characteristic diagrams of different receptive fields obtained by the plurality of parallel branches in the step A2 in series, wherein the proportion of the scale coefficient of each branch in the step A2 influences the proportion of the characteristic diagram obtained by each branch in a series result; performing point-by-point convolution adjustment, wherein the number of output channels is half of the number of original input channels, and thus obtaining the output of a convolution branch part;
step A4: connecting the residual error part in the step A1 and the output of the convolution branch part obtained in the step A3 in series to obtain the output equal to the number of the original input channels; then obtaining the final output of the multi-branch bottleneck-adjustable convolution module after the random shuffling operation of the channels;
in the case where the input and output feature maps are 3-dimensional data and are not equal in size:
step B1: generating residual errors by performing depth separable convolution on all input feature maps;
and step B2: all input feature maps enter a plurality of branches simultaneously, each branch comprises a point-by-point convolution and a depth-by-depth convolution and corresponds to a scale coefficient; the point-by-point convolution maps input from high dimensions to low dimensions, specifically, the number of reserved channels is determined by a scale coefficient, then, the depth-by-depth convolution is carried out, the number of channels is not changed, the depth-by-depth convolution of each branch uses hole convolution with different expansion rates, feature information extracted by each branch under the condition of the same parameter number has different receptive fields, and then the multi-scale convolution result is realized;
and step B3: directly connecting the characteristic diagrams of different receptive fields obtained by the multiple parallel branches in the step B2 in series, wherein the proportion of the scale coefficient of each branch in the step B2 influences the proportion of the characteristic diagram obtained by each branch in a series result; performing point-by-point convolution adjustment, and halving the number of output channels after series connection to obtain the output of a convolution branch part;
and step B4: connecting the residual error part in the step B1 and the output of the convolution branch part obtained in the step B3 in series, and generating an output characteristic diagram with different input sizes after channel random shuffling operation; the length and width of the output feature map are determined by the step size of the depth-by-depth convolution in step B2, and the number of channels is determined by the depth-separable convolution in the residual between step B2 and step B1.
Further, when the input and output are 4-dimensional data, expanding the point-by-point convolution and the depth-by-depth convolution in the multi-branch adjustable bottleneck convolution module from 2D to 3D to obtain a 3D multi-branch adjustable bottleneck convolution block, wherein the expanding method comprises the following steps: using asymmetric convolution, the 3 × 3 × 3 convolution kernel is decomposed into a3 × 1 × 1 point-by-point convolution kernel and a1 × 3 × 3 depth-by-depth convolution kernel.
An end-to-end stereo matching network algorithm inputs a left graph and a right graph with the same resolution, and the stereo matching network executes the steps of:
step 1: using the 2D multi-branch adjustable bottleneck convolution module to construct two weight-sharing feature extraction networks, and extracting feature information on the left image and the right image to obtain a left feature map and a right feature map;
step 2: and (2) performing cross series connection on the left characteristic diagram and the right characteristic diagram obtained in the step (1) at different parallax levels to construct a 4-dimensional cost quantity, wherein the 4-dimensional cost quantity comprises the following steps: channel dimensions, parallax dimensions, two-dimensional spatial domain dimensions;
and step 3: performing cost aggregation on the 4-dimensional cost quantity obtained in the step 2 through the 3D multi-branch adjustable bottleneck rolling block, and protecting the characteristic information quantity by using branch fusion in the encoding and decoding process to finally obtain 3 4-dimensional characteristic graphs;
and 4, step 4: performing convolution twice on each 4-dimensional feature map output in the step 3 to enable the channel dimension to be 1, and obtaining a 3-dimensional feature map, wherein the 3 dimensions are the parallax dimension, the feature map height and the feature map width respectively; performing interpolation operation on the 3 dimensions to enable the parallax dimension to be equal to a preset maximum parallax value and enable the height and the width of the feature map to be respectively equal to the height and the width of the input original image, so as to generate 3-dimensional cost;
and 5: performing parallax regression on each 3-dimensional cost quantity generated in the step 4 to respectively obtain a continuous prediction parallax image; and taking the weighted average sum of the 3 predicted disparity maps as a final disparity estimation result.
Further, the step 1 specifically comprises: the image is convolved for 3 times to obtain an initial characteristic diagram; then, 4 groups of the multi-branch adjustable bottleneck convolution modules are further adopted, the number of the modules in each group is respectively 3, 16, 3 and 3, and the number of channels is respectively 32, 64, 128 and 128; then, the outputs of the 2 nd, 3 rd and 4 th groups are connected in series to obtain a characteristic diagram with the channel number of 320; and finally, fusing the left and right characteristic images by using 2 convolutions to respectively obtain a left characteristic image and a right characteristic image with the channel number of 32.
Further, the step 3 specifically includes: firstly, coding the 4-dimensional cost quantity, and obtaining an initial aggregation characteristic diagram through continuous 4 3D convolutions; then, taking the initial aggregation characteristic diagram as input, respectively performing standard convolution with the step length of 2 and cavity convolution with the step length of 2 and the expansion rate of 2, and respectively performing 1 3D multi-branch adjustable bottleneck convolution block to generate two branch codes; then, each branch respectively carries out standard convolution with the step length of 2 and hole convolution with the expansion rate of 2, and 4 branch codes are generated; after the aggregation characteristic diagrams output by the 4 branch codes are connected in series in the channel direction, the aggregation characteristic diagrams are fused through 1 3D multi-branch adjustable bottleneck rolling block; finally, performing deconvolution for 2 times, and recovering the size of the aggregated characteristic graph to the size of the initial aggregated characteristic graph, namely decoding operation; the encoding and decoding processes are repeated three times, and 3 4-dimensional feature maps are obtained.
Further, in the step 5, performing disparity regression on each 3-dimensional cost amount generated in the step 4 specifically includes:
Figure BDA0002618699650000031
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002618699650000032
is a weighted sum of each disparity value d according to its probability, which is the cost measure (-c) d ) Obtained by performing SoftMax operation σ (-), D max Is a preset maximum disparity value.
Has the beneficial effects that: first, the MAB convolution module of the present invention obtains a multi-scale convolution result with a smaller number of parameters and a smaller number of multiply-add operations than the conventional convolution module. The channel number and the receptive field of convolution capture information are adjusted by adjusting the scale coefficients of multiple branches and the expansion rate of convolution of each branch cavity in the MAB convolution module, and therefore the calculated amount, the benefit of data access amount and the information amount of convolution results are balanced. The MAB convolution module can be used as a light-weight feature extraction module and widely applied to deep learning networks. And then, constructing a lightweight end-to-end stereo matching neural network (MABNet) based on the MAB convolution module and the 3D expansion thereof, wherein compared with the traditional stereo matching neural network, the model parameters and the operation times are greatly reduced, but the accuracy reaches the SOTA level by testing on a SceneFlow and KITTI data set. Therefore, it is easier to deploy onto resource-constrained systems such as embedded platforms, wearable devices, and the like.
Drawings
FIG. 1 is a schematic diagram of a MAB module:
wherein C, H and W respectively represent the channel number, height and width of the input characteristic diagram, and lambda i (i =1,2,3) represents the scale factor of each branch, PWConv represents the point-by-point convolution, DWConv represents the depth-by-depth convolution, graph (a) is in the form of a 2D MAB with equal number of input and output channels, graph (b) is in the form of a 2D MAB with unequal number of input and output channels, the general step size is not 1, and graph (b) is represented by S;
FIG. 2 is a schematic diagram of a 3D MAB module:
wherein, the variable representation is similar to that in fig. 1, the graph (a) is in the form of 3D MAB with equal number of input and output channels, and the graph (b) is in the form of 3D MAB with unequal number of input and output channels;
fig. 3 is a flowchart of the stereo matching network MABNet;
FIG. 4 is a schematic diagram of the MABNet feature extraction section;
FIG. 5 is a schematic diagram of MABNet cost construction:
wherein D represents the disparity dimension;
FIG. 6 is a schematic diagram of the MABNet cost polymerization section.
Detailed Description
The invention is further explained below with reference to the drawings.
A multi-branch adjustable bottleneck convolution module, as shown in fig. 1 (a), when the input and output feature maps are 3-dimensional data and are equal in size:
step A1: firstly, cutting an input feature map by a channel, dividing the input feature map into two parts on the dimension of the channel, wherein one part is directly transmitted backwards as a residual error, and the other part is subjected to the convolution operation of the step A2. This step can effectively reduce the number of multiply-add operations and the number of parameters because half of the inputs do not undergo convolution.
Step A2: the part of the input feature map that enters the convolution operation enters a plurality of branches simultaneously, each branch containing a point-by-point convolution and a depth-by-depth convolution and corresponding to a scale coefficient. The point-by-point convolution maps input from high dimensions to low dimensions, specifically, the number of reserved channels is determined by a scale coefficient, then, depth-by-depth convolution is carried out, the number of channels is not changed, the depth-by-depth convolution of each branch uses hole convolution with different expansion rates, feature information extracted by each branch under the condition of the same parameter number has different receptive fields, and therefore the multi-scale convolution result is achieved.
In step A2, the point-by-point convolution adjusts the number of channels of the branch operation result according to respective scale coefficients. The depth-by-depth convolution on each branch uses hole convolution with different expansion rates to extract features of different receptive fields under the condition of not increasing the number of parameters.
Step A3: directly connecting the characteristic diagrams of different receptive fields obtained by the plurality of parallel branches in the step A2 in series, wherein the proportion of the scale coefficient of each branch in the step A2 influences the proportion of the characteristic diagram obtained by each branch in the series result. Therefore, adjusting the scaling factor is equivalent to adjusting the ratio of the features of different receptive fields in the output. And performing point-by-point convolution adjustment, wherein the number of output channels is half of the number of original input channels, and thus obtaining the output of the convolution branch part.
Step A4: connecting the residual error part in the step A1 and the output of the convolution branch part obtained in the step A3 in series to obtain the output with the number equal to that of the original input channels; after the channels are randomly shuffled, increasing information exchange among the channels, enhancing expressive force of output characteristics and obtaining final output of the multi-branch bottleneck-adjustable convolution module;
when the input and output feature maps are 3-dimensional data and are not equal in size, as shown in fig. 1 (b), the channel cutting operation performed at the beginning of the convolution block is canceled and the convolution operation in the branch is performed using all the input feature maps, as compared with the case where the input and output feature maps are equal in size. And adding a jump connection, using depth separable convolution, namely adding one point-by-point convolution to one depth-by-depth convolution to obtain a residual error, and then connecting the residual error in series to an output characteristic diagram of the multi-branch part to obtain an MAB convolution output characteristic diagram with the number of channels being twice of that of input channels. The method specifically comprises the following steps:
step B1: all input feature maps are depth separable convolved to generate residuals. Wherein the depth separable convolution is a depth-by-depth convolution first and a point-by-point convolution second.
And step B2: all input feature maps enter a plurality of branches simultaneously, each branch comprises a point-by-point convolution and a depth-by-depth convolution and corresponds to a scale coefficient; the point-by-point convolution maps input from high dimension to low dimension, the number of specific reserved channels is determined by a scale coefficient, then the point-by-depth convolution enters into the point-by-depth convolution, the number of channels is not changed by the point-by-depth convolution, the point-by-depth convolution of each branch uses the cavity convolution with different expansion rates, and feature information extracted by each branch under the condition of the same parameter number has different receptive fields, so that a multi-scale convolution result is realized.
And step B3: directly connecting the characteristic diagrams of different receptive fields obtained by the multiple parallel branches in the step B2 in series, wherein the proportion of the scale coefficient of each branch in the step B2 influences the proportion of the characteristic diagram obtained by each branch in a series result; and performing point-by-point convolution adjustment, wherein the number of output channels is half of the number of original input channels, and thus obtaining the output of the convolution branch part.
And step B4: connecting the residual error part in the step B1 and the output of the convolution branch part obtained in the step B3 in series, and generating an output characteristic diagram with different input sizes after channel random shuffling operation; the length and width of the output feature map are determined by the step size of the depth-by-depth convolution in step B2, and the number of channels is determined by the depth-separable convolution in the residual between step B2 and step B1.
In the above steps, parameters such as the number of branches in the convolution block, the expansion rate of the hole convolution on the branches, the scale coefficient of each branch and the like can be adjusted according to network settings and practical applications, or the optimal configuration can be obtained through experiments. This greatly increases the utility and flexibility of the module. Compared with the traditional convolution module, the multi-branch adjustable bottleneck convolution module and the convolution block obtain multi-scale convolution results with fewer parameters and times of multiply-add operation, can be used as a light-weight feature extraction module and is widely used in a deep learning network.
When the input and output are 4-dimensional data, expanding the point-by-point convolution and the depth-by-depth convolution in the multi-branch adjustable bottleneck convolution module from 2D to 3D to obtain a 3D multi-branch adjustable bottleneck convolution block, wherein the expanding method comprises the following steps: using asymmetric convolution, the 3 × 3 × 3 convolution kernel is decomposed into a3 × 1 × 1 point-by-point convolution kernel and a1 × 3 × 3 depth-by-depth convolution kernel, as shown in FIG. 2.
As shown in fig. 3, an end-to-end stereo matching network, named MABNet, is divided into four steps: feature extraction, cost quantity construction, cost aggregation and parallax regression.
Step 1: two weight-sharing feature extraction networks are constructed by using a 2D MAB module, a left image and a right image with the same resolution are input, and the left feature image and the right feature image are obtained by respectively passing the right image and the right image through the feature extraction network shown in FIG. 4.
To reduce the amount of computation, the two networks share weights. Specifically, the image is firstly subjected to 3 convolutions to obtain an initial characteristic diagram; then, 4 groups of 2D MAB modules are used, the number of the modules in each group is respectively 3, 16, 3 and 3, and the number of channels is respectively 32, 64, 128 and 128; then, the outputs of the 2 nd, 3 rd and 4 th groups are connected in series to obtain a characteristic diagram with the channel number of 320; and finally, fusing the left and right characteristic images by using 2 convolutions to respectively obtain a left characteristic image and a right characteristic image with the channel number of 32.
Step 2: and (3) performing cross-concatenation on the left and right feature maps obtained in the step (1) at different parallax levels to construct a 4-dimensional cost quantity, as shown in fig. 5. The 4-dimension includes: channel dimensions, parallax dimensions, two-dimensional spatial domain dimensions.
And 3, step 3: and (3) performing cost aggregation on the 4-dimensional cost quantity obtained in the step (2) through a network constructed by the 3D multi-branch adjustable bottleneck convolution block as shown in fig. 6, and protecting the characteristic information quantity by using branch fusion in the coding and decoding processes to finally obtain 3 4-dimensional characteristic diagrams. In the 3D multi-branch adjustable bottleneck convolution block, point-by-point convolution is used as a parallax domain convolution kernel for adjusting the number of channels; the depth-by-depth convolution is a space domain convolution kernel, and the characteristics of different receptive fields are extracted by different expansion rates of the space domain convolution kernel.
Here, an encoding-decoding structure, also referred to as an hourglass structure, is used. But unlike general encoding-decoding, here a branch-like structure is fused. Firstly, coding 4-dimensional cost quantity, and firstly carrying out continuous 4 3D convolutions to obtain an initial aggregation characteristic diagram; then, taking the initial aggregation characteristic diagram as input, respectively performing standard convolution with the step length of 2, namely performing hole convolution with the expansion rate of 1, the step length of 2 and the expansion rate of 2, and respectively performing 1 3D multi-branch adjustable bottleneck convolution block to generate two branch codes; then, each branch respectively carries out standard convolution with the step length of 2 and cavity convolution with the expansion rate of 2, and 4 branch codes are generated; after the aggregation characteristic diagrams output by the 4 branch codes are connected in series in the channel direction, the aggregation characteristic diagrams are fused through 1 3D MAB module; and finally, performing deconvolution for 2 times, recovering the size of the aggregated feature graph to the size of the initial aggregated feature graph, namely decoding operation, and circulating the encoding-decoding operation for 3 times to increase the precision of the network so as to obtain 3 4-dimensional feature graphs, namely three 4-dimensional cost quantities.
And 4, step 4: and (4) performing convolution twice on each 4-dimensional feature map output in the step (3) to enable the channel dimension to be 1, and obtaining a 3-dimensional feature map, wherein the 3 dimensions are the parallax dimension, the feature map height and the feature map width respectively. And performing interpolation operation on the 3 dimensions to ensure that the parallax dimension is equal to the preset maximum parallax value and the height and the width of the feature map are respectively equal to the height and the width of the input original image, so as to generate 3-dimensional cost quantity, and obtain a 3D feature map with the resolution as the original image and the number of channels as the maximum parallax value.
And 5: and (4) performing parallax regression on each 3-dimensional cost quantity generated in the step (4) to respectively obtain a continuous prediction parallax image. The method comprises the following specific steps:
Figure BDA0002618699650000071
wherein the content of the first and second substances,
Figure BDA0002618699650000072
is a weighted sum of each disparity value d according to its probability, which is the cost quantity (-c) d ) Obtained by performing SoftMax operation σ (-), D max Is a preset maximum disparity value. And taking the weighted average sum of the 3 predicted disparity maps as a final disparity estimation result.
The training process of the end-to-end stereo matching network MABNet based on the MAB module has to overcome the problem of insufficient data quantity, and the method comprises the following steps: without loss of generality, after data enhancement, a large-scale synthesis data set sceneFlow is adopted for pre-training, and then a KITTI data set in a real scene is used for fine adjustment.
In network training, smooth L1 loss is used as a loss function, and the formula is as follows:
Figure BDA0002618699650000081
Figure BDA0002618699650000082
where N is the number of all valid pixels, d i In order to be true parallax error,
Figure BDA0002618699650000083
to predict the disparity.
Since 3 predicted disparity maps are obtained, there will also be 3 losses (loss) 1 、loss 2 、loss 3 ) Final loss total Taking the weighted sum of these 3 weights, the weights refer to PSMNet as:
loss total =0.5×loss 1 +0.7×loss 2 +loss 3 (4)
during training, two data sets, sceneFlow and KITTI (KITTI 2012 and KITTI 2015) are used. Pre-training with SceneFlow and then fine tuning with KITTI. The data set is subjected to data enhancement, including processing such as random clipping and normalization. Wherein SceneFlow uses the end-point error EPE (end-point-error) as a measure, i.e. the average absolute value error of the parallax of all pixels. In KITTI, when the estimated parallax and the real parallax of the pixel are less than 3 pixels, the pixel is regarded as correct, and the percentage of the error 3PE (3-point-error) of three pixels is used as a measurement standard. In particular, since the test set of KITTI does not provide a true disparity map, the training set 200 of KITTI was split 160 and 40 image-wise in the experiment as training and validation sets, respectively.
By experiment, the optimal configuration of the MAB module in a MABNet was determined: the 2D MAB comprises 3 branches, the expansion rate of each branch is 1,2 and 4, and the corresponding scale factor is 1, 0.5 and 0.25; the 3D MAB comprises 2 branches, the expansion rate of each branch is 1 and 2, and the corresponding scale factor is 0.5 and 0.25.
In order to further reduce the parameters and the calculated amount of an end-to-end stereo matching network MABNet based on MAB modules, in the end-to-end stereo matching network, the number of the MAB modules used in the step 1, the number of the modules contained in each group and the number of channels are all adjustable, the number of the multi-branch adjustable bottleneck convolution modules used in the step 3, the number of the modules contained in each group and the number of the channels are all adjustable, the number of encoding and decoding groups is adjustable, and the number of the groups of the generated 4-dimensional feature map and the number of the channels of each group are adjustable. Therefore, the calculation amount, the overhead of the data access amount and the information amount of the convolution result are balanced, and further the depth estimation precision of the SOTA is obtained at lower calculation and access cost.
The end-to-end stereo matching network MABNet is further simplified, a lighter version is provided, named MABNet _ tiny, in the feature extraction part, 4 groups of 2D MAB modules are changed into 3 groups, the number of the modules in each group is respectively 4, 8 and 4, and the number of channels is respectively 8, 16 and 32. The output results of the 2 nd and 3 rd groups are used for series connection. In the cost aggregation part, only one encoding-decoding structure is used, so that the calculation amount is further reduced, and only one prediction disparity map is generated. The general structure is similar to MABNet, and the other parts, including the number of channels and the number of layers, are reduced and compressed. Compared with the traditional stereo matching network, the finally obtained MABNet-tiny has the advantages that the parameter number and the operation frequency are reduced by two orders of magnitude, but the precision is only reduced by 1-2%.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims (6)

1. A multi-branch bottleneck-adjustable convolution module is characterized in that the construction method of the convolution module comprises the following steps:
when the input and output profiles are 3-dimensional data and are equal in size:
step A1: firstly, cutting an input characteristic diagram by a channel, dividing the input characteristic diagram into two parts on the dimension of the channel, wherein one part is directly transmitted backwards as a residual error, and the other part is subjected to convolution operation in the step A2;
step A2: simultaneously entering a plurality of branches by the part of the input feature map entering the convolution operation, wherein each branch comprises a point-by-point convolution and a depth-by-depth convolution and corresponds to a scale coefficient; the point-by-point convolution maps input from high dimensions to low dimensions, specifically, the number of reserved channels is determined by a scale coefficient, then, the depth-by-depth convolution is carried out, the number of channels is not changed, the depth-by-depth convolution of each branch uses hole convolution with different expansion rates, feature information extracted by each branch under the condition of the same parameter number has different receptive fields, and then the multi-scale convolution result is realized;
step A3: directly connecting the characteristic diagrams of different receptive fields obtained by the multiple parallel branches in the step A2 in series, wherein the proportion of the scale coefficient of each branch in the step A2 influences the proportion of the characteristic diagram obtained by each branch in a series result; performing point-by-point convolution adjustment, wherein the number of output channels is half of the number of original input channels, and thus obtaining the output of a convolution branch part;
step A4: connecting the residual error part in the step A1 and the output of the convolution branch part obtained in the step A3 in series to obtain the output equal to the number of the original input channels; then obtaining the final output of the multi-branch bottleneck-adjustable convolution module after the random shuffling operation of the channels;
in the case where the input and output feature maps are 3-dimensional data and are not equal in size:
step B1: generating residual errors by performing depth separable convolution on all input feature maps;
and step B2: all input feature maps enter a plurality of branches at the same time, each branch comprises a point-by-point convolution and a depth-by-depth convolution and corresponds to a scale coefficient; the point-by-point convolution maps input from high dimensions to low dimensions, the number of specifically reserved channels is determined by a scale coefficient, then the point-by-depth convolution enters into the point-by-depth convolution, the number of the channels is not changed by the point-by-depth convolution, the point-by-depth convolution of each branch uses cavity convolution with different expansion rates, feature information extracted by each branch under the condition of the same parameter number has different receptive fields, and then the multi-scale convolution result is realized;
and step B3: directly connecting the characteristic diagrams of different receptive fields obtained by the multiple parallel branches in the step B2 in series, wherein the proportion of the scale coefficient of each branch in the step B2 influences the proportion of the characteristic diagram obtained by each branch in a series result; performing point-by-point convolution adjustment, and halving the number of output channels after series connection to obtain the output of a convolution branch part;
and step B4: connecting the residual error part in the step B1 with the output of the convolution branch part obtained in the step B3 in series, and generating an output characteristic diagram with different input sizes after channel random shuffling operation; the length and width of the output feature map are determined by the step size of the depth-by-depth convolution in step B2, and the number of channels is determined by the depth-separable convolution in the residual between step B2 and step B1.
2. The multi-branch adjustable bottleneck convolution module of claim 1, wherein when the input and output are 4-dimensional data, the point-by-point convolution and the depth-by-depth convolution in the multi-branch adjustable bottleneck convolution module are extended from 2D to 3D to obtain a 3D multi-branch adjustable bottleneck convolution block, and the extension method is as follows: using asymmetric convolution, the 3 × 3 × 3 convolution kernel is decomposed into a3 × 1 × 1 point-by-point convolution kernel and a1 × 3 × 3 depth-by-depth convolution kernel.
3. The end-to-end stereo matching network of the convolution module of claim 2, wherein the left and right graphs with the same resolution are input, and the stereo matching network performs steps including:
step 1: using the 2D multi-branch adjustable bottleneck convolution module to construct two weight-sharing feature extraction networks, and extracting feature information on the left image and the right image to obtain a left feature map and a right feature map;
and 2, step: and (2) performing cross series connection on the left characteristic diagram and the right characteristic diagram obtained in the step (1) at different parallax levels to construct a 4-dimensional cost quantity, wherein the 4-dimensional cost quantity comprises the following steps: channel dimensions, parallax dimensions, two-dimensional spatial domain dimensions;
and 3, step 3: performing cost aggregation on the 4-dimensional cost quantity obtained in the step 2 through the 3D multi-branch adjustable bottleneck rolling block, and protecting the characteristic information quantity by using branch fusion in the encoding and decoding process to finally obtain 3 4-dimensional characteristic graphs;
and 4, step 4: performing convolution twice on each 4-dimensional feature map output in the step 3 to enable the channel dimension to be 1, and obtaining a 3-dimensional feature map, wherein the 3 dimensions are the parallax dimension, the feature map height and the feature map width respectively; performing interpolation operation on the 3 dimensions to enable the parallax dimension to be equal to a preset maximum parallax value and enable the height and width of the feature map to be respectively equal to the height and width of the input original image, so that 3-dimensional cost is generated;
and 5: performing parallax regression on each 3-dimensional cost quantity generated in the step 4 to respectively obtain a continuous prediction parallax image; and taking the weighted average sum of the 3 predicted disparity maps as a final disparity estimation result.
4. The end-to-end stereo matching network algorithm according to claim 3, wherein the step 1 specifically comprises: the image is firstly convolved for 3 times to obtain an initial characteristic diagram; then, 4 groups of the multi-branch adjustable bottleneck convolution modules are further adopted, the number of the modules in each group is respectively 3, 16, 3 and 3, and the number of the channels is respectively 32, 64, 128 and 128; then, the outputs of the 2 nd, 3 rd and 4 th groups are connected in series to obtain a characteristic diagram with the channel number of 320; and finally, fusing the left and right characteristic images by using 2 convolutions to respectively obtain a left characteristic image and a right characteristic image with the channel number of 32.
5. The end-to-end stereo matching network algorithm according to claim 3, wherein the step 3 specifically comprises: firstly, coding 4-dimensional cost quantity, and firstly carrying out continuous 4 3D convolutions to obtain an initial aggregation characteristic diagram; then, taking the initial aggregation characteristic diagram as input, respectively performing standard convolution with the step length of 2 and hole convolution with the step length of 2 and the expansion rate of 2, and respectively performing 1 3D multi-branch adjustable bottleneck convolution block to generate two branch codes; then, each branch respectively carries out standard convolution with the step length of 2 and cavity convolution with the expansion rate of 2, and 4 branch codes are generated; after the aggregation characteristic graphs of the 4 branch coding outputs are connected in series in the channel direction, fusing the obtained product by using 1 3D multi-branch adjustable bottleneck volume block; finally, performing deconvolution for 2 times, and recovering the size of the aggregated characteristic graph to the size of the initial aggregated characteristic graph, namely decoding operation; the encoding and decoding processes are repeated three times, and 3 4-dimensional feature maps are obtained.
6. The end-to-end stereo matching network algorithm according to claim 3, wherein in the step 5, performing disparity regression on each 3-dimensional cost quantity generated in the step 4 specifically comprises:
Figure FDA0002618699640000031
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0002618699640000032
is a weighted sum of each disparity value d according to its probability, which is the cost quantity (-c) d ) Obtained by performing SoftMax operation σ (-), D max Is a preset maximum disparity value.
CN202010776723.6A 2020-08-05 2020-08-05 Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network Active CN111968168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010776723.6A CN111968168B (en) 2020-08-05 2020-08-05 Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010776723.6A CN111968168B (en) 2020-08-05 2020-08-05 Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network

Publications (2)

Publication Number Publication Date
CN111968168A CN111968168A (en) 2020-11-20
CN111968168B true CN111968168B (en) 2022-10-25

Family

ID=73363861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010776723.6A Active CN111968168B (en) 2020-08-05 2020-08-05 Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network

Country Status (1)

Country Link
CN (1) CN111968168B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587337B (en) * 2022-12-14 2023-06-23 中国汽车技术研究中心有限公司 Method, equipment and storage medium for identifying abnormal sound of vehicle door

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110692A (en) * 2019-05-17 2019-08-09 南京大学 A kind of realtime graphic semantic segmentation method based on the full convolutional neural networks of lightweight
CN110533712B (en) * 2019-08-26 2022-11-04 北京工业大学 Binocular stereo matching method based on convolutional neural network
CN111402129B (en) * 2020-02-21 2022-03-01 西安交通大学 Binocular stereo matching method based on joint up-sampling convolutional neural network

Also Published As

Publication number Publication date
CN111968168A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
Eldesokey et al. Confidence propagation through cnns for guided sparse depth regression
CN110378844B (en) Image blind motion blur removing method based on cyclic multi-scale generation countermeasure network
CN109472819B (en) Binocular parallax estimation method based on cascade geometric context neural network
CN112861729B (en) Real-time depth completion method based on pseudo-depth map guidance
CN112150521B (en) Image stereo matching method based on PSMNet optimization
CN111696148A (en) End-to-end stereo matching method based on convolutional neural network
CN113592026A (en) Binocular vision stereo matching method based on void volume and cascade cost volume
CN109005398B (en) Stereo image parallax matching method based on convolutional neural network
CN109635763B (en) Crowd density estimation method
CN111583313A (en) Improved binocular stereo matching method based on PSmNet
CN108171249B (en) RGBD data-based local descriptor learning method
CN114663509B (en) Self-supervision monocular vision odometer method guided by key point thermodynamic diagram
CN103189715A (en) Stereo image processing device and stereo image processing method
CN110570402A (en) Binocular salient object detection method based on boundary perception neural network
CN111553296B (en) Two-value neural network stereo vision matching method based on FPGA
CN112509021A (en) Parallax optimization method based on attention mechanism
CN111294614B (en) Method and apparatus for digital image, audio or video data processing
CN111968168B (en) Multi-branch adjustable bottleneck convolution module and end-to-end stereo matching network
CN112288690A (en) Satellite image dense matching method fusing multi-scale and multi-level features
CN116109689A (en) Edge-preserving stereo matching method based on guide optimization aggregation
CN111310767A (en) Significance detection method based on boundary enhancement
CN117576402B (en) Deep learning-based multi-scale aggregation transducer remote sensing image semantic segmentation method
CN113538402B (en) Crowd counting method and system based on density estimation
CN112116646B (en) Depth estimation method for light field image based on depth convolution neural network
CN116630388A (en) Thermal imaging image binocular parallax estimation method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant