CN114581505A

CN114581505A - Convolution-based binocular stereo matching network structure

Info

Publication number: CN114581505A
Application number: CN202210070978.XA
Authority: CN
Inventors: 鲍伟; 沈浩然; 徐玉华; 孟周; 张凯
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2022-06-03
Anticipated expiration: 2042-01-21
Also published as: CN114581505B

Abstract

The invention relates to a convolution-based binocular stereo matching network structure which comprises a feature extraction module, a coarse parallax value generation module, a parallax range prediction module, a cost space construction module, a coarse parallax image generation module and a fine parallax image generation module, wherein the feature extraction module is used for extracting a feature of a binocular stereo matching network structure; the characteristic extraction module is used for extracting characteristic data of an input image, processing the characteristic data and outputting a first characteristic image of the corresponding input image; the rough parallax value generation module is used for acquiring the first characteristic image, processing the first characteristic image and outputting the rough parallax value of each pixel point of the first characteristic image. According to the binocular stereo matching network structure based on convolution, the binocular image pair is used as input, the parallax image is directly output through the binocular stereo matching network, end-to-end network structure design is achieved, post-processing operations of a traditional binocular stereo matching method, such as interpolation, filtering, sub-pixel enhancement and the like, are eliminated, and efficiency is greatly improved.

Description

Convolution-based binocular stereo matching network structure

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a binocular stereo matching network structure based on convolution.

Background

Computer vision is a subject of research that uses computers to simulate the human visual system.

Depth estimation from one or more RGB images is a long standing research-oriented problem with applications in various fields such as robotics, autopilot, object recognition and scene understanding, 3D modeling and animation, augmented reality, industrial control and medical diagnostics.

The binocular stereo matching technology is one of the cores in the field of computer vision, and is to shoot two RGB images by adopting two cameras positioned on the same horizontal line, find the corresponding relation of pixels in the images and obtain the depth by a triangulation principle.

The traditional binocular stereo matching is generally divided into four steps of matching cost calculation, cost aggregation, parallax calculation and post-processing. However, the conventional stereo matching method has a poor matching result for an occlusion region, a weak texture or a region with a repeated texture, and is sensitive to illumination, contrast and noise.

In recent years, it has been receiving wide attention that learning a strong representation of data by CNN based on a deep learning stereo matching method can also achieve a good effect, such as the MC-CNN method. However, the stereo matching method based on CNN uses the matching calculation result obtained by CNN to initialize the matching cost, and then still performs the same steps as the conventional stereo matching method, which is complicated in process.

Disclosure of Invention

The present invention is directed to solving the above problems by providing a convolution-based binocular stereo matching network structure.

The invention achieves the above purpose through the following technical scheme:

a binocular stereo matching network structure based on convolution, comprising:

the characteristic extraction module is used for extracting characteristic data input into a binocular image pair, processing the characteristic data and outputting a first characteristic image of a corresponding input image;

the rough parallax value generation module is used for acquiring the first characteristic image, processing the first characteristic image and outputting the rough parallax value of each pixel point of the first characteristic image;

the parallax range prediction module is used for acquiring the first characteristic image and the rough parallax value of each pixel point of the first characteristic image, processing the first characteristic image and the rough parallax value and outputting the parallax range interval of each pixel point;

the cost space construction module is used for acquiring the first characteristic image and the parallax range interval of each pixel point of the first characteristic image, processing the first characteristic image and outputting a four-dimensional cost space of the first characteristic image in the parallax range interval;

the rough parallax image generation module is used for acquiring a four-dimensional cost space, and outputting a rough parallax image with a four-dimensional cost space scale after processing;

and the fine parallax image generation module is used for acquiring the coarse parallax image, processing the coarse parallax image and outputting the parallax image corresponding to the binocular image pair.

As a further optimization scheme of the present invention, the feature extraction module includes a first convolution unit provided with three two-dimensional convolutions, a residual error structural unit provided with four residual error modules, a second convolution unit provided with four two-dimensional convolutions, a first deconvolution unit provided with three deconvolution modules, a third convolution unit provided with three two-dimensional convolutions, and a second deconvolution unit provided with three deconvolution modules;

wherein,

the first convolution unit processes the input binocular image pair, and then the processed binocular image pair is processed by the residual error structure unit, the second convolution unit, the first deconvolution unit, the third convolution unit and the second deconvolution unit, and the second deconvolution unit outputs the first characteristic image.

As a further optimization scheme of the present invention, the input and output of each of the four residual error modules are used as the input of the next adjacent residual error module;

the output of each deconvolution module of the three deconvolution modules in the first deconvolution unit is used as the input of the next adjacent deconvolution module;

and the output of each deconvolution module of the three deconvolution modules in the second deconvolution unit is used as the input of the next adjacent deconvolution module, and the last deconvolution module in the second deconvolution unit outputs the first characteristic image.

As a further optimization scheme of the present invention, the coarse disparity value generating module comprises:

the parallax initialization unit is used for initializing N parallax values for each pixel point of the first characteristic image randomly in an initial parallax search range;

the parallax transmission unit is used for transmitting the random initialized parallax value of each pixel point in the horizontal and vertical directions, so that each pixel point has 5 multiplied by N random parallax values;

and the parallax evaluation unit is used for respectively calculating the matching similarity of each pixel point to the 5 multiplied by N random parallax values and selecting the parallax value with the highest matching similarity as the rough parallax value of the pixel point.

As a further optimization scheme of the present invention, the parallax range prediction module includes a first three-dimensional convolution unit provided with three-dimensional convolutions and a first three-dimensional deconvolution unit provided with three-dimensional deconvolution, the first three-dimensional convolution unit obtains a rough parallax value of a first feature image and each pixel point thereof, processes the rough parallax value, and outputs a range section in which the pixel point parallax is located through a last three-dimensional deconvolution of the first three-dimensional deconvolution unit;

wherein,

the output of each of the three-dimensional convolutions is used as the input of the next adjacent three-dimensional convolution;

the output of each of the three-dimensional deconvolution is taken as the input of the next adjacent three-dimensional deconvolution.

As a further optimization scheme of the present invention, the cost space construction module includes a first encapsulation layer, and is configured to encapsulate the parallax range interval of the first feature image and each pixel thereof into a four-dimensional cost space in a channel dimension.

As a further optimization of the present invention, the coarse parallax image generation module includes:

the first coding and decoding structure unit is used for acquiring a four-dimensional cost space, processing the four-dimensional cost space and outputting a second characteristic image corresponding to a binocular image pair;

and the rough parallax regression unit is used for acquiring a second characteristic image, processing the second characteristic image and outputting a rough parallax image with the same scale as the second characteristic image.

As a further optimization of the present invention, the fine parallax image generation module includes:

the second packaging layer is used for acquiring the first characteristic image, the second characteristic image and the rough parallax image and packaging the first characteristic image, the second characteristic image and the rough parallax image into a third characteristic image in a channel dimension;

the fourth convolution unit is used for acquiring and processing the third characteristic image and outputting a fine parallax image with the same scale as the coarse parallax image;

and the parallax image normalizing unit is used for acquiring the fine parallax image and performing interpolation up-sampling processing on the fine parallax image to obtain the parallax image with the same size as the binocular image.

The invention has the beneficial effects that:

the invention takes the binocular image pair as input, directly outputs parallax images through the binocular stereo matching network, realizes the end-to-end network structure design, eliminates the post-processing operations of the traditional binocular stereo matching method, such as interpolation, filtering, sub-pixel enhancement and the like, and greatly improves the efficiency.

Drawings

FIG. 1 is a block diagram of the overall architecture of the present invention;

FIG. 2 is a block diagram of the structure of the feature extraction module of the present invention;

fig. 3 is a block diagram of the coarse parallax image generation module according to the present invention;

fig. 4 is a block diagram of the fine parallax image generation module of the present invention.

Detailed Description

The present application will now be described in further detail with reference to the drawings, it should be noted that the following detailed description is given for illustrative purposes only and is not to be construed as limiting the scope of the present application, as those skilled in the art will be able to make numerous insubstantial modifications and adaptations to the present application based on the above disclosure.

Example 1

As shown in fig. 1, a convolution-based binocular stereo matching network structure includes a feature extraction module, a coarse disparity value generation module, a disparity range prediction module, a cost space construction module, a coarse disparity image generation module, and a fine disparity image generation module;

the characteristic extraction module is used for extracting characteristic data of the input image, processing the characteristic data and outputting a first characteristic image of the corresponding input image; wherein, the input images are a left image and a right image of a binocular image pair;

as shown in fig. 2, the feature extraction module includes a first convolution unit, a residual structure unit, a second convolution unit, a first deconvolution unit, a third convolution unit, and a second deconvolution unit, wherein,

the first convolution unit comprises three two-dimensional convolutions, and the first convolution unit processes an input image and outputs the processed image to the residual error structure unit; the sizes of convolution kernels of the two-dimensional convolution are all 3 multiplied by 3, the step lengths are respectively 2, 1 and 1, and the number of output characteristic channels of the first convolution unit is 32;

the residual error structure unit comprises four residual error modules, wherein the input and the output of each residual error module are used as the input of the next adjacent residual error module, and the residual error structure unit processes the output data of the first convolution unit and outputs the processed data to the second convolution unit; the sizes of convolution kernels of the residual error modules are all 3 multiplied by 3, the step lengths are respectively 1, 2 and 1, and the output characteristic channel numbers of the four residual error modules are respectively 32, 64, 128 and 128;

the second convolution unit comprises four two-dimensional convolutions, and the second convolution unit processes the output data of the residual error structure unit and outputs the processed output data to the first deconvolution unit; the sizes of convolution kernels of the two-dimensional convolutions are all 3 multiplied by 3, the step lengths are respectively 1, 2 and 2, and the output characteristic channels of the four two-dimensional convolutions are respectively 32, 48, 64 and 96;

the first deconvolution unit comprises three deconvolution modules, wherein the output of each deconvolution module is used as the input of the next adjacent deconvolution module, and the first deconvolution unit processes the output data of the second convolution unit and outputs the processed data to the third convolution unit; the sizes of deconvolution kernels of the deconvolution modules are all 4 multiplied by 4, the step lengths of the three deconvolution modules are all 2, and the number of output characteristic channels is 64, 48 and 32 respectively;

the third convolution unit comprises three two-dimensional convolutions, and outputs the data output by the first deconvolution unit to the second deconvolution unit after processing; the convolution kernel size of the two-dimensional convolution is 3 multiplied by 3, the step length is 2, and the number of output characteristic channels is 48, 64 and 96 respectively;

the second deconvolution unit comprises three deconvolution modules, wherein the output of each deconvolution module is used as the input of the next adjacent deconvolution module, and the last deconvolution module outputs the first characteristic image; the sizes of deconvolution kernels of the deconvolution modules are all 4 multiplied by 4, the step lengths of the three deconvolution modules are all 2, and the number of output characteristic channels is 64, 48 and 32 respectively.

The feature extraction module uses fewer residual error modules (4), the structure of the feature extraction module is simple, the speed of the network is improved, the network can still have a larger receptive field, and the first feature image with the output size of H/8 xW/8 xC (C is the number of feature channels) is used for constructing a cost space with a small size.

The rough parallax value generation module is used for acquiring a first characteristic image, processing the first characteristic image and outputting a rough parallax value of each pixel point of the first characteristic image;

the coarse parallax value generation module comprises a parallax initialization unit, a parallax transmission unit and a parallax evaluation unit, wherein,

the parallax initialization unit randomly initializes N parallax values for each pixel point in an initial parallax search range after acquiring the first characteristic image and acquires a random initialization parallax value for each pixel point, the parallax transmission unit transmits the random initialization parallax value for each pixel point in the horizontal and vertical directions to enable each pixel point to have 5 multiplied by N random parallax values, the parallax evaluation unit respectively calculates matching similarity for the 5 multiplied by N random parallax values of each pixel point, and the parallax value with the highest matching degree is selected as a rough parallax value of the pixel point.

The method specifically comprises the steps of averagely dividing an initial parallax search range into N intervals, randomly initializing 1 parallax value for each pixel point in each interval, obtaining N randomly initialized parallax values for each pixel point, then carrying out horizontal and vertical direction transmission on the randomly initialized parallax values of each pixel point through One-Hot coding through a parallax transmission unit, enabling each pixel point to have 5 multiplied by N random parallax values, finally carrying out point multiplication operation on a first feature image in a channel dimension through a parallax evaluation unit to calculate matching similarity, and selecting the parallax value with the highest matching similarity in each interval as a rough parallax value of the pixel point.

The parallax range prediction module is used for acquiring the first characteristic image and the rough parallax value of each pixel point of the first characteristic image, processing the first characteristic image and the rough parallax value and outputting a parallax range interval of each pixel point;

the parallax range prediction module comprises a first three-dimensional convolution unit and a first three-dimensional deconvolution unit, wherein the first three-dimensional convolution unit acquires the rough parallax value of the first characteristic image and each pixel point thereof, processes the rough parallax value and inputs the rough parallax value into the first three-dimensional deconvolution unit,

the first three-dimensional convolution unit comprises three-dimensional convolutions, and the output of each three-dimensional convolution is used as the input of the next adjacent three-dimensional convolution; the sizes of convolution kernels of the three-dimensional convolutions are all 3 multiplied by 3, and the step lengths of the three-dimensional convolutions are all 2;

the first three-dimensional deconvolution unit comprises three-dimensional deconvolution, the convolution kernel sizes of the three-dimensional deconvolution are all 3 multiplied by 3, the step lengths are all (1, 2, 2), the output of each three-dimensional deconvolution is used as the input of the next adjacent three-dimensional deconvolution, and the range interval where the pixel parallax is located is output by the last three-dimensional deconvolution.

The rough parallax value of the pixel point is firstly obtained, the small range interval where the pixel point parallax is located is obtained according to the rough parallax value, then the first feature image is utilized to construct a cost space with a small size according to the small range interval of the pixel point parallax, the parallax searching range of the pixel point is reduced, the calculated amount of the network is greatly reduced, the prediction accuracy of the network is guaranteed, and meanwhile the prediction speed of the network is improved.

the cost space construction module comprises a first packaging layer, and the first packaging layer packages the parallax range interval of the first characteristic image and each pixel point of the first characteristic image into a four-dimensional cost space in the channel dimension.

as shown in fig. 3, the coarse parallax image generation module includes a first encoding/decoding structure unit and a coarse parallax regression unit, wherein,

the first coding and decoding structure unit acquires a four-dimensional cost space, processes the four-dimensional cost space and outputs a second characteristic image corresponding to the input image, and the coarse parallax regression unit acquires the second characteristic image, processes the second characteristic image and outputs a coarse parallax image with the same scale as the second characteristic image;

the first coding and decoding structure unit comprises three-dimensional convolution modules and three-dimensional deconvolution modules, the convolution kernel sizes of the three-dimensional convolution modules are all 3 multiplied by 3, the step sizes are all 2, the convolution kernel sizes of the three-dimensional deconvolution modules are all 3 multiplied by 3, and the step sizes are all (1, 2 and 2).

Specifically, the input of the first coding and decoding structure unit is a four-dimensional cost space, the output of the first coding and decoding structure unit is a second feature image corresponding to the input image, a coarse parallax regression unit is arranged, and the coarse parallax regression unit performs Softmax operation on the output of the first coding and decoding structure unit on the channel dimension to output a coarse parallax image with the same scale as the second feature image.

The coarse parallax image generation module is simple in structure (3 three-dimensional volumes and 3 three-dimensional deconvolution), and the calculation amount of the network is reduced and the network speed is increased by using a small number of three-dimensional volumes and three-dimensional deconvolution.

As shown in fig. 4, the fine parallax image generation module is configured to obtain the coarse parallax image, process the coarse parallax image, and output a parallax image corresponding to the input image;

the fine parallax image generation module includes a second encapsulation layer, a fourth convolution unit, and a parallax map normalization unit, wherein,

the second packaging layer obtains the first characteristic image, the second characteristic image and the coarse parallax image and packages the first characteristic image, the second characteristic image and the coarse parallax image in a channel dimension packaging layer to obtain a third characteristic image;

a fourth convolution unit acquires and processes the third characteristic image and outputs a fine image with the same scale as the coarse parallax image; the fourth convolution unit comprises seven two-dimensional convolutions, the sizes of convolution kernels are all 3 multiplied by 3, and the step sizes are all 1;

the disparity map normalizing unit acquires the fine disparity image and performs interpolation up-sampling processing on the fine disparity image to obtain the disparity image with the same scale as the input image.

And the fine parallax image generation module takes the first characteristic image and the second characteristic image output by the characteristic extraction module as guide images, abandons a complex residual error module, and outputs the fine parallax image through a fourth convolution unit (7 two-dimensional convolutions) with a simple structure, so that the network speed is increased.

It should be noted that the binocular image pair is used as input, and the parallax image is directly output through the binocular stereo matching network, so that the end-to-end network structure design is realized, the post-processing operations of the traditional binocular stereo matching method, such as interpolation, filtering, sub-pixel enhancement and the like, are eliminated, and the efficiency is greatly improved.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A binocular stereo matching network structure based on convolution is characterized by comprising:

2. The binocular stereo matching network structure based on convolution of claim 1, wherein: the feature extraction module comprises a first convolution unit provided with three two-dimensional convolutions, a residual error structure unit provided with four residual error modules, a second convolution unit provided with four two-dimensional convolutions, a first deconvolution unit provided with three deconvolution modules, a third convolution unit provided with three two-dimensional convolutions and a second deconvolution unit provided with three deconvolution modules;

wherein,

3. The binocular stereo matching network structure based on convolution of claim 2, wherein: the input and output of each of the four residual modules are used as the input of the next adjacent residual module;

4. The binocular disparity value matching network structure based on convolution of claim 3, wherein the coarse disparity value generating module comprises:

5. The binocular stereo matching network structure based on convolution of claim 4, wherein: the parallax range prediction module comprises a first three-dimensional convolution unit provided with three-dimensional convolutions and a first three-dimensional deconvolution unit provided with three-dimensional deconvolution, wherein the first three-dimensional convolution unit acquires a first characteristic image and a rough parallax value of each pixel point of the first characteristic image, processes the rough parallax value and outputs a range section where the pixel point parallax is located through the last three-dimensional deconvolution of the first three-dimensional deconvolution unit;

wherein,

6. The binocular stereo matching network structure based on convolution of claim 1, wherein: the cost space construction module comprises a first packaging layer and is used for packaging the first characteristic image and the parallax range interval of each pixel point of the first characteristic image into a four-dimensional cost space in a channel dimension.

7. The binocular stereo matching network structure based on convolution of claim 1, wherein: the coarse parallax image generation module includes:

8. The binocular stereo matching network structure based on convolution of claim 1, wherein: the fine parallax image generation module includes:

the second packaging layer is used for acquiring the first characteristic image, the second characteristic image and the coarse parallax image and packaging the first characteristic image, the second characteristic image and the coarse parallax image into a third characteristic image in a channel dimension;