CN112489097A

CN112489097A - Stereo matching method based on mixed 2D convolution and pseudo 3D convolution

Info

Publication number: CN112489097A
Application number: CN202011436492.0A
Authority: CN
Inventors: 陈世峰; 甘万水
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-12
Anticipated expiration: 2040-12-11
Also published as: WO2022120988A1; CN112489097B

Abstract

The invention relates to the field of computer vision, in particular to a stereo matching method (hybrid Net) based on mixed 2D convolution and pseudo 3D convolution; the method comprises the following steps: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, and obtaining a disparity residual error optimization initial disparity map after residual error aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in the PSmNet structure and the residual aggregation; optimizing a depth map by adopting a CSPNet method for the disparity map; the function of 3D convolution is approximately realized by combining 2D convolution, and the data switching operation does not contain learnable parameters and does not generate calculated amount; the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.

Description

Stereo matching method based on mixed 2D convolution and pseudo 3D convolution

Technical Field

The invention relates to the field of computer vision, in particular to a stereo matching method based on mixed 2D convolution and pseudo 3D convolution.

Background

The stereo matching can be widely applied to the fields of automatic driving, three-dimensional reconstruction, virtual reality and the like as a basic task of stereo vision. By calculating the parallax of the left and right views after the stereoscopic correction, the distance of the object can be calculated by the geometric relationship of the similar triangles. Compared with some common active distance detection sensors such as laser radar, the binocular stereo camera has the advantages that a dense depth map can be acquired, and meanwhile, the cost is far lower than that of an active sensor.

In the conventional stereo matching algorithm, the calculation of the parallax of the left and right views is mainly divided into the following four steps: cost calculation, cost aggregation, parallax calculation and parallax optimization. The traditional stereo matching algorithm often faces the problems of low parallax accuracy and large calculation amount. In recent years, Convolutional Neural Networks (CNNs) have been developed to achieve binocular stereo matching. By the convolutional neural network, the binocular image is subjected to feature extraction and down-sampling, and the calculation amount can be remarkably reduced during parallax aggregation and calculation. At the present stage, the neural network cost aggregation part can effectively aggregate costs by adopting 3D convolution, and accurate parallax regression calculation is realized. However, the 3D convolution is computationally expensive and is very disadvantageous for use in some real-time applications. In addition, there are some networks that use only 2D convolution for cost aggregation, and for this reason, these networks compress the channel dimension of the whole learning feature, which results in loss of feature information, and thus the accuracy of these networks is reduced.

The existing binocular stereo matching algorithm based on the neural network is mainly divided into two types. One is an algorithm for cost aggregation using 2D convolution, and the other is an algorithm for cost aggregation using 3D convolution; two categories have at least the following disadvantages:

the 2D convolution cost aggregation algorithm forms a four-dimensional cost volume by compressing channel information on the cost volume generated by utilizing the left and right feature maps. This can be done directly with 2D convolution for cost aggregation, but this type of approach is not superior in accuracy due to the large amount of feature information discarded when compressing the channel information.

The 3D convolution cost aggregation algorithm reserves channel information on the cost volume generated by the left and right feature maps, forms a five-dimensional cost volume, and needs to use 3D convolution for cost aggregation. Although excellent performance is achieved in terms of accuracy, there is no advantage in terms of real-time-oriented performance due to the large amount of computation of the 3D convolution.

Disclosure of Invention

The embodiment of the invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, which can ensure the accuracy and greatly reduce the calculation amount.

According to an embodiment of the present invention, there is provided a stereo matching method based on a hybrid 2D convolution and a pseudo 3D convolution, including the steps of:

extracting image features based on preset parameters to obtain a feature map;

generating a cost volume based on the feature graph;

obtaining an initial disparity map through disparity regression after PSmNet structure cost aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;

generating a residual error cost volume through the initial parallax, and obtaining parallax error residual error optimized initial parallax through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;

and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.

Further, the method further comprises obtaining an initial disparity map by disparity regression using a version of the hourglass structure of the PSMNet and converting its 3D convolution into a hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention.

Further, a depth switching mode and a 2D convolution formula cost aggregation are adopted for the cost volume, and a mode that the 2D convolution and the pseudo 3D convolution are arranged at intervals is adopted on the basis of the depth switching mode.

Further, the original disparity map is adopted to reconstruct the right feature map so as to generate a left feature map, and then a residual cost volume is generated with the original left feature map.

Further, extracting image features by adopting a PSmNet structure; it is characterized in that:

where H is the input image height and W is the input image width.

And further, generating the cost volume by adopting a mode of measuring acquaintance.

Further, in the 2D convolution formula, the expression when 3 × 3 × 3 is adopted is as follows:

wherein the content of the first and second substances,

for the cost volume of the cost volume,

for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.

Further, the disparity is optimized using the convolutional affine propagation of CSPNet, where the disparity optimization update times are 4.

The invention has the beneficial effects that: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, obtaining a disparity residual error optimization initial disparity map after residual error aggregation, wherein 3D convolution is converted into a combination of mixed 2D volume and pseudo 3D convolution in PSmNet structure and residual error aggregation; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map. The method approximately realizes the function of 3D convolution by combining 2D convolution, and because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of mixing 2D convolution and pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss. The invention has at least the following advantages:

1. a solution is provided for the problem of large model calculation amount at present, namely a cost aggregation method for mixing 2D convolution and pseudo 3D convolution is provided. The pseudo-3D convolution submodule can be used for modeling depth dimension information without additional parameters and calculated amount, and therefore the model achieves higher accuracy.

2. The existing stereo matching method faces the problem of large calculation amount and seriously influences the use in a real-time application scene, and the cost aggregation module based on the mixed 2D convolution and the pseudo 3D convolution can ensure the accuracy and greatly reduce the calculation amount.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to the present invention;

FIG. 2 is a framework diagram of the hybrid Net algorithm of the present invention;

FIG. 3 is a detailed parameter diagram of the hybrid Net extraction feature of the present invention;

FIG. 4 is a diagram of a deep switching model according to the present invention;

FIG. 5 is a diagram showing the specific parameters of the hybrid 2D convolution and pseudo 3D convolution combination for the 3D convolution in the hybrid Net of the present invention;

FIG. 6 is a graph of the specific parameters of the hybrid 2D convolution combined with the pseudo 3D convolution in the hourglass configuration version of the hybrid Net of the present invention;

FIG. 7 is a graph of the present invention using CSPNet's method for depth optimization;

FIGS. 8 and 9 are comparisons of the hybrid Net of the present invention on Scene flow and KITTI Stereo 2015 datasets with prior art algorithms;

FIG. 10 is a view of a binocular stereo camera mounted on a vehicle in an application scenario of the present invention;

FIG. 11 is a road scene and depth map of a binocular stereo camera mounted on a vehicle of an application scene of the present invention;

fig. 12 is an example of three-dimensional reconstruction of an object of an application scene of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1 to 12, according to an embodiment of the present invention, there is provided a stereo matching method based on a hybrid 2D convolution and a pseudo 3D convolution, referring to fig. 1, including the following steps:

s101: extracting image features based on preset parameters to obtain a feature map;

in the embodiment, the feature extraction module in the PSMNet is adopted in the invention and the number of channels of the convolution layer is reduced to half of the original number, so as to obtain the features (32, H/4, W/4), where H is the height of the input image and W is the width of the input image. The specific parameters are shown in fig. 3.

S102: generating a cost volume based on the feature graph;

in the embodiment, a cost volume generation mode of similarity measurement is adopted, and the characteristic shape is (32, H/4, W/4 and D/4); where D is the maximum disparity value, 192 is taken here by the present invention.

S103: obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted to a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSMNet structure.

In this embodiment, the present invention provides a formula (depth shift module, DSM) of a depth switching plus 2D convolution module; where the DSM is shown in figure 4.

S104: generating a residual error cost volume through the initial parallax, and obtaining a parallax error residual error optimization initial parallax map through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution.

S105: and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.

In this embodiment, the disparity optimization adopts a CSPNet method to perform depth map optimization. As shown in fig. 7.

In the embodiment, an initial parallax image is obtained by adopting a PSmNet hourglass structure; cost aggregation is respectively initial parallax regression and residual parallax fine adjustment; wherein, the initial parallax regression: the PSmNet structure is adopted, and the 3D convolution is converted into the combination of the hybrid 2D convolution and the pseudo 3D convolution proposed by the invention. The specific parameter table is shown in FIG. 2; residual parallax fine adjustment: a version of the hourglass structure of PSMNet is employed and its 3D convolution is converted to the hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention. The specific parameters are shown in fig. 5 and 6 below.

The best stereo matching effect can be achieved by using 3D convolution to carry out cost aggregation at present, but the defect is that the calculation amount is large; the cost aggregation mode of mixing the 2D convolution and the pseudo 3D convolution, which is provided by the invention of the application, can reduce more than half of the calculation amount. As shown in fig. 8 and 9, for a simple comparison of the present invention with other methods, fig. 8 and 9 show the comparison of hybrid net with the current algorithm on Scene flow and KITTI Stereo 2015 data sets.

When a Depth Switching Module (DSM) is designed, the relation among all dimensions needs to be considered, the number of switching channels after down sampling is adjusted when codes need to be written, and meanwhile, the convolution on a stereo matching task (1 multiplied by 1) can reduce the effect of cost aggregation.

Fig. 10 to 12 show application scenarios of the invention of the present application:

1. automatic driving

Distance information (as shown in figure 11) in an image range can be estimated through a binocular stereo camera (as shown in figure 10) mounted on the vehicle, and early warning information such as a front vehicle distance and an obstacle distance is provided for advanced assistant driving.

2. Binocular three-dimensional reconstruction

The key of the binocular three-dimensional reconstruction is that an accurate depth map is generated through high-precision stereo matching, and then the three-dimensional reconstruction of a specific object is completed through triangulation and texture mapping (as shown in figure 12).

The invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, as shown in figure 1, comprising:

the method comprises the following steps: extracting features; extracting image features based on the parameters to obtain a feature map;

step two: generating a cost volume; generating a cost volume based on the feature graph;

step three: polymerizing initial cost; obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;

step four: residual error optimization; generating a residual error cost volume through the initial parallax, and obtaining parallax error residual error optimized initial parallax through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;

step five: depth optimization; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.

The invention has the beneficial effects that: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, and obtaining a disparity residual error optimization initial disparity map after residual error aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in the PSmNet structure and the residual aggregation; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map. The method approximately realizes the function of 3D convolution by combining 2D convolution, and because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of mixing 2D convolution and pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss. The invention has at least the following advantages:

In this embodiment, the method further includes obtaining an initial disparity map through disparity regression using a version of the hourglass structure of PSMNet, and converting the 3D convolution thereof into a combination of the hybrid 2D convolution and the pseudo 3D convolution proposed by the present invention; the specific parameters are shown in fig. 6.

In the embodiment, the invention adopts the feature extraction module in the PSmNet and reduces the number of channels of the convolution layer to half of the original number to obtain the features (32, H/4, W/4); where H is the input image height and W is the input image width. The specific parameters are shown in fig. 3.

In order to solve the problem of large aggregate calculation amount based on the 3D convolution cost at the present stage, the invention designs a high-efficiency stereo matching network (hybrid Net, shown in FIG. 3) mixing the 2D convolution and the pseudo 3D convolution to realize the depth estimation with low calculation amount. In the image convolution process, when the parameter of the convolution kernel is 0 or 1, the corresponding data can be switched, and the learnable parameter and the calculated amount of the part can be omitted. Therefore, the invention provides a pseudo-3D convolution module by taking the data switching as a depth parallax dimension modeling mode, so that the function of 3D convolution can be approximately realized by combining 2D convolution.

In this embodiment, a depth switching manner and a 2D convolution formula cost aggregation are adopted for the cost volume, and a manner of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching manner.

As shown in fig. 2, an efficient stereo matching network (hybrid net, fig. 2) that mixes 2D convolution with pseudo 3D convolution achieves low computation depth estimation. In the image convolution process, when the parameter of the convolution kernel is 0 or 1, the corresponding data can be switched, and the learnable parameter and the calculated amount of the part can be omitted. Therefore, the data switching is used as a modeling mode of the depth parallax dimension, a pseudo-3D convolution module is provided, and therefore the function of 3D convolution can be approximately realized by combining 2D convolution.

In this embodiment, a mode of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching mode, so that the inference time is further reduced while the cost aggregation performance is ensured.

A pseudo-3D convolution module is provided, and the function of 3D convolution is approximately realized by combining 2D convolution. Because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.

The invention provides a pseudo-3D convolution module, so that the function of 3D convolution can be approximately realized by combining 2D convolution. Because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.

In this embodiment, the original disparity map is used to reconstruct the right feature map to generate a left feature map, and then a residual cost volume is generated with the original left feature map.

In the embodiment, a PSmNet structure is adopted to extract image features; it is characterized in that:

where H is the input image height and W is the input image width.

In the feature extraction module, the feature extraction module in the PSmNet is adopted, and the number of channels of the convolution layer is reduced to half of the original number; the specific parameters are shown in fig. 4.

To further reduce the amount of computation, it is conceivable to further reduce the size of the feature map by downsampling, but this means that the accuracy is reduced by a little. Other uses are as follows: the proposed pseudo-3D convolution is also applicable to other 3D convolution networks, such as optical flow estimation, point cloud processing, etc.

In this embodiment, the cost volume is generated by using a recognition degree measurement mode, and the cost volume is generated by using a recognition degree measurement mode.

The method adopts a cost volume generation mode of similarity measurement, and the characteristic shape is (32, H/4, W/4 and D/4); where D is the maximum disparity value, 192 is taken in this embodiment.

As shown in fig. 4, in this embodiment, a mode of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching mode, so that the inference time is further reduced while the cost aggregation performance is ensured.

In the 2D convolution formula, the expression when 3 × 3 × 3 is adopted is as follows:

wherein the content of the first and second substances,

for the cost volume of the cost volume,

When a Depth Switching Module (DSM) is designed, the relation among all dimensions needs to be considered, the number of switching channels after down sampling is adjusted when codes need to be written, and meanwhile, the convolution on a stereo matching task can reduce the effect of cost aggregation.

In this embodiment, the parallax optimization method for the CSPNet optimizes the parallax by using convolution affine propagation of the CSPNet, where the parallax optimization update times are 4 times.

The invention of the application can solve the problem of universality; the module of the invention can be inserted into any 3D convolution network to realize the effect of realizing 3D convolution by the calculated amount close to 2D, the current mainstream stereo matching network design comprises 3D convolution, and the invention can be transferred to the network containing the 3D convolution. The same applies to similar dense regression-type tasks, such as optical flow estimation and 3D point cloud segmentation.

Compared with two plug-and-play video identification modules TSM and Nonlocal which are known to be relatively used at present, the second balance between the calculated amount and the accuracy is that the TSM and the Nonlocal can be embedded into a current mainstream 2D network, but the module effect of the application is higher than that of the TSM, and the robustness is better through residual connection; in addition, the calculation amount can be further reduced on the premise of ensuring the depth dimension modeling capability by combining the intervals of the 2D convolution and the pseudo 3D convolution; the calculated amount of Nonlocal is larger than the smallBIg, and the result on the 2D network of the application is obviously higher than that on the Nonlocal +3D network. This demonstrates that the design of the present application has advantages in both computational effort and accuracy.

For some special application scenes, such as security protection, abnormal behaviors or actions often have short duration and fast change, the technology of the application is insensitive to the change speed on an action frame, and can be well modeled for actions of different durations, because a kinetic data set is about 10S of videos, such as shooting the action, the duration and the long change are slow from dribbling to preparing shooting to final shooting, and the 2-3S videos are on the something-something, such as standing thumbs, one action change does not exceed 3S, and the application obtains good results on the two data sets, so that the module of the application can be well modeled for the actions with different durations.

In addition, the technology has wide application range:

1. in the aspect of intelligent sports training/video auxiliary referee: because the technology is insensitive to the video action speed and time, the technology can be universally applied to various sports scenes, such as yoga with slow action and flower sliding/gymnastics with rapid action change.

2. And (3) intelligent video auditing: the abnormal action recognition and study and judgment can be completed at the mobile terminal, and the abnormality is directly sent to the cloud server, so that the study and judgment speed and efficiency are further improved.

3. Intelligent video montage: in the face of a huge video database, unified action videos are automatically extracted, edited and summarized.

4. Intelligent security: can directly carry out action discernment on the intelligent terminal that computing resources is restricted like intelligent glasses, unmanned aerial vehicle, intelligent camera etc. directly feed back unusual action, improve promptness and accuracy of patrolling and defending etc..

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The stereo matching method based on the mixed 2D convolution and the pseudo 3D convolution is characterized by comprising the following steps of:

extracting image features based on preset parameters to obtain a feature map;

generating a cost volume based on the feature map;

obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;

generating a residual error cost volume through the initial parallax, and obtaining a parallax error residual error optimization initial parallax map through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;

2. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, further comprising obtaining an initial disparity map by disparity regression using a version of the hourglass structure of PSMNet and converting its 3D convolution into the hybrid 2D convolution and pseudo 3D convolution proposed by the present invention.

3. The stereo matching method based on the hybrid 2D volume and the pseudo 3D convolution according to claim 1, wherein a depth switching method and a cost aggregation of 2D convolution formula are applied to cost volumes, and a method of arranging the 2D convolution and the pseudo 3D convolution at intervals is applied on the basis of the depth switching method.

4. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein an initial disparity map is used to reconstruct the right feature map for generating the left feature map, and then a residual cost volume is generated with the original left feature map.

5. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein a PSMNet structure is adopted to extract image features; it is characterized in that:

where H is the input image height and W is the input image width.

6. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein the cost volume is generated by means of a measure of acquaintance.

7. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 3, wherein the 2D convolution formula is expressed as follows when 3 x 3 is adopted:

wherein the content of the first and second substances,

for the cost volume of the cost volume,

8. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein the disparity optimization method of CSPNet optimizes disparity by using convolution affine propagation of CSPNet, and the disparity optimization update time is 4 times.