CN112489097A - Stereo matching method based on mixed 2D convolution and pseudo 3D convolution - Google Patents
Stereo matching method based on mixed 2D convolution and pseudo 3D convolution Download PDFInfo
- Publication number
- CN112489097A CN112489097A CN202011436492.0A CN202011436492A CN112489097A CN 112489097 A CN112489097 A CN 112489097A CN 202011436492 A CN202011436492 A CN 202011436492A CN 112489097 A CN112489097 A CN 112489097A
- Authority
- CN
- China
- Prior art keywords
- convolution
- pseudo
- cost
- hybrid
- map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000004220 aggregation Methods 0.000 claims abstract description 49
- 230000002776 aggregation Effects 0.000 claims abstract description 49
- 238000005457 optimization Methods 0.000 claims abstract description 18
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 description 17
- 230000009471 action Effects 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000000605 extraction Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000012821 model calculation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000379 polymerizing effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 210000003813 thumb Anatomy 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/30—Determination of transform parameters for the alignment of images, i.e. image registration
- G06T7/33—Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20228—Disparity calculation for image-based rendering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Software Systems (AREA)
- Image Processing (AREA)
Abstract
The invention relates to the field of computer vision, in particular to a stereo matching method (hybrid Net) based on mixed 2D convolution and pseudo 3D convolution; the method comprises the following steps: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, and obtaining a disparity residual error optimization initial disparity map after residual error aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in the PSmNet structure and the residual aggregation; optimizing a depth map by adopting a CSPNet method for the disparity map; the function of 3D convolution is approximately realized by combining 2D convolution, and the data switching operation does not contain learnable parameters and does not generate calculated amount; the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a stereo matching method based on mixed 2D convolution and pseudo 3D convolution.
Background
The stereo matching can be widely applied to the fields of automatic driving, three-dimensional reconstruction, virtual reality and the like as a basic task of stereo vision. By calculating the parallax of the left and right views after the stereoscopic correction, the distance of the object can be calculated by the geometric relationship of the similar triangles. Compared with some common active distance detection sensors such as laser radar, the binocular stereo camera has the advantages that a dense depth map can be acquired, and meanwhile, the cost is far lower than that of an active sensor.
In the conventional stereo matching algorithm, the calculation of the parallax of the left and right views is mainly divided into the following four steps: cost calculation, cost aggregation, parallax calculation and parallax optimization. The traditional stereo matching algorithm often faces the problems of low parallax accuracy and large calculation amount. In recent years, Convolutional Neural Networks (CNNs) have been developed to achieve binocular stereo matching. By the convolutional neural network, the binocular image is subjected to feature extraction and down-sampling, and the calculation amount can be remarkably reduced during parallax aggregation and calculation. At the present stage, the neural network cost aggregation part can effectively aggregate costs by adopting 3D convolution, and accurate parallax regression calculation is realized. However, the 3D convolution is computationally expensive and is very disadvantageous for use in some real-time applications. In addition, there are some networks that use only 2D convolution for cost aggregation, and for this reason, these networks compress the channel dimension of the whole learning feature, which results in loss of feature information, and thus the accuracy of these networks is reduced.
The existing binocular stereo matching algorithm based on the neural network is mainly divided into two types. One is an algorithm for cost aggregation using 2D convolution, and the other is an algorithm for cost aggregation using 3D convolution; two categories have at least the following disadvantages:
the 2D convolution cost aggregation algorithm forms a four-dimensional cost volume by compressing channel information on the cost volume generated by utilizing the left and right feature maps. This can be done directly with 2D convolution for cost aggregation, but this type of approach is not superior in accuracy due to the large amount of feature information discarded when compressing the channel information.
The 3D convolution cost aggregation algorithm reserves channel information on the cost volume generated by the left and right feature maps, forms a five-dimensional cost volume, and needs to use 3D convolution for cost aggregation. Although excellent performance is achieved in terms of accuracy, there is no advantage in terms of real-time-oriented performance due to the large amount of computation of the 3D convolution.
Disclosure of Invention
The embodiment of the invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, which can ensure the accuracy and greatly reduce the calculation amount.
According to an embodiment of the present invention, there is provided a stereo matching method based on a hybrid 2D convolution and a pseudo 3D convolution, including the steps of:
extracting image features based on preset parameters to obtain a feature map;
generating a cost volume based on the feature graph;
obtaining an initial disparity map through disparity regression after PSmNet structure cost aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;
generating a residual error cost volume through the initial parallax, and obtaining parallax error residual error optimized initial parallax through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;
and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
Further, the method further comprises obtaining an initial disparity map by disparity regression using a version of the hourglass structure of the PSMNet and converting its 3D convolution into a hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention.
Further, a depth switching mode and a 2D convolution formula cost aggregation are adopted for the cost volume, and a mode that the 2D convolution and the pseudo 3D convolution are arranged at intervals is adopted on the basis of the depth switching mode.
Further, the original disparity map is adopted to reconstruct the right feature map so as to generate a left feature map, and then a residual cost volume is generated with the original left feature map.
Further, extracting image features by adopting a PSmNet structure; it is characterized in that:
where H is the input image height and W is the input image width.
And further, generating the cost volume by adopting a mode of measuring acquaintance.
Further, in the 2D convolution formula, the expression when 3 × 3 × 3 is adopted is as follows:
wherein the content of the first and second substances,for the cost volume of the cost volume,for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.
Further, the disparity is optimized using the convolutional affine propagation of CSPNet, where the disparity optimization update times are 4.
The invention has the beneficial effects that: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, obtaining a disparity residual error optimization initial disparity map after residual error aggregation, wherein 3D convolution is converted into a combination of mixed 2D volume and pseudo 3D convolution in PSmNet structure and residual error aggregation; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map. The method approximately realizes the function of 3D convolution by combining 2D convolution, and because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of mixing 2D convolution and pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss. The invention has at least the following advantages:
1. a solution is provided for the problem of large model calculation amount at present, namely a cost aggregation method for mixing 2D convolution and pseudo 3D convolution is provided. The pseudo-3D convolution submodule can be used for modeling depth dimension information without additional parameters and calculated amount, and therefore the model achieves higher accuracy.
2. The existing stereo matching method faces the problem of large calculation amount and seriously influences the use in a real-time application scene, and the cost aggregation module based on the mixed 2D convolution and the pseudo 3D convolution can ensure the accuracy and greatly reduce the calculation amount.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to the present invention;
FIG. 2 is a framework diagram of the hybrid Net algorithm of the present invention;
FIG. 3 is a detailed parameter diagram of the hybrid Net extraction feature of the present invention;
FIG. 4 is a diagram of a deep switching model according to the present invention;
FIG. 5 is a diagram showing the specific parameters of the hybrid 2D convolution and pseudo 3D convolution combination for the 3D convolution in the hybrid Net of the present invention;
FIG. 6 is a graph of the specific parameters of the hybrid 2D convolution combined with the pseudo 3D convolution in the hourglass configuration version of the hybrid Net of the present invention;
FIG. 7 is a graph of the present invention using CSPNet's method for depth optimization;
FIGS. 8 and 9 are comparisons of the hybrid Net of the present invention on Scene flow and KITTI Stereo 2015 datasets with prior art algorithms;
FIG. 10 is a view of a binocular stereo camera mounted on a vehicle in an application scenario of the present invention;
FIG. 11 is a road scene and depth map of a binocular stereo camera mounted on a vehicle of an application scene of the present invention;
fig. 12 is an example of three-dimensional reconstruction of an object of an application scene of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1 to 12, according to an embodiment of the present invention, there is provided a stereo matching method based on a hybrid 2D convolution and a pseudo 3D convolution, referring to fig. 1, including the following steps:
s101: extracting image features based on preset parameters to obtain a feature map;
in the embodiment, the feature extraction module in the PSMNet is adopted in the invention and the number of channels of the convolution layer is reduced to half of the original number, so as to obtain the features (32, H/4, W/4), where H is the height of the input image and W is the width of the input image. The specific parameters are shown in fig. 3.
S102: generating a cost volume based on the feature graph;
in the embodiment, a cost volume generation mode of similarity measurement is adopted, and the characteristic shape is (32, H/4, W/4 and D/4); where D is the maximum disparity value, 192 is taken here by the present invention.
S103: obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted to a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSMNet structure.
In this embodiment, the present invention provides a formula (depth shift module, DSM) of a depth switching plus 2D convolution module; where the DSM is shown in figure 4.
S104: generating a residual error cost volume through the initial parallax, and obtaining a parallax error residual error optimization initial parallax map through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution.
S105: and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
In this embodiment, the disparity optimization adopts a CSPNet method to perform depth map optimization. As shown in fig. 7.
In the embodiment, an initial parallax image is obtained by adopting a PSmNet hourglass structure; cost aggregation is respectively initial parallax regression and residual parallax fine adjustment; wherein, the initial parallax regression: the PSmNet structure is adopted, and the 3D convolution is converted into the combination of the hybrid 2D convolution and the pseudo 3D convolution proposed by the invention. The specific parameter table is shown in FIG. 2; residual parallax fine adjustment: a version of the hourglass structure of PSMNet is employed and its 3D convolution is converted to the hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention. The specific parameters are shown in fig. 5 and 6 below.
S105: and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
In this embodiment, the disparity optimization adopts a CSPNet method to perform depth map optimization. As shown in fig. 7.
The best stereo matching effect can be achieved by using 3D convolution to carry out cost aggregation at present, but the defect is that the calculation amount is large; the cost aggregation mode of mixing the 2D convolution and the pseudo 3D convolution, which is provided by the invention of the application, can reduce more than half of the calculation amount. As shown in fig. 8 and 9, for a simple comparison of the present invention with other methods, fig. 8 and 9 show the comparison of hybrid net with the current algorithm on Scene flow and KITTI Stereo 2015 data sets.
When a Depth Switching Module (DSM) is designed, the relation among all dimensions needs to be considered, the number of switching channels after down sampling is adjusted when codes need to be written, and meanwhile, the convolution on a stereo matching task (1 multiplied by 1) can reduce the effect of cost aggregation.
Fig. 10 to 12 show application scenarios of the invention of the present application:
1. automatic driving
Distance information (as shown in figure 11) in an image range can be estimated through a binocular stereo camera (as shown in figure 10) mounted on the vehicle, and early warning information such as a front vehicle distance and an obstacle distance is provided for advanced assistant driving.
2. Binocular three-dimensional reconstruction
The key of the binocular three-dimensional reconstruction is that an accurate depth map is generated through high-precision stereo matching, and then the three-dimensional reconstruction of a specific object is completed through triangulation and texture mapping (as shown in figure 12).
The invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, as shown in figure 1, comprising:
the method comprises the following steps: extracting features; extracting image features based on the parameters to obtain a feature map;
step two: generating a cost volume; generating a cost volume based on the feature graph;
step three: polymerizing initial cost; obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;
step four: residual error optimization; generating a residual error cost volume through the initial parallax, and obtaining parallax error residual error optimized initial parallax through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;
step five: depth optimization; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
The invention has the beneficial effects that: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, and obtaining a disparity residual error optimization initial disparity map after residual error aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in the PSmNet structure and the residual aggregation; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map. The method approximately realizes the function of 3D convolution by combining 2D convolution, and because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of mixing 2D convolution and pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss. The invention has at least the following advantages:
1. a solution is provided for the problem of large model calculation amount at present, namely a cost aggregation method for mixing 2D convolution and pseudo 3D convolution is provided. The pseudo-3D convolution submodule can be used for modeling depth dimension information without additional parameters and calculated amount, and therefore the model achieves higher accuracy.
2. The existing stereo matching method faces the problem of large calculation amount and seriously influences the use in a real-time application scene, and the cost aggregation module based on the mixed 2D convolution and the pseudo 3D convolution can ensure the accuracy and greatly reduce the calculation amount.
In this embodiment, the method further includes obtaining an initial disparity map through disparity regression using a version of the hourglass structure of PSMNet, and converting the 3D convolution thereof into a combination of the hybrid 2D convolution and the pseudo 3D convolution proposed by the present invention; the specific parameters are shown in fig. 6.
In the embodiment, the invention adopts the feature extraction module in the PSmNet and reduces the number of channels of the convolution layer to half of the original number to obtain the features (32, H/4, W/4); where H is the input image height and W is the input image width. The specific parameters are shown in fig. 3.
In order to solve the problem of large aggregate calculation amount based on the 3D convolution cost at the present stage, the invention designs a high-efficiency stereo matching network (hybrid Net, shown in FIG. 3) mixing the 2D convolution and the pseudo 3D convolution to realize the depth estimation with low calculation amount. In the image convolution process, when the parameter of the convolution kernel is 0 or 1, the corresponding data can be switched, and the learnable parameter and the calculated amount of the part can be omitted. Therefore, the invention provides a pseudo-3D convolution module by taking the data switching as a depth parallax dimension modeling mode, so that the function of 3D convolution can be approximately realized by combining 2D convolution.
In this embodiment, a depth switching manner and a 2D convolution formula cost aggregation are adopted for the cost volume, and a manner of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching manner.
As shown in fig. 2, an efficient stereo matching network (hybrid net, fig. 2) that mixes 2D convolution with pseudo 3D convolution achieves low computation depth estimation. In the image convolution process, when the parameter of the convolution kernel is 0 or 1, the corresponding data can be switched, and the learnable parameter and the calculated amount of the part can be omitted. Therefore, the data switching is used as a modeling mode of the depth parallax dimension, a pseudo-3D convolution module is provided, and therefore the function of 3D convolution can be approximately realized by combining 2D convolution.
In this embodiment, a mode of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching mode, so that the inference time is further reduced while the cost aggregation performance is ensured.
A pseudo-3D convolution module is provided, and the function of 3D convolution is approximately realized by combining 2D convolution. Because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.
The invention provides a pseudo-3D convolution module, so that the function of 3D convolution can be approximately realized by combining 2D convolution. Because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.
In this embodiment, the original disparity map is used to reconstruct the right feature map to generate a left feature map, and then a residual cost volume is generated with the original left feature map.
In the embodiment, a PSmNet structure is adopted to extract image features; it is characterized in that:
where H is the input image height and W is the input image width.
In the feature extraction module, the feature extraction module in the PSmNet is adopted, and the number of channels of the convolution layer is reduced to half of the original number; the specific parameters are shown in fig. 4.
To further reduce the amount of computation, it is conceivable to further reduce the size of the feature map by downsampling, but this means that the accuracy is reduced by a little. Other uses are as follows: the proposed pseudo-3D convolution is also applicable to other 3D convolution networks, such as optical flow estimation, point cloud processing, etc.
In this embodiment, the cost volume is generated by using a recognition degree measurement mode, and the cost volume is generated by using a recognition degree measurement mode.
The method adopts a cost volume generation mode of similarity measurement, and the characteristic shape is (32, H/4, W/4 and D/4); where D is the maximum disparity value, 192 is taken in this embodiment.
As shown in fig. 4, in this embodiment, a mode of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching mode, so that the inference time is further reduced while the cost aggregation performance is ensured.
In the 2D convolution formula, the expression when 3 × 3 × 3 is adopted is as follows:
wherein the content of the first and second substances,for the cost volume of the cost volume,for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.
When a Depth Switching Module (DSM) is designed, the relation among all dimensions needs to be considered, the number of switching channels after down sampling is adjusted when codes need to be written, and meanwhile, the convolution on a stereo matching task can reduce the effect of cost aggregation.
In this embodiment, the parallax optimization method for the CSPNet optimizes the parallax by using convolution affine propagation of the CSPNet, where the parallax optimization update times are 4 times.
The invention of the application can solve the problem of universality; the module of the invention can be inserted into any 3D convolution network to realize the effect of realizing 3D convolution by the calculated amount close to 2D, the current mainstream stereo matching network design comprises 3D convolution, and the invention can be transferred to the network containing the 3D convolution. The same applies to similar dense regression-type tasks, such as optical flow estimation and 3D point cloud segmentation.
Compared with two plug-and-play video identification modules TSM and Nonlocal which are known to be relatively used at present, the second balance between the calculated amount and the accuracy is that the TSM and the Nonlocal can be embedded into a current mainstream 2D network, but the module effect of the application is higher than that of the TSM, and the robustness is better through residual connection; in addition, the calculation amount can be further reduced on the premise of ensuring the depth dimension modeling capability by combining the intervals of the 2D convolution and the pseudo 3D convolution; the calculated amount of Nonlocal is larger than the smallBIg, and the result on the 2D network of the application is obviously higher than that on the Nonlocal +3D network. This demonstrates that the design of the present application has advantages in both computational effort and accuracy.
For some special application scenes, such as security protection, abnormal behaviors or actions often have short duration and fast change, the technology of the application is insensitive to the change speed on an action frame, and can be well modeled for actions of different durations, because a kinetic data set is about 10S of videos, such as shooting the action, the duration and the long change are slow from dribbling to preparing shooting to final shooting, and the 2-3S videos are on the something-something, such as standing thumbs, one action change does not exceed 3S, and the application obtains good results on the two data sets, so that the module of the application can be well modeled for the actions with different durations.
In addition, the technology has wide application range:
1. in the aspect of intelligent sports training/video auxiliary referee: because the technology is insensitive to the video action speed and time, the technology can be universally applied to various sports scenes, such as yoga with slow action and flower sliding/gymnastics with rapid action change.
2. And (3) intelligent video auditing: the abnormal action recognition and study and judgment can be completed at the mobile terminal, and the abnormality is directly sent to the cloud server, so that the study and judgment speed and efficiency are further improved.
3. Intelligent video montage: in the face of a huge video database, unified action videos are automatically extracted, edited and summarized.
4. Intelligent security: can directly carry out action discernment on the intelligent terminal that computing resources is restricted like intelligent glasses, unmanned aerial vehicle, intelligent camera etc. directly feed back unusual action, improve promptness and accuracy of patrolling and defending etc..
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (8)
1. The stereo matching method based on the mixed 2D convolution and the pseudo 3D convolution is characterized by comprising the following steps of:
extracting image features based on preset parameters to obtain a feature map;
generating a cost volume based on the feature map;
obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;
generating a residual error cost volume through the initial parallax, and obtaining a parallax error residual error optimization initial parallax map through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;
and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
2. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, further comprising obtaining an initial disparity map by disparity regression using a version of the hourglass structure of PSMNet and converting its 3D convolution into the hybrid 2D convolution and pseudo 3D convolution proposed by the present invention.
3. The stereo matching method based on the hybrid 2D volume and the pseudo 3D convolution according to claim 1, wherein a depth switching method and a cost aggregation of 2D convolution formula are applied to cost volumes, and a method of arranging the 2D convolution and the pseudo 3D convolution at intervals is applied on the basis of the depth switching method.
4. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein an initial disparity map is used to reconstruct the right feature map for generating the left feature map, and then a residual cost volume is generated with the original left feature map.
6. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein the cost volume is generated by means of a measure of acquaintance.
7. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 3, wherein the 2D convolution formula is expressed as follows when 3 x 3 is adopted:
wherein the content of the first and second substances,for the cost volume of the cost volume,for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.
8. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein the disparity optimization method of CSPNet optimizes disparity by using convolution affine propagation of CSPNet, and the disparity optimization update time is 4 times.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011436492.0A CN112489097B (en) | 2020-12-11 | 2020-12-11 | Stereo matching method based on mixed 2D convolution and pseudo 3D convolution |
PCT/CN2020/139400 WO2022120988A1 (en) | 2020-12-11 | 2020-12-25 | Stereo matching method based on hybrid 2d convolution and pseudo 3d convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011436492.0A CN112489097B (en) | 2020-12-11 | 2020-12-11 | Stereo matching method based on mixed 2D convolution and pseudo 3D convolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112489097A true CN112489097A (en) | 2021-03-12 |
CN112489097B CN112489097B (en) | 2024-05-17 |
Family
ID=74940986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011436492.0A Active CN112489097B (en) | 2020-12-11 | 2020-12-11 | Stereo matching method based on mixed 2D convolution and pseudo 3D convolution |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112489097B (en) |
WO (1) | WO2022120988A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023240764A1 (en) * | 2022-06-17 | 2023-12-21 | 五邑大学 | Hybrid cost body binocular stereo matching method, device and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116703999A (en) * | 2023-08-04 | 2023-09-05 | 东莞市爱培科技术有限公司 | Residual fusion method for binocular stereo matching |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106355570A (en) * | 2016-10-21 | 2017-01-25 | 昆明理工大学 | Binocular stereoscopic vision matching method combining depth characteristics |
CN110533712A (en) * | 2019-08-26 | 2019-12-03 | 北京工业大学 | A kind of binocular solid matching process based on convolutional neural networks |
CN111583313A (en) * | 2020-03-25 | 2020-08-25 | 上海物联网有限公司 | Improved binocular stereo matching method based on PSmNet |
CN111696148A (en) * | 2020-06-17 | 2020-09-22 | 中国科学技术大学 | End-to-end stereo matching method based on convolutional neural network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472819B (en) * | 2018-09-06 | 2021-12-28 | 杭州电子科技大学 | Binocular parallax estimation method based on cascade geometric context neural network |
CN109816710B (en) * | 2018-12-13 | 2023-08-29 | 中山大学 | Parallax calculation method for binocular vision system with high precision and no smear |
CN111402311B (en) * | 2020-03-09 | 2023-04-14 | 福建帝视信息科技有限公司 | Knowledge distillation-based lightweight stereo parallax estimation method |
-
2020
- 2020-12-11 CN CN202011436492.0A patent/CN112489097B/en active Active
- 2020-12-25 WO PCT/CN2020/139400 patent/WO2022120988A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106355570A (en) * | 2016-10-21 | 2017-01-25 | 昆明理工大学 | Binocular stereoscopic vision matching method combining depth characteristics |
CN110533712A (en) * | 2019-08-26 | 2019-12-03 | 北京工业大学 | A kind of binocular solid matching process based on convolutional neural networks |
CN111583313A (en) * | 2020-03-25 | 2020-08-25 | 上海物联网有限公司 | Improved binocular stereo matching method based on PSmNet |
CN111696148A (en) * | 2020-06-17 | 2020-09-22 | 中国科学技术大学 | End-to-end stereo matching method based on convolutional neural network |
Non-Patent Citations (2)
Title |
---|
CHANGJIANG CAI 等: "Do End-to-end Stereo Algorithms Under-utilize Information?", 2020 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), pages 2 * |
HAIHUA LU 等: "Cascaded Multi-scale and Multi-dimension Convolutional Neural Network for Stereo Matching", 2018 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING(VCIP) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023240764A1 (en) * | 2022-06-17 | 2023-12-21 | 五邑大学 | Hybrid cost body binocular stereo matching method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022120988A1 (en) | 2022-06-16 |
CN112489097B (en) | 2024-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109377530B (en) | Binocular depth estimation method based on depth neural network | |
CN108734776B (en) | Speckle-based three-dimensional face reconstruction method and equipment | |
CN111402311B (en) | Knowledge distillation-based lightweight stereo parallax estimation method | |
CN106530333B (en) | Interest frequency solid matching method based on binding constraint | |
CN106952247B (en) | Double-camera terminal and image processing method and system thereof | |
CN110021043A (en) | A kind of scene depth acquisition methods based on Stereo matching and confidence spread | |
CN103136750A (en) | Stereo matching optimization method of binocular visual system | |
CN111583313A (en) | Improved binocular stereo matching method based on PSmNet | |
CN116222577B (en) | Closed loop detection method, training method, system, electronic equipment and storage medium | |
CN113763446B (en) | Three-dimensional matching method based on guide information | |
CN112489097A (en) | Stereo matching method based on mixed 2D convolution and pseudo 3D convolution | |
CN115329111B (en) | Image feature library construction method and system based on point cloud and image matching | |
CN113705796A (en) | Light field depth acquisition convolutional neural network based on EPI feature enhancement | |
CN116612468A (en) | Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism | |
Xu et al. | High-speed stereo matching algorithm for ultra-high resolution binocular image | |
Kallwies et al. | Triple-SGM: stereo processing using semi-global matching with cost fusion | |
CN108681753A (en) | A kind of image solid matching method and system based on semantic segmentation and neural network | |
CN114742875A (en) | Binocular stereo matching method based on multi-scale feature extraction and self-adaptive aggregation | |
CN111462211A (en) | Binocular parallax calculation method based on convolutional neural network | |
CN112270701B (en) | Parallax prediction method, system and storage medium based on packet distance network | |
CN117132737B (en) | Three-dimensional building model construction method, system and equipment | |
CN115908992B (en) | Binocular stereo matching method, device, equipment and storage medium | |
CN117152580A (en) | Binocular stereoscopic vision matching network construction method and binocular stereoscopic vision matching method | |
CN110610503A (en) | Three-dimensional information recovery method for power disconnecting link based on stereo matching | |
CN115880555A (en) | Target detection method, model training method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |