CN112489097A - Stereo matching method based on mixed 2D convolution and pseudo 3D convolution - Google Patents

Stereo matching method based on mixed 2D convolution and pseudo 3D convolution Download PDF

Info

Publication number
CN112489097A
CN112489097A CN202011436492.0A CN202011436492A CN112489097A CN 112489097 A CN112489097 A CN 112489097A CN 202011436492 A CN202011436492 A CN 202011436492A CN 112489097 A CN112489097 A CN 112489097A
Authority
CN
China
Prior art keywords
convolution
pseudo
cost
hybrid
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011436492.0A
Other languages
Chinese (zh)
Other versions
CN112489097B (en
Inventor
陈世峰
甘万水
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202011436492.0A priority Critical patent/CN112489097B/en
Priority to PCT/CN2020/139400 priority patent/WO2022120988A1/en
Publication of CN112489097A publication Critical patent/CN112489097A/en
Application granted granted Critical
Publication of CN112489097B publication Critical patent/CN112489097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the field of computer vision, in particular to a stereo matching method (hybrid Net) based on mixed 2D convolution and pseudo 3D convolution; the method comprises the following steps: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, and obtaining a disparity residual error optimization initial disparity map after residual error aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in the PSmNet structure and the residual aggregation; optimizing a depth map by adopting a CSPNet method for the disparity map; the function of 3D convolution is approximately realized by combining 2D convolution, and the data switching operation does not contain learnable parameters and does not generate calculated amount; the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.

Description

Stereo matching method based on mixed 2D convolution and pseudo 3D convolution
Technical Field
The invention relates to the field of computer vision, in particular to a stereo matching method based on mixed 2D convolution and pseudo 3D convolution.
Background
The stereo matching can be widely applied to the fields of automatic driving, three-dimensional reconstruction, virtual reality and the like as a basic task of stereo vision. By calculating the parallax of the left and right views after the stereoscopic correction, the distance of the object can be calculated by the geometric relationship of the similar triangles. Compared with some common active distance detection sensors such as laser radar, the binocular stereo camera has the advantages that a dense depth map can be acquired, and meanwhile, the cost is far lower than that of an active sensor.
In the conventional stereo matching algorithm, the calculation of the parallax of the left and right views is mainly divided into the following four steps: cost calculation, cost aggregation, parallax calculation and parallax optimization. The traditional stereo matching algorithm often faces the problems of low parallax accuracy and large calculation amount. In recent years, Convolutional Neural Networks (CNNs) have been developed to achieve binocular stereo matching. By the convolutional neural network, the binocular image is subjected to feature extraction and down-sampling, and the calculation amount can be remarkably reduced during parallax aggregation and calculation. At the present stage, the neural network cost aggregation part can effectively aggregate costs by adopting 3D convolution, and accurate parallax regression calculation is realized. However, the 3D convolution is computationally expensive and is very disadvantageous for use in some real-time applications. In addition, there are some networks that use only 2D convolution for cost aggregation, and for this reason, these networks compress the channel dimension of the whole learning feature, which results in loss of feature information, and thus the accuracy of these networks is reduced.
The existing binocular stereo matching algorithm based on the neural network is mainly divided into two types. One is an algorithm for cost aggregation using 2D convolution, and the other is an algorithm for cost aggregation using 3D convolution; two categories have at least the following disadvantages:
the 2D convolution cost aggregation algorithm forms a four-dimensional cost volume by compressing channel information on the cost volume generated by utilizing the left and right feature maps. This can be done directly with 2D convolution for cost aggregation, but this type of approach is not superior in accuracy due to the large amount of feature information discarded when compressing the channel information.
The 3D convolution cost aggregation algorithm reserves channel information on the cost volume generated by the left and right feature maps, forms a five-dimensional cost volume, and needs to use 3D convolution for cost aggregation. Although excellent performance is achieved in terms of accuracy, there is no advantage in terms of real-time-oriented performance due to the large amount of computation of the 3D convolution.
Disclosure of Invention
The embodiment of the invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, which can ensure the accuracy and greatly reduce the calculation amount.
According to an embodiment of the present invention, there is provided a stereo matching method based on a hybrid 2D convolution and a pseudo 3D convolution, including the steps of:
extracting image features based on preset parameters to obtain a feature map;
generating a cost volume based on the feature graph;
obtaining an initial disparity map through disparity regression after PSmNet structure cost aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;
generating a residual error cost volume through the initial parallax, and obtaining parallax error residual error optimized initial parallax through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;
and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
Further, the method further comprises obtaining an initial disparity map by disparity regression using a version of the hourglass structure of the PSMNet and converting its 3D convolution into a hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention.
Further, a depth switching mode and a 2D convolution formula cost aggregation are adopted for the cost volume, and a mode that the 2D convolution and the pseudo 3D convolution are arranged at intervals is adopted on the basis of the depth switching mode.
Further, the original disparity map is adopted to reconstruct the right feature map so as to generate a left feature map, and then a residual cost volume is generated with the original left feature map.
Further, extracting image features by adopting a PSmNet structure; it is characterized in that:
Figure BDA0002829185260000031
where H is the input image height and W is the input image width.
And further, generating the cost volume by adopting a mode of measuring acquaintance.
Further, in the 2D convolution formula, the expression when 3 × 3 × 3 is adopted is as follows:
Figure BDA0002829185260000032
wherein the content of the first and second substances,
Figure BDA0002829185260000033
for the cost volume of the cost volume,
Figure BDA0002829185260000034
for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.
Further, the disparity is optimized using the convolutional affine propagation of CSPNet, where the disparity optimization update times are 4.
The invention has the beneficial effects that: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, obtaining a disparity residual error optimization initial disparity map after residual error aggregation, wherein 3D convolution is converted into a combination of mixed 2D volume and pseudo 3D convolution in PSmNet structure and residual error aggregation; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map. The method approximately realizes the function of 3D convolution by combining 2D convolution, and because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of mixing 2D convolution and pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss. The invention has at least the following advantages:
1. a solution is provided for the problem of large model calculation amount at present, namely a cost aggregation method for mixing 2D convolution and pseudo 3D convolution is provided. The pseudo-3D convolution submodule can be used for modeling depth dimension information without additional parameters and calculated amount, and therefore the model achieves higher accuracy.
2. The existing stereo matching method faces the problem of large calculation amount and seriously influences the use in a real-time application scene, and the cost aggregation module based on the mixed 2D convolution and the pseudo 3D convolution can ensure the accuracy and greatly reduce the calculation amount.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to the present invention;
FIG. 2 is a framework diagram of the hybrid Net algorithm of the present invention;
FIG. 3 is a detailed parameter diagram of the hybrid Net extraction feature of the present invention;
FIG. 4 is a diagram of a deep switching model according to the present invention;
FIG. 5 is a diagram showing the specific parameters of the hybrid 2D convolution and pseudo 3D convolution combination for the 3D convolution in the hybrid Net of the present invention;
FIG. 6 is a graph of the specific parameters of the hybrid 2D convolution combined with the pseudo 3D convolution in the hourglass configuration version of the hybrid Net of the present invention;
FIG. 7 is a graph of the present invention using CSPNet's method for depth optimization;
FIGS. 8 and 9 are comparisons of the hybrid Net of the present invention on Scene flow and KITTI Stereo 2015 datasets with prior art algorithms;
FIG. 10 is a view of a binocular stereo camera mounted on a vehicle in an application scenario of the present invention;
FIG. 11 is a road scene and depth map of a binocular stereo camera mounted on a vehicle of an application scene of the present invention;
fig. 12 is an example of three-dimensional reconstruction of an object of an application scene of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1 to 12, according to an embodiment of the present invention, there is provided a stereo matching method based on a hybrid 2D convolution and a pseudo 3D convolution, referring to fig. 1, including the following steps:
s101: extracting image features based on preset parameters to obtain a feature map;
in the embodiment, the feature extraction module in the PSMNet is adopted in the invention and the number of channels of the convolution layer is reduced to half of the original number, so as to obtain the features (32, H/4, W/4), where H is the height of the input image and W is the width of the input image. The specific parameters are shown in fig. 3.
S102: generating a cost volume based on the feature graph;
in the embodiment, a cost volume generation mode of similarity measurement is adopted, and the characteristic shape is (32, H/4, W/4 and D/4); where D is the maximum disparity value, 192 is taken here by the present invention.
S103: obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted to a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSMNet structure.
In this embodiment, the present invention provides a formula (depth shift module, DSM) of a depth switching plus 2D convolution module; where the DSM is shown in figure 4.
S104: generating a residual error cost volume through the initial parallax, and obtaining a parallax error residual error optimization initial parallax map through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution.
S105: and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
In this embodiment, the disparity optimization adopts a CSPNet method to perform depth map optimization. As shown in fig. 7.
In the embodiment, an initial parallax image is obtained by adopting a PSmNet hourglass structure; cost aggregation is respectively initial parallax regression and residual parallax fine adjustment; wherein, the initial parallax regression: the PSmNet structure is adopted, and the 3D convolution is converted into the combination of the hybrid 2D convolution and the pseudo 3D convolution proposed by the invention. The specific parameter table is shown in FIG. 2; residual parallax fine adjustment: a version of the hourglass structure of PSMNet is employed and its 3D convolution is converted to the hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention. The specific parameters are shown in fig. 5 and 6 below.
S105: and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
In this embodiment, the disparity optimization adopts a CSPNet method to perform depth map optimization. As shown in fig. 7.
The best stereo matching effect can be achieved by using 3D convolution to carry out cost aggregation at present, but the defect is that the calculation amount is large; the cost aggregation mode of mixing the 2D convolution and the pseudo 3D convolution, which is provided by the invention of the application, can reduce more than half of the calculation amount. As shown in fig. 8 and 9, for a simple comparison of the present invention with other methods, fig. 8 and 9 show the comparison of hybrid net with the current algorithm on Scene flow and KITTI Stereo 2015 data sets.
When a Depth Switching Module (DSM) is designed, the relation among all dimensions needs to be considered, the number of switching channels after down sampling is adjusted when codes need to be written, and meanwhile, the convolution on a stereo matching task (1 multiplied by 1) can reduce the effect of cost aggregation.
Fig. 10 to 12 show application scenarios of the invention of the present application:
1. automatic driving
Distance information (as shown in figure 11) in an image range can be estimated through a binocular stereo camera (as shown in figure 10) mounted on the vehicle, and early warning information such as a front vehicle distance and an obstacle distance is provided for advanced assistant driving.
2. Binocular three-dimensional reconstruction
The key of the binocular three-dimensional reconstruction is that an accurate depth map is generated through high-precision stereo matching, and then the three-dimensional reconstruction of a specific object is completed through triangulation and texture mapping (as shown in figure 12).
The invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, as shown in figure 1, comprising:
the method comprises the following steps: extracting features; extracting image features based on the parameters to obtain a feature map;
step two: generating a cost volume; generating a cost volume based on the feature graph;
step three: polymerizing initial cost; obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;
step four: residual error optimization; generating a residual error cost volume through the initial parallax, and obtaining parallax error residual error optimized initial parallax through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;
step five: depth optimization; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
The invention has the beneficial effects that: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature graph; obtaining a cost volume after cost aggregation through a PSmNet structure; finally, obtaining an initial disparity map through disparity regression; obtaining a residual error cost volume through the initial disparity map, and obtaining a disparity residual error optimization initial disparity map after residual error aggregation; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in the PSmNet structure and the residual aggregation; and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map. The method approximately realizes the function of 3D convolution by combining 2D convolution, and because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of mixing 2D convolution and pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss. The invention has at least the following advantages:
1. a solution is provided for the problem of large model calculation amount at present, namely a cost aggregation method for mixing 2D convolution and pseudo 3D convolution is provided. The pseudo-3D convolution submodule can be used for modeling depth dimension information without additional parameters and calculated amount, and therefore the model achieves higher accuracy.
2. The existing stereo matching method faces the problem of large calculation amount and seriously influences the use in a real-time application scene, and the cost aggregation module based on the mixed 2D convolution and the pseudo 3D convolution can ensure the accuracy and greatly reduce the calculation amount.
In this embodiment, the method further includes obtaining an initial disparity map through disparity regression using a version of the hourglass structure of PSMNet, and converting the 3D convolution thereof into a combination of the hybrid 2D convolution and the pseudo 3D convolution proposed by the present invention; the specific parameters are shown in fig. 6.
In the embodiment, the invention adopts the feature extraction module in the PSmNet and reduces the number of channels of the convolution layer to half of the original number to obtain the features (32, H/4, W/4); where H is the input image height and W is the input image width. The specific parameters are shown in fig. 3.
In order to solve the problem of large aggregate calculation amount based on the 3D convolution cost at the present stage, the invention designs a high-efficiency stereo matching network (hybrid Net, shown in FIG. 3) mixing the 2D convolution and the pseudo 3D convolution to realize the depth estimation with low calculation amount. In the image convolution process, when the parameter of the convolution kernel is 0 or 1, the corresponding data can be switched, and the learnable parameter and the calculated amount of the part can be omitted. Therefore, the invention provides a pseudo-3D convolution module by taking the data switching as a depth parallax dimension modeling mode, so that the function of 3D convolution can be approximately realized by combining 2D convolution.
In this embodiment, a depth switching manner and a 2D convolution formula cost aggregation are adopted for the cost volume, and a manner of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching manner.
As shown in fig. 2, an efficient stereo matching network (hybrid net, fig. 2) that mixes 2D convolution with pseudo 3D convolution achieves low computation depth estimation. In the image convolution process, when the parameter of the convolution kernel is 0 or 1, the corresponding data can be switched, and the learnable parameter and the calculated amount of the part can be omitted. Therefore, the data switching is used as a modeling mode of the depth parallax dimension, a pseudo-3D convolution module is provided, and therefore the function of 3D convolution can be approximately realized by combining 2D convolution.
In this embodiment, a mode of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching mode, so that the inference time is further reduced while the cost aggregation performance is ensured.
A pseudo-3D convolution module is provided, and the function of 3D convolution is approximately realized by combining 2D convolution. Because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.
The invention provides a pseudo-3D convolution module, so that the function of 3D convolution can be approximately realized by combining 2D convolution. Because the data switching operation does not contain learnable parameters and does not generate calculated amount, the cost aggregation mode of the hybrid 2D convolution and the pseudo 3D convolution provided by the invention can greatly reduce the calculated amount of the existing model under the condition of slight precision loss.
In this embodiment, the original disparity map is used to reconstruct the right feature map to generate a left feature map, and then a residual cost volume is generated with the original left feature map.
In the embodiment, a PSmNet structure is adopted to extract image features; it is characterized in that:
Figure BDA0002829185260000101
where H is the input image height and W is the input image width.
In the feature extraction module, the feature extraction module in the PSmNet is adopted, and the number of channels of the convolution layer is reduced to half of the original number; the specific parameters are shown in fig. 4.
To further reduce the amount of computation, it is conceivable to further reduce the size of the feature map by downsampling, but this means that the accuracy is reduced by a little. Other uses are as follows: the proposed pseudo-3D convolution is also applicable to other 3D convolution networks, such as optical flow estimation, point cloud processing, etc.
In this embodiment, the cost volume is generated by using a recognition degree measurement mode, and the cost volume is generated by using a recognition degree measurement mode.
The method adopts a cost volume generation mode of similarity measurement, and the characteristic shape is (32, H/4, W/4 and D/4); where D is the maximum disparity value, 192 is taken in this embodiment.
As shown in fig. 4, in this embodiment, a mode of interval arrangement of 2D convolution and pseudo 3D convolution is adopted on the basis of the depth switching mode, so that the inference time is further reduced while the cost aggregation performance is ensured.
In the 2D convolution formula, the expression when 3 × 3 × 3 is adopted is as follows:
Figure BDA0002829185260000102
wherein the content of the first and second substances,
Figure BDA0002829185260000103
for the cost volume of the cost volume,
Figure BDA0002829185260000104
for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.
When a Depth Switching Module (DSM) is designed, the relation among all dimensions needs to be considered, the number of switching channels after down sampling is adjusted when codes need to be written, and meanwhile, the convolution on a stereo matching task can reduce the effect of cost aggregation.
In this embodiment, the parallax optimization method for the CSPNet optimizes the parallax by using convolution affine propagation of the CSPNet, where the parallax optimization update times are 4 times.
The invention of the application can solve the problem of universality; the module of the invention can be inserted into any 3D convolution network to realize the effect of realizing 3D convolution by the calculated amount close to 2D, the current mainstream stereo matching network design comprises 3D convolution, and the invention can be transferred to the network containing the 3D convolution. The same applies to similar dense regression-type tasks, such as optical flow estimation and 3D point cloud segmentation.
Compared with two plug-and-play video identification modules TSM and Nonlocal which are known to be relatively used at present, the second balance between the calculated amount and the accuracy is that the TSM and the Nonlocal can be embedded into a current mainstream 2D network, but the module effect of the application is higher than that of the TSM, and the robustness is better through residual connection; in addition, the calculation amount can be further reduced on the premise of ensuring the depth dimension modeling capability by combining the intervals of the 2D convolution and the pseudo 3D convolution; the calculated amount of Nonlocal is larger than the smallBIg, and the result on the 2D network of the application is obviously higher than that on the Nonlocal +3D network. This demonstrates that the design of the present application has advantages in both computational effort and accuracy.
For some special application scenes, such as security protection, abnormal behaviors or actions often have short duration and fast change, the technology of the application is insensitive to the change speed on an action frame, and can be well modeled for actions of different durations, because a kinetic data set is about 10S of videos, such as shooting the action, the duration and the long change are slow from dribbling to preparing shooting to final shooting, and the 2-3S videos are on the something-something, such as standing thumbs, one action change does not exceed 3S, and the application obtains good results on the two data sets, so that the module of the application can be well modeled for the actions with different durations.
In addition, the technology has wide application range:
1. in the aspect of intelligent sports training/video auxiliary referee: because the technology is insensitive to the video action speed and time, the technology can be universally applied to various sports scenes, such as yoga with slow action and flower sliding/gymnastics with rapid action change.
2. And (3) intelligent video auditing: the abnormal action recognition and study and judgment can be completed at the mobile terminal, and the abnormality is directly sent to the cloud server, so that the study and judgment speed and efficiency are further improved.
3. Intelligent video montage: in the face of a huge video database, unified action videos are automatically extracted, edited and summarized.
4. Intelligent security: can directly carry out action discernment on the intelligent terminal that computing resources is restricted like intelligent glasses, unmanned aerial vehicle, intelligent camera etc. directly feed back unusual action, improve promptness and accuracy of patrolling and defending etc..
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (8)

1. The stereo matching method based on the mixed 2D convolution and the pseudo 3D convolution is characterized by comprising the following steps of:
extracting image features based on preset parameters to obtain a feature map;
generating a cost volume based on the feature map;
obtaining a cost volume after cost aggregation through PSmNet structure cost aggregation, and obtaining an initial disparity map through disparity regression; wherein the 3D convolution is converted into a combination of a hybrid 2D convolution and a pseudo 3D convolution in a PSmNet structure;
generating a residual error cost volume through the initial parallax, and obtaining a parallax error residual error optimization initial parallax map through residual error cost aggregation; wherein the residual cost aggregated 3D convolution is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution;
and further optimizing the depth map by adopting a CSPNet method on the optimized disparity map.
2. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, further comprising obtaining an initial disparity map by disparity regression using a version of the hourglass structure of PSMNet and converting its 3D convolution into the hybrid 2D convolution and pseudo 3D convolution proposed by the present invention.
3. The stereo matching method based on the hybrid 2D volume and the pseudo 3D convolution according to claim 1, wherein a depth switching method and a cost aggregation of 2D convolution formula are applied to cost volumes, and a method of arranging the 2D convolution and the pseudo 3D convolution at intervals is applied on the basis of the depth switching method.
4. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein an initial disparity map is used to reconstruct the right feature map for generating the left feature map, and then a residual cost volume is generated with the original left feature map.
5. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein a PSMNet structure is adopted to extract image features; it is characterized in that:
Figure FDA0002829185250000021
where H is the input image height and W is the input image width.
6. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein the cost volume is generated by means of a measure of acquaintance.
7. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 3, wherein the 2D convolution formula is expressed as follows when 3 x 3 is adopted:
Figure FDA0002829185250000022
wherein the content of the first and second substances,
Figure FDA0002829185250000023
for the cost volume of the cost volume,
Figure FDA0002829185250000024
for the number of output channels after convolution, h, w, d are the number of channels, width and depth of the feature map respectively, c is the number of input channels, and i, j, z are the indices of height, width and depth dimensions respectively.
8. The stereo matching method based on hybrid 2D convolution and pseudo 3D convolution according to claim 1, wherein the disparity optimization method of CSPNet optimizes disparity by using convolution affine propagation of CSPNet, and the disparity optimization update time is 4 times.
CN202011436492.0A 2020-12-11 2020-12-11 Stereo matching method based on mixed 2D convolution and pseudo 3D convolution Active CN112489097B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011436492.0A CN112489097B (en) 2020-12-11 2020-12-11 Stereo matching method based on mixed 2D convolution and pseudo 3D convolution
PCT/CN2020/139400 WO2022120988A1 (en) 2020-12-11 2020-12-25 Stereo matching method based on hybrid 2d convolution and pseudo 3d convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011436492.0A CN112489097B (en) 2020-12-11 2020-12-11 Stereo matching method based on mixed 2D convolution and pseudo 3D convolution

Publications (2)

Publication Number Publication Date
CN112489097A true CN112489097A (en) 2021-03-12
CN112489097B CN112489097B (en) 2024-05-17

Family

ID=74940986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011436492.0A Active CN112489097B (en) 2020-12-11 2020-12-11 Stereo matching method based on mixed 2D convolution and pseudo 3D convolution

Country Status (2)

Country Link
CN (1) CN112489097B (en)
WO (1) WO2022120988A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023240764A1 (en) * 2022-06-17 2023-12-21 五邑大学 Hybrid cost body binocular stereo matching method, device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703999A (en) * 2023-08-04 2023-09-05 东莞市爱培科技术有限公司 Residual fusion method for binocular stereo matching

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355570A (en) * 2016-10-21 2017-01-25 昆明理工大学 Binocular stereoscopic vision matching method combining depth characteristics
CN110533712A (en) * 2019-08-26 2019-12-03 北京工业大学 A kind of binocular solid matching process based on convolutional neural networks
CN111583313A (en) * 2020-03-25 2020-08-25 上海物联网有限公司 Improved binocular stereo matching method based on PSmNet
CN111696148A (en) * 2020-06-17 2020-09-22 中国科学技术大学 End-to-end stereo matching method based on convolutional neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472819B (en) * 2018-09-06 2021-12-28 杭州电子科技大学 Binocular parallax estimation method based on cascade geometric context neural network
CN109816710B (en) * 2018-12-13 2023-08-29 中山大学 Parallax calculation method for binocular vision system with high precision and no smear
CN111402311B (en) * 2020-03-09 2023-04-14 福建帝视信息科技有限公司 Knowledge distillation-based lightweight stereo parallax estimation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355570A (en) * 2016-10-21 2017-01-25 昆明理工大学 Binocular stereoscopic vision matching method combining depth characteristics
CN110533712A (en) * 2019-08-26 2019-12-03 北京工业大学 A kind of binocular solid matching process based on convolutional neural networks
CN111583313A (en) * 2020-03-25 2020-08-25 上海物联网有限公司 Improved binocular stereo matching method based on PSmNet
CN111696148A (en) * 2020-06-17 2020-09-22 中国科学技术大学 End-to-end stereo matching method based on convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHANGJIANG CAI 等: "Do End-to-end Stereo Algorithms Under-utilize Information?", 2020 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), pages 2 *
HAIHUA LU 等: "Cascaded Multi-scale and Multi-dimension Convolutional Neural Network for Stereo Matching", 2018 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING(VCIP) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023240764A1 (en) * 2022-06-17 2023-12-21 五邑大学 Hybrid cost body binocular stereo matching method, device and storage medium

Also Published As

Publication number Publication date
WO2022120988A1 (en) 2022-06-16
CN112489097B (en) 2024-05-17

Similar Documents

Publication Publication Date Title
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN108734776B (en) Speckle-based three-dimensional face reconstruction method and equipment
CN111402311B (en) Knowledge distillation-based lightweight stereo parallax estimation method
CN106530333B (en) Interest frequency solid matching method based on binding constraint
CN106952247B (en) Double-camera terminal and image processing method and system thereof
CN110021043A (en) A kind of scene depth acquisition methods based on Stereo matching and confidence spread
CN103136750A (en) Stereo matching optimization method of binocular visual system
CN111583313A (en) Improved binocular stereo matching method based on PSmNet
CN116222577B (en) Closed loop detection method, training method, system, electronic equipment and storage medium
CN113763446B (en) Three-dimensional matching method based on guide information
CN112489097A (en) Stereo matching method based on mixed 2D convolution and pseudo 3D convolution
CN115329111B (en) Image feature library construction method and system based on point cloud and image matching
CN113705796A (en) Light field depth acquisition convolutional neural network based on EPI feature enhancement
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
Xu et al. High-speed stereo matching algorithm for ultra-high resolution binocular image
Kallwies et al. Triple-SGM: stereo processing using semi-global matching with cost fusion
CN108681753A (en) A kind of image solid matching method and system based on semantic segmentation and neural network
CN114742875A (en) Binocular stereo matching method based on multi-scale feature extraction and self-adaptive aggregation
CN111462211A (en) Binocular parallax calculation method based on convolutional neural network
CN112270701B (en) Parallax prediction method, system and storage medium based on packet distance network
CN117132737B (en) Three-dimensional building model construction method, system and equipment
CN115908992B (en) Binocular stereo matching method, device, equipment and storage medium
CN117152580A (en) Binocular stereoscopic vision matching network construction method and binocular stereoscopic vision matching method
CN110610503A (en) Three-dimensional information recovery method for power disconnecting link based on stereo matching
CN115880555A (en) Target detection method, model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant