WO2022120988A1 - 基于混合2d卷积和伪3d卷积的立体匹配方法 - Google Patents

基于混合2d卷积和伪3d卷积的立体匹配方法 Download PDF

Info

Publication number
WO2022120988A1
WO2022120988A1 PCT/CN2020/139400 CN2020139400W WO2022120988A1 WO 2022120988 A1 WO2022120988 A1 WO 2022120988A1 CN 2020139400 W CN2020139400 W CN 2020139400W WO 2022120988 A1 WO2022120988 A1 WO 2022120988A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution
pseudo
disparity
cost
hybrid
Prior art date
Application number
PCT/CN2020/139400
Other languages
English (en)
French (fr)
Inventor
陈世峰
甘万水
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2022120988A1 publication Critical patent/WO2022120988A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Definitions

  • the present invention relates to the field of computer vision, in particular, to a stereo matching method based on mixed 2D convolution and pseudo 3D convolution.
  • stereo matching can be widely used in the fields of automatic driving, 3D reconstruction, virtual reality and so on.
  • the distance of the target can be calculated by the proportional relationship of similar triangles.
  • some common active distance detection sensors such as lidar
  • the advantage of binocular stereo cameras is that they can obtain dense depth maps, and the cost is much lower than that of active sensors.
  • the calculation of the disparity of the left and right views is mainly divided into the following four steps: cost calculation, cost aggregation, disparity calculation, and disparity optimization.
  • Traditional stereo matching algorithms often face the problems of low parallax accuracy and large amount of computation.
  • Convolutional neural networks CNNs
  • CNNs Convolutional neural networks
  • the cost aggregation part of the neural network usually uses 3D convolution to effectively aggregate the cost and achieve accurate disparity regression calculation.
  • the 2D convolution cost aggregation algorithm uses the left and right feature maps to generate the cost volume, and adopts the method of compressing the channel information to form a four-dimensional cost volume. In this way, 2D convolution can be used directly for cost aggregation, but since a large amount of feature information is discarded when compressing channel information, such methods do not have an advantage in accuracy.
  • the 3D convolution cost aggregation algorithm uses the left and right feature maps to generate the cost volume, and retains the channel information to form a five-dimensional cost volume, which requires 3D convolution for cost aggregation. Although superior performance is achieved in accuracy, 3D convolution has no advantage in real-time orientation due to the large computational complexity of 3D convolution.
  • the embodiment of the present invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, which can greatly reduce the amount of calculation while ensuring the accuracy.
  • a stereo matching method based on mixed 2D convolution and pseudo 3D convolution including the following steps:
  • the initial disparity map is obtained through disparity regression after the cost aggregation of the PSMNet structure; in which, the 3D convolution is replaced by a combination of hybrid 2D convolution and pseudo 3D convolution in the PSMNet structure;
  • the residual cost volume is generated through the initial disparity, and the disparity residual optimization initial disparity is obtained through the residual cost aggregation; wherein, the 3D convolution of the residual cost aggregation is replaced by a combination of hybrid 2D convolution and pseudo 3D convolution;
  • the optimized disparity map is further optimized by the CSPNet method.
  • the method also includes using the version of the hourglass structure of PSMNet to obtain an initial disparity map through disparity regression, and replacing its 3D convolution with the hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention.
  • a depth switching method and a 2D convolution formula cost aggregation are used for the cost volume, and the 2D convolution and the pseudo 3D convolution are arranged at intervals based on the depth switching method.
  • the original disparity map is used to reconstruct the right feature map to generate a left feature map, and then a residual cost volume is generated with the original left feature map.
  • H is the height of the input image and W is the width of the input image.
  • cost volume cost volume is the number of output channels after convolution
  • h, w, and d are the number of channels, width and depth of the feature map, respectively
  • c is the number of input channels
  • i, j, and z are the indices of the height, width, and depth dimensions, respectively.
  • disparity is optimized using the convolutional affine propagation of CSPNet, where the number of disparity optimization updates is 4 times.
  • the beneficial effects of the present invention are: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature map; obtaining a cost volume after cost aggregation through the PSMNet structure; finally obtaining an initial disparity map through parallax regression; The residual cost volume is obtained.
  • the disparity residual optimization is obtained to initialize the initial disparity map.
  • the 3D convolution is replaced by a combination of hybrid 2D convolution and pseudo 3D convolution;
  • the optimized disparity map adopts the method of CSPNet to further optimize the depth map.
  • the present invention combines 2D convolution to approximately realize the function of 3D convolution.
  • a solution is proposed to solve the problem that the current model has a large amount of calculation, that is, a cost aggregation method that combines 2D convolution and pseudo 3D convolution is proposed.
  • the pseudo 3D convolution sub-module can realize the modeling of the depth dimension information without bringing additional parameters and calculation amount, so that the model can achieve higher accuracy.
  • the cost aggregation module based on mixed 2D convolution and pseudo 3D convolution proposed by the present invention can ensure the accuracy at the same time. , which greatly reduces the amount of computation.
  • Fig. 1 is the flow chart of the stereo matching method based on mixed 2D convolution and pseudo 3D convolution of the present invention
  • Fig. 2 is the HybridNet algorithm frame diagram of the present invention
  • Fig. 3 is the concrete parameter diagram of HybridNet extraction feature of the present invention.
  • Fig. 4 is the deep handover model diagram of the present invention.
  • Fig. 5 is the concrete parameter diagram of changing 3D convolution into hybrid 2D convolution and pseudo 3D convolution combination in HybridNet of the present invention
  • Fig. 7 is the method that the present invention adopts CSPNet to carry out deep optimization diagram
  • FIG. 8 and FIG. 9 are the comparisons between the HybridNet of the present invention and the prior art algorithm on the Scene flow and KITTI Stereo 2015 datasets;
  • FIG. 10 is a diagram of a binocular stereo camera mounted on a vehicle in an application scenario of the present invention.
  • FIG. 11 is a road scene and a depth map of a binocular stereo camera mounted on a vehicle in an application scenario of the present invention
  • FIG. 12 is an example of three-dimensional reconstruction of an object in an application scene of the present invention.
  • a stereo matching method based on mixed 2D convolution and pseudo 3D convolution is provided.
  • the method includes the following steps:
  • the present invention adopts the feature extraction module in PSMNet and reduces the number of channels of its convolutional layer to half of the original, and the obtained features (32, H/4, W/4) , H is the height of the input image, and W is the width of the input image.
  • the specific parameters are shown in Figure 3.
  • the cost volume generation method of similarity measurement is adopted, and its characteristic shape is (32, H/4, W/4, D/4); where D is the maximum disparity value, and here the present invention takes 192 .
  • S103 The cost volume after cost aggregation is obtained through the cost aggregation of the PSMNet structure, and the initial disparity map is obtained through disparity regression; wherein, in the PSMNet structure, the 3D convolution is replaced by a combination of hybrid 2D convolution and pseudo 3D convolution.
  • DSM depth shift module
  • S104 Generate a residual cost volume through the initial disparity, and obtain a disparity residual optimized initial disparity map through the residual cost aggregation; wherein, the 3D convolution of the residual cost aggregation is replaced by a combination of a hybrid 2D convolution and a pseudo 3D convolution.
  • the disparity optimization adopts the method of CSPNet to optimize the depth map. As shown in Figure 7.
  • the PSMNet hourglass structure is used to obtain the initial disparity map; the cost aggregation is the initial disparity regression and the residual disparity fine-tuning respectively; wherein, the initial disparity regression: the structure of PSMNet is adopted, and its 3D convolution is replaced by the present invention.
  • the hybrid 2D convolution is combined with a pseudo 3D convolution.
  • the specific parameter table is shown in Figure 2 below; residual parallax fine-tuning: the version of the hourglass structure of PSMNet is adopted, and its 3D convolution is replaced by the hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention.
  • the specific parameters are shown in Figure 5 and Figure 6 below.
  • the disparity optimization adopts the method of CSPNet to optimize the depth map. As shown in Figure 7.
  • DSM depth switching module
  • FIG 10 to Figure 12 are the application scenarios of the invention of the present application:
  • the distance information within the image range can be estimated to provide early warning information for advanced assisted driving, such as the distance of the preceding vehicle and the distance of obstacles.
  • the key to binocular 3D reconstruction is to generate an accurate depth map through high-precision stereo matching, and then complete the 3D reconstruction of a specific object through triangulation and texture mapping (as shown in Figure 12).
  • the present invention provides a stereo matching method based on mixed 2D convolution and pseudo 3D convolution, as shown in FIG. 1 , including:
  • Step 1 extract features; extract image features based on parameters to obtain a feature map;
  • Step 2 Generate a cost volume; generate a cost volume based on the feature map;
  • Step 3 Initial cost aggregation; the cost volume after cost aggregation is obtained through the cost aggregation of the PSMNet structure, and the initial disparity map is obtained through disparity regression; among them, the 3D convolution is replaced by a hybrid 2D convolution and a pseudo 3D convolution in the PSMNet structure. combination;
  • Step 4 Residual optimization; generate residual cost volume through initial disparity, and obtain disparity residual optimization initial disparity through residual cost aggregation; among them, the 3D convolution of residual cost aggregation is replaced by mixed 2D convolution and pseudo 3D volume combination of products;
  • Step 5 Depth optimization; further optimize the depth map by using the CSPNet method for the optimized disparity map.
  • the beneficial effects of the present invention are: extracting image features based on preset parameters to obtain a feature map; generating a cost volume based on the feature map; obtaining a cost volume after cost aggregation through the PSMNet structure; finally obtaining an initial disparity map through parallax regression;
  • the residual cost volume is obtained, and the disparity residual optimization is obtained after the residual aggregation is used to optimize the initial disparity map; among them, in the PSMNet structure and residual aggregation, the 3D convolution is replaced by a combination of hybrid 2D convolution and pseudo 3D convolution;
  • the optimized disparity map adopts the method of CSPNet to further optimize the depth map.
  • the present invention combines 2D convolution to approximately realize the function of 3D convolution.
  • a solution is proposed to solve the problem that the current model has a large amount of calculation, that is, a cost aggregation method that combines 2D convolution and pseudo 3D convolution is proposed.
  • the pseudo 3D convolution sub-module can realize the modeling of the depth dimension information without bringing additional parameters and calculation amount, so that the model can achieve higher accuracy.
  • the cost aggregation module based on mixed 2D convolution and pseudo 3D convolution proposed by the present invention can ensure the accuracy at the same time. , which greatly reduces the amount of computation.
  • the method further includes using the version of the hourglass structure of PSMNet to obtain the initial disparity map through disparity regression, and replacing its 3D convolution with the hybrid 2D convolution and pseudo 3D convolution combination proposed by the present invention; the specific parameters are shown in the figure 6 shown.
  • the present invention adopts the feature extraction module in PSMNet and reduces the number of channels of its convolutional layer to half of the original, and the obtained features are (32, H/4, W/4); among them, H is the height of the input image, and W is the width of the input image.
  • the specific parameters are shown in Figure 3.
  • the present invention designs an efficient stereo matching network (HybridNet, shown in Figure 3) that mixes 2D convolution and pseudo 3D convolution to achieve low computational cost depth estimation .
  • HybridNet shown in Figure 3
  • the present invention uses this data switching as a modeling method of depth disparity dimension, and proposes a pseudo 3D convolution module, so that the function of 3D convolution can be approximately realized in combination with 2D convolution.
  • a depth switching method and a 2D convolution formula cost aggregation are used for the cost volume, and a 2D convolution and a pseudo 3D convolution are arranged at intervals based on the depth switching method.
  • HybridNet HybridNet, Figure 2
  • HybridNet HybridNet
  • the 2D convolution and the pseudo 3D convolution are arranged at intervals based on the depth switching method, so as to ensure the cost aggregation performance and further reduce the inference time.
  • a pseudo-3D convolution module which combines 2D convolution to approximate the function of 3D convolution. Since this data switching operation does not contain learnable parameters and does not generate computational overhead, our proposed cost aggregation method of hybrid 2D convolution and pseudo 3D convolution can greatly reduce the performance with a small loss of accuracy. There are model calculations.
  • the present invention proposes a pseudo 3D convolution module, so that the function of 3D convolution can be approximately realized in combination with 2D convolution. Since this data switching operation does not contain learnable parameters and does not generate computational load, the cost aggregation method of the hybrid 2D convolution and pseudo 3D convolution proposed in the present invention can greatly reduce the loss of precision. Computational amount of the existing model.
  • the original disparity map is used to reconstruct the right feature map to generate the left feature map, and then a residual cost volume is generated with the original left feature map.
  • the PSMNet structure is used to extract image features; its features are:
  • H is the height of the input image and W is the width of the input image.
  • the proposed pseudo-3D convolution is also applicable to other 3D convolutional networks, such as optical flow estimation, point cloud processing, etc.
  • the cost volume is generated by the method of measuring the acquaintance degree, and the cost volume is generated by the method of measuring the degree of acquaintance.
  • the cost volume generation method of similarity measurement is adopted, and its characteristic shape is (32, H/4, W/4, D/4); where D is the maximum disparity value, in this embodiment, 192 .
  • a 2D convolution and a pseudo 3D convolution are arranged at intervals based on the depth switching method, so as to ensure the cost aggregation performance and further reduce the inference time.
  • cost volume cost volume is the number of output channels after convolution
  • h, w, and d are the number of channels, width and depth of the feature map, respectively
  • c is the number of input channels
  • i, j, and z are the indices of the height, width, and depth dimensions, respectively.
  • DSM depth switching module
  • the disparity optimization method of CSPNet adopts the convolutional affine propagation of CSPNet to optimize the disparity, wherein the number of disparity optimization updates is 4 times.
  • the first thing that can be solved in the invention of the present application is universality; the module of the present invention can be inserted into any 3D convolution network to realize the effect of 3D convolution with a calculation amount close to 2D.
  • the current mainstream stereo matching network design Both contain 3D convolution, and the invention of this application can be transferred to the above network containing 3D convolution. At the same time, it is also used for similar dense regression tasks, such as optical flow estimation and 3D point cloud segmentation.
  • the second balance between the amount of calculation and the accuracy rate is to compare the two well-known plug-and-play video recognition modules TSM and nonlocal.
  • TSM and Nonlocal can also be embedded in the current mainstream 2D network, but this application's The module effect is higher than TSM, and the robustness is better through residual connection; in addition, the interval combination of 2D convolution and pseudo 3D convolution can further reduce the amount of calculation under the premise of ensuring the ability to model the depth dimension;
  • Nonlocal The calculation amount is larger than that of the smallBIg of the present application, and the result on the 2D network of the present application is also significantly higher than that of the Nonlocal+3D network. This shows that the design of the present application has advantages in terms of computational complexity and accuracy.
  • Smart sports training/video-assisted refereeing: Because this technology is not sensitive to the speed and time of video movements, it can be widely used in a variety of sports scenarios, such as slow-moving yoga and fast-changing figure skating/gymnastics, etc. .
  • the abnormal motion recognition and judgment can be completed on the mobile terminal, and the abnormality can be directly sent to the cloud server to further improve the speed and efficiency of the judgment.
  • Intelligent video montage In the face of a huge video database, it can automatically extract and edit the unified action video.
  • Action recognition can be performed directly on intelligent terminals with limited computing resources, such as smart glasses, drones, and smart cameras, and abnormal behaviors can be directly fed back to improve the immediacy and accuracy of patrols.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

一种基于混合2D卷积和伪3D卷积的立体匹配方法(HybridNet),涉及计算机视觉领域。所述方法包括以下步骤:基于预设参数提取图像特征,得到特征图(S101);基于特征图生成代价卷(S102);通过PSMNet结构得到代价聚合后的代价卷;最后通过视差回归得到初始视差图;通过初始视差图得到残差代价卷,残差聚合后得到视差残差优以化初始视差图;其中,在PSMNet结构和残差聚合中将3D卷积换成混合2D卷积和伪3D卷积的组合(S103);对视差图采用CSPNet的方法进行深度图优化(S105);结合2D卷积来近似实现3D卷积的功能,这种数据切换操作不含可学习的参数以及不会产生计算量。所述混合2D卷积以及伪3D卷积的代价聚合方式可以在微小的精度损失的情况下,极大的降低现有模型的计算量。

Description

基于混合2D卷积和伪3D卷积的立体匹配方法 技术领域
本发明涉及计算机视觉领域,具体而言,涉及一种基于混合2D卷积和伪3D卷积的立体匹配方法。
背景技术
立体匹配作为立体视觉的基础任务可以被广泛运用到自动驾驶,三维重建,虚拟现实等领域。通过计算经立体校正之后的左右视图的视差,可以通过相似三角形的等比关系计算目标的距离。相比于一些常见的主动距离探测传感器如激光雷达,双目立体相机的优势在于可以获取稠密的深度图,同时成本也远低于主动式传感器。
在传统的立体匹配算法中,计算左右视图的视差主要分为以下四步:代价计算,代价聚合,视差计算,视差优化。传统的立体匹配算法常面临着视差准确率不高以及计算量大的问题。近年来,卷积神经网络(Convolutional neural networks,简称CNNs)在双目立体匹配上取得了很大的进展。通过卷积神经网络,将双目图像进行特征提取降采样,在进行视差聚合与计算可以显著的降低计算量。现阶段,神经网络代价聚合部分通常采用3D卷积可以有效地代价聚合,实现准确的视差回归计算。但是3D卷积的计算量较大,非常不利于在一些实时应用上使用。另外,也有一些仅使用2D卷积进行代价聚合的网络,为此这些网络压缩了整个学习特征的通道维度,这样导致特征信息的丢失,从而这些网络的准确率有所降低。
现有的基于神经网络的双目立体匹配算法主要分为两类。一类是使用2D卷积进行代价聚合的算法,另一类是使用3D卷积进行代价聚合的算法;两类至少具有以下不足:
2D卷积代价聚合算法在利用左右特征图生成代价卷上,采用了通过压缩通道信息的方式形成四维的代价卷。这样做可以直接利用2D卷积进行代价聚合,但是由于在压缩通道信息的时候丢弃了大量的特征信息,导致这类方法在准确率上不占优势。
3D卷积代价聚合算法在利用左右特征图生成代价卷上,保留了通道信息,形成五维的代价卷,需要使用3D卷积进行代价聚合。虽然在准确率上实现了优越的性能,但是由于3D卷积的计算量大,在面向实时性方面没有优势。
发明内容
本发明实施例提供了一种基于混合2D卷积和伪3D卷积的立体匹配方法,可以保证准确率的同时,极大的降低计算量。
根据本发明的一实施例,提供了一种基于混合2D卷积和伪3D卷积的立体匹配方法,包括以下步骤:
基于预设参数提取图像特征,得到特征图;
基于特征图生成代价卷;
通过PSMNet结构代价聚合后通过视差回归得到初始视差图;其中,在PSMNet结构中将3D卷积换成混合2D卷积和伪3D卷积的组合;
通过初始视差生成残差代价卷,通过残差代价聚合,得到视差残差优化初始视差;其中,残差代价聚合的3D卷积换成混合2D卷积和伪3D卷积的组合;
对优化的视差图采用CSPNet的方法进行深度图进一步优化。
进一步地,方法还包括采用PSMNet的沙漏结构的版本通过视差回归得到初始视差图,并将其3D卷积换成本发明提出的混合2D卷积与伪3D卷积组合。
进一步地,对代价卷采用深度切换方式及2D卷积公式代价聚合,在深度切换方式的基础上采用所述2D卷积与伪3D卷积间隔排列的方式。
进一步地,采用初始的视差图重构右特征图为生成左特征图,然后与原始左特征图生成残差代价卷。
进一步地,采用PSMNet结构提取图像特征;其特征为:
Figure PCTCN2020139400-appb-000001
其中,H为输入图像高,W为输入图像的宽。
进一步地,采用相识度衡量的方式生成代价卷。
进一步地,2D卷积公式中,采用3×3×3时的表达如下:
Figure PCTCN2020139400-appb-000002
其中,
Figure PCTCN2020139400-appb-000003
为代价卷cost volume,
Figure PCTCN2020139400-appb-000004
为卷积之后输出通道数,h、w、d分别为特征图的通道数、宽和深度,c为输入通道数,i、j、z分别为高、宽以及深度维度的索引。
进一步地,采用CSPNet的卷积仿射传播来优化视差,其中视差优化更新次数为4次。
本发明的有益效果在于:基于预设参数提取图像特征,得到特征图;基于特征图生成代价卷;通过PSMNet结构得到代价聚合后的代价卷;最后通过视差回归得到初始视差图;通过初始视差图得到残差代价卷,残差聚合后得到视差残差优以化初始视差图,其中,在PSMNet结构和残差聚合中将3D卷积换成混合2D卷积和伪3D卷积的组合;对优化的视差图采用CSPNet的方法进行深度图进一步优化。本发明结合2D卷积来近似实现3D卷积的功能,由于这种数据切换操作不含可学习的参数以及不会产生计算量,本发明提出的混合2D卷积以及伪3D卷积的代价聚合方式可以在微小的精度损失的情况下,极大的降低现有模型的计算量。本发明至少有以下优点:
1、针对目前模型计算量大的问题提出了一个解决方案,即提出了混合2D卷积与伪3D卷积的代价聚合方法。其中,通过伪3D卷积子模块可以在不带来额外参数以及计算量的情况下实现对深度维度的信息进行建模,从而让模型实现更高的准确率。
2、目前的立体匹配方法都面临着计算量大的问题,严重影响在实时应用场景的使用,本发明提出的基于混合2D卷积与伪3D卷积的代价聚合模块,可以保证准确率的同时,极大的降低计算量。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为本发明基于混合2D卷积和伪3D卷积的立体匹配方法的流程图;
图2为本发明HybridNet算法框架图;
图3为本发明HybridNet提取特征的具体参数图;
图4为本发明深度切换模型图;
图5为本发明HybridNet中将3D卷积换成混合2D卷积与伪3D卷积组合的具体参数图;
图6为本发明HybridNet的沙漏结构的版本中,3D卷积换成混合2D卷积与伪3D卷积组合的具体参数图;
图7为本发明采用CSPNet的方法来进行深度优化图;
图8与图9为本发明HybridNet在Scene flow与KITTI Stereo 2015数据集上与现有技术算法的对比;
图10为本发明应用场景的车上搭载的双目立体相机图;
图11为本发明应用场景的车上搭载的双目立体相机的道路场景以及深度图;
图12为本发明应用场景物体的三维重建示例。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应 该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
如图1至图12所示,根据本发明一实施例,提供了一种基于混合2D卷积和伪3D卷积的立体匹配方法,参见图1,包括以下步骤:
S101:基于预设参数提取图像特征,得到特征图;
本实施例中,特征提取模块中本发明采用PSMNet中的特征提取模块并减小了将其卷积层的通道数降为原来的一半,得到的特征(32,H/4,W/4),H为输入图像高,W为输入图像的宽。具体参数如图3所示。
S102:基于特征图生成代价卷;
本实施例中,采用了相似度衡量的代价卷生成方式,其特征形状为(32,H/4,W/4,D/4);其中D为最大是视差值,这里本发明取192。
S103:通过PSMNet结构代价聚合得到代价聚合后的代价卷,通过视差回归得到初始视差图;其中,在PSMNet结构中将3D卷积换成混合2D卷积和伪3D卷积的组合。
本实施例中,本发明提出的深度切换加2D卷积模块的公式(depth shift module,DSM);其中,DSM如图4所示。
S104:通过初始视差生成残差代价卷,通过残差代价聚合,得到视差残差优化初始视差图;其中,残差代价聚合的3D卷积换成混合2D卷积和伪3D卷积的 组合。
S105:对优化的视差图采用CSPNet的方法进行深度图进一步优化。
本实施例中,视差优化采用了CSPNet的方法来进行深度图优化。如图7所示。
本实施例中,采用PSMNet沙漏结构得到初始视差图;代价聚合分别为初始视差回归以及残差视差微调;其中,初始视差回归:采用了PSMNet的结构,并将其3D卷积换成本发明提出的混合2D卷积与伪3D卷积组合。具体参数表如下图2所示;残差视差微调:采用了PSMNet的沙漏结构的版本,并将其3D卷积换成本发明提出的混合2D卷积与伪3D卷积组合。具体参数如下图5和图6所示。
S105:对优化的视差图采用CSPNet的方法进行深度图进一步优化。
本实施例中,视差优化采用了CSPNet的方法来进行深度图优化。如图7所示。
目前利用3D卷积来进行代价聚合可以实现最好的立体匹配效果,但是缺点是计算量大;通过本申请发明提出的混合2D卷积与伪3D卷积的代价聚合方式可以较少超过一半的计算量。如图8和图9所示,为本发明与其他方法的简单对比,图8和图9所示为HybridNet在Scene flow与KITTI Stereo 2015数据集上与当下算法的对比。
在设计深度切换模块(DSM)时,需要考虑好各个维度之间的关系,需要书写代码的时候调整降采样后切换通道数,同时在立体匹配任务上(1×1×1)的卷积会降低代价聚合的效果。
如图10至图12是本申请发明的应用场景:
1、自动驾驶
通过车上搭载的双目立体相机(如图10)可以估计出图像范围内的距离信息(如图11),为高级辅助驾驶提供预警信息,如前车距离,障碍物距离等。
2、双目三维重建
双目三维重建的关键在于高精度的立体匹配生成准确的深度图,然后通过三角剖分,以及纹理贴图完成对特定物体的三维重建(如图12)。
本发明提供了一种基于混合2D卷积和伪3D卷积的立体匹配方法,如图1所示,包括:
步骤一:提取特征;基于参数提取图像特征,得到特征图;
步骤二:生成代价卷;基于特征图生成代价卷;
步骤三:初始代价聚合;通过PSMNet结构代价聚合得到代价聚合后的代价卷,通过视差回归得到初始视差图;其中,在PSMNet结构中将3D卷积换成混合2D卷积和伪3D卷积的组合;
步骤四:残差优化;通过初始视差生成残差代价卷,通过残差代价聚合,得到视差残差优化初始视差;其中,残差代价聚合的3D卷积换成混合2D卷积和伪3D卷积的组合;
步骤五:深度优化;对优化的视差图采用CSPNet的方法进行深度图进一步优化。
本发明的有益效果在于:基于预设参数提取图像特征,得到特征图;基于特征图生成代价卷;通过PSMNet结构得到代价聚合后的代价卷;最后通过视差回归得到初始视差图;通过初始视差图得到残差代价卷,残差聚合后得到视差残差优以化初始视差图;其中,在PSMNet结构和残差聚合中将3D卷积换成混合2D 卷积和伪3D卷积的组合;对优化的视差图采用CSPNet的方法进行深度图进一步优化。本发明结合2D卷积来近似实现3D卷积的功能,由于这种数据切换操作不含可学习的参数以及不会产生计算量,本发明提出的混合2D卷积以及伪3D卷积的代价聚合方式可以在微小的精度损失的情况下,极大的降低现有模型的计算量。本发明至少有以下优点:
1、针对目前模型计算量大的问题提出了一个解决方案,即提出了混合2D卷积与伪3D卷积的代价聚合方法。其中,通过伪3D卷积子模块可以在不带来额外参数以及计算量的情况下实现对深度维度的信息进行建模,从而让模型实现更高的准确率。
2、目前的立体匹配方法都面临着计算量大的问题,严重影响在实时应用场景的使用,本发明提出的基于混合2D卷积与伪3D卷积的代价聚合模块,可以保证准确率的同时,极大的降低计算量。
本实施例中,方法还包括采用PSMNet的沙漏结构的版本通过视差回归得到初始视差图,并将其3D卷积换成本发明提出的混合2D卷积与伪3D卷积组合;具体参数如图6所示。
本实施例中,本发明采用PSMNet中的特征提取模块并减小了将其卷积层的通道数降为原来的一半,得到的特征(32,H/4,W/4);其中,H为输入图像高,W为输入图像的宽。具体参数如图3所示。
为了解决现阶段基于3D卷积代价聚合计算量大的问题,本发明设计了混合2D卷积与伪3D卷积的高效立体匹配网络(HybridNet,图3所示)来实现低计算量的深度估计。在图像卷积过程中,当卷积核的参数在0或者1的时候可以通过对相应数据进行切换来实现,这样做可以省略这部分的可学习参数以及计算量。 为此本发明将这种数据切换作为深度视差维度的建模方式,提出伪3D卷积模块,从而可以结合2D卷积来近似实现3D卷积的功能。
本实施例中,对代价卷采用深度切换方式及2D卷积公式代价聚合,在深度切换方式的基础上采用2D卷积与伪3D卷积间隔排列的方式。
如图2所示,混合2D卷积与伪3D卷积的高效立体匹配网络(HybridNet,图2)来实现低计算量的深度估计。在图像卷积过程中,当卷积核的参数在0或者1的时候可以通过对相应数据进行切换来实现,这样做可以省略这部分的可学习参数以及计算量。为此我们将这种数据切换作为深度视差维度的建模方式,提出伪3D卷积模块,从而可以结合2D卷积来近似实现3D卷积的功能。
本实施例中,在深度切换方式的基础上采用2D卷积与伪3D卷积间隔排列的方式,保证代价聚合性能的同时,进一步减小推理时间。
提出伪3D卷积模块,结合2D卷积来近似实现3D卷积的功能。由于这种数据切换操作不含可学习的参数以及不会产生计算量,我们提出的混合2D卷积以及伪3D卷积的代价聚合方式可以在微小的精度损失的情况下,极大的降低现有模型的计算量。
本发明提出伪3D卷积模块,从而可以结合2D卷积来近似实现3D卷积的功能。由于这种数据切换操作不含可学习的参数以及不会产生计算量,本发明提出的混合2D卷积以及伪3D卷积的代价聚合方式可以在微小的精度损失的情况下,极大的降低现有模型的计算量。
本实施例中,采用初始的视差图重构右特征图为生成左特征图,然后与原始左特征图生成残差代价卷。
本实施例中,采用PSMNet结构提取图像特征;其特征为:
Figure PCTCN2020139400-appb-000005
其中,H为输入图像高,W为输入图像的宽。
特征提取模块中我们采用PSMNet中的特征提取模块并减小了将其卷积层的通道数降为原来的一半;具体参数如图4所示。
为了进一步降低计算量,可考虑通过下采样进一步降低特征图的尺寸,但意味着准确率也相应下降一些。其他用途:提出的伪3D卷积也适用于其他的3D卷积网络,如光流估计,点云处理等。
本实施例中,采用相识度衡量的方式生成代价卷采用相识度衡量的方式生成代价卷。
本方法中采用了相似度衡量的代价卷生成方式,其特征形状为(32,H/4,W/4,D/4);其中D为最大是视差值,本实施例中,取192。
如图4所示,本实施例中,在深度切换方式的基础上采用2D卷积与伪3D卷积间隔排列的方式,保证代价聚合性能的同时,进一步减小推理时间。
2D卷积公式中,采用3×3×3时的表达如下:
Figure PCTCN2020139400-appb-000006
其中,
Figure PCTCN2020139400-appb-000007
为代价卷cost volume,
Figure PCTCN2020139400-appb-000008
为卷积之后输出通道数,h、w、d分别为特征图的通道数、宽和深度,c为输入通道数,i、j、z分别为高、宽以及深度维度的索引。
在设计深度切换模块(DSM)时,需要考虑好各个维度之间的关系,需要书写代码的时候调整降采样后切换通道数,同时在立体匹配任务上的卷积会降低代价聚合的效果。
本实施例中,CSPNet的视差优化方法,采用CSPNet的卷积仿射传播来优化视差,其中视差优化更新次数为4次。
本申请发明,第一个可以解决的是通用性;本发明的模块可以插入到任意的3D卷积网络中实现以接近2D的计算量来实现3D卷积的效果,目前主流的立体匹配网络设计都包含3D卷积,本申请发明可以迁移到以上含3D卷积的网络。同时,对于类似的稠密回归型任务也同样使用,如光流估计以及3D点云分割等。
第二个计算量和准确率之间的平衡,是对比目前比较有名两个即插即用的视频识别模块TSM和nonlocal,TSM和Nonlocal同样可以嵌入到当前主流的2D网络中,但是本申请的模块效果高于TSM,通过残差连接鲁棒性更好;此外,通过2D卷积与伪3D卷积的间隔结合可以进一步在保证对深度维度建模能力的前提下,进一步降低计算量;Nonlocal的计算量大于本申请smallBIg,本申请2D网络上的结果也明显高于Nonlocal+3D网络。这说明本申请的设计在计算量和准确率上都有优势。
对于某些特殊的应用场景,比方说安防,异常行为或者动作往往持续时间短变化快,本申请的技术对于动作帧上的变化快慢不敏感,对于不同duration的动作都可以很好的建模,因为kinetics数据集是10s左右的视频,比方说投篮这个动作,从运球到准备投篮到最后投进持续时间长变化慢,而something-something上是2~3s的视频,比方说竖拇指,一个动作变化不会超过3S,在这两个数据集上本申请都取得很好的结果证明本申请的模块对于持续时间不同的动作都可以 很好的建模。
另外这个技术应用范围很广:
1、智能体育运动训练/视频辅助裁判方面:因为该技术对视频动作快慢时间不敏感,因此可以普适于多种体育运动场景中,如动作慢的瑜伽和动作变化迅速的花滑/体操等。
2、智能视频审核:能在移动端即可完成异常动作识别及研判,直接把异常发送到云服务器,进一步提升研判速度及效率。
3、智能视频蒙太奇:面对庞大视频数据库,自动提取及剪辑汇总统一动作视频。
4、智能安防:可以在计算资源受限的智能终端如智能眼镜、无人机、智能摄像头等上直接进行动作识别,直接反馈异常行为,提高巡防等的即时性和准确性。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (8)

  1. 基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,包括以下步骤:
    基于预设参数提取图像特征,得到特征图;
    基于所述特征图生成代价卷;
    通过PSMNet结构代价聚合得到代价聚合后的代价卷,通过视差回归得到初始视差图;其中,在PSMNet结构中将3D卷积换成混合2D卷积和伪3D卷积的组合;
    通过初始视差生成残差代价卷,通过残差代价聚合,得到视差残差优化初始视差图;其中,残差代价聚合的3D卷积换成混合2D卷积和伪3D卷积的组合;
    对优化的视差图采用CSPNet的方法进行深度图进一步优化。。
  2. 根据权利要求1所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,所述方法还包括采用PSMNet的沙漏结构的版本通过视差回归得到初始视差图,并将其3D卷积换成本发明提出的混合2D卷积与伪3D卷积组合。
  3. 根据权利要求1所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,对代价卷采用深度切换方式及2D卷积公式代价聚合,在所述深度切换方式的基础上采用所述2D卷积与所述伪3D卷积间隔排列的方式。
  4. 根据权利要求1所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,采用初始的视差图重构右特征图为生成左特征图,然后与原始左特征图生成残差代价卷。
  5. 根据权利要求1所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,采用PSMNet结构提取图像特征;其特征为:
    Figure PCTCN2020139400-appb-100001
    其中,H为输入图像高,W为输入图像的宽。
  6. 根据权利要求1所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,采用相识度衡量的方式生成代价卷。
  7. 根据权利要求3所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,所述2D卷积公式中,采用3×3×3时的表达如下:
    Figure PCTCN2020139400-appb-100002
    其中,
    Figure PCTCN2020139400-appb-100003
    为代价卷cost volume,
    Figure PCTCN2020139400-appb-100004
    为卷积之后输出通道数,h、w、d分别为特征图的通道数、宽和深度,c为输入通道数,i、j、z分别为高、宽以及深度维度的索引。
  8. 根据权利要求1所述的基于混合2D卷积和伪3D卷积的立体匹配方法,其特征在于,所述CSPNet的视差优化方法,采用CSPNet的卷积仿射传播来优化视差,其中视差优化更新次数为4次。
PCT/CN2020/139400 2020-12-11 2020-12-25 基于混合2d卷积和伪3d卷积的立体匹配方法 WO2022120988A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011436492.0 2020-12-11
CN202011436492.0A CN112489097B (zh) 2020-12-11 2020-12-11 基于混合2d卷积和伪3d卷积的立体匹配方法

Publications (1)

Publication Number Publication Date
WO2022120988A1 true WO2022120988A1 (zh) 2022-06-16

Family

ID=74940986

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139400 WO2022120988A1 (zh) 2020-12-11 2020-12-25 基于混合2d卷积和伪3d卷积的立体匹配方法

Country Status (2)

Country Link
CN (1) CN112489097B (zh)
WO (1) WO2022120988A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703999A (zh) * 2023-08-04 2023-09-05 东莞市爱培科技术有限公司 一种用于双目立体匹配的残差融合方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115170636A (zh) * 2022-06-17 2022-10-11 五邑大学 混合代价体的双目立体匹配方法、设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472819A (zh) * 2018-09-06 2019-03-15 杭州电子科技大学 一种基于级联几何上下文神经网络的双目视差估计方法
CN109816710A (zh) * 2018-12-13 2019-05-28 中山大学 一种双目视觉系统高精度且无拖影的视差计算方法
CN111402311A (zh) * 2020-03-09 2020-07-10 福建帝视信息科技有限公司 一种基于知识蒸馏的轻量级立体视差估计方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355570B (zh) * 2016-10-21 2019-03-19 昆明理工大学 一种结合深度特征的双目立体视觉匹配方法
CN110533712B (zh) * 2019-08-26 2022-11-04 北京工业大学 一种基于卷积神经网络的双目立体匹配方法
CN111583313A (zh) * 2020-03-25 2020-08-25 上海物联网有限公司 一种基于PSMNet改进的双目立体匹配方法
CN111696148A (zh) * 2020-06-17 2020-09-22 中国科学技术大学 基于卷积神经网络的端到端立体匹配方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472819A (zh) * 2018-09-06 2019-03-15 杭州电子科技大学 一种基于级联几何上下文神经网络的双目视差估计方法
CN109816710A (zh) * 2018-12-13 2019-05-28 中山大学 一种双目视觉系统高精度且无拖影的视差计算方法
CN111402311A (zh) * 2020-03-09 2020-07-10 福建帝视信息科技有限公司 一种基于知识蒸馏的轻量级立体视差估计方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAI CHANGJIANG; MORDOHAI PHILIPPOS: "Do End-to-end Stereo Algorithms Under-utilize Information?", 2020 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), IEEE, 25 November 2020 (2020-11-25), pages 374 - 383, XP033880205, DOI: 10.1109/3DV50981.2020.00047 *
LU HAIHUA; XU HAI; ZHANG LI; MA YANBO; ZHAO YONG: "Cascaded Multi-scale and Multi-dimension Convolutional Neural Network for Stereo Matching", 2018 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), IEEE, 9 December 2018 (2018-12-09), pages 1 - 4, XP033541866, DOI: 10.1109/VCIP.2018.8698637 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703999A (zh) * 2023-08-04 2023-09-05 东莞市爱培科技术有限公司 一种用于双目立体匹配的残差融合方法

Also Published As

Publication number Publication date
CN112489097B (zh) 2024-05-17
CN112489097A (zh) 2021-03-12

Similar Documents

Publication Publication Date Title
Paschalidou et al. Raynet: Learning volumetric 3d reconstruction with ray potentials
CN108520554B (zh) 一种基于orb-slam2的双目三维稠密建图方法
WO2018214505A1 (zh) 立体匹配方法与系统
CN110689008A (zh) 一种面向单目图像的基于三维重建的三维物体检测方法
CN110220493B (zh) 一种双目测距方法及其装置
CN108398139B (zh) 一种融合鱼眼图像与深度图像的动态环境视觉里程计方法
CN113592026B (zh) 一种基于空洞卷积和级联代价卷的双目视觉立体匹配方法
CN106530333B (zh) 基于捆绑约束的分级优化立体匹配方法
WO2022120988A1 (zh) 基于混合2d卷积和伪3d卷积的立体匹配方法
Chen et al. A full density stereo matching system based on the combination of CNNs and slanted-planes
CN113763446B (zh) 一种基于引导信息的立体匹配方法
CN115329111B (zh) 一种基于点云与影像匹配的影像特征库构建方法及系统
CN106952304A (zh) 一种利用视频序列帧间相关性的深度图像计算方法
Xu et al. High-speed stereo matching algorithm for ultra-high resolution binocular image
CN110060264B (zh) 神经网络训练方法、视频帧处理方法、装置及系统
CN117132737B (zh) 一种三维建筑模型构建方法、系统及设备
Zhang et al. Slfnet: A stereo and lidar fusion network for depth completion
CN111179327B (zh) 一种深度图的计算方法
CN109816710B (zh) 一种双目视觉系统高精度且无拖影的视差计算方法
CN112270701A (zh) 基于分组距离网络的视差预测方法、系统及存储介质
Li et al. Monocular 3-D Object Detection Based on Depth-Guided Local Convolution for Smart Payment in D2D Systems
WO2023240764A1 (zh) 混合代价体的双目立体匹配方法、设备及存储介质
Zeng et al. Tsfe-net: Two-stream feature extraction networks for active stereo matching
CN110610503A (zh) 一种基于立体匹配的电力刀闸三维信息恢复方法
CN115375746A (zh) 基于双重空间池化金字塔的立体匹配方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964917

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964917

Country of ref document: EP

Kind code of ref document: A1