CN116433904A - Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution - Google Patents
Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution Download PDFInfo
- Publication number
- CN116433904A CN116433904A CN202310347813.7A CN202310347813A CN116433904A CN 116433904 A CN116433904 A CN 116433904A CN 202310347813 A CN202310347813 A CN 202310347813A CN 116433904 A CN116433904 A CN 116433904A
- Authority
- CN
- China
- Prior art keywords
- rgb
- features
- cross
- feature
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 43
- 238000000034 method Methods 0.000 title claims abstract description 22
- 230000008447 perception Effects 0.000 title claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 27
- 238000012360 testing method Methods 0.000 claims abstract description 22
- 230000006870 function Effects 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000011156 evaluation Methods 0.000 claims abstract description 6
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 6
- 238000013135 deep learning Methods 0.000 claims abstract description 4
- 238000005516 engineering process Methods 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 11
- 230000009977 dual effect Effects 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 241000282326 Felis catus Species 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域:Technical field:
本发明涉及计算机视觉和图像处理领域,特别地涉及一种基于形状感知和像素卷积的跨模态RGB-D语义分割方法。The invention relates to the fields of computer vision and image processing, in particular to a cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution.
背景技术:Background technique:
语义分割涉及将一些原始数据作为输入并将它们转换为具有突出显示的感兴趣区域的掩膜,其中图像中的每个像素根据其所属的对象被分配类别ID。语义分割将属于同一目标的图像部分聚集在一起解决这个问题,从而扩展了其应用领域。与其他的基于图像的任务相比,语义分割是完全不同且先进的。简而言之,在计算机视觉领域,语义分割是基于全卷积的像素分类任务。Semantic segmentation involves taking some raw data as input and converting them into masks with highlighted regions of interest, where each pixel in the image is assigned a class ID according to the object it belongs to. Semantic segmentation solves this problem by clustering image parts belonging to the same object, thus expanding its application domain. Semantic segmentation is completely different and advanced compared to other image-based tasks. In short, in the field of computer vision, semantic segmentation is a pixel classification task based on full convolution.
单一的模态的RGB语义分割在面临复杂场景等挑战性因素,难以明确目标的轮廓从而精准的进行语义分割。并且难以准确和完整地从背景中准确定位出所有目标并分类。因此,为了解决这个问题,将深度(Depth)图像引入到语义分割,通过联合RGB图像和Depth图像相结合构成了RGB-D进行语义分割。The RGB semantic segmentation of a single modality is faced with challenging factors such as complex scenes, and it is difficult to clarify the outline of the target so as to accurately perform semantic segmentation. And it is difficult to accurately and completely locate and classify all objects from the background. Therefore, in order to solve this problem, depth (Depth) images are introduced into semantic segmentation, and RGB-D is formed by combining RGB images and Depth images for semantic segmentation.
由于Depth Map主要能够提供目标边缘等信息。将Depth图引入到语义分割任务中,RGB图提供了全局信息,而深度图提供轮廓信息更完备,表达几何结构和距离信息。因此,将RGB图像与深度图相结合用于语义分割任务是一种合理的选择。Since the Depth Map can mainly provide information such as the edge of the target. The Depth map is introduced into the semantic segmentation task, the RGB map provides global information, and the depth map provides more complete contour information, expressing geometric structure and distance information. Therefore, it is a reasonable choice to combine RGB images with depth maps for semantic segmentation tasks.
此前的RGB-D语义分割方法大多将Depth Map作为独立于RGB图像的数据流,单独提取特征,或者将Depth图像作为RGB图像的第四个通道,该类方法无差别地对待RGB图像和Depth图像,并未考虑到RGB图和深度图信息本质上是不同的,所以广泛应用于RGB的卷积操作不适用于深度图的信息处理。Most of the previous RGB-D semantic segmentation methods use the Depth Map as a data stream independent of the RGB image to extract features separately, or use the Depth image as the fourth channel of the RGB image. This type of method treats the RGB image and the Depth image indiscriminately. , does not consider that the RGB image and the depth image information are essentially different, so the convolution operation widely used in RGB is not suitable for the information processing of the depth image.
考虑到RGB图像数据和Depth图像数据之间存在跨模态数据的二义性问题,本发明尝试探索基于形状感知和像素卷积的跨模态特征融合方法。本发明通过对深度特征的局部形状以及联系进一步挖掘特征在跨模态特征融合方面的作用,帮助语义分割模型更加准确地像素分类。Considering the ambiguity of cross-modal data between RGB image data and Depth image data, the present invention attempts to explore a cross-modal feature fusion method based on shape perception and pixel convolution. The present invention helps the semantic segmentation model to more accurately classify pixels by further mining the role of features in cross-modal feature fusion through local shapes and connections of deep features.
发明内容:Invention content:
针对以上提出的问题,本发明提供一种基于形状感知的跨模态RGB-D语义分割方法,具体采用的技术方案如下:In view of the problems raised above, the present invention provides a cross-modal RGB-D semantic segmentation method based on shape perception, and the specific technical scheme adopted is as follows:
1.获取训练和测试该任务的RGB-D数据集。1. Obtain the RGB-D dataset for training and testing the task.
1.1)将NYU-Depth-V2(NYUDv2-13 and-40)数据集作为训练集,SUN RGB-D数据集作为测试集。1.1) The NYU-Depth-V2 (NYUDv2-13 and-40) dataset is used as the training set, and the SUN RGB-D dataset is used as the test set.
1.2)RGB-D图像数据集,每份数据标注了场景种类(scene category)、二维分割(2D segmentation)、三维房间布局(3D room layout)、三维物体边框(3D object box)、三维物体方向(3D object prientation)。1.2) RGB-D image data set, each data is marked with scene category, 2D segmentation, 3D room layout, 3D object box, and 3D object direction (3D object prientation).
2.利用深度学习技术,基于形状感知和像素卷积并通过双编码器-解码器结构构建RGB-D语义分割网络模型:2. Using deep learning technology, a RGB-D semantic segmentation network model is constructed based on shape perception and pixel convolution and through a dual encoder-decoder structure:
2.1)利用编码器-解码器架构作为本发明的模型的基本架构,用于提取RGB图像特征和因对的Depth图像特征,分别为和/> 2.1) Utilize the encoder-decoder architecture as the basic architecture of the model of the present invention, for extracting RGB image features and corresponding Depth image features, respectively and />
2.2)本发明采用NYU-Depth-V2数据集预训练构建双编码器-解码器架构的网络模型。2.2) The present invention adopts NYU-Depth-V2 data set pre-training to build a network model of dual encoder-decoder architecture.
3.基于步骤2所提取到的RGB图像特征和对应的Depth图像特征进行跨模态特征融合,并利用该融合构建一个跨模态特征融合网络用于生成多模态特征。3. Based on the RGB image features extracted in step 2 And the corresponding Depth image features Perform cross-modal feature fusion, and use the fusion to construct a cross-modal feature fusion network for generating multi-modal features.
3.1)跨模态特征融合模块由5个层次的FCF模块整合5个层次的RGB图像特征和对应的Depth图像特征/>构成,更新出5个层次的特征/>和/> 3.1) The cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules And the corresponding Depth image features /> Composition, update 5 levels of features/> and />
3.2)第i层次的FCF模块的输入数据为和/>构成,并通过交互注意力机制更新出5个层次的特征/>和/> 3.2) The input data of the FCF module at the i-th level is and /> Composition, and update the features of 5 levels through the interactive attention mechanism /> and />
3.3)FCF模块通过特征交叉融合生成多模态特征具体过程如下:3.3) The FCF module generates multimodal features through feature cross fusion. The specific process is as follows:
3.3.1)首先本发明构建一个交叉像素卷积模块用于获取RGB和像素差异的特征,进一步增强RGB图像特征。同时对于深度图构建形状感知卷积用于获取较为准确地局部形状边缘信息,进一步增强Depth图像特征。3.3.1) Firstly, the present invention constructs a cross-pixel convolution module to obtain features of RGB and pixel differences, and further enhance RGB image features. At the same time, the shape-aware convolution is constructed for the depth image to obtain more accurate local shape edge information, and further enhance the depth image features.
3.3.2)进一步利用元素感知的矩阵相加操作融合RGB图像特征和对应的Depth图像特征,其中通过像素卷积进行判断像素是否可用,利用元素感知的矩阵相加操作确定最后计算值。然后利用softmax激活函数将融合后的特征转化为RGB特征更新权重Wr和深度特征更新权重Wd:3.3.2) Further use the element-aware matrix addition operation to fuse the RGB image features and the corresponding Depth image features, wherein whether the pixel is available is judged by pixel convolution, and the element-aware matrix addition operation is used to determine the final calculation value. Then use the softmax activation function to convert the fused features into RGB feature update weight W r and depth feature update weight W d :
其中,conv表示卷积模块,表示元素感知的矩阵乘操作,add表示元素感知的矩阵加操作,GAP表示全局平均池化,softmax表示softmax激活函数。为像素卷积值,为RGB卷积值。Among them, conv represents the convolution module, represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function. is the pixel convolution value, is the RGB convolution value.
3.3.3)在获得RGB特征更新权重Wr和深度特征更新权重Wd之后,我们将Wr和Wd分别与增强之后的RGB图像特征和对应的Depth图像特征相结合,得到新的RGB特征和深度特征。3.3.3) After obtaining the RGB feature update weight W r and the depth feature update weight W d , we combine W r and W d with the enhanced RGB image feature and the corresponding Depth image feature respectively to obtain a new RGB feature and deep features.
3.3.4)通过上述操作,更新出5个层次的特征和/>并将每个层次更新的特征对应输入下一个像素卷积模块和形状感知模块,通过多层级的操作增强特征感受野信息和高级语义信息。3.3.4) Through the above operations, update the features of 5 levels and /> And the updated features of each level are correspondingly input to the next pixel convolution module and shape perception module, and the feature receptive field information and advanced semantic information are enhanced through multi-level operations.
4)通过交叉融合方法,融合跨模态特征,RGB图像特征和对应的Depth图像特征最后得到融合特征/> 4) Through the cross-fusion method, the cross-modal features and RGB image features are fused And the corresponding Depth image features Finally get the fusion feature />
其中,i∈{1,2,3,4,5}表示特征所在模型的层次,conv5表示卷积核大小为5×5的卷积操作,cat表示特征连接操作。Among them, i∈{1,2,3,4,5} represents the level of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature connection operation.
4.1)将更新后的特征经过有效特征层利用像素卷积结构特征提取:4.1) The updated features Through the effective feature layer, the pixel convolution structure feature extraction is used:
Pi=Conv(P,Ki) 公式(3)P i =Conv(P, K i ) formula (3)
Di=Conv(R,Ki) 公式(4)D i =Conv(R, K i ) formula (4)
Ri=Conv(Di+Pi,K1) 公式(5)R i =Conv(D i +P i ,K 1 ) formula (5)
其中,i∈{1,2,3,4,5}表示特征所在的层次,Conv()代表所执行的卷积操作,Ki为各层次不同的卷积核,Di为RGB特征的提取结果,Pi为像素信息的提取,并令K1为1×1卷积核。Ri为最终产生的RGB图像特征 Among them, i∈{1,2,3,4,5} represents the level of the feature, Conv() represents the convolution operation performed, K i is the convolution kernel of each level, D i is the extraction of RGB features As a result, P i is the extraction of pixel information, and let K 1 be a 1×1 convolution kernel. R i is the final RGB image feature
4.2)将上述步骤所生成的RGB图像特征和模态感知模块中的深度特征信息输入特征交叉融合模块,融合不同感受野的多模态特征。4.2) Input the RGB image features generated in the above steps and the depth feature information in the modality perception module into the feature cross fusion module to fuse the multimodal features of different receptive fields.
5)将步骤4所获取更新的第5个层次RGB图像特征和深度图像特征,输入到DeepLabV3+的解码器中,将编码器的输出上采样4倍,使其分辨率和低层级的feature一致。将特征层连接后,再进行一次3×3的卷积(细化作用),得到最终的融合特征,在经过sigmoid函数激活,得到预测的语义图Pest。5) Input the updated fifth-level RGB image features and depth image features obtained in step 4 into the decoder of DeepLabV3+, and upsample the output of the encoder by 4 times to make its resolution consistent with low-level features. After the feature layers are connected, a 3×3 convolution (refinement) is performed to obtain the final fusion feature, which is activated by the sigmoid function to obtain the predicted semantic map P est .
6)通过本发明预测出来的语义图Pest与人工标注的语义分割图PGT进行损失函数的计算,并通过反向传播算法逐步更新本发明提出的模型的参数权重,最终确定RGB-D语义分割算法的结构和参数权重。6) Calculate the loss function through the semantic map P est predicted by the present invention and the semantic segmentation map PGT manually marked, and gradually update the parameter weights of the model proposed by the present invention through the back propagation algorithm, and finally determine the RGB-D semantics The structure and parameter weights of the segmentation algorithm.
7)在步骤6确定模型的结构和参数权重的基础上,对测试集上的RGB-D图像对进行测试,生成显著图Ptest,并使用MAE、S-measure、F-measure、E-measure评价指标进行评估。7) On the basis of determining the structure and parameter weights of the model in step 6, test the RGB-D image pair on the test set to generate a saliency map P test , and use MAE, S-measure, F-measure, E-measure Evaluation indicators are evaluated.
本发明基于深度卷积神经网络、形状感知和像素卷积实现的RGB-D语义分割,提取Depth图像中的丰富的空间结构边缘信息,并与RGB图像提取的全局信息进行交叉跨模态特征融合,能够适应不同场景下的语义分割的要求,特别在一些具有挑战性场景下(复杂背景、低对比度、透明物体等)。相比较之前的语义分割方法,本发明具有以下收益:The invention is based on RGB-D semantic segmentation realized by deep convolutional neural network, shape perception and pixel convolution, extracts rich spatial structure edge information in Depth image, and performs cross-cross-modal feature fusion with global information extracted from RGB image , can adapt to the requirements of semantic segmentation in different scenarios, especially in some challenging scenarios (complex background, low contrast, transparent objects, etc.). Compared with previous semantic segmentation methods, the present invention has the following benefits:
首先,将深度图的引入,并且深度图不作为RGB图的额外通道,同时不将两种模态作为相同贡献价值进行特征提取和融合。利用深度学习技术,通过双编码器-解码器结构构建RGB-D图像对和实类之间的关系,并通过跨模态特征的提取和融合,得到分割特征。First of all, the depth map is introduced, and the depth map is not used as an additional channel of the RGB map, and the two modalities are not used as the same contribution value for feature extraction and fusion. Using deep learning technology, the relationship between RGB-D image pairs and real classes is constructed through a dual encoder-decoder structure, and segmentation features are obtained through cross-modal feature extraction and fusion.
其次,通过一种交叉融合的方式,有效调制Depth图像特征对于RGB图像特征的补充边缘信息,而对RGB图像的全局信息无影响,并利用其本身具备的深度分布信息指导跨模态的特征融合,排除RGB图像中的背景信息的干扰,为下一阶段的像素分割打好基础。Secondly, through a cross-fusion method, the Depth image feature is effectively modulated to supplement the edge information of the RGB image feature, but has no effect on the global information of the RGB image, and uses its own depth distribution information to guide cross-modal feature fusion. , eliminate the interference of background information in the RGB image, and lay a solid foundation for the next stage of pixel segmentation.
最后,通过语义解码器,预测最终的语义分割像素图。Finally, through the semantic decoder, the final semantic segmentation pixmap is predicted.
附图说明Description of drawings
图1为本发明的模型结构示意图Fig. 1 is the model structure schematic diagram of the present invention
图2为跨模态特征融合模块示意图Figure 2 is a schematic diagram of the cross-modal feature fusion module
图3为交叉像素卷积模块示意图Figure 3 is a schematic diagram of the cross-pixel convolution module
图4为分割解码器示意图Figure 4 is a schematic diagram of the segmentation decoder
图5为模型训练和测试示意图Figure 5 is a schematic diagram of model training and testing
具体实施方式Detailed ways
下面将结合本发明实例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,此外,所叙述的实例仅仅是本发明一部分实例,而不是所有的实例。基于本发明中的实例,本研究方向普通技术人员在没有付出创造性劳动前提下所获得的所有其他实例,都属于本发明保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the examples of the present invention. In addition, the described examples are only some examples of the present invention, not all examples. Based on the examples in the present invention, all other examples obtained by those of ordinary skill in this research direction without paying creative work, all belong to the protection scope of the present invention.
参考附图1,一种基于形状感知和像素卷积的跨模态RGB-D语义分割方法主要包含以下步骤:Referring to Figure 1, a cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution mainly includes the following steps:
1.获取训练和测试该任务的RGB-D数据集,并定义本发明的算法目标,并确定用于训练和测试算法的训练集和测试集。将NYU-Depth-V2(NYUDv2-13 and-40)数据集作为训练集,SUN RGB-D数据集作为测试集。1. Obtain the RGB-D data set for training and testing the task, and define the algorithm target of the present invention, and determine the training set and test set for training and testing the algorithm. The NYU-Depth-V2 (NYUDv2-13 and-40) dataset is used as the training set, and the SUN RGB-D dataset is used as the test set.
2.利用交叉像素卷积网络提取RGB图像特征,利用形状感知卷积网络提取Depth图像特征,并以此为基础构建双编码器-解码器的语义分割模型网络,包括用于提取RGB图像特征的RGB编码器和用于提取Depth图像特征的Depth编码器:2. Use the cross-pixel convolutional network to extract RGB image features, and use the shape-aware convolutional network to extract Depth image features, and build a dual encoder-decoder semantic segmentation model network based on this, including for extracting RGB image features RGB encoder and Depth encoder for extracting Depth image features:
2.1.将带有三通道的RGB图像输入到RGB编码器,生成5个层次的RGB图像特征,为 2.1. Input the RGB image with three channels to the RGB encoder to generate 5 levels of RGB image features, as
2.2.将三通道的Depth图像输入到Depth编码器中,生成5个层次的Depth图像特征,为 2.2. Input the three-channel Depth image into the Depth encoder to generate five levels of Depth image features, as
3.基于步骤2所提取到的RGB图像特征和对应的Depth图像特征进行跨模态特征融合,并利用该融合构建一个跨模态特征融合网络用于生成多模态特征。3. Based on the RGB image features extracted in step 2 And the corresponding Depth image features Perform cross-modal feature fusion, and use the fusion to construct a cross-modal feature fusion network for generating multi-modal features.
3.1)跨模态特征融合模块由5个层次的FCF模块整合5个层次的RGB图像特征和对应的Depth图像特征/>构成,更新出5个层次的特征和/> 3.1) The cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules And the corresponding Depth image features /> Composition, update 5 levels of features and />
3.2)第i层次的FCF模块的输入数据为和/>构成,并通过交互注意力机制更新出5个层次的特征/>和/> 3.2) The input data of the FCF module at the i-th level is and /> Composition, and update the features of 5 levels through the interactive attention mechanism /> and />
3.3)FCF模块通过特征交叉融合生成多模态特征具体过程如下:3.3) The FCF module generates multimodal features through feature cross fusion. The specific process is as follows:
3.3.1)首先本发明构建一个交叉像素卷积模块用于获取RGB和像素差异的特征,进一步增强RGB图像特征。同时对于深度图构建形状感知卷积用于获取较为准确地局部形状边缘信息,进一步增强Depth图像特征。3.3.1) Firstly, the present invention constructs a cross-pixel convolution module to obtain features of RGB and pixel differences, and further enhance RGB image features. At the same time, the shape-aware convolution is constructed for the depth image to obtain more accurate local shape edge information, and further enhance the depth image features.
3.3.2)进一步利用元素感知的矩阵相加操作融合RGB图像特征和对应的Depth图像特征,其中通过像素卷积进行判断像素是否可用,利用元素感知的矩阵相加操作确定最后计算值。然后利用softmax激活函数将融合后的特征转化为RGB特征更新权重Wr和深度特征更新权重Wd:3.3.2) Further use the element-aware matrix addition operation to fuse the RGB image features and the corresponding Depth image features, wherein whether the pixel is available is judged by pixel convolution, and the element-aware matrix addition operation is used to determine the final calculation value. Then use the softmax activation function to convert the fused features into RGB feature update weight Wr and depth feature update weight W d :
其中,conv表示卷积模块,表示元素感知的矩阵乘操作,add表示元素感知的矩阵加操作,GAP表示全局平均池化,softmax表示softmax激活函数。为像素卷积值,为RGB卷积值。Among them, conv represents the convolution module, represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function. is the pixel convolution value, is the RGB convolution value.
3.3.3)在获得RGB特征更新权重Wr和深度特征更新权重Wd之后,我们将Wr和Wd分别与增强之后的RGB图像特征和对应的Depth图像特征相结合,得到新的RGB特征和深度特征。3.3.3) After obtaining the RGB feature update weight W r and the depth feature update weight W d , we combine W r and W d with the enhanced RGB image feature and the corresponding Depth image feature respectively to obtain a new RGB feature and deep features.
3.3.4)通过上述操作,更新出5个层次的特征和/>并将每个层次更新的特征对应输入下一个像素卷积模块和形状感知模块,通过多层级的操作增强特征感受野信息和高级语义信息。3.3.4) Through the above operations, update the features of 5 levels and /> And the updated features of each level are correspondingly input to the next pixel convolution module and shape perception module, and the feature receptive field information and advanced semantic information are enhanced through multi-level operations.
4)通过交叉融合方法,融合跨模态特征,RGB图像特征和对应的Depth图像特征最后得到融合特征/> 4) Through the cross-fusion method, the cross-modal features and RGB image features are fused And the corresponding Depth image features Finally get the fusion feature />
其中,i∈{1,2,3,4,5}表示特征所在模型的层次,conv5表示卷积核大小为5×5的卷积操作,cat表示特征连接操作。Among them, i∈{1,2,3,4,5} represents the level of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature connection operation.
4.1)将更新后的特征经过有效特征层利用像素卷积结构特征提取:4.1) The updated features Through the effective feature layer, the pixel convolution structure feature extraction is used:
Di=Conv(R,Ki) 公式(3)D i =Conv(R, K i ) formula (3)
Pi=Conv(P,Ki) 公式(4)P i =Conv(P, K i ) formula (4)
Ri=Conv(Di+Pi,K1) 公式(5)R i =Conv(D i +P i ,K 1 ) formula (5)
其中,i∈{1,2,3,4,5}表示特征所在的层次,Conv()代表所执行的卷积操作,Ki为各层次不同的卷积核,Di为RGB特征的提取结果,Pi为像素信息的提取,并令K1为1×1卷积核。Ri为最终产生的RGB图像特征 Among them, i∈{1,2,3,4,5} represents the level of the feature, Conv() represents the convolution operation performed, K i is the convolution kernel of each level, D i is the extraction of RGB features As a result, P i is the extraction of pixel information, and let K 1 be a 1×1 convolution kernel. R i is the final RGB image feature
4.2)将上述步骤所生成的RGB图像特征和模态感知模块中的深度特征信息输入特征交叉融合模块,融合不同感受野的多模态特征。4.2) Input the RGB image features generated in the above steps and the depth feature information in the modality perception module into the feature cross fusion module to fuse the multimodal features of different receptive fields.
5)将步骤4所获取更新的第5个层次RGB图像特征和深度图像特征,输入到DeepLabV3+的解码器中,将编码器的输出上采样4倍,使其分辨率和低层级的feature一致。将特征层连接后,再进行一次3×3的卷积(细化作用),得到最终的融合特征,在经过sigmoid函数激活,得到预测的语义图Pest。5) Input the updated fifth-level RGB image features and depth image features obtained in step 4 into the decoder of DeepLabV3+, and upsample the output of the encoder by 4 times to make its resolution consistent with low-level features. After the feature layers are connected, a 3×3 convolution (refinement) is performed to obtain the final fusion feature, which is activated by the sigmoid function to obtain the predicted semantic map P est .
6)通过本发明预测出来的语义图Pest与人工标注的语义分割图PGT进行损失函数的计算,并通过反向传播算法逐步更新本发明提出的模型的参数权重,最终确定RGB-D语义分割算法的结构和参数权重。6) Calculate the loss function through the semantic map P est predicted by the present invention and the semantic segmentation map PGT manually marked, and gradually update the parameter weights of the model proposed by the present invention through the back propagation algorithm, and finally determine the RGB-D semantics The structure and parameter weights of the segmentation algorithm.
7)在步骤6确定模型的结构和参数权重的基础上,对测试集上的RGB-D图像对进行测试,生成显著图Ptest,并使用MAE、S-measure、F-measure、E-measure评价指标进行评估。7) On the basis of determining the structure and parameter weights of the model in step 6, test the RGB-D image pair on the test set to generate a saliency map P test , and use MAE, S-measure, F-measure, E-measure Evaluation indicators are evaluated.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310347813.7A CN116433904A (en) | 2023-03-31 | 2023-03-31 | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310347813.7A CN116433904A (en) | 2023-03-31 | 2023-03-31 | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116433904A true CN116433904A (en) | 2023-07-14 |
Family
ID=87084845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310347813.7A Pending CN116433904A (en) | 2023-03-31 | 2023-03-31 | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116433904A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116935052A (en) * | 2023-07-24 | 2023-10-24 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
-
2023
- 2023-03-31 CN CN202310347813.7A patent/CN116433904A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116935052A (en) * | 2023-07-24 | 2023-10-24 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
CN116935052B (en) * | 2023-07-24 | 2024-03-01 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522966B (en) | A target detection method based on densely connected convolutional neural network | |
JP6395158B2 (en) | How to semantically label acquired images of a scene | |
CN113609896B (en) | Object-level Remote Sensing Change Detection Method and System Based on Dual Correlation Attention | |
CN111931787A (en) | RGBD significance detection method based on feature polymerization | |
CN116503602A (en) | Unstructured environment three-dimensional point cloud semantic segmentation method based on multi-level edge enhancement | |
CN113743417B (en) | Semantic segmentation method and semantic segmentation device | |
CN109086777B (en) | Saliency map refining method based on global pixel characteristics | |
CN110781894B (en) | Point cloud semantic segmentation method, device and electronic device | |
JP2021119506A (en) | License-number plate recognition method, license-number plate recognition model training method and device | |
CN112347932B (en) | A 3D model recognition method based on point cloud-multi-view fusion | |
CN110827295A (en) | 3D Semantic Segmentation Method Based on Coupling of Voxel Model and Color Information | |
CN114283315B (en) | RGB-D significance target detection method based on interactive guiding attention and trapezoidal pyramid fusion | |
CN116485860A (en) | Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features | |
CN114693951A (en) | An RGB-D Saliency Object Detection Method Based on Global Context Information Exploration | |
CN116704506A (en) | A Cross-Context Attention-Based Approach to Referential Image Segmentation | |
CN117351360A (en) | Remote sensing image road extraction method based on attention mechanism improvement | |
Rong et al. | 3D semantic segmentation of aerial photogrammetry models based on orthographic projection | |
CN114693953B (en) | A RGB-D salient object detection method based on cross-modal bidirectional complementary network | |
CN113780241B (en) | Acceleration method and device for detecting remarkable object | |
CN115965783A (en) | Unstructured road segmentation method based on point cloud and image feature fusion | |
CN116433904A (en) | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution | |
Shi et al. | Context‐guided ground truth sampling for multi‐modality data augmentation in autonomous driving | |
CN117745948A (en) | Space target image three-dimensional reconstruction method based on improved TransMVSnet deep learning algorithm | |
CN119068080A (en) | Method, electronic device and computer program product for generating an image | |
CN116403068A (en) | Lightweight monocular depth prediction method based on multi-scale attention fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |