CN116433904A - Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution - Google Patents

Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution Download PDF

Info

Publication number
CN116433904A
CN116433904A CN202310347813.7A CN202310347813A CN116433904A CN 116433904 A CN116433904 A CN 116433904A CN 202310347813 A CN202310347813 A CN 202310347813A CN 116433904 A CN116433904 A CN 116433904A
Authority
CN
China
Prior art keywords
rgb
features
cross
feature
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310347813.7A
Other languages
Chinese (zh)
Inventor
葛斌
陆一鸣
夏晨星
朱序
卢洋
郭婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202310347813.7A priority Critical patent/CN116433904A/en
Publication of CN116433904A publication Critical patent/CN116433904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The inventionThe invention belongs to the field of computer vision, and provides a cross-mode RGB-D semantic segmentation method based on shape perception, which comprises the following steps: 1) Acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention; 2) Constructing a shape perception and pixel convolution based RGB-D semantic segmentation network model by using a deep learning technology and a double encoder-decoder structure; 3) Constructing a cross-modal feature fusion network for generating multi-modal features; 4) The cross-modal characteristics are fused by a cross fusion method, so that the high-level semantic information of the multi-modal characteristics is enhanced; 5) In the deep labv3+ decoder, the output of the encoder is up-sampled to match the resolution with the features of the low level. The feature layer connection is convolved once by 3 multiplied by 3, and then activated by a sigmoid function to obtain a predicted semantic graph P est The method comprises the steps of carrying out a first treatment on the surface of the 6) Predicted saliency map P est Semantic segmentation map P with artificial annotation GT Calculating loss; 7) Testing the test data set to generate a saliency map P test And performance evaluation is performed using the evaluation index.

Description

一种基于形状感知和像素卷积的跨模态RGB-D语义分割方法A Cross-Modal RGB-D Semantic Segmentation Method Based on Shape Awareness and Pixel Convolution

技术领域:Technical field:

本发明涉及计算机视觉和图像处理领域,特别地涉及一种基于形状感知和像素卷积的跨模态RGB-D语义分割方法。The invention relates to the fields of computer vision and image processing, in particular to a cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution.

背景技术:Background technique:

语义分割涉及将一些原始数据作为输入并将它们转换为具有突出显示的感兴趣区域的掩膜,其中图像中的每个像素根据其所属的对象被分配类别ID。语义分割将属于同一目标的图像部分聚集在一起解决这个问题,从而扩展了其应用领域。与其他的基于图像的任务相比,语义分割是完全不同且先进的。简而言之,在计算机视觉领域,语义分割是基于全卷积的像素分类任务。Semantic segmentation involves taking some raw data as input and converting them into masks with highlighted regions of interest, where each pixel in the image is assigned a class ID according to the object it belongs to. Semantic segmentation solves this problem by clustering image parts belonging to the same object, thus expanding its application domain. Semantic segmentation is completely different and advanced compared to other image-based tasks. In short, in the field of computer vision, semantic segmentation is a pixel classification task based on full convolution.

单一的模态的RGB语义分割在面临复杂场景等挑战性因素,难以明确目标的轮廓从而精准的进行语义分割。并且难以准确和完整地从背景中准确定位出所有目标并分类。因此,为了解决这个问题,将深度(Depth)图像引入到语义分割,通过联合RGB图像和Depth图像相结合构成了RGB-D进行语义分割。The RGB semantic segmentation of a single modality is faced with challenging factors such as complex scenes, and it is difficult to clarify the outline of the target so as to accurately perform semantic segmentation. And it is difficult to accurately and completely locate and classify all objects from the background. Therefore, in order to solve this problem, depth (Depth) images are introduced into semantic segmentation, and RGB-D is formed by combining RGB images and Depth images for semantic segmentation.

由于Depth Map主要能够提供目标边缘等信息。将Depth图引入到语义分割任务中,RGB图提供了全局信息,而深度图提供轮廓信息更完备,表达几何结构和距离信息。因此,将RGB图像与深度图相结合用于语义分割任务是一种合理的选择。Since the Depth Map can mainly provide information such as the edge of the target. The Depth map is introduced into the semantic segmentation task, the RGB map provides global information, and the depth map provides more complete contour information, expressing geometric structure and distance information. Therefore, it is a reasonable choice to combine RGB images with depth maps for semantic segmentation tasks.

此前的RGB-D语义分割方法大多将Depth Map作为独立于RGB图像的数据流,单独提取特征,或者将Depth图像作为RGB图像的第四个通道,该类方法无差别地对待RGB图像和Depth图像,并未考虑到RGB图和深度图信息本质上是不同的,所以广泛应用于RGB的卷积操作不适用于深度图的信息处理。Most of the previous RGB-D semantic segmentation methods use the Depth Map as a data stream independent of the RGB image to extract features separately, or use the Depth image as the fourth channel of the RGB image. This type of method treats the RGB image and the Depth image indiscriminately. , does not consider that the RGB image and the depth image information are essentially different, so the convolution operation widely used in RGB is not suitable for the information processing of the depth image.

考虑到RGB图像数据和Depth图像数据之间存在跨模态数据的二义性问题,本发明尝试探索基于形状感知和像素卷积的跨模态特征融合方法。本发明通过对深度特征的局部形状以及联系进一步挖掘特征在跨模态特征融合方面的作用,帮助语义分割模型更加准确地像素分类。Considering the ambiguity of cross-modal data between RGB image data and Depth image data, the present invention attempts to explore a cross-modal feature fusion method based on shape perception and pixel convolution. The present invention helps the semantic segmentation model to more accurately classify pixels by further mining the role of features in cross-modal feature fusion through local shapes and connections of deep features.

发明内容:Invention content:

针对以上提出的问题,本发明提供一种基于形状感知的跨模态RGB-D语义分割方法,具体采用的技术方案如下:In view of the problems raised above, the present invention provides a cross-modal RGB-D semantic segmentation method based on shape perception, and the specific technical scheme adopted is as follows:

1.获取训练和测试该任务的RGB-D数据集。1. Obtain the RGB-D dataset for training and testing the task.

1.1)将NYU-Depth-V2(NYUDv2-13 and-40)数据集作为训练集,SUN RGB-D数据集作为测试集。1.1) The NYU-Depth-V2 (NYUDv2-13 and-40) dataset is used as the training set, and the SUN RGB-D dataset is used as the test set.

1.2)RGB-D图像数据集,每份数据标注了场景种类(scene category)、二维分割(2D segmentation)、三维房间布局(3D room layout)、三维物体边框(3D object box)、三维物体方向(3D object prientation)。1.2) RGB-D image data set, each data is marked with scene category, 2D segmentation, 3D room layout, 3D object box, and 3D object direction (3D object prientation).

2.利用深度学习技术,基于形状感知和像素卷积并通过双编码器-解码器结构构建RGB-D语义分割网络模型:2. Using deep learning technology, a RGB-D semantic segmentation network model is constructed based on shape perception and pixel convolution and through a dual encoder-decoder structure:

2.1)利用编码器-解码器架构作为本发明的模型的基本架构,用于提取RGB图像特征和因对的Depth图像特征,分别为

Figure BDA0004160400720000021
和/>
Figure BDA0004160400720000031
2.1) Utilize the encoder-decoder architecture as the basic architecture of the model of the present invention, for extracting RGB image features and corresponding Depth image features, respectively
Figure BDA0004160400720000021
and />
Figure BDA0004160400720000031

2.2)本发明采用NYU-Depth-V2数据集预训练构建双编码器-解码器架构的网络模型。2.2) The present invention adopts NYU-Depth-V2 data set pre-training to build a network model of dual encoder-decoder architecture.

3.基于步骤2所提取到的RGB图像特征

Figure BDA0004160400720000032
和对应的Depth图像特征
Figure BDA0004160400720000033
进行跨模态特征融合,并利用该融合构建一个跨模态特征融合网络用于生成多模态特征。3. Based on the RGB image features extracted in step 2
Figure BDA0004160400720000032
And the corresponding Depth image features
Figure BDA0004160400720000033
Perform cross-modal feature fusion, and use the fusion to construct a cross-modal feature fusion network for generating multi-modal features.

3.1)跨模态特征融合模块由5个层次的FCF模块整合5个层次的RGB图像特征

Figure BDA0004160400720000034
和对应的Depth图像特征/>
Figure BDA0004160400720000035
构成,更新出5个层次的特征/>
Figure BDA0004160400720000036
和/>
Figure BDA0004160400720000037
3.1) The cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules
Figure BDA0004160400720000034
And the corresponding Depth image features />
Figure BDA0004160400720000035
Composition, update 5 levels of features/>
Figure BDA0004160400720000036
and />
Figure BDA0004160400720000037

3.2)第i层次的FCF模块的输入数据为

Figure BDA0004160400720000038
和/>
Figure BDA0004160400720000039
构成,并通过交互注意力机制更新出5个层次的特征/>
Figure BDA00041604007200000310
和/>
Figure BDA00041604007200000311
3.2) The input data of the FCF module at the i-th level is
Figure BDA0004160400720000038
and />
Figure BDA0004160400720000039
Composition, and update the features of 5 levels through the interactive attention mechanism />
Figure BDA00041604007200000310
and />
Figure BDA00041604007200000311

3.3)FCF模块通过特征交叉融合生成多模态特征具体过程如下:3.3) The FCF module generates multimodal features through feature cross fusion. The specific process is as follows:

3.3.1)首先本发明构建一个交叉像素卷积模块用于获取RGB和像素差异的特征,进一步增强RGB图像特征。同时对于深度图构建形状感知卷积用于获取较为准确地局部形状边缘信息,进一步增强Depth图像特征。3.3.1) Firstly, the present invention constructs a cross-pixel convolution module to obtain features of RGB and pixel differences, and further enhance RGB image features. At the same time, the shape-aware convolution is constructed for the depth image to obtain more accurate local shape edge information, and further enhance the depth image features.

3.3.2)进一步利用元素感知的矩阵相加操作融合RGB图像特征和对应的Depth图像特征,其中通过像素卷积进行判断像素是否可用,利用元素感知的矩阵相加操作确定最后计算值。然后利用softmax激活函数将融合后的特征转化为RGB特征更新权重Wr和深度特征更新权重Wd3.3.2) Further use the element-aware matrix addition operation to fuse the RGB image features and the corresponding Depth image features, wherein whether the pixel is available is judged by pixel convolution, and the element-aware matrix addition operation is used to determine the final calculation value. Then use the softmax activation function to convert the fused features into RGB feature update weight W r and depth feature update weight W d :

Figure BDA00041604007200000312
Figure BDA00041604007200000312

Figure BDA0004160400720000041
Figure BDA0004160400720000041

其中,conv表示卷积模块,表示元素感知的矩阵乘操作,add表示元素感知的矩阵加操作,GAP表示全局平均池化,softmax表示softmax激活函数。

Figure BDA0004160400720000042
为像素卷积值,
Figure BDA0004160400720000043
为RGB卷积值。Among them, conv represents the convolution module, represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function.
Figure BDA0004160400720000042
is the pixel convolution value,
Figure BDA0004160400720000043
is the RGB convolution value.

3.3.3)在获得RGB特征更新权重Wr和深度特征更新权重Wd之后,我们将Wr和Wd分别与增强之后的RGB图像特征和对应的Depth图像特征相结合,得到新的RGB特征和深度特征。3.3.3) After obtaining the RGB feature update weight W r and the depth feature update weight W d , we combine W r and W d with the enhanced RGB image feature and the corresponding Depth image feature respectively to obtain a new RGB feature and deep features.

3.3.4)通过上述操作,更新出5个层次的特征

Figure BDA0004160400720000044
和/>
Figure BDA0004160400720000045
并将每个层次更新的特征对应输入下一个像素卷积模块和形状感知模块,通过多层级的操作增强特征感受野信息和高级语义信息。3.3.4) Through the above operations, update the features of 5 levels
Figure BDA0004160400720000044
and />
Figure BDA0004160400720000045
And the updated features of each level are correspondingly input to the next pixel convolution module and shape perception module, and the feature receptive field information and advanced semantic information are enhanced through multi-level operations.

4)通过交叉融合方法,融合跨模态特征,RGB图像特征

Figure BDA0004160400720000046
和对应的Depth图像特征
Figure BDA0004160400720000047
最后得到融合特征/>
Figure BDA0004160400720000048
4) Through the cross-fusion method, the cross-modal features and RGB image features are fused
Figure BDA0004160400720000046
And the corresponding Depth image features
Figure BDA0004160400720000047
Finally get the fusion feature />
Figure BDA0004160400720000048

Figure BDA0004160400720000049
Figure BDA0004160400720000049

其中,i∈{1,2,3,4,5}表示特征所在模型的层次,conv5表示卷积核大小为5×5的卷积操作,cat表示特征连接操作。Among them, i∈{1,2,3,4,5} represents the level of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature connection operation.

4.1)将更新后的特征

Figure BDA00041604007200000410
经过有效特征层利用像素卷积结构特征提取:4.1) The updated features
Figure BDA00041604007200000410
Through the effective feature layer, the pixel convolution structure feature extraction is used:

Pi=Conv(P,Ki) 公式(3)P i =Conv(P, K i ) formula (3)

Di=Conv(R,Ki) 公式(4)D i =Conv(R, K i ) formula (4)

Ri=Conv(Di+Pi,K1) 公式(5)R i =Conv(D i +P i ,K 1 ) formula (5)

其中,i∈{1,2,3,4,5}表示特征所在的层次,Conv()代表所执行的卷积操作,Ki为各层次不同的卷积核,Di为RGB特征的提取结果,Pi为像素信息的提取,并令K1为1×1卷积核。Ri为最终产生的RGB图像特征

Figure BDA0004160400720000051
Among them, i∈{1,2,3,4,5} represents the level of the feature, Conv() represents the convolution operation performed, K i is the convolution kernel of each level, D i is the extraction of RGB features As a result, P i is the extraction of pixel information, and let K 1 be a 1×1 convolution kernel. R i is the final RGB image feature
Figure BDA0004160400720000051

4.2)将上述步骤所生成的RGB图像特征和模态感知模块中的深度特征信息输入特征交叉融合模块,融合不同感受野的多模态特征。4.2) Input the RGB image features generated in the above steps and the depth feature information in the modality perception module into the feature cross fusion module to fuse the multimodal features of different receptive fields.

5)将步骤4所获取更新的第5个层次RGB图像特征和深度图像特征,输入到DeepLabV3+的解码器中,将编码器的输出上采样4倍,使其分辨率和低层级的feature一致。将特征层连接后,再进行一次3×3的卷积(细化作用),得到最终的融合特征,在经过sigmoid函数激活,得到预测的语义图Pest5) Input the updated fifth-level RGB image features and depth image features obtained in step 4 into the decoder of DeepLabV3+, and upsample the output of the encoder by 4 times to make its resolution consistent with low-level features. After the feature layers are connected, a 3×3 convolution (refinement) is performed to obtain the final fusion feature, which is activated by the sigmoid function to obtain the predicted semantic map P est .

6)通过本发明预测出来的语义图Pest与人工标注的语义分割图PGT进行损失函数的计算,并通过反向传播算法逐步更新本发明提出的模型的参数权重,最终确定RGB-D语义分割算法的结构和参数权重。6) Calculate the loss function through the semantic map P est predicted by the present invention and the semantic segmentation map PGT manually marked, and gradually update the parameter weights of the model proposed by the present invention through the back propagation algorithm, and finally determine the RGB-D semantics The structure and parameter weights of the segmentation algorithm.

7)在步骤6确定模型的结构和参数权重的基础上,对测试集上的RGB-D图像对进行测试,生成显著图Ptest,并使用MAE、S-measure、F-measure、E-measure评价指标进行评估。7) On the basis of determining the structure and parameter weights of the model in step 6, test the RGB-D image pair on the test set to generate a saliency map P test , and use MAE, S-measure, F-measure, E-measure Evaluation indicators are evaluated.

本发明基于深度卷积神经网络、形状感知和像素卷积实现的RGB-D语义分割,提取Depth图像中的丰富的空间结构边缘信息,并与RGB图像提取的全局信息进行交叉跨模态特征融合,能够适应不同场景下的语义分割的要求,特别在一些具有挑战性场景下(复杂背景、低对比度、透明物体等)。相比较之前的语义分割方法,本发明具有以下收益:The invention is based on RGB-D semantic segmentation realized by deep convolutional neural network, shape perception and pixel convolution, extracts rich spatial structure edge information in Depth image, and performs cross-cross-modal feature fusion with global information extracted from RGB image , can adapt to the requirements of semantic segmentation in different scenarios, especially in some challenging scenarios (complex background, low contrast, transparent objects, etc.). Compared with previous semantic segmentation methods, the present invention has the following benefits:

首先,将深度图的引入,并且深度图不作为RGB图的额外通道,同时不将两种模态作为相同贡献价值进行特征提取和融合。利用深度学习技术,通过双编码器-解码器结构构建RGB-D图像对和实类之间的关系,并通过跨模态特征的提取和融合,得到分割特征。First of all, the depth map is introduced, and the depth map is not used as an additional channel of the RGB map, and the two modalities are not used as the same contribution value for feature extraction and fusion. Using deep learning technology, the relationship between RGB-D image pairs and real classes is constructed through a dual encoder-decoder structure, and segmentation features are obtained through cross-modal feature extraction and fusion.

其次,通过一种交叉融合的方式,有效调制Depth图像特征对于RGB图像特征的补充边缘信息,而对RGB图像的全局信息无影响,并利用其本身具备的深度分布信息指导跨模态的特征融合,排除RGB图像中的背景信息的干扰,为下一阶段的像素分割打好基础。Secondly, through a cross-fusion method, the Depth image feature is effectively modulated to supplement the edge information of the RGB image feature, but has no effect on the global information of the RGB image, and uses its own depth distribution information to guide cross-modal feature fusion. , eliminate the interference of background information in the RGB image, and lay a solid foundation for the next stage of pixel segmentation.

最后,通过语义解码器,预测最终的语义分割像素图。Finally, through the semantic decoder, the final semantic segmentation pixmap is predicted.

附图说明Description of drawings

图1为本发明的模型结构示意图Fig. 1 is the model structure schematic diagram of the present invention

图2为跨模态特征融合模块示意图Figure 2 is a schematic diagram of the cross-modal feature fusion module

图3为交叉像素卷积模块示意图Figure 3 is a schematic diagram of the cross-pixel convolution module

图4为分割解码器示意图Figure 4 is a schematic diagram of the segmentation decoder

图5为模型训练和测试示意图Figure 5 is a schematic diagram of model training and testing

具体实施方式Detailed ways

下面将结合本发明实例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,此外,所叙述的实例仅仅是本发明一部分实例,而不是所有的实例。基于本发明中的实例,本研究方向普通技术人员在没有付出创造性劳动前提下所获得的所有其他实例,都属于本发明保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the examples of the present invention. In addition, the described examples are only some examples of the present invention, not all examples. Based on the examples in the present invention, all other examples obtained by those of ordinary skill in this research direction without paying creative work, all belong to the protection scope of the present invention.

参考附图1,一种基于形状感知和像素卷积的跨模态RGB-D语义分割方法主要包含以下步骤:Referring to Figure 1, a cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution mainly includes the following steps:

1.获取训练和测试该任务的RGB-D数据集,并定义本发明的算法目标,并确定用于训练和测试算法的训练集和测试集。将NYU-Depth-V2(NYUDv2-13 and-40)数据集作为训练集,SUN RGB-D数据集作为测试集。1. Obtain the RGB-D data set for training and testing the task, and define the algorithm target of the present invention, and determine the training set and test set for training and testing the algorithm. The NYU-Depth-V2 (NYUDv2-13 and-40) dataset is used as the training set, and the SUN RGB-D dataset is used as the test set.

2.利用交叉像素卷积网络提取RGB图像特征,利用形状感知卷积网络提取Depth图像特征,并以此为基础构建双编码器-解码器的语义分割模型网络,包括用于提取RGB图像特征的RGB编码器和用于提取Depth图像特征的Depth编码器:2. Use the cross-pixel convolutional network to extract RGB image features, and use the shape-aware convolutional network to extract Depth image features, and build a dual encoder-decoder semantic segmentation model network based on this, including for extracting RGB image features RGB encoder and Depth encoder for extracting Depth image features:

2.1.将带有三通道的RGB图像输入到RGB编码器,生成5个层次的RGB图像特征,为

Figure BDA0004160400720000071
2.1. Input the RGB image with three channels to the RGB encoder to generate 5 levels of RGB image features, as
Figure BDA0004160400720000071

2.2.将三通道的Depth图像输入到Depth编码器中,生成5个层次的Depth图像特征,为

Figure BDA0004160400720000072
2.2. Input the three-channel Depth image into the Depth encoder to generate five levels of Depth image features, as
Figure BDA0004160400720000072

3.基于步骤2所提取到的RGB图像特征

Figure BDA0004160400720000073
和对应的Depth图像特征
Figure BDA0004160400720000074
进行跨模态特征融合,并利用该融合构建一个跨模态特征融合网络用于生成多模态特征。3. Based on the RGB image features extracted in step 2
Figure BDA0004160400720000073
And the corresponding Depth image features
Figure BDA0004160400720000074
Perform cross-modal feature fusion, and use the fusion to construct a cross-modal feature fusion network for generating multi-modal features.

3.1)跨模态特征融合模块由5个层次的FCF模块整合5个层次的RGB图像特征

Figure BDA0004160400720000075
和对应的Depth图像特征/>
Figure BDA0004160400720000076
构成,更新出5个层次的特征
Figure BDA0004160400720000077
和/>
Figure BDA0004160400720000078
3.1) The cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules
Figure BDA0004160400720000075
And the corresponding Depth image features />
Figure BDA0004160400720000076
Composition, update 5 levels of features
Figure BDA0004160400720000077
and />
Figure BDA0004160400720000078

3.2)第i层次的FCF模块的输入数据为

Figure BDA0004160400720000079
和/>
Figure BDA00041604007200000710
构成,并通过交互注意力机制更新出5个层次的特征/>
Figure BDA00041604007200000711
和/>
Figure BDA00041604007200000712
3.2) The input data of the FCF module at the i-th level is
Figure BDA0004160400720000079
and />
Figure BDA00041604007200000710
Composition, and update the features of 5 levels through the interactive attention mechanism />
Figure BDA00041604007200000711
and />
Figure BDA00041604007200000712

3.3)FCF模块通过特征交叉融合生成多模态特征具体过程如下:3.3) The FCF module generates multimodal features through feature cross fusion. The specific process is as follows:

3.3.1)首先本发明构建一个交叉像素卷积模块用于获取RGB和像素差异的特征,进一步增强RGB图像特征。同时对于深度图构建形状感知卷积用于获取较为准确地局部形状边缘信息,进一步增强Depth图像特征。3.3.1) Firstly, the present invention constructs a cross-pixel convolution module to obtain features of RGB and pixel differences, and further enhance RGB image features. At the same time, the shape-aware convolution is constructed for the depth image to obtain more accurate local shape edge information, and further enhance the depth image features.

3.3.2)进一步利用元素感知的矩阵相加操作融合RGB图像特征和对应的Depth图像特征,其中通过像素卷积进行判断像素是否可用,利用元素感知的矩阵相加操作确定最后计算值。然后利用softmax激活函数将融合后的特征转化为RGB特征更新权重Wr和深度特征更新权重Wd3.3.2) Further use the element-aware matrix addition operation to fuse the RGB image features and the corresponding Depth image features, wherein whether the pixel is available is judged by pixel convolution, and the element-aware matrix addition operation is used to determine the final calculation value. Then use the softmax activation function to convert the fused features into RGB feature update weight Wr and depth feature update weight W d :

Figure BDA0004160400720000081
Figure BDA0004160400720000081

Figure BDA0004160400720000082
Figure BDA0004160400720000082

其中,conv表示卷积模块,表示元素感知的矩阵乘操作,add表示元素感知的矩阵加操作,GAP表示全局平均池化,softmax表示softmax激活函数。

Figure BDA0004160400720000083
为像素卷积值,
Figure BDA0004160400720000084
为RGB卷积值。Among them, conv represents the convolution module, represents the element-aware matrix multiplication operation, add represents the element-aware matrix addition operation, GAP represents the global average pooling, and softmax represents the softmax activation function.
Figure BDA0004160400720000083
is the pixel convolution value,
Figure BDA0004160400720000084
is the RGB convolution value.

3.3.3)在获得RGB特征更新权重Wr和深度特征更新权重Wd之后,我们将Wr和Wd分别与增强之后的RGB图像特征和对应的Depth图像特征相结合,得到新的RGB特征和深度特征。3.3.3) After obtaining the RGB feature update weight W r and the depth feature update weight W d , we combine W r and W d with the enhanced RGB image feature and the corresponding Depth image feature respectively to obtain a new RGB feature and deep features.

3.3.4)通过上述操作,更新出5个层次的特征

Figure BDA0004160400720000085
和/>
Figure BDA0004160400720000086
并将每个层次更新的特征对应输入下一个像素卷积模块和形状感知模块,通过多层级的操作增强特征感受野信息和高级语义信息。3.3.4) Through the above operations, update the features of 5 levels
Figure BDA0004160400720000085
and />
Figure BDA0004160400720000086
And the updated features of each level are correspondingly input to the next pixel convolution module and shape perception module, and the feature receptive field information and advanced semantic information are enhanced through multi-level operations.

4)通过交叉融合方法,融合跨模态特征,RGB图像特征

Figure BDA0004160400720000091
和对应的Depth图像特征
Figure BDA0004160400720000092
最后得到融合特征/>
Figure BDA0004160400720000093
4) Through the cross-fusion method, the cross-modal features and RGB image features are fused
Figure BDA0004160400720000091
And the corresponding Depth image features
Figure BDA0004160400720000092
Finally get the fusion feature />
Figure BDA0004160400720000093

Figure BDA0004160400720000094
Figure BDA0004160400720000094

其中,i∈{1,2,3,4,5}表示特征所在模型的层次,conv5表示卷积核大小为5×5的卷积操作,cat表示特征连接操作。Among them, i∈{1,2,3,4,5} represents the level of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature connection operation.

4.1)将更新后的特征

Figure BDA0004160400720000095
经过有效特征层利用像素卷积结构特征提取:4.1) The updated features
Figure BDA0004160400720000095
Through the effective feature layer, the pixel convolution structure feature extraction is used:

Di=Conv(R,Ki) 公式(3)D i =Conv(R, K i ) formula (3)

Pi=Conv(P,Ki) 公式(4)P i =Conv(P, K i ) formula (4)

Ri=Conv(Di+Pi,K1) 公式(5)R i =Conv(D i +P i ,K 1 ) formula (5)

其中,i∈{1,2,3,4,5}表示特征所在的层次,Conv()代表所执行的卷积操作,Ki为各层次不同的卷积核,Di为RGB特征的提取结果,Pi为像素信息的提取,并令K1为1×1卷积核。Ri为最终产生的RGB图像特征

Figure BDA0004160400720000096
Among them, i∈{1,2,3,4,5} represents the level of the feature, Conv() represents the convolution operation performed, K i is the convolution kernel of each level, D i is the extraction of RGB features As a result, P i is the extraction of pixel information, and let K 1 be a 1×1 convolution kernel. R i is the final RGB image feature
Figure BDA0004160400720000096

4.2)将上述步骤所生成的RGB图像特征和模态感知模块中的深度特征信息输入特征交叉融合模块,融合不同感受野的多模态特征。4.2) Input the RGB image features generated in the above steps and the depth feature information in the modality perception module into the feature cross fusion module to fuse the multimodal features of different receptive fields.

5)将步骤4所获取更新的第5个层次RGB图像特征和深度图像特征,输入到DeepLabV3+的解码器中,将编码器的输出上采样4倍,使其分辨率和低层级的feature一致。将特征层连接后,再进行一次3×3的卷积(细化作用),得到最终的融合特征,在经过sigmoid函数激活,得到预测的语义图Pest5) Input the updated fifth-level RGB image features and depth image features obtained in step 4 into the decoder of DeepLabV3+, and upsample the output of the encoder by 4 times to make its resolution consistent with low-level features. After the feature layers are connected, a 3×3 convolution (refinement) is performed to obtain the final fusion feature, which is activated by the sigmoid function to obtain the predicted semantic map P est .

6)通过本发明预测出来的语义图Pest与人工标注的语义分割图PGT进行损失函数的计算,并通过反向传播算法逐步更新本发明提出的模型的参数权重,最终确定RGB-D语义分割算法的结构和参数权重。6) Calculate the loss function through the semantic map P est predicted by the present invention and the semantic segmentation map PGT manually marked, and gradually update the parameter weights of the model proposed by the present invention through the back propagation algorithm, and finally determine the RGB-D semantics The structure and parameter weights of the segmentation algorithm.

7)在步骤6确定模型的结构和参数权重的基础上,对测试集上的RGB-D图像对进行测试,生成显著图Ptest,并使用MAE、S-measure、F-measure、E-measure评价指标进行评估。7) On the basis of determining the structure and parameter weights of the model in step 6, test the RGB-D image pair on the test set to generate a saliency map P test , and use MAE, S-measure, F-measure, E-measure Evaluation indicators are evaluated.

Claims (5)

1.一种基于形状感知的跨模态RGB-D语义分割方法,其特征在于,该方法包括一下步骤:1. A cross-modal RGB-D semantic segmentation method based on shape perception, characterized in that the method comprises the following steps: 1)获取训练和测试该任务的RGB-D数据集,并定义本发明的算法目标;1) obtain the RGB-D dataset of training and testing this task, and define the algorithm target of the present invention; 2)利用深度学习技术,构建基于形状感知和像素卷积并通过双编码器-解码器结构构建RGB-D语义分割网络模型;2) Using deep learning technology, build a RGB-D semantic segmentation network model based on shape perception and pixel convolution and through a dual encoder-decoder structure; 3)构建一个跨模态特征融合网络用于生成多模态特征;3) Construct a cross-modal feature fusion network for generating multi-modal features; 4)构通过交叉融合方法,融合跨模态特征,以增强多模态特征的高级语义信息;4) The structure fuses cross-modal features through the cross-fusion method to enhance the high-level semantic information of multi-modal features; 5)DeepLabV3+的解码器中,将编码器的输出上采样,使分辨率和低层级的feature一致。将特征层连接进行一次3×3的卷积,在经过sigmoid函数激活,得到预测的语义图Pest5) In the decoder of DeepLabV3+, the output of the encoder is up-sampled to make the resolution consistent with the low-level feature. Connect the feature layer to perform a 3×3 convolution, and activate the sigmoid function to obtain the predicted semantic map P est ; 6)预测的显著图Pest与人工标注的语义分割图PGT计算损失;6) Calculate the loss of the predicted saliency map P est and the manually labeled semantic segmentation map P GT ; 7)对测试数据集进行测试,生成显著图Ptest,并使用评价指标进行性能评估。7) Test the test data set, generate a saliency map P test , and use the evaluation index to perform performance evaluation. 2.根据权利要求1所述的一种基于形状感知的跨模态RGB-D语义分割方法,其特征在于:所述步骤2)具体方法是:2. a kind of cross-modal RGB-D semantic segmentation method based on shape perception according to claim 1, is characterized in that: described step 2) concrete method is: 2.1)将NYU-Depth-V2(NYUDv2-13 and-40)数据集作为训练集,SUN RGB-D数据集作为测试集。2.1) The NYU-Depth-V2 (NYUDv2-13 and-40) dataset is used as the training set, and the SUN RGB-D dataset is used as the test set. 2.2)RGB-D图像数据集,每份数据标注了场景种类(scene category)、二维分割(2Dsegmentation)、三维房间布局(3D room layout)、三维物体边框(3D object box)、三维物体方向(3D object prientation)。2.2) RGB-D image data set, each data marked scene category (scene category), two-dimensional segmentation (2Dsegmentation), three-dimensional room layout (3D room layout), three-dimensional object frame (3D object box), three-dimensional object direction ( 3D object prientation). 3.根据权利要求1所述的一种基于形状感知的跨模态RGB-D语义分割方法,其特征在于:所述步骤3)具体方法是:3. a kind of cross-modal RGB-D semantic segmentation method based on shape perception according to claim 1, is characterized in that: described step 3) concrete method is: 3.1)用编码器-解码器架构作为本发明的模型的基本架构,用于提取RGB图像特征和因对的Depth图像特征,分别为
Figure FDA0004160400710000021
和/>
Figure FDA0004160400710000022
3.1) Use the encoder-decoder architecture as the basic architecture of the model of the present invention, for extracting RGB image features and the corresponding Depth image features, respectively
Figure FDA0004160400710000021
and />
Figure FDA0004160400710000022
3.2)本发明采用NYU-Depth-V2数据集预训练构建双编码器-解码器架构的网络模型。3.2) The present invention adopts NYU-Depth-V2 data set pre-training to build a network model of dual encoder-decoder architecture.
4.根据权利要求1所述的一种基于形状感知的跨模态RGB-D语义分割方法,其特征在于:所述步骤4)具体方法是:4. a kind of cross-modal RGB-D semantic segmentation method based on shape perception according to claim 1, is characterized in that: described step 4) concrete method is: 4.1)跨模态特征融合模块由5个层次的FCF模块整合5个层次的RGB图像特征
Figure FDA0004160400710000023
和对应的Depth图像特征/>
Figure FDA0004160400710000024
构成,更新出5个层次的特征/>
Figure FDA0004160400710000025
Figure FDA0004160400710000026
4.1) The cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules
Figure FDA0004160400710000023
And the corresponding Depth image features />
Figure FDA0004160400710000024
Composition, update 5 levels of features/>
Figure FDA0004160400710000025
and
Figure FDA0004160400710000026
4.2)第i层次的FCF模块的输入数据为
Figure FDA0004160400710000027
和/>
Figure FDA0004160400710000028
构成,并通过交互注意力机制更新出5个层次的特征/>
Figure FDA0004160400710000029
和/>
Figure FDA00041604007100000210
4.2) The input data of the FCF module at the i-th level is
Figure FDA0004160400710000027
and />
Figure FDA0004160400710000028
Composition, and update the features of 5 levels through the interactive attention mechanism />
Figure FDA0004160400710000029
and />
Figure FDA00041604007100000210
5.根据权利要求1所述的一种基于形状感知的跨模态RGB-D语义分割方法,其特征在于:所述步骤5)具体方法是:5. a kind of cross-modal RGB-D semantic segmentation method based on shape perception according to claim 1, is characterized in that: described step 5) concrete method is: 5.1)将更新后的特征
Figure FDA00041604007100000211
经过有效特征层利用像素卷积结构特征提取:
5.1) The updated features
Figure FDA00041604007100000211
Through the effective feature layer, the pixel convolution structure feature extraction is used:
Pi=Conv(P,Ki) 公式(1)P i =Conv(P, K i ) formula (1) Di=Conv(R,Ki) 公式(2)D i =Conv(R, K i ) formula (2) Bi=Conv(Di+Pi,K1) 公式(3)B i =Conv(D i +P i , K 1 ) formula (3) 其中,i∈{1,2,3,4,5}表示特征所在的层次,Conv()代表所执行的卷积操作,Ki为各层次不同的卷积核,Di为RGB特征的提取结果,Pi为像素信息的提取,并令K1为1×1卷积核。Ri为最终产生的RGB图像特征
Figure FDA0004160400710000031
Among them, i∈{1, 2, 3, 4, 5} represents the level of the feature, Conv() represents the convolution operation performed, K i is the convolution kernel of each level, D i is the extraction of RGB features As a result, P i is the extraction of pixel information, and let K 1 be a 1×1 convolution kernel. R i is the final RGB image feature
Figure FDA0004160400710000031
5.2)将上述步骤所生成的RGB图像特征和模态感知模块中的深度特征信息输入特征交叉融合模块,融合不同感受野的多模态特征。5.2) Input the RGB image features generated by the above steps and the depth feature information in the modality perception module into the feature cross fusion module to fuse the multimodal features of different receptive fields. 6)将步骤4所获取更新的第5个层次RGB图像特征和深度图像特征,输入到DeepLabV3+的解码器中,将编码器的输出上采样4倍,使其分辨率和低层级的feature一致。将特征层连接后,再进行一次3×3的卷积(细化作用),得到最终的融合特征,在经过sigmoid函数激活,得到预测的语义图Pest6) Input the updated fifth-level RGB image features and depth image features obtained in step 4 into the decoder of DeepLabV3+, and upsample the output of the encoder by 4 times to make its resolution consistent with the low-level features. After the feature layers are connected, a 3×3 convolution (refinement) is performed to obtain the final fusion feature, which is activated by the sigmoid function to obtain the predicted semantic map P est . 7)通过本发明预测出来的语义图Pest与人工标注的语义分割图PGT进行损失函数的计算,并通过反向传播算法逐步更新本发明提出的模型的参数权重,最终确定RGB-D语义分割算法的结构和参数权重。7) Calculate the loss function through the semantic map P est predicted by the present invention and the semantic segmentation map PGT manually marked, and gradually update the parameter weights of the model proposed by the present invention through the back propagation algorithm, and finally determine the RGB-D semantics The structure and parameter weights of the segmentation algorithm.
CN202310347813.7A 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution Pending CN116433904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310347813.7A CN116433904A (en) 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310347813.7A CN116433904A (en) 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution

Publications (1)

Publication Number Publication Date
CN116433904A true CN116433904A (en) 2023-07-14

Family

ID=87084845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310347813.7A Pending CN116433904A (en) 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution

Country Status (1)

Country Link
CN (1) CN116433904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935052A (en) * 2023-07-24 2023-10-24 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935052A (en) * 2023-07-24 2023-10-24 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment
CN116935052B (en) * 2023-07-24 2024-03-01 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment

Similar Documents

Publication Publication Date Title
CN109522966B (en) A target detection method based on densely connected convolutional neural network
JP6395158B2 (en) How to semantically label acquired images of a scene
CN113609896B (en) Object-level Remote Sensing Change Detection Method and System Based on Dual Correlation Attention
CN111931787A (en) RGBD significance detection method based on feature polymerization
CN116503602A (en) Unstructured environment three-dimensional point cloud semantic segmentation method based on multi-level edge enhancement
CN113743417B (en) Semantic segmentation method and semantic segmentation device
CN109086777B (en) Saliency map refining method based on global pixel characteristics
CN110781894B (en) Point cloud semantic segmentation method, device and electronic device
JP2021119506A (en) License-number plate recognition method, license-number plate recognition model training method and device
CN112347932B (en) A 3D model recognition method based on point cloud-multi-view fusion
CN110827295A (en) 3D Semantic Segmentation Method Based on Coupling of Voxel Model and Color Information
CN114283315B (en) RGB-D significance target detection method based on interactive guiding attention and trapezoidal pyramid fusion
CN116485860A (en) Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features
CN114693951A (en) An RGB-D Saliency Object Detection Method Based on Global Context Information Exploration
CN116704506A (en) A Cross-Context Attention-Based Approach to Referential Image Segmentation
CN117351360A (en) Remote sensing image road extraction method based on attention mechanism improvement
Rong et al. 3D semantic segmentation of aerial photogrammetry models based on orthographic projection
CN114693953B (en) A RGB-D salient object detection method based on cross-modal bidirectional complementary network
CN113780241B (en) Acceleration method and device for detecting remarkable object
CN115965783A (en) Unstructured road segmentation method based on point cloud and image feature fusion
CN116433904A (en) Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution
Shi et al. Context‐guided ground truth sampling for multi‐modality data augmentation in autonomous driving
CN117745948A (en) Space target image three-dimensional reconstruction method based on improved TransMVSnet deep learning algorithm
CN119068080A (en) Method, electronic device and computer program product for generating an image
CN116403068A (en) Lightweight monocular depth prediction method based on multi-scale attention fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination