CN116433904A - Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution - Google Patents

Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution Download PDF

Info

Publication number
CN116433904A
CN116433904A CN202310347813.7A CN202310347813A CN116433904A CN 116433904 A CN116433904 A CN 116433904A CN 202310347813 A CN202310347813 A CN 202310347813A CN 116433904 A CN116433904 A CN 116433904A
Authority
CN
China
Prior art keywords
rgb
modal
cross
feature
semantic segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310347813.7A
Other languages
Chinese (zh)
Inventor
葛斌
陆一鸣
夏晨星
朱序
卢洋
郭婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Science and Technology
Original Assignee
Anhui University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Science and Technology filed Critical Anhui University of Science and Technology
Priority to CN202310347813.7A priority Critical patent/CN116433904A/en
Publication of CN116433904A publication Critical patent/CN116433904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The inventionThe invention belongs to the field of computer vision, and provides a cross-mode RGB-D semantic segmentation method based on shape perception, which comprises the following steps: 1) Acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention; 2) Constructing a shape perception and pixel convolution based RGB-D semantic segmentation network model by using a deep learning technology and a double encoder-decoder structure; 3) Constructing a cross-modal feature fusion network for generating multi-modal features; 4) The cross-modal characteristics are fused by a cross fusion method, so that the high-level semantic information of the multi-modal characteristics is enhanced; 5) In the deep labv3+ decoder, the output of the encoder is up-sampled to match the resolution with the features of the low level. The feature layer connection is convolved once by 3 multiplied by 3, and then activated by a sigmoid function to obtain a predicted semantic graph P est The method comprises the steps of carrying out a first treatment on the surface of the 6) Predicted saliency map P est Semantic segmentation map P with artificial annotation GT Calculating loss; 7) Testing the test data set to generate a saliency map P test And performance evaluation is performed using the evaluation index.

Description

Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution
Technical field:
the invention relates to the field of computer vision and image processing, in particular to a cross-mode RGB-D semantic segmentation method based on shape perception and pixel convolution.
The background technology is as follows:
semantic segmentation involves taking some raw data as input and converting them into a mask with highlighted regions of interest, where each pixel in the image is assigned a class ID according to the object to which it belongs. Semantic segmentation aims at solving the problem by gathering image parts belonging to the same object together, thereby expanding the application field of the semantic segmentation. Semantic segmentation is completely different and advanced compared to other image-based tasks. Briefly, in the field of computer vision, semantic segmentation is a pixel classification task based on full convolution.
The RGB semantic segmentation of a single mode faces challenging factors such as complex scenes, and the outline of a target is difficult to be defined, so that the semantic segmentation is accurately performed. And it is difficult to accurately and completely locate and classify all objects from the background. Therefore, to solve this problem, a Depth (Depth) image is introduced into semantic segmentation, which is performed by combining the joint RGB image and the Depth image to construct RGB-D.
The Depth Map can mainly provide information such as target edges. The Depth map is introduced into the semantic segmentation task, the RGB map provides global information, the Depth map provides profile information more complete, and geometric structure and distance information are expressed. Therefore, combining RGB images with depth maps for semantic segmentation tasks is a reasonable choice.
The prior RGB-D semantic segmentation method mainly uses a Depth Map as a data stream independent of an RGB image, extracts features independently, or uses the Depth image as a fourth channel of the RGB image, and treats the RGB image and the Depth image indiscriminately, and does not consider that the RGB image and the Depth image information are basically different, so that the convolution operation widely applied to RGB is not suitable for information processing of the Depth image.
Considering the ambiguity problem of cross-modal data between RGB image data and Depth image data, the invention attempts to explore a cross-modal feature fusion method based on shape perception and pixel convolution. The invention helps the semantic segmentation model to more accurately classify pixels by connecting the local shape of the depth feature with the function of further mining the feature in the cross-modal feature fusion.
The invention comprises the following steps:
aiming at the problems, the invention provides a cross-modal RGB-D semantic segmentation method based on shape perception, which adopts the following technical scheme:
1. an RGB-D dataset is acquired that trains and tests the task.
1.1 The NYU-Depth-V2 (NYUdv 2-13 and-40) dataset was used as the training set and the SUN RGB-D dataset was used as the test set.
1.2 RGB-D image dataset, each data labeled with a scene category (scene), a two-dimensional segmentation (2D segmentation), a three-dimensional room layout (3D room layout), a three-dimensional object box (3D object box), a three-dimensional object direction (3D object prientation).
2. Using a deep learning technique, an RGB-D semantic segmentation network model is built based on shape perception and pixel convolution and by a dual encoder-decoder structure:
2.1 Using encoder-decoder architecture as the present inventionFor extracting RGB image features and Depth image features of the cause pair, respectively
Figure BDA0004160400720000021
And->
Figure BDA0004160400720000031
2.2 The present invention builds a network model of a dual encoder-decoder architecture using NYU-Depth-V2 dataset pre-training.
3. Based on the RGB image features extracted in step 2
Figure BDA0004160400720000032
And corresponding Depth image features
Figure BDA0004160400720000033
And performing cross-modal feature fusion, and constructing a cross-modal feature fusion network by using the fusion to generate multi-modal features.
3.1 Cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules
Figure BDA0004160400720000034
And corresponding Depth image features +.>
Figure BDA0004160400720000035
The constitution updates the characteristics of 5 layers +.>
Figure BDA0004160400720000036
And->
Figure BDA0004160400720000037
3.2 I-th level FCF module input data is
Figure BDA0004160400720000038
And->
Figure BDA0004160400720000039
Constitute and update the 5-level feature through the interactive attention mechanism +.>
Figure BDA00041604007200000310
And->
Figure BDA00041604007200000311
3.3 The specific process of generating the multi-mode features by the FCF module through feature cross fusion is as follows:
3.3.1 Firstly, the invention constructs a cross pixel convolution module for acquiring the characteristic of RGB and pixel difference, and further enhances the characteristic of RGB image. Meanwhile, shape perception convolution is constructed for the Depth map to obtain accurate local shape edge information, so that Depth image characteristics are further enhanced.
3.3.2 Further fusing the RGB image features and the corresponding Depth image features using an element-aware matrix addition operation, wherein a determination is made as to whether the pixel is available by pixel convolution, and the final calculated value is determined using the element-aware matrix addition operation. Then the fused features are converted into RGB feature update weights W by using softmax activation function r And depth feature update weight W d
Figure BDA00041604007200000312
Figure BDA0004160400720000041
Wherein conv represents a convolution module, represents an element-aware matrix multiplication operation, add represents an element-aware matrix addition operation, GAP represents global averaging pooling, and softmax represents a softmax activation function.
Figure BDA0004160400720000042
For the convolution value of the pixel,
Figure BDA0004160400720000043
is an RGB convolution value.
3.3.3 In obtaining RGB feature update weights W r And depth feature update weight W d After that, we will W r And W is d And combining the enhanced RGB image features with the corresponding Depth image features respectively to obtain new RGB features and Depth features.
3.3.4 Through the above operation, 5-level features are updated
Figure BDA0004160400720000044
And->
Figure BDA0004160400720000045
And correspondingly inputting the updated characteristics of each layer into a next pixel convolution module and a shape perception module, and enhancing the characteristic receptive field information and the advanced semantic information through multi-layer operation.
4) Cross-modal characteristics and RGB image characteristics are fused through a cross-fusion method
Figure BDA0004160400720000046
And corresponding Depth image features
Figure BDA0004160400720000047
Finally, fusion characteristics are obtained>
Figure BDA0004160400720000048
Figure BDA0004160400720000049
Where i ε {1,2,3,4,5} represents the hierarchy of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature join operation.
4.1 To be updated) features
Figure BDA00041604007200000410
Pixel convolution structure feature extraction through effective feature layerTaking:
P i =Conv(P,K i ) Formula (3)
D i =Conv(R,K i ) Formula (4)
R i =Conv(D i +P i ,K 1 ) Formula (5)
Where i ε {1,2,3,4,5} represents the level at which the feature is located, conv () represents the convolution operation performed, K i For convolution kernels of different levels, D i P is the result of RGB feature extraction i For extracting pixel information, let K 1 Is a 1 x 1 convolution kernel. R is R i For the final generated RGB image features
Figure BDA0004160400720000051
4.2 Inputting the RGB image characteristics generated by the steps and the depth characteristic information in the modal sensing module into a characteristic cross fusion module, and fusing the multi-modal characteristics of different receptive fields.
5) And (3) inputting the updated 5 th-level RGB image characteristic and depth image characteristic obtained in the step (4) into a DeepLabV3+ decoder, and upsampling the output of the encoder by 4 times to enable the resolution to be consistent with the features of the low level. After the feature layers are connected, a convolution (refinement) of 3×3 is carried out again to obtain the final fusion feature, and the final fusion feature is activated by a sigmoid function to obtain a predicted semantic graph P est
6) Semantic graph P predicted by the invention est Semantic segmentation map P with artificial annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D semantic segmentation algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
The invention extracts abundant spatial structure edge information in the Depth image based on RGB-D semantic segmentation realized by the deep convolutional neural network, shape perception and pixel convolution, and performs cross-modal feature fusion with global information extracted by the RGB image, thereby being capable of adapting to the requirements of semantic segmentation under different scenes, especially under some challenging scenes (complex background, low contrast, transparent object, etc.). Compared with the prior semantic segmentation method, the method has the following benefits:
firstly, the depth map is introduced, and is not used as an additional channel of the RGB map, and simultaneously, the two modes are not used as the same contribution value for feature extraction and fusion. And constructing the relationship between the RGB-D image pair and the real class through a double encoder-decoder structure by utilizing a deep learning technology, and obtaining the segmentation feature through the extraction and fusion of the cross-modal feature.
Secondly, by means of cross fusion, the Depth image features are effectively modulated on the complementary edge information of the RGB image features, the global information of the RGB image is not affected, the Depth distribution information of the Depth image is utilized to guide cross-mode feature fusion, interference of background information in the RGB image is eliminated, and a foundation is laid for pixel segmentation in the next stage.
Finally, the final semantically partitioned pixel map is predicted by a semantic decoder.
Drawings
FIG. 1 is a schematic view of a model structure according to the present invention
FIG. 2 is a schematic diagram of a cross-modal feature fusion module
FIG. 3 is a schematic diagram of a cross-pixel convolution module
FIG. 4 is a schematic diagram of a split decoder
FIG. 5 is a schematic diagram of model training and testing
Detailed Description
The following description of the embodiments of the present invention will be made more clearly and fully with reference to the accompanying drawings, in which some, but not all examples of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.
Referring to fig. 1, a cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution mainly comprises the following steps:
1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NYU-Depth-V2 (NYUDV 2-13 and-40) dataset was used as the training set and the SUN RGB-D dataset was used as the test set.
2. Extracting RGB image features by using a cross pixel convolution network, extracting Depth image features by using a shape perception convolution network, and constructing a semantic segmentation model network of a double encoder-decoder based on the extracted RGB image features, wherein the semantic segmentation model network comprises an RGB encoder for extracting the RGB image features and a Depth encoder for extracting the Depth image features:
2.1. inputting RGB image with three channels to RGB encoder to generate 5 layers of RGB image features as
Figure BDA0004160400720000071
2.2. Inputting the three-channel Depth image into a Depth encoder to generate 5 layers of Depth image features, wherein the Depth image features are as follows
Figure BDA0004160400720000072
3. Based on the RGB image features extracted in step 2
Figure BDA0004160400720000073
And corresponding Depth image features
Figure BDA0004160400720000074
And performing cross-modal feature fusion, and constructing a cross-modal feature fusion network by using the fusion to generate multi-modal features.
3.1 Cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules
Figure BDA0004160400720000075
And corresponding Depth image features +.>
Figure BDA0004160400720000076
The structure updates the characteristics of 5 layers
Figure BDA0004160400720000077
And->
Figure BDA0004160400720000078
3.2 I-th level FCF module input data is
Figure BDA0004160400720000079
And->
Figure BDA00041604007200000710
Constitute and update the 5-level feature through the interactive attention mechanism +.>
Figure BDA00041604007200000711
And->
Figure BDA00041604007200000712
3.3 The specific process of generating the multi-mode features by the FCF module through feature cross fusion is as follows:
3.3.1 Firstly, the invention constructs a cross pixel convolution module for acquiring the characteristic of RGB and pixel difference, and further enhances the characteristic of RGB image. Meanwhile, shape perception convolution is constructed for the Depth map to obtain accurate local shape edge information, so that Depth image characteristics are further enhanced.
3.3.2 Further fusing the RGB image features and the corresponding Depth image features using an element-aware matrix addition operation, wherein a determination is made as to whether the pixel is available by pixel convolution, and the final calculated value is determined using the element-aware matrix addition operation. Then the fused features are converted into RGB feature update weights Wr and depth feature update weights W by using softmax activation function d
Figure BDA0004160400720000081
Figure BDA0004160400720000082
Wherein conv represents a convolution module, represents an element-aware matrix multiplication operation, add represents an element-aware matrix addition operation, GAP represents global averaging pooling, and softmax represents a softmax activation function.
Figure BDA0004160400720000083
For the convolution value of the pixel,
Figure BDA0004160400720000084
is an RGB convolution value.
3.3.3 In obtaining RGB feature update weights W r And depth feature update weight W d After that, we will W r And W is d And combining the enhanced RGB image features with the corresponding Depth image features respectively to obtain new RGB features and Depth features.
3.3.4 Through the above operation, 5-level features are updated
Figure BDA0004160400720000085
And->
Figure BDA0004160400720000086
And correspondingly inputting the updated characteristics of each layer into a next pixel convolution module and a shape perception module, and enhancing the characteristic receptive field information and the advanced semantic information through multi-layer operation.
4) Cross-modal characteristics and RGB image characteristics are fused through a cross-fusion method
Figure BDA0004160400720000091
And corresponding Depth image features
Figure BDA0004160400720000092
Finally, fusion characteristics are obtained>
Figure BDA0004160400720000093
Figure BDA0004160400720000094
Where i ε {1,2,3,4,5} represents the hierarchy of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature join operation.
4.1 To be updated) features
Figure BDA0004160400720000095
Extracting by using pixel convolution structure features through an effective feature layer:
D i =Conv(R,K i ) Formula (3)
P i =Conv(P,K i ) Formula (4)
R i =Conv(D i +P i ,K 1 ) Formula (5)
Where i ε {1,2,3,4,5} represents the level at which the feature is located, conv () represents the convolution operation performed, K i For convolution kernels of different levels, D i P is the result of RGB feature extraction i For extracting pixel information, let K 1 Is a 1 x 1 convolution kernel. R is R i For the final generated RGB image features
Figure BDA0004160400720000096
4.2 Inputting the RGB image characteristics generated by the steps and the depth characteristic information in the modal sensing module into a characteristic cross fusion module, and fusing the multi-modal characteristics of different receptive fields.
5) Inputting the updated 5 th-level RGB image characteristic and depth image characteristic obtained in the step 4 into a DeepLabV3+ decoder, and upsampling the output of the encoderThe sample was 4-fold, and its resolution was matched to the features of the lower layer. After the feature layers are connected, a convolution (refinement) of 3×3 is carried out again to obtain the final fusion feature, and the final fusion feature is activated by a sigmoid function to obtain a predicted semantic graph P est
6) Semantic graph P predicted by the invention est Semantic segmentation map P with artificial annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D semantic segmentation algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.

Claims (5)

1. The cross-modal RGB-D semantic segmentation method based on shape perception is characterized by comprising the following steps:
1) Acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention;
2) Constructing a shape perception and pixel convolution based RGB-D semantic segmentation network model by using a deep learning technology and a double encoder-decoder structure;
3) Constructing a cross-modal feature fusion network for generating multi-modal features;
4) The cross-modal characteristics are fused by a cross fusion method, so that the high-level semantic information of the multi-modal characteristics is enhanced;
5) In the deep labv3+ decoder, the output of the encoder is up-sampled to match the resolution with the features of the low level. The feature layer connection is convolved once by 3 multiplied by 3, and then activated by a sigmoid function to obtain a predicted semantic graph P est
6) Predicted saliency map P est Semantic segmentation map P with artificial annotation GT Calculating loss;
7) Testing the test data set to generate a saliency map P test And use the ratingAnd (5) performing performance evaluation on the price index.
2. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 2) is as follows:
2.1 The NYU-Depth-V2 (NYUdv 2-13 and-40) dataset was used as the training set and the SUN RGB-D dataset was used as the test set.
2.2 RGB-D image dataset, each data labeled with a scene category (scene), a two-dimensional segmentation (2D segmentation), a three-dimensional room layout (3D room layout), a three-dimensional object box (3D object box), a three-dimensional object direction (3D object prientation).
3. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 3) is as follows:
3.1 Using encoder-decoder architecture as the basic architecture of the model of the present invention for extracting RGB image features and Depth image features of the cause pair, respectively
Figure FDA0004160400710000021
And->
Figure FDA0004160400710000022
3.2 The present invention builds a network model of a dual encoder-decoder architecture using NYU-Depth-V2 dataset pre-training.
4. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 4) is as follows:
4.1 Cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modules
Figure FDA0004160400710000023
And corresponding Depth image features +.>
Figure FDA0004160400710000024
The constitution updates the characteristics of 5 layers +.>
Figure FDA0004160400710000025
And
Figure FDA0004160400710000026
4.2 I-th level FCF module input data is
Figure FDA0004160400710000027
And->
Figure FDA0004160400710000028
Constitute and update the 5-level feature through the interactive attention mechanism +.>
Figure FDA0004160400710000029
And->
Figure FDA00041604007100000210
5. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 5) is as follows:
5.1 To be updated) features
Figure FDA00041604007100000211
Extracting by using pixel convolution structure features through an effective feature layer:
P i =Conv(P,K i ) Formula (1)
D i =Conv(R,K i ) Formula (2)
B i =Conv(D i +P i ,K 1 ) Formula (3)
Wherein i is {1,2,3,4,5 represents the level at which the feature is located, conv () represents the convolution operation performed, K i For convolution kernels of different levels, D i P is the result of RGB feature extraction i For extracting pixel information, let K 1 Is a 1 x 1 convolution kernel. R is R i For the final generated RGB image features
Figure FDA0004160400710000031
5.2 Inputting the RGB image characteristics generated by the steps and the depth characteristic information in the modal sensing module into a characteristic cross fusion module, and fusing the multi-modal characteristics of different receptive fields.
6) And (3) inputting the updated 5 th-level RGB image characteristic and depth image characteristic obtained in the step (4) into a DeepLabV3+ decoder, and upsampling the output of the encoder by 4 times to enable the resolution to be consistent with the features of the low level. After the feature layers are connected, a convolution (refinement) of 3×3 is carried out again to obtain the final fusion feature, and the final fusion feature is activated by a sigmoid function to obtain a predicted semantic graph P est
7) Semantic graph P predicted by the invention est Semantic segmentation map P with artificial annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D semantic segmentation algorithm.
CN202310347813.7A 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution Pending CN116433904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310347813.7A CN116433904A (en) 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310347813.7A CN116433904A (en) 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution

Publications (1)

Publication Number Publication Date
CN116433904A true CN116433904A (en) 2023-07-14

Family

ID=87084845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310347813.7A Pending CN116433904A (en) 2023-03-31 2023-03-31 Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution

Country Status (1)

Country Link
CN (1) CN116433904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935052A (en) * 2023-07-24 2023-10-24 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935052A (en) * 2023-07-24 2023-10-24 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment
CN116935052B (en) * 2023-07-24 2024-03-01 北京中科睿途科技有限公司 Semantic segmentation method and related equipment in intelligent cabin environment

Similar Documents

Publication Publication Date Title
CN111339903B (en) Multi-person human body posture estimation method
CN112529015B (en) Three-dimensional point cloud processing method, device and equipment based on geometric unwrapping
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN110674741B (en) Gesture recognition method in machine vision based on double-channel feature fusion
CN112734915A (en) Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning
CN114359509B (en) Multi-view natural scene reconstruction method based on deep learning
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
Jaus et al. Panoramic panoptic segmentation: Towards complete surrounding understanding via unsupervised contrastive learning
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
Li et al. ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion
CN113870160B (en) Point cloud data processing method based on transformer neural network
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN117315169A (en) Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching
Liu et al. Road segmentation with image-LiDAR data fusion in deep neural network
CN116433904A (en) Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution
Wang et al. Multi-view attention-convolution pooling network for 3D point cloud classification
CN117745944A (en) Pre-training model determining method, device, equipment and storage medium
CN116385660A (en) Indoor single view scene semantic reconstruction method and system
CN111274901B (en) Gesture depth image continuous detection method based on depth gating recursion unit
Li et al. DPG-Net: Densely progressive-growing network for point cloud completion
Tang et al. Encoder-decoder structure with the feature pyramid for depth estimation from a single image
Guo et al. CTpoint: A novel local and global features extractor for point cloud
Zhang et al. Hvdistill: Transferring knowledge from images to point clouds via unsupervised hybrid-view distillation
CN116665185A (en) Three-dimensional target detection method, system and storage medium for automatic driving
CN114898356A (en) 3D target detection algorithm based on sphere space characteristics and multi-mode cross fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination