CN116433904A - Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution - Google Patents
Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution Download PDFInfo
- Publication number
- CN116433904A CN116433904A CN202310347813.7A CN202310347813A CN116433904A CN 116433904 A CN116433904 A CN 116433904A CN 202310347813 A CN202310347813 A CN 202310347813A CN 116433904 A CN116433904 A CN 116433904A
- Authority
- CN
- China
- Prior art keywords
- rgb
- modal
- cross
- feature
- semantic segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 title claims abstract description 22
- 230000008447 perception Effects 0.000 title claims abstract description 17
- 230000004927 fusion Effects 0.000 claims abstract description 31
- 238000012360 testing method Methods 0.000 claims abstract description 22
- 230000006870 function Effects 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000011156 evaluation Methods 0.000 claims abstract description 7
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 5
- 238000013135 deep learning Methods 0.000 claims abstract description 4
- 238000005516 engineering process Methods 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 6
- 230000009977 dual effect Effects 0.000 claims description 3
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 description 8
- 230000004913 activation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012935 Averaging Methods 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The inventionThe invention belongs to the field of computer vision, and provides a cross-mode RGB-D semantic segmentation method based on shape perception, which comprises the following steps: 1) Acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention; 2) Constructing a shape perception and pixel convolution based RGB-D semantic segmentation network model by using a deep learning technology and a double encoder-decoder structure; 3) Constructing a cross-modal feature fusion network for generating multi-modal features; 4) The cross-modal characteristics are fused by a cross fusion method, so that the high-level semantic information of the multi-modal characteristics is enhanced; 5) In the deep labv3+ decoder, the output of the encoder is up-sampled to match the resolution with the features of the low level. The feature layer connection is convolved once by 3 multiplied by 3, and then activated by a sigmoid function to obtain a predicted semantic graph P est The method comprises the steps of carrying out a first treatment on the surface of the 6) Predicted saliency map P est Semantic segmentation map P with artificial annotation GT Calculating loss; 7) Testing the test data set to generate a saliency map P test And performance evaluation is performed using the evaluation index.
Description
Technical field:
the invention relates to the field of computer vision and image processing, in particular to a cross-mode RGB-D semantic segmentation method based on shape perception and pixel convolution.
The background technology is as follows:
semantic segmentation involves taking some raw data as input and converting them into a mask with highlighted regions of interest, where each pixel in the image is assigned a class ID according to the object to which it belongs. Semantic segmentation aims at solving the problem by gathering image parts belonging to the same object together, thereby expanding the application field of the semantic segmentation. Semantic segmentation is completely different and advanced compared to other image-based tasks. Briefly, in the field of computer vision, semantic segmentation is a pixel classification task based on full convolution.
The RGB semantic segmentation of a single mode faces challenging factors such as complex scenes, and the outline of a target is difficult to be defined, so that the semantic segmentation is accurately performed. And it is difficult to accurately and completely locate and classify all objects from the background. Therefore, to solve this problem, a Depth (Depth) image is introduced into semantic segmentation, which is performed by combining the joint RGB image and the Depth image to construct RGB-D.
The Depth Map can mainly provide information such as target edges. The Depth map is introduced into the semantic segmentation task, the RGB map provides global information, the Depth map provides profile information more complete, and geometric structure and distance information are expressed. Therefore, combining RGB images with depth maps for semantic segmentation tasks is a reasonable choice.
The prior RGB-D semantic segmentation method mainly uses a Depth Map as a data stream independent of an RGB image, extracts features independently, or uses the Depth image as a fourth channel of the RGB image, and treats the RGB image and the Depth image indiscriminately, and does not consider that the RGB image and the Depth image information are basically different, so that the convolution operation widely applied to RGB is not suitable for information processing of the Depth image.
Considering the ambiguity problem of cross-modal data between RGB image data and Depth image data, the invention attempts to explore a cross-modal feature fusion method based on shape perception and pixel convolution. The invention helps the semantic segmentation model to more accurately classify pixels by connecting the local shape of the depth feature with the function of further mining the feature in the cross-modal feature fusion.
The invention comprises the following steps:
aiming at the problems, the invention provides a cross-modal RGB-D semantic segmentation method based on shape perception, which adopts the following technical scheme:
1. an RGB-D dataset is acquired that trains and tests the task.
1.1 The NYU-Depth-V2 (NYUdv 2-13 and-40) dataset was used as the training set and the SUN RGB-D dataset was used as the test set.
1.2 RGB-D image dataset, each data labeled with a scene category (scene), a two-dimensional segmentation (2D segmentation), a three-dimensional room layout (3D room layout), a three-dimensional object box (3D object box), a three-dimensional object direction (3D object prientation).
2. Using a deep learning technique, an RGB-D semantic segmentation network model is built based on shape perception and pixel convolution and by a dual encoder-decoder structure:
2.1 Using encoder-decoder architecture as the present inventionFor extracting RGB image features and Depth image features of the cause pair, respectivelyAnd->
2.2 The present invention builds a network model of a dual encoder-decoder architecture using NYU-Depth-V2 dataset pre-training.
3. Based on the RGB image features extracted in step 2And corresponding Depth image featuresAnd performing cross-modal feature fusion, and constructing a cross-modal feature fusion network by using the fusion to generate multi-modal features.
3.1 Cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modulesAnd corresponding Depth image features +.>The constitution updates the characteristics of 5 layers +.>And->
3.2 I-th level FCF module input data isAnd->Constitute and update the 5-level feature through the interactive attention mechanism +.>And->
3.3 The specific process of generating the multi-mode features by the FCF module through feature cross fusion is as follows:
3.3.1 Firstly, the invention constructs a cross pixel convolution module for acquiring the characteristic of RGB and pixel difference, and further enhances the characteristic of RGB image. Meanwhile, shape perception convolution is constructed for the Depth map to obtain accurate local shape edge information, so that Depth image characteristics are further enhanced.
3.3.2 Further fusing the RGB image features and the corresponding Depth image features using an element-aware matrix addition operation, wherein a determination is made as to whether the pixel is available by pixel convolution, and the final calculated value is determined using the element-aware matrix addition operation. Then the fused features are converted into RGB feature update weights W by using softmax activation function r And depth feature update weight W d :
Wherein conv represents a convolution module, represents an element-aware matrix multiplication operation, add represents an element-aware matrix addition operation, GAP represents global averaging pooling, and softmax represents a softmax activation function.For the convolution value of the pixel,is an RGB convolution value.
3.3.3 In obtaining RGB feature update weights W r And depth feature update weight W d After that, we will W r And W is d And combining the enhanced RGB image features with the corresponding Depth image features respectively to obtain new RGB features and Depth features.
3.3.4 Through the above operation, 5-level features are updatedAnd->And correspondingly inputting the updated characteristics of each layer into a next pixel convolution module and a shape perception module, and enhancing the characteristic receptive field information and the advanced semantic information through multi-layer operation.
4) Cross-modal characteristics and RGB image characteristics are fused through a cross-fusion methodAnd corresponding Depth image featuresFinally, fusion characteristics are obtained>
Where i ε {1,2,3,4,5} represents the hierarchy of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature join operation.
4.1 To be updated) featuresPixel convolution structure feature extraction through effective feature layerTaking:
P i =Conv(P,K i ) Formula (3)
D i =Conv(R,K i ) Formula (4)
R i =Conv(D i +P i ,K 1 ) Formula (5)
Where i ε {1,2,3,4,5} represents the level at which the feature is located, conv () represents the convolution operation performed, K i For convolution kernels of different levels, D i P is the result of RGB feature extraction i For extracting pixel information, let K 1 Is a 1 x 1 convolution kernel. R is R i For the final generated RGB image features
4.2 Inputting the RGB image characteristics generated by the steps and the depth characteristic information in the modal sensing module into a characteristic cross fusion module, and fusing the multi-modal characteristics of different receptive fields.
5) And (3) inputting the updated 5 th-level RGB image characteristic and depth image characteristic obtained in the step (4) into a DeepLabV3+ decoder, and upsampling the output of the encoder by 4 times to enable the resolution to be consistent with the features of the low level. After the feature layers are connected, a convolution (refinement) of 3×3 is carried out again to obtain the final fusion feature, and the final fusion feature is activated by a sigmoid function to obtain a predicted semantic graph P est 。
6) Semantic graph P predicted by the invention est Semantic segmentation map P with artificial annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D semantic segmentation algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
The invention extracts abundant spatial structure edge information in the Depth image based on RGB-D semantic segmentation realized by the deep convolutional neural network, shape perception and pixel convolution, and performs cross-modal feature fusion with global information extracted by the RGB image, thereby being capable of adapting to the requirements of semantic segmentation under different scenes, especially under some challenging scenes (complex background, low contrast, transparent object, etc.). Compared with the prior semantic segmentation method, the method has the following benefits:
firstly, the depth map is introduced, and is not used as an additional channel of the RGB map, and simultaneously, the two modes are not used as the same contribution value for feature extraction and fusion. And constructing the relationship between the RGB-D image pair and the real class through a double encoder-decoder structure by utilizing a deep learning technology, and obtaining the segmentation feature through the extraction and fusion of the cross-modal feature.
Secondly, by means of cross fusion, the Depth image features are effectively modulated on the complementary edge information of the RGB image features, the global information of the RGB image is not affected, the Depth distribution information of the Depth image is utilized to guide cross-mode feature fusion, interference of background information in the RGB image is eliminated, and a foundation is laid for pixel segmentation in the next stage.
Finally, the final semantically partitioned pixel map is predicted by a semantic decoder.
Drawings
FIG. 1 is a schematic view of a model structure according to the present invention
FIG. 2 is a schematic diagram of a cross-modal feature fusion module
FIG. 3 is a schematic diagram of a cross-pixel convolution module
FIG. 4 is a schematic diagram of a split decoder
FIG. 5 is a schematic diagram of model training and testing
Detailed Description
The following description of the embodiments of the present invention will be made more clearly and fully with reference to the accompanying drawings, in which some, but not all examples of the invention are shown. All other examples, which are obtained by a person of ordinary skill in the art without any inventive effort, are within the scope of the present invention based on the examples in this invention.
Referring to fig. 1, a cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution mainly comprises the following steps:
1. the RGB-D data sets for training and testing the task are acquired and the algorithm targets of the present invention are defined and the training set and test set for training and testing the algorithm are determined. The NYU-Depth-V2 (NYUDV 2-13 and-40) dataset was used as the training set and the SUN RGB-D dataset was used as the test set.
2. Extracting RGB image features by using a cross pixel convolution network, extracting Depth image features by using a shape perception convolution network, and constructing a semantic segmentation model network of a double encoder-decoder based on the extracted RGB image features, wherein the semantic segmentation model network comprises an RGB encoder for extracting the RGB image features and a Depth encoder for extracting the Depth image features:
2.1. inputting RGB image with three channels to RGB encoder to generate 5 layers of RGB image features as
2.2. Inputting the three-channel Depth image into a Depth encoder to generate 5 layers of Depth image features, wherein the Depth image features are as follows
3. Based on the RGB image features extracted in step 2And corresponding Depth image featuresAnd performing cross-modal feature fusion, and constructing a cross-modal feature fusion network by using the fusion to generate multi-modal features.
3.1 Cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modulesAnd corresponding Depth image features +.>The structure updates the characteristics of 5 layersAnd->
3.2 I-th level FCF module input data isAnd->Constitute and update the 5-level feature through the interactive attention mechanism +.>And->
3.3 The specific process of generating the multi-mode features by the FCF module through feature cross fusion is as follows:
3.3.1 Firstly, the invention constructs a cross pixel convolution module for acquiring the characteristic of RGB and pixel difference, and further enhances the characteristic of RGB image. Meanwhile, shape perception convolution is constructed for the Depth map to obtain accurate local shape edge information, so that Depth image characteristics are further enhanced.
3.3.2 Further fusing the RGB image features and the corresponding Depth image features using an element-aware matrix addition operation, wherein a determination is made as to whether the pixel is available by pixel convolution, and the final calculated value is determined using the element-aware matrix addition operation. Then the fused features are converted into RGB feature update weights Wr and depth feature update weights W by using softmax activation function d :
Wherein conv represents a convolution module, represents an element-aware matrix multiplication operation, add represents an element-aware matrix addition operation, GAP represents global averaging pooling, and softmax represents a softmax activation function.For the convolution value of the pixel,is an RGB convolution value.
3.3.3 In obtaining RGB feature update weights W r And depth feature update weight W d After that, we will W r And W is d And combining the enhanced RGB image features with the corresponding Depth image features respectively to obtain new RGB features and Depth features.
3.3.4 Through the above operation, 5-level features are updatedAnd->And correspondingly inputting the updated characteristics of each layer into a next pixel convolution module and a shape perception module, and enhancing the characteristic receptive field information and the advanced semantic information through multi-layer operation.
4) Cross-modal characteristics and RGB image characteristics are fused through a cross-fusion methodAnd corresponding Depth image featuresFinally, fusion characteristics are obtained>
Where i ε {1,2,3,4,5} represents the hierarchy of the model where the feature is located, conv5 represents the convolution operation with a convolution kernel size of 5×5, and cat represents the feature join operation.
4.1 To be updated) featuresExtracting by using pixel convolution structure features through an effective feature layer:
D i =Conv(R,K i ) Formula (3)
P i =Conv(P,K i ) Formula (4)
R i =Conv(D i +P i ,K 1 ) Formula (5)
Where i ε {1,2,3,4,5} represents the level at which the feature is located, conv () represents the convolution operation performed, K i For convolution kernels of different levels, D i P is the result of RGB feature extraction i For extracting pixel information, let K 1 Is a 1 x 1 convolution kernel. R is R i For the final generated RGB image features
4.2 Inputting the RGB image characteristics generated by the steps and the depth characteristic information in the modal sensing module into a characteristic cross fusion module, and fusing the multi-modal characteristics of different receptive fields.
5) Inputting the updated 5 th-level RGB image characteristic and depth image characteristic obtained in the step 4 into a DeepLabV3+ decoder, and upsampling the output of the encoderThe sample was 4-fold, and its resolution was matched to the features of the lower layer. After the feature layers are connected, a convolution (refinement) of 3×3 is carried out again to obtain the final fusion feature, and the final fusion feature is activated by a sigmoid function to obtain a predicted semantic graph P est 。
6) Semantic graph P predicted by the invention est Semantic segmentation map P with artificial annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D semantic segmentation algorithm.
7) On the basis of determining the structure and parameter weight of the model in the step 6, testing the RGB-D image pairs on the test set to generate a saliency map P test And using MAE, S-measure, F-measure, E-measure evaluation index for evaluation.
Claims (5)
1. The cross-modal RGB-D semantic segmentation method based on shape perception is characterized by comprising the following steps:
1) Acquiring an RGB-D data set for training and testing the task, and defining an algorithm target of the invention;
2) Constructing a shape perception and pixel convolution based RGB-D semantic segmentation network model by using a deep learning technology and a double encoder-decoder structure;
3) Constructing a cross-modal feature fusion network for generating multi-modal features;
4) The cross-modal characteristics are fused by a cross fusion method, so that the high-level semantic information of the multi-modal characteristics is enhanced;
5) In the deep labv3+ decoder, the output of the encoder is up-sampled to match the resolution with the features of the low level. The feature layer connection is convolved once by 3 multiplied by 3, and then activated by a sigmoid function to obtain a predicted semantic graph P est ;
6) Predicted saliency map P est Semantic segmentation map P with artificial annotation GT Calculating loss;
7) Testing the test data set to generate a saliency map P test And use the ratingAnd (5) performing performance evaluation on the price index.
2. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 2) is as follows:
2.1 The NYU-Depth-V2 (NYUdv 2-13 and-40) dataset was used as the training set and the SUN RGB-D dataset was used as the test set.
2.2 RGB-D image dataset, each data labeled with a scene category (scene), a two-dimensional segmentation (2D segmentation), a three-dimensional room layout (3D room layout), a three-dimensional object box (3D object box), a three-dimensional object direction (3D object prientation).
3. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 3) is as follows:
3.1 Using encoder-decoder architecture as the basic architecture of the model of the present invention for extracting RGB image features and Depth image features of the cause pair, respectivelyAnd->
3.2 The present invention builds a network model of a dual encoder-decoder architecture using NYU-Depth-V2 dataset pre-training.
4. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 4) is as follows:
4.1 Cross-modal feature fusion module integrates 5 levels of RGB image features by 5 levels of FCF modulesAnd corresponding Depth image features +.>The constitution updates the characteristics of 5 layers +.>And
5. The shape-aware cross-modal RGB-D semantic segmentation method of claim 1, wherein: the specific method of the step 5) is as follows:
5.1 To be updated) featuresExtracting by using pixel convolution structure features through an effective feature layer:
P i =Conv(P,K i ) Formula (1)
D i =Conv(R,K i ) Formula (2)
B i =Conv(D i +P i ,K 1 ) Formula (3)
Wherein i is {1,2,3,4,5 represents the level at which the feature is located, conv () represents the convolution operation performed, K i For convolution kernels of different levels, D i P is the result of RGB feature extraction i For extracting pixel information, let K 1 Is a 1 x 1 convolution kernel. R is R i For the final generated RGB image features
5.2 Inputting the RGB image characteristics generated by the steps and the depth characteristic information in the modal sensing module into a characteristic cross fusion module, and fusing the multi-modal characteristics of different receptive fields.
6) And (3) inputting the updated 5 th-level RGB image characteristic and depth image characteristic obtained in the step (4) into a DeepLabV3+ decoder, and upsampling the output of the encoder by 4 times to enable the resolution to be consistent with the features of the low level. After the feature layers are connected, a convolution (refinement) of 3×3 is carried out again to obtain the final fusion feature, and the final fusion feature is activated by a sigmoid function to obtain a predicted semantic graph P est 。
7) Semantic graph P predicted by the invention est Semantic segmentation map P with artificial annotation GT And calculating a loss function, gradually updating the parameter weight of the model provided by the invention through a back propagation algorithm, and finally determining the structure and the parameter weight of the RGB-D semantic segmentation algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310347813.7A CN116433904A (en) | 2023-03-31 | 2023-03-31 | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310347813.7A CN116433904A (en) | 2023-03-31 | 2023-03-31 | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116433904A true CN116433904A (en) | 2023-07-14 |
Family
ID=87084845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310347813.7A Pending CN116433904A (en) | 2023-03-31 | 2023-03-31 | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116433904A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116935052A (en) * | 2023-07-24 | 2023-10-24 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
-
2023
- 2023-03-31 CN CN202310347813.7A patent/CN116433904A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116935052A (en) * | 2023-07-24 | 2023-10-24 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
CN116935052B (en) * | 2023-07-24 | 2024-03-01 | 北京中科睿途科技有限公司 | Semantic segmentation method and related equipment in intelligent cabin environment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339903B (en) | Multi-person human body posture estimation method | |
CN112529015B (en) | Three-dimensional point cloud processing method, device and equipment based on geometric unwrapping | |
Zhang et al. | Deep hierarchical guidance and regularization learning for end-to-end depth estimation | |
CN110674741B (en) | Gesture recognition method in machine vision based on double-channel feature fusion | |
CN112734915A (en) | Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning | |
CN114359509B (en) | Multi-view natural scene reconstruction method based on deep learning | |
CN112396607A (en) | Streetscape image semantic segmentation method for deformable convolution fusion enhancement | |
Jaus et al. | Panoramic panoptic segmentation: Towards complete surrounding understanding via unsupervised contrastive learning | |
CN115713679A (en) | Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map | |
Li et al. | ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion | |
CN113870160B (en) | Point cloud data processing method based on transformer neural network | |
CN113554032A (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN117315169A (en) | Live-action three-dimensional model reconstruction method and system based on deep learning multi-view dense matching | |
Liu et al. | Road segmentation with image-LiDAR data fusion in deep neural network | |
CN116433904A (en) | Cross-modal RGB-D semantic segmentation method based on shape perception and pixel convolution | |
Wang et al. | Multi-view attention-convolution pooling network for 3D point cloud classification | |
CN117745944A (en) | Pre-training model determining method, device, equipment and storage medium | |
CN116385660A (en) | Indoor single view scene semantic reconstruction method and system | |
CN111274901B (en) | Gesture depth image continuous detection method based on depth gating recursion unit | |
Li et al. | DPG-Net: Densely progressive-growing network for point cloud completion | |
Tang et al. | Encoder-decoder structure with the feature pyramid for depth estimation from a single image | |
Guo et al. | CTpoint: A novel local and global features extractor for point cloud | |
Zhang et al. | Hvdistill: Transferring knowledge from images to point clouds via unsupervised hybrid-view distillation | |
CN116665185A (en) | Three-dimensional target detection method, system and storage medium for automatic driving | |
CN114898356A (en) | 3D target detection algorithm based on sphere space characteristics and multi-mode cross fusion network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |