CN112861911A - RGB-D semantic segmentation method based on depth feature selection fusion - Google Patents
RGB-D semantic segmentation method based on depth feature selection fusion Download PDFInfo
- Publication number
- CN112861911A CN112861911A CN202110027615.3A CN202110027615A CN112861911A CN 112861911 A CN112861911 A CN 112861911A CN 202110027615 A CN202110027615 A CN 202110027615A CN 112861911 A CN112861911 A CN 112861911A
- Authority
- CN
- China
- Prior art keywords
- layer
- image
- rgb
- semantic segmentation
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 77
- 230000004927 fusion Effects 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000005070 sampling Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 37
- 230000008569 process Effects 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 4
- 238000003709 image segmentation Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000005286 illumination Methods 0.000 abstract description 5
- 238000013528 artificial neural network Methods 0.000 abstract description 3
- 230000008859 change Effects 0.000 abstract description 3
- 239000000284 extract Substances 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 101150083127 brox gene Proteins 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a depth feature selection fusion-based RGB-D semantic segmentation method, which adopts a double-current convolution neural network as a semantic segmentation model, firstly preprocesses an RGB-D image and a corresponding label image, then extracts feature maps of each layer of a visible light image and a depth image by an encoder, and then fuses the feature maps of the visible light image and the depth image to obtain the fusion feature of each layer; and then, selecting the fusion characteristics of the encoder by using space attention, and up-sampling the selected result to finally obtain a segmentation result. The method is more accurate for small objects and contour segmentation, has stronger robustness for illumination change and objects with similar appearances, and simultaneously has higher accuracy and average intersection ratio of segmentation results.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an RGB-D semantic segmentation method.
Background
Semantic Segmentation (Semantic Segmentation) refers to the classification of images at the pixel level according to Semantic information. With the development of depth sensors, in addition to visual information that is widely used, depth information is regarded as another supplementary information that improves scene resolution performance. The depth information contains 3D geometric information that is insensitive to illumination variations and that can distinguish objects that are similar in appearance. Therefore, the depth cue can make up for the defect of semantic segmentation using only visual cue to some extent. RGB-D semantic segmentation is very important for many applications, such as auto-driving, robot vision and indoor navigation, etc.
With the development of deep learning, the dual-stream network achieves excellent performance in RGB-D semantic segmentation. However, how to effectively fuse visible light and depth information into a unified representation remains a basic but difficult problem in RGB-D semantic segmentation. Hazirbas et al in the document "C.Hazirbas, L.Ma, C.Domokos, and D.Cremers.FuseNet: incorporated Depth and Selective Segmentation video-Based CNN architecture. in assessment Conference on Computer video, 2016, pp.213-228" directly add the Depth feature map to the visible light feature map to fuse the information of both. Deng in the document "l.deng, m.yang, t.li, y.he, and c.wang.rfbnet: Deep Multimodal Networks with Residual Fusion Blocks for RGB-D semiconducting segmentation. envelope axiv, 2019" proposes a Residual Fusion block to achieve bottom-up interaction and Fusion between the two modes. Hu et al, in the literature "X.Hu, K.Yang, L.Fei, and K.Wang.ACNet: Attention Based networks to explicit computation resources for RGBD selection. in IEEE International Conference on Image Processing,2019, pp.1440-1444", propose an Attention supplementing module that assigns different weights to different modalities to achieve better integration. The document "s.lee, s.j.park, and k.s.hong.rdfnet RGB-D Multi-level reactive Feature Fusion for the interior semiconductor segmentation. in IEEE International Conference on Computer Vision,2017, pp.4990-4999" by s.lee et al uses monomodal Residual joining to learn RGB and depth features and combinations thereof to exploit complementary features. Although these methods provide a structured model that integrates both types of information, how to ensure that the network takes full advantage of the information in both modes for fine semantic segmentation remains an open question.
In addition, most of the current RGB-D semantic segmentation methods are based on an encoder-decoder architecture, continuous down-sampling in an encoder may cause a part of detailed information to be lost, and reducing the resolution of the feature map to a very small level through a cascaded pooling layer is not favorable for generating an accurate segmentation result. To further exploit the spatial information lost in the encoding stage, o.ronnberger et al propose hopping connections in the literature "o.ronnberger, p.fischer, and t.brox.u-Net: relational Networks for biological Image segmentation. in Medical Image Computing and Computer-Assisted interpretation, 2015, pp.234-241" to reuse the information lost in the encoder to assist in the up-sampling learning, and finally to obtain a fine segmentation result. Although it may enable reuse of some missing features, it lacks pertinence and does not explicitly model the recovery of important detail information.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an RGB-D semantic segmentation method based on depth feature selection fusion, which adopts a double-current convolution neural network as a semantic segmentation model, firstly, an RGB-D image and a corresponding label image are preprocessed, then, a coder extracts feature maps of each layer of a visible light image and a depth image, and then, the feature maps of the visible light image and the depth image are fused to obtain the fusion feature of each layer; and then, selecting the fusion characteristics of the encoder by using space attention, and up-sampling the selected result to finally obtain a segmentation result. The method is more accurate for small objects and contour segmentation, has stronger robustness for illumination change and objects with similar appearances, and simultaneously has higher accuracy and average intersection ratio of segmentation results.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: constructing a semantic segmentation model:
the method comprises the steps that a double-current convolutional neural network is adopted as a semantic segmentation model and comprises two encoders and a decoder;
step 2: fusing RGB-D information;
step 2-1: preprocessing an image;
taking the RGB-D images and the corresponding label images from the public training set as a training sample set, and uniformly changing the sizes of the images in the training sample set into A x A; the RGB-D image consists of a visible light image and a depth image;
coding the depth image of the single channel into a three-channel HHA image;
step 2-2: respectively inputting the visible light image and the HHA image into two encoders for feature extraction to obtain two feature sequences, wherein each feature sequence comprises an m-layer feature diagram and is represented as formula (1):
whereinAndj-th layer feature maps of the visible light image and the depth image respectively, wherein H, W and C respectively represent the height, width and channel number of the feature maps;
step 2-3: will be provided withAndobtaining selected characteristic graphs through convolution of 1 x 1 respectively, wherein the selected characteristic graphs are expressed as a formula (2):
whereinAndselected feature maps for the j-th layer of the visible image and the depth image respectively,andrespectively representing the process of selection through convolution of 1 x 1;
step 2-4: calculating the fusion characteristics of j layers of the two characteristic sequences by adopting an equation (3):
wherein the content of the first and second substances,andrespectively representing the fusion characteristics of the j-th layer and the j-1 st layer of the two characteristic sequences, fconv(.) represents convolution, fdown(.) represents downsampling, fconcat(.) represents a cascade; fusion features of layer 1 of two feature sequencesComprises the following steps:
and step 3: reusing the detail information;
step 3-1: inputting the fused features of the mth layer of the two feature sequences into a decoder to be used as a first layer of the decoder; the fusion characteristics of the m-th layers of the two characteristic sequences are up-sampled to be used as a second layer of a decoder;
step 3-2: respectively selecting the fusion characteristics of the 2 nd layer to the m-1 th layer of the two characteristic sequences by using space attention, re-fusing the selected result and the fusion characteristics of the two characteristic sequences according to a formula (5), and up-sampling the re-fused result to obtain the third layer to the m th layer of the decoder;
wherein the content of the first and second substances,shows the results of the m-i +1 th layer after the re-fusion,represents the fusion characteristics of the ith layers of the two characteristic sequences,representing the m-i +1 th layer of the decoder,representing the m-i +2 th layer of the decoder, fup(.) represents upsampling;
and 4, step 4: training a semantic segmentation model;
training the semantic segmentation model by using a training sample set; inputting the RGB-D image of the training sample set into a semantic segmentation model, and segmenting through the steps 2 and 3;
sequentially performing m-2 times of downsampling on the label images of the training sample set, wherein the size of the image obtained by each downsampling is the same as the scale of the feature maps from the (m-1) th layer to the (2) th layer of the feature sequence;
respectively taking the labeled images of the training sample set and the images sequentially subjected to m-2 times of downsampling as supervision information from the mth layer to the 2 nd layer of the semantic segmentation model to train the semantic segmentation model;
the objective function of the training is:
wherein liAnd λiRespectively representing the weighted cross entropy loss function and the weight thereof of the ith layer, wherein class is a class number, weight [ class ]]Is the frequency, x, at which the pixels of each class appear in the training set.]Is the predicted segmentation result;
and 5: and inputting the RGB-D image to be segmented into the trained semantic segmentation model to obtain an image segmentation result.
Preferably, a is 480, m is 4, λ1=0.1,λ2=0.1,λ3=0.2。
Preferably, in the step 4, when the label images of the training sample set are sequentially subjected to m-2 times of downsampling, nearest neighbor interpolation downsampling is adopted.
The invention has the following beneficial effects:
1. and the method is more accurate for small objects and contour segmentation. The method is more accurate for small object and contour segmentation, and the proposed spatial attention mechanism is used for paying attention to and reusing features which are easy to lose in down-sampling, so that the features can play a role in recovering a segmentation mask, and the method is more helpful for small object and contour segmentation.
2. The robustness is stronger for illumination changes and object segmentation of similar appearance. The method of the invention uses two kinds of modal information of visible light image and depth image as input, and uses an information fusion module to fuse the two kinds of information to obtain a unified representation with unified and high distinguishability. The unified representation can make up the deficiency of visible light semantic segmentation, namely, the influence of illumination change on the performance of the model can be reduced, and objects with partial similar appearances in a scene can be segmented.
3. The accuracy rate and the average intersection ratio of the segmentation results are higher. The method can help the existing model architecture to better perform scene analysis.
4. The generalization ability is stronger. The RGB-D information fusion module provided by the invention can be generalized to other multi-mode information fusion algorithms, such as visible light and infrared information fusion.
5. The system has more practical and industrial values. The invention is suitable for vehicle auxiliary driving system and robot indoor navigation, thus having higher practical value.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a graph of semantic segmentation results generated by the method of the present invention and the comparison method.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The method mainly solves the two problems that multi-modal feature fusion is not complete enough in RGB-D semantic segmentation and important features are lost in the down-sampling process. The purpose is to solve the following aspects:
1. existing algorithms do not explicitly model the fusion of both visible light and depth information.
2. The existing algorithm loses some important features in the down-sampling process, and further causes poor small object and outline segmentation performance.
As shown in fig. 1, a depth-feature-selection-based fused RGB-D semantic segmentation method includes the following steps:
step 1: constructing a semantic segmentation model:
the method comprises the steps that a double-current convolutional neural network is adopted as a semantic segmentation model and comprises two encoders and a decoder;
step 2: fusing RGB-D information;
step 2-1: preprocessing an image;
taking the RGB-D images and the corresponding label images from the public training set as a training sample set, and uniformly changing the sizes of the images in the training sample set into A x A; the RGB-D image consists of a visible light image and a depth image;
coding the depth image of the single channel into a three-channel HHA image;
step 2-2: respectively inputting the visible light image and the HHA image into two encoders for feature extraction to obtain two feature sequences, wherein each feature sequence comprises an m-layer feature diagram and is represented as formula (1):
whereinAndj-th layer feature maps of the visible light image and the depth image respectively, wherein H, W and C respectively represent the height, width and channel number of the feature maps;
step 2-3: will be provided withAndobtaining selected characteristic graphs through convolution of 1 x 1 respectively, wherein the selected characteristic graphs are expressed as a formula (2):
whereinAndselected feature maps for the j-th layer of the visible image and the depth image respectively,andrespectively representing the process of selection through convolution of 1 x 1;
step 2-4: calculating the fusion characteristics of j layers of the two characteristic sequences by adopting an equation (3):
wherein the content of the first and second substances,andrespectively representing the fusion characteristics of the j-th layer and the j-1 st layer of the two characteristic sequences, fconv(.) represents convolution, fdown(.) represents downsampling, fconcat(.) represents a cascade; fusion features of layer 1 of two feature sequencesComprises the following steps:
and step 3: reusing the detail information;
step 3-1: inputting the fused features of the mth layer of the two feature sequences into a decoder to be used as a first layer of the decoder; the fusion characteristics of the m-th layers of the two characteristic sequences are up-sampled to be used as a second layer of a decoder;
step 3-2: respectively selecting the fusion characteristics of the 2 nd layer to the m-1 th layer of the two characteristic sequences by using space attention, re-fusing the selected result and the fusion characteristics of the two characteristic sequences according to a formula (5), and up-sampling the re-fused result to obtain the third layer to the m th layer of the decoder;
wherein the content of the first and second substances,shows the results of the m-i +1 th layer after the re-fusion,represents the fusion characteristics of the ith layers of the two characteristic sequences,representing the m-i +1 th layer of the decoder,representing the m-i +2 th layer of the decoder, fup(.) represents upsampling;
and 4, step 4: training a semantic segmentation model;
training the semantic segmentation model by using a training sample set; inputting the RGB-D image of the training sample set into a semantic segmentation model, and segmenting through the steps 2 and 3;
sequentially performing m-2 times of downsampling on the label images of the training sample set, wherein the size of the image obtained by each downsampling is the same as the scale of the feature maps from the (m-1) th layer to the (2) th layer of the feature sequence;
respectively taking the labeled images of the training sample set and the images sequentially subjected to m-2 times of downsampling as supervision information from the mth layer to the 2 nd layer of the semantic segmentation model to train the semantic segmentation model;
the objective function of the training is:
wherein liAnd λiRespectively representing the weighted cross entropy loss function and the weight thereof of the ith layer, wherein class is a class number, weight [ class ]]Is the frequency, x, at which the pixels of each class appear in the training set.]Is the predicted segmentation result;
and 5: and inputting the RGB-D image to be segmented into the trained semantic segmentation model to obtain an image segmentation result.
The specific embodiment is as follows:
in this embodiment, the simulation is performed by using a Pythroch on an operating system with a central processing unit of Intel (R) core (TM) i7-6800K CPU @3.40GHz and a memory 60G, Linux. The data used in the simulation is an open data set.
The data used in the simulation were from NYUDv2 and SUN RGB-D data. The NYUDv2 data contained 1449 pairs of densely labeled RGB-D image pairs captured by Microsoft Kinect, with 795 pairs of images for training and 654 pairs of images for testing. The SUN RGB-D dataset is the currently largest RGB-D semantically segmented dataset with 10,335 densely annotated RGB-D images taken from 20 different scenes. It is captured by four different sensors, Kinect V1, Kinect V2, xution and RealSense. The official-divided training set includes 5285 pairs of RGB-D images and labels, with the remaining 5050 pairs being used for testing. The number of categories in both data sets is 40.
To demonstrate the effectiveness of the method, the present invention selected 3DGNN, RedNet, CFN, ACNet, PAP, SA-Gate as the comparison algorithm on both data sets. Wherein 3DGNN is a method as proposed in the literature "X.Qi, R.Liao, J.jia, S.Fidler, and R.Urtasun.3D Graph Neural Networks for RGBD magnetic segmentation. in IEEE International Conference on Computer Vision,2017, pp.5209-5218"; RedNet is proposed in the literature "J.Jiang, L.Zheng, F.Luo, and Z.Zhang.RedNet: Residual Encode-Decoder Network for index RGB-D magnetic segmentation. Eprint Arxiv, 2018."; CFN is described in the documents "D.Lin, G.Chen, D.Cohen-Or, P.A.heng, and H.Huang.Cascade Feature Network for magnetic Segmentation of RGB-D images.In International Conference on Computer Vision,2017, pp.1320-1328"; PAP is proposed in the literature "Z.ZHenyu, C.ZHen, X.Chunyan, Y.Yan, S.Nicu, Y.Jian.Pattern-affinity amplification Depth, Surface Normal and Semantic Segmentin.in IEEE Conference on Computer Vision and Pattern Recognition,2019, pp.4106-4115"; SA-Gate is proposed in the literature "X.Chen, K.Y.Lin, J.Wang, W.Wu, C.Qian, H.Li, and G.Zeng.Bi-directional Cross-modification Feature creation with Separation-and-Aggregation Gate for RGB-D semiconductor selection. in European Conference on Computer Vision, 2020". FSFNet is the method proposed in the invention, mIoU and Pixel Acc are evaluation indexes for RGB-D semantic segmentation quality, and the comparison result is shown in Table 1:
TABLE 1
As can be seen from table 1, on NYUDv2 dataset, the present invention is equivalent to the current optimal algorithm on Pixel acc. On the SUN RGB-D data set, the method is superior to other algorithms in the mIoU index.
FIG. 2 is a graph of semantic segmentation results generated by the present invention and comparison algorithm. As seen from the figure, the present invention has better segmentation performance on different classes of objects such as ceilings, tables, etc. compared to the comparative algorithm, which can prove that the present invention effectively combines the features of RGB and depth information. In addition, the present invention can distinguish small objects and can perform more accurate contour segmentation, thereby proving that the present invention can better utilize detailed information in the down-sampling process.
Claims (3)
1. A depth feature selection fusion-based RGB-D semantic segmentation method is characterized by comprising the following steps:
step 1: constructing a semantic segmentation model:
the method comprises the steps that a double-current convolutional neural network is adopted as a semantic segmentation model and comprises two encoders and a decoder;
step 2: fusing RGB-D information;
step 2-1: preprocessing an image;
taking the RGB-D images and the corresponding label images from the public training set as a training sample set, and uniformly changing the sizes of the images in the training sample set into A x A; the RGB-D image consists of a visible light image and a depth image;
coding the depth image of the single channel into a three-channel HHA image;
step 2-2: respectively inputting the visible light image and the HHA image into two encoders for feature extraction to obtain two feature sequences, wherein each feature sequence comprises an m-layer feature diagram and is represented as formula (1):
whereinAndj-th layer feature maps of the visible light image and the depth image respectively, wherein H, W and C respectively represent the height, width and channel number of the feature maps;
step 2-3: will be provided withAndrespectively convolving by 1 x 1 to obtain selected characteristic diagram represented by formula (2)):
WhereinAndselected feature maps for the j-th layer of the visible image and the depth image respectively,andrespectively representing the process of selection through convolution of 1 x 1;
step 2-4: calculating the fusion characteristics of j layers of the two characteristic sequences by adopting an equation (3):
wherein the content of the first and second substances,andrespectively representing the fusion characteristics of the j-th layer and the j-1 st layer of the two characteristic sequences, fconv(.) represents convolution, fdown(.) represents downsampling, fconcat(.) represents a cascade; fusion features of layer 1 of two feature sequencesComprises the following steps:
and step 3: reusing the detail information;
step 3-1: inputting the fused features of the mth layer of the two feature sequences into a decoder to be used as a first layer of the decoder; the fusion characteristics of the m-th layers of the two characteristic sequences are up-sampled to be used as a second layer of a decoder;
step 3-2: respectively selecting the fusion characteristics of the 2 nd layer to the m-1 th layer of the two characteristic sequences by using space attention, re-fusing the selected result and the fusion characteristics of the two characteristic sequences according to a formula (5), and up-sampling the re-fused result to obtain the third layer to the m th layer of the decoder;
wherein the content of the first and second substances,shows the results of the m-i +1 th layer after the re-fusion,represents the fusion characteristics of the ith layers of the two characteristic sequences,representing the m-i +1 th layer of the decoder,representing the m-i +2 th layer of the decoder, fup(.) represents upsampling;
and 4, step 4: training a semantic segmentation model;
training the semantic segmentation model by using a training sample set; inputting the RGB-D image of the training sample set into a semantic segmentation model, and segmenting through the steps 2 and 3;
sequentially performing m-2 times of downsampling on the label images of the training sample set, wherein the size of the image obtained by each downsampling is the same as the scale of the feature maps from the (m-1) th layer to the (2) th layer of the feature sequence;
respectively taking the labeled images of the training sample set and the images sequentially subjected to m-2 times of downsampling as supervision information from the mth layer to the 2 nd layer of the semantic segmentation model to train the semantic segmentation model;
the objective function of the training is:
wherein liAnd λiRespectively representing the weighted cross entropy loss function and the weight thereof of the ith layer, wherein class is a class number, weight [ class ]]Is the frequency, x, at which the pixels of each class appear in the training set.]Is the predicted segmentation result;
and 5: and inputting the RGB-D image to be segmented into the trained semantic segmentation model to obtain an image segmentation result.
2. The RGB-D semantic segmentation method based on depth feature selection fusion as claimed in claim 1, wherein A-480, m-4, λ1=0.1,λ2=0.1,λ3=0.2。
3. The RGB-D semantic segmentation method based on depth feature selection fusion as claimed in claim 1, wherein in step 4, nearest neighbor interpolation downsampling is adopted when downsampling is performed m-2 times on the label images of the training sample set in sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110027615.3A CN112861911B (en) | 2021-01-10 | 2021-01-10 | RGB-D semantic segmentation method based on depth feature selection fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110027615.3A CN112861911B (en) | 2021-01-10 | 2021-01-10 | RGB-D semantic segmentation method based on depth feature selection fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112861911A true CN112861911A (en) | 2021-05-28 |
CN112861911B CN112861911B (en) | 2024-05-28 |
Family
ID=76002060
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110027615.3A Active CN112861911B (en) | 2021-01-10 | 2021-01-10 | RGB-D semantic segmentation method based on depth feature selection fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112861911B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762263A (en) * | 2021-08-17 | 2021-12-07 | 慧影医疗科技(北京)有限公司 | Semantic segmentation method and system for small-scale similar structure |
CN113920317A (en) * | 2021-11-15 | 2022-01-11 | 西北工业大学 | Semantic segmentation method based on visible light image and low-resolution depth image |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271990A (en) * | 2018-09-03 | 2019-01-25 | 北京邮电大学 | A kind of semantic segmentation method and device for RGB-D image |
CN110298361A (en) * | 2019-05-22 | 2019-10-01 | 浙江省北大信息技术高等研究院 | A kind of semantic segmentation method and system of RGB-D image |
CN110490884A (en) * | 2019-08-23 | 2019-11-22 | 北京工业大学 | A kind of lightweight network semantic segmentation method based on confrontation |
US20200134375A1 (en) * | 2017-08-01 | 2020-04-30 | Beijing Sensetime Technology Development Co., Ltd. | Semantic segmentation model training methods and apparatuses, electronic devices, and storage media |
US20200402300A1 (en) * | 2019-06-21 | 2020-12-24 | Harbin Institute Of Technology | Terrain modeling method that fuses geometric characteristics and mechanical charateristics, computer readable storage medium, and terrain modeling system thereof |
-
2021
- 2021-01-10 CN CN202110027615.3A patent/CN112861911B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200134375A1 (en) * | 2017-08-01 | 2020-04-30 | Beijing Sensetime Technology Development Co., Ltd. | Semantic segmentation model training methods and apparatuses, electronic devices, and storage media |
CN109271990A (en) * | 2018-09-03 | 2019-01-25 | 北京邮电大学 | A kind of semantic segmentation method and device for RGB-D image |
CN110298361A (en) * | 2019-05-22 | 2019-10-01 | 浙江省北大信息技术高等研究院 | A kind of semantic segmentation method and system of RGB-D image |
US20200402300A1 (en) * | 2019-06-21 | 2020-12-24 | Harbin Institute Of Technology | Terrain modeling method that fuses geometric characteristics and mechanical charateristics, computer readable storage medium, and terrain modeling system thereof |
CN110490884A (en) * | 2019-08-23 | 2019-11-22 | 北京工业大学 | A kind of lightweight network semantic segmentation method based on confrontation |
Non-Patent Citations (1)
Title |
---|
钱正芳等: "浅析深度学习在未来水面无人艇平台的应用", 《中国造船》, 30 August 2020 (2020-08-30) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762263A (en) * | 2021-08-17 | 2021-12-07 | 慧影医疗科技(北京)有限公司 | Semantic segmentation method and system for small-scale similar structure |
CN113920317A (en) * | 2021-11-15 | 2022-01-11 | 西北工业大学 | Semantic segmentation method based on visible light image and low-resolution depth image |
CN113920317B (en) * | 2021-11-15 | 2024-02-27 | 西北工业大学 | Semantic segmentation method based on visible light image and low-resolution depth image |
Also Published As
Publication number | Publication date |
---|---|
CN112861911B (en) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | Salient object detection in stereoscopic 3D images using a deep convolutional residual autoencoder | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN108734210B (en) | Object detection method based on cross-modal multi-scale feature fusion | |
CN112581409B (en) | Image defogging method based on end-to-end multiple information distillation network | |
CN110751111B (en) | Road extraction method and system based on high-order spatial information global automatic perception | |
CN111563909A (en) | Semantic segmentation method for complex street view image | |
CN113657388A (en) | Image semantic segmentation method fusing image super-resolution reconstruction | |
CN114820579A (en) | Semantic segmentation based image composite defect detection method and system | |
CN112861911A (en) | RGB-D semantic segmentation method based on depth feature selection fusion | |
CN111488884A (en) | Real-time semantic segmentation method with low calculation amount and high feature fusion | |
CN115631513B (en) | Transformer-based multi-scale pedestrian re-identification method | |
CN116311254B (en) | Image target detection method, system and equipment under severe weather condition | |
WO2024040973A1 (en) | Multi-scale fused dehazing method based on stacked hourglass network | |
CN115527096A (en) | Small target detection method based on improved YOLOv5 | |
CN114913493A (en) | Lane line detection method based on deep learning | |
CN111860116A (en) | Scene identification method based on deep learning and privilege information | |
CN112766056A (en) | Method and device for detecting lane line in low-light environment based on deep neural network | |
CN114743027B (en) | Weak supervision learning-guided cooperative significance detection method | |
CN115905838A (en) | Audio-visual auxiliary fine-grained tactile signal reconstruction method | |
Sun et al. | TSINIT: a two-stage Inpainting network for incomplete text | |
CN117079237A (en) | Self-supervision monocular vehicle distance detection method | |
CN113920317B (en) | Semantic segmentation method based on visible light image and low-resolution depth image | |
CN112733934B (en) | Multi-mode feature fusion road scene semantic segmentation method in complex environment | |
CN114220098A (en) | Improved multi-scale full-convolution network semantic segmentation method | |
Gao et al. | RGBD semantic segmentation based on global convolutional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |