CN113920317B - Semantic segmentation method based on visible light image and low-resolution depth image - Google Patents

Semantic segmentation method based on visible light image and low-resolution depth image Download PDF

Info

Publication number
CN113920317B
CN113920317B CN202111369121.XA CN202111369121A CN113920317B CN 113920317 B CN113920317 B CN 113920317B CN 202111369121 A CN202111369121 A CN 202111369121A CN 113920317 B CN113920317 B CN 113920317B
Authority
CN
China
Prior art keywords
resolution
image
semantic segmentation
depth image
visible light
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111369121.XA
Other languages
Chinese (zh)
Other versions
CN113920317A (en
Inventor
袁媛
苏月皎
姜志宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202111369121.XA priority Critical patent/CN113920317B/en
Publication of CN113920317A publication Critical patent/CN113920317A/en
Application granted granted Critical
Publication of CN113920317B publication Critical patent/CN113920317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4053Super resolution, i.e. output image resolution higher than sensor resolution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a semantic segmentation method based on a visible light image and a low-resolution depth image. Based on multi-task learning, the super-resolution module network is utilized to process the low-resolution depth image by utilizing the high-resolution visible light image to obtain the high-resolution depth image, and then the semantic segmentation module network is utilized to obtain the semantic segmentation result, so that the high-resolution visible light information available in actual conditions is utilized, the resolution and quality of the semantic segmentation are ensured, the problem of resolution alignment is solved, and the application of the semantic segmentation in real life is expanded.

Description

Semantic segmentation method based on visible light image and low-resolution depth image
Technical Field
The invention belongs to the technical field of computer vision and image processing, and particularly relates to a semantic segmentation method based on a visible light image and a low-resolution depth image.
Background
Visible light and Depth image (RGB-Depth, abbreviated RGB-D) semantic segmentation is the segmentation of scene objects and regions using visual appearance information of the scene and scene distance information acquired by a Depth sensor. RGB-D semantic segmentation has wide applications as an underlying core technology, such as indoor navigation, autopilot, and machine vision.
With the development of a deep learning method, the RGB-D semantic segmentation technology is effectively improved, and the method is widely applied to actual visual tasks. Gupta et al in document "S.Gupta, R.Girshick, P.Arbel az, and j.malik.learning Rich Features from RGB-D Images for Object Detection and segment.in European Conference on Computer Vision,2014, pp.345-360 propose a dual stream (Two-stream) encoder-decoder network containing Two encoder branches for extracting the visible light information and depth information features, respectively, to predict the segmentation result. The following most RGB-D semantic segmentation methods adopt a double-flow network as the basis of the RGB-D semantic segmentation network to extract, fuse and upsample multi-mode features. Chen et al, in the literature "X.Chen, K.Y.Lin, J.Wang, W.Wu, C.Qian, H.Li, and G.Zeng. Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D security segmentation.In European Conference on Computer Vision,2020 et al, propose a two-way cross-modal feature propagation based on gates, improving the feature fusion approach of both modalities.
While the above methods all achieve good results, they often rely on a large number of pairs of visible and depth images of the same resolution as training data. However, in real situations, due to the depth sensor principle and hardware limitation, the resolution of the depth image is often lower, and the visible light camera is rapidly developed, and the acquired visible light image has higher resolution. Based on this, many studies are currently performed to achieve matching with the resolution of the depth image by capturing a visible light image with a lower resolution. This does not fully utilize the high resolution visible image information on the one hand, and on the other hand limits the generalization ability of the RGB-D semantic segmentation model and the ability to solve practical problems in real life.
Disclosure of Invention
In order to overcome the defect that the existing RGB-D semantic segmentation method cannot process the non-aligned resolution RGB-D image pair, the invention provides a semantic segmentation method based on a visible light image and a low resolution depth image. Based on multi-task learning, the super-resolution module network is utilized to process the high-resolution visible light image and the low-resolution depth image to obtain the high-resolution depth image, and then the semantic segmentation module network is utilized to obtain the semantic segmentation result, so that the high-resolution visible light information available in actual conditions is utilized, the resolution and quality of the semantic segmentation are ensured, the problem of resolution alignment is solved, and the application of the semantic segmentation in real life is expanded.
A semantic segmentation method based on a visible light image and a low-resolution depth image is characterized by comprising the following steps:
step 1: downsampling the depth image in the input RGB-D data set to obtain a low-resolution depth image; taking a visible light image and a low-resolution depth image in an RGB-D data set as input data, taking the depth image as supervision information, and training a super-resolution module network to obtain a trained network;
the RGB-D data set is a public visible light and depth image data set;
the super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders extract features of an input image to obtain a corresponding feature image; the fusion module performs addition fusion on the characteristic images output by the two encoders to obtain a fused characteristic image; the decoder performs up-sampling processing on the fused image to obtain a depth image with the same resolution as the input visible light image;
step 2: taking a visible light image in the RGB-D data set and a depth image obtained by the super-resolution module as input data, taking the middle two-layer parameters of the encoder of the trained super-resolution module network as the supervision information of the middle two-layer parameters of the encoder of the semantic segmentation module network, and training the semantic segmentation module network to obtain the trained network;
the semantic segmentation module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders extract features of an input image to obtain a corresponding feature image; the fusion module performs addition fusion on the characteristic images output by the two encoders to obtain a fused characteristic image; the decoder carries out up-sampling processing on the fused image, and the obtained image is a semantic segmentation result image;
step 3: the RGB-D data obtained through real acquisition is input into a trained super-resolution module network, and then is output into a semantic segmentation result image through a trained semantic segmentation module network.
2. The semantic segmentation method based on the visible light image and the low-resolution depth image according to claim 1, wherein: the loss function L of the super-resolution module network SR The method comprises the following steps:
where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,representing the original depth image in the RGB-D dataset, < >>Representing a low resolution depth image obtained by downsampling, I rgb Representing visible light images in RGB-D dataset, W sr Representing super-resolution module network parameters, f sr (. Cndot.) represents super resolution modular network processing.
Specifically, the loss function L of the super-resolution module network SR The method comprises the following steps:
where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,representing the original depth image in the RGB-D dataset, < >>Representing a low resolution depth image obtained by downsampling, I rgb Representing visible light images in RGB-D dataset, W sr Representing super-resolution module network parameters, f sr (. Cndot.) represents super resolution modular network processing.
Specifically, the encoders in the super-resolution module network and the semantic segmentation module network are both 4-layer convolution networks, and the loss function L of the semantic segmentation module network seg The method comprises the following steps:
wherein L (x, class) represents weighted cross entropy between the prediction segmentation result and the real label, and x represents the prediction semantic segmentation resultClass represents the class of the object,parameter representing the ith layer of the encoder in the super resolution module network, a method for determining the parameter of the ith layer of the encoder in the super resolution module network>Parameters representing the ith layer of the encoder in the semantic segmentation module network;
wherein, the calculation formula of L (x, class) is:
wherein weight [ class ] represents the weight of class, the value of which is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x [ class ] represents the class-th channel of the output feature map, j represents the position of the pixel, and x [ j ] represents the probability that the pixel j is predicted to be class.
The beneficial effects of the invention are as follows: the super-resolution subtask is introduced, so that the gap between the depth resolution and the visible light resolution which can be obtained in real life can be made up, the high-resolution visible light information which can be obtained in real life is fully utilized, and the practical and industrial values are realized; the correlation between the super-resolution subtask and the semantic segmentation subtask is utilized to assist in optimizing the semantic segmentation subtask architecture, so that the accuracy of the semantic segmentation network can be improved. The invention is suitable for vehicle-mounted auxiliary systems, automatic driving systems, indoor autonomous navigation systems and the like, and has good practical value.
Drawings
FIG. 1 is a flow chart of a semantic segmentation method based on a visible light image and a low resolution depth image according to the present invention;
FIG. 2 is a depth image result contrast diagram of the super resolution module network of the present invention;
in the figure, (a) -an original depth image, (b) -a super-resolution module network output depth image, (c) -an actual high-resolution depth image, (d) -a high-resolution visible light image;
FIG. 3 is a comparative image of the segmentation result of the present invention;
in the figure, (a) -input depth image, (b) -input visible light image, (c) -present invention segmentation result image, (d) -true segmentation image.
Detailed Description
The invention will be further illustrated with reference to the following figures and examples, which include but are not limited to the following examples.
As shown in fig. 1, the invention provides a semantic segmentation method based on a visible light image and a low-resolution depth image, which comprises the following specific implementation processes:
the visible light sensor in real life has higher resolution than the depth sensor, and in order to enable the method to be applied to practice, the processing object of the invention is a high-resolution visible light image I rgb And corresponding low resolution depth imageThe resolutions of two modal images in the existing large RGB-D data set are consistent, so that in order to train the super-resolution subtask, the depth images in the RGB-D data set are firstly downsampled to obtain the depth images with relatively low resolution
1. Super-resolution subtasks
Image I of visible light rgb And low resolution depth imageSuper resolution module network training is performed as input. The super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the two branch encoders are used for extracting features of a high-resolution visible light image and a low-resolution depth image, and the extracted features are fused and transferred into the decoder for up-sampling. With the original high resolution depth information in the dataset +.>Supervising the generation of the high resolution depth image as a supervisory signal thereby training the super resolution module network:
where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,representing the original depth image in the ith pair of images in the RGB-D dataset,/and>depth image representing low resolution of the i-th pair of images obtained by downsampling, +.>Representing visible light images in an ith pair of images in an RGB-D dataset, W sr Representing super-resolution module network parameters, f sr (. Cndot.) represents super resolution modular network processing.
2. Semantic segmentation subtasks
The super-resolution subtask may obtain an aligned resolution RGB-D image pair from a non-aligned resolution RGB-D image pair. The predicted aligned RGB-D image pair is input into a semantic segmentation module network, and features of a visible light image and a depth image predicted by a super-resolution module network are extracted by using two encoders comprising K layers of convolution, wherein K is the number of network layers and can be adjusted according to actual conditions. And merging the extracted features, and transmitting the merged features into a decoder for up-sampling to finally obtain a segmentation result.
In order to improve the generalization performance of the model, the invention aims at the training of a semantic segmentation module network and the optimization solution of a loss function thereof, adopts a two-part joint mode, one part is from the cross entropy loss between a predicted segmentation result and a real label in a data set, the other part is from the coupling constraint between a super-resolution module network and the semantic segmentation module network, namely, after the super-resolution module network training is finished, the parameters of the middle two-layer encoder of the super-resolution sub-module network are taken out, and the parameters are used as the supervision information of the middle two-layer parameters of the semantic segmentation module network encoder to assist the middle two-layer encoder of the semantic segmentation network encoder to train. The following objective function is used for carrying out optimization solution on the network parameters of the semantic segmentation module:
the weight represents the weight of class, the value of the weight is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x class represents the class-th channel of the output feature diagram, j represents the position of the pixel, x j represents the probability of predicting the pixel j as class, and the value range is 0-1.
3. Model application
After model parameter optimization learning, the highest resolution visible light image which can be acquired by a visible light sensor in a real scene and the highest resolution depth image which can be acquired by a depth sensor are input into a super-resolution module network, so that the depth image which is aligned upwards to the resolution of the visible light image can be obtained. And then, inputting the predicted depth image and the visible light image acquired by the visible light sensor into a semantic segmentation module network for segmentation result prediction.
In order to verify the effectiveness of the method, a simulation experiment is carried out by using Pytorch on an operating system with a central processing unit of Intel (R) Core (TM) i7-6800K CPU@3.40GHz and a memory 60G, linux. The data used in the experiments were the published SUN RGB-D dataset. The SUN RGB-D dataset is the current largest RGB-D semantic segmentation dataset, containing 40 classes of 10335 annotated RGB-D images, of which there are 5285 pairs for training and 5050 pairs for testing. It is captured by four different sensors, kinect V1, kinect V2, xtion and RealSense.
To demonstrate the effectiveness of the algorithm RedNet, ACNet, PAP, FSFNet was chosen as a comparison method, respectively. Wherein, redNet is proposed in the literature "J.Jiang, L.Zheng, F.Luo, and Z.zhang. RedNet: residual Encoder-Decoder Network for indoor RGB-D security segment. Eprint Arxiv, 2018."; PAP is proposed in documents "Z.Zhang, Z.Cui, C.Xu, Y.Yan, N.Sebe, J.Yang.Pattern-Affinitive Propagation Across Depth, surface Normal and Semantic segment.In IEEE Conference on Computer Vision and Pattern Recognition,2019, pp.4106-4115"; FSFNet is set forth in the document "Y.Su, Y.Yuan, Z.Jiang.Deep feature selection-and-fusion for RGB-D setup section.In International Conference on Multimedia & Expo, 2021"; ACNet is proposed in documents X.Hu, K.Yang, L.Fei, and K.Wang, "ACNET: attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation," in Proc.IEEE International Conference on Image Processing,2019, pp.1440-1444. And respectively calculating two indexes of an average intersection ratio (mIoU) and Pixel precision (Pixel Acc), and evaluating the RGB-D semantic segmentation quality, wherein the segmentation effect is better as the index value is larger. The calculation results are shown in Table 1, and it can be seen that both indexes of the invention perform better than other methods.
TABLE 1
Fig. 2 shows that the super-resolution subtask of the present invention can super-divide non-aligned resolution RGB-D image data into aligned resolution RGB-D data, thereby laying a foundation for the subsequent semantic segmentation subtask.
Fig. 3 shows a semantic segmentation result image and a real segmentation image obtained by the method of the invention when a low resolution depth image and a visible light image are input. From the above, the invention can achieve better semantic segmentation performance under the condition that the resolution of the input depth image is lower than that of the visible light image.

Claims (3)

1. A semantic segmentation method based on a visible light image and a low-resolution depth image is characterized by comprising the following steps:
step 1: downsampling the depth image in the input RGB-D data set to obtain a low-resolution depth image; taking a visible light image and a low-resolution depth image in an RGB-D data set as input data, taking the depth image as supervision information, and training a super-resolution module network to obtain a trained network;
the RGB-D data set is a public visible light and depth image data set;
the super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders extract features of an input image to obtain a corresponding feature image; the fusion module performs addition fusion on the characteristic images output by the two encoders to obtain a fused characteristic image; the decoder performs up-sampling processing on the fused image to obtain a depth image with the same resolution as the input visible light image;
step 2: taking a visible light image in the RGB-D data set and a depth image obtained by the super-resolution module as input data, taking the middle two-layer parameters of the encoder of the trained super-resolution module network as the supervision information of the middle two-layer parameters of the encoder of the semantic segmentation module network, and training the semantic segmentation module network to obtain the trained network;
the semantic segmentation module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders extract features of an input image to obtain a corresponding feature image; the fusion module performs addition fusion on the characteristic images output by the two encoders to obtain a fused characteristic image; the decoder carries out up-sampling processing on the fused image, and the obtained image is a semantic segmentation result image;
step 3: the RGB-D data obtained through real acquisition is input into a trained super-resolution module network, and then is output into a semantic segmentation result image through a trained semantic segmentation module network.
2. The semantic segmentation method based on the visible light image and the low-resolution depth image according to claim 1, wherein: the loss function L of the super-resolution module network SR The method comprises the following steps:
where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,representing the original depth image in the RGB-D dataset, < >>Representing a low resolution depth image obtained by downsampling, I rgb Representing visible light images in RGB-D dataset, W sr Representing super-resolution module network parameters, f sr (. Cndot.) represents super resolution modular network processing.
3. A semantic segmentation method based on visible light images and low resolution depth images as claimed in claim 1 or 2, characterized in that: the encoders in the super-resolution module network and the semantic segmentation module network are both 4-layer convolution networks, and the loss function L of the semantic segmentation module network seg The method comprises the following steps:
wherein L (x, class) represents the weighted cross entropy between the prediction segmentation result and the real label, and x represents the pre-predictionMeasuring semantic segmentation results, class represents category,parameter representing the ith layer of the encoder in the super resolution module network, a method for determining the parameter of the ith layer of the encoder in the super resolution module network>Parameters representing the ith layer of the encoder in the semantic segmentation module network;
wherein, the calculation formula of L (x, class) is:
wherein weight [ class ] represents the weight of class, the value of which is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x [ class ] represents the class-th channel of the output feature map, j represents the position of the pixel, and x [ j ] represents the probability that the pixel j is predicted to be class.
CN202111369121.XA 2021-11-15 2021-11-15 Semantic segmentation method based on visible light image and low-resolution depth image Active CN113920317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111369121.XA CN113920317B (en) 2021-11-15 2021-11-15 Semantic segmentation method based on visible light image and low-resolution depth image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111369121.XA CN113920317B (en) 2021-11-15 2021-11-15 Semantic segmentation method based on visible light image and low-resolution depth image

Publications (2)

Publication Number Publication Date
CN113920317A CN113920317A (en) 2022-01-11
CN113920317B true CN113920317B (en) 2024-02-27

Family

ID=79247396

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111369121.XA Active CN113920317B (en) 2021-11-15 2021-11-15 Semantic segmentation method based on visible light image and low-resolution depth image

Country Status (1)

Country Link
CN (1) CN113920317B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908531B (en) * 2023-03-09 2023-06-13 深圳市灵明光子科技有限公司 Vehicle-mounted ranging method and device, vehicle-mounted terminal and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634296A (en) * 2020-10-12 2021-04-09 深圳大学 RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN112861911A (en) * 2021-01-10 2021-05-28 西北工业大学 RGB-D semantic segmentation method based on depth feature selection fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11263756B2 (en) * 2019-12-09 2022-03-01 Naver Corporation Method and apparatus for semantic segmentation and depth completion using a convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634296A (en) * 2020-10-12 2021-04-09 深圳大学 RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN112861911A (en) * 2021-01-10 2021-05-28 西北工业大学 RGB-D semantic segmentation method based on depth feature selection fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王子羽 ; 张颖敏 ; 陈永彬 ; 王桂棠 ; .基于RGB-D图像的室内场景语义分割网络优化.自动化与信息工程.2020,(02),全文. *

Also Published As

Publication number Publication date
CN113920317A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
Xu et al. Cobevt: Cooperative bird's eye view semantic segmentation with sparse transformers
CN110222591B (en) Lane line detection method based on deep neural network
CN111563909B (en) Semantic segmentation method for complex street view image
Xia et al. PANDA: Parallel asymmetric network with double attention for cloud and its shadow detection
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
WO2023138629A1 (en) Encrypted image information obtaining device and method
Qian et al. Gated-residual block for semantic segmentation using RGB-D data
CN112651423A (en) Intelligent vision system
CN111354030A (en) Method for generating unsupervised monocular image depth map embedded into SENET unit
CN115131281A (en) Method, device and equipment for training change detection model and detecting image change
CN113920317B (en) Semantic segmentation method based on visible light image and low-resolution depth image
CN116385761A (en) 3D target detection method integrating RGB and infrared information
CN114120272A (en) Multi-supervision intelligent lane line semantic segmentation method fusing edge detection
Wang et al. TF-SOD: a novel transformer framework for salient object detection
Hong et al. USOD10K: a new benchmark dataset for underwater salient object detection
Wang et al. Global perception-based robust parking space detection using a low-cost camera
CN112861911A (en) RGB-D semantic segmentation method based on depth feature selection fusion
CN116258756B (en) Self-supervision monocular depth estimation method and system
CN117079237A (en) Self-supervision monocular vehicle distance detection method
CN116503709A (en) Vehicle detection method based on improved YOLOv5 in haze weather
CN112200840B (en) Moving object detection system in visible light and infrared image combination
Xie et al. Research on building extraction method based on surveillance images
Zou et al. AF-net: All-scale feature fusion network for road extraction from remote sensing images
Bhavadharshini et al. Semantic Segmentation using Convolutional Neural Networks
Yian et al. Improved deeplabv3+ network segmentation method for urban road scenes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant