CN112699889A - Unmanned real-time road scene semantic segmentation method based on multi-task supervision - Google Patents

Unmanned real-time road scene semantic segmentation method based on multi-task supervision Download PDF

Info

Publication number
CN112699889A
CN112699889A CN202110017471.3A CN202110017471A CN112699889A CN 112699889 A CN112699889 A CN 112699889A CN 202110017471 A CN202110017471 A CN 202110017471A CN 112699889 A CN112699889 A CN 112699889A
Authority
CN
China
Prior art keywords
image
layer
road scene
semantic segmentation
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110017471.3A
Other languages
Chinese (zh)
Inventor
周武杰
林鑫杨
钱小鸿
万健
甘兴利
叶宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202110017471.3A priority Critical patent/CN112699889A/en
Publication of CN112699889A publication Critical patent/CN112699889A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multitask supervision-based unmanned real-time road scene semantic segmentation method, which is applied to the technical field of road scene semantic segmentation and comprises the following steps: selecting a color image, a thermal image and a corresponding real semantic segmentation image of Q original road scene images to form a training set; the method comprises the following steps of using a MobileNet V2 lightweight network as a feature extractor, using an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and using a dense connection structure to fuse multi-level features to construct a convolutional neural network; inputting the color image and the thermal image of the original road scene image in the training set into a convolutional neural network for training to obtain a predicted image; calculating a loss function value between the predicted image and the corresponding original image; and obtaining a final weight vector and a final bias item according to the loss function value. The invention can improve the image segmentation efficiency and accuracy and meet the real-time requirement.

Description

Unmanned real-time road scene semantic segmentation method based on multi-task supervision
Technical Field
The invention relates to the technical field of semantic segmentation of unmanned road scenes, in particular to an unmanned real-time road scene semantic segmentation method based on multi-task supervision.
Background
With the continuous development of automatic driving technology, computer vision and natural language processing technology, unmanned vehicles are gradually and widely appearing in our lives. The unmanned automobile needs to accurately understand surrounding scenes in real time and make a decision on an emergency quickly in the driving process, so that traffic accidents are avoided. Therefore, efficient and accurate road scene semantic segmentation is becoming one of the hot spots for the research in the field of computer vision.
The semantic segmentation task is a basic task for image understanding and is an important task to be solved in the field of computer vision. Deep learning techniques, particularly convolutional neural networks, have shown great potential in semantic segmentation tasks over the past few years. In general, the full-convolution network architecture used by the semantic segmentation task can be divided into two categories: based on an encoder-decoder structure and based on an expanded convolution structure. The encoder-decoder architecture first uses the encoder to extract image features and then uses the decoder to recover spatial resolution; the expansion convolution structure is used for increasing the overall receptive field by expanding convolution in order to reduce the loss of the space information of the coding part, so that the model can give consideration to the overall semantic information.
Although the expansion convolution structure has the advantage of maintaining spatial information, if a higher spatial resolution is used all the time without downsampling, the consumed memory is larger, the inference speed of the model is greatly influenced, and the requirement of real-time performance cannot be met. In addition, because the convolutional network learns richer characteristics along with the deepening of the layer number, the network is difficult to have a deeper structure due to high memory consumption.
Therefore, the problem that the technical personnel in the field need to solve urgently is to provide the semantic segmentation method for the unmanned real-time road scene, which has high segmentation efficiency and high segmentation accuracy and can meet the real-time requirement.
Disclosure of Invention
The invention provides a multitask supervision-based unmanned real-time road scene semantic segmentation method, which combines low-level and high-level feature information, uses a dense connection structure for image decoding, uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency hollow space feature pyramid structure for extracting deep-level semantic features of an image, and brings great challenges to night scene understanding under poor illumination conditions for night road scenes.
In order to achieve the above purpose, the invention provides the following technical scheme:
a multitask supervision-based unmanned real-time road scene semantic segmentation method comprises the following specific steps:
selecting a color image and a thermal image of Q original road scene images, a corresponding real foreground background image, a real semantic segmentation image and a real boundary image to form a training set;
constructing a convolutional neural network, wherein the convolutional neural network uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and uses a dense connection structure to fuse multi-level features;
inputting the color image and the thermal image of the original road scene image in the training set as original input images into the convolutional neural network for training to obtain a corresponding foreground background prediction image, a corresponding semantic segmentation prediction image and a corresponding boundary prediction image;
calculating loss function values among the foreground background prediction image, the semantic segmentation prediction image and the boundary prediction image obtained by training and the corresponding real foreground background image, real semantic segmentation image and real boundary image;
and repeating training and calculating a loss function value, and determining the last training result as a final weight vector and a final bias item.
Further, the Q original road scene images are images in a road scene image database reported in the MFNet.
Further, the convolutional neural network comprises an input layer, a feature extraction layer, a feature fusion layer and a multitask output layer;
the input layer comprises a color image input layer and a thermal image input layer, and color images and thermal images are input respectively;
the characteristic extraction layer performs layer-by-layer characteristic extraction on the color image and the thermal image and extracts the deep semantic characteristics of the image;
the characteristic fusion layer fuses multi-level characteristics by using a dense connection structure;
and the multi-task output layer outputs a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image.
Further, the MobileNetV2 removes the last two inverse residual error structures and the classification layer, and divides the remaining part into 3 blocks, where the color image input branch corresponds to R _ Block i, i is 1,2, and 3, and the thermal image input branch corresponds to T _ Block i, i is 1,2, and 3.
Furthermore, the output result of each module of the thermal image input branch and the output result of each module of the color image input branch are fused by adding corresponding characteristics.
Further, the dense connection structure comprises an fused upsampling module, wherein the fused upsampling module comprises a 1 × 1 convolutional layer, a batch normalization layer and a ReLU6 activation function, a double upsampling layer, a 3 × 3 depth convolutional layer, a batch normalization layer and a ReLU6 activation function, and a 1 × 1 convolutional layer and a batch normalization layer.
The pyramid structure of the high-efficiency void space characteristic comprises a 1 × 1 convolution layer with the step length of 1, the filling of 0 and the number of filters of 192, a batch normalization layer, a ReLU6 activation function, three parallel shallow structures and a deep structure. The first shallow structure comprises a 3 x 3 depth convolutional layer with step size of 1, filling of 2 and hole factor of 2, a batch normalization layer and a ReLU6 activation function, the second shallow structure comprises a 3 x 3 depth convolutional layer with step size of 1, filling of 4 and hole factor of 4, a batch normalization layer and a ReLU6 activation function, and the third shallow structure comprises a 3 x 3 depth convolutional layer with step size of 1, filling of 8 and hole factor of 8. The deep structure includes a 3 × 3 depth convolutional layer with step size 2 and fill of 1, a batch normalization layer and a ReLU6 activation function, a 1 × 1 convolutional layer with step size 1, fill of 0, filter number 192, a batch normalization layer and a ReLU6 activation function and a parallel structure. The parallel structure comprises a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 2, and a hole factor of 2, a batch normalization layer, a ReLU6 activation function and double upsampling, a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 4, and a hole factor of 4, a batch normalization layer, a ReLU6 activation function and double upsampling, an adaptive maximum pooling layer and a 10-fold upsampling combination. A feature mosaic, 1 × 1 convolutional layers with step size 1, padding 0, filter count 96, batch normalization layer, and ReLU6 activation function.
According to the technical scheme, compared with the prior art, the invention has the beneficial effects that:
1) the method takes the thermal image information as the supplement of the color image information, fuses the thermal image characteristics and the color image characteristics, and can accurately predict the object under the condition of night.
2) The method uses the MobileNet V2 lightweight network as a feature extractor, so that the model can meet the real-time requirement. And the improved high-efficiency void space characteristic pyramid structure is used for extracting deep semantic characteristics of the image, so that the accuracy of the convolutional neural network model is further improved.
3) The method combines the low-level and high-level feature information when constructing the convolutional neural network, uses the dense connection structure to decode the image and fuses the multi-level features, so that various classification targets with different sizes in the road scene can be accurately segmented, and the semantic segmentation accuracy of the road scene image is effectively improved.
4) The method of the invention uses a multi-task supervision method to increase the model performance through the correlation among multiple tasks.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a diagram of the structure of the high-efficiency void space feature pyramid eASPP of the method of the present invention;
FIG. 3 is a block diagram of a fused upsampling module FU of the method of the present invention;
FIGS. 4a and 4b are the 1 st original color image and thermal image of the road scene in the same scene;
fig. 4c, 4d, and 4e are predicted semantic segmentation images, predicted boundary images, and predicted foreground-background images obtained by predicting the original road scene images shown in fig. 4a and 4b by using the method of the present invention, respectively;
FIGS. 5a and 5b are 2 nd original road scene color images and road scene thermal images of the same scene;
5c, 5d, 5e are predicted semantic segmentation images, predicted boundary images and predicted foreground background images obtained by predicting the original road scene images shown in FIGS. 5a, 5b by the method of the present invention, respectively;
FIGS. 6a and 6b are 3 rd original road scene color images and road scene thermal images of the same scene;
fig. 6c, 6d, and 6e are predicted semantic segmentation images, predicted boundary images, and predicted foreground-background images obtained by predicting the original road scene images shown in fig. 6a and 6b by using the method of the present invention, respectively;
FIGS. 7a and 7b are 4 th original road scene color images and road scene thermal images of the same scene;
fig. 7c, 7d, and 7e are predicted semantic segmentation images, predicted boundary images, and predicted foreground-background images obtained by predicting the original road scene images shown in fig. 7a and 7b by using the method of the present invention, respectively.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a multitask supervision-based unmanned real-time road scene semantic segmentation method, and the overall implementation block diagram is shown in figure 1.
The method comprises a training stage and a testing stage, wherein the training stage comprises the following specific steps:
step S101: selecting a color image and a thermal image of Q original road scene images, a corresponding real foreground background image, a real semantic segmentation image and a real boundary image to form a training set;
step S102: constructing a convolutional neural network, wherein the convolutional neural network uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and uses a dense connection structure to fuse multi-level features;
step S103: inputting color images and thermal images of original road scene images in a training set as original input images into a convolutional neural network for training to obtain corresponding foreground and background prediction images, semantic segmentation prediction images and boundary prediction images;
step S104: calculating a loss function value between a predicted image obtained by training and a corresponding original road scene image;
step S105: and repeating training and calculating a loss function value, and determining the last training result as a final weight vector and a final bias item.
In the embodiment of the present invention, step S101 specifically includes: selecting Q color images and thermal images of the original road scene image, corresponding real foreground background images, real semantic segmentation images and real boundary images to form a training set, and recording the Q-th original road scene color image in the training set as a color image
Figure BDA0002887214070000061
Thermal image marking
Figure BDA0002887214070000062
Record the corresponding real semantic segmentation image as
Figure BDA0002887214070000063
Then, the real foreground background image, the real semantic segmentation image and the real boundary image corresponding to each original road scene image in the training set are recorded as
Figure BDA0002887214070000064
Wherein Q1176 is the number of training samples, Q is a positive integer, 1Q, 1 i W, 1 j H, W is the width of the input image, H is the height of the input image, e.g. W640, H480,
Figure BDA0002887214070000065
respectively represent
Figure BDA0002887214070000071
The middle coordinate position is the pixel value of the pixel point of (i, j); in this embodiment, 1176 images in the road scene image database reported in the MFNet are directly selected as the original road scene image.
Step S102: constructing a convolutional neural network, wherein the convolutional neural network comprises an input layer, a feature extraction layer, a feature fusion layer and a multitask output layer;
the characteristic extraction layer consists of two main networks of MobileNet V2 and an improved high-efficiency hollow space characteristic pyramid structure; the feature fusion layer uses a dense connection structure to repeatedly utilize high-level features for image decoding; the multitask output layer uses the fusion features as input and outputs a semantic prediction graph, a boundary prediction graph and a foreground and background prediction graph.
Specifically, for the input layer, including a color image input layer and a thermal image input layer, an RGB color image and a thermal image are input, respectively, wherein the color image and the thermal image of the input layer are required to have a width W and a height H.
For the feature extraction layer, the method uses MobileNetV2 as a main network as a feature extractor, removes the last two inverse residual error structures and the classification layer, and divides the rest into 3 blocks, and the detailed division is shown in table 1. As shown in fig. 1, for the color image input branch, the corresponding structures are defined as R _ Blocki, i is 1,2, 3; for the thermal image input branches, the corresponding structures are defined as T _ Blocki, i ═ 1,2,3, respectively. The output result of each module of the thermal image branch and the color image branch output result are fused by adding corresponding elements, and the fusion characteristic of different layers is defined as a characteristic O from a shallow layer to a deep layer4Characteristic O3Characteristic O2. In table 1, t is the internal parameter of the bottleneck layer, c is the output channel size, n is the module repetition number, and s is the partial downsampling multiple.
Further, in order to improve the receptive field of the model, the invention uses an efficient void space characteristic Pyramid structure (esaspp) as shown in fig. 2 to extract the deep semantic characteristics of the image. Characteristic of its use O2As input, get the characteristic O1. Defining the deep convolution as a block convolution whose number of blocks is the number of input feature channels. Characteristic O2First, the 1 × 1 convolutional layer with step size 1, padding 0, filter number 192, batch normalization layer, and ReLU6 activation function are input, and then the output features are input into three parallel shallow structures and one deep structure. The first shallow layer structure comprises a 3 × 3 depth convolution layer with step length of 1, filling of 2 and hole factor of 2Normalization layer and ReLU6 activation function, the second shallow structure comprising a 3 x 3 depth convolutional layer with step size of 1, fill of 4, and hole factor of 4, batch normalization layer, and ReLU6 activation function, and the third shallow structure comprising a 3 x 3 depth convolutional layer with step size of 1, fill of 8, and hole factor of 8. The deep structure includes a 3 × 3 depth convolutional layer with step size 2 and fill of 1, a batch normalization layer and a ReLU6 activation function, a 1 × 1 convolutional layer with step size 1, fill of 0, filter number 192, a batch normalization layer and a ReLU6 activation function and a parallel structure. The parallel structure comprises a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 2, and a hole factor of 2, a batch normalization layer, a ReLU6 activation function and double upsampling, a combination of a 3 × 3 depth convolutional layer with a step size of 1, a fill of 4, and a hole factor of 4, a batch normalization layer, a ReLU6 activation function and double upsampling, an adaptive maximum pooling layer and a 10-fold upsampling combination. Finally, all output characteristics are spliced and input into a 1 multiplied by 1 convolutional layer with the step length of 1, the filling of 0 and the number of filters of 96, a batch normalization layer and a ReLU6 activation function to obtain the characteristic O1
TABLE 1 MobileNet V2 backbone network partitioning
Figure BDA0002887214070000081
For the feature fusion layer, the invention uses a dense connection structure to fuse multi-level features. The fused Upsampling module (FU) in the dense connection structure is shown in fig. 3. The feature fusion layer firstly combines the features O1And characteristic O2Splicing, inputting the obtained result into a fusion upsampling module FU3 to obtain a characteristic F1. The FU3 is composed of a 1 × 1 convolutional layer with step size of 1, padding of 0, and filter number of 192, a batch normalization layer, and ReLU6 activation function, a double up-sampling layer, a 3 × 3 deep convolutional layer with step size of 1 and padding of 1, a batch normalization layer, and a ReLU6 activation function, and a 1 × 1 convolutional layer with step size of 1, padding of 0, and filter number of 64, and a batch normalization layer. The characteristic O is then1After 2 times of up-samplingResults and characteristics of (1)1And characteristic O3Splicing, inputting the obtained result into a fusion upsampling module FU2 to obtain a characteristic F2. The FU2 is composed of a 1 × 1 convolutional layer with step size of 1, padding of 0, and number of filters of 96, a batch normalization layer, and a ReLU6 activation function, a double upsampled layer, a 3 × 3 deep convolutional layer with step size of 1 and padding of 1, a batch normalization layer, and a ReLU6 activation function, a 1 × 1 convolutional layer with step size of 1, padding of 0, and number of filters of 32, and a batch normalization layer. The characteristic O is then1Results after 4 times upsampling and feature F1Results and features F after 2 times upsampling2And characteristic O4Splicing, inputting the spliced signals into a 1 × 1 convolutional layer with the step size of 1, the filling of 0 and the number of filters of 376, a batch normalization layer and a ReLU6 activation function to obtain a characteristic F3
For a multi-task output layer comprising a foreground background prediction branch, a semantic segmentation prediction branch and a boundary prediction branch, using the characteristic F3The foreground and background prediction maps, the semantic segmentation prediction map and the boundary prediction map are output as input. The foreground and background prediction branch consists of a 1 × 1 convolutional layer with the step size of 1, the filling amount of 0 and the number of filters of 94, a batch normalization layer, a ReLU6 activation function, a 3 × 3 convolutional layer with the step size of 1, the filling amount of 1 and the number of filters of 1, and a 4-time upsampling and foreground and background output layer, and outputs a foreground and background prediction image. The semantic segmentation prediction branch consists of 2 times of upsampling, a 1 × 1 convolutional layer with the step size of 1, the filling of 0 and the number of filters of 376, a batch normalization layer and a ReLU6 activation function, a 3 × 3 convolutional layer with the step size of 1, the filling of 1 and the number of filters of 9, and a 2 times of upsampling and semantic classification output layer. The output result of the 3 x 3 convolutional layer of the foreground and background prediction branch is input into a Sigmoid activation function, and the output result and the characteristic F are input3And taking the result obtained after multiplication as input, and outputting the semantic segmentation prediction graph. Boundary prediction Branch first feature F3And performing double upsampling, and splicing the result with the output of the post-activation function of the 1 × 1 convolutional layer in the semantic segmentation prediction branch to obtain a characteristic serving as an input. It is composed of a step size of 1, a padding of 0, and filteringThe 1 × 1 convolutional layers with the number of devices 376, the batch normalization layer and the ReLU6 activation function, the 3 × 3 convolutional layers with the step size of 1, the filling of 1 and the number of filters of 1 and the 2 times of upsampling and boundary output layers are formed, and a boundary prediction graph is output.
Step S103: inputting each original road scene image in the training set as an original input image into a convolutional neural network for training to obtain a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image corresponding to each original road scene image in the training set, and respectively recording the foreground background prediction image, the semantic segmentation prediction image and the boundary prediction image as a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image
Figure BDA0002887214070000101
Figure BDA0002887214070000102
Step S104: calculating a loss function value between a predicted image and a real image corresponding to each original road scene image in the training set, and recording the loss function value as a loss function value
Figure BDA0002887214070000103
Wherein Loss1And Loss3Is a two-class cross entropy Loss function, Loss2Is a multi-class cross entropy loss function.
Step S105: repeatedly executing the step S103 and the step S104V times, training the model by using an Adam optimization method, correspondingly taking the weight vector and the bias term corresponding to the last training result as the final weight vector and the final bias term of the convolutional neural network classification training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 300.
The specific steps of the test phase include:
step S201: order to
Figure BDA0002887214070000104
Representing a road scene color image and a thermal image to be semantically segmented; wherein, i' is more than or equal to 1 and less than or equal to W', 1. ltoreq. j '. ltoreq.H ', W ' denotes the width of the image, H ' denotes the height of the image,
Figure BDA0002887214070000105
respectively represent
Figure BDA0002887214070000106
And the middle coordinate position is the pixel value of the pixel point of (i, j).
Step S202: inputting the color image and the thermal image into a convolutional neural network classification training model and utilizing WbestAnd bbestPredicting, and recording the corresponding prediction semantic segmentation image as
Figure BDA0002887214070000107
Wherein the content of the first and second substances,
Figure BDA0002887214070000108
to represent
Figure BDA0002887214070000109
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed:
and (3) building a convolutional neural network architecture by using a python-based deep learning library pytorch. The road scene image database test set reported in the MFNet is adopted to analyze how the segmentation effect of the road scene image (393 road scene images) predicted by the method is. Here, the segmentation performance of the predicted semantic segmentation image is evaluated by using 2 common objective parameters of the evaluated semantic segmentation method as evaluation indexes, namely, a Class average accuracy (Class accuracy) and a ratio of Intersection and Union of the segmentation image and the label image (Mean Intersection over Union, mlou). The number of predicted images per second (FPS) was used to evaluate the speed of the model.
The method is utilized to predict each road scene image in the test set to obtain a predicted semantic segmentation image corresponding to each road scene image, and the average category accuracy CA reflecting the semantic segmentation effect of the method, the ratio mIoU of the intersection and the union of the segmentation image and the label image, and the number FPS of predicted images per second are listed in Table 2. As can be seen from the data listed in table 2, the road scene image obtained by the method of the present invention has a better segmentation result and a faster prediction speed, which indicates that it is feasible and effective to obtain the predicted semantic segmentation image corresponding to the road scene image by using the method of the present invention.
Table 2, evaluation results on test sets using the method of the present invention
CA 67.7%
mIoU 54.8%
FPS 54.06
Fig. 4a and 4b show the 1 st original road scene color image and thermal image of the same scene, and fig. 4c, 4d and 4e show the predicted semantic segmentation image, the predicted boundary image and the predicted foreground background image obtained by predicting the original road scene image by using the method of the present invention; fig. 5a and 5b show 2 nd original road scene color image and thermal image of the same scene, and fig. 5c, 5d and 5e show a prediction semantic segmentation image, a prediction boundary image and a prediction foreground background image obtained by predicting the original road scene image by using the method of the present invention; fig. 6a and 6b show 3 rd original road scene color image and thermal image of the same scene, and fig. 6c, 6d and 6e show a prediction semantic segmentation image, a prediction boundary image and a prediction foreground background image obtained by predicting the original road scene image by using the method of the present invention; fig. 7a and 7b show the 4 th original road scene color image and thermal image of the same scene, and fig. 7c, 7d and 7e show the predicted semantic segmentation image, the predicted boundary image and the predicted foreground-background image obtained by predicting the original road scene image by using the method of the present invention. The segmentation precision of the predicted semantic segmentation image obtained by the method is high.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A multitask supervision-based unmanned real-time road scene semantic segmentation method is characterized by comprising the following specific steps:
selecting a color image and a thermal image of Q original road scene images, a corresponding real foreground background image, a real semantic segmentation image and a real boundary image to form a training set;
constructing a convolutional neural network, wherein the convolutional neural network uses a MobileNet V2 lightweight network as a feature extractor, uses an improved high-efficiency void space feature pyramid structure to extract deep semantic features of an image, and uses a dense connection structure to fuse multi-level features;
inputting the color image and the thermal image of the original road scene image in the training set as original input images into the convolutional neural network for training to obtain a corresponding foreground background prediction image, a corresponding semantic segmentation prediction image and a corresponding boundary prediction image;
calculating loss function values among the foreground background prediction image, the semantic segmentation prediction image and the boundary prediction image obtained by training and the corresponding real foreground background image, real semantic segmentation image and real boundary image;
and repeating training and calculating a loss function value, and determining the last training result as a final weight vector and a final bias item.
2. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the Q original road scene images are selected from images in a road scene image database reported in MFNet.
3. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the convolutional neural network comprises an input layer, a feature extraction layer, a feature fusion layer and a multitask output layer;
the input layer comprises a color image input layer and a thermal image input layer, and color images and thermal images are input respectively;
the characteristic extraction layer performs layer-by-layer characteristic extraction on the color image and the thermal image and extracts the deep semantic characteristics of the image;
the characteristic fusion layer fuses multi-level characteristics by using a dense connection structure;
and the multi-task output layer outputs a foreground background prediction image, a semantic segmentation prediction image and a boundary prediction image.
4. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the MobileNetV2 lightweight network removes the last two inverse residual error structures and classification layers, and divides the remaining part into 3 blocks, wherein the color image input branches correspond to R _ Block i, i is 1,2,3, and the thermal image input branches correspond to T _ Block i, i is 1,2, 3.
5. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 4, wherein the output result of each module of the thermal image input branch and the output result of each module of the color image input branch are fused through corresponding feature addition.
6. The unmanned real-time road scene semantic segmentation method based on multitask supervision as claimed in claim 1, wherein the dense connection structure comprises an integrated upsampling module, and the integrated upsampling module comprises a 1 x 1 convolutional layer, a batch normalization layer and a ReLU6 activation function, a double upsampling layer, a 3 x 3 deep convolutional layer, a batch normalization layer and a ReLU6 activation function, and a 1 x 1 convolutional layer and a batch normalization layer.
CN202110017471.3A 2021-01-07 2021-01-07 Unmanned real-time road scene semantic segmentation method based on multi-task supervision Withdrawn CN112699889A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110017471.3A CN112699889A (en) 2021-01-07 2021-01-07 Unmanned real-time road scene semantic segmentation method based on multi-task supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110017471.3A CN112699889A (en) 2021-01-07 2021-01-07 Unmanned real-time road scene semantic segmentation method based on multi-task supervision

Publications (1)

Publication Number Publication Date
CN112699889A true CN112699889A (en) 2021-04-23

Family

ID=75515032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110017471.3A Withdrawn CN112699889A (en) 2021-01-07 2021-01-07 Unmanned real-time road scene semantic segmentation method based on multi-task supervision

Country Status (1)

Country Link
CN (1) CN112699889A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408462A (en) * 2021-06-29 2021-09-17 西南交通大学 Landslide remote sensing information extraction method based on convolutional neural network and classification thermodynamic diagram
CN113420848A (en) * 2021-08-24 2021-09-21 深圳市信润富联数字科技有限公司 Neural network model training method and device and gesture recognition method and device
CN115410189A (en) * 2022-10-31 2022-11-29 松立控股集团股份有限公司 Complex scene license plate detection method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408462A (en) * 2021-06-29 2021-09-17 西南交通大学 Landslide remote sensing information extraction method based on convolutional neural network and classification thermodynamic diagram
CN113408462B (en) * 2021-06-29 2023-05-02 西南交通大学 Landslide remote sensing information extraction method based on convolutional neural network and class thermodynamic diagram
CN113420848A (en) * 2021-08-24 2021-09-21 深圳市信润富联数字科技有限公司 Neural network model training method and device and gesture recognition method and device
CN115410189A (en) * 2022-10-31 2022-11-29 松立控股集团股份有限公司 Complex scene license plate detection method
CN115410189B (en) * 2022-10-31 2023-01-24 松立控股集团股份有限公司 Complex scene license plate detection method

Similar Documents

Publication Publication Date Title
CN108596330B (en) Parallel characteristic full-convolution neural network device and construction method thereof
Wang et al. DDU-Net: Dual-decoder-U-Net for road extraction using high-resolution remote sensing images
CN112699889A (en) Unmanned real-time road scene semantic segmentation method based on multi-task supervision
CN111563909B (en) Semantic segmentation method for complex street view image
CN111539887B (en) Channel attention mechanism and layered learning neural network image defogging method based on mixed convolution
CN110751111B (en) Road extraction method and system based on high-order spatial information global automatic perception
CN111062395B (en) Real-time video semantic segmentation method
CN111767927A (en) Lightweight license plate recognition method and system based on full convolution network
CN112329780B (en) Depth image semantic segmentation method based on deep learning
CN112927253B (en) Rock core FIB-SEM image segmentation method based on convolutional neural network
CN113688836A (en) Real-time road image semantic segmentation method and system based on deep learning
CN110781850A (en) Semantic segmentation system and method for road recognition, and computer storage medium
CN111860411A (en) Road scene semantic segmentation method based on attention residual error learning
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
CN111832453A (en) Unmanned scene real-time semantic segmentation method based on double-path deep neural network
CN112766409A (en) Feature fusion method for remote sensing image target detection
CN114299286A (en) Road scene semantic segmentation method based on category grouping in abnormal weather
CN113762396A (en) Two-dimensional image semantic segmentation method
Pham Semantic road segmentation using deep learning
CN115082928A (en) Method for asymmetric double-branch real-time semantic segmentation of network for complex scene
CN112785610B (en) Lane line semantic segmentation method integrating low-level features
CN112149496A (en) Real-time road scene segmentation method based on convolutional neural network
CN114494893B (en) Remote sensing image feature extraction method based on semantic reuse context feature pyramid
CN115035402B (en) Multistage feature aggregation system and method for land cover classification problem

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210423