CN112733934A - Multi-modal feature fusion road scene semantic segmentation method in complex environment - Google Patents

Multi-modal feature fusion road scene semantic segmentation method in complex environment Download PDF

Info

Publication number
CN112733934A
CN112733934A CN202110025132.XA CN202110025132A CN112733934A CN 112733934 A CN112733934 A CN 112733934A CN 202110025132 A CN202110025132 A CN 202110025132A CN 112733934 A CN112733934 A CN 112733934A
Authority
CN
China
Prior art keywords
neural network
block
road scene
convolution
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110025132.XA
Other languages
Chinese (zh)
Inventor
周武杰
刘文宇
雷景生
万健
甘兴利
钱小鸿
许彩娥
黄杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Zhejiang University of Science and Technology ZUST
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202110025132.XA priority Critical patent/CN112733934A/en
Publication of CN112733934A publication Critical patent/CN112733934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Abstract

The invention discloses a multi-modal feature fusion road scene semantic segmentation method in a complex environment. Selecting a road scene image, a thermal image and a segmentation label image to form a training set; constructing a convolutional neural network; inputting the training set into a convolutional neural network to train to obtain a semantic segmentation result image, finishing the training to obtain a set formed by the semantic segmentation result image, and calculating a loss function value between the set formed by segmentation label images corresponding to all road scene images; repeating the step training to obtain an optimal parameter corresponding to the minimum loss function value; inputting the multi-channel components of the original road scene image to be segmented, predicting by using the optimal parameters to obtain a saliency detection image of the original road scene image to be segmented, and obtaining a segmentation result. The invention optimizes the decoding of the characteristic image by applying a novel module, combines hierarchical and multi-modal information fusion, and finally improves the segmentation efficiency and accuracy of the road scene semantic segmentation task.

Description

Multi-modal feature fusion road scene semantic segmentation method in complex environment
Technical Field
The invention relates to a deep learning semantic segmentation method, in particular to a multi-modal feature fusion road scene semantic segmentation method in a complex environment.
Background
With the rapid development of economy in China, the living standard of people is continuously improved, and vehicles become indispensable transportation tools. With the increasing number of vehicles, the problems of road congestion, traffic accidents and the like cause great troubles for people going out. Therefore, the concept of intelligent transportation has emerged to effectively ameliorate these problems. The intelligent transportation means that vehicles on a road are allocated and controlled macroscopically through the cross development of the directions of the internet of things, the electronic technology, the control technology and the like so as to relieve the current traffic pressure condition. The intelligent traffic has very important functions of improving traffic management efficiency, relieving traffic jam, reducing environmental pollution, ensuring traffic safety and the like. The intelligent auxiliary driving system is an important component in intelligent traffic, and unmanned driving is the final target of the intelligent traffic. The road semantic segmentation technology is a necessary link for sensing the external environment by the vehicle, is a core technology for realizing unmanned driving, and is also an important branch of computer vision and image processing directions.
Traditional research methods for semantic segmentation mostly rely on manual features and a priori knowledge, which makes research greatly limited. The manual characteristics refer to characteristic pixel points obtained through calculation, the priori knowledge is similar to the general cognition of people, the traditional method can solve the semantic segmentation visual task to a certain extent, but the traditional method still cannot well understand some complex, tiny and sheltered scenes, and cannot accurately segment the objects.
In recent years, the advent of deep learning has revolutionized the way computer vision tasks. Unlike traditional methods, the deep learning method extracts features from training samples in a convolutional neural network autonomous learning manner instead of relying on manual features and a priori knowledge. In addition, sufficient data is the basic guarantee for scientific research, and compared with the traditional method, deep learning is easier to obtain a large amount of manual labeling data sets. More importantly, the deep learning algorithm can process the graphics on the GPU in parallel, and can greatly improve the learning efficiency and the prediction capability.
Most of current semantic segmentation methods adopt a deep learning method, that is, convolution operation is combined with pooling operation, a full convolution network and the like, and a deep convolution neural network is used for autonomously learning and extracting feature information in an image, but the simple use of various existing neural networks is not enough to meet the requirement of high precision of semantic segmentation, because some operations including the pooling operation lose part of feature information in the image, the obtained segmentation result image has poor effect, and therefore, on the basis of the operation, deeper research is needed by people to optimize the model.
Disclosure of Invention
In order to solve the problems in the background art, the invention provides a multi-modal feature fusion road scene semantic segmentation method in a complex environment, which has high detection efficiency and high detection accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set as
Figure BDA0002890088140000021
The road scene image is RGB color image, and the thermal image is recorded as
Figure BDA0002890088140000022
The corresponding segmentation label image is noted
Figure BDA0002890088140000023
Because the original road scene image, namely the RGB colored image has three channels, and the thermal image has only one channel, the thermal image in the training set is processed into the image with three channels as the road scene image by adopting the single-heat-coding (HHA) and the Angle of the local surface normal to the induced gradient direction, and the image with three channels as the road scene image is processed
Figure BDA0002890088140000024
The set of image components processed into three channels is denoted Jk
Rk(x, y) represents a road scene image
Figure BDA0002890088140000025
Pixel value of pixel point with (x, y) as middle coordinate position, { T }k(x, y) } denotes a thermal image
Figure BDA0002890088140000026
The pixel value G of the pixel point with the middle coordinate position (x, y)k(x, y) denotes a segmentation label image
Figure BDA0002890088140000027
The middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing a convolutional neural network:
step 1_ 3: inputting each road scene image of the original road scene in the training set and the corresponding thermal image into a convolutional neural network for training to obtain a semantic segmentation result image, and finishing the training to obtain all the road scene imagesThe set formed by the semantic segmentation result graph correspondingly obtained from the road scene image is recorded as
Figure BDA0002890088140000028
Step 1_ 4: set formed by semantic segmentation result graph obtained by computational training
Figure BDA0002890088140000029
Set of split label images corresponding to all road scene images
Figure BDA00028900881400000210
Value of the loss function in between, is recorded as
Figure BDA00028900881400000211
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the QXN loss function values, and taking the weight vector corresponding to the minimum loss function value and the bias item as the optimal weight vector W of the convolutional neural network classification training modelbestAnd an optimum bias term Bbest
The test stage process comprises the following specific steps:
original road scene image to be segmented
Figure BDA0002890088140000031
Inputting the R channel component, the G channel component and the B channel component into a convolutional neural network training model, and utilizing an optimal weight vector WbestAnd an optimum bias term BbestPredicting to obtain the original road scene image to be segmented
Figure BDA0002890088140000032
Is detected
Figure BDA0002890088140000033
Saliency detection image
Figure BDA0002890088140000034
The method comprises the steps of (1) obtaining a segmentation result of a road scene; wherein the content of the first and second substances,
Figure BDA0002890088140000035
representing saliency detection images
Figure BDA0002890088140000036
And the pixel value of the pixel point with the middle coordinate position of (x ', y').
The thermal image
Figure BDA0002890088140000037
The image with three channels is processed by adopting one-hot coding in advance as the road scene image.
In the testing stage process, the original road scene image to be segmented
Figure BDA0002890088140000038
The road scene image to be segmented and the corresponding thermal image are included; wherein x ' is more than or equal to 1 and less than or equal to W ', y ' is more than or equal to 1 and less than or equal to H ', and W ' represents an original road scene image to be segmented
Figure BDA0002890088140000039
H' represents the original road scene image to be segmented
Figure BDA00028900881400000310
S (x ', y') represents the original road scene image to be segmented
Figure BDA00028900881400000311
The pixel value of the pixel point with the middle coordinate position (x ', y');
the convolutional neural network comprises an input layer, a hidden layer and an output layer;
the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged;
for the input layer, the input end of the input layer receives the road scene image and the thermal image of the original road scene, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the road scene image and the thermal image, and the output quantity of the input layer is the input quantity of the hidden layer.
The thermal image has three channels after being processed by the HHA coding mode, which is processed into three components after passing through the input layer, and the width and the height of the input original road scene image and the thermal image are W and H.
In the hidden layer, the road scene image and the thermal image are respectively input into a 6 th neural network block and a 1 st neural network block, the 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are sequentially connected,
the output of the 1 st neural network block and the output of the 6 th neural network block are added and then input into the 7 th neural network block, the output of the 2 nd neural network block and the output of the 7 th neural network block are added and then input into the 8 th neural network block, the output of the 3 rd neural network block and the output of the 8 th neural network block are added and then input into the 9 th neural network block, and the output of the 4 th neural network block and the output of the 9 th neural network block are added and then input into the 10 th neural network block; all 10 neural network blocks constitute the encoder.
The 10 th neural network block is sequentially connected with the 5 th neural network block, the output of the 5 th neural network block is added by the adding block and then is input to the FFM characteristic fusion module together with the outputs of the 8 th neural network block and the 9 th neural network block, the output of the FFM characteristic fusion module is input to the 1 st decoding block, the 2 nd decoding block and the 3 rd decoding block are sequentially connected, and all the 3 decoding blocks form an encoder.
The output of the 6 th neural network block is input to the 3 rd decoding block through the 2 nd transition convolution layer, the output of the 7 th neural network block is input to the 2 nd decoding block through the 1 st transition convolution layer, and the output of the 3 rd decoding block is used as the output of the hidden layer.
In the hidden layer, the hidden layer is provided with a plurality of hidden layers,
the 1 st neural network block mainly comprises a first convolution layer and a first activation layer in sequence;
the 2 nd neural network block mainly comprises a first pooling layer and a first dense convolution block, wherein the first dense convolution block comprises five continuous dense convolution layers;
the 3 rd neural network block mainly comprises a first transition convolution block and a second dense convolution block, wherein the first transition convolution block is formed by an activation layer with an activation mode of Relu, and the second dense convolution block is formed by sequentially connecting twelve continuous dense convolution layers with the same structure;
the 4 th neural network block mainly comprises a second transition convolution block and a third dense convolution block, wherein the second transition convolution block is formed by an activation layer with an activation mode of Relu, and the third dense convolution block is formed by sequentially connecting thirty-six continuous dense convolution layers with the same structure;
the 5 th neural network block mainly comprises a third transition convolution block and a fourth dense convolution block, wherein the third transition convolution block is formed by an activation layer with an activation mode of Relu, and the fourth dense convolution block is formed by sequentially connecting twenty-four continuous dense convolution layers with the same structure;
the corresponding structures of the 6 th to 10 th neural network blocks are the same as those of the 1 st to 5 th neural network blocks.
The dense convolution layers have the same structure and are formed by sequentially connecting two continuous active layers and two continuous convolution layers.
As shown in fig. 2, the FFM feature fusion module mainly includes a first convolution layer with convolution kernel size of 3 × 3 and two second convolution layers with convolution kernel size of 1 × 1, outputs from the 7 th neural network block, the 8 th neural network block and the addition block are input into the first convolution layer after pixel superposition operation, the first convolution layer receives three inputs, outputs of the first convolution layer are respectively input into the two parallel second convolution layers, and outputs of the two second convolution layers are output as outputs of the FFM feature fusion module after pixel multiplication operation.
The 1 st transition convolution layer and the 2 nd transition convolution layer are formed by one convolution layer.
As shown in fig. 3, the 1 st decoding block, the 2 nd decoding block, and the 3 rd decoding block have the same structure, and each decoding block includes two deconvolution modules and two expansion convolution layers, each deconvolution module includes an deconvolution layer, a normalization layer, and an activation function layer, which are connected in sequence, the inputs of the decoding blocks are respectively input into the two deconvolution modules, the outputs of the two deconvolution modules are output after being overlapped through the processing of the first expansion convolution layer, the inputs of the decoding blocks are output after being processed by the second expansion convolution layer, and the outputs of the two expansion convolution layers are output as the outputs of the decoding blocks after being added.
For the output layer, the output of the 3 rd decoding block is used for outputting a segmentation result graph with the width W and the height H.
Compared with the prior art, the invention has the advantages that:
1) the method adopts the convolutional neural network to extract the characteristics of the road scene image and the thermal image, and combines the information from a plurality of modes of the road scene image and the thermal image to be fused to obtain a high-efficiency and high-precision semantic segmentation result graph.
2) The method of the invention provides an effective module FFM (feature Fusion module), which takes the feature map as input and can effectively fuse the input feature map to obtain richer feature information after Fusion, thereby being beneficial to obtaining a segmentation result map with higher precision in a decoding stage.
3) The method of the invention provides a decoding block DB (decoder Block) structure, which takes the feature graph processed by the feature fusion module as input and decodes the input feature graph by a unique decoding structure, namely, the image resolution is gradually restored to be consistent with the input image, and a high-precision segmentation result graph is obtained.
Drawings
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a block diagram of a feature fusion module FFM in the method of the present invention;
FIG. 3 is a block diagram of a decoding block DB in the method of the present invention;
FIG. 4a is a segmentation label image corresponding to the 1 st original road scene image of the same scene;
FIG. 4b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 4a by using the method of the present invention;
FIG. 5a is a segmentation label image corresponding to the 2 nd original road scene image of the same scene;
FIG. 5b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 5a by using the method of the present invention;
FIG. 6a is a segmentation label image corresponding to the 3 rd original road scene image of the same scene;
FIG. 6b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 6a by using the method of the present invention;
FIG. 7a is a segmentation label image corresponding to the 4 th original road scene image of the same scene;
FIG. 7b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 7a using the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The overall implementation block diagram of the multi-modal feature fusion road scene semantic segmentation method under the complex environment is shown in FIG. 1, and the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set as
Figure BDA0002890088140000071
The thermal image is recorded as
Figure BDA0002890088140000072
The corresponding segmentation label image is noted
Figure BDA0002890088140000073
Because the original road scene image, namely the RGB colored image has three channels, and the thermal image has only one channel, the thermal image in the training set is processed into the image with three channels as the road scene image by adopting the single-heat-coding (HHA) and the Angle of the local surface normal to the induced gradient direction, and the image with three channels as the road scene image is processed
Figure BDA0002890088140000074
The set of image components processed into three channels is denoted Jk(ii) a Wherein Q is positive integer, Q ≧ 200, such as Q ≦ 1569, k is positive integer, 1 ≦ k ≦ Q, 1 ≦ x ≦ W, 1 ≦ y ≦ H, W represents the width of the original stereoscopic image, H represents the height of the original road scene image and thermal image, such as W ≦ 224, H ≦ 224, R is takenk(x, y) represents
Figure BDA0002890088140000075
Pixel value of pixel point with (x, y) as middle coordinate position, { T }k(x, y) } denotes
Figure BDA0002890088140000076
The pixel value G of the pixel point with the middle coordinate position (x, y)k(x, y) represents
Figure BDA0002890088140000077
The middle coordinate position is the pixel value of the pixel point of (x, y); book (I)The data set in the experiment is directly selected from a road scene semantic segmentation data set disclosed by Ha et al, and the data set comprises 1569 pairs of road scene image thermal images.
Step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged.
For an input layer, an input end of the input layer receives a road scene image and a thermal image of an original road scene, an output end of the input layer outputs an R channel component, a G channel component and a B channel component of the original input image, and an output quantity of the input layer is an input quantity of a hidden layer; the thermal image is processed in an HHA coding mode and then has three channels as the road scene image, namely the thermal image is processed into three components after passing through an input layer, and the width and the height of the input original road scene image and the input original road scene image are W and H.
For hidden layers: the 1 st neural network block consists of a first convolution layer and a first activation layer; the input of the 1 st neural network block is a three-channel thermal image processed by HHA technology, 96 characteristic maps are output by the processing of the 1 st neural network block, and a set formed by the 96 characteristic maps is marked as P1(ii) a The convolution kernel size (kernel _ size) of the first convolution layer is 7 × 7, the number of convolution kernels (filters) is 96, the step size (stride) is 2, the value of the zero padding parameter (padding) is 3, the activation mode of the first activation layer is "Relu", and P is1Each feature map of (1) has a width of
Figure BDA0002890088140000081
Has a height of
Figure BDA0002890088140000082
2 nd spiritThe network block consists of a first pooling layer and a first dense convolution block; the input to the 2 nd neural network block is P1The 96 characteristic diagrams in the (1) are processed by the 2 nd neural network block and output 384 characteristic diagrams, and the set formed by the 384 characteristic diagrams is marked as P2(ii) a The pooling size of the first maximum pooling layer is 3, the step size (stride) is 2, and the value of the zero padding parameter (padding) is 1; the first dense convolution block is composed of five dense convolution layers, the five dense convolution layers have the same structure and are composed of two active layers and two convolution layers, the active layers are all activated in a 'Relu' mode, the convolution kernel size of the first convolution layer in the two convolution layers is 1 multiplied by 1, the step length (stride) is 1, the convolution kernel size of the second convolution layer is 3 multiplied by 3, the step length (stride) is 1, the value of the zero padding parameter (padding) is 1, and P is P2Each feature map of (1) has a width of
Figure BDA0002890088140000083
Has a height of
Figure BDA0002890088140000084
The 3 rd neural network block consists of a first transition convolution block and a second intensive convolution block; the input to the 3 rd neural network block is P2The 384 characteristic graphs in the (1) are processed by the 3 rd neural network block and then output 768 characteristic graphs, and the set formed by the 768 characteristic graphs is marked as P3(ii) a And the first transition convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 x 1 and step length (stride) of 1, and an average pooling layer with pooling size of 2 and step length (stride) of 2; the second dense convolution block is composed of twelve dense convolution blocks with the same structure, wherein the dense convolution block structure is the same as that described above, and the dense convolution blocks mentioned below are also the same and are not described again; p3Each feature map of (1) has a width of
Figure BDA0002890088140000085
Has a height of
Figure BDA0002890088140000086
4 th neural network block consisting ofThe second transition convolution block and the third intensive convolution block; the input to the 4 th neural network block is P3The 768 characteristic diagrams in the table are processed by the 4 th neural network block and then 2112 characteristic diagrams are output, and a set formed by the 2112 characteristic diagrams is marked as P4(ii) a The second transition convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 multiplied by 1 and a step length (stride) of 1, and an average pooling layer with a pooling size of 2 and a step length (stride) of 2; the third dense convolution block is composed of thirty-six dense convolution blocks with the same structure, P4Each feature map of (1) has a width of
Figure BDA0002890088140000091
Has a height of
Figure BDA0002890088140000092
The 5 th neural network block consists of a third transition convolution block and a fourth dense convolution block; the input to the 5 th neural network block is P4The 2112 characteristic maps in the table are processed by the 5 th neural network block and then 2208 characteristic maps are output, and the set formed by the 2208 characteristic maps is marked as P5(ii) a And the third transitional convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 x 1 and step length (stride) of 1, and an average pooling layer with pooling size of 2 and step length (stride) of 2; the fourth dense convolution block is composed of twenty-four dense convolution blocks with the same structure, P5Each feature map of (1) has a width of
Figure BDA0002890088140000093
Has a height of
Figure BDA0002890088140000094
The 6 th to 10 th neural network blocks correspond to the 1 st to 5 th neural network blocks, and the structures thereof are also the same, which is not described herein again; the input of the 6 th neural network block is a three-channel road scene image, 96 characteristic maps are output after the 6 th neural network block is processed, and a set formed by the 96 characteristic maps is recorded asT1(ii) a And T1Each feature map of (1) has a width of
Figure BDA0002890088140000095
Has a height of
Figure BDA0002890088140000096
The input to the 7 th neural network block is T1Middle 96 characteristic maps and P1Feature map set G after adding 96 feature maps1After the processing of the 7 th neural network block, 384 characteristic graphs are output, and a set formed by the 384 characteristic graphs is marked as T2,T2Each feature map of (1) has a width of
Figure BDA0002890088140000097
Has a height of
Figure BDA0002890088140000098
The input to the 8 th neural network block is T2Middle 384 characteristic diagrams and P2Feature map set G after addd of 384 middle feature maps27688 characteristic graphs are output after 8 th neural network block processing, and a set formed by the 768 characteristic graphs is marked as T3,T3Each feature map of (1) has a width of
Figure BDA0002890088140000101
Has a height of
Figure BDA0002890088140000102
The input to the 9 th neural network block is T3768 characteristic diagrams and P3Characteristic atlas G after adding 768 characteristic atlases32112 characteristic graphs are processed and output by the 9 th neural network block, and a set formed by the 2112 characteristic graphs is marked as T4And T is4Each feature map of (1) has a width of
Figure BDA0002890088140000103
Has a height of
Figure BDA0002890088140000104
The input to the 10 th neural network block is T4Middle 2112 feature maps and P4Feature atlas G after add of middle 2112 feature maps4Processing and outputting 2208 feature maps through a 10 th neural network block, and marking a set formed by the 2208 feature maps as T5(ii) a And T5Each feature map of (1) has a width of
Figure BDA0002890088140000105
Has a height of
Figure BDA0002890088140000106
Furthermore, T is added52208 characteristic diagrams and P in5The feature map set of 2208 feature maps in (1) after add is marked as G5
For FFM (Feature Fusion Module), the input G of the Module3、G4、G5(wherein G is4And G5Processing the characteristic graph in the method of bilinear interpolation and G3The resolution of the feature map in (1) is the same, i.e. the size of the uniform feature map) the feature map set G after the feature map is concat6,G6The method comprises 5088 feature maps Jie, 1272 feature maps are output after FFM processing, and the 1272 feature maps are marked as G7(ii) a Is that it is composed of a convolution with convolution kernel size of 3 × 3, step size (stride) of 1, zero padding parameter (padding) value of 1 and convolution with two convolution kernel sizes of 1 × 1 and step size (stride) of 1, where the convolutions with convolution kernel size of 1 × 1 are arranged in parallel, and G is6Each feature map in (1) has a width of
Figure BDA0002890088140000107
Has a height of
Figure BDA0002890088140000108
For the 1 st transition convolution layer, the input is G1The 96 characteristic maps in (1) are processed by the convolutional layer to output 159 characteristic maps, and the 159 characteristic maps are marked as C1(ii) a It is thatThe convolution layer is formed by convolution with convolution kernel size of 3 multiplied by 3, step length of 1 and zero padding parameter value of 1, the resolution of the image is not changed by the convolution layer, namely the width of the feature map processed by the convolution layer is
Figure BDA0002890088140000111
Has a height of
Figure BDA0002890088140000112
For the 2 nd transition convolution layer, the input is G2The 384 characteristic maps are processed by the convolutional layer to output 318 characteristic maps, and the 318 characteristic maps are marked as C2(ii) a It is formed from convolution whose convolution kernel size is 3X 3, step length (stride) is 1 and zero-padding parameter (padding) value is 1, and the convolution layer does not change the resolution of image, i.e. the width of characteristic diagram after the convolution layer treatment is 3X 3
Figure BDA0002890088140000113
Has a height of
Figure BDA0002890088140000114
For the 1 st decoding block, the input is G7The 1272 feature maps in the graph are processed by a 1 st decoding block to obtain 636 feature maps, and the 636 feature maps are recorded as H1(ii) a The method is characterized by comprising two deconvolution and two expansion convolutions, wherein the two deconvolution have the same structure, namely convolutions with the sizes of 4 multiplied by 4, step lengths (stride) of 1 and zero padding parameters (padding) of 1, the convolutions with the sizes of 3 multiplied by 3, the step lengths (stride) of 1, different expansion rates of 1 and 2 respectively, and the corresponding zero padding parameters (padding) of 1 and 2; and the width of the feature map after the 1 st decoding block is
Figure BDA0002890088140000115
Has a height of
Figure BDA0002890088140000116
For the 2 nd decoding block, its input is H1318 characteristic diagrams and C in2The feature map set after the add of 318 feature maps in (1) is marked as C3159 feature maps are obtained after the 2 nd decoding block processing, and the 159 feature maps are recorded as H2(ii) a The structures of the 2 nd decoding block and the 3 rd decoding block are the same as the 1 st decoding block, and are not described herein; and the width of the feature map after passing through the 2 nd decoding block is
Figure BDA0002890088140000117
Has a height of
Figure BDA0002890088140000118
For the 3 rd decoding block, its input is H2159 characteristic diagrams in (1) and C1The 159 feature maps in the system are processed by the 3 rd decoding block to obtain 1 feature map, and the 1 feature map is marked as H3(ii) a The structures of the 2 nd decoding block and the 3 rd decoding block are the same as the 1 st decoding block, and are not described herein; and the width of the feature map after the 3 rd decoding block is W, and the height of the feature map is H.
For the output layer, which is the output of the 3 rd decoding block, a segmentation result graph with width W and height H is output.
Step 1_ 3: taking the road scene image and the thermal image of the original road scene in the training set as input, inputting the input into a convolutional neural network for training to obtain a semantic segmentation result graph, and recording a set formed by the semantic segmentation result graph obtained after training as
Figure BDA0002890088140000121
Step 1_ 4: calculating a loss function value between a set of semantic segmentation result images obtained by training and a set of corresponding segmentation label images
Figure BDA0002890088140000122
And
Figure BDA0002890088140000123
the value of the loss function in between is recorded as
Figure BDA0002890088140000124
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the Q multiplied by N loss function values; and then, corresponding the weight vector and the bias item corresponding to the minimum loss function value to be used as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as WbestAnd Bbest(ii) a Where N > 1, the value N is 100 in this experiment.
The test stage process comprises the following specific steps:
Figure BDA0002890088140000125
the pixel value of the pixel point with the position (x ', y').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed. And (3) constructing a multi-modal feature fusion convolutional neural network architecture based on a coding and decoding structure by using a python-based deep learning library PyTorch1.1.0. The public data set for the semantic segmentation task is used to analyze the effect of the segmentation result images (1569 on road scene images and thermal images) segmented by the method of the present invention. In this experiment, 2 common objective parameters of the significance evaluation detection method were used as evaluation indexes: accuracy of each class (Accuracy, Acc) and Intersection set (IoU) of each class are used for evaluating segmentation precision of segmentation result images.
The method of the invention is utilized to segment each road scene image in the data set to obtain a segmentation result graph corresponding to each road scene image, and the segmentation effect of the method of the invention is reflected as follows: the Accuracy (Accuracy, Acc) of each class and the Intersection universe (IoU) of each class are listed in Table 1.
TABLE 1 evaluation results on test sets using the method of the invention
Figure BDA0002890088140000131
Figure BDA0002890088140000132
As can be seen from the data listed in table 1, the first table reflects the Accuracy (Accuracy, Acc) of each class, the second table reflects the Intersection Union (IoU) of each class, and the table shows the optimal solution of each value in a bolded manner, so that the segmentation result of the segmentation result graph obtained by the method of the present invention is better.
FIG. 4a shows a segmentation label image corresponding to the 1 st original road scene image of the same scene in the data set selected for the experiment; FIG. 4b is a diagram showing a segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 4a by using the method of the present invention; FIG. 5a shows a segmentation label image corresponding to the 2 nd original road scene image of the same scene in the data set selected for the experiment; FIG. 5b is a diagram showing a segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 5a by using the method of the present invention; FIG. 6a shows a segmentation label image corresponding to the 3 rd original road scene image of the same scene in the data set selected for the experiment; FIG. 6b is a diagram showing the segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 6a by using the method of the present invention; FIG. 7a shows a segmentation label image corresponding to the 4 th original road scene image of the same scene in the data set selected for the experiment; FIG. 7b is a diagram showing the segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 7a by using the method of the present invention. Comparing fig. 4a and 4b, fig. 5a and 5b, fig. 6a and 6b, and fig. 7a and 7b, it can be seen that the segmentation accuracy of the segmentation result graph obtained by the method of the present invention is higher.
Therefore, the implementation of the method disclosed by the invention can be seen that the decoding of the characteristic image is optimized by applying a novel module, and the segmentation efficiency and accuracy of the road scene semantic segmentation task are finally improved by combining hierarchical and multi-modal information fusion.

Claims (9)

1. A multi-modal feature fusion road scene semantic segmentation method in a complex environment is characterized by comprising the following steps: the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set as
Figure FDA0002890088130000011
The thermal image is recorded as
Figure FDA0002890088130000012
The corresponding segmentation label image is noted
Figure FDA0002890088130000013
Step 1_ 2: constructing a convolutional neural network:
step 1_ 3: inputting each road scene image of the original road scene in the training set and the thermal image corresponding to the road scene image into a convolutional neural network for training to obtain a semantic segmentation result graph, and recording a set formed by the semantic segmentation result graphs correspondingly obtained from all the road scene images obtained after training as a set
Figure FDA0002890088130000014
Step 1_ 4: set formed by semantic segmentation result graph obtained by computational training
Figure FDA0002890088130000015
Set of split label images corresponding to all road scene images
Figure FDA0002890088130000016
Value of the loss function in between, is recorded as
Figure FDA0002890088130000017
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the QXN loss function values, and taking the weight vector corresponding to the minimum loss function value and the bias item as the optimal weight vector W of the convolutional neural network classification training modelbestAnd an optimum bias term Bbest
The test stage process comprises the following specific steps:
original road scene image to be segmented
Figure FDA0002890088130000018
Inputting the R channel component, the G channel component and the B channel component into a convolutional neural network training model, and utilizing an optimal weight vector WbestAnd an optimum bias term BbestPredicting to obtain the original road scene image to be segmented
Figure FDA0002890088130000019
Is detected
Figure FDA00028900881300000110
Saliency detection image
Figure FDA00028900881300000111
With the segmentation result of the road scene.
2. The complex environment multi-modal feature fusion road scene language as claimed in claim 1The semantic segmentation method is characterized by comprising the following steps: the thermal image
Figure FDA00028900881300000112
And the image with three channels is processed by adopting one-hot coding as the image of the road scene.
3. The method for multi-modal feature fusion road scene semantic segmentation under the complex environment according to claim 1, characterized in that:
in the testing stage process, the original road scene image to be segmented
Figure FDA0002890088130000021
The road scene image to be segmented and the corresponding thermal image are included; wherein x ' is more than or equal to 1 and less than or equal to W ', y ' is more than or equal to 1 and less than or equal to H ', and W ' represents an original road scene image to be segmented
Figure FDA0002890088130000022
H' represents the original road scene image to be segmented
Figure FDA0002890088130000023
S (x ', y') represents the original road scene image to be segmented
Figure FDA0002890088130000024
And the pixel value of the pixel point with the middle coordinate position of (x ', y').
4. The method for multi-modal feature fusion road scene semantic segmentation under the complex environment according to claim 1, characterized in that: the convolutional neural network comprises an input layer, a hidden layer and an output layer;
the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged; in the hidden layer, the road scene image and the thermal image are respectively input into a 6 th neural network block and a 1 st neural network block, the 6 th neural network block, the 7 th neural network block, the 8 th neural network block, the 9 th neural network block and the 10 th neural network block are sequentially connected, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are sequentially connected, the output of the 1 st neural network block and the output of the 6 th neural network block are subjected to addition operation and then input into the 7 th neural network block, the output of the 2 nd neural network block and the output of the 7 th neural network block are subjected to addition operation and then input into the 8 th neural network block, the output of the 3 rd neural network block and the output of the 8 th neural network block are subjected to addition operation and then input into the 9 th neural network block, the output of the 4 th neural network block and the output of the 9 th neural network block are added and then input into the 10 th neural network block; the 10 th neural network block is sequentially connected with the output of the 5 th neural network block, the output of the 5 th neural network block is added by the adding block and then is input to the FFM feature fusion module together with the output of the 8 th neural network block and the output of the 9 th neural network block, the output of the FFM feature fusion module is input to the 1 st decoding block, the 2 nd decoding block and the 3 rd decoding block are sequentially connected, the output of the 6 th neural network block is input to the 3 rd decoding block through the 2 nd transition convolution layer, the output of the 7 th neural network block is input to the 2 nd decoding block through the 1 st transition convolution layer, and the output of the 3 rd decoding block is used as the output of the hidden layer.
5. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: in the hidden layer, the hidden layer is provided with a plurality of hidden layers,
the 1 st neural network block mainly comprises a first convolution layer and a first activation layer in sequence;
the 2 nd neural network block mainly comprises a first pooling layer and a first dense convolution block, wherein the first dense convolution block comprises five continuous dense convolution layers;
the 3 rd neural network block mainly comprises a first transition convolution block and a second dense convolution block, wherein the first transition convolution block is formed by an activation layer with an activation mode of Relu, and the second dense convolution block is formed by sequentially connecting twelve continuous dense convolution layers with the same structure;
the 4 th neural network block mainly comprises a second transition convolution block and a third dense convolution block, wherein the second transition convolution block is formed by an activation layer with an activation mode of Relu, and the third dense convolution block is formed by sequentially connecting thirty-six continuous dense convolution layers with the same structure;
the 5 th neural network block mainly comprises a third transition convolution block and a fourth dense convolution block, wherein the third transition convolution block is formed by an activation layer with an activation mode of Relu, and the fourth dense convolution block is formed by sequentially connecting twenty-four continuous dense convolution layers with the same structure;
the corresponding structures of the 6 th to 10 th neural network blocks are the same as those of the 1 st to 5 th neural network blocks.
6. The method according to claim 5, wherein the method for segmenting road scene semantics by multi-modal feature fusion in a complex environment is characterized in that: the dense convolution layers have the same structure and are formed by sequentially connecting two continuous active layers and two continuous convolution layers.
7. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the FFM feature fusion module mainly comprises a first convolution layer with convolution kernel size of 3 x 3 and two second convolution layers with convolution kernel size of 1 x 1, wherein the outputs from the 7 th neural network block, the 8 th neural network block and the addition block are input into the first convolution layer after pixel superposition operation, the output of the first convolution layer is respectively input and connected into the two second convolution layers, and the output of the two second convolution layers is output as the output of the FFM feature fusion module after pixel multiplication operation.
8. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the 1 st transition convolution layer and the 2 nd transition convolution layer are formed by one convolution layer.
9. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the decoding device comprises a first decoding block, a second decoding block, a third decoding block, a fourth decoding block, a fifth decoding block, a sixth decoding block and a seventh decoding block, wherein the first decoding block, the second decoding block and the third decoding block have the same structure and respectively comprise two deconvolution modules and two expansion convolution layers, each deconvolution module comprises a deconvolution layer, a normalization layer and an activation function layer which are sequentially connected, the input of each decoding block is respectively input into the two deconvolution modules, the output of the two deconvolution modules is processed and output by the first expansion convolution layer after being overlapped, the input of each decoding block is processed and output by the second expansion convolution layer, and the output of the two expansion convolution layers is output as the output of.
CN202110025132.XA 2021-01-08 2021-01-08 Multi-modal feature fusion road scene semantic segmentation method in complex environment Pending CN112733934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110025132.XA CN112733934A (en) 2021-01-08 2021-01-08 Multi-modal feature fusion road scene semantic segmentation method in complex environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110025132.XA CN112733934A (en) 2021-01-08 2021-01-08 Multi-modal feature fusion road scene semantic segmentation method in complex environment

Publications (1)

Publication Number Publication Date
CN112733934A true CN112733934A (en) 2021-04-30

Family

ID=75589810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110025132.XA Pending CN112733934A (en) 2021-01-08 2021-01-08 Multi-modal feature fusion road scene semantic segmentation method in complex environment

Country Status (1)

Country Link
CN (1) CN112733934A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984856A (en) * 2022-12-05 2023-04-18 百度(中国)有限公司 Training method of document image correction model and document image correction method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115984856A (en) * 2022-12-05 2023-04-18 百度(中国)有限公司 Training method of document image correction model and document image correction method

Similar Documents

Publication Publication Date Title
CN108647585B (en) Traffic identifier detection method based on multi-scale circulation attention network
CN110188705B (en) Remote traffic sign detection and identification method suitable for vehicle-mounted system
Fan et al. Learning collision-free space detection from stereo images: Homography matrix brings better data augmentation
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN113807355B (en) Image semantic segmentation method based on coding and decoding structure
CN109635662B (en) Road scene semantic segmentation method based on convolutional neural network
CN105701508A (en) Global-local optimization model based on multistage convolution neural network and significant detection algorithm
CN111401436A (en) Streetscape image segmentation method fusing network and two-channel attention mechanism
CN111696110B (en) Scene segmentation method and system
CN109753959B (en) Road traffic sign detection method based on self-adaptive multi-scale feature fusion
CN110956119B (en) Method for detecting target in image
CN114223019A (en) Feedback decoder for parameter efficient semantic image segmentation
CN113269133A (en) Unmanned aerial vehicle visual angle video semantic segmentation method based on deep learning
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN112861727A (en) Real-time semantic segmentation method based on mixed depth separable convolution
CN111860411A (en) Road scene semantic segmentation method based on attention residual error learning
Wang et al. Global perception-based robust parking space detection using a low-cost camera
CN115273032A (en) Traffic sign recognition method, apparatus, device and medium
CN109508639B (en) Road scene semantic segmentation method based on multi-scale porous convolutional neural network
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
Xing et al. MABNet: a lightweight stereo network based on multibranch adjustable bottleneck module
CN112733934A (en) Multi-modal feature fusion road scene semantic segmentation method in complex environment
CN117037119A (en) Road target detection method and system based on improved YOLOv8
CN112446292B (en) 2D image salient object detection method and system
CN111626298A (en) Real-time image semantic segmentation device and segmentation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination