CN112733934A - Multi-modal feature fusion road scene semantic segmentation method in complex environment - Google Patents
Multi-modal feature fusion road scene semantic segmentation method in complex environment Download PDFInfo
- Publication number
- CN112733934A CN112733934A CN202110025132.XA CN202110025132A CN112733934A CN 112733934 A CN112733934 A CN 112733934A CN 202110025132 A CN202110025132 A CN 202110025132A CN 112733934 A CN112733934 A CN 112733934A
- Authority
- CN
- China
- Prior art keywords
- neural network
- block
- road scene
- convolution
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
Abstract
The invention discloses a multi-modal feature fusion road scene semantic segmentation method in a complex environment. Selecting a road scene image, a thermal image and a segmentation label image to form a training set; constructing a convolutional neural network; inputting the training set into a convolutional neural network to train to obtain a semantic segmentation result image, finishing the training to obtain a set formed by the semantic segmentation result image, and calculating a loss function value between the set formed by segmentation label images corresponding to all road scene images; repeating the step training to obtain an optimal parameter corresponding to the minimum loss function value; inputting the multi-channel components of the original road scene image to be segmented, predicting by using the optimal parameters to obtain a saliency detection image of the original road scene image to be segmented, and obtaining a segmentation result. The invention optimizes the decoding of the characteristic image by applying a novel module, combines hierarchical and multi-modal information fusion, and finally improves the segmentation efficiency and accuracy of the road scene semantic segmentation task.
Description
Technical Field
The invention relates to a deep learning semantic segmentation method, in particular to a multi-modal feature fusion road scene semantic segmentation method in a complex environment.
Background
With the rapid development of economy in China, the living standard of people is continuously improved, and vehicles become indispensable transportation tools. With the increasing number of vehicles, the problems of road congestion, traffic accidents and the like cause great troubles for people going out. Therefore, the concept of intelligent transportation has emerged to effectively ameliorate these problems. The intelligent transportation means that vehicles on a road are allocated and controlled macroscopically through the cross development of the directions of the internet of things, the electronic technology, the control technology and the like so as to relieve the current traffic pressure condition. The intelligent traffic has very important functions of improving traffic management efficiency, relieving traffic jam, reducing environmental pollution, ensuring traffic safety and the like. The intelligent auxiliary driving system is an important component in intelligent traffic, and unmanned driving is the final target of the intelligent traffic. The road semantic segmentation technology is a necessary link for sensing the external environment by the vehicle, is a core technology for realizing unmanned driving, and is also an important branch of computer vision and image processing directions.
Traditional research methods for semantic segmentation mostly rely on manual features and a priori knowledge, which makes research greatly limited. The manual characteristics refer to characteristic pixel points obtained through calculation, the priori knowledge is similar to the general cognition of people, the traditional method can solve the semantic segmentation visual task to a certain extent, but the traditional method still cannot well understand some complex, tiny and sheltered scenes, and cannot accurately segment the objects.
In recent years, the advent of deep learning has revolutionized the way computer vision tasks. Unlike traditional methods, the deep learning method extracts features from training samples in a convolutional neural network autonomous learning manner instead of relying on manual features and a priori knowledge. In addition, sufficient data is the basic guarantee for scientific research, and compared with the traditional method, deep learning is easier to obtain a large amount of manual labeling data sets. More importantly, the deep learning algorithm can process the graphics on the GPU in parallel, and can greatly improve the learning efficiency and the prediction capability.
Most of current semantic segmentation methods adopt a deep learning method, that is, convolution operation is combined with pooling operation, a full convolution network and the like, and a deep convolution neural network is used for autonomously learning and extracting feature information in an image, but the simple use of various existing neural networks is not enough to meet the requirement of high precision of semantic segmentation, because some operations including the pooling operation lose part of feature information in the image, the obtained segmentation result image has poor effect, and therefore, on the basis of the operation, deeper research is needed by people to optimize the model.
Disclosure of Invention
In order to solve the problems in the background art, the invention provides a multi-modal feature fusion road scene semantic segmentation method in a complex environment, which has high detection efficiency and high detection accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set asThe road scene image is RGB color image, and the thermal image is recorded asThe corresponding segmentation label image is noted
Because the original road scene image, namely the RGB colored image has three channels, and the thermal image has only one channel, the thermal image in the training set is processed into the image with three channels as the road scene image by adopting the single-heat-coding (HHA) and the Angle of the local surface normal to the induced gradient direction, and the image with three channels as the road scene image is processedThe set of image components processed into three channels is denoted Jk。
Rk(x, y) represents a road scene imagePixel value of pixel point with (x, y) as middle coordinate position, { T }k(x, y) } denotes a thermal imageThe pixel value G of the pixel point with the middle coordinate position (x, y)k(x, y) denotes a segmentation label imageThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing a convolutional neural network:
step 1_ 3: inputting each road scene image of the original road scene in the training set and the corresponding thermal image into a convolutional neural network for training to obtain a semantic segmentation result image, and finishing the training to obtain all the road scene imagesThe set formed by the semantic segmentation result graph correspondingly obtained from the road scene image is recorded as
Step 1_ 4: set formed by semantic segmentation result graph obtained by computational trainingSet of split label images corresponding to all road scene imagesValue of the loss function in between, is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the QXN loss function values, and taking the weight vector corresponding to the minimum loss function value and the bias item as the optimal weight vector W of the convolutional neural network classification training modelbestAnd an optimum bias term Bbest;
The test stage process comprises the following specific steps:
original road scene image to be segmentedInputting the R channel component, the G channel component and the B channel component into a convolutional neural network training model, and utilizing an optimal weight vector WbestAnd an optimum bias term BbestPredicting to obtain the original road scene image to be segmentedIs detectedSaliency detection imageThe method comprises the steps of (1) obtaining a segmentation result of a road scene; wherein the content of the first and second substances,representing saliency detection imagesAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
The thermal imageThe image with three channels is processed by adopting one-hot coding in advance as the road scene image.
In the testing stage process, the original road scene image to be segmentedThe road scene image to be segmented and the corresponding thermal image are included; wherein x ' is more than or equal to 1 and less than or equal to W ', y ' is more than or equal to 1 and less than or equal to H ', and W ' represents an original road scene image to be segmentedH' represents the original road scene image to be segmentedS (x ', y') represents the original road scene image to be segmentedThe pixel value of the pixel point with the middle coordinate position (x ', y');
the convolutional neural network comprises an input layer, a hidden layer and an output layer;
the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged;
for the input layer, the input end of the input layer receives the road scene image and the thermal image of the original road scene, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the road scene image and the thermal image, and the output quantity of the input layer is the input quantity of the hidden layer.
The thermal image has three channels after being processed by the HHA coding mode, which is processed into three components after passing through the input layer, and the width and the height of the input original road scene image and the thermal image are W and H.
In the hidden layer, the road scene image and the thermal image are respectively input into a 6 th neural network block and a 1 st neural network block, the 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are sequentially connected,
the output of the 1 st neural network block and the output of the 6 th neural network block are added and then input into the 7 th neural network block, the output of the 2 nd neural network block and the output of the 7 th neural network block are added and then input into the 8 th neural network block, the output of the 3 rd neural network block and the output of the 8 th neural network block are added and then input into the 9 th neural network block, and the output of the 4 th neural network block and the output of the 9 th neural network block are added and then input into the 10 th neural network block; all 10 neural network blocks constitute the encoder.
The 10 th neural network block is sequentially connected with the 5 th neural network block, the output of the 5 th neural network block is added by the adding block and then is input to the FFM characteristic fusion module together with the outputs of the 8 th neural network block and the 9 th neural network block, the output of the FFM characteristic fusion module is input to the 1 st decoding block, the 2 nd decoding block and the 3 rd decoding block are sequentially connected, and all the 3 decoding blocks form an encoder.
The output of the 6 th neural network block is input to the 3 rd decoding block through the 2 nd transition convolution layer, the output of the 7 th neural network block is input to the 2 nd decoding block through the 1 st transition convolution layer, and the output of the 3 rd decoding block is used as the output of the hidden layer.
In the hidden layer, the hidden layer is provided with a plurality of hidden layers,
the 1 st neural network block mainly comprises a first convolution layer and a first activation layer in sequence;
the 2 nd neural network block mainly comprises a first pooling layer and a first dense convolution block, wherein the first dense convolution block comprises five continuous dense convolution layers;
the 3 rd neural network block mainly comprises a first transition convolution block and a second dense convolution block, wherein the first transition convolution block is formed by an activation layer with an activation mode of Relu, and the second dense convolution block is formed by sequentially connecting twelve continuous dense convolution layers with the same structure;
the 4 th neural network block mainly comprises a second transition convolution block and a third dense convolution block, wherein the second transition convolution block is formed by an activation layer with an activation mode of Relu, and the third dense convolution block is formed by sequentially connecting thirty-six continuous dense convolution layers with the same structure;
the 5 th neural network block mainly comprises a third transition convolution block and a fourth dense convolution block, wherein the third transition convolution block is formed by an activation layer with an activation mode of Relu, and the fourth dense convolution block is formed by sequentially connecting twenty-four continuous dense convolution layers with the same structure;
the corresponding structures of the 6 th to 10 th neural network blocks are the same as those of the 1 st to 5 th neural network blocks.
The dense convolution layers have the same structure and are formed by sequentially connecting two continuous active layers and two continuous convolution layers.
As shown in fig. 2, the FFM feature fusion module mainly includes a first convolution layer with convolution kernel size of 3 × 3 and two second convolution layers with convolution kernel size of 1 × 1, outputs from the 7 th neural network block, the 8 th neural network block and the addition block are input into the first convolution layer after pixel superposition operation, the first convolution layer receives three inputs, outputs of the first convolution layer are respectively input into the two parallel second convolution layers, and outputs of the two second convolution layers are output as outputs of the FFM feature fusion module after pixel multiplication operation.
The 1 st transition convolution layer and the 2 nd transition convolution layer are formed by one convolution layer.
As shown in fig. 3, the 1 st decoding block, the 2 nd decoding block, and the 3 rd decoding block have the same structure, and each decoding block includes two deconvolution modules and two expansion convolution layers, each deconvolution module includes an deconvolution layer, a normalization layer, and an activation function layer, which are connected in sequence, the inputs of the decoding blocks are respectively input into the two deconvolution modules, the outputs of the two deconvolution modules are output after being overlapped through the processing of the first expansion convolution layer, the inputs of the decoding blocks are output after being processed by the second expansion convolution layer, and the outputs of the two expansion convolution layers are output as the outputs of the decoding blocks after being added.
For the output layer, the output of the 3 rd decoding block is used for outputting a segmentation result graph with the width W and the height H.
Compared with the prior art, the invention has the advantages that:
1) the method adopts the convolutional neural network to extract the characteristics of the road scene image and the thermal image, and combines the information from a plurality of modes of the road scene image and the thermal image to be fused to obtain a high-efficiency and high-precision semantic segmentation result graph.
2) The method of the invention provides an effective module FFM (feature Fusion module), which takes the feature map as input and can effectively fuse the input feature map to obtain richer feature information after Fusion, thereby being beneficial to obtaining a segmentation result map with higher precision in a decoding stage.
3) The method of the invention provides a decoding block DB (decoder Block) structure, which takes the feature graph processed by the feature fusion module as input and decodes the input feature graph by a unique decoding structure, namely, the image resolution is gradually restored to be consistent with the input image, and a high-precision segmentation result graph is obtained.
Drawings
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a block diagram of a feature fusion module FFM in the method of the present invention;
FIG. 3 is a block diagram of a decoding block DB in the method of the present invention;
FIG. 4a is a segmentation label image corresponding to the 1 st original road scene image of the same scene;
FIG. 4b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 4a by using the method of the present invention;
FIG. 5a is a segmentation label image corresponding to the 2 nd original road scene image of the same scene;
FIG. 5b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 5a by using the method of the present invention;
FIG. 6a is a segmentation label image corresponding to the 3 rd original road scene image of the same scene;
FIG. 6b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 6a by using the method of the present invention;
FIG. 7a is a segmentation label image corresponding to the 4 th original road scene image of the same scene;
FIG. 7b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 7a using the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The overall implementation block diagram of the multi-modal feature fusion road scene semantic segmentation method under the complex environment is shown in FIG. 1, and the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set asThe thermal image is recorded asThe corresponding segmentation label image is notedBecause the original road scene image, namely the RGB colored image has three channels, and the thermal image has only one channel, the thermal image in the training set is processed into the image with three channels as the road scene image by adopting the single-heat-coding (HHA) and the Angle of the local surface normal to the induced gradient direction, and the image with three channels as the road scene image is processedThe set of image components processed into three channels is denoted Jk(ii) a Wherein Q is positive integer, Q ≧ 200, such as Q ≦ 1569, k is positive integer, 1 ≦ k ≦ Q, 1 ≦ x ≦ W, 1 ≦ y ≦ H, W represents the width of the original stereoscopic image, H represents the height of the original road scene image and thermal image, such as W ≦ 224, H ≦ 224, R is takenk(x, y) representsPixel value of pixel point with (x, y) as middle coordinate position, { T }k(x, y) } denotesThe pixel value G of the pixel point with the middle coordinate position (x, y)k(x, y) representsThe middle coordinate position is the pixel value of the pixel point of (x, y); book (I)The data set in the experiment is directly selected from a road scene semantic segmentation data set disclosed by Ha et al, and the data set comprises 1569 pairs of road scene image thermal images.
Step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged.
For an input layer, an input end of the input layer receives a road scene image and a thermal image of an original road scene, an output end of the input layer outputs an R channel component, a G channel component and a B channel component of the original input image, and an output quantity of the input layer is an input quantity of a hidden layer; the thermal image is processed in an HHA coding mode and then has three channels as the road scene image, namely the thermal image is processed into three components after passing through an input layer, and the width and the height of the input original road scene image and the input original road scene image are W and H.
For hidden layers: the 1 st neural network block consists of a first convolution layer and a first activation layer; the input of the 1 st neural network block is a three-channel thermal image processed by HHA technology, 96 characteristic maps are output by the processing of the 1 st neural network block, and a set formed by the 96 characteristic maps is marked as P1(ii) a The convolution kernel size (kernel _ size) of the first convolution layer is 7 × 7, the number of convolution kernels (filters) is 96, the step size (stride) is 2, the value of the zero padding parameter (padding) is 3, the activation mode of the first activation layer is "Relu", and P is1Each feature map of (1) has a width ofHas a height of2 nd spiritThe network block consists of a first pooling layer and a first dense convolution block; the input to the 2 nd neural network block is P1The 96 characteristic diagrams in the (1) are processed by the 2 nd neural network block and output 384 characteristic diagrams, and the set formed by the 384 characteristic diagrams is marked as P2(ii) a The pooling size of the first maximum pooling layer is 3, the step size (stride) is 2, and the value of the zero padding parameter (padding) is 1; the first dense convolution block is composed of five dense convolution layers, the five dense convolution layers have the same structure and are composed of two active layers and two convolution layers, the active layers are all activated in a 'Relu' mode, the convolution kernel size of the first convolution layer in the two convolution layers is 1 multiplied by 1, the step length (stride) is 1, the convolution kernel size of the second convolution layer is 3 multiplied by 3, the step length (stride) is 1, the value of the zero padding parameter (padding) is 1, and P is P2Each feature map of (1) has a width ofHas a height ofThe 3 rd neural network block consists of a first transition convolution block and a second intensive convolution block; the input to the 3 rd neural network block is P2The 384 characteristic graphs in the (1) are processed by the 3 rd neural network block and then output 768 characteristic graphs, and the set formed by the 768 characteristic graphs is marked as P3(ii) a And the first transition convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 x 1 and step length (stride) of 1, and an average pooling layer with pooling size of 2 and step length (stride) of 2; the second dense convolution block is composed of twelve dense convolution blocks with the same structure, wherein the dense convolution block structure is the same as that described above, and the dense convolution blocks mentioned below are also the same and are not described again; p3Each feature map of (1) has a width ofHas a height of4 th neural network block consisting ofThe second transition convolution block and the third intensive convolution block; the input to the 4 th neural network block is P3The 768 characteristic diagrams in the table are processed by the 4 th neural network block and then 2112 characteristic diagrams are output, and a set formed by the 2112 characteristic diagrams is marked as P4(ii) a The second transition convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 multiplied by 1 and a step length (stride) of 1, and an average pooling layer with a pooling size of 2 and a step length (stride) of 2; the third dense convolution block is composed of thirty-six dense convolution blocks with the same structure, P4Each feature map of (1) has a width ofHas a height ofThe 5 th neural network block consists of a third transition convolution block and a fourth dense convolution block; the input to the 5 th neural network block is P4The 2112 characteristic maps in the table are processed by the 5 th neural network block and then 2208 characteristic maps are output, and the set formed by the 2208 characteristic maps is marked as P5(ii) a And the third transitional convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 x 1 and step length (stride) of 1, and an average pooling layer with pooling size of 2 and step length (stride) of 2; the fourth dense convolution block is composed of twenty-four dense convolution blocks with the same structure, P5Each feature map of (1) has a width ofHas a height ofThe 6 th to 10 th neural network blocks correspond to the 1 st to 5 th neural network blocks, and the structures thereof are also the same, which is not described herein again; the input of the 6 th neural network block is a three-channel road scene image, 96 characteristic maps are output after the 6 th neural network block is processed, and a set formed by the 96 characteristic maps is recorded asT1(ii) a And T1Each feature map of (1) has a width ofHas a height ofThe input to the 7 th neural network block is T1Middle 96 characteristic maps and P1Feature map set G after adding 96 feature maps1After the processing of the 7 th neural network block, 384 characteristic graphs are output, and a set formed by the 384 characteristic graphs is marked as T2,T2Each feature map of (1) has a width ofHas a height ofThe input to the 8 th neural network block is T2Middle 384 characteristic diagrams and P2Feature map set G after addd of 384 middle feature maps27688 characteristic graphs are output after 8 th neural network block processing, and a set formed by the 768 characteristic graphs is marked as T3,T3Each feature map of (1) has a width ofHas a height ofThe input to the 9 th neural network block is T3768 characteristic diagrams and P3Characteristic atlas G after adding 768 characteristic atlases32112 characteristic graphs are processed and output by the 9 th neural network block, and a set formed by the 2112 characteristic graphs is marked as T4And T is4Each feature map of (1) has a width ofHas a height ofThe input to the 10 th neural network block is T4Middle 2112 feature maps and P4Feature atlas G after add of middle 2112 feature maps4Processing and outputting 2208 feature maps through a 10 th neural network block, and marking a set formed by the 2208 feature maps as T5(ii) a And T5Each feature map of (1) has a width ofHas a height ofFurthermore, T is added52208 characteristic diagrams and P in5The feature map set of 2208 feature maps in (1) after add is marked as G5。
For FFM (Feature Fusion Module), the input G of the Module3、G4、G5(wherein G is4And G5Processing the characteristic graph in the method of bilinear interpolation and G3The resolution of the feature map in (1) is the same, i.e. the size of the uniform feature map) the feature map set G after the feature map is concat6,G6The method comprises 5088 feature maps Jie, 1272 feature maps are output after FFM processing, and the 1272 feature maps are marked as G7(ii) a Is that it is composed of a convolution with convolution kernel size of 3 × 3, step size (stride) of 1, zero padding parameter (padding) value of 1 and convolution with two convolution kernel sizes of 1 × 1 and step size (stride) of 1, where the convolutions with convolution kernel size of 1 × 1 are arranged in parallel, and G is6Each feature map in (1) has a width ofHas a height of
For the 1 st transition convolution layer, the input is G1The 96 characteristic maps in (1) are processed by the convolutional layer to output 159 characteristic maps, and the 159 characteristic maps are marked as C1(ii) a It is thatThe convolution layer is formed by convolution with convolution kernel size of 3 multiplied by 3, step length of 1 and zero padding parameter value of 1, the resolution of the image is not changed by the convolution layer, namely the width of the feature map processed by the convolution layer isHas a height of
For the 2 nd transition convolution layer, the input is G2The 384 characteristic maps are processed by the convolutional layer to output 318 characteristic maps, and the 318 characteristic maps are marked as C2(ii) a It is formed from convolution whose convolution kernel size is 3X 3, step length (stride) is 1 and zero-padding parameter (padding) value is 1, and the convolution layer does not change the resolution of image, i.e. the width of characteristic diagram after the convolution layer treatment is 3X 3Has a height of
For the 1 st decoding block, the input is G7The 1272 feature maps in the graph are processed by a 1 st decoding block to obtain 636 feature maps, and the 636 feature maps are recorded as H1(ii) a The method is characterized by comprising two deconvolution and two expansion convolutions, wherein the two deconvolution have the same structure, namely convolutions with the sizes of 4 multiplied by 4, step lengths (stride) of 1 and zero padding parameters (padding) of 1, the convolutions with the sizes of 3 multiplied by 3, the step lengths (stride) of 1, different expansion rates of 1 and 2 respectively, and the corresponding zero padding parameters (padding) of 1 and 2; and the width of the feature map after the 1 st decoding block isHas a height of
For the 2 nd decoding block, its input is H1318 characteristic diagrams and C in2The feature map set after the add of 318 feature maps in (1) is marked as C3159 feature maps are obtained after the 2 nd decoding block processing, and the 159 feature maps are recorded as H2(ii) a The structures of the 2 nd decoding block and the 3 rd decoding block are the same as the 1 st decoding block, and are not described herein; and the width of the feature map after passing through the 2 nd decoding block isHas a height of
For the 3 rd decoding block, its input is H2159 characteristic diagrams in (1) and C1The 159 feature maps in the system are processed by the 3 rd decoding block to obtain 1 feature map, and the 1 feature map is marked as H3(ii) a The structures of the 2 nd decoding block and the 3 rd decoding block are the same as the 1 st decoding block, and are not described herein; and the width of the feature map after the 3 rd decoding block is W, and the height of the feature map is H.
For the output layer, which is the output of the 3 rd decoding block, a segmentation result graph with width W and height H is output.
Step 1_ 3: taking the road scene image and the thermal image of the original road scene in the training set as input, inputting the input into a convolutional neural network for training to obtain a semantic segmentation result graph, and recording a set formed by the semantic segmentation result graph obtained after training as
Step 1_ 4: calculating a loss function value between a set of semantic segmentation result images obtained by training and a set of corresponding segmentation label imagesAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the Q multiplied by N loss function values; and then, corresponding the weight vector and the bias item corresponding to the minimum loss function value to be used as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as WbestAnd Bbest(ii) a Where N > 1, the value N is 100 in this experiment.
The test stage process comprises the following specific steps:
the pixel value of the pixel point with the position (x ', y').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed. And (3) constructing a multi-modal feature fusion convolutional neural network architecture based on a coding and decoding structure by using a python-based deep learning library PyTorch1.1.0. The public data set for the semantic segmentation task is used to analyze the effect of the segmentation result images (1569 on road scene images and thermal images) segmented by the method of the present invention. In this experiment, 2 common objective parameters of the significance evaluation detection method were used as evaluation indexes: accuracy of each class (Accuracy, Acc) and Intersection set (IoU) of each class are used for evaluating segmentation precision of segmentation result images.
The method of the invention is utilized to segment each road scene image in the data set to obtain a segmentation result graph corresponding to each road scene image, and the segmentation effect of the method of the invention is reflected as follows: the Accuracy (Accuracy, Acc) of each class and the Intersection universe (IoU) of each class are listed in Table 1.
TABLE 1 evaluation results on test sets using the method of the invention
As can be seen from the data listed in table 1, the first table reflects the Accuracy (Accuracy, Acc) of each class, the second table reflects the Intersection Union (IoU) of each class, and the table shows the optimal solution of each value in a bolded manner, so that the segmentation result of the segmentation result graph obtained by the method of the present invention is better.
FIG. 4a shows a segmentation label image corresponding to the 1 st original road scene image of the same scene in the data set selected for the experiment; FIG. 4b is a diagram showing a segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 4a by using the method of the present invention; FIG. 5a shows a segmentation label image corresponding to the 2 nd original road scene image of the same scene in the data set selected for the experiment; FIG. 5b is a diagram showing a segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 5a by using the method of the present invention; FIG. 6a shows a segmentation label image corresponding to the 3 rd original road scene image of the same scene in the data set selected for the experiment; FIG. 6b is a diagram showing the segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 6a by using the method of the present invention; FIG. 7a shows a segmentation label image corresponding to the 4 th original road scene image of the same scene in the data set selected for the experiment; FIG. 7b is a diagram showing the segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 7a by using the method of the present invention. Comparing fig. 4a and 4b, fig. 5a and 5b, fig. 6a and 6b, and fig. 7a and 7b, it can be seen that the segmentation accuracy of the segmentation result graph obtained by the method of the present invention is higher.
Therefore, the implementation of the method disclosed by the invention can be seen that the decoding of the characteristic image is optimized by applying a novel module, and the segmentation efficiency and accuracy of the road scene semantic segmentation task are finally improved by combining hierarchical and multi-modal information fusion.
Claims (9)
1. A multi-modal feature fusion road scene semantic segmentation method in a complex environment is characterized by comprising the following steps: the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set asThe thermal image is recorded asThe corresponding segmentation label image is noted
Step 1_ 2: constructing a convolutional neural network:
step 1_ 3: inputting each road scene image of the original road scene in the training set and the thermal image corresponding to the road scene image into a convolutional neural network for training to obtain a semantic segmentation result graph, and recording a set formed by the semantic segmentation result graphs correspondingly obtained from all the road scene images obtained after training as a set
Step 1_ 4: set formed by semantic segmentation result graph obtained by computational trainingSet of split label images corresponding to all road scene imagesValue of the loss function in between, is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the QXN loss function values, and taking the weight vector corresponding to the minimum loss function value and the bias item as the optimal weight vector W of the convolutional neural network classification training modelbestAnd an optimum bias term Bbest;
The test stage process comprises the following specific steps:
original road scene image to be segmentedInputting the R channel component, the G channel component and the B channel component into a convolutional neural network training model, and utilizing an optimal weight vector WbestAnd an optimum bias term BbestPredicting to obtain the original road scene image to be segmentedIs detectedSaliency detection imageWith the segmentation result of the road scene.
2. The complex environment multi-modal feature fusion road scene language as claimed in claim 1The semantic segmentation method is characterized by comprising the following steps: the thermal imageAnd the image with three channels is processed by adopting one-hot coding as the image of the road scene.
3. The method for multi-modal feature fusion road scene semantic segmentation under the complex environment according to claim 1, characterized in that:
in the testing stage process, the original road scene image to be segmentedThe road scene image to be segmented and the corresponding thermal image are included; wherein x ' is more than or equal to 1 and less than or equal to W ', y ' is more than or equal to 1 and less than or equal to H ', and W ' represents an original road scene image to be segmentedH' represents the original road scene image to be segmentedS (x ', y') represents the original road scene image to be segmentedAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
4. The method for multi-modal feature fusion road scene semantic segmentation under the complex environment according to claim 1, characterized in that: the convolutional neural network comprises an input layer, a hidden layer and an output layer;
the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged; in the hidden layer, the road scene image and the thermal image are respectively input into a 6 th neural network block and a 1 st neural network block, the 6 th neural network block, the 7 th neural network block, the 8 th neural network block, the 9 th neural network block and the 10 th neural network block are sequentially connected, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are sequentially connected, the output of the 1 st neural network block and the output of the 6 th neural network block are subjected to addition operation and then input into the 7 th neural network block, the output of the 2 nd neural network block and the output of the 7 th neural network block are subjected to addition operation and then input into the 8 th neural network block, the output of the 3 rd neural network block and the output of the 8 th neural network block are subjected to addition operation and then input into the 9 th neural network block, the output of the 4 th neural network block and the output of the 9 th neural network block are added and then input into the 10 th neural network block; the 10 th neural network block is sequentially connected with the output of the 5 th neural network block, the output of the 5 th neural network block is added by the adding block and then is input to the FFM feature fusion module together with the output of the 8 th neural network block and the output of the 9 th neural network block, the output of the FFM feature fusion module is input to the 1 st decoding block, the 2 nd decoding block and the 3 rd decoding block are sequentially connected, the output of the 6 th neural network block is input to the 3 rd decoding block through the 2 nd transition convolution layer, the output of the 7 th neural network block is input to the 2 nd decoding block through the 1 st transition convolution layer, and the output of the 3 rd decoding block is used as the output of the hidden layer.
5. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: in the hidden layer, the hidden layer is provided with a plurality of hidden layers,
the 1 st neural network block mainly comprises a first convolution layer and a first activation layer in sequence;
the 2 nd neural network block mainly comprises a first pooling layer and a first dense convolution block, wherein the first dense convolution block comprises five continuous dense convolution layers;
the 3 rd neural network block mainly comprises a first transition convolution block and a second dense convolution block, wherein the first transition convolution block is formed by an activation layer with an activation mode of Relu, and the second dense convolution block is formed by sequentially connecting twelve continuous dense convolution layers with the same structure;
the 4 th neural network block mainly comprises a second transition convolution block and a third dense convolution block, wherein the second transition convolution block is formed by an activation layer with an activation mode of Relu, and the third dense convolution block is formed by sequentially connecting thirty-six continuous dense convolution layers with the same structure;
the 5 th neural network block mainly comprises a third transition convolution block and a fourth dense convolution block, wherein the third transition convolution block is formed by an activation layer with an activation mode of Relu, and the fourth dense convolution block is formed by sequentially connecting twenty-four continuous dense convolution layers with the same structure;
the corresponding structures of the 6 th to 10 th neural network blocks are the same as those of the 1 st to 5 th neural network blocks.
6. The method according to claim 5, wherein the method for segmenting road scene semantics by multi-modal feature fusion in a complex environment is characterized in that: the dense convolution layers have the same structure and are formed by sequentially connecting two continuous active layers and two continuous convolution layers.
7. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the FFM feature fusion module mainly comprises a first convolution layer with convolution kernel size of 3 x 3 and two second convolution layers with convolution kernel size of 1 x 1, wherein the outputs from the 7 th neural network block, the 8 th neural network block and the addition block are input into the first convolution layer after pixel superposition operation, the output of the first convolution layer is respectively input and connected into the two second convolution layers, and the output of the two second convolution layers is output as the output of the FFM feature fusion module after pixel multiplication operation.
8. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the 1 st transition convolution layer and the 2 nd transition convolution layer are formed by one convolution layer.
9. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the decoding device comprises a first decoding block, a second decoding block, a third decoding block, a fourth decoding block, a fifth decoding block, a sixth decoding block and a seventh decoding block, wherein the first decoding block, the second decoding block and the third decoding block have the same structure and respectively comprise two deconvolution modules and two expansion convolution layers, each deconvolution module comprises a deconvolution layer, a normalization layer and an activation function layer which are sequentially connected, the input of each decoding block is respectively input into the two deconvolution modules, the output of the two deconvolution modules is processed and output by the first expansion convolution layer after being overlapped, the input of each decoding block is processed and output by the second expansion convolution layer, and the output of the two expansion convolution layers is output as the output of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110025132.XA CN112733934A (en) | 2021-01-08 | 2021-01-08 | Multi-modal feature fusion road scene semantic segmentation method in complex environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110025132.XA CN112733934A (en) | 2021-01-08 | 2021-01-08 | Multi-modal feature fusion road scene semantic segmentation method in complex environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112733934A true CN112733934A (en) | 2021-04-30 |
Family
ID=75589810
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110025132.XA Pending CN112733934A (en) | 2021-01-08 | 2021-01-08 | Multi-modal feature fusion road scene semantic segmentation method in complex environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112733934A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115984856A (en) * | 2022-12-05 | 2023-04-18 | 百度(中国)有限公司 | Training method of document image correction model and document image correction method |
-
2021
- 2021-01-08 CN CN202110025132.XA patent/CN112733934A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115984856A (en) * | 2022-12-05 | 2023-04-18 | 百度(中国)有限公司 | Training method of document image correction model and document image correction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108647585B (en) | Traffic identifier detection method based on multi-scale circulation attention network | |
CN110188705B (en) | Remote traffic sign detection and identification method suitable for vehicle-mounted system | |
Fan et al. | Learning collision-free space detection from stereo images: Homography matrix brings better data augmentation | |
CN111612807B (en) | Small target image segmentation method based on scale and edge information | |
CN113807355B (en) | Image semantic segmentation method based on coding and decoding structure | |
CN109635662B (en) | Road scene semantic segmentation method based on convolutional neural network | |
CN105701508A (en) | Global-local optimization model based on multistage convolution neural network and significant detection algorithm | |
CN111401436A (en) | Streetscape image segmentation method fusing network and two-channel attention mechanism | |
CN111696110B (en) | Scene segmentation method and system | |
CN109753959B (en) | Road traffic sign detection method based on self-adaptive multi-scale feature fusion | |
CN110956119B (en) | Method for detecting target in image | |
CN114223019A (en) | Feedback decoder for parameter efficient semantic image segmentation | |
CN113269133A (en) | Unmanned aerial vehicle visual angle video semantic segmentation method based on deep learning | |
CN113554032A (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN112861727A (en) | Real-time semantic segmentation method based on mixed depth separable convolution | |
CN111860411A (en) | Road scene semantic segmentation method based on attention residual error learning | |
Wang et al. | Global perception-based robust parking space detection using a low-cost camera | |
CN115273032A (en) | Traffic sign recognition method, apparatus, device and medium | |
CN109508639B (en) | Road scene semantic segmentation method based on multi-scale porous convolutional neural network | |
CN113066089B (en) | Real-time image semantic segmentation method based on attention guide mechanism | |
Xing et al. | MABNet: a lightweight stereo network based on multibranch adjustable bottleneck module | |
CN112733934A (en) | Multi-modal feature fusion road scene semantic segmentation method in complex environment | |
CN117037119A (en) | Road target detection method and system based on improved YOLOv8 | |
CN112446292B (en) | 2D image salient object detection method and system | |
CN111626298A (en) | Real-time image semantic segmentation device and segmentation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |