CN112733934A

CN112733934A - Multi-modal feature fusion road scene semantic segmentation method in complex environment

Info

Publication number: CN112733934A
Application number: CN202110025132.XA
Authority: CN
Inventors: 周武杰; 刘文宇; 雷景生; 万健; 甘兴利; 钱小鸿; 许彩娥; 黄杰
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Science and Technology ZUST
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2021-04-30

Abstract

The invention discloses a multi-modal feature fusion road scene semantic segmentation method in a complex environment. Selecting a road scene image, a thermal image and a segmentation label image to form a training set; constructing a convolutional neural network; inputting the training set into a convolutional neural network to train to obtain a semantic segmentation result image, finishing the training to obtain a set formed by the semantic segmentation result image, and calculating a loss function value between the set formed by segmentation label images corresponding to all road scene images; repeating the step training to obtain an optimal parameter corresponding to the minimum loss function value; inputting the multi-channel components of the original road scene image to be segmented, predicting by using the optimal parameters to obtain a saliency detection image of the original road scene image to be segmented, and obtaining a segmentation result. The invention optimizes the decoding of the characteristic image by applying a novel module, combines hierarchical and multi-modal information fusion, and finally improves the segmentation efficiency and accuracy of the road scene semantic segmentation task.

Description

Multi-modal feature fusion road scene semantic segmentation method in complex environment

Technical Field

The invention relates to a deep learning semantic segmentation method, in particular to a multi-modal feature fusion road scene semantic segmentation method in a complex environment.

Background

With the rapid development of economy in China, the living standard of people is continuously improved, and vehicles become indispensable transportation tools. With the increasing number of vehicles, the problems of road congestion, traffic accidents and the like cause great troubles for people going out. Therefore, the concept of intelligent transportation has emerged to effectively ameliorate these problems. The intelligent transportation means that vehicles on a road are allocated and controlled macroscopically through the cross development of the directions of the internet of things, the electronic technology, the control technology and the like so as to relieve the current traffic pressure condition. The intelligent traffic has very important functions of improving traffic management efficiency, relieving traffic jam, reducing environmental pollution, ensuring traffic safety and the like. The intelligent auxiliary driving system is an important component in intelligent traffic, and unmanned driving is the final target of the intelligent traffic. The road semantic segmentation technology is a necessary link for sensing the external environment by the vehicle, is a core technology for realizing unmanned driving, and is also an important branch of computer vision and image processing directions.

Traditional research methods for semantic segmentation mostly rely on manual features and a priori knowledge, which makes research greatly limited. The manual characteristics refer to characteristic pixel points obtained through calculation, the priori knowledge is similar to the general cognition of people, the traditional method can solve the semantic segmentation visual task to a certain extent, but the traditional method still cannot well understand some complex, tiny and sheltered scenes, and cannot accurately segment the objects.

In recent years, the advent of deep learning has revolutionized the way computer vision tasks. Unlike traditional methods, the deep learning method extracts features from training samples in a convolutional neural network autonomous learning manner instead of relying on manual features and a priori knowledge. In addition, sufficient data is the basic guarantee for scientific research, and compared with the traditional method, deep learning is easier to obtain a large amount of manual labeling data sets. More importantly, the deep learning algorithm can process the graphics on the GPU in parallel, and can greatly improve the learning efficiency and the prediction capability.

Most of current semantic segmentation methods adopt a deep learning method, that is, convolution operation is combined with pooling operation, a full convolution network and the like, and a deep convolution neural network is used for autonomously learning and extracting feature information in an image, but the simple use of various existing neural networks is not enough to meet the requirement of high precision of semantic segmentation, because some operations including the pooling operation lose part of feature information in the image, the obtained segmentation result image has poor effect, and therefore, on the basis of the operation, deeper research is needed by people to optimize the model.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a multi-modal feature fusion road scene semantic segmentation method in a complex environment, which has high detection efficiency and high detection accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original road scene images, thermal images and corresponding segmentation label images to form a training set, and recording the kth original road scene image in the training set as

The road scene image is RGB color image, and the thermal image is recorded as

The corresponding segmentation label image is noted

Because the original road scene image, namely the RGB colored image has three channels, and the thermal image has only one channel, the thermal image in the training set is processed into the image with three channels as the road scene image by adopting the single-heat-coding (HHA) and the Angle of the local surface normal to the induced gradient direction, and the image with three channels as the road scene image is processed

The set of image components processed into three channels is denoted J^k。

R^k(x, y) represents a road scene image

Pixel value of pixel point with (x, y) as middle coordinate position, { T }^k(x, y) } denotes a thermal image

The pixel value G of the pixel point with the middle coordinate position (x, y)^k(x, y) denotes a segmentation label image

The middle coordinate position is the pixel value of the pixel point of (x, y).

Step 1_ 2: constructing a convolutional neural network:

step 1_ 3: inputting each road scene image of the original road scene in the training set and the corresponding thermal image into a convolutional neural network for training to obtain a semantic segmentation result image, and finishing the training to obtain all the road scene imagesThe set formed by the semantic segmentation result graph correspondingly obtained from the road scene image is recorded as

Step 1_ 4: set formed by semantic segmentation result graph obtained by computational training

Set of split label images corresponding to all road scene images

Value of the loss function in between, is recorded as

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the QXN loss function values, and taking the weight vector corresponding to the minimum loss function value and the bias item as the optimal weight vector W of the convolutional neural network classification training model^bestAnd an optimum bias term B^best；

The test stage process comprises the following specific steps:

original road scene image to be segmented

Inputting the R channel component, the G channel component and the B channel component into a convolutional neural network training model, and utilizing an optimal weight vector W^bestAnd an optimum bias term B^bestPredicting to obtain the original road scene image to be segmented

Is detected

Saliency detection image

The method comprises the steps of (1) obtaining a segmentation result of a road scene; wherein the content of the first and second substances,

representing saliency detection images

And the pixel value of the pixel point with the middle coordinate position of (x ', y').

The thermal image

The image with three channels is processed by adopting one-hot coding in advance as the road scene image.

In the testing stage process, the original road scene image to be segmented

The road scene image to be segmented and the corresponding thermal image are included; wherein x ' is more than or equal to 1 and less than or equal to W ', y ' is more than or equal to 1 and less than or equal to H ', and W ' represents an original road scene image to be segmented

H' represents the original road scene image to be segmented

S (x ', y') represents the original road scene image to be segmented

The pixel value of the pixel point with the middle coordinate position (x ', y');

the convolutional neural network comprises an input layer, a hidden layer and an output layer;

the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged;

for the input layer, the input end of the input layer receives the road scene image and the thermal image of the original road scene, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the road scene image and the thermal image, and the output quantity of the input layer is the input quantity of the hidden layer.

The thermal image has three channels after being processed by the HHA coding mode, which is processed into three components after passing through the input layer, and the width and the height of the input original road scene image and the thermal image are W and H.

In the hidden layer, the road scene image and the thermal image are respectively input into a 6 th neural network block and a 1 st neural network block, the 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are sequentially connected,

the output of the 1 st neural network block and the output of the 6 th neural network block are added and then input into the 7 th neural network block, the output of the 2 nd neural network block and the output of the 7 th neural network block are added and then input into the 8 th neural network block, the output of the 3 rd neural network block and the output of the 8 th neural network block are added and then input into the 9 th neural network block, and the output of the 4 th neural network block and the output of the 9 th neural network block are added and then input into the 10 th neural network block; all 10 neural network blocks constitute the encoder.

The 10 th neural network block is sequentially connected with the 5 th neural network block, the output of the 5 th neural network block is added by the adding block and then is input to the FFM characteristic fusion module together with the outputs of the 8 th neural network block and the 9 th neural network block, the output of the FFM characteristic fusion module is input to the 1 st decoding block, the 2 nd decoding block and the 3 rd decoding block are sequentially connected, and all the 3 decoding blocks form an encoder.

The output of the 6 th neural network block is input to the 3 rd decoding block through the 2 nd transition convolution layer, the output of the 7 th neural network block is input to the 2 nd decoding block through the 1 st transition convolution layer, and the output of the 3 rd decoding block is used as the output of the hidden layer.

In the hidden layer, the hidden layer is provided with a plurality of hidden layers,

the 1 st neural network block mainly comprises a first convolution layer and a first activation layer in sequence;

the 2 nd neural network block mainly comprises a first pooling layer and a first dense convolution block, wherein the first dense convolution block comprises five continuous dense convolution layers;

the 3 rd neural network block mainly comprises a first transition convolution block and a second dense convolution block, wherein the first transition convolution block is formed by an activation layer with an activation mode of Relu, and the second dense convolution block is formed by sequentially connecting twelve continuous dense convolution layers with the same structure;

the 4 th neural network block mainly comprises a second transition convolution block and a third dense convolution block, wherein the second transition convolution block is formed by an activation layer with an activation mode of Relu, and the third dense convolution block is formed by sequentially connecting thirty-six continuous dense convolution layers with the same structure;

the 5 th neural network block mainly comprises a third transition convolution block and a fourth dense convolution block, wherein the third transition convolution block is formed by an activation layer with an activation mode of Relu, and the fourth dense convolution block is formed by sequentially connecting twenty-four continuous dense convolution layers with the same structure;

the corresponding structures of the 6 th to 10 th neural network blocks are the same as those of the 1 st to 5 th neural network blocks.

The dense convolution layers have the same structure and are formed by sequentially connecting two continuous active layers and two continuous convolution layers.

As shown in fig. 2, the FFM feature fusion module mainly includes a first convolution layer with convolution kernel size of 3 × 3 and two second convolution layers with convolution kernel size of 1 × 1, outputs from the 7 th neural network block, the 8 th neural network block and the addition block are input into the first convolution layer after pixel superposition operation, the first convolution layer receives three inputs, outputs of the first convolution layer are respectively input into the two parallel second convolution layers, and outputs of the two second convolution layers are output as outputs of the FFM feature fusion module after pixel multiplication operation.

The 1 st transition convolution layer and the 2 nd transition convolution layer are formed by one convolution layer.

As shown in fig. 3, the 1 st decoding block, the 2 nd decoding block, and the 3 rd decoding block have the same structure, and each decoding block includes two deconvolution modules and two expansion convolution layers, each deconvolution module includes an deconvolution layer, a normalization layer, and an activation function layer, which are connected in sequence, the inputs of the decoding blocks are respectively input into the two deconvolution modules, the outputs of the two deconvolution modules are output after being overlapped through the processing of the first expansion convolution layer, the inputs of the decoding blocks are output after being processed by the second expansion convolution layer, and the outputs of the two expansion convolution layers are output as the outputs of the decoding blocks after being added.

For the output layer, the output of the 3 rd decoding block is used for outputting a segmentation result graph with the width W and the height H.

Compared with the prior art, the invention has the advantages that:

1) the method adopts the convolutional neural network to extract the characteristics of the road scene image and the thermal image, and combines the information from a plurality of modes of the road scene image and the thermal image to be fused to obtain a high-efficiency and high-precision semantic segmentation result graph.

2) The method of the invention provides an effective module FFM (feature Fusion module), which takes the feature map as input and can effectively fuse the input feature map to obtain richer feature information after Fusion, thereby being beneficial to obtaining a segmentation result map with higher precision in a decoding stage.

3) The method of the invention provides a decoding block DB (decoder Block) structure, which takes the feature graph processed by the feature fusion module as input and decodes the input feature graph by a unique decoding structure, namely, the image resolution is gradually restored to be consistent with the input image, and a high-precision segmentation result graph is obtained.

Drawings

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a block diagram of a feature fusion module FFM in the method of the present invention;

FIG. 3 is a block diagram of a decoding block DB in the method of the present invention;

FIG. 4a is a segmentation label image corresponding to the 1 st original road scene image of the same scene;

FIG. 4b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 4a by using the method of the present invention;

FIG. 5a is a segmentation label image corresponding to the 2 nd original road scene image of the same scene;

FIG. 5b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 5a by using the method of the present invention;

FIG. 6a is a segmentation label image corresponding to the 3 rd original road scene image of the same scene;

FIG. 6b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 6a by using the method of the present invention;

FIG. 7a is a segmentation label image corresponding to the 4 th original road scene image of the same scene;

FIG. 7b is a graph of the segmentation result obtained by segmenting the original road scene image shown in FIG. 7a using the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The overall implementation block diagram of the multi-modal feature fusion road scene semantic segmentation method under the complex environment is shown in FIG. 1, and the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

The thermal image is recorded as

The corresponding segmentation label image is noted

The set of image components processed into three channels is denoted J^k(ii) a Wherein Q is positive integer, Q ≧ 200, such as Q ≦ 1569, k is positive integer, 1 ≦ k ≦ Q, 1 ≦ x ≦ W, 1 ≦ y ≦ H, W represents the width of the original stereoscopic image, H represents the height of the original road scene image and thermal image, such as W ≦ 224, H ≦ 224, R is taken^k(x, y) represents

Pixel value of pixel point with (x, y) as middle coordinate position, { T }^k(x, y) } denotes

The pixel value G of the pixel point with the middle coordinate position (x, y)^k(x, y) represents

The middle coordinate position is the pixel value of the pixel point of (x, y); book (I)The data set in the experiment is directly selected from a road scene semantic segmentation data set disclosed by Ha et al, and the data set comprises 1569 pairs of road scene image thermal images.

Step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged.

For an input layer, an input end of the input layer receives a road scene image and a thermal image of an original road scene, an output end of the input layer outputs an R channel component, a G channel component and a B channel component of the original input image, and an output quantity of the input layer is an input quantity of a hidden layer; the thermal image is processed in an HHA coding mode and then has three channels as the road scene image, namely the thermal image is processed into three components after passing through an input layer, and the width and the height of the input original road scene image and the input original road scene image are W and H.

For hidden layers: the 1 st neural network block consists of a first convolution layer and a first activation layer; the input of the 1 st neural network block is a three-channel thermal image processed by HHA technology, 96 characteristic maps are output by the processing of the 1 st neural network block, and a set formed by the 96 characteristic maps is marked as P₁(ii) a The convolution kernel size (kernel _ size) of the first convolution layer is 7 × 7, the number of convolution kernels (filters) is 96, the step size (stride) is 2, the value of the zero padding parameter (padding) is 3, the activation mode of the first activation layer is "Relu", and P is₁Each feature map of (1) has a width of

Has a height of

2 nd spiritThe network block consists of a first pooling layer and a first dense convolution block; the input to the 2 nd neural network block is P₁The 96 characteristic diagrams in the (1) are processed by the 2 nd neural network block and output 384 characteristic diagrams, and the set formed by the 384 characteristic diagrams is marked as P₂(ii) a The pooling size of the first maximum pooling layer is 3, the step size (stride) is 2, and the value of the zero padding parameter (padding) is 1; the first dense convolution block is composed of five dense convolution layers, the five dense convolution layers have the same structure and are composed of two active layers and two convolution layers, the active layers are all activated in a 'Relu' mode, the convolution kernel size of the first convolution layer in the two convolution layers is 1 multiplied by 1, the step length (stride) is 1, the convolution kernel size of the second convolution layer is 3 multiplied by 3, the step length (stride) is 1, the value of the zero padding parameter (padding) is 1, and P is P₂Each feature map of (1) has a width of

Has a height of

The 3 rd neural network block consists of a first transition convolution block and a second intensive convolution block; the input to the 3 rd neural network block is P₂The 384 characteristic graphs in the (1) are processed by the 3 rd neural network block and then output 768 characteristic graphs, and the set formed by the 768 characteristic graphs is marked as P₃(ii) a And the first transition convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 x 1 and step length (stride) of 1, and an average pooling layer with pooling size of 2 and step length (stride) of 2; the second dense convolution block is composed of twelve dense convolution blocks with the same structure, wherein the dense convolution block structure is the same as that described above, and the dense convolution blocks mentioned below are also the same and are not described again; p₃Each feature map of (1) has a width of

Has a height of

4 th neural network block consisting ofThe second transition convolution block and the third intensive convolution block; the input to the 4 th neural network block is P₃The 768 characteristic diagrams in the table are processed by the 4 th neural network block and then 2112 characteristic diagrams are output, and a set formed by the 2112 characteristic diagrams is marked as P₄(ii) a The second transition convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 multiplied by 1 and a step length (stride) of 1, and an average pooling layer with a pooling size of 2 and a step length (stride) of 2; the third dense convolution block is composed of thirty-six dense convolution blocks with the same structure, P₄Each feature map of (1) has a width of

Has a height of

The 5 th neural network block consists of a third transition convolution block and a fourth dense convolution block; the input to the 5 th neural network block is P₄The 2112 characteristic maps in the table are processed by the 5 th neural network block and then 2208 characteristic maps are output, and the set formed by the 2208 characteristic maps is marked as P₅(ii) a And the third transitional convolution block is composed of an activation layer with an activation mode of 'Relu', convolution layers with convolution kernel sizes of 1 x 1 and step length (stride) of 1, and an average pooling layer with pooling size of 2 and step length (stride) of 2; the fourth dense convolution block is composed of twenty-four dense convolution blocks with the same structure, P₅Each feature map of (1) has a width of

Has a height of

The 6 th to 10 th neural network blocks correspond to the 1 st to 5 th neural network blocks, and the structures thereof are also the same, which is not described herein again; the input of the 6 th neural network block is a three-channel road scene image, 96 characteristic maps are output after the 6 th neural network block is processed, and a set formed by the 96 characteristic maps is recorded asT₁(ii) a And T₁Each feature map of (1) has a width of

Has a height of

The input to the 7 th neural network block is T₁Middle 96 characteristic maps and P₁Feature map set G after adding 96 feature maps₁After the processing of the 7 th neural network block, 384 characteristic graphs are output, and a set formed by the 384 characteristic graphs is marked as T₂，T₂Each feature map of (1) has a width of

Has a height of

The input to the 8 th neural network block is T₂Middle 384 characteristic diagrams and P₂Feature map set G after addd of 384 middle feature maps₂7688 characteristic graphs are output after 8 th neural network block processing, and a set formed by the 768 characteristic graphs is marked as T₃，T₃Each feature map of (1) has a width of

Has a height of

The input to the 9 th neural network block is T₃768 characteristic diagrams and P₃Characteristic atlas G after adding 768 characteristic atlases₃2112 characteristic graphs are processed and output by the 9 th neural network block, and a set formed by the 2112 characteristic graphs is marked as T₄And T is₄Each feature map of (1) has a width of

Has a height of

The input to the 10 th neural network block is T₄Middle 2112 feature maps and P₄Feature atlas G after add of middle 2112 feature maps₄Processing and outputting 2208 feature maps through a 10 th neural network block, and marking a set formed by the 2208 feature maps as T₅(ii) a And T₅Each feature map of (1) has a width of

Has a height of

Furthermore, T is added₅2208 characteristic diagrams and P in₅The feature map set of 2208 feature maps in (1) after add is marked as G₅。

For FFM (Feature Fusion Module), the input G of the Module₃、G₄、G₅(wherein G is₄And G₅Processing the characteristic graph in the method of bilinear interpolation and G₃The resolution of the feature map in (1) is the same, i.e. the size of the uniform feature map) the feature map set G after the feature map is concat₆，G₆The method comprises 5088 feature maps Jie, 1272 feature maps are output after FFM processing, and the 1272 feature maps are marked as G₇(ii) a Is that it is composed of a convolution with convolution kernel size of 3 × 3, step size (stride) of 1, zero padding parameter (padding) value of 1 and convolution with two convolution kernel sizes of 1 × 1 and step size (stride) of 1, where the convolutions with convolution kernel size of 1 × 1 are arranged in parallel, and G is₆Each feature map in (1) has a width of

Has a height of

For the 1 st transition convolution layer, the input is G₁The 96 characteristic maps in (1) are processed by the convolutional layer to output 159 characteristic maps, and the 159 characteristic maps are marked as C₁(ii) a It is thatThe convolution layer is formed by convolution with convolution kernel size of 3 multiplied by 3, step length of 1 and zero padding parameter value of 1, the resolution of the image is not changed by the convolution layer, namely the width of the feature map processed by the convolution layer is

Has a height of

For the 2 nd transition convolution layer, the input is G₂The 384 characteristic maps are processed by the convolutional layer to output 318 characteristic maps, and the 318 characteristic maps are marked as C₂(ii) a It is formed from convolution whose convolution kernel size is 3X 3, step length (stride) is 1 and zero-padding parameter (padding) value is 1, and the convolution layer does not change the resolution of image, i.e. the width of characteristic diagram after the convolution layer treatment is 3X 3

Has a height of

For the 1 st decoding block, the input is G₇The 1272 feature maps in the graph are processed by a 1 st decoding block to obtain 636 feature maps, and the 636 feature maps are recorded as H₁(ii) a The method is characterized by comprising two deconvolution and two expansion convolutions, wherein the two deconvolution have the same structure, namely convolutions with the sizes of 4 multiplied by 4, step lengths (stride) of 1 and zero padding parameters (padding) of 1, the convolutions with the sizes of 3 multiplied by 3, the step lengths (stride) of 1, different expansion rates of 1 and 2 respectively, and the corresponding zero padding parameters (padding) of 1 and 2; and the width of the feature map after the 1 st decoding block is

Has a height of

For the 2 nd decoding block, its input is H₁318 characteristic diagrams and C in₂The feature map set after the add of 318 feature maps in (1) is marked as C₃159 feature maps are obtained after the 2 nd decoding block processing, and the 159 feature maps are recorded as H₂(ii) a The structures of the 2 nd decoding block and the 3 rd decoding block are the same as the 1 st decoding block, and are not described herein; and the width of the feature map after passing through the 2 nd decoding block is

Has a height of

For the 3 rd decoding block, its input is H₂159 characteristic diagrams in (1) and C₁The 159 feature maps in the system are processed by the 3 rd decoding block to obtain 1 feature map, and the 1 feature map is marked as H₃(ii) a The structures of the 2 nd decoding block and the 3 rd decoding block are the same as the 1 st decoding block, and are not described herein; and the width of the feature map after the 3 rd decoding block is W, and the height of the feature map is H.

For the output layer, which is the output of the 3 rd decoding block, a segmentation result graph with width W and height H is output.

Step 1_ 3: taking the road scene image and the thermal image of the original road scene in the training set as input, inputting the input into a convolutional neural network for training to obtain a semantic segmentation result graph, and recording a set formed by the semantic segmentation result graph obtained after training as

Step 1_ 4: calculating a loss function value between a set of semantic segmentation result images obtained by training and a set of corresponding segmentation label images

And

the value of the loss function in between is recorded as

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4N times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by N loss function values in total; then finding out the loss function value with the minimum value from the Q multiplied by N loss function values; and then, corresponding the weight vector and the bias item corresponding to the minimum loss function value to be used as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as W^bestAnd B^best(ii) a Where N > 1, the value N is 100 in this experiment.

The test stage process comprises the following specific steps:

the pixel value of the pixel point with the position (x ', y').

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed. And (3) constructing a multi-modal feature fusion convolutional neural network architecture based on a coding and decoding structure by using a python-based deep learning library PyTorch1.1.0. The public data set for the semantic segmentation task is used to analyze the effect of the segmentation result images (1569 on road scene images and thermal images) segmented by the method of the present invention. In this experiment, 2 common objective parameters of the significance evaluation detection method were used as evaluation indexes: accuracy of each class (Accuracy, Acc) and Intersection set (IoU) of each class are used for evaluating segmentation precision of segmentation result images.

The method of the invention is utilized to segment each road scene image in the data set to obtain a segmentation result graph corresponding to each road scene image, and the segmentation effect of the method of the invention is reflected as follows: the Accuracy (Accuracy, Acc) of each class and the Intersection universe (IoU) of each class are listed in Table 1.

TABLE 1 evaluation results on test sets using the method of the invention

As can be seen from the data listed in table 1, the first table reflects the Accuracy (Accuracy, Acc) of each class, the second table reflects the Intersection Union (IoU) of each class, and the table shows the optimal solution of each value in a bolded manner, so that the segmentation result of the segmentation result graph obtained by the method of the present invention is better.

FIG. 4a shows a segmentation label image corresponding to the 1 st original road scene image of the same scene in the data set selected for the experiment; FIG. 4b is a diagram showing a segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 4a by using the method of the present invention; FIG. 5a shows a segmentation label image corresponding to the 2 nd original road scene image of the same scene in the data set selected for the experiment; FIG. 5b is a diagram showing a segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 5a by using the method of the present invention; FIG. 6a shows a segmentation label image corresponding to the 3 rd original road scene image of the same scene in the data set selected for the experiment; FIG. 6b is a diagram showing the segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 6a by using the method of the present invention; FIG. 7a shows a segmentation label image corresponding to the 4 th original road scene image of the same scene in the data set selected for the experiment; FIG. 7b is a diagram showing the segmentation result obtained by segmenting the original road scene image of the road scene shown in FIG. 7a by using the method of the present invention. Comparing fig. 4a and 4b, fig. 5a and 5b, fig. 6a and 6b, and fig. 7a and 7b, it can be seen that the segmentation accuracy of the segmentation result graph obtained by the method of the present invention is higher.

Therefore, the implementation of the method disclosed by the invention can be seen that the decoding of the characteristic image is optimized by applying a novel module, and the segmentation efficiency and accuracy of the road scene semantic segmentation task are finally improved by combining hierarchical and multi-modal information fusion.

Claims

1. A multi-modal feature fusion road scene semantic segmentation method in a complex environment is characterized by comprising the following steps: the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

The thermal image is recorded as

The corresponding segmentation label image is noted

Step 1_ 2: constructing a convolutional neural network:

step 1_ 3: inputting each road scene image of the original road scene in the training set and the thermal image corresponding to the road scene image into a convolutional neural network for training to obtain a semantic segmentation result graph, and recording a set formed by the semantic segmentation result graphs correspondingly obtained from all the road scene images obtained after training as a set

Set of split label images corresponding to all road scene images

Value of the loss function in between, is recorded as

The test stage process comprises the following specific steps:

original road scene image to be segmented

Is detected

Saliency detection image

With the segmentation result of the road scene.

2. The complex environment multi-modal feature fusion road scene language as claimed in claim 1The semantic segmentation method is characterized by comprising the following steps: the thermal image

And the image with three channels is processed by adopting one-hot coding as the image of the road scene.

3. The method for multi-modal feature fusion road scene semantic segmentation under the complex environment according to claim 1, characterized in that:

in the testing stage process, the original road scene image to be segmented

H' represents the original road scene image to be segmented

S (x ', y') represents the original road scene image to be segmented

4. The method for multi-modal feature fusion road scene semantic segmentation under the complex environment according to claim 1, characterized in that: the convolutional neural network comprises an input layer, a hidden layer and an output layer;

the hidden layer comprises a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, an FFM (fringe field modulation) feature fusion module, a 1 st transition convolution layer, a 2 nd transition convolution layer, a 1 st decoding block, a 2 nd decoding block and a 3 rd decoding block which are sequentially arranged; in the hidden layer, the road scene image and the thermal image are respectively input into a 6 th neural network block and a 1 st neural network block, the 6 th neural network block, the 7 th neural network block, the 8 th neural network block, the 9 th neural network block and the 10 th neural network block are sequentially connected, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are sequentially connected, the output of the 1 st neural network block and the output of the 6 th neural network block are subjected to addition operation and then input into the 7 th neural network block, the output of the 2 nd neural network block and the output of the 7 th neural network block are subjected to addition operation and then input into the 8 th neural network block, the output of the 3 rd neural network block and the output of the 8 th neural network block are subjected to addition operation and then input into the 9 th neural network block, the output of the 4 th neural network block and the output of the 9 th neural network block are added and then input into the 10 th neural network block; the 10 th neural network block is sequentially connected with the output of the 5 th neural network block, the output of the 5 th neural network block is added by the adding block and then is input to the FFM feature fusion module together with the output of the 8 th neural network block and the output of the 9 th neural network block, the output of the FFM feature fusion module is input to the 1 st decoding block, the 2 nd decoding block and the 3 rd decoding block are sequentially connected, the output of the 6 th neural network block is input to the 3 rd decoding block through the 2 nd transition convolution layer, the output of the 7 th neural network block is input to the 2 nd decoding block through the 1 st transition convolution layer, and the output of the 3 rd decoding block is used as the output of the hidden layer.

5. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: in the hidden layer, the hidden layer is provided with a plurality of hidden layers,

6. The method according to claim 5, wherein the method for segmenting road scene semantics by multi-modal feature fusion in a complex environment is characterized in that: the dense convolution layers have the same structure and are formed by sequentially connecting two continuous active layers and two continuous convolution layers.

7. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the FFM feature fusion module mainly comprises a first convolution layer with convolution kernel size of 3 x 3 and two second convolution layers with convolution kernel size of 1 x 1, wherein the outputs from the 7 th neural network block, the 8 th neural network block and the addition block are input into the first convolution layer after pixel superposition operation, the output of the first convolution layer is respectively input and connected into the two second convolution layers, and the output of the two second convolution layers is output as the output of the FFM feature fusion module after pixel multiplication operation.

8. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the 1 st transition convolution layer and the 2 nd transition convolution layer are formed by one convolution layer.

9. The method according to claim 4, wherein the method for segmenting the road scene semantics by multi-modal feature fusion in the complex environment is characterized in that: the decoding device comprises a first decoding block, a second decoding block, a third decoding block, a fourth decoding block, a fifth decoding block, a sixth decoding block and a seventh decoding block, wherein the first decoding block, the second decoding block and the third decoding block have the same structure and respectively comprise two deconvolution modules and two expansion convolution layers, each deconvolution module comprises a deconvolution layer, a normalization layer and an activation function layer which are sequentially connected, the input of each decoding block is respectively input into the two deconvolution modules, the output of the two deconvolution modules is processed and output by the first expansion convolution layer after being overlapped, the input of each decoding block is processed and output by the second expansion convolution layer, and the output of the two expansion convolution layers is output as the output of.