CN113362349A

CN113362349A - Road scene image semantic segmentation method based on multi-supervision network

Info

Publication number: CN113362349A
Application number: CN202110823118.4A
Authority: CN
Inventors: 周武杰; 董少华; 强芳芳; 许彩娥
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Science and Technology ZUST
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2021-09-07

Abstract

The invention discloses a road scene image semantic segmentation method based on a multi-supervision network. The method comprises a training stage and a testing stage; the method comprises the following steps: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, and preprocessing the images to form a training set; constructing a convolutional neural network; inputting the training set into a convolutional neural network for training, and outputting seven prediction graph sets by the convolutional neural network; calculating a final loss value; repeating the steps for multiple times to obtain a convolutional neural network classification training model; inputting a plurality of original road scene RGB images to be subjected to semantic segmentation and original Thermal infrared images to obtain corresponding semantic segmentation prediction maps. The method improves the semantic segmentation efficiency and accuracy of the RGB-T road scene image.

Description

Road scene image semantic segmentation method based on multi-supervision network

Technical Field

The invention relates to a road scene semantic segmentation method based on deep learning, in particular to a road scene image semantic segmentation method based on a multi-supervision network.

Background

With the rising of technologies such as unmanned driving, scene understanding, virtual reality and the like, semantic segmentation of images gradually becomes a research hotspot of computer vision and machine learning researchers, and visual navigation can be detected from traffic scene understanding and multi-target obstacle detection by means of the semantic segmentation technology. Currently, the most common semantic segmentation methods include support vector machines, random forests, and other algorithms. These algorithms focus primarily on a binary task for detecting and identifying specific objects, such as road surfaces, vehicles and pedestrians. The traditional machine learning methods are often realized through high-complexity features, the semantic segmentation of the traffic scene is simple and convenient by using deep learning, and more importantly, the precision of an image pixel level classification task is greatly improved by applying the deep learning.

The method adopts a deep learning semantic segmentation method to directly perform end-to-end (end-to-end) semantic segmentation at a pixel level, and can predict in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. Currently, methods based on deep learning semantic segmentation are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object.

The existing road scene semantic segmentation methods mostly adopt a deep learning method, a large number of models are formed by combining a convolutional layer and a pooling layer, however, a feature map obtained by simply utilizing pooling operation and convolution operation is single and not representative, so that the feature information of an obtained image is reduced, the restored effect information is rough, and the segmentation precision is low.

Disclosure of Invention

The invention aims to solve the technical problem of providing a road scene image semantic segmentation method based on a multi-supervision network, which has high segmentation efficiency and high segmentation accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a road scene image semantic segmentation method based on a multi-supervision network comprises a training stage and a testing stage; the specific steps of the training phase process are as follows:

the specific steps of the training phase process are as follows:

step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining the original road scene RGB images and the original Thermal infrared images, wherein a training set is formed by the plurality of original road scene RGB images, the plurality of original Thermal infrared images and the corresponding real semantic segmentation images;

step 1_ 2: constructing a convolutional neural network;

step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each original road scene RGB image in the training set;

step 1_ 4: processing a real semantic segmentation image corresponding to each original road scene RGB image into 9 independent thermal coding images, recording a set of the 9 independent thermal coding images as Jtrue, respectively calculating loss function values between the set Jtrue of the 9 independent thermal coding images and seven corresponding prediction image sets, and taking the sum of the seven loss function values as a final loss value;

step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, and obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;

the test stage process comprises the following specific steps:

step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.

The convolutional neural network comprises an encoding module and a decoding module, wherein the encoding module is connected with the decoding module;

the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules;

the first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;

the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;

the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;

the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.

The first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;

the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units;

the third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;

the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;

the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.

The first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;

the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit;

the second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;

the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit.

The 5 information fusion FM modules have the same structure, and specifically comprise:

comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.

The semantic AHLS module comprises two first convolution modules, two first attention mechanism modules, a second convolution module and a second attention mechanism module;

the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;

the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;

the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.

The 5 feature fusion AMMF modules have the same structure, and specifically comprise:

the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;

the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;

the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;

the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;

the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module.

The 3 semantic supervision MLF modules have the same structure, and specifically comprise:

the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;

the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.

The multitask supervision RM module comprises an eighteenth convolutional layer, a nineteenth convolutional layer, a twentieth convolutional layer, an exp function layer, a first multitask module, a second multitask module and a third multitask module;

the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;

the first multitask module, the second multitask module and the third multitask module are identical in structure and mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.

The size of each prediction map in the seven prediction map sets is the same as that of the RGB image of the initial road scene, and the seven prediction map sets comprise a semantic segmentation prediction map set Jpre1, a high-level semantic prediction map set Jpre2, a middle-level semantic prediction map set Jpre3, a low-level semantic prediction map set Jpre4, a foreground-background prediction map set Jpre5, a boundary prediction map set Jpre6 and a semantic prediction map set Jpre 7;

the set of semantic segmentation prediction maps Jpre1 is 9 semantic segmentation prediction maps f output by the first output of the convolutional neural network_finalThe high-level semantic prediction graph set Jpre2 is formed by convolution9 high-level semantic prediction graph f of second output of neural network_highThe middle-level semantic prediction graph set Jpre3 is composed of 9 middle-level semantic prediction graphs f of the third output of the convolutional neural network_midComposed, the low-level semantic prediction graph set Jpre4 is 9 low-level semantic prediction graphs f output by the fourth output of the convolutional neural network_lowThe set of foreground-background prediction maps Jpre5 is composed of 9 foreground-background prediction maps f output by the fifth output of the convolutional neural network_binThe set of boundary prediction maps Jpre6 is composed of 9 boundary prediction maps f output from the sixth output of the convolutional neural network_bouThe semantic prediction graph set Jpre7 is composed of 9 semantic prediction graphs f output by the seventh output of the convolutional neural network_semAnd the seventh output of the convolutional neural network is used as a semantic segmentation prediction graph output in the testing stage process.

Compared with the prior art, the invention has the advantages that:

1) the method comprises the steps of constructing a convolutional neural network, inputting a road scene RGBT image in a training set into the convolutional neural network for training, and obtaining a convolutional neural network classification training model; the road scene RGBT image to be semantically segmented is input into a convolutional neural network classification training model, and a predicted semantic segmentation image corresponding to the road scene image is obtained through prediction.

2) The method adopts multi-task supervision, and performs semantic supervision, boundary supervision and foreground-background supervision on the output segmentation image respectively, thereby effectively improving the semantic segmentation precision.

3) The method divides the coding part network into three parts of high, middle and low, and carries out semantic supervision on the output prediction graph in the three parts, thereby obtaining good segmentation effect on a training set and a test set.

4) The method of the invention fully utilizes high-level semantics, combines the high-level semantics with low-level information, and fully utilizes information of each layer of the network, so that the segmentation result is more accurate.

Drawings

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a block diagram of an implementation of an advanced semantic AHLS module.

Fig. 3 is a block diagram of an implementation of a decoding stage feature fusion AMMF module.

Fig. 4 is a block diagram of an implementation of the high level semantic and low level information fusion FM module at the decoding stage.

Fig. 5 is a block diagram of an implementation of a semantic supervision MLF module of a high, medium, and low layers.

FIG. 6 is a block diagram of an implementation of a multitasking supervision RM module.

FIG. 7a is a first original road scene RGB image;

FIG. 7b is a segmented image obtained by segmenting the first RGB image of the original road scene shown in FIG. 7a according to the present invention;

FIG. 8a is a second original road scene RGB image;

FIG. 8b is a segmented image obtained by segmenting the second original road scene RGB image shown in FIG. 8a using the method of the present invention;

FIG. 9a is a third original road scene RGB image;

FIG. 9b is a segmented image obtained by segmenting the third RGB image of the original road scene shown in FIG. 9a according to the present invention;

FIG. 10a is a fourth original road scene RGB image;

FIG. 10b is a segmented image obtained by segmenting the fourth RGB image of the original road scene shown in FIG. 10a using the method of the present invention;

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention provides an RGBT road scene semantic segmentation method based on a multi-supervision network, which has a general implementation block diagram as shown in FIG. 1 and comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, in the specific implementation, selecting 784 original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, wherein the original road scene RGB images and the corresponding original Thermal infrared images are used as original road scene images, the set of the original road scene images is marked as { J (i, J) }, the set of the corresponding real semantic segmentation images is marked as { Jtrue (i, J) }, then the real semantic segmentation images corresponding to each original road scene RGB image in the training set are processed into 9 independent Thermal coding images by adopting the existing independent Thermal coding technology (one-hot), and the set formed by the 9 independent Thermal coding images is marked as Jtrue (J) }. The original road scene image is 480 in height, 640 in width, i is larger than or equal to 1 and smaller than or equal to 640, J is larger than or equal to 1 and smaller than or equal to 480, J (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in the set { J (i, J) } of the original road scene image, and Jtrue (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in the set { Jtrue (i, J) } of the real semantic segmentation image. Respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining an original road scene RGB image and an original Thermal infrared image, wherein the cutting is a cutting method keeping the size unchanged, and the specific realization is that 0 is supplemented to the cut part; the batch size is 4, and a training set is formed by a plurality of initial road scene RGB images, initial Thermal infrared images and corresponding real semantic segmentation images;

step 1_ 2: constructing a convolutional neural network;

as shown in fig. 1, the convolutional neural network includes two parts, namely an encoding module and a decoding module, which are respectively used for performing feature extraction operation and upsampling operation on an initial road scene RGB image and a corresponding initial Thermal infrared image, and the encoding module is connected with the decoding module;

the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules; the semantic AHLS module is used for generating high-level semantics; the characteristic fusion AMMF module is used for fusing RGB information, Thermal information and the previous-stage output information; the semantic supervision MLF module is used for fusing high-level, middle-level and low-level semantic information; the RM module is used for semantic supervision, boundary supervision and foreground-background supervision. The information fusion FM module is used for fusing high-level semantics and low-level information.

the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units; the first downsampling layer is specifically maximum pooled downsampling.

the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit. The activation function of the third activation layer is a Relu activation function.

the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit. The activation function of the fifth activation layer is a Relu activation function.

As shown in fig. 4, the 5 information fusion FM modules have the same structure, specifically:

As shown in fig. 2, the semantic AHLS module includes two first convolution modules, two first attention modules, one second convolution module, and one second attention module;

As shown in fig. 3, the 5 feature fusion AMMF modules have the same structure, specifically:

the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module. The activation function of the tenth activation layer is a Relu activation function.

As shown in fig. 5, the 3 semantic supervised MLF modules have the same structure, specifically:

As shown in fig. 6, the multitask supervision RM module includes an eighteenth convolutional layer, a nineteenth convolutional layer, a twentieth convolutional layer, an exp function layer, a first multitask module, a second multitask module, and a third multitask module;

the first multitask module, the second multitask module and the third multitask module have the same structure and are mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.

For the 1 st coding module, it is composed of a first Convolution layer (Convolution, Conv), a first batch normalization layer (BatchNorm), and a first active layer (Activation, Act) that are arranged in sequence. The size of the convolution kernel (kernel _ size) adopted by the first convolution layer is 7, the step size (stride) is 2, the edge padding (padding) is 3, and the number of the convolution kernels is 64. The input end of the 1 st convolution block receives RGB three-channel components of an original input image, and the width of the original input image received by the input end is required to be W, and the height of the original input image is required to be H. After the normalization operation of the first batch normalization layer, outputting 64 output feature maps through a first activation layer (the activation mode is Relu), and recording a set formed by 64 sub-feature maps as N1; wherein each feature map has a width of

Has a height of

For the 2 nd coding module, the coding module is composed of 1 down-sampling layer and 3 residual units in turn. The 1 st down-sampling layer adopts the maximum pooling down-sampling, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling coefficient is 1. Sequentially stacking the main branches of the first residual unit by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 1; one normalization layer, and 256 output channels. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 2 nd coding module receives all the feature maps in N1, the output end outputs 256 sub-feature maps, the set of the 256 sub-feature maps is marked as N2, wherein the width of each feature map is N2

Has a height of

For the 3 rd coding module, 8 residual error units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third is toThe device comprises a stratification layer and a first activation layer, and the number of output channels is 512. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, and the number of output channels is 512. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 3 rd coding module receives all the feature maps in N2, the output end outputs 512 sub-feature maps, the set of the 512 sub-feature maps is denoted as N3, wherein the width of each feature map is N3

Has a height of

For the 4 th coding module, 36 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer is formed, and the number of output channels is 1024. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. The shortcut branch has no other operation, and only the flow of the input data is simple. Each residual errorAnd the final operation of the unit is that the main branch and the shortcut branch are subjected to Add operation, and then the final output is obtained through Relu activation function. The input end of the 4 th coding module receives all the feature maps in N3, the output end outputs 1024 sub-feature maps, the set of the 1024 sub-feature maps is marked as N4, wherein the width of each feature map is N4

Has a height of

For the 5 th coding module, 3 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, output channel number 2048. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 5 th coding module receives all the feature maps in N4, the output end outputs 2048 sub-feature maps, the set of 2048 sub-feature maps is denoted as N5, wherein the width of each feature map is N5

Has a height of

For the 6 th encoding module, it is composed of a first Convolution layer (Convolution, Conv), a first batch normalization layer (BatchNorm), and a first active layer (Activation, Act) that are arranged in sequence. The size of the convolution kernel (kernel _ size) adopted by the first convolution layer is 7, the step size (stride) is 2, the edge padding (padding) is 3, and the number of the convolution kernels is 64. The input of the 6 th convolution block receives the Thermal single-channel component of the original input image, requiring that the original input image received at the input has a width W and a height H. After the normalization operation of the first batch normalization layer, outputting 64 output feature maps through a first activation layer (the activation mode is Relu), and recording a set formed by 64 sub-feature maps as N6; wherein each feature map has a width of

Has a height of

For the 7 th coding module, the coding module is composed of 1 down-sampling layer and 3 residual units in turn. The 1 st down-sampling layer adopts the maximum pooling down-sampling, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling coefficient is 1. Sequentially stacking the main branches of the first residual unit by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 1; one normalization layer, and 256 output channels. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. The shortcut branch has no other operation, and only the flow of the input data is simple. The final operation of each residual unit isAnd performing Add operation on the main branch and the shortcut branch, and performing Relu activation function to obtain final output. The input end of the 7 th coding module receives all the feature maps in N6, the output end outputs 256 sub-feature maps, the set of the 256 sub-feature maps is marked as N7, wherein the width of each feature map is N7

Has a height of

For the 8 th coding module, 8 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, and the number of output channels is 512. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 8 th coding module receives all the feature maps in N7, the output end outputs 512 sub-feature maps, the set of the 512 sub-feature maps is denoted as N8, wherein the width of each feature map is N8

Has a height of

For the 9 th coding module, 36 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer is formed, and the number of output channels is 1024. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 9 th coding module receives all the feature maps in N8, the output end outputs 1024 sub-feature maps, the set of the 1024 sub-feature maps is marked as N9, wherein the width of each feature map is N9

Has a height of

For the 10 th coding module, 3 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, output channel number 2048. For other residual units, the first volume is used for sequentially generatingStacking, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 10 th coding module receives all the feature maps in N9, the output end outputs 2048 sub-feature maps, the set of 2048 sub-feature maps is denoted as N10, wherein the width of each feature map is N10

Has a height of

For the advanced semantic AHLS module. The system sequentially comprises a first convolution module, a first attention mechanism module, a Tensor splicing layer, a second convolution module and a second attention mechanism module. The convolution kernel size of the first convolution module is 1, the step size is 1, and the number of convolution kernels is 64. The first attention module is composed of a global maximum pooling layer, a first full-connection layer, a first activation function, a second full-connection layer, a second activation function, a third full-connection layer, a Sigmoid function and a convolution with a convolution kernel size of 1 and a step length of 1 in sequence. The Tensor stitching operation is the stitching of two input features in the channel dimension. The second convolution module is composed of a convolution layer with convolution kernel size of 3 and step length of 1, a normalization layer and an activation layer in sequence. The second attention mechanism module is identical to the first attention mechanism module. The RGB output of the 5 th encoding module is denoted as R5, and the Thermal output of the 10 th encoding module is denoted as T5. R5 and T5 are input into a first convolution module and a first attention mechanism module in sequence, and the output is respectively

And

then will be

And

input into a Tensor splicing layer and output is f^out(ii) a Finally f is to be^outSequentially inputting the semantic information into a second convolution module and a second attention mechanism module to output high-level semantic information f^high。

The AMMF module 1 is fused for the first feature. The RGB output of the 5 th encoding module is denoted as R5, and the Thermal output of the 10 th encoding module is denoted as T5. R5 and T5 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being

And

to be generated

And

performing dot product operation to obtain output

Then will be

Are added to obtain

Then will be

And

performing splicing operation to obtain

Then will be

Input to the first blending module to obtain

The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step^highAnd

performing splicing operation to obtain

Will be provided with

Input to the second blending module to obtain output

Wherein the second blending module is identical to the first blending module; followed by

Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64

Then the

Obtained by bilinear interpolation up-sampling operation

Then will be

Obtaining the output of the first feature fusion AMMF module 1 through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 64

At this time, the size of the feature map is 2 times of the original size, and the width of each feature map is

Has a height of

Since the jump connection used by the network model is specifically an information fusion FM module, the first information fusion FM module fuses the first feature to the output of the AMMF module 1

With the above high level of semantics f^highSequentially carrying out 2 times of bilinear interpolation upsampling, obtaining an output f after the convolution kernel is 1, the step length is 1 and the number of the convolution kernels is 64₁ ^highPerforming splicing operation to obtain output

Finally will be

Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f₄. The output end outputs 64 pairs of feature maps at the moment, wherein each feature map comprisesHas a width of

Has a height of

Amaf module 2 is fused for the second feature. The RGB output of the 4 th encoding module is denoted as R4, and the Thermal output of the 9 th encoding module is denoted as T4. R4 and T4 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being

And

to be generated

And

performing dot product operation to obtain output

Then will be

Are added to obtain

Then will be

And

performing splicing operation to obtain

Then will be

Input to the first blending module to obtain

The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step₄And

performing splicing operation to obtain

Will be provided with

Input to the second blending module to obtain output

Then the

Obtained by bilinear interpolation up-sampling operation

Then will be

The output of the second fusion module RAMMF2 is obtained by convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64

Has a height of

The second information fusion FM module fuses the second feature to the resulting output of the AMMF module 2 due to the presence of a jump connection to the model

And f above^highSequentially performing 4 times of bilinear interpolation upsampling, convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain output

Performing splicing operation to obtain output

Finally will be

Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f₃. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map is

Has a height of

The AMMF module 3 is fused for the third feature. The RGB output of the 3 rd encoding module is denoted as R3, and the Thermal output of the 8 th encoding module is denoted as T3. R3 and T3 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being

And

to be generated

And

performing dot product operation to obtain output

Then will be

Are added to obtain

Then will be

And

performing splicing operation to obtain

Then will be

Input to the first blending module to obtain

The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step₃And

performing splicing operation to obtain

Will be provided with

Input to the second blending module to obtain output

Then the

Obtained by bilinear interpolation up-sampling operation

Then will be

The number of the convolution kernels is 1 through the convolution kernel, the step length is 1The 64 convolutional layers get the output of the third fusion module RAMMF3

Has a height of

The third information fusion FM module fuses the third feature to the output of the AMMF module 3 due to the presence of a jump connection in the model

And f above^highSequentially performing 8 times of bilinear interpolation upsampling, obtaining output after convolutional layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64

Performing splicing operation to obtain output

Finally will be

Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f₂. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map is

Has a height of

The AMMF module 4 is fused for the fourth feature. The RGB output of the 2 nd encoding module is denoted as R2, and the Thermal output of the 7 th encoding module is denoted as T2. R2 and T2 are sequentially input intoAmong the same modules as the first convolution module and the first attention mechanism module in the above-mentioned AHLS module, the outputs are respectively

And

to be generated

And

performing dot product operation to obtain output f₁ ^out1(ii) a Then will be

f₁ ^out1Add to obtain f₁ ^out2(ii) a Then will be

And f₁ ^out2Performing a splicing operation to obtain f₁ ^out3(ii) a Then f is mixed₁ ^out3Input to the first blending module to obtain f₁ ^out4The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, the shortcut branch of the first blending module has no other operation and only flows of simple input data, and the last operation is that the main branch and the shortcut branch carry out Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step₂And f₁ ^out4Performing a splicing operation to obtain f₁ ^out5(ii) a Will f is₁ ^out5Input to a second blending module to obtain an output f₁ ^out6Wherein the second blending module is identical to the first blending module; then f₁ ^out6After the convolution kernel is set to be 1,the step size is 1, the convolution layers with convolution kernel number of 64 are obtained, and the output f is obtained₁ ^out7(ii) a Then f₁ ^out7F is obtained by bilinear interpolation upsampling operation₁ ^out8At this time, the size of the feature map is 2 times that of the original feature map, and the width of each feature map is

Has a height of

Then f is mixed₁ ^out8The output f of the fourth fusion module RAMMF4 is obtained by convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64₁ ^out9At this time, the size of the feature map is 2 times that of the original feature map, and the width of each feature map is

Has a height of

The fourth information fusion FM module fuses the fourth feature to the output f of the AMMF module 4 due to the presence of a jump connection in the network₁ ^out9And f above^highSequentially performing 16 times of bilinear interpolation upsampling, obtaining output after convolutional layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64

Performing splicing operation to obtain output f₁ ^out10(ii) a Finally f is to be₁ ^out10Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f₁. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map is

Has a height of

The AMMF module 5 is fused for the fifth feature. The RGB output of the 1 st coding block is denoted as R1, and the Thermal output of the 6 th coding block is denoted as T1. R1 and T1 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being

And T₁ ^out(ii) a To be generated

And T₁ ^outPerforming dot product operation to obtain output f₀ ^out1(ii) a Then will be

T₁ ^out，f₀ ^out1Add to obtain f₀ ^out2(ii) a Then will be

And f₀ ^out2Performing a splicing operation to obtain f₀ ^out3(ii) a Then f is mixed₀ ^out3Input to the first blending module to obtain f₀ ^out4The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, the shortcut branch of the first blending module has no other operation and only flows of simple input data, and the last operation is that the main branch and the shortcut branch carry out Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step₁And f₀ ^out4Performing a splicing operation to obtain f₀ ^out5(ii) a Will f is₀ ^out5Input to a second blending module to obtain an output f₀ ^out6Wherein the second blending module is identical to the first blending module; is connected withIs on f₀ ^out6Obtaining an output f through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 64₀ ^out7(ii) a Then f₀ ^out7F is obtained by bilinear interpolation upsampling operation₀ ^out8Then f is added₀ ^out8The output f of the fifth fusion module RAMMF5 is obtained by convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64₀ ^out9At this time, the size of the feature map is 2 times the original size, and the width and height of each feature map are W and H, respectively.

The fifth information fusion FM module fuses the fifth feature to the output f of the AMMF module 5 due to the presence of a jump connection in the network₀ ^out9And f above^highSequentially carrying out 32 times of bilinear interpolation upsampling, obtaining output f after convolutional layers with convolution kernels of 1, 1 step length and 64 convolution kernels₅ ^highPerforming splicing operation to obtain output f₀ ^out10(ii) a Finally f is to be₀ ^out10Inputting the input into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain output f₀(ii) a Will obtain an output f₀Obtaining semantic prediction output f through convolution layers with convolution kernels of 1, step length of 1 and convolution kernels of 9_finalAt this time, the output end outputs 9 pairs of feature maps, wherein the width of each feature map is W and the height of each feature map is H.

The MLF module 1 is supervised for high-level information semantics. The high-level semantic information f is processed^highPerforming 2 times bilinear interpolation up-sampling to obtain output

Then will be

Then will be

With the output f obtained above₄Adding them to obtain output

In this case, each feature map has a width of

Has a height of

The number of channels is 64; then will be

Performing 16 times bilinear interpolation up-sampling to obtain output

Finally will be

The final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9_highIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.

The MLF module 2 is supervised for medium level information semantics. The output f obtained above₃Performing 2 times bilinear interpolation up-sampling to obtain output

Then will be

Then will be

With the output f obtained above₂Perform additionObtain an output

In this case, each feature map has a width of

Has a height of

The number of channels is 64; then will be

Performing 4 times bilinear interpolation up-sampling to obtain output

Finally will be

The final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9_midIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.

The MLF module 3 is supervised for low level information semantics. The output f obtained above₁Performing 2 times bilinear interpolation up-sampling to obtain output

Then will be

Then will be

With the output f obtained above₀Adding them to obtain output

At the moment, the width of each characteristic diagram is W, the height is H, and the number of channels is 64; finally will be

The final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9_lowIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.

For a multitask supervision RM module based on three multitask modules. Firstly, the semantic prediction graph f is_finalThe first multitask module sequentially comprises a first convolution layer, a normalization layer and an active layer, wherein the convolution kernel number of the first convolution layer is 3, the step length of the first convolution layer is 1, the number of the convolution kernels is 9, the output of the active layer passes through a second convolution layer, the convolution kernel number of the second convolution layer is 1, the step length of the second convolution layer is 1, and the number of the convolution kernels is 2, and finally the foreground-background output f is obtained_bin(ii) a Then f is put_binObtaining semantic supervised weight through exp function, and converting f_finalThe point multiplication is carried out with weight to obtain

Then will be

The second multitask module sequentially comprises convolution layers with convolution kernel of 3, step length of 1 and convolution kernel number of 9, a normalization layer and an activation layer to obtain output

Then will be

The final semantic output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9_sem(ii) a Then f is put_finalAnd

performing splicing operation to obtain

Finally will be

The output of the activation layer passes through a second convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 2 to obtain the final boundary output f_bou. At this time, the width of each feature map is W, the height is H, and the number of channels is 9.

Step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each initial road scene RGB image in the training set;

the set of semantic segmentation prediction maps Jpre1 is 9 semantic segmentation prediction maps f output by the first output of the convolutional neural network_finalThe high-level semantic prediction graph set Jpre2 is a 9-amplitude high-level semantic prediction graph f output by the second output of the convolutional neural network_highThe middle-level semantic prediction graph set Jpre3 is composed of 9 middle-level semantic prediction graphs f of the third output of the convolutional neural network_midComposed, the low-level semantic prediction graph set Jpre4 is 9 low-level semantic prediction graphs f output by the fourth output of the convolutional neural network_lowThe set of foreground-background prediction maps Jpre5 is composed of 9 foreground-background prediction maps f output by the fifth output of the convolutional neural network_binThe set of boundary prediction maps Jpre6 is composed of 9 boundary prediction maps f output from the sixth output of the convolutional neural network_bouThe semantic prediction graph set Jpre7 is composed of 9 semantic prediction graphs f output by the seventh output of the convolutional neural network_semComposition, final convolutional nerveThe seventh output of the network is used as the semantic segmentation prediction graph output in the testing stage process.

Step 1_ 4: processing the real semantic segmentation image corresponding to each initial road scene RGB image into 9 independent thermal coding images, recording a set of the 9 independent thermal coding images as Jtrue, respectively calculating loss function values between the set Jtrue of the 9 independent thermal coding images and seven corresponding prediction image sets, and taking the sum of the seven loss function values as a final loss value; loss function values between a set Jtrue of 9 unique hot coded images and a corresponding prediction image set Jtrue are recorded as Lossi (Jtrue ), i is 1,2,3,4,5,6,7, and Lossi (Jtrue ) is calculated by cross entropy (crossentrypyloss).

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, namely the fluctuation of the training loss value is difficult to reduce, the verification loss is almost reduced to the minimum, and at the moment, obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model; in this example, V is chosen to be 300.

The specific steps of the test stage process are as follows:

In specific implementation, 393 original RGB color images to be semantically segmented and original Thermal infrared images are taken as a test set. Order to

Representing a set of an original RGB color image to be semantically segmented and an original Thermal infrared image; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

And the middle coordinate position is the pixel value of the pixel point of (i, j).

Will be provided with

Respectively inputting the R channel component, the G channel component and the B channel component and the corresponding original Thermal infrared image into a convolutional neural network classification training model, and utilizing an optimal weight vector W^bestAnd an optimum bias term b^bestMaking a prediction to obtain

Corresponding semantic segmentation prediction graph, denoted

Wherein the content of the first and second substances,

to represent

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) building a convolutional neural network architecture by using a deep learning library Python based on Python. The method adopts a test set of a road scene image database MFNET RGB-T Dataset to analyze how the segmentation effect of the road scene image (393 road scene images) predicted by the method is. Here, 4 common objective parameters for evaluating the semantic segmentation method are used as evaluation indexes, namely, Class accuracy (Acc), average Class accuracy (machc), ratio of Intersection and Union of each Class segmented image and label image (IoU), and average ratio of Intersection and Union of segmented image and label image (MIoU).

The method is utilized to predict each road scene image in the road scene image database MFNET RGB-T Dataset test set to obtain a predicted semantic segmentation image corresponding to each road scene image, and the class accuracy Acc, the average class accuracy mAcc, the ratio IoU of the intersection and the union of each class segmentation image and the label image, and the average ratio MIoU of the intersection and the union of the segmentation images and the label image, which reflect the semantic segmentation effect of the method, are listed in Table 1. As can be seen from the data listed in Table 1, the segmentation result of the road scene image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted semantic segmentation image corresponding to the road scene image by using the method of the present invention.

TABLE 1 evaluation results on test sets using the method of the invention

FIG. 7a shows the 1 st original road scene image of the same scene; FIG. 7b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 7a by using the method of the present invention; FIG. 8a shows the 2 nd original road scene image of the same scene; FIG. 8b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 8a by using the method of the present invention; FIG. 9a shows the 3 rd original road scene image of the same scene; FIG. 9b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 9a by using the method of the present invention; FIG. 10a shows the 4 th original road scene image of the same scene; FIG. 10b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 10a by using the method of the present invention; comparing fig. 7a and 7b, 8a and 8b, 9a and 9b, and 10a and 10b, it can be seen that the segmentation precision of the predicted semantic segmentation image obtained by the method of the present invention is higher.

Claims

1. A road scene image semantic segmentation method based on a multi-supervision network is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 2: constructing a convolutional neural network;

step 1_ 4: processing the real semantic segmentation image corresponding to each original road scene RGB image into 9 single-hot coded images and recording the set of the 9 single-hot coded images as J_trueRespectively calculating a set J of 9 single-hot coded images_trueLoss function values between the corresponding seven prediction image sets, and taking the sum of the seven loss function values as a final loss value;

the test stage process comprises the following specific steps:

2. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 1, characterized in that: the convolutional neural network comprises an encoding module and a decoding module, wherein the encoding module is connected with the decoding module;

3. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 1, characterized in that: the first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;

4. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 3, characterized in that: the first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;

5. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, characterized in that: the 5 information fusion FM modules have the same structure, and specifically comprise:

6. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, characterized in that: the semantic AHLS module comprises two first convolution modules, two first attention mechanism modules, a second convolution module and a second attention mechanism module;

7. The road scene image semantic segmentation method based on the multi-supervision network according to claim 2, wherein the 5 feature fusion AMMF modules have the same structure, specifically:

8. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, wherein the 3 semantic supervision MLF modules have the same structure, specifically:

9. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, wherein the multi-task supervision RM module comprises eighteenth convolutional layer, nineteenth convolutional layer, twentieth convolutional layer, exp function layer, first multi-task module, second multi-task module and third multi-task module;

10. The method as claimed in claim 1, wherein the size of each prediction map in the seven prediction map sets is the same as the size of the original RGB image of the road scene, and the seven prediction map sets include a semantic segmentation prediction map set J_pre1Advanced semantic prediction graph set J_pre2Middle-level semantic prediction graph set J_pre3Low level semantic prediction graph set J_pre4Foreground-background prediction map set J_pre5And a boundary prediction graph set J_pre6And semantic prediction graph set J_pre7；

Semantically segmenting a prediction graph set J_pre1Is a prediction graph f of 9 semantic segmentations from the first output of the convolutional neural network_finalComposition, high-level semantic prediction graph set J_pre2Is a 9-amplitude high-level semantic prediction graph f from the second output of the convolutional neural network_highComposition, intermediate level semantic prediction graph set J_pre3Is a 9-medium semantic prediction graph f output by the third output of the convolutional neural network_midComposition, low-level semantic prediction graph set J_pre4Is a low-level semantic prediction graph f of 9 output from the fourth of the convolutional neural network_lowComposition, foreground-background prediction set of images J_pre5Is 9 foreground-background prediction maps f output from the fifth of the convolutional neural network_binComposition, boundary prediction graph set J_pre6Is a 9-amplitude boundary prediction graph f output from the sixth of the convolutional neural network_bouComposition, semantic prediction graph set J_pre7Is 9 semantic prediction graphs f output by the seventh of the convolutional neural network_semComposition, the seventh output of the convolutional neural network as the output during the test phaseAnd (4) extracting a semantic segmentation prediction graph.