CN113362349A - Road scene image semantic segmentation method based on multi-supervision network - Google Patents
Road scene image semantic segmentation method based on multi-supervision network Download PDFInfo
- Publication number
- CN113362349A CN113362349A CN202110823118.4A CN202110823118A CN113362349A CN 113362349 A CN113362349 A CN 113362349A CN 202110823118 A CN202110823118 A CN 202110823118A CN 113362349 A CN113362349 A CN 113362349A
- Authority
- CN
- China
- Prior art keywords
- module
- output
- layer
- input
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 43
- 238000012360 testing method Methods 0.000 claims abstract description 16
- 230000004927 fusion Effects 0.000 claims description 186
- 238000010606 normalization Methods 0.000 claims description 143
- 230000004913 activation Effects 0.000 claims description 72
- 238000002156 mixing Methods 0.000 claims description 48
- 238000005070 sampling Methods 0.000 claims description 44
- 230000007246 mechanism Effects 0.000 claims description 39
- 230000008569 process Effects 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 32
- 238000010586 diagram Methods 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000013517 stratification Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Abstract
The invention discloses a road scene image semantic segmentation method based on a multi-supervision network. The method comprises a training stage and a testing stage; the method comprises the following steps: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, and preprocessing the images to form a training set; constructing a convolutional neural network; inputting the training set into a convolutional neural network for training, and outputting seven prediction graph sets by the convolutional neural network; calculating a final loss value; repeating the steps for multiple times to obtain a convolutional neural network classification training model; inputting a plurality of original road scene RGB images to be subjected to semantic segmentation and original Thermal infrared images to obtain corresponding semantic segmentation prediction maps. The method improves the semantic segmentation efficiency and accuracy of the RGB-T road scene image.
Description
Technical Field
The invention relates to a road scene semantic segmentation method based on deep learning, in particular to a road scene image semantic segmentation method based on a multi-supervision network.
Background
With the rising of technologies such as unmanned driving, scene understanding, virtual reality and the like, semantic segmentation of images gradually becomes a research hotspot of computer vision and machine learning researchers, and visual navigation can be detected from traffic scene understanding and multi-target obstacle detection by means of the semantic segmentation technology. Currently, the most common semantic segmentation methods include support vector machines, random forests, and other algorithms. These algorithms focus primarily on a binary task for detecting and identifying specific objects, such as road surfaces, vehicles and pedestrians. The traditional machine learning methods are often realized through high-complexity features, the semantic segmentation of the traffic scene is simple and convenient by using deep learning, and more importantly, the precision of an image pixel level classification task is greatly improved by applying the deep learning.
The method adopts a deep learning semantic segmentation method to directly perform end-to-end (end-to-end) semantic segmentation at a pixel level, and can predict in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. Currently, methods based on deep learning semantic segmentation are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object.
The existing road scene semantic segmentation methods mostly adopt a deep learning method, a large number of models are formed by combining a convolutional layer and a pooling layer, however, a feature map obtained by simply utilizing pooling operation and convolution operation is single and not representative, so that the feature information of an obtained image is reduced, the restored effect information is rough, and the segmentation precision is low.
Disclosure of Invention
The invention aims to solve the technical problem of providing a road scene image semantic segmentation method based on a multi-supervision network, which has high segmentation efficiency and high segmentation accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a road scene image semantic segmentation method based on a multi-supervision network comprises a training stage and a testing stage; the specific steps of the training phase process are as follows:
the specific steps of the training phase process are as follows:
step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining the original road scene RGB images and the original Thermal infrared images, wherein a training set is formed by the plurality of original road scene RGB images, the plurality of original Thermal infrared images and the corresponding real semantic segmentation images;
step 1_ 2: constructing a convolutional neural network;
step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each original road scene RGB image in the training set;
step 1_ 4: processing a real semantic segmentation image corresponding to each original road scene RGB image into 9 independent thermal coding images, recording a set of the 9 independent thermal coding images as Jtrue, respectively calculating loss function values between the set Jtrue of the 9 independent thermal coding images and seven corresponding prediction image sets, and taking the sum of the seven loss function values as a final loss value;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, and obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;
the test stage process comprises the following specific steps:
step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.
The convolutional neural network comprises an encoding module and a decoding module, wherein the encoding module is connected with the decoding module;
the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules;
the first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;
the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;
the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;
the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.
The first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;
the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units;
the third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;
the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;
the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.
The first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;
the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit;
the second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;
the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit.
The 5 information fusion FM modules have the same structure, and specifically comprise:
comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.
The semantic AHLS module comprises two first convolution modules, two first attention mechanism modules, a second convolution module and a second attention mechanism module;
the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;
the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;
the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.
The 5 feature fusion AMMF modules have the same structure, and specifically comprise:
the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;
the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;
the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;
the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;
the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module.
The 3 semantic supervision MLF modules have the same structure, and specifically comprise:
the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;
the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.
The multitask supervision RM module comprises an eighteenth convolutional layer, a nineteenth convolutional layer, a twentieth convolutional layer, an exp function layer, a first multitask module, a second multitask module and a third multitask module;
the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;
the first multitask module, the second multitask module and the third multitask module are identical in structure and mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.
The size of each prediction map in the seven prediction map sets is the same as that of the RGB image of the initial road scene, and the seven prediction map sets comprise a semantic segmentation prediction map set Jpre1, a high-level semantic prediction map set Jpre2, a middle-level semantic prediction map set Jpre3, a low-level semantic prediction map set Jpre4, a foreground-background prediction map set Jpre5, a boundary prediction map set Jpre6 and a semantic prediction map set Jpre 7;
the set of semantic segmentation prediction maps Jpre1 is 9 semantic segmentation prediction maps f output by the first output of the convolutional neural networkfinalThe high-level semantic prediction graph set Jpre2 is formed by convolution9 high-level semantic prediction graph f of second output of neural networkhighThe middle-level semantic prediction graph set Jpre3 is composed of 9 middle-level semantic prediction graphs f of the third output of the convolutional neural networkmidComposed, the low-level semantic prediction graph set Jpre4 is 9 low-level semantic prediction graphs f output by the fourth output of the convolutional neural networklowThe set of foreground-background prediction maps Jpre5 is composed of 9 foreground-background prediction maps f output by the fifth output of the convolutional neural networkbinThe set of boundary prediction maps Jpre6 is composed of 9 boundary prediction maps f output from the sixth output of the convolutional neural networkbouThe semantic prediction graph set Jpre7 is composed of 9 semantic prediction graphs f output by the seventh output of the convolutional neural networksemAnd the seventh output of the convolutional neural network is used as a semantic segmentation prediction graph output in the testing stage process.
Compared with the prior art, the invention has the advantages that:
1) the method comprises the steps of constructing a convolutional neural network, inputting a road scene RGBT image in a training set into the convolutional neural network for training, and obtaining a convolutional neural network classification training model; the road scene RGBT image to be semantically segmented is input into a convolutional neural network classification training model, and a predicted semantic segmentation image corresponding to the road scene image is obtained through prediction.
2) The method adopts multi-task supervision, and performs semantic supervision, boundary supervision and foreground-background supervision on the output segmentation image respectively, thereby effectively improving the semantic segmentation precision.
3) The method divides the coding part network into three parts of high, middle and low, and carries out semantic supervision on the output prediction graph in the three parts, thereby obtaining good segmentation effect on a training set and a test set.
4) The method of the invention fully utilizes high-level semantics, combines the high-level semantics with low-level information, and fully utilizes information of each layer of the network, so that the segmentation result is more accurate.
Drawings
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a block diagram of an implementation of an advanced semantic AHLS module.
Fig. 3 is a block diagram of an implementation of a decoding stage feature fusion AMMF module.
Fig. 4 is a block diagram of an implementation of the high level semantic and low level information fusion FM module at the decoding stage.
Fig. 5 is a block diagram of an implementation of a semantic supervision MLF module of a high, medium, and low layers.
FIG. 6 is a block diagram of an implementation of a multitasking supervision RM module.
FIG. 7a is a first original road scene RGB image;
FIG. 7b is a segmented image obtained by segmenting the first RGB image of the original road scene shown in FIG. 7a according to the present invention;
FIG. 8a is a second original road scene RGB image;
FIG. 8b is a segmented image obtained by segmenting the second original road scene RGB image shown in FIG. 8a using the method of the present invention;
FIG. 9a is a third original road scene RGB image;
FIG. 9b is a segmented image obtained by segmenting the third RGB image of the original road scene shown in FIG. 9a according to the present invention;
FIG. 10a is a fourth original road scene RGB image;
FIG. 10b is a segmented image obtained by segmenting the fourth RGB image of the original road scene shown in FIG. 10a using the method of the present invention;
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides an RGBT road scene semantic segmentation method based on a multi-supervision network, which has a general implementation block diagram as shown in FIG. 1 and comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, in the specific implementation, selecting 784 original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, wherein the original road scene RGB images and the corresponding original Thermal infrared images are used as original road scene images, the set of the original road scene images is marked as { J (i, J) }, the set of the corresponding real semantic segmentation images is marked as { Jtrue (i, J) }, then the real semantic segmentation images corresponding to each original road scene RGB image in the training set are processed into 9 independent Thermal coding images by adopting the existing independent Thermal coding technology (one-hot), and the set formed by the 9 independent Thermal coding images is marked as Jtrue (J) }. The original road scene image is 480 in height, 640 in width, i is larger than or equal to 1 and smaller than or equal to 640, J is larger than or equal to 1 and smaller than or equal to 480, J (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in the set { J (i, J) } of the original road scene image, and Jtrue (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in the set { Jtrue (i, J) } of the real semantic segmentation image. Respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining an original road scene RGB image and an original Thermal infrared image, wherein the cutting is a cutting method keeping the size unchanged, and the specific realization is that 0 is supplemented to the cut part; the batch size is 4, and a training set is formed by a plurality of initial road scene RGB images, initial Thermal infrared images and corresponding real semantic segmentation images;
step 1_ 2: constructing a convolutional neural network;
as shown in fig. 1, the convolutional neural network includes two parts, namely an encoding module and a decoding module, which are respectively used for performing feature extraction operation and upsampling operation on an initial road scene RGB image and a corresponding initial Thermal infrared image, and the encoding module is connected with the decoding module;
the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules; the semantic AHLS module is used for generating high-level semantics; the characteristic fusion AMMF module is used for fusing RGB information, Thermal information and the previous-stage output information; the semantic supervision MLF module is used for fusing high-level, middle-level and low-level semantic information; the RM module is used for semantic supervision, boundary supervision and foreground-background supervision. The information fusion FM module is used for fusing high-level semantics and low-level information.
The first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;
the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;
the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;
the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.
The first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;
the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units; the first downsampling layer is specifically maximum pooled downsampling.
The third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;
the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;
the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.
The first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;
the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit. The activation function of the third activation layer is a Relu activation function.
The second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;
the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit. The activation function of the fifth activation layer is a Relu activation function.
As shown in fig. 4, the 5 information fusion FM modules have the same structure, specifically:
comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.
As shown in fig. 2, the semantic AHLS module includes two first convolution modules, two first attention modules, one second convolution module, and one second attention module;
the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;
the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;
the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.
As shown in fig. 3, the 5 feature fusion AMMF modules have the same structure, specifically:
the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;
the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;
the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;
the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;
the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module. The activation function of the tenth activation layer is a Relu activation function.
As shown in fig. 5, the 3 semantic supervised MLF modules have the same structure, specifically:
the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;
the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.
As shown in fig. 6, the multitask supervision RM module includes an eighteenth convolutional layer, a nineteenth convolutional layer, a twentieth convolutional layer, an exp function layer, a first multitask module, a second multitask module, and a third multitask module;
the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;
the first multitask module, the second multitask module and the third multitask module have the same structure and are mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.
For the 1 st coding module, it is composed of a first Convolution layer (Convolution, Conv), a first batch normalization layer (BatchNorm), and a first active layer (Activation, Act) that are arranged in sequence. The size of the convolution kernel (kernel _ size) adopted by the first convolution layer is 7, the step size (stride) is 2, the edge padding (padding) is 3, and the number of the convolution kernels is 64. The input end of the 1 st convolution block receives RGB three-channel components of an original input image, and the width of the original input image received by the input end is required to be W, and the height of the original input image is required to be H. After the normalization operation of the first batch normalization layer, outputting 64 output feature maps through a first activation layer (the activation mode is Relu), and recording a set formed by 64 sub-feature maps as N1; wherein each feature map has a width ofHas a height of
For the 2 nd coding module, the coding module is composed of 1 down-sampling layer and 3 residual units in turn. The 1 st down-sampling layer adopts the maximum pooling down-sampling, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling coefficient is 1. Sequentially stacking the main branches of the first residual unit by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 1; one normalization layer, and 256 output channels. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 2 nd coding module receives all the feature maps in N1, the output end outputs 256 sub-feature maps, the set of the 256 sub-feature maps is marked as N2, wherein the width of each feature map is N2Has a height of
For the 3 rd coding module, 8 residual error units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third is toThe device comprises a stratification layer and a first activation layer, and the number of output channels is 512. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, and the number of output channels is 512. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 3 rd coding module receives all the feature maps in N2, the output end outputs 512 sub-feature maps, the set of the 512 sub-feature maps is denoted as N3, wherein the width of each feature map is N3Has a height of
For the 4 th coding module, 36 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer is formed, and the number of output channels is 1024. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. The shortcut branch has no other operation, and only the flow of the input data is simple. Each residual errorAnd the final operation of the unit is that the main branch and the shortcut branch are subjected to Add operation, and then the final output is obtained through Relu activation function. The input end of the 4 th coding module receives all the feature maps in N3, the output end outputs 1024 sub-feature maps, the set of the 1024 sub-feature maps is marked as N4, wherein the width of each feature map is N4Has a height of
For the 5 th coding module, 3 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, output channel number 2048. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 5 th coding module receives all the feature maps in N4, the output end outputs 2048 sub-feature maps, the set of 2048 sub-feature maps is denoted as N5, wherein the width of each feature map is N5Has a height of
For the 6 th encoding module, it is composed of a first Convolution layer (Convolution, Conv), a first batch normalization layer (BatchNorm), and a first active layer (Activation, Act) that are arranged in sequence. The size of the convolution kernel (kernel _ size) adopted by the first convolution layer is 7, the step size (stride) is 2, the edge padding (padding) is 3, and the number of the convolution kernels is 64. The input of the 6 th convolution block receives the Thermal single-channel component of the original input image, requiring that the original input image received at the input has a width W and a height H. After the normalization operation of the first batch normalization layer, outputting 64 output feature maps through a first activation layer (the activation mode is Relu), and recording a set formed by 64 sub-feature maps as N6; wherein each feature map has a width ofHas a height of
For the 7 th coding module, the coding module is composed of 1 down-sampling layer and 3 residual units in turn. The 1 st down-sampling layer adopts the maximum pooling down-sampling, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling coefficient is 1. Sequentially stacking the main branches of the first residual unit by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 1; one normalization layer, and 256 output channels. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. The shortcut branch has no other operation, and only the flow of the input data is simple. The final operation of each residual unit isAnd performing Add operation on the main branch and the shortcut branch, and performing Relu activation function to obtain final output. The input end of the 7 th coding module receives all the feature maps in N6, the output end outputs 256 sub-feature maps, the set of the 256 sub-feature maps is marked as N7, wherein the width of each feature map is N7Has a height of
For the 8 th coding module, 8 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, and the number of output channels is 512. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 8 th coding module receives all the feature maps in N7, the output end outputs 512 sub-feature maps, the set of the 512 sub-feature maps is denoted as N8, wherein the width of each feature map is N8Has a height of
For the 9 th coding module, 36 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer is formed, and the number of output channels is 1024. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 9 th coding module receives all the feature maps in N8, the output end outputs 1024 sub-feature maps, the set of the 1024 sub-feature maps is marked as N9, wherein the width of each feature map is N9Has a height of
For the 10 th coding module, 3 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, output channel number 2048. For other residual units, the first volume is used for sequentially generatingStacking, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 10 th coding module receives all the feature maps in N9, the output end outputs 2048 sub-feature maps, the set of 2048 sub-feature maps is denoted as N10, wherein the width of each feature map is N10Has a height of
For the advanced semantic AHLS module. The system sequentially comprises a first convolution module, a first attention mechanism module, a Tensor splicing layer, a second convolution module and a second attention mechanism module. The convolution kernel size of the first convolution module is 1, the step size is 1, and the number of convolution kernels is 64. The first attention module is composed of a global maximum pooling layer, a first full-connection layer, a first activation function, a second full-connection layer, a second activation function, a third full-connection layer, a Sigmoid function and a convolution with a convolution kernel size of 1 and a step length of 1 in sequence. The Tensor stitching operation is the stitching of two input features in the channel dimension. The second convolution module is composed of a convolution layer with convolution kernel size of 3 and step length of 1, a normalization layer and an activation layer in sequence. The second attention mechanism module is identical to the first attention mechanism module. The RGB output of the 5 th encoding module is denoted as R5, and the Thermal output of the 10 th encoding module is denoted as T5. R5 and T5 are input into a first convolution module and a first attention mechanism module in sequence, and the output is respectivelyAndthen will beAndinput into a Tensor splicing layer and output is fout(ii) a Finally f is to beoutSequentially inputting the semantic information into a second convolution module and a second attention mechanism module to output high-level semantic information fhigh。
The AMMF module 1 is fused for the first feature. The RGB output of the 5 th encoding module is denoted as R5, and the Thermal output of the 10 th encoding module is denoted as T5. R5 and T5 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs beingAndto be generatedAndperforming dot product operation to obtain outputThen will beAre added to obtainThen will beAndperforming splicing operation to obtainThen will beInput to the first blending module to obtainThe main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above stephighAndperforming splicing operation to obtainWill be provided withInput to the second blending module to obtain outputWherein the second blending module is identical to the first blending module; followed byObtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Then theObtained by bilinear interpolation up-sampling operationThen will beObtaining the output of the first feature fusion AMMF module 1 through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 64At this time, the size of the feature map is 2 times of the original size, and the width of each feature map isHas a height of
Since the jump connection used by the network model is specifically an information fusion FM module, the first information fusion FM module fuses the first feature to the output of the AMMF module 1With the above high level of semantics fhighSequentially carrying out 2 times of bilinear interpolation upsampling, obtaining an output f after the convolution kernel is 1, the step length is 1 and the number of the convolution kernels is 641 highPerforming splicing operation to obtain outputFinally will beInputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f4. The output end outputs 64 pairs of feature maps at the moment, wherein each feature map comprisesHas a width ofHas a height of
Amaf module 2 is fused for the second feature. The RGB output of the 4 th encoding module is denoted as R4, and the Thermal output of the 9 th encoding module is denoted as T4. R4 and T4 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs beingAndto be generatedAndperforming dot product operation to obtain outputThen will beAre added to obtainThen will beAndperforming splicing operation to obtainThen will beInput to the first blending module to obtainThe main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step4Andperforming splicing operation to obtainWill be provided withInput to the second blending module to obtain outputWherein the second blending module is identical to the first blending module; followed byObtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Then theObtained by bilinear interpolation up-sampling operationThen will beThe output of the second fusion module RAMMF2 is obtained by convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64At this time, the size of the feature map is 2 times of the original size, and the width of each feature map isHas a height of
The second information fusion FM module fuses the second feature to the resulting output of the AMMF module 2 due to the presence of a jump connection to the modelAnd f abovehighSequentially performing 4 times of bilinear interpolation upsampling, convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain outputPerforming splicing operation to obtain outputFinally will beInputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f3. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map isHas a height of
The AMMF module 3 is fused for the third feature. The RGB output of the 3 rd encoding module is denoted as R3, and the Thermal output of the 8 th encoding module is denoted as T3. R3 and T3 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs beingAndto be generatedAndperforming dot product operation to obtain outputThen will beAre added to obtainThen will beAndperforming splicing operation to obtainThen will beInput to the first blending module to obtainThe main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step3Andperforming splicing operation to obtainWill be provided withInput to the second blending module to obtain outputWherein the second blending module is identical to the first blending module; followed byObtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Then theObtained by bilinear interpolation up-sampling operationThen will beThe number of the convolution kernels is 1 through the convolution kernel, the step length is 1The 64 convolutional layers get the output of the third fusion module RAMMF3At this time, the size of the feature map is 2 times of the original size, and the width of each feature map isHas a height of
The third information fusion FM module fuses the third feature to the output of the AMMF module 3 due to the presence of a jump connection in the modelAnd f abovehighSequentially performing 8 times of bilinear interpolation upsampling, obtaining output after convolutional layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Performing splicing operation to obtain outputFinally will beInputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f2. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map isHas a height of
The AMMF module 4 is fused for the fourth feature. The RGB output of the 2 nd encoding module is denoted as R2, and the Thermal output of the 7 th encoding module is denoted as T2. R2 and T2 are sequentially input intoAmong the same modules as the first convolution module and the first attention mechanism module in the above-mentioned AHLS module, the outputs are respectivelyAndto be generatedAndperforming dot product operation to obtain output f1 out1(ii) a Then will bef1 out1Add to obtain f1 out2(ii) a Then will beAnd f1 out2Performing a splicing operation to obtain f1 out3(ii) a Then f is mixed1 out3Input to the first blending module to obtain f1 out4The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, the shortcut branch of the first blending module has no other operation and only flows of simple input data, and the last operation is that the main branch and the shortcut branch carry out Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step2And f1 out4Performing a splicing operation to obtain f1 out5(ii) a Will f is1 out5Input to a second blending module to obtain an output f1 out6Wherein the second blending module is identical to the first blending module; then f1 out6After the convolution kernel is set to be 1,the step size is 1, the convolution layers with convolution kernel number of 64 are obtained, and the output f is obtained1 out7(ii) a Then f1 out7F is obtained by bilinear interpolation upsampling operation1 out8At this time, the size of the feature map is 2 times that of the original feature map, and the width of each feature map isHas a height ofThen f is mixed1 out8The output f of the fourth fusion module RAMMF4 is obtained by convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 641 out9At this time, the size of the feature map is 2 times that of the original feature map, and the width of each feature map isHas a height of
The fourth information fusion FM module fuses the fourth feature to the output f of the AMMF module 4 due to the presence of a jump connection in the network1 out9And f abovehighSequentially performing 16 times of bilinear interpolation upsampling, obtaining output after convolutional layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Performing splicing operation to obtain output f1 out10(ii) a Finally f is to be1 out10Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f1. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map isHas a height of
The AMMF module 5 is fused for the fifth feature. The RGB output of the 1 st coding block is denoted as R1, and the Thermal output of the 6 th coding block is denoted as T1. R1 and T1 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs beingAnd T1 out(ii) a To be generatedAnd T1 outPerforming dot product operation to obtain output f0 out1(ii) a Then will beT1 out,f0 out1Add to obtain f0 out2(ii) a Then will beAnd f0 out2Performing a splicing operation to obtain f0 out3(ii) a Then f is mixed0 out3Input to the first blending module to obtain f0 out4The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, the shortcut branch of the first blending module has no other operation and only flows of simple input data, and the last operation is that the main branch and the shortcut branch carry out Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step1And f0 out4Performing a splicing operation to obtain f0 out5(ii) a Will f is0 out5Input to a second blending module to obtain an output f0 out6Wherein the second blending module is identical to the first blending module; is connected withIs on f0 out6Obtaining an output f through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 640 out7(ii) a Then f0 out7F is obtained by bilinear interpolation upsampling operation0 out8Then f is added0 out8The output f of the fifth fusion module RAMMF5 is obtained by convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 640 out9At this time, the size of the feature map is 2 times the original size, and the width and height of each feature map are W and H, respectively.
The fifth information fusion FM module fuses the fifth feature to the output f of the AMMF module 5 due to the presence of a jump connection in the network0 out9And f abovehighSequentially carrying out 32 times of bilinear interpolation upsampling, obtaining output f after convolutional layers with convolution kernels of 1, 1 step length and 64 convolution kernels5 highPerforming splicing operation to obtain output f0 out10(ii) a Finally f is to be0 out10Inputting the input into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain output f0(ii) a Will obtain an output f0Obtaining semantic prediction output f through convolution layers with convolution kernels of 1, step length of 1 and convolution kernels of 9finalAt this time, the output end outputs 9 pairs of feature maps, wherein the width of each feature map is W and the height of each feature map is H.
The MLF module 1 is supervised for high-level information semantics. The high-level semantic information f is processedhighPerforming 2 times bilinear interpolation up-sampling to obtain outputThen will beObtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Then will beWith the output f obtained above4Adding them to obtain outputIn this case, each feature map has a width ofHas a height ofThe number of channels is 64; then will bePerforming 16 times bilinear interpolation up-sampling to obtain outputFinally will beThe final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9highIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.
The MLF module 2 is supervised for medium level information semantics. The output f obtained above3Performing 2 times bilinear interpolation up-sampling to obtain outputThen will beObtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Then will beWith the output f obtained above2Perform additionObtain an outputIn this case, each feature map has a width ofHas a height ofThe number of channels is 64; then will bePerforming 4 times bilinear interpolation up-sampling to obtain outputFinally will beThe final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9midIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.
The MLF module 3 is supervised for low level information semantics. The output f obtained above1Performing 2 times bilinear interpolation up-sampling to obtain outputThen will beObtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64Then will beWith the output f obtained above0Adding them to obtain outputAt the moment, the width of each characteristic diagram is W, the height is H, and the number of channels is 64; finally will beThe final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9lowIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.
For a multitask supervision RM module based on three multitask modules. Firstly, the semantic prediction graph f isfinalThe first multitask module sequentially comprises a first convolution layer, a normalization layer and an active layer, wherein the convolution kernel number of the first convolution layer is 3, the step length of the first convolution layer is 1, the number of the convolution kernels is 9, the output of the active layer passes through a second convolution layer, the convolution kernel number of the second convolution layer is 1, the step length of the second convolution layer is 1, and the number of the convolution kernels is 2, and finally the foreground-background output f is obtainedbin(ii) a Then f is putbinObtaining semantic supervised weight through exp function, and converting ffinalThe point multiplication is carried out with weight to obtainThen will beThe second multitask module sequentially comprises convolution layers with convolution kernel of 3, step length of 1 and convolution kernel number of 9, a normalization layer and an activation layer to obtain outputThen will beThe final semantic output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9sem(ii) a Then f is putfinalAndperforming splicing operation to obtainFinally will beThe output of the activation layer passes through a second convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 2 to obtain the final boundary output fbou. At this time, the width of each feature map is W, the height is H, and the number of channels is 9.
Step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each initial road scene RGB image in the training set;
the size of each prediction map in the seven prediction map sets is the same as that of the RGB image of the initial road scene, and the seven prediction map sets comprise a semantic segmentation prediction map set Jpre1, a high-level semantic prediction map set Jpre2, a middle-level semantic prediction map set Jpre3, a low-level semantic prediction map set Jpre4, a foreground-background prediction map set Jpre5, a boundary prediction map set Jpre6 and a semantic prediction map set Jpre 7;
the set of semantic segmentation prediction maps Jpre1 is 9 semantic segmentation prediction maps f output by the first output of the convolutional neural networkfinalThe high-level semantic prediction graph set Jpre2 is a 9-amplitude high-level semantic prediction graph f output by the second output of the convolutional neural networkhighThe middle-level semantic prediction graph set Jpre3 is composed of 9 middle-level semantic prediction graphs f of the third output of the convolutional neural networkmidComposed, the low-level semantic prediction graph set Jpre4 is 9 low-level semantic prediction graphs f output by the fourth output of the convolutional neural networklowThe set of foreground-background prediction maps Jpre5 is composed of 9 foreground-background prediction maps f output by the fifth output of the convolutional neural networkbinThe set of boundary prediction maps Jpre6 is composed of 9 boundary prediction maps f output from the sixth output of the convolutional neural networkbouThe semantic prediction graph set Jpre7 is composed of 9 semantic prediction graphs f output by the seventh output of the convolutional neural networksemComposition, final convolutional nerveThe seventh output of the network is used as the semantic segmentation prediction graph output in the testing stage process.
Step 1_ 4: processing the real semantic segmentation image corresponding to each initial road scene RGB image into 9 independent thermal coding images, recording a set of the 9 independent thermal coding images as Jtrue, respectively calculating loss function values between the set Jtrue of the 9 independent thermal coding images and seven corresponding prediction image sets, and taking the sum of the seven loss function values as a final loss value; loss function values between a set Jtrue of 9 unique hot coded images and a corresponding prediction image set Jtrue are recorded as Lossi (Jtrue ), i is 1,2,3,4,5,6,7, and Lossi (Jtrue ) is calculated by cross entropy (crossentrypyloss).
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, namely the fluctuation of the training loss value is difficult to reduce, the verification loss is almost reduced to the minimum, and at the moment, obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model; in this example, V is chosen to be 300.
The specific steps of the test stage process are as follows:
step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.
In specific implementation, 393 original RGB color images to be semantically segmented and original Thermal infrared images are taken as a test set. Order toRepresenting a set of an original RGB color image to be semantically segmented and an original Thermal infrared image; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsWidth of (A), H' representsThe height of (a) of (b),to representAnd the middle coordinate position is the pixel value of the pixel point of (i, j).
Will be provided withRespectively inputting the R channel component, the G channel component and the B channel component and the corresponding original Thermal infrared image into a convolutional neural network classification training model, and utilizing an optimal weight vector WbestAnd an optimum bias term bbestMaking a prediction to obtainCorresponding semantic segmentation prediction graph, denotedWherein the content of the first and second substances,to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building a convolutional neural network architecture by using a deep learning library Python based on Python. The method adopts a test set of a road scene image database MFNET RGB-T Dataset to analyze how the segmentation effect of the road scene image (393 road scene images) predicted by the method is. Here, 4 common objective parameters for evaluating the semantic segmentation method are used as evaluation indexes, namely, Class accuracy (Acc), average Class accuracy (machc), ratio of Intersection and Union of each Class segmented image and label image (IoU), and average ratio of Intersection and Union of segmented image and label image (MIoU).
The method is utilized to predict each road scene image in the road scene image database MFNET RGB-T Dataset test set to obtain a predicted semantic segmentation image corresponding to each road scene image, and the class accuracy Acc, the average class accuracy mAcc, the ratio IoU of the intersection and the union of each class segmentation image and the label image, and the average ratio MIoU of the intersection and the union of the segmentation images and the label image, which reflect the semantic segmentation effect of the method, are listed in Table 1. As can be seen from the data listed in Table 1, the segmentation result of the road scene image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted semantic segmentation image corresponding to the road scene image by using the method of the present invention.
TABLE 1 evaluation results on test sets using the method of the invention
FIG. 7a shows the 1 st original road scene image of the same scene; FIG. 7b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 7a by using the method of the present invention; FIG. 8a shows the 2 nd original road scene image of the same scene; FIG. 8b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 8a by using the method of the present invention; FIG. 9a shows the 3 rd original road scene image of the same scene; FIG. 9b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 9a by using the method of the present invention; FIG. 10a shows the 4 th original road scene image of the same scene; FIG. 10b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 10a by using the method of the present invention; comparing fig. 7a and 7b, 8a and 8b, 9a and 9b, and 10a and 10b, it can be seen that the segmentation precision of the predicted semantic segmentation image obtained by the method of the present invention is higher.
Claims (10)
1. A road scene image semantic segmentation method based on a multi-supervision network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining the original road scene RGB images and the original Thermal infrared images, wherein a training set is formed by the plurality of original road scene RGB images, the plurality of original Thermal infrared images and the corresponding real semantic segmentation images;
step 1_ 2: constructing a convolutional neural network;
step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each original road scene RGB image in the training set;
step 1_ 4: processing the real semantic segmentation image corresponding to each original road scene RGB image into 9 single-hot coded images and recording the set of the 9 single-hot coded images as JtrueRespectively calculating a set J of 9 single-hot coded imagestrueLoss function values between the corresponding seven prediction image sets, and taking the sum of the seven loss function values as a final loss value;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, and obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;
the test stage process comprises the following specific steps:
step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.
2. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 1, characterized in that: the convolutional neural network comprises an encoding module and a decoding module, wherein the encoding module is connected with the decoding module;
the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules;
the first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;
the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;
the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;
the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.
3. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 1, characterized in that: the first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;
the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units;
the third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;
the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;
the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.
4. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 3, characterized in that: the first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;
the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit;
the second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;
the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit.
5. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, characterized in that: the 5 information fusion FM modules have the same structure, and specifically comprise:
comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.
6. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, characterized in that: the semantic AHLS module comprises two first convolution modules, two first attention mechanism modules, a second convolution module and a second attention mechanism module;
the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;
the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;
the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.
7. The road scene image semantic segmentation method based on the multi-supervision network according to claim 2, wherein the 5 feature fusion AMMF modules have the same structure, specifically:
the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;
the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;
the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;
the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;
the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module.
8. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, wherein the 3 semantic supervision MLF modules have the same structure, specifically:
the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;
the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.
9. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, wherein the multi-task supervision RM module comprises eighteenth convolutional layer, nineteenth convolutional layer, twentieth convolutional layer, exp function layer, first multi-task module, second multi-task module and third multi-task module;
the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;
the first multitask module, the second multitask module and the third multitask module are identical in structure and mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.
10. The method as claimed in claim 1, wherein the size of each prediction map in the seven prediction map sets is the same as the size of the original RGB image of the road scene, and the seven prediction map sets include a semantic segmentation prediction map set Jpre1Advanced semantic prediction graph set Jpre2Middle-level semantic prediction graph set Jpre3Low level semantic prediction graph set Jpre4Foreground-background prediction map set Jpre5And a boundary prediction graph set Jpre6And semantic prediction graph set Jpre7;
Semantically segmenting a prediction graph set Jpre1Is a prediction graph f of 9 semantic segmentations from the first output of the convolutional neural networkfinalComposition, high-level semantic prediction graph set Jpre2Is a 9-amplitude high-level semantic prediction graph f from the second output of the convolutional neural networkhighComposition, intermediate level semantic prediction graph set Jpre3Is a 9-medium semantic prediction graph f output by the third output of the convolutional neural networkmidComposition, low-level semantic prediction graph set Jpre4Is a low-level semantic prediction graph f of 9 output from the fourth of the convolutional neural networklowComposition, foreground-background prediction set of images Jpre5Is 9 foreground-background prediction maps f output from the fifth of the convolutional neural networkbinComposition, boundary prediction graph set Jpre6Is a 9-amplitude boundary prediction graph f output from the sixth of the convolutional neural networkbouComposition, semantic prediction graph set Jpre7Is 9 semantic prediction graphs f output by the seventh of the convolutional neural networksemComposition, the seventh output of the convolutional neural network as the output during the test phaseAnd (4) extracting a semantic segmentation prediction graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110823118.4A CN113362349A (en) | 2021-07-21 | 2021-07-21 | Road scene image semantic segmentation method based on multi-supervision network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110823118.4A CN113362349A (en) | 2021-07-21 | 2021-07-21 | Road scene image semantic segmentation method based on multi-supervision network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113362349A true CN113362349A (en) | 2021-09-07 |
Family
ID=77540049
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110823118.4A Pending CN113362349A (en) | 2021-07-21 | 2021-07-21 | Road scene image semantic segmentation method based on multi-supervision network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113362349A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190164290A1 (en) * | 2016-08-25 | 2019-05-30 | Intel Corporation | Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation |
CN112991350A (en) * | 2021-02-18 | 2021-06-18 | 西安电子科技大学 | RGB-T image semantic segmentation method based on modal difference reduction |
CN112991351A (en) * | 2021-02-23 | 2021-06-18 | 新华三大数据技术有限公司 | Remote sensing image semantic segmentation method and device and storage medium |
CN112991364A (en) * | 2021-03-23 | 2021-06-18 | 浙江科技学院 | Road scene semantic segmentation method based on convolution neural network cross-modal fusion |
-
2021
- 2021-07-21 CN CN202110823118.4A patent/CN113362349A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190164290A1 (en) * | 2016-08-25 | 2019-05-30 | Intel Corporation | Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation |
CN112991350A (en) * | 2021-02-18 | 2021-06-18 | 西安电子科技大学 | RGB-T image semantic segmentation method based on modal difference reduction |
CN112991351A (en) * | 2021-02-23 | 2021-06-18 | 新华三大数据技术有限公司 | Remote sensing image semantic segmentation method and device and storage medium |
CN112991364A (en) * | 2021-03-23 | 2021-06-18 | 浙江科技学院 | Road scene semantic segmentation method based on convolution neural network cross-modal fusion |
Non-Patent Citations (2)
Title |
---|
王子羽;张颖敏;陈永彬;王桂棠;: "基于RGB-D图像的室内场景语义分割网络优化", 自动化与信息工程, no. 02, 15 April 2020 (2020-04-15) * |
青晨;禹晶;肖创柏;段娟;: "深度卷积神经网络图像语义分割研究进展", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109190752B (en) | Image semantic segmentation method based on global features and local features of deep learning | |
CN108764063B (en) | Remote sensing image time-sensitive target identification system and method based on characteristic pyramid | |
CN108509978B (en) | Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion | |
CN111563507B (en) | Indoor scene semantic segmentation method based on convolutional neural network | |
CN110490205B (en) | Road scene semantic segmentation method based on full-residual-error hole convolutional neural network | |
CN111612807A (en) | Small target image segmentation method based on scale and edge information | |
CN109635662B (en) | Road scene semantic segmentation method based on convolutional neural network | |
CN112991364A (en) | Road scene semantic segmentation method based on convolution neural network cross-modal fusion | |
CN114241274B (en) | Small target detection method based on super-resolution multi-scale feature fusion | |
CN110929736A (en) | Multi-feature cascade RGB-D significance target detection method | |
CN113192073A (en) | Clothing semantic segmentation method based on cross fusion network | |
US20220156528A1 (en) | Distance-based boundary aware semantic segmentation | |
CN115035361A (en) | Target detection method and system based on attention mechanism and feature cross fusion | |
CN115631344B (en) | Target detection method based on feature self-adaptive aggregation | |
CN115830471B (en) | Multi-scale feature fusion and alignment domain self-adaptive cloud detection method | |
CN114529581A (en) | Multi-target tracking method based on deep learning and multi-task joint training | |
CN115908772A (en) | Target detection method and system based on Transformer and fusion attention mechanism | |
CN113034506A (en) | Remote sensing image semantic segmentation method and device, computer equipment and storage medium | |
CN109446933B (en) | Road scene semantic segmentation method based on convolutional neural network | |
CN109508639B (en) | Road scene semantic segmentation method based on multi-scale porous convolutional neural network | |
CN113362349A (en) | Road scene image semantic segmentation method based on multi-supervision network | |
CN116310850A (en) | Remote sensing image target detection method based on improved RetinaNet | |
CN116051532A (en) | Deep learning-based industrial part defect detection method and system and electronic equipment | |
CN113781504A (en) | Road scene semantic segmentation method based on boundary guidance | |
CN111047571B (en) | Image salient target detection method with self-adaptive selection training process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |