CN113362349A - Road scene image semantic segmentation method based on multi-supervision network - Google Patents

Road scene image semantic segmentation method based on multi-supervision network Download PDF

Info

Publication number
CN113362349A
CN113362349A CN202110823118.4A CN202110823118A CN113362349A CN 113362349 A CN113362349 A CN 113362349A CN 202110823118 A CN202110823118 A CN 202110823118A CN 113362349 A CN113362349 A CN 113362349A
Authority
CN
China
Prior art keywords
module
output
layer
input
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110823118.4A
Other languages
Chinese (zh)
Inventor
周武杰
董少华
强芳芳
许彩娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Zhejiang University of Science and Technology ZUST
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202110823118.4A priority Critical patent/CN113362349A/en
Publication of CN113362349A publication Critical patent/CN113362349A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a road scene image semantic segmentation method based on a multi-supervision network. The method comprises a training stage and a testing stage; the method comprises the following steps: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, and preprocessing the images to form a training set; constructing a convolutional neural network; inputting the training set into a convolutional neural network for training, and outputting seven prediction graph sets by the convolutional neural network; calculating a final loss value; repeating the steps for multiple times to obtain a convolutional neural network classification training model; inputting a plurality of original road scene RGB images to be subjected to semantic segmentation and original Thermal infrared images to obtain corresponding semantic segmentation prediction maps. The method improves the semantic segmentation efficiency and accuracy of the RGB-T road scene image.

Description

Road scene image semantic segmentation method based on multi-supervision network
Technical Field
The invention relates to a road scene semantic segmentation method based on deep learning, in particular to a road scene image semantic segmentation method based on a multi-supervision network.
Background
With the rising of technologies such as unmanned driving, scene understanding, virtual reality and the like, semantic segmentation of images gradually becomes a research hotspot of computer vision and machine learning researchers, and visual navigation can be detected from traffic scene understanding and multi-target obstacle detection by means of the semantic segmentation technology. Currently, the most common semantic segmentation methods include support vector machines, random forests, and other algorithms. These algorithms focus primarily on a binary task for detecting and identifying specific objects, such as road surfaces, vehicles and pedestrians. The traditional machine learning methods are often realized through high-complexity features, the semantic segmentation of the traffic scene is simple and convenient by using deep learning, and more importantly, the precision of an image pixel level classification task is greatly improved by applying the deep learning.
The method adopts a deep learning semantic segmentation method to directly perform end-to-end (end-to-end) semantic segmentation at a pixel level, and can predict in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. Currently, methods based on deep learning semantic segmentation are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object.
The existing road scene semantic segmentation methods mostly adopt a deep learning method, a large number of models are formed by combining a convolutional layer and a pooling layer, however, a feature map obtained by simply utilizing pooling operation and convolution operation is single and not representative, so that the feature information of an obtained image is reduced, the restored effect information is rough, and the segmentation precision is low.
Disclosure of Invention
The invention aims to solve the technical problem of providing a road scene image semantic segmentation method based on a multi-supervision network, which has high segmentation efficiency and high segmentation accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: a road scene image semantic segmentation method based on a multi-supervision network comprises a training stage and a testing stage; the specific steps of the training phase process are as follows:
the specific steps of the training phase process are as follows:
step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining the original road scene RGB images and the original Thermal infrared images, wherein a training set is formed by the plurality of original road scene RGB images, the plurality of original Thermal infrared images and the corresponding real semantic segmentation images;
step 1_ 2: constructing a convolutional neural network;
step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each original road scene RGB image in the training set;
step 1_ 4: processing a real semantic segmentation image corresponding to each original road scene RGB image into 9 independent thermal coding images, recording a set of the 9 independent thermal coding images as Jtrue, respectively calculating loss function values between the set Jtrue of the 9 independent thermal coding images and seven corresponding prediction image sets, and taking the sum of the seven loss function values as a final loss value;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, and obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;
the test stage process comprises the following specific steps:
step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.
The convolutional neural network comprises an encoding module and a decoding module, wherein the encoding module is connected with the decoding module;
the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules;
the first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;
the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;
the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;
the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.
The first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;
the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units;
the third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;
the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;
the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.
The first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;
the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit;
the second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;
the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit.
The 5 information fusion FM modules have the same structure, and specifically comprise:
comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.
The semantic AHLS module comprises two first convolution modules, two first attention mechanism modules, a second convolution module and a second attention mechanism module;
the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;
the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;
the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.
The 5 feature fusion AMMF modules have the same structure, and specifically comprise:
the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;
the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;
the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;
the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;
the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module.
The 3 semantic supervision MLF modules have the same structure, and specifically comprise:
the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;
the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.
The multitask supervision RM module comprises an eighteenth convolutional layer, a nineteenth convolutional layer, a twentieth convolutional layer, an exp function layer, a first multitask module, a second multitask module and a third multitask module;
the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;
the first multitask module, the second multitask module and the third multitask module are identical in structure and mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.
The size of each prediction map in the seven prediction map sets is the same as that of the RGB image of the initial road scene, and the seven prediction map sets comprise a semantic segmentation prediction map set Jpre1, a high-level semantic prediction map set Jpre2, a middle-level semantic prediction map set Jpre3, a low-level semantic prediction map set Jpre4, a foreground-background prediction map set Jpre5, a boundary prediction map set Jpre6 and a semantic prediction map set Jpre 7;
the set of semantic segmentation prediction maps Jpre1 is 9 semantic segmentation prediction maps f output by the first output of the convolutional neural networkfinalThe high-level semantic prediction graph set Jpre2 is formed by convolution9 high-level semantic prediction graph f of second output of neural networkhighThe middle-level semantic prediction graph set Jpre3 is composed of 9 middle-level semantic prediction graphs f of the third output of the convolutional neural networkmidComposed, the low-level semantic prediction graph set Jpre4 is 9 low-level semantic prediction graphs f output by the fourth output of the convolutional neural networklowThe set of foreground-background prediction maps Jpre5 is composed of 9 foreground-background prediction maps f output by the fifth output of the convolutional neural networkbinThe set of boundary prediction maps Jpre6 is composed of 9 boundary prediction maps f output from the sixth output of the convolutional neural networkbouThe semantic prediction graph set Jpre7 is composed of 9 semantic prediction graphs f output by the seventh output of the convolutional neural networksemAnd the seventh output of the convolutional neural network is used as a semantic segmentation prediction graph output in the testing stage process.
Compared with the prior art, the invention has the advantages that:
1) the method comprises the steps of constructing a convolutional neural network, inputting a road scene RGBT image in a training set into the convolutional neural network for training, and obtaining a convolutional neural network classification training model; the road scene RGBT image to be semantically segmented is input into a convolutional neural network classification training model, and a predicted semantic segmentation image corresponding to the road scene image is obtained through prediction.
2) The method adopts multi-task supervision, and performs semantic supervision, boundary supervision and foreground-background supervision on the output segmentation image respectively, thereby effectively improving the semantic segmentation precision.
3) The method divides the coding part network into three parts of high, middle and low, and carries out semantic supervision on the output prediction graph in the three parts, thereby obtaining good segmentation effect on a training set and a test set.
4) The method of the invention fully utilizes high-level semantics, combines the high-level semantics with low-level information, and fully utilizes information of each layer of the network, so that the segmentation result is more accurate.
Drawings
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a block diagram of an implementation of an advanced semantic AHLS module.
Fig. 3 is a block diagram of an implementation of a decoding stage feature fusion AMMF module.
Fig. 4 is a block diagram of an implementation of the high level semantic and low level information fusion FM module at the decoding stage.
Fig. 5 is a block diagram of an implementation of a semantic supervision MLF module of a high, medium, and low layers.
FIG. 6 is a block diagram of an implementation of a multitasking supervision RM module.
FIG. 7a is a first original road scene RGB image;
FIG. 7b is a segmented image obtained by segmenting the first RGB image of the original road scene shown in FIG. 7a according to the present invention;
FIG. 8a is a second original road scene RGB image;
FIG. 8b is a segmented image obtained by segmenting the second original road scene RGB image shown in FIG. 8a using the method of the present invention;
FIG. 9a is a third original road scene RGB image;
FIG. 9b is a segmented image obtained by segmenting the third RGB image of the original road scene shown in FIG. 9a according to the present invention;
FIG. 10a is a fourth original road scene RGB image;
FIG. 10b is a segmented image obtained by segmenting the fourth RGB image of the original road scene shown in FIG. 10a using the method of the present invention;
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides an RGBT road scene semantic segmentation method based on a multi-supervision network, which has a general implementation block diagram as shown in FIG. 1 and comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, in the specific implementation, selecting 784 original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, wherein the original road scene RGB images and the corresponding original Thermal infrared images are used as original road scene images, the set of the original road scene images is marked as { J (i, J) }, the set of the corresponding real semantic segmentation images is marked as { Jtrue (i, J) }, then the real semantic segmentation images corresponding to each original road scene RGB image in the training set are processed into 9 independent Thermal coding images by adopting the existing independent Thermal coding technology (one-hot), and the set formed by the 9 independent Thermal coding images is marked as Jtrue (J) }. The original road scene image is 480 in height, 640 in width, i is larger than or equal to 1 and smaller than or equal to 640, J is larger than or equal to 1 and smaller than or equal to 480, J (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in the set { J (i, J) } of the original road scene image, and Jtrue (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in the set { Jtrue (i, J) } of the real semantic segmentation image. Respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining an original road scene RGB image and an original Thermal infrared image, wherein the cutting is a cutting method keeping the size unchanged, and the specific realization is that 0 is supplemented to the cut part; the batch size is 4, and a training set is formed by a plurality of initial road scene RGB images, initial Thermal infrared images and corresponding real semantic segmentation images;
step 1_ 2: constructing a convolutional neural network;
as shown in fig. 1, the convolutional neural network includes two parts, namely an encoding module and a decoding module, which are respectively used for performing feature extraction operation and upsampling operation on an initial road scene RGB image and a corresponding initial Thermal infrared image, and the encoding module is connected with the decoding module;
the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules; the semantic AHLS module is used for generating high-level semantics; the characteristic fusion AMMF module is used for fusing RGB information, Thermal information and the previous-stage output information; the semantic supervision MLF module is used for fusing high-level, middle-level and low-level semantic information; the RM module is used for semantic supervision, boundary supervision and foreground-background supervision. The information fusion FM module is used for fusing high-level semantics and low-level information.
The first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;
the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;
the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;
the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.
The first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;
the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units; the first downsampling layer is specifically maximum pooled downsampling.
The third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;
the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;
the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.
The first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;
the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit. The activation function of the third activation layer is a Relu activation function.
The second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;
the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit. The activation function of the fifth activation layer is a Relu activation function.
As shown in fig. 4, the 5 information fusion FM modules have the same structure, specifically:
comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.
As shown in fig. 2, the semantic AHLS module includes two first convolution modules, two first attention modules, one second convolution module, and one second attention module;
the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;
the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;
the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.
As shown in fig. 3, the 5 feature fusion AMMF modules have the same structure, specifically:
the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;
the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;
the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;
the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;
the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module. The activation function of the tenth activation layer is a Relu activation function.
As shown in fig. 5, the 3 semantic supervised MLF modules have the same structure, specifically:
the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;
the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.
As shown in fig. 6, the multitask supervision RM module includes an eighteenth convolutional layer, a nineteenth convolutional layer, a twentieth convolutional layer, an exp function layer, a first multitask module, a second multitask module, and a third multitask module;
the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;
the first multitask module, the second multitask module and the third multitask module have the same structure and are mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.
For the 1 st coding module, it is composed of a first Convolution layer (Convolution, Conv), a first batch normalization layer (BatchNorm), and a first active layer (Activation, Act) that are arranged in sequence. The size of the convolution kernel (kernel _ size) adopted by the first convolution layer is 7, the step size (stride) is 2, the edge padding (padding) is 3, and the number of the convolution kernels is 64. The input end of the 1 st convolution block receives RGB three-channel components of an original input image, and the width of the original input image received by the input end is required to be W, and the height of the original input image is required to be H. After the normalization operation of the first batch normalization layer, outputting 64 output feature maps through a first activation layer (the activation mode is Relu), and recording a set formed by 64 sub-feature maps as N1; wherein each feature map has a width of
Figure BDA0003172576900000121
Has a height of
Figure BDA0003172576900000122
For the 2 nd coding module, the coding module is composed of 1 down-sampling layer and 3 residual units in turn. The 1 st down-sampling layer adopts the maximum pooling down-sampling, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling coefficient is 1. Sequentially stacking the main branches of the first residual unit by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 1; one normalization layer, and 256 output channels. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 2 nd coding module receives all the feature maps in N1, the output end outputs 256 sub-feature maps, the set of the 256 sub-feature maps is marked as N2, wherein the width of each feature map is N2
Figure BDA0003172576900000131
Has a height of
Figure BDA0003172576900000132
For the 3 rd coding module, 8 residual error units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third is toThe device comprises a stratification layer and a first activation layer, and the number of output channels is 512. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, and the number of output channels is 512. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 3 rd coding module receives all the feature maps in N2, the output end outputs 512 sub-feature maps, the set of the 512 sub-feature maps is denoted as N3, wherein the width of each feature map is N3
Figure BDA0003172576900000141
Has a height of
Figure BDA0003172576900000142
For the 4 th coding module, 36 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer is formed, and the number of output channels is 1024. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. The shortcut branch has no other operation, and only the flow of the input data is simple. Each residual errorAnd the final operation of the unit is that the main branch and the shortcut branch are subjected to Add operation, and then the final output is obtained through Relu activation function. The input end of the 4 th coding module receives all the feature maps in N3, the output end outputs 1024 sub-feature maps, the set of the 1024 sub-feature maps is marked as N4, wherein the width of each feature map is N4
Figure BDA0003172576900000143
Has a height of
Figure BDA0003172576900000144
For the 5 th coding module, 3 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, output channel number 2048. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 5 th coding module receives all the feature maps in N4, the output end outputs 2048 sub-feature maps, the set of 2048 sub-feature maps is denoted as N5, wherein the width of each feature map is N5
Figure BDA0003172576900000151
Has a height of
Figure BDA0003172576900000152
For the 6 th encoding module, it is composed of a first Convolution layer (Convolution, Conv), a first batch normalization layer (BatchNorm), and a first active layer (Activation, Act) that are arranged in sequence. The size of the convolution kernel (kernel _ size) adopted by the first convolution layer is 7, the step size (stride) is 2, the edge padding (padding) is 3, and the number of the convolution kernels is 64. The input of the 6 th convolution block receives the Thermal single-channel component of the original input image, requiring that the original input image received at the input has a width W and a height H. After the normalization operation of the first batch normalization layer, outputting 64 output feature maps through a first activation layer (the activation mode is Relu), and recording a set formed by 64 sub-feature maps as N6; wherein each feature map has a width of
Figure BDA0003172576900000153
Has a height of
Figure BDA0003172576900000154
For the 7 th coding module, the coding module is composed of 1 down-sampling layer and 3 residual units in turn. The 1 st down-sampling layer adopts the maximum pooling down-sampling, the convolution kernel size is 3 multiplied by 3, the step length is 2, and the filling coefficient is 1. Sequentially stacking the main branches of the first residual unit by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 1; one normalization layer, and 256 output channels. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 256. The shortcut branch has no other operation, and only the flow of the input data is simple. The final operation of each residual unit isAnd performing Add operation on the main branch and the shortcut branch, and performing Relu activation function to obtain final output. The input end of the 7 th coding module receives all the feature maps in N6, the output end outputs 256 sub-feature maps, the set of the 256 sub-feature maps is marked as N7, wherein the width of each feature map is N7
Figure BDA0003172576900000161
Has a height of
Figure BDA0003172576900000162
For the 8 th coding module, 8 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, and the number of output channels is 512. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 512. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 8 th coding module receives all the feature maps in N7, the output end outputs 512 sub-feature maps, the set of the 512 sub-feature maps is denoted as N8, wherein the width of each feature map is N8
Figure BDA0003172576900000171
Has a height of
Figure BDA0003172576900000172
For the 9 th coding module, 36 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer is formed, and the number of output channels is 1024. For other residual error units, sequentially adding layers by a first convolution layer, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 1024. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 9 th coding module receives all the feature maps in N8, the output end outputs 1024 sub-feature maps, the set of the 1024 sub-feature maps is marked as N9, wherein the width of each feature map is N9
Figure BDA0003172576900000173
Has a height of
Figure BDA0003172576900000174
For the 10 th coding module, 3 residual units are formed in sequence. The main branch of the first residual error unit is sequentially laminated by a first convolution layer, the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 2; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. For the shortcut branches, sequentially performing convolution by one layer, wherein the size of a convolution kernel is 1, and the step length is 2; one normalization layer, output channel number 2048. For other residual units, the first volume is used for sequentially generatingStacking, wherein the convolution kernel is 1, and the step length is 1; a first normalization layer; a second convolution layer with a convolution kernel of 3 and a step size of 1; a second normalization layer; a third convolution layer with convolution kernel of 1 and step length of 1; the third normalization layer and the first activation layer, and the number of output channels is 2048. The shortcut branch has no other operation, and only the flow of the input data is simple. And performing Add operation on the main branch and the shortcut branch of each residual error unit, and performing Relu activation function to obtain final output. The input end of the 10 th coding module receives all the feature maps in N9, the output end outputs 2048 sub-feature maps, the set of 2048 sub-feature maps is denoted as N10, wherein the width of each feature map is N10
Figure BDA0003172576900000181
Has a height of
Figure BDA0003172576900000182
For the advanced semantic AHLS module. The system sequentially comprises a first convolution module, a first attention mechanism module, a Tensor splicing layer, a second convolution module and a second attention mechanism module. The convolution kernel size of the first convolution module is 1, the step size is 1, and the number of convolution kernels is 64. The first attention module is composed of a global maximum pooling layer, a first full-connection layer, a first activation function, a second full-connection layer, a second activation function, a third full-connection layer, a Sigmoid function and a convolution with a convolution kernel size of 1 and a step length of 1 in sequence. The Tensor stitching operation is the stitching of two input features in the channel dimension. The second convolution module is composed of a convolution layer with convolution kernel size of 3 and step length of 1, a normalization layer and an activation layer in sequence. The second attention mechanism module is identical to the first attention mechanism module. The RGB output of the 5 th encoding module is denoted as R5, and the Thermal output of the 10 th encoding module is denoted as T5. R5 and T5 are input into a first convolution module and a first attention mechanism module in sequence, and the output is respectively
Figure BDA0003172576900000183
And
Figure BDA0003172576900000184
then will be
Figure BDA0003172576900000185
And
Figure BDA0003172576900000186
input into a Tensor splicing layer and output is fout(ii) a Finally f is to beoutSequentially inputting the semantic information into a second convolution module and a second attention mechanism module to output high-level semantic information fhigh
The AMMF module 1 is fused for the first feature. The RGB output of the 5 th encoding module is denoted as R5, and the Thermal output of the 10 th encoding module is denoted as T5. R5 and T5 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being
Figure BDA0003172576900000191
And
Figure BDA0003172576900000192
to be generated
Figure BDA0003172576900000193
And
Figure BDA0003172576900000194
performing dot product operation to obtain output
Figure BDA0003172576900000195
Then will be
Figure BDA0003172576900000196
Are added to obtain
Figure BDA0003172576900000197
Then will be
Figure BDA0003172576900000198
And
Figure BDA0003172576900000199
performing splicing operation to obtain
Figure BDA00031725769000001910
Then will be
Figure BDA00031725769000001911
Input to the first blending module to obtain
Figure BDA00031725769000001912
The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above stephighAnd
Figure BDA00031725769000001913
performing splicing operation to obtain
Figure BDA00031725769000001914
Will be provided with
Figure BDA00031725769000001915
Input to the second blending module to obtain output
Figure BDA00031725769000001916
Wherein the second blending module is identical to the first blending module; followed by
Figure BDA00031725769000001917
Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA00031725769000001918
Then the
Figure BDA00031725769000001919
Obtained by bilinear interpolation up-sampling operation
Figure BDA00031725769000001920
Then will be
Figure BDA00031725769000001921
Obtaining the output of the first feature fusion AMMF module 1 through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 64
Figure BDA00031725769000001922
At this time, the size of the feature map is 2 times of the original size, and the width of each feature map is
Figure BDA00031725769000001923
Has a height of
Figure BDA00031725769000001924
Since the jump connection used by the network model is specifically an information fusion FM module, the first information fusion FM module fuses the first feature to the output of the AMMF module 1
Figure BDA00031725769000001925
With the above high level of semantics fhighSequentially carrying out 2 times of bilinear interpolation upsampling, obtaining an output f after the convolution kernel is 1, the step length is 1 and the number of the convolution kernels is 641 highPerforming splicing operation to obtain output
Figure BDA0003172576900000201
Finally will be
Figure BDA0003172576900000202
Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f4. The output end outputs 64 pairs of feature maps at the moment, wherein each feature map comprisesHas a width of
Figure BDA0003172576900000203
Has a height of
Figure BDA0003172576900000204
Amaf module 2 is fused for the second feature. The RGB output of the 4 th encoding module is denoted as R4, and the Thermal output of the 9 th encoding module is denoted as T4. R4 and T4 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being
Figure BDA0003172576900000205
And
Figure BDA0003172576900000206
to be generated
Figure BDA0003172576900000207
And
Figure BDA0003172576900000208
performing dot product operation to obtain output
Figure BDA0003172576900000209
Then will be
Figure BDA00031725769000002010
Are added to obtain
Figure BDA00031725769000002011
Then will be
Figure BDA00031725769000002012
And
Figure BDA00031725769000002013
performing splicing operation to obtain
Figure BDA00031725769000002014
Then will be
Figure BDA00031725769000002015
Input to the first blending module to obtain
Figure BDA00031725769000002016
The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step4And
Figure BDA00031725769000002017
performing splicing operation to obtain
Figure BDA00031725769000002018
Will be provided with
Figure BDA00031725769000002019
Input to the second blending module to obtain output
Figure BDA00031725769000002020
Wherein the second blending module is identical to the first blending module; followed by
Figure BDA00031725769000002021
Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA00031725769000002022
Then the
Figure BDA00031725769000002023
Obtained by bilinear interpolation up-sampling operation
Figure BDA00031725769000002024
Then will be
Figure BDA00031725769000002025
The output of the second fusion module RAMMF2 is obtained by convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA00031725769000002026
At this time, the size of the feature map is 2 times of the original size, and the width of each feature map is
Figure BDA00031725769000002027
Has a height of
Figure BDA00031725769000002028
The second information fusion FM module fuses the second feature to the resulting output of the AMMF module 2 due to the presence of a jump connection to the model
Figure BDA0003172576900000211
And f abovehighSequentially performing 4 times of bilinear interpolation upsampling, convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain output
Figure BDA0003172576900000212
Performing splicing operation to obtain output
Figure BDA0003172576900000213
Finally will be
Figure BDA0003172576900000214
Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f3. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map is
Figure BDA0003172576900000215
Has a height of
Figure BDA0003172576900000216
The AMMF module 3 is fused for the third feature. The RGB output of the 3 rd encoding module is denoted as R3, and the Thermal output of the 8 th encoding module is denoted as T3. R3 and T3 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being
Figure BDA0003172576900000217
And
Figure BDA0003172576900000218
to be generated
Figure BDA0003172576900000219
And
Figure BDA00031725769000002110
performing dot product operation to obtain output
Figure BDA00031725769000002111
Then will be
Figure BDA00031725769000002112
Are added to obtain
Figure BDA00031725769000002113
Then will be
Figure BDA00031725769000002114
And
Figure BDA00031725769000002115
performing splicing operation to obtain
Figure BDA00031725769000002116
Then will be
Figure BDA00031725769000002117
Input to the first blending module to obtain
Figure BDA00031725769000002118
The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, no other operation exists in the shortcut branch of the first blending module, only the flow of simple input data exists, and the last operation is that the main branch and the shortcut branch perform Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step3And
Figure BDA00031725769000002119
performing splicing operation to obtain
Figure BDA00031725769000002120
Will be provided with
Figure BDA00031725769000002121
Input to the second blending module to obtain output
Figure BDA00031725769000002122
Wherein the second blending module is identical to the first blending module; followed by
Figure BDA00031725769000002123
Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA00031725769000002124
Then the
Figure BDA00031725769000002125
Obtained by bilinear interpolation up-sampling operation
Figure BDA00031725769000002126
Then will be
Figure BDA00031725769000002127
The number of the convolution kernels is 1 through the convolution kernel, the step length is 1The 64 convolutional layers get the output of the third fusion module RAMMF3
Figure BDA0003172576900000221
At this time, the size of the feature map is 2 times of the original size, and the width of each feature map is
Figure BDA0003172576900000222
Has a height of
Figure BDA0003172576900000223
The third information fusion FM module fuses the third feature to the output of the AMMF module 3 due to the presence of a jump connection in the model
Figure BDA0003172576900000224
And f abovehighSequentially performing 8 times of bilinear interpolation upsampling, obtaining output after convolutional layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA0003172576900000225
Performing splicing operation to obtain output
Figure BDA0003172576900000226
Finally will be
Figure BDA0003172576900000227
Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f2. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map is
Figure BDA0003172576900000228
Has a height of
Figure BDA0003172576900000229
The AMMF module 4 is fused for the fourth feature. The RGB output of the 2 nd encoding module is denoted as R2, and the Thermal output of the 7 th encoding module is denoted as T2. R2 and T2 are sequentially input intoAmong the same modules as the first convolution module and the first attention mechanism module in the above-mentioned AHLS module, the outputs are respectively
Figure BDA00031725769000002210
And
Figure BDA00031725769000002211
to be generated
Figure BDA00031725769000002212
And
Figure BDA00031725769000002213
performing dot product operation to obtain output f1 out1(ii) a Then will be
Figure BDA00031725769000002214
f1 out1Add to obtain f1 out2(ii) a Then will be
Figure BDA00031725769000002215
And f1 out2Performing a splicing operation to obtain f1 out3(ii) a Then f is mixed1 out3Input to the first blending module to obtain f1 out4The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, the shortcut branch of the first blending module has no other operation and only flows of simple input data, and the last operation is that the main branch and the shortcut branch carry out Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step2And f1 out4Performing a splicing operation to obtain f1 out5(ii) a Will f is1 out5Input to a second blending module to obtain an output f1 out6Wherein the second blending module is identical to the first blending module; then f1 out6After the convolution kernel is set to be 1,the step size is 1, the convolution layers with convolution kernel number of 64 are obtained, and the output f is obtained1 out7(ii) a Then f1 out7F is obtained by bilinear interpolation upsampling operation1 out8At this time, the size of the feature map is 2 times that of the original feature map, and the width of each feature map is
Figure BDA0003172576900000231
Has a height of
Figure BDA0003172576900000232
Then f is mixed1 out8The output f of the fourth fusion module RAMMF4 is obtained by convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 641 out9At this time, the size of the feature map is 2 times that of the original feature map, and the width of each feature map is
Figure BDA0003172576900000234
Has a height of
Figure BDA0003172576900000235
The fourth information fusion FM module fuses the fourth feature to the output f of the AMMF module 4 due to the presence of a jump connection in the network1 out9And f abovehighSequentially performing 16 times of bilinear interpolation upsampling, obtaining output after convolutional layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA0003172576900000236
Performing splicing operation to obtain output f1 out10(ii) a Finally f is to be1 out10Inputting the data into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain final output f1. At this time, 64 pairs of feature maps are output from the output terminal, wherein the width of each feature map is
Figure BDA0003172576900000237
Has a height of
Figure BDA0003172576900000238
The AMMF module 5 is fused for the fifth feature. The RGB output of the 1 st coding block is denoted as R1, and the Thermal output of the 6 th coding block is denoted as T1. R1 and T1 are input sequentially into the same modules as the first convolution module and the first attention mechanism module in the AHLS module, with the outputs being
Figure BDA0003172576900000239
And T1 out(ii) a To be generated
Figure BDA00031725769000002310
And T1 outPerforming dot product operation to obtain output f0 out1(ii) a Then will be
Figure BDA00031725769000002312
T1 out,f0 out1Add to obtain f0 out2(ii) a Then will be
Figure BDA00031725769000002313
And f0 out2Performing a splicing operation to obtain f0 out3(ii) a Then f is mixed0 out3Input to the first blending module to obtain f0 out4The main branch of the first blending module sequentially consists of a first convolution layer, a first normalization layer, a first activation layer, a second convolution layer and a second normalization layer, wherein the convolution kernel size of the first convolution layer and the convolution kernel size of the second convolution layer are 3, the step length is 1, the shortcut branch of the first blending module has no other operation and only flows of simple input data, and the last operation is that the main branch and the shortcut branch carry out Add operation and then pass through a Relu activation function to obtain the final output; then f generated in the above step1And f0 out4Performing a splicing operation to obtain f0 out5(ii) a Will f is0 out5Input to a second blending module to obtain an output f0 out6Wherein the second blending module is identical to the first blending module; is connected withIs on f0 out6Obtaining an output f through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 640 out7(ii) a Then f0 out7F is obtained by bilinear interpolation upsampling operation0 out8Then f is added0 out8The output f of the fifth fusion module RAMMF5 is obtained by convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 640 out9At this time, the size of the feature map is 2 times the original size, and the width and height of each feature map are W and H, respectively.
The fifth information fusion FM module fuses the fifth feature to the output f of the AMMF module 5 due to the presence of a jump connection in the network0 out9And f abovehighSequentially carrying out 32 times of bilinear interpolation upsampling, obtaining output f after convolutional layers with convolution kernels of 1, 1 step length and 64 convolution kernels5 highPerforming splicing operation to obtain output f0 out10(ii) a Finally f is to be0 out10Inputting the input into a convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 64 to obtain output f0(ii) a Will obtain an output f0Obtaining semantic prediction output f through convolution layers with convolution kernels of 1, step length of 1 and convolution kernels of 9finalAt this time, the output end outputs 9 pairs of feature maps, wherein the width of each feature map is W and the height of each feature map is H.
The MLF module 1 is supervised for high-level information semantics. The high-level semantic information f is processedhighPerforming 2 times bilinear interpolation up-sampling to obtain output
Figure BDA0003172576900000245
Then will be
Figure BDA0003172576900000246
Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA0003172576900000247
Then will be
Figure BDA0003172576900000248
With the output f obtained above4Adding them to obtain output
Figure BDA0003172576900000249
In this case, each feature map has a width of
Figure BDA0003172576900000241
Has a height of
Figure BDA0003172576900000242
The number of channels is 64; then will be
Figure BDA00031725769000002410
Performing 16 times bilinear interpolation up-sampling to obtain output
Figure BDA00031725769000002411
Finally will be
Figure BDA00031725769000002412
The final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9highIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.
The MLF module 2 is supervised for medium level information semantics. The output f obtained above3Performing 2 times bilinear interpolation up-sampling to obtain output
Figure BDA00031725769000002413
Then will be
Figure BDA00031725769000002414
Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA00031725769000002415
Then will be
Figure BDA00031725769000002416
With the output f obtained above2Perform additionObtain an output
Figure BDA00031725769000002417
In this case, each feature map has a width of
Figure BDA0003172576900000243
Has a height of
Figure BDA0003172576900000244
The number of channels is 64; then will be
Figure BDA00031725769000002418
Performing 4 times bilinear interpolation up-sampling to obtain output
Figure BDA00031725769000002419
Finally will be
Figure BDA00031725769000002420
The final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9midIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.
The MLF module 3 is supervised for low level information semantics. The output f obtained above1Performing 2 times bilinear interpolation up-sampling to obtain output
Figure BDA0003172576900000251
Then will be
Figure BDA0003172576900000252
Obtaining output through convolution layers with convolution kernel of 1, step length of 1 and convolution kernel number of 64
Figure BDA0003172576900000253
Then will be
Figure BDA0003172576900000254
With the output f obtained above0Adding them to obtain output
Figure BDA0003172576900000255
At the moment, the width of each characteristic diagram is W, the height is H, and the number of channels is 64; finally will be
Figure BDA0003172576900000256
The final output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9lowIn this case, the width of each feature map is W, the height is H, and the number of channels is 9.
For a multitask supervision RM module based on three multitask modules. Firstly, the semantic prediction graph f isfinalThe first multitask module sequentially comprises a first convolution layer, a normalization layer and an active layer, wherein the convolution kernel number of the first convolution layer is 3, the step length of the first convolution layer is 1, the number of the convolution kernels is 9, the output of the active layer passes through a second convolution layer, the convolution kernel number of the second convolution layer is 1, the step length of the second convolution layer is 1, and the number of the convolution kernels is 2, and finally the foreground-background output f is obtainedbin(ii) a Then f is putbinObtaining semantic supervised weight through exp function, and converting ffinalThe point multiplication is carried out with weight to obtain
Figure BDA0003172576900000257
Then will be
Figure BDA0003172576900000258
The second multitask module sequentially comprises convolution layers with convolution kernel of 3, step length of 1 and convolution kernel number of 9, a normalization layer and an activation layer to obtain output
Figure BDA0003172576900000259
Then will be
Figure BDA00031725769000002510
The final semantic output f is obtained through convolution layers with convolution kernels of 1, step length of 1 and convolution kernel number of 9sem(ii) a Then f is putfinalAnd
Figure BDA00031725769000002511
performing splicing operation to obtain
Figure BDA00031725769000002512
Finally will be
Figure BDA00031725769000002513
The output of the activation layer passes through a second convolution layer with convolution kernel of 1, step length of 1 and convolution kernel number of 2 to obtain the final boundary output fbou. At this time, the width of each feature map is W, the height is H, and the number of channels is 9.
Step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each initial road scene RGB image in the training set;
the size of each prediction map in the seven prediction map sets is the same as that of the RGB image of the initial road scene, and the seven prediction map sets comprise a semantic segmentation prediction map set Jpre1, a high-level semantic prediction map set Jpre2, a middle-level semantic prediction map set Jpre3, a low-level semantic prediction map set Jpre4, a foreground-background prediction map set Jpre5, a boundary prediction map set Jpre6 and a semantic prediction map set Jpre 7;
the set of semantic segmentation prediction maps Jpre1 is 9 semantic segmentation prediction maps f output by the first output of the convolutional neural networkfinalThe high-level semantic prediction graph set Jpre2 is a 9-amplitude high-level semantic prediction graph f output by the second output of the convolutional neural networkhighThe middle-level semantic prediction graph set Jpre3 is composed of 9 middle-level semantic prediction graphs f of the third output of the convolutional neural networkmidComposed, the low-level semantic prediction graph set Jpre4 is 9 low-level semantic prediction graphs f output by the fourth output of the convolutional neural networklowThe set of foreground-background prediction maps Jpre5 is composed of 9 foreground-background prediction maps f output by the fifth output of the convolutional neural networkbinThe set of boundary prediction maps Jpre6 is composed of 9 boundary prediction maps f output from the sixth output of the convolutional neural networkbouThe semantic prediction graph set Jpre7 is composed of 9 semantic prediction graphs f output by the seventh output of the convolutional neural networksemComposition, final convolutional nerveThe seventh output of the network is used as the semantic segmentation prediction graph output in the testing stage process.
Step 1_ 4: processing the real semantic segmentation image corresponding to each initial road scene RGB image into 9 independent thermal coding images, recording a set of the 9 independent thermal coding images as Jtrue, respectively calculating loss function values between the set Jtrue of the 9 independent thermal coding images and seven corresponding prediction image sets, and taking the sum of the seven loss function values as a final loss value; loss function values between a set Jtrue of 9 unique hot coded images and a corresponding prediction image set Jtrue are recorded as Lossi (Jtrue ), i is 1,2,3,4,5,6,7, and Lossi (Jtrue ) is calculated by cross entropy (crossentrypyloss).
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, namely the fluctuation of the training loss value is difficult to reduce, the verification loss is almost reduced to the minimum, and at the moment, obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model; in this example, V is chosen to be 300.
The specific steps of the test stage process are as follows:
step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.
In specific implementation, 393 original RGB color images to be semantically segmented and original Thermal infrared images are taken as a test set. Order to
Figure BDA0003172576900000261
Representing a set of an original RGB color image to be semantically segmented and an original Thermal infrared image; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0003172576900000262
Width of (A), H' represents
Figure BDA0003172576900000263
The height of (a) of (b),
Figure BDA0003172576900000264
to represent
Figure BDA0003172576900000265
And the middle coordinate position is the pixel value of the pixel point of (i, j).
Will be provided with
Figure BDA0003172576900000266
Respectively inputting the R channel component, the G channel component and the B channel component and the corresponding original Thermal infrared image into a convolutional neural network classification training model, and utilizing an optimal weight vector WbestAnd an optimum bias term bbestMaking a prediction to obtain
Figure BDA0003172576900000267
Corresponding semantic segmentation prediction graph, denoted
Figure BDA0003172576900000268
Wherein the content of the first and second substances,
Figure BDA0003172576900000269
to represent
Figure BDA00031725769000002610
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building a convolutional neural network architecture by using a deep learning library Python based on Python. The method adopts a test set of a road scene image database MFNET RGB-T Dataset to analyze how the segmentation effect of the road scene image (393 road scene images) predicted by the method is. Here, 4 common objective parameters for evaluating the semantic segmentation method are used as evaluation indexes, namely, Class accuracy (Acc), average Class accuracy (machc), ratio of Intersection and Union of each Class segmented image and label image (IoU), and average ratio of Intersection and Union of segmented image and label image (MIoU).
The method is utilized to predict each road scene image in the road scene image database MFNET RGB-T Dataset test set to obtain a predicted semantic segmentation image corresponding to each road scene image, and the class accuracy Acc, the average class accuracy mAcc, the ratio IoU of the intersection and the union of each class segmentation image and the label image, and the average ratio MIoU of the intersection and the union of the segmentation images and the label image, which reflect the semantic segmentation effect of the method, are listed in Table 1. As can be seen from the data listed in Table 1, the segmentation result of the road scene image obtained by the method of the present invention is good, which indicates that it is feasible and effective to obtain the predicted semantic segmentation image corresponding to the road scene image by using the method of the present invention.
TABLE 1 evaluation results on test sets using the method of the invention
Figure BDA0003172576900000271
FIG. 7a shows the 1 st original road scene image of the same scene; FIG. 7b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 7a by using the method of the present invention; FIG. 8a shows the 2 nd original road scene image of the same scene; FIG. 8b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 8a by using the method of the present invention; FIG. 9a shows the 3 rd original road scene image of the same scene; FIG. 9b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 9a by using the method of the present invention; FIG. 10a shows the 4 th original road scene image of the same scene; FIG. 10b shows a predicted semantic segmentation image obtained by predicting the original road scene image shown in FIG. 10a by using the method of the present invention; comparing fig. 7a and 7b, 8a and 8b, 9a and 9b, and 10a and 10b, it can be seen that the segmentation precision of the predicted semantic segmentation image obtained by the method of the present invention is higher.

Claims (10)

1. A road scene image semantic segmentation method based on a multi-supervision network is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting a plurality of original road scene RGB images, corresponding original Thermal infrared images and real semantic segmentation images, respectively carrying out data enhancement on each original road scene RGB image and each original Thermal infrared image through cutting, brightness and turning, and then obtaining the original road scene RGB images and the original Thermal infrared images, wherein a training set is formed by the plurality of original road scene RGB images, the plurality of original Thermal infrared images and the corresponding real semantic segmentation images;
step 1_ 2: constructing a convolutional neural network;
step 1_ 3: inputting the training set into a convolutional neural network for training, wherein the convolutional neural network outputs seven prediction image sets corresponding to each original road scene RGB image in the training set;
step 1_ 4: processing the real semantic segmentation image corresponding to each original road scene RGB image into 9 single-hot coded images and recording the set of the 9 single-hot coded images as JtrueRespectively calculating a set J of 9 single-hot coded imagestrueLoss function values between the corresponding seven prediction image sets, and taking the sum of the seven loss function values as a final loss value;
step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times until the convergence of the convolutional neural network reaches saturation, and obtaining a convolutional neural network classification training model; taking the weight vector and the bias of the network obtained at the moment as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;
the test stage process comprises the following specific steps:
step 2: inputting a plurality of original road scene RGB images to be semantically segmented and original Thermal infrared images into a convolutional neural network classification training model, and predicting by using an optimal weight vector and an optimal bias term to obtain a corresponding semantic segmentation prediction graph.
2. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 1, characterized in that: the convolutional neural network comprises an encoding module and a decoding module, wherein the encoding module is connected with the decoding module;
the encoding module comprises 10 encoding modules, and the decoding module comprises a semantic AHLS module, a multitask supervision RM module, 5 information fusion FM modules, 5 feature fusion AMMF modules and 3 semantic supervision MLF modules;
the first coding module is connected with the semantic AHLS module after sequentially passing through a second coding module, a third coding module, a fourth coding module and a fifth coding module, the sixth coding module is connected with the semantic AHLS module after sequentially passing through a seventh coding module, an eighth coding module, a ninth coding module and a tenth coding module, the input of the first coding module is an initial road scene RGB image, and the input of the sixth coding module is an initial Thermal infrared image;
the fifth coding module, the semantic AHLS module and the tenth coding module are connected with the first feature fusion AMMF module, the output of the first feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the first information fusion FM module, the output of the first information fusion FM module and the output of the semantic AHLS module are simultaneously input into the first semantic supervision MLF module, and the output of the first semantic supervision MLF module is used as the second output of the convolutional neural network;
the output of the fourth coding module, the output of the first information fusion FM module and the output of the ninth coding module are input into the second feature fusion AMMF module, the output of the second feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the second information fusion FM module, the output of the third coding module and the output of the eighth coding module are simultaneously input into the third feature fusion AMMF module, the output of the third feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the third information fusion FM module, the output of the second information fusion FM module and the output of the third information fusion FM module are simultaneously input into the second semantic supervision MLF module, and the output of the second semantic supervision MLF module is used as the third output of the convolutional neural network;
the output of the second coding module, the output of the third information fusion FM module and the output of the seventh coding module are input into a fourth feature fusion AMMF module, the output of the fourth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fourth information fusion FM module, the output of the first coding module and the output of the sixth coding module are input into a fifth feature fusion AMMF module, the output of the fifth feature fusion AMMF module and the output of the semantic AHLS module are respectively input into the first input and the second input of the fifth information fusion FM module, the output of the fourth information fusion FM module and the output of the fifth information fusion FM module are simultaneously input into a third semantic supervision MLF module, and the output of the third semantic supervision MLF module is used as the fourth output of the convolutional neural network; the output of the fifth information fusion FM module is used as the first output of the convolutional neural network, the output of the fifth information fusion FM module is input to the multitask supervision RM module, and the first output, the second output and the third output of the multitask supervision RM module are respectively used as the fifth output, the sixth output and the seventh output of the convolutional neural network.
3. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 1, characterized in that: the first coding module and the sixth coding module have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first batch of normalization layers and a first activation layer;
the second coding module and the seventh coding module have the same structure and are mainly formed by sequentially connecting a first lower sampling layer, a first residual error unit and two second residual error units;
the third coding module and the eighth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and seven second residual error units;
the fourth coding module and the ninth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and 35 second residual error units;
the fifth coding module and the tenth coding module have the same structure and are mainly formed by sequentially connecting a first residual error unit and two second residual error units.
4. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 3, characterized in that: the first residual error unit comprises a second convolution layer, a second normalization layer, a third convolution layer, a third normalization layer, a fourth convolution layer, a fourth normalization layer, a second active layer, a fifth convolution layer, a fifth normalization layer and a third active layer;
the second convolution layer is connected with the second active layer after sequentially passing through the second normalization layer, the third convolution layer, the third normalization layer, the fourth convolution layer and the fourth normalization layer, the input of the first residual error unit is the input of the second convolution layer, the input of the second convolution layer is also input into the fifth convolution layer, the fifth convolution layer is connected with the fifth normalization layer, the output of the second active layer and the output of the fifth normalization layer are added, the output is input into the third active layer, and the output of the third active layer is used as the output of the first residual error unit;
the second residual error unit comprises a sixth convolution layer, a sixth normalization layer, a seventh convolution layer, a seventh normalization layer, an eighth convolution layer, an eighth normalization layer, a fourth active layer and a fifth active layer;
the sixth convolution layer is connected with the fourth active layer after sequentially passing through the sixth normalization layer, the seventh convolution layer, the seventh normalization layer, the eighth convolution layer and the eighth normalization layer, the input of the second residual error unit is the input of the sixth convolution layer, the output of the fourth active layer added with the input of the sixth convolution layer is input into the fifth active layer, and the output of the fifth active layer is used as the output of the second residual error unit.
5. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, characterized in that: the 5 information fusion FM modules have the same structure, and specifically comprise:
comprises a fourth upsampling layer, a twenty-second convolutional layer and a twenty-third convolutional layer; each information fusion FM module has two inputs and one output, the second input of the information fusion FM module is input into a fourth up-sampling layer, the fourth up-sampling layer is connected with a twenty-second convolution layer, the output of the twenty-second convolution layer and the output of the information fusion FM module after being cascaded are input into a twenty-third convolution layer, and the output of the twenty-third convolution layer is used as the output of the information fusion FM module.
6. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, characterized in that: the semantic AHLS module comprises two first convolution modules, two first attention mechanism modules, a second convolution module and a second attention mechanism module;
the semantic AHLS module is provided with two inputs, a first input of the semantic AHLS module is connected with one first attention mechanism module after passing through one first convolution module, a second input of the semantic AHLS module is connected with the other first attention mechanism module after passing through the other first convolution module, the output of the two first attention mechanism modules after being cascaded is input into the second convolution module, the second convolution module is connected with the second attention mechanism module, and the output of the second attention mechanism module is used as the output of the semantic AHLS module;
the first convolution module mainly comprises a ninth convolution layer; the first attention mechanism module is mainly formed by sequentially connecting a first pooling layer, a first full-connection layer, a sixth activation layer, a second full-connection layer, a seventh activation layer, a third full-connection layer, an eighth activation layer and a tenth convolution layer;
the second convolution module is mainly formed by sequentially connecting an eleventh convolution layer, a ninth normalization layer and a ninth active layer; the second attention mechanism module is identical in structure to the first attention mechanism module.
7. The road scene image semantic segmentation method based on the multi-supervision network according to claim 2, wherein the 5 feature fusion AMMF modules have the same structure, specifically:
the device comprises a first blending module, a second blending module, a twelfth convolution layer, a first up-sampling layer and a thirteenth convolution layer;
the first input of the feature fusion AMMF module is connected with the first coding module, the second coding module, the third coding module, the fourth coding module or the fifth coding module, the second input of the feature fusion AMMF module is connected with the sixth coding module, the seventh coding module, the eighth coding module, the ninth coding module or the tenth coding module, and the third input of the feature fusion AMMF module is connected with the semantic AHLS module, the first fusion feature output, the second fusion feature output, the third fusion feature output or the fourth fusion feature output;
the output of the characteristic fusion AMMF module after multiplication is used as a first fusion output, the output of the first fusion output after multiplication is used as a second fusion output, the output of the second fusion output after cascade connection is used as a second input of the characteristic fusion AMMF module, the output of the first fusion module after cascade connection is used as a third input of the characteristic fusion AMMF module, the output of the first fusion module and the output of the characteristic fusion AMMF module are input into the second fusion module, the first fusion module is connected with the thirteenth convolution layer after sequentially passing through the twelfth convolution layer and the first upper sampling layer, and the output of the thirteenth convolution layer is used as the output of the characteristic fusion AMMF module;
the first blending module and the second blending module have the same structure and comprise a fourteenth coiling layer, a tenth normalization layer, a tenth activation layer, a fifteenth coiling layer, an eleventh normalization layer and a tenth activation layer;
the fourteenth convolution layer is connected with the eleventh normalization layer after sequentially passing through the tenth normalization layer, the tenth active layer and the fifteenth convolution layer, the input of the blending module is the input of the fourteenth convolution layer, the output of the fourteenth convolution layer after the input of the fourteenth convolution layer is added with the output of the eleventh normalization layer is input into the tenth active layer, and the output of the tenth active layer is used as the output of the blending module.
8. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, wherein the 3 semantic supervision MLF modules have the same structure, specifically:
the sampling device comprises a second up-sampling layer, a sixteenth convolution layer, a third up-sampling layer and a seventeenth convolution layer; the first input of the semantic supervision MLF module is connected with the semantic AHLS module, the second fusion characteristic output or the fourth fusion characteristic output, and the second input of the semantic supervision MLF module is connected with the first fusion characteristic output, the third fusion characteristic output or the fifth fusion characteristic output;
the first input of the semantic supervised MLF module is connected with the sixteenth convolution layer after passing through the second up-sampling layer, the output of the sixteenth convolution layer and the output of the semantic supervised MLF module after being cascaded are connected with the seventeenth convolution layer after passing through the third up-sampling layer, and the output of the seventeenth convolution layer is used as the output of the semantic supervised MLF module.
9. The road scene image semantic segmentation method based on the multi-supervision network as claimed in claim 2, wherein the multi-task supervision RM module comprises eighteenth convolutional layer, nineteenth convolutional layer, twentieth convolutional layer, exp function layer, first multi-task module, second multi-task module and third multi-task module;
the input of the multitask supervision RM module is the input of an eighteenth convolutional layer, the output of the eighteenth convolutional layer after passing through the first multitask module is used as the first output of the multitask supervision RM module, the output of the first output of the multitask supervision RM module after being multiplied by the output of the exp function layer and the output of the eighteenth convolutional layer is input into a second multitask module, the output of the second multitask module after passing through a nineteenth convolutional layer is used as the third output of the multitask supervision RM module, the output of the second multitask module after being cascaded with the output of the eighteenth convolutional layer is connected with a twentieth convolutional layer after passing through a third multitask module, and the output of the twentieth convolutional layer is used as the second output of the multitask supervision RM module;
the first multitask module, the second multitask module and the third multitask module are identical in structure and mainly formed by sequentially connecting a twenty-first convolution layer, a twelfth normalization layer and a twelfth active layer.
10. The method as claimed in claim 1, wherein the size of each prediction map in the seven prediction map sets is the same as the size of the original RGB image of the road scene, and the seven prediction map sets include a semantic segmentation prediction map set Jpre1Advanced semantic prediction graph set Jpre2Middle-level semantic prediction graph set Jpre3Low level semantic prediction graph set Jpre4Foreground-background prediction map set Jpre5And a boundary prediction graph set Jpre6And semantic prediction graph set Jpre7
Semantically segmenting a prediction graph set Jpre1Is a prediction graph f of 9 semantic segmentations from the first output of the convolutional neural networkfinalComposition, high-level semantic prediction graph set Jpre2Is a 9-amplitude high-level semantic prediction graph f from the second output of the convolutional neural networkhighComposition, intermediate level semantic prediction graph set Jpre3Is a 9-medium semantic prediction graph f output by the third output of the convolutional neural networkmidComposition, low-level semantic prediction graph set Jpre4Is a low-level semantic prediction graph f of 9 output from the fourth of the convolutional neural networklowComposition, foreground-background prediction set of images Jpre5Is 9 foreground-background prediction maps f output from the fifth of the convolutional neural networkbinComposition, boundary prediction graph set Jpre6Is a 9-amplitude boundary prediction graph f output from the sixth of the convolutional neural networkbouComposition, semantic prediction graph set Jpre7Is 9 semantic prediction graphs f output by the seventh of the convolutional neural networksemComposition, the seventh output of the convolutional neural network as the output during the test phaseAnd (4) extracting a semantic segmentation prediction graph.
CN202110823118.4A 2021-07-21 2021-07-21 Road scene image semantic segmentation method based on multi-supervision network Pending CN113362349A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110823118.4A CN113362349A (en) 2021-07-21 2021-07-21 Road scene image semantic segmentation method based on multi-supervision network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110823118.4A CN113362349A (en) 2021-07-21 2021-07-21 Road scene image semantic segmentation method based on multi-supervision network

Publications (1)

Publication Number Publication Date
CN113362349A true CN113362349A (en) 2021-09-07

Family

ID=77540049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110823118.4A Pending CN113362349A (en) 2021-07-21 2021-07-21 Road scene image semantic segmentation method based on multi-supervision network

Country Status (1)

Country Link
CN (1) CN113362349A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164290A1 (en) * 2016-08-25 2019-05-30 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation
CN112991350A (en) * 2021-02-18 2021-06-18 西安电子科技大学 RGB-T image semantic segmentation method based on modal difference reduction
CN112991351A (en) * 2021-02-23 2021-06-18 新华三大数据技术有限公司 Remote sensing image semantic segmentation method and device and storage medium
CN112991364A (en) * 2021-03-23 2021-06-18 浙江科技学院 Road scene semantic segmentation method based on convolution neural network cross-modal fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164290A1 (en) * 2016-08-25 2019-05-30 Intel Corporation Coupled multi-task fully convolutional networks using multi-scale contextual information and hierarchical hyper-features for semantic image segmentation
CN112991350A (en) * 2021-02-18 2021-06-18 西安电子科技大学 RGB-T image semantic segmentation method based on modal difference reduction
CN112991351A (en) * 2021-02-23 2021-06-18 新华三大数据技术有限公司 Remote sensing image semantic segmentation method and device and storage medium
CN112991364A (en) * 2021-03-23 2021-06-18 浙江科技学院 Road scene semantic segmentation method based on convolution neural network cross-modal fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王子羽;张颖敏;陈永彬;王桂棠;: "基于RGB-D图像的室内场景语义分割网络优化", 自动化与信息工程, no. 02, 15 April 2020 (2020-04-15) *
青晨;禹晶;肖创柏;段娟;: "深度卷积神经网络图像语义分割研究进展", 中国图象图形学报, no. 06, 16 June 2020 (2020-06-16) *

Similar Documents

Publication Publication Date Title
CN109190752B (en) Image semantic segmentation method based on global features and local features of deep learning
CN108764063B (en) Remote sensing image time-sensitive target identification system and method based on characteristic pyramid
CN108509978B (en) Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN111563507B (en) Indoor scene semantic segmentation method based on convolutional neural network
CN110490205B (en) Road scene semantic segmentation method based on full-residual-error hole convolutional neural network
CN111612807A (en) Small target image segmentation method based on scale and edge information
CN109635662B (en) Road scene semantic segmentation method based on convolutional neural network
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
CN114241274B (en) Small target detection method based on super-resolution multi-scale feature fusion
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN113192073A (en) Clothing semantic segmentation method based on cross fusion network
US20220156528A1 (en) Distance-based boundary aware semantic segmentation
CN115035361A (en) Target detection method and system based on attention mechanism and feature cross fusion
CN115631344B (en) Target detection method based on feature self-adaptive aggregation
CN115830471B (en) Multi-scale feature fusion and alignment domain self-adaptive cloud detection method
CN114529581A (en) Multi-target tracking method based on deep learning and multi-task joint training
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN113034506A (en) Remote sensing image semantic segmentation method and device, computer equipment and storage medium
CN109446933B (en) Road scene semantic segmentation method based on convolutional neural network
CN109508639B (en) Road scene semantic segmentation method based on multi-scale porous convolutional neural network
CN113362349A (en) Road scene image semantic segmentation method based on multi-supervision network
CN116310850A (en) Remote sensing image target detection method based on improved RetinaNet
CN116051532A (en) Deep learning-based industrial part defect detection method and system and electronic equipment
CN113781504A (en) Road scene semantic segmentation method based on boundary guidance
CN111047571B (en) Image salient target detection method with self-adaptive selection training process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination