CN117934849A

CN117934849A - Deep learning-based RGB-D image semantic segmentation method

Info

Publication number: CN117934849A
Application number: CN202410135853.XA
Authority: CN
Inventors: 郭翔宇; 马伟
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2024-01-31
Filing date: 2024-01-31
Publication date: 2024-04-26

Abstract

The invention discloses a deep learning-based RGB-D image semantic segmentation method, which is characterized in that a semantic edge detection deep neural network model based on multi-level feature fusion is used as a core. The system comprises a bimodal non-local context coding module, a complementary feature selection module and a semantic guided feature calibration module. Training is divided into three steps: initialization of model parameters, preparation of a target data set, and training of an overall model. The invention has the following advantages: 1) The token containing the bimodal non-local context information is utilized to interact bimodal information, so that bimodal complementary features can be better extracted, and the single-modality feature representation is enhanced; 2) And introducing global semantic context information with rich top-level features into shallow features by using a semantic-guided feature calibration module, enriching the semantic information of the shallow features, and inhibiting noise information in the shallow features, so that the segmentation effect is improved.

Description

Deep learning-based RGB-D image semantic segmentation method

Technical Field

The invention belongs to the technical field of image processing and computer vision, and relates to an RGB-D image semantic segmentation method based on deep learning.

Background

Semantic segmentation aims at classifying each pixel in the input image. As a classical task, semantic segmentation is widely used in a number of fields including autopilot, facial segmentation, remote sensing image analysis, medical image analysis, and the like. RGB images have rich color and texture information, while depth images contain three-dimensional geometric information. The two image information supplement each other, which is helpful to promote the semantic segmentation effect. For example, the depth information is used as the supplement of RGB information, the depth information can provide three-dimensional geometric information, and the illumination change is robust, so that more accurate semantic segmentation of the model at the object strong light and shadow positions is facilitated.

In order to extract features useful for semantic segmentation tasks from RGB and depth data, it is important to develop an efficient method of interacting and fusing the two modality features. In the paper "CMX: cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers" published by Liu et al 2023 in TITS, they take RGB data and depth data as inputs, extract features using two parallel branches, and design CM-FRM and FFM structures for calibrating and fusing information of both RGB and depth modalities. Although this approach provides a viable solution for unifying the two types of information, the complementary information of the bimodal features is not fully mined. Due to the differences between RGB modalities and depth modalities, how to effectively identify the differences between them and integrate these two types of information into an effective feature representation remains a challenging problem.

Disclosure of Invention

The invention aims at overcoming the defects of the existing RGB-D semantic segmentation method, and provides a deep learning-based RGB-D semantic segmentation method.

To achieve the object, the technical scheme of the invention is as follows: an RGB-D semantic segmentation model based on deep learning is constructed, the model is trained on a target data set, a picture to be detected is processed by using the trained model, and K channel activation values output by the model are used as probability values of corresponding K semantic categories.

The network model in the method consists of the following three major modules:

1. a bimodal non-local context encoding module. Global context information for bimodal features is captured using a bimodal non-local context token.

2. And a complementary feature selection module. The module is used to enhance the feature representation of a single modality.

3. A semantically guided feature calibration module. The module performs guided feature calibration on shallow features by using top features, suppresses noise information in the shallow features, and enhances semantic feature representation of the shallow features.

The model training process in the method comprises the following three stages:

1. and initializing model parameters. The encoder is pre-trained in the classification dataset ImageNet.

2. Preparation of the target data set. NYUDepthV2 is selected as the target dataset.

3. Training of the whole model. Initializing network parameters by using the pre-trained parameters, and supervising the updating process of the whole network parameters by using the cross entropy loss function.

Advantageous effects

1) The token containing the bimodal non-local context information is utilized to interact bimodal information, so that bimodal complementary features can be better extracted, and the single-modality feature representation is enhanced; 2) And introducing global semantic context information rich in top-level features into shallow features by using a semantic-guided feature calibration module, enriching the semantic information of the shallow features, and inhibiting noise information in the shallow features, so that the segmentation effect is improved. Experiments prove that: compared with the existing method, the semantic segmentation method has better effect in the strong light area and the light reflection area, has stronger robustness to illumination change, can effectively inhibit the noise of the depth map, and improves the semantic segmentation effect of the noise area of the depth map.

Drawings

FIG. 1 is a schematic diagram of a network framework of the method of the present invention;

FIG. 2 is a block diagram of a bimodal non-local context encoding module in the method of the present invention;

FIG. 3 is a complementary feature selection module in the method of the present invention;

FIG. 4 is a semantic guided feature calibration module in the method of the present invention;

Fig. 5 shows the experimental results of the application example: (a) is an input RGB image, (b) is an input depth image, (c) is a semantic segmentation truth value, (d) is a semantic segmentation result by a method published by Liu et al in TITS of 2023, "CMX: cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", (e) is a semantic segmentation result by the method of the invention;

Detailed Description

The invention is realized based on a deep learning open source framework Pytorch, and a network model is trained by using a GPU processor NVIDIA RTX 3090.

The various block configurations of the method of the present application, as well as the training and use of the method model, will be further described in conjunction with the accompanying drawings and detailed description, it being understood that the detailed description of the application is provided for illustration only and not for limitation of the scope of the application, and that various equivalent modifications of the application will fall within the scope of the application as defined by the appended claims after reading the application.

The module composition and flow of the invention are shown in figure 1, and specifically comprise the following modules:

1. A bimodal non-local context encoding module.

For RGB-D semantic segmentation tasks, complementary information exists between RGB and depth data, so that effective information interaction between the RGB and the depth data is necessary, and the segmentation effect is improved. In the process of information interaction, how to effectively select complementary information between two modalities is a problem which needs to be solved by the current task. To solve this problem, a bimodal non-local context coding (DNCE) module is constructed, the specific structure of which is shown in fig. 2. The module consists of a single-mode context attention module and a cross-mode token interaction module so as to realize effective bimodal information fusion. The input of the module is RGB feature sequence F _RGB and depth feature sequence F _D, in the stage one, F _RGB is obtained by RGB image through 7X 7 convolution layer and layer normalization (layernorm) with the step length of 4, and F _D is obtained by depth image through 7X 7 convolution layer and layer normalization with the step length of 4; in the second stage to the fourth stage, F _RGB is obtained by normalizing the 3×3 convolution layer and layer with the step length of 2 by F _RGB in the previous stage, and F _D is obtained by normalizing the 3×3 convolution layer and layer with the step length of 2 by F _D in the previous stage. The number of channels for the four phase feature sequence is 64, 128, 320 and 512, respectively.

In the unimodal context attention module, a bimodal non-local context tokenT _RGB is spliced on F _RGB, where T _RGB is a randomly initialized parameter used to capture global context information on RGB features, and the number of channels is consistent with F _RGB, and the sequence length is 1. Then send it to Transformer block for self-attention learning to obtain RGB modal characteristics F' _RGB and global context information containing RGB modalA bimodal non-local context tokenT _D is stitched on F _D, where T _D is a randomly initialized parameter used to capture global context information on depth features, the number of channels is consistent with F _D, and the sequence length is 1. It is then fed Transformer block for self-attention learning, resulting in the depth modality feature F' _D and/>, which contains depth modality global context informationTransformer block is formed by stacking multiple layers of transformers, and the number of stacked layers from stage one to stage four is 3,4, 6 and 3. Inside the transducer, there are layer normalization, multi-head attention and multi-layer perceptron, where the head numbers of multi-head attention from stage one to stage four are 1,2, 5, 8, respectively.

ObtainingAnd/>Thereafter, cross-modal token interaction module is utilized to interact/>And/>Global context information between. Will/>And/>Splicing according to channel dimension as input of module, and output is/> containing bimodal global context informationAnd/>Wherein/>Corresponding RGB features,/>Corresponding to the depth features. The specific calculation process is expressed as follows:

wherein the T _MHSA is defined as the total number of the components, C is the channel dimension, T _MHSA is the calculation result of formula (1), MHSA is the multi-head attention, MLP is the multi-layer perceptron, and LN is the layer normalization. The double vertical line represents stitching features by channel dimension.

2. And a complementary feature selection module.

In order to make the information interaction between the modalities more sufficient, a Complementary Feature Selection (CFS) module is proposed, which selects complementary features from two dimensions of the channel and the space, and its specific structure is as shown in fig. 3. In choosing complementary features in the channel dimension, the main use isAnd/>Complementary features are selected from the channel dimensions. And inputting the spliced results of the two into a channel interaction module for attention calculation, and generating a channel weight graph. The channel interaction module consists of a linear layer, a nonlinear function ReLU and Sigmoid, and the specific calculation process is expressed as follows:

Wherein the method comprises the steps of Is a channel weight map of RGB modality,/>Is a channel weight map of the depth modality. They are normalized to between (0, 1) by Sigmoid. The output dimension of the Linear layer Linear ₁ is half the input dimension, and the output dimension and the input dimension of the Linear layer Linear ₂ are identical. Utilization/>Weighting RGB modal features by channel dimension, utilizing/>Weighting the depth modality features by channel dimension, a specific computational process can be formulated as:

where F '_RGB represents the RGB mode features derived by the single mode contextual attention module, F' _D represents the depth mode features derived by the single mode contextual attention module, Representing complementary information extracted from RGB modality features in the channel dimension,/>Representing complementary information extracted from the depth modality features in the channel dimension. Wherein the modal feature and the weight map complete the weighted calculation by multiplying at element level.

When the feature selection is performed on the space dimension, the two modal features are spliced according to the channel dimension and input into a space interaction module for attention calculation, the space interaction module is provided with 1*1 convolution layers, a nonlinear function ReLU and a Sigmoid, and the specific calculation process is expressed as follows:

Wherein the method comprises the steps of Is a spatial weighting map of RGB modality,/>Are spatial weight maps of depth modalities that are normalized to between (0, 1) by Sigmoid. And weighting the two modal characteristics by using the two space weight graphs, and selecting effective complementary information in a space dimension. The specific calculation process can be expressed as:

Wherein the method comprises the steps of Representing complementary information extracted from RGB modality features in the spatial dimension,/>Representing complementary information extracted from depth modality features in the spatial dimension. Wherein the modal feature and the weight map complete the weighted calculation by multiplying at element level.

After completing the complementary feature selection, the methodSupplementing RGB image features will/>Supplementing the depth image features, the specific calculation process is expressed as follows:

Wherein the method comprises the steps of Is RGB modal feature enhanced by complementary feature,/>Is a depth modality feature enhanced by a complementary feature. 3. A semantically guided feature calibration module.

In order to suppress noise information in shallow features and enhance their semantically represented ability, a semantically guided feature calibration (SGFR) module is designed, the specific structure of which is shown in fig. 4. The fusion module FFM in the CMX Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers referred to in the fusion process of (C) is used for fusing the bimodal features to obtain the fusion features of each stage. And then, carrying out semantic guided feature calibration on the shallow fusion features by using the top fusion features, thereby obtaining stronger feature representation. The feature alignment portion specifically employs a cross-attention approach to enhance shallow features.

As shown in fig. 4, the number of shallow feature channels obtained in stage one to stage three is first adjusted to 512 by the linear layer to ensure that the top layer features remain consistent. Subsequently, the low-level feature is regarded as a Query (Query) Q, and the top-level feature is regarded as a Key (Key) K and a Value (Value) V. By matrix multiplying query Q and transpose of key K, attention figure A is obtained through softmax, and the specific calculation formula is:

Where C _t is the top level feature channel number and K ^T is the transpose of K. The attention map A is then matrix multiplied by the value V and added to the query Q to obtain the semantically enhanced feature F, the specific formula being as follows:

F＝A*V+Q (12)

4. Semantic segmentation decoder.

After the semantic enhancement features of the first stage to the third stage are acquired, they are input to a decoder together with the fusion features of the fourth stage to generate semantic segmentation results. The decoder includes a specific process of bilinear difference up-sampling the phase two to phase four fusion features, up-sampling them to the same size as the phase one fusion features. Then, the fusion features from stage one to stage four are spliced according to the channel dimension, and the fusion features are processed by two 1X 1 convolution layers. The first convolution layer has an input dimension of 2048, an output dimension of 512, and the second convolution layer has an input dimension of 512, an output dimension of 40, consistent with the number of categories in the dataset. Finally, a prediction result of semantic segmentation is obtained.

Training stage.

Step one, initializing model parameters.

The main purpose of this process is to provide a better parameter initial value for the method model, because in the neural network, the parameters are updated based on the BP algorithm (back propagation algorithm), and the nature of the BP algorithm is random gradient descent, so that the difference of the parameter initial values will cause the network to converge to different positions, the good parameter initial value can avoid the network to fall into a locally optimal solution, and meanwhile, the convergence process of the network model can be accelerated. In the neural network model, low-level features required by different tasks are similar, most of the low-level features are based on information such as edges, angles and the like of brightness, color and texture, and the low-level features are further abstracted into features required by the respective tasks in high-level semantic features, so that the bottom-level sub-features learned under different tasks have universality, the parameters learned in another larger data set can be directly transferred to a target data set, only the target set is required to be slightly trained, and the network model is subjected to parameter fine adjustment, so that the process is called fine-tuning. In the method, the parameters of the transducer module in the bimodal non-local context coding module are pre-trained by using a classification data set ImageNet, wherein the data set has more than 1400 tens of thousands of pictures and more than 2 tens of thousands of classifications, which is beneficial to enhancing the characteristic learning capability of the transducer module. After the pre-training is completed, the trained parameters need to be loaded into the transducer module of the method.

Step two, preparing a target data set.

The target data set used for training by the method is NYU Depth V2, and the target data set comprises 40 categories of indoor common objects. NYU Depth V2 contains 1449 RGB-D images, 795 of which are used for training and 654 for testing, with image size 480X 640. The method only uses a training set to train model parameters, and a testing set is used for testing results. The dataset was enhanced using two techniques, image random horizontal flipping, and scaling, with 6 scaling factors of 0.5, 0.75, 1, 1.25, 1.5, 1.75.

And thirdly, training of the integral model.

Before the integral model is trained, the parameters of the integral model need to be initialized, and the quality of the initialized parameters directly influences the final semantic segmentation effect. In the transform module of the bimodal non-local context coding module, parameters obtained by pre-training in the classification network are used for initialization, while in the multi-level feature extraction module and the fusion module, standard truncated forward distribution functions are used for random initialization of the parameters, wherein the expected sum variance is 0and 0.2 respectively. The learning rate was set to 6e-5 and the weight decay was 0.01 using poly LEARNING RATE schedule.

In the semantic segmentation method, a weighted multi-class cross entropy loss function is generally used to supervise the updating process of the overall network parameters, and specific functions are as follows:

x _i represents the characteristic value of the pixel x on the ith channel, i channels correspond to i semantic categories, p (x _i) is the true probability that the pixel x belongs to the category i, if the pixel x belongs to the category i, p (x _i) is 1, otherwise, 0 is 0.q (x _i) is the prediction probability that the pixel point x belongs to the category i, and the probability that the pixel point x belongs to the current category is larger as the value is close to 1, wherein the value is 0-1, and the value is obtained by a softmax activation function. The basic learning rate change strategy adopted in the training of the method is 'poly':

Where power is the learning rate decay index, base _lr is the base learning rate, iter _cur is the current number of training iterations, and iter _max is the total number of iterations. Experiments prove that the modified multi-classification cross entropy loss function and the learning rate change strategy poly can complement each other, so that the network converges to a better position, and the semantic segmentation effect is best.

Stage of use.

Constructing a network structure according to the method, initializing model parameters and preparing a data training set. After training is completed, inputting the picture to be detected into a trained whole network, and outputting the picture as a semantic segmentation result of the picture.

And (5) testing a method.

The method disclosed by the invention is verified on an NYU Depth V2 test set and compared with the disclosed method CMX mentioned in the above, and the result shows that the semantic segmentation effect of the method disclosed by the invention is more excellent in performance in a strong light area and a light reflection area, and has stronger illumination variation robustness. In addition, the method can effectively inhibit noise in the depth map, and the semantic segmentation effect of the noise region of the depth map is remarkably improved.

The comparative visualization results are shown in FIG. 5, (a) for the input RGB image, (b) for the input depth image, (c) for the semantic segmentation truth value, (d) for the semantic segmentation visualization result using the method of Liu et al published in TITS of 2023, "CMX: cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers," and (e) for the semantic segmentation visualization result of the method of the present invention. The comparative quantitative evaluation results are shown in Table 1, the first column is the evaluation result of Liu et al CMX, the second column is the evaluation result of the present invention, the evaluation standard adopted is the same as Liu et al, and the index adopts the average cross ratio of all the classes.

TABLE 1

CMX	Our method
		54.1	54.7

1. And (5) comparing the visual results.

A comparison of the visual effect on the NYU Depth V2 dataset is shown in fig. 5. The results show that the CMX has errors in the segmentation of the input image on the left side, i.e. belonging to the strong light area on the whiteboard, whereas the semantic segmentation result of the method correctly segments this area. The visual result clearly shows that the method has more excellent segmentation effect in strong light areas and stronger robustness to illumination changes. The method can effectively utilize the geometric information of the depth map, thereby ensuring the intra-class consistency of semantic segmentation results.

2. And (5) comparing the quantitative evaluation results.

In the quantitative evaluation results of Table 1, the index of the method reaches 54.7%, exceeds CASENet results, and is improved by 0.6 percentage points, which shows that the characteristic expression capacity of the whole network model in the method is stronger than that of CMX.

In summary, the invention discloses a deep learning-based RGB-D image semantic segmentation method, which mainly explains the framework composition of a network model and the training process of the framework composition. The overall framework encompasses three key modules: a bimodal non-local context encoding module, a complementary feature selection module, and a semantically guided feature calibration module. The training process is mainly divided into three stages: initialization of model parameters, preparation of a target data set, and training of an overall model. By comparing the result with the result of CMX, the module provided by the method is verified to be capable of effectively utilizing the geometric information in the depth map, improving the robustness of the model to illumination change in the image, guaranteeing the intra-class consistency of semantic segmentation, and further remarkably improving the accuracy of the semantic segmentation.

Claims

1. The RGB-D image semantic segmentation method based on deep learning is characterized by comprising the following modules:

1. a bimodal non-local context encoding module;

The module consists of a single-mode context attention module and a cross-mode token interaction module; the input of the module is RGB feature sequence F _RGB and depth feature sequence F _D, F _RGB is obtained by normalizing RGB images through 7X 7 convolution layers and layers with the step length of 4 in the stage one, and F _D is obtained by normalizing depth images through 7X 7 convolution layers and layers with the step length of 4; in the second stage to the fourth stage, F _RGB is obtained by normalizing the 3×3 convolution layer and layer with the step length of 2 by F _RGB in the previous stage, and F _D is obtained by normalizing the 3×3 convolution layer and layer with the step length of 2 by F _D in the previous stage; the number of channels of the four-stage feature sequence is 64, 128, 320 and 512, respectively;

In the single-mode context attention module, a bimodal non-local context tokenT _RGB is spliced on F _RGB, wherein T _RGB is a parameter subjected to random initialization, the number of channels is consistent with F _RGB, and the sequence length is 1; then send it to Transformer block for self-attention learning to obtain RGB modal characteristics F' _RGB and global context information containing RGB modal Splicing a bimodal non-local context tokenT _D on the F _D, wherein T _D is a parameter subjected to random initialization, the number of channels is consistent with that of the F _D, and the sequence length is 1; it is then fed Transformerblock for self-attention learning, resulting in the depth modality feature F' _D and/>, which contains depth modality global context informationTransformer block is formed by stacking a plurality of layers of convectors, wherein the number of stacked layers from stage one to stage four is 3, 4, 6 and 3 respectively; inside the transducer, there are layer normalization, multi-head attention and multi-layer perceptron, wherein the head numbers of multi-head attention from stage one to stage four are 1,2, 5, 8, respectively;

Obtaining And/>Thereafter, cross-modal token interaction module is utilized to interact/>And/>Global context information between; will/>And/>Splicing according to channel dimension as input of module, and output is/> containing bimodal global context informationAnd/>Wherein/>Corresponding RGB features,/>Corresponding depth features; the specific calculation process is expressed as follows:

Wherein the method comprises the steps of C is the channel dimension, T _MHSA is the calculation result of formula (1), MHSA is the multi-head attention, MLP is the multi-layer perceptron, and LN is the layer normalization; the double vertical lines indicate the splicing characteristics according to the channel dimension;

2. A complementary feature selection module;

In choosing complementary features in channel dimensions, use is made of And/>Selecting complementary features from the channel dimensions; the result of the splicing of the two is input into a channel interaction module for attention calculation and used for generating a channel weight graph; the channel interaction module consists of a linear layer, a nonlinear function ReLU and Sigmoid, and the specific calculation process is expressed as follows:

Wherein the method comprises the steps of Is a channel weight map of RGB modality,/>Is a channel weight map of the depth modality; they are normalized to between (0, 1) by Sigmoid; the output dimension of the Linear layer Linear ₁ is half of the input dimension, and the output dimension and the input dimension of the Linear layer Linear ₂ are consistent; utilization/>Weighting RGB modal features by channel dimension, utilizing/>Weighting the depth modality features by channel dimension, a specific computational process can be formulated as:

where F '_RGB represents the RGB mode features derived by the single mode contextual attention module, F' _D represents the depth mode features derived by the single mode contextual attention module, Representing complementary information extracted from RGB modality features in the channel dimension,/>Representing complementary information extracted from depth modality features in a channel dimension; the mode characteristics and the weight graph complete weighting calculation in a mode of multiplying element levels;

Wherein the method comprises the steps of Is a spatial weighting map of RGB modality,/>Are spatial weight maps of depth modalities normalized to between (0, 1) by Sigmoid; weighting the two modal characteristics by using the two space weight graphs, and selecting effective complementary information in a space dimension; the specific calculation process is expressed as follows:

Wherein the method comprises the steps of Representing complementary information extracted from RGB modality features in the spatial dimension,/>Representing complementary information extracted from depth modality features in a spatial dimension; the mode characteristics and the weight graph complete weighting calculation in a mode of multiplying element levels;

After completing the complementary feature selection, the method Supplementing RGB image features will/>Supplementing the depth image features, the specific calculation process is expressed as follows:

Wherein the method comprises the steps of Is RGB modal feature enhanced by complementary feature,/>Is a depth modality feature enhanced by complementary features;

3. A semantic guided feature calibration module;

The fusion module FFM fuses the bimodal features to obtain fusion features of each stage; secondly, carrying out semantic guidance feature calibration on the shallow fusion features by utilizing the item layer fusion features; the characteristic calibration part specifically adopts a cross attention mode to enhance shallow characteristics;

Firstly, adjusting the number of shallow characteristic channels obtained in the first stage to the third stage to 512 through a linear layer, and ensuring that the characteristics of the top layer are kept consistent; subsequently, taking the low-level features as a Query (Query) Q, and taking the top-level features as keys (Key) K and values (Value) V; by matrix multiplying query Q and transpose of key K, attention figure A is obtained through softmax, and the specific calculation formula is:

Wherein C _t is the number of top-level feature channels, and K ^T is the transpose of K; the attention map A is then matrix multiplied by the value V and added to the query Q to obtain the semantically enhanced feature F, the specific formula being as follows:

F＝A*V+Q (12)

4. A semantic segmentation decoder;

After the semantic enhancement features from the first stage to the third stage are obtained, inputting the semantic enhancement features and the fusion features of the fourth stage into a decoder together to generate a semantic segmentation result; the specific process of the decoder comprises the steps of carrying out bilinear difference up-sampling on the fusion characteristics of the second stage to the fourth stage, and up-sampling the bilinear difference up-sampling to the same size as the fusion characteristic of the first stage; then, splicing the fusion characteristics of the first stage to the fourth stage according to the channel dimension, and processing by two 1X 1 convolution layers; the input dimension of the first convolution layer is 2048, the output dimension is 512, the input dimension of the second convolution layer is 512, the output dimension is 40, and the number of categories in the data set is consistent; finally, a prediction result of semantic segmentation is obtained;

Training;

initializing model parameters;

Pre-training parameters of a transducer module in the bimodal non-local context encoding module using a classification dataset ImageNet; after the pre-training is completed, the trained parameters are required to be loaded into a transducer module of the method;

Step two, preparing a target data set;

Training model parameters only using a training set, the test set being used to test results; enhancing the dataset using two techniques of image random horizontal flipping, scaling, 6 scale factors of 0.5, 0.75, 1, 1.25, 1.5, 1.75;

training of an overall model;

Before the integral model is trained, initializing parameters of the integral model, and randomly initializing the parameters by using a standard truncated forward distribution function, wherein the expected sum variance is 0 and 0.2 respectively; the learning rate is set to 6e-5, and poly LEARNING RATE schedule is adopted, and the weight decay is 0.01;

in the semantic segmentation method, a multi-classification cross entropy loss function with weight is used for supervising the updating process of the overall network parameters, and the specific functions are as follows:

x _i represents a characteristic value of the pixel point x on the ith channel, i channels correspond to i semantic categories, p (x _i) is a true probability that the pixel point x belongs to the category i, if the pixel point x belongs to the category i, p (x _i) is 1, otherwise, 0 is 0; q (x _i) is the prediction probability that the pixel point x belongs to the category i, and is obtained by a softmax activation function, the value range is 0-1, and the probability that the pixel point x belongs to the current category is larger as the value is closer to 1;

the basic learning rate change strategy adopted in training is "poly":

Wherein power is a learning rate decay index, base _lr is a basic learning rate, item _cur is a current training iteration number, and item _max is a total iteration number; after training is completed, inputting the picture to be detected into a trained whole network, and outputting the picture as a semantic segmentation result of the picture.