CN116342877A

CN116342877A - Semantic segmentation method based on improved ASPP and fusion module in complex scene

Info

Publication number: CN116342877A
Application number: CN202310163543.4A
Authority: CN
Inventors: 钱华明; 丁鹏; 鲍家兵; 于爽; 孙永虎; 阎淑雅
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-27

Abstract

The invention aims to provide a semantic segmentation method based on improved ASPP and a fusion module in a complex scene, which comprises the following steps: building a deep model under a Pytorch framework; based on the traditional ASPP structure, designing an RA-ASPP module; designing a CBB module; replacing an ASPP module in the deep labv3+ model by adopting an RA-ASPP module, and replacing 3X 3 standard convolution of a decoding fusion part by adopting a CBB module; the freezing training method is adopted to train the model, xception, mobileNetV2 is used as a backbone part to carry out an ablation experiment on the PASCALVOC07+12 data set, and performances of different models are compared. The improved module provided by the invention improves the segmentation effect of deep labv3+, and different backbone parts provide more choices for semantic segmentation tasks in complex scenes.

Description

Semantic segmentation method based on improved ASPP and fusion module in complex scene

Technical Field

The invention relates to an image processing method.

Background

The main task of image semantic segmentation is to understand the semantics of an image and segment objects with different semantics, wherein the semantics refer to meanings represented by the objects in the image, such as pedestrians, vehicles, roads, obstacles and the like in complex scenes. It has similar characteristics in segmenting pixels having the same label in an image by the process of dividing the digital image into sets (pixel sets).

The semantic segmentation effect directly relates to the accuracy of the unmanned system to the scene understanding. The intelligent unmanned plane navigation system has a vital function on unmanned systems such as intelligent driving, autonomous navigation of a robot cognition layer, an unmanned plane landing system and intelligent security monitoring. For example, in the field of autopilot, automobiles need to have the necessary image recognition and segmentation capabilities to fully understand environmental changes during their travel; in the field of medical image diagnosis, doctors can be helped to analyze images of diseased parts of patients, so that the diagnosis efficiency of the patients is improved. The higher the accuracy of image segmentation, the better the effect of the subsequent image processing task is, so that the image segmentation is a prerequisite for image processing, which is a crucial basic work and has an important role in the field of computer vision. However, various factors such as unstructured, diversified targets, irregular shapes, illumination changes, object shielding and the like are easy to occur in a complex scene, and great challenges are brought to semantic segmentation accuracy. Such as smaller target objects that are more difficult to segment, and smaller strip-like areas of the target object. For example, feet of a desk and a chair in an indoor scene, and thinner strip-shaped parts of a telegraph pole, a street lamp and the like in a road scene. It is more difficult to distinguish between different objects having similar appearances and the same object having a different appearance. For example, floors with similar textures and appearances to trees are misclassified as trees. The adaptation capability to illumination and seasonal variation in complex environments is not strong, and the robustness is poor.

With the development of deep convolutional neural networks, a great improvement of the effect is obtained for many computer vision tasks. Semantic segmentation, one of the most important tasks in computer vision, has also relied on deep learning related techniques to gain good progress. Semantic segmentation is a difficult task in the field of computer vision, where foreground and background objects are very complex in reality, and often these objects have different shapes, sizes, colors, etc. Although the deep convolutional neural network has improved the effect of the traditional semantic segmentation task a lot, there is still a gap for the real application in the semantic segmentation of real-life complex scenes. The increasingly complex segmentation of scenes, the requirements for segmentation algorithms are also becoming more stringent. In the scene perception technology, the efficient and accurate acquisition of the classification information of the target category from the environmental information is very difficult, and has two main reasons, namely: for some objects with similar attributes, accurate segmentation is not possible. And the second is: the semantic segmentation is to segment by means of the shape of an object, and can show higher segmentation precision for some static objects with stable shapes or without more changes, but if the measured object moves or has more obvious shape changes frequently, accurate segmentation cannot be performed. Therefore, in order to make the accuracy of the semantic segmentation of the image higher, a more intensive study is performed. The semantic segmentation effect is improved by optimizing the network structure, and the semantic segmentation effect is very important to realizing accurate perception in complex scenes.

Disclosure of Invention

The invention aims to provide a semantic segmentation method based on an improved ASPP and a fusion module in a complex scene, which can provide more choices for semantic segmentation tasks in the complex scene.

The purpose of the invention is realized in the following way:

the invention discloses a semantic segmentation method based on an improved ASPP and a fusion module in a complex scene, which is characterized by comprising the following steps of:

(1) Building a deep model under a Pytorch framework;

(2) Based on the traditional ASPP structure, designing an RA-ASPP module;

(3) Designing a CBB module;

(4) Replacing an ASPP module in the deep labv3+ model by adopting an RA-ASPP module, and replacing 3X 3 standard convolution of a decoding fusion part by adopting a CBB module;

(5) The freezing training method is adopted to train the model, xception, mobileNetV2 is used as a backbone part to conduct ablation experiments on the PASCAL VOC07+12 data set, and performances of different models are compared.

The invention may further include:

1. the step (1) comprises the following steps:

(1.1) adopting an Xreception network model as a backbone part to build a deep bv3+ network structure, wherein the backbone part carries out Xception, mobileNetV2 switching so as to meet different application requirements;

(1.2) providing an RA-ASPP module based on an ASPP structure, and firstly utilizing a residual error network structure to realize denser multi-scale feature extraction; combining the asymmetric convolution module with the cavity convolution module to form a new AACB module for replacing the 3X 3 cavity convolution module in the ASPP;

(1.3) a parallel fusion architecture CBB combining 1 x 1 standard convolution and bottleneck modules is proposed after the decode fusion.

2. The AACB module in the step (2) is used for replacing the 3×3 hole convolution module in ASPP, and the AACB module continues the expansion rate in the hole convolution, i.e. the sampling rate rate= {6,12,18} of the AACB module.

3. The CBB module architecture in step (3) adds an SE attention mechanism module after the 3 x 3 convolution operation, based on the bottleneck module in the ResNet, with a reduction factor of 16.

4. In the step (5), a PASCAL VOC07+12 data set is adopted for network training, 10582 images which are additionally marked are adopted for training, 1449 images are adopted for verification and test, the initial learning rate is 0.007, a random gradient descent network model optimizer is adopted, the momentum is 0.9, the weight attenuation rate is set to 0.0001, the learning rate descent mode is cos, the input image size is 512 multiplied by 512,Freeze batch size and the Freeze epoch is 100; unfreeze batch size is 8 and UnFreeze epoch is 200; co-training 300epochs.

The invention has the advantages that: the RA-ASPP module combining the residual error network and the asymmetric cavity convolution module provided by the invention further enriches the scale of feature extraction, realizes denser multi-scale feature extraction, and improves the characterization capability of the network in a display manner; a parallel fusion structure CBB combining a 1X 1 standard convolution and bottleneck module is provided, so that information loss in the whole network transmission process is reduced. In order to meet the requirements of high precision and real-time performance under a complex scene task, the invention uses Xception, mobileNetV2 as a backbone part to carry out experimental verification on the PASCAL VOC07+12 data set. Experimental results show that when the backbone part is Xreception, the average cross ratio MIoU of the proposed method is 79.78%, the speed is increased by 2.81% compared with that of the prior art under the condition of sacrificing 1.72FPS, the proposed module remarkably improves the semantic segmentation precision, achieves the segmentation effect comparable with that of an advanced semantic segmentation algorithm, and meets the requirement of high precision. When the backbone part is MobileNet V2, the speed of the method reaches 37.54FPS, which is improved by 17.34FPS compared with the prior art, and MIoU reaches 73.32%, so that the balance between the real-time segmentation speed and the precision is ensured. The improved module provided by the invention improves the segmentation effect of deep labv3+, and different backbone parts provide more choices for semantic segmentation tasks in complex scenes.

Drawings

FIG. 1 is a diagram of the deep labv3+;

FIG. 2 is a diagram of a modified deep labv3+;

FIG. 3 is a diagram of the RA-ASPP structure;

FIG. 4a is a schematic diagram of a residual network element (residual network structure) of RA-ASPP, and FIG. 4b is a schematic diagram of a residual network element (modified element) of RA-ASPP;

FIG. 5 is a schematic illustration of hole convolution with different dilations;

FIG. 6 is a schematic diagram of an asymmetric convolution;

FIG. 7a is a CBB block diagram (CBB block diagram) and FIG. 7b is a CBB block diagram (SE block);

FIG. 8 is a loss training curve with Xreception as the backbone part;

FIG. 9 is a MIoU change curve with Xreception as backbone part;

FIG. 10 is a graph comparing IoU performance of Deeplabv3+, ours1, ours2 on a PASCAL VOC07+12 dataset;

fig. 11 is a graph of the segmentation effect versus the (Input image, delayed image, deep labv3+, ours1 (Xception), ours2 (mobilenv 2)) of the different methods.

Detailed Description

The invention is described in more detail below, by way of example, with reference to the accompanying drawings:

referring to fig. 1-11, the structure of the conventional deeplabv3+ model is shown in fig. 1, and the structure of the deeplabv3+ model is improved. Deep labv3+ is one of typical semantic segmentation network architecture, and is composed of an encoder and a decoder, and can perform pixel level segmentation on an image, so that the semantic segmentation network architecture has better effect on image classification. The deep labv & lt3+ & gt is added with a simple and effective decoder module compared with the deep labv & lt3+ & gt, the deep labv & lt2+ & gt and the deep labv & lt1+ & gt, so that an encoder-decoder structure is formed, more pixel information can be collected, and the accuracy of the segmented image is higher. Deep labv3+ uses Xreception to replace ResNet, deepens the depth of the network, extracts image characteristic information through depth separable convolution layers of different channels in a backbone network Xreception model, acquires high-level semantic information by using parallel cavity convolution with different rates in a spatial pyramid pooling module, and compresses the channels through 1X 1 convolution; the decoder part fuses the low-level features extracted from the backbone network with the high-level features subjected to 4-time bilinear interpolation up-sampling, and then restores the space information and the 4-time bilinear interpolation up-sampling fine target boundary by using 3X 3 convolution, so that the decoding structure improves the restoration effect of the edge information, and the precision is improved.

The invention provides a semantic segmentation method based on an improved ASPP and a fusion module in a complex scene, and the network structure of the semantic segmentation method is shown in figure 2. According to the method, the RA-ASPP module is designed, and the parallel fusion structure CBB is adopted at the decoding end, so that the identification accuracy and the segmentation accuracy of the network are improved.

The method comprises the following steps:

step 1: inputting an RGB image of 512×512 in size; (as in FIG. 2)

Step 2: the image is input to the Backbone portion (Backbone in fig. 2) and features are extracted from the image.

The invention sets two switchable backbone parts, namely Xnaption and MobileNetV2. Wherein Xreception is the backbone part of the traditional deep v3+, and MobileNet V2 is a lightweight deep neural network proposed by Google for embedded devices such as mobile phones. The MobileNet V2 is an upgrade version of MobileNet, and compared with the traditional convolutional neural network, the model parameters and the operation amount are greatly reduced on the premise of low accuracy. Compared with Xattention, the MobileNet V2 has the advantages of low consumption, instantaneity and the like, and meets the instantaneity requirement of semantic segmentation tasks. The MobileNet V2 introduces a depth separable convolution to replace a common convolution, and introduces a linear bottleneck and inverted residual error structure to avoid information loss and improve precision, so that the quantity and the calculated amount of model parameters are greatly reduced, and the representation capability of a network is further improved. In order to meet the real-time requirement of semantic segmentation in a complex scene, the invention replaces Xreception with MobileNetV2, thereby providing a semantic segmentation method which meets the real-time application requirement more.

Step 3: aiming at the limitations of ASPP and considering the characteristics of target diversification of complex scenes, the invention provides an RA-ASPP module for supplementing the lost information in the ASPP feature extraction process so as to achieve better target segmentation effect. FIG. 3 is a schematic diagram of the RA-ASPP structure according to the present invention.

The traditional ASPP module performs multi-scale sampling by using an advanced semantic feature map obtained by a backbone network to generate a multi-scale feature map. The combination of the hole convolution can enlarge the receptive field of the convolution kernel under the condition of not losing resolution. ASPP consists of two parts in parallel, the first part comprises 1×1 convolution layer and three 3×3 cavity convolution layers with a sampling rate of rate= {6,12,18}, the number of convolution kernels is 256, and the two parts comprise batch normalization layers; the second part is the image level feature representation, and the specific operation is to apply global average pooling followed by a convolution layer with a convolution kernel of 256, and finally performing bilinear upsampling operation to obtain the required spatial dimension. There are different sampling rate, hole convolution parallel sampling in ASPP structure. The cavity convolution can enlarge the receptive field of the feature map under the condition that the image resolution is not lost, the resolution is high, the target can be accurately positioned, different receptive fields can feel different scale information, and a plurality of receptive fields with different sizes can be obtained by combining different sampling rates in parallel, so that the aim of classifying targets with any size is fulfilled. However, the limitation of ASPP is also reflected in the hole convolution, the hole convolution with large sampling rate has better effect on the identification of large target objects, but the effective information of small target objects is lost; the hole convolution with small sampling rate can obtain the semantic position information of a small target object, but more outline edge information of a large target object is lost. The cavity convolution with different sampling rates is combined in parallel to make up for the missing information of the cavity convolution to a certain extent, but the effective content in the missing information is still not well utilized.

The step 3 comprises the following steps:

step 3.1: a residual network element of RA-Acspp is built, and fig. 4 is a schematic diagram of a residual network element of RA-Acspp.

Aiming at the problems that the convolutional neural network is easy to suffer from over fitting and gradient disappearance, a residual network is proposed, and the basic idea is to assume that an optimal network layer number exists in a deep network, and then a part of redundant network layers are included in the network and defined as redundant layers. The redundancy layers are set as identity layers, so that input and output identity mapping can be completed, and the identity layers can be adaptively learned in the network training process.

Fig. 4 (a) is a schematic diagram of a residual network structure. The residual network proposes a concept of shortcut (shortcut), i.e. skipping one layer or multiple layers, and directly adding the input result to the bottom layer, and the calculation formula of the residual network is shown in formula (1).

H(x)＝x+F(x) (1)

In the formula (1): h (x) is the mapping of the bottom layer; x is an input result; f (x) is the hidden layer output result in the network. And extracting the characteristics of the picture in a mode of adding the output and the input of the cascade of a plurality of convolution layers.

In convolutional neural network structures, the deeper the network hierarchy, the more errors that occur during training. The occurrence of the residual network solves the problem of performance degradation of the deep convolutional neural network under the extremely deep condition to a certain extent. As shown in fig. 4 (b), the invention adds a residual network in the conventional ASPP structure, avoids the problem of network performance degradation, further enriches the scale of feature extraction by using residual connection, and realizes denser multi-scale feature extraction, thereby increasing the accuracy of network model segmentation.

Step 3.2: an AACB module in RA-ASPP in fig. 3 is built, namely, asymmetric convolution and hole convolution are combined to form a new asymmetric hole convolution AACB module. Aiming at the defect of the cavity convolution, the asymmetric convolution and the cavity convolution module are combined to form a new AACB (asymmetric atrous convolution block) module which is used for replacing the 3 multiplied by 3 cavity convolution module in the ASPP. Meanwhile, the AACB module continues the expansion rate in the cavity convolution, namely, the sampling rate rate= {6,12,18} of the AACB module. The asymmetric cavity convolution module AACB provided by the invention has the advantages that on one hand, the receptive field is enlarged by utilizing the cavity convolution, and the multi-scale context information is captured; on the other hand, AACB well complements the information of the void convolution missing on the spatial level, so that the whole network has better continuity.

The deep labv3+ network model adopts an ASPP module to enrich the semantic information of the context during feature extraction, but hole convolution with multiple expansion rates is easy to cause chessboard effect, so that the problems of small-scale target loss and discontinuous segmentation are caused. FIG. 5 is a schematic illustration of hole convolution with different dilations. The hole convolution introduces a new parameter called "dilation rate" compared to the normal convolution layer, which defines the spacing of the values when the convolution kernel processes the data. The cavity convolution has the characteristics of complete data structure reservation and no downsampling, and has obvious advantages. However, multi-layer hole convolution also has the disadvantage of disrupting data continuity. The convolution kernel of the cavity convolution is realized by filling 0 on the common convolution kernel, so that the expansion rate of the network is increased. The dimensional relationship between the expansion rate and the cavity convolution kernel is shown in the formula (2):

kd _size ＝(γ-1)(k _size -1)+k _size (2)

wherein: gamma represents the expansion coefficient of the cavity convolution; k (k) _size Represents the normal convolution kernel size; kd (kd) _size The size of the cavity convolution kernel is represented, and when gamma=1, the normal convolution kernel is obtained.

ACNet work shows that the skeleton part of a standard square convolution kernel is more important than the corner part, the weight of the skeleton part of the convolution kernel is discarded during training, the accuracy of the model is reduced, and the weight of the corner part of the convolution kernel is discarded, so that the accuracy of the model is increased. When the convolution kernel skeleton part is enhanced, more image features can be captured, and the model precision is improved. Therefore, the invention introduces asymmetric convolution and improves the empty convolution part. Fig. 6 is a schematic diagram of an asymmetric convolution. The asymmetric convolution module adds horizontal and vertical asymmetric convolution kernels on the basis of standard square convolution, and input images are subjected to convolution processing of 3 different shapes of 3×3,1×3 and 3×1 convolution kernels respectively, so that different branch characteristics are extracted, as shown in formula (3):

wherein: m is M _:,:,k ∈R ^U×V×C A feature map representing a kth channel of input size U x V;

a j-th convolution kernel representing a k-th channel; and O, namely, outputting a characteristic diagram corresponding to the jth convolution kernel.

In order to enhance the skeleton part of the 3×3 convolution kernel, the ACB fuses the three parallel 3×3,1×3 and 3×1 convolution kernels by utilizing the additivity of convolution to obtain fused feature output, and the feature dimension of the fused output is consistent with the dimension of the input feature, as shown in formula (4):

wherein I is an input feature map matrix, K ⁽¹⁾ And K ⁽²⁾ Representing two 2D convolution kernels of mutually compatible sizes,

the corresponding positions of the two 2D convolution kernels are summed to form an asymmetric convolution module to replace the original 3X 3 convolution kernel to strengthen the skeleton part of the convolution kernel. The ACNet is characterized in that the conventional square convolution kernel is replaced by an asymmetric convolution form, and the weight of the convolution kernel skeleton part can be improved by enhancing the convolution kernel skeleton part, so that more image features are captured, and the precision of a reference model is improved.

Step 4: a parallel fusion structure CBB combining 1 x 1 convolution and bottleneck modules is built. The CBB module provided by the invention can reduce the problem of partial information loss caused by up-sampling and improve the accuracy of a network.

The decoding fusion part of deep labv3+ utilizes 3×3 convolution to perform simple feature fusion on the combined total features, and finally performs 4-multiple bilinear interpolation upsampling on the features to obtain a segmentation result. In the deep labv3+ network, two bilinear interpolation upsampling is adopted, a specific matrix of the bilinear interpolation is shown as a formula (5), and interpolation in the x-axis and y-axis directions is carried out in the square by taking (0, 0), (1, 0), (0, 1) and (1, 1) as examples.

Assuming that x, y are x-axis coordinates and y-axis coordinates of the target point, respectively, when (x, y) is interpolated between (0, 0), (1, 0), (0, 1), (1, 1), this point has a relationship of f (1, 0) × (1-x) × (1-y) with (1, 0), f (1, 0) × (1-y) with (0, 1), f (0, 1) × (1-x) × (y), and f (1, 1) × (1) x×y with (1, 1). As can be derived from equation (5), the gray value of the target point is obtained by averaging the gray values of 4 surrounding pixels, and this way, the size of the surrounding pixel values is considered, but the influence of the change rate of the neighboring points is not considered, so that part of the detailed information after amplification is lost.

Aiming at the problem of information loss caused by bilinear interpolation sampling, the CBB module structure diagram provided by the invention is shown in fig. 7 (a), and a 1×1 convolution module and a bottleneck module are fused in parallel to form a CBB module, so that the CBB module is used for replacing a traditional 3×3 convolution module. In the CBB block, the channel and resolution are adjusted by adding a 1 x 1 convolution. The CBB block is based on the bottleneck block of a conventional res net, adding SE blocks after a 3 x 3 convolution operation, a reduction factor of 16 (r=16). The SE module automatically acquires the importance degree of each feature channel mainly in a learning mode, namely, different weights are distributed to the feature channels, and the features useful for the current detection task are highlighted and invalid features are restrained, so that the feature processing efficiency is improved, and the SE module can be flexibly embedded into other network models.

Fig. 7 (b) is a schematic structural diagram of the SE module. The SE module firstly processes an input feature map by Global pooling (Global pooling), then reduces the dimension of the feature map through two full Connected layers (full Connected), then increases the dimension, finally obtains corresponding weight after sigmoid activation function processing, obtains an output result by multiplying the weight at a corresponding position with the original input feature map, and can correspondingly process the feature maps with different importance degrees.

And 5, setting an experiment.

The initial learning rate set in the experiment is 0.007, the backbone network adopts Xreception, a random gradient descent (SGD) network model optimizer is used, the momentum (momentum) is 0.9, the weight attenuation rate is set to 0.0001 for preventing overfitting, the used learning rate descent mode is cos, and the size of an input image is 512 multiplied by 512. The experiment of the invention adopts a freezing training mode under the processing capacity of hardware equipment, thereby accelerating the training efficiency. Wherein Freeze batch size is 8 and Freeze epoch is 100; unfreeze batch size is 8 and UnFreeze epoch is 200; 300epochs were co-trained.

The experiment adopts a PASCAL VOC07+12 data set, which is formed by combining a semantic segmentation standard data set PASCAL VOC2007 and VOC2012 data sets, and comprises 21 semantic segmentation categories in total, including 20 foreground categories and 1 background category. The PASCAL VOC07+12 dataset was trained with additional annotated 10582 images, validated and tested with 1449 images, and the test set was not divided separately. The PASCAL VOC07+12 data set is the most commonly used data set in the current semantic segmentation field, has large data volume, and can lead the semantic segmentation model obtained by training to show stronger generalization capability. The experimental procedure is mainly realized by adopting a mainstream deep learning framework Pytorch, and the software and hardware configuration is shown in table 1.

Table 1 experimental hardware configuration table

Step 6: experimental results and analysis.

Step 6.1: the invention uses the PASCAL VOC07+12 data set to train the semantic segmentation method based on the improved ASPP and the fusion module in the complex scene, thereby verifying the effectiveness of the algorithm of the invention. The training scenario of an improved network with Xception as the backbone part is shown here as an example.

Fig. 8 shows the trend of Loss of the proposed model during training. As the model training efficiency is improved by adopting the freezing training mode, the loss curve is fast in overall descent. the train loss and the val loss are continuously reduced until the train loss tends to be unchanged, and the model training effect provided by the invention is proved to be good. When the Epoch is 100, the Loss curve is gradually gentle. FIG. 9 is a graph showing the variation of MIOU during training, wherein MIOU is 79.78% optimal when Epoch is 195.

Step 6.2: and (5) comparing the extensive ablation experiments.

Table 2 extensive ablation experiments on paspal VOC07+12 dataset

To verify the effectiveness of the improvement module, the present invention performed an ablation experiment on the deeplabv3+ model on the paspal VOC07+12 dataset based on the experimental conditions of table 1. Table 2 shows the results of the ablation experiments. The RA-ASPP module and the CBB module provided have obvious improvement on the model segmentation precision of deep labv3+, and only increase a small amount of parameters. Wherein, mioU, MPA, PA of the Ours1 is respectively improved by 2.81 percent, 1.86 percent and 0.76 percent compared with the traditional deep labv < 3+ >, the parameter value is 64.157MB, 9.443MB is added before improvement, and the real-time semantic segmentation of 18.48FPS is achieved. To further increase the segmentation speed of the model, the backbone part was switched to a lighter weight mobilenet v2 in our 2. MioU, MPA, PA of our 2 reached 73.32%, 82.01%, 94.05%, respectively. Compared with deep labv3+, ours1, ours2 sacrifices a certain accuracy, but its parameter is only 7.299MB, and the segmentation speed reaches 37.54FPS, which is 17.34FPS faster than deep labv3+. The Ours2 provides a good balance between speed and accuracy. The ablation experiment result proves that the segmentation precision of the deep labv3+ is better improved by the two improved modules provided by the invention, and the two improved deep labv3+ models provided by the invention have better accuracy and real-time performance.

TABLE 3 speed contrast on different GPUs

The semantic segmentation speeds of the models on the different GPUs are compared in table 3. Where Ours1 uses Xreception as the backbone part and Ours2 uses MobileNetV2 as the backbone part. The segmentation speed of Ours1 is close to deeplabv3+ with a gap of about 1 FPS. Ours2 has a significant speed advantage over traditional deep labv3+, ours 1.

TABLE 4 MIOU of different networks on PASCAL VOC07+12 data set

As shown in Table 4, ours1 has a MIOU of 79.78% over the PASCAL VOC07+12 dataset, which is higher than the equivalent series of SegNet, FCN-8s, deeplabv1, deeplabv2, deeplabv3, deeplabv3+. The method obtains the precision index equivalent to the advanced semantic segmentation algorithm and has good semantic segmentation performance. The MIOU for Ours2 is 73.32%, higher than SegNet, FCN-8s, deep labv1, but lower than the other algorithms in Table 4, mainly because Ours2 uses a light backbone portion, which sacrifices some accuracy while achieving speed improvement.

IoU comparison results for each category on the paspal VOC07+12 dataset are shown in fig. 10 for deep v3+, ours1 (Xception) and Ours2 (MobileNetV 2). The improved deep labv3+ provided by the invention obtains the precision index equivalent to the current mainstream semantic segmentation algorithm, and has excellent semantic segmentation performance.

Fig. 11 shows a comparison of the segmentation effect of different methods. The segmentation result of the Ours1 is better than the deep labv3+ segmentation result, the shape information of the image content is more comprehensive, and the edge contour result is finer and smoother. The segmentation effect of Ours2 is worse than that of deep labv3+ and Ours1, but the corresponding segmentation result can be well predicted. When the backbone part is Xreception, the segmentation effect of the improved deep labv3+ is better than that of the traditional deep labv3+; when the backbone part is more lightweight mobiletv2, the splitting effect of the modified deeplabv3+ is worse than that of deeplabv3+.

Claims

1. A semantic segmentation method based on an improved ASPP and a fusion module in a complex scene is characterized by comprising the following steps:

(1) Building a deep model under a Pytorch framework;

(2) Based on the traditional ASPP structure, designing an RA-ASPP module;

(3) Designing a CBB module;

2. The semantic segmentation method based on improved ASPP and fusion module in a complex scene as claimed in claim 1, wherein the semantic segmentation method is characterized in that: the step (1) comprises the following steps:

3. The semantic segmentation method based on improved ASPP and fusion module in a complex scene as claimed in claim 1, wherein the semantic segmentation method is characterized in that: the AACB module in the step (2) is used for replacing the 3×3 hole convolution module in ASPP, and the AACB module continues the expansion rate in the hole convolution, i.e. the sampling rate rate= {6,12,18} of the AACB module.

4. The semantic segmentation method based on improved ASPP and fusion module in a complex scene as claimed in claim 1, wherein the semantic segmentation method is characterized in that: the CBB module architecture in step (3) adds an SE attention mechanism module after the 3 x 3 convolution operation, based on the bottleneck module in the ResNet, with a reduction factor of 16.

5. The semantic segmentation method based on improved ASPP and fusion module in a complex scene as claimed in claim 1, wherein the semantic segmentation method is characterized in that: in the step (5), a PASCAL VOC07+12 data set is adopted for network training, 10582 images which are additionally marked are adopted for training, 1449 images are adopted for verification and test, the initial learning rate is 0.007, a random gradient descent network model optimizer is adopted, the momentum is 0.9, the weight attenuation rate is set to 0.0001, the learning rate descent mode is cos, the input image size is 512 multiplied by 512,Freeze batch size and the Freeze epoch is 100; unfreeze batch size is 8 and UnFreeze epoch is 200;

co-training 300epochs.